Académique Documents
Professionnel Documents
Culture Documents
LTE 17
and Evaluation 17
Diagnostic Writing
native speakers of English and learners of English as an additional language.
The rating scale was then validated using both quantitative and qualitative
methods. The study showed that a detailed data-based rating scale is more
Assessment
valid and more useful for diagnostic purposes than the more commonly used
impressionistic rating scale.
LTE 17
and Evaluation 17
Diagnostic Writing
native speakers of English and learners of English as an additional language.
The rating scale was then validated using both quantitative and qualitative
methods. The study showed that a detailed data-based rating scale is more
Assessment
valid and more useful for diagnostic purposes than the more commonly used
impressionistic rating scale.
Volume 17
PETER LANG
Frankfurt am Main · Berlin · Bern · Bruxelles · New York · Oxford · Wien
Ute Knoch
Diagnostic Writing
Assessment
The Development and Validation
of a Rating Scale
PETER LANG
Internationaler Verlag der Wissenschaften
Bibliografische Information der Deutschen Nationalbibliothek
Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der
Deutschen Nationalbibliografie; detaillierte bibliografische
Daten sind im Internet über http://dnb.d-nb.de abrufbar.
ISSN 1612-815X
ISBN 978-3-631-58981-6
© Peter Lang GmbH
Internationaler Verlag der Wissenschaften
Frankfurt am Main 2009
Alle Rechte vorbehalten.
Das Werk einschließlich aller seiner Teile ist urheberrechtlich
geschützt. Jede Verwertung außerhalb der engen Grenzen des
Urheberrechtsgesetzes ist ohne Zustimmung des Verlages
unzulässig und strafbar. Das gilt insbesondere für
Vervielfältigungen, Übersetzungen, Mikroverfilmungen und die
Einspeicherung und Verarbeitung in elektronischen Systemen.
www.peterlang.de
ABSTRACT
Alderson (2005) suggests that diagnostic tests should identify strengths and weak-
nesses in learners' use of language, focus on specific elements rather than global
abilities and provide detailed feedback to stakeholders. However, rating scales
used in performance assessment have been repeatedly criticized for being impre-
cise, for using impressionistic terminology (Fulcher, 2003; Upshur & Turner,
1999; Mickan, 2003) and for often resulting in holistic assessments (Weigle,
2002).
The results indicate that rater reliability and candidate discrimination were gener-
ally higher and that raters were able to better distinguish between different aspects
of writing ability when the more detailed, empirically-developed descriptors were
used. The interviews and questionnaires showed that most raters preferred using
the empirically-developed descriptors because they provided more guidance in the
rating process. The findings are discussed in terms of their implications for rater
training and rating scale development, as well as score reporting in the context of
diagnostic assessment.
5
ACKNOWLEDGEMENTS
This book would not have been possible without the help and support of many
individuals. I would like to thank the following people:
x Professor Rod Ellis for his patient support and ex-pert guidance through-
out the preparation of this research. Our discussion of all aspects of the re-
search was enormously helpful. I am especially grateful for the long hours
he spent reading and checking my drafts.
x Janet von Randow for her incredible enthusiasm and helpfulness at all
stages of this study, for providing access to the DELNA materials and for
her wonderful morning teas.
x A special thanks needs to be reserved for Associate Professor Catherine
Elder for sparking my interest in language assessment.
x Carol Myford and Mike Linacre who answered my copious questions
about FACETS. I appreciate their comments with regard to several of the
statistics used in this study.
x The raters who agreed to take part in my study for patiently undertaking
the task of marking and remarking the one hundred writing scripts, show-
ing both good humour and a real sense of responsibility and dedication
throughout.
This publication is supported by a grant from the Research and Research Training
Committee, Faculty of Arts, The University of Melbourne and by a Grant-in-Aid
from the School of Languages and Linguistics, Faculty of Arts, The University of
Melbourne.
7
TABLE OF CONTENTS
Chapter 1: Introduction 11
APPENDICES 311
REFERENCES 313
9
Chapter 1: Introduction
1.1 Background
The diagnostic section of the assessment, which is administered after the screen-
ing, comprises listening and reading tasks (which are developed and validated at
the University of Melbourne) and an expository writing task (which is developed
in-house). Both the reading and listening tasks each produce a single score. The
writing task, which is the focus of this study, is scored using an analytic rating
scale. The DELNA rating scale has nine traits, arranged into three groups (flu-
ency, content and form). Each trait is divided into six level descriptors ranging
from four to nine. The rating scale was adapted from a pre-existing scale used at
the University of Melbourne.
No information is available on how that scale was developed. Since its introduc-
tion to DELNA, the rating scale has been modified a number of times, mainly
through consultation with raters.
A closer inspection of the DELNA rating scale reveals that it is typical of rating
scales commonly used in performance assessment systems such as for example
IELTS (Inter-national English Language Testing System and TOEFL (Test of
English as a foreign language). The traits (or-ganisation, cohesion, style, content,
11
grammatical accuracy, sentence structure and vocabulary and spelling) are re-
presentative of traits usually encountered in rating scales of writing. The level de-
scriptors make use of a common practice in writing performance assessment: ad-
jectives (e.g. satisfactory, adequate, limited, inadequate) are used to differentiate
between the different level descriptors.
DELNA writing scores are reported to two stakeholder groups. Students receive
one score averaged from the nine traits on the rating scale. In addition, students
are also given a brief statement about their performance on each of the three cate-
gories of fluency, content and form. Departments are presented with one overall
writing score for each student.
I was first confronted with rating scales for writing assessment in 2001. In that
year, I first joined the team of DELNA raters at the University of Auckland and a
little later became an IELTS accredited rater. Because I was relatively inexperi-
enced at rating writing at that time, I often found that the descriptors provided me
with very little guidance. On what basis was I meant to, for example, decide that a
student uses cohesive devices ‘appropriately’ rather than ‘adequately’ or that the
style of a writing script ‘is not appropriate to the task’ rather than displaying ‘no
apparent understanding of style’? And what exactly should I look for when as-
sessing the style of a writing script? This lack of guidance by the rating scale of-
ten forced me to return to a more holistic form of marking where the choice of the
different analytic categories was mostly informed by my first impression of a
writing script.
Although I thought that my inexperience with rating writing was the cause of my
difficulties, I also realised during rater training sessions that I was not the only
one experiencing problems. We would often spend endless time discussing why a
certain script should be awarded a seven instead of a six, only to be told that the
benchmark raters had given it a seven, and even though the rater trainer did not
seem to entirely agree with this mark, that was what we would have to accept. At
other times the rater trainers told us to rely on our ‘gut feeling’ of the level of a
script. If we felt it was, for example, a six overall, we should rely on that and rate
accordingly. I often felt that this was not a legitimate way to rate and that impor-
tant information might be lost in this process.
I also felt uncomfortable with rating scales mixing different aspects of writing into
one level descriptor. For example, vocabulary and spelling might be described in
one de-scriptor or grammatical range and accuracy might be grouped together.
12
But what happens if a writer is at different developmental levels in the two traits?
Should the rater prioritize one trait or average the scores on the two?
During those early years as an IELTS and DELNA rater I was not aware of the
differences between diagnostic assessment and proficiency assessment. A number
of raters would, like me, rate both these types of assessment, often in the same
week. Although DELNA and IELTS use slightly different rating scales, both
scales are very similar in terms of the types of features they display on the de-
scriptor level. The rater training is also conducted in very similar fashion. Only in
very recent times have I become aware of the fact that diagnostic assessment is
quite different from other types of assessment. One important feature of diagnos-
tic assessment is the detailed feedback that is provided to candidates. Therefore,
relying on one’s ‘gut feeling’ when rating might cause potentially important in-
formation to be lost.
Later in the book, Alderson (2005) describes the use of indirect tests (in this case
the DIALANG2 test) of writing rather than the use of performance tests (such as
the writing test in DELNA). However, indirect tests of writing are used less and
less in this era of performance testing and therefore an argument can easily be
made that diagnostic tests of writing should be direct rather than indirect.
The question, however, is how direct diagnostic tests of writing should differ from
proficiency or placement tests. One central aspect in the performance assessment
of writing is the rating scale. McNamara (2002) and Turner (2000), for example,
13
have argued that the rating scale (and the way raters interpret the rating scale)
represents the de-facto test construct. It should therefore not be assumed that rat-
ing scales used in proficiency or placement testing function validly and reliably in
a diagnostic context.
Existing rating scales of writing used in proficiency or placement tests have also
been subject to some criticism. It has, for example, been claimed that they are
often developed intuitively which means that they are either adapted from already
existing scales or they are based purely on what developers think might be com-
mon features of writing at various proficiency levels (Brindley, 1991; Fulcher,
1996a, 2003; North, 1995). Furthermore, Brindley (1998) and other authors have
pointed out that the criteria often use impressionistic terminology which is open to
subjective interpretations (Mickan, 2003; Upshur & Turner, 1995; Watson Todd,
Thienpermpool, & Keyuravong, 2004). The band levels have furthermore been
criticized for often using relativistic wording as well as adjectives and intensifiers
to differentiate between levels (Mickan, 2003).
There is also a growing body of research that indicates that raters often experience
problems when using these rating scales. Claire (2002, cited in Mickan, 2003), for
example, reported that raters regularly debate the criteria in moderation sessions
and describe problems with applying descriptors which make use of adjectives
like ‘appropriate’ or ‘sufficient’. Similarly, Smith (2000), who conducted think-
aloud protocols of raters marking writing scripts noted that raters had ‘difficulty
interpreting and applying some of the relativistic terminology used to describe
performances’ (p. 186).
The problems with existing rating scales described above might affect the raters’
ability to make fine-grained distinctions between different traits on a rating scale.
This might result in important diagnostic information being lost. Similarly, if rat-
ers resort to letting an overall, global im-pression guide their ratings, even when
using an analytic rating scale, the resulting scoring profile would be less useful to
candidates. It is therefore doubtful whether existing rating scales are suitable for a
diagnostic context.
The study was conducted in two main phases. During the first phase, the analysis
phase, over six hundred DELNA writing scripts at different proficiency levels
14
were analysed using a range of discourse analytic measures. These discourse ana-
lytic measures were selected because they were able to distinguish between writ-
ing scripts at different proficiency levels and because they represented a range of
aspects of writing. Based on the findings in Phase 1, a new rating scale was de-
veloped.
During the second phase of this study, the validation phase, ten raters rated one
hundred pre-selected writing scripts using first the existing descriptors and then
the new rating scale. After these two rating rounds, the raters completed a ques-
tionnaire designed to elicit their opinions about the efficacy of the new scale. De-
tailed interviews were con-ducted with seven of the ten raters. The purpose of this
phase was not only to establish the reliability and validity of the two scales based
on the rating data, but also to elicit the raters’ opinions of the efficacy of the two
scales.
Because the overarching research question is broad, three more specific questions
were formulated to guide the data collection and analysis:
Research question 1 (Phase 1): Which discourse analytic measures are successful
in distinguishing between writing samples at different DELNA writing levels?
Research question 2a (Phase 2): Do the ratings produced using the two rating
scales differ in terms of (a) the discrimination between candidates, (b) rater spread
and agreement, (c) variability in the ratings, (d) rating scale properties and (e)
what the different traits measure?
Research question 2b (Phase 2): What are raters’ percep-tions of the two different
rating scales for writing?
This book is organised into eleven chapters. Chapter 1, this chapter, provides an
overview of the research and its purpose. Chapters 2 to 4 provide a review of the
relevant literature. Chapter 2 gives a general introduction to performance assess-
15
ment of writing, in particular diagnostic assessment. The chapter goes on to dis-
cuss models of performance assessment of writing and how these could be rele-
vant to diagnostic assessment of writing. Specifically, the influence of the rater,
the task and the test taker on the outcome of an assessment is described. Chapter 3
reviews the literature on rating scales, which is the main focus of this study. As
part of this chapter, possible design features of rating scales for diagnostic writing
assessment are considered. The final chapter of the literature review, Chapter 4,
first considers what constructs should be assessed in a diagnostic assessment of
writing and then reviews discourse analytic measures for each of these constructs.
Chapters 5 to 7 contain the methodology, results and discussion chapters of Phase
1 of the study, the development of the rating scale. Chapter 5, the method chapter
provides a detailed description of the context of the study and an outline of the
methodology used. This chapter also contains an account of the pilot study. Chap-
ter 6 presents the results of Phase 1, the analysis of the writing scripts. The results
are discussed in the following chapter, Chapter 7. Here, the development of the
pilot scale is described and the different trait scales are presented. The following
three chapters presents the methodology (Chapter 8), results (Chapter 9) and dis-
cussion (Chapter 10) of Phase 2 of this study, the validation of the rating scale.
Chapter 9 is divided into two sections, one providing the results of the quantitative
analysis of the rating scores and the other presenting the results from the ques-
tionnaires and interviews. Chapter 10 then draws these results together and dis-
cusses the overarching research question. Chapter 11, the concluding chapter,
summarises the study as a whole and discusses the implications of the study both
at a practical and theoretical level. Suggestions for further research are offered
and limitations of the study are identified.
----
Notes:
1
Although not the focus of this study, the writing tasks used in diagnostic assessment might also
be different to those in proficiency tests of writing.
2
DIALANG is a diagnostic language test for 14 European languages based on the ‘Common
European Framework of Reference’
16
Chapter 2: Performance Assessment of Writing
2.1 Introduction
Until the 1950s, writing assessment was mainly undertaken by individual teachers
in the context of their classes. However, with an increase in the number of univer-
sity enrolments came a greater demand for reliability. In re-sponse to this demand,
psychometricians developed indirect writing assessments (Grabe & Kaplan,
1996), which evaluate students’ knowledge of writing by using discrete test items
that assess knowledge of particular linguistic features, such as grammatical
choices or errors or even more specific writing behaviours such as spelling or
punctuation (Cumming, 1997). In these discrete-point tests, reliability issues were
seen as more important than questions of validity.
17
A very influential test that used multiple-choice com-ponents to measure writing
was the Test of Standard Written English (TSWE) developed by the Educational
Testing Service (ETS) for English first language writers. This test was part of a
common pre-university assessment measure in the United States (Grabe & Kap-
lan, 1996).
During the late 70s and early 80s, direct assessment of writing (or performance
assessment of writing) became standard practice in English L1 (English as a first
language) contexts and was also widely adopted by L2 (English as a second lan-
guage) teachers who favoured testing students on meaningful, communicative
tasks (e.g. letter writing). With this shift back to the direct assessment of writing,
the problems regarding content and construct validity were addressed. However a
whole range of concerns regarding the methods of collecting and evaluating writ-
ing samples as true indicators of writing ability were raised (Grabe & Kaplan,
1996). Therefore, research since that time has focussed on a number of validity
issues, especially on improved procedures for obtaining valid writing samples
(taking into account the reader, task type, rater background, rater training and the
type of rating scale used).
In the 1980s, the skills and components model of the 1970s came under criticism
and a broadened view of language proficiency based on communicative compe-
tence was proposed by Canale and Swain (1980)1. Since then the testing of writ-
ing has commonly taken the following form: students write a brief (30-45 minute)
essay (Cumming, 1997, p.53) which is then rated either holistically or analytically
(for a description of these terms refer to Chapter 3) by trained raters using a rating
scale.
18
One of the largest direct tests of writing is administered by the Educational Test-
ing Service (ETS) as part of the TOEFL iBT (Test of English as a Foreign Lan-
guage internet based test) test battery. Students produce two pieces of writing, one
independent writing task and one integrated task (which requires test takers to
write texts based on listening or reading input). The integrated task has a time
limit of 20 minutes, whilst the independent task has a time limit of 30 minutes.
Both tasks are evaluated by two trained raters (and a third rater in case of discrep-
ancies). The TOEFL iBT has undergone extensive validity and reliability checks
which have often directly contributed to changes in rater training, topic compari-
son, essay scoring and prompt development. Both the TOEFL iBT and IELTS are
currently administered around the world and are often used as gate-keeping ex-
aminations for university entrance and immigration.
Whilst the two writing tests described above are considered to be proficiency tests
as they are designed to assess general writing ability, writing assessments for
other purposes are also administered around the world. Students are, for example,
often required to write an essay which is then used for placement purposes. Their
result might determine which course or class at a certain institution would be the
most appropriate for the students concerned. Achievement tests are often adminis-
tered at the end of a writing course to determine the progress that students have
made whilst taking the course. Finally, diagnostic writing tests might be adminis-
tered to identify the strengths and weaknesses in candidates’ writing ability. Be-
cause diagnostic assessment is the focus of this study, the following section fo-
cuses entirely on this type of test.
19
few tests are designed specifically for diagnostic purposes. A frequent al-
ternative is to use achievement or proficiency tests (which typically pro-
vide only very general information), because it is difficult and time-
consuming to construct a test which provides detailed diagnostic informa-
tion (p. 43)
Despite repeated calls by Spolsky in the 1980s and 1990s (e.g Spolsky, 1981;
1992), Alderson (2005) argues that very little research has looked at diagnostic
assessment. He points out, in the most detailed discussion of diagnostic assess-
ment to date, that diagnostic tests are frequently confused with placement tests.
He also disapproves of the fact that a number of definitions of diagnostic tests
claim that achievement and proficiency tests can be used for diagnostic purposes.
He also criticizes Bachman’s (1990) considerations of what the content of a diag-
nostic test should look like:
Alderson (2005) argues that the former test type in Bachman’s description is gen-
erally regarded as an achievement test and the latter as a proficiency test. There-
fore, he argues that there are no specifications in the literature of what the content
of diagnostic tests should look like.
Moussavi (2002), in his definition of diagnostic tests, argues that it is not the pur-
pose of the test so much that makes an assessment diagnostic, but rather the way
in which scores are analysed and used. Alderson (2005), however, argues that the
content of a diagnostic test needs to be more specific and focussed than that of
proficiency tests. Moreover, the profiles of performance that are produced as a
result of the test should contain very detailed information on the performance
across the different language aspects in question. He therefore believes that the
construct definition of a diagnostic test needs to be different from that of other
tests.
Summarizing the existing literature, he stresses:
(...) the language testing literature offers very little guidance on how diag-
nosis might appropriately be conducted, what content diagnostic tests
might have, what theoretical basis they might rest on, and how their use
might be validated. (p. 10)
20
After a detailed review of the existing, scarce, literature on diagnostic assessment
in second and foreign language assessment, he provides a series of features that
could distinguish diagnostic tests from other types of tests. These can be found
below.
21
Alderson stresses, however, that this is a list of hypothetical features which need
to be reviewed and which he produced mainly to guide further thinking about this
much under-described area of assessment.
Alderson (2005) further points out that, whilst all definitions of diagnostic testing
emphasize feedback, there is no discussion of how scores should be reported. He
argues that feedback is probably one of the most crucial components of diagnostic
assessment. Merely reporting a test score without any detailed explanation is not
appropriate in the context of diagnostic assessment. He writes, ‘the essence of a
diagnostic test must be to provide meaningful information to users which they can
understand and upon which they or their teachers can act’ (p. 208). Also, he ar-
gues, this feedback should be as immediate as possible and not, as is often the
case for proficiency tests, two or more weeks after the test administration.
Although Alderson suggests the use of indirect tests of writing for diagnostic test-
ing, these tests, as mentioned earlier, lack face validity and have generally fallen
out of favour. However, if a direct test of writing (or performance test) is used for
diagnostic purposes, a number of possible sources of variation are introduced into
the test context. The following section reviews models of performance assessment
of writing to identify these potential sources of variation. Research on each source
is then reported and the findings are evaluated in terms of their relevance to diag-
nostic assessment.
22
2.4 Models of performance assessment of writing
Taking all the above-mentioned factors into account, McNamara (1996) devel-
oped a model which organises language testing research and accounts for factors
that contribute to the systematic variance of a performance test score. McNa-
mara’s model, which is based on an earlier model by Kenyon (1992) was devel-
oped in the context of oral assessment. It is however just as valid for written test
performance. For the purpose of this literature review it has been slightly adapted
to exclude any aspects relevant only to oral performance.
The model (Figure 1 above) places performance in a central position. The arrows
indicate that it is influenced by several factors, including the tasks, which drive
the performance and the raters who judge the performance using rating scales and
23
criteria. The final score can therefore only be partly seen as a direct index of per-
formance. The performance is also influenced by other contextual factors like, for
example, the test taking conditions. The model also accounts for the candidate and
the way his or her underlying competence will influence the performance. It is
assumed that the candidate draws on these underlying competences in a straight-
forward manner.
Skehan (1998a) points out that it is not only important to understand the individ-
ual components that influence test performance, but that it is necessary to recog-
nize an interaction between these components. He argues that for example the rat-
ing scale, which is often seen as a neutral ruler, actually has a great deal of influ-
ence on variation in test scores. There is competition between processing goals
within a performance. As shown by Skehan and Foster (1997), fluency, accuracy
and complexity compete with each other for processing capacity. If the rating
scale emphasizes each of these areas, then the final writing score might be influ-
24
enced by the processing goals the test taker emphasized at the time. This might
further be influenced by a rater prioritizing certain areas of performance. Simi-
larly, certain task qualities and conditions might lead to an emphasis on one or
two of the above-mentioned processing goals.
Fulcher (2003) further revised the model to include more detailed descriptions of
various factors that influence the score of a written performance (Figure 3 below).
In the case of the raters, he acknowledges that rater training and rater characteris-
tics (or rater background as it is called by other authors) play a role in the score
awarded to a writer. Fulcher’s model shows the importance of the scoring phi-
losophy and the construct definition of the rating scale for the outcome of the rat-
ing process. He also indicates that there is an interaction between the rating scale
and a student’s performance which results in the score and any inferences that are
made about the test taker. Fulcher further acknowledges the importance of context
in test performance by including local performance conditions. Like Skehan, Ful-
cher includes aspects that influence the task. Among these are the task orientation,
goals, and topics, as well as any context-specific task characteristics or conditions.
Finally, Fulcher’s model shows a number of variables that influence the test taker.
These include any individual differences between candidates (like perso-nality),
their actual ability in the constructs tested, their ability for real-time processing
and any task-specific knowledge or skills they might possess. Fulcher (2003) sees
this model as provisional and requiring further research.
The models discussed above were conceived in the context of proficiency testing.
Because this book addresses diagnostic assessment, it is important to review the
25
research on the different sources of score variation presented by the models and
evaluate how they might affect the scoring of a direct diagnostic test of writing.
Each of the four influences on performance in bold-face in Fulcher’s model above
will be discussed in turn in the remainder of this chapter. Research on tasks, test
takers and raters will be discussed in this chapter, whilst issues surrounding the
rating scale (being the main focus of this study) will be described in the following
chapter (Chapter 3). Because the task and the test taker are not as central as the
rater to the purpose of this study, these two issues will be described more briefly
in this chapter.
2.4.1 Tasks
Hamp-Lyons (1990) writes that the variables of the task component of a writing
test are those elements that can be manipulated and controlled to give test takers
the opportunity to produce their best performance. Amongst these she names the
length of time available to students to write, the mode of writing (if students write
by hand or use a word processor), the topic and the prompt. She argues that of the
variables named above, the topic variable is the most controversial. Some studies
have found no differences in student performance across tasks (e.g. Carlson,
Bridgeman, Camp, & Waanders, 1985) whilst others have found differences in
content quality and quantity due to topic variation (Freedman & Calfee, 1983;
Pollitt, Hutchinson, Entwhistle, & DeLuca, 1985). Hamp-Lyons argues, however,
that if there are no differences in performance found between different tasks, then
this can also be due to the fact that the scoring procedure and the raters influence
the score and any differences are lessened as a result of these factors.
A large number of studies have been undertaken to investigate the impact of task
variability in oral language. Based on Skehan’s (1998a) model (see Figure 2),
Wigglesworth (2000), for example, divided the sources of error surrounding the
task into two groups. Firstly, there are the task characteristics, which include fea-
tures internal to the task such as structure, cognitive load or familiarity of content.
Secondly, there are the task conditions like planning time, or native speaker/non-
native speaker inter-locutor in the case of a speaking test. In her study, Wig-
glesworth manipulated two task characteristics and two task conditions to see how
these affected task difficulty. She found that generally more structure made the
task more difficult. Her results for familiarity were mixed and therefore inconclu-
sive. The task conditions influenced the results in the following manner: a native
speaker inter-locutor made a task easier and planning time did not improve the
results. Yuan and Ellis (2003), however, found that pre-task planning resulted in
improved lexical and grammatical complexity and an increase in fluency, and that
online planning improved accuracy and grammatical complexity. It is important to
note that all these studies were carried out in the context of speaking and it is not
26
clear if the results can be transferred to writing. In a similar study, again in the
context of speaking but not in a testing context, Skehan (2001) investigated the
effect of a number of task characteristics on complexity, accuracy and fluency. A
summary table of his results can be seen in Table 1 below.
Table 1: Summary of the effects of task characteristics on complexity, accuracy and fluency
Relatively little research on task effects has been undertaken in the context of
writing assessment. Studies investigating whether different task prompts elicit
language which is different in quantity and quality have resulted in mixed find-
ings. For example, whilst Quellmalz, Capell and Chou (1982) found that the type
of task did not significantly influence writing quality, Brown, Hilgers and
Marsella (1991) were able to show that both prompts and the type of topic re-
sulted in a significant difference between ratings based on a holistic scale.
O’Loughlin and Wigglesworth (2003) pointed out, however, that most studies in-
vestigating task variables have used ratings as the basis of their investigations and
have not looked at the actual discourse produced. One exception is a study by
Wigglesworth (1999, cited in O’Loughlin and Wiggles-worth, 2003) in which she
investigated the effects of different tasks on both the ratings and the discourse
produced. She was able to show that the candidates produced more complex, less
accurate language when writing on the report task, and less complex but more ac-
curate language when responding to the recount tasks.
A more recent study that examined task characteristics was undertaken by
O’Loughlin and Wigglesworth (2003) in the context of the IELTS writing task.
The authors examined how quantity and manner of presentation of information in
the Academic Writing Task 1 affected the candidates’ writing. They found that
students wrote more complex texts if the task included less information, except in
27
the case of students with very high proficiency, who wrote more complex texts if
more information was given to them.
Ellis and Yuan (2004) investigated the influence of the task characteristic plan-
ning on written output. They found that pre-task planning impacted positively on
fluency and complexity, whilst online planning increased accuracy (i.e. the results
were very similar to those reported by Yuan and Ellis (2003) for oral perform-
ance). .
Some research findings suggest that the differences in performance resulting from
different task characteristics are too fine to be measured by ratings (see
O’Loughlin and Wigglesworth, 2003 above). However, Alderson (2005) argues
that diagnostic assessment should focus on specific rather than global abilities and
therefore these differences might be more salient in a diagnostic context.
Test takers vary not only in their linguistic skills but also in their cultural back-
ground, writing proficiency, knowledge, ideas, emotions, opinions (Kroll, 1998),
language back-ground, socio-economic status, cultural integration (Hamp-Lyons,
1990), personality and learning style (Hamp-Lyons, 2003). Because of this, writ-
ers’ performance varies from occasion to occasion. Hamp-Lyons (1990) points
out that for this reason researchers (see for example A. Wilkinson, 1983) have
criticized models of writing development for failing to account for affective fac-
28
tors, and for focussing only on descriptions of linguistic skills and cognitive abili-
ties. This view is supported by Porter (1991) who found a number of affective
variables to influence the test score awarded to a student in the context of an oral
assessment. Hamp-Lyons (2003) also found that test takers bring certain expecta-
tions to the test which are usually based on their life experiences up to that point.
It is therefore important to make sure that test takers receive as much background
information about the test as possible.
The findings reported are significant for diagnostic assessment. Firstly, Alderson
(2005) noted that diagnostic tests are usually low-stakes or no stakes and that
therefore little anxiety or few affective barriers arise on the part of the test taker.
However, it is important to investigate the TLU situation for which the diagnosis
is undertaken. For example, if a diagnostic test is designed to provide test takers
with detailed instruction to help them with an essay that will be written in a very
high-stakes testing context, then the data elicited in the context of a low-stakes
diagnostic assessment might not be representative of what the learner would be
able to achieve in the more pressured, high-stakes context. Secondly, it is possible
that although students’ extrinsic motivation might be lower in the context of a
low-stakes diagnostic test, their intrinsic motivation might be increased because
they are aware that they will receive valuable feedback on their writing ability.
2.4.3 Raters
This section on the rater first addresses the different ways in which raters can
vary. This is then followed by a discussion of research which has investigated the
reasons why raters differ in their ratings. The third part describes the most com-
mon way of dealing with rater variability: rater training. Finally, the research is
discussed in terms of its implications for diagnostic assessment.
A number of different studies have identified a variety of ways in which raters can
vary (McNamara, 1996; Myford & Wolfe, 2003, 2004).
The first possible rater effect is the severity effect. In this case, raters are found to
consistently rate either too harshly or too leniently if compared to other raters or
established benchmark ratings. The second rater effect is the halo effect. The halo
effect occurs when raters fail to discriminate between a number of conceptually
distinct traits, but rather rate a candidate’s performance on the basis of a general,
overall impression. The third rater effect described in the literature is the central
tendency effect. Landy and Farr (1983) described this effect as ‘the avoidance of
29
extreme (favourable or unfavourable) ratings or a preponderance of ratings at or
near the scale midpoint’ (p.63). The fourth rater effect is inconsistency or what
Myford and Wolfe (2003) term randomness. Inconsistency is defined as a ten-
dency of a rater to apply one or more rating scale categories in a way that is in-
consistent with the way in which other raters apply the same scale. This rater will
display more random variation than can be expected. The fifth rater effect is the
bias effect. When exhibiting this effect, raters tend to rate unusually harshly or
leniently on one aspect of the rating situation. For example, they might favour a
certain group of test takers or they might always rate too harshly or leniently on
one category of the rating scale in use. All these rater effects can be displayed ei-
ther by individual raters or a whole group of raters.
How the raters of a writing product interpret their role, the task, and the scoring
procedures constitutes one source of variance in writing assessment. Several re-
searchers (e.g. Hamp-Lyons, 2003) have shown that apart from free variance
(variance which cannot be systematically explained), raters differ in their deci-
sion-making because of their personal background, professional training, work
experience and rating background and this influences their performance. Differ-
ences have, for example, been found in the way ESL trained teachers (teachers
specifically trained to teach ESL students) and English Faculty staff (who have no
specific ESL training) rate essays (O'Loughlin, 1993; Song & Caruso, 1996;
Sweedler-Brown, 1993). Song and Caruso (1996), for example, found that English
Faculty staff seemed to give greater weight to overall content and quality of rhe-
torical features than they did to language.
Rater occupation also seems to influence rating. Brown (1995), in the context of
oral performance, found that ESL teachers rate grammar, expression, vocabulary
and fluency more harshly than tour guides. Elder (1993), also in an oral context,
compared ESL teachers with mathematics and science teachers. She found that
ESL raters focus more on language components. There was little agreement be-
30
tween the two groups on accuracy and comprehension and most agreement on in-
teraction and communicative effectiveness. Finally, raters seem to be as much in-
fluenced by their own cultural background as they are by the students’ (Connor-
Linton, 1995; Kobayashi & Rinnert, 1996) and by more superficial aspects of the
writing script like handwriting (A. Brown, 2003; Milanovic, Saville, & Shen,
1996; S. D. Shaw, 2003; Vaughan, 1991).
Rater training has been shown to be effective. For example, Weigle (1994a;
1994b) was able to show that rater training is able to increase self-consistency of
individual raters by reducing random error, to reduce extreme differences between
raters in terms of leniency and harshness, to clarify understanding of the rating
criteria and to modify rater expectations in terms of both the characteristics of the
writers and the demands of the writing tasks.
31
when rating the tape and the live version of the interview and that in general, they
seemed to be able to incorporate the feedback into their subsequent rating ses-
sions, so that ‘in many cases, bias previously evident in various aspects of their
ratings is reduced’ (p. 318). A follow-up study by Lunt, Morton and Wig-
glesworth (1994) however failed to confirm any signi-ficant changes in the pattern
of rating after giving feedback in this way. A more recent study by Elder, Knoch,
Barkhuizen and von Randow (2005), conducted in the context of writing assess-
ment, found that although for the whole group of raters the feedback resulted in
improved rating behaviour, some raters were more receptive to this type of train-
ing than others.
Most studies of rater training have shown that differences in judge-severity persist
and in some cases can account for as much as 35% of variance in students’ written
performance (Cason & Cason, 1984). Raw scores, there-fore, cannot be consid-
ered a reliable guide to candidate ability (McNamara, 1996) and double or multi-
ple rating is often recommended. In addition, in large-scale testing contexts, it
may also be necessary to use statistical programs which adjust for differences be-
tween individual raters on the basis of their known patterns of behaviour.
So how can these research findings contribute to diagnostic assessment? First,
rater variation needs not only to be minimized for the overall writing score, it also
needs to be minimized across all traits on the rating scale. This is because scores
in the context of diagnostic assessment should not be averaged, but rather reported
back to stakeholders individually. This would ensure that the diagnostic informa-
tion is as accurate and informative as possible. Secondly, as the background of
raters has an influence on the ratings, it is important that raters are trained and
monitored so that their background does not lead them to emphasize certain traits
in the rating scale over others. This would result in a distorted feedback profile.
The feedback needs to be as unbiased as possible across all traits in the scale. It
was also reported by Brown (1995) that native speaker raters seem to adhere more
closely to the rating scale than non-native speaker raters who have been found to
be more influenced by their intuitions. This might suggest that native speaker rat-
ers are more desirable in the context of diagnostic assessment, as rating based on
intuitions might result in a halo effect (i.e. very similar ratings across different
traits) which leads to a loss of diagnostic information. Thirdly, it is important that,
as part of rater training, regular bias analyses are conducted, so that raters display-
ing a bias towards a certain trait in the rating scale are identified and retrained.
In the following section, research about automated ratings using the computer is
reviewed and the relevance of these programs to diagnostic assessment is dis-
cussed.
32
The difficulty of obtaining consistently high reliability in the ratings of human
judges has resulted in research in the field of automated essay scoring. This re-
search began as early as the 1960s (Shermis & Burstein, 2003). Several computer
programs have been designed to help with the automated scoring of essays.
In the past few years, a number of computer programs have become available
which completely replace human raters. This advance has been made possible by
developments in Natural Language Processing (NLP). NLP uses tools such as
syntactic parsers which analyse discourse structure and organisation, and lexical
similarity measures which analyze the word use of a text. There are some general
advantages to automated assessment. It is generally understood to be cost effec-
tive, highly consistent, objective and impartial. However, sceptics of NLP argue
that these computer techniques are not able to evaluate communicative writing
ability. Shaw (2004) reviews four automated essay assessment programs: Project
Essay Grader, the E-rater model, the Latent semantic analysis model and the text
categorisation model.
Project Essay Grader (Page, 1994) examines the linguistic features of an essay. It
makes use of multiple linear regression to ascertain an optimal combination of
weighted features that most accurately predict human markers’ ratings. This pro-
gram started its development in the 60s. It was only a partial success as it ad-
dressed only indirect measures of writing and could not capture rhetorical, organ-
isational and stylistic features of writing.
33
The second program evaluated by Shaw (2004) is Latent Semantic Analysis
(LSA). LSA is based on word co-occurrence statistics represented as a matrix,
which is “decomposed and then subjected to a dimensionality technique” (p.14).
This system looks beneath surface lexical content to quantify deeper content by
mapping words onto a matrix and then rates the essay on the basis of this matrix
and the relations in it. The LSA model is the basis of the Intelligent Essay Asses-
sor (Foltz, Laham, & Landauer, 2003). LSA has been found to be almost as reli-
able as human assessors, but as it does not account for syntactic information, it
can be tricked. It can also not cope with certain features that are difficult for NLP
(e.g. negation).
The third program, the Text Categorisation Technique Model (Larkey, 1998), uses
a combination of key words and linguistic features. In this model a text document
is grouped into one or more pre-existing categories based on its content. This
model has been shown to match the ratings of human examiners about 65% of the
time. Almost all ratings were within one grade point of the human ratings.
Finally, e-rater was developed by the Education Testing Service (ETS) (Burstein
et al., 1998). It uses of a com-bination of statistical and NLP techniques to extract
linguistic features. The programme compares essays at different levels in its data
base with features (e.g. sentence structure, organisation and vocabulary) found in
the current essay. Essays earning high scores are those with charac-teristics most
similar to the high-scoring essays in the data base and vice versa. Over one hun-
dred automatically extractable essay features and computerized algorithms are
used to extract values for every feature from each essay. Then, stepwise linear re-
gression is used to group features in order to optimize rating models. The content
of an essay is checked by vectors of weighted content words. An essay that re-
mains focussed is coherent as evidenced by use of discourse structures, good lexi-
cal resource and varied syntactic structure. E-rater has been evaluated by Burstein
et al. (1998) and has been found to have levels of agreement with human raters of
87 to 94 percent. E-rater is used operationally in GMAT (Graduate Management
Admission Test) as one of two raters and research is underway to establish the
feasibility of using e-rater operationally as second rater for the TOEFL iBT inde-
pendent writing samples (Jamieson, 2005; Weigle, Lu, & Baker, 2007).
Based on the E-rater technology, ETS has developed a programme called Crite-
rion. This programme is able to provide students with immediate feedback on
their writing ability, in the form of a holistic score, trait level scores and detailed
feedback.
There are several reasons why computerized rating of performance essays might
be useful for diagnostic assess-ment. The main advantage of computer grading
might be the quick, immediate feedback that this scoring method can provide
34
(Weigle et al., 2007). Alderson (2005) stressed that for diagnostic tests to be ef-
fective, the feedback should be immediate, a feature which his indirect test of
writing in the context of DIALANG is able to achieve. Performance assessment of
writing rated by human raters will inevitably mean a delay in score reporting. The
second advantage might be the internal consistency of such computer programs
(see for example the feedback provided by the Criterion programme developed by
ETS). However, research comparing human and the e-rater technology has shown
(1) that e-rater was not as sensitive to some aspects of writing as human raters
were when length was removed as variable (Chodorow & Burstein, 2004), (2) that
human/human correlations were generally higher than human/e-rater correlations
(Weigle et al., 2007), and (3) that human raters fared better than automated scor-
ing systems when correlations were investigated of writing scores with grades,
instructor assessment of writing ability, independent rater assessment on disci-
pline-specific writing tasks and student self-assessment of writing (Powers,
Burstein, Chodorow, Fowles, & Kukich, 2000; Weigle et al., 2007).
There are also a number of concerns about using computerized essay rating.
Firstly, these ratings might not be practical in contexts where computers are not
readily available. Furthermore, it could be argued that writing is essentially a so-
cial act and that writing to a computer vio-lates the social nature of writing. Simi-
larly, what counts as an error might vary across different sociolinguistic contexts
and therefore human raters might be more suitable to evaluate writing (Cheville,
2004). In addition, as dia-gnostic tests should provide feedback on a wide variety
of features of a learner’s performance, current rating programs are unable to
measure the same number of features as human raters. This means that automated
scoring programs might under-represent the writing construct. For example, the
programs reviewed above were not able to evaluate communicative writing ability
or more advanced features of syntactic complexity. Taking all the above into ac-
count, it can be argued that human raters should be able to provide more useful
information for diagnostic assessment.
2.6 Conclusion
This chapter has attempted to situate diagnostic assessment within the literature
on performance assessment of writing, and research regarding the influences of a
number of variables on performance assessment was reported. Because the focus
of this study is the rating scale, research relating to rating scales and rating scale
development, as well as considerations regarding the design of a rating scale for
diagnostic assessment, are considered in the following chapter.
35
---
Notes:
1
For a more detailed discussion of this and later models refer to Chapter 3.
2
Most research cited in this section is based on studies conducted in the context of oral assess-
ment. This research is equally as relevant to writing assessment.
36
Chapter 3: RATING SCALES
3.1 Introduction
The aim of this study is to develop a rating scale that is valid for diagnostic as-
sessment. This chapter therefore begins with a definition of rating scales. To es-
tablish what options are available to a rating scale developer interested in develop-
ing a scale specific to diagnostic assessment, the chapter illustrates the different
options available during the development process. This is followed by a section
on criticisms of current rating scales. Then the chapter turns to an examination of
rating scales for diagnostic assessment. Here reasons are suggested why current
rating scales are unsuitable for the diagnostic context. Drawing on the literature
on diagnostic assessment as well as considerations in rating scale development,
five suggestions are made as to what a rating scale for diagnostic assessment
should look like.
Weigle (2002) describes a number of very practical steps that should be taken into
account in the process of scale development. Because these different steps illus-
trate the different options rating scale designers have in the design process, each is
37
described in detail below. For a rating scale to be valid, each of the different de-
sign options has to be weighed carefully.
1. What type of rating scale is desired? The scale developer should decide if
a holistic, analytic, primary trait or multi-trait rating scale is preferable
(each of these options will be described in detail below).
2. Who is going to use the rating scale? The scale developer needs to decide
between three functions of rating scales identified by Alderson (1991).
3. What aspects of writing are most important and how will they be divided
up? The scale developer needs to decide on what criteria to use as the basis
for the ratings.
4. What will the descriptors look like and how many scoring levels will be
used? There are limits to the number of distinctions raters can make. Many
large-scale examinations use between six and nine scale steps. This is de-
termined by the range of performances that can be expected and what the
test result will be used for. Developers also have to make decisions regard-
ing the way that band levels can be distinguished from each other and the
types of descriptor.
5. How will scores be reported? Scores from an analytic rating scale can ei-
ther be reported separately or combined into a total score. This decision
needs to be based on the use of the test score. The scale developer also has
to decide if certain categories on the rating scale are going to be weighted.
6. How will the rating scale be validated? The rating scale developer needs to
consider how the rating scale will be developed and what aspects of valid-
ity are paramount for the type of rating scale designed.
38
(1998). These are sum-marized below. Weigle (2002) provides a useful overview
of the four different types of rating scales (Table 2):
Table 2: Types of rating scales for the assessment of writing (based on Weigle, 2002)
Specific to a particu- Generalizable to a class of
lar writing task writing tasks
Single score Primary Trait Holistic
Multiple score Multiple Trait Analytic
Holistic scoring is based on a single, integrated score of writing behavior and re-
quires the rater to respond to writing as a whole. Raters are encouraged to read
each writing script quickly and base their score on a ‘general impression’. This
global approach to the text reflects the idea that writing is a single entity, which is
best captured by a single score that integrates the inherent qualities of the writing.
A well-known example of a holistic scoring rubric in ESL is the scale used for the
Test of Written English (TWE), which was administered as an optional extra with
the TOEFL test and has now been largely replaced by the TOEFL iBT1.
One of the advantages of this scoring procedure is that test takers are unlikely to
be penalized for poor performance on one aspect (e.g. grammatical accuracy).
Generally, it can be said that the approach emphasizes what is well done and not
the deficiencies (White, 1985). Holistic rating is generally seen as very efficient,
both in terms of time and cost. It has however been criticized and has nowadays
generally fallen out of favor for the following reasons. Firstly, it has been argued
that one score is not able to provide sufficient diagnostic information to be of
much value to the stakeholders. Uneven abilities, as often displayed by L2 writing
candidates (Kroll, 1998), are lumped together in one score. Another problem with
holistic scoring is that raters might overlook one or two aspects of writing per-
formance. Furthermore, it can be argued that, if raters are allowed to assign
weightings for different categories to different students, this might produce unfair
results and a loss of reliability and ultimately of validity. A further problem spe-
cific to L2 writing is that the rating scale might lump both writing ability and lan-
guage proficiency into one composite score. This might potentially result in the
same writing score for ESL learners who struggle with their linguistic skills and a
native speaker who lacks essay writing skills. The fact that writers are not neces-
sarily penalized for weaknesses but rather rated on their strengths can also be seen
as a disadvantage as areas of weakness might be important for decision-making
regarding promotion (Bacha, 2001; Charney, 1984; Cumming, 1990; Hamp-
Lyons, 1990). Finally, it is likely that test takers who attempt more difficult forms
and fail to produce these accurately might be penalized more heavily than test
takers using very basic forms accurately. Research has shown that holistic scores
39
correlate with quite superficial characteristics like handwriting (e.g. Sloan &
McGinnis, 1982).
Table 3 below from Weigle (2002, p.121) summarizes the advantages and disad-
vantages of holistic and analytic scales.
Table 3: A comparison between holistic and analytic rating scales (based on Weigle, 2002)
Quality Holistic Scale Analytic Scale
Reliability Lower than analytic but still accept- Higher than holistic
able
Construct Valid- Holistic scale assumes that all rele- Analytic scales more appropriate for
ity vant aspects of writing develop at L2 writers as different aspects of
the same rate and can thus be cap- writing ability develop at different
tured in a single score; holistic rates
scores correlate with superficial
aspects such as length and handwrit-
ing
40
Practicality Relatively fast and easy Time-consuming; expensive
Impact Single score may mask an uneven More scales provide useful diagnos-
writing profile and may be mislead- tic information for placement and/or
ing for placement instruction; more useful for rater
training
Authenticity White (1985) argues that reading Raters may read holistically and
holistically is a more natural process adjust analytic scores to match ho-
than reading analytically listic impression
A third scale type is primary trait scoring which was developed in the mid 1970s
by Lloyd-Jones (1977) for the National Assessment of Educational Progress
(NAEP) in an effort to obtain more information than a single holistic score. The
goal is to predetermine criteria for writing on a particular topic. It therefore repre-
sents ‘a sharpening and narrowing of criteria to make the rating scale fit the spe-
cific task at hand’ (Cohen, 1994, p. 32) and is therefore context-dependent
(Fulcher, 2003). The approach allows for attention to only one aspect of writing.
Because these scales only focus on one aspect of writing, they may not be integra-
tive enough. Also, it might not be fair to argue that the aspect singled out for as-
sessment is primary enough to base a writing score on it. Another reason why
primary trait scoring has not been readily adopted is that it takes about 60 to 80
hours per task to be developed.
The fourth and final type of rating scale is multi-trait scoring. Essays are scored
for more than one aspect, but the criteria are developed so that they are consistent
with the prompt. Validity is improved as the test is based on expectations in a par-
ticular setting. As the ratings are more task-specific, they can provide more diag-
nostic information than can a generalized rating scale. However, the scales are
again very time consuming to develop and it might be difficult to identify and
empirically validate aspects of writing that are especially suitable for the given
context. There is also no assurance that raters will not fall back on their traditional
way of rating.
Neither primary trait nor multi-trait scoring have been commonly used in ESL as-
sessment, probably because they are very time consuming to design and cannot be
reused for other tasks. Holistic and analytic rating scales have most commonly
been used in writing assessment.
It is important that the format of the rating scale, the theoretical orientation of the
description and the formulation of the definitions are appropriate for the context
41
and purpose in mind. In drawing attention to this, Alderson (1991) identified three
different rating scale subcategories depending on the purpose the score will be
used for. It is important to note that, for each of these subcategories, descriptors
might be formulated in different ways. Firstly, user-oriented scales are used to
report information about typical behaviors of a test taker at a given level. This in-
formation can be useful for potential employers and others outside the education
system to clarify the circumstances in which a test taker will be able to operate
adequately (Pollitt & Murray, 1996). Descriptors are usually formulated as ‘can
do’ statements. The second type of scale that Alderson considers are assessor-
oriented scales, which are designed to guide the rating process, focussing on the
quality of the performance typically observed in a student at a certain level.
Thirdly, there are constructor-oriented scales which are produced to help the test
developer select tasks for a test by describing what sort of tasks a student can do
at a certain level. The scales describe potential test items that might make up a
discrete point test for each level. Fulcher (2003) points out that the information in
each of these scales might be different and it is therefore essential for establishing
validity that scales are used only for the purpose for which were designed for.
North (2003) argues that scales used to rate second language performance should
be assessor-oriented, which means that they should focus on aspects of ability
shown by the performance. Although this might seem obvious, he then shows that
rating scales that follow the Foreign Service Institute (FSI) family of rating scales
(described later in this chapter in more detail) often mix these different purposes
in the one scale.
The ways in which rating scales and rating criteria are constructed and interpreted
by raters act as the de facto test construct (McNamara, 2002). North (2003) how-
ever cautions that viewing the rating scale as a representation of the construct is
simplistic, as the construct is produced by a complex interplay of tasks, perform-
ance conditions, raters and rating scale. However, it is fair to say that the rating
scale represents the developers’ view of the construct. Therefore, rating scales for
writing are usually based on what scale developers think represents the construct
of writing proficiency and the act of defining criteria involves operationalizing the
construct of proficiency.
Turner (2000) suggests that although rating scales play such an important part in
the rating process and ultimately represent the construct on which the perform-
ance evaluation is based, there is surprisingly little information on how commonly
used rating scales are constructed. The same point has also been made by McNa-
mara (1996), Brindley (1998) and Upshur and Turner (1995). It is however vital
to have some knowledge of how scales are commonly constructed in order to un-
42
derstand some of the main issues associated with rating scales. Fulcher (2003)
points out that many rating scales are developed based on intuition (see also
Brindley, 1991). He describes three sub-types of intuitive methods, which are out-
lined below.
Several researchers have described models of rating scale development that are
not based on intuition. These design methods can be divided into two main
groups. Firstly, there are rating scales that are based on a theory. This could be a
theory of communicative competence, a theory of writing or a model of the deci-
sion-making of expert raters. Or scales can be based on empirical methods. The
following sections describe intuition-based, theory-based and empirically-based
methods in more detail.
The Foreign Service Institute (FSI) family of rating scales is based on intuitive
design methods (Fulcher, 2003). These scales became the basis for many scales,
like the ILR (Interagency Language Roundtable) and ACTFL (American Council
on the Teaching of Foreign Languages) rating scales still commonly used today.
The FSI scale was developed very much in-house in a United States government
testing context to test foreign services personell and all the scales focus on several
very basic principles, all of which have been criticized. Firstly, the scale descrip-
tors are defined in relation to levels within the scale and are not based on external
criteria. The only reference point is the ‘educated native-speaker’. Already in the
late 1960s, Perren (1968) criticized this and argued that the scale should be based
on a proficient second language speaker. The ILR scale ranges from ‘no practical
ability’ to ‘well-educated native speaker’. The Australian Second Language Profi-
ciency Ratings (ASLPR) also use these criteria. The concept of the ‘educated na-
tive speaker’ has come increasingly under attack (see for example Bachman &
Savignon, 1986; Lantolf & Frawley, 1985) because native speakers vary consid-
erably in their ability.
43
Secondly, it has been contended that the scale descriptors of the FSI family rating
scales are based on very little empirical evidence. Similarly, Alderson (1991) was
able to show that some of the IELTS band descriptors described performances that
were not observed in any of the actual samples. This is a clear threat to the valid-
ity of the test.
Thirdly, the descriptors in the FSI family of rating scales range from zero profi-
ciency through to native-like performance. Each descriptor exists in relation to the
others. The criticism that has been made in relation to this point (see for example
Pienemann, Johnston, & Brindley, 1988; Young, 1995) is that the progression of
the descriptors is not based on language development as shown by researchers in-
vestigating second language acquisition. It can therefore be argued that the theo-
ries underlying the development of the rating scales have not been validated and
are probably based on the intuitions and personal theories of the scale developers.
Fourthly, rating scales in the FSI family are often marked by a certain amount of
vagueness in the descriptors. Raters are asked to base their judgements on key
terms like ‘good’, ‘fluent’, ‘better than’, ‘always’, ‘usually’, ‘sometimes’ or ‘many
mistakes’. Despite all these criticisms levelled at intuitively developed rating
scales, it is important to note that they are still the most commonly used scales in
high-stakes assessments around the world.
North (2003) argues that, inevitably, the descriptors in proficiency scales are a
simplification of a very complex phenomenon. In relation to language learning, it
would be ideal if one could base the progression in a language proficiency scale
on what is known of the psycholinguistic development process. However, the in-
sights from this area of investigation are still quite limited and therefore hard to
apply (see for example Ingram, 1995). It could then be argued that if the stages of
proficiency cannot be described satisfactorily, one should not use proficiency
scales. But, as North points out, raters need some sort of reference point to follow.
Another possible response to the problem, and one taken by for example Mislevy
(1995), is that proficiency scales should be based on some sort of simplified stu-
dent model which is a basic description of selected aspects that characterize real
students. It is however clear that, unless the underlying framework of a rating
scale takes some account of linguistic theory and research in the definition of pro-
ficiency, the validity of the scale will be limited (Lantolf & Frawley, 1985).
Below, four types of theories (or models) are described which could be used as a
basis for a rating scale of writing: the four skills model, models of communicative
competence, theories/models of writing and theories of rater decision-making.
44
3.3.3.2.1 The Four Skills Model
A widely used conceptual framework seems to be the Four Skills Model proposed
by Lado (1961) and Carroll (1968). North (2003) summarizes the common fea-
tures of the model with respect to language in the following table (Table 4):
It can be seen from Table 4 that each skill is underpinned by the three elements,
phonology/orthography, lexicon and grammar. In order to write or speak, the
learner puts lexis into appropriate grammatical structures and uses phonology or
orthography to realize the sentence or utterance. North (2003) points out that al-
though the model is not theoretically based, it is generic and is therefore poten-
tially applicable to any context.
An example of a rating scale based on the Four Skills Model is a scale proposed
by Madsen (1983). This scale shows that in the Four Skills Model communication
quality and content are not assessed (see Table 5 below).
Table 5: Example of rating scale representing Four Skills Model (Madsen, 1983)
Mechanics 20%
Vocabulary choice 20%
Grammar and usage 30%
Organisation 30%
Total 100%
The scale in Table 5 above contains the additional feature of ‘grammar and usage’
and ‘organisation’ given more weight than the other two items. It is not clear if
this was done based on any theoretical or empirical evidence.
The main advantage of adopting the categories of the Four Skills Model for rating
scales lies in their simplicity. The rating categories are simple and familiar to eve-
ryone. North (2003) points out that the main disadvantage of the model is that it
45
does not differentiate between range and accuracy of both vocabulary and gram-
mar. Grammar may be interpreted purely in terms of counting mistakes. There is
also no measurement of communicative ability in this type of rating scale.
One way of dealing with the lack of communicative meaning in the Four Skills
Model is to base the assessment criteria on a model of language ability (see also
Luoma, 2004 on this topic in the context of speaking assessment). A number of
test designers (e.g. Clarkson & Jensen, 1995; Connor & Mbaye, 2002; Council of
Europe, 2001; Grierson, 1995; Hawkey, 2001; Hawkey & Barker, 2004; McKay,
1995; Milanovic, Saville, Pollitt, & Cook, 1995) have chosen to base their rating
scales on Canale and Swain’s (1983; 1980), Bachman’s (1990) or Bachman and
Palmer’s (1996) models of communicative competence which will be described in
more detail below.
One of the first theories of communicative competence was developed by Hymes
(1967; 1972). He suggested four distinct levels of analysis of language use that are
relevant for understanding regularities in people’s use of language. The first level
is what is possible in terms of language code, the grammatical level. At another
level, what a language user can produce or comprehend in terms of time and proc-
essing constraints should be examined. Another level should be concerned with
what is appropriate in different language-use situations. Finally, language use is
shaped by the conventions and habits of a community of users. This level is
shaped by the conventions and habit. Hymes also made a distinction between lan-
guage performance as in a testing situation, and more abstract models of underly-
ing knowledge and capacities which might not be tapped in most performance
situations. Hymes’ model of com-municative competence was developed for the
L1 context but it seems equally relevant for the L2 context.
Canale and Swain (1980) were the first authors to adapt Hymes’ model for the L2
context. The most influential feature of this model was that it treated different
domains of language as separate, which was ground-breaking after a decade of
research based on Oller’s hypothesis that language ability is a unitary construct
(see for example Oller, 1983; Oller & Hinofotis, 1980; Scholz, Hendricks, Spurl-
ing, Johnson, & Vandenburg, 1980). Canale and Swain (1980) proposed the fol-
lowing domains of language knowledge: grammatical competence, sociolinguistic
competence and strategic competence. Canale (1983) later extended this to in-
clude discourse competence. Socio-linguistic competence stresses the appropri-
ateness of language use, the language user’s understanding of social relations and
how language use relates to them. Discourse competence is concerned with the
ability of the language user to handle language beyond the sentence level. This
includes the knowledge of how texts are organised and how underlying meaning
46
can be extracted based on these principles. As Skehan (1998a) points out, it is im-
portant to note here that while native speakers distinguish themselves mainly in
the area of linguistic competence, some might have problems in the areas of so-
ciolinguistic and discourse competence. Strategic competence, according to Ca-
nale and Swain (1980), only comes into play if the other competences are unable
to cope.
According to Skehan (1998a), the model proposed by Canale and Swain (1980)
and later extended by Canale (1983), is lacking in a number of ways. It fails in
that it does not relate the underlying abilities to performance, nor does it account
for different contexts. It fails to account for the fact that some of the competencies
might be more important in some situations than in others. He also criticizes the
position given to strategic competence, which in this model only comes into play
when there is a communication breakdown and is therefore only used to compen-
sate for problems with other competences.
The model proposed by Canale and Swain was subsequently further developed by
Bachman (1990). His model distinguishes three components of language ability:
language competence, strategic competence and psycho-physiological mecha-
nisms / skills. Language competence in turn consists of two components, organ-
isational and pragmatic competences. Organisational competence includes the
knowledge involved in creating or recognizing grammatically correct utterances
and comprehending their propositional content (grammatical competence) and in
organising them into text (textual competence). Pragmatic competence includes
illocutionary competence and sociolinguistic competence. Bachman (1990) rede-
fines strategic competence as a “general ability which enables an individual to
make the most effective use of available abilities in carrying out a given task” (p.
106).
In 1996, Bachman and Palmer revised their model to include the role played by
affective factors in influencing language use. A further change in this model is
that strategic competence is now seen as consisting of a set of metacognitive
strategies. ‘Knowledge structures’ (knowledge of the world) from the 1990 model
has been relabelled ‘topical knowledge’. In this model, strategic knowledge can
be thought of as a higher order executive process (Bachman & Palmer, 1996)
which includes goal-setting (deciding what to do), assessment (deciding what is
needed and how well one has done) and planning strategies (deciding how to use
47
what one has). The role and subcomponents of the language knowledge compo-
nent remain essentially unchanged from the Bachman (1990) model.
Skehan (1998a) sees the Bachman and Palmer model as an improvement on pre-
vious models in that it is more detailed in its specifications of the language com-
ponent, defines the relationships between the different components more ade-
quately, is more grounded in linguistic theory and is also more empirically based.
He finds, however, that there are problems with the operationalization of the con-
cepts, which are generally structured in the form of a list. It is difficult to find any
explanation in the model for why some tasks are more difficult than others and
how this influences accuracy, fluency and complexity. Luoma (2004) suggests
that the quite detailed specification of the language component distracts from
other components and knowledge types, which may as a result receive less em-
phasis. She therefore suggests that test developers might want to use Bachman
and Palmer’s (1996) model in conjunction with other frameworks. Like Bach-
man’s (1990) model, the Bachman and Palmer (1996) model has not been vali-
dated empirically.
After examining the Four Skills model and the various models of communicative
competence as possible theories for rating scale development, we now turn to
48
theories in the area of writing research to establish if any of these models would
be useful as a theoretical basis for rating scale development.
Grabe and Kaplan (1996) propose a model of text construction. They argue that
from the research they have reviewed it becomes clear that any model of text con-
struction needs at least seven basic components: syntactic structures, semantic
senses and mapping, cohesion signalling, genre and organisational structuring to
support coherence interpretations, lexical forms and relations, stylistic and register
dimensions of text structure, and non-linguistic knowledge bases, including world
knowledge. These seven components (syntax, semantics, lexicon, cohesion, co-
herence, functional dimensions and non-linguistic resources) form the centre of
the text construction model. On the sentential level, two components are speci-
fied: syntax and semantics. On a textual, or intersentential level are cohesion and
coherence. The lexicon is connected to all four of the other components, in both
surface form and underlying organisation and is therefore placed in a central posi-
tion. On an interpersonal level, the style level, are the components of posture and
stance.
The syntactic component involves types of phrases and clauses and the ordering
of phrases and words within a sentence. The authors suggest that a researcher
might, for example, want to investigate the number of types of passive structures.
Overall, syntactic analysis at this stage will involve the counting of various con-
structions and categories. Grabe and Kaplan (1996) acknowledge that the seman-
tic component is open to alternative frameworks as there is no complete theory of
semantics currently available. Cohesion and coherence, on the text level, can be
seen as equivalent to syntax and semantics at the level of the sentence (or the
clause). The authors point out that there is no consensus on an overall theory of
cohesion, nor is there a satisfactory overall definition. It is also not completely
clear what the relationship is between cohesion and coherence.
49
The lexicon, which influences all components described above, is placed in a cen-
tral position. Vocabulary used in text construction provides the meaning and sig-
nals that are needed for syntax, semantics and pragmatic interpretations.
The third, interpersonal level of writing shows the writer’s attitudes to the reader,
the topic and the situation. Style ultimately reflects the personality of the writer.
Several parameters are available to express this personality, such as formality or
distance.
Because this model was not created with test development in mind, Grabe and
Kaplan (1996) offer no explanation on how the building blocks of the model can
be evaluated in a student’s piece of writing, nor is there any consideration of
whether some features contribute more than others to a successful piece of writ-
ing.
Another way to arrive at a theory of writing is to gather all the information that
can be collected through an ethnography of writing and categorize it into a taxon-
omy of writing skills and contexts. This is, according to Grabe and Kaplan, a use-
ful way to identify any gaps that can be further investigated. However, what be-
comes clear from this taxonomy is just how many different aspects and variables
are encompassed in writing that need to be considered when conducting research.
This taxonomy offers however no information that could be used in the writing of
descriptors of a rating scale, nor does it attempt to structure the information hier-
archically.
50
Cumming (1990) showed that raters use a wide range of knowledge and strategies
and that their decision-making processes involve complex, interactive mental
processes. He identified 28 interpretation and judgment strategies used by the rat-
ers in his study and he was able to show that both expert and novice raters were
able to distinguish between language proficiency and writing ability. Based on
Cumming’s study, Milanovic, Saville and Shen (1996) devised a model of the de-
cision-making involved in holistic scoring. This model can be seen in Figure 4
below. It shows that raters first scan the script and form an overall idea of the
length, format, handwriting and organisation, followed by a quick read which es-
tablishes an indication of the overall level of the writing script. Only then do rat-
ers proceed to rating.
Milanovic and his co-researchers also created a list of items that raters focus on.
These include length, legibility, grammar, structure, communicative effectiveness,
tone, vocabulary, spelling, content, task realization and punctuation. Their find-
ings also give some indication of how the raters weighted these essay features.
They noticed, for example, that spelling and punctuation were not seen to be im-
portant as other features. It seems, however, that the findings with regard to
weighting are quite inconclusive and vary greatly among individual raters.
Both Vaughan (1991) and Lumley (2002; 2005) showed that raters generally fol-
low the rating criteria specified in the rating scale but if the essay does not fit the
pre-defined categories, they are forced to make decisions that are not based on the
rating scale or on any rater training they received. Consequently they are unreli-
able and might lack validity. Vaughan (1991) showed that in such cases the raters
51
based their rating on first impression or used one or two categories like grammar
and/or content to arrive at their rating. Similarly, Lumley (2002; 2005) found that
if some aspect of the script was not covered by the rating scale descriptors, the
raters used their own knowledge or intuitions to resolve uncertainties or they re-
sorted to other strategies like heavily weighting one aspect or comparing the script
with previously rated compositions. He acknowledges that scale development and
rater training might help, but found that these could not prevent this problem from
occurring. He argues therefore that it is possible that the rater, and not the rating
scale, is at the centre of the rating process.
Sakyi (2000) who also used verbal protocols, found four distinct styles among the
raters in his study: focus on errors in the text, focus on essay topic and presenta-
tion of ideas, focus on the rater’s personal reaction to the text and focus on the
scoring guide. He also noticed that certain criteria were more associated with high
and low marks (this was also observed by A. Brown, 2002 and; Pollitt & Murray,
1996). On the basis of his findings, Sakyi proposed a model of the holistic scoring
process as seen in Figure 5 below.
Cumming, Kantor and Powers (2001; 2002) undertook a series of studies to de-
velop and verify a descriptive framework of the decision-making processes of rat-
ers as part of the development process of TOEFL 2000. They also investigated if
there were any differences between the decision-making processes of English-
mother-tongue (EMT) raters and ESL/EFL trained raters. In the first study, a pre-
liminary descriptive framework was developed based on the think-aloud protocols
of ten experienced raters rating essays without scoring criteria. In the second
study, this framework was applied to verbal data from another seven experienced
52
raters. In the third study, this framework was revised by analyzing think-aloud
protocols from the same raters. The results of their studies showed that raters put
more weight on rhetoric and ideas (compared to language) when scoring higher
level compositions. They also found that ESL trained raters attended more exten-
sively to language than rhetoric and ideas whilst the EMT raters divided their at-
tention more evenly. Overall, however, the research showed that the two groups
of raters rate compositions very similarly, which verified the framework. Most
participants in the study noted that their background, teaching experiences and
previous rating experiences had influenced the process of rating as well as the cri-
teria they applied. The authors argue that a descriptive framework of the rating
processes of experienced raters is necessary to formulate, field-test, and validate
rating scales as well as to guide rater training. The descriptive framework of deci-
sion-making behaviors of the raters in Cumming et al.’s (2001; 2002) study can be
found in Table 6 below.
Interpretation Strategies
Read or interpret prompt or Discern rhetorical structure Classify errors into types
task input or both
Read or reread composition Summarize ideas or proposi- Interpret or edit ambiguous
tions or unclear phrases
Envision personal situation Scan whole composition or
of the writer observe layout
Judgment Strategies
Decide on macro-strategy for Assess reasoning, logic or Assess quantity of total writ-
reading and rating; compare topic development ten production
with other compositions, or
summarize, distinguish or
tally judgments collectively
53
The 27 behaviors identified in Table 6 are those that one might expect from ex-
perienced raters when rating ESL/EFL compositions. On the basis of this, the au-
thors argue that analytic rating scales should reflect how experienced raters score.
For example, they divide their attention equally between content and language. It
might also make sense to weight criteria more heavily towards language at the
lower end of the scale and more towards rhetoric and ideas at the higher end. This
might show that language learners need to manifest a certain threshold of lan-
guage before raters are able to attend to their ideas. The study also showed that it
is necessary for each task type to have a rating scale that is uniquely designed for
it. However this might not be practical.
Another data-based approach to scale development has been proposed more re-
cently by researchers working on the Cambridge ESOL examinations (Hawkey,
2001; Hawkey & Barker, 2004). The aim of the study was to develop an overarch-
ing rating scale to cover Cambridge ESOL writing examinations at different lev-
els. They used a corpus-based approach to distinguish key features at four pre-
assessed proficiency levels. Writing scripts were classed into subcorpora at differ-
ent levels on the basis of previous ratings. The subcorpora were then analysed to
identify the salient features underlying each level. The scripts were reread by the
main researcher who then decided which features should be included in the rating
scale. Therefore, the criteria included in the design of this study emerged partly
from the intuitions of the main researcher as well as from features identified by
the corpus analyst. On the basis of this, a draft scale was designed. It is not clear,
however, whether any validation of this common scale of writing was undertaken.
54
3.3.3.3.1 Empirically derived, binary-choice, boundary definition scales
(EBBs)
Upshur and Turner (1995) claim that the main difference between EBB scales and
traditional rating scales is, that instead of having descriptors that define the mid-
point of a band, there are a number of questions which describe the boundaries
between categories. Ratings are therefore based on differences rather than simi-
larities. They also contend that the strongpoint of their scale lies in its simplicity
as no more than one feature competes at a particular level. Fulcher (2003), how-
ever, argues that the EBB rating scales do not take into account a theoretical, lin-
ear process of second language acquisition. They rely entirely on the decisions of
expert raters. Another weakness of the EBB scales is that they can only be applied
to a specific task and cannot be generalized to others or to the real world. Also,
they again rely heavily on the judgment of expert raters working within a particu-
lar context. Finally, the scale heavily weights some criteria over others. Those cri-
teria are weighted more heavily which are at a higher level of decision-making
and require a decision to be made first. Upshur and Turner found increased inter-
rater reliability but no post-hoc validation studies were carried out.
55
rank them according to ‘low’, ‘middle’ and ‘high’ and then divide each group into
a further two levels to arrive at six levels. The descriptors that were ranked most
consistently were then put into questionnaires linked by common anchor items,
which were the same in all questionnaires. The third phase involved the qualita-
tive analysis of the data. Raters were asked to rate a small number of students
from their own classes using the descriptors in the questionnaires. Multi-faceted
Rasch mea-surement3 was then used to construct a single scale from the descrip-
tors, identifying any misfitting descriptors in the process. Next, cut-off points
were established using dif-ficulty estimates, natural gaps and groupings. The
whole process was repeated in the fourth and final phase and in this case other
languages (French and German) were added as well as other skills (listening,
reading and speaking). North and Schneider (1998) acknowledge that their
method is essentially a-theoretical in nature as it is not based on either empirically
validated descriptions of language pro-ficiency or on a model of language learn-
ing.
This section has reviewed ways in which rating scale descriptors can be devel-
oped. First was a description of intuition-based scale development, and then the-
ory-based scale development was explored. Possible theories discussed were the
four skills model, models of com-municative competence, theories/models of
writing and models of rater decision-making. The final scale development method
described was empirical scale de-velopment. After reviewing each of these possi-
ble ap-proaches to scale development, it is clear that each approach provides dif-
ferent types of information and therefore none seems sufficient on its own. Impli-
cations for the development of a rating scale for diagnostic assessment can be
found later in this chapter.
3.3.4 What will the descriptors look like and how many scoring levels will be
used?
The rating scale developer also has to make a number of decisions at the descrip-
tor level. Firstly, how many bands the rating scale should have needs to be de-
cided. Secondly, the developer has to decide how the descriptors will differentiate
between the levels. Finally, the descriptor formulation style needs to be deter-
mined.
Research has shown that raters can only differentiate between seven (plus or mi-
nus two) levels (Miller, 1956)4. North (2003) points out that there is a certain ten-
sion when deciding the number of levels. Firstly, one needs enough levels to show
progress and discriminate between different learners, but the number of bands
56
should not exceed a certain number so that raters can still make reasonable dis-
tinctions. He argues that there is a direct relationship between reliability and deci-
sion power. Myford (2002) investigated the reliability and candidate separation of
a number of different scales and concluded that the reliability was highest for
scales ranging from five to nine scale points, lending credibility to Miller’s sug-
gestion.
Another issue is how many bands are appropriate for specific categories. Some
categories might not lend themselves to as fine distinctions as others. This might
manifest itself in the inability of a scale developer to formulate descriptors at all
levels or the failure of the raters to distinguish between the levels even if they are
defined (North, 2003). According to North, there are different ways of reacting to
this problem: Test developers can admit to the problem, or investigate the prob-
lem. If one circumvents the problem, one can combine categories into broader
categories. In the investigation approach, the researcher investigates each band by
making use of Rasch scalar analysis (as suggested by Davidson, 1993). This re-
quires an iterative process where mis-functioning scale bands are revised or re-
formulated and then modelled again by statistical analysis until the problem is
solved.
Another issue that the rating scale developer has to tackle is how to distinguish
between the different levels. Several approaches are possible. For example, not all
scales provide descriptors for the levels. Some rating scales might start with 100
points and ask the rater to subtract points for each mistake. They are therefore
based on deficiency and not on a competence approach which would give credit
for ability. Such a scheme was presented by Reid (1993) and an extract is pre-
sented in Figure 6 below:
Begin with 100 points and subtract points for each deficiency:
- appropriate register (formality or informality)- 10 points
- language conventions - 10 points
- accuracy and range of vocabulary - 5 points
Figure 6: Extract from deficit marking scheme (Reid, 1993)
Other rating schemes might require the rater to score each aspect under investiga-
tion out of three. There is no scale that guides the rating and it is therefore very
hard to know how raters agree on a particular score for a certain feature.
An alternative to the kinds of rating schemes shown above is to assign marks on a
scale. There are three different types of scales which roughly follow the historical
development of rating scales (North, 2003).
57
a) Graphic and numerical rating scales: These scales present a continuous line be-
tween two points representing the top and the bottom ends of the scale. Graphic
scales require the rater to choose a point on the scale, whilst numerical scales di-
vide the continuum into intervals represented by numbers. An example of each of
these can be found in Figure 7 below. The graphic scale is at the top and the nu-
merical scale is at the bottom.
Quality: High
____________________________________ Low
Quality: High
____________________________________ Low
5 4 3 2 1
Figure 7: Graphic and numerical rating scales (North, 2003)
A drawback of these types of scales is that they say nothing about the behavior
associated with each of the levels of the continuum. It is therefore not clear why
two raters might agree on a particular level.
b) Labeled scales: Later, rating scale developers set out to add cues to the various
points along the scale. Cues were usually quite vague, with stages on the contin-
uum ranging from, for example, ‘too many errors’ to ‘almost never makes mis-
takes’ or they might range from ‘poor’ to ‘excellent’. The obvious disadvantage of
these types of scales lies in their vagueness. It is, for example, a quite subjective
judgment whether a learner’s writing is ‘above average’ or ‘excellent’.
c) Defined scales: Another step in rating scale development was taken when the
horizontal scales described above were changed to vertical scales, so that there
was suddenly ample space for longer descriptions. An example of such a scale is
Shohamy et al.’s (1992) ESL Writing scale. Shohamy’s team was able to show
that these more detailed descriptors led to a higher level of inter-rater reliability.
An extract from the scale can be found in Figure 8 below.
58
Accuracy
5 Near native accuracy
4 Few sporadic mistakes; more sophisticated; complex sentence
structures; idiomatic expression
3 Consistent errors; accurate use of varied/richer vocabulary; longer
sentence structure.
2 Frequent consistent errors yet comprehensible; basic structures
and simple vocabulary
1 Poor grammar and vocabulary strongly interfering with compre-
hensibility; elementary errors.
0 entirely inaccurate
Figure 8: ESL Writing: Linguistic (Shohamy et al., 1992)
Myford (2002) compared the reliability of a number of different scale types. She
was interested to see whether the number of scale points or the presence or ab-
sence of a defined midpoint made a difference. She found no significant differ-
ences in the resultant reliability and therefore concluded that the training of raters
is more important than the type of descriptors used.
The rating scale designer also has to decide how to formulate the descriptors.
North (2003) distinguishes three different approaches to formulating descriptors.
59
schemes, by behavioral objectives and by research tools which analyze speaking
or writing in terms of objective features like the number of words per utterance or
number of words in error-free utterances. This third formulation style aims for
objectivity in a very simplistic manner. One example of such a scale, in the con-
text of speaking, can be found in the Oral Situation Test scoring rubric by Raf-
faldini (1988). An extract is presented in Figure 9 below. The scale attempts to
have raters count structures where possible (e.g. for cohesion, structures and vo-
cabulary). However, although Raffaldini attempts to reduce subjectivity, the rater
still has to make some very subjective decisions. It is for example not clear what
is classed as a ‘major’ and ‘minor’ error. Furthermore, using a quantitative ap-
proach for operational purposes is extremely time-consuming.
Finally, the scale developer has to decide how to report the scores. Here the rating
scale developer should return to Alderson’s (1991) rating scale categories and de-
cide what the initial purpose of the scale was, as well as what the purpose of the
writing test was. Scores should for example not be combined if the stakeholders
could profit from knowing sub-scores. However, where it is not important to
know sub-scores, a combined score should be reported. Similarly, the rating scale
designer needs to decide if any of the categories on the rating scale should be
weighted.
As McNamara (1996) and Weigle (2002) point out, the scale that is used in as-
sessing writing performance, implicitly or explicitly represents the theoretical ba-
sis of a writing test. That means it embodies the test developer’s notion of what
underlying abilities are being measured by the test. Therefore, the rating scale is
of great importance to the validity of a test.
60
Before reviewing the relevant literature on how rating scales can be validated, it is
important to briefly explore how validity is conceptualized and then discuss how
it can be applied to rating scales.
The view of validation has changed historically (Chapelle, 1999). Whilst in the
1960s it was seen as one of two important aspects of language tests (the other be-
ing reliability), subsequent work has focussed on identifying a number of different
features of tests which contribute to validity. Prior to Messick’s (1989) seminal
paper, different types of validity were established as separate aspects, each of
which will be briefly described below.
x construct validity
x content validity
x criterion-related validity, consisting of concurrent and predictive validity
x face validity
The construct validity of a language test was defined by Davies et al. (1999) as an
indication of how representative it is of an underlying theory of language use. In
the case of writing assessment, construct validity determines how far the task
measures writing ability (Hyland, 2003). Hamp-Lyons (2003) argues that con-
structs cannot be seen and are therefore difficult to measure. They have to be
measured by tapping some examples of behaviour that represent the construct. In
the case of writing assessment, this ability is operationalized by the rating scale
descriptors.
Content validity evaluates whether the tasks in a test are similar to what writers
are required to write about in the target language situation (Hamp-Lyons, 1990;
Hyland, 2003). This is usually established through a needs analysis. Hamp-Lyons
argues that whilst there has been a call for, say, history majors to be required to
write on a certain history topic, this does not guarantee that they have actually
studied this particular topic. She therefore argues that it is more useful to cover
this issue under construct validity and sample what it is that writers do when writ-
ing on a history topic. While content validity is more central to the task that learn-
ers are required to perform, the rating scale should also display content validity in
the sense that it should reflect as much as possible how writing is perceived by
readers in the target language use domain.
61
Criterion-related validity refers to the way a test score compares to other similar
measures. There are two types of criterion-related validity (Hughes, 2003). Firstly,
concurrent validity measures how the test scores compare with other comparable
test scores. The result of the comparison is usually expressed as a correlation coef-
ficient, ranging in value from -1.0 to +1.0 (Alderson et al., 1995). Most concurrent
validity coefficients range from +0.5 to +0.7. Higher coefficients are possible for
closely related and reliable tests. Secondly, predictive validity differs from con-
current validity in that, instead of collecting the external measures at the same
time as the administration of the experimental test, the external measures are
gathered some time after the test has been given (Alderson et al., 1995; Hamp-
Lyons, 1990). Predictive validity measures how well a test predicts performance
on an external criterion.
Messick (1989; 1994; 1996) proposed a more integrated view of validity. He saw
assessment as a process of reasoning and evidence gathering which is carried out
so that inferences can be made about test takers. He argued that establishing the
meaningfulness of those inferences should be seen as the main task of test devel-
opers. He therefore redefined validity as ‘an integrated evaluative judgement of
the degree to which empirical evidence and theoretical rationales support the ade-
quacy and appropriateness of inferences and actions based on test scores’ (1989,
p.13). Messick argued that construct validity is the unifying factor to which all
other validities contribute and he also extended the notion of validity beyond test
score meaning to include relevance, utility, value implications and social conse-
quences.
Table 7 below shows the different facets of validity identified by Messick. He iso-
lated two sources of justification for test validity: the evidential basis and the con-
sequential basis. The evidential basis focuses on establishing validity through em-
pirical investigation. The consequential basis focuses on justification based on the
effects of a test after its administration. Both the evidential basis and the conse-
quential basis need to be evaluated in terms of the two functions which Messick
labelled across the top of the table: test interpretation, which focuses on how ade-
quate test interpretations are, and test use, which focuses on the adequacy of ac-
tions based on the test.
62
Table 7: Messick's (1989) facets of validity
Test interpretation Test use
Evidential Basis Construct Validity Construct Validity + Rele-
vance/Utility
Consequential Basis Value Implications Social Consequences
Chapelle (1999) produced a summary table which outlines the contrasts between
past and current conceptions of validation (Table 8).
63
Table 8: Summary of contrasts between past and current conceptions of validation (from
Chapelle, 1999)
Past Current
Validity was considered a Validity is considered an argu-
characteristic of a test: the ment concerning test interpreta-
extent to which a test meas- tion and use: the extent to which
ures what it is supposed to test interpretations and uses can
measure be justified
Reliability was seen as distinct Reliability can be seen as one
from and a necessary condi- type of validity evidence
tion for validity
Validity was often established Validity is argued on the basis of
through correlations of a test a number of types of rationales
with other tests and evidence, including the con-
sequences of testing
Construct validity was seen as Validity is a unitary concept with
one of three types of validity construct validity as central (con-
(the three validities were con- tent and criterion-related evidence
tent, criterion-related, and can be used as evidence about
construct) construct validity
Establishing validity was con- Justifying the validity of test use
sidered within the purview of is the responsibility of all test
testing researchers responsible users
for developing large-scale,
high-stakes tests
Bachman and Palmer’s (1996, forthcoming) facets of test usefulness were devel-
oped to establish the validity of entire tests and not to validate aspects of tests, like
for example the rating scale. However, most of the aspects can be adapted to be
used as a framework for rating scale validation, in combination with warrants
which represent an ideal situation. Table 9 below presents the aspects of test use-
fulness with the relevant warrants which will be used for the validation of the rat-
ing scale later in this book (Chapter 10). Not all aspects of test usefulness can
however be usefully applied to rating scale validation. Especially interactiveness
cannot be established for a rating scale. Therefore, this concept was excluded.
64
Table 9: Facets of rating scale validity (based on Bachman and Palmer, 1996)
Construct validity
The scale provides the intended assessment outcome appropriate to purpose and context and
the raters perceive the scale as representing the construct adequately
The trait scales successfully discriminate between test takers and the raters report that the
scale is functioning adequately
The rating scale descriptors reflect current applied linguistics theory as well as research
Reliability
Raters rate reliably and interchangeably when using the scale
The test scores and feedback are perceived as relevant, complete and meaningful by other
stakeholders
The impact on raters is positive
Practicality
The scale use is practical
The scale development is practical
Several criticisms have been leveled at existing rating scales. Firstly, as has been
mentioned earlier in this chapter, the a priori nature of rating scale development
has been criticized (Brindley, 1991; Fulcher, 1996a; North, 1995). Rating scales
are often not based on an accepted theory or model of language development
(Fulcher, 1996b; North & Schneider, 1998) nor are they based on an empirical
investigation of language performance (Young, 1995). This results in scales that
include features that do not actually occur in the writing performances of learners
(Fulcher, 1996a; Upshur & Turner, 1995). Rating scales based on pre-existing
scales might also result in rating scale criteria which are irrelevant to the task in
question or the context (Turner & Upshur, 2002). Other researchers have con-
tended that rating scales are often not consistent with findings from second lan-
guage acquisition (Brindley, 1998; North, 1995; Turner & Upshur, 2002; Upshur
& Turner, 1995). Rating scales also generally assume a linear development of
65
language ability, although studies such as those undertaken by Meisel, Clahsen
and Pienemann (1981) show that this might not be justified (Young, 1995).
Fewer studies have focussed on the problems raters experience when using rating
scales. There is, however, a growing body of research that indicates that raters of-
ten find it very difficult to assign levels and that they employ a number of strate-
gies to cope with these problems. Shaw (2002), for example, noted that about a
third of the raters he interviewed reported problems when using the criteria. How-
ever, he does not mention what problems they referred to. Claire (2002, cited in
Mickan, 2003) reported that raters regularly debate the criteria in moderation ses-
sions and describe problems in applying descriptors with terms like ‘appropri-
ately’. Similarly, Smith (2000), who conducted think-aloud protocols of raters
marking writing scripts, noted that raters had ‘difficulty interpreting and applying
some of the relativistic terminology used to describe performances’ (p. 186).
However, Lumley (2002; 2005), who also conducted think-aloud protocols with
raters, noted raters experiencing problems only in unusual situations, when raters
for example encountered problem scripts or features that were not mentioned in
the scripts. He observed how, when raters were not able to apply the criteria, they
fell back on their personal experiences. Otherwise, he found that raters encoun-
tered very few problems in applying the criteria.
66
able that rating scales used in other assessment contexts are not appropriate for
diagnostic purposes.
Alderson (2005) further suggests that diagnostic assessment usually focuses more
on specific features rather than global abilities. Some of the literature reviewed
above, however, suggests that current rating scales make use of vague and impres-
sionistic terminology and that raters often seem to struggle when employing these
types of scales. Impres-sionistic and vague terminology on the descriptor level
might not be conducive to identifying specific features in a writing script.
Alderson (2005) also argues that a diagnostic test should be either theoretically-
based or based on a syllabus. Because the rating scale represents the de facto test
construct, the rating scale used for a diagnostic assessment of writing should be
based on a theory (or syllabus). Alderson further suggests that this theory should
be as detailed as possible, rather than global. Diagnostic tests should also be based
on current SLA theory and research.
Overall, it seems doubtful that rating scales which are designed for proficiency or
placement procedures would also be appropriate for diagnostic assessment. But
what features would a rating scale for diagnostic assessment have to display?
Weigle’s (2002) five steps in rating scale development suggest the following:
67
(2) The rating scale should be assessor-oriented, so that raters are assisted in iden-
tifying specific details in learners’ writing. Rating scales should therefore provide
as much information as necessary for raters to assign bands reliably. Similarly, it
could also be argued that the scale should be user-oriented, as feedback is central
in diagnostic assessment.
(3) The rating scale should be based on a theory or model of language develop-
ment5 (as suggested by Alderson, 2005). In this way, the criteria chosen will re-
flect as closely as possible our current understanding of writing (and language)
development. The theory should be as detailed as possible, to provide a useful ba-
sis for the descriptors. The descriptors should ideally be empirically-developed. In
this way, they will be based on actual student performance rather than being con-
ceived in a vacuum. If the descriptors are based on empirical investigation, they
can be based on our current understanding of SLA theory.
(5) The way the scores are reported to stakeholders is central to diagnostic as-
sessment. Scores should be provided in such a way as to offer as much feedback
as possible to students.
3.6 Conclusion
This chapter has investigated a number of options available to rating scale devel-
opers and has then discussed features of scales which might be most suitable to
the diagnostic context. One suggestion is that a rating scale for diagnostic assess-
ment should be theory-based. However, a closer look at the different models and
theories that could be or have been used for rating scale development reveals that
none of them provide an outright solution. Our current understanding of writing is
not sufficiently developed to base a rating scale just on one theory. The following
chapter therefore attempts to follow a similar path to Grabe and Kaplan’s (1996)
taxonomy of writing in order to establish a taxonomy of aspects of writing rele-
vant to rating scale development, which will then serve as a theoretical basis for
the design of the rating scale.
68
---
Notes:
1
The TWE is still administered in areas where the TOEFL iBT has not been introduced (e.g.
where access to computers is difficult).
2
In this case an example from the context of speaking is chosen because it is the most well-known
study exemplifying this type of scale development. The principles of this study are just as applica-
ble to the context of writing.
3
This method is described in more detail in Chapter 8.
4
Miller was not referring to raters in his article, but was instead referring to human processing
capacity in general.
5
A diagnostic test based on a syllabus is also possible, but not the focus of this study
69
Chapter 4: Measuring Constructs and Con-
structing Measures
4.1 Introduction
In the previous chapter, it was proposed that a rating scale for diagnostic assess-
ment should be (1) based on a theory of writing and/or language development and
(2) based on empirical investigation at the descriptor level. This chapter, there-
fore, sets out to achieve two purposes. Firstly, it attempts to arrive at a taxonomy
of the different theories and models available to rating scale developers. It will be
argued that, because currently no satisfactory theory or model of writing devel-
opment is available, a taxonomy based on a number of theories and models can
provide the most comprehensive description of our current knowledge about writ-
ing development. The first part of the chapter describes such a taxonomy. Based
on this taxonomy, a number of aspects of writing will be chosen which will serve
as the trait categories in the rating scale. To conclude the first part of the chapter,
the rating scale the Diagnostic English Language Needs Assessment (DELNA)1
uses currently to rate writing scripts is reviewed in terms of these constructs.
The second aim of the chapter is to arrive at discourse analytic measures which
can be used as a basis for the empirical investigation of a new rating scale. These
discourse analytic measures should represent each of the different trait categories
chosen from the taxonomy. The relevant literature on each of these different as-
pects of writing (or traits) is reviewed to establish which discourse analytic meas-
ures should be used to operationalize each of these traits. At the end of this chap-
ter, a list of discourse analytic measures is presented, which will then be used dur-
ing the pilot study.
A number of authors have argued that a theoretical basis for rating scale design is
necessary. For example, McNamara (1996, p.49) writes ‘an atheoretical approach
to rating scale design in fact provides an inadequate basis for practice’ and that
‘the completeness, adequacy and coherence of such models is crucial’, and North
(2003, following Lantolf and Frawley, 1985) argues that ‘unless the conceptual
framework behind the scale takes some account of ling-uistic theory and research
in its definition of proficiency, its validity will be limited and it can be accused of
constructing a closed reality of little general interest’. North further argues that
one cannot actually avoid theory. He claims that it is more than sensible to have a
valid conceptual framework and try and to incorporate relevant insights from the-
ory when the scale is being developed. Therefore, for him, models of language use
71
are a logical starting point. Alderson (2005) also suggests that diagnostic tests
should be based on a theory.
However, although there is general agreement that rating scales should be based
on a theoretical framework, there are a number of problems with the models cur-
rently available (as reviewed in the previous chapter). These problems are further
discussed below.
Models of communicative competence have been used as the theoretical basis for
many rating scales. They have the advantage of being general models, which can
be transferred across contexts, assuring generalisability of results. Therefore re-
sults should be expected to show less variation across different tasks and general-
ise to other contexts (as shown by Fulcher, 1995). However, there are a number of
problems with using these models as a conceptual framework for a rating scale.
The first problem is, that they are models of communicative competence and not
models of performance. Therefore, they have problems of coping when compe-
tence is put into use. North (2003), for example, argues that these models have no
place for fluency, which is a component of performance. This is one of the most
obvious elements necessary to turn a model of competence into a model of per-
formance, which, as North (2003) and McNamara (1996) point out, is really
needed. The second problem relates to the operationalisation of the models. The
fact that certain aspects are components of a theoretical model does not mean
these parameters can be isolated as observable aspects and operationalised into
rating scale descriptors and hence tested separately.
Most writing is not undertaken for the writer himself/herself but for a certain au-
dience. It could therefore be argued that raters’ decision-making models could be
used as a basis for rating scale design, since raters are the readers of the writing
scripts produced in the context of assessment. The assessment of writing should
take into account how readers of L2 writing think and respond to writing. These
decision-making processes have been modelled by Cumming et al. (2001; 2002)
in a reader-writer model (shown in Table 6 in the previous chapter). Brindley
(1991), however, has concerns about using raters’ decision-making processes as a
basis for the assessment process and the rating scale, for a number of reasons.
Firstly, he finds it hard to define what makes an ‘expert’ judge. He also argues
that these judges might be unreliable and base their judgements on different crite-
72
ria, as background and context might play an important role (as was seen in re-
search reported in Chapter 2). Even the method used in rater decision-making
studies, the concurrent think-aloud protocol, has been questioned (e.g. Stratman &
Hamp-Lyons, 1994) and recent research by Barkaoui (2007a; 2007b) reinforces
the doubts about the validity of this method.
It therefore seems that there is no theory currently available that can serve by it-
self as a basis for the design of a rating scale for writing for diagnostic assess-
ment. North (2003) argues that describing the stages of learning surpasses our
knowledge of the learning process and Lantolf and Frawley (1985, cited in
McNamara, 1996) add:
There are therefore those that argue that one should not attempt to describe the
stages of attainment in a rating scale (e.g. Mislevy, 1993). However, in practical
terms, teachers and raters need some reference point to base their decisions on.
73
(1992) suggestion that an assessment is more valid if more than one model is used
in combination. It would also conform with North’s (2003) argument that a rating
scale based on a general model is more valid.
Fluency is not an aspect that is part of the models of language competence. How-
ever, raters seem to consider this construct as part of their decision-making proc-
esses. A group of features connected with the reader includes stance or audience
awareness. These are termed socio-linguistic knowledge in the models of commu-
nicative competence and stance, posture and audience awareness in Grabe and
Kaplan’s two models. However, very little mention of this aspect of writing can
be found in the models of rater decision-making, which might be because raters
are often not specifically trained to rate these. These aspects will from now on be
grouped together and referred to as features of reader/writer interaction.
Only aspects of writing that can be assessed on the basis of the writing product are
included in the taxonomy. For example, whilst there is no doubt that the affect of
a writer plays an important role in the outcome of the writing product, it is unreal-
istic for raters to assess this. It is therefore not included in the list of criteria. Simi-
larly, the world knowledge of a writer cannot be assessed on the basis of a prod-
uct, only on the quality of the content of a piece of writing. It is moreover doubt-
ful that interest, creativity and originality of content can be assessed objectively
and these aspects are therefore also not included in the list.
74
The features in the taxonomy were grouped into the following eight categories,
which will form the basis of the constructs further pursued in the remainder of this
study (see Table 10 below):
4.4 Evaluation of the usefulness of the DELNA rating scale for diagnostic
assessment
In this section, I will analyze the existing rating scale in terms of the constructs
that have been identified in the preceding section as well as some of the criticisms
of rating scales discussed in Chapter 3. The DELNA2 rating scale (Table 11 be-
low), as it is currently in use, has evolved over several years. It was developed
from other existing rating scales and on the basis of expert intuition. Over the
years several changes have been carried out, mainly on the basis of suggestions by
raters.
75
The DELNA rating scale has nine categories, grouped together into the three
groups - form, fluency and content. Each category is divided into six level de-
scriptors ranging from four to nine. A first glance at the scale reveals that a num-
ber of the constructs identified in the previous section are represented. A closer
look at the scale however, also reveals some of the problems common to rating
scales identified in Chapter 3. These problems can mostly be found in the group-
ings of the categories and the wording of the level descriptors.
The group of categories under the heading form consists of sentence structure,
grammatical accuracy and vocabulary & spelling. A closer look at the category of
sentence structure shows that the descriptors mix aspects of accuracy and com-
plexity. At level 6, for example, one reads ‘adequate range – errors in complex
sentences may be frequent’. There is, however, no indication of what adequate
range means and how that is different from level 7 ‘satisfactory variety – reduced
accuracy in complex sentences’.
Overall, the traits under form represent the aspects of accuracy and complexity as
identified in the taxonomy as well as one aspect of mechanics, namely spelling.
Under the heading of content, there are three categories: description of data, in-
terpretation of data and development of ideas. These three categories are gener-
ally intended to follow the three sub-sections of the task3. The first section of the
task requires the writers to describe the data provided in a graph or a table. The
level descriptors in this category represent a cline from ‘clearly and accurately’
over ‘accurately’, ‘generally accurately’ to ‘adequately’ and ‘inadequately’.
These levels might be hard for raters to distinguish. The second category under
the heading of content refers to the ‘interpretation of ideas’. Here a number of the
categories identified by Cumming et al.’s (2001; 2002) rater decision-making
processes are mixed in the level descriptors. For example, at some levels the raters
are asked to rate the relevance of ideas, at others the quantity of ideas and the clar-
ity or the length of the essay. A similar problem can be identified in the next cate-
76
gory entitled development of ideas. Again, some level descriptors include rele-
vance, supporting evidence, length of essay or clarity, which are all separate con-
cepts according to Cumming et al.’s findings.
In general, the content category can be equated with the construct of content iden-
tified in the taxonomy.
The third heading is entitled fluency. However, none of the categories is measur-
ing fluency, as the three categories are organisation, cohesion and style. Organi-
sation looks at the paragraphing of the writing as well as logical organisation.
These might possibly be separate constructs, with the formatting conventions be-
ing an aspect of mechanics and organisation being an aspect of coherence. The
category of cohesion refers to cohesive devices, but does not explain what exactly
raters should look for. The category of style might refer to the category of
reader/writer interaction.
However, the raters are given very little guidance in terms of the features of style
to rate. The heading of fluency equates to the constructs of cohesion and coher-
ence, reader/writer interaction, features of academic writing and possibly some
aspects of mechanics.
Overall, it can be said that the DELNA rating scale is a comprehensive scale that
covers almost all constructs of writing identified in the taxonomy in the previous
section. However, the groupings are at times arbitrary; some level descriptors mix
separate constructs and some rating scale descriptors could be criticized for being
vague and using impressionistic terminology. The construct of fluency, which has
been identified as being important when measuring performance, is not part of the
DELNA rating scale.
When compared to the list of features that a diagnostic rating scale should display,
the following observations can be made. (1) The DELNA scale is an analytic rat-
ing scale, but at times separate aspects of writing ability are mixed into one de-
scriptor. (2) The rating scale is assessor-oriented, although at times the descriptors
include vague terminology and might therefore not provide sufficient guidance to
raters. (3) It is not clear whether the scale is based on any theory or model of writ-
ing development. It was developed in what Fulcher (2003) would term an ‘intui-
tive’ manner. (4) The scale descriptors do not have an objective formulation style,
and many descriptors make use of adjectives or adverbs to differentiate between
levels. (5) Scores are currently not reported to stakeholders separately. Students
receive a single averaged score and comments based on fluency, content and
form. Taking all these features of the DELNA rating scale into account, it is
doubtful that the scale provides an adequate basis for diagnostic assessment.
77
Fluency Organisation Cohesion Style
9 Essay organised effectively – Skilful use of cohesive de- Academic – appro-
fluent – introduction and vices – message able to be priate to task
concluding comment followed effortlessly
8 Essay fluent – well organised Appropriate use of cohesive Generally aca-
– logical paragraphing devices – message able to be demic – may be
followed throughout slight awkwardness
7 Essay organised – paragraph- Adequate use of cohesive Adequate under-
ing adequate devices – slight strain for standing of aca-
reader demic style
6 Evidence of organisation – Lack / inappropriate use of Some understand-
paragraphing may not be cohesive devices causes some ing of academic
entirely logical strain for reader style
5 Little organisation – possibly Cohesive devices absent/ in- Style not appropri-
no paragraphing adequate/ inappropriate – ate to task
considerable strain for reader
4 Lacks organisation Cohesive devices absent – No apparent under-
severe strain for reader standing of style
Table 11: DELNA rating scale – fluency
78
Form Sentence structure Grammatical accuracy Vocabulary & spelling
9 Sophisticated control of Error free Extensive vocab/ may be one or
sentence structure two minor spelling errors
8 Controlled and varied No significant errors in Vocab appropriate / may be few
sentence structure sytax minor spelling errors
7 Satisfactory variety – Errors minor/ not intru- Vocab adequate / occasionally
reduced accuracy in sive inappropriate / some minor
complex sentences spelling erros
6 Adequate range – errors Errors intrusive / may Limited, possibly inaccurate /
in complex sentences cause problems with inappropriate vocab/ spelling
may be frequent expression of ideas errors
5 Limited control of sen- Frequent errors in sytax Range and use of vocab inade-
tence structure cause significant strain quate. Errors in word formation
& spelling cause strain
4 Inadequate control of Frequent basic syntacti- Basic errors in word formation /
sentence structure cal errors impede com- spelling. Errors disproportion-
prehension ate to length and complexity of
script
Table 11 (cont.): DELNA rating scale - form
Whilst the taxonomy described above has provided us with eight constructs which
can be used as the basis for the trait scales of the new rating scales, the descriptors
will be de-rived empirically. For this purpose, operational definitions of the dif-
ferent constructs need to be developed. That is the intention of the second part of
this chapter.
The eight constructs constituting the taxonomy of writing are described in more
detail in the following sections. In these sections the theoretical basis of the con-
structs is discussed, followed by examples of research that has identified discourse
analytic measures to operationalize the three different constructs. The main aim of
this section is to identify measures which have successfully distinguished different
proficiency levels of writing. Based on the findings of the review of the literature,
a summary of suitable measures for the empirical investigation will be presented.
The following section discusses the theoretical basis underlying the analytic
measures of accuracy, fluency and complexity. First, a theoretical framework
based on an information-processing model is presented and then each measure is
79
described in detail. In these sections, the varying measures previous studies have
employed to operationalize the different concepts are investigated.
Measures of accuracy, fluency and complexity are often used in second language
acquisition research because they provide a balanced picture of learner language
(Ellis & Barkhuizen, 2005). Accuracy refers to ‘freedom of error’ (Foster & Ske-
han, 1996, p. 305), fluency refers to ‘the processing of language in real time’
(Schmidt, 1992, p.358) where there is ‘primacy of meaning’ (Foster & Skehan,
1996 ,p. 304) and complexity is ‘the extent to which learners produce elaborated
language’ (Ellis & Barkhuizen, 2005).
Skehan (1998b) uses the model proposed above to suggest three aspects underly-
ing L2 performance (see Figure 10 below). Learner production is to be analysed
80
with an initial partition between meaning and form. Form can further be subdi-
vided into control and restructuring. Meaning is reflected in fluency, while form is
either displayed in accuracy (if the learner prioritizes control) or in complexity (if
opportunities for restructuring arise because the learner is taking risks).
Figure 10: Skehan’s three aspects of L2 performance (from Ellis & Barkhuizen, 2005)
Skehan (1996, p. 50) considers the possible results of learners allocating their at-
tentional resources in a certain way. He argues that a focus on accuracy makes it
less likely that interlanguage change will occur (production will be slow and
probably consume a large part of the attentional resources). A focus on complex-
ity and the process of restructuring increases the chance that new forms can be
incorporated in the interlanguage system. A focus on fluency will lead to language
being produced more quickly and with lower attention to producing accurate lan-
guage and incorporating new forms. He proposes that as learners do not have
enough processing capacity available to attend to all three aspects equally, it is
important to understand the consequences of allocating resources in one direction
or another. A focus on performance is likely to prioritize fluency, with restructur-
ing and accuracy assigned lesser importance. A focus on development might shift
the concentration to restructuring, with accuracy and fluency becoming less im-
portant.
81
In the context of language testing, Iwashita et al. (2001) have criticized the meas-
ures of accuracy, fluency and com-plexity used in research as being too complex
and time consuming to be used under operational testing conditions. They call for
more practical and efficient measures of ability that are not as sensitive to varia-
tions in task structure and processing conditions. In their study, they propose a
rating scale based on aspects of accuracy, fluency and complexity.
Table 12: Factor analysis for measures of accuracy, fluency and complexity (Tavakoli and
Skehan, 2005)
Measures Factor Factor Factor Communality
1 2 3
Reformulations .88 .880
False starts .94 .892
Replacements .41 .276
Repetitions .62 .490
Accuracy .65 .662
Complexity .87 .716
Length of run -.66 -.44 .43 .767
Speech rate -.84 .793
Total silence . 95 .912
Time spent speak- -.94 .902
ing
No. of pauses .80 .736
Mean length .87 .844
pause
These measures represent what the authors refer to as the temporal aspects of flu-
ency. The second factor is based on the measures of reformulations, false starts,
replacements and repetitions. These measures are associated with another aspect
of fluency, namely repair fluency (e.g. Skehan, 2001). The third factor has load-
ings of measures of accuracy and complexity as well as length of run. This indi-
cates that more accurate language was also more complex. These loadings also
suggest that the measures represent the same underlying constructs, which con-
82
firms Skehan’s (1998b) model of task performance according to which accuracy
and complexity are both aspects of form, while fluency is meaning-oriented. The
results of this factor analysis are potentially useful for the field of language test-
ing, especially rating scale design, as it can be shown which measures are in fact
distinct entities and can therefore be represented separately on a rating scale. It is
worth noting however, that the research investigated oral language use and that
the results may not be applicable to written production.
In the three sections below discourse analytic measures of accuracy, fluency and
complexity are examined in more detail. Definitions are given and commonly
used measures are reviewed.
4.5.1.1 Accuracy
Polio (1997) reviewed several studies that employed measures of accuracy. Some
studies used holistic measures in the form of a rating scale (looking at the accu-
racy of syntax, morphology, vocabulary and punctuation), whilst others used more
objective measures like error-free t-units5. Others counted the number of errors
with or without classifying them.
The accuracy of writing texts has been analyzed through a number of discourse
analytic measures. Usually, errors in the text are counted in some fashion. Two
approaches have been developed. The first one involves focusing on whether a
structural unit (e.g. clause, t-unit) is error free. Typical measures found in the lit-
erature include the number of error-free t-units per total number of t-units or the
number of error-free clauses per total number of clauses. For this measure, a deci-
sion has to be made as to what constitutes an error. According to Wolfe-Quintero
(1998), this decision might be quite subjective as it might depend on the re-
searcher’s preferences or views on what constitutes an error for a certain popula-
tion of students. Error-free measures of accuracy have been criticized by Bar-
dovi-Harlig and Bofman (1989) for not being sufficiently discriminating because
a unit with only one error is treated in the same way as a unit with more than one
error. Furthermore, error-free measures do not disclose the types of errors that are
involved as some might impede communication more than others. In light of these
criticisms, a second approach to measuring accuracy was developed based on the
number of errors in relation to a certain production unit (e.g. the number of errors
per t-unit). One problem of this method is that all errors are still given the same
weight. Some researchers (e.g. Homburg, 1984) have de-veloped a system of cod-
ing errors according to gravity, but Wolfe-Quintero et al. (1998) argue that these
systems are usually based on the intuitions of the researchers rather than being
empirically based.
83
Several studies have found a relationship between the number of error-free t-
units and proficiency as measured by program level (Hirano, 1991; Sharma, 1980;
Tedick, 1990), standardized test scores (Hirano, 1991), holistic ratings (Homburg,
1984; Perkins, 1980), grades (Tomita, 1990) or comparison with native speakers
(Perkins & Leahy, 1980). Two studies found no relationship between error-free t-
units and grades (Kawata, 1992; Perkins & Leahy, 1980). Wolfe-Quintero et al.
argue that for the number of error-free t-units to be effective, a time-limit for
completing the writing task needs to be set (as was done by most studies they in-
vestigated). Another measure that seems promising according to Wolfe-Quintero
et al. is the number of error-free clauses. This measure has only been employed
by Ishikawa (1995) to differentiate between proficiency levels. Ishikawa devel-
oped this measure with the idea that her beginning students were less likely to
have errors in all clauses than in t-units, because the string is likely to be shorter.
She found a significant improvement after three months of instruction.
The error-free t-unit ratio (error-free t-units per total number of t-units) or the
percentage of error-free t-units has been employed by several studies to examine
the relationship between this measure and proficiency. According to Wolfe-
Quintero et al., twelve studies have found a significant relationship but eleven
have not. Of the twelve significant studies, some investigated the relationship be-
tween error-free t-units ratio and program level (Hirano, 1991; Larsen-Freeman,
1978; Larsen-Freeman & Strom, 1977), test scores (Arnaud, 1992; Hirano, 1991;
Vann, 1979) or grades (Kawata, 1992; Tomita, 1990). However, three studies re-
lating to program level were not significant (Henry, 1996; Larsen-Freeman, 1983;
Tapia, 1993). Some longitudinal studies were also not able to capture a significant
increase in accuracy, indicating that the percentage of error-free t-units cannot
capture short-term increases over time. Another accuracy measure, error-free
clause ratio (total number of error-free clauses divided by the total number of
clauses) was used by only two researchers with mixed results. Ishikawa (1995)
chose this measure as a smaller unit of analysis for her beginner-level learners.
She found a significant increase for one of her groups over a three month period.
Her other group and Tapia’s (1993) students all increased in this measure without
showing a statistically significant difference. Another measure in this group, is
errors per t-unit (total number of errors divided by the total number of t-units).
This measure has been shown to be related to holistic ratings (Flahive & Gerlach
Snow, 1980; Perkins, 1980; Perkins & Leahy, 1980) but has been less successful
in discriminating between program level and proficiency level (Flahive & Gerlach
Snow, 1980; Homburg, 1984). Wolfe-Quintero et al. therefore argue that this
might indicate that this measure does not discriminate between program level and
proficiency level, but rather gives an indication of what teachers look for when
making comparative judgements between learners. However, they argue that this
issue needs to be examined in more detail. The last measure in this group is the
84
errors per clause ratio (total number of errors divided by total number of
clauses). The findings were the same as those of the errors per t-unit measure,
showing that these two measures are more related to holistic ratings than to pro-
gram level.
4.5.1.2 Fluency
Fluency has been defined in a variety of ways. It might refer to the smoothness of
writing or speech in terms of temporal aspects; it might represent the level of
automatisation of psychological processes; or it might be defined in contrast to
accuracy (Koponen & Riggenbach, 2000). Reflecting the multi-faceted nature of
fluency, researchers have developed a number of measures to assess fluency. Ske-
han (2003) has identified four groups of measures: breakdown fluency, repair flu-
ency, speech/writing rate and automatisation. All these categories were developed
in the context of speech rather than writing. They are however, just as applicable
to the context of writing. Breakdown fluency in the context of speech is measured
by silence. In the context of writing this could be measured by a break in the writ-
ing process, which cannot be examined on the basis of the product alone. Repair
fluency has been operationalised in the context of speech as reformulations, re-
placements, false starts and repetition. For writing, this could be measured by the
number of revisions (self-corrections) a writer undertakes during the composing
process (Chenoweth & Hayes, 2001). Kellogg (1996) has shown that this editing
process can take place at any stage during or after the writing process. Another
sub-category of fluency is speech/writing rate, a temporal aspect of fluency, op-
erationalised by the number of words per minute. The final sub-group is automa-
tisation, measured by length of run (Skehan, 2003). Only repair fluency and tem-
poral aspects of writing (writing rate) can be measured on the basis of a writing
product. Furthermore, writing rate can only be established if the product was pro-
duced under a time limit or if the time spent writing was recorded. That repair flu-
ency and temporal aspects of fluency are separate entities has been shown by Ta-
vakoli and Skehan’s (2005) factor analysis (Table 12).
In the context of writing, Chenowith and Hayes (2001) found that even within a
period of only two semesters their students displayed a significant increase in
writing fluency. This included an increase in burst length (automatisation), a de-
crease in the frequency of revision (repair fluency), and an increase in the number
of words accepted and written down (writing rate).
One measure that can be used to investigate temporal aspects of fluency is the
number of words which, according to Wolfe-Quintero et al. (1998), has produced
rather mixed results. According to their analysis, eleven studies found a signifi-
cant relationship between the number of words and writing development, while
seven studies did not. However, this measure might be more reliable if it is ap-
85
plied to writing that has been produced under time pressure. Kennedy and Thorp
(2002), who investigated the differences in writing performance at three different
IELTS levels, found a difference between essays at levels 4, 6 and 8, with writers
at level 4 struggling to meet the word limit. However, they also report a large
amount of overlap between the levels. Cumming et al. (2005), in a more recent
study focussing on the next generation TOEFL, found statistically significant dif-
ferences only between essays at levels 3 and 4 (and levels 3 and 5), but no differ-
ences between levels 4 and 5. The descriptive statistics indicate a slight increase
in the number of words between levels 4 and 5. Another interesting measure to
pursue might be the number of verbs. This measure has only been used once
(Harley & King, 1989) in a study which compared native and non-native speakers
and which produced significant results. However, it has never been used to differ-
entiate between different proficiency levels.
No studies of the writing product have investigated repair fluency. The number of
self-corrections, a measure mirroring the number of reformulations and false
starts in speech, might be a worthwhile measure to pursue in this study.
4.5.1.3 Complexity
Ellis and Barkhuizen (2005) suggest that complexity can be analysed according to
the language aspects they relate to. These could include interactional, proposi-
tional, functional, grammatical or lexical aspects. As propositional and functional
complexity are hard to operationalize and interactional complexity is a feature of
speech, only grammatical and lexical complexity will be considered here
(following Wolfe-Quintero et al., 1998).
86
The measures that have been shown to most significantly distinguish between pro-
ficiency levels, according to Wolfe-Quintero et al. (1998), seem to be the t-unit
complexity ratio, the dependent clause per clause ratio and the dependent clause
per t-unit ratio (with the last two producing rather mixed results in previous stud-
ies).
The t-unit complexity ratio (number of clauses per t-units) was first used by Hunt
(1965). A t-unit contains one independent clause plus any number of other clauses
(including adverbial, adjectival and nominal clauses). Therefore, a t-unit complex-
ity ratio of two would mean that on average each t-unit consists of one independ-
ent clause plus one other clause. Wolfe-Quintero et al. (1998) point out that in L2
writing not all sentences are marked for tense or have subjects. They argue that it
is therefore important to include all finite and non-finite verb phrases in the t-unit
(as was done by Bardovi-Harlig & Bofman, 1989). This would change the meas-
ure to a verb phrases per t-unit measure. They argue that it would be useful to
compare which of these measures is more revealing. The t-unit complexity ratio
was designed to measure grammatical complexity, assuming that in more complex
writing there are more clauses per t-unit. However, in second language research,
there have been mixed results. Hirano (1991) found a significant relationship be-
tween the t-unit complexity ratio and program level, as did Cooper (1976) and
Monroe (1975) between this measure and school level, and Flahive and Snow
(1980) found a relationship between this measure and a number of their program
levels. However other studies (Bardovi-Harlig & Bofman, 1989; Ishikawa, 1995;
Perkins, 1980; Sharma, 1980) obtained no significant results. For example, Cum-
ming et al.’s (2005) detailed analysis of TOEFL essays resulted in a similar num-
ber of clauses across proficiency levels. The means ranged from 1.5 to 1.8 for the
different levels. Similarly, Banerjee and Franceschina (2006) found no differences
between pro-ficiency levels when conducting a similar analysis on IELTS writing
scripts. According to Wolfe-Quintero et al. (1998) this measure is most related to
program or school level and holistic ratings. They also point to the fact that even
in studies that found no significant results, scores on this measure increased.
87
(1979) did not. Vann also did not find the measure to be a predictor in a multiple
regression step analysis of TOEFL scores.
The most commonly known ratio measure of lexical complexity is the type/token
ratio (total number of different word types divided by the total number of words).
Type/token ratios, however, have been criticized, as they are sensitive to the
length of the writing sample. It is therefore important that if the type/token ratio is
used, the length of the sample has to be limited to a certain number of words6.
This might be one possible reason for Cumming and Mellow (1996) not finding a
significant difference between their learners of English in different program lev-
els. They did however find that, although not significant, the data showed the ex-
pected trend.
In a more recent study conducted by Cumming et al. (2005) in the context of the
next generation TOEFL, the authors used average word length as an indicator of
lexical complexity. This measure had been used successfully in other studies (e.g.
Engber, 1995; Frase et al., 1999; Grant & Ginther, 2000), but failed to differenti-
ate between candidates at different proficiency levels in Cumming et al.’s study.
88
4.5.1.4 Summary of accuracy, fluency and complexity
As the constructs of accuracy, fluency and complexity are based on a current view
of second language acquisition, they are more promising for the investigation of
writing performance than more traditional constructs and measures like grammar,
vocabulary or error counts. Measures of accuracy, fluency and complexity have
been shown to successfully distinguish between different levels of writing devel-
opment and have been shown to be separate constructs, as shown by Skehan’s fac-
tor analysis. A number of measures from the literature review were selected to be
further pursued in the pilot study. These can be seen in Table 13 below.
Table 13: Measures of accuracy, fluency and complexity worthy of further investigation
Construct Measures
Accuracy Number of error-free t-units
Number of error-free clauses
Error-free t-unit ratio
Error-free clause ratio
Errors per t-unit
Fluency Number of words
Number of self-corrections
Grammatical complexity Clauses per t-unit
Dependent clauses per t-unit
Dependent clauses per clause
Lexical complexity Average word length
Lexical sophistication
Measures were selected based on two principles. Firstly, they needed to have been
shown by previous research to be successful in distinguishing between different
proficiency levels of writing and secondly, they needed to be sufficiently easy for
raters to apply during the rating process.
4.5.2 Mechanics
Very few studies have attempted to quantify aspects of mechanics, which include
spelling, punctuation, capitalization, and indentation (Polio, 2001). Most studies
that have investigated this construct to date (e.g. Pennington & So, 1993; Tsang,
1996), have made use of the Jacobs scale (Jacobs et al., 1981). However, none of
the studies had mechanics as a focus. It is therefore not clear if the scale is able to
reliably distinguish between different levels of mechanical quality. A second issue
raised by Polio (2001) is that it is not entirely clear if mechanics is a construct at
all. It is for example not clear if the different sub-components are related. Polio
further points out that in studies looking at accuracy, spelling is in fact often dis-
regarded. Bereiter (1980) argues however, that writing is significantly different
89
from speech in that it requires certain conventions like spelling and punctuation
and it might therefore be necessary to measure these.
Two studies were identified that measured aspects of mechanics without the use
of a rating scale. Firstly, Mugharbil (1999) set out to discover the order in which
second language learners acquire punctuation marks. He concluded that the period
(or full stop) was the first punctuation mark acquired and the semi-colon the last.
For beginning learners, he was able to show that the comma was the least often
correctly placed. The second study that included a measure for mechanics was
conducted by Kennedy and Thorp (2002) in the context of an analysis of textual
features produced by candidates of the IELTS test. The authors looked at para-
graphing and counted the number of paragraphs produced by writers at three dif-
ferent levels of writing, levels 4, 6 and 8.
Percentage of essays –
Level 6 2 0 6 48 26 14 2 0 2 0
Percentage of essays –
Level 4 10 8 18 24 22 14 40 0 0 0
They were able to show that ten percent of the writers at level 4 produced only
one paragraph, whilst writers at levels 6 generally produced four or more para-
graphs. However, the results (shown in Table 14) are anything but conclusive.
So overall, the area of mechanics seems to have been very little explored in stud-
ies of second language writing. Several areas seem to be of interest and will there-
fore be further pursued in the pilot study. These are: punctuation, spelling, capi-
talization and paragraphing.
90
posed study. Therefore, only measures that can be operationalised for a rating
scale will be reviewed.
4.5.3.1 Coherence
Overall, it can be said that coherence resides at textual level (not sentence level),
where it creates links between ideas to create meaning, show organisation and
make a text summarizable. It is further concerned with how people interpret the
text. Coherence is created not only by the writer’s purpose but also the readers’
(possibly even whole discourse communities’) expectation. Lautamatti (1990) dis-
tinguishes two types of coherence: interactional and propositional. The former is
created when succeeding speech acts in discourse are linked. This is the case in
spoken informal language. The latter occurs through links created by the idea-
tional content of the discourse and is evident in more formal settings and written
language. This chapter will discuss only propositional coherence that can be found
in writing.
Taking the above definitions of coherence into account, it is not surprising that the
concept of coherence has been one of the most criticized in existing rating scales.
Descriptors are usually vague, as has been shown by Watson-Todd et al. (2004),
who provide examples of typical descriptors. For example, good writing should be
‘well organised’, and ‘cohesive’, should have a ‘clear progression of well-linked
ideas’. Poor quality writing, on the other hand, is often described as so ‘fragmen-
tary that comprehension of the intended communication is virtually impossible’.
These descriptors often require subjective interpretation and might lead to confu-
sion among the raters. Hoey (1991) argues that, because coherence resides outside
the text, judgments will inevitably have to be subjective and vary from reader to
reader. Chiang (1999; 2003), however, was able to show that raters, contrary to
what has been shown by many studies, put more emphasis on coherence and co-
hesion in writing than on grammar, if they have clear descriptors to focus on. The
91
following section on coherence therefore aims to illustrate the debate in the litera-
ture on coherence and describe the measures which have been proposed to meas-
ure coherence objectively.
Research investigating coherence dates back as far as the 19th century. Then, how-
ever, coherence was predominantly defined in terms of sentence connections and
paragraph structure (Lee, 2002a). Only since the emergence of discourse analysis
in the 1960s has more emphasis been placed on constituents larger than the sen-
tence. Researchers began investigating what principles tie a text together and in
what contexts texts occur. Coherence, according to Grabe and Kaplan (1996),
should derive its meaning from what a text is and how a text is constructed. This
can be considered either as internal to the text or internal to the reader. If defined
as internal to the text, coherence can be explained as the formal properties of a
text. In this context Halliday and Hasan (1976) developed their theory of cohe-
sion, which will be discussed in more detail in the section on cohesion below.
Other researchers investigated information distribution in texts, introducing the
concepts of given and new information (Vande Kopple, 1983, 1986), also referred
to as topic and comment (Connor & Farmer, 1990) or theme and rheme (Halliday,
1985, 1994). From these, Lautamatti (1987) and later Scheider and Connor (1990)
developed topical structure analysis as a tool for analyzing coherence. They were
able to identify different structural patterns in texts and were able to teach this
method to ESL students to successfully investigate the coherence of their texts
(Connor & Farmer, 1990). This method will be described in more detail later in
this chapter.
Kintsch and Dijk (1978) described coherence in terms of propositions and their
ordering in text. Thus coherence has been described in terms of cohesion and the
ordering of information structure to form the macrostructure of texts. Hoey (1991)
looked at lexical patterns in a text, whilst other linguists have looked at metadis-
coursal features of a text, for example, logical connectors, sequencers and hedges,
and how they contribute to the overall coherence of texts (Cheng & Steffensen,
1996; Crismore, Markkanen, & Steffensen, 1993). There is therefore, from a lin-
guistic perspective, plenty of evidence that coherence can be found, at least partly,
within texts.
Other research, however, has defined coherence as internal to the reader. This
view has its basis in modern reading theories, which have shown that text process-
ing is an interaction between the reader and the text and that readers use their
world knowledge and knowledge of text structures to make sense of a text
(Carrell, 1988). Readers can anticipate upcoming textual information, which helps
to organise the text into understandable information (Bamberg, 1983). The reader
can therefore be regarded as an important contributor to coherence.
92
Although it is quite clear from these two strands of research that coherence resides
both in the text and is created through an interaction between the reader and the
text, for the purpose of this research, only coherence internal to the text is consid-
ered. Although probably not a complete picture of coherence, coherence internal
to the text can be more easily operationalised for the purpose of rating scale de-
scriptors and can be defined in more detail. Aspects of writing that are created by
an interaction between the reader and the text are investigated in a later section
called ‘reader/writer interaction.
Several different ways of measuring coherence have been proposed in the litera-
ture. This section will describe three measures: metadiscourse markers, topical
structure analysis and topic-based analysis.
93
Logical connectives include coordinating conjunctions (e.g. and, but) and con-
junctive adverbs (e.g. therefore, in addition). Sequencers include numbers as well
as counting and numbering words like ‘first’, ‘second’ and so on. Reminders are
expressions that refer to earlier text, like, for instance, ‘as I noted earlier’. Topi-
calizers are words or phrases that indicate a topic shift. These can include ‘well’,
‘now’, ‘in regard to’ or ‘speaking of’.
Interpretive markers include code glosses and illocution markers. Code glosses
are explanations of text introduced by expressions such as ‘namely’, ‘for example’
or ‘what I mean is’. These expressions provide more information for words or
propositions which the writer anticipates will be difficult for the reader. Illocution
markers name the act that the writer is performing. These might include expres-
sions like ‘I state again that…’, ‘to sum up’, ‘to conclude’, ‘to give an example’
or ‘I plead with you’.
Intaraprawat and Steffensen (1995) used the categories described above to inves-
tigate the difference between good and poor ESL essays. They found that good
essays displayed twice as many metadiscoursal features as poor essays. They also
found a higher density of metadiscourse features in the good essays (calculated as
features per average number of t-units). Good writers used more than twice the
number of code glosses and three times as many illocutionary markers. They
found very little difference in connectives between the two groups and explained
this by suggesting that these are explicitly taught in many writing courses. The
good essays had a higher percentage of interpersonal features while the poor had a
higher percentage of textual features.
Topical structure analysis (TSA) was first developed by Lautamatti (1987) in the
context of text readability to analyse topic development in reading material. She
defined the topic of a sentence as ‘what the sentence is about’ and the comment of
a sentence as ‘what is said about the topic’. Lautamatti described three types of
progression which advance the discourse topic by developing a sequence of sen-
tence topics. Through this sequence of sentence topics, local coherence is created.
The three types of progression can be summarized as follows (Hoenisch, 1996):
Parallel progression, in which the topics of successive sen-tences are the same,
producing a repetition of topic that reinforces the idea for the reader (<a, b>, <a,
c>, <a, d>);
94
Extended parallel progression, in which the first and the last topics of a piece of
text are the same but are interrupted with some sequential progression (<a, b>, <b,
c>, <a, d>).
Witte (1983a; 1983b) made use of TSA in writing research. He compared two
groups of persuasive writing scripts, one rated high and one rated low, in terms of
the use of the three types of progression described above. He found that the higher
level writers used less sequential progression and more extended and parallel pro-
gression. There are however several shortcomings in Witte’s study. Firstly, the
raters were not professional raters, but were rather recruited from a variety of pro-
fessions. Secondly, Witte did not use a standardized scoring scheme. He also con-
ducted the study in a controlled revision situation. The students revised a text
written by another person. Furthermore, Witte did not report any intercoder reli-
ability analysis.
In 1990, Schneider and Connor set out to compare the use of topical structure by
45 writers taking the TWE (Test of Written English). They grouped the 45 argu-
mentative essays into three different levels (high, medium, low). As with Witte’s
study, Schneider and Connor did not report any intercoder reliability statistics.
The findings were contradictory to Witte’s findings. The higher level writers used
more sequential progression while the low and middle group used more parallel
progression. There was no difference between the levels in the use of extended
parallel progression. Connor and Schneider drew up clear guidelines on how to
code TSA and also suggested a re-interpretation of sequential progression in their
discussion section. They suggested dividing sequential progression into the fol-
lowing subcategories:
Direct sequential progression, in which the comment of the previous sentence be-
comes the topic of the following sentence. The topic and comment are either word
derivations (e.g. science, scientist) or they form a part-whole relation (these
groups, housewives, children) (<a,b>, <b,c>, <c,d>).
Unrelated sequential progression, in which topics are not clearly related to either
the previous sentence topic or discourse topic (<a,b>, <c,d>, <e,f>).
95
the use of parallel progression between high and low level writers. Higher level
writers used slightly more extended parallel progression and more related sequen-
tial progression.
A more recent study using TSA to compare groups of writing based on holistic
ratings was undertaken by Burneikaité and Zabiliúté (2003). Using the original
criteria of topical structure developed by Lautamatti and Witte, they investigated
the use of topical structure in argumentative essays by three groups of students
rated as high, middle and low, based on a rating scale adapted from Tribble
(1996). They found that the lower level writers overused parallel progression
whilst the higher level writers used a balance between parallel and extended paral-
lel progression. The differences in terms of sequential progression were small,
although they did show that lower level writers used this type of progression
slightly less regularly. Burneikaité and Zabiliúté failed to report any inter-rater
reliability statistics.
All studies conducted since Witte’s study in 1983 have produced very similar
findings, but with some differences. Two out of three studies found that lower
level writers used more parallel progression than higher level writers. However,
Wu (1997) found no significant difference. All three studies found that higher
level writers used more extended parallel progression. In terms of sequential pro-
gression the differences in findings can be explained by the different ways this
category was applied. Schneider and Connor (1990) and Burneikaité and Zabu-
liúté (2003) used the definition of sequential progression with no subcategories.
Both studies found that higher level writers use more sequential progression. Wu
found no differences between different levels of writing using this same category.
However, he was able to show that higher level writers used more related sequen-
tial progression. It is also not entirely clear how much task type or topic familiar-
ity influences the use of topical structure and if findings can be transferred from
one writing situation to another.
96
In conclusion, it can be said that coherence remains a fuzzy concept, and that it
will be hard to define the concept operationally. For the purpose of this study,
topical structure analysis and metadiscoursal markers seem the most promising.
4.5.3.2 Cohesion
Cohesion has been defined by Fitzgerald and Spiegel (1986) as ‘the linguistic fea-
tures which help to make a sequence of sentences in a text’ (i.e. give it texture).
Reid (1992) defined it as ‘explicit linguistic devices used to convey information,
specifically the discrete lexical cues used to signal relations between parts of dis-
course’. To her, cohesion devices are therefore words and phrases that act as sig-
nals to the reader; these words relate what is being stated to what has been stated
and to what will soon be stated. She goes on to argue that cohesion is a subcate-
gory or sub-element of coherence.
Analysis of cohesion has received much attention among applied linguists and
writing researchers. The term ‘cohesion’ was popularized by Halliday and Hasan
(1976) who developed a model for analysing texts. They showed that cohesive
ties involve a relation between two items within a text. One item cannot be effec-
tively decoded without reference to the other. Cohesive ties are ties that operate
intersententially (between sentences). For the purpose of this study, cohesive ties
were operationalized as operating between t-units. They can also, however, as was
pointed out by Halliday and Hasan (1976) operate between clauses.
Halliday and Hasan show that cohesion is not always necessary in achieving
communication, but helps guide the reader’s or listener’s understanding of text
units. Their model has been criticized by various authors, but nevertheless has
been a major influence in language teaching.
Halliday and Hasan (1976) identify the following two broad types of cohesion in
English:
The first item of grammatical cohesion described by Halliday and Hasan (1976) is
the term reference. Reference refers to items of language that instead of being in-
terpreted semantically in their own right, make reference to other items for which
97
the context is clear to both sender and receiver. The retrieval of these items can be
either exo-phoric or endophoric (within the text). Exophoric reference looks out-
side the text to the immediate situation or refers to cultural or general knowledge
(homophoric). Endophoric reference can be either anaphoric (referring to a word
or phrase used earlier in a text) or cataphoric (referring to a word or phrase used
later in the text). There are three types of reference: personal, demonstrative and
comparative. These words indicate to the listener/reader that information is to be
retrieved from elsewhere.
- additive
- adversative
- causal
- temporal
The second major group of cohesive relations is lexical cohesion. The cohesive
effect is achieved by the selection of certain vocabulary items that occur in the
context of related lexical items. Halliday and Hasan (1976) identify two principal
kinds and their subcategories:
98
- reiteration:
- repetition
- synonym, near-synonym
- antonym
- superordinate relations: -hyponym
- meronym
- general
Nouns
- collocations
Several authors have debated whether collocations properly belong to the notion
of lexical cohesion, since collocation only refers to the probability that lexical
items will co-occur and there is no semantic relation between them.
Halliday and Hasan (1976) acknowledged some of the problems with their model
when they suggested that the boundaries between lexical and grammatical cohe-
sion are not always clear. They further observed that the closer the ties the greater
the cohesive strength, and that a higher density of cohesive ties increases the co-
hesive strength.
Halliday and Hasan’s (1976) categories of cohesion have been applied in a num-
ber of research projects with varying results. Witte and Faigley (1981) in the con-
text of L1 English, for example, compared the cohesion of high and low level es-
says. They found a higher density of cohesive ties in high level essays. Almost a
third of all words in the better essays contributed to cohesion and the cohesive ties
spanned shorter distances than in lower-level essays. They also found that the ma-
jority of lexical ties in low-level essays involved repetition, whilst high-level es-
says relied more on lexical collocation. In contrast, Neuner (1987) found that none
of the ties were used more in good essays than in poor freshman essays. He did,
however, find a difference between cohesive chains (three or more cohesive ties
that refer to each other), in the cohesive distance and in the variety of word types
and the maturity of word choice. For example, in good essays, cohesive chains are
sustained over greater distances and involve greater proportions of the whole text.
Good writers also used more different words in their cohesive chains as well as
less frequent words than the poor writers. A very similar result was found by
Crowhurst (1987), who compared cohesion at different grade levels in two differ-
ent genres (arguments and narratives). He also found that the overall frequency of
cohesive ties did not increase with grade level, but that synonyms and collocations
(a sign of more mature vocabulary) did.
99
Jafapur (1991) applied Halliday and Hasan’s categories to ESL writing. He found
that in the essays the number of cohesive ties and the number of different types of
cohesion successfully discriminated between different proficiency levels. Reid
(1992), investigating ESL and NS writing, focussed on the percentages of coordi-
nate conjunctions, subordinate conjunctions, prepositions and pronouns and found
that ESL writers used more pronouns and coordinating conjunctions than NS, but
fewer prepositions and subordinating conjunctions. Two other studies also com-
pared native and non-native speaking writers in terms of their use of connectors.
Field and Yip (1992) were able to show that Cantonese writers significantly over-
use such devices. However, Granger and Tyson (1996) in a large scale investiga-
tion of the International Corpus of Learner English were not able to confirm these
findings. They emphasised that a qualitative analysis of the connectors is impor-
tant. They documented the underuse of some connectors and overuse of others.
Two recent studies compared the performances of test takers over different profi-
ciency levels. Firstly, Kennedy and Thorp (2002), in the context of IELTS, were
able to show that writers at levels 4 and 6 used markers like ‘however, firstly,
secondly’ and subordinators more than writers at level 8. They concluded that
writers at level 8 seemed to have other means at their disposal to mark these con-
nections, whilst lower level writers needed to rely on these overt lexico-
grammatical markers to structure their argument. Even more recently and also in
the context of IELTS, Banerjee and Franceschina (2006) looked at the use of de-
monstrative reference over five different IELTS levels. They found that the use of
‘this’ and ‘these’ increased with proficiency level whilst the use of ‘that’ and
‘those’ stayed relatively level or decreased.
Several authors have specifically investigated lexical cohesion (Hoey, 1991; Liu,
2000; Reynolds, 1995), arguing that this is the most common and important type
of cohesion. Hoey, for example, investigated the types of lexical repetition and
classified them into simple and complex lexical repetition and paraphrase. He
showed how lexical repetition can be mapped into a matrix, revealing the links
throughout the whole text. This method of analysis, although very promising, will
not be further pursued, as it is too complex to be performed by raters during a rat-
ing session. Both Hassan’s (1984) and Hoey’s (1991) models were developed for
the first language writing context and rely on the concept that quantity is signifi-
cant. However, Reynolds (1995) questions whether quantity makes a text more
cohesive. It is also not clear if this can be transferred to the L2 writing context.
100
4.5.4 Reader/writer interaction
Reader/writer interaction expands the focus of study beyond the ideational dimen-
sions of texts to the ways in which texts function at the interpersonal level.
Hyland (2000b) argues that writers do more than produce texts in which they pre-
sent an external reality; they also negotiate the status of their claims, present their
work so that readers are most likely to find it persuasive, and balance fact with
evaluation and certainty with caution. Writers have to take a position with respect
to their statements and to their audiences, and a variety of features have been ex-
amined to see how they contribute to this negotiation of a successful reader-writer
relationship.
Hedges, have been defined as ‘ways in which authors tone down uncertain or po-
tentially risky claims’ (Hyland, 2000a), as ‘conventions of inexplicitness’ and as
‘a guarded stance’ (P. Shaw & Liu, 1998), as structures that ‘signal a tentative as-
sessment of referential information and convey collegial respect for the views of
colleagues’ (Hyland, 2000a) or as ‘the absence of categorical commitment, the
expression of uncertainty, typically realized by lexical devices such as might’
(Hyland, 2000b). Examples of hedges are epistemic modals like might, may,
could, and other structures such as I think, I feel, I suppose, perhaps, maybe, it is
possible. Hyland (1996a; 1996b; 1998) differentiates between two functions of
hedging: content-oriented and reader-oriented. Content-oriented hedges mitigate
between the propositional content of a piece of writing and the discourse commu-
nity’s conception of what the truth is like. Content-oriented hedges can in turn be
divided into accuracy-oriented hedges and writer-oriented hedges. The writer
needs to express propositions as accurately as possible. This is made possible by
accuracy-oriented hedges which allow the writer to express claims with greater
precision, acknowledge uncertainty and signal that a statement is based on the
writer’s plausible reasoning rather than assured knowledge. The writer, however,
also needs to acknowledge contrary views from readers. Writer-oriented hedges
permit the writer to speculate. The second major category of hedges are the
reader-oriented hedges. Through these, the writer develops a writer-reader rela-
101
tionship. These structures help to tone down statements in order to gain the
reader’s ratification of claims. Hyland (2000b) suggests that hedges are highly
frequent in academic writing and are more frequent than one in every 50 words.
Attitude markers express the writer’s affective values and emphasize the proposi-
tional content, but do not show commitment to it. These include words and
phrases like ‘unfortunately’ or ‘most importantly’. They can perform the functions
of expressing surprise, concession, agreement, disagreement and so on.
102
Intaraprawat and Steffenson (1995) used all the categories described above to in-
vestigate differences between good and poor ESL essays. They found that good
students used twice as many hedges, attitude markers and attributors, more than
double the number of emphatics (boosters) and three times as many commentar-
ies.
Apart from hedges, boosters, attributors, attitude markers and commentaries, writ-
ers can also express reader-writer interaction by showing writer identity in their
writing. As Hyland (2002a) suggests, academic writing is not just about convey-
ing an ideational ‘content’, it is also about the representation of self. Ivanic (1998;
Ivanic & Weldon, 1999) identifies three aspects of identity interacting in writing.
Firstly, there is the autobiographical self, which is influenced by the writer’s life
history. Then there is the discoursal self, which represents the image or ‘voice’ the
writer projects in a text. Finally, there is the authorial self, which is the extent to
which a writer intrudes into a text and claims responsibility for its content. This is
achieved through ‘stance’. For the purpose of this study only the third type of
identity will be discussed here. Academic writing is a site in which social posi-
tioning is constructed. The academy’s emphasis on analysis and interpretation
means that students must position themselves in relation to the material they dis-
cuss, finding a way to express their own arguments (Hyland, 2002a). Writers are
therefore required to establish a stance towards their propositions and to get be-
hind their words. The problem with identity, however, is that it is uncertain. On
the one hand, an impersonal style is seen as a key feature of academic writing, as
it symbolizes the idea that academic research is objective and empirical. However,
textbooks encourage writers to make their own voice clear through the first per-
son. This constitutes a problem for L2 writers. Hyland (2002b) argues that L2
writers are often told not to use ‘I’ or ‘in my opinion’ in their academic writing. In
his investigation on the use of the first person in L1 expert and L2 writing, he
found that professional writers are four times more likely to use the first person
than L2 student writers (Hyland, 2002a).
Hyland (2002b) argues that this underuse of first person pronouns in L2 writing
inevitably results in a loss of voice. Contrary to Hyland’s (2002a; 2002b) findings,
Shaw and Liu (1998) showed that as L2 students’ writing develops, they move
away from using personal pronouns in their writing and make more use of passive
verbs. They therefore argue that more developed writing has less authorial refer-
ence.
If writers choose not to display writer identity, but rather want to keep a piece of
writing more impersonal, they could do this by increased use of the passive voice.
This was investigated by Banerjee and Franceschina (2006), who found that the
103
higher the IELTS score awarded to a writing script, the more passives the writer
had used.
Summing up, there are various devices available to writers to establish a success-
ful writer-reader relationship. Among these are hedges, boosters, attributors and
attitude markers, as well as markers of identity and the use of the passive voice,
all of which will be further pursued in the pilot study.
4.5.5 Content
Kennedy and Thorp (2002) recorded the main topics for IELTS essays produced
at three proficiency levels. However, their analysis was inconsistent in that they
did not follow the same procedures for essays at levels 4, 6 and 8. Therefore, the
results are difficult to compare. No other research was located that compared can-
didates’ performance on content over different proficiency levels without using a
rating scale. Because of the lack of discourse analytic measures of content in the
literature, a measure specific to the current study will be designed.
4.6 Conclusion
Overall, this chapter has shown that, although no adequate model or theory of
writing or writing proficiency is currently available, a taxonomy based on current
models of language development can guide the rating scale design process and
provide an underlying theoretical basis.
104
Table 15: List of measures to be trialed during pilot study
Construct Measures
Accuracy Number of error-free t-units
Number of error-free clauses
Error-free t-unit ratio
Error-free clause ratio
Errors per t-unit
Fluency Number of words
Number of self-corrections
Complexity Clauses per t-unit
Dependent clauses per t-unit
Dependent clauses per clause
Average word length
Lexical sophistication
Mechanics Number of punctuation errors
Number of spelling errors
Number of capitalization errors
Paragraphing
Cohesion Number of anaphoric pronominals
Number of linking devices
Number of lexical chains
Coherence Categories of topical structure analysis
Metadiscoursal markers
Reader/writer interaction Number of hedges
Number of boosters
Number of attributors
Number of attitude markers
Number of markers of writer identity
Number of instances of passive voice
Content Measure specific to this research
I have shown that the constructs identified as important aspects of academic writ-
ing have been operationalized to varying degrees and with varying success. Table
15 above shows the eight constructs from the taxonomy in the left hand column,
whilst the column on the right presents the different discourse analytic measures
that were chosen as operationalisations of these constructs. Each discourse ana-
lytic measure will be trialed during the pilot study phase, which is described in the
following chapter.
105
---
Notes:
1
For a detailed description of DELNA (Diagnostic English Language Needs Assessment), refer to
the methodology section.
2
For a detailed description of DELNA (Diagnostic English Language Needs Assessment), refer to
the methodology section.
3
For a detailed description of the three DELNA writing tasks, refer to the methodology section.
4
The data set used was based on oral performance. It is not clear if the same results would be ob-
tained for written performance.
5
A t-unit contains one independent clause plus any number of other clauses (including adverbial,
adjective, and nominal). The t-unit was first developed by Hunt (1965)
6
Recent developments in type/token ratio take length into account (Jarvis, 2002). These complex
formulae are however not suitable for the context of this study. More simple measures must there-
fore be calculated on the basis of equal length word segments.
7
Interpersonal metadiscourse is described in the section on reader/writer interaction.
8
Textual metadiscourse markers were discussed in the section on coherence
106
Chapter 5: METHODOLOGY – ANALYSIS OF WRITING
SCRIPTS
5.1 Design
The study reported here was implemented in two phases. At the beginning of
Phase 1, a pilot study was undertaken to select the most suitable discourse analytic
measures from those identified in the literature review. The main aim of the pilot
study was to identify discourse analytic measures which are successful in differ-
entiating between different levels of writing performance. Then, during the main
analysis, a large number of writing scripts were analysed using those discourse
analytic measures. Those measures successful in discriminating between scripts at
different proficiency levels during the main analysis were then used as the basis
for the descriptors during the development of the rating scale. The final part of
this first phase was the design of a new rating scale based on the findings of the
main analysis. The hypothesis is that this newly developed rating scale would be
more suitable for diagnostic purposes because it is theoretically-based (i.e. based
on the taxonomy described in Chapter 4), empirically-developed and therefore has
level descriptors which are more specific (rather than global) and avoid vague,
impressionistic terminology.
The second phase of the study involved the validation of the new rating scale for
diagnostic writing assessment. For this purpose, ten raters rated one hundred writ-
ing samples, first using the existing DELNA (Diagnostic English Language Needs
Assessment) rating scale and then the same ten raters rated the same one hundred
scripts using the new rating scale. The rating results from these two scales were
then compared. To elicit the raters’ opinions about the efficacy of the two scales, a
questionnaire was administered and a subset of the raters was interviewed.
The two phases of this research study were characterized by two different types of
research design. The first phase, the analysis of the writing scripts, followed what
Seliger and Shohamy (1989) termed ‘descriptive research’ because it is used to
establish phenomena by explicitly describing them. Descriptive research provides
measures of frequency for different features of interest. It is important to empha-
size that descriptive research does not manipulate contexts (by for example estab-
lishing groups of participants, as is often found in experimental studies). The
groups used in the analysis were pre-existing. In this study, the groups were de-
termined according to a proficiency score based on the performance of each can-
didate. The data analysis was quantitative.
The second phase employed two rather different research design features. The
first part of Phase 2, the ratings based on the two rating scales, can also be de-
scribed as a descriptive study because the ratings of the ten raters were compared
107
under two conditions. It is best viewed as a descriptive study comparing the scores
obtained for two groups (Seliger & Shohamy, 1989). It should be noted that the
candidates were not randomly selected and the two types of treatment (the two
rating scales) were not administered in a counterbalanced design. If the study had
displayed these two features, it could have been considered a quasi-experimental
study (Mackey & Gass, 2005; Nunan, 1992). The data analysis was quantitative
and employed statistical procedures. The second part of Phase 2, the administra-
tion of questionnaires and interviews, involved qualitative data analyzed qualita-
tively. Therefore, it can be argued that the study overall followed a mixture of
qualitative, quantitative and descriptive designs.
For reasons of readability, the method, results and discussion sections of the two
phases are kept separate. The method, results and discussion of Phase 2 can be
found in later in this book.
The current chapter presents the research questions for both phases and a general
introduction to the context in which the whole study was conducted.
108
5.2 Research Questions
The overarching research question for the whole project was the following:
For reasons of practicality, the main research question was further divided into
three subsidiary questions, one guiding the analysis of Phase 1, and the other two
relevant to Phase 2.
Phase 1:
1. Which discourse analytic measures are successful in distinguishing between
writing samples at different DELNA writing levels?
Phase 2:
2a. Research Question 1: Do the ratings produced using the two rating scales dif-
fer in terms of (a) the discrimination between candidates, (b) rater spread and
agreement, (c) variability in the ratings, (d) rating scale properties (e) and what
the different traits measure?
2b. What are raters’ perceptions of the two different rating scales for writing?
Although being optional for most years since 2001, since 2007 the assessment is a
requirement for all first year undergraduate students.
DELNA consists of two parts: screening and diagnosis. The screening section in-
cludes two components: vocabulary and speed reading. It is conducted online and
takes 30 minutes. The diagnosis section, which takes two hours and is conducted
109
by the pen and paper method, comprises sub-tests of reading and listening (devel-
oped and validated at the University of Melbourne) and an expository writing
task, which requires students to describe a graph or information in a table and then
interpret the data. The writing component, which is the focus of this study, was
developed in-house and is scored analytically on nine 6-point scales ranging from
4 (“at high risk of failure due to limited academic English”) to 9 (“highly compe-
tent academic writer”), with accompanying level descriptors describing the nature
of the writer’s performance against each of the analytic criteria. The complete
DELNA rating scale was presented in Table 11 in the previous chapter.
Based on their DELNA results, students are advised to attend suitable courses.
EAL students might be advised to seek help in the English Language Self-access
Centre (ELSAC) and ESB students might be advised to seek writing help in the
Student Learning Centre (SLC) which provides similar assistance to writing labs
found at other universities.
While DELNA can be considered a low-stakes test in the sense that it is used for
diagnosis rather than selection purposes, its utility is dependent on the accuracy of
test scores in diagnosing students’ language needs. The writing task is therefore
assessed twice by separate raters and concerted training efforts have been made to
enhance the reliability of scoring (Elder, Barkhuizen, Knoch, & von Randow,
2007; Elder et al., 2005; Knoch, Read, & von Randow, 2007).
The DELNA raters are all experienced teachers of English and/or English as a
second language. All raters have high levels of English language proficiency al-
though not all are native-speakers (NS) of English. Some raters are certified
IELTS (International English Language Testing System) examiners whereas oth-
ers have gained experience of writing assessment in other contexts. All raters take
part in regular training sessions which are conducted throughout the year both
online and face-to-face (Elder et al., 2007; Elder et al., 2005; Knoch et al., 2007).
At the time of the study, five DELNA writing prompts were in use, all of which
follow a similar three-part structure. Students are required to first describe a graph
or table of information presented to them. This graph or table consists of some
simple statistics requiring no specialist knowledge. Students are then asked to in-
terpret this information, suggesting reasons for any trends observed. In the final
part, students are required to either compare this information with the situation in
110
their own country, or suggest ideas on how this situation could be changed or dis-
cuss how it will impact on the country.
The writing task has a set time limit of 30 minutes. Students can, however, hand
in their writing earlier if they have finished.
The students taking DELNA are generally undergraduate students, although some
are postgraduates. More detailed background information on the students whose
writing samples were investigated in this study will be provided in the methodol-
ogy section of the main analysis of Phase 1 later in this chapter.
5.4.1 Introduction:
The method section below describes the analysis of the writing scripts in more
detail. Because a number of suitable measures were identified in the review of the
literature, a pilot study was first undertaken to finalise the discourse analytic
measures to be used in the main analysis of Phase 1. Two criteria were stipulated
to ensure measures were suitable. Firstly, each measure had to differentiate be-
tween writing at the different band levels of DELNA. Secondly a measure had to
be sufficiently simple to be transferable into a rating scale.
After gaining ethics approval to use the DELNA scripts for research purposes, the
scripts for the pilot study were selected. This involved cataloguing the scripts
available from the 2004 DELNA administration into a data base. All scripts were
given a running ID number, which was recorded on the script and also entered
into the data base. Other information recorded in the data base included scores
awarded for each script by the two different raters and a number of background
variables which will be described in more detail in the methodology section of the
main study of Phase 1 later in this chapter. For the purpose of the pilot study, the
111
six levels used in the current rating scale were collapsed into three levels. The ra-
tionale for this was that the pilot study was conducted only on a small number of
scripts and by collapsing the levels it was hoped that the analysis would yield
clearer results. Fifteen scripts, five at each of the three resulting proficiency lev-
els, were randomly selected from the 2004 administration of the DELNA assess-
ment. The only selection criterion used for these scripts was that the two raters
had agreed on the level of the essay. The three groups of scripts will henceforth be
referred to as ‘low’ for scripts of levels 4 and 5, ‘middle’ for scripts of levels 6
and 7, and ‘high’ for scripts of levels 8 and 9.
Analysis of the pilot study was undertaken manually by the researcher. The sec-
tion below outlining the method and results of the pilot study explains the process
taken during the pilot analysis and why certain measures were further pursued or
adjusted according to the data in hand. Because of the extremely small sample
size in the pilot study, no inferential statistics were calculated and the data was not
double coded. Coding for inter-rater reliability was, however, undertaken in the
main study.
As the methodology of the pilot study will be described in much detail (including
definitions for each measure), all definitions of measures used in the pilot study
are therefore not repeated in the description of the methodology of the main
analysis.
5.4.3.1 Accuracy:
For the purpose of this study, error was defined, following Lennon (1991), as ‘a
linguistic form or combination of forms which, in the same context and under
similar conditions of production, would, in all likelihood, not be produced by the
speakers’ native speaker counterparts’. (p. 182). A t-unit was defined following
Hunt (1965) as containing ‘one independent clause plus any number of other
clauses (including adverbial, adjectival, and nominal)’. A clause was defined as ‘a
group of words containing a subject and a verb which form part of a sentence’. An
independent clause was defined as ‘a clause that can stand alone as a sentence’.
112
T-unit boundaries were located at each full-stop (following Schneider and Con-
nor, 1990), as well as at boundaries between two independent clauses as a t-unit is
defined as an independent clause with all its dependent clauses. Therefore, typi-
cally, t-unit boundaries occur before co-ordinating conjunctions like and, but, or,
yet, for, nor, so. As some of the data that forms part of this study was written by
learners of English, occasionally there were problems deciding on t-unit bounda-
ries because at times either the main verb or the subject (or both) were omitted. It
was therefore decided, that to qualify as a t-unit, the independent clause needed to
have both a subject and a main verb. Only the placement of a full stop by a stu-
dent could override this rule. So for example, the sentence ‘the rise in opportunity
for students.’ was coded as a t-unit, although no verb is present, but a full stop
was put at the end by the writer.
Below, a sample extract from the pilot study is reproduced (Figure 12). Errors are
marked in bold (with omissions indicated in square brackets), t-unit boundaries
are indicated with // whilst clause boundaries are indicated with a /.
<84>
The graph indicates [missing: the] average minutes per week spent on hobbies and games
by age group and sex. //
The males age between 12-24 years old spent the most time on hobbies and games.// It is
indicated approximately 255 minutes per week.// As comparison, female in the same age
group spent around 90 minutes on hobbies and games.// Males spent [missing: the] least
time on hobbies and games at 45-54 years old// but females spent [missing: the] least time
on hobbies and game at 25-34 years old.// As we can see, both sexes increase their time on
hobbies and games after 45-54 years old. //
Figure 12: Sample text for accuracy
Results of the pilot study can be seen in Table 19 below. The results are arranged
by the three different proficiency levels (low, middle, high) as described above.
Table 19 displays the means and standard deviations for each measure at each
level, low, medium and high. It becomes clear from this analysis that all the
measures were successful in distinguishing between the different levels, although
some were more successful than others. Among these measures were error-free t-
units, error-free clauses and errors/clause. The percentage of error-free t-units was
selected for the second phase of this study as this measure might be the easiest for
the raters to apply and is unaffected by the length of the script.
113
Table 16: Descriptive statistics - accuracy
Low Middle High
Accuracy Mean SD Mean SD Mean SD
Error-free t-units 1.4 1.14 6.4 2.30 15.6 1.82
Error-free clauses 5.67 1.75 13.33 4.32 30.67 4.84
Error-free t- 0.08 0.04 0.32 .10 0.84 0.11
units/t-units
Error-free 0.23 0.05 .41 .10 .95 .03
clauses/clauses
Errors/t-units 2.21 .18 1.43 .26 0.07 0.16
Errors/clause 1.36 .20 .75 .14 .03 .01
5.4.3.2 Fluency:
Fluency was divided into two separate aspects: writing rate and repair fluency ac-
cording to the findings of the literature review. Writing rate (temporal fluency)
was operationalised as the number of words. This measure was possible because
the essays were written under a time limit and these conditions were the same for
all students taking the assessment. It is however possible that some students did
not utilize the whole time available; therefore this measure needs to be interpreted
with some caution.
Self-correction: any instance of self-correction by itself. This can be just crossed out letters or
words or longer uninterrupted stretches of writing, which can even be as long as a paragraph. In-
sertions also count as one no matter how long the insertion is. If there are an insertion and a dele-
tion in the same place, then this counts as two.
Number of words in self-corrections: These are all the words (or individual free-standing at-
tempts at words) that have been deleted plus the number of words inserted.
If there is a deletion as part of an insertion or an insertion as part of a deletion, then it is counted as
part of the larger part in the number of words, but not counted separately.
If a letter is written over by another letter, it is not counted as two self-corrections, but just as one.
Deletions that range over two sentences or two paragraphs are counted as one.
Scripts where it is apparent that a correction has been rubbed out, are marked as ‘pencil’ and ex-
cluded from any further analysis as the exact number of insertions or deletions cannot be estab-
lished.
114
It was furthermore of interest whether, apart from the number of self-corrections,
there was any difference in the average length of each self-correction produced by
the writers at different levels.
The results for the analysis of fluency can be found in Table 18 above. It is clear
from the table that the number of words and the number of self-corrections were
successful measures, whilst the average length of self-correction was not. There-
fore, only the number of words and the number of self-corrections were used in
the main analysis.
The same definitions of clauses and independent clauses were used as in the sec-
tion on accuracy. A dependent clause was defined as ‘a clause that cannot stand
on its own in the sense that it depends on another clause for its meaning’.
Table 19 above shows that all three measures distinguished between the three
groups of writing scripts, although there was considerable overlap. Because both
115
clauses per t-unit and dependent clauses per t-unit ultimately measure the same
construct, only the measure ‘clauses per t-unit’ was used in the main analysis.
In this case not all measures were equally successful in differentiating between the
different levels of data. All measures except ‘word types per total words’ were
able to differentiate between the levels. The variables used for the main analysis
were ‘the average word length’, ‘the number of sophisticated lexical words over
the total number of lexical words’ and ‘the percentage words from the Academic
Word List’.
116
5.4.3.5 Mechanics:
To measure mechanics, accuracy of punctuation, spelling, capitalisation and para-
graphing was assessed. Punctuation errors were defined as ‘errors in the placing
of full stops’. Commas were not included as accurate comma use is hard to opera-
tionalise. Other punctuation marks were not included as they were used only
rarely. Full stop mistakes are indicated by a / (slash) in the example (Figure 13)
below.
There are many factors that may have impacted on these trends,/firstly there was a change
of laws as the Australian government decided to discontinue New Zealand citizens from ob-
taining Australian benefits,/ this prevented many low-socio-economic families from migrat-
ing to Australia.
Figure 13: Sample text with punctuation mistakes
Spelling errors were defined as ‘any errors in spelling’. The example below (Fig-
ure 14) has the spelling mistake highlighted.
And the reason for a drop in 15-64 is the job oppotunities in New Zealand has a significant
decrease
Figure 14: Sample text with spelling error
Capitalisation errors were defined as (a) failure to use a capital letter for a noun
where it is required in English or (b) an inappropriate use of a capital letter. The
following example sentence (Figure 15) has all errors in capitalisation marked in
bold.
The trend of weekly time spent on hobbies and games by males and females of Third world
countries might be different to that of New Zealand, Australia and European Countries.
Figure 15: Sample text with capitalisation errors
For paragraphing, it was decided not to adopt Kennedy and Thorp’s (2002) sys-
tem of simply counting paragraphs produced, as this did not return very meaning-
ful results. Instead, a new measure was developed. It was assumed that because of
the nature of the task, a five-paragraph model could be expected. Because each
task is divided into three main sections, the writers should ideally produce a para-
graph on each of these sections as well as an introduction and conclusion. This
means that paragraphing was measured very mechanically. The maximum number
of points a writer could score in this section was five, one point for each para-
graph. If students further divided any of these paragraphs, that was still only
counted as one (i.e if a writer produced three paragraphs as part of the interpreta-
tion section, that was scored only as one point, not as three). If a writer connected
117
for example the introduction and the data description into one, this was scored as
one, not as two, because only one paragraph was produced. If it was logical, writ-
ers could also have body paragraphs that described a part of the data and then
gave the reasons for that piece of data, and then a separate paragraph for the next
data and reasons etc, but not more than two were counted. Also, if one part of the
question clearly was not answered, then the writer would not be able to score full
points. Below (see Table 21) are some examples of how students divided their
texts and how they were scored (/ indicates a paragraph break). It should be noted
that this was a very mechanical way of scoring and that no regard was taken of
organisation within paragraphs, which was partly covered by coherence.
The results of the analysis of mechanics can be found in Table 22 below. The fig-
ures for punctuation, spelling and capitalisation indicate the average number of
errors per essay, whilst the scores under paragraphing denote the analysis of para-
graphing as described above.
Table 22 shows that whilst punctuation and spelling mistakes decreased as the
writing level increased, the same was not the case for capitalisation. In the case of
paragraphing, students of higher writing ability used more paragraphs than lower
level writers. However, there was much overlap. Punctuation, spelling and para-
graphing were analysed in the main study.
5.4.3.6 Coherence:
118
Table 22: Descriptive statistics – mechanics
Low Middle High
Mechanics Mean SD Mean SD Mean SD
Punctuation 2.3 2.07 2 1.4 0 0
Spelling 8.17 5.7 3.8 2.56 .33 .52
Capitalisation 1 1 2.2 1.92 0 0
Paragraphing 2 1 3.2 4.5 4.2 .84
differentiate between writing at three levels of the Test of Written English (as was
described in the literature review). In parallel progression, the topic of a t-unit is
identical to the topic of the preceding t-unit. In sequential progression, the topic of
a t-unit relates back to the comment of the previous t-unit. In extended parallel
progression, the topic of a t-unit is identical to a topic of a t-unit before the imme-
diately proceeding t-unit. As part of their discussion, Schneider and Connor sug-
gested three subcategories of sequential progression. The first they termed ‘di-
rectly related sequential regression’. This includes (a) the comment of the previ-
ous t-unit becoming the new topic, (b) word derivations (e.g. science, scientist)
and (c) part-whole relations (e.g. these groups, housewives, children, and old peo-
ple). The second subcategory was termed ‘indirectly related sequential topics’
which include related semantic sets (e.g. scientists and the invention of the radio).
The final subcategory was ‘unrelated sequential topics’ where the topic does not
relate back to the previous t-unit. An initial analysis using these categories (i.e.
parallel progression, the three subcategories of sequential progression and ex-
tended parallel progression) showed that for the current data, this differentiation
only partially works (see Table 23 below). The table below expresses in percent-
ages the extent to which each type of progression was used in each writing script.
Table 23 above shows that as the level of the essays increased, students made use
of less parallel progression and more direct sequential progression (as was found
by Schneider and Connor). However, indirect and unrelated sequential progres-
sion did not follow a clear pattern. Very few instances of extended parallel pro-
gression were found.
119
A further, more detailed analysis of the category of unrelated sequential progres-
sion made it clear, however, that more categories were necessary. For example, it
was found that especially in the higher level essays, a large percentage of the t-
units found unrelated in the above analysis were in fact perfectly cohesive because
the writer introduced the topic at the beginning of a paragraph or used a linking
device to create coherence. According to Schneider and Connor, cases like these
were not recognised as being coherent as they did not conform to the above cate-
gories. An analysis revealed, however, that more skilful writers use linking de-
vices or paragraph introductions quite commonly. For the final analysis, both
these categories were analysed together in one category called super-structure.
Superstructure therefore creates coherence by a linking device instead of topical
progression.
Another category created after the more detailed analysis was the category of co-
herence breaks. In this case, the writer attempts coherence but fails. This might be
caused by either an incorrect linking device or an erroneously used pronominal
reference.
Apart from the two new categories created for this analysis, two other categories
of topical structure analysis were adapted from the literature. Firstly, indirect se-
quential progression was extended to indirect progression, to include cases in
which the topic of a t-unit indirectly links back to the previous topic. Similarly,
extended parallel progression was changed to extended progression to include an
extended link back to an earlier comment. Table 24 below shows all categories of
topical structure used in the pilot study. Definitions and examples are also sup-
plied.
1. Parallel progression
Topics of successive sentences are the same (or synonyms)
<a,b> <a,c>
Maori and PI males are just as active as the rest of NZ. They also have other interests.
2. Direct sequential progression
The comment of the previous sentence becomes the topic of the following sentence
<a,b> <b,c>
The graph showing the average minutes per week spent on hobbies and games by age group
and sex, shows many differences in the time spent by females and males in NZ on hobbies and
games. These differences include on age factor.
3. Indirect progression
The topic or comment of the previous sentence becomes the topic of the following sentence.
The topic/or comment are only indirectly related (by inference, e.g. related semantic sets)
<a,b> <indirect a, c> or <a,b> <indirect b, c>
120
The main reasons for the increase in the number of immigrates is the development of some
third-world countries. e.g. China. People in those countries has got that amount of money to
support themselves living in a foreign country.
4. Superstructure
Coherence is created by a linking device instead of topic progression
<a,b> <linking device, c,d>
Reasons may be the advance in transportation and the promotion of New Zealand's natural
environment and "green image". For example, the filming of "The Lord of the rings"
brought more tourist to explore the beautiful nature of NZ.
5. Extended progression
The topic or comment before the previous sentence becomes the topic of the new sentence
<a,b> ... <a,c> or <a,b> ... <b,c>
The first line graph shows New Zealanders arriving in and departing from New Zealand be-
tween 2000 and 2002. The horizontal axis shows the times and the vertical axis shows the
number of passengers which are New Zealanders. The number of New Zealanders leaving
and arriving have increased slowly from 2000 to 2002.
6. Coherence break
Attempt at coherence fails because of an error
<a,b> <failed attempts at a or b or linker, c>
The reasons for the change on the graph. It’s all depends on their personal attitude.
7. Unrelated progression
The topic of a sentence is not related to the topic or comment in the previous sentence
<a,b> <c,d>
The increase in tourist arrivers has a direct affect to New Zealand economy in recent years.
The government reveals that unemployment rate is down to 4% which is a great news to all
New Zealander’s.
Table 25 below presents the results of the pilot study. The mean scores in the table
are the mean percentages of each category found in essays at that level.
The table shows that as students progressed, they used less parallel progression,
more direct sequential progression, more indirect progression and more super-
structure. Higher level students produced fewer coherence breaks and less unre-
lated progression. Extended progression showed no clear trend over the different
levels of writing. All categories were, however, included in the main analysis of
the data as, to calculate percentage of usage, all types of progression were re-
quired.
121
5.4.3.7 Cohesion:
But the old people are emmigrating to the green countries like Australia or New Zealand.
Because they need a better environment to live in for the rest of their life.
Figure 16: Anaphoric pronominal
Very few instances of ellipsis and substitution were found in the data, and these
measures were therefore excluded from the main analysis.
122
Public can also gain better nutrious products. Therefore the life span increases over time.
As time goes by, more and more elders would stay at home and could not devote themselves
to the society. Less young people could actually work for NZ society and might make NZ’s
economy be worse and non-competitive.
Furthermore, the population trends in NZ are more likely as European countries which pro-
vide sufficient medical facilities, many nutrious products and better education. However,
many countries, such as Africa or india are quite different from NZ with many young chil-
dren in one family.
Figure 17: Linking devices
Lexical chains were defined as ‘a group of lexical items which are related to each
other by synonymy, antonymy, meronymy, hyponymy or repetition’. In the exam-
ple below (Figure 18), a complete text is reproduced. Lexical chains that weave
through the text are indicated in superscript and bold writing. The lexical chain
indicated with number one relates to the different age groups mentioned in the
data. The lexical chain indicated by a two in superscript is made up of lexical
items that describe an increase. The third lexical chain (indicated with a three) re-
lates to health and medicine, whilst the last lexical chain (indicated by a zero) re-
lates to work and the economy.
The table 1 shows that the age group 15-64 years old¹ occupies the greatest portion among
the three groups¹ from year 1996 to 2051. The age group above 65 years old¹ has the
smalles portion compared with the other two groups¹. However, the percentage of age
group above 65 years old¹ keeps increasing² while the percentage of the other two age
groups¹ increase². Furthermore, The population¹ is growing² an the average age is also in-
creasing² from year 1996 to 2051. There are two possible reasons for the increasing² in
Population¹ over time. One is the modern medical technology³. People¹ could access to the
medical facilities³ which can provide better medical facilities³ which can provide better
medical services³ and improve public’s¹ health³. Public¹ can also gain better nutrious
products³.
Therefore the life span increases² over time. The other reason is that better education makes
people¹ know how to keep a healthy³ life for themselves.
In addition, as time goes by, more and more elders¹ would stay at homeº and could not de-
vote themselves to the society. Less young people¹ could actually workº for NZ society and
might make NZ’s economyº be worse and non-competitiveº.
Furthermore, the population¹ trends in NZ are more likely as European countries which
provide sufficient medical facilities³, many nutrious products³ and better education. How-
ever, many countries, such as Africa or india are quite different from NZ with many young
children¹ in one family.
Figure 18: Example of lexical chains
123
Table 26 lists the findings of the cohesion analysis. It shows that higher level
writers used more anaphoric pronominals and fewer linking devices. Higher level
writers also used more lexical chains1.
For the main study, it was decided to use the number of anaphoric pronominals
and the number of linking devices for the analysis of the main study. Measuring
the number of lexical chains was found to be very time-consuming. Also because
it is a high inference measure and rater reliability would be hard to achieve, it was
deemed unsuitable for both the main analysis and the rating scale.
The leap by 12% in this range for 2051 will likely impact a) the workforce: costs to pay for
the elderly may be higher; more + more of the population approaching 65+ + after may
choose to stay in the workforce longer.
Figure 19: Hedges
Boosters were defined as ‘ways in which writers emphasise their assertions’. In-
stances of boosters can be found in the example below (Figure 20).
In New Zealand, the population trends represented unsignificantly from the past to present
time. But there is a clearly change for the population trends in future.
Figure 20: Boosters
124
Finally, an instance of the passive voice can be seen in the following example
(Figure 21).
This big progress could have been achieved by investing more in promoting
accurate driving habit, such as driving at safe speed, fasterning seat belt and
so on.
Figure 21: Passive voice
The results for the analysis of reader/writer interaction can be found in Table 27
below. The table shows that as students’ writing ability increased, they used more
hedges and fewer boosters, and writers at the highest level made use of the pas-
sive voice more than the lower two levels. Although very few instances of writer
identity were found in the sample used for the pilot study, it was decided that this
measure would be pursued in the main analysis in order to see, first, if a relation-
ship could be found between the use of the passive voice and markers of writer
identity and also because it is very easy to analyse with the help of a concor-
dancing program. Hedges, boosters, markers of writer identity and passive voice
were included in the main study.
5.4.3.9 Content:
125
The scripts written by the DELNA raters were deemed to be model answers.
These model answers were then analysed in three stages in terms of their content.
Firstly, the content of the data description section was analysed. Here, the types
of information produced by most raters in their task answers were recorded. In-
formation from the prompt which was usually summarised, or not mentioned at all
in the answers, was also recorded in this analysis. The same was done for the
other two sections in the writing task, the interpretation of data and Part three in
which writers are asked to either discuss any future possible developments or de-
scribe the situation in their own country. In these two parts each proposition made
by the model answers was noted down.
After the model answers were examined, a scoring system was developed as fol-
lows: For section one (data description), each trend described correctly was given
one mark, and each trend described by the appropriate figures was given another
point. For sections two (interpretation) and three, each proposition was given one
point. For sections two and three, writers were also given additional points for
supporting ideas.
The table above shows the findings for the pilot study (Table 28). From the table
it can be seen that higher level writers described more of the data provided and
that they also provided more ideas and supporting arguments in the second and
third part of the essay
126
Cohesion No. of anaphoric pronominal references
No. of connectors
Reader/Writer Interaction No. of hedges
No. of boosters
No. of markers of writer identity
No. of passive voice verbs
Content % of data described correctly
No. of propositions in part 2 of task
No. of propositions in part 3 of task
All these measures were seen as useful for the main analysis. Because the
amounts of data provided by the different tasks varied slightly, it was decided to
convert the score for the data description into a percentage score which represents
the amount of data described out of the total data that could be described.
Based on the pilot study reported above, the measures in Table 29 were chosen for
the main study.
The following section will briefly describe the writing scripts collected as part of
the 2004 administration of the DELNA assessment. Of the just over two thousand
scripts, 601 were randomly chosen for the main analysis.
5.5.1 Instruments:
Five prompts were used in the administration of DELNA in 2004. Table 30 below
illustrates the distribution across scripts in the sample. As mentioned previously,
scripts on prompt five were excluded based on a FACETS analysis (in which
prompt was specified as a facet), which showed that it was marginally more diffi-
cult than the others.
127
The length of the scripts ranged from 47 to 628 words, with a mean of 270 words.
Deletions were not part of the word count. All scripts were originally written by
hand and then typed for the analysis.
Table 31 below shows the distribution of final scores awar-ded to the writing
scripts. This is based on the averaged final score from both raters. It can be seen
that no scripts were awarded a nine overall by both raters.
5.5.2 Participants
5.5.2.1 The writers:
Several background variables were available for the participants, because DELNA
students routinely fill in a background information sheet when booking their as-
sessment. Here, gender, age group and L1 of the students in the sample are re-
ported.
Table 32 below shows that there were somewhat more females in the sample
overall.
Table 33 below shows that most students occupied the under 20 category. Very
few writing scripts in the sample were produced by writers over 41.
The L1 of the students was also noted as part of the self-report questionnaire. Ta-
ble 34 below shows that the two largest L1 groups were students speaking an East
Asian language as L1 (41%), closely followed by students with English as their
first language (36%). Other L1s included in the sample were European languages
128
other than English (9%), Pacific Island languages (4%), languages from Paki-
stan/India and Sri Lanka (4%) and others (3%). A further 3% of students did not
specify their L1.
As part of the information above, the distribution of the final average writing
mark in relation to the test takers’ L1 was calculated. Table 35 shows that almost
all students scoring an eight overall were native speakers of English, while the
largest number scoring lower marks (fours or fives) were from Asian back-
grounds. Test takers that did not specify their language background were not in-
cluded in this table.
129
5.5.2.2 The raters:
Very little specific information was available about the raters of the 601 scripts
during the 2004 administration. However, as mentioned earlier, all DELNA raters
are experienced teachers of either ESOL or English, a large number have rating
experience outside the context of DELNA (for example in the context of IELTS)
and all have postgraduate qualifications. More background details on the partici-
pating raters in Phase 2 of the study will be reported in Chapter 8.
5.5.3 Procedures
The 601 writing scripts randomly selected for the purpose of this study were col-
lected as part of the normal administration of the DELNA writing component over
the course of the academic year 2004. All scripts were rated by two raters and, in
case of discrepancies of more than two band scores, a third rater was consulted.
As part of the DELNA administration, a background information sheet is rou-
tinely collected from each student. Several categories on this background informa-
tion sheet were entered into a database (see section on data entry).
Data were entered into a Microsoft Access Database which included a random ID
number for each script, the students’ ID number to identify the script, the task
(prompt) number, the score awarded to the scripts by the two raters on the three
different categories in the analytic scale (fluency, content, form) as well as any
relevant background information about the students. The variables entered from
the background information sheet were as follows: country of birth, gender, age
group, L1, home language, time in NZ, time in other English speaking country,
marks on other relevant English exams and enrolment at the University of Auck-
land at time of sitting the assessment (i.e. first, second or third year). The scores
awarded on each category of the analytic scale (i.e. fluency, content, form), by the
two (or three) raters were then averaged (in the case of uneven scores arising, the
score was rounded down) to arrive at a final score for each script in each category.
An overall writing score was also calculated for each script. This was based on the
average of the mean scores for each of the three categories of fluency, content and
form. The overall score was rounded down if .333 and up if .667.
130
5.5.3.3 Data analysis:
5.5.3.3.1Accuracy:
As mentioned in the pilot study, the measure chosen for accuracy was the per-
centage of error-free t-units. This therefore involved identifying both t-unit
boundaries and errors. As these variables cannot be coded with the aid of com-
puter programs (Sylvianne Granger, personal comm-unication), both had to be
coded manually. To save time, t-units were coded in combination with clause
boundaries (see grammatical complexity) and errors were coded in combination
with spelling mistakes and punctuation mistakes (see mechanics).
After coding t-unit boundaries and errors, all error-free t-units were recorded into
a SPSS (Statistical Package for the Social Sciences) spreadsheet. To make the
variable more meaningful, the percentage of error-free t-units was calculated by
dividing the error-free t-units by the total number of t-units. A second coder was
then involved to ensure inter-rater reliability by double-coding a subset of the
whole sample (50 scripts). A Pearson correlation co-efficient was calculated using
SPSS.
Temporal fluency was operationalised by the number of words written. This was
established using a Perl Program specifically produced for this task. The output of
the Perl program is composed of the script number from 0 to 601 in one column
and the number of words in the script in the adjacent column. The output is in
TextPad (a free downloadable software for Windows) and this can then easily be
transferred into Excel or SPSS spreadsheets. The reason a Perl programme was
chosen for this task is that, instead of having to go through the laborious task of
checking the number of words in each individual script through the help of the
Microsoft Word Tools menu, Perl performs the analysis within seconds. Because
this variable was analysed by a computer program, double rating was unneces-
sary. However, it should be mentioned that as part of the design process of the
Perl program, a number of spot checks were carried out to ensure that the program
was working in the way required.
The variable chosen to analyse repair fluency was the number of self-corrections.
The self-corrections were ope-rationalised as described in the pilot study. To en-
sure inter-rater reliability, this variable was double rated in 50 scripts and a Pear-
son correlation coefficient was calculated using SPSS.
131
5.5.3.3.4 Grammatical complexity:
Lexical complexity was coded into three variables: firstly, sophisticated lexical
words per total lexical words, secondly the average length of words and finally the
number of AWL words. The variable sophisticated lexical words per total lexical
words was analysed with the help of the computer program Web VocabProfile
(Cobb, 2002) which is an adaptation of Heatly and Nation’s Range (1994).
Before the data was entered into VocabProfile, all spelling mistakes were cor-
rected. This was done because the program would not be able to recognise mis-
spelled words and would therefore move them into the offlist wordlist. The ra-
tionale behind including these words in the analysis was that the writer had at-
tempted the items, but was just not able to spell them correctly. Items of vocabu-
lary that were too unclear to be corrected were excluded from the analysis.
The sophisticated lexical words were taken from the tokens of the AWL (aca-
demic word list) and the Off-List Word tokens. However, as the Off-List words
also included ab-breviations and words like ‘Zealander’ from New Zea-lander,
this list was first scanned and then only the ‘real’ Off-List words were included in
the analysis. The Off-List words could be investigated easily because lower down
the screen, each token of the Off-List words was given. The number of sophisti-
cated lexical words was then divided by the total number of content words. As the
number of content words was not stated in the output of VocabProfile, the value
for lexical density had to be used. Lexical density is defined as the number of con-
tent words divided by the total number of words. Therefore, it was quite straight-
forward to arrive at the number of content words (i.e. by multiplying the value of
132
lexical density by the total number of words). Because the variable sophisticated
lexical words over total lexical words was analysed with the aid of the computer
program VocabProfile, no inter-rater reliability check was deemed necessary.
The second variable that was investigated for lexical complexity was the average
length of words. This was done completely automatically, again using a Perl
script specifically designed for the task. The Perl program was written so that it
identified the number of characters in each script, as well as the number of spaces
between characters. Before this count, the Perl script disregarded all punc-tuation
marks (so that they were not added into the final count where they might inflate
the length of words). To arrive at the final average word length for each script, the
number of characters was divided by the number of spaces between words. As this
was done completely automatically, no inter-rater reliability check was deemed
necessary. The Perl program was however thoroughly checked for any mistakes
before it was used.
Finally, the number of words from the Academic Word List was recorded in the
spreadsheet. This was also taken from the output of VocabProfile.
5.5.3.3.6 Mechanics:
The first group of variables examined for mechanics was the number of errors in
each script for spelling and punctuation. They were coded at the same time as the
rest of the errors (i.e. the types of errors analysed for accuracy). Each of these was
defined as described in the methodology section of the pilot study. A second rater
rated a subset of the data (50 scripts) and Pearson correlation coefficients were
calculated for each of the variables using SPSS.
5.5.3.3.7 Coherence:
Using the categories established in the pilot study, the scripts were coded manu-
ally. The same t-unit breaks for accuracy were used. Inter-rater reliability was es-
tablished using a second coder who rated a subset of 50 scripts and calculated us-
ing a Pearson correlation coefficient in SPSS.
133
5.5.3.3.8 Cohesion:
The variable chosen to investigate cohesion was the number of anaphoric pro-
nominals (e.g. this, that, these) used by the writer. The pronominals used in the
main analysis are listed in Appendix 1. The decision was made that instead of
hand-coding these in the 601 writing scripts, with the risk of missing some due to
human error, a concordancing program would be used to search for each of these
pronominals individually. The concordancer chosen for this task was MonoConc
Pro Concordance Software Version 2.2 (Barlow, 2002).
Monoconc not only displays the concordancing lines, but also displays as much
context as is requested. This proved invaluable, because many of the words identi-
fied were not anaphoric pronominals and thus were not acting as cohesive devices
as described by Halliday and Hasan (1976). Although this method of data analysis
has the advantage that it saves time compared to the manual method, it still
proved time-consuming in the sense that all instances of the words in the concor-
dance needed to be checked in the top window, to eliminate all occasions where
the word was not used as a cohesive device. For example, when counting the use
of those, all instances of those as in those of us, needed to be discarded as well as
the those used in the sense of those people that I am familiar with. After pronomi-
nals that were not used as cohesive devices were discarded, the next step was to
assess if the referent referred to by the pronominal was in fact over the clause
boundary in accordance with the definition adopted for cohesive devices. This ex-
cluded a number of possessive pronominals occurring in the same clause as the
referent as for example the use of its in ... the motor vehicle crashes declined to
half its number....
Following this procedure, each pronominal was recorded and entered into an
SPSS spreadsheet next to the relevant script number. The next step was to ex-
clude all pronouns that occurred fewer than 50 times in all scripts. This was done
because it was not deemed useful to include very rare items in a rating scale.
Therefore the following words were excluded from any further analysis: here, its,
those, his, her, she and he. Then the results for each pronoun were correlated with
the final score awarded by the DELNA raters. Finally, an inter-rater reliability
check was undertaken by double-rating 50 scripts and calculating a Pearson corre-
lation coefficient.
134
writer identity and the passive voice. The complete list of items investigated was
established based on previous research of the literature and can be found in Ap-
pendix 1. Each lexical item was investigated individually using MonoConc. Here
special care needed to be taken, so that lexical items that did not function as
hedges or boosters were excluded from the analysis. For example, in the case of
the booster certain, all uses of certain + noun needed to be excluded as this struc-
ture does not act as a boosting device. In the case of the lexical item major, all
uses of the word in conjunction with cities or axial routes, for example, needed to
be excluded because these were also not used as boosters. So for each lexical item
in Appendix 1, the whole concordancing list produced in MonoConc needed to be
thoroughly examined before each instance of that item could be entered into a
spreadsheet. Finally, all items were added together, so that a final frequency count
for each script was found for hedges, boosters and markers of writer identity. The
passive voice was initially also investigated using MonoConc. However, because
it was impossible to search for erroneous instances of the passive (i.e. unsuccess-
ful attempts), this analysis was later refined by a manual search.
5.5.3.3.10 Content:
Using the scoring scheme described in the pilot study, the scripts were manually
coded. A second rater was used to ensure inter-rater reliability by scoring a subset
of 50 scripts. A Pearson correlation coefficient was calculated using SPSS.
To ascertain that any differences found between different DELNA writing levels
did not occur purely due to sampling variation, each measure in the analysis was
subjected to an Analysis of Variance (ANOVA). A number of assumptions under-
lie an ANOVA (A. Field, 2000; Wild & Seber, 2000). The first assumption relates
to independence of samples. This assumption is satisfied in this situation, as no
writing script is repeated in any of the groups (DELNA band levels) compared.
The second assumption stipulates that the sample should be normally distributed.
However, according to Wild & Seber (2000, p. 452), ANOVA is robust enough to
cope with departures from this assumption. Furthermore, because most groups in
this analysis were very large, we can rely on the central limit theorem, which
stipulates that large samples will always be approximately normally distributed.
The third assumption stipulates that the groups compared should have equal vari-
ances. This is the most important assumption relating to ANOVA. Wild & Seber
135
(2000) suggest that this can be tested by ensuring that the largest standard devia-
tion is no more than twice as large as the smallest standard deviation2. If the vari-
ances were found to be unequal following this analysis, a Welch test (Welch’s
variance-weighted ANOVA) was used. This test is robust enough to cope with
departures from the assumption of equality of variances and performs well in
situations where group sizes are unequal. The post hoc test used for all analyses
was the Games-Howell procedure. This test is appropriate when va-riances are
unequal or when variances and group sizes are unequal (A. Field, 2000, p.276).
This was found to be the most appropriate test of pair-wise comparisons because
in all cases the groups were unequal (with DELNA band levels 4 and 8 having
fewer cases than band levels 5, 6 and 7).
Whilst pair-wise post hoc comparisons were performed for each measure, it was
not deemed important for each mea-sure to achieve statistical significance be-
tween each ad-jacent level. Pair-wise comparisons between adjacent levels are
however briefly mentioned in the results chapter.
After the ANOVAs and pair-wise post hoc comparisons had been computed, it
came to my attention that a MANOVA analysis would be more suitable for this
type of data as it would avoid Type 1 errors. Because the data violated some un-
derlying assumptions of inferential statistics, especially the assumptions of equal
variances, a non-parametric MANOVA was chosen. The computer program
PERMANOVA (Anderson, 2001, 2005; McArdle & Anderson, 2001) was used
for this as SPSS is unable to compute non-parametric MANOVAs. However, the
resulting significance values for each structure showed very little difference from
those computed by the ANOVAs described above, and it was therefore decided to
keep the ANOVAs in the results section as these results are more easily presented
and interpreted.
---
Notes:
1
In retrospect it might have been better to have looked at the number of items in a lexical chain.
2
This test was chosen over the Levene’s test for equality of variances, as the Levene’s test almost
always returns significant results in the case of large samples.
136
Chapter 6: Results – Analysis Of Writing Scripts
6.1 . Introduction
The following chapter presents the results of Phase 1, which address the following
research question:
For each variable under investigation, two pieces of information are presented.
Firstly, side-by-side box plots showing the distribution over the different DELNA
writing proficiency levels are provided. The box of each plot portrays the middle
50% of students, while the thick black line inside the box denotes the median. The
whiskers indicate the points above and below which the highest and lowest 10%
of cases occur. Cases lying outside this area are outliers, or extreme scores. The y-
axis on which these plots are charted represents the frequency (or proportion of
usage) of the variable in question, while the x-axis represents the average DELNA
mark, ranging from 4 to 82.
The second piece of information is a table presenting the descriptive statistics for
each variable at each DELNA level. As in the pilot study, the minimum and
maximum were chosen over the range to illustrate any overlap between levels.
6.2 Accuracy
Accuracy was measured as the percentage of error-free t-units3.
137
Figure 22: Distribution of proportion of error-free t-units over overall sample and DELNA
sublevels
The side-by-side box plots in Figure 22 depict the distribution of the proportion of
error-free t-units over the different DELNA bands. The variable successfully dis-
tinguished the different levels, with some overlap.
Table 36 above shows the descriptive statistics for each of the five proficiency
levels. Because the equality of variance assumption was not violated in this case,
an analysis of variance (ANOVA) test was performed. The analysis of variance
revealed significant differences between the different band levels, F (4, 576) =
60.28, p = .000. The Games-Howell post hoc procedure showed statistically sig-
nificant differences between two adjacent pairs of levels, levels 5 and 6 and levels
6 and 7.
138
6.3 Fluency
The variable chosen for temporal fluency was the average number of words per
script.
The box plots in Figure 23 and the descriptive statistics in Table 37 both indicate
that although the average number of words generally increased as the writing level
rose, there was much overlap. There also seemed to be a ceiling effect to the vari-
able, indicating that writers at levels 6, 7 and 8 seemed to produce a very similar
number of words on average. So while there was a clear difference between the
number of words produced on average between levels 4 to 6 (although with much
overlap in the range), for levels 6 and above the variable did not successfully dis-
criminate between the writers.
Figure 23: Distribution of number of words per essay over overall sample and DELNA sub-
levels
139
Because the assumption of equal variances was not violated, an ANOVA was per-
formed. This analysis revealed a statistically significant difference between the
five band levels, F (4, 577) = 5.82, p = .000. The Games-Howell procedure re-
vealed that the only adjacent levels that were significantly different were levels 5
and 6.
The variable chosen for repair fluency was the number of self-corrections.
While the mean for all scripts was 14.13 self-corrections, the scripts ranged
widely. Over 50 writers made no self-corrections, while some scripts had as many
as 64.
Figure 24: Distribution of number of self-corrections over overall sample and DELNA sub-
levels
This variable also produced a large number of outliers as can be seen when the
number of self-corrections were plotted over the DELNA bands. Although there
was considerable overlap, the measure discriminated between the different
DELNA bands (see Figure 24 and Table 38 below), showing that the lower the
level of the writer, the more self-corrections were made4.
140
Table 38: Descriptive statistics - Number of self-corrections
DELNA level M SD Minimum Maximum
4 21.33 5.19 0 32
5 17.21 11.41 0 52
6 15.00 9.58 0 64
7 12.38 9.57 0 57
8 6.96 5.84 0 37
Because the assumption of equality of variances did not hold in this case, a Welch
test was performed which revealed statistically significant differences between the
different groups, F (4, 60.7) = 4.14, p = .005. However, the Games-Howell proce-
dure revealed that no immediately adjacent levels were significantly different.
6.4 Complexity
The variable chosen to analyse grammatical complexity was clauses per t-units.
An inter-rater reliability check was undertaken for the coding of both clauses and
t-units. Both showed a strong positive relationship, with the correlation coefficient
for t-units, r = .981, N = 50, p = .000, being slightly higher than that for clauses, R
= .934, N = 50, p = .000.
Figure 25: Distribution of clauses per t-units over overall sample and DELNA sublevels
141
The box plots in Figure 25 and the descriptive statistics in Table 39 above show
that the variable failed to differentiate between scripts at different ability levels.
This means that, in contrast to what was expected, higher level writers did not use
more complex sentences (more subordination).
Overall, very little subordination was used in the scripts as is indicated by the
mean of 1.46 for all scripts included. That is, fewer than every second t-unit in-
cluded sub-ordination.
Because the assumption of equality of variances held in this case, an ANOVA was
performed which returned a sta-tistically significant result, F (4, 575) = 3.08, p =
.016. The Games-Howell procedure showed that the only adjacent band level pair
that was significantly different was level 5 and 6.
Two separate variables were chosen for lexical complexity in the pilot study, the
average word length and sophisticated lexical words per total lexical words. As
part of the main analysis, the number of AWL words were also recorded, because,
forming part of the output of VocabProfile, the coding required no extra time.
Firstly, the average word length was investigated. The average word length for all
words in the whole sample was 4.78.
The box plots (Figure 26) and the table displaying the descriptive statistics (Table
40) show that the variable successfully discriminated between different levels of
writing, in that the higher the level of writing, the longer the average word.
142
Figure 26: Distribution of average word length over overall sample and DELNA sublevels
The second variable investigated for lexical complexity was the number of so-
phisticated lexical words per total number of lexical words.
143
Figure 27: Distribution of sophisticated lexical words per total lexical words over overall
sample and DELNA sublevels
Figure 27 and Table 41 show that the higher the level of writing, the more sophis-
ticated lexical words per total lexical words were used by the writers5.
Table 41: Descriptive statistics - Sophisticated lexical words per total lexical words
DELNA level M SD Minimum Maximum
4 .13 .05 .03 .21
5 .15 .06 .00 .30
6 .17 .07 .00 .39
7 .18 .07 .00 .37
8 .21 .07 .00 .34
Although not initially planned to be part of the analysis, the number of words in
the Academic Word List (AWL) were also recorded as part of the analysis of Vo-
cabProfile. As Figure 28 and Table 42 indicate, this variable differentiates well
between the different levels of writing6.
144
Figure 28: Distribution of number of AWL words over overall sample and DELNA sublevels
Because the assumption of equal variances was not satisfied in this case, a Welch
procedure was performed, which revealed statistically significant differences be-
tween the groups, F (4, 66.22) = 39.99, p = .000. The Games-Howell procedure
showed that all adjacent pairs of band levels differed significantly statistically.
6.5 Mechanics
Inter-rater reliability for the variable was investigated by having a second coder
double rate a subset of 50 scripts. A Pearson correlation coefficient showed a
strong relationship between the two counts of errors, r = .959, N = 50, p = .000.
145
Many scripts displayed no or very few mistakes, suggesting that this variable
might not be suitable as a measure. Over a third of all scripts displayed no spelling
errors, while the overall mean for all scripts was 3.5 spelling errors per script.
Figure 29: Distribution of number of spelling errors over overall sample and DELNA sub-
levels
The box plots present the number of spelling mistakes for each DELNA band
level. It can be seen that this variable differentiated between levels.
However, the majority of scripts, with the exception of some outliers, did not dis-
play a large number of spelling mistakes and the differences between levels 5 to 7
were very small. In contrast, there was a large difference in the mean for scripts
scored at level 4 and 5. The mean for level 4 scripts was 8.27 while the mean for
level 5 scripts was just below 4 per script. The descriptive statistics for each level
are displayed in Table 43 above. Because the assumption of equal variances did
not hold in this case, a Welch procedure was used instead of an analysis of vari-
146
ance. The Welch test revealed statistically significant differences, F (4, 58.46) =
6.01, p = .000. The Games-Howell procedure showed that only levels 7 and 8
were statistically significantly different from each other.
First, inter-rater reliability was established for this variable. A correlation showed
a strong relationship between the ratings of the two coders, r = .864, n = 50, p =
.000.
As with the number of spelling mistakes, this variable showed a positively skewed
distribution. For the overall sample of scripts, the average was 3.04 punctuation
errors.
Figure 30: Distribution of number of punctuation errors over overall sample and DELNA
sublevels
As with spelling, this variable also failed to differentiate between the five differ-
ent levels of writing and in this case there was very little differentiation in terms
of the mean scores of the five writing levels (Figure 30 and Table 44).
147
The third and final variable investigated in the category of mechanics was para-
graphing, which was measured as the number of paragraphs (of the five para-
graph model) produced.
Figure 31: Distribution of paragraphing over overall sample and DELNA sublevels
When the box plots (Figure 31) and the descriptive statistics (Table 45) for the
different DELNA proficiency levels were compared, it could be seen that writers
at level 4 produced only two of the expected paragraphs on average, whilst writers
at level 8 produced just under four. Students at levels 5, 6 and 7 had a very similar
mean (around three paragraphs) on this variable; however the box plots show a
clear differentiation between levels 5 and 6.
148
Table 45: Descriptive statistics - Paragraphing
DELNA level M SD Minimum Maximum
4 2.27 1.10 1 4
5 2.88 .85 1 5
6 3.09 .91 1 5
7 3.17 .91 1 5
8 3.68 .56 3 5
6.6 Coherence
The inter-rater correlation for indirect progression was below .80, which was cho-
sen as the cut-off for this study. However, because it is a high-inference variable,
it was decided that this level would be acceptable.
Next, the following hypothesis was made: Parallel progression, direct sequential
progression and super-structure would all contribute towards coherence. There-
fore, there was an expectation that these might be produced more commonly by
more proficient writers. On the other hand, unrelated progression and coherence
breaks were thought to be reasons for coherence to break down and might there-
fore be produced by less proficient writers. No clear hypothesis could be stated for
indirect progression and extended progression.
149
However, it was decided, instead of having pre-conceived hypotheses about what
the writers might produce at different levels, to let the data speak for itself. There-
fore, a correlation analysis was undertaken, in which the proportion of usage of
each of these categories was correlated with the final score the writers received
from the two raters. The results from the correlation confirmed some of the hy-
potheses, whilst others were refuted. The correlations (Table 47 below) showed
that higher level writers used more direct sequential progression, superstructure
and indirect progression (resulting in significant positive correlations). Categories
used more by lower level writers were parallel progression, unrelated progression
and coherence breaks (resulting in significant negative correlations). Extended
progression was used equally by lower and higher level writers and therefore re-
sulted in a correlation close to zero.
Table 47: Topical structure categories correlated with final DELNA writing score
Topical structure category Final writing score
Parallel progression -.215**
Direct sequential progression .292**
Superstructure .258**
Indirect progression .220**
Extended progression -.07
Unrelated progression -.202**
Coherence break -.246**
n = 601; **p < .01
150
Figure 32: Distribution of proportion of parallel progression
151
.000. The Games-Howell procedure showed that levels 6 and 7 were statistically
significantly different.
Direct sequential progression was used more frequently by higher level writers, In
particular, writers at level 8 used this type of progression for nearly a third of their
sentence topics. An analysis of variance revealed statistically signi-ficant differ-
ences between the groups, F (4, 575) = 2.86, p = .023. However, the Games-
Howell procedure showed no statistically significant differences between adjacent
groups.
152
Figure 35: Distribution of proportion of superstructure over DELNA sublevels
The result for indirect progression was less clear, but showed an increase accord-
ing to level. An analysis of variance revealed statistically significant differences
between the different levels of writing, F (4, 574) = 5.85, p = .000. However
again, the Games-Howell procedure resulted in no statistically significant differ-
ences between any adjacent band levels.
153
Figure 36: Distribution of proportion of extended progression over DELNA sublevels
Figure 36 above shows that extended progression was used more frequently by
lower level writers. The distribution over levels 6, 7 and 8 was very similar. An
analysis of variance, however, revealed no statistically significant differences be-
tween the groups, F (4, 577) = 1.62, p = .168.
154
Unrelated progression (Figure 37 above), whilst being used in about a quarter of
all topic progressions at level 4, was only very rarely found in writing at level 8. A
Welch test revealed statistically significant differences between the groups, F (4,
576) = 6.40, p = .000. The Games-Howell procedure showed that the only border-
ing band levels that were statistically distinct from each other were levels 5 and
6.
6.7 Cohesion
First an inter-rater reliability analysis was undertaken for the anaphoric pronomi-
nals. This involved a second re-searcher double-coding 50 scripts. The correlation
co-efficient indicates a high level of inter-rater reliability, r = .969, n = 50, p =
.000.
155
Each anaphoric pronominal investigated as part of cohesion (after pronominals
used less than 50 times overall were deleted) was then correlated with the final
average score, using a Pearson correlation coefficient to establish if some were
used more commonly by either low or high level writers. The results of this corre-
lational analysis can be seen in Table 48 below.
Based on the correlations, the items were divided into two groups: firstly, items
that showed a negative correlation with the overall average score (and were there-
fore used more commonly by lower level writers), and secondly those that corre-
lated positively with the average score (and were therefore more commonly used
by higher level writers). The correlation for that was negative but not significant,
and was therefore excluded from the analysis. Table 49 below shows the two
groups.
Table 49: Anaphoric pronominals that correlate positively and negatively with the DELNA
writing score
Positive correlation Negative correlation
- these - they
- this - them
- it
- their
Then, these two groups of pronominals were plotted against the overall score (see
Figures 39 and 40 below) to provide a graphic representation of the distribution.
Table 50 below provides the relevant descriptive statistics.
156
Figure 39: Distribution of this, these over overall sample and DELNA sublevels
The overall distribution of this and these was slightly positively skewed as well as
peaked. It can be seen that the overall mean score was nearly three per script.
About one hundred writers in the sample did not use either of these items.
The box plots (Figure 39) for this and these show that the use of these two pro-
nominals increased as the proficiency level increased. The same can also be seen
in the table depicting the descriptive statistics (Table 50). A Welch test showed
statistically significant differences between the groups, F (4, 66.00) = 20.74, p =
.000. The Games-Howell procedure of pair-wise comparisons showed that adjoin-
ing levels 5 and 6, as well as 7 and 8, were statistically different from each other.
Next, the distribution of the group of pronominals that correlated negatively with
the overall DELNA score were investigated over the different DELNA levels. The
box plots in Figure 40 below show that with increasing proficiency level, fewer of
these pronominals were used. This is also evidenced in Table 51 below.
157
Figure 40: Distribution of they, them, it, their over overall sample and DELNA sublevels
The second variable for cohesion was the number of linking devices in the data.
An analysis of the number of linking devices over the different proficiency levels
(Figure 41) showed that when this variable was controlled for essay length, lower
level writers used far fewer linking devices than did higher level writers. This
variable was, however, slightly inconclusive when not controlled for length. Table
52 below shows the descriptive statistics.
158
Figure 41: Distribution of number of linking devices per total words over overall sample and
DELNA sublevels
Table 52: Descriptive statistics – Number of linking devices per total words
DELNA level Mean SD Minimum Maximum
4 .03 .01 .02 .06
5 .02 .01 .00 .06
6 .02 .01 .00 .06
7 .02 .01 .00 .05
8 .02 .01 .00 .04
159
6.8 Reader-Writer Interaction
Over half of the scripts made no use of writer identity. However, although so
many scripts did not make use of this category, the mean was just under 2.5 mark-
ers per script. This shows that a number of writers used a large number of these
markers.
Figure 42: Distribution of features of writer identity over overall sample and DELNA sub-
levels
The box plots in Figure 42 and the table of descriptive statistics (Table 53) above
show the distribution over the different DELNA levels. It is clear that this variable
did not differentiate distinctly between the different proficiency levels.
160
The analysis of variance showed no statistically significant difference between the
different levels of writing, F (4, 577) = 1.07, p = .368.
The second variable under investigation was the number of hedging devices. On
average, writers used just under six of these structures per script.
Figure 43: Distribution of hedging devices over overall sample and DELNA sublevels
When broken up into the different DELNA band levels, the use of hedging de-
vices can be seen to have quite clearly distinguished between different levels of
writing. This is revealed in the box plot (Figure 43 above) as well as in the table
summarising the descriptive statistics (Table 54 be-low). The table shows that
whilst writers at lower levels used on average about five hedging devices in their
writing, higher level writers used more than eight of these devices.
161
The analysis of variance revealed a statistically significant difference between the
groups, F (4, 596) = 7.39, p = .000. The Games-Howell procedure showed that
levels 5 and 6 as well as levels 7 and 8 were statistically distinct from each other.
The hedging variable was also investigated when script length was controlled.
This showed an even stronger difference between the different proficiency levels.
The final variable investigated in this category was boosters. The distribution per
band level (box plots in Figure 44) and the table indicating descriptive statistics
(Table 55) show that this category failed to distinguish between different levels of
writing because writers of all levels used on average about 2.5 boosters in their
writing.
Figure 44: Distribution of boosters over overall sample and DELNA sublevels
162
Finally, the use of the attempted passive voice was investigated.
Figure 45: Distribution of passives over overall sample and DELNA sublevels
The box plots (Figure 45) and the table above (Table 56) show that higher level
writers used the passive more frequently, whilst hardly any writers at level 4 at-
tempted this structure; however the differences between the different levels of
writing proficiency were very small on average.
163
Finally, it was of interest whether there was a relationship between the use of
markers of writer identity and the passive voice. It is conceivable, for instance,
that writers who use markers of writer identity (by projecting their own voice into
the text) use fewer passives. A correlation analysis was conducted which showed
a positive relationship between these two variables, r = .304, n = 583, p = .000.
This means that writers who used more passives also tended to use more markers
of writer identity.
6.9 Content
The final category investigated was content. Content was divided into three sec-
tions, closely following the three sections of the prompts: data description, data
interpretation and Content Part three.
Part one, the description of data was calculated as percentages of information de-
scribed.
Inter-rater reliability was established by having a second rater double code a sub-
set of fifty scripts. The relationship was significant, r = .821, n = 50, p = .000.
The mean for all scripts was 0.59, indicating that, on average, the writers included
just under 60% of the possible data. Whilst some writers did not attempt this sec-
tion of the task and therefore scored 0%, about 70 writers described all the pieces
of information deemed important by the expert writers and therefore scored 100%.
The largest number of writers (more than 150) described 50% of the data.
Figure 46: Distribution of proportion of data description over overall sample and DELNA
sublevels
164
The box plots in Figure 46 and Table 57 indicate that this variable, based on the
mean scores, splits the data set into clearly separate levels.
A Welch test was performed to investigate differences between the groups. The
analysis revealed statistically significant differences between the groups, F (4,
55.58, p = .032. However, no adjacent pairs were found to be statistically distinct
by the Games-Howell procedure.
The second part of each writing task, the interpretation of data, is scored in terms
of the number of reasons given for the facts described in the data.
An inter-rater reliability test established a strong correlation between the two cod-
ers, r = .811, n = 50, p = .000.
The data distribution over the five DELNA proficiency levels was investigated
using side-by-side box plots (Figure 47). The means (as seen in Table 58 below)
ranged from 1.6 to 4.5 reasons, showing a clear differentiation according to level.
A Welch test revealed statistically significant differences between the groups in-
volved, F (4, 57.26) = 5.78, p = .001. The Games-Howell procedure showed that
no adjacent band levels were statistically distinct from each other.
165
Figure 47: Distribution of interpretation of data over overall sample and DELNA sublevels
The final part of each prompt, content Part 3, required the writer to either de-
scribe how the current situation will or can be changed in the future or describe a
similar situation in their own country. This part was again scored by giving a
point for each proposition.
When the number of propositions in part three of the prompt was plotted against
the overall DELNA score (Figure 48), it became clear that the data was separated
well by this variable. Descriptive statistics can be seen in Table 59 below.
166
Figure 48: Distribution of Content Part 3 over overall sample and DELNA sublevels
6.10 Conclusion
The results in this chapter are based on the analysis of 601 writing scripts pro-
duced as part of the 2004 administration of DELNA. Each measure was plotted
against the different proficiency levels, providing a clear visual overview of the
distribution. Inferential statistics were presented for each structure. The analysis
revealed that the variables in Table 60 below successfully differentiated between
the different levels.
167
Table 60: Variables successful in differentiating between levels
Construct Measure
Accuracy Percentage error-free t-units
Fluency Number of self-corrections
Complexity Average word length
Sophisticated lexical words / total lexical words
Number of AWL words
Mechanics Paragraphing
Coherence Parallel progression
Direct sequential progression
Superstructure
Indirect progression
Unrelated progression
Coherence break
Cohesion Anaphoric pronominals – these, this
Linking devices – qualitative analysis
Reader/writer interaction Number of hedges
Content Percentage data supplied
Number of propositions (Part 2 and 3)
The next chapter discusses the findings presented above. Here, relevant previous
research is related to the current data. Based on the findings in this chapter, the
new rating scale is developed.
---
Notes:
1
The n-size in each histogram differs, because of missing values resulting from the analysis.
2
No writing scripts scored at level 9 were included in this analysis, because overall as part of the
more than two thousand scripts there were only three writing samples that received a score of 9 by
both raters. These three scripts were excluded from any further analysis. Scripts that were scored
at level 9 by only one rater were rounded down as part of the calculation of the average of two
raters.
3
Percentages are represented as proportions of 1 in the data below.
4 A further analysis showed that if the variable is controlled for the number of words per script,
the variable is even more discriminating between levels. For reasons of space it was not repro-
duced in this chapter.
5
See Note 4
6
See Note 4
168
Chapter 7: Discussion – Analysis of Writing Scripts
7.1 Introduction
The main aim of this study was to investigate whether a more detailed, empiri-
cally developed rating scale for writing would result in more reliable and valid
rater judgements in a diagnostic context than a more intuitively-developed rating
scale. During Phase 1, 601 writing scripts produced as part of a normal opera-
tional administration of the DELNA (Diagnostic English Language Needs As-
sessment) were analysed using detailed discourse analytic measures. The method-
ology and results of Phase 1 were presented in Chapters 5 and 6. This chapter pre-
sents a discussion of the first subsidiary research question:
In this chapter, key findings are summarized and discussed in relation to previous
literature. At the end of the discussion of each aspect of writing, the relevant new
trait scale is presented. The following principles were followed for the design of
the rating scales:
7.2 Accuracy
A number of measures identified in the literature were explored for the analysis of
accuracy in the pilot study. All measures trialed discriminated successfully be-
tween the different levels. However, most of these measures were difficult to ap-
169
ply in the rating process or did not account for differences in text length. It was
decided to use the percentage of error-free t-units in the main study.
This measure proved highly successful in distinguishing between the five different
band levels in the DELNA corpus although there was some overlap. The analysis
of variance showed that there were statistically significant differences between the
different levels of writing.
The measure of percentage of error-free t-units has been used repeatedly in the
literature. Wolfe-Quintero et al. (1998), in the most comprehensive review of
studies on measures of accuracy, fluency and complexity noted the varying suc-
cess of the measure. Twelve studies found a significant relationship between pro-
ficiency level and the percentage of error-free t-units while eleven did not. In this
study, the measure proved successful with some overlap between levels. It is pos-
sible that the overlap was partly due to some learners’ accuracy not increasing in a
linear fashion (as was for example found by Henry, 1996).
One reason why this measure was so successful in discriminating between the dif-
ferent levels of writing ability could be the way the scripts were grouped into lev-
els (one of the limitations of this study). The scripts were classified according to
scores awarded on the basis of the existing DELNA rating scale. However, some
authors (e.g. Weigle, 2002) have shown that raters base their decisions mainly on
a holistic rating of writing scripts. This holistic score is often highly correlated
with the number of errors in a script. Raters seem to base their decisions on the
accuracy of a script, as this is a very noticeable feature of a writing script. There-
fore, it is possible that the ratings used for the groupings of the scripts in this
study were mainly based on a holistic impression of the accuracy of a script.
The rating scale for accuracy was designed so that the raters did not have to actu-
ally count each error-free t-unit. Instead, it required them to estimate the propor-
tion of error-free t-units when reading a script. It was further decided that raters
did not need to be trained to identify t-units in this data because a brief analysis of
t-unit borders showed that these coincided in over 90% of the cases with sentence
breaks.
Although the analysis of the scripts only showed five distinct levels of accuracy
(because no scripts at level 9 were included in the analysis), a sixth level was
added to the trait scale of accuracy to acknowledge completely error-free scripts.
The rating scale for accuracy is shown in Table 61.
170
Table 61: Rating scale - Accuracy
9 8 7 6 5 4
All sen- Nearly all About ¾ of About half About ¼ of Nearly no or
tences error- sentences sentences of sentences sentences no error-free
free error-free error free error-free error-free sentences
7.3 Fluency
Two types of fluency were investigated in both the pilot study and the main analy-
sis of the writing scripts: temporal and repair fluency. The measure chosen for
temporal fluency was the number of words produced within the time limit of 30
minutes. Although some doubt existed about this measure because of varying
findings of other research studies and the fact that not all students used the 30
minutes they were entitled to, this measure produced some promising results in
the pilot study. Therefore, the number of words were analysed in the main study.
Although the histogram showed large variation among the scripts in terms of the
number of words produced, this measure was not successful in distinguishing be-
tween the different proficiency levels.
Wolfe-Quintero et al. (1998), in their review of the literature, also found varying
results for this measure. Although ten studies found significant differences among
proficiency levels, seven did not. As in this study, Larsen-Freeman (1978; 1983)
and Henry (1996) found a ceiling effect around the higher levels or even a de-
crease at the advanced level. The findings of this study are also in line with Cum-
ming et al.’s (2005) investigation of TOEFL essays. The authors also failed to
find a significant difference between the two higher levels (levels 4 and 5). A
similar study looking at IELTS essays (Kennedy & Thorp, 2002) did not differen-
tiate between immediately adjacent levels, but looked only at differences between
essays at levels 4, 6 and 8. Although the authors fail to report means for each
level, the minimum and maximum number of words at each level also indicate a
large amount of overlap, even though the levels were not adjacent. Essays at level
4 ranged from 111 to 370 words, essays at level 6 from 184 to 485 words and es-
says at level 8 from 239 to 457 words. These ranges suggest that there was proba-
bly no statistical difference between the essays at higher levels. It could therefore
be argued that the number of words is more successful in distinguishing between
lower level writers, but is not a measure that can be expected to successfully dif-
ferentiate between students who have already been admitted to university.
171
7.3.1.1 Trait scale: temporal fluency
It was decided not to include temporal fluency in the rating scale because there
was little evidence from the analysis of the scripts that there are differences be-
tween the levels of writing in terms of the number of words that writers produce.
The second measure of fluency was repair fluency, operationalised as the number
of self-corrections. This measure has not been applied to writing before but was
‘borrowed’ from research on speaking. The variable distinguished successfully
between the different proficiency levels but the differences between levels were a
lot less pronounced than for accuracy. The measure was included in the rating
scale but there is some doubt regarding its usefulness in the context of writing.
This will be discussed in more detail in the context of Research Question 2, in
light of the feedback from the raters.
On the basis of these findings, the rating scale for fluency was based only on the
variable ‘the number of self-corrections’. The scale largely followed the findings
from the analysis. The levels were slightly adjusted to allow for better distinctions
between bands. For example, band level 8 was designed to include no more than
five self-corrections although the analysis of band 8 resulted in a mean of nearly 7
and so on. As with accuracy, a sixth level (level 9) was added to the scale to ac-
knowledge scripts with no self-corrections. The rating scale for repair fluency is
shown in Table 62 below.
172
7.4 Complexity
Two different types of complexity were investigated in both the pilot study and
the main analysis of the writing scripts: grammatical and lexical complexity.
Grammatical complexity was operationalised as clauses per t-units. Although the
measure was successful in the pilot study, it failed to distinguish between the five
levels of writing in the main analysis. Interestingly, Wolfe-Quintero et al. (1998)
found that although a number of studies in their review returned non-significant
results for this measure, it seemed to generally increase at least with overall profi-
ciency level. However, a more recent study also undertaken in an assessment con-
text, in this case TOEFL (Cumming et al., 2005), also returned non-significant
results for this measure. The authors report very little difference between the pro-
ficiency levels, with the means ranging from 1.5 to 1.8. These are slightly higher
and more varied than those found in this study (which found means ranging from
1.4 to 1.5), but present a similar picture to the current findings. It is possible that
the context under which the data are collected plays a role in this measure. The
students taking the DELNA assessment were aware that their writing was going to
be assessed. It is possible, therefore, that when students are in an assessment situa-
tion, they employ a play-it-safe method and focus more on the accuracy (and lexi-
cal complexity) of their writing at the expense of grammatical complexity. It is
interesting though that the complexity of sentences is regularly included in rating
scales of writing. If this and other studies show that writers do not differ greatly
from each other in terms of the complexity of their sentence structure when in an
assessment context, this measure should perhaps not be included in rating scales
in the future. It might be important to make raters and rating scale designers aware
of the limitation of this measure.
It can further be argued that not having a successful measure for grammatical
complexity is a limitation of this study. If time had allowed it, it would have been
useful to pursue other measures of grammatical complexity. A possibility for fur-
ther research would be measures of the number of passives per t-units or complex
nominals per t-unit. However, Wolfe-Quintero et al.’s review shows that in previ-
ous research not many measures of grammatical complexity have been successful.
Based on these findings, the decision was made not to include this variable in the
rating scale.
173
7.4.2 Lexical complexity
The second type of complexity pursued was lexical complexity. Several measures
were examined in the pilot study and the three most promising measures were ex-
amined in the main analysis. These were the sophisticated words over total lexical
words, the average word length and the number of AWL words. All three meas-
ures were successful in distinguishing between the different levels of writing. The
measure of sophisticated lexical words over total words was used in a longitudi-
nal study by Laufer (1994). Like Laufer’s, this study was able to show that this
measure differentiates between proficiency levels. The concern was however, that
it would be difficult for raters to use in the rating process. The average word
length was also successful in distinguishing between the different proficiency lev-
els as it has been in other studies (e.g. Grant & Ginther, 2000). However, there
was a concern whether raters would be able to judge the average word length
when rating. Differences between the different proficiency levels were not pro-
nounced enough to be detected by human raters examining a hand-written writing
product. For this reason, the measure, although promising, was not included in the
rating scale. The only measure incorporated in the scale, was the number of AWL
words in a text. Although the original measure was the percentage of AWL words,
a brief investigation of the scripts in the sample showed that controlling for text
length in this manner made no difference to the result. It was therefore thought
that it might be easier for raters to look for the number of AWL words. No prior
research could be located for this measure, although it parallels the variable of so-
phisticated lexical words over total lexical words. There are, of course, several
problems with this measure. It does not control for students reusing the same word
on several occasions. It would possibly have been better to measure the number of
different AWL words. However, overall, the number of AWL words seems a
promising measure which might be usefully applied in other contexts.
Two main considerations went into the design of the descriptors for lexical com-
plexity. Firstly, the variable used needed to be usable by raters in a testing situa-
tion. This excluded the average word length and sophisticated lexical words over
total lexical words because measuring these would be time-consuming. The num-
ber of AWL words was seen to be usable in a rating situation. Secondly, it was
decided that six levels of AWL words would be difficult to distinguish for raters.
Therefore, only four levels were created, joining levels 4 with 5 and 8 with 9. The
rating scale for lexical complexity can be seen in Table 63 below.
174
Table 63: Rating scale – Lexical complexity
8 7 6 5
Large number of Between 12 and 20 5 - 12 words from Less than 5 words
words from aca- AWL words AWL from AWL
demic wordlist
(more than 20)
7.5 Mechanics
Three measures were used for the analysis of mechanics: spelling, punctuation
and paragraphing. The number of spelling mistakes was a promising measure, but
the only really worthwhile difference was found between levels 4 (with eight mis-
takes on average) and level 5 (with just under four errors on average). The other
four proficiency levels were very similar and probably not distinguishable by rat-
ers. No prior research was found that investigated spelling mistakes over different
proficiency levels. However, it was interesting to see that although there were
slight differences between the levels, this measure was not very successful in dif-
ferentiating them. One explanation why the number of spelling errors did not dif-
ferentiate between the different writing levels is that lower level writers know
fewer words and although they produce many mistakes when spelling these
words, in relative terms they can make only a certain number of mistakes. The
words they know are often just very simple, easily spelt words. Writers at higher
levels have access to a larger vocabulary and therefore the chances of misspelling
words also increases. For these reasons, higher level writers produce the same
number of mistakes as lower level writers.
It is therefore not surprising that the measure of the number of spelling mistakes
was not successful in identifying differences between writers at different levels
(except between levels 4 and 5). However, spelling is regularly included in rating
scales of writing. If there are indeed very few differences between writers in the
number of mistakes they produce, then it might be necessary to bring this fact to
the attention of rating scale developers. Further research in this area is clearly
necessary.
The number of punctuation errors did also not distinguish between the different
levels of writing. Very little research was identified on this measure. Mugharbil
(1999) was able to show in his study that the full stop (the only punctuation mark
that was examined in this study) is acquired first by learners. Therefore, it is pos-
sible that there were very few differences between the learners in this study be-
cause all had reached post-beginner level as they were already at university. It
might have been better to include comma errors into the analysis, as Mugharbil
was able to show that the correct usage of the comma is what differentiates higher
175
and lower level learners. However, the comma was not included, as it would have
been difficult to achieve inter-rater reliability on this measure. Punctuation and
capitalisation, which were already identified in the pilot study as not being able to
differentiate between the learners, do not seem to be worthwhile measures to pur-
sue in the future.
The third measure used for the analysis of mechanics, was paragraphing. This
measure has not been used in this form in any previous studies and was created
specifically for the tasks used in the context of DELNA. A slightly similar analy-
sis of paragraphing in a study by Kennedy and Thorp (2002) failed to produce any
clear results between writers in the IELTS test, although they found that writers at
level 4 produced significantly more essays with only one paragraph than writers at
level 6. Although the measure used in this study distinguished between the differ-
ent proficiency levels, there are some problems with it. For example, it disregards
unnecessary paragraph breaks. That is, if a writer produces very short, two-
sentence paragraphs, this is not penalized. The measure also does not account for
the ordering of the information within a paragraph. So, whilst the measure can be
applied easily and seems to be successful in discriminating between different lev-
els of writing (as was shown by the analysis), it is very mechanical and has very
clear shortcomings. It would be useful if further studies could attempt to develop a
more sophisticated measure of paragraphing.
Based on these findings, it was decided that spelling and punctuation would not be
included into the rating scale, as a clear differentiation between levels of writing
could not be shown. The trait scale for paragraphing can be found in Table 64 be-
low.
The first consideration when designing the trait scale, was that although the
DELNA scale has six levels, for this new scale only five levels would be possible.
The histogram shows that only very small percentages of students in the sample
produced either one or five paragraphs and the majority produced three (nearly
half of the students). A decision had to be made as to which level to leave empty –
either the highest level (9) or the lowest level (4). As five paragraphs were seen as
176
the perfect response, level 4 was left empty and the descriptors were scaled from
levels 5 to 9.
7.6 Coherence
Seven topical structure categories were investigated, three of which had been used
previously in other studies (parallel progression, direct sequential progression and
unrelated progression), two were taken from the literature but adapted to suit the
data (indirect progression and extended progression) and two were newly devel-
oped (superstructure and coherence break).
The findings of the analysis of the scripts are generally in line with existing stud-
ies, showing that higher level writers use more direct sequential progression (as
was shown by Wu, 1997), less parallel progression (as shown by Burneikaité &
Zabiliúté, 2003; Schneider & Connor, 1990; Wu, 1997) and less unrelated pro-
gression (as found by Wu, 1997). As with Schneider and Connor, no difference
was found in the use of extended progression by higher and lower level writers.
The findings for the new categories followed the results of the pilot study. It was
shown that higher level writers use more superstructures and linkers to make their
writing coherent and make use of more indirect progression and fewer coherence
breaks.
The correlation coefficients that resulted from the correlation of the overall writ-
ing score with the different topical structure categories can be seen as rather weak.
There are two reasons for this. Firstly, the large sample of scripts (601) influences
the resulting correlation coefficients. Large sample sizes inevitably result in lower
values. A second reason is that there are a number of intervening variables at play.
The final writing score is a product of a number of different aspects. Earlier in this
chapter, a case was made for the fact that raters often put more emphasis on more
explicit features of a piece of writing, for example, accuracy and vocabulary.
Therefore, lower correlation coefficients can be expected. Overall, the results for
the analysis of coherence are very promising and might be applicable to other
contexts and studies.
Although the correlations of the topical structure categories with the overall writ-
ing score (see Table 47) were all of more or less similar strength, it is possible that
a multiple regression analysis might show that some categories are more indica-
tive of high or low level writing. If some categories predict certain levels of writ-
ing more than others, then it would be possible to reduce the number of categories
that are included into the rating scale. This would probably simplify the rating
process for the raters and make rater training significantly easier. However, as the
idea of a multiple regression analysis only emerged after the design and trialling
177
of the rating scale, it was not pursued but left as a possible avenue for further re-
search.
The rating scale that was based on these findings can be seen in Table 65 below.
The design of the trait scale for coherence was more difficult than for other scales,
as the results for a number of categories had to be considered and synthesized.
Firstly, the number of band levels needed to be decided. Because the analysis of
the scripts was based only on levels 4 to 8, it was decided that these levels would
also be used for this scale. The two outer levels (levels 4 and 8/9) were described
first. Here, information was included for raters on what features might be most
commonly or least commonly expected. The central three levels were scaled
based on the findings of the analysis, after a detailed scrutiny of the box plots for
these levels.
7.7 Cohesion
Two aspects of cohesion were investigated in the main study: anaphoric pronomi-
nals and the number of linking devices. The anaphoric pronominals this and
these were shown to be used more by writers of higher proficiency, whilst the re-
mainder (they, them, it, their) were used more by lower level writers (as shown by
a negative correlation with the overall score). The same finding was made by
Banerjee and Franceschina (2006) in the context of the IELTS test. They found a
178
strong increase of the use of this and these at higher levels. The reason why higher
level writers use more of these two demonstrative pronominals might be that it is
more difficult for lower level writers to refer anaphorically to ideas (which is the
main function of this and these).
Whilst the use of this and these produced a clear differentiation between the levels
of writing in terms of means (although with a lot of overlap in the distribution of
the levels), the number of linking devices (e.g. however, therefore) used was not
very strongly indicative of writing proficiency. However, it was interesting to ob-
serve that lower level writers produced slightly more of these devices than higher
level writers. It is also interesting that the findings in the literature are divided on
this. Most studies investigated the difference between native and non-native
speakers in the use of linking devices. Reid (1992) and Field and Yip (1992) both
found that non-native speakers overused these devices, although a large scale
study by Granger and Tyson (1996) was not able to confirm these findings. Ken-
nedy and Thorp (2002), however, in the context of IELTS, were able to show that
lower level writers used linking devices like ‘however’ and enumerative markers
like ‘firstly’ more frequently than writers rated at level 8. In the current study,
only a careful qualitative analysis of the type of devices used was able to differen-
tiate between writing levels (as was also suggested by Granger and Tyson, 1996).
Even this differentiation was not as discriminating as other measures used in this
analysis, and therefore resulted only in four levels in the new scale.
Lexical cohesion, as operationalised by Halliday and Hasan (1976) was not in-
cluded in the main analysis of the writing scripts. Counting lexical chains or the
length of chains proved too time-consuming to be included in the rating process.
The lack of a measure of lexical cohesion needs to be accepted as a limitation of
this study.
The design of the trait scale for cohesion was the most difficult, because the
analysis of the frequency of use of linking devices did not produce clear findings
and the differences between the levels in the use of anaphoric pronominals were
small. It was decided not to include the pronominals used more by lower level
writers to avoid having too lengthy level descriptors. Because the analysis of the
number of linking devices was not as clear as hoped for, it was decided to include
the findings from the qualitative analysis. The analysis of the scripts did not make
it possible to easily distinguish between more than four levels and therefore levels
8 and 9 as well as levels 6 and 7 were combined. Table 66 shows the scale devel-
oped for this category.
179
Table 66: Rating scale – Cohesion
8 Connectives used sparingly and skilfully (not mechanically) compared to text
length, and often describe a relationship between ideas
Writers might use this/these to refer to ideas more than four times
7 Slight overuse of connectives compared to text length.
Connectives might be used mechanically (e.g. firstly, secondly, in conclusion)
One or two connectives might be misused
Some connectives skilfully used
This/these to refer to ideas possibly used up to four times
5 Overuse of connectives compared to text length
Connectives used are often simple (and, but, because).
Some might be used incorrectly.
This/these to refer to ideas used only once or twice
4 Overuse of connectives compared to text length
Connectives used are often simple (and, but, because).
Some might be used incorrectly.
This/these not or very rarely used.
180
ers use more than double the number of boosters as higher level ESL writers. This
study, with a mixed cohort of L2 and L1 writers, was not able to show that lower
level writers used more boosters than higher level writers, as might be hypothe-
sised based on the findings of previous research. The pilot study indicated that this
might be the case but when the larger corpus was investigated, there was only a
very slight difference between the levels. It might be necessary to do a more fine-
grained, qualitative analysis of the type of boosters used to identify differences
between writers of higher and lower proficiency.
The fourth category of reader/writer interaction investigated was the use of the
passive voice, a category related to writer identity. In line with what Shaw and Liu
(1998) and Banerjee and Franceschina (2006) found, it was seen that as the writ-
ing proficiency level increased, more instances of passive voice were found.
However, overall the frequency was very low. This again might be a feature of the
genre of the task. There was furthermore no negative correlation between the use
of passives and instances of writer identity, as might be expected. The interaction
between these two devices clearly warrants further research.
After this analysis, it was decided that the only measure of reader-writer interac-
tion that could be transferred into the rating scale was the measure of hedging.
The rating scale can be seen in Table 67 below.
181
The analysis showed distinct levels in the number of hedging devices. The de-
scriptors were scaled to match the levels in the DELNA scale and to allow for
clear differentiation between levels.
7.9 Content
The trait scale for data description can be found in Table 68 below. It can be seen
that the descriptors were scaled to include all levels of the DELNA scale, ranging
from ‘data description not attempted’ to ‘all data described’. It was decided that
the level descriptors would be worded to only broadly represent the percentages
investigated in the quantitative analysis in Phase 1. The information in brackets in
most level descriptors was given to clarify the exact details of the data description
the raters should expect at each level.
The second and third parts of the prompts were evaluated in terms of the number
of reasons (or ideas) and the number of supporting ideas that writers supplied.
This was possible because these sections are clearly demarcated in the essays, and
because of the relatively short time limit. Ideas that did not relate to the topic and
were therefore also not found in the essays of the expert writers were not included
182
into the count. These measures were able to discriminate successfully between the
different levels of writing, although for both sections there was a lot of overlap
between writers at levels 5 and 6. As for the description of the data, no literature
on similar measures was located. It is interesting to see, however, that better writ-
ers were able to produce more relevant ideas in the existing time limit. The reason
for this might be that less space in their working memory is taken up by producing
sentence level gram-matical constructions and therefore more ideas can be pro-
duced in the time available.
The resulting scale can be found in Table 68 below. In this case, not all six levels
of the DELNA scale were used, because the analysis in Phase 1 did not provide
evidence for six levels. Only four levels were found, but a separate band at the
lowest level was added to provide for scripts that did not attempt this section of
the prompt.
As for the interpretation of ideas, the four levels found in the data were used in the
rating scale. A separate level was added at the bottom of the scale to include the
large number of scripts that did not attempt this section. The scale can be found in
Table 69 below.
7.10 Conclusion
Overall, it can be said that most measures investigated were able to discriminate
between the different proficiency levels. A clear limitation of this study is, how-
ever, the way the independent variable of writing proficiency was mea-sured. Be-
183
cause no independent measure of writing ability (or language ability) was avail-
able, DELNA ratings were used as a basis on which the corpora of the different
sub-levels were created. However, the ratings are a product of the rating scale
used, which was the existing DELNA scale. Therefore, it is not clear if the rank-
ing of the candidates into the levels can be trusted. Since one criticism of the ex-
isting rating scale is that the descriptors are non-specific and therefore the ratings
can be seen as slightly unreliable, this is a problem. To alleviate this problem at
least to a certain extent, the ratings of the two raters were averaged. Therefore,
any erratic rating behaviour of individual raters was controlled for.
184
Chapter 8: Methodology - Validation of Rating
Scale
8.1 Introduction
While the previous chapters presented the methodology, results and discussion of
the first phase of this study (the analysis of the writing scripts), the following
chapters provide the methodology, results and discussion chapters of the second
phase, the validation of the new scale. As was mentioned previously in Chapter 5,
the validation phase of this study involved two very different research designs.
The first part, the comparison of the measurement properties of the two different
rating scales, involved a quantitative methodology whilst the analysis of the ques-
tionnaires and interviews was conducted within a qualitative paradigm.
This chapter presents the methodology of the second phase of the study. First, an
overview of the design of the second phase is presented, after which the partici-
pants, instruments and procedures are described in more detail.
8.2 Design
The validation phase of the new rating scale took place in several stages. After
approval from the Human Participants Ethics Committee was obtained, raters
were recruited. This was done by contacting all DELNA raters via an email with
information about the study. Some raters immediately agreed to take part, whilst
others requested more details. The first ten raters who volunteered to participate
were recruited for the study. Therefore, raters were self-selected.
Raters first took part in a rater training session for the existing DELNA rating
scale. After this training session, each rater was given a rating pack of 100 scripts
to take home. They were asked to complete the rating of these scripts over a pe-
riod of three weeks in late January and early February 2006.
While the ten DELNA raters were completing their ratings, the new scale, a train-
ing manual and a questionnaire were trialed on a group of ten research students.
These students studied the training manual at home and were then asked to rate a
number of scripts at a plenary meeting. Some of these ten research students were
DELNA raters not taking part in the study; others had done no previous DELNA
rating although most had some experience of rating writing in other contexts. Dur-
ing the plenary session, the students were able to provide feedback on the scale
and the training manual. At the end of the session, they completed a trial ques-
tionnaire to provide further feedback on the rating scale. A number of changes
were made both to the scale and the training manual based on this feedback.
185
In early March 2006, the raters participating in the study were trained in using the
new scale. The same procedures were employed as during the rater training ses-
sion in January. Only eight raters were able to take part in this rater training and
the following rating round in May/June 2006. The eight raters filled in the ques-
tionnaire as soon as they completed their rating. The two remaining raters were
unavailable at the time of the training due to personal reasons. They were indi-
vidually trained during May 2006 (following the same procedures) and completed
the rating round during May and June 2006, after which they also completed the
questionnaire.
After these data collection procedures, a break of two months occurred, during
which the researcher analysed the data. During that time, the decision was made
to conduct in-depth follow-up interviews to elicit more detailed information from
the raters. These were undertaken in September 2006, after a refresher rating
round. The schedule for the validation activities outlined above can be found in
Table 70 below.
Overall, the aim of the validation phase of this study was to keep all aspects cen-
tral to performance assessment (the raters, the tasks, the candidates and the rater
training) constant, whilst only the rating scale was varied. It was hoped that any
resulting differences in the ratings were therefore due to differences in the scales
alone.
186
8.3 Participants
Three groups of participants took part in the validation phase of the study. The
first group were the students who produced the writing scripts rated by the raters.
The second group of participants were the research student raters who took part in
a trial of the rating scale, the training manual and the questionnaire. This trial is
described in more detail under Procedures later in this chapter. The third group of
participants was the group of raters. All three groups of participants are described
in more detail below.
One hundred scripts were chosen for the validation phase of the new scale. These
scripts were selected to represent as closely as possible the larger sample of
scripts in terms of marks awarded and background characteristics of the writers.
The student writers were not actively recruited, but had agreed, when doing the
DELNA assessment, that their scripts could be used for research purposes. Very
little information was available about the writers apart from information recorded
in a self-report questionnaire completed before the administration of the assess-
ment. Below is a summary of the information available.
Most of the one hundred writing samples selected were produced by students who
reported either an Asian language or English as their first language. A smaller
group of nearly ten percent had a European language other than English as their
L1, whilst only two students reported a Pacific Island language or Maori as their
first language. Two students fell into the ‘Other’ category. A summary of the stu-
dents’ first languages can be seen in Table 71 below.
Table 71: First languages of writers of scripts chosen for validation phase (self-report ques-
tionnaire)
L1 N
Asian 40
English 47
European 9
PI/Maori 2
Other 2
Total 100
No information was available about the ages of the students but it is probable that
most of the students were in their late teens or early twenties, as all except one
were enrolled in undergraduate programmes. The gender distribution seen in Ta-
ble 72 below shows that more females than males were in the sample, a trend also
187
observed in the wider DELNA test taker population as well as the university as a
whole.
Table 72: Gender of writers of scripts chosen for validation phase (self-report questionnaire)
Gender N
Female 60
Male 40
Table 73: Faculties of writers of scripts chosen for validation phase (self-report question-
naire)
Faculty N
Engineering 39
Architecture, Property, Planning and Fine Arts 17
Education 17
Arts 11
Medical and Health Sciences 6
Science 5
Conjoint 2
No information available 3
Total 100
To trial the new scale, training manual and questionnaire, ten research students
were recruited. These students were part of a group that met weekly to discuss
their research or other topics of relevance or interest to their study. All research
students were PhD candidates in the Department of Applied Language Studies
and Linguistics at the University of Auckland. All had a background in teaching
English as a second language and therefore experience in marking essays written
by students at different proficiency levels. Two of the research students were also
DELNA raters not taking part in the main validation phase. The procedures em-
ployed in the trial of these instruments will be described later in this chapter.
188
8.3.3 The raters
The DELNA raters involved in the current study were drawn from a larger pool of
raters who were all experienced teachers of English and/or English as a second
language. All raters had high levels of English language proficiency although not
all were native speakers of English. Some raters were certificated IELTS examin-
ers whereas others had gained experience of marking writing in other academic
contexts. Table 74 below presents background information about each rater taking
part in the study. Specific qualifications relating to Language Teaching are noted
with a ‘LT’ next to the qualification. Because the pool of DELNA raters was not
very large, the background information has been kept to a minimum and general-
ised, so that individual raters cannot be easily identified.
8.4 Instruments
Several instruments were used as part of the validation phase of this study. Firstly,
there were the one hundred writing scripts that were rated twice by the raters, first
using the existing rating scale and then using the new scale. Other instruments in-
cluded the two different rating scales, the rating sheets, the training manual, the
questionnaires and the interview questions used in the semi-structured interviews.
Each of these will be described in detail in the following sections.
189
8.4.1 Writing scripts
One hundred writing scripts were chosen from the larger pool of writing scripts
from the 2004 administration of DELNA. The scripts were chosen, as mentioned
in the section on the writers above, to represent the different DELNA levels and
the different background profiles of the candidates in the larger corpus. The dis-
tribution of the four different prompts (described in more detail in Chapter 5) used
in this study is shown in Table 75 below:
Table 75: Distribution of four different writing prompts in sample of one hundred scripts
Prompt N
1 22
2 20
3 38
4 20
The scripts were all photocopied from their original handwritten form, in such a
way that the students’ names and ID numbers could not be identified by the raters.
The scripts ranged from 173 to 450 words (deletions were not included in the
word count), with a mean of 261 words.
To compare the validity of the existing DELNA scale and the newly developed
scale, the raters rated one hundred scripts using both scales. The existing DELNA
rating scale can be found in Chapter 4 and the new scale can be found in this
chapter. Both were analytic rating scales; however, they differed in three ways.
The first and most obvious difference relates to the descriptor styles. Whilst the
existing rating scale had relative, vague descriptors which made use of adjectives
like ‘appropriate’ and ‘extensive’, the descriptors on the new scale were more
specific and mostly involved counting features of writing. Secondly, the DELNA
rating scale had level descriptors for nine categories (or traits) whilst the new
scale had descriptors for ten traits. A comparison of the traits of the two scales can
be seen in Table 76 below. Similar traits are noted in the same row of the table.
The third way in which the scales differed had to do with the number of levels as-
sociated with each trait. The DELNA scale had the same number of levels for
each trait (six band levels ranging from 4 to 9), whilst the new scale had a varying
numbers of levels for different traits. Some categories had only four levels whilst
others had six. The reason for this was that the analysis of the writing scripts con-
ducted in Phase 1 of this research did not provide evidence for the same number
of levels for each trait scale. The number of band levels for each trait scale can
also be seen in Table 76 below.
190
Table 76: Comparison of traits and band levels in existing and new scale
DELNA scale Band levels New scale Band levels
Sentence structure 6
Vocabulary and spelling 6 Lexical complex- 4
ity
Data description 6 Data description 6
Data interpretation 6 Data interpreta- 5
tion
Data – Part 3 6 Data – Part 3 5
Style 6 Hedging 6
Organisation 6 Paragraphing 5
Coherence 5
Cohesion 6 Cohesion 4
Repair Fluency 6
For every assessment using DELNA, raters use a marking sheet based on the de-
scriptors for the current rating scale. Therefore, the same sheet was used in this
study when the existing rating scale was employed. A different marking sheet was
designed for the new scale, for two reasons. Firstly, for the purpose of this study,
the raters did not need to write any comments (as they usually do when using the
existing DELNA scale). Therefore, no space for comments was needed on the
marking sheet. Secondly, to save space, the traits were laid out as columns and the
scripts as rows. In this way, more ratings could fit on one page.
Because the raters were all busy and could not put a large amount of time aside
for familiarizing themselves with the new scale, a training manual was produced
which could be studied at home before the rater training and so shorten the train-
ing session. In the manual, clear instructions were provided on how each trait was
to be rated. Traits that were more complicated to rate were further illustrated by
examples and practice exercises. For example, raters could practise identifying
words from the academic word list in a sample text or practise the identification of
the different topical structure analysis categories. At the end of the training man-
ual, the raters were provided with the correct answers.
191
8.4.5 Questionnaire
After the two rating rounds, a questionnaire was administered. The purpose of the
questionnaire was to elicit raters’ perceptions of the measurement efficacy and
usability of the new scale. Raters were asked to consider the categories in the new
scale, the band levels of each trait and the wording of the descriptors. Raters were
also encouraged to reflect on the rating process when using the scale. The ques-
tions can be found in Table 77 below.
Two versions of the questionnaire were created: a hard copy and an electronic
version, so that raters could choose the way in which they wanted to complete it
(see procedures).
Two months after the rating rounds, more in-depth interviews were conducted
with a subset of seven raters. The questions in the interviews resembled those
found in the questionnaire but in this case, the raters were asked about both scales.
Because the interviews were semi-structured, the exact interview questions varied
slightly from participant to participant. The guiding questions and broad topics
can be found in Table 78 below. For each participant, more specific questions
were created which reflected individual rating patterns noticed during the analy-
sis. For example, if a rater consistently over- or under-used certain band levels of
a trait scale, then the rater was asked more in-depth questions about that particular
trait scale.
192
Table 78: Interview topics
DELNA rating scale:
1. Do you think that the way we have broken the rating down into many specific parts
helps or hinders the process?
2. How would you change the rating scale?
a. in terms of the wording
b. in terms of the categories on the scale
c. in terms of the number of levels on the scale
3. Are there any categories which you find particularly good or problematic?
8.5 Procedures
Before having the raters use the new rating scale, training manual and question-
naire, a trial was conducted. The forum for this was the weekly departmental
meeting of research students which was thought to be suitable for a trial for the
following reasons. Firstly, some of the research students were DELNA raters who
were not part of the study. It was therefore considered fitting to trial the scale on
this group of people, because at least some members of the group had experience
with rating DELNA writing scripts. Because the pool of DELNA raters was lim-
ited, no more DELNA raters could be found for the trial of the materials. Sec-
ondly, the meeting of research students is designed to trial ideas or materials. All
members of the group were in the process of conducting doctoral research them-
selves and therefore provided time to discuss others’ work.
193
8.5.1.1 Trial procedures:
The research students were asked to read the training manual before the session
and rate five scripts at the research meeting. During the session, they were able to
ask any questions and to criticize the material in any way they wanted to. The rat-
ings of the five scripts were then discussed by the group and at the end of the ses-
sion they completed the questionnaire.
The comments from the group were extremely helpful for revisions of all three
instruments. Firstly, the group suggested several changes to the training manual.
These changes had to do with clarifications about how some scale categories
should be used. For example, they suggested the provision of a more comprehen-
sive list of hedging devices and a list of Academic Word List (AWL) headwords.
The research students also suggested including a comprehensive list of linking
devices in the section on cohesion.
Several changes were also made to the rating scale itself. It became clear that
some raters had problems applying the descriptors for lexical complexity based on
the AWL words. Because of these problems, the descriptors were extended to in-
clude more general descriptions. The changed descriptors for lexical complexity
can be seen in Table 79 below.
Apart from this, the group noticed several minor spelling mistakes and made sug-
gestions about the layout of the scale. All these ideas were helpful and were taken
into con-sideration before the final scale was completed. The final scale can be
seen in Table 80 on the following pages.
194
Accuracy Fluency Complexity
9 All sentences error-free No self- Large number of words from
corrections Academic Word List (more
8 Nearly all sentences No more than 5 than 20) / vocabulary exten-
error-free self-corrections sive – makes use of large
number of sophisticated
words
7 About ¾ of sentences 6-10 self- Between 12 and 20 AWL
error-free corrections words / makes use of a num-
ber of sophisticated words
6 About half of the sen- 11-15 self- 5-12 words from AWL / vo-
tences are error-free corrections cabulary limited, uses only
some sophisticated words
5 about ¼ of sentences 16-20 self- Less than 5 words from AWL
error-free corrections / uses only very basic vocabu-
4 Nearly no or no error- More than 20 lary
free sentences self-corrections
195
Coherence Cohesion
9 Writer makes regular use of super- Connectives used sparingly but skilfully
8 structures, sequential progression and (not mechanically) compared to text
possibly indirect progression. Few length, and often describe a relationship
incidences of unrelated progression. between ideas. Writer might use this/these
No coherence breaks to refer to ideas more than four times.
7 Frequent sequential progression, su- Slight overuse of connectives compared to
perstructure occurring more fre- text length. Connectives might be used
quently. Infrequent parallel progres- mechanically (e.g. firstly, secondly, in
sion. Possibly no coherence breaks conclusion). One or two connectives might
6 Mixture of most categories. Super- be misused. Some connectives skilfully
structure relatively rare. Few coher- used
ence breaks. This/these to refer to ideas possibly used
up to four times or writer uses connectives
rarely, but some ideas could be more skil-
fully connected
5 As for level 4, but coherence might Overuse of connectives compared to text
be achieved in stretches of discourse length. Connectives used are often simple
by overusing parallel progression. (and, but, because). Some might be used
Only some coherence breaks incorrectly. This/these to refer to ideas only
used once or twice or hardly any connec-
tives used
4 Frequent: Unrelated progression, Writer uses few connectives, there is little
coherence breaks and some extended cohesion. This/these not or very rarely
progression. Infrequent: sequential used.
progression and superstructure
As for the trial of the questionnaire, all students noted that they found the scale to
be very usable. No categories were seen to be missing by the students, although
one thought that the scale should include descriptors on the use of passives. Gen-
erally, the group was positive about the scale.
For the existing rating scale, the raters were trained in a face-to-face training ses-
sion. In this session, the raters met in plenary and rated 12 scripts as a group. Af-
ter each script, their ratings were discussed and compared to the benchmark rat-
ings awarded to the scripts by four highly experienced raters. The rater training
session lasted for about two hours with a 15 minute break.
196
8.5.2.2 Rater training: new scale
The training procedures for the new scale were generally the same. Eight raters
read the training manual at home and then rated 12 scripts in a plenary session at
which the same procedures were employed as in the first training session. Two
raters, however, could not meet at the time of the plenary session for personal rea-
sons. These two raters were trained individually at a later date (see Table 70
above). The procedures for these sessions replicated those of the group session.
The one hundred writing scripts selected for the study were photocopied for each
of the ten raters and given a random ID number from one to one hundred. The
scripts were then put in random order into five envelopes of twenty scripts clear
labelling. The rating packs also included a copy of the relevant rating scale and
marking sheet, the prompts, and for the second session a list of AWL headwords
and the training manual. The scripts included no information which could identify
the student writers.
As described earlier, the raters produced ratings of one hundred writing scripts
under two conditions, first using the existing rating scale and then using the new
rating scale. Ideally, a counterbalanced design should have been used, in which
half of the raters first produced the ratings using the existing scale, whilst the
other half of the raters first used the new scale before the two groups changed
over. However, this was not possible for practical reasons. Because the raters
were mostly busy teaching at the time of the study, the rating rounds had to be
arranged around the holidays and the semester breaks. In addition, several raters
had to leave the country around the middle of May 2006. As the DELNA rating
scale was pre-existing, the decision was made to ask raters to produce the ratings
using this scale early in 2006, before the start of the first semester. At this stage,
the new scale had not been completed or trialed. Therefore, all raters rated the
scripts using first the existing scale and then the new scale. It was assumed that
raters would not be able to remember details of individual scripts over the period
of two to three months between the two rating rounds and therefore no order ef-
fect was anticipated.
In each rating round, the raters were given all one hundred scripts at the same time
in five envelopes of twenty scripts each. The raters were instructed to rate no
more than ten scripts in one session to avoid fatigue, and were given three weeks
to complete the ratings. Once the scripts had been rated, the envelopes were
197
handed back to the researcher. The results were immediately entered onto an Ex-
cel spreadsheet.
The raters were paid the current DELNA rate per script so that they spent the
same amount of time on each script as they would do under a regular administra-
tion of the assessment.
At the end of the second rating round, the raters were asked to keep the new rating
scale and the training manual, and were provided with the questionnaire. They
then had three days to complete the questionnaire. This time limit was set so that
their memories would still be fresh. The raters were able to either complete the
paper version of the ques-tionnaire or ask for an electronic version via email
which they could fill in on the computer and return in the same way to the re-
searcher.
Three months after the second rating round, all raters were invited to participate in
interviews. Seven raters agreed to participate. Therefore the raters in the inter-
views were self-selected. Before the interviews, the raters rated five scripts using
both the existing and the new rating scale to refresh their memories. The inter-
views were undertaken in a quiet conference room at the university and digitally
recorded using an Olympus Digital Voice recorder WS-100. The interviews were
semi-structured and lasted for about 30 minutes. All interviews were later tran-
scribed broadly by the researcher using the Olympus DSS Player A-400 transcrip-
tion software. This software enables the transcription to be carried out with the
help of a foot pedal to stop, start and rewind the recording, reducing the transcrip-
tion time considerably.
The long break of three to four months between the last rating round and the in-
terviews occurred because all data needed to be analyzed first, so that more de-
tailed questions could be devised for each participant (see instrument section).
The analysis of the data will be described in three sub-sections. The first two sec-
tions describe the details of the analysis of the rating data, whilst the third section
briefly comments on the analysis of the interviews and questionnaires respec-
tively.
198
8.5.4.1 Rasch analysis
The rating data was analyzed using the multi-faceted Rasch measurement program
FACETS version 3.59.0 (Linacre, 1988, 1994, 2006; Linacre & Wright, 1993).
The basic Rasch model was first advanced by Rasch (1960; 1980) as a mathe-
matical representation of the interaction of person ability and item difficulty.
These two parameters were modeled on a common scale of log odds units or
‘logits’ (McNamara, 1996). This logit scale has the advantage that it is an interval
scale. It can therefore not only tell the researcher that one item is more difficult
than another, or that one person has more ability than another, but also how large
this difference is. The basic Rasch model was designed to be used for dichoto-
mously scored data (i.e. right/wrong scoring).
The initial Rasch model was further extended into the Rating scale model by An-
drich (1978). This model not only expresses the overall difficulty of a particular
item but is also able to calculate the step difficulty between each step in a rating
scale. A further extension of the Rasch model was proposed in the early 1980s by
Masters and Wright (Masters, 1982; Wright & Masters, 1982). This model has the
added ability of being able to work with items scored using a partial credit scoring
system.
The Rasch model used in this analysis, the multi-faceted Rasch model, is an ex-
tension of both the rating scale model and the partial credit model. The multi-
faceted Rasch model proposed by Linacre (1989) was designed to include any
number of facets pertinent to the assessment situation. A typical basic multi-
faceted Rasch analysis would include candidates, items and raters as facets. The
researcher is able to analyze rating data by summarising overall rating patterns in
terms of group-level main effects for the raters, candidates, traits and any other
variables of the rating situation. In the analyses, the contribution of each facet is
separated out to determine if the various facets are functioning as intended. The
analysis further allows the researcher to look at individual-level effects of the
various elements within a facet (i.e. how individual raters, candidates, or traits in-
cluded in the analysis are performing).
The multi-faceted Rasch model is an additive linear model based on logistic trans-
formation of the observed ratings to a logit scale. In this model, the logit can be
viewed as the dependent variable, while the various facets function as independent
variables influencing these logits (Myford & Wolfe, 2003). When the analysis is
undertaken, the various aspects (or facets) are analysed simultaneously but inde-
pendently and are then calibrated onto the common logit scale. This makes it pos-
sible to measure rater severity on the same scale as candidate ability and trait dif-
ficulty and thus carry out comparisons between different facets.
199
For each element of each facet, the analysis provides a logit measure, a standard
error (which provides information about the precision of the calibration) and fit
indices (which supply information about how well the data of this element fit the
expectation of the measurement model).
The multi-faceted model is now the most general Rasch model and all other mod-
els can be derived from it (McNamara, 1996).
The form of the many-faceted Rasch model used in this study (a multi-faceted
version of the partial credit model) can be represented by the following mathe-
matical model:
Log (Pnijk/Pnijk-1) = Bn - Cj - Di - Fik
Multi-faceted Rasch analysis was chosen for this study because analytical tools
from classical test theory have several limitations. For example, an ANOVA-
based ap-proach could be chosen to study group-level rater effects as well as
200
rater-effect interactions. However, ANOVA has the limitation that possible inter-
action effects can contaminate main effects, making the interpretation of the main
effects more difficult (Wild & Seber, 2000). As mentioned earlier, multi-faceted
Rasch measurement goes beyond the de-tection of main effects and interaction
effects, as it allows for the detection of individual level effects. In this respect,
multi-faceted Rasch measurement is superior to ANOVA-based approaches and
regression approaches (Myford & Wolfe, 2003).
Another approach possible when working with rating data is generalisability the-
ory (or G-theory). One limitation of G-theory, which is addressed in multi-faceted
Rasch measurement, is that although it identifies sources of variance attributed to
each facet and its interactions, the impact of such differences on the candidates’
scores during a particular examination is not corrected. Therefore, the candidates
receive the raw scores they earn from the raters they encounter, and not an ad-
justed raw score due to rater differences or other attributes of the examination, as
is produced in multi-faceted Rasch measurement.
FACETS (Linacre, 1988, 2006) makes it possible to analyze data based on an ana-
lytic rating scale both as a whole (to see the functionality of the rating scale as a
whole) or, by employing a partial credit model, with respect to each individual
trait scale. It is also possible to investigate the rating behavior of all raters in the
study as a group or individually or to investigate how each rater employs each in-
dividual trait scale.
For the purpose of the following results chapter, the rating behavior of all raters as
a group was investigated both when using the scale as a whole, and when utilizing
individual trait scales.
The rating behavior of individual raters was also analyzed (when using the scales
as a whole and the individual trait scales), but this is not presented in the results
section as it is not relevant to answering the research questions, which are con-
cerned with group rather than individual behaviour. The individual analysis was
undertaken merely for the purpose of developing more detailed questions for the
interviews.
Before the data was analyzed, a number of hypotheses were developed for com-
paring the two rating scales. Each of these hypotheses related to a group of statis-
tics generated by the FACETS program. These were:
201
5) scale step functionality
Why each of these was chosen and the hypothesis relating to each group of statis-
tics will be discussed in detail below.
The first hypothesis was that a more discriminating rating scale can be seen as su-
perior. It is important for an assessment to be able to differentiate between candi-
dates. In the case of performance assessment, the tool that is used to achieve this
is the rating scale. The more levels of candidate ability a group of raters can dis-
cern with the help of a rating scale, the better the scale is functioning.
When a rating scale is analyzed, the candidate separation ratio is an excellent in-
dicator of the discrimination of the rating scale. The candidate separation ratio
measures the spread of candidates’ performances relative to their precision
(Fisher, 1992). According to Myford and Wolfe (2003, p.410), this separation is
expressed as a ratio of the ‘true’ standard deviation of ratee performance measures
over the average ratee standard error as in the equation G = True SD / RMSE,
where True SD is the standard deviation of the ratee performance measures (i.e.
true standard deviation = the observed standard deviation – average measurement
error), and RMSE is the root mean-square of the standard errors of the ratee per-
formance measures, or the statistical ‘average’ measurement error of the ratee per-
formance measures. The candidate separation ratio is an excellent indicator of the
discrimination of the rating scale. The higher the separation ratio, the more dis-
criminating the rating scale is. FACETS reports two more measures of candidate
separation, the candidate fixed chi square value and the reliability of the candidate
separation. However, neither of these provides any additional information about
the candidate separation and are therefore not reported in the following chapter.
The next hypothesis made was that a well functioning rating scale would result in
small differences between raters in terms of their leniency and harshness as a
group. If a scale is functioning well, the raters will be able to discern the ability of
a candidate easily and do this in agreement with other raters. Thus, raters will not
vary greatly in terms of leniency and harshness. For this reason, a rating scale re-
sulting in a smaller rater separation ratio is seen to be superior. The rater separa-
tion ratio, like the candidate separation ratio, provides a measure of the spread of
the rater severity measures relative to the precision of those measures (Myford &
Wolfe, 2004). The higher the rater separation ratio, the more the raters differed in
terms of severity in their ratings.
202
8.5.4.1.3 Rater reliability:
The third hypothesis was that a necessary condition for validity of a rating scale is
rater reliability (Davies & Elder, 2005). A scale that results in higher levels of
rater reliability can be seen as superior.
FACETS provides two measures of rater reliability: (a) the rater point-biserial
correlation index (or single rater - rest of raters correlation), which is a measure of
how similarly the raters are ranking the candidates, and (b) the percentage of exact
rater agreement, which indicates the percentage of how many times raters
awarded exactly the same score as another rater in the sample. Both types of rater
reliability statistics were deemed necessary based on Stemler’s (2004) paper, in
which he cautions against the use of just one statistic to describe inter-rater reli-
ability.
203
8.5.4.1.4 Variation in ratings:
Because rating behaviour is a direct result of using a rating scale, it was further
contended that a better functioning rating scale would result in fewer raters rating
either inconsistently or overly consistently (by overusing the central categories of
the rating scale). The idea behind this was that if a rater is unsure what level to
award when using a rating scale, the rater might either rate inconsistently or resort
to a play-it-safe method and overuse the inner categories of a rating scale and
avoid the outside band levels.
The measure indicating variability in raters’ scores is the rater infit mean square
value. Rater infit means square values have an expected value of 1 and can range
from 0 to infinity. The closer the calculated value is to 1, the closer the rater’s rat-
ings are to the expected ratings. Infit mean square values significantly higher than
1.3 (following McNamara, 1996 and Myford and Wolfe, 2000) denote ratings that
are further away from the expected ratings than the model predicts. This is a sign
that the rater in question is rating inconsistently and therefore showing too much
variation. This is called ‘misfit’. Similarly, values lower than .7 indicate that the
observed ratings are closer to the expected ratings than the Rasch model predicts.
This is called ‘overfit’. This could indicate that a rater is rating very consistently.
However, it is more likely that the rater concerned is overusing certain categories
of the rating scale, normally the inside values. This can be confirmed by scrutiniz-
ing the FACETS analysis of individual raters using the individual trait scales.
Only raters actually overusing the inside categories were included in the summary
report on rater variation in Chapter 9.
The final hypothesis was that a better functioning rating scale would result in bet-
ter scale step functionality. A rating scale is made up of a number of different
band levels. It is important for each level to function appropriately for the entire
scale to perform efficiently.
Linacre (1999) reports a number of statistics that need to be scrutinized when rat-
ing scale functionality is of interest. All features of scale functionality are reported
in a single table in the output of FACETS, entitled category statistics. Figure 49
below presents the category statistics for grammatical accuracy of the existing
scale (as an example). For the output to be valid (i.e. for FACETS to return reli-
able results), each scale category (band level) needs to include at least ten obser-
vations.
204
Figure 49: Scale category category statistics: Grammatical accuracy – existing scale
This can be verified in the second column of the rating scale category statistics
table. Linacre (1999) further argues that it is important that the observations (as
seen in column 2) are regular (i.e. more or less normally distributed with only one
peak). He also suggests that the average measures (seen in column 5) should ad-
vance monotonically. These measures show the average logit value of the candi-
dates rated at each band level. The outfit means square measures (column 7) indi-
cate the difference between the observed average measure and the expected aver-
age measure in each category. The expected outfit mean square value is one. If a
band level displays an outfit mean square value of over 1.4, this indicates unex-
pected ‘noise’ in the category. The reason for this could be found in either the
scale, the candidates or the traits, and needs to be further investigated (Mike Lina-
cre, personal communication, May 2006). Finally, the step calibration measures
(column 8) are the rating scale category thresholds, the point at which a candidate
of that measure of ability has a 50% possibility of being graded into either of the
adjacent band levels. These should advance monotonically and the steps should
advance by at least 1.0 (for a five level scale) and by less than 5.0 logits.
Each table presenting the statistics of two trait scales in the following chapter also
provides a short comment on how the different rating scale band levels (entitled
scale properties) were utilized. Usually this comment focuses on extremely un-
derused categories. Underutilization of levels can occur for two reasons. Firstly, it
could mean that the raters did not use that band level because the descriptors were
not clear to them or did not represent what is actually displayed in the writing
scripts. Alternatively, it could also mean that there were no student performances
at that level in the sample of scripts that the raters rated.
205
The last piece of information provided by FACETS that is of interest to research-
ers interested in scale step functionality are probability curves. Probability curves
are a visual representation of the rating scale category statistics. Figure 50 below
presents the probability curve for the trait scale ‘grammatical complexity’ of the
existing DELNA rating scale (as an example). The horizontal axis represents the
candidate proficiency scale (in logits) and the vertical axis denotes the probability
of a score being awarded (from 0 to 1).
When examining probability curves, the chief concern is whether there is a sepa-
rate peak for each scale category probability curve and whether the curves appear
as an evenly spaced series of hills. If there are some categories (band levels) that
never become most probable (and therefore do not have separate peaks), then that
may suggest that one or more raters are experiencing problems when using the
rating scale. The intersection points of each scale curve are referred to as the rat-
ing scale category thresholds (which can also be found in Figure 49 above, the
rating scale category statistics). According to Myford and Wolfe (2003), a rating
scale category threshold represents the point at which the probability is 50% of a
candidate being rated in one or the other of these two adjacent categories, given
that the candidate is in one of them (Andrich, 1998). Figure 50 below, for exam-
ple, shows that the band levels of the grammatical accuracy trait scale were more
or less evenly spaced, although level 8 had a slightly lower peak than the other
band levels.
Very few noteworthy problems were identified with the scale probability curves.
To limit the length of Chapter 9, any problems are noted in the relevant tables un-
der scale properties only.
Figure 50: Scale category probability curve: Grammatical complexity – existing scale
206
8.5.4.2 Correlational analyses
PFA, like all inferential statistics, has several assumptions. First of all, the data set
should not include any extreme outliers. This was not a problem with this data set.
207
Further, none of the variables in the data should correlate too highly. SPSS in-
cludes a test for multicollinearity or singularity (the determinant). The results of
these tests are reported in the results section. Otherwise, PFA assumes linearity
and multivariate normality, assumptions which are both met by the data. It also
requires a large enough sample, which can be tested with the Kaiser-Meyer-Olkin
measure of sampling adequacy (A. Field, 2000).
Finally, to establish if the ratings based on the two scales are ranking the candi-
dates similarly, the candidate ability measures resulting from the two FACETS
analyses were correlated using a simple Pearson correlation.
Both questionnaires and interviews were saved as text files and then coded manu-
ally by the researcher. Some coding categories (or themes) were devised a priori
based on the questions, and some emerged during the coding process. To refine
the coding process, the questionnaires were first read thoroughly and categories
were identified. Then the data were grouped according to these themes. A second
researcher was asked to verify the selection of categories and code a subset of
three questionnaires.
The broad, overarching categories (or themes) identified for both questionnaires
and interviews can be found in Table 81 below.
208
Chapter 9: Results – Validation of Rating Scale
The following chapter presents the findings from the validation phase of the
study. The first section displays the findings of the quantitative analysis (Research
Question 2a), while the second part of the chapter presents the findings from the
questionnaires and interviews (Research Question 2b).
Do the ratings produced using the two rating scales differ in terms of (a) the
discrimination between candidates, (b) rater spread and agreement, (c) vari-
ability in the ratings, (d) rating scale properties and (e) what the different
traits measure?
The results for Research Question 2a are presented in two parts to aid comprehen-
sion. Firstly, the analysis of the individual trait scales is presented. To compare
the two rating scales (the existing DELNA scale and the new scale), the results for
corresponding trait scales are presented together. After the results for the individ-
ual trait scales, the scales as a whole are compared.
The first two trait scales are those relating to accuracy (see Table 82 below). It can
be seen that the candidate separation ratio for the new scale was higher than that
for the existing DELNA scale, which suggests that the new scale was more dis-
criminating. The statistics indicating rater separation and reliability show that the
raters rated more similarly in terms of leniency and harshness when using the new
scale (indicated by the lower rater separation ratio) and ranked the candidates
209
more similarly (rater point biserial) and also chose the same band level of the rat-
ing scale more often (percentage exact agreement).
Table 82 above also presents the percentage of unusually high or low infit mean
square values exhibited by the raters. If, for example, two of the ten raters dis-
played very high infit mean square values, then the table indicates that twenty
percent of raters showed this tendency. Whilst three raters displayed either unac-
ceptably high or low infit mean square values when using the DELNA scale, no
raters rated with too little or too much variation when applying the new scale for
accuracy.
Finally, the section on scale properties indicates that for the existing scale both
outside levels were underutilized, whilst for the new scale, only level 9 was un-
derutilized.
In summary, it can be argued that when the two accuracy scales were compared,
all indicators point to the fact that the new scale functioned better.
Table 83 below shows a comparison of the four groups of statistics for the two
rating scales focussing on lexis. In this case, the discrimination of the new scale
was only slightly greater than that of the existing scale. That the candidate separa-
tion of the new scale was higher than that of the existing rating scale is surprising
given that the new scale had two fewer band levels.
210
Table 83: Rating scale statistics for vocabulary/spelling and lexical complexity
DELNA scale – New scale –
Vocabulary and spelling Lexical complexity
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 4.38 Candidate separation ratio: 4.54
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 3.48 Rater separation ratio: 6.65
Rater point biserial: .78 Rater point biserial: .85
% Exact agreement: 40.6% % Exact agreement: 49.7%
Variation in ratings: Variation in ratings:
% Raters infit high: 20% % Raters infit high: 10%
% Raters infit low: 20% % Raters infit low: 10%
Scale properties: Scale properties:
Scale: Scale: 4, 5 and 9 underused, strong Scale: well spread
central tendency
Probability curve: low peak for level 5
Because the discrimination of the two scales cannot be easily compared in this
way, Mike Linacre (personal communication, July 2006) offered the formula in
Equation 1 below (an application of the Spearman-Brown Prophecy formula to
this situation) to equate the separation ratios of two scales with differing band lev-
els.
Candidate separation new scale = square root (no. of levels in new scale – 1/ no. of levels in
old scale – 1)* candidate separation old scale
Equation 1: Equation used to predict the candidate separation of two scales with differing
levels
If the empirical candidate separation ratio (new scale) found in the FACETS
analysis exceeds what the formula predicts, then the new scale is more discrimi-
nating. The result of the formula predicts that if converted to only four levels, the
existing scale would only have a candidate separation ratio of 3.39. Therefore, the
new scale was clearly more discriminating, even though it had fewer band levels.
As was found with the accuracy trait scales, both the rater point biserial correla-
tion coefficient and the exact agreement were higher for the new scale. However,
interestingly, the rater separation ratio indicates that the raters were more spread
out in terms of severity when using the new scale. So, although they seemed to be
ranking the candidates more similarly when using the new scale, the raters as a
group were more varied in terms of severity.
When using the new scale, fewer raters rated either inconsistently or overly con-
sistently (only 20% of raters compared to 40% of raters when applying the exist-
ing descriptors).
211
The raters strongly underused levels 4, 5 and 9 in the DELNA trait scale for vo-
cabulary and spelling. The category probability curve indicates a very low peak
for level 5. When the new scale for lexical complexity was used, the ratings were
well spread over all levels.
The existing DELNA scale has level descriptors for sentence structure. The de-
scriptors in this scale refer both to accuracy and complexity of sentences. Accu-
racy of sentences is covered in the new scale in the trait scale for accuracy, whilst
the analysis of the writing scripts showed no differences in grammatical complex-
ity between the different levels of writing. Therefore, for completeness of the re-
sults section, the scale statistics for the DELNA sentence structure trait scale are
presented here.
Table 84 above shows that the trait scale was discri-minating. The rater separation
ratio indicates that the raters were not all rating exactly alike in terms of severity.
The single rater-rest of raters correlation (point biserial) was high, as was the ex-
act agreement. However, of the ten raters, four rated with either too much varia-
tion (inconsistently) or with too little variation (underusing the extreme levels of
the scale). An analysis of the use of the different band levels revealed that the
outer categories of this trait scale were generally underused by the raters.
One trait scale not found in the existing rating scale, but included in the new scale,
was the trait scale for repair fluency. Table 85 below displays the rating scale sta-
tistics for this new scale.
212
Table 85: Rating scale statistics for repair fluency
New scale – Repair fluency
Candidate discrimination:
Candidate separation ratio: 5.82
Rater separation and reliability:
Rater separation ratio: 5.34
Rater point biserial: .93
% Exact agreement: 61.9%
Variation in ratings:
% Raters infit high: 20%
% Raters infit low: 20%
Scale properties:
Scale: 9 underused, otherwise functioned well
Distribution: multimodal distribution
Probability curve: level 8 very high peak
The candidate separation ratio indicates that the dis-crimination of this new scale
was high. The inter-rater reliability, as indicated by the point biserial correlation,
was also high (.93), more so than for all other trait scales examined so far. The
same can be said for the exact agree-ment (61.9%). The rater separation ratio,
however, indicates large differences in severity between the most severe and the
most lenient rater in the group.
Nearly half of the raters were identified as rating either inconsistently or too con-
sistently.
The band levels of the scale were generally well used, except level 9, which was
not utilized a lot by the raters. The scale category statistics indicate that there are a
number of problems with this trait scale. Firstly, not enough raters utilized level 9,
which might cause problems with the stability of the statistics measured for this
trait scale. Secondly, the distribution over the rating scale levels was not normal
with one peak, but multimodal with several peaks. Thirdly, the scale category
probability curve indicates problems with level 8 of the scale, which had a very
high peak.
The next section presents the findings for the trait scales associated with para-
graphing (Table 86 below). As was the case with the two trait scales for vocabu-
lary, the two rating scales focussing on paragraphing did not have the same num-
ber of band levels. Whilst the DELNA scale had six levels, the new rating scale
for paragraphing was designed with only five levels. To compare the candidate
separation ratios of the two scales and therefore the discrimination of the two
213
scales, the equation described in Equation 1 would have to be used. However, it is
very clear from the candidate separation ratios in Table 86 below, that the new
scale was a lot more discriminating. Also, the inter-rater reliability as indicated by
the rater point biserial coefficient and the exact agreement were higher and the
raters rated slightly more similarly in terms of severity.
When using the existing scale, half of the raters could be identified as rating in-
consistently or with too little variation, whilst only 20% of the raters displayed
similar behaviour when using the new scale. For both scales, the outer categories
were underutilized, but more so when the existing scale was used as the basis for
the ratings.
Table 87 below shows the findings of the trait scales for cohesion and coherence.
The existing scale had no category for coherence, and therefore the trait scale for
coherence of the new scale is displayed in the same table for space reasons.
The results for the trait scales for cohesion were mixed. The candidate separation
ratio of the new cohesion trait scale was lower than that of the existing cohesion
scale. However, this was another trait scale that had fewer levels than the existing
scale (only four levels compared to the six levels of the existing scale). When the
formula in Equation 1 was applied, the candidate separation ratio of 3.18 of the
new scale had to be compared to a candidate separation ratio of 2.79 for the exist-
214
ing scale. Therefore, if the number of levels in the two scales were equivalent, the
new scale would be more discriminating.
Table 87: Rating scale statistics for cohesion and coherence trait scales
DELNA scale - Cohesion New scale - New scale -
Cohesion Coherence
Candidate discrimination: Candidate discrimination: Candidate discrimination:
Candidate separation ratio: Candidate separation ratio: Candidate separation ratio:
3.62 3.18 3.56
Rater separation and reli- Rater separation and reli- Rater separation and reli-
ability: ability: ability:
Rater separation ratio: 5.53 Rater separation ratio: 4.12 Rater separation ratio: 4.62
Rater Pt. Bis: .71 Rater Pt. Bis: .65 Rater Pt. Bis: .72
% Exact agreement: 37.9% % Exact agreement: 51.5% % Exact agreement: 36.1%
Variation in ratings: Variation in ratings: Variation in ratings:
% Raters infit high: 20% % Raters infit high: 0% % Raters infit high: 0%
% Raters infit low: 20% % Raters infit low: 0% % Raters infit low: 0%
Scale properties: Scale properties: Scale properties:
Scale: strong central ten- Scale: 4 underused, other- Scale: 4 underused, other-
dency – 4, 5 and 9 under- wise functioning well, wise functioning well,
used, skewed slightly skewed slightly skewed
Level 4: less than 10 obser-
vations
The raters rated slightly more similarly in terms of severity when using the new
scale. When the inter-rater reliability of the two scales was compared, the results
were mixed. The rater point biserial correlation coefficient (the single rater – rest
of rater correlation coefficient) of the new cohesion scale was slightly lower than
that for the existing cohesion trait scale (.65 and .71 respectively). However, the
percentage of exact agreement was considerably higher for the new scale (51.5%)
than for the existing scale (37.9%). The reason for these differences can be attrib-
uted to the different number of levels in the scales. If there are fewer levels in a
scale, the chance of raters choosing the same category is likely to be higher.
The variation in the raters’ ratings shows that nearly half of the raters rated with
too much or too little variation when applying the existing scale, whilst none fell
into this category when applying the new scale. With the existing scale, the raters
as a group displayed a central tendency effect, because levels 4, 5 and 9 were un-
derused, whilst when using the new cohesion trait scale level 4 was underused.
The distribution of the ratings based on the existing scale was skewed and fewer
than ten observations were recorded for level 4, which might cause problems for
the reliability of the results.
215
Table 87 above also displays the trait scale of coherence. This new scale had no
equivalent in the existing scale although raters were trained to apply the cohesion
scale as a coherence and cohesion scale. The coherence trait scale was as dis-
criminating as the existing cohesion trait scale (although it has one level less) and
also displayed a similar level of inter-rater reliability (as measured by the rater
point biserial correlation coefficient and the percentage of exact agreement).
However, the raters rated slightly more similarly in terms of severity and there
were no raters identified as rating with too little or too much variation when using
this new coherence trait scale. All levels of the scale were applied by the raters,
although level 4 was slightly underused.
The next comparison between two rating scales focussed on the trait scales relat-
ing to style. The existing scale had descriptors pertaining to academic style in
general, whilst the new scale had descriptors only for the use of hedging because
the analysis of the writing scripts reported in Chapter 6 could find no other aspects
of academic style that clearly differentiated between the DELNA levels. Table 88
below displays the rating scale statistics for the two trait scales. The new scale for
hedging was clearly more discriminating (in this case both scales have six levels)
with a candidate separation ratio of 5.86 compared to the separation ratio of 3.32
for the existing scale.
216
The raters rated considerably more similarly in terms of severity when using the
new scale. Furthermore, both the inter-rater reliability statistics were significantly
higher than those of the existing scale. Fewer raters rated with too much or too
little variation (only 20% of the raters com-pared to 40% of raters when applying
the existing scale). A closer scrutiny of the use of the different band levels showed
that the raters (as a group) displayed a strong cen-tral tendency effect when using
the existing scale – levels 4, 5 and 9 were underutilized. When using the new
scale, only level 9 was underused. So few instances of 4s were identified by the
raters when using the DELNA scale that the results of the FACETS analysis
might not be stable.
The final group of rating scale statistic comparisons focuses on the trait scales re-
lating to content. Firstly, the two trait scales for data description were compared
(see Table 89 below). Both trait scales for data description had six band levels.
The new scale was more discriminating as can be seen by the higher candidate
separation ratio.
When using the existing trait scale for data description, the raters rated more alike
in terms of leniency and harshness. The inter-rater reliability statistics were, how-
ever, in favour of the new scale. No raters were found to be rating with too much
or too little variation when applying the new scale, whilst 20% of raters were
identified in these categories when using the existing scale. When applying the
existing DELNA descriptors, the outer levels of the scale, levels 4 and 9 were un-
derutilized, whilst when the new scale was used, only level 4 was used rarely.
217
9.1.1.9 Content – interpretation of data scales
The next two trait scales compared in this analysis were the trait scales pertaining
to the interpretation of data (see Table 90 below). In this case, the new scale had
one level less than the existing scale (only five compared to six levels). But even
without applying the formula in the equation above, it can be seen that the new
scale was more discriminating between candidates. Both the rater point biserial
correlation coefficient and the percentage of exact agreement were higher for the
new scale, whilst the rater separation ratio was almost identical for the two scales.
The analysis of the variation in the ratings showed that two raters rated inconsis-
tently when applying the existing scale, whilst none fell in this category when ap-
plying the new scale. However, whereas only one rater rated with too little varia-
tion when using the existing scale, two raters displayed a central tendency effect
when applying the new scale. Overall, the two outside categories (band levels 4
and 9) were underused in the case of the existing scale, whilst level 4, the lowest
level was underused when the new scale was employed. For both scales, the peaks
of band level 7 on the probability curves were low.
The final two trait scales compared in this section were the two trait scales per-
taining to the rating of the content of part three of the prompt, in which the writers
were asked to either compare the data to the situation in their own country or ex-
tend the ideas developed in the interpretation section to the future. The summary
statistics for the two trait scales can be found in Table 91 below.
218
Table 91: Rating scale statistics for part three of the content
DELNA scale – Part three of content New scale – Part three of content
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 3.89 Candidate separation ratio: 4.65
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 4.19 Rater separation ratio: 3.66
Rater point biserial: .77 Rater point biserial: .90
% Exact agreement: 36.1% % Exact agreement: 57.5%
Variation in ratings: Variation in ratings:
% Raters infit high: 30% % Raters infit high: 10%
% Raters infit low: 20% % Raters infit low: 10%
Scale properties: Scale properties:
Scale: category 9 underused, skewed Scale: 4 underused, slightly skewed
Probability curve: very flat
As was the case with the two trait scales used for the interpretation of the data, the
number of band levels differed for these two trait scales. Again, the new scale had
only five levels, whilst the existing scale had, like all DELNA trait scales, six lev-
els. Without applying the formula in Equation X above, it can be seen that al-
though the new trait scale had a level less, it was more discriminating. Further-
more, the rater reliability and separation statistics show that the new rating scale
was applied more similarly by the raters. Both the rater point biserial correlation
coefficient and the percentage of exact agreement were higher for the new scale
(.90 and 57.5% respectively) than those for the existing scale (.77 and 36.% re-
spectively) and the raters rated more alike in terms of severity when applying this
new scale.
Considerably fewer raters were identified to be rating with too much or too little
variation when applying the new scale (only 20% of raters) than the existing scale
(half of the raters). Finally, when the underutilized band levels on the trait scales
were established, the scales were very similar. In the case of the existing scale,
level 9, the highest level, was underused by the raters, whilst in the case of the
new scale, level 4, the lowest level, was used very little. The probability curve for
the existing scale was very flat.
After the individual trait scales were analysed and compared, it was of further in-
terest to explore how the two scales as a whole performed. Figures 51 and 52 be-
low present the Wright maps of the two rating scales.
219
220
Figure 51: Wright map of entire DELNA scale including all trait scales
Figure 52: Wright map of entire new rating scale including all trait scales
221
The left-hand column of each Wright map displays the logit values ranging from
positive values to negative values. The second column shows the candidates in the
sample indicated as asterisks. Higher ability candidates are plotted higher on the
logit scale, whilst lower ability candidates can be found lower on the logit scale.
The next column in the Wright map represents the raters, here indicated as num-
bers. Raters plotted higher on this map rated more severely than those plotted
lower on the map. The wide column in the centre of each Wright map shows the
traits in each rating scale. More difficult traits are plotted higher on the map than
easier traits. Finally, the narrow columns on the right of each Wright map repre-
sent the trait scales (with band levels) in the order they were entered into
FACETS. For the existing scale, these are from left to right: organisation (S1),
cohesion (S2), style (S3), data description (S4), data interpretation (S5), part three
of prompt (S6), sentence structure (S7), grammatical accuracy (S8) and vocabu-
lary/spelling (S9). For the new scale, these are from left to right: accuracy (S1),
repair fluency (S2), lexical complexity (S3), paragraphing (S4), hedging (S5), data
description (S6), data interpretation (S7), part three of prompt (S8), coherence
(S9) and cohesion (S10). These columns display the different band levels avail-
able to raters for each particular trait scale and how these were spread in terms of
difficulty. Myford and Wolfe (2003) describe the horizontal dotted line across a
column (rating scale threshold) as an indicator of the point at which the likelihood
of a candidate receiving the next higher rating begins to exceed the likelihood of a
candidate receiving the next lower rating. For example, a candidate pictured on
the same logit as a category boundary between two band levels has a 50% chance
of being awarded either of these two band levels.
When the two Wright maps were compared, the following observations could be
made. First of all, when the raters used the existing scale, the candidates were
more spread out, ranging over five logits. When the raters employed the new
scale, the candidates were spread over only three logits. Therefore, although most
individual trait scales on the new scale were more discriminating (as shown in the
section on individual trait scales earlier in this chapter), it seems that as a whole,
the existing scale was more discriminating. This is also confirmed by the first of
the rating scale statistics for the whole scale, the candidate separation ratio, dis-
played in Table 92 below.
It also became apparent that the raters were a lot less spread out when using the
new scale. Their severity measures (in logits) ranged from .25 (for the harshest
rater) to -.21 (for the most lenient rater), a range of less than half a logit. When
employing the existing scale, the raters were spread from .64 to -.74 logits, a
range of nearly one and a half logits. That the raters rated more similarly in terms
of severity could also be seen by the inter-rater reliability statistics in Table 92,
which showed that the exact agreement was higher when the new scale was used
222
(51.2%) than when the existing scale was applied (37.9%). The rater point biserial
correlation coefficient, however, was lower when the new scale was used.
Table 92: Rating scale statistics for entire existing and new rating scales
DELNA scale New scale
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 8.15 Candidate separation ratio: 5.34
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 8.67 Rater separation ratio: 4.19
Rater point biserial: .47 Rater point biserial: .38
% Exact agreement: 37.9% % Exact agreement: 51.2%
Variation in ratings: Variation in ratings:
% Raters infit high: 40% % Raters infit high: 0%
% Raters infit low: 10% % Raters infit low: 0%
Trait statistics: Trait statistics:
Spread of trait measures: .78 to -.37 Spread of trait measures: .53 to -.76
Trait separation: 9.14 Trait separation: 12.47
Trait fit values: data and part three much Trait fit values: repair fluency and data
over 1.3, no low slightly high, lexis and coherence low
Scale properties: Scale properties:
Scale: central tendency, 4 and 9 underused Scale: 9 underused, otherwise good
Next, the number of raters displaying too much or too little variability in their rat-
ings were scrutinized. For the existing scale, half the raters fell into one of these
categories whilst no raters did for the new scale.
The scale property statistics indicate that on the existing scale the outer levels
were underused. Only level 9 on the new scale was not sufficiently utilized.
When the different traits were examined on the two Wright maps, it became clear
that the traits on the new scale were slightly more spread out in terms of difficulty,
ranging from .78 on the logit scale for repair fluency to -.74 for cohesion, a differ-
ence of one and a half logits. On the DELNA scale the traits spread from .78 (for
Data – part three) to -.37 (for style), a difference of just over one logit. In a crite-
rion-referenced situation as was the case when these rating scales were used, it is
not necessarily a problem to have a bunching up of traits around the zero logit
point, as is found in Figure 51 with the traits on the existing rating scale. How-
ever, it indicates that raters had difficulty dis-tinguishing between the different
traits or that the traits were related or dependent on each other (Carol Myford,
personal communication, July 2006). The fact that the traits in Figure 52 (new
223
scale) were more spread out shows that the different traits were measuring differ-
ent aspects.
Apart from indicating the difficulty of the traits, the columns on the right of each
Wright map representing the different trait scales also provide a visual display of
how the different band levels were used by the raters as a group. Wider band lev-
els show a higher probability of a candidate being awarded that level. The outside
levels of the trait scales are indicated in brackets only, because they are infinitely
wide. When the two Wright maps were compared, it becomes clear that the rating
scale thresholds of the new scale were a lot less tidy. There were greater differ-
ences between the thresholds than when the existing scale was employed. This is
another indication that not all the traits in the new scale were measuring the same
aspects of writing. If the traits were not measuring the same underlying construct,
then this explains why both the candidate separation of the new scale and the rater
point biserial correlation coefficient of the new scale were lower than that of the
existing scale.
To explore this idea further, the item correlations were scrutinized. Like the rater
point biserial correlation, the item point biserial correlation coefficient measures
how similar the different traits are in terms of what they are measuring. Each cor-
relation coefficient shows the correlation of that particular trait with all the other
traits in the rating scale. Table 93 below presents the different traits in each rating
scale with the associated point biserial correlation coefficient. The average corre-
lation coefficient for each scale can be found in the last row of the table.
Table 93: Item correlation coefficients for existing DELNA scale and new scale
DELNA scale Pt Bise- New scale Pt Bise-
rial rial
Accuracy .55 Accuracy .52
Sentence structure .54 Repair fluency .27
Vocabulary and spelling .55 Lexical complexity .50
Cohesion .53 Coherence .47
Style .50 Cohesion .43
Organisation .39 Hedging .27
Data description .35 Paragraphing .07
Data interpretation .48 Data description .20
Data Part 3 .38 Data interpretation .29
Data Part 3 .32
Mean .47 Mean .33
224
The correlation coefficients in Table 93 above show that whilst the traits in the
existing scale seemed to be measuring similar aspects of writing, the correlation
coefficients in the new scale were dissimilar to each other. In particular, para-
graphing, hedging, repair fluency and all trait scales pertaining to content resulted
in very low correlation coefficients.
Because Table 93 above indicates only that the traits were measuring different
underlying abilities, but not how many different groups of traits the data was
measuring, a principal axis factor analysis (or principal factor analysis – PFA)
was performed on the rating data of each of the two rating scales. This was done
because PFA is able to reduce a set of variables to a smaller number of underlying
factors. To ensure suitability of the data for a PFA, the determinant was calcu-
lated, which tests for multicollinearity or singularity. The determinant of the R-
matrix should be greater than 0.00001 (A. Field, 2000). In addition, the Kaiser-
Meyer-Olkin (KMO) measure of sample adequacy was calculated, which should
be greater than .5. The results of these can be found in Table 94 below.
Table 94: Determinants and KMO statistics for principal factor analyses
DELNA scale New scale
Determinant .001 .071
KMO .937 .814
The results in Table 94 above indicate that both data sets were suitable for a prin-
cipal factor analysis.
PFA reduces the data in hand into a number of components, each with an eigen-
value representing the amount of variance of the components. Components with
low eigenvalues are discarded from the analysis, as they are not seen to be con-
225
tributing enough to the overall variance. Table 95 (DELNA scale) and Table 96
(new scale) below show the results from the principal factor analysis.
The scree plots, which provide a visual representation of the eigenvalues, can be
found in Figure 53 below.
226
Figure 53: Scree plots of principal factor analysis
– DELNA scale (top) and new scale (bottom)
Both the scree plots and the tables displaying the results from the principal factor
analysis show that when the existing rating scale was analyzed, only one major
component was found. This component had an eigenvalue of 5.8 and accounted
for about 65% of the entire variance. All other eigenvalues were clearly below 1
(following Kaiser, 1960) and below .7 (following Jolliffe, 1986) and there was no
further leveling off point on the scree plot. When the new scale was analyzed,
however, the results were different. The principal factor analysis resulted in three
components with eigenvalues over 1, and a further three above 0.7 after which a
leveling off could be seen in the scree plot (resulting in six components).
The next step in a PFA is to identify which variables load onto which component.
For this, a rotation of the data is necessary. However, because only one compo-
nent was identified for the existing scale, no factor loadings can be displayed. A
varimax rotation was chosen to facilitate the interpretation of the factors of the
new scale. A trait was considered to be loading on a factor, if the loading was
higher than .4 (as indicated in bold font). The six factor loadings for the new scale
can be seen in Table 97 below.
227
Table 97: Loadings for principal factor analysis
Component
1 2 3 4 5 6
Accuracy .796 .119 -.045 .037 .168 -.035
Repair fluency .155 -.005 .067 -.062 .959 -.064
Lexical complex- .731 .112 .288 .116 .106 .004
ity
Paragraphing .009 -.037 .072 .029 -.061 .992
Hedging .174 .945 .056 -.009 -.030 -.041
Data description .141 .017 .025 .971 -.062 .030
Interpretation of .338 .448 -.340 .253 .241 .009
data
Content – Part 3 .269 .105 .850 .090 .139 .092
Coherence .867 .097 .009 .083 .016 .030
Cohesion .875 .089 .064 .064 .039 .021
The factor loadings for the six different components can be described in the fol-
lowing way. The largest factor, accounting for 34% of the variance was made up
of accuracy, lexical complexity, coherence and cohesion. This factor can be de-
scribed as a general writing ability factor. The second factor, which accounted for
a further 13% of the variance, was made up of hedging and interpretation of data.
This is, at first glance, an unusual factor. However, it can be argued that writers
need to make use of hedging in the section where the data is interpreted since the
writer is speculating rather than stating facts. For this reason, a writer who scored
high on hedging might also have put forward more ideas in this section of the es-
say. The third factor, which accounted for 12% of the variance, consisted of part
three of the content, the section in which writers are required to extend their ideas.
The fourth factor, which accounted for 9% of the variance, was another content
factor, the description of data. That all three parts of the content load on separate
factors shows that they were all measuring different aspects of content. Repair
fluency was the only measure that loaded on the fifth factor, which accounted for
another 8% of the variance. The last factor, which also accounted for 8% of the
variance, had only paragraphing loading on it. The six factors together accounted
for 83% of the entire variance of the score, whilst the single factor found in the
analysis of the existing rating scale accounted for only 64% of the data.
It can therefore be argued that the ratings based on the new scale accounted for
not only more aspects of writing ability, but also for a larger amount of variation
of the scores. In other words, there was less unaccounted variance when the new
scale was used.
228
A further investigation was conducted into the score profiles produced by individ-
ual raters. The bunching up of traits around the zero logit on the existing scale and
the single factor found in the principal factor analysis described above, indicate
that raters probably displayed a halo effect when using the existing scale. There-
fore, the score profiles for each individual candidate were scrutinized. If a rater
awarded the same score seven or more times (of the nine or ten traits on the two
scales respectively), this was considered to be rating with a halo effect. Table 98
shows that raters displayed a halo effect in the case of nearly half the scripts when
employing the DELNA scale, whilst 12% of scripts were rated with a halo effect
when the new scale was employed.
It was then of interest whether the two rating scales resulted in similar rankings of
the one hundred candidates. For this purpose, the candidate ability measures were
correlated by means of a Pearson correlation and a scatterplot was created to rep-
resent the results visually. The scatterplot (seen in Figure 67 below) indicates a
reasonable correlation between the two variables ‘old’ for the existing scale and
‘new’ for the new scale, although there was some scatter in the lower and higher
ability categories. The correlation coefficient showed a strong positive correlation,
r = .891, p = .000. This indicates that the two scales, when employed as a whole,
ranked the candidates reasonably similarly, especially around the middle ability
levels.
The findings described above show that, in general, the individual trait scales of
the newly developed rating scale were functioning better in a variety of respects
when compared to the existing scale. However, when the two scales as a whole
were analyzed, the existing scale resulted in a higher candidate discrimination. A
229
principal factor analysis showed that this can be explained by the number of dif-
ferent aspects that the two scales appeared to be measuring. The trait scales on the
existing scale generally appeared to be evaluating the same aspect of writing,
whilst a number of trait scales of the new scale seemed to be assessing different
information about candidates’ writing performance not measured by the existing
scale. The factor analysis of the new scale resulted in six factors before a leveling
off could be seen in the scree plot, whilst the factor analysis of the scores of the
existing scale only produced one large factor. All raters rated fewer scripts with a
halo effect when using the new scale than when using the DELNA scale. Overall,
the two scales ranked the candidates very similarly.
The following section which is dedicated to Research Question 2b will present the
qualitative findings based on the questionnaire and interview results:
What are raters’ perceptions of the two different rating scales for writing?
To answer this research question, the findings from the questionnaire will be re-
ported first. The questionnaire focussed on a number of questions ranging from
the raters’ overall impression of the new scale, and the rating process employed
when using the new scale, to questions about each trait scale individually. An im-
portant point to note here is that the questionnaire focussed only on the new scale.
Questions about the existing scale were asked only during the interviews reported
later in this chapter. The questionnaire questions can be found in Table 77 in
Chapter 8.
Because the answers provided to the questionnaire were generally very short and
very few themes emerged, the findings for each question will, where possible, be
provided in tables with illustrative excerpts.
The first question asked the raters what they liked about the new scale. Raters
mentioned that they thought it was good that the scale was ‘more objective’ (Rater
10), that it ‘gives more guidance’ (Rater 1) and that it made them ‘focus on the
underlying language skills’ (Rater 8). Five raters indicated that they thought that
the rating scale made the rating process easier, because ‘the categories were de-
scriptive’ (Rater 2), the scale made the rating process ‘more mechanical’ (Raters 4
and 6) and it was therefore ‘quick’ (Rater 5) and ‘easy to arrive at a score’ (Rater
7), as well as the fact that the scale was ‘clearly set out without any ambiguities’
(Rater 3). Finally, Rater 2 thought that this kind of ‘descriptive’ rating would
230
benefit the students in that ‘more specific information about where they did well,
and what they need to work on’ could be provided.
The next question in the questionnaire asked the raters if they thought any catego-
ries were missing in the new scale. The findings are summarized in Table 99 be-
low:
Half of the raters thought that no categories were missing. The four raters who
listed aspects of writing which they thought the new scale did not account for,
however, produced a list of nine different categories1.
The third question asked if the raters thought any of the trait scales had the wrong
number of band levels. Table 100 below presents an overview of the results from
that section:
231
Five of the raters thought that some of the categories did not have the right num-
ber of levels. Only two raters, however, supplied reasons for their choices. The
two comments provided by Rater 3 (about paragraphing) and Rater 7 (about lexi-
cal complexity) can be seen in Table 100 above.
The fourth question asked the raters if they thought the wording of any of the de-
scriptors needed changing. A brief summary of the results is presented in Table
101 below:
As Table 101 above shows, most raters were satisfied with the wording in the rat-
ing scale. Rater 5 criticized lexical complexity and accuracy for not allowing him
to use the whole range of levels, whilst Rater 8 criticized one particular level de-
scriptor for data description.
The fifth question asked the raters if they thought any of the descriptors were dif-
ficult to apply. All the raters stated that the descriptors for coherence were a little
difficult to apply, although four of the raters (Raters 1, 2, 8 and 9) mentioned that
they got used to using them with practice. Rater 5 also found accuracy, fluency
and paragraphing difficult to apply, but failed to give more specific reasons and
two raters (Raters 4 and 10) reported having problems applying the descriptors for
cohesion.
In the sixth question raters were asked if they at any time resorted to thinking of a
holistic (overall) score for a script to arrive at a band level for a particular trait
scale. Typical rater responses to this question can be found in Table 102 below.
No: 7 raters
x ‘I felt the scale didn’t allow for that. To make a script fit to a descriptor, I had
to actually use the descriptors as guidance, much more than I would do using
the DELNA scale’ (Rater 1)
232
Yes: 3 raters
x ‘Occasionally for coherence’ (Raters 7 and 8)
x ‘I occasionally [did this] if I felt the scripts would rate too high on the scale,
for example, high on paragraphing, accuracy, hedges, fluency, (the areas eas-
ier to get high marks), which would bring up the final mark, when I feel the
student required more help. I think it is useful sometimes to have an idea about
the final score’ (Rater 2)
Did it bother you that you did not know what the final score for each script would
come out as?
No: 10 raters
x ‘liberating not always having to assess so subjectively’ (Rater 4)
x ‘I abandoned my idea of a script as ‘roughly seven’, for example, as I could
see that scripts were coming out with considerable variation in scores’ (Rater
10)
The raters were further asked if it bothered them not to know how the final score
for each script would be derived. Table 102 above shows that none of the raters
was concerned with the final score. Two illustrative quotes by Raters 4 and 10
exemplify this.
The final part of the questionnaire asked the raters to provide any specific com-
ments about the different trait scales. The remarks made about each trait scale are
summarized in Table 103 below.
233
Table 103: Summary of answers to Question 7 (cont.)
Lexical complexity
Positive remarks: 7 raters
x ‘the Academic Word List helped and after a while I found I got better at apply-
ing this category. It made me look in much more detail at the vocabulary pro-
duced’ (Rater 1)
x ‘this category is very indicative of competence’ (Rater 5)
Negative remarks: 3 raters
x ‘I feel the graph and the topic are more relevant than Academic Word List.
For example, has the student used language that is applicable to the task?’
(Rater 2)
x ‘too strict’ (Rater 7)
Paragraphing
Positive remarks: 3 raters
x ‘picked up many writers who did not have a clear grasp of paragraphing
structure and of one main item per paragraph, and introduction and conclu-
sion’ (Rater 4)
x ‘logical’ (Rater 7)
Negative remarks: 5
x ‘illogical paragraphing not penalized’ (Rater 3)
x easy category for students to score well in’ (Rater 10)
No comments: 2 raters
Hedges
Positive comments: 6 raters
x ‘this is a great addition as the previous scale lacked such subtlety’ (Rater 7)
x ‘good, interesting, effective’ (Rater 8) and Rater 1 noted that ‘the scale was
easy to follow’
Negative comments: 4 raters
x ‘some good scripts managed with no hedging. Is it really necessary? Does it
show lack of understanding of academic style to do without? Probably, yes.
Many picked up hedging from the question. That should have given all the hint
that it was necessary to remember to use it’ (Rater 4)
x ‘I was not quite satisfied with this category as I felt that it only measured ex-
plicit hedging devices’ (Rater 10)
Content – data description
Positive comments: 8
x ‘it was very easy to apply. Clearer than the DELNA scale’ (Rater 1)
x ‘was useful to know what to look for as opposed to the DELNA scale’ (Rater 2)
No comments: 2 raters
Content – interpretation of data and Part 3 (the comments were identical for both
categories and are therefore reported together)
Positive comments: 8 raters
x ‘the scale really helped – having a number of ideas made it easier to score’
(Rater 7)
x ‘this was an improvement on the DELNA - very specific and clear’ (Rater 3)
234
Negative comments: 2 raters
x ‘as the emphasis was on quantitative measurement in the descriptors, the qual-
ity of ideas, appropriateness, clarity of expression, depth of explanation and
support, any repetition got ignored a little’ (Rater 5)
x ‘whether an idea was supported is what should get more marks. Two well-
supported reasons should score higher than three unsupported reasons’ (Rater
9)
Coherence
Positive comments: 1 rater
x ‘it is doing the writers more justice in this way’ (Rater 1)
Negative comments: 7 raters
x ‘descriptors are long and sometimes confusing’ (Rater 5)
x ‘I had trouble coming to grips with coherence’ (Rater 6)
No comments: 2 raters
Cohesion
Positive comments: 5 raters
x ‘nicely specific and easier to make accurate judgments than the DELNA de-
scriptors’ (Rater 3)
Negative comments: 3 raters
x ‘difficult to categorize with fairness. Connected to lexical complexity and co-
herence’ (Rater 4)
No comments: 2 raters
Overall, Table 103 above indicates that there was general support for the catego-
ries of accuracy, lexical complexity and the three content trait scales. The remarks
were slightly more mixed, but still overall positive, for repair fluency, hedging
and cohesion. More raters were less in favour of the descriptors for paragraphing
and coherence.
In summary, it can be said that in general the new rating scale was perceived as
positive. The raters were happy with the categories included in the scale, although
some raters made suggestions for other traits that could be added. There were also
some suggestions about levels and descriptors that needed changing, but no clear
consensus could be reached among the participants. The raters seemed to use the
235
new scale as intended, in an analytic fashion, by focussing on the individual cate-
gories, but some indicated that they reverted to holistic overall impressionistic rat-
ing when they found descriptors problematic or they did not agree with the cali-
bration of a scale. Most individual trait scales were positively received, but a
number of raters had reservations about the descriptors for coherence and para-
graphing.
As mentioned in the methodology section, all interviews were conducted after the
analysis of the quantitative data from the rating rounds was completed. This was
done in the hope that some interesting interview topics might emerge from the
results of the data analysis. The broad interview topics can be seen in Table 78 in
Chapter 8.
The rater comments are presented in this section without any evaluation or discus-
sion. A discussion of the interview data can be found in the following chapter
(Chapter 10).
Several themes emerged from the analysis of the interviews. These themes have
been grouped into three broad categories for the presentation of results. Firstly,
themes emerging about the existing DELNA scale are discussed. Then, any
themes pertaining to the new scale are described. The final section presents a
number of more general themes relating to a comparison of the two scales.
This section is divided into two parts. The first looks at the problems that raters
experienced when using the DELNA scale and the strategies that they used to
cope with these problems whilst the second section describes positive aspects of
the DELNA scale.
The most regularly emerging theme in the interviews was that raters often experi-
enced problems when using the DELNA scale. A broad outline of these problems
can be seen in the following list. Each will be discussed in turn:
One of the most commonly mentioned problems was that the raters thought the
descriptors were often too vague for them to easily arrive at a score. In the extract
236
below, for example, Rater 4 talked about the problems she encountered when de-
ciding on a score for Content part three. The sections in italics refer to the word-
ing in the DELNA descriptors. Any comments in square brackets were added by
the researcher to aid understanding and set the context if necessary:
Rater 4: [...] And here relevant and supported, I find that tricky support,
what exactly is support. Because sometimes it is actually, sometimes you
have a number of ideas but there is not much support for them and what is
sufficient. You see, I can, several times I thought that is sufficient, but oth-
ers have said there is not enough. You know, I think, oh well, that is okay
for a page essay or a page and a half or whatever you are supposed to
write. You just can’t, there is nothing specific there to hang things on. I
mean you get the idea and then you feel a bit mean if you just, if there is
just one idea and they might have supported it adequately, so is that idea
not sufficient, but it is certainly relevant and it is certainly expressed
clearly and it is certainly appropriately supported, but there might be only
one idea. Mmh and so where do you put it?
Problems with the vagueness of the DELNA descriptors were also reflected in the
comments by Raters 5 and 9 below:
Researcher: Would you like to change anything in the actual wording [of
the DELNA scale]?
Rater 5: [...] Sometimes I look at it [the descriptors] I’m going ‘what do
you mean by that?’ [...] You just kind of have to find a way around it
cause it’s not really descriptive enough, yeah.
A number of raters pointed directly to the adjectives used as being the problem for
the vagueness of the scale. In the example below, Rater 10 talked about the de-
scriptors for vocabulary and spelling:
Rater 10: Well there’s always a bit of a problem with vocabulary and
spelling anyway in deciding you know the difference between extensive,
appropriate, adequate, limited and inadequate. So there’s sort of adverbial
[sic]. Yeah, it’s really just a sort of adverbial thingy anyway isn’t it so I
think I just go with gut instinct on that one. I probably would privilege vo-
cabulary I think over spelling also bearing in mind that they’re under
exam conditions
Researcher: Do these kinds of words like you just mentioned, are they a
problem in other categories as well for you?
Rater 10: Um, well I just end up going with my gut feeling. It’s quite simi-
lar I guess with things like, um, interpretation of data in part two with the
237
content where we go from wholly appropriate, sufficient or appropriate,
generally adequate, just adequate or often inaccurate.
Another problem that raters described was a concern that descriptors might mean
different things to different raters. This is, for example, reflected in Rater 4’s
comment about rating the category style:
Some raters also noted problems with several aspects of writing being mixed into
one scale category (as is the case, for example, with vocabulary and spelling).
Rater 3 raised this topic when she was asked what she would like to change about
the DELNA rating scale.
Rater 3: Well, I would say the vocabulary and spelling one when I think
about it now [...] I mean they could be a poor speller but have good vocab
and I’ve mentioned that, about the risk taking. It’s a matter then of judge-
ment, whether you can credit them and I think people do need to be cred-
ited otherwise you do have people being great spellers who never use
words that are very taxing.
Finally, raters sometimes had trouble differentiating rating scale categories from
each other. The extract below illustrates the problems raters seemed to experi-
ence, for example, in differentiating between the categories of sentence structure
and grammatical accuracy. Both scale categories refer to accuracy in the band de-
scriptors.
238
doesn’t have a major impact on the sentence stucture overall, so maybe
that should be grammatical. So that is quite tricky. I find it hard to distin-
guish between those
Although most raters reported having problems deciding on band levels with the
DELNA scale, the methods of coping are quite different for different raters. A va-
riety of strategies (both conscious and subconscious) emerged from the inter-
views. These were as follows:
The first strategy that a number of raters referred to in their interviews was as-
signing a global score to a script, usually
after the first reading.
One rater, Rater 10, provided a reason for preferring this global type of marking.
She is from a background where impressionistic rating is more commonly used:
This overall, holistic type rating often results in a halo effect, where a rater
awards the same score for a number of categories on the scale. Below, Rater 10
239
talked about awarding scores for the three categories grouped under fluency in the
DELNA scale, organisation, cohesion and style:
Rater 10: For style, again, I just tend to go with the gut instinct. And I sus-
pect I often tend to give the same grade or similar grades for cohesion and
style. Probably for the whole of fluency. [...] So in a way, it is almost like
giving a global mark for the three things in consideration. With, if some-
one had no paragraphing, but everything else was good, maybe a bit of
variation.
Apart from holistic rating and rating that results in a halo effect, some raters also
reported mainly using the inside categories of the rating scale when using the
DELNA scale, which results in a central tendency effect. This is illustrated by the
comment by Rater 5 below:
Researcher: Do you feel you use the whole range of scores for style?
Rater 5: Again, I think it is similar to [...], I think that is probably similar
for me throughout. I tend not to use four and nine. I tend to move within
the other four. Probably, the most common marks I am sure I use are five,
six and seven and sometimes an eight.
Other raters, when having problems assigning a score, resorted to the strategy of
comparing scripts to each other. This is shown in Rater 10’s comment below:
Rater 10: I think it’s a bit like the case we talked about below where we go
from appropriate, appropriately adequate, da da da, that they [the descrip-
tors] are a little bit vague, but it does seem to work out in practice that I
just go with my gut instinct and I guess it’s really about comparing differ-
ent scripts so that when you don’t have a lot of them you can say ok this
one, yeah.
Some raters seemed to clearly disregard the DELNA descriptors and override
them with their own impression of a script. Rater 10 (below) was talking about the
score she would award for organisation to a script that had no clear paragraph
breaks but was otherwise well organised. The DELNA descriptors recommend
awarding either a five or a six.
Rater 10: Mmh [...] well, I think I would, (sighs), looking at this it ought
to be a six, but it is possible particularly if I suspected that it was a native
speaker, and that it was someone that wasn’t so strong in academic writing
but actually had very good English, I might even go up to a seven, but I
[...] yeah, if I had other reservations about the language and stuff, then I
240
would give it a six or even a five if it is really bad. But if I was sort of
convinced by the writer in every other way, I might well push the score up
in a way not to pull them down. Just for the paragraphing.
The final strategy reported when dealing with vague descriptors was to ask the
administrator for directions. For example, Rater 9 described how she deals with
the category of organisation for scripts that have not answered the third part of the
prompt:
Rater 9: This was the direction from the administrator. If there is no para-
graph for part three, then it isn’t fluent and organised effectively.
Above, problems emerging out of the use of the existing rating scale were dis-
cussed. The following section turns to positive themes that emerged about the
DELNA scale.
Whilst some raters seemed to agonize over differentiating between the different
band levels in the DELNA scale, other raters reported that they found them very
easy to use and had no problems assigning levels. This is illustrated in a quote
from Rater 5:
Researcher: What is the difference between a five and a six for cohesion
Rater 5: It is just a matter of degree [...], this [level 5] would cause me a
lot of trouble reading, I would find it hard to read it ... But between here
[between five and six], this is when a non-native speaker is really strug-
gling, a five in cohesion is really struggling and it is really hard for me to
read it. [...] This [six] is like that, but not as bad. And then a little better
[for a seven]. So the difference between acceptable and not quite accept-
able and pretty bad, quite bad. Maybe I wouldn’t really go much between
eight and nine. I mean if it is really really well written, then a nine maybe,
if it really flows, no trouble reading it, wow, that is great.
Rater 9, as shown in the quote below, used her feeling of how a native speaker
would write as a measure or benchmark when deciding on a score for cohesion:
Rater 9: [...] if it was a native speaker for example, this would jump out at
you. .. This would be a kind of subliminal thing. On first reading, which I
would have already done for the content, I would have picked up on
whether it was effortless for me to follow the message. Effortless is differ-
ent from appropriate. Yeah, I think effortless is something you just notice
or not. And I don’t have to judge, I don’t have to go, oh, they used these
types cohesive devices, I just know. And this is, yeah, and then for this
241
one (eight), it is really not bad at all, like a first draft for example by a
good writer. And again, the middle area (sighs), excellent second language
writer (seven) who would cause me slight strain, but who is not a particu-
larly great writer, but has other things going. But again, that is an impres-
sion.
As already reflected in the questionnaire results, the raters liked the fact that the
descriptors in the new scale were more explicit. Further evidence of this can be
found in the following extracts from the interviews:
Researcher: Do you feel you used the whole range there [accuracy in the
new scale]?
Rater 10: Yes, yes, I did. Yeah. I think I would be more likely to. Because
I thought I had something to actually back it up with, it had a clearer
guideline for what I was actually doing, so I was more confident for giv-
ing nines and fours. And I think also because I didn’t, I let go of the sense
of this is a seven, so I have to make it come out as a seven and I’d say,
well, sorry, if they have no error-free sentences they get a four and I don’t
care if it is something that might otherwise get a six or a seven and yes, if
they can write completely error-free then I can give them a nine. I have no
problems with that.
The idea of being able to arrive precisely at a score was also echoed in the com-
ment by Rater 7 below:
The more explicit criteria in the new scale resulted in a number of raters reporting
that it was not possible for them to use impressionistic marking when using the
new scale. It is important to remember that the raters did not know how the final
score would be derived for the new scale. The following quote is from the ques-
242
tionnaire administered immediately following the rating round using the new
scale. The raters were asked if they ever used a holistic score to arrive at a level
for a script.
Rater 1: I felt that the scale didn’t allow for that. To make the script fit to a
descriptor, I had to actually use the descriptors as guidance, much more
than I would do using the DELNA scale
Raters were also asked in their interviews which scale they thought would result
in higher intra- and inter-rater reliability for their rating. A typical rater response
can be seen in the extract below:
Researcher: With which scale do you think you are more self-consistent?
Rater 7: Probably with the new scale, because it is less subjective, you
know, you can say look there is five self-corrections whereas with the
DELNA one you have, you know, organisation it looks good today,
maybe next week I will think it is not.
Researcher: How about inter-rater reliability?
Rater 7: I think it is going to be more consistent with the new one. For the
same reason, because you can count, fluency, complexity, mechanics,
reader-writer interaction, content. Cohesion and coherence I am probably
a bit open to various ideas, but it is mostly, more than half of it, I am able
to say, no no look that is nine or that is eight.
One of the most unexpected themes emerging from the interviews was the fact
that almost all raters reported a changed rating behaviour since using the new
scale. The first rater interviewed (Rater 3) raised the topic and it was then in-
cluded in the interviews that followed. Here is what Rater 3 said:
Rater 3: Yeah. I found the first time round, there was definitely an im-
provement in my DELNA marking
Researcher: In what way?
Rater 3: It made me more aware, I hadn’t really thought about hedging
very much, I have to say, mmh, so that then I started to notice them, so
there is, it has had a very positive spin-off. It has pinpointed things, be-
cause the DELNA one is less specific, it is less specific, so this, the two
kind of go together quite nicely, this [the DELNA scale] pinpoints things.
But by marking with the new scale, it has, I have got in my mind now, I
can see hedging
Researcher: So maybe like a training scale?
Rater 3: Yeah, it definitely has been very useful. It is sort of more aware-
ness of things which I might have glossed over [...]
243
This very interesting idea of the scale being useful as a training tool will be dis-
cussed further in the following chapters.
Whilst all the comments about the new scale reported above shed a positive light
on the scale, some less positive comments were also made by the raters. These can
be grouped into three categories:
The first three comments below point to aspects of writing which are not as-
sessed by the new scale but were mentioned by raters to be missing. Rater 3, for
example, challenged the fact that spelling was not included in the new scale:
Rater 3: But for spelling, the DELNA one actually considers spelling, that
suits me in that way, I think this is quite a ‘me’ scale
Researcher: Because you are used to it?
Rater 3: No, that is what I am saying, I would always be looking at spell-
ing, that would be one thing, it’s ideas, but spelling does really bug me
The second group of negative comments related to aspects of writing in the new
scale that are not included in the DELNA scale. The first comment below relates
to the inclusion of hedging in the new scale:
Rater 10: I just wasn’t sure with, I guess the hedging devices would be an
example, mmh, sometimes I might think it was actually a pretty good
script, but they just hadn’t put any hedging devices in and so I felt like I
was marking them down for something that they didn’t know they were
supposed to do. And that they could maybe produce a pretty good piece of
work without having hedging devices and no kind of account was taken.
So I guess this [the new scale] seemed a bit more rigid to me and maybe
not fitting each individual case.
Two raters criticized the inclusion of repair fluency into the new scale. Rater 5
talked about his own writing style as a comparison:
244
would say it is irrelevant. Maybe that is clouded by my own experience
writing. I am a very hesitant writer. Maybe that is a reflection of ability,
because people’s processes of writing go differently, their ability to write
a first draft quite well. Maybe it is a reflection of writing. But it doesn’t
seem to me from my experience.
Raters further criticized the fact that some information was lost because the de-
scriptors in the new scale were too specific. Rater 5, for example, argued that a
simple count of hedging devices could not capture variety and appropriateness:
Researcher: You said that, other than hedging, style wasn’t really consid-
ered.
Rater 5: Yeah, it does seem a bit limited. And then they might repeat the
same hedge and they might copy the one from the prompt and so they get
automatic points which I suppose is a strategy you can use when you are
doing academic writing, but quite often sort of non-native speakers will
rely on one or two hedges all the way through [...] Whereas the good writ-
ers will very sparingly use hedges but they will use them just right and
they will vary them. [...] So maybe something about variety of hedges and
appropriateness as well. I suppose that is similar [to the DELNA scale] it
sort of relies on the marker’s knowledge of English in a more kind of
global way sort of. But maybe that is the inter-rater reliability issue com-
ing up. So the DELNA scale allows me the flexibility to use my own
judgement about a script in all categories.
Whilst the first and second section above described themes arising out of the in-
terviews pertinent to the DELNA and the new scale separately, the third section
below looks at topics that emerged about both scales.
The first theme relates to the time taken when using the two scales. This question
was included in the interviews, as it emerged from the questionnaires. Interest-
ingly, the raters differed greatly in the time it took them to rate scripts with the
new scale. One rater, Rater 3, reported taking about three to four times longer
when using the new scale. On the other hand, Rater 5, reported rating a lot faster
when using the new scale. Most raters, however, suggested that it took only
245
slightly longer with the new scale and that the time taken was within reason. The
comment by Rater 4 reflects the sentiment of most raters interviewed:
The final theme pertaining to both scales is providing feedback to students. This
question was included in the interview after the analysis of the quantitative data
revealed that the new scale seemed to be measuring more aspects of writing. As
DELNA is a diagnostic assessment, it was thought that it would be interesting to
see which scale the raters would perceive as more appropriately capturing the dif-
ferent aspects of writing and therefore being more useful for providing feedback
to students. Not all raters were able to answer this question, probably because this
is not an aspect of the assessment situation they are usually confronted with. Rater
9, however, had experience working in the English Language Self Access Centre
(ELSAC), a facility which is often recommended to DELNA candidates, and she
thought in her interview that the DELNA scale might be more useful for provid-
ing feedback:
Rater 10, another rater with experience in working at ELSAC, however, thought
that the new scale might be more useful:
246
new scale might offer some more explicit feedback, I guess we might
really be saying, you need to have more than one or two reasons, whereas
here it might say your interpretation was not adequate. So I guess this
might provide some more concrete things to give to students and the data
description one. Yeah, again, if you could say you described the trends,
but you didn’t put any figures in, that might be better than saying your
data description was inadequate. So I think if you could get some preci-
sion out of this [the new scale], and things like the vocabulary, that would
be easier to back that up. And certainly the accuracy.
Overall, no clear consensus was reached on this question on the basis of the inter-
views.
The first section presented themes that emerged about the DELNA scale. Most
raters reported experiencing some problems when using the DELNA scale. The
interviews also brought a number of strategies to light that were used when raters
encountered problems. A small number of raters, however, preferred the DELNA
descriptors.
The second section illustrated the themes that emerged about the new scale. These
were divided into positive and negative aspects. The raters found the band de-
scriptors more explicit because for many of them they were able to count features
and noted that this probably resulted in an increase in intra- and inter-rater reli-
ability. They also noted a positive spin-off on their rating behavior. All negative
comments related to aspects that are different from the existing scale, be it aspects
missing in the new scale or included in the scale but not normally found in other
scales.
The themes reported in the third section above related to broader overall themes.
Raters differed in the time they took to apply the two scales. Finally, raters dif-
fered in their opinions about which scale might be more useful for providing
feedback to students.
9.3 Conclusion
Chapter 9 presented the results in response to research questions 2a and 2b. The
following chapter, Chapter 10, will attempt to answer the overarching research
question guiding this study.
---
Notes:
1
One rater failed to answer this question
247
Chapter 10: Discussion – Validation of Rating
Scale
This chapter focuses on the validation phase of the new scale. The previous two
chapters, Chapters 8 and 9, presented the methodology and results of this phase,
whilst this chapter presents the discussion of the findings.
Although the results in Chapter 9 were reported in two parts (divided into research
questions 2a and 2b), the discussion in this chapter will focus on answering the
overarching research question:
The aim of this chapter is to build a validity argument. In contrast to other studies
that aim to validate one test or measure, this study set out to compare the validity
of two rating scales. Validity is therefore established through a comparison of the
two scales. To determine which rating scale is more valid, Bachman’s (2005) and
Bachman and Palmer’s (forthcoming) Assessment Use Argument was used as a
basis. To facilitate the comparison, a table was created in which the relative valid-
ity of the two scales was noted. The empty grid can be seen in Table 104 below.
As can be seen from the rows in Table 104, to establish validity, Bachman and
Palmer’s (1996; forthcoming) facets of test usefulness were used as guidelines.
The authors define test usefulness in terms of six aspects: construct validity, reli-
ability, authenticity, interactiveness, impact and practicality. These will provide
the structure of this chapter. It is important to point out, that the facets of test use-
fulness used in the table above were designed for the validation of entire tests and
not rating scales. However, because most aspects can be modified and applied to
rating scale validation, the decision was made to follow this framework. Because
249
interactiveness cannot be established with respect to rating scales, this concept has
been excluded from any further discussion. To guide the discussion of the remain-
ing five facets of test usefulness, a number of warrants have been formulated. A
warrant is a statement which has been devised to represent an ideal situation and
the discussion will establish how closely each rating scale reflects this.
Bachman and Palmer (1996) define construct validity as ‘the meaningfulness and
appropriateness of the interpretations that we make on the basis of test scores’ (p.
21). According to Weigle (2002), construct validation refers to the process of de-
termining whether a test is actually measuring what it is intended to measure. To
establish construct validity for a rating scale, we need to understand what the pur-
pose and the context of an assessment are, and whether the rating scale is helping
raters to arrive at scores which represent the abilities in question. A variety of
types of evidence can be used to establish construct validity. Of the types men-
tioned by Chapelle (1998), content analysis and empirical investigation will be
used. Three warrants focussing on construct validity have been formulated. The
first two warrants will employ content analysis and empirical investigation, while
the third warrant will involve a consideration of the procedures used during rating
scale development.
10.1.1 Warrant 1: The scale provides the intended assessment outcome appropri-
ate to purpose and context and the raters perceive the scale as representing the
construct adequately
250
Each of these four statements will now be discussed in turn.
Alderson’s first statement calls for diagnostic assessments to identify strengths
and weaknesses in a learner’s knowledge and use of language. Both rating scales
compared in this study were analytic scales and were designed to identify
strengths and weaknesses in different aspects of the learners’ writing ability.
However, the principal factor analysis showed that the new scale distinguished six
different writing factors, accounting for 83% of the variance, whilst the current
DELNA scale resulted in one large factor accounting for 64% of the variance.
Therefore, it could be argued that the new scale was more successful in identify-
ing different strengths and weaknesses as well as accounting for more variance in
the final score. This result also shows that test takers’ abilities in different areas of
writing performance do not develop in parallel (as has been suggested by Young
1995 and Perkins and Gass 1996) but rather develops at different rates and at dif-
ferent times.
The main reason that the ratings based on the DELNA scale resulted in only one
factor was the halo effect displayed by most raters. Although developed as an ana-
lytic scale, the existing scale seemed to lend itself to a more holistic approach to
rating. It is possible that, as hypothesized in this study, that the rating scale de-
scriptors do not offer raters sufficient information on which to base their decisions
and so raters resort to a global impression when awarding scores. This then would
explain why, when using the empirically developed new scale with its more de-
tailed descriptors, the raters were able to discern distinct aspects of a candidate’s
writing ability.
Some studies have in fact found that raters display halo effects only when encoun-
tering problems in the rating process (e.g. Lumley, 2002; Vaughan, 1991). Lum-
ley, for example, found that when raters could not identify certain features in the
descriptors, they would resort to more global, impressionistic type rating. This
study suggests that the halo effect and impressionistic type marking might be
more widespread than has so far been reported. The halo effect is usually seen in
the literature as being a rater effect that needs to be reduced or even eliminated by
rater training. However, Cascio (1982, cited in Myford and Wolfe, 2003, p. 396)
notes that the halo effect is ‘doggedly resistant to extinction’. This study has
shown that simply providing raters with more explicit scoring criteria can signifi-
cantly reduce this effect. It could therefore be argued that the halo effect is not
necessarily only a rater effect, but also a rating scale effect.
However, what was not established in this study was whether the raters rated ana-
lytically because they were unfamiliar with the new scale. It is possible that ex-
tended use of the new scale might also result in more holistic rating behavior. This
251
point will be taken up in the suggestions for further research in the following
chapter.
As mentioned above, when using the new scale, the raters were able to determine
six different factors of writing ability. However, it is also important to examine
the usefulness of these six factors. The first factor, made up of accuracy, lexical
complexity, coherence and cohesion, can be described as a general writing factor.
This is a very useful factor for students to receive feedback on, as it seems to be a
good indicator of overall writing ability.
The second factor, made up of hedging and interpretation of data, is a factor that
lower level writers often struggle with. They are often able to describe data pro-
vided in a graph or table, but struggle to interpret the information and to present it
in the appropriate style (e.g. by using hedging devices). This would again, be a
useful factor for students to receive separate feedback on.
The third factor represented content – Part 3, where students are asked to extend
their ideas. Because the interpretation of data and Part 3 of content load on sepa-
rate factors, it can be argued that students who perform well in one part might
struggle with the other section. This could mean that the time-limit is not suffi-
cient to enable good students to receive high marks on both sections. Further re-
search into this is clearly necessary.
The fourth factor represents the description of the data. From the factor analysis, it
is clear that very different skills are required from learners when describing data
and interpreting them. Therefore, it is again a useful factor to report separately to
students. These results also provide evidence that averaging the different catego-
ries relating to content into one score results in a loss of important information.
The fifth factor represents repair fluency. The usefulness of this measure is doubt-
ful because the quantitative findings were not as convincing as those of other
scales and the raters were generally not convinced of the efficacy of this scale.
However, it is of course possible to argue that copious self-corrections are dis-
tracting to the reader. Even though more and more writing is produced on the
computer, students are expected to produce hand-written answers in exams and
too many self-corrections might distract from their writing. Therefore, although
this measure did not produce very convincing quantitative results, it can be argued
that it could be useful.
252
within a paragraph. Also, if the scale was publicly available to students, achieving
high on paragraphing would be easy for students, as formatting their writing into
the five paragraphs would be simple. It would therefore be beneficial to develop a
more meaningful scale for paragraphing, but until then, the current scale descrip-
tors are arguably of some use to students.
Alderson’s (2005) second and third statements assert that diagnostic assessments
should enable a detailed analysis and report of responses to tasks and that this
feedback should be in a form that can be acted upon. Both rating scales lend
themselves to a detailed report of a candidate’s performance. However, as evident
in the quantitative analysis, if the raters at times resort to a holistic impression to
guide their marking when using the DELNA scale, this will reduce the amount of
detail that can be provided to students. If most scores are, for example, centred
around the middle of the scale range, then this information is less useful to stu-
dents than if they are presented with a more jagged profile of some higher and
some lower scores which therefore affords a clear indication of which aspects of
their writing they need to focus on.
For both scales, it is unclear to what extent the statements in the scales are of use
to students. It might not help much for a student to know that he or she has ‘little
understanding of academic style’, whilst the more detailed descriptors in the new
scale - ‘less than four hedges used’ - might be more informative. Technical vo-
cabulary, like the term ‘hedges’, would of course have to be defined for students.
The descriptors in the DELNA scale are often very broad and general and might
therefore not provide enough detail to be useful as a basis for instruction. If the
feedback is to be useful to stakeholders, it needs to be detailed and in a form that
can be understood by candidates and their future instructors. Therefore, it might
be useful for testing centres to develop two different types of feedback; one which
is designed solely for students who might have less metalinguistic knowledge and
the other for potential instructors with more technical vocabulary. This idea will
be further developed in the section on practical implications in Chapter 11.
Alderson’s fourth statement states that diagnostic tests are more likely to be fo-
cussed on specific elements rather than on global abilities. If a diagnostic test of
writing is aimed at focussing on specific elements, then this needs to be reflected
in the rating scale. Therefore, the descriptors need to lend themselves to isolating
more detailed aspects of a writing performance. The descriptors of the new scale
were more focussed on specific elements of writing because they were based on
discourse analytic measures. The band descriptors on the DELNA scale generally
reflect more global abilities, with vaguer, more general band descriptors. It was
interesting to observe, however, that the raters thought that, because of the more
specific descriptors, important information was lost. Some raters even remarked in
253
their questionnaires that they thought a scale with band descriptors for an overall
assessment was missing.
The way the scores are reported is also important. It is not effective to use an ana-
lytic scale and then average the scores when reporting back to stakeholders (as is
currently the case with the existing scale), because this will result in a more global
impression of the performance and important information is therefore lost. Cur-
rently the writing scores are reported to test takers as one averaged score with
brief accompanying descriptions about their performance in fluency, content and
form. Academic departments only receive one averaged score. As described in
Chapter 5, students also receive a recommendation on where to receive appropri-
ate help for their level of English proficiency. This could be either the English
Language Self-Access Centre, the Student Learning Centre, or the advice might
be that they should enrol in ESOL credit papers if their scores are found to be suf-
ficiently low. None of this advice, however, focusses on details of their writing
performance. In this way, the current practice is more representative of profi-
ciency tests or placement tests.
Finally, it was also important to establish the stakeholders’ perceptions of the effi-
cacy of the two scales for diagnostic assessment. Only the raters’ opinions were
determined. Raters’ perceptions of the scale usefulness are important as they pro-
vide one perspective on the construct validity of the scale. They are, for example,
able to judge whether the writing construct is adequately represented by the scale.
Just as important as the raters’ perceptions of the usefulness of the scale for diag-
nostic assessment would have been the test takers’ views, as well as the judge-
ments of stakeholders, such as teachers of the students. These were, however, not
canvassed as the scope of this study did not allow for this.
Raters were asked during the interviews which scale they thought might be more
useful for providing feedback to learners. Not all raters commented on this topic.
The raters who were able to answer this question were divided on this issue. Some
raters thought that the DELNA descriptors were more useful as the basis for feed-
back, whilst others considered the new descriptors to be better.
254
In the course of the interviews and questionnaires it became apparent that most
raters treated DELNA as a proficiency or placement test rather than a diagnostic
assessment. For example, Rater 10 wrote in her questionnaire: I think I prefer the
existing DELNA scale because I like to mark on ‘gut instinct’ and to feel that a
script is ‘probably a six’ etc. It was a little disconcerting with the ‘new’ scale to
feel that scores were varying widely for different categories for the same script.
Similarly, Rater 5 mentioned in his interview: I notice these things [features of
form] as I am reading through, but I try not to focus too much on them. I try to go
for broad ideas and sort of are they answering the question. Are they communi-
cating to me what they need to communicate first of all. And how well do they do
that.’ Also, some raters suggested in their questionnaires that they would have
liked to see descriptors assessing the overall quality of a script. It seems therefore
that the purpose of the assessment was not clear to them. The findings of this
study suggest that raters need to be made aware of the purpose of the assessment
in their training sessions, so that they recognize the importance of rating each as-
pect of writing separately. This might result in raters displaying less of the halo or
central tendency effects.
Summarizing the evidence for Warrant 1, it can be said that the new scale is able
to provide more information about the strengths and weaknesses of learners, as
more different aspects of writing ability were distinguished and a larger amount of
variance could be explained. It is therefore better equipped to form the basis of
detailed feedback profiles. The descriptors of the new scale also focus more on
specific elements of the writing product than the more general descriptors in the
DELNA scale. The raters’ perceptions of the efficacy of the two scales were di-
vided but slightly in favour of the new scale. However, there was evidence that
the raters were not aware of the purpose of the assessment; a number of their
comments showed that they were treating the assessment as a proficiency test.
10.1.2 Warrant 2: The trait scales successfully discriminate between test takers
and raters report that scale is functioning adequately
Although the main focus of this section is the discrimination power of the scales,
aspects such as reliability contributing to the discrimination of the scale are also
discussed.
The discrimination power of the whole scale is less important in the context of a
diagnostic assessment, as results should be reported back for each trait individu-
ally. In this section, the focus is therefore on the trait scales and the raters’ percep-
tions of these. After the discussion of the individual scales, the raters’ perceptions
of the validity of the scales as a whole will be considered.
255
The trait scales on the new scale generally resulted in a higher candidate separa-
tion ratio, which means that they were able to discriminate between more levels of
candidate ability. The main reason for the increased candidate separation ratio was
the fact that the raters were rating more similarly to each other and were using
more levels on the rating scale. If raters rate with large differences, their ratings
cancel each other out and this reduces the candidate separation ratio and therefore
also the validity of the assessment.
Overall, this study was able to show that the trait scales of the new scale func-
tioned better in most aspects than the DELNA trait scales. However, there were
some exceptions. For example, the discrimination (as measured by the candidate
separation ratio) was slightly lower for the new trait scales of cohesion and coher-
ence if the formula for equating the number of band levels (see Chapter 9) was not
applied. There are two possible explanations for these results. Firstly, all the trait
scales on the new scale were found to be inferior to the existing trait scale where
the focus was on features of writing are difficult to describe in precise detail. For
example, cohesion and especially coherence are aspects of the writing product
which are inherently difficult to quantify and to grade. It is therefore possible that
some aspects of writing lend themselves more to the type of descriptors used in
this study, whilst others are just as successfully applied even if they are not em-
pirically-based. However, it is also possible that more training and more experi-
ence in using these new descriptors might help raters in rating these traits more
reliably.
To provide further evidence for the warrant above, each trait scale on the new
scale will be evaluated individually, with evidence taken from both the statistical
analysis and the rater questionnaires and interviews. Based on these findings, rec-
ommendations for future revisions of the new scale will be made.
10.1.2.1 Accuracy:
Most raters remarked positively about the category of accuracy. However, there
was some disapproval of the measure of the percentage of error-free t-units. The
criticism raised reflects similar criticisms in theoretical discussions of this meas-
ure (see for example Wolfe-Quintero et al., 1998). One rater was not convinced
that it was fair that writers with one error per t-unit should be penalized in the
same way as writers with many errors per t-unit, a criticism also raised by re-
searchers such as Bardovi-Harlig and Bofman (1989). Overall, though, it could be
said that the rating scale category of accuracy functioned well in terms of its quan-
titative aspects and was also generally well perceived by the raters. It should
therefore be adopted in any future use of the rating scale. It might be useful to col-
lapse levels 8 and 9 on the new scale as the top band level was underused, espe-
cially since level 9 was created without any empirical basis. Collapsing the two
256
top levels to read ‘nearly all or all error-free sentences’ would also more accu-
rately mirror the lowest level, which reads ‘nearly no or no error-free sentences’.
However, Myford and Wolfe (2004) caution against collapsing adjacent band lev-
els. It is possible that a larger sample size would have shown that the top band is
in fact necessary.
The quantitative findings for the category of repair fluency were rather mixed.
Although it resulted in very high candidate discrimination and rater reliability, the
raters differed substantially from each other in terms of leniency and harshness
and nearly half of the raters were shown to be rating with too much or too little
variation. Raters’ comments collected as part of the questionnaire were divided.
About half the raters made positive comments. However, these related generally
to the ease of use of the category rather than its validity. Four raters were not con-
vinced of the efficacy of this scale. One rater suggested that there was little corre-
lation between the results for this category and English ability. Another rater sug-
gested that writers should be encouraged to self-correct rather than be penalized
for too many self-corrections. The reasons for these problems with the category of
repair fluency lie in its origins in speaking. A breakdown in fluency in speaking is
likely to be more problematic than in writing. A writer has time to self-correct
without immediately influencing the reader. Also, while speech occurs in real
time, it is not clear when a self-correction occurs in writing purely on the basis of
the writing product. If the self-correction was made during the initial writing
process, then this could be considered a breakdown in fluency. However, if the
correction occurs as a result of a revision at a later time, it should not be consid-
ered a breakdown in fluency. Whilst all the previous points made focussed on the
writer, it is also possible to view repair fluency from a reader’s perspective. Copi-
ous self-corrections could be seen to have a negative influence on someone read-
ing a text.
Overall, there are arguments for, as well as against, keeping this trait scale. Exces-
sive self-corrections might be distracting to the reader, but a case can also be
made for encouraging self-corrections as these provide evidence of writers’ revi-
sion processes. The quantitative findings of this trait scale were mixed, so if it is
reused in future administrations of the assessment, further revisions are clearly
necessary. If the scale is deleted, then this leaves the problem that no suitable
measure of fluency is available and this could be seen as a weakness of the scale.
However, it could also be argued that fluency is a lot less vital in writing than it is
in speech and thus does not need to be assessed.
257
10.1.2.3 Lexical complexity:
The quantitative findings for lexical complexity were generally positive when
compared to the vocabulary and spelling category on the existing scale. All as-
pects that were compared resulted in better values, except for the rater separation
ratio, which was slightly higher for the new scale, indicating that the raters were
spread more in terms of leniency and harshness. It seems that raters were able to
rank candidates more similarly when using the new descriptors but they differed
from each other in severity. The reason for this could lie in the fact that the de-
scriptors consisted of two parts: one focussed raters on the Academic Word List,
whilst the other part was more general. This second part was designed for raters
experiencing problems identifying words from the Academic Word List. It is pos-
sible that raters who used one type of descriptor produced more lenient ratings
than raters using the other type of descriptor. It would have been useful to ask rat-
ers which type of descriptor they used, to see if this explained the differences
among them.
The questionnaire comments on the trait scale of lexical complexity were gener-
ally positive. One rater for example remarked that this category is very indicative
of competence. This is also what the analysis of the writing scripts during Phase 1
suggested. This finding is further supported by other research. Loewen and Ellis
(2004), for example, were able to show that vocabulary knowledge is a good pre-
dictor of academic success as measured by grade point average (GPA), especially
if only the GPAs for language and writing-rich courses were used. A similar find-
ing was reported by Elder and von Randow (2002).
Overall, it seems that the trait scale of lexical complexity functions successfully
and therefore should be included if the new scale is adopted. All band levels on
the scales were sufficiently used by the raters, so that there is no reason to change
the number of levels.
10.1.2.4 Paragraphing:
The quantitative results for the category of paragraphing clearly favoured the new
scale. On the other hand, the qualitative results were not quite so clear-cut. Three
raters commented positively, but five raters were less convinced. They remarked
for example that this category did not account for the ordering of information
within paragraphs. This needs to be acknowledged as a clear weakness of the trait
scale. However it is very difficult to design a more detailed scale for paragraphing
without returning to more open-ended, vague descriptors. No previous research
measuring paragraphing empirically was available and it is clear that more work
in this area is necessary. On a purely mechanical basis, the scale functioned well
258
enough to be included in the new scale. It is not clear, though, if the number of
levels should be changed. The analysis showed that both outer levels were slightly
underused. It might be necessary to collect more data to see if there is any empiri-
cal basis for collapsing any levels.
10.1.2.5 Hedging:
The category of hedging performed well when the quantitative data was analyzed,
outperforming the existing rating scale of style in all aspects. The raters’ com-
ments in the questionnaire were also generally positive, although some raters
thought that a script could be highly successful without hedging devices. It is clear
that the category of hedging provides a substantially narrower picture of a writers’
academic style than its broader counterpart in the DELNA scale. The vaguer de-
scriptors in the DELNA scale, however, resulted in a central tendency effect.
Hardly any raters used the outside scale categories. This was possibly the case be-
cause raters did not know what specific features to focus on. In Phase 1 of this
study, several aspects of style were pursued, but the only one that successfully
discriminated between the levels was the category of hedging. Further research
resulting in the detailed de-scription of academic style is necessary. In the inter-
views, a number of raters suggested that whilst lexical complexity focusses on
academic vocabulary, a good discriminator of academic style would be the use (or
non use) of informal vocabulary. Although slightly subjective, this is an avenue
that might be worthwhile pursuing further. The category of hedging, although
functioning well, is clearly just one aspect of academic style. Future revisions of
the scale will hopefully include a wider variety of features of academic style. For
example, it might be interesting to investigate whether the category of voice used
by Cumming et al. (2005) is a meaningful measure for the type of writing genre
investigated in this study. In terms of the number of band levels, the highest level
(band 9) was slightly underused. Future research should investigate if it should be
combined with band level 8.
The quantitative findings for the comparison of the two scales of data description
favored the new scale, although the raters were further spread in terms of leniency
and harshness. All raters were generally positively disposed towards this scale and
it should therefore be included in any future use of the new scale. Level 4 on the
scale was underused. However, an argument could be made for keeping this level
in the scale to cover instances in which candidates misread the question and do
not describe the data. If further use of the scale shows that this level is underused,
then it should perhaps be collapsed into the next higher level.
259
10.1.2.7 Content – Interpretation and Part 3:
All categories in the quantitative comparison of the interpretation of data and con-
tent Part 3 of the two rating scales pointed to the two trait scales in the new scale
functioning better than the equivalent trait scales in the DELNA scale. The ques-
tionnaire results showed that almost all the raters were positively disposed to-
wards these new trait scales. One rater, however, preferred using the existing trait
scales because it left him more room to bring in his own knowledge and experi-
ence. He suggested that this more quantitative measurement of content was not
able to evaluate the quality of ideas, their appropriateness and the clarity of ex-
pression or the depth of explanation and support. As was the case with other scale
categories, it is clear that the way of measuring content used in the new trait scales
has its limitations. To be able to arrive at a more reliable judgment, more explicit
categories had to be developed. Aspects such as the clarity of the expression of
ideas are very subjective and were therefore not included in the analysis. In both
the content categories, level 4 was slightly underused. Further use of the scale
needs to establish if these categories need to be combined with the next higher
level.
10.1.2.8 Coherence:
The quantitative findings for coherence were mixed. The new scale was not more
discriminating, but the rater separation was lower; that is, the raters differed less
in their severity and fewer raters were found to be rating with too little or too
much variation. A problem with the trait scale can however be found in the raters’
responses to the questionnaire. Almost all the raters found the category of coher-
ence too difficult or too time-consuming to use. Some raters stated that they got
used to the category as they marked more scripts, so there is some evidence that
the trait scale might become more usable with more training and experience.
However, since the data were collected another suggestion has been put forward.
It might be useful to undertake a multiple regression analysis with the different
categories used for the analysis of coherence as independent variables and with
the average DELNA score as the dependent variable. In this way, it might be pos-
sible to identify two or three of the seven categories used in the new scale, which
are more indicative of writing ability than others. If, for example, raters could fo-
cus only on the categories of superstructure, coherence breaks and direct sequen-
tial progression, this might make the rating task substantially easier. This would
also mean that a more simplified rating scale could be designed based on the find-
ings of the multiple regression analysis. All raters were asked as part of the inter-
views if they thought a simplified version of the coherence scale might be useful.
All seven raters thought that this might make training and use of this scale suffi-
ciently easier for the scale to be useful in future administrations of the assessment.
260
Compared to other rating scales in use for coherence, which have been criticized
for being too vague (Watson Todd et al., 2004), this approach to rating scale de-
velopment promises to provide raters with more guidance in the rating process. It
is, however, also possible that aspects like coherence, as suggested above, are not
suitable for scales which aim to provide more explicit descriptors. It is possible
that the two very different scales of coherence compared in this study will always
result in very similar statistical findings, because an aspect of writing like coher-
ence will always defeat more detailed description. In terms of the number of band
levels, level 4 was slightly underused in the new scale. More research is necessary
before any decision on the band levels is possible.
10.1.2.9 Cohesion:
The quantitative findings for cohesion were very similar to those for coherence
except that the raters did not rank the candidates more similarly than with the ex-
isting scale. The questionnaire comments were generally positive. Five raters
mentioned that they found the new descriptors easier to use than the existing ones.
Three raters were less convinced, with one commenting that it is difficult to di-
vorce cohesion from other aspects of writing, such as coherence and lexical com-
plexity. The descriptors for cohesion clearly need to be further revised. None of
the measures used was sufficiently clear in discriminating between the writers.
However, some aspects that might be more successful in distinguishing between
different writing ability levels, like lexical cohesion and cohesive chains (Halliday
& Hasan, 1976; Hoey, 1991; Neuner, 1987; Reynolds, 1995, 1996) are impractical
as they cannot be easily rated. Level 4 of the new scale was underused, but, this is
an improvement on the existing scale where levels 4, 5 and 9 were underused.
Overall, the trait scale for cohesion seems promising, but more research is clearly
necessary so that the band descriptors can be refined.
When considering the scales as a whole, the raters’ comments were mixed. Most
raters reported encountering problems when using the DELNA descriptors. Al-
most all of these comments were related to the descriptors being too vague or non-
specific for raters to be able to easily decide on a score. One reason mentioned in
this respect is the use of adjectives like ‘extensive’, ‘appropriate’ or ‘adequate’.
Raters were very aware that these could mean different things to different raters.
This problem has also been pointed out by a number of researchers (for example
Brindley, 1998; Mickan, 2003; Upshur & Turner, 1995; Watson Todd et al.,
2004). Furthermore, recent evidence from think-aloud protocols of the rating
process lends support to the fact that raters struggle with vague descriptors. Smith
(2000), for example, found that raters had ‘difficulty interpreting and applying
261
some of the relativistic terminology used to describe performances’ (p. 186).
Shaw (2002) noted that about a third of the raters he interviewed reported prob-
lems when using the criteria but he did not specify what specific problems they
encountered. Similarly, Claire (cited in Mickan, 2003) reported that raters regu-
larly debate the rating scale descriptors in rater training sessions and describe
problems in applying descriptors with terms like ‘appropriately’. It can therefore
be said that there is a growing body of research available that supports the results
obtained for the interviews. Raters do often seem to find the descriptors vague and
consider this to be a problem.
Some raters did not report any problems when using the DELNA descriptors.
However, these raters usually gave some indication in their interviews that they
were rating mostly holistically, and therefore were probably not aware of the need
for a diagnostic assessment to report back individual scores to test takers. It is
possible that rater background played a role in their perception of the descriptors.
Two raters mentioned in their respective interviews that they preferred the non-
specific descriptor style of the DELNA descriptors because their background was
in English literature and they were accustomed to more holistic type rating. That
ESL trained teachers and English Faculty staff rate differently has previously been
shown in studies conducted by O’Loughlin (1993), Song and Caruso (1996) and
Sweedler-Brown (1993). Similarly, it is possible that because most of the raters
have gained their rating experience in the context of proficiency tests (like e.g.
IELTS) that specific instructions during rater induction training need to focus on
the differences between these two assessment types.
The rater comments about the new scale as a whole were generally positive. The
raters liked the fact that the level descriptors were more explicit and objective and
provided more guidance than the descriptors raters were used to. They reported
that it was much easier for them to ‘let go’ of impressionistic marking.
A number of criticisms of the new scale emerged from the interviews. Some as-
pects of writing were found to be missing from the new scale. Some of these as-
pects mentioned by raters were excluded from the scale based on the findings in
Phase 1 of this study. These were, for example, spelling, punctuation, capitalisa-
tion, sentence structure (as operationalised in the grammatical complexity) and
certain aspects of academic style. Raters had not been briefed on how the new
scale was designed and did therefore not know that these categories had been ex-
cluded based on empirical findings. It might have been useful to inform raters on
the scale development process before they used the scale so that they understood
why these categories were excluded. This awareness might help raters in the rat-
ing process.
262
Raters also suggested a number of categories appeared to be missing. These had
not been included in the scale as they were found difficult to operationalize. These
were, for example, strength of ideas (appropriacy and development) and quality of
expression (clarity of ideas, conciseness). These aspects are areas which raters can
currently include when using the existing descriptors. However, what is not clear
with more general descriptors such as the ones used in the DELNA scale, is
whether raters end up focussing on the same aspects of writing. There was some
evidence in the interviews that raters focussed on very different features when us-
ing the same category on the rating scale. For example, when rating academic
style, some raters focussed more on non-academic vocabulary whilst others were
more irritated by persistent use of markers of writer identity such as ‘I’. It could
therefore be argued that categories that could be interpreted in a variety of ways
by raters add to the unreliability that has been observed in performance assess-
ments. Others might suggest that it is the role of rater training to counteract any of
these differences. Some of the extracts from the interviews suggest, though, that
raters are very fixed in their views of the writing product and it might not be pos-
sible to change the rating behavior of all raters to achieve high levels of rater reli-
ability. Similarly, authors such as Huot (1990) might argue that forcing raters to
discard their valuable personal experience and background might reduce the valid-
ity of a test. However, I would like to suggest that although raters’ personal ex-
perience and background are important in the rating process, achieving a certain
level of reliability is important for validity. This reliability has been shown to be
difficult to achieve with more conventional rating scales (see for example Cason
& Cason, 1984) and it might therefore be necessary to guide the rating process by
using more detailed, empirically-based descriptors.
Similarly, some raters felt that important aspects of writing were lost because the
new scale descriptors were too specific. For example, the only aspect of academic
style included in the new scale was that of hedging. The DELNA rating scale,
however, has more open-ended descriptors for style, allowing raters to award or
penalize a variety of aspects in this category. Some raters noted that the vagueness
of the descriptors allowed them to look at a more complete picture of the style of a
writing script, whilst hedging is very narrow and not necessary for a successful
script. This, of course, is a valid criticism which can be extended to a number of
features in the new scale. It could therefore be argued that the empirically devel-
oped descriptors take a narrower view of writing, not giving a true representation
of what raters take into account when using less rigid descriptors. Similar criti-
cisms were levelled at analytic rating scales when compared to more holistic type
of ratings by authors such as Huot (1990), Charney (1984) and Barrit, Stock and
Clarke (1986).
263
However, it is also possible to see this from a different point of view. The nature
of the more general descriptors allows different raters to focus on a variety of as-
pects in the same category. In the case of style, for example, it is possible that one
rater focusses typically on hedging whilst another rater is more concerned with
penalizing informal vocabulary. These differences will very likely result in low-
ered reliability. It would be hard to imagine how these differences could be elimi-
nated to arrive at more reliable ratings (which are of course the basis for any va-
lidity argument). Also, as discussed above in the section on Warrant 1, Alderson
(2005) suggested that raters should focus on more specific, rather than global
abilities, when diagnosing writing ability.
10.1.3 Warrant 3: The rating scale descriptors reflect current applied linguistics
theory as well as research
Messick (1989) argued that the construct validation process includes the collec-
tion of empirical evidence (which was discussed in Warrant 1 and 2 above) and a
theoretical rationale. Warrant 3 will now be considered in terms of the theoretical
underpinnings of the two rating scales. This warrant is important for the validity
of rating scales in all assessment contexts and can also be found in Alderson’s
(2005) list of aspects of diagnostic assessment.
The existing DELNA scale was based on pre-existing rating scales and has been
further developed according to the intuitions of administrators and raters involved
in the DELNA assessment. No information is available on the theoretical basis for
the DELNA scale, but it was adopted from a context other than the one it is cur-
rently used for. Overall, Fulcher (2003) would describe the existing DELNA scale
as an intuitively developed scale (see Chapter 3, p. X).
The categories in the new scale, however, were based on a taxonomy derived
from our understanding of language and/or writing development (as was described
in Chapter 4). A taxonomy was necessary as currently no theory of writing is
available. As a result, a number of different models were used and a taxonomy
was established. The descriptors in the new scale were developed empirically,
based on the investigation of writing samples collected in the context of the
DELNA assessment. This is important, as it shows that the descriptors are based
on actual performance and therefore closely represent what actually happens in
writing scripts. Therefore, the new scale is based both on linguistic theory as well
264
as on research (empirical investigation). Thus, it could be argued that the new
scale has more construct validity than the DELNA descriptors because it more
closely reflects current applied linguistics theory and is based on an empirical in-
vestigation.
Table 106 below summarizes the three warrants relating to construct validity dis-
cussed above. The final column shows that in this specific context, the new scale
has more construct validity as it was able to discern more aspects of writing which
can be reported back to stakeholders as diagnostic information and it resulted in
higher discrimination on the trait scales. Most raters found the new scale easier to
use. Finally, it was established that the new scale more closely reflects current
theory and the fact that the descriptors are based on actual student performance
gave it construct validity.
265
Warrant 3: The rat- Based on taxonomy Basis of categories New scale
ing scale descrip- of writing and rat- not clear; descrip-
tors reflect current ing models; de- tors intuitively im-
applied linguistics scriptors empiri- proved over years
theory as well as cally-developed
research
10.2 Reliability
10.2.1 Warrant 4: Raters rate reliably and interchangeably when using the scale
The second aspect of reliability investigated was the rater point biserial coefficient
(or single-rater/rest-of-rater correlation). The pattern that emerged was that raters
266
generally ranked candidates more similarly when using the new trait scales, result-
ing in a higher rater point biserial value. A high rater point biserial for a trait scale
directly results in higher candidate discrimination, arguably a necessary condition
for a valid rating scale.
The third rater reliability statistic measured was the percentage of exact agree-
ment. This is higher if more raters choose exactly the same scale categories. Gen-
erally, the percentage of exact agreement was higher when the new trait scales
were applied, except for coherence and cohesion. The percentage of exact agree-
ment of raters is a variable that has only recently been introduced into the
FACETS output (Myford & Wolfe, 2004). This measure needs to be considered
with some caution and is probably less meaningful than the other rater statistics,
for two reasons. Firstly, exact agreement can be achieved if raters avoid the outer
scale categories and tend to mainly award the inner band levels. This rating be-
havior has been described as the central tendency effect (Landy & Farr, 1983;
Myford & Wolfe, 2003) and is not desirable. However, if high exact agreement is
achieved because of a central tendency effect, then this should inevitably result in
a lower candidate separation and more raters displaying a low infit mean square
value. This was generally not observed in the case of the new scales. Another rea-
son for the high exact agreement could be because a rating scale has few band
levels. Therefore, the chance of two raters awarding the same score level is much
higher with fewer levels to choose from. If the raters were choosing band levels
purely by chance, without referring to any descriptors, the percentage agreement
for the rating scale with fewer categories would inevitably be higher. Therefore,
unless the trait scales that are compared have the same number of band levels, the
measure of percentage of exact agreement is difficult to compare. In the case of
this study, some trait scales on the new scale had fewer scale categories than the
existing trait scales they were compared to.
Another aspect which also contributes to reliability is the percentage of raters who
rated with too much or too little variation (compared to what the FACETS pro-
gram would predict) and whose ratings therefore resulted in very high or low infit
mean square values. The percentage of raters identified to be rating with too much
or too little variation was found to be significantly lower when raters used the em-
pirically developed new scale. A possible reason for this lies in the raters’ reaction
when confronted with scale descriptors that do not offer enough information on
which to make a defensible decision on a band level. It is possible that raters react
to this in two different ways. Some raters might, when not sure what band level to
award to a script, choose a play-it-safe method and mainly award scores that are
in the inner levels of the rating scales. Some evidence of this was found in the in-
terviews and questionnaires of the raters. Anastasi (1988) points out that if raters
avoid using the extreme categories of the rating scale, this reduces the discrimina-
267
tion power of a rating scale. Other raters, when struggling to decide on a band
level, might however attempt to use most levels on the rating scale, and although
they might be quite self-assured about which band level to award when rating, the
resulting ratings might be slightly erratic and therefore inconsistent.
The results of this study seem to suggest that such rating behavior is not only the
result of rater background or individual characteristics that can be alleviated by
training, but might be the product of rating scale descriptors which are not de-
tailed enough to provide a solid basis for the raters in the rating process. There-
fore, it could be argued that Lumley’s (2002; 2005) suggestion that the rating
scale is mainly an inanimate object with little influence on the rating process
might have to be revisited. Because all aspects of the rating situation except the
scale were kept stable in this study, the findings suggest that the rating scale does
have a significant influence on rater behavior. Interestingly, Myford and Wolfe
(2003; 2004) suggest that to reduce both the central tendency effect and inconsis-
tencies in raters, the scale categories need to be defined more precisely so that rat-
ers will have a better idea of what the different band levels mean. This is exactly
what was attempted in this study, with some success it seems.
Overall, the results of the FACETS analysis suggest higher reliability in the rat-
ings based on the new scale. A summary of the findings relating to Warrant 4 can
be found in Table 107 below.
10.3 Authenticity
268
Authenticity is therefore a way of assessing the extent to which the score interpre-
tations generalize beyond performance on the assessment to language use in the
target language use (TLU) domain. To establish the authenticity of a rating scale,
we need to consider if what raters are doing is representative of how readers in the
TLU domain would approach a piece of writing. Warrant 5 reflects the degree of
authenticity of the rating scale.
10.3.1 Warrant 5: The scale reflects as much as possible how writing is perceived
by readers in the TLU domain.
No data was collected as evidence for this warrant; however a discussion of this
issue is necessary. Weigle (1998) argues that rating scales are inevitably a reduc-
tion of the construct being measured. Because of this, some authors have argued
that holistic rating is more authentic because it mirrors more closely the natural
process of reading (e.g. White, 1995). On the other hand, we again need to be
mindful of the purpose of this assessment. Alderson (2005), as discussed earlier in
this chapter, pointed out that a diagnostic test should give detailed feedback and
focus on specific, rather than global abilities. By its nature, therefore, diagnosis
reduces authenticity.
The DELNA scale, which often results in more global ratings, is for that reason
not necessarily appropriate for a diagnostic context. When assessing diagnosti-
cally, we are less interested in the generalizability of the ratings to a context be-
yond the assessment context than in identifying detailed strengths and weaknesses
of learners.
269
Therefore, the authenticity of the rating scale might be less of an issue in diagnos-
tic assessment than it is in proficiency tests. Table 108 below summarizes the dis-
cussion of this aspect of test usefulness.
According to Bachman and Palmer (1996), the impact of test use operates at two
levels: a micro level which is concerned with effects on individuals (stakeholders)
and a macro level which is concerned with effects on the educational system or
society. Because DELNA is generally considered a low-stakes test, impact will
only be considered at the micro level. Three warrants have been formulated about
individuals who could potentially be affected: the test takers, other stakeholders
and the raters.
10.4.1 Warrant 6: The feedback test takers receive is relevant, complete and
meaningful
No data to support this warrant was collected in the context of this study, as no
test takers were interviewed. Therefore, any suggestions made in this section are
purely speculative. It has been argued earlier in this chapter that the feedback pro-
vided to test takers would be more detailed based on the ratings with the new
scale, as the raters were able to discern more aspects of writing ability. Therefore,
it is possible to assume that the feedback based on the new scale would provide a
more detailed picture of the learners’ strengths and weaknesses. The feedback
might also be more meaningful as the descriptors are more concrete and test tak-
ers could therefore act upon them. However, neither rating scale was designed to
be directly used as feedback (as was mentioned earlier) and therefore more mean-
ingful descriptors need to be designed if students are truly meant to benefit from
the feedback (see Chapter 11).
10.4.2 Warrant 7: The test scores and feedback are perceived as relevant, com-
plete and meaningful by other stakeholders
Other stakeholders who might be impacted upon are teachers (or staff in self-
access labs) and the departments of test takers. More specific feedback on learn-
ers’ difficulties might lead to the introduction of language tutorials in certain con-
tent courses as well as teachers being able to offer more specific help. Again, al-
though this is an important issue for the validity of the scale, no data was col-
lected, so no conclusions can be drawn.
270
10.4.3 Warrant 8: The impact on raters is positive
271
The three warrants pertaining to impact can be seen in Table 109 above. No data
was collected from test takers and stakeholders other than the raters although
speculatively we could expect the new scale to be more useful. Warrant 8 sug-
gested that the new scale would be more useful.
10.5 Practicality
Bachman and Palmer (1996) define practicality in the following way. If the avail-
able resources are divided by the required resources and the result is equal to or
larger than one, the test development and use can be seen as practical. If the result
is lower than one, then the test development and use are not practical. The authors
list three types of relevant resources that need to be examined. Firstly, human re-
sources, which include test developers, raters, test administrators and clerical sup-
port. The second type of resource are material resources which can be divided into
space (e.g. rooms for test development and administration), equipment (e.g. com-
puters, software) and materials (e.g. paper, library resources). The final resource is
time. Bachman and Palmer (1996) argue that all the resources mentioned above
are ultimately a function of the financial budget. In the case of the new scale, two
aspects of practicality need to be considered. First of all, it is important to con-
sider if the development of an empirically-developed rating scale is practical. This
will be considered in Warrant 10. The discussion of this warrant is based on the
findings described in the previous chapter. Secondly, it needs to be considered if
the use of the scale is practical. This will be discussed under Warrant 9 below. In
this case, evidence is gathered from the process of scale development.
Raters seemed to differ greatly in the time they took when using the new scale.
The majority of raters, however, agreed that the time taken to rate a script with the
new scale was slightly longer than that for the DELNA scale, but that this was
within reason. A significant increase in the time taken by raters might mean that
testing centers would have to pay more to their staff, which might not be viable (a
concern raised by North, 2003). Almost all raters agreed that the additional time
taken was not a threat to the practicality of the assessment. In fact, a number of
them seemed to agree that if the category of coherence were to be simplified, very
little or no additional time would be needed in comparison to the existing descrip-
tors.
272
DELNA scale was developed based on a pre-existing scale taken from another
context and has since been continuously changed based on rater feedback. The
new scale, on the other hand, was developed based on an analysis of writing per-
formances, which was extremely time-consuming. There-fore, the development of
the DELNA scale can be considered to be more practical. Also, the scale is avail-
able when the assessment is first administered, while the new scale could only be
developed after sufficient test taker performances were available.
Overall, it can be said that in terms of practicality the arguments were in favor of
the DELNA scale (see Table 110 above).
10.6 Conclusion:
Table 111 below presents a summary of the warrants investigated for the Assess-
ment Use Argument of the two rating scales.
273
resenting the construct
adequately
Warrant 2: The trait Discrimination Discrimination Mixed, but possibly
scales successfully dis- generally higher generally slightly in favour of
criminate between test than DELNA lower than the new scale
takers and raters report scale; raters’ per- new scale; rat-
that scale is functioning ceptions gener- ers report
adequately ally positive, but problems using
mixed scale and often
do not rate di-
agnostically
Warrant 3: The rating Based on taxon- Basis of cate- New scale
scale descriptors reflect omy of writing gories not
current applied linguis- and rating mod- clear; descrip-
tics theory as well as re- els; descriptors tors intuitively
search empirically- improved over
developed years
Reliability
Warrant 4: Raters rate Rater reliability Rater reliabil- New scale
reliably and inter- generally higher ity generally
changeably when using than with lower than
the scale DELNA scale with new scale
Authenticity
Warrant 5: The scale re- Raters focus Raters read New scale
flects as much as possible more on details more holisti-
how writing is perceived which is appro- cally which is
by readers in the TLU priate in a diag- less suitable to
domain nostic context diagnosis
Impact (test consequences)
Warrant 6: The feedback No data was col- No data was N/A
test takers receive is rele- lected collected
vant, complete and mean-
ingful
Warrant 7: The test No data was col- No data was N/A
scores and feedback are lected collected
perceived as relevant,
complete and meaningful
by other stakeholders
Warrant 8: The impact on Use of new scale No aspects Possibly new scale
raters is positive (during operation came to light
or training) raised in study (but
awareness question was
not directly
investigated)
Table 111: Summary of evidence for Assessment Use Argument (continued)
274
Practicality
Warrant 9: The scale use Marginally more Marginally DELNA scale (but no
is practical time-consuming quicker to use great difference)
to use than than new scale
DELNA scale
Warrant 10: The scale More time- Less time- DELNA scale
development is practical consuming to consuming as
develop it evolves over
years
Returning to the overarching research question which guided the discussion in this
chapter, the following conclusions can be drawn. Bachman and Palmer (1996)
suggest that ‘the most important consideration in designing and developing a lan-
guage test is the use for which it was intended’ (p.17). We need, therefore, to re-
member that the purpose of this test is to provide detailed diagnostic information
to the stakeholders on test takers’ writing ability. Most warrants provided evi-
dence in favour of the new scale. Warrant 1, which focuses on the construct valid-
ity of the assessment and pays special attention to the context of the test, is espe-
cially important.
However, not all the warrants above favoured the new scale. For example, the
warrants relating to practicality were in favour of the DELNA scale. Also, I was
only able to speculate on the impact (test consequences) of the two scales, as no
data was collected to support Warrants 6 and 7. But, as Weigle (2002) argues, it is
impossible to maximise all of the aspects described above. The task of the test de-
veloper is to determine an appropriate balance among the qualities in a specific
situation. Since each context is different, the importance of each quality of test
usefulness discussed above varies from situation to situation. Test developers
should therefore strive to maximize overall usefulness given the constraints of a
particular context, rather than try to maximize all qualities. In the context of
DELNA, a diagnostic assessment, it could be argued that the warrants relating to
construct validity are the most central (as is the case in most assessment situa-
tions). Authenticity and test consequences are possibly less important, considering
that diagnostic tests are generally regarded as low-stakes for the students (Alder-
son, 2005). Practicality is always a crucial consideration, but wherever possible,
construct validity should not be sacrificed simply to ensure practicality.
Overall, the new scale has been shown to generally function more validly and re-
liably in the diagnostic context that it was trialled in than the pre-existing scale.
275
Chapter 11: Conclusion
11.1 Introduction
I will first summarize my main findings and then discuss their theoretical and
practical implications and limitations. Finally, I will outline some potentially fruit-
ful areas of future research.
Based on the taxonomy described above, eight constructs were chosen as the basis
for the traits in the rating scale. These were: accuracy, fluency, complexity, me-
chanics, reader-writer interaction, content, coherence and cohesion. The aim of
Phase 1 was to identify discourse analytic measures for each of these constructs
that were able to successfully differentiate between writing scripts at different
DELNA band levels and that were at the same time sufficiently simple to be used
by raters during the rating process.
Table 112 below shows the discourse analytic measures which were chosen as
suitable measures for the design of the rating scale.
277
Table 112: Discourse analytic measures included in rating scale
Construct Discourse analytic measure
Accuracy Percentage error-free t-units
Fluency Number of self-corrections
Complexity (lexical) Number of AWL words
Mechanics Number of paragraphs from five-
paragraph model
Reader-writer interaction Number of hedging devices
Content Percentage of data described
Number of reasons and supporting ideas
in Parts 2 and 3 of content
Coherence Categories from topical structure analy-
sis
Cohesion These/this, number of linking devices in
combination with qualitative analysis of
linking devices
The analysis during Phase 1 also identified several measures which were not able
to discriminate between the five DELNA proficiency levels or that proved too
complex for inclusion in the scale. These measures (listed in Table 113 below)
were therefore not used in the rating scale.
278
The measures in Table 113 above were then used to develop the new rating scale
and the scale was validated in Phase 2.
The quantitative findings of Phase 2 suggested that, on the whole, the individual
trait scales on the new scale outperformed the trait scales on the pre-existing scale
in a number of categories: they generally resulted in higher candidate discrimina-
tion, higher rater reliability and smaller differences between the raters in terms of
severity. There were by and large also fewer raters identified as rating with too
much or too little variation when using the new scale.
When the ratings based on the scales as a whole were analysed, it became clear
that raters had awarded very similar scores across the different trait categories
when using the DELNA scale. When the new scale was applied, the rating pro-
files were more jagged, suggesting that the different categories on the scale were
measuring different aspects of writing. This was confirmed by a principal factor
analysis which suggested that the ratings based on the DELNA scale only resulted
in one main factor (accounting for 64% of the variance of the writing score),
whilst those based on the new scale resulted in six factors which accounted for
83% of the variance.
The qualitative findings showed that the raters experienced problems when as-
signing scores with the DELNA descriptors. This was mainly because of the
vagueness of the level descriptors which often did not provide enough information
to the raters. Raters reported a number of strategies they had developed to deal
with these problems. It was furthermore clear from the interviews that raters were
not sufficiently aware that rating in the context of diagnostic assessment required
different types of rating processes than for example when rating a proficiency test
– an issue which needs to be addressed in rater training. When asked about the
new scale, the raters generally reported that they enjoyed being provided with
more objective level descriptors because these provided more information to back
up their ratings.
11.3 Implications
The current study has a number of theoretical and practical implications, which
will be discussed in the following section.
279
rater reliability in performance assessment. Each of these will be discussed in turn
below.
The first theoretical implication resulting from this study relates to the current
models of performance assessment. In the literature review, a number of models
of writing performance were discussed (see Chapter 2). Earlier models were de-
veloped by Kenyon (1992), McNamara (1996) and Skehan (1998a). The latest
model available is by Fulcher (2003). Most of these models were conceived in the
context of oral performance assessment and were therefore adapted for the pur-
pose of this study.
The findings from this study suggest a number of possible additions to Fulcher’s
model. The expanded model can be seen in Figure 53 below.
A number of features have been added to this model to reflect the findings of this
study. New variables are indicated in bold font and striped boxes, and new arrows
are indicated in bold font. Firstly, ‘scale development method’ was included in the
model as an additional variable. This variable has a direct influence on the rating
scale, as was shown in this study. An arrow was also added from ‘orientation /
scoring philosophy’ to ‘scale develop-ment method’ as the scoring philosophy is
likely to influence the scale development method. Scale developers might choose
a certain method to develop a rating scale because of an underlying scoring phi-
losophy, but they could also change this scoring philosophy based on the resulting
280
rating scale as indicated by the double-sided arrow linking scoring philosophy and
the rating scale.
A variable called ‘raters’ attitudes’ was also added. This variable influences the
raters and their use of the rating scale. Raters’ attitudes are influenced by their un-
der-standing of the context of the assessment, the purpose of the assessment and
the expected assessment outcomes. An arrow was included between this new vari-
able and the rater characteristics as these can inform and shape the raters’ atti-
tudes. This study, for example, showed that raters were not aware of the impor-
tance of rating for diagnosis. This had a direct influence on the raters’ use of the
scale descriptors and their rating process. Most raters also had a background in
rating a proficiency test and often seemed to transfer this approach to rating into
the context of DELNA, a diagnostic assessment. This understanding should of
course be shaped in rater training (as indicated by an arrow). This same arrow is
double-sided because the raters’ understanding of the purpose of the assessment
can also have a direct influence on the rater training (and its effectiveness).
Like Fulcher’s model, this expanded model is also seen as provisional and requir-
ing further research.
The second theoretical implication is related to the models of the rating process
available in the literature. Several models (e.g. Erdosy, 2000; Milanovic et al.,
1996; Sakyi, 2000) have shown that raters differ in their typical rating procedures.
Several different rating approaches have been identified. These include the fol-
lowing:
x the read-through-once-then-scan approach
x the performance criteria-focussed approach
x the first-impression-dominates approach
281
In this study, most raters were identified as following ‘the first-impression-
dominates approach’ when using the existing DELNA descriptors. However,
when using the new rating scale, the raters seemed to follow ‘the per-formance
criteria-focussed approach’. Because the same raters were used in this study, this
may mean that existing models need to be either expanded or adjusted to allow for
the role the rating scale seems to play in the rating approach that raters choose.
More specific rating scale descriptors seem to be conducive to ‘the performance
criteria-focussed approach’ whilst vaguer, broader level descriptors seem to in-
duce raters to be guided by their first impression.
Existing models of the rating process were formulated on the basis of think-aloud
protocols. It is quite possible (as has been generally acknowledged) that think-
aloud protocols might result in a Hawthorne effect among the raters, suggesting
that raters are a lot more careful when rating than they actually are in reality. Fur-
ther doubts about think-aloud protocols have also been raised by a recent study by
Barkaoui (2007a; 2007b). In the study described here, no think-aloud protocols
were collected, but raters reported in their interviews some of the strategies that
they employed when rating with the existing rating scale descriptors.
The third theoretical implication refers to the classification of rating scale types
commonly found in the literature. Weigle (2002), in her comprehensive summary
of the different types of rating scales, distinguished holistic and analytic rating
scales and listed the differences in a table (see Table 4 in Chapter 3). Weigle (as
well as authors such as Bachman & Palmer, 1996; Cohen, 1994; Fulcher, 2003;
Grabe & Kaplan, 1996; Hyland, 2003; Kroll, 1998) described these two types of
rating scales as being distinct from each other. However, this study seems to sug-
gest that although these two scales are different, it is also necessary to distinguish
two types of analytic scales: less detailed, a priori developed scales and more de-
tailed, empirically-developed scales. It is possible that further research will be
able to describe different degrees of analycity in rating scales. Therefore, Wei-
gle’s summary table can be expanded in the following manner (Table 114):
282
Table 114: Extension of Weigle’s (2002) table to include empirically-developed analytic
scales
Quality Holistic Scale Analytic Scale – Analytic Scale –
intuitively devel- empirically devel-
oped oped
Reliability Lower than analytic Higher than holis- Higher than intui-
but still acceptable. tic. tively developed
analytic scales.
Construct Validity Holistic scale as- Analytic scales Higher construct
sumes that all rele- more appropriate as validity as based on
vant aspects of writ- different aspects of real student per-
ing develop at the writing ability de- formance; assumes
same rate and can velop at different that different as-
thus be captured in rates. But raters pects of writing
a single score; ho- might rate with halo ability develop at
listic scores corre- effect. different speeds.
late with superficial
aspects such as
length and hand-
writing.
Practicality Relatively fast and Time-consuming; Time-consuming;
easy. expensive. most expensive
Impact Single score may More scales can Provides even more
mask an uneven provide useful di- diagnostic informa-
writing profile, may agnostic informa- tion than intuitively
be misleading for tion for placement, developed analytic
placement and may instruction and di- scale; especially
not provide enough agnosis, but might useful for rater
relevant informa- be used holistically training.
tion for diagnostic by raters; useful for
purposes. rater training.
Authenticity White (1985) ar- Raters may read Raters assess each
gues that reading holistically and ad- aspect individually.
holistically is a just analytical
more natural proc- scores to match ho-
ess than reading listic impression.
analytically.
283
11.3.1.4 Rater reliability in performance assessment
The fourth theoretical implication has to do with how rater reliability can be
achieved in performance assessment. A number of researchers have focussed on
counter-acting rater effects through extensive rater training and restandardization
sessions (see for example Elder et al., 2007; Elder et al., 2005; McIntyre, 1993;
Weigle, 1994a, 1994b, 1998; Wigglesworth, 1993). But whilst the rating scale has
been acknowledged repeatedly as a source of measurement error (Fulcher, 2003;
McNamara, 1996; Skehan, 1998a), very little time and energy seems to have been
invested in reducing this type of measurement error. The quantitative results for
the comparison of the individual trait scales show that introducing descriptors
based on discourse analytic measures can be very successful in reducing meas-
urement error and therefore adding to the validity and reliability of a writing as-
sessment.
This study has three practical implications. These relate to the weighting of
scores, rater training, and rating scale development and score reporting. Each will
be discussed in turn below.
One issue that was not covered in this study is the question of the weighting of
categories. Depending on the context that a rating scale is used in, it might be sen-
sible to give extra weight to certain rating scale categories. For example, it could
be argued that in an assessment in which both grammatical accuracy and spelling
are evaluated, it might make sense to give more weighting to grammatical accu-
racy as correct spelling is arguably less important (especially as a large number of
writers nowadays use computers). However, the weight of categories might be
less important in diagnostic contexts, as the categories are reported individually to
maximise the feedback to stakeholders. Nevertheless, it might be useful for stu-
dents to understand which aspects of writing are particularly important for their
further success at university. This is arguably different for different disciplines.
Students taking language-rich subjects, for example, need to be able to organise
their writing to a much greater extent than students in the sciences who mainly
have to write technical reports which often follow prescribed formats. Similarly,
students at post-graduate level involved in writing theses and dissertations need to
produce much lengthier discourse than most undergraduate students. It might
therefore be argued that the emphasis of the feedback should differ for different
students. Feedback could also vary according to the proficiency level of students.
However, this might not be practical.
284
If scores for different categories need to be averaged to provide more accessible
information to groups of stake-holders (e.g. the academic departments of test tak-
ers), it might be useful to draw on the factor loadings as a basis for the weightings.
For example, factor 1 (accuracy, lexical complexity, coherence and cohesion) ac-
counted for 33 percent of the variance, whilst factor 2 (hedging and interpretation
of data) accounted for only 13 percent of the variance. Therefore, factor 1 could
be given more weigh-ting in the final score, if averaging was desired in that con-
text. For this, either the percentage variance could be used, or the score in each
category could be multiplied by the eigenvalue of that factor. However, in the
feedback for test takers the scores should not be averaged.
Two issues relating to rater training emerged from the findings of this study.
Firstly, as suggested in the previous chapter, even if a rating scale such as the new
scale is not used under operational rating conditions, there is some evidence from
the interviews that raters profited from using the more detailed new descriptors
because it made them more aware of what aspects of writing to look for and how
to differentiate rating scale categories and band levels. The new scale seemed to
force raters to rate analytically. A more detailed rating scale could be used during
training and could also be kept on the premises where the rating is undertaken for
raters to refer to when experiencing problems. Similarly, raters could be invited to
regularly review the more detailed descriptors in their own time after a break in
rating.
Raters also need to be clear about the purpose of an assessment they are involved
in. This study showed that raters seem to be transferring their rating experience
from proficiency tests to the context of diagnostic assessment, which has a differ-
ent aim and focus. Raters, therefore, need to be made aware of the differences be-
tween these types of assessment during their initial induction training and this idea
needs to be repeated in every rater training session. Raters need to understand that
their scores are not averaged and not used as a basis for student placement, but
rather serve as feedback to students to help them identify their strengths and
weaknesses in order to help them identify the appropriate support structures avail-
able at university.
The final practical implication relates to score reporting. Firstly, some more gen-
eral considerations will be discussed. The second part of this section will then
provide some suggestions how scores could be reported in the context of diagnos-
285
tic assessment in an English for academic purposes (EAP) context, such as the one
which provided the context for this study.
Table 115 below presents several different purposes for which writing tests are
administered. The second column of the table provides a brief, general definition
of each test type. The third and fourth columns indicate that depending on the
purpose of the writing test, the rating scale in use might need to be different and
that the score will be reported in a different manner. For example, whilst for a
proficiency test it might be less important if the rating scale in use is holistic or
analytic (as long as it results in reliable ratings), the rating scale used in diagnostic
assessment would need to be analytic and at the same time should provide a dif-
ferentiated score profile. The need for these different types of scales is a conse-
quence of the way the scores are reported. Results of tests of writing proficiency
are usually only reported as one averaged score but, as Alderson (2005) suggests,
the score profiles of diagnostic tests should be as detailed as possible and there-
fore any averaging of scores is not desirable.
Table 115: Rating scales and score reporting for different types of writing assessment
Purpose of Definition Rating scale Score reporting
writing test
Proficiency test Designed to test general Holistic or analytic. One averaged score.
writing ability of stu-
dents.
Placement test Designed to be used to Usually holistic Usually one aver-
place students in spe- aged score, but
cific writing course. scores might be
grouped according
to the focus of the
courses.
Achievement Designed to show that Depends on the fo- Depends on the fo-
test student has achieved a cus of the writing cus of the writing
certain standard on a course. course.
writing course.
Diagnostic test Designed to identify Needs to be analytic In detail; separate
strengths and weak- and needs to result for each trait rating
nesses in writing ability; in differentiated scale.
designed to provide de- scores across traits.
tailed feedback which
students can act upon;
designed to focus on
specific rather than
global abilities.
286
The rating scale used for a placement test would differ depending on the context.
If, for example, it is more important to establish the overall writing ability of a
student to place test takers in courses, then a holistic scale (or an analytic scale)
which results in one overall (averaged) score would be needed. However, if the
courses have different foci in terms of the aspects of writing they concentrate on,
then an analytic scale might be more useful in differentiating between test takers.
This might mean that the score is not necessarily reported as one overall score, but
as a group of scores. For example, if one course focuses more on sentence level
features of writing (e.g. grammatical accuracy and mechanics) and another on
overall text organisation in different writing genres, then the scores might be re-
ported in two parts, rather than one averaged score. Finally, the rating scale used
in achievement tests and the way the score is reported depends on the focus of the
writing course under consideration.
The following section proposes more specific types of feedback that could be pro-
vided to different stakeholder groups in the context of a diagnostic EAP writing
assess-ment such as the one that provided the background for this study. Three
groups of stakeholders will be considered: the test takers, the academic English
tutors of the candidates and the content teachers or the academic departments of
the students.
In the case of the test takers, the feedback should be clear, in sufficiently simple
language and detailed, that it can be acted upon. The feedback for each student
should contain advice on the test taker’s strengths and weaknesses in all the areas
assessed.
For example, the feedback on the student’s ability in the category of hedging
could read as follows (Figure 69 below):
In academic writing, writers usually tone down the strength of the claims they make. They
would use phrases and words like ‘might’, ‘it is likely that’, ‘one can assume that’ and ‘it
seems that’. A good writer might use several of these phrases or words to soften what he or
she is saying. You used only a few in the entire essay.
Figure 69: Feedback on hedging for test takers
Apart from detailed explanations, such as the one on hedging presented above, it
might also be useful to provide students with a visual presentation of their per-
formance across different traits. An example of this can be seen in Figure 54 be-
low. In this way students can easily see which aspects of writing require more ur-
gent attention.
287
Figure 54: Visual feedback for test takers
Next to the feedback for each category of the rating scale, the test taker should
find a summary and a recommendation of what steps to take to improve any
weaknesses found. This could take the form of an action plan for the student. A
complete feedback portfolio for the student represented in the graph above, might
look as follows (see Table 116 below).
288
able to plan without compromising the quality of your
essay.
Complexity
You used very little aca- You should enrol in an ESOL credit course. ESOL 100
demic vocabulary in your might be the most appropriate course to help you with
writing. your problems with vocabulary. In your own time, you
could also seek help at the English Language Self-
Access Centre (ELSAC). When you make an appoint-
ment there, the tutor will guide you to the best resources
available on campus. There are also several useful self-
help books available in the library which you can use to
study academic vocabulary in your own time. See for
example, the following book: (give example) and you
can access the following website from home (give ex-
ample).
Mechanics
Your paragraphing was very Well done, no action needed.
good. Your essay had a clear
introduction and conclusion
and the different sections of
the essay were well differen-
tiated by paragraphs.
Reader/writer interaction
In academic writing, writers You should enrol in an ESOL credit course. ESOL 100
usually tone down the might be the most appropriate course to help you to im-
strength of the claims they prove your hedging. In your own time, you could also
make. They would use seek help at the English Language Self-Access Centre
phrases and words like (ELSAC). When you make an appointment there, the
‘might’, ‘it is likely that’, tutor will guide you to the best resources available on
‘one can assume that’ and ‘it campus. Attached to this feedback is a list of possible
seems that’. A good writer hedging devices. It might also be useful to have a look
might use several of these at the model answer attached to this feedback to give
phrases or words to soften you an idea when it is appropriate to hedge.
down what he or she is say-
ing. You used only a few in
the entire essay.
Data description
You described all the main Imagine the readers does not have the data you are de-
trends in the data provided to scribing. You need to make sure they can understand the
you, but you failed to back information by your description alone. You should have
this up with any relevant fig- a look at the model answer provided. You can get fur-
ures. ther help with this by taking ESOL credit courses, by
enrolling in Engwrit (English Writing) or going to
ELSAC.
Table 116: Feedback profile for test takers
289
Data interpretation
You provided enough ideas Well done, no further action required.
and support for your ideas in
this section.
Data extension of ideas
You provided enough ideas Well done, no further action required.
and support for your ideas in
this section
Coherence
The topics of your sentences The best way to improve your coherence in writing is to
often did not link back to enrol in an ESOL credit course like ESOL 100 or ESOL
previous sentences or you 101.
attempted to link back, but
failed by using incorrect
linking devices.
Cohesion
You overused simple linking The best way to improve your cohesion in writing is to
devices like ‘and’, ‘but’, enrol in an ESOL credit course like ESOL 100 or ESOL
‘because’. 101. You could also go to the ELSAC for help and there
are many self-help books and websites available which
the tutor at the ELSAC can show you. Have a look at the
linking devices used in the model answer attached to
this action plan.
The second group of stake holders that should receive feedback are the English
tutors of the test takers. These tutors might be employed in a self-access lab at the
university or they might teach an ESOL credit paper. The feedback to the teachers
should ideally also be clear and detailed, but the language used can be more so-
phisticated and include metalinguistic terms. Keeping with the example of hedg-
290
ing used above, the feedback about hedging to a tutor might read as follows (Fig-
ure 55):
Whilst we would expect a writer of high proficiency levels to possibly use a number of
hedging devices in this type of writing, this test taker used hardly any.
Figure 55: Feedback on hedging for academic English tutors
An overall recommendation at the end of the feedback could summarize the in-
formation and ask the tutor to focus attention on certain weaknesses.
The third group of stakeholders who regularly receive feedback in diagnostic as-
sessment are the university departments. In this case, very detailed feedback is of
less use. It is, however, of interest to the departments how their student cohorts as
a whole perform. This might help them to schedule language-focussed content
tutorials, which could then focus on common weaknesses in their group of test
takers. In this case, the feedback could possibly focus on each individual category
in the rating scale and summarize the behaviour of all students in the group.
Again, keeping with the example of hedging, the feedback for a cohort of students
could read (Figure 56):
We expect students to qualify the claims they make by using phrases like ‘it seems that’ ‘…
might be the reason’ or ‘it is likely that’. These phrases are referred to as hedging devices.
Of the twenty learners in your group, the majority did not use sufficient hedging devices in
their writing assessment. Whilst we would expect several hedging devices, most of your
learners used only a few. There were five learners who used no hedging devices at all.
Figure 56: Feedback on hedging for content teachers and academic departments
It might also be useful for this group of stakeholders to receive visual feedback on
the performance of the whole group. Figure 57 below shows what such a graph
might look like. Each grey bar represents the range of the whole group in terms of
test scores, whilst the black crossbars represent the mean of the group. In this
way, the content teachers get a visual impression of how spread their group is in
terms of abilities and with which aspects of writing the group as a whole had the
greatest problems. Specialist terms like for example cohesion and coherence
would have to be explained separately.
291
Figure 57: Visual group feedback profile for content teachers and departments
292
Not all of the ideas proposed above may be practical, but they provide a frame-
work for developing best-practice in providing feedback in a diagnostic assess-
ment context.
11.4 Limitations
Although the study reported in this book was carefully designed, several short-
comings must be acknowledged. These can broadly be divided into two groups.
The first group relates to limitations in the analysis of the 601 writing scripts and
the development of the resulting new scale. The second group of limitations is as-
sociated with the validation phase of the new scale.
The first limitation of Phase 1 of this study, and at the same time one of the major
limitations of this study, is the manner in which the five different writing levels
used as the basis for the analysis of the writing scripts were established. Because
no independent measure of writing ability was available to form the five different
sub-corpora, average ratings from the administration of the DELNA writing as-
sessment were used. These are of course based on the pre-existing scale. This
means that the existing rating scale had a direct influence on the development of
the new scale. Two factors may have reduced this potential influence. Firstly, the
ratings of the two raters were averaged in an attempt to reduce the influence of
any extreme individual raters. Secondly, the overall score was chosen, in the hope
that this might represent a general writing ability score. The findings from the fac-
tor analysis in Chapter 9 suggest that because raters’ scores seem to be influenced
by a holistic, global impression of writers’ ability, the overall DELNA writing
score is affected by the scale only to a certain extent. It is therefore hoped that this
flaw in the research design only had a limited effect on the outcome of this study.
The second limitation of Phase 1 is that it was not possible to establish any suit-
able measures for a number of categories in the analysis. One aspect which is piv-
otal to the assessment of performance as opposed to competence (North, 2003) is
fluency. However, although two measures of fluency which could be applied to
the writing product were identified in the literature review and included in the
analysis of the scripts, neither seemed to be very useful in the assessment of writ-
ing. The number of words produced, a measure of temporal fluency, was not suc-
cessful in discriminating between the five levels of writing (and therefore not in-
cluded in the new scale), whilst the number of self-corrections, a measure of re-
pair fluency, was criticized by the raters. Similarly, the new scale lacked a meas-
ure of grammatical complexity, arguably one of the more serious limitations of
the scale. More time might have resulted in the successful exploration of other
293
potential measures, which could be more fruitfully applied in assessment situa-
tions. For example, the number of passives or complex nominals per t-unit might
be promising measures. Whilst a measure for paragraphing was identified, a num-
ber of raters criticized the simplicity of this measure. Further research might be
able to develop a discourse analytic measure of paragraphing which is able to as-
sess a wider representation of paragraphing, including the organisation within
paragraphs. Further limitations of the scale include the complexity of the meas-
ures of coherence, the lack of a measure for lexical cohesion and more varied
measures of academic style (which was only assessed by the number of hedging
devices used). Apart from these limitations with the scale categories, there were
also criticisms of missing scale categories. Some raters noted that there were no
categories in the new scale which could be applied to the inappropriate use of in-
formal vocabulary and abbreviations.
A third limitation relating to the first phase of this research is the fact that the rat-
ing scale was based on a specific task type. This of course limits the generalizabil-
ity of the resulting ratings to other contexts. Messick (1994) de-scribes the aim of
performance testing as not being to measure particular performances but to be
able to deduce competence from those performances. Fulcher (1996c) argued that
aspects of performance based on a general theory of language ability (such as a
model of com-municative competence would provide) result in greater gener-
alizability of a test. As soon as task effects are built into the descriptors, the rat-
ings are dependent on the tasks that the writers performed and therefore result in a
context dependent measure that does not generalize.
A case for generalizability can also be made from the perspective of the task in
use. If the task generally represents what is expected in the target use domain (in
this case the academic setting), then this would establish generalizability even if
the scale is based on this task alone. Research into faculties’ perceptions of which
tasks best represent the writing expected in their target language domain have
shown that there are major differences depending on the content area (e.g. S.
Smith, 2003) thus showing that arriving at a task type suitable to all disciplines is
difficult and potentially impossible.
294
A fourth limitation relating to the first phase of this research is the fact that the
raters were not directly involved in scale development. Although the scale was
developed on an empirical basis, raters could have been given a voice in the scale
development process. For example, the category of coherence could have been
simplified earlier in the scale development process if raters had been asked to trial
some sample performances. Besides getting them involved in the actual develop-
ment of the scale, it might have proven useful to explain to them certain decisions
that were made in the development process prior to the validation phase. It is pos-
sible that raters would have understood perfectly well why for example spelling
was not included into the scale and that they therefore approached the entire scale
differently.
The last limitation relating to Phase 1 is the fact that the new scale (as well as the
existing DELNA scale) is used to assess the performance of both native and non-
native speakers of English. It could be argued that these two groups of candidates
experience very different problems and should therefore not be assessed using a
common assessment instrument (see for example Elder, 1995). On the other hand,
it is very difficult to establish the language background of students. In the context
of DELNA, the first language of students is established based on a self-report
questionnaire. Anecdotal evidence suggests that a number of students report Eng-
lish as their first language even if this is not the case. Similarly, there are nowa-
days more and more students who learn English from a very early age (for exam-
ple in Singapore or India), but who would experience very different problems in
their language proficiency than for example students who grew up and underwent
their schooling in a country like New Zealand, Australia, the United Kingdom or
the United States of America. For practical reasons, it seems almost inevitable that
the scale used needs to be applicable to all students, regardless of their back-
ground.
The limitations listed above all related to the first phase of this study, the analysis
of the writing scripts and the scale development phase. Now, we turn to shortcom-
ings of the second phase.
The first limitation, and possibly the most important one, is the fact that the design
of the second phase of this study was not counterbalanced. As mentioned earlier
in Chapter 8, all ten raters first rated all one hundred scripts using the DELNA
scale and only then rated the hundred scripts using the new scale. Ideally, how-
ever, half of the raters should have used the new scale first and then the DELNA
scale, and the other half of the raters should have rated the scripts using the scales
in the opposite order. However, this was not possible for practical reasons. The
295
raters were only able to rate such large numbers of scripts at certain times during
the year, which compromised the nature of the design possibilities. There was
some indication from the interviews however that this order had no influence on
the outcome of the study, because the raters were used to rating with the existing
scale. However, if the new scale had been used first, this might have influenced
the outcome of the findings, as a number of raters reported changing their rating
behaviour after using the new descriptors. Also, because of the large number of
scripts that were rated, it is highly unlikely that raters were able to remember in-
dividual scripts from the first rating round to the second (two to three months
later).
A second limitation of Phase 2 relates to sample size. Only ten raters were used
and these raters rated only one hundred scripts each. The number of raters and
scripts had to be kept to these limits for financial reasons. Although two research
grants were secured to reimburse raters for the time they spent rating, these were
able to cover the expenses for only ten raters. Experts in multi-faceted Rasch
measurement suggest that a larger number of raters and scripts would have re-
turned even more stable results (Mike Linacre and Carol Myford, personal com-
munications). In particular, the scale category statistics would be more trustwor-
thy. As mentioned in Chapter 9, each band level should be used by at least ten rat-
ers for FACETS to return stable results. This was generally, but not always, the
case. Because this problem was anticipated, a fully crossed design was chosen,
which meant that all raters rated all one hundred scripts under both conditions.
This is generally regarded as helping to improve the stability of the statistics re-
turned by FACETS (Myford & Wolfe, 2003).
The findings of the current study point to various directions for future research,
which again can be divided into two groups: those relating to Phase 1 of the cur-
rent study and those relating to the second phase.
The first group of suggestions for further research follow from the shortcomings
identified relating to the scale development. No suitable measure, was, for exam-
ple, found for fluency in writing, which is an integral part of assessing perform-
ance (North, 2003). More research is necessary to establish what factors contrib-
ute to creating fluency in writing (especially in the writing product) and how these
factors are different to aspects of fluency in speech (as have been described by
Skehan, 2003). Similarly, as this and a number of other studies (e.g. Cumming et
al., 2005) have shown that writers do not necessarily produce more subordination
296
in an assessment context, more detailed research is necessary to establish if there
are other measures of grammatical complexity which could be successfully incor-
porated into a rating scale for writing. Further research might also be able to es-
tablish a less mechanical measure of paragraphing. As suggested previously, a
multiple regression analysis of all the measures of topical structure used in this
study might have been able to ascertain the measures most indicative of writing
ability. This analysis should be conducted in the future and trialled on another
group of raters. Finally, a measure of lexical cohesion should be sought in future
developments of the scale as well as more measures of reader-writer identity.
A second group of suggestions for future research relates to the second phase of
this study.
Firstly, it might be useful to conduct think-aloud protocols of raters employing the
two very different rating scales used in this study. Although the quantitative rat-
ings give an indication of differences between the rating processes employed by
raters, and anecdotal evidence was collected from the raters during the interviews,
it might be fruitful to obtain an insight into the cognitive processes of raters dur-
ing the rating procedure. However, some doubt has been cast on the validity of
think-aloud protocols (e.g. Barkaoui, 2007a, 2007b; Stratman & Hamp-Lyons,
1994). It is not entirely clear if the online accounts of raters provide a complete
picture of their thought-processes, if the rating process is fundamentally altered in
the process of providing a think-aloud protocol, if raters are even able to describe
their thought processes or if these remain inaccessible. If think-aloud protocols are
seen to provide valid evidence of the rating process, then it might be useful to
compare the different processes raters follow when using the two different scales.
Findings from this could inform future scale development and rater training.
It would also be useful to establish if the raters rated more analytically because
they were unfamiliar with the new scale. Research needs to ascertain if certain
rating behaviour is associated with specific scale types, or if raters shift their rat-
ing patterns over time irrespective of what type of scale they are using.
A further area of possible future research could be a comparison of the rating be-
haviour of experienced and less experienced raters when employing the two
scales. Although not a focus of this study, there was some evidence that less ex-
perienced raters preferred using the more detailed, empirically-developed scale,
whilst more experienced raters preferred the less descriptive intuitively-developed
scale. If this is in fact the case, then it might be useful to employ more detailed
descriptors in rater induction sessions. Similarly, there was also some evidence
that professional background played a role in raters’ perceptions of the two scales.
Raters coming from a background in English as a first language teaching rather
297
than English for Speakers of Other Languages (ESOL) seemed more accustomed
to rating holistically and therefore preferred the pre-existing descriptors. It is pos-
sible that raters with a background in English rather than ESOL need to be trained
not to rate globally when assessing in a diagnostic context. However, not enough
data was collected on either of these groups of raters and further research is there-
fore necessary.
This study set out to explore a way to develop rating scales empirically, so that the
descriptors reflect more closely what happens in students’ performance. Although
the scale development method explored here is not the only possible way to de-
velop empirically-based descriptors, the resulting rating scale was shown to pro-
vide raters with a more explicit basis on which to base their rating decisions than
the more commonly used intuitively-developed rating scales. Not only were the
ratings on individual trait scales of the new scale more reliable and discriminating,
the resulting score profiles were more differentiated, because raters were able to
discern more different aspects of the writers’ performance. Therefore, it could be
argued that the type of scale developed in this study is more suitable in a diagnos-
tic context, where the aim is to provide students with feedback about their
strengths and weaknesses.
---
Notes:
1
A section specifically considering the feedback on the performance of content can be found be-
low.
298
Appendix 1:
Writer Identity
I, you, we, us, our, me, mine, yours, my, your
Hedges
Can, could, may, might, perhaps, maybe, possible/possibly, suppose/supposed, I think, I
feel, sometimes, seem, relative/relatively, would, appear, probably, possibility, fairly,
usually, tend, hardly, more or less, should, suggest, indicate, potential/ly, assume, gener-
ally, about, believe, hypothesise, likely, speculate, estimate, doubt (used without a nega-
tive), presume
Boosters
Certain/ly, clear/ly, I know, definite/ly, fact, obvious/ly, sure/ly, like/ly, significant/ly,
enormous/ly, no/never, a lot, really, main/ly, very, extremely, at last, major, always,
demonstrate, substantially, will, all, many, apparent, evident, doubt (used in negative
sense, i.e. no doubt), doubtless, indeed, of course
Cohesion – anaphoric pronominals
this, that, these, those, it, he, she, its, her, him, his, me, their, them, they, there, here, the
former, the latter
299
References
Alderson, C. (1991). Bands and scores. In C. Alderson & B. North (Eds.), Language test-
ing in the 1990s: The communicative legacy. London: Modern English Publica-
tions/British Council/Macmillan.
Alderson, C. (2005). Diagnosing foreign language proficiency. The interface between
learning and assessment. London: Continuum.
Alderson, C., & Clapham, C. (1992). Applied linguistics and language testing: A case
study. Applied Linguistics, 13(2), 149-167.
Alderson, C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evalua-
tion. Cambridge: Cambridge University Press.
Allison, D. (1995). Assertions and alternatives: Helping ESL undergraduates extend their
choice in academic writing. Journal of Second Language Writing, 4, 1-16.
Anastasi, A. (1988). Psychological testing. New York: Macmillan.
Anderson, M. J. (2001). A new method for non-parametric multivariate analysis of vari-
ance. Austral Ecology, 26, 32-46.
Anderson, M. J. (2005). PERMANOVA: a FORTRAN computer program for permuta-
tional multivariate analysis of variance. Department of Statistics: University of
Auckland, New Zealand.
Andrich, D. (1978). A general form of Rasch's extended logistic model for partial credit
scoring. Applied Measurement in Education, 4, 363-378.
Andrich, D. (1998). Threshold, steps, and rating scale conceptualization. Rasch Meas-
urement: Transactions of the Rasch Measurement SIG, 12, 648.
Arnaud, P. J. L. (1992). Objective lexical and grammatical characteristics of L2 written
compositions and the validity of separate-component tests. In P. J. L. Arnaud &
H. Bejoint (Eds.), Vocabulary and applied linguistics. London: McMillan.
Bacha, N. (2001). Writing evaluation: What can analytic versus holistic essay scoring tell
us? System, 29, 371-383.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment
Quarterly, 2(1), 1-34.
Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and
rater judgements in a performance test of foreign language speaking. Language
Testing, 12, 238-252.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford
University Press.
Bachman, L. F., & Palmer, A. S. (forthcoming). Language assessment in practice. Ox-
ford: Oxford University Press.
Bachman, L. F., & Savignon, S. J. (1986). The evaluation of communicative language
proficiency: A critique of the ACTFL Oral Interview. Modern Language Journal,
70, 380-390.
301
Bamberg, B. (1983). What makes a text coherent? College Composition and Communica-
tion, 34(4), 417-429.
Banerjee, J., & Franceschina, F. (2006). Documenting features of written language pro-
duction typical at different IELTS band score levels. Paper presented at the
Workshop sponsored by the European Science Foundation entitled 'Bridging the
gap between research on second language acquisition and research on language
testing', Amsterdam, February 2006.
Bardovi-Harlig, K., & Bofman, T. (1989). Attainment of syntactic and morphological
accuracy by advanced language learners. Studies in Second Language Acquisi-
tion, 11, 17-34.
Barkaoui, K. (2007a). Effects of thinking aloud on ESL essay rater performance: A Fac-
ets analysis. Paper presented at the Language Testing Research Colloquium, Bar-
celona, June 2007.
Barkaoui, K. (2007b). Raters' perceptions of the effects of thinking aloud on their ESL
essay rating performance: A qualitative study. Paper presented at the Annual
conference of the American Association of Applied Linguistics: Costa Mesa, CA,
April 2007.
Barlow, M. (2002). MonoConc Pro 2.2. Houston: Athelstan.
Barnwell, D. (1989). 'Naive' native speakers and judgements of oral proficiency in Span-
ish. Language Testing, 6, 152-163.
Barritt, L., Stock, P., & Clarke, F. (1986). Researching practice: Evaluating assessment
essays. College Composition and Communication, 37, 315-327.
Bereiter, C. (1980). Development in writing. In L. W. Gregg & E. R. Steinberg (Eds.),
Cognitive Processes in Writing. Hillsdale, NJ: Lawrence Erlbaum.
Bloor, M., & Bloor, T. (1991). Cultural expectations and socio-pragmatic failure in aca-
demic writing. In P. Adams, B. Heaton & P. Howarth (Eds.), Academic writing in
a second language: Essays on research and pedagogy. Norwood, NJ: Ablex.
Brindley, G. (1991). Defining language ability: The criteria for criteria. In S. Anivan
(Ed.), Current developments in language testing. Singapore: SEAMEO Regional
Language Centre.
Brindley, G. (1998). Describing language development? Rating scales and SLA. In L. F.
Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition
and language testing research. Cambridge: Cambridge University Press.
Brown, A. (1995). The effect of rater variables in the development of an occupation-
specificic language performance test. Language Testing, 12, 1-15.
Brown, A. (2000). An investigation of the rating process in the IELTS oral interview.
IELTS Research Reports, 3.
Brown, A. (2003). Legibility and the rating of second language writing: An investigation
of the rating of handwritten and word-processed IELTS task two essays. IELTS
Research Reports, 4, 131-151.
Brown, J. D., Hildgers, T., & Marsella, J. (1991). Essay prompts and topics. Minimizing
the effect of mean differences. Written Communication, 8, 533-556.
302
Burneikaité, N., & Zabiliúté, J. (2003). Information structuring in learner texts: A possi-
ble relationship between the topical structure and the holistic evaluation of
learner essays. Studies about Language, 4, 1-11.
Burstein, J., Kukich, K., Wolff, S., Chi, L., Chodorow, M., Braden-Harder, L., et al.
(1998). Automated scoring using a hybrid feature identification technique. Paper
presented at the Annual Meeting of the Association of Computational Linguis-
tics, Montreal, Canada.
Canale, M. (1983). From communicative competence to communicative language peda-
gogy. In J. C. Richards & R. Schmidt (Eds.), Language and communication (pp.
2-27). London, UK: Longman.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to sec-
ond language teaching and testing. Applied Linguistics, 1, 1-47.
Carlson, S., Bridgeman, B., Camp, R., & Waanders, J. (1985). Relationship of admission
test scores to writing performance of native and nonnative speakers of English
(TOEFL Research Report No.19). Princeton, NJ: Educational Testing Service.
Carrell, P. (1988). Interactive text processing: Implications for ESL/second language
reading classrooms. In P. Carrell, J. Devine & D. Eskey (Eds.), Interactive ap-
proaches to second language reading. New York, Cambridge: Cambridge Uni-
versity Press.
Carroll, J. B. (1968). The psychology of language testing. In A. Davies (Ed.), Language
testing symposium: A psycholinguistic approach. Oxford: Oxford University
Press.
Carson, J. G. (2001). Second language writing and second language acquisition. In T.
Silva & P. K. Matsuda (Eds.), On second language writing. Mahwah, NJ: Law-
rence Erlbaum Associates.
Cascio, W. F. (1982). Applied psychology in personal management. Reston, VA: Reston
Publishing Company.
Cason, G. J., & Cason, C. L. (1984). A deterministic theory of clinical performance rat-
ing. Evaluation and the Health Professions, 7, 221-247.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245-276.
Chalhoub-Deville, M. (1995). Theoretical models, assessment frameworks and test con-
struction. Language Testing, 14(1), 3-22.
Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F.
Bachman & A. Cohen (Eds.), Interfaces between second language acquisition
and language testing research. Cambridge: Cambridge University Press.
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Lin-
guistics, 19, 254-272.
Charney, D. (1984). The validity of using holistic scoring to evaluate writing: A critical
overview. Research in the Teaching of English, 18, 65-81.
Cheng, X., & Steffensen, M. S. (1996). Metadiscourse: A technique for improving stu-
dent writing. Research in the Teaching of English, 30(2), 149-181.
303
Chenoweth, N. A., & Hayes, J. R. (2001). Fluency in writing. Generating text in L1 and
L2. Written Communication, 18(1), 80-98.
Cheville, J. (2004). Automated scoring technologies and the rising influence of error.
English Journal, 93(4), 47-52.
Chiang, S. (1999). Assessing grammatical and textual features in L2 writing samples: The
case of French as a Foreign Language. The Modern Language Journal, 83, 219-
232.
Chiang, S. (2003). The importance of cohesive conditions to perceptions of writing qual-
ity at the early stages of foreign language learning. System, 31, 471-484.
Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater's per-
formance on TOEFL essays. TOEFL Research Report No. RR-73, ETS RR-04-
04. Princeton, NJ: Educational Testing Service.
Clarkson, R., & Jensen, M.-T. (1995). Assessing achievement in English for professional
employment programs. In G. Brindley (Ed.), Language Assessment in Action.
Sydney: National Centre for English Language Teaching and Research.
Cobb, T. (2002). Web Vocabprofile. Retrieved 12 December 2005, from
http://www.lextutor.ca/vp/
Cohen, A. D. (1994). Assessing language ability in the classroom (2nd ed.). Boston, Mas-
sachusetts: Heinle and Heinle Publishers.
Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in large-scale as-
sessment programs. Journal of Educational Measurement, 37(2), 163-178.
Connor-Linton, J. (1995). Crosscultural comparison of writing standards: American ESL
and Japanese EFL. World Englishes, 14, 99-115.
Connor, U. (1990). Linguistic/rhetorical measures for international persuasive student
writing. Research in the Teaching of English, 24(1), 67-87.
Connor, U., & Farmer, F. (1990). The teaching of topical structure analysis as a revision
strategy for ESL writers. In B. Kroll (Ed.), Second language writing: Research
insights for the classroom. Cambridge: Cambridge University Press.
Connor, U., & Mbaye, A. (2002). Discourse approaches to writing assessment. Annual
Review of Applied Linguistics, 22, 263-278.
Cooper, T. C. (1976). Measuring written syntactic patterns of second language learners of
German. The Journal of Educational Research, 69, 176-183.
Corbel, C. (1995). Exrater: a knowledge-based system for language assessors. In G.
Brindley (Ed.), Language assessment in action. Sydney: National Centre for Eng-
lish Language Teaching and Research.
Council of Europe. (2001). Common European Framework of Reference for Languages:
Learning, teaching, assessment. Cambridge: Cambridge University Press.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238.
Crismore, A., Markkanen, R., & Steffensen, M. S. (1993). Metadiscourse in persuasive
writing: A study of texts written by American and Finnish university students.
Written Communication, 10, 39-71.
304
Crookes, G. (1989). Planning and interlanguage variation. Studies in Second Language
Acquisition, 11, 183-199.
Crowhurst, M. (1987). Cohesion in argument and narration at three grade levels. Re-
search in the Teaching of English, 21(2), 185-201.
Cumming, A. (1990). Expertise in evaluating second language compositions. Language
Testing, 7, 31-51.
Cumming, A. (1997). The testing of writing in a second language. In C. Clapham & D.
Corson (Eds.), Encyclopedia of Language and Education (Vol. 7: Language Test-
ing and Assessment). Dordrecht: Kluwer Academic Publishers.
Cumming, A. (1998). Theoretical perspectives on writing. Annual Review of Applied Lin-
guistics, 18, 61-78.
Cumming, A., Kantor, R., Baba, K., Erdosy, U., Eouanzoui, K., & James, M. (2005). Dif-
ferences in written discourse in independent and integrated prototype tasks for
next generation TOEFL. Assessing Writing, 10(1), 1-75.
Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL
2000 prototype writing tasks: An investigation into raters' decision making and
development of a preliminary analytic framework. TOEFL Monograph Series 22.
Princeton, New Jersey: Educational Testing Service.
Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating
ESL/EFL writing tasks: A descriptive framework. The Modern Language Jour-
nal, 86, 67--96.
Cumming, A., & Mellow, D. (1996). An investigation into the validity of written indica-
tors of second language proficiency. In A. Cumming & R. Berwick (Eds.), Vali-
dation in Language Testing. Clevedon (England), Philadelphia: Multilingual
Matters.
Cumming, A., & Riazi, A. (2000). Building models of adult second-language writing in-
struction. Learning and Instruction, 10, 55-71.
Davidson, F. (1993). Statistical support for Training in ESL Composition Rating. In L.
Hamp-Lyons (Ed.), Assessing Second Language Writing in Academic Contexts.
Norwood, NJ: Ablex Publishing Company.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Diction-
ary of Language Testing. Cambridge: Cambridge University Press.
Davies, A., & Elder, C. (2005). Validity and validation in language testing. In E. Hinkel
(Ed.), Handbook of research in second language teaching and learning. Mah-
wah, NJ: Lawrence Erlbaum.
Edgeworth, F. Y. (1888). The statistics of examinations. Journal of the Royal Statistical
Society, 51, 599-635.
Elder, C. (1993). How do subject specialists construe classroom language proficiency?
Language Testing, 10(3), 235-254.
Elder, C. (1995). The effect of language background on 'foreign' language test perform-
ance: Problems of classification and measurement. Language Testing Update, 17,
34-36.
305
Elder, C. (2003). The DELNA initiative at the University of Auckland. TESOLANZ
Newsletter, 12(1), 15-16.
Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater re-
sponses to an online rater training program. Language Testing, 24(1), 37-64.
Elder, C., & Erlam, R. (2001). Development and validation of the diagnostic English lan-
guage needs assessment (DELNA): Final Report. Auckland: University of Auck-
land, Department of Applied Lnaguage Studies and Linguistics.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to
enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175-
196.
Elder, C., & von Randow, J. (2002). Report on the 2002 Pilot of DELNA at the University
of Auckland. Auckland: University of Auckland, Department of Applied Lan-
guage Studies and Linguistics.
Ellis, R. (1987). Interlanguage variability in narrative discourse: Style shifting in the use
of the past tense. Studies in Second Language Acquisition, 18, 1-20.
Ellis, R., & Barkhuizen, G. (2005). Analyzing learner language. Oxford: Oxford Univer-
sity Press.
Ellis, R., & Yuan, F. (2004). The effects of planning on fluency, complexity, and accu-
racy in second language narrative writing. Studies in Second Language Acquisi-
tion, 26, 59-84.
Engber, C. (1995). The relationship of lexical proficiency to the quality of ESL composi-
tions. Journal of Second Language Writing, 4, 139-155.
Erdosy, U. (2000). Exploring the establishment of scoring criteria for writing ability in a
second language: The influence of background factors on variability in the deci-
sion-making processes of four experienced raters of ESL composition. Unpub-
lished MA thesis, University of Toronto.
Farhady, H. (1983). On the plausibility of the unitary language proficiency factor. In J.
W. Oller (Ed.), Issues in language testing research. Rowley, Mass. : Newbury
House.
Field, A. (2000). Discovering statistics using SPSS for Windows. London: SAGE Publica-
tions.
Field, Y., & Yip, L. (1992). A comparison of internal cohesive conjunction in the English
essay writing of Cantonese speakers and native speakers of English. RELC Jour-
nal, 23(1), 15-28.
Fisher, W. P. (1992). Reliability statistics. Rasch Measurement: Transactions of the
Rasch Measurement SIG, 6(3), 238.
Fitzgerald, J., & Spiegel, D. (1986). Textual cohesion and coherence in children's writing.
Research in the Teaching of English, 20, 263-280.
Flahive, D. E., & Gerlach Snow, B. (1980). Measures of syntactic complexity in evaluat-
ing ESL compositions. In J. W. Oller & K. Perkins (Eds.), Research in Language
Testing. Rowley, Mass.: Newbury House.
306
Foltz, P. W., Laham, D., & Landauer, T. K. (2003). Automated essay scoring: Applica-
tions to educational technology. from Http://www.psych.nmsu.edu/~pfoltz/re-
prints/Edmedia99.html
Foster, P., & Skehan, P. (1996). The influence of planning and task type on second lan-
guage performance. Studies in Second Language Acquisition, 18, 299-323.
Frase, L., Faletti, J., Ginther, L., & Grant, L. (1999). Computer analysis of the TOEFL
Test of Written English. Research report No. 64. Princeton, NJ: Educational Test-
ing Service.
Freedman, S. W., & Calfee, R. C. (1983). Holistic assessment of writing: Experimental
design and cognitive theory. In J. Mosenthal, L. Tamor & S. Walmsley (Eds.),
Research in writing: Principles and methods. New York: Longman.
Friedlander, A. (1990). Composing in English: Effects of a first language on writing in
English as a Second Language. In B. Kroll (Ed.), Second langauge writing: Re-
search insights for the classroom (pp. 109-125). Cambridge: Cambridge Univer-
sity Press.
Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. ELT
Journal, 41(4), 287-291.
Fulcher, G. (1995). Variable competence in second language acquisition: A problem for
research methodology? System, 23(1), 25-33.
Fulcher, G. (1996a). Does thick description lead to smart tests? A data-based approach to
rating scale construction. Language Testing, 13(2), 208-238.
Fulcher, G. (1996b). Invalidating validity claims for the ACTFL Oral rating scale. Sys-
tem, 24(2), 163-172.
Fulcher, G. (1996c). Testing tasks: Issues in task design and the group oral. Language
Testing, 13, 23-51.
Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman.
Grabe, W., & Kaplan, R. B. (1996). Theory and practice of writing. New York: Long-
man.
Granger, S., & Tyson, S. (1996). Connector usage in the English essay writing of native
and non native EFL speakers of English. World Englishes, 15(1), 17-27.
Grant, L., & Ginther, L. (2000). Using computer-tagged linguistic features to describe L2
writing differences. Journal of Second Language Writing, 9(2), 123-145.
Grierson, J. (1995). Classroom-based assessment in intensive English centres. In G.
Brindley (Ed.), Language assessment in action. Sydney: National Centre for Eng-
lish Language Teaching and Research.
Halliday, M. A. K. (1985). An introduction to functional grammar. London: Arnold.
Halliday, M. A. K. (1994). An introduction to functinal grammar. London: Edward Ar-
nold.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman.
Hamp-Lyons, L. (1990). Second language writing: Assessment issues. In B. Kroll (Ed.),
Second language writing: Research insights for the classroom. New York: Cam-
bridge University Press.
307
Hamp-Lyons, L. (2001). Fourth generation writing assessment. In T. Silva & P. K. Ma-
tsuda (Eds.), On second language writing. Mahwah, NJ: Lawrence Erlbaum As-
sociates.
Hamp-Lyons, L. (2003). Writing teachers as assessors of writing. In B. Kroll (Ed.), Ex-
ploring the dynamics of second language writing. Cambridge: Cambridge Uni-
versity Press.
Harley, B., & King, M. L. (1989). Verb lexis in the written composition of young L2
learners. Studies in Second Language Acquisition, 11, 415-439.
Hasan, R. (1984). Coherence and cohesive harmony. In J. Flood (Ed.), Understanding
reading comprehension. Newark, DE: International Reading Association.
Hawkey, R. (2001). Towards a common scale to describe L2 writing performance. Cam-
bridge Research Notes, 5, 9-13.
Hawkey, R., & Barker, F. (2004). Developing a common scale for the assessment of writ-
ing. Assessing Writing, 9(2), 122-159.
Heatly, A., & Nation, P. (1994). Range [Computer program, available at
http://vuw.ac.nz/lals/]: Victoria University of Wellington, NZ.
Henry, K. (1996). Early L2 writing development: A study of autobiographical essays by
university-level students of Russian. Modern Language Journal, 80, 309-326.
Hill, K. (1997). Who should be the judge? The use of non-native speakers as raters on a
test of English as an international language. Paper presented at the Current de-
velopments and alternatives in language assessment: Proceedings of LTRC 96,
Jyvaskyla, Finland: University of Jyvaskyla and University of Tampere.
Hinkel, E. (2003). Simplicity without elegance: Features of sentences in L1 and L2 aca-
demic texts. TESOL Quarterly, 37(2), 275-300.
Hinofotis, F. B., Bailey, K., & Stern, S. L. (1981). Section II. Empirical research. In A. S.
Palmer, P. Groot & G. Trosper (Eds.), The construct validation of tests of com-
municative competence (pp. 106-126). Washington, DC: TESOL.
Hirano, K. (1991). The effect of audience on the efficacy of objective measures of EFL
proficiency in Japanese university students. Annual Review of English Language
Education in Japan, 2, 21-30.
Hoenisch, S. (1996). The theory and method of topical structure analysis. Retrieved 30
April 2007, from http://www.criticism.com/da/tsa-method.php
Hoey, M. (1991). Patterns of lexis in text. Oxford: Oxford University Press.
Homburg, T. J. (1984). Holistic evaluation of ESL composition: Can it be validated ob-
jectively? TESOL Quarterly, 18, 87-107.
Hu, Z., Brown, D., & Brown, L. (1982). Some linguistic differences in the written Eng-
lish of Chinese and Australian students. Language Learning and Communication,
1(1), 39-49.
Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge: Cambridge Uni-
versity Press.
Hunt, K. W. (1965). Grammatical structures written at three grade levels.Unpublished
manuscript, Champaign, IL: NCTE.
308
Huot, B. (1990). Reliability, validity, and holistic scoring: What we know, what we need
to know. College Composition and Communication, 41(2), 201-213.
Hyland, K. (1996a). Talking to the academy: Forms of hedging in science research arti-
cles. Written Communication, 13(1), 251-281.
Hyland, K. (1996b). Writing without conviction? Hedging in science research articles.
Applied Linguistics, 17(4), 433-454.
Hyland, K. (1998). Hedging in scientific research articles. Amsterdam: John Benjamins.
Hyland, K. (2000a). Hedges, boosters and lexical invisibility: Noticing modifiers in aca-
demic texts. Language Awareness, 9(4), 179-301.
Hyland, K. (2000b). 'It might be suggested that...': Academic hedging and student writing.
Australian Review of Applied Linguistics, 16, 83-97.
Hyland, K. (2002a). Authority and invisibility: Authorial identity in academic writing.
Journal of Pragmatics, 34, 1091-1112.
Hyland, K. (2002b). Options of identity in academic writing. ELT Journal, 56(4), 351-
358.
Hyland, K. (2003). Second language writing. Cambridge: Cambridge University Press.
Hyland, K., & Milton, J. (1997). Hedging in L1 and L2 student writing. Journal of Sec-
ond Language Writing, 6(2), 183-296.
Hymes, D. H. (1967). Models of the interaction of language and social setting. Journal of
Social Issues, 23(2), 8-38.
Hymes, D. H. (1972). On communicative competence. In J. Holmes (Ed.), Sociolinguis-
tics: Selected readings. Harmondsworth, Middlesex: Penguin.
Ingram, D. E. (1995). Scales. Melbourne Papers in Language Testing, 4(2), 12-29.
Intaraprawat, P., & Steffensen, M. S. (1995). The use of metadiscourse in good and poor
ESL essays. Journal of Second Language Writing, 4(3), 253-272.
Ishikawa, S. (1995). Objective measurement of low-proficiency EFL narrative writing.
Journal of Second Language Writing, 4, 51-70.
Ivanic, R. (1998). Writing and identity: The discoursal construction of identity in aca-
demic writing. Amsterdam: Benjamins.
Ivanic, R., & Weldon, S. (1999). Researching the writer-reader relationship. In C. N.
Candlin & K. Hyland (Eds.), Writing: Texts, processes and practices. London:
Longman.
Iwashita, N., McNamara, T., & Elder, C. (2001). Can we predict task difficulty in an Oral
Proficiency Test? Exploring the Potential of an Information-Processing Approach
to Task Design. Language Learning, 51(3), 401-436.
Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL
composition: A practical approach. Rowley, MA: Newbury House.
Jafarpur, A. (1991). Cohesiveness as a basis for evaluating compositions. System, 19(4),
459-465.
Jamieson, J. (2005). Trends in computer-based second language assessment. Annual Re-
view of Applied Linguistics, 25, 228-242.
309
Jarvis, S. (2002). Short texts, best-fitting curves and new measures of lexical diversity.
Language Testing, 19(1), 57-84.
Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educa-
tional and Psychological Measurement, 20, 141-151.
Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin,
112(3), 527-535.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educa-
tional Measurement: Issues and Practice, 12, 5-17.
Kawata, K. (1992). Evaluation of free English composition. CASELE Research Bulletin,
22, 305-313.
Kellogg, R. T. (1996). A model of working memory in writing. In C. M. Levy & S.
Ransdell (Eds.), The science of writing. Theories, methods, individual differ-
ences, and applications. Mahwah, NJ: Lawrence Erlbaum.
Kennedy, C., & Thorp, D. (2002). A corpus-based investigation of linguistic responses to
an IELTS academic writing task: University of Birmingham.
Kenyon, D. (1992). An investigation of the validity of the demands of tasks on perform-
ance-based tasks of oral proficiency: Paper presented at the Language Testing
Research Colloquium, Vancouver, Canada.
Kepner, C. (1991). An experiment in the relationship of types of written feedback to the
development of second-language writing skills. Modern Language Journal, 75,
305-313.
Kintsch, W., & Keenan, J. (1973). Reading rate and retention as a function of the number
of propositions in the base structure of sentences. Cognitive Psychology, 5, 257-
274.
Kintsch, W., & van Dijk, T. A. (1978). Toward a model of text comprehension and pro-
duction. Psychological Review, 85, 363-394.
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How
does it compare with face-to-face training? Assessing Writing, 12, 26-43.
Kobayashi, H., & Rinnert, C. (1996). Factors affecting composition evaluation in an EFL
context: Cultural rhetorical pattern and readers' background. Language Learning,
46(3), 397-437.
Koponen, M., & Riggenbach, H. (2000). Overview: Varying perspectives on fluency. In
H. Riggenbach (Ed.), Perspectives on fluency. Ann Arbor: The University of
Michigan Press.
Kroll, B. (1998). Assessing writing abilities. Annual Review of Applied Linguistics, 18,
219-240.
Lado, R. (1961). Language testing. New York: McGraw-Hill.
Landy, F. J., & Farr, J. L. (1983). The measurement of work performance: Methods, the-
ory, and application. San Diego, CA: Academic Press.
Lantolf, J. P., & Frawley, W. (1985). Oral proficiency testing: A critical analysis. Modern
Language Journal, 69(4), 337-345.
310
Larkey, L. S. (1998). Automatic essay grading using text categorisation techniques. Paper
presented at the Twenty first annual international ACM SIGIR conference on re-
search and development in information retrieval, Melbourne, Australia.
Larsen-Freeman, D. (1978). Implications of the morpheme studies for second language
acquisition. Review of Applied Linguistics, 39-40, 93-102.
Larsen-Freeman, D. (1983). Assessing global second language proficiency. In H. W.
Seliger & M. H. Long (Eds.), Classroom-oriented research in second language
acquisition. Rowley, MA: Newbury House.
Larsen-Freeman, D., & Strom, V. (1977). The construction of a second language acquisi-
tion index of development. Language Learning, 27, 123-134.
Laufer, B. (1994). The lexical profile of second language writing: Does it change over
time? RELC Journal, 25, 21-33.
Lautamatti, L. (1987). Observations on the development of the topic of simplified dis-
course. In U. Connor & R. B. Kaplan (Eds.), Writing across languages: Analysis
of L2 text. Reading, MA: Addison-Wesley.
Lautamatti, L. (1990). Coherence in Spoken and Written Discourse. In U. Connor & A.
M. Johns (Eds.), Coherence in Writing. Research and Pedagogical Perspectives.
Alexandria, Virginia: Teachers of English to Speakers of Other Languages.
Lee, I. (2002a). Helping students develop coherence in writing. English Teaching Forum,
July 2002, 32-39.
Lee, I. (2002b). Teaching coherence to ESL students: a classroom inquiry. Journal of
Second Language Writing, 11, 135-159.
Lennon, P. (1991). Error: Some problems of definition, identification and distinction. Ap-
plied Linguistics, 12, 180-196.
Linacre, J. M. (1988). FACETS: A computer program for the analysis of multi-faceted
data. Chicago: MESA Press.
Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.
Linacre, J. M. (1994). A user's guide to FACETS: Rasch measurement computer pro-
gram. Chicago: MESA Press.
Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome
Measurement, 3(2), 103-122.
Linacre, J. M. (2006). Facets Rasch measurement computer program. Chicago: Win-
steps.com.
Linacre, J. M., & Wright, B. D. (1993). A user's guide to FACETS (Version 2.6). Chi-
cago, IL: MESA Press.
Liu, D. (2000). Writing cohesion. Using content lexical ties in ESOL. Forum, 38(1), 28-
36.
Lloyd-Jones, R. (1977). Primary trait scoring. In C. R. Cooper & L. Odell (Eds.), Evalu-
ating writing. New York: National Council of Teachers of English.
Loewen, S., & Ellis, R. (2004). The relationship between English vocabulary knowledge
and the academic success of second language university students. New Zealand
Studies in Applied Linguistics, 10(1), 1-29.
311
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really
mean to the raters? Language Testing, 19(3), 246-276.
Lumley, T. (2005). Assessing second language writing. The rater's perspective. Frank-
furt: Peter Lang.
Lumley, T., & McNamara, T. (1995). Rater characteristics and rater bias: Implications for
training. Language Testing, 12(1), 54-71.
Lunt, H., Morton, J., & Wigglesworth, G. (1994). Rater behaviour in performance test-
ing: Evaluating the effect of bias feedback. University of Melbourne: July.
Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.
Mackey, A., & Gass, S. (2005). Second language research: Methodology and design.
Mahwah, NJ: Lawrence Erlbaum.
Madsen, H. S. (1983). Techniques in testing. Oxford: Oxford University Press.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrica, 47, 149-
174.
McArdle, B. H., & Anderson, M. J. (2001). Fitting multivariate models to community
data: a comment on distance-based redundancy analysis. Ecology, 82, 290-297.
McIntyre, P. N. (1993). The importance and effectiveness of moderation training on the
reliability of teachers' assessments of ESL writing samples. Unpublished MA
thesis, University of Melbourne.
McKay, P. (1995). Developing ESL proficiency descriptions for the school context: The
NLLIA ESL bandscales. In G. Brindley (Ed.), Language assessment in action.
Sydney: National Centre for English Language Teaching and Research.
McNamara, T. (1996). Measuring second language performance. Harlow, Essex: Pearson
Education.
McNamara, T. (2002). Discourse and assessment. Annual Review of Applied Linguistics,
22, 221-242.
McNamara, T., & Roever, C. (2006). Language Testing: The social dimension. Oxford:
Basisl Blackwell.
Mehnert, U. (1998). The effects of different length of time for planning on second lan-
guage performance. Studies in Second Language Acquisition, 20, 83-106.
Meisel, J., Clahsen, H., & Pienemann, M. (1981). On determining developmental stages
in natural second language acquisition. Studies in Second Language Acquisition,
3, 109-135.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.).
New York: Macmillan.
Messick, S. (1994). The interplay of evidence and consequences in the validation of per-
formance assessments. Educational Researcher, 23(2), 13-23.
Messick, S. (1996). Validity and washback in language testing. Language Testing, 13,
241-256.
Mickan, P. (2003). 'What's your score?' An investigation into language descriptors for
rating written performance. Canberra: IELTS Australia.
312
Milanovic, M., Saville, N., Pollitt, A., & Cook, A. (1995). Developing rating scales for
CASE: Theoretical concerns and analyses. In A. Cumming & R. Berwick (Eds.),
Validation in language testing. Clevedon: Multilingual Matters.
Milanovic, M., Saville, N., & Shen, S. (1996). A study of the decision-making behaviour
of composition markers. In M. Milanovic & N. Saville (Eds.), Studies in Lan-
guage Testing 3: Performance, cognition and assessment. Cambridge: Cam-
bridge University Press.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our
capacity for processing information. Psychological Review, 63(2), 81-97.
Mislevy, R. J. (1993). Foundations of a new test theory. In N. Frederiksen, R. J. Mislevy
& I. I. Bejar (Eds.), Test theory for a new generation of tests. Hillsdale, N.J: Law-
rence Erlbaum Associates.
Mislevy, R. J. (1995). Test theory and language learning assessment. Language Testing,
12(3), 341-369.
Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33,
379-416.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of assessment
arguments. Measurement: Interdisciplinary Research and Perspectives, 1, 3-62.
Monroe, J. H. (1975). Measuring and enhancing syntactic fluency in French. The French
Review, 48, 1023-1031.
Moussavi, S. A. (2002). An Encyclopedic dictionary of language testing (Third edition
ed.). Taiwan: Tung Hua Book Company.
Mugharbil, H. (1999). Second language learners' punctuation: Acquisition and aware-
ness. Unpublished PhD dissertation, University of Southern California.
Myford, C. M. (2002). Investigating design features of descriptive graphic rating scales.
Applied Measurement in Education, 15(2), 187-215.
Myford, C. M., & Wolfe, E. W. (2000). Monitoring sources of variability within the Test
of Spoken English assessment system. Princeton, NJ: Educational Testing Ser-
vice.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using
many-facet rasch measurement: Part I. Journal of Applied Measurement, 4(4),
386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using
many-facet rasch measurement: Part II. Journal of Applied Measurement, 5(2),
189-227.
Neuner, J. L. (1987). Cohesive ties and chains in good and poor freshman essays. Re-
search in the Teaching of English, 21(1), 92-105.
North, B. (1995). The development of a common framework scale of descriptors of lan-
guage proficiency based on a theory of measurement. System, 23(4), 445-465.
North, B. (2003). Scales for rating language performance: Descriptive models, formula-
tion styles, and presentation formats. TOEFL Monograph 24. Princeton: Educa-
tional Testing Service.
313
North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales.
Language Testing, 15(2), 217-263.
Nunan, D. (1992). Research methods in language learning. Cambridge: Cambridge Uni-
versity Press.
O'Loughlin, K. (1993). The assessment of writing by English and ESL teach-
ers.Unpublished manuscript, Cambridge, August 1993.
O'Loughlin, K., & Wigglesworth, G. (2003). Task design in IELTS academic writing
Task1: The effect of quantity and manner of presentation of information on can-
didate writing. IELTS Research Reports, 4, 89-129.
Oller, J. W. (1983). Evidence for a general language proficiency factor: An expectancy
grammar. In J. W. Oller (Ed.), Issues in language testing research. Rowley, Mas-
sachusetts: Newbury House.
Oller, J. W., & Hinofotis, F. B. (1980). Two mutually exclusive hypotheses about second
language ability: Indivisible or partially divisible competence. In J. W. Oller &
K. Perkins (Eds.), Research in Language Testing. Rowley, Massachusetts: New-
bury House.
Ortega, L. (1999). Planning and focus on form in L2 oral performance. Studies in Second
Language Acquisition, 21, 109-148.
Page, E. B. (1994). Computer grading of student prose, using modern concepts and soft-
ware. Journal of Experimental Education, 62, 127-142.
Pennington, M., & So, S. (1993). Comparing writing process and product across two lan-
guages: A study of six Singaporean university student writers. Journal of Second
Language Writing, 2, 41-63.
Perkins, K. (1980). Using objective methods of attained writing proficiency to discrimi-
nate among holistic evaluations. TESOL Quarterly, 14(1), 61-69.
Perkins, K., & Gass, S. (1996). An investigation of patterns of discontinuous learning:
Implications for ESL measurement. Language Testing, 13(1), 63-82.
Perkins, K., & Leahy, R. (1980). Using objective measures of composition to compare
native and non-native compositions. In R. Silverstein (Ed.), Occasional Papers in
Linguistics, No. 6. Carbondale: Southern Illinois University.
Perren, G. E. (1968). Testing spoken langauge: Some unresolved problems. In A. Davies
(Ed.), Language Testing Symposium. Oxford: Oxford University Press.
Pienemann, M., Johnston, M., & Brindley, G. (1988). Constructing an acquisition-based
procedure for second language assessment. Studies in Second Language Acquisi-
tion, 10, 217-234.
Polio, C. G. (1997). Measures of linguistic accuracy in second language writing research.
Language Learning, 47(1), 101-143.
Polio, C. G. (2001). Research methodology in second language writing research: The case
of text-based studies. In T. Silva & P. K. Matsuda (Eds.), On Second Language
Writing. Mahwah, NJ: Lawrence Erlbaum Associates.
314
Pollitt, A., Hutchinson, C., Entwhistle, N., & DeLuca, C. (1985). What makes exam ques-
tions difficult? An analysis of 'O' grade questions and answers (Research Report
for Teachers No.2). Edinburgh: Scottish Academic Press.
Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic
& N. Saville (Eds.), Performance testing, cognition and assessment: Selected pa-
pers from the 15th Language Testing Research Colloquium (LTRC), Cambridge
and Arnhem (Vol. 3). Cambridge: Cambridge University Press.
Porter, D. (1991). Affective factors in the assessment of oral interaction: Gender and
status. In S. Anivan (Ed.), Current developments in language testing. Singapore:
SEAMEO Regional Language Centre.
Powers, D. E., Burstein, J., Chodorow, M., Fowles, M. E., & Kukich, K. (2000). Compar-
ing the validity of automated and human essay scoring. GRE Board Research
Report No. 98-08a, ETS RR-00-10. Princeton, NJ: Educational Testing Service.
Quellmalz, E. S., Capell, F., & Chou, C. P. (1982). Effects of discourse and response
mode on the measurement of writing competence. Journal of Educational Meas-
urement, 19, 241-258.
Raffaldini, T. (1988). The use of situation tests as measures of communicative compe-
tence. Studies in Second Language Acquisition, 10(2), 197-216.
Rasch, G. (1960). Probablilistic models for some intelligence and attainment tests. Chi-
cago: University of Chicago Press.
Rasch, G. (1980). Probablilistic models for some intelligence and attainment tests. Chi-
cago: University of Chicago Press.
Reid, J. (1992). A computer text analysis of four cohesion devices in English discourse
by native and nonnative writers. Journal of Second Language Writing, 1(2), 79-
107.
Reid, J. (1993). Teaching ESL writing. Boston, Massachusetts: Prentice Hall.
Reynolds, D. W. (1995). Repetition in non-native speaker writing. More than quantity.
Studies in Second Language Acquisition, 17, 185-209.
Reynolds, D. W. (1996). Repetition in second language writing. Unpublished PhD disser-
tation, Indiana University.
Roberts, R., & Kreuz, R. J. (1993). Nonstandard discourse and its coherence. Discourse
Processes, 16, 451-464.
Sakyi, A. A. (2000). Validation of holistic scoring for ESL writing assessment: How rat-
ers evaluate compositions. In A. J. Kunnan (Ed.), Fairness and validation in lan-
guage assessment: Selected papers from the 19th Language Testing Research
Colloquium, Orlando, Florida. Cambridge: Cambridge University Press.
Schmidt, R. (1992). Psychological mechanisms underlying second language fluency.
Studies in Second Language Acquisition, 14, -357-385.
Schneider, M., & Connor, U. (1990). Analyzing topical structure in ESL essays. Studies
in Second Language Acquisition, 12(4), 411-427.
Scholz, G., Hendricks, D., Spurling, R., Johnson, M., & Vandenburg, L. (1980). Is lan-
guage ability divisible or unitary? A factor analysis of 22 English language profi-
315
ciency tests. In J. W. Oller & K. Perkins (Eds.), Research in language testing.
Rowley, Mass.: Newbury House.
Seliger, H. W., & Shohamy, E. (1989). Second language research methods. Oxford: Ox-
ford University Press.
Sharma, A. (1980). Syntactic maturity: Assessing writing proficiency in a second lan-
guage. In R. Silverstein (Ed.), Occasional Papers in Linguistics, No.6. Carbon-
dale: Southern Illinois University.
Shaw, P., & Liu, E. T.-K. (1998). What develops in the development of second-language
writing. Applied Linguistics, 19, 225-254.
Shaw, S. D. (2002). IELTS writing: Revising assessment criteria and scales (Phase 1).
Cambridge Research Notes, 9, 16-18.
Shaw, S. D. (2003). Legibility and the rating of second language writing: The effect on
examiners when assessing handwritten and word-processed scripts. Cambridge
Research Notes, 11, 7-15.
Shaw, S. D. (2004). Automated writing assessment: A review of four conceptual models.
Cambridge Research Notes, 17, 13-18.
Shermis, M. D., & Burstein, J. (Eds.). (2003). Automated essay scoring: A cross-
disciplinary perspective. Mahwah, N.J.: Lawrence Erlbaum.
Shohamy, E. (1998). How can language testing and SLA benefit from each other? The
case of discourse. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between
SLA and Language Testing Research. Cambridge: Cambridge University Press.
Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effects of raters' background
and training on the reliability of direct writing tests. The Modern Language Jour-
nal, 76(1), 27-33.
Skehan, P. (1996). A framework for the implementation of task-based instruction. Ap-
plied Linguistics, 17, 38-62.
Skehan, P. (1998a). A cognitive approach to language learning. Oxford: Oxford Univer-
sity Press.
Skehan, P. (1998b). Task-based instruction. Annual Review of Applied Linguistics, 18,
268-286.
Skehan, P. (2001). Tasks and language performance assessment. In M. Bygate, P. Skehan
& M. Swain (Eds.), Researching pedagogic tasks: Second language learning,
teaching and testing. Harlow: Longman.
Skehan, P. (2003). Task-based instruction. Language Teaching, 36, 1-14.
Skehan, P., & Foster, P. (1997). Task type and processing conditions as influences on
foreign language performance. Language Teaching Research, 1(3), 185-211.
Sloan, C., & McGinnis, I. (1982). The effect of handwriting on teachers' grading of high
school essays. Journal of the Association for the Study of Perception, 17(2), 15-
21.
Smith, D. (2000). Rater judgments in the direct assessment of competency-based second
language writing ability. In G. Brindley (Ed.), Studies in immigrant English lan-
316
guage assessment. Sydney: National Centre for English Language Teaching and
Research.
Smith, S. (2003). Standards for academic writing: Are they common within and across
disciplines? Unpublished MA thesis, University of Auckland, New Zealand.
Song, B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays
of native English-speaking and ESL students? Journal of Second Language Writ-
ing, 5(2), 163-182.
Spolsky, B. (1981). The gentle art of diagnostic testing. Paper presented at the Interuni-
versitaere Sprachtestgrupppe Workshop on Diagnostic Testing, 15 December,
Hasensprungmuehle.
Spolsky, B. (1992). The gentle art of diagnostic testing revisited. In E. Shohamy & R. E.
Walton (Eds.), Language assessment for feedback: Testing and other strategies.
Dubuque, Iowa: Kendall/Hunt Publishing Company.
Stahl, A., & Lunz, M. E. (1992). Judge performance reports. Paper presented at the an-
nual meeting of the American Educational Research Association: San Francisco,
CA.
Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement ap-
proaches to estimating interrater reliability. Practical Assessment, Research and
Evaluation, 9(4), 1-20.
Stratman, J., & Hamp-Lyons, L. (1994). Reactivity in concurrent think-aloud protocols:
Issues for research. In P. Smagorinsky (Ed.), Speaking about writing: Reflections
on research methodology. Thousand Oaks, CA: Sage.
Sunderland, J. (1995). Gender and language testing. Language Testing Update, 17, 24-35.
Sweedler-Brown, C. (1993). ESL essay evaluation: The influence of sentence-level and
rhetorical features. Journal of Second Language Writing, 2(1), 3-17.
Tapia, E. (1993). Cognitive demand as a factor in interlanguage syntax: A study in topics
and texts. Indiana University.
Tavakoli, P., & Skehan, P. (2005). Strategic planning, task structure, and performance
testing. In R. Ellis (Ed.), Planning and task performance in a second language.
Oxford: Oxford University Press.
Tedick, D. J. (1990). ELS writing assignment: Subject-matter knowledge and its impact
on performance. English for Specific Purposes, 9, 123-143.
Tomita, Y. (1990). T-unit o mochiita kokosei no jiyu eisaku bun noryoku nosokutei (As-
sessing the writing ability of high school students with the use of t-units). Step
Bulletin, 2, 14-28.
Tribble, C. (1996). Writing. Oxford: Oxford University Press.
Tsang, W. K. (1996). Comparing the effects of reading and writing on writing perform-
ance. Applied Linguistics, 17, 210-233.
Turner, C. E. (2000). Listening to the voices of rating scale developers: Identifying sali-
ent features for second language performance assessment. The Canadian Modern
Language Review, 56(4), 555-584.
317
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Ef-
fects of the scale maker and the student sample on scale content and student
scores. TESOL Quarterly, 36(1), 49-70.
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language
tests. ELT Journal, 49(1), 3-12.
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language
speaking ability: Test method and learner discourse. Language Testing, 16(1),
82-111.
Vande Kopple, W. J. (1983). Something old, something new: Functional Sentence Per-
spective. Research in the Teaching of English, 17, 85-99.
Vande Kopple, W. J. (1985). Some exploratory discourse on metadiscourse. College
Composition and Communication, 36, 82-93.
Vande Kopple, W. J. (1986). Given and new information and some aspects of the struc-
tures, semantics, and pragmatics of written texts. In C. Cooper & S. Greenbaum
(Eds.), Studying writing: Linguistic approaches. London: Sage.
Vann, R. J. (1979). Oral and written syntactic relationships in second language learning.
In C. Yorio, K. Perkins & J. Schachter (Eds.), On TESOL '79: The learner in fo-
cus. Washington, D.C.: TESOL.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater's mind? In L. Hamp-
Lyons (Ed.), Assessing Second Language Writing in Academic Contexts. Nor-
wood, New Jersey: Ablex Publishing Corporation.
Vollmer, H. J., & Sang, F. (1983). Competing hypotheses about second language ability:
A plea for caution. In J. W. Oller (Ed.), Issues in language testing research.
Rowley, Mass. : Newbury House.
Watson Todd, R. (1998). Topic-based analysis of classroom discourse. System, 26, 303-
318.
Watson Todd, R., Thienpermpool, P., & Keyuravong, S. (2004). Measuring the coherence
of writing using topic-based analysis. Assessing Writing, 9, 85-104.
Weigle, S. C. (1994a). Effects of training on raters of English as a second language com-
positions: Quantitative and qualitative approaches. Unpublished PhD disserta-
tion, University of California, Los Angeles.
Weigle, S. C. (1994b). Effects of training on raters of ESL compositions. Language Test-
ing, 11(2), 197-223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing,
15(2), 263-287.
Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.
Weigle, S. C., Lu, Y., & Baker, A. (2007). Validation of automated essay scoring for ESL
writers. Paper presented at the Language Testing Research Colloquium, Barce-
lona, June 2007.
Weir, C. J. (1990). Communicative language testing. New Jersey: Prentice Hall Regents.
White, E. M. (1985). Teaching and assessing writing. San Francisco: Jossey-Bass Inc.
318
White, E. M. (1995). An apologia for the timed impromptu essay test. College Composi-
tion and Communication, 46, 30-45.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consis-
tency in assessing oral interaction. Language Testing, 10(3), 305-323.
Wigglesworth, G. (1997). An investigation of planning time and proficiency level on oral
test discourse. Language Testing, 14, 85-106.
Wigglesworth, G. (2000). Influences on performance in task-based oral assessments. In
M. Bygate, P. Skehan & M. Swain (Eds.), Researching pedagogic tasks: Second
language learning, teaching and testing. Harlow: Longman.
Wild, C., & Seber, G. (2000). Chance encounters. A first course in data analysis and in-
ference. New York: John Wiley & Sons, Inc.
Wilkinson, A. (1983). Assessing language development: The Credition project. In A.
Freedman, I. Pringle & J. Yalden (Eds.), Learning to write: First lan-
guage/second language. New York: Longman.
Wilkinson, L., Blank, G., & Gruber, C. (1996). Desktop data analysis with SYSTAT. Up-
per Saddle River, NJ: Prentice-Hall.
Witte, S. (1983a). Topical structure analysis and revision: An exploratory study. College
Composition and Communication, 34(3), 313-341.
Witte, S. (1983b). Topical structure and writing quality: Some possible text-based expla-
nations of readers' judgments of students' writing. Visible Language, 17, 177-205.
Witte, S., & Faigley, L. (1981). Cohesion, coherence and writing quality. College Com-
position and Communication, 32(2), 189-204.
Wolfe-Quintero, K., Inagaki, S., & Kim, H.-Y. (1998). Second language development in
writing: Measures of fluency, accuracy and complexity. Technical Report No. 17.
Honolulu, HI: University of Hawai'i Press.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
Wu, J. (1997). Topical structure analysis of English as a second language (ESL) texts
written by college South-east Asian refugee students. Unpublished PhD disserta-
tion, University of Minnesota.
Young, R. (1995). Discontinuous interlanguage development and its implications for oral
proficiency rating scales. Applied Language Learning, 6, 13-26.
Yuan, F., & Ellis, R. (2003). The effects of pre-task planning and on-line planning on
fluency, complexity and accuracy in L2 monologic oral production. Applied Lin-
guistics, 24, 1-27.
319
Series editors: Rüdiger Grotjahn and Günther Sigott
Vol. 17 Ute Knoch: Diagnostic Writing Assessment. The Development and Validation of a Rating
Scale. 2009.
www.peterlang.de