Académique Documents
Professionnel Documents
Culture Documents
r--E-r
.'
t esttng tor
Langu age Teachers
Arthur H*ghes
I
I
,F
The righr of :the ..
Unilcrsity oI Cbnbridgc
to print ond sell
oll mnnr of books
Ntat tronlad:by
Hcnrs,VIII in,l5J1,
hat printcd
Univcnily
The
and publithcd continuow!y
since I5El,
$
I
F
'i
Melbourne
SydneY
p.
Bibliography'p.
Includes index.
ISBN 0 521 25264 4. - ISBN 0 521 27260 2 (pbk,)
l. Language and languages - Ability tesring. ' l. Title.
IL Series.
P53.4.H84 1989
407
.5-dc19
4 hard covers
ISBN0 52727260 2 paperback
ISBN 0 521,25264
Copyright
The law allows a reader ro make a single copy of part of a book
for purposes of privare study. It does nor allow the copvine of
entire books or rhe making of mulriple copies of extracrs. iv.irr-n
permission for any such copying must alwa.vs be obrained from rhe
publisher in advance.
CE
v
F
i,
I estrng
v
B
t'
I
t
F
l
fi
fi
ln this series:
book of
Drama Techniques in Language Learning - A resource
.o*.nr.tnication activities fo r I4ngua ge teachers
bliAlon MaleY and ,tian D"ff
Games for Language Learning
iy'iiarr*
once upon
jit
Uy'
"'t
i
I
r
For Vicky, lr4eg and Jake
h
I
r
F
'I
F
.-f
ffimntents
viii
Acknou'ledgements
Preface
t1
Teaching and
testing
1"
!'t
,,
Validitl'
22
Reliability
Achieving ben
S
ta
geibf
29
efi
cial bae-kr
59
, 9 Testingwriting. 75J
ability.
lO
Testing oral
11
L2
Testing
reading. 1,16
Testing
listening
70I
1.34
14
/
,,1
3
)
Test
Appendix
administration
\\. "YY
Annendix 2
lde
152
Ilibliography
f
141
1,66
170
I'
155
l{e t<newIedgeirlents
f-or
fl
[or extracts
American Council on the Teaching of Foreign Languag_es inc.
Generic Guidelines
i;;* ACTFL provisional Proficiency Guidelines andfrom
examinationsl
for extracts
ew Bofiaziqi UniversitY Language
ouncil for the draft of the scale ior
ress for lvI. Swan and C. Walter:
98 8 ) ; Educational Testing Services
ingual House for M. Garman and
s Chapter 4 (1983); The Foreign
pp. 35-8 (1979); HarPer & Row',
publishers, lnc. for the graph on p. 1Z1,.from Basic Statistical Nlethods
Rot.tt W. Heath, copyright O 1985-in N' ivl'
t, X.M. Downie and
6C
ilo*ni. and Robert \W. Heath, reprinted wit-h permission of Harper
n"*, p"blishers, Inc.; The Independent_f.o_r N. Tirnmins: 'Passive
oard
smoking comes under fr,.', 14 March 1987; J
eas);
io, .,t,rl.ts from March 1980 and June 1980 T
tract
Language Learning, and J' W' Cller Jr' and C' r r
r
a;; 'Th. .lore-t"echnique and ESL proficiency', Language Learning
Oxfor
C
viii
Fnef ece
The simple objective of this book is to help language teachers write better
rests. It takes the view that test construction is essentially a matter of
problem solving, u'ith every teaching situation setting a different tesring
I
,
problem. In order
L
I
t
D
ix
of tesrers.
The starting point for this book is the admission that this mistrust is
;F
B
I
I
',
I
I
-f_a--
othey
'^'\i\\
\**J'*-"' h* n"i\ \'r**'t-
on tea
lea rn ip_g;Ls
known
\"s\q
^t
backu.aslrneednotalwaysbeharmf"ffiepositively
'
I
{
{
{
appropriate and the testing is not; we are then likely to suifer irom
nrirnflf backwash. This rvould seem to be the situation that leads Dat'ies
to confine testing to the role of servant of teaching. But equally there mav
be occasions whin teaching is poor or inappropriate and rvhen tesring is
able to exerr a beneficial influence. We cannot expect testing ortiv ro
follow teaching. What we should demand of ir, holvever' is that it should
be supportive of good teaching and, where-necessary, exert a correctl'''c
influence on bad ftaching. lf resting ahvays had a beneficial backrvash on
teachilg, it would have a much better reputation amongst teachers.
Cn"pr.i,5 of this book is devoted to a discussion of horv benefici;il
backwash can be achieved.
tests
The second reason for mistrusting tests is that very oiten thel' ,larl-ro measure accuratg\r--wtrateyer"-ifjs-thrauhey-ar.e-in!9lld-ed to *eas.ulr.
fl;.G;["",ffiG. Students'true abilities are not al,.vays retlected in t]re
tesr scores that they obtain. To a certain extent this is inevitable.
Lrngurg. abilities are not easy to measure; we cannot expect a level of
accuracy comparable to those of measurements in thE phy'sical sciences.
But we can exPect greater accuracy than is fr.equentlv achieved'
t.it, inaccurate? The causes of inaccuracy (and rva,vs of
Why
"r. their effects) are identified and discussed in subsequent
rninimisinE
is possible h
ese concerrxs
: ifrwe'Warl'i-
1"-,
re of
have
br-rt
surer
n the
scoring
of tens of thousands of
be
t
F
'-l
-+>
-U,nr.J
ry
;Fiqs!'t@
tffi*a+a+is:*
"d
about the test creates a tendency for
fiFfoim significantly differently on different occasions
has
t!\,o- e
se, something
u'hen they might take the test. Their performance might be quite different
if they took the test on, say,lWednesday rather than on the follou'ing day.
As a iesult, even if the scoring of their performance on the test is perfectiy
/
F.
Uh
accurate (that is, the sborers do not make any mistakes), they will
nevertheless obtain a markedly different score) depending on u'hen they
actuallt. sat rhe test, even though there has been no change in the ability
u'hich the test is meant to measure. This is not the place to list all possible
fearures of a test rvhich might make it unreliable, but examples are:
unclear instrucrions, ambiguous questions,_items that resulr in guessing
on rhe part of the test takers. While it is not possible entirely to eiiminate
such differences in behaviour from one test administration to another
(hunran beings are not machines), there are prlncipies of test construction
u'hich can reduce them.
erformances are accorded _sisIn the
t. F"t example, the sarnJlompoTiTron"miyEe
.nifica
g1i'en ver)' different scores by different nrarkers (or even by' the same
ma.rker on different occasions). Fortunately, there are well-understood
ii'a)/s of minimising such differences in scoring.
N,{ost (but not all) large testing organisations, to their credit, take every
Io
:1
,rrion, hlring
it;
*Ifi
teachers' ,rr.rr*.nts
sufficient, this is not true
for the
cases
compariso*t'
.o, that
rhrr resrc
arenecessary, and iiIwe care about testing and
tests are
lf it is accepted
(in my view' the
its effect on ,.".lrlng tttd learning, the other conclusion
the poor quality of so
correct orre) to-b. ir^*n frorn ^-',ttogt'ition of
;;;;-;.;ii.i ir ,hur ru. should do eve.ything that we can to irnprove the
practice of testlng.
interpreted widelY. [t is use'] to
abiliry. No distincrion is macle
What is to be done?
The reaching profession can make two contributions to the improvement
clf tesring: they can write better tests themselves, and they can put pressure
READER ACTIVITIES
1
Think of tests with which you are farniliar (the tests may be international or local, written by professionals or by teachers). What do
you think the backwash effect of each of them is? Harrnful or
beneficial? \Xlhat are your reasons for coming to these conclusions?
Consider these tests again. Do you think that they give accurate or
'What are your reasons for coming to these
inaccurate information?
conclusions
F,
9
F
Further reading
Forlan accounr of how the introduction of a nerv test can have a striking
beneficial effect on teaching and leagning, see F{ughes (tr988a). For a
reyie\\, of the nerv TCEFL w'riting test which acknorvledges its potentia[
'beneficial backrvash effect but rn'hich also points out that the narrorv
range of writing tasks. set (they are. of only^two types) rnay 6.t*1, Un
narrow rr",ntngln writing, see Gree-4-herg (1986). For a discussion of the
cthics of language testing, see Spo_lsky (n98x.).
t2
idea of testins as
The purpose of this chaprer is to introduce readers to the
solving and to show how the content and structure of the L:itok
oioUf.n.,
-a.iign.a
; help them to become successful solvers of testing
Ir.
fi?ffr:ge
;h;;i,
il;;ilr. .t.
or.uiour
'oi pi^.. in the i.ttt of teaching institutions. In the same s7211r, t\\'o
,."itring institutions may require ve y diffe rent tests, depending.amongst
and importorher ,liing, on the objectivis of th:ir courses, the Pytpe:.
of th". t.sts, and ihe resources that are available. The assumption
unique.and
""1.
,nr, n", to be *^d. therefore is that each testing-situation is
,trl!
sLeJ rnust
rEarwl"*l::jT:li:11:
ffi'fr;r'n*solution.Er,erytestingprobIemcanbeexpressedinthe
which rvill:
;;;g;..r"1tfr*s:
onsistently
which we are interested;
(in those cases where the tests are
ave a beneficial effect on teaching
'v likelv to influence teaching);
money'
OU...onomical in terms of time and
nical sense. lr refers sinrply to r'r'hat people
example, include the ability to collverse
y to recite grammatical rules (if chat is
iuringl). [t does not, however, refer to
first
thing that testers have to be clear about is the purPose of testing in any
orrti.ular situation. Different purposes will usually require different kinds
seem obvious hut it is something which seems not always
f f ,.ri.t. This rnay
.t
ro be recognrseo. Ih-.- p_g.lp.g,::Loll.jl$ discussed in this bo.ok are:
\-.-to []easure language proficiency regardless of any language courses
ed
F
B
-I
'@Jbe'#brr
;l?Aatt ;.i.it.d*to inlfiE*pt*ious cha
te-uegg,1::,1'l*Hl
an absolutely essential
qualir, of tesrs - u,hat use is a test if it rvill give widely differing estirnates
Ji .,n'individual's (unchanged) ability? - yet it is something which is
tv
{F,
Ail
Further reading
The collection of critiC-al reviews of nearly 50 Engtish language
tests
Krahnke and Stans-
t5
F
'
')
communicative language testing.
We use rests to o"brrin inforriati n. The informatibit
obtain
thtt we hope to
of kinds of
information being sought. This categorisation lt'ill prove useful both in
deciding u,hether-an existing test is suitable for a particular purpose and
in writing appropriate new tests where these are necessary. The four
tvpes of tesr which we will discuss in the following sections are:
Proficiency tests
*lt
)
)
test's develoPment.
There are other proficiency tests which, by contr"ast, do not have any
t1
Most reachers are urnlikeiy to he responsibne fon proficiency tests. lt is
much more probable that they will be involved in the preparation and use
of abhievement tests" X.n contrast to proficiency tests,'achievement tests
are directly relate<l to language courses, their purpose being to esrablish
how st-tccessful individual studerets' groups
thernselves have been in achieving objectives'
ch i e v e m e nt
a ch i eve m e n t te s ts n
plnnl
l''
d(f6ffii"t
achievement
ts'
red at the end of a course
te
oi-
"
tar:LllWlh
sprr;
I
ts
I
r-
Ii
D,
i ,,,. :;:
;. iiiilCn rn
,-.I:i.,,..;:,'.',
;::I.
-.,-.-..=,;.-!= iiiUts
r-)l:;r'i-LJ
--rnS----.J, iti+,=,;.!!
i Uo iJ
DC pfCICffCG:
trO l-...
Vviii PfOViLie
aCCuIaLe
-,,.L
;-^1..-.-.^*:
:,,:J,.^l
;*:^
-L^,"*:-.1
^^L:^-,^*^.-*
^-.
^-l
^*,-: group
--^..^ achievernent, and lt is l:l-^l-.
likely -^
infon:rarion about individual and
to
promote a more beneficial backwash effect on teaching"l
1
.L
Of course, if objectives are unrealisric, then tests u'ill also reveal a failure to achieve
rlrern. This too can only be regarded as salutary. There may be disagreement as to why
tlrerE has been a failure to achieve the objecrives, bur at least this provides a srarring
poini for ne cessary discussion rvhich otlrerwise might never have taken place.
\g
1a
I-L
her
not
hty
hru. nor been prepared. nn a sense this is true. But in anothe r sense it is
nor. If a resr is-baied on the content of a poor or inqPPropriate course,
thb stud.nts taking ir will be misled as to the extent of their achieve*.n, and the qut-iitv'of the course. \Whereas if the test is based on
trsefrrl. lrtrr
Obieatirr"Si not onlV will the information it gives be mote
*rire is less chance of the course surviving in its present unsatisfactorv
lnitially some students may suffer, but future srudents rvill benefit
are
iro* the'pressure for change. The long-term interests of students
C*, ,.ro.d by final achievement te *hot. content is based on course
i;;;.
ob jectives.
I
I
I
- The reader ma
ditierence betwe
'ts
*hgbe--_--LSqL._!S
ment tests an
retsts.
-.- -^^;J^ ^-
!
?
l
I
t
[ar-esrrs
rne
'nO
r'fl
'rthe
_i_n1ttl
iI
,{
_F^-
I
I
-#\
u---/
t?
ffi
jectives.
It has been argued in this section that it is better to base the content of
achievement tests on course objectives rather than on the detaile d content
of a course. However, it may not be at all easy to convince colleagues of
this, especially if the latter approach is already being follorved. Not only
is tlrere likely to be natural resistance to change, but such a change may
represent a threat to many people. A great deal of skill, tact and, possibly,
oLr
ts
!
F
.'"-Fiagn0stie tests
=+
g$-rs
nd u'eaknesses.
UR
r
F
single
ugh, since a str.rdent rrright give the
sult, a comprehensive diagnostic test
(think of what would be irrvolved in
.
-1
.I.J
ormation.
sts is unfortunate. TheY could be
nsrruction or self-instruction. Leirrnst in their command of the lrrnrrtrrtg*
cf infcrrnrticn, exempliFclti'rn :ln'J'
lity of relatively inexpensi',re (-orFlchange the situation. Well-u'ritten
e that the learner spent no more time
obtain the desired information, irtrcl
this kind rvill still nre ,l
rvithout the need for a test administrator. Tests of
Whether or not thel' become
u ,r.In.*dour-r*ount of work to produc.e.
of individuals to write
eenerally available will depend on the willingness
in.nl and of publishers tQ distribute them'
*;;"'il)
e-'44e6'*F
institution, a
not work rvell'
that is .o*..r.irily uutilable must be that it will
most successful are those constructed
everY
'in house.'' I
through
rion is ,.-Juta.a by the savlng ln ttme and effort
;;i;".tii""
a-p
it given in Ch-aPter
.*r*pl.
of how
E*
i,-
\\l
\*-\ )
mightbe.designed rvithin
ion of placement tests is
o.^**s\**+:.:h
of uses to
two approa
resulrs are Pur. We now distinguish betweert
construction.
LI
'14
a:cur.ate
t
t
::::::il'
devis
method
the skills in which he or she is i
;,?':n'.T t
l;:'
: : : :',1'l
uch evidence
rformance of
to
methods for
achieving this are discussed in Chapters 1l- and l-2. lnterestingll' enough,
in many rexrs on language testing it is the te sting of productive skills that
i, ft.r.nted as bein[ *-ost problematic, for reasons usually connected
-r,.iif., reliability. ln fact the problems are bir no means insurmountable, as
rve shall see in ChaPters 9 and 10'
' Direct resring hai a number of attractions. First, provided that we are
about jusi u,hat abilities we want to assess) it is relatively straight-'
clear
.for,.,^rd
.foru,ard
u'hi
ro create the conditions which u'ill elicit the behaviour on u'hi
our judgements. Secondly, at least in the case of t ie producti
io t
^r.the assesr*.nt and interpreration of students' performance is also
skills,
quit.'straightforu,ard. Thirdly, since practice for the test involves PracJ.. of the"skills that we wish to foster, there is likely to be a helpful
backr"'ash effect.
Irrclirect testing atrempts ro measure the abilities rn'hich underlie the
if.ittt in u'hich *J ^r. interested. One section of the TOEFL, for example,
,uri a.u.toped as an indirect measure of n'riting ability. It contains iterns
of the follorving kind:
h
ft
At first the o\d \^,oman seemed unwilling to acceDt anything that u'as
;ff.t.d her b| mY friend and I'
.,.,.hc;c rhc candidate has to iCentif,r' .*hich
.,, o n. o u s o r i n a p p r o p r i a t e in f o r m a
L*f ,Hi i;
:l:,'l ; )
is
*h:hl'; it:
2-2
I)
composition writing to predict accurately composition writing abilit.vfrom scores on tests which measure the abilities Which we beliette
underlie it. \fle may construct tests of grammar, vocabularv, discourse
markers, handwriting, punctuation, and what lve will, But rve still ,,viil
not be able to predict accurately scores on compositions (even if rve rnake
sure of the representativeness of the composition scores by taking man-y
samples).
Itlseems ro me that in our present state of knowledge, at least as far as
proficiency and final achievement tests are concerned, it is preferabie to
concenrrare on direct testing. Provid d that we sample reasonably w'idely
lo
2?
{
sql
D Norm-referenced
!
F
F
F
G
proficiency i., G.r-an Level 1 can 'speak and react to others using sirnple
language in the following contexts':
to greet, interact with and take leave of others;
to exchange information on personal backgroulnd, home,
school life and interests;
Lrl
t7
In these two cases we learn nothing about how the individual's performance compares with that of other candidates. Rather we learn something
abourt rvhat he or she can actually do in the language. Tests vihich a.r+
this kind of infornnation directly r.re sa.id tr 1,.:
..-lee;r'ned
to
."
evrrb..vs
-*
fsrc',.,ide
cr it ir ion - r ef er e n c e d.2
The purpose of qriterion-referenced tests is to classify people according
to wheth.i ot not rhey are able to perform some task or set of t:rsks
satisfactorily. The tasks are set, and the performances are evairiar,ed. it
. does not marrer in principle whether all the candidates are successful, or
none of the candidates.is successful. The tasks are set, and thcse lvho
perform them satisf-actorily 'pass'; those who don't, 'fail'. This means
ih^t rtudents are encouraged to measure their progress in relation to
meaningful criteria, rvithout feeling that, pegause they are-less able thln
*o51 of=their fellows, they are desrined to fail. ln the case of rhe Berkshire
German Certificate, for example, it is hoped that all students r*,'ho are
entered for it ivill be successful. Criterion-referenced tests therefore hir,;'e
terms of -uvh.rt
two positive virtues: they
s of candidirics I
p.opi. can do,which do no
and they motivate students
means that the
The need for direct interpretati
consrrucrion of a criterion-referenced test may be quite different from
rhat of a norm-referenced test designed to serve the same PurPose. Let us
imagine that the purpose is to assess the English language abilit'v of
universiries.
stud"enrs in relarion to ih. demands n ade by English medium
The criterion-referenced test would almost certainly have to bebased on
,^ r^rfyris of whar students had to be able to do with or through English
at univlrsity. Tasks would rhen be set similar to those to be met at
urriu.rtity. if ,hlr were not done, direct interpretation of performance
would be impossible. The i'orn't-referenced test, on the other hand, while
itsconrent might be based on a si
.l1,:?:tlili]i;ffil:'.IT:i
Jl;
which
*ni..,
This is r:n2. People differ somewhat in their use of the terrn 'criterion-referenced''
rvhich
it is used
clear. The sense in
important Provided thar the sense intended is made
in analvsing testing
one which I feel will be mosr useful to the reader
heie is rhe
problems.
{
,
ra'irh that score should be allowed to carry, this heing based on experience
over rhe years of students with sirnilar scores, not on any meaning in the
score itself. In the same wdy, university administrators have learned from
crperience horv to interpret TOEFL scores and to set minimum scores for
theit own institutions.
Books on Ianguage testing have tended to give advice which is more
appropriate to norm-referenced testing than to criterion-referenced
resting. One reason for this rnay be that procedures for use with
norm-referenced tests (particularly with respect to such matters as the
analysis of items and the estimation of reliability) are well established,
rvhile those for criterion-referenced tests are not. The view taken in this
book, and argued for in Chapter 6, is that criterion-referenced tests are
often to be preferred, not least for the beneficial backwash effect they are
likely to have. The lack of agreed procedures for such tests is not
sufficient reason for them to be excluded from consideration.
r
B
h
J
"
*o!'fi
r-rt unr
2,
L9
recapitulatiou under
cative testing are to be found throughout. A
separate heading would therefore be redundant-
ffiDFnAcrlYlrtFS
'',vith vrhich yoll are famili:.r:, F,:,:
consider a number of language tesrs
each of thcir,, ansivcr thc follo';;ing qucstions:
thesubiectiveitemsaccordingtodegrteofsubiectivity?
5. Is rhe test norm-referenced or criterion-refe renced?
you describe it
6. Does the test measure communicative abilities? would
as a communicative test? Justify your answers'
question 5 and the
7. What ,.trtionihip is there b.t*l.in the answers to
answers to the other questions?
Further reading
towards achievement test content
erson \1987) rePorts on researcfi
e computer to language testing'
o be as authentic as Possible: Vol'
Testing is devoted to articles on
ount of the develoPrnent of an
dshalk et al. (1965). Classic short
orm-referencing (not restricted to
Kinds
of
test.
and testing
i
I
4
I
Bi
I
I
!
F
r
D
?8
21
It
&
VattciitY
f,
2 that a test is said to be "'alid rf it
ended to measure. This seerns simple
wever, the concept of validiry
vcals
deserves our attention.'[his chaprer
shorv its relevance for the
will presenr each aspect in turn, and attempt to
soluiion of language testing problems'
/.--\
L)
re
Content validitY
set
Validity
backwas-h effect. Areas
isnored
in teachin
d-cr-ermiqsd -byrry-he
(g-..
J'fu*._d=:=_#
ntent is a
tTt
fair reflect
rvriting-f specifrcatiiins
found in Chapter
6'tL-./
YCrlte
I
E"n-d
l.
ri o n -related va
lid
onT-ne )udgementTf
Ee
iontent validity is to be
ity
Eeiil-*-*--
'
on the
*'-
'functions'which student
-u.hich might take 45 m
Il of
I be
iryppc.gigal. Perhaps it is felt that only ten minutes can be devoted to each
srudent folthe oral component. The question then arises: can such a
ten-minute session give a sufficientll' accurate estimate of the student's
at'iliti'rvith respect to the functions specified in the course objectives? Is
it, in other words, a valid measure?.
t,
u'ould be the criterion test against which the shorter test would be
judged. The students' scores on the full test would be cornpared with the
ones they obtained on the ten-minute session,. rvhich would have been
conducted and scored in the usual way, wirhout knowtredge of their
perfornlance on thre longer version. [f the con:lparison betrveen the two
sers of scores reveals a high level of agreernent, then the shorter version of
''! ')
Jo
LJ
'r'
Validity
the oral component may be considered valid, inasmuch as it gives results
similar ro rhose obtained with the longer version. [f, on the other I'rand,
the two sets of scores show little agreement, the shorter version cannot be
considered valid; it cannot be used as a dependable measure of achievemenr rvirh respect to the functions specified in the objectives. Of course, if
ten minutes really is all that can be spared for each student, then the oral
ccfnponent rnay be included for the contribution that it makes to thc
of st'.:.jcn;s' ovcrall achici;ciirciit anC for its back';,'r;.'-rh ,:l:fl.:i,
"ss.it-.nt
B.ut it cannot be regarded as an accutate measure in itself.
References to 'a high level of agreement' and 'little agreement' raise the
question of how the level of agreement is measured. There are in fact
siandard procedures for comparing sets of scores in this wsY, lvhich
generate what is called a 'validity coefficient', a mathematical measure oi
iimilaritv. Perfect agreement betrveen tw'o sets of scores rvill result irr a
validity coeficient of 1. Total lack- of agreemen-t rvill give a coefficient of
-io g.t a feel for the meaning of a coefficient between these trvo
zero.
extremes, read the contents of Box 1.
Box
fl
f;
{
I
I
i=i
,j
V alidity
abrliry. The saving in time would not be worth the risk of appointing
someone with insufficient ability in the relevant foreign language. On the
otl-rer hand, a coefficient of the same size might be perfectly acceptable for
a brief interview forming part of a placemcnt test.
It should be said that the criterion for concurreRt va[idation is not
necessarily a proven, longer test. A test may be validated againsr, for
example, teachers' assessments of their students; provided that the
assessrnents themselves can be relied on. This would be appropriate
tvhere a test was develOped which claimed to be measuring something
different from all existing tests, as was said of at least one quiite recently
devel,rped'communicative' test.
The second kind of criterion-related validity is predictiue uafif,isy. This
I.
c^ncerns the
n
p
$
F
92
25
\talidity
*-,^lilit-,
if ir can he
ab
9.*;;.,;i-t'KJnothesise.
=e-d-tha
t-tl,m
ea
s=.[-iei j ]r
I th e- ab i i ty
1
rv h
ic
fi
d* rl:ii
ng
:,Jnd,e
r-s+.+-any- :Jn
r.s+.+-any-e.r \:{"1
abilitt'.
,..',,
of
lansuase abiliti'.
of- lanBuage
ry;
^J-:-::-:----.-"-----Y--
[e
Le
"
foi.example, that.r'.the abrlity"tbTded-iE';i6l'''ci
._,-..-.:'-..',.4
{
6
33
Validity
n, we would obtain a set of scores
ts could then be calculated between
ients between Scores on the same
n those between-scO-resu.rr dif ferent
t we are indeed measuring separate
and,identifiable ccinstructs.
Construct validation is a research activity, the means by which theories
are put to the test and. are confirmed, modified, or abandoned. lt is
tfri"lgh construct validation that language testing.can be put on a
so,,ndir, more scientific footing. But it will not all happen overnight;
there is a long way to go. In the meantime, the practical language tester
should try to keep abreast of what is known. When in doubt, u'here it is
possible, direct testing of abilities is recommended.
,E
F
l4)
Face vaiidi
jrlari
Lyas"ejenri ficco-n
;"t h;t
ce
is
"
#E ^. Gilffi
f*iltl
parricularly
eli
aiCJ
'-iAE?
where it is intended to use indirect testirrg, reference should
be made ro rhe research literature to confirm that measurement of the
-using the testing
relei,ant underlying constructs has been dernonstrated
.rechniques rhar are to be used (this rnay,often result in disappointmentanother reason for favouring direct testing!)'
An), published test should supply details of its validation, without
W
?Lt
27
*
u
Validity
I
B
Consicier any tests with which you ar'e iamiiiar. Assess'eacii oi iiiciii iii
rerms of the various kinds of validity that have been presented in this
chapter. What empiricalevidence is there that the test is valid? If evidence
is lacking, how would you set about gathering it?
Further reading
For general discussion of test validity and ways of measuring it, see
Anastasi (1,975). For an interesting recent example of test validation (of
the British Council ELTS test) in which a number of importanr issues are
raised, see Criper and Davies (1988) and Hughes, Porter and Veir
(1988). For the argument (with which I do not agree) that there is no
criterion against which 'communicative' language tests can be validated
(in the sense of criterion-related validity), see Morrow (1985). Bachman
t5
.rqfl$i
" -i--fr
a.
E
A
ffieEiabiEEtV
\,,-l*-.,{r*U-,*,{. 6*\\
.,fi
rl.
**-n
"\tS\
o"*
"t\.
/t
nl
u"l
F;
I
It
!l
I
!l
I
il
lqnagine that a hundred students take a 100-item test at three o'clock one
said to be.
Look at the hypothetical data in Table 1a). They represent the scores
obtained by ten students who took a 1O0-item test (A) on a particular
occasion, and those that they u'ould have obtairied if they had taken it a
Jal' lltcl. Ccnpare the nvo sets cf scores. (Dc not \I":orii'foi thc nomcni
about tire fact that we would never be able to obtain this information.
Ways of estimating what scores people would have got on another
occasion are discussed later. The most obvious of these is simply to have
people take the same test trn'ice.) Note the size of the difference between
the tivo scores for each student"
3/
29
Reliabiliry
Score obtained
Bill
Mr rrt
Ann
Harry
Cyril
58
82
28
34
67
46
L9
89
43
55
43
27
76
62
Pauline
Don
Colin
Irene
Sue
63
59
35
23
62
49
Now look
ar
Score obtained
BiII
65
48
23
85
59
52
Mary
Ann
Harry
Cyril
Pauline
Don
44
56
38
2t
90
39
59
35
Colin
1.9
rc
Irene
67
5Z
52
57
Sue
\flhich rst seerns rhe rnore reliable? The differences betrveen the two sets
of scores are much smaller for Test B than for Test A. On the evicience
that we have here (and in practice we w'ould not wish to make claims
about reLiabitiry on the basii of such a srnall number of individuals), Test
B appears to be rnore reliable Test A'
Reliability
of the same students
Look now ar Table 1c), which represents scores
oin interview using a fiue-point scale'
ranrl
1c) scoREs
Student
Score obtained
Bill
.tA
Ann
HarrY
Cvril
2
2
4
z
4
Pauiine
Don
Colin
Irene
(tr c
F
B
h4arY
The.largest
In one sense the nvo sets of interview scores are very similar.
diff.r.n.. betrveen a student's actual score and the one which would have
largest possible
;;;; obt.in.d on the following d"y is 3. Butarethe
very different..This
;iff.r;. is only 4! Really the two sets of scores differences
between
the
beconres apparent once we compare the.size of,
,tud.n,o *iif, the size of differences betrveen scores for individual
,ird.nto. They are of about the same order of ma-gnitude' The result of
il.,i, .r., b. slen if we place the students in order according to their
le order based on their actual scores is
sed on the scores theY would have'
rview on the following daY. This
very reliable at all.
F
B
retlabi[ltY coefficie
It is possible to quanrify the reliability of a test
':.::tf;;;;,
are
:,Y'+;^','^i'
,^ffic-h
reli
in the form
of-
a relia?I::l
s (chaPter
.T--q-t&el
I t' ineY ali
.'+).
'tr i"
:-t]:i
.
is ong
,"^'^+
nF
!
4
recr
^F.n
^,.,"
^f
reliabilitv rnefficrent ls t - ---aEgilyllj}-a-..r-.gll*v-atl"k-y--.v.v*p^L{4-YlY.r.+:
T4
-gg,
-::-:l
p"Ti.-:11'.t^.,_:l
s"*._-ie_rs"ilr
fj9.iy
,for 1
rest
iegardless of wh." lt rrupp;"td to "be adrninistered.-A
;i;*a**
r.t r t
tls hope
hone that rtJiillHno sr'rc
(and let us
-r:-L:l:--, coefficient
zero (nnri
^^^{E-;ant nd
of ,.*r',
;-,h,;h had a ,".tiuUitity
?g
3i
Xr
g' ''**>
ReliabilitY
sl
E
I
I
test exists!) would give sets of results quite unconnected lvith each otl,.ei',
in the sense that the score that someone actually got on a Wednesdav
would be no help at all in attempting to predict the score he or she would
_ej5tiggr_gg
gf
t'
t
1,
roqr-$".
coelnctent \\ e
should expect for different types of language tests. Lado (1961)' f1r
exa-nnple, -..)'s th.a.t good voca.b'-rla-ry, structure and. reading tcst: :r':'
usually in the.90 to .99 range, while auditory comprehension tests;:lre
more often in the .80 to .89 range. Oral production tests may be in the .70
ability coefficient of .85 rniehr be
tion test but lorv for a reading rest.
o sees as the difficulty in achievinq
rent abilities. In fact rhe reliebilirv
epend also on other consideratiotrs,
mosr parricularly the importance of the decisions rhat are to be take n on
he*gr-e.+te r
the basis of rhe test. The -ngg'ilnportant-!bg- d+uj
someone the opportunitr'
Ftg n
ty,t-1Lt
il i ty Fre
SLI:!-qg;1ra;AG
reljabi,lity.
relja
H: !--*
loituav ou.ir."J"6.iouse of tneii sEore on a language test) then u'e ha.''e
to be prerty sure that their score would not have been much difierent iI
i[* f.t"art taken tfte test a d,ay or two earlier or l"t.r]The next section rvill
explain how rhe retiability coefficient can be used to arrive at-another
figut. (the standard error of measurement) to estimate Iikely differences
t=q;bqLrt
someth
oithis kind. Before this is done, however, someth
the way in which reliability coefficients are arriv
-b-,
requ
=;,9T&!rst
most obviotis w
i6e ianre
te_s.1"-tw-iee.
Jrr*brlkr
affiififfition
rne
of the test
after the first, then subjects are likely to recall items and their
i. roo soon "i.
too long a gap between administral) will have taken place, and the
uld be. F{owever long the gap, the
unlikely to be very motiv.ated to take the same test tw'icq,and
subjects
-::'--- are
I't
I
t - - r^--essing effect on the coefficient' These
'
rhis too is ltkelY to have a oepr
-t-r*pdtf{erer+F-f@rrnrofthe
"if".,, are_reduced_spmewhalbr-rh-e-uqe"SL
alg-1g4-te iorms are
F{owever,
forms met[odil
4,
the necessary
!1-y-o--.s"g!s
of
scores
provide u,Lw
theseie-C,ru
ih. t.tt ln the usual wAY, but ea
methods of obtaining
involve only o.zle administration of qns
tr[ gl-q ogs-slps-qv"
ien t o f
-in
ln this the subjects take"
Iit hal
su ject is given two scores. Cne score is
] 7
|,I
1
R.eliability
f.or-thetest"+o-be-spli
&
I
Bl
-l
es
_[
The stqndard error of rneasurement and the tru e score
While the reliability coefficient allows us to compare the reliability of
rests,
$ a-e"tu,a.Ls-c*QJe
,,r,har
-
f.urrher--. c alculati-o"n-n*h
ve
is-tq
ion. IVirh-a-lltrle-
r, j1-i
clo
e score
-sen-aure.tuelsssg"st"p--wh?i-iseall-edihSLt
lmagine that it were possible for someone to take the same language
t.st or'.t and over again, an indefinitely large number of times, without
their performance being affected by having already taken the test, and
rvithout their ability' in the language changing. Unless the test is perfectly
reliable, and provided that it is not so easy or difficult that the student
alu'a)'s gets full marks or zerot
various administrations to vary. [.
3
.-{-@qs
$
D
wrth
resPect
treasons
we
carrciidate's@74,
$j'e ar-r atti?*t,oTn-ike statements ahout the probab,iliry thar a candidare 's rrue score (the one u'hich best represents their abiiitl'on the test) is
rvithin a certain number of points of the score they actually obtained
qn the test. In order to do thit, *yg-Elqlk.to cSlcgla!-e the standard.
error of measutrentent of the P-artic*f::t-:, T"he calculation (described in
1. Because of the reduced lengrh,
rn'hich
will
lo
33
illustrated by an examPle.
Suppose that a test has a standard error of measurement Ot J. At^r
indi'.'rJua! scores 56 on that test- Ve are then in e nocirir'n tn rnrlrc the
2
following statements:
We can be abou.t 58 per cent c'ertain that the person's true score lies in the
range of 51 to 61 (i.e. r,vithin one standard error of tneasureirlcni of tlic
r.oi. actually obtained on this occasion)We can be about 95 per cent certain that their true score is in the r:rnge -15
to G6 (i.e. rvithin tr,r'o standard errors of measurement of the score
actually obtained).
.We can be 99.7 per cenr certain that their true score is in the range 41 to
7l (i:e. u,ithin rhree standard errors of measurement of the score
actually obtained).
These sraremenrs are based on rvhat is known.about the pattern ot
scores.that rvould occur if it lvere in fact possible for sorneone to take tiie
rest repearedly in the way described above. About 68 per cent of their
scores would be rvithin one standard error of measurement, and so on. li
in fact they only take the test once' we cannot be sure hor,l' their score on
core, but rve are still able to make
Tffirrn
-slsls"Jor
would tend ro be distribured if rhev took the same test an indefinitelv large nuntber of
times (wirhout the experience of any test-raking occasio.n affecting performance on anv
(see
orher occasion). The scores would follor.r-rvhar is called a normal distributi,rn
t-,f
the
scope
the
beyond
discussion
Present
mai distribution which ailou"s us to sav what
ain range (for exarnple abour 68 per cerrt of
f measurement of the frue score)' Since about
of the
dg pei cent of actual scores will be within one standard error of measurenierLt
actual sccrre +'ill he
rrue scone, \\,e can be abour d8 per cenr certain that any parricular
score"
the
true
of
rn ithin one standard error of measuremenr
error oi
3. Ir should be clear rhar rhere is no such rhing as a'good'or a'bad'standard
a
particuhr
to
in
relation
scores
measurernent. tt is the particular use made of particular
nay be considered acceptable or unacceptable.
i1
I
Reliability
sv.lde-us.er*s-r#iih ..nolo,;}l
'ard
ard error of measurement,
Lteslis-not relia
Hiu
ow tnat t
nt-{rom their
d
Fi
ll
F1
tru
'\
th e
'(\
ttle
f"affi-e"e*strrss.
{1
ird
sl
candidates.
Scorer reliabilitY
]n rhe first example given in this chapter we spoke about scores on a
niqlriple choice test. lt was rnost unlikely, we thought, that everY
candidate rvould get precisely the same score on both of two possible
n,
D
adminisrrarions of the test. We assu red, horvever, that scoring of the test
1'ould be 'perfect'. That is, if a particular candidate did perform in
exactly the iame way on the two occasions, they would be given the same
score on borh occasions. Tbat is, any one scorer u'ould give the same'
score on the t\ /o occasions, and this u'ould be the same score as would be
giyeu b), an), other scorer on either occasion. lt is poss_ible to quantify the
i.t.i uf agreement given by different scorers on different occasions by
means of a scorer reliability coefficient which can be interpreted in a
similar \4,ay as the test reliability coefficient.,trn the cage-of:h.emulriple
ch oi qq q.est j.u.sr .d.e,sc:ib-e.d
the
gq_ql.g
tl2
35
T
s
Reliability
lI
I
{
.,,1
;eliable*lherlhg
behaviour
being equal, the more items that you have on a test' the more
reliable rhlt test *iti U.. This seerns intuitively right. lf we wanted to
know how good an archer sorneone was, we wouldn't rely on the
.uid.n.. of i single shot at the rarger. That_one shot could be quite
u*.pr.r.nrarive 6f ,h.ir ability. To be satisfied that we had a really
of
reliaLle measure of rhe abiliry rve would want to see a large number
shots at the target.
The same is true for language tesring. trt has been demoirstrateci
Fth., things
that
reliabie. There is
empirically
Ap-pe*ndix)
tbafel1ry:-911-!"o-
e
e
&
R
eliability
.t
rellaDtlltl ,t-.,
la1q5:
gr
glven
a selectroj_-gt
lecti
l-^
In general,
iimeV to-hr'.1
-
^-^
C-^^S^-
-L^-:-
theief_ore,._candiglsi.Es
over rvhich
}hd
nqnnr^
l,{
ha
raa+rtataA
I\-\JiiiPciiL
'
lvrru vvrrlg r! r rLrrrS L::ri'.-r
Lrlu f^lr^rt"
miqht Vafj' SnCUiU UC -i_qljii*i!.L,
iho"ld
d)
in
this countrY.
Discuss the following measures intended to increase the nurnl-le r .-,i
foreign tourists coming to this country:
i) More/better advertising and/or information (rvhere? u'h.rt [orrr,
should it take?).
ii) Improve facilities (hotels, transportation, communication etc.)'
iii) Training of personnel (guides, hotel managers etc.).
The successive tasks impose more and more control over $'hlr is
written. The fourth task is likely to be a much more reliable indicetttr rrf
writing ability than the firstThe"general principle of restricting the freedom of candidates rvill be
taken r[ agrin'in chipters relating to particular skills. It should perhaps
t. rria ^tt.ri, ho*'eu.r, rhat in restricting the students we must be caret'ul
not ro distort too much the task that we really want to see them perform.
ih. pon.ntial tension between reliabilitv and validity is taken up at the
end of the chaPter.
dd
,/4
R.eliability
.
ones alreadv
.the ene_Lalreadv
_LLL_U 3',.r-"
.epgffigt--l
I
l
I
I
I
I
Fl
I
I!
-go
1"ncq-e_gsg
the reliability
r and of exisring
ile_ms. Lnagine a reading test that asks the question: ''Where did the thief
hide the jewels?'lf an additional item following that took the forrn'What
r"6 "nu.uual about the hiding place?', it vrould not rnake a ful! contribution to an increase in the reliability of the test. Why not? Because it is
hardly possible for someone who got the original question-wrong to get
the supplementary question right. Such candidates are effectively prevented from answering the additional question; for them, in reality, there
is no additional question. We do not get an additional sample of their
Bi
increased.
increased
represent a fresh start fqr
r as possiBle
Possl
rS Each additional item should as tar.as
ina" d d-irc n a J inlqunErr-on-o.n
ihe ca n d i d ate. Bv-dP t-qgJbi
_tt_ich__fyill nakq--t-qs-t r-esu-lts- 1n_o 1e
should not be taken to mean only
t"epJ
e-s-e-n t.a.
$
{a.)
k-9f-
e5
37
--
R.eliability
candidate might interpret the question in different ways on different
occasions means that the item is not contributing fully to the reliabilitv of
the test.
heilwey-t-o-{Ldvq-alu-q4JnUsrg}"Iitggs-r,"hsvrng-drdt_ed"--!h-e-m,
I
I
tr
utln
g3glig-s-LWAgi !.9"i1.[d- t rv' a sto subject thern to $.e ,9IfU
ard?s tTiey can to find alternative r-!t--e-rpr:9-tatia-OqlqlLq-gp.-esmt=erd-e-d, If
;ilS:asEffiiherightspirit,oneofgood.naturedperversity,
ri
I
hi
p,
F'
nsff
-i
9WlitttYvve
ninllec \
.-..*_._.*_,..-.1..---.-.---- -//
tf any aspect of
16
39
.it
'T"
Y
F..
Reliability
parallel version, for example)..Lgl th3g J-ea1on?-,e-verv effort must be rnade
i; ;r".J h ^ I -*lt-tuo+, ui* l,l, a,y+lt'*-ffi t f ffi ;ii'i" .ti ii "1, s t w H a t rv i I I
b"_res_qir of them. f[l'--*"y'lsestr-qbFbI!:n gl:e-ple-,sut tpilrl
il
*1:t5lg s-q"f-p-tad{-.-e-*-a
.t
e-r
!ds-i n
cas
lols-
fl
---71----;-*
rovr de u,r,ro riA; d ;;n-ciistracting cond itio ns 9!)
gru^ter the differences between one administration of a test irnd
anotf,er, the greater the differences one can expect between a candidate's
Great care should be taken to ensure
strictly adhcred
;1---J4
Inlstrailons ot tl
intain a quiet
fh.
-*l-
Grlt"*s
'^^^:|^t^
fhii .ty
n r'vhich they
ice items are
pretesting. A
ssion
of
the
_4s a
ates
f
act
endbd quesrion What was different abou,t the results? may be designed to
elicit tlre response Swccess Laas closely associate.d tc,ith high ttotiuatiott.
This is iikellro cause problems f rr scoring. Greater scorer reiiab'iliri'will
probably be achieved if the question is followed by:
w.as
q2
bility
1
trater
Irems of this kind are discussed 'in
t
i-rtfei;i
e-;tY-e
;;!
; ; ;;:'
a;Ir
i-
I
I
,g-the.-,co.mp"q"sit.Lans-a-ll--en.qt-e-.t-qpie
t
:
-,
"
t.
"I-*;;;*---'
tints. It should be the outcome of efforts
oI. pc
tts_asslgnment
in
aipo.ssrble
"" r"-".' .-tt
| _ ._L:^^-^r -^
responses and have been subiected to group
" poirible tsrP\rrr
antlclpate aIL
to
PosslL)lc
criticism. (This
pnrti"lly o,
,dui..
instance-)
(*l
---"--.-....".."-."-----'-,
*'fflg,n
t:-gI9rs-"''
'T1
lve.
I ne sconng
L-!!te:Y"-*r-- '
'
to anyone rvho has
' sflr
' ould not be assigned
of composrtions, tor eXdlrlple'
r
_ r__:_:^-_-^: ^__
past- administratrons.
from
compositions
accuratery
,.or.
;;rl.ffia.
should be. analysed.
ifr., each administrarion, p"tt.int of scoring
inji'iau"rs whose scoring i.ui"t.t markedrl' and inconsistently from the
norm sliould not be used again'
\*Lffig
re e a ccePta b I e
of scoring
*,--;^
resPonses and
;\
t\
1'
*'
se
+g
47
Reliability
the points to be assigned, the supervisor should convey it to all the scorcrs
concerned.
{
fr
:6
,r@
g.n.il-ilL,
As a
at
ieliable. A
must therefore
a writing test we might require candidates to rvrite
_4!1. For example, as
ld^o*r rhe translation equivalents of 500 words in their orvn language.
it i, .outd rvell be a reliable test; but it is unlikely to be a valid test of
writing.
In oi,, efforrs to make rests reliable, we must be rvary of reducing their
validity.'Earlier in this chapter it was admimed
of whar candidates are permitted to write
diminish the validiry of the task. This depends i
are trying to tneasure by setting the !1tk'
omposition, then it would be hard ro
.r"al,i^r.lr' ability
rure in order to incre ase reliabiliry. At
,*nf, providing ih
to restrict candidates in wa,vs t'vhich
,h. ,r.. time we
rvOuld not render their performance orx the task irrvalid'
valiCiri'" The
There will always be ro*. rension berw.een reliability and
,.rr., has to bala.nce gains in one against losses in the other'
a?
I
I
ir
f,