HUGHES - Testing For Language Teachers

i
r--E-r
.'
t esttng tor
Langu age Teachers
Arthur H*ghes
I
I
,F
The righr of :the ..
Unilcrsity oI Cbnbridgc
to print ond sell
oll mnnr of books
Ntat tronlad:by
Hcnrs,VIII in,l5J1,
hat printcd
Univcnily
The
and publithcd continuow!y
since I5El,
$
I
F
'i
Cambri dge UniversitY Pres

Cambridge
New York Port Chester
Melbourne
SydneY
@ Cambridge University Press 1989

First pubiished 1989
Prinred in Grear Brirain
by Bell &.Bain Ltd, Glasgow
Library of Congress cataloguing in publication clala

Hughes, Arrhur, 1941Testing for language reachers / Arrhur Hughes.
cm. - (Cambridge h-andbooks for language reachers)
p.
Bibliography'p.
Includes index.
ISBN 0 521 25264 4. - ISBN 0 521 27260 2 (pbk,)
l. Language and languages - Ability tesring. ' l. Title.
IL Series.
P53.4.H84 1989
407
.5-dc19
88-3 850 CIP
British Library cataloguittg in publication data

Hughes, Arrhur, l94LTesring for language reachers. (Cambridge
handbooks for language teachers).
1. Great Brirain. Educarional insrirutions.
Students. Foreign language skills.
Assessment. Tests - For teaching.
I. Title.
478',.0076
4 hard covers
ISBN0 52727260 2 paperback
ISBN 0 521,25264
Copyright
The law allows a reader ro make a single copy of part of a book
for purposes of privare study. It does nor allow the copvine of
entire books or rhe making of mulriple copies of extracrs. iv.irr-n
permission for any such copying must alwa.vs be obrained from rhe
publisher in advance.
CE
v
F
i,
I estrng
for Language Teachers
v
B
t'
I
t
CAMBRIDGE HANDBOOKS FOR LANGUAGE TEACHERS

C.".itf Editors: Michael Swan and Roger Bowers
eachers of English and other
sually drawn from the field of Engiish
he ideas and techniques described can
f arry language.
F
l
fi
fi
ln this series:
book of
Drama Techniques in Language Learning - A resource
.o*.nr.tnication activities fo r I4ngua ge teachers
bliAlon MaleY and ,tian D"ff
Games for Language Learning
WrTghi Dauid Beiteridge and lvlichael Buckby

practice by Pewty Ur
Discussions thatWork-Task-centred fluency
iy'iiarr*
Time - using s_tories in the language.classroorn

torgan and NIario Rinualucri
once upon
jit
Uy'
"'t
Teaching LiEtening Comprehension by Penny Ur

for Ianguage teaching
Keep Talking - Communicative fluency activities
bt, Frederihe KliPPel
vocabulary
working with words - A guide to teaching and learning
b,y RutiGairns and Stuart Redman
problems
English - A teacher's guide to interference and other
Learner
-riiltd liirhael Swan and Bernard Smith
by
Testing Spoken fanguage
bv I'tric LJnderhill
A handbook of oral resting techniques
book of ideas and

Literature in the Language classroom - A resource
by Joanne eollie and Stephen Slater
".tiuiri.t
Dictatior, - New methods, new possibilities
Ly Paut Dauis and Mario Rinuolucri
teachers by Pennv Ur
Gramnrar Practice Activities - A pracrical guide for
Testing for Language Teachers by Arthur lTughes
i
I
r
For Vicky, lr4eg and Jake
h
I
r
F
'I
F
.-f
ffimntents
viii
Acknou'ledgements
Preface
t1
Teaching and
testing
1"
!'t
Testing as problem solving: an overview of th
l" Kinds of test and testing

l.r
/"-'-L
,,
Validitl'
22
Reliability
Achieving ben
S
ta
geibf
29
efi
cial bae-kr
tA$i-cdiis tFu ction
Test techniques and testing overall ability
59
, 9 Testingwriting. 75J
ability.
lO
Testing oral
11
L2
Testing
reading. 1,16
Testing
listening
70I
1.34
'1.13 Testing gramrqar and vocabulary
14
/
,,1
3
)
Test
Appendix
administration
\\. "YY
Annendix 2
lde
152
Statistical anall'sis of test results

1,55
Ilibliography
f
141
1,66
170
I'
155
l{e t<newIedgeirlents
The author and publishers would like to thank the followine
f-or
permission to reProduce copyright material:
fl
[or extracts
American Council on the Teaching of Foreign Languag_es inc.
Generic Guidelines
i;;* ACTFL provisional Proficiency Guidelines andfrom
examinationsl
for extracts
ew Bofiaziqi UniversitY Language
ouncil for the draft of the scale ior
ress for lvI. Swan and C. Walter:
98 8 ) ; Educational Testing Services
ingual House for M. Garman and
s Chapter 4 (1983); The Foreign
pp. 35-8 (1979); HarPer & Row',
publishers, lnc. for the graph on p. 1Z1,.from Basic Statistical Nlethods
Rot.tt W. Heath, copyright O 1985-in N' ivl'
t, X.M. Downie and
6C
ilo*ni. and Robert \W. Heath, reprinted wit-h permission of Harper
n"*, p"blishers, Inc.; The Independent_f.o_r N. Tirnmins: 'Passive
oard
smoking comes under fr,.', 14 March 1987; J
eas);
io, .,t,rl.ts from March 1980 and June 1980 T
tract
Language Learning, and J' W' Cller Jr' and C' r r
r
a;; 'Th. .lore-t"echnique and ESL proficiency', Language Learning
it,$ sq, I-ongman uK Ltd u?,1"fi.u.5'ff'rrl'?fJ?'iir', S':::'::,
e Cbseruer for S. Limb: 'One-sided

oyal Society of Arts Examinations
ocal Examinations SYndicate for
or the examination in The Com'
municatiue IJse of English as a Fore
1985 summer exarninations; Th

for the ex'
Examinati
egacY of L
Univlrsity
rhe
Oxfor
in English as a Foreign Languge'
C
viii
Fnef ece
The simple objective of this book is to help language teachers write better
rests. It takes the view that test construction is essentially a matter of
problem solving, u'ith every teaching situation setting a different tesring
I
,
problem. In order
L
I
t
D
to arrive at the best solution for any particular
situation - the most appropriate test or testing systen'l - it is not enough

ro have at one's disposal a collection of test techniques from which to
choose. lt is also necessary to understand the principles of testing and
hou'they can be applied in practice.
Ir is relatively straightforu'ard to introduce and explain the desirable
qualiries of tests: validiq', reliability, practicality, and beneficial backu'ash
(tl'lis last, r','hich refers to the favourable effects tests can have on teaching
and learning, here receiving more attention than is usual in books of this
kind). lt is much less ea5y to give realistic advice on how to achieve them
in reacher-made teStS; One is tempted either to ignore the problem or to
Fresent as a model the not always appropriate methods of large-scale
tesring instirutions. In resisting these temptations I have been compelled
ro make explicit in my' own rnind much that had previously been vague
and intuitive. I have certainly benefited from doing this; I hope that
readers u'ill too.
Exemplification throughout the book is from the testing of English as a
foreigrr language. This reflects both my own experience in language
te siing and the fact that English will be the one Ianguage knou,n by all
reirde rs. I trust that it rvill not prove too difficult for teachers of other
I:r:iguaqes to find or construct parallel examples of their o\\rn.
I rrrust acknoivledge the contributions of others: -lr4A students at
Reading University, too numerous to mention by name) who have taught
me much, usually by asking questions that I could not answer; my friends
irnC crrlleagues, Paul Fletcher, Michael Garman, Don Porter, and Toni'
\1r.r,,.,,;c., -,,.,1;-o alJ read parts of the ma.nuscript and mad.e ITl3n)'helpfu!
:lggesriorrs; Barbara Barnes, rvho typed a first version of the early
chaprers; Ir{ichael Swan, rn'ho gave good advice and much encouragel'nenr. and u'ho remained remarkably patient as each deadline for
conrpletion passed; and finally *y farnily, u'ho. accepted the r,,'riting of
LIrc book as an excuse more often than they shouid. To ail of them I arn
ve rt' gra'tefuI.
ix
N4any language teachers harbour a deep mistrust of tests and
of tesrers.
The starting point for this book is the admission that this mistrust is
;F
frequently u,ell-founded. It cannot be denied that a great deal of language

resting is of very poor quality. Too often language tests have a harmful
effect on teaching and learning; and too often they fail io measure
are intended to measure*
accurately whatever it is
B
I
I
',
I
I
-f_a--
The effect of testin
othey
'^'\i\\
\**J'*-"' h* n"i\ \'r**'t-
on tea
lea rn ip_g;Ls
known
\"s\q
asib g c!1..u ry!2,
. If a test is regarded as importanr,

Backwash can
then preparation for iftan cori? to dominate all teaching and learning
activities. And if the test content and testing techniques are at variance
ri'ith the objectives of the course, then there is likell' to be harmful

backu'ash. An instance of this wouid be rvhere students are follow'ing an
English course w'hich is meant to train them in the language skills
(including rvriting) necessary for university study in an Englisii-speaking
collntr)', but rvhere the language test which they have to take in order to
b,e admitted to a university does not test those skills directly, If the skill of
t'riting, for example, is tested only by multiple choice items, then there is
great pressure to practise such items rather than practise the skill'of
ri'riting itself. This is clearly undesirable.
a case of harmful backye:ll. Horvever,
\l,re have just looked
^t
backu.aslrneednotalwaysbeharmf"ffiepositively
beneficial. I was once involved in the development of an English^languag!

rest for an English medium university in a non-English-speaking country,
T]ie test \,\'as to br- administered at the end of an intensive year of Enelish
:.rud,. ilicic an,i w-ouid be irse.l tu clererminc which srucients wouii be
allou'ed to go on to their undergraduate courses (taught in English) and
u'hich u'ould have to leave the universitr'. A test was devised which was
b;ased directly' on an analysis of the English language needs of first year
undersraduate students. and which included tasks as similar. as possibie
to those u,hich they u'ould have to perforn'l as undergraduates (reading
rer:tb'ook materials, taking notes during nectures, and so on). The
introduction of this test, in place of one rvhich had been enrirely multiple
'
I
Teaching and testittg
choice, had ap immediate effect on teaching: the syllabus'was redesiqne'C.

new books were chosen, classes were conducted differently. The result oi
I th.se changes was that by the end of their year's training, in circumstancer r^d. particularly difficult by greatly increased.numbers anrj
lirnited ,.s"ur..r, the students reached a much higher standard in English
than had ever been achieved in the university's history. This was a case oI
beneficial backwash.
sincr:
*' Da,ries (1958:5) has said that 'the good test is an obedient servantproper
it follows and apes rhe teaching'. I find it difficult to agree. The
relationship between teaching and testing is surely that of partnersilip. It
is true that there may be occasions when the teaching is g_ood _ and
{
{
{
appropriate and the testing is not; we are then likely to suifer irom
nrirnflf backwash. This rvould seem to be the situation that leads Dat'ies
to confine testing to the role of servant of teaching. But equally there mav
be occasions whin teaching is poor or inappropriate and rvhen tesring is
able to exerr a beneficial influence. We cannot expect testing ortiv ro
follow teaching. What we should demand of ir, holvever' is that it should
be supportive of good teaching and, where-necessary, exert a correctl'''c
influence on bad ftaching. lf resting ahvays had a beneficial backrvash on
teachilg, it would have a much better reputation amongst teachers.
Cn"pr.i,5 of this book is devoted to a discussion of horv benefici;il
backwash can be achieved.
tests
The second reason for mistrusting tests is that very oiten thel' ,larl-ro measure accuratg\r--wtrateyer"-ifjs-thrauhey-ar.e-in!9lld-ed to *eas.ulr.
fl;.G;["",ffiG. Students'true abilities are not al,.vays retlected in t]re
tesr scores that they obtain. To a certain extent this is inevitable.
Lrngurg. abilities are not easy to measure; we cannot expect a level of
accuracy comparable to those of measurements in thE phy'sical sciences.
But we can exPect greater accuracy than is fr.equentlv achieved'
t.it, inaccurate? The causes of inaccuracy (and rva,vs of
Why
"r. their effects) are identified and discussed in subsequent
rninimisinE
is possible h
ese concerrxs
: ifrwe'Warl'i-
1"-,
re of
have
br-rt
surer
n the
scoring
of tens of thousands of
be
Teaching and testing

practical proposition, it is understandable that potentially greater accuracy is sacrificed for reasons of economy and convenience. But it does not
give testing a good name! And it does set a bad example
While few teachers would wish to follow that particular exampie in
order to test writing abilityn the ovenvhelrning practice in large-scale
testing of using multiple choice items does lead to irnitation in circumstarices where such items are not at all appropriate. What is more, the
imitation tencls to be of a very poor standard. Good multiple choice items
are notoriously difficult to write. A great deal of time and effort has ro go
into their construction. Too many multiple choice tests are written where
such care and attention is not given (and indeed may not be possible). The
result is a set of poor items that cannot possibly provide accurare
measurements. One of the princlpal aims of this book is to discourage the
use of inappropriate techniques and to show that teacher-made tests can
be superior in certain respects to their professional counterparts.
iahtliry Reliability is a
u
I he.Secon
-l
t..hffi-I.1Eifr which is .ipl"i*d=-in Ctilpter"Sl-FbTThe moment it is
enough ro say that a test is reliable if it measures consistently. On a
reliable test you can be confident that someone rvill get more or less the
same score, whether they happen to take it on one particular day or on
the nexrl u'hereas on an unreliable test the score is quite likely to be
considerably different, dependLpg on the day on which it is ta-ken.
"
t
F
'-l
-+>
-U,nr.J
ry
;Fiqs!'t@
tffi*a+a+is:*
"d
about the test creates a tendency for
fiFfoim significantly differently on different occasions
has
t!\,o- e
se, something
u'hen they might take the test. Their performance might be quite different
if they took the test on, say,lWednesday rather than on the follou'ing day.
As a iesult, even if the scoring of their performance on the test is perfectiy
/
F.
Uh
accurate (that is, the sborers do not make any mistakes), they will
nevertheless obtain a markedly different score) depending on u'hen they
actuallt. sat rhe test, even though there has been no change in the ability
u'hich the test is meant to measure. This is not the place to list all possible
fearures of a test rvhich might make it unreliable, but examples are:
unclear instrucrions, ambiguous questions,_items that resulr in guessing
on rhe part of the test takers. While it is not possible entirely to eiiminate
such differences in behaviour from one test administration to another
(hunran beings are not machines), there are prlncipies of test construction
u'hich can reduce them.
erformances are accorded _sisIn the
t. F"t example, the sarnJlompoTiTron"miyEe
.nifica
g1i'en ver)' different scores by different nrarkers (or even by' the same
ma.rker on different occasions). Fortunately, there are well-understood
ii'a)/s of minimising such differences in scoring.
N,{ost (but not all) large testing organisations, to their credit, take every
Io
:1
Teacbing and testing

tests, and the scoring of them, as reliabie as
y highly successful in this respect. Small-scaie
d, tends to be less reliable than it should be'
k, then, is to show how to achierie grcllter
reliability intesting. Advice on this is to be found in chapter 5.
The need for tests

rvhy tests ere sc)
So far this chapter has been concerned to understand
this mistrust is
l-ir,rrri.d by many language teachers,'We have seen that
that rve *'ould
;^fr*l;r;in.h. on. .oniluJion drawn from this might beall,
the.primar'
i.tU.it., off rvithout language tests.,Teaching is, after
ii .ot fli.t',virh it, then it is testing '',vhich shor-rpl
as been ad.mitted that so much testing provides
I his is a plausible argument - but there are orher
conclusion.
considerarions, which might lead to a different
useiul and
, Information abou, p.oil.'s language ability is often very British
and
for example,
sometime, n...rrrry. i, ii aifficulito i*tgine,
accePting students from overseas rvithout some
A-.ri."n
'd;;i;;. u.riu.rsities
of ,h.i, proficien.yltt English. The same is true for organiinterpreters or translators. They certainly need dependable
,rrion, hlring
measures of language abilitY'

for
\ilithin teaching systems, too, as long as.it is thought appropriate
a
second
in
indi;idr"ls to be g"iuin a statement of what they have achieved needed'1
will be
or foreign language, then-tests of some kindor other
rftJ b." n..d.C in order to provide information about the
it is difficult to see horv
achievement of SrouPs of learners, without which
for some PurPoses
,urion"t .d,r.rtlonaf decisions can be made. While appropriate
and
of their o*n students are both
it;
*Ifi
teachers' ,rr.rr*.nts
sufficient, this is not true
iust mentioned' Even without

need for a
.""ria.ting ,hr porribility of bias, we have to recognise themeaningful
;;;;"; yitattilf., whicir tests provide, in order to rnake
for the
cases
compariso*t'
.o, that
rhrr resrc
arenecessary, and iiIwe care about testing and
tests are
lf it is accepted
(in my view' the
its effect on ,.".lrlng tttd learning, the other conclusion
the poor quality of so
correct orre) to-b. ir^*n frorn ^-',ttogt'ition of
;;;;-;.;ii.i ir ,hur ru. should do eve.ything that we can to irnprove the
practice of testlng.
interpreted widelY. [t is use'] to
abiliry. No distincrion is macle
Teaching and testing
What is to be done?
The reaching profession can make two contributions to the improvement
clf tesring: they can write better tests themselves, and they can put pressure
on others, including professional testers and exarnining boards, to
improve theirtests. This book rpresents an attempt to help them do both.

Fot the reader who doubts that teachers can influence th'! large testing
institutions, let this chapter end with a further reference to the testing of
writing through n'rultiple choice items. This was the practice followed by
those responsible for TCEFL (Test of English as a Foreign Language), the
test raken by most non-native speakers of English applying to North
American universities. Over a period of rnany years they maintained that
ir was simply not posiible to test the writing ability of hundreds of
thousands of candidates by means of a composition: it was impracticable

and the results, anyhow, would be unreliable. Yet in 1985 a writing test
andidates actually have to write for
upplement to TOEFL, and already
re requiring applicants to take this
pal reason given for this change was
pressure from English language-teachers who had finally convinced those
,esponsible for the TOEFL of the overriding need for a writing task
ro'I''ich would provide beneficial backwash.
READER ACTIVITIES
1
Think of tests with which you are farniliar (the tests may be international or local, written by professionals or by teachers). What do
you think the backwash effect of each of them is? Harrnful or
beneficial? \Xlhat are your reasons for coming to these conclusions?
Consider these tests again. Do you think that they give accurate or
'What are your reasons for coming to these
inaccurate information?
conclusions
F,
9
F
Further reading
Forlan accounr of how the introduction of a nerv test can have a striking
beneficial effect on teaching and leagning, see F{ughes (tr988a). For a
reyie\\, of the nerv TCEFL w'riting test which acknorvledges its potentia[
'beneficial backrvash effect but rn'hich also points out that the narrorv
range of writing tasks. set (they are. of only^two types) rnay 6.t*1, Un
narrow rr",ntngln writing, see Gree-4-herg (1986). For a discussion of the
cthics of language testing, see Spo_lsky (n98x.).
t2
I'f'J TestEffiE as ProfoEerm sotving:
idea of testins as
The purpose of this chaprer is to introduce readers to the
solving and to show how the content and structure of the L:itok
oioUf.n.,
-a.iign.a
; help them to become successful solvers of testing
Ir.
fi?ffr:ge
tesrers are sometimes asked ro say rvhat is'the best test' or

.the be"st tJsti.,g technique'. Such questions reveal a misunderstanding of
inuotu.? in the iractice of [anguage re.sting. ln fact there is no besr
technique. A test rvhich proves ideal for one purpose. ma' be
i.r,
"iU.st fo,
a technique which may work verv r'vell in oê
;;ir;;;.1.r,
entirely inappropriate in another. As we saw in the
u.^norher;
chapter, what suits l"ig. iesting corporations may be quite c'ut
;h;;i,
il;;ilr. .t.
or.uiour
'oi pi^.. in the i.ttt of teaching institutions. In the same s7211r, t\\'o
,."itring institutions may require ve y diffe rent tests, depending.amongst
and importorher ,liing, on the objectivis of th:ir courses, the Pytpe:.
of th". t.sts, and ihe resources that are available. The assumption
unique.and
""1.
,nr, n", to be *^d. therefore is that each testing-situation is
problem. It is the tester's f ob to provide the best

qTiP readers rvith
,otu,ion ,o that problemlThe aims of thi's book are to e
problems, secondlv'
,rr. urri. kno*ledge and techniques first to solve such
and
io.urittte the solitions ptopot.d or already implemented by others,
practice lvhere
thirdly to argue persuasiv.ly fot improvements in testing
so sets a partlcular testing
these seem necessary.

.]n every situatiggjbg
,trl!
sLeJ rnust
rEarwl"*l::jT:li:11:
ffi'fr;r'n*solution.Er,erytestingprobIemcanbeexpressedinthe
which rvill:
we vant to create a test or testing system

provide accurate measures of precisely the abilitiesl in
;;;g;..r"1tfr*s:
onsistently
which we are interested;
(in those cases where the tests are
ave a beneficial effect on teaching
'v likelv to influence teaching);
money'
OU...onomical in terms of time and
nical sense. lr refers sinrply to r'r'hat people
example, include the ability to collverse
y to recite grammatical rules (if chat is
iuringl). [t does not, however, refer to
Testing as problem soluing

I_,et
us describe the general.testing problem in a little more detail. The
first
thing that testers have to be clear about is the purPose of testing in any
orrti.ular situation. Different purposes will usually require different kinds
seem obvious hut it is something which seems not always
f f ,.ri.t. This rnay
.t
ro be recognrseo. Ih-.- p_g.lp.g,::Loll.jl$ discussed in this bo.ok are:
\-.-to []easure language proficiency regardless of any language courses
ed
achieved the objectives of a course of
nd rn'eaknesses, to identify what they

w
by identifying the stage or part of a
teaching programme mosr appropriare ro their ability
All of these purposes are discussed in the next chapter. That chapter also
introduces different,"-kinds o.f-.t.g!-tjgg*-elq JgqlJshniqrres: direct as
opposed to indirect resting; discrete-point versus integrative testing;
.rii..iun-referenced testing as against norm-referenced testing; objective
and subjective testing.
ln stating the testing problem in general terms above, u'e spoke of
proyiding consistenr measures of precisely.the abilities we are interested
in. A ,.ti*hi.h does this is said to be 'ualid'. Chapter 4 addresses itself to
r,irrious kinds of validity. [t provides advice on the achievement of
yalidity in test construction and shows how validity is measured.
The u,ord 'consistently' was used in the statement of the testing
problem. The consj:Jengy [Lth which accu!?te m%.si-rspJnqSre-trS"
lsten
vatlol
in iact an esien-tt-alntrr-g4le
-'
. t * p t . t p:llo-ft*t-lg9ll-aU rh e -tesljt- I * ely-tq*.b.*t-:sr'y" -si'nr+i'l-a r
t ;f;-f
"
;ee;rdless of ivhelhet
rathgr than o4
F
B
-I
'@Jbe'#brr
;l?Aatt ;.i.it.d*to inlfiE*pt*ious cha
te-uegg,1::,1'l*Hl
an absolutely essential
qualir, of tesrs - u,hat use is a test if it rvill give widely differing estirnates
Ji .,n'individual's (unchanged) ability? - yet it is something which is
disrinctly lacking in very -^rry teacher-rnade tests. Chapter 5 gives advice

on ho'*, to achl*ue reliability and explains the \\rays in u'hich it is
mea sured.
The concepr of backr,vash effect was introduced in the previous

chapter. Chaiter 6 identifies a number of conditions for tests to rneet in
order to achiette beneficial bacl<was.h"
ple have, in differing degrees,ifor learning
ent in order to predict horv u'ell or how
e, is beyond the scope of this book. The
, Carroll (1981), and Skehan (1986).
tv
{F,
Testing as Problem soluing

tests cost tirne and money - to prepare, adtniuister' scol'e atid
inrerpret. Time and money are in limited suppll', and so there is often
likely to be a conflict between what appears ro be. a perfect testirrg
solution in a particular situation and considerations of practiccli 11'. This
issue is also discussed in Chapter 6.
To rephrase rhe general testing problem identifred above: the basic
problem is to develop'tests rvhich are valid and reliable, r.vhich hrrrt '.'
tcncficial bacltwasll effect cn teaching ("','here this is rele','lnl), al'.'C ','''hi'-:h
are pracical. The next four chapters of the. book are intended to Iook
more closely at the releVant-cgncepts and so hqlp the reader to formrrlrte
such problems clearly in pariicular instances, and to provide advice on
how to approach their solutionThe seiond half of the book is devoted to more detailed advice on the
consrrucrion and use of tests, the putting into practice of the principlt:s '
outlined in earlier chapters. Chapter 7 outlines and exemplifies the
various srages of test construction. Chapter 8 discusses a nurnber oI
testing tecltniques" Chapters 9-13 shorv how' a variety of langu;rge
abiliti-es can best be tested, particularly within teaching institutions.
Chapter 14 gives straightforward advice on the administration oI tests.
We have ,o rry something about statistics. Some understanding of
statistics is-useful, indeed necessary, for a proper appreciation oi testing
matters and for successful problem solving. At the same time, rve have to
recognise that there is a limit to rvhat *1ty readers will be prepared to
do, Jrp..ially if they are ar all afraid of ma.then:Iatics. For this reason.
statistical *,rtt.r, are kept to a minimum and are pre.sented in terms that
everyone should be able to grasp.-The emphasis will be on interpret;ttion
r"rhe, than on calculation. For the rnore adventurous reader, how'evet.
Appendix 1 explai.ns how to carry out a number bf statisrical operations..' ;li
Ail
Further reading
The collection of critiC-al reviews of nearly 50 Engtish language
tests
Krahnke and Stans-
(mostly Brifish and American), e

iters are thought to
field (1.987), reveals how well p
of the reviervs lvill
have solved their problerirs. A full
depend to some degree on an assimilation of the content of Chapters 3, 4,
and 5 of this book.
t5
'I'his chapter begins by considering the purposes for which language

resring is carried out. It goes on to make a number of distinctions:
berween direct and indirect testing, between discrete point and integrative testing, between norm-referenced and criterion-referenced testing,
and between objective and subjective testing. Finalll' there is a note on
F
'
')
communicative language testing.
We use rests to o"brrin inforriati n. The informatibit
obtain
thtt we hope to
will of course'vary from situation to situation. It is possible,
of kinds of
information being sought. This categorisation lt'ill prove useful both in
deciding u,hether-an existing test is suitable for a particular purpose and
in writing appropriate new tests where these are necessary. The four
tvpes of tesr which we will discuss in the following sections are:
neverrheless) ro categorjse tests according to a pmall nurnber
proficiency tests, achievement tests, diagnostic tests, and placement tests.
Proficiency tests
*lt
)
)
yro6ciency resrs are designed to measure people's ability' in a language

regardless of any training they may have had in that language. The
.ont.nr of a proficiency test, therefore, is not based on the content or
ob jectives of language courses rn'hich people taking the test rnay have
follou'ed. Rather, iqiqlased on a spe cification of u'hat candidates have to
be able to do in-tfFfrnguage in order to be considered prof,cient. This
raises the question of what we mean by the u'ord'proficient'.
In the case of some proficiency tests, 'Pro[ciqgfl]mea+l-ber,1ag
suffi
9 fgl ! p ary S"tg_p44ypo s 9.. An exa m ple
oi ltrir u,ould be a test designed to discover rn'hether someone can
functron successiully as a United Nations translator. Another example
u'ould be a test used to determine n'hether a srudent's E,nglish is good
enough to follou' a course of studl' at a British university' Such a test may
eyen arrempr ro take into account the level and kind of English needed to
'follou, ."uis.t in particular subiect areas" It might, for example, have one
form of the test for arts subjects, another for sciences, and so on.
\Thatever the particular purpose ro which the language is to.be put, this
Kinds of test and testing
wilt be reflected in the specification of test content at an early stage of
test's develoPment.
There are other proficiency tests which, by contr"ast, do not have any
occupation or course of srudy in mind. For them the concepl of

is more general. British -examples of these rvould be the
proficiency
^Cambridge
examinations (First Certificate Exarnination and Profi ciencv
E;ramination) and the Oxford EFL examinations (Preliminari' ani
Higher). Tirc functitrn ui tiresc tcsts is to sltow lvhethcr carrriir-iaius iia', c
rea;hed a certain standard rvith respect to certain specified abilities, Sr:ch
examining bodies-are independent of the teaching institutions and so can
be relied on by potential employers etc. to make fair comparisons
between candidates from different institutions and different courrtries.
Though there is no particular purpose in mind for the language, these
g.n.r^l proficiency tests should have detailed specifications. saying just
iuh", ir il that successful candidates will have demonstrated that they can
do. Each test should be seen to be based directly on these specifications.
All users of a test (teachers, students, employers, etc.) can then judge
whether the test is suitable for them, and can interpret te st results. It is nor
enough to have some vague notion of proficiency, however prestigious
the testing body concerned.
Despite differences between them of content and level of difficultv, all
proficiency rests have in common the fact that thev are not based on
.ourr.r that candidates may have previously taken. On the other hand, as
we saw in Chapter 1-, such tests may themselves exercise considerable
influence over the method and content of language courses. Their
backrvash effect - for this is what it is - may be beneficial or harrnful. In
my view, the effect of sorne widely used proficiency tests is more harmfut
than beneficial. However, the teachers of students who take such tests,
and whose work suffers from a harmful backwash effect, may be abie to
exercise more influence over the tesring organisations concerned than
they realise. The recent addition to TOEFL, referred to in Chapter L, is a
case in Point.
t1
Most reachers are urnlikeiy to he responsibne fon proficiency tests. lt is
much more probable that they will be involved in the preparation and use
of abhievement tests" X.n contrast to proficiency tests,'achievement tests
are directly relate<l to language courses, their purpose being to esrablish
how st-tccessful individual studerets' groups
thernselves have been in achieving objectives'
ch i e v e m e nt
a ch i eve m e n t te s ts n
plnnl
l''
d(f6ffii"t
achievement
ts'
red at the end of a course
te
oi-
"
^st!dy, They may be written and administered by ministries of education.

official examining boards, or by rnembers of teacLring institutions.
Clearly the content of these tests must be related to the courses with
u,hich they are concerned, but the nat
nt amongst language tes
ew of some testers, the
ased directly on a detaile
other materials used. This has been referred to as the 'sylXabus-content
r-'
approach'. trt has an obvious q
ince the t
ry contalns what lt ts
a v e a c t u4lly_e n. o u n t Jr-e d._arylr h u sT a n-EE
rhough t
c r:-n s dEie d,:glh it_lgjpgc ! 4tr arl
ls tnat I
s ano oI
tar:LllWlh
sprr;
I
ts
I
:y"ely--!0idsadt*n g. Su c-e5 sTu I

f I
i .---*
perf ormancegllbg1gqt_may"n-o-r-r:-ul.,r.r
achreveme*nt
sstul
of
*---.7-:--'---;_----c.-o*q!-lS-gbig*qgu:q, For example, a course may have as an
develooment of conversational ability,
abilitv- but
hrrt the course
corrrse itself
irse and the test
development
i.
may require students only to utter carefully prepared sratements about

their home town, the w'eather) or whatever. Another course rnay aim to
develop a reading ability in German, bur the test may limit itself to the
vocabuiary the students are known to have met. Yet another course is
intended ro prepare studenrs for university study in English, but the
svllabus (and so the course and the test) may not include listening (',vith
note taking) to English delivered in lecture style on topics of the kind that
tlre students rvillhave to deal with at university. In each of these examples
- all of them based on actual cases - test results will fail to show rrhrt
have achieved in terms of course objectives. xr-s.y5e- - *V\"*il\\r&
^6 studentsalternative
appro"t.i: ,.o base the test coglllrec.-rlyp\rll.
# @The
c'bjectives of the course. lbir has. a quglb
\r#__.
f,',tr
I
.-lorLp- compels course oeslgners* t-o
r_-c^rgg_D_o_il"r*.A-ql9_Q-qlv"e_s.Q,e_c914'i j!
ruiakeslT_poidble for performance on the tesd to show iuiFT6n'-far
r-
Ii
D,
s.-Ti,ii in:ilil-p;ii-flitsiuft o;iand for the selecrion of books and

nraterials to ensure that these are consistent with the course obiectives.
.POOr"teaehl.ng
sed on obiectives work asainst t
.+ -'tc5
u'\ich c_o.Ut_Q_e-:-9__R^ten-t_:ba,F.qd.t.qqlg, almost as if part of
IalIlEe,-something
.._Y-...
do. I,
It is nry
nrv belief that to base test content on course
iracy. fail to _4g,
1 _consp!_r-1-c_yr{n{,t_q
-.
i ,,,. :;:
;. iiiilCn rn
,-.I:i.,,..;:,'.',
;::I.
-.,-.-..=,;.-!= iiiUts
r-)l:;r'i-LJ
--rnS----.J, iti+,=,;.!!
i Uo iJ
DC pfCICffCG:
trO l-...
Vviii PfOViLie
aCCuIaLe
-,,.L
;-^1..-.-.^*:
:,,:J,.^l
;*:^
-L^,"*:-.1
^^L:^-,^*^.-*
^-.
^-l
^*,-: group
--^..^ achievernent, and lt is l:l-^l-.
likely -^
infon:rarion about individual and
to
promote a more beneficial backwash effect on teaching"l
1
.L
Of course, if objectives are unrealisric, then tests u'ill also reveal a failure to achieve
rlrern. This too can only be regarded as salutary. There may be disagreement as to why
tlrerE has been a failure to achieve the objecrives, bur at least this provides a srarring
poini for ne cessary discussion rvhich otlrerwise might never have taken place.
\g
1a
I-L
Now it might be argued that to base test c

than on.ourr. content is unfair to students' If
h-i *.ff with objectives, they will be expected
her
not
hty
hru. nor been prepared. nn a sense this is true. But in anothe r sense it is
nor. If a resr is-baied on the content of a poor or inqPPropriate course,
thb stud.nts taking ir will be misled as to the extent of their achieve*.n, and the qut-iitv'of the course. \Whereas if the test is based on
trsefrrl. lrtrr
Obieatirr"Si not onlV will the information it gives be mote
*rire is less chance of the course surviving in its present unsatisfactorv
lnitially some students may suffer, but future srudents rvill benefit
are
iro* the'pressure for change. The long-term interests of students
C*, ,.ro.d by final achievement te *hot. content is based on course
i;;;.
ob jectives.
I
I
I
- The reader ma
ditierence betwe
'ts
nder at this stage
*hgbe--_--LSqL._!S
ment tests an
retsts.
-.- -^^;J^ ^-
!
?
l
I
t
[ar-esrrs
rne
'nO
r'fl
'rthe
_i_n1ttl
e torm and conlenr of the tu'o
iI
,{
_F^-
I
I
an achievement test has been conapplicability

srructed, and be aware of the possibly limited validity and
Li,.r, ,.or.r. Test writers' on the rther hand, must create achievement
exPect a
I ;;r;; t;hi.h ,*fl..t the objectives of a particular course, andanot
satisfactory
\ ;.;;r;if roficiency test (or some imitition of it) to provide
-#\
u---/
made. This is not reallY feasible,

a course. The low scores obtained
; and quite possibly to their teachers"
objecThe alternarive is to establish a series of-well-defined short-term
achie'e;;;r. ih;r. rtr"uta n'rake a ciear progression towards the fi-nal
and teachirrg
ment test based on course oblectiv.t. Th.tt if the syllabus
based on short-term
if not, there will be
at is at fault, it is the
that ii t's there that change is n'eeded,
not in the tests.
1,2
t?
ffi
emgnt test$ WhiSh*f_eq.uir-e. c-areful
e such resrs will not form part of formal

assessment Proceclures, fnelr construction and scoring need not be too
rigorous. Nevertheless, they should be seen as measuring progress
tourards the intermediate objectives on which the more formal progress
achievement tests are based. They can, however, reflect the particular
'route' that an individual teacher is taking towards the achievement of
jectives.
It has been argued in this section that it is better to base the content of
achievement tests on course objectives rather than on the detaile d content
of a course. However, it may not be at all easy to convince colleagues of
this, especially if the latter approach is already being follorved. Not only
is tlrere likely to be natural resistance to change, but such a change may
represent a threat to many people. A great deal of skill, tact and, possibly,
oLr
ts
!
F
political manoeuvring may be called for

cannot pretend to give advice.
topics on rvhich this book
.'"-Fiagn0stie tests
=+
g$-rs
nd u'eaknesses.
rvhat further teatTin-f is

necessar),. At the level of broad language skills this is reasonably
srraightfonn'ard. We can be fairly confident of our ability to create tests
that rn'ill tell us that a student is particularly rn,eak in,_sa,v, speaking as
proficienc| tests malr
ndeed existing proficiency
ma1,
se.
analysing samples of a studerrr's

g in order to create profiles of the
ch categories as 'grammatical accue Chapter 9 for a scoring system that
UR
r
F
a detailed analysis of a student's

s, something which would tell us, for
stereC the prcscnt pcrfcct/past tcnsc
sure of dris, we would need a number
t made between the t\ ro structures in
ought was significantll' diffe rent and
btaining information on.
single
ugh, since a str.rdent rrright give the
sult, a comprehensive diagnostic test
(think of what would be irrvolved in
.
-1
.I.J
Kinds of test and te'sting

). The size of such a test would rrtakc
ine fashion. For this reason, very ferv
nostic purposes, and those that there
ormation.
sts is unfortunate. TheY could be
nsrruction or self-instruction. Leirrnst in their command of the lrrnrrtrrtg*
cf infcrrnrticn, exempliFclti'rn :ln'J'
lity of relatively inexpensi',re (-orFlchange the situation. Well-u'ritten
e that the learner spent no more time
obtain the desired information, irtrcl
this kind rvill still nre ,l
rvithout the need for a test administrator. Tests of
Whether or not thel' become
u ,r.In.*dour-r*ount of work to produc.e.
of individuals to write
eenerally available will depend on the willingness
in.nl and of publishers tQ distribute them'
*;;"'il)
e-'44e6'*F
sidered suits its particular teachi
institution, a
not work rvell'
that is .o*..r.irily uutilable must be that it will
most successful are those constructed
ivill work for
everY
'in house.'' I
through
rion is ,.-Juta.a by the savlng ln ttme and effort
have been Produced

olacement. An
;;i;".tii""
a-p
it given in Ch-aPter
.*r*pl.
of how
E*
i,-
\\l
\*-\ )
mightbe.designed rvithin
ion of placement tests is
o.^**s\**+:.:h
Direct versus indirect testing

a number
So far in this chaPter we have considered
of uses to
two approa
resulrs are Pur. We now distinguish betweert
construction.
LI
'14
a:cur.ate
t
t
Kinds of test and testirtg
Testing is said to be direct when it requires the candidate to perform

we want ro know horv
nrecisely-the skill which we wish to measure. [f
write coml,.ii .riaidates can write compositions, we get thern atolanguage,
we
oositions. If we want to know how well they pronounce
be as
'n.t rl1.r to speak. The tasks, and the texts w
re in a
Iuthentic as possible. The fact that candidates
rthet.rt ui,u"tion means that the tasks cannot be
,less the effort is made to make them as realistic as possibleDirect resting is easier to carry out when it is intended to measure the
iting. The very acts of speaking and
about the candidate's ability. With
::::::il'
devis
method
the skills in which he or she is i
successfully. The tester has

,..ur"t.ly and without the
;,?':n'.T t
l;:'
: : : :',1'l
uch evidence
rformance of
to
methods for
achieving this are discussed in Chapters 1l- and l-2. lnterestingll' enough,
in many rexrs on language testing it is the te sting of productive skills that
i, ft.r.nted as bein[ *-ost problematic, for reasons usually connected
-r,.iif., reliability. ln fact the problems are bir no means insurmountable, as
rve shall see in ChaPters 9 and 10'
' Direct resring hai a number of attractions. First, provided that we are
about jusi u,hat abilities we want to assess) it is relatively straight-'
clear
.for,.,^rd
.foru,ard
u'hi
ro create the conditions which u'ill elicit the behaviour on u'hi
our judgements. Secondly, at least in the case of t ie producti
io t
^r.the assesr*.nt and interpreration of students' performance is also
skills,
quit.'straightforu,ard. Thirdly, since practice for the test involves PracJ.. of the"skills that we wish to foster, there is likely to be a helpful
backr"'ash effect.
Irrclirect testing atrempts ro measure the abilities rn'hich underlie the
if.ittt in u'hich *J ^r. interested. One section of the TOEFL, for example,
,uri a.u.toped as an indirect measure of n'riting ability. It contains iterns
of the follorving kind:
h
ft
At first the o\d \^,oman seemed unwilling to acceDt anything that u'as
;ff.t.d her b| mY friend and I'
.,.,.hc;c rhc candidate has to iCentif,r' .*hich
.,, o n. o u s o r i n a p p r o p r i a t e in f o r m a
cf the underlin:d elernents
L*f ,Hi i;
:l:,'l ; )
is
*h:hl'; it:
h the strength of the relationship n'as
ot the same thing. Another exan-rple

proposed rnethod of testing pronuncietion ability by a paper and p."oii test in which the candidate has to
identif;, pai.s of words rn'hich rhyme with each other.
2-2
I)
Kinds at''test and testing

'perhaps.the main appeal of indirect testing is that it see ms to offcr thc
possibility of testing a representative sarnple of a finite number of ab,ilities
which underlie a potentially indefinitely large number of manifestatioris
of them. [f, for example, we take a representative sample of grarnmatical
Structuresl then, it may be argued, we have taken a sample rvhich is
relevant i.or all the situations in which control of grammar is uecessary.
By contgast, direct testing is inevitably limited to a rather smali sample of
tasks.which mav call on a restricted and nossiblv rrnrep'resentntit''e rnnc'e
of grammatical structures. On this argument, indirect, testing is superior
to dir..t'testing in that its results are more generalisable.
The main problem with indirect tests is that the relationship betr,veen
performance on them and_performance ot the skills in rvhich we ate
lsually more interested tends to be rather weak in strength and uncertain
in nature. We do not yet knorv enough about the component parts of, sav,
composition writing to predict accurately composition writing abilit.vfrom scores on tests which measure the abilities Which we beliette
underlie it. \fle may construct tests of grammar, vocabularv, discourse
markers, handwriting, punctuation, and what lve will, But rve still ,,viil
not be able to predict accurately scores on compositions (even if rve rnake
sure of the representativeness of the composition scores by taking man-y
samples).
Itlseems ro me that in our present state of knowledge, at least as far as
proficiency and final achievement tests are concerned, it is preferabie to
concenrrare on direct testing. Provid d that we sample reasonably w'idely
(for example require at least two compositions, each calling for
different kind of wridng and on a c iffer.ent topic), we can expect more

accurate estimates of the abilities that really concern us than lvould be
obtained through indirect testing. The fact that direct tests are generally
easier to .orrrtruct simply reinforces this view with resPect to inslitutional
tesrs, as does their gt.it.r potential for beneficial backwash. it is only {air
to say, however, th"t *rny testers are reluctant to commit themselves
entirely ro direct resring and will always include an indirect element in
rheir ,.rrr. Of course, to obtain diagnostic information on underlying
abilities, such as conrrol of particulii grammatical structures, indir:ect
testing is cailed for.
@ plscrete point versus integratlve testing

Discrete point testing refers to the testing of one element irt a tirne, item
lo
2?
{
sql
Kinds af test and testing
distinction is not unrelated to that between indirect

ind ,fir..r resting. Discrete poini tests rvill alrnost always be indirect,
while inregrative tests will tend to be direct. However, some integrative
tesring methods, such as the cloze procedure, are indirect.
Dassage. Clearly this
D Norm-referenced
versus criterion-referenced testing
Imagine that a reading test is administered to an individual student.

When we ask how the student performed on the test, we may be given
tno kinds of answer. An answer of the first kind would be that the
student obtained a score that placed her or him in the top ten'per cent of
candidates who have taken that test, or in the bottom five per cent; or
that she or he did better than sixty per cent of those who took it. A test
w,hich is designed to give this kind of information is said to be normreferenced. It relates one candidate's performance to that of other
canCidates. We are not told directly what the student is capable of doing
in the language
The otlrer kind of answer we might be given is exemplified by the
follou'ing,'taken from the lnteragencv Language Roundtable (lLR)
language skiltievel descriptions for reading:
Sufficient comprehension to read iimple, authentic written materials in a
form equivalent to usual printing or tvpescript on subiects u'ithin a
familiar context. Able to read with some misunderstandings
straightforward, familiar, factual material, but in general insuffic-iently
.*p.ii.n..d ivith rhe language to dran' inferences directly from the
lin.euistic aspects of the text. Can locate and understand the main ideas
apJ details in materials u'ritten for the general reader . . . The individual
can r,:ad uncomplicated, but authentic prose on familiar subiects that are
!
F
F
F
G
normally presenred in a predictable sequence which aids the reader in

understanding, Texts may include descriptions and narrations in
conrexrs such as news items describing frequently-occurring events,
simple biographical information, social notices, fonnulaic business
l.tte 1s, anJ simple technical information written for the general reader.
Generally rhe piose that can be r:ad bv the individual is predominantly
in straightforward/high-frequency sentence patterns. The individual does
not hal'e a broad active vocabulary .. . but is able to use conrextual and
real-rvorid clues to understand the text"
Simiiarlv, a candidate u,ho is awarded the Berksirire Certificate of
proficiency i., G.r-an Level 1 can 'speak and react to others using sirnple
language in the following contexts':
to greet, interact with and take leave of others;
to exchange information on personal backgroulnd, home,
school life and interests;
Lrl
t7
Kinds oi test and testing
to discuss and rnake choices, decisions and plans;

to express opinions, make requests and suggestions;
to ask for inforrnation and understand instructions.
In these two cases we learn nothing about how the individual's performance compares with that of other candidates. Rather we learn something
abourt rvhat he or she can actually do in the language. Tests vihich a.r+
this kind of infornnation directly r.re sa.id tr 1,.:
..-lee;r'ned
to
."
evrrb..vs
-*
fsrc',.,ide
cr it ir ion - r ef er e n c e d.2
The purpose of qriterion-referenced tests is to classify people according
to wheth.i ot not rhey are able to perform some task or set of t:rsks
satisfactorily. The tasks are set, and the performances are evairiar,ed. it
. does not marrer in principle whether all the candidates are successful, or
none of the candidates.is successful. The tasks are set, and thcse lvho
perform them satisf-actorily 'pass'; those who don't, 'fail'. This means
ih^t rtudents are encouraged to measure their progress in relation to
meaningful criteria, rvithout feeling that, pegause they are-less able thln
*o51 of=their fellows, they are desrined to fail. ln the case of rhe Berkshire
German Certificate, for example, it is hoped that all students r*,'ho are
entered for it ivill be successful. Criterion-referenced tests therefore hir,;'e
terms of -uvh.rt
two positive virtues: they
s of candidirics I
p.opi. can do,which do no
and they motivate students
means that the
The need for direct interpretati
consrrucrion of a criterion-referenced test may be quite different from
rhat of a norm-referenced test designed to serve the same PurPose. Let us
imagine that the purpose is to assess the English language abilit'v of
universiries.
stud"enrs in relarion to ih. demands n ade by English medium
The criterion-referenced test would almost certainly have to bebased on
,^ r^rfyris of whar students had to be able to do with or through English
at univlrsity. Tasks would rhen be set similar to those to be met at
urriu.rtity. if ,hlr were not done, direct interpretation of performance
would be impossible. The i'orn't-referenced test, on the other hand, while
itsconrent might be based on a si
.l1,:?:tlili]i;ffil:'.IT:i
Jl;
, and reading comprehension com-
test does not teil us directly ivhat his

the demands that would be made on
consult a tatlle
ir at an English-medium *niversity.To kqow thi.s, we must
recommendations as to the academic load that a student
which
*ni..,
This is r:n2. People differ somewhat in their use of the terrn 'criterion-referenced''
rvhich
it is used
clear. The sense in
important Provided thar the sense intended is made
in analvsing testing
one which I feel will be mosr useful to the reader
heie is rhe
problems.
{
,
Kinds af test and testing
ra'irh that score should be allowed to carry, this heing based on experience
over rhe years of students with sirnilar scores, not on any meaning in the
score itself. In the same wdy, university administrators have learned from
crperience horv to interpret TOEFL scores and to set minimum scores for
theit own institutions.
Books on Ianguage testing have tended to give advice which is more
appropriate to norm-referenced testing than to criterion-referenced
resting. One reason for this rnay be that procedures for use with
norm-referenced tests (particularly with respect to such matters as the
analysis of items and the estimation of reliability) are well established,
rvhile those for criterion-referenced tests are not. The view taken in this
book, and argued for in Chapter 6, is that criterion-referenced tests are
often to be preferred, not least for the beneficial backwash effect they are
likely to have. The lack of agreed procedures for such tests is not
sufficient reason for them to be excluded from consideration.
r
B
O1iective testing versus subiective testing

The distinction here is between methods of scoring,, and nothing else. lf no
judgement is required on the part of the scorer, then the scoring is objectiye, A multiple choice test, with the correct responses unarnbiguously
identified, rvould be a case in point. lf judgement is called for, the
scoring is said to be subjective. There are different degrees of subjectivity'
in testing. The impressionistic scoring of a composition malr $. .on'

sidered more subjective than the scoring of short answers in response to
questions: on a reading Passage.
Objectivity in scoring'is sought after by many testers) not for itself, but
for the grearer reliability it brings. ln general,,the less subjective the
scoring, ih. gr.rt.r agreement there will be between two different scorers
(and betu'een the scores of one person scoring the same test paPer on.
different occasions). Hora'ever, there are ways of obtaining reliable
subjective scoring, even of compositions. These are discussed first in
Chapter 5.
h
J
"
*o!'fi
r-rt unr
icative ian gu a g e testing
jrtuch has been written in recent years about 'cotrnmunicative tranguage

testing'. Discussions have centred on the desirability of rneasuring th9
.filitl' to take part in acts of comrnunication (including reading and
listening) and on the best way to do this. [t is assumed in this book that it
is usuall-), .o**unicative ability which we want to test. As a result, rn'hat tr
beliei,e ro be rhe mosr significant points made in discussiores of communi-
2,
L9
recapitulatiou under
cative testing are to be found throughout. A
separate heading would therefore be redundant-
ffiDFnAcrlYlrtFS
'',vith vrhich yoll are famili:.r:, F,:,:
consider a number of language tesrs
each of thcir,, ansivcr thc follo';;ing qucstions:
(or a mixture of both)?

(or a mixture of both)?
subjective? Can You order
thesubiectiveitemsaccordingtodegrteofsubiectivity?
5. Is rhe test norm-referenced or criterion-refe renced?
you describe it
6. Does the test measure communicative abilities? would
as a communicative test? Justify your answers'
question 5 and the
7. What ,.trtionihip is there b.t*l.in the answers to
answers to the other questions?
Further reading
towards achievement test content
erson \1987) rePorts on researcfi
e computer to language testing'
o be as authentic as Possible: Vol'
Testing is devoted to articles on
ount of the develoPrnent of an
dshalk et al. (1965). Classic short
orm-referencing (not restricted to
ns at a nurnber of level's for the four

s in acadernic contexts' have been
the teaching of Foreign Languages
are available frorn ACTFI- at 579
Y 1A706, USA. It shouid be said,
ke and the waY in u'hich the;' were
sorne controversy. Doubts about the
to language testing are expressed by
se. F{"ughes (1986)' Carroll (195i)
Kinds
of
test.
and testing
made the distinction between discrete point and integrative language

resting. Oller (1979) discusses integrative testing techniques. Morrou,
(1979) is a seminal paper on communicative language testing. Further
cliscussion of the topic is to be found in'Canale and Swain (1980),
Alder+on and Hughes (1981, Part L), Hughes and Porter (1983), and
Davies (1988). Weir's (1988a) book has as its title Communicatiue
Ianguage testing.
i
I
4
I
Bi
I
I
!
F
r
D
?8
21
It
&
VattciitY
f,
2 that a test is said to be "'alid rf it
ended to measure. This seerns simple
wever, the concept of validiry
vcals
deserves our attention.'[his chaprer
shorv its relevance for the
will presenr each aspect in turn, and attempt to
soluiion of language testing problems'
/.--\
L)
re
Content validitY
this in itselfdoes' not q

validitY onlY"if it Inilu
relevant structures
what are the'We
would not exPe
of the test.
set
of strucrures as one for arlvan,:ed
{ gI l-o,! a !9{ \"9-cgn!gnt validltv;11'e

structure s etc. that tt ls meant to cover'
made- at a very earlv stage in test
Validity
backwas-h effect. Areas
isnored
in teachin
which are not tested are likely to become areas
d-cr-ermiqsd -byrry-he
(g-..
J'fu*._d=:=_#
ntent is a
tTt
fair reflect
rvriting-f specifrcatiiins
found in Chapter
6'tL-./
YCrlte
I
E"n-d
l.
ri o n -related va
lid
onT-ne )udgementTf
Ee
iontent validity is to be
ity
Eeiil-*-*--
'
on the
"': - "''" ""
*'-
There are essentially two

validitl, and predictive vali
thg_test and the..crimriorr are -ad-m
e.xemplify' this kind of validation in ach.icvement.tEsfing, let us consider a
situation lvhere course objectives call for an oral component as part of
ihe final achievement tes
r of
'functions'which student
-u.hich might take 45 m
Il of
I be
iryppc.gigal. Perhaps it is felt that only ten minutes can be devoted to each
srudent folthe oral component. The question then arises: can such a
ten-minute session give a sufficientll' accurate estimate of the student's
at'iliti'rvith respect to the functions specified in the course objectives? Is
it, in other words, a valid measure?.
From the point of view of content validity, thii-:urlklspgqd -gn how
t,
d tto_urre.preserlta d_ed in thq pbjec_tiues.

Even, e ffort should be made when designing the oral component to give it
conrent r,alidity. Once this has been done, however, we can go further.
\\ie can attempt to establish the concurrent validity of the component.
To do this, we should choose at random a sample of all the students
i;rhing. the test. These students would then be subjected to the full 45
irriiiuti oi:al coinponent necessary for cc-,verage of ail the functions, qsi-qg.
pc ri-r ap s .f o ur-scorers -loênsur.ag3;lieflil gggin g ( see n ext chap te r ) . Th[
an
u'ould be the criterion test against which the shorter test would be
judged. The students' scores on the full test would be cornpared with the
ones they obtained on the ten-minute session,. rvhich would have been
conducted and scored in the usual way, wirhout knowtredge of their
perfornlance on thre longer version. [f the con:lparison betrveen the two
sers of scores reveals a high level of agreernent, then the shorter version of
''! ')
Jo
LJ
'r'
Validity
the oral component may be considered valid, inasmuch as it gives results
similar ro rhose obtained with the longer version. [f, on the other I'rand,
the two sets of scores show little agreement, the shorter version cannot be
considered valid; it cannot be used as a dependable measure of achievemenr rvirh respect to the functions specified in the objectives. Of course, if
ten minutes really is all that can be spared for each student, then the oral
ccfnponent rnay be included for the contribution that it makes to thc
of st'.:.jcn;s' ovcrall achici;ciirciit anC for its back';,'r;.'-rh ,:l:fl.:i,
"ss.it-.nt
B.ut it cannot be regarded as an accutate measure in itself.
References to 'a high level of agreement' and 'little agreement' raise the
question of how the level of agreement is measured. There are in fact
siandard procedures for comparing sets of scores in this wsY, lvhich
generate what is called a 'validity coefficient', a mathematical measure oi
iimilaritv. Perfect agreement betrveen tw'o sets of scores rvill result irr a
validity coeficient of 1. Total lack- of agreemen-t rvill give a coefficient of
-io g.t a feel for the meaning of a coefficient between these trvo
zero.
extremes, read the contents of Box 1.
Box
fl
f;
{
I
I
To get a feel for what a coefficient means in terms of the level oi

,grJ.*.n, betw'een two sets of scores, it is best co square that
cJefficient. Let us imagine that a coefficient .of.0.7 is.calcul;rted
berrveen the two oral tests referred to in the main text. Squared' this
becomes O.49.lf this is regarded as a proportion of one' and
converted to a percentage, \ 'e get 49 per cenr. On the basis of this' r,r'e
can say that thi scores on rhe short test predict 49 per cent oI the
variation in scores on the longer test.'[n broad terms' there is almost
50 per cent agreement between one Set of scores and rhe other. A
.o.ifi.i.n, ofb.S would signify 25 per cent agreement; a coefficient of
0.8 would indicate 54 per cent agreement. [t is important to note that
a 'level of agreement' of, say', 50 per cent does not mean thar 50 per
cent of the studenfs would each have equivalent Scores on rhe n','o
versions. We are dealing with an overall measure of agreement that
does nor refer ro the individual scores of srudents. This explanation of
how to interprer validity coefficients is very brief and necessarily
rarher crude. For a berrer under5tanding, the reader is referred to
Appendix i..
Vhether or not a particular level of agreemeirt is regarded as satisfactorY

of the
*itt a.p.nd upon the purpose of the test and the irnportance
a
test
of
oraL
decisions that are made on rhe basis of it. [f, for example,
for a high lei'el
;;ii*t was ro be used as parr of the selection.procedure
regarded as too
diplomatic post, then a cbefficient of 0.7 rnight well be
io* fo, a shorter test ro be substituted for a full and thorough te st of oratr
i=i
,j
V alidity
abrliry. The saving in time would not be worth the risk of appointing
someone with insufficient ability in the relevant foreign language. On the
otl-rer hand, a coefficient of the same size might be perfectly acceptable for
a brief interview forming part of a placemcnt test.
It should be said that the criterion for concurreRt va[idation is not
necessarily a proven, longer test. A test may be validated againsr, for
example, teachers' assessments of their students; provided that the
assessrnents themselves can be relied on. This would be appropriate
tvhere a test was develOped which claimed to be measuring something
different from all existing tests, as was said of at least one quiite recently
devel,rped'communicative' test.
The second kind of criterion-related validity is predictiue uafif,isy. This
I.
c^ncerns the
n
p
d.g... ,o *n1g5. t-; ;-;;. pi.i.f .9-gndidat,JJ'p-ui*

.nYexampiJ *oiiid u. rto* *.il r pioficienCy t.t,Tb"ta
ent's abiliry to cope with a graduate course at a British

unirrersity. The criterion measure here might be an assessment of the
student's English as perceived by his or her supervisoi at the university, or
it cguld be the outcome of the course (pass/fail etc.). The choice of
criterion measure raises interesting issues. Should we rely on the subjecrive and untrained judgements of supervisors? How helpful is it to use
final outcome as the criterion measure when so many factors other than
abiliry in English (such as subject knowledge, intelligence, motivation,
health and happiness) will have contributed to every outcome? Where
outcome is used as the criterion measure, a validity coefficient of around
0.4 (only 20 per cent agreement) is about as high as one can expect. This
is partly because of the other factors, and partly because those students
,.,,hor. ErLglish the test predicted would be inadequate are not normally
permitted to take the course, and so the test's (possible) accuracy in
predicting problems for those students goes unr'ecognised. As a result, a
i'alidity coefficient of this order is generally regarded as satisfactory. The
further reading section at the end of the chapter gives references to the
recenr renorrs on the validation of the tsritish Council's ELTS test, in
u'hich these issues are discussed at length.
I\!!!r!
$
F
students rvho u,ere thought to be rnisplaced. trt would then be a rnatter of

comparing the number of rnisplacements (and their effect on teaching
and i."rnLng) u'ith the cost of deveioping arad adrnini,stering a test which
u'ould place students more accurately'
92
25
\talidity
*-,^lilit-,
if ir can he
ab
9.*;;.,;i-t'KJnothesise.
=e-d-tha
t-tl,m
ea
s=.[-iei j ]r
I th e- ab i i ty
1
rv h
ic
fi
d* rl:ii
ng
:,Jnd,e
r-s+.+-any- :Jn
r.s+.+-any-e.r \:{"1
abilitt'.
,..',,
of
lansuase abiliti'.
of- lanBuage
ry;
^J-:-::-:----.-"-----Y--
[e
Le
"
foi.example, that.r'.the abrlity"tbTded-iE';i6l'''ci
._,-..-.:'-..',.4
-L^ ii,irili::; r-:i

ilitg-to....gucss ihc
'oe a
ch theY are met. It r'"ould
whether or not such a distinct
{
6
tike 'reading ability' and 'r'"riting

tieah Similarly; the direct measure-
that underlying writing abilitv are a

ntrol of punctuation, sensitivit,t* to
construct items that are meant to
iriister them as a pilot test' Horv do
ring writing ability? 9lt steP we
n extensive samples of the rvriting
t is first administered, and have these
e scores qn the pilot test r"ith the
it g. If there is a high level of
discribed in the Previous section
idence that we are measuring writing
have developed a sadsfactory indirect
nstrated tho reality of the underlying
33
Validity
n, we would obtain a set of scores
ts could then be calculated between
ients between Scores on the same
n those between-scO-resu.rr dif ferent
t we are indeed measuring separate
and,identifiable ccinstructs.
Construct validation is a research activity, the means by which theories
are put to the test and. are confirmed, modified, or abandoned. lt is
tfri"lgh construct validation that language testing.can be put on a
so,,ndir, more scientific footing. But it will not all happen overnight;
there is a long way to go. In the meantime, the practical language tester
should try to keep abreast of what is known. When in doubt, u'here it is
possible, direct testing of abilities is recommended.
,E
F
l4)
Face vaiidi
validity if itlo-oA#s"d*!l $q9.9

A test is said't6.fiiueJace
-F p retenddd to mea sure
ot
exa m ple,-TTEsT-whii
,uppor.d to measut..*Fot
-*T-.-ii+
r
r: r r:l
--^--:--- -L^
^---t^^-l:lto speak
the candidate
did not require
but which
ofiffiil-tion;Hifr
r
i r
r. !.
irnd there have been some) might be thought.to lack face validity. This
il,ould be true even if the test'i construct and criterion-rel'ated vaiidity
jrlari
Lyas"ejenri ficco-n
;"t h;t
ce
pJ, yet. it.
is
face v alidirf'uay- lg1

-hFuqation authorities or employers. lt
usbd, the candidates' reaction to it
on it in a way that truly reflects their
arly those which providg indirect

u'ly, rvith care, and with conl'incing
explanations.
Ttre use of \ralidity
"
\Vhat use is the reader to make of the

should be made in constructing tes
#E ^. Gilffi
f*iltl
parricularly
eli
aiCJ
'-iAE?
where it is intended to use indirect testirrg, reference should
be made ro rhe research literature to confirm that measurement of the
-using the testing
relei,ant underlying constructs has been dernonstrated
.rechniques rhar are to be used (this rnay,often result in disappointmentanother reason for favouring direct testing!)'
An), published test should supply details of its validation, without
W
?Lt
27
*
u
Validity
I
B
which its ^validity (and suitabiliry) can hardly be judged by a potentiai

purchaserh.ttt ior which validity information is not available should be
treated with caurion.
READER ACTIVITIES
Consicier any tests with which you ar'e iamiiiar. Assess'eacii oi iiiciii iii
rerms of the various kinds of validity that have been presented in this
chapter. What empiricalevidence is there that the test is valid? If evidence
is lacking, how would you set about gathering it?
Further reading
For general discussion of test validity and ways of measuring it, see
Anastasi (1,975). For an interesting recent example of test validation (of
the British Council ELTS test) in which a number of importanr issues are
raised, see Criper and Davies (1988) and Hughes, Porter and Veir
(1988). For the argument (with which I do not agree) that there is no
criterion against which 'communicative' language tests can be validated
(in the sense of criterion-related validity), see Morrow (1985). Bachman
and Palmer (1981) is a good example of construct validation. For

coliection cf papers on language testing research, see OIIer (1983).
t5
.rqfl$i
" -i--fr
a.
E
A
ffieEiabiEEtV
1-$,.*- t:o vr \.i , \ * tr,,1

'ttr..r*.
\,,-l*-.,{r*U-,*,{. 6*\\
.,fi
rl.
**-n
"\tS\
o"*
"t\.
/t
nl
u"l
F;
I
It
!l
I
!l
I
il
lqnagine that a hundred students take a 100-item test at three o'clock one
Thursday afternoon. The test is not impossibly difficult or ridiculously

easy for these students, so they do not all get zero or a perfbct score of
100. Now what if in fact they had not taken the test on the Thursday but
had taken it at three o'clock the previous afternoon? Would r,,rre expect
\,Wednesday as they
each student to have got exactly the same score on the
actually did on the Thursday? The answer to this question must be no.
Even if we assume that the test is excellent, that the conditions of
administration are almost identical, that the scoring calls for no judgement on the part of the scorers and is carried out with perfect care, and
that no learning or forgetting has taken place during the one-day interval
- nevertheless we would not expect every individual to get precisely the
same score on the Wednesday as they got on the Thursday. Human
beings are not like that; they simpl)' do not behave in exactly the same
\\ra), on e\.'ery occasion, even when the circumstances seem identical.
But if this is the case, it would seem to imply that we can never have
'We
know that the bcores would
complete trust in any set of test scores.
had
been
the
test
administered
on the previous or
have been different if
rhe follou'ing day. This is inevitable, and we must accept.it.'!7hat we have
ro do is construct, administer and score tests in such a way that the scores
actually obtained on a test on a particular occasion a.re likely to be uery
sinilar to those which would have been obtained if it had been adminisrered ro the same students with the same ability, but at a different time.
The more similar the scores rvould have been, the moie reliable the test is
said to be.
Look at the hypothetical data in Table 1a). They represent the scores
obtained by ten students who took a 1O0-item test (A) on a particular
occasion, and those that they u'ould have obtairied if they had taken it a
Jal' lltcl. Ccnpare the nvo sets cf scores. (Dc not \I":orii'foi thc nomcni
about tire fact that we would never be able to obtain this information.
Ways of estimating what scores people would have got on another
occasion are discussed later. The most obvious of these is simply to have
people take the same test trn'ice.) Note the size of the difference between
the tivo scores for each student"
3/
29
Reliabiliry
rnet.r 1a) scoRES oN TEsr n (tnvENTED onre)

Student
Score obtained
Score uhicb wotild hu,'e

been obtained on the
follotuing day
Bill
Mr rrt
Ann
Harry
Cyril
58
82
28
34
67
46
L9
89
43
55
43
27
76
62
Pauline
Don
Colin
Irene
Sue
63
59
35
23
62
49
Table 1b), rvhich displays the same kind of information [or i

second 1gO-item resr (B). Again note the difference in scores for e,rch
student.
Now look
ar
reel-E, Lb) scoRES oN TEST e (lruvexrED Dnre)

Stndent
Score obtained
Score tuhich tt,ould haue

been obtained on ihe
follotuing day
BiII
65
48
23
85
59
52
Mary
Ann
Harry
Cyril
Pauline
Don
44
56
38
2t
90
39
59
35
Colin
1.9
rc
Irene
67
5Z
52
57
Sue
\flhich rst seerns rhe rnore reliable? The differences betrveen the two sets
of scores are much smaller for Test B than for Test A. On the evicience
that we have here (and in practice we w'ould not wish to make claims
about reLiabitiry on the basii of such a srnall number of individuals), Test
B appears to be rnore reliable Test A'
Reliability
of the same students
Look now ar Table 1c), which represents scores
oin interview using a fiue-point scale'
ranrl
1c) scoREs
oN INTERvIEw (tNvENrEn o,+.ra)
Student
Score obtained
Score which would haue

been obtained on tbe
following day
Bill
.tA
Ann
HarrY
Cvril
2
2
4
z
4
Pauiine
Don
Colin
Irene
(tr c
F
B
h4arY
The.largest
In one sense the nvo sets of interview scores are very similar.
diff.r.n.. betrveen a student's actual score and the one which would have
largest possible
;;;; obt.in.d on the following d"y is 3. Butarethe
very different..This
;iff.r;. is only 4! Really the two sets of scores differences
between
the
beconres apparent once we compare the.size of,
,tud.n,o *iif, the size of differences betrveen scores for individual
,ird.nto. They are of about the same order of ma-gnitude' The result of
il.,i, .r., b. slen if we place the students in order according to their
le order based on their actual scores is
sed on the scores theY would have'
rview on the following daY. This
very reliable at all.
F
B
retlabi[ltY coefficie
It is possible to quanrify the reliability of a test
':.::tf;;;;,
are
:,Y'+;^','î'
,^ffic-h
reli
in the form
of-
a relia?I::l
s (chaPter
.T--q-t&el
I t' ineY ali
.'+).
'tr i"
:-t]:i
.
is ong
,"^'^+
nF
!
4
recr
^F.n
^,.,"
^f
reliabilitv rnefficrent ls t - ---aEgilyllj}-a-..r-.gll*v-atl"k-y--.v.v*p^L{4-YlY.r.+:
T4
-gg,
-::-:l
p"Ti.-:11'.t^.,_:l
s"*._-ie_rs"ilr
fj9.iy
,for 1
rest
iegardless of wh." lt rrupp;"td to "be adrninistered.-A
;i;*a**
r.t r t
tls hope
hone that rtJiillHno sr'rc
(and let us
-r:-L:l:--, coefficient
zero (nnri
^^^{E-;ant nd
of ,.*r',
;-,h,;h had a ,".tiuUitity
?g
3i
Xr
g' ''**>
ReliabilitY
sl
E
I
I
test exists!) would give sets of results quite unconnected lvith each otl,.ei',
in the sense that the score that someone actually got on a Wednesdav
would be no help at all in attempting to predict the score he or she would
_ej5tiggr_gg
gf
t'
t
1,
roqr-$".
coelnctent \\ e
should expect for different types of language tests. Lado (1961)' f1r
exa-nnple, -..)'s th.a.t good voca.b'-rla-ry, structure and. reading tcst: :r':'
usually in the.90 to .99 range, while auditory comprehension tests;:lre
more often in the .80 to .89 range. Oral production tests may be in the .70
ability coefficient of .85 rniehr be
tion test but lorv for a reading rest.
o sees as the difficulty in achievinq
rent abilities. In fact rhe reliebilirv
epend also on other consideratiotrs,
mosr parricularly the importance of the decisions rhat are to be take n on
he*gr-e.+te r
the basis of rhe test. The -ngg'ilnportant-!bg- d+uj
someone the opportunitr'
Ftg n
ty,t-1Lt
il i ty Fre
SLI:!-qg;1ra;AG
reljabi,lity.
relja
H: !--*
loituav ou.ir."J"6.iouse of tneii sEore on a language test) then u'e ha.''e
to be prerty sure that their score would not have been much difierent iI
i[* f.t"art taken tfte test a d,ay or two earlier or l"t.r]The next section rvill
explain how rhe retiability coefficient can be used to arrive at-another
figut. (the standard error of measurement) to estimate Iikely differences
t=q;bqLrt
someth
oithis kind. Before this is done, however, someth
the way in which reliability coefficients are arriv
-b-,
requ
=;,9T&!rst
most obviotis w
i6e ianre
te_s.1"-tw-iee.
Jrr*brlkr
lhts ts Known as ne3!9Y-r:t1:: ygrJ:r+;
not difficult ro see. tf the second
affiififfition
rne
of the test
after the first, then subjects are likely to recall items and their
i. roo soon "i.
e responses more likely and the
too long a gap between administral) will have taken place, and the
uld be. F{owever long the gap, the
unlikely to be very motiv.ated to take the same test tw'icq,and
subjects
-::'--- are
I't
I
t - - r^--essing effect on the coefficient' These
'
rhis too is ltkelY to have a oepr
-t-r*pdtf{erer+F-f@rrnrofthe
"if".,, are_reduced_spmewhalbr-rh-e-uqe"SL
alg-1g4-te iorms are
F{owever,
forms met[odil
4,
1e***It rurns out, rutf iiiin$lnlhat the most common
the necessary
!1-y-o--.s"g!s
of
scores
provide u,Lw
theseie-C,ru
ih. t.tt ln the usual wAY, but ea
methods of obtaining
involve only o.zle administration of qns
tr[ gl-q ogs-slps-qv"
ien t o f
-in
ln this the subjects take"
Iit hal
su ject is given two scores. Cne score is
] 7
|,I
1
R.eliability
f.or-thetest"+o-be-spli
which are really equiyalesrt, through

fact.w-here.items in the iest fg.yg b$n
tl,, ."t.f,tl matchin
ord.r.d in terms of difficulty, a split into odd-numbered iterni and
even-numbered items may be adequate). It can be seen that this method is
rarher like the alternate forms rnethod,, except that the two 'forms' are
only half the length.l
Ir has been demonstrated empirically that this altogether more
economical method will indeed give good estimates of alternate forms
coefficients, provided that the alternate forms are closely equivalent ro
each other. Details of other methods of estimating reliability and of
carrying out the necessary statisti
Appendix 1.
&
I
Bl
-l
es
_[
The stqndard error of rneasurement and the tru e score
While the reliability coefficient allows us to compare the reliability of
rests,
$ a-e"tu,a.Ls-c*QJe
,,r,har
-
f.urrher--. c alculati-o"n-n*h
ve
is-tq
ion. IVirh-a-lltrle-
r, j1-i
clo
e score
-sen-aure.tuelsssg"st"p--wh?i-iseall-edihSLt
lmagine that it were possible for someone to take the same language
t.st or'.t and over again, an indefinitely large number of times, without
their performance being affected by having already taken the test, and
rvithout their ability' in the language changing. Unless the test is perfectly
reliable, and provided that it is not so easy or difficult that the student
alu'a)'s gets full marks or zerot
various administrations to vary. [.
3
.-{-@qs
$
D
wrth
resPect
treasons
ertain, which iE referred to as the
we
carrciidate's@74,
'r /g"ruv- f,c are-
$j'e ar-r atti?*t,oTn-ike statements ahout the probab,iliry thar a candidare 's rrue score (the one u'hich best represents their abiiitl'on the test) is
rvithin a certain number of points of the score they actually obtained
qn the test. In order to do thit, *yg-Elqlk.to cSlcgla!-e the standard.
error of measutrentent of the P-artic*f::t-:, T"he calculation (described in
1. Because of the reduced lengrh,
rn'hich
will
cause the coefficient to be less than it would be
f,rr rhe u'hole tesr, a statisrical adjusrment has to
lo
h'e rnade (see
Appendix'1 for details).
33
illustrated by an examPle.
Suppose that a test has a standard error of measurement Ot J. At^r
indi'.'rJua! scores 56 on that test- Ve are then in e nocirir'n tn rnrlrc the
2
following statements:
We can be abou.t 58 per cent c'ertain that the person's true score lies in the
range of 51 to 61 (i.e. r,vithin one standard error of tneasureirlcni of tlic
r.oi. actually obtained on this occasion)We can be about 95 per cent certain that their true score is in the r:rnge -15
to G6 (i.e. rvithin tr,r'o standard errors of measurement of the score
actually obtained).
.We can be 99.7 per cenr certain that their true score is in the range 41 to
7l (i:e. u,ithin rhree standard errors of measurement of the score
actually obtained).
These sraremenrs are based on rvhat is known.about the pattern ot
scores.that rvould occur if it lvere in fact possible for sorneone to take tiie
rest repearedly in the way described above. About 68 per cent of their
scores would be rvithin one standard error of measurement, and so on. li
in fact they only take the test once' we cannot be sure hor,l' their score on
core, but rve are still able to make
Tffirrn
dE-lislons that wg.!1ke-en.,qhs_F"e.:.*_9,! !g:"1"_:cg'e1."#e
-slsls"Jor
-ffii.;' ffi-li. 'f][g st"nd=iiile'ro-i of measui:*'."qil9i:ff::-:b.*.L!b'--rr

2.
-
the way a person's scotes

These statisrical sratemenrs are based on what is known about
would tend ro be distribured if rhev took the same test an indefinitelv large nuntber of
times (wirhout the experience of any test-raking occasio.n affecting performance on anv
(see
orher occasion). The scores would follor.r-rvhar is called a normal distributi,rn
t-,f
the
scope
the
beyond
discussion
Present
mai distribution which ailou"s us to sav what
ain range (for exarnple abour 68 per cerrt of
f measurement of the frue score)' Since about
of the
dg pei cent of actual scores will be within one standard error of measurenierLt
actual sccrre +'ill he
rrue scone, \\,e can be abour d8 per cenr certain that any parricular
score"
the
true
of
rn ithin one standard error of measuremenr
error oi
3. Ir should be clear rhar rhere is no such rhing as a'good'or a'bad'standard
a
particuhr
to
in
relation
scores
measurernent. tt is the particular use made of particular
nay be considered acceptable or unacceptable.
i1
I
Reliability
sv.lde-us.er*s-r#iih ..nolo,;}l
'ard
ard error of measurement,
Lteslis-not relia
Hiu
ow tnat t
nt-{rom their
d
Fi
ll
F1
tru
'\
th e
5g:*eJm-ary^ jndUtd_ualsarelik,ely- fp--bE-Suire
'(\
ttle
f"affi-e"e*strrss.
{1
ird
sl
.the case..of .same 'i

a.ngy bqgweqn aq$al
I
utious about making n\
of candidates whose \\
important decisions on the basis of
close to the cut-off point (the point that divides
,.iurl scores place themtWe
,Dasses' from 'fails').
should at least consider the possibility of
jathering further relevant information on the language ability of such
i-n"
candidates.
Haying seen the importance of reliability,'*'e shall consider' later in the

chapter, to.,,n, to make our tesrs more reliable' Before that, however' u'e
shail look at another asPect of reliability.
Scorer reliabilitY
]n rhe first example given in this chapter we spoke about scores on a
niqlriple choice test. lt was rnost unlikely, we thought, that everY
candidate rvould get precisely the same score on both of two possible
n,
D
adminisrrarions of the test. We assu red, horvever, that scoring of the test
1'ould be 'perfect'. That is, if a particular candidate did perform in
exactly the iame way on the two occasions, they would be given the same
score on borh occasions. Tbat is, any one scorer u'ould give the same'
score on the t\ /o occasions, and this u'ould be the same score as would be
giyeu b), an), other scorer on either occasion. lt is poss_ible to quantify the
i.t.i uf agreement given by different scorers on different occasions by
means of a scorer reliability coefficient which can be interpreted in a
similar \4,ay as the test reliability coefficient.,trn the cage-of:h.emulriple
ch oi qq q.est j.u.sr .d.e,sc:ib-e.d
the
il'e noteciin Chapter 3, when
gq_ql.g
lhe reliability coefficients of
ectly consistent scoring in the case of

the intervierv scores discussed earlier in the chapter. ni would probably
haveTJeG-e?*tdtE-e reader an unreasonaL,le assumption. We can accept
thar score rs should be able to be consistent *'hen there is on15'one easily
tl2
35
T
s
Reliability
lI
I
n a degree of judgement is calied for

ring of performance in an ihtervie'',v.
. Such subjective t"=sqg1"-j,ll" 5ot
thEEGiiTimlw'hen manv
coefficients (and also rhe reliabilit.,'
low to justify the use of subjectir,e
measures o,t language ability in serious language testing. This vier', is lu:;s
wicieiy* heici rociay. v{Ihiie rhe perfecr reiiabiiity oi objectivc tcsis is i-rrii
obtainable in subjective tests, there are ways of making it sufficientiy' 1r i5h
{
.,,1
ther' Indeed the tesi
;eliable*lherlhg
han scorer reliah'ilin',

ieliabiliry coefficient will almost cert
since other sources of unreliability will be additional to u'hat enters
through imperfect scoring. In a case I knorv of, the scorer reliabiliir
coef6cient on r composition rvriting test was .92, while the reliabiliti'
coefficient for the resr was .84. Variability in the performance of
individual candidates accounred for the difference betlveen the tyu'o
coef ficients.
As we have seen, there are two components of test reiiabilin': the

sion to occasion, and the reliabilitv
esting ways of achieving consistent
then turn our attention to scorer
reliability.
'ake enough samPles of
behaviour
being equal, the more items that you have on a test' the more
reliable rhlt test *iti U.. This seerns intuitively right. lf we wanted to
know how good an archer sorneone was, we wouldn't rely on the
.uid.n.. of i single shot at the rarger. That_one shot could be quite
u*.pr.r.nrarive 6f ,h.ir ability. To be satisfied that we had a really
of
reliaLle measure of rhe abiliry rve would want to see a large number
shots at the target.
The same is true for language tesring. trt has been demoirstrateci
Fth., things
that
reliabie. There is
empirically
Ap-pe*ndix)
tbafel1ry:-911-!"o-
e
e
&
R
eliability
.t
depresstng ettect on tne
rellaDtlltl ,t-.,
test U.* irk.n, fay, a day

* the t.li
la1q5:
gr
glven
a selectroj_-gt
lecti
l-^
In general,
iimeV to-hr'.1
-
^-^
C-^^S^-
-L^-:-
theief_ore,._candiglsi.Es
over rvhich
}hd
nqnnr^
l,{
ha
raa+rtataA
I\-\JiiiPciiL
'
lvrru vvrrlg r! r rLrrrS L::ri'.-r
Lrlu f^lr^rt"
miqht Vafj' SnCUiU UC -i_qljii*i!.L,
iho"ld
a) Write a comPosition on tourism.

b) Write a composition on tourism in this country.
.j \y/1i1s r.o*porition on how we might develop the tourist industrv
d)
in
this countrY.
Discuss the following measures intended to increase the nurnl-le r .-,i
foreign tourists coming to this country:
i) More/better advertising and/or information (rvhere? u'h.rt [orrr,
should it take?).
ii) Improve facilities (hotels, transportation, communication etc.)'
iii) Training of personnel (guides, hotel managers etc.).
The successive tasks impose more and more control over $'hlr is
written. The fourth task is likely to be a much more reliable indicetttr rrf
writing ability than the firstThe"general principle of restricting the freedom of candidates rvill be
taken r[ agrin'in chipters relating to particular skills. It should perhaps
t. rria ^tt.ri, ho*'eu.r, rhat in restricting the students we must be caret'ul
not ro distort too much the task that we really want to see them perform.
ih. pon.ntial tension between reliabilitv and validity is taken up at the
end of the chaPter.
dd
Write unambiguous items

It is essential that candidates should not be presented with items 'uvhose
*.^rring is not clear or to which there is.an acceptable ansrver which the
,.r, *ri-*r has not anticipated. In a reading test I once set the follo',"'ing
on a length' reading Passage abo.ut English
open-ended question, baied
'Where does tl e author.direct the reader w'ho is
di"l.cts:
"rrd
"i..nrr
\\'as
interested in non-standard dialects of English? T'he expected answer
\,vas
the Further reading secrion of the book, which is where the reader
dirocted to. A nu.ib.r of candidates answered 'page 3', rvhich rvas the
reader
olace in rhe texr where the author actualll' said that the interested
;h;;lt loqk in rhe Further reading section' Qgly*-t-trglS-1g1glrsfuac's1
re \\'as a com
revealed th.a-q-."-thsr
test [evcalcLt
the tesE
ins tne
scortng
i
lf that hjd nol"hagnened, then ^,:o.f..:
.nrrert answer to the quest
;;-rre;?fiil,t0_lHel'y_gsti9n.
.ii.#ec*Cn
rrm"#;;;*ed as incorrect. The fact that an individual
,/4
R.eliability
.
ones alreadv
.the ene_Lalreadv
_LLL_U 3',.r-"
.epgffigt--l
I
l
I
I
I
I
Fl
I
I!
test will be needed

tn
in the test
A.-.,----
-go
1"ncq-e_gsg
9ne thing,to bear in.mind,
the reliability
r and of exisring
ile_ms. Lnagine a reading test that asks the question: ''Where did the thief
hide the jewels?'lf an additional item following that took the forrn'What
r"6 "nu.uual about the hiding place?', it vrould not rnake a ful! contribution to an increase in the reliability of the test. Why not? Because it is
hardly possible for someone who got the original question-wrong to get
the supplementary question right. Such candidates are effectively prevented from answering the additional question; for them, in reality, there
is no additional question. We do not get an additional sample of their
o{;heir apility isis not

of our estima;e of..their
*u,[,fl,'f ,'noJ l"te-yrnr
lnt*+n,
afl ((,- .t,# .u6,[,{,'6yuo*r(
'f.f lT.*
lu- 1t+1{} s.rff^*
behaviour, so the reliability
Bi
increased.
increased
represent a fresh start fqr
r as possiBle
Possl
rS Each additional item should as tar.as
ina" d d-irc n a J inlqunErr-on-o.n
ihe ca n d i d ate. Bv-dP t-qgJbi
_tt_ich__fyill nakq--t-qs-t r-esu-lts- 1n_o 1e
should not be taken to mean only
est of writing, for example, where

candi.jates have to produce a number of passages, each of those passages
is to be regarded as an item. The more independent Passages there are, the
more reliable will be the test. In the same wdy, in an interview used to test
oral ability, the candidate should be given as many 'fresh starts' as
oossible. More detailed implications of the need to obtain sufficiently
irrg. samples of behaviour will be outlined later in the book, in chapters
deircted to the testing of particular abilities.
_Whileit!qrmpo_r1qt-lo-make-arEsrl.rongn"ought.o-achiq-v--e9-1t5!-?Slg:)'
b_e m.ade so l.ong-rha.r.-rhe candidates.b-e,.EomE io
_reiiaL.,ilitv, iiihCrl_doot
.l
tt_bS
e _ogr e,S. U-n
t"epJ
e-s-e-n t.a.
tuc_cl_rLe_pability. At the same time, it may often be necessary to resist

pr;;r. to make a test shorter than is appropt,?t..The usual argument
io,. short.ning a rest is that it is not practical. The ansrver to this is that
accurare info-rrnation does not co ne cheaply: if such information is
. ln general, :be-mo-l-ej-mporta+r"e+herc.q! Sh-o"pld -be. Jephthah used the
' as a test to distinguish his own
ot protlounce sh. Those r,'ho faiied
a}i's oivL'r l'r-rcn killeu.l in crrur uriglrt
have t'isired for a longer, more reliable test'
$
{a.)
k-9f-
6-not allow candidates
t-oo rnuch f reedce is a tendency to offer candidates a

thern a great deal of free dom in the
hey have chosen. An example would
e5
37
--
R.eliability
candidate might interpret the question in different ways on different
occasions means that the item is not contributing fully to the reliabilitv of
the test.
heilwey-t-o-{Ldvq-alu-q4JnUsrg}"Iitggs-r,"hsvrng-drdt_ed"--!h-e-m,
I
I
tr
utln
g3glig-s-LWAgi !.9"i1.[d- t rv' a sto subject thern to $.e ,9IfU
ard?s tTiey can to find alternative r-!t--e-rpr:9-tatia-OqlqlLq-gp.-esmt=erd-e-d, If
;ilS:asEffiiherightspirit,oneofgood.naturedperversity,
mosi of the problems can be identified before the test is administered.

Pretesting of the items on a group of people comparable to those for
whom the test is intended (see Chapter 7) should reveal the remainder.
Where pretesting is not practicable, scorers must be on the lookout for
patterns of response that rndicate that there ate problem items.
This applies both to written and oral instructions. If it is possible for

candidares to misinterpret what they are asked to do, then on some
occasions some of them certainly r,vill. trt is by no means alrn,ays the
u,eakest candidates who are misled by ambiguous instructions; indeed it
is often the better candidate rn'ho is able to provide the alternative
tests written for the str-rdents of a
e-s ri pF o sitioil-ili a--i;tfu d en ii all
"
ction s.. Th e f r e q u e n q'

of rhe complaint that students are unintelligent, have been stupid, have
i','ilfLrllv misunderstood what they were asked to do, reveals that the
supposition is often unwarranted. Test u'riters should not rely on the
students' po\4rers of telepath)' to elicit the desired behaviour. Again, JbS
use of colleagues to criticise dr4fts of instructioqs (i
ems. Spoken instruc'text
In order to avoid
tions should alu'ays be reao lrom a prePareo
-w-o-r-dc d-i
ri
I
hi
p,
F'
nsff
Too often, institutional tests tt. be$gPgd (of hgnd"ftglLhave too

much text in too small3j!3ge, and are ngglf *p*-{ggd. As a result,
;r.de"ts areEce?-wrTffiAaiiional taski u,E-i?E?e-not ones meant to
measure therr language ability. Their variable performance on the
unu'anted tasks q'iil lower the reliability of a test.
-i
fAndidates should be familiar with fEtnat and fesfr4

onh
9WlitttYvve
ninllec \
.-..*_._.*_,..-.1..---.-.---- -//
a test is unfarniliar to candidates, they are likely to

perform less weil than they u'ould do othenvise (on subsequent);'taking a
tf any aspect of
16
39
.it
'T"
Y
F..
Reliability
parallel version, for example)..Lgl th3g J-ea1on?-,e-verv effort must be rnade
i; ;r".J h ^ I -*lt-tuo+, ui* l,l, a,y+lt'*-ffi t f ffi ;ii'i" .ti ii "1, s t w H a t rv i I I
b"_res_qir of them. f[l'--*"y'lsestr-qbFbI!:n gl:e-ple-,sut tpilrl
il
*1:t5lg s-q"f-p-tad{-.-e-*-a
.t
e-r
!ds-i n
cas
lols-
fl
---71----;-*
rovr de u,r,ro riA; d ;;n-ciistracting cond itio ns 9!)
gru^ter the differences between one administration of a test irnd
anotf,er, the greater the differences one can expect between a candidate's
Great care should be taken to ensure
strictly adhcred
;1---J4
Inlstrailons ot tl
intain a quiet
fh.
'We rurn now to ways of obtaini"{WlSJlgfufy-,yfhich' as \\'e si1\\'

=:{-----"21
above, is essential to iest reliabilit)'.
-*l-
Grlt"*s
'^^^:|^t^
fhii .ty
that permit scoring which ,: -rt:!!9{''3-?f-)
to be a recommendation P. gsi-11glttnlg-g!:.tg items,

-i
T hE-i s n o t i n t e n d e d .lVFi-i e t
e
s
y
ilitl+suu!-q-saplg "ly_*i:g f
3tt-g.
;."fu-b;il;k*-t"-safiTiat inultiple choice items are never approaPpear
priate,.it is certainly true th_at there a

are qulte lnapproPriate. What is m
notorioutly difficult to write and al
substantiai part of Chapter 8 is
construction and use of multiple choice items'
n r'vhich they
ice items are
pretesting. A
ssion
of
the
_4s a
ates
f
act
spelling which makes candidate's

^
ing test) often make demands on the
scorer's judgement. The longer the required response, the greater the
difficulries of this kind. Cne way of dealing with this is to structure the
candidate's response by providing part of it. For example, the open-
endbd quesrion What was different abou,t the results? may be designed to
elicit tlre response Swccess Laas closely associate.d tc,ith high ttotiuatiott.
This is iikellro cause problems f rr scoring. Greater scorer reiiab'iliri'will
probably be achieved if the question is followed by:
w.as
ftrore closely associated with
q2
bility
1
trater
Irems of this kind are discussed 'in
t
i-rtfei;i
e-;tY-e
;;!
; ; ;;:'
a;Ir
i-
dy made that candidates should not

hey should be limited in the way that
I
I
,g-the.-,co.mp"q"sit.Lans-a-ll--en.qt-e-.t-qpie
t
:
c a n di. d. i t e s". t ao m.i. C lt "frbedo tn -
-,
"
t.
a detailed scoring k-,--'"-*-\
'Answersl and assi
"I-*;;;*---'
tints. It should be the outcome of efforts
oI. pc
tts_asslgnment
in
aipo.ssrble
"" r"-".' .-tt
| _ ._L:^^-^r -^
responses and have been subiected to group
" poirible tsrP\rrr
antlclpate aIL
to
PosslL)lc
applies only rn'here resPonses can be classed as

totally 'correct', not in the case of compositions, for
criticism. (This
pnrti"lly o,
,dui..
instance-)
(*l
---"--.-....".."-."-----'-,
*'fflg,n
t:-gI9rs-"''
'T1
lve.
I ne sconng
L-!!te:Y"-*r-- '
'
to anyone rvho has
' sflr
' ould not be assigned
of composrtions, tor eXdlrlple'
r
_ r__:_:^-_-^: ^__
past- administratrons.
from
compositions
accuratery
,.or.
;;rl.ffia.
should be. analysed.
ifr., each administrarion, p"tt.int of scoring
inji'iau"rs whose scoring i.ui"t.t markedrl' and inconsistently from the
norm sliould not be used again'
\*Lffig
re e a ccePta b I e
of scoring
*,--;^
resPonses and
;\
t\
1'
)-imrnediatelv afterthe ad.nr-i-nisu.ation

^r^^-rn+c
scrl[ "t ^,,lrl hctalren
A sampte or
-a.ch,,
-_i_.__^n--,.i-i:r4--.--.-.: | _.
tati v e^^s o^tf
ety p i ca l re p re s en ---:_.
o
ni,
p
hi
o
s
a;e?oin
ie
:wl, aia 69
are
"itili:ie"
,iiiiir.n, ieveis of ability sliouid be seiected. Oniy when aii scorers
begin" Much
;;;;;J;" the scores to be giien to these should realof scoring
compositions.
,fr. ,,r,ill be said in Chapter 9 about the scoring note
any difficulties
For short answer questions, the scorers should
is unlikely to have anticipated every
*'
se
to the attention of whoever is
Once a decision has been taken as to

I
+g
47
Reliability
the points to be assigned, the supervisor should convey it to all the scorcrs
concerned.
{
Scorers inevitably have expectations of candidates that they knolr .

Except in purely objective testing, thiswill aflecl the rvay th.at rhey scili'i.
Stuciies irave sho,n,n rnat even where the canciidates are unknowii ci--t tiie
scorers, the name on a script (or a photograph) lvill make a_sigrrificant
difference to rhe scores given. For example, a scorer may be influenceJ brthe gender or narionality of a name into making predictio.ns.which can
number
aff.eZt the score given. The identification of candidates onlv bv
fr
:6
,r@
ivill reduce such effects.

:
g.n.il-ilL,
certainly where testing.is subjective,.all scripis

"ndat least nvo independent scorers. Neither scorer
,houlj be scored by
,fr-""fa knorv hotv ihc other has scored a test paper. Scores shouid be
rgr-re.
recorded on separate score sheets and passed to a third, senior, colle
*ho .o.npnr.t rhe t'r,vo sets of scores and investigates discrepancies.
As a
ReliabilitY and validitY

V
consistently accurate measurements. lr
at
ieliable. A
must therefore
a writing test we might require candidates to rvrite
_4!1. For example, as
ldô*r rhe translation equivalents of 500 words in their orvn language.
it i, .outd rvell be a reliable test; but it is unlikely to be a valid test of
writing.
In oi,, efforrs to make rests reliable, we must be rvary of reducing their
validity.'Earlier in this chapter it was admimed
of whar candidates are permitted to write
diminish the validiry of the task. This depends i
are trying to tneasure by setting the !1tk'
omposition, then it would be hard ro
.r"al,i^r.lr' ability
rure in order to incre ase reliabiliry. At
,*nf, providing ih
to restrict candidates in wa,vs t'vhich
,h. ,r.. time we
rvOuld not render their performance orx the task irrvalid'
valiCiri'" The
There will always be ro*. rension berw.een reliability and
,.rr., has to bala.nce gains in one against losses in the other'
a?
I
I
ir
f,

HUGHES - Testing For Language Teachers

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

HUGHES - Testing For Language Teachers

Transféré par

Droits d'auteur :

Formats disponibles

i

Cambri dge UniversitY Pres

@ Cambridge University Press 1989

Library of Congress cataloguing in publication clala

88-3 850 CIP

British Library cataloguittg in publication data

for Language Teachers

CAMBRIDGE HANDBOOKS FOR LANGUAGE TEACHERS

WrTghi Dauid Beiteridge and lvlichael Buckby

Time - using s_tories in the language.classroorn

Teaching LiEtening Comprehension by Penny Ur

A handbook of oral resting techniques

book of ideas and

Testing as problem solving: an overview of th

l" Kinds of test and testing

tA$i-cdiis tFu ction

Test techniques and testing overall ability

'1.13 Testing gramrqar and vocabulary

Statistical anall'sis of test results

The author and publishers would like to thank the followine

permission to reProduce copyright material:

it,$ sq, I-ongman uK Ltd u?,1"fi.u.5'ff'rrl'?fJ?'iir', S':::'::,

e Cbseruer for S. Limb: 'One-sided

1985 summer exarninations; Th

in English as a Foreign Languge'

to arrive at the best solution for any particular

situation - the most appropriate test or testing systen'l - it is not enough

N4any language teachers harbour a deep mistrust of tests and

frequently u,ell-founded. It cannot be denied that a great deal of language

The effect of testin

asib g c!1..u ry!2,

. If a test is regarded as importanr,

ri'ith the objectives of the course, then there is likell' to be harmful

beneficial. I was once involved in the development of an English^languag!

Teaching and testittg

choice, had ap immediate effect on teaching: the syllabus'was redesiqne'C.

Teaching and testing

Teacbing and testing

The need for tests

measures of language abilitY'

iust mentioned' Even without

Teaching and testing

on others, including professional testers and exarnining boards, to

improve theirtests. This book rpresents an attempt to help them do both.

ir was simply not posiible to test the writing ability of hundreds of

thousands of candidates by means of a composition: it was impracticable

I'f'J TestEffiE as ProfoEerm sotving:

tesrers are sometimes asked ro say rvhat is'the best test' or

problem. It is the tester's f ob to provide the best

these seem necessary.

we vant to create a test or testing system

Testing as problem soluing

us describe the general.testing problem in a little more detail. The

achieved the objectives of a course of

nd rn'eaknesses, to identify what they

disrinctly lacking in very -^rry teacher-rnade tests. Chapter 5 gives advice

The concepr of backr,vash effect was introduced in the previous

Testing as Problem soluing

(mostly Brifish and American), e

'I'his chapter begins by considering the purposes for which language

will of course'vary from situation to situation. It is possible,

neverrheless) ro categorjse tests according to a pmall nurnber

proficiency tests, achievement tests, diagnostic tests, and placement tests.