Vous êtes sur la page 1sur 21

Assessing authentic tasks: alternatives to mark-schemes

Dylan Wiliam
The kinds of authentic fasks that have been used in national assessm ents in England and Wales over the lastthirtyyears - typicattyopen-ended, 'pure' investigative tasks - are described, and the marking schemes used for their assess/nent are classified as either task-specificor generic. Generic schemes are further ctassified according to whether the 'degree of difficulty' of the task or the 'extent of progress' through the task is given most emphasis.A view of vatidation is presented that requires considerationof the value implicationsand social consequences of implementing assess/7?ent procedures,and it is argued that both task-specificand generic schemes will have the effect of stereotypingstudent approaches to these tasks.An alternativeparadigm to norm-referencedand criterion-referencedinterpretationsof assessments, entitled'construct-refe renced'assessme nt, is proposed as being more consistent with the rationale behind such authentic assessmenfs.Suggestions for the implementation of such a system are made and indices derived from signatdetection theory are suggesfed as appropriate measuresfor the evaluation of the accuracy of such assessments.

Introduction

For mostof this century,therehavebeentwo levelsof nationalexaminationsin EnglandandWales:oneintended for l6-year-oldstudents, which hascometo serveasa school-leaving examination, and onetakenat age 18,for universityenffance. Theseexaminations and other 'high-stakes' assessments in mathematicshave alwaysinvolved a preponderance of consffucted-response questions.Indeed,at universlty enfrancelevel it has been common to find a three-hourexamination paperin mathematics in which the candidateis expected to answeronly six or sevenextendedquestions. The assessment of suchexaminations hasbeenconductedin a largely pragmatic way.Rightly or wrongly,manyof thoseinvolvedin administering the national assessment systems have regardedclassicalpsychometrics as having little to say about the design, implementation zurd assessment of suchcomplex,performance-based tasks.
Dylan Wiliam (PhD)is SeniorLecturerin MathematicsEducationatthe Centre for Educational Studies, King's College, lJniversityof London, Great Britain.
Note: This article is basedon a paperpresentedat the Nordic symposium Researchon Assessment in Mathematics Educationin Gciteborg, November5-7, rgg3.

48

NordicSfudr'es in Mathematics Education Vot2, No l,4g-69

Assessing authentic tasks: alternatives to mark-schemes

However, over the last thiny yearstherehas been increasingconcem that evensuchcomplexperformance-based examinations only assess a sampleof the mathematics felt to be important.Other forms of assessment such as 'portfolios', extended piecesof work, and oral and aural testswere developed. Thesedevelopments were given addedimpetus when,in 1984, theGovemment announced ttratfrom 1988, all syllabuses for the school-leaving examinations shouldincorporate someschoolbasedassessment. The samepragmaticapproach to assessment hasbeenappliedto the co-ordination of thesenew school-based assessments, with little attention being paid to the underlyingtheoreticalissues. This is unfonunate becausea substantialamount of very importantwork has beencarried out by the five regional ExaminationGroupsthat administernational examinations which hasnot beenpublished. This article is an attemptto pull the two poles closer together- by bringing to the attention of psychometricians some of the innovative practices assessment undertaken in GreatBritain overthelastthirty years, and by extending some of the conceptsof psychometricsso as to be more applicableto the kinds of authenticassessments that are used in public examining. ln section2,I will characterises the kinds of tasksthat havecometo be associated with school-based assessment in EnglandandWales,and describe someof the schemes have that beenproposed for their assessment.I shall arguethat theseschemes, by treatingcertainapproaches to 'open' supposedly tasksascanonical havetended to stereotype mathematical activity in classrooms, zurdthat this problem is inherent in all prescriptive assessment schemes. Section3 reviewsthedevelopment of theconcept of the validity of an assessment, emphasising therole of inferences madeon the basisof the assessment resultsratherthan the assessments themselves. Section4 - ie takesthis ideafurther by focusingon the referents of the assessment with what is the observed behaviour compared? The well-knownnorrnand criterion-referenced interpretations of assessment results,and the less well known ipsative interpretations are discussed, but it is argued that theseare inadequate to describesomeof the assessment practice that hasemergedin recentyears.A fourth kind of assessment and interpretation- construct-referenced assessmentis proposed andelaborated. Finally, in section5, someof the practicalrequirements of consffuctreferenced assessment arediscussed briefly, andthe concepts of Signal DetectionTheory arc proposedas being adequate for the evaluationof the dependability of suchassessments.

Nordic Studr'es in Mathematics EducationNo l,lgg4

49

Dylan Wiliam

2 Assessing authentic tasksin mathematics


One of the earliest usesof authentic tasksin a 'high-stakes' assessment setting was provided by a version of the school-leavingexamination offered in 1966. In one of the three examinationpapers (Associated ExaminingBoard, 1966), just one candidates had four hoursto answer question from five: I Discussthe relevance of mafficesto networks.Illustratebv suitable examples.

2 Discuss"Relations"with special references to their representations. Illustrateby suitable examples. 3 Discussthe applications of setsto linear programming. 4 [After a definition and an exampleof a simple continuedfraction] Investigatesimplecontinuedfractions.

5 Investigateeither: Quadrilaterals: classificationby symmetry, or: Trianglesandtheir associated circles. Over the next twenty years,a varietyof syllabuses incorporated such 'open-ended' tasks,eitheras part of a school-based examination componentor asquestions in a formal examination. However,the meansfor their assessment were generallyad hoc, and often highly idiosyncratic. The announcement in June 1984 that 20Voof the assessment in all school-leaving examinations in mathematics should be school-based triggered a suddenupsurgein the developmentof authentic tasks in mathematics, although the development hastended to favour somekinds of tasksmorethanothers. In practice, therehasbeena strongbiastowards problems open-ended in 'pure' mathematics, especially thoseinvolving combinatoricsand enumeration. Wells (1986) arguesthat as well as being stereotyped in terms of content,the approach to thesetasksin classrooms is also stereotyped an approach Wells calls data-pattem-generalisation (DPG) - relying on a 'naive inductivist' view of mathematics. The tasksthat have beendevelopedin the last thirty yearstherefore cannot claim to be a comprehensive or representative selection of mathematical activity. Nevertheless, the open-ended natureof the tasks did create significant problemsfor the constructorsof assessmentschemes. The use of the resultsof theseexaminations to make lifeaffecting decisionsabout employmentand further educationrequired that the assessments be highly reliable,while at the sametime, the openendednatureof the tasksmakesit very difficult to prescribeand cater for the likely responses.

50

NordicSfudr'es in Mathematics Education No 1,1994

Assessing authentic fasks; alternatives to mark-schemes

The schemes thathavebeendeveloped areeithergeneric whereageneral assessment scheme is usedfor morethanone(and,typically,every)task or task-specificwhere a separate assessment scheme is constructed for eachtask.Thesearediscussed in tum below. 2.1 Generic assessment schemes Clearly, if workable genericschemes could be produced, they would be muchmoreefficientthanhavingto re-author new assessment schemes asnew taskswere developed. Accordingly,almostall of the effort of the five regional Examining Groupsresponsible for the developmentand implementationof the new examinationwent into the developmentof genericassessment schemes. A large numberof assessment schemes wereproduced, anda retrospective analysis of these schemes suggested that almost all schemes focusedexclusivelyon one of two aspects of increasing competence in mathematical (Wiliam, I 989). investigations ln the 'cognitive demand' approach, the focus (adoptinga metaphor from competitive diving) is on the 'degreeof difficulty' of the task. Continuing the diving metaphor, the other approach focuseson the extent of progress madeon the taskor the 'marksfor style'. 2.1.l 'Cognitivedemand'approaches The developmental psychology provides literature a numberof models of the 'difficulty' of a task (Case,1985;Pascual-kone, 1970:Piaget, 1956),basedon the cognitiveprocesses required for successful action. other models,suchastheSoLo taxonomy@iggs& Collis, 1982) ignore the cognitive processes and concentrate insteadon the quality of the leaming outcome. However, it was clear from very early development work that these frameworkswereunlikely to beusefulin developing assessment schemes for open-ended work for two reasons. The first was that the number of different stagesin the frameworkstendedto be small - typically only four or five levels to coverthe whole rangeof intellectualdevelopment from birth to adulthood- while the new examination wasto reporton an eight-pointscalethe attainment of just the age-16 cohort. The secondproblem was that eventhoughthe numberof levels was small, many tasks that should, accordittgto their sffucture,be at the samelevel, actually showedwidely differing degrees of difficulty - a phenomenon dubbed'horizontal d6calage' by Piaget & Inhelder(1941). 2.1.2 ' Extent of progress' approaches Instead,almostall the examination groupsattemptedtoassess authentic tasksin termsof a problem-solving heuristic. Basedon Polya's (1957)

Nordic Studies in Mathematics Education No l,lgg4

51

Dylan Wiliam

four-stepsof problem-solving,John Mason and Irone Burton had (Burton, 1984;Mason, developed modelsof theproblem-solving process 1984)which weredeveloped furtherby theExamining Groupsinto viable assessment schemes. The schemes haveused 'criteria' to exemplify the stages reachedby students althoughthesearenot the precisebehaviouralobjectivesadvocatedby proponents of criterion-referenced assessment suchasPopham (1980).Instead they usedbroaddescriptors suchas: "Formulates general rules" (I-EAG, 1987 , p. 6), "Expressa generalisation in words" (OCEA, 1987 , p. 13), "Make and test generalisations and simplehypotheses" (DES & WelshOffice, 1989,p. 4) "Make a generalisation andtestit" (DES & WelshOffice, I99L,p.4) All theseprocess-based schemes have,by ignoring the task variables, treatedall tasksas essentially equivalent.Consequently, theseprocessbasedschemes would not distinguishbetweenthe sameprocessdisplayed in different problem-contexts, even though the difficulty (as determinedby, say,facility) might be very different. For example,the numberof integer-sided trianglesthat can be made with longestsiden is givenby n(n+2) for evenn , and ry 44
n+1 2

for odd n,

while the numberof suchtriangleswith perimetern is


n 2+ 6 n - 1 +6 ( - l )

48
n-3

for odd nnotdivisible by 3, and

n 2+ 6 n + 1 5+ 6 ( - 1 ) 6 48

for odd n divisibleby 3,

with the resultsfor evenn the sameasfor (odd) n - 3 (Wiliam, 1993b). Yet both could arise naturally out of the open stimulus "Investigate integer-sided triangles". The descriptor"Formulatesgeneralrules" mentionedabove was associated with the highestgradeof the new examination- the General Certificateof Secondary Educationor GCSE- and was designed to be attainedby aboutSVo of the age- 16 cohort.The generalisation for n as 'too longestsidewould probablybe regarded as easy'for this standard, andthat for n asperimeterastoo hard.In this sense, the heuristic-based schemedoesnot give what teachers andexaminers would regardasthe 'right' resultfor eitherof these two versionsof the sameopentask,even

52

NordicSfudiesin Mathematics EducationNo 1,1994

Assessing authentic tasks; alternatives to mark-schemes

wherethe response of the studentfollows the model of progression implicit in the assessment. Otherapproaches takenby studen6might diverge even further from the progression envisaged by the consffuctors of the assessment scheme. Teachers'school-based assessments aresubjectto scrutinyby external moderators, who have(andhaveused)thepowerto revisemarksby coarsere-scaling.Aware of this, and concemed to avoid having their gradesreviseddownwards,it appears school-based that teachers have 'played safe', andusedonly coursework tasksthatconformto the model of progression and the particular calibration implicit in the generic descriptors. This has produceda considerable stereotyping of the kinds - typiof open-ended mathematical tasksthat teachers offer to students cally combinatoric problems with two independentvariables and quadraticgeneralisations. This convergencecan be viewed as teachers taking control of the systemand'making it make sense'. But can also be viewed from a Foucauldian perspective as legislatingthe horizontald6calages out of existenceby constrainingthe discoursewithin which the assessment takesplace(Foucault,1977)- 'if it doesn'tfit the scheme it's not proper mathematics'. Thesestereotyping effectswould appear generic to be inherentir zury assessment scheme. Neverlheless, suchamodelof general grade orlevel descriptors has beenadopted for one of the dimensions of the revised National Curriculum for mathematics (DES & Welsh Office, l99l), nominally accountingfor 20Voof the curriculum. 2.2 Task-specificschemes In response to the difficulties raisedby horizontald6calage when using generic level descriptors, (Bell, Burkhardt,& someassessment schemes Swan, 1992; GradedAssessment in Mathematics,1988, 1992)have developedtask-specificlevel descriptions which take into accountthe context and difficulty of the tasks. This createsan immediate difficulty in that only tasks for which assessment schemes have beenprepared can be used,thus limiting the rarlge of tasks than can be used.However, in the developmentof the GradedAssessment in Mathematics (GAIM) scheme at King's College during 1984-1990,we discovered another dfficulty. Whenperformance descriptionsfor particular taskswere presented, teachersoften regard the identifieddescriptions asthe only way of achievinga particularlevel, ratherthan an exemplificationof the standard associated with that level. Furthermore,this was only partially alleviatedby presentingmultiple descriptions for eachexemplifiedlevel. As well asrestrictingthe kinds

Nordic Sfudr'es in Mathematics EducationNo t,lgg4

53

Dylan Wiliam

of tasks that can be used,therefore,task specific schemesmay well promote stereotyped approaches to tasksto a greaterextentthan is the casewith genericassessment schemes. open tasksare claimed,by their proponents, to allow opportunities for studentsto poseproblems,to refine areasfor investigation,and to makedecisions abouthow to proceed (Brown & Walter, 1983;Mason, Burton, & Stacey,1982).However,while the activity may be studentcentred,the assessment is not. From a Foucauldian readingof the situation, it is clear that the discourse is constrained. In a high-stakes setting (Popham,, 1987,p.77), or one in which the stakes are perceived to be high (Madaus, p. 86),students 1988, willbe'disciplined'(paechter, 1992) into adoptingeasilyassessable, stereotyped responses. They will still be playing 'Guesswhat's in teacher's head'. To sum up, by delineating particular'canonical' responses, the taskspecific schemes appearto lead teachers to direct studentstowards approaches that yield more easily 'assessable' responses. On the other hand,generalschemes havetendedto treatall tasksasequivalent,with scoring dependent upon the mathematical processes involved, which hasplaceda premiumon selecting tasksthatarelikely to elicit the appropriateprocesses. What is requiredis a way of assessing authentictaskson their own terms- in termsof what the student setout to do. but it doesnot seemas if any kind of explicit assessment schemecan achieve this. The possibilities for an implicit assessment scheme arediscussed in sections 4 and 5, and in orderto lay the foundations for this discussion, section3 reviews the important developments in the concept of the vahdity of assessment.

3 The validity of assessments


The classicaldefinition of validity has changedlittle over the last 50 years.Ga:rett (1937)definedthe validity of an assessment asthe extent to which it measured "what it purportsto measure"(p. 324) although applying this in practicehasled to a proliferationof kinds of validity. Clearlyit is importantthatanassessment is bothrelevantto the domain addressed andrepresentative of it, andthesetwo requirements aretraditionally regardedas defining the extent to which the assessment has contentvalidity. Unfortunately,the useof the term 'content' in this way appears to excludeassessment of thepsychomotor andaffectivedomains. Popham( 1978) proposed ( 1980)theterm 'descriptive but thenretracted validity' as more appropriate (1983)hasusedthe term zurd Embretson 'construct representation' to focus on the relationshipbetween the psychological processes involvedin responding to the assessment.

54

NordicSfudiesin Mathematics EducationNo t,lgg4

Assessing authentic fasks; alternatives to mark-schemes

However, while assessments are frequentlyused to determinean particular individual's competencein a domain, they a.remore often in orderto makedecisions. administered Sometimes, the decisionrelates to future performance("can this personbecomea pilot?") in which case thepredictiveaspect of validity is paramount. At othertimes,whether produces resultssimilar to a morecomplexprocedureis the assessment more important ("is this child being sexuallyabused?"), Sothat validation is concemedwith the concurrenceof the two procedures.Both predictive validity and concurrent validity concem the ability of an assessment to standas a proxy for performance on somecriterion, and arethereforeoften referredto collectively ascriterion-related validity The idea that validity shouldbe more concemed with the inferences madefrom assessment results,ratherthan the resultsthemselves was mademore explicit in the development, duringthe 1950s, of construct validity. The term consffuctvalidationwas first introducedin 1954by a joint committee of the APA, AERA and NCME ( 1954) and required "investigatingwhat psychologicalqualitiesa test measures, ie by demonsffating thatcertainexplanatory constructs account to somedegree for perfornance on the test" (p. 14).Originally intendedto be usedonly where "the testerhasno definitive criterionmeasure of the quality with which he is concemedand must useindirectmeasures to validatethe theory" (p. 14),construct validationwasheldby Loevinger( 1957)to be "the whole of validity from a scientificpoint of view" (p. 636). This view was disputedby Bechtoldt(1959)who regarded the definition as - with what conflatingthe meaningof a score- construct representation it signifies(nomotheticspan).Nevertheless, by the late 1970s, Angoff (1988)notesthat Loevinger'sview "became moregenerallyaccepted" (p. 28), and Messick (1980)asserted that "constructvalidity is indeed the unifying concept of validity that integratescriterion and content considerations into a commonframeworkfor testingrationalhypotheses (p. 1015). abouttheoretically relevantrelationships" In the samepaper,Messickarguedthat eventhis unifying conceptof consffuctvalidity was not broadenoughto allow a full consideration of the quality of an assessment. In additionto consffuctvalidity, he argued procedure that the evaluationof an assessment shouldconsider: a) the valuejudgementsassociated with the interpretations of the assessment results b) evidencefor the relevance of the constructand the utility of the particular applications, and c) the socialconsequences assessment of the proposed andthe use of the results what Madaus(1988)hascalledthe impactof an assessment.
NordicStudiesin Mathematics Education No 1,1994

55

Dylan Wiliam

asthe resultof crossingthe basisof the assessment This was presented or on providedby the assessment, (whether the focusis on the evidence (whether assessment of the function of its use)with the the consequences or their applicaof the assessments the focus is on the interpretations for Table 1. In additionTable 1 includes tion). This providesthestmcture some of the other terms proposedby other writers which, while not to Messick'spartitioningof types of validity argubeing exactmatches ment, appearto be closeenoughto be instructive.
resultinterpretation validity construct (Messick)
evidential basis (Messick) interpretive basis (Mos's)

result use
construct validitv + relevance/utilitv (Messick) nomothetic span (Embretson) sienificance (B"echtoldt)

construct representation (Einbreson) meaning(Bechtoldt)

consequential ' basis

value implications

social consequences (Messick) impact (Madaus)

of validiry. TableI. Facets

given by differentauthon for incorporatingauthentic The variousreasons canbe locatedwithin this framework. of assessment tasksinto a scheme schemefor mathemaIncorporatingauthentictasksinto an assessment performmathematical betterthe it is believedto represent tics because (top left). iurce of the studentis an appealto constructrepresentation in further we believe we can betterpredict success Doing so because study or employment is concemedwith relevanceor utility as well it shows investigative constructvalidity (top right). Doing so because and so worth assessing work to be an important part of mathematics basis of result interpretationsor, in other concemsthe consequential investiga(bottomleft).Finally,if we assess words,thevalueimptications to incorporate teachers we believethat this will encourage tions because with validation thenwe areconcemed suchactivitiesinto their teaching, (bottomright). of the assessments in termsof the socialconsequences The important point aboutthis framework is that while interpretive validity arguments(ie top row) can (althoughneed not) be discussed

56

No 1,1994 Education NordicSfudiesin Mathematics

Assessing authentic fasks: alternatives to mark-schemes

wiftin a rationalist programme,arguments that involve discussionof (bottomrow) mustbe conducted consequences within a value-system. The incorporationof the impact or socialconsequences of assessmentsinto validity argument hasenriched the field enorrnously, andhas illuminated many aspects of practicethat could not easilybe explained with nalrowerconcepts of validity. In particularit canbeusedto illustrate the role that valuesplay in assessment. In one reading,teachingto the test is unacceptable (or even morally 'wrong') because it robs a testof its ability to provideusefulinformation (Cannell,1987).other readings arehoweverpossible. In the US, given the usethatis to be madeof testresults, manyhaveargued that,at times, not teaching to the test is more damagingto studentsthan doing so (Airasian,1987). As well asdiscussion of these negative'backwash' effects,therehas recently beena substantial amountof interestin the possiblebeneficial effectsof the assessments, leadingto theideaof the 'beneficence' (Elton, 1992) of an assessment. As an exampleof this, in 1988,the GAIM projectdescribed itself asan 'assessment-led curriculumdevelopment' project and was premissedon the assumptionthat teachers'practice could be changed by changingsummative practices. assessment ln the US, similararguments aremadeby proponents of 'measurement-driven insffuction'or MDI (Airasian,1988). It is clear,therefore,that the unified conceptof validity is, ultimately, subjectiveand personal:"a testis valid to the extentthat you arehappy for a teacherto teachtowardsthe test" (Wiliam, 1992,p. 17).

4 Norms,criteria and other referents


The previoussectionpresented a frameworkwithin which theincorporation of authentictasksin mathematics assessments can be considered. However, very few of the benefitsof suchtasksare likely to accrueif they aresubjectto the kind of stereotyping that,it was arguedin section 2, is likely to occurif prescriptive schemes of assessment areused.This sectionsuggests that a solutionlies in a movement awayfrom traditional notions of criterion- and norm-referenced assessment, and towardsthe ideaof a 'consffuct-referenced' assessment. Any assessment functionsby comparingthe behavioursobserved during the assessment with somethingelse- the referent.This referent might be the performance of otherstudents of the sameage,an extemal criterion, or eventhe student'sown previousperformance. The importanceof the referentis that the interpretations resultsare of assessment (implicitly at least) inferenceswith respectto thesereferents.It is for this reasonthat it is now widely acknowledged that it is more useful to

NordicSfudr'es in Mathematics EducationNo l. tgg4

57

Dylan Wiliam

speakof norm- and criterion-referenced interpretations of assessment results, than of norm- and criterion-referenced assessments per se. However,the designof the assessment still needsto take into account the inferences that it is proposed to makefrom the outcomes. While it is possible,in somecases, to make satisfactory criterionrefeienced interpretations from tests desigrcdto providenorm-refercnced (andvice-versa), inferences this is not alwayspossible, andconsffuctors of assessments needto be bearin mind the interpretations that arelikely to be madeof assessment results. For thisreason, it doesstill makesense to talk of a 'norm-referenced test', althoughwhat we mean is a test designedspecifically to provide norm-referenced inferencesor interpretations. 4.1 Norm-referencedassessment ln a norm-referenced assessment, the referentis the normative group, but the inferences thataresoughtarelikely to be to a muchwider populition. The key is thereforethe ability of the normativegroup to represent the population. Typically, the interpretations (evenif only implicitly) are expressed in terms of the proportion of the cohort doing beffer or worse than the individualin question, and,typically,thenomothetic spanof the assessment is at least as importatrtas its constructrepresentation, placing a premiumon the ability of theassessment to discriminate. A usefultouchstoneis that an assessment scheme hassomedegree of norm-referencing if 'sabotaging'the efforts of othercandidates is likely to help your own assessment! 4.2 Criterion-referencedassessment The term 'criterion-referenced assessment' is generally attributed to Glaser(1963),althoughthe underlyingideasare much older. Writing nearlytwenty yearsearlier,cattell (lg44),had suggested that aswell ai 'populometric' 'normative' or measurement, therewasanothercategory of 'absolute'measurements, whichrelated performance to "literal, logical dimensionsdefining events" (p. 295). He termed this assessment 'interacilve'to sftessthat what was being relatedwas the interaction betweenthe individual and the "extemal world" (p. Z9a).The essence of criterion-referenced insessment is thatthedomainto which inferences areto be madeis specified with greatprecision (popham,1980). However,asAngoff ( I 974) haspointedout, if you scratch the surface of any criterion-referenced assessment, and you will find a normpopham( 1993)regards referenced setof assumptions lurkingundemeath.

58

NordicSludiesin Mathematics EducationNo t,lgg4

Assessrng authentic tasks;alternatives to mark-schemes

this not just a property of poorly-framedobjectives,but an inevitable featureof all performancecriteria. Any criterion will have a degreeof 'plasticity' (Wiliam , 1993c , p . 3a\ in thattherearea rangeof interpretations that can reasonably be made. The particularinterpretation of a criterionthat is chosenshouldbe the most useful bearingin mind the inferences that aredesired. The cenffal feature of a criterion-referenced assessment that distinguishes it from norrn-referenced assessment is that once we have decided on an interpretation, it doesn'tthen changeaccordingto the proportionof the populationachievingit. 4.3 lpsative assessment In the papercited above,Cattell alsoproposed a third form of measurement "for designatingscaleunits relativeto othermeasurements on the personhimself ' (p.29D which he termedipsative (from measurements the Latin: ipse - sel|.Authors such as Stricker (1976) have further required that with ipsative measures, "the sum of scoresfor a set of variables is the samefor eachperson"(p. 218). More recenfly,at leastin the UK, therehasbeena tendencyto usethe term ipsativeassessment to describe the perforrnance of an individual at onemomentin time, compared with thatindividual'spreviouslevelsof performance(eg Stronach,1989).Suchcomparisons are certainly not norm-referenced, sincethereis no comparison with a normativegroup, nor do such assessments satisfythe requirements of domain-specification neededin criterion-referenced asses sment. However, such assessments are part of the day-to-dayactivitiesof - arguablythe mostsignificantpart- andthe relativelack goodteachers of attentionthat these kinds of assessmenb havereceivedin ths literature hardens manypractisingteachers' beliefsthatpsychometrics hasnothing usefulto sayto them. 4.4 Construct-referencedassessment There is anotherclassof referents,widely usedin education,to which assessments are frequentlyrelated,that havereceivedlitfle attentionin the psychometricliterature.Thesereferentsareusedwhen the domain of assessment is holistic, ratherthan being definedin terms of precise objectives. The essenceof this fourth kind of referent is that the domain of assessment is neverdefinedexplicitly. Examples of behaviours areused in illustrating or exemplifying perforrnance, but the standards, in that they exist anywhere,exist as sharedconstructsin the minds of those involved in making the assessments.
NordicSfudiesin Mathematics Education No 1,1994

59

Dylan Wiliam

The most successful exampleof this kind of assessment in the UK has beenthe assessment of GCSE English, which, in three-quarters of the schr:ols in EnglandandWales,is entirelyschool-based. Inorder to safeguardstandards teachers aretrainedto usethe appropriate standards for marking by the use of 'agreement trials'. There are many models for suchagreement trials, but typically, a markeris given a piete of work to assess. when she has made an assessment, feedbachis given by an 'expert' asto whetherthe assessment agrees with theexpertissessment. The processof marking different pieces of work cotrlinuesuntil the teacherdemonstrates that she has convergedon the correct marking standard, at which point sheis 'accredited'as a marker for someflxed period of time. The term domain-referenced assessment might be an appropriate descriptionfor this kind of assessment but for the fact that most authors - seefor exampleBerk ( I 990,p. a90)or Hambleton andRogers( lggr, p. 4) - use this synonymously with 'criterion-referenced' aisessmenr. Because of this,andbecause of theusethatis madeof shared constructs, I have proposed fhe term 'construct-referenced (Wiliam, assessment' 1992)- first usedby Messick(1975)- to describe this kind of assessment. The meaning I wish to attachto the term has a slightly different emphasisfrom that proposedby Messick, but both emphasisethe meanings that are attached to assessment results(Bechtoldt,1959)or how w91lthey represent the consffucts (whitely), l9g3). @mbretson The differenceis that in Messick's terms,the constructsexist either in terms of traits within the individual or in termsof concordances with a nomological network within which the constructis defined.while the definition proposed hereis essentially social.It is not necessary that the constructexistswithin a nomologicalnetwork; merely that ratersshare the constructto a sufficient extentthat they exhibit enoughagreement about their ratings for the pulpose in hand. The asses-.nt is not objective,but the UK experience is that it canbe madereliable.To put it crudely,it is not necessary for theraters(or anybodyelse)to know what they are doing, only that they do it right. B oth criterion-referenced andconstruct-referenced asse ssmentsrelate performance to someextemalstandard, andalthoughtheremay be cases whereboth areequallyappropriate, it will usuallybe the casethat one is more appropriate thanthe other. Where aspects of performance or achievement can be broken down into specific behaviours, which collectively exhaustthe domain, then that perforrnanceis reducible. In such a case we are assuredthat if someonecan perform eachof the constituent behaviours, then we also know that they canperformthe completebehaviour.Wherethe specific 60

Nordic Studiesin Mathematics EducationNo t, 1g94

Assessing authentic fasks; alternatives to mark-schemes

behavioursciurbe definedpreciselyby explicit criteria,we can saythat the behaviour is definable. Criterion-referenced assessments are appropriatefor achievements that areboth reducibleand definable. However,manycomplexskillscannot beU''eated in this way;thewhole may be greaterthan the sum of the pafis (so the performanceis not reducible), or we cannot write down criteria which capturewhat we (sotheperforrnance want to describe is not definable). This is recognised in many areas of socialdecision making,andis encapsulated in the oftencited legal dictum that 'hard cases makebadlaw'. Construct-referenced assessments may usestatements to describe the domains indeedmany systems of assessment describe thesestatements as criteria - but the role of thesestatements is quite different from the role of criteria in criterion-referenced assessments. The touchstonefor distinguishingbetweencriterion- and constructreferencedassessment is the relationshipbetweenthe written descriptions andthe domains.Wherewritten statements collectivelydefinethe level of performancerequired(or morepreciselywherethey definethe justifiable inferences),then the assessment is criterion-referenced. However, where such statements merely exemplify the kinds of inferencesthat are wiuranted,then the assessment is, to an extent at least,construct-referenced. In practice, no assessment relates exclusively to a singleoneof these four kinds of referents. For examplea selection procedurefor employment is likely to combine severalaspects. If thereis only a singlepost, thento be successful, (in somesense) onehasto be the bestcandidate so there is a degree of norm-referencing.However, one also has to be 'appointable'; if thereareselection criteria,thenthis may be a criterionreferencedassessment, but if there are not, then it may involve some (ie can this persondo thejob?). To completethe construct-referencing analogy,ipsative concemsmay be involved if the applicationis for promotion, where the decision might be about whetherthe candidate has madesufficient improvementover (say)the last year.

5 Implementing construct-referenced assessment


Although the use of the term may not be well-established, there is a considerable amountof experience in EnglandandWalesof the development of construct-referenced assessment. Unfortunately,this has not beenresearched rigorouslynor hasit appeared in the literatureof educational and psychologicalmeasurement. The findings presented below mustthereforebe considered assuggestive at best,but in the absence of anymorerigorousliterature, mayhelpothersavoidsomeof theproblems solvedby trial and error in the UK.
NordicSfudl'es in Mathematics EducationNo 1,1994

61

Dylan Wiliam

5.1 Assessor training The major requirementin achievinga dependable systemof constructreferencedassessment is achievingunidimensionality: teachersmay disagreeabout which grades,levels or scoresto award, but they must agreeon the rank order.For example,a commonexperience in the early development of school-based assessment in GCSEEnglishwasthat some teachers focusedunduly on thetechnicalaccuracy of the writing, almost ignoring the descriptivequality. Otherspaid little attentionto punctuation, grammar and the 'conventions'of StandardEnglish, awarding gradesalmost solely on the basisof descriptivepower. The flrst stageis to identify the variousdimensionsof competence in the domain,howeverintuitively, and,within eachdimension,io identify (againpossiblyintuitively)degrees of progression. Samples of students' work for agreement trialling canthenbe selected to illusftate the exffemes of differencesin gradeor quality alongthe different dimensions. Once the illustrative setof work samples hasbeendetermined,there are many ways to proceed.For example,consistentrank ordering of samplescan be established first, with correctgradingsestablished as a subsequent calibrationexercise, or both can be established simultaneously. It is our experience that in manycilses, teamsof experienced teachers working together can produce unanimous agreementsquickly, presumablybecause of the amountof experience they share,although how quickly new entrantsto the professioncan be enculturatedis an important issue.However,preliminary resultsfrom work with mathematicsstudents on initial teacher (Gill, 1993)has trainingcourses shown that constructsof 'levelness'can be formed quite quickly. This is, however,a single,very small-scale studyin a complexanddiversefield andanexplorationinto how this process of enculturation czulbe speeded up mustbe a pressing item on theresearch agenda of construct-referenced assessment. The GradedAssessment in Mathematics schemementionedabove developed a series of mathematical investigations andpracticalproblem solving taskswhich had to be gradedon a scalefrom I to 15, with the top sevengradesbeittg equivalentto the gradesof the school leaving examinationat 16 (the GCSE). Although very little of the project'i findings have been published,severallessonsabout communiCating constructsof assessment standards to teachers were leamt (Wiliam, I 993a): ' generalised leveldescriptions caused too many 'ddcalages' to provide a reliablebasisfor assessment.

62

Nordic Studiesin Mathematics EducationNo t, tgg4

Assessr'ng authentic fasks: alternatives to mar k -s chemes

task-specificlevel descriptions tendedto be either too specific to a particular approachtakento the task, or too generalfor teachers to be ableto identify which werethe importantfeatures. Theredid not appear to be an easilylocated'middle ground'. attempts to communicate standards wereconsistently mostsuccessful when actual samplesof students'work for eachof the levels of the system, annotated to illustrateimportantpoints,with several different approaches to eachtask at eachlevel, wereused.

'

Generating suchsamples of work, however,caused its own problems. Wheneverteachers were askedto provide specific samplesof work to exemplify levels,theexamples providedwereusuallyatthecorrectlevel, but extremelywell presented. To usesuchmaterialsasexemplification would havecreated anunrealistic expectation of the standards associated with a particular level. Generally, more appropriateexemplification materials were generated when teachers were askedto submit whole class-sets of work, from which samples couldbe chosen. 5.2 Evaluating the quality of assessments Consfruct-referenced asses smentinvolvesmakingcomplex, holistic and discreteattributions,and it is clearthat in such settings, classicaltest theoryis inappropriate for evaluating thequalityof theassessments made. However, since any discreteattribution can be treatedas a seriesof ordereddichotomousaffributions, then signaldetection theoryprovides appropriateindicesof the accuracyof assessments. Signal detectiontheory (SDT) developed out of attemptsto analyse theperformance of ffierent (human) receivers of radiosignals. However, while having been developedin communicationengin*fug, the idea of dichotomousdecision-makingin the presence of noise has a very wide rangeof application(Green& Swets,1966). For consistency with the language of signaldetection theory,the term 'positive' is generally usedto denotethe situationwherethe threshold hasbeenexceeded, irrespective of whetherthis haspositiveor negative (Swets,1988,p. 1285).Similarly, the term 'negative'is connotations usedto denote a situation wherethethreshold is not reached. If thenumber of false and true attributionsire expressed proportion as a of the true positives,then theseproportionswill sumto 1, aswill the false andtrue atfributionsof true negatives. The behaviourof the systemcan then be describedby two indices: the number of correctly attributedpositives (called 'hits') and incorrectlyattributed ('false alarms') are negatives usuallychosen, although SperlingandDosher( 1986), argue thatthe use of hits and correctnegatives gives more easilyinterpreted results.

Nordic Studies in Mathematics Education No 1, tgg4

63

Dylan Wiliam

This use of proportionssatisfiesSwets' first property required of a measureof the performanceof a diagnosticsystem:that the measure shouldbe unaffected by the proportionof positivesandnegativesin the testsample. The second of Swets'requirements is that a measure of the performanceof the systemasa whole shouldbe independent of the way that the decisioncriterionis set.ln our case, we shouldwzurt our measure of the accuracyof teachers assessing authentictasksin mathematics to be the samewhetherthey aretold to be lenientor to behanh in decidingwhether to award a particulargradeor level. The essence of signaldetection theoryis thatthe decision-consistency of the systemis measured over a wide riurgeof possiblesettingsof the thresholdbetweenlenient and strict interpretations of the criterion. The graphof the pairsof false-positive proportions(falsealarms)and ffue positiveproportions is calledtheROC (originally 'receiveroperating characteristic',but now often 'relative operatingcharacteristic;)of the system,and describes the accuracy of the systemover different settings of the criterion.This informationcanthenbe given to thosewho haveio determinethe settingof the thresholdso the performanceof the system at the chosensettingis known reasonably well in advance.If a single index, ratherthan a curve on a graph,is required,then Swets (lgBg, p. 1287)suggests that the areaof the graphbelow the curve (denotedA) can be used as an index of systemperformance,which ranges from 0.50, when the ROC is a diagonalline (corresponding to theiituation where no discrimination exists)to 1.00(wherethere are no incorrect classifications). The lack of a samplingdistributionof the index A creates difficulties, but nevertheless signal detectiontheory appea-rs to hold considerable promisewhereessentially continuous datahasto be reportedin a dichotomousway.

6 Summary
The incorporationof open-ended authentictasksin formal assessments of mathematics achievement can be justified on many grounds.It has been arguedthat the inclusion of such tasks meansthat assessments represent betterthe natureof mathematical thinking, havegreaterutility for selectingstudents for advanced study,andrepresent more appropriately the valuesof mathematics. However,the inclusionof suchtasksin high-stakes assessment createssignificantproblems. Naturaljusticerequires ahigh degree of interrater agreement,which has in the past meant adopting very tightly

64

Nordic Studiesin Mathematics EducationNo t, tgg4

Assessing authentic fasks: alternatives to mark-schemes

conftolled assessment schemes. While defensible for more niurowly focused tasks,suchassessment schemes cannotanticipate all the different directions in which studentsmight proceed.Task-specificassessment schemes cannotthereforebe usedwithout compromisingthe rationale for innoducing authentictasksin the first place. While genericassessment schemes may, in the future,offer workable solutions,at the moment not enough is known about the nature of progressionand competence in mathematics to provide a schemethat 'on its own allows a pieceof mathematical investigation to be assessed terms'. ln section 2 rt was arguedthat the genericschemes that have been developedto date have focusedon particular aspects of performance (whetherdegreeof difficulty or extentof progress). It was also argued that any genericscheme will, by its very nature,serve to canonise certain aspects of performanceat the expense of others. As a partial (and possibly temporary)solution,consffuct-referenced assessment hasbeenproposed asa way of workingtowardssufficiently high inter-rateragreement, without compromising therationalefor openendedwork in mathematics. Whetherconstruct-referenced as ses smenti s sufficiently differentfrom norrn- and criterion-referenced assessment to be useful remainsto be seen,but even if the term does no more than focus attention on the inadequacies of theconffastive rhetoricof norm-andcriterion-referencing asdescriptions of complexhumanjudgements, thenit will haveserved its purpose.

References
Airasian,P. ( 1987).Statemandated testingandeducational reform:contextandconsequences. American J ourtnl of Education,95(3), 393412. Airasian, P. (1988). Measurement-driven instruction:a closer look. Educational Measurement: Issuesand Practice (Winter), 6-11. AmericanPsychological Association, AmericanEducational Research Association, & National Council on Measurement Used in Education (1954). Technical recommendationsfor psychologicaltests and diagnostictechniques. PsychologicalBulletin Supplement, SI(2) part2),1-38. Angoff, W. H. (1914). Criterion-referencing, norm-referencing and the SAT. CotlegeBoard Revi ew, 92 (Summe r), 2-5. Angoff, W. H. ( 1988).Validity: an evolving concept. In H. Wainer & H. I. Braun (Eds.), Iesr validiry (pp. 19-32).Hillsdale, NJ: LawrenceErlbaum Associates. AssociatedExamining Board (1966). Mathematicssyllabus C paper 11.London, UK: AssociatedExamining Board. Bechtoldt, H. P. (1959). Constructvalidity: a critique.American Psychologist, 14,619-629.

Nordic Studies in Mathematics Education No l,lgg4

65

Dylan Wiliam

Bell, A. W., Burkhardt,H., & Swan,M. (1997).Assessment of extendedtasks.In R. Lesh & S. J. Lamon (Eds.),Assessment of authenticperformancein school mnthenntics (pp. 145176).Washington,DC: American Associationfor the Advancementof Science. Berk, R. A. (1990). Criterion-referenced tests.In H. J. Walberg & G. D. Haertel @ds.), The international encyclopaedia of educational evaluation (pp. 490-495). Oxford, UK: Pergamon. Biggs, J. B., & Collis, K. F. (1982).Evaluating the qualiry of learning: the SOLO raxonomy (Structureof the observedLeaming outcome). London, UK: Academic press. Brown, S. I., & Walter, M. I. (1983). The art of problem posing. Philadelphia,PA: Franklin Institute Press. Burton, L. (1984). Thinking things through.Oxford, UK: Basil Blackwell. Cannell,J. J. (1987). Nationally normedelementary achievement testingin America' s public schools: how allftfty statesare above the nationnl average. Daniels, WV: Friends for Education. Case,R. (1985). Intellecnal development: birth to adulthood. New York, NY: Academic Press. Cattell, R. B. (1944). Psychologicalmeasurement: normative,ipsative,interactive.Psycholo gical review, 5l, 292-303. Departrnent of Educationand Science,& Welsh Office (1989). Mathematicsin the National Curriculum. London: Her Majesty's StationeryOffice. Deparfinentof Educationand Science,& Welsh Office (1991). Mathemnticsin the National Curriculum. London, UK: Her Majesty's StationeryOffice. Embretson(Whitely), S. E. (1983). Constructvalidity - constructrepresentation versus nomotheticspan.Psycft ological Bulletin, 93(I ), 179-I9j . Foucault, M. (1977). Discipline and punish (Sheridan-Smith, A. M., Trans.). Harmondsworth, UK: Penguin. Garrett, H. E. (1937). Statistics in psychologyand education.New York, NY: Longmans, Green. Gill, P. N. G. ( 1993).Using the construct of "levelness"in assessing openwork in the National Curriculum. British Journal of Curriculwn and Assessment, 3(3), 17-lB. Glaser,R. (1963).Instructionaltechnologyandthe measurement of leaming outcomes:some questions.American Psychologist,18, 519-521. (1988) GradedAssessment in Mathematics . Development pack.London:Macmillan Education. GradedAssessment (1992) in Mathematics . Complete pcct. Walton-on-Thames, UK: Thomas Nelson. Green,D. M., & Swets,J. A. (1966).Signal detectiontheory and psychophysics. New York, NY: Wiley. Hambleton,R.K., & Rogers,H. J. (1991).Advancesin criterion-referenced measurement. In R. K. Hambleton & J. N. Zaal (Eds.),Advancesin educationaland psychological testing (pp. 3-43). Boston, MA: Kluwer AcademicPublishen. Loevinger, J. (1957). Objective testsas instrumentsof psychologicaltheory. Psychological -694. reports,3(MonographSupplement 9), 635 London and East Anglian Group for GCSE Examinations( 1987). Mathemniics: centre-based assessmentLondon, UK: London and East Anglian Group for GCSE Examinations. Madaus,G. (1988).The influenceof testingon the curriculum.In L. N. Tarner (Ed.), Critical issuesin curriculwn: the 87th yearbook of the Nationnl Sociery for the Study of Educatiott (part l) (pp. 83-121).Chicago,IL: University of Chicagopress. Mason, J. (1984). Mathematics: a psychologicalperspective.Milton Keynes, UK: Open University Press. Mason,J., Burton, L., & Stacey, K. (1982).Thinkingmathemntically.r-ondon, UK: AddisonWeslev.

66

NordicStudiesin Mathematics Education No 1,1994

Assessing authentic fasks; alternatives to mark-schemes

Messick,S. (1975).The standard problem: meaningand valuesin measurement and evaluation. Ameri can P sycholog ist,3 0, 955-9 66. Messick,S. (1980).Testvalidity andtheethicsof assessment. AmericanPsychologist,35(lI ), I0I2-LO27. Moss, P. A. (1992). Shifting conceptionsof validity in educational measurement: implications for performanceassessment. Reviewof EducationalResearch, 62(3),229-258. Oxford Certificateof EducationalAchievement(1987).Mathemntics: putting it intopractice. Oxford, UK: Oxford IntemationalAssessment Services Limited. Paechter,C. (1992, August). Discipline as examinntionlexamination as discipline: crosssubiect coursework and the assessment-focused subjectsubculture. Paperpresentedat British Educational ResearchAssociation Annual Conferenceheld at Stirling Univenity. London, UK: King's College Centrefor EducationalStudies. Pascual-Leone, J. (1970). A mathematical model for the transitionrule in Piaget'sdevelopmental stages. Acta Psychologica, 32, 301-345. Piaget, J. (1956). Les stadesdu developmentmentalechez I'enfant et I'adolescent.[n P. osterreich, J. Piaget, R. de Saussure, J. M. Tarurer,H. wallon, & R. Zarro (Eds.), Le probleme des stadesen psychologiede l'enfonts. Paris, France:PresseUniversitairede France. Piaget,J., & Inhelder, B. (1941). Le ddveloppement des qunntitschezl'enfalrr. Neuchatel, France:Delachauxet Niestl6. Polya, G. (1957). How to solve it. Princeton,NJ: PrincetonUniversity press. Popham,W. J. (1978). Criterion-referenced measuremenr. EnglewoodCliffs, NJ: PrenticeHall. Popham,W. J. (1980). Domain specification strategies. In R. A. Berk (Ed.), Criterionreferencedmeasurement: the state of the art (pp. 15-31).Baltimore, MD: JohnsHopkins University Press. Popham,W. J. (1987).Can high-stakes testsbe developed at the local level? NA.SSP bulletin, 7t(496),77-84. Popham,W. J. (1993, April). The instructionalconsequences of criterion-referencedclarity. Paper presented at Symposium on Criterion-referencedmeasurement- a thirty year retrospectiveat the annual meeting of the American EducationalResearchAssociation held at Atlanta, GA. Los Angeles,LA: University of Califomia. Sperling, G., & Dosher, B. A. (1986). Strategies and optimization in human information processing.In K. Bofl J. Thomas, & L. Kaufrnann(Eds.),Handbook of perception and performanceNew l/ork, NY: Wiley. Stricker,L. J. (1976). Ipsativemeasures. In S. B. Anderson,S. Ball, & R. T. Muryhy (Eds.), Encyclopedia of educational evaluation: conceptsand techniques for evaluating education and training programs (pp.2l7 -220). SanFrancisco,CA: Jossey-Bass. Stronach,I. (1989). A critique of the 'new assessment': from currency to camival. In H. Simons& J. Elliott (Eds.),Rethinkingappraisaland assessment Milton Keynes,UK: Open University Press. Swets,J.A.(1988).Measuringtheaccuracyofdiagnosticsystems. Science,240(4857),12851293. wells, D. G. (1986). Problem solving and investigations.westbury-on-Trym, UK: Rain Publications. Wiliam (Williams), D. (1989). Assessment of open-ended work in the secondary school. In D. F. Robitaille (Ed.), Evaluationandassessment inmathematicseducation(pp. 135-la0). Paris,France:UNESCO. Wiliam, D" (1992). Some technicalissuesin assessment: a user's guide. British Journnl for Curriculum and Asses sment,2 (3 ), I l-20.

NordicSfudr'es in Mathematics EducationNo t, tgg4

67

Dylan Wiliam

Wiliam, D. (1993a,April). Assessing open-ended problem solving and investigative work in mathemntics. Paperpresented at SecondAustralianCouncil forEducational Research Second NationalConference on Assessment in theMathematical Sciences held at Surfer'sParadise. Australia. Wiliam, D. (1993b).Paradise postponed? MathemnticsTeaching(144),20-23. Wiliam, D. (1993c). Validity, dependabilityand reliability in national curriculum assessment. The Curriculwn J ournal, 4(4), 335-350.

Bediimning av autentiskauppgifter: alternativ till rittningsmallar Sammanfattning Artikeln beskriverde typer av 'autentiska'uppgifter - vanligenoppna, 'rent' unders6kande uppgifter - som anviintsi nationell utveirdering i England och Wales under de senaste trettio flren. Rlttningsmallar for beditmninghar klassificerats som antingen problemspecifika eller sammanfattande. I den senare kategorinhar mallamai sin tur klassificerats efter uppgiftens'svfl.righetsgrad' eller efterden 'delkisning' someleven istadkommit, beroende pfl vad som betonats mest i bedomningen. Vidare presenteras en beskrivningav begreppet validitet som fordrar att man beaktarvrirdetav implikationeroch socialakonsekvenser av olika bedomningssaff. Det argumenteras for aff bflde uppgifuspecifika och sammanfattande scheman medfor att elevemassritt att angripauppgiftema blir stereotypa. Ett altemativtparadigmtill norm- och kriterierelateradetolkningar av bedomningar, som kallas'konstruktionsrelaterat', ansesb[ttre for bedomning av autentiskauppgifter. I artikeln framfors forslag till implementation av ett sfldantsystem.Utgiende fran en teori for att upptiicka"signaler"fdreslfls liimpliga mitt fcir att utvdrderaprecisioneni sfldana bedomningar. Fiirfattare DylanWiliam tir universitetslektor i matematikdmnets didaktik vid Centre for EducationalStudies,Kings College,University of London, Great Britain. Adress Cenfrefor RlucationalStudies, Comwall HouseAnnex,WaterlooRoad. London SEI 8TX. GreatBritain.

68

NordicSfudiesin Mathematics EducationNo t,lgg4

Vous aimerez peut-être aussi