Validity and Reliability Issues in The Direct Assessment of Writing

WPA: Writing Program Administration, Volume 16, Numbers 1-2, Fall/Winter 1992
Council of Writing Program Administrators
Validity and Reliability

Issues in the Direct
Assessment of Writing
Karen L. Greenberg
During the past decade, writing assessment programs have mushroomed

at American colleges and universities. Faced with legislative mandates to
certify and to credential students' literacy skills, college writing teachers
have become more knowledgeable about writing assessment, and they are
trying to make writing tests parallel more closely their writing curricula
and pedagogy.
According to national surveys of post-secondary writing assessment
practices, writing teachers have developed and administered holistically-
scored essay tests of writing, which they prefer over all other types of
writing tests (eCCe Committee on Assessment; Greenberg, Wiener, and
Donovan; Lederman, Ryzewic, and Ribaudo). These surveys confirm the
growing consensus within our profession that the best way to assess
students' writing skills is through writing or "direct" assessment. While
there is still much disagreement about what constitutes an effective writing
sample test, there is agreement that multiple-choice testing--the"indirect"
assessment that once dominated post secondary writing assessment--is no
longer adequate for our purposes; yet direct writing assessment continues
to be challenged. This essay will elaborate on and respond to some of these
challenges and will speculate on future directions in writing assessment.
The Reliability of Essay Test Scores

Most essay tests of writing are evaluated by holistic scoring, a procedure
based on the response of trained readers to a meaningful "whole" piece of
writing. Holistic scoring involves reading a writing sample for an overall
impression of the writing and assigning the sample a score point value
based on a set of scoring criteria. Typically, holistic scoring systems use a
scoring scale or guide, created by composition faculty, that describes
papers at different levels of competence (for examples, see Cooper;
Greenberg, Wiener, and Donovan; White, 1985). In order for a holistic
scoring system to be of any value, it must be shown to be "reliable," that is,
it should yield the same relative magnitude of scores for the same group of
I
~'"Pk Wrinng P,og...m Admini'''''non VoL 16, No>. 12, F/W 1992 7
wri ters under differing conditions. Reliabi] ity is an eslima te of a test score's
accuracy and consistency, an estimate of the extent to which the score three types of tests: multiple-choice questions on the verbal sections of the
measures the behavior being assessed rather than othl'r sources of score Scholastic Aptitude Test, an objective editing test, and the English Compo-
variance. sition essay test. Diederich found that the SAT verbal SCore was the best
predictor of the teachers' grade ("The 1950Study"). In 1954, Edith Huddleston
No test score is perfectly reliable because every testing situation
differs. Sources of error inherent in any measurement situation include conducted a similar analysis of the correlations between essays written by
inconsistencies in the behavior of the person being assessed (e.g., illness, 763 high-school students and their English grades, their scores on multiple-
lack of sleep), variability in the administration of the test (inadequateyght, choice tests of composition, and their teachers' judgments of their writing
insufficient space) and differences in raters' scorIng behavIOrs (lemency, ability. Huddleston concluded that the multiple-choice measures were
superior to essay tests beCduse they had higher correlations wi th teachers'
harshness). This last source of error has been the focus of almost all duect grades.
writing assessment programs, probably because i.t is ~ubject to the. greatest
Similar conclusions were reached by Paul Diederich and his col-
control (i.e., raters can be trained to apply scale cntena. m~~e consIste.ntly).
Most programs calculate only an inter-rater scoring rel~abI!Ity--anestimate leagues John French and Sydell Carlton in their 1961 study of inter-ratcr
reliability. They asked fifty-three academic and nonacadem"iC professional
of the extent to which readers agree on the scores assIgned to essays.
The inter-rater reliability of holistic scores, and of essay tests in writers to sor~ into nine piles 300 essays written by college freshmen. Using
factor analySIS, the researchers discovered five clusters of readers who
general, has been under atta~k since 1916 when the College Entran.cc
were judging the essays on five basic characteristics: ideas, reasoning,
Examination Board (the College Board) added an hour-long essay te~t to I~S
form, flavor, and mechanics. They also found that the readers who were
Comprehensive Examination in English. The College ~oard--t~e country s
most influenced by one of the five characteristics also favored two or three
largest private testing agency and creator of mulhple-chmce. tests of
of the other characteristics; the average inter-correlation among the five
writing--has published most of these attacks, and it has done.so In rather
acrimonious terms: "The history of direct writing assessment IS bleak. As factors was .31. The researchers concluded that this low correlation was
unacceptable.
farback as 1880 itwas recognized that the essay examination was beset wIth
the curse of unreliability" (Breland et at 2). . . .However, many teachers believed that a test of writing ability should
reqUIre students to write, and they found support for their belief in a
During the first half of this century, essay tests dId mdeed have
relativelv low inter-rater reliability correlations. As Thomas HopkInt-.
c~prehensive review of research in written composition done in 1963 by
showed in 1921, the score that a student achieved on a College Board
J
Richard Braddock, Richard Lloyd-Jones, and Lowell Schaer. After detail-
English exam might depend more on"which year h.e appeared for the ing the shortcomings in, multiple-choice testing, Braddock, Lloyd-Jones,
exa-mination, or on which person read his paper, than It would on what he and Schaer noted flaws 10 several of the College Board studies described
above. They pointed out that Diederich never gave his readers any
had wrifl:en" (Godshalk et ai. 2). Concern with reliability of essay .tesb
increased with the College Board's introduction of essay tests of achIeve-
stand~rds or criteria for judging the essays, so he should not have been
ment in the 1940s. In 1945, three College Board researchers examined data Surpnsed that the readers did not often agree wi th each other and he should
from various College Board essay tests and wrote a report indicating that not ~ave used this disagreement to condemn holistic scoring procedures
(which usually do use scoring guides) (43). They also noted that none of
the reliability (.58-.59) was too low to meet College Board standard~
(Noyes, Sale, and Stalnaker). The late 19405 and early 1950 were the heyday Huddleston's mea~ures included items on logic, detail, focus or clarity and
of indirect "component" measurement. Introductory college courses over- that her twenty-mmute essay topics did not give students an opportunity
to analyze and formulate their ideas (42). In addition, Braddock, Uoyd-
flowed with thousands of World War II veterans who did not possess the
usually expected skills and knowledge, and faculty turned. to multipk- :~e.s, ~d Schaer commented that defend.e~s of multiple-choice tests of
nting ~em.to overlook or regard as SUSpICIOUS the high reliabilities that
choice tests to screen and diagnose students (Ohmer 19). Dunng the 1950s,
the College Board began commissioning studies to assess the "con:pon~nt th~obtatne~In same of their studies ofessay testing" (41). They concluded
WIth a questlOn that stIll needs answering today:
skills of writing ability' (Godshalk et al. 2). In 1950, Paul B. Dledench
examined the correlations betwecn the grades assigned to wntcrs by thelr
high school English teachers and the scores that the writers achieved DO in how many schools has objective testing been a
good predictor of success in compOSition classes
I 9
1
. 1 because those c1asses h a ve emphasized

grammaYtI' cal and mechanical matters with
preClse .,
Httleor nod I.Q. tests; even the best of them can only test reading,
emphasis on central idea, analysis, organtzation, an proofreading, edi hng,l ogic, and guessing skills. They
content? (43) cannot distinguish between proofreading errors and
process errors, reading problems and scribal stutter,
failure to consider audience or lack of interest in
At the same time that B~addo~ '. t ~ts the College Board Web
k Llo d-Jones, and Schaer were
materials manufactured by someone else. (3)
writing their criticisms of mulbPhle-c .. ~Ice t e 'of essay tests to date, J.
. ' . th most compTe ensl \ e s dy U d .
commlsslOnmg e. fl ence the way m . w h'ch1
writing was assesse In d Brown also noted the invidious influences of multiple-choice tests of
study. that was to m U Fred Godshalk, Frances Swineford, an writing: they require a passive mental, state whereas writing samples
A menca for the next tW? decades. 1 f 646 high-school students, each of
William Coffman examlI~_~d a sam P : ~two 40-minute essays and three 20-
require active mental processes, they focus on and give undue importance
w hom wrote fi vc free-wntmg:a~~iSticall scored by five readers. The
l S to the less significant components of writing (usage, spelling, and punctua-
tion); and they are often culturally and linguistically biased (4). This last
minute paragraphs) that we e hY)f these writing samples and
d the scores on eac ( . problem was documented by researchers at the California State University
researchers summe. . h th tudents' scores on multiple-chOlct'
examined their ~~rrelahonsT~~ res~l:s led the researchers to two major
System, who found that multiple-choice tests of English usage produced a
far more negative picture of the writing abilities of minority students than
tests and on edltmg tests. d t' multiple-choice usage and did the university's essay test (White, Teaching and Assessing).
. .. (1) h eS on these stu en s ..
conclu51Ons. t .e scor, lated hi hlv with their wrItmg sample By 1980, universities and colleges across the countrywere developing
sentence constructIOn tests corre . t g '"ti'ng samples were as reliable
h . n the 20-mmu e wn or refining their own holistically-scored essay tests of writing for varied
scores and (2) t e scores 0 1 Godshalk et al 40-41),
as the scores on the,40-minute sa;~~: ~riting assess~ent, as noted in a
purposes, inclUding determining placement, diagnosing strengths and
weaknesses, and certifying proficiency (CCCC Committee on Assessment;
These conclUSIOns came t~~n.e "This study fby Godshalk et al.] WaS Greenberg, Weiner, and Donovan). The College Board was quick to
recent College Board pubhca . . t'al study of writing assessment.
. f time the qumtessen I J respond to the decline in the use of m ul ti pIe-choice wri ting assessmen t. In
conSIdered or some h 'of multiple-choice usage anu 1984, the Board commissioned Hunter Breland, Roberta Camp, Robert
The findings of this study led to t . e use testing devices in the College
Jones, Margaret Morris, and Donald Rock to replicate the "Godshalk" study
sentence correction item.s. as the pn~~t Test and in other composition
and to examine the reliabiHties of essay and of multiple-choice measures of
Board's English ComposItIon AchIeve_five ears assed before the College writing skills.
tests" (Breland et a1. 3). Almost twent Y
.. d th compre henslve s
Y tuPdv of direct and indirect
Board commlSSlone an~ er ears multiple-choice tests continued ~o
For this study, 267 students from six colleges and universities each
wrote six essays on two topics in three modes (narration-description,
writing assessment. Durmg those y , t t of writing began to gaIn
exposition, and persuasion), and each essay Was holisticallyscored by three
dominate writing assessment, but essha y es dS 'nator of the first National
Popularity. in 1978,ex R ford Brown, t e coor I readers and analyzed for errors in grammar, usage, sentence structure, and
. d au t the weaknesses in current mechanics. In addition, the researchers examined students' scores on six
d t I Progress pOInte .
AssessmentofE uca lOn.a. ..' mentin on the growing dlssat- mUltiple-choice tests, their course grades, and their teachers' ratings of
methods of assessing wntmg ~b~htY. l~O~-choice~ests he noted that their
isfaction of writing teachers WIt mu Ip e , their writing skills. Their search indicated that the essay scores had lower
high reliability is often illusory: inter-ra terreliability correlations than did the Scores on the m ul tipIe-choice
tests. The researchers concluded that "these problems in reliability fofessay
Of course these tests correlate with wn'ti ng a bility
f tests] can be alleviated only through multiple essays or through the
and predict academic success; b~t the n~~b;~I~ ~~:(~ combination of essay and non-essay measures" (Breland et al. 57).
or television sets or bathroo~s In one s a t 1 d _ Once aga in the College Board Was asserting tha t m ul bpIe-choice tests
correlate with his writing ablhty, and paren.a e u were better than essay tests because they had higher inter-rater reliability
. .IS one 0 f the best predictors there. IS,'1 All
cahon to Correlations, But nowhere in this College Board challenge to the reliability
. t'mg obJ'ective tests
eX1S . of writing are very SImI ar of essay testing--o r in any of the earlier studies--was there any discussion
of Whether users of essay tests should strive for "perfect" reliability.
!O
JJ
Obviously, multiple-choice scores W1'n ahiwnesYcan

I a s bescore
morea multiple-choice
consistent than
t .mply because mac choice testing in order to feel that students' test scores are reliable.
scores on essay tes s SI h score a piece of writing, How- Currently, many writing programs have essay test writing tasks and
t Iy than umans can fl
answer"more accura
t/' orrect" score on a mu p Ie-choice test can never re ect
e Itl' . scoring criteria that eliminate sources of random error and enable readers
ever,a corree m.c .tin
g aluationas it genuinely occursm to score reliably. Indeed, the Educational Testing Service has produced a
the subjective, soc1al process of wInd" In~v d as Ed White recently pointed scoring manual that enables teachers all Over the country to improve the
d and in the "real war. ee , ..
the aca
out, thereemy
can never b e one "true" score for a piece of wnting: reliabilities of their essay test readings (ETS Quality Assurance Free-
Response Testing Team). However, as administrators and teachers focus
B t when we evaluate student writing (not to s~eak on improving the reliabilities of their essay tests, they must simultaneously
orwriting programs or high schools), we some~m~ examine their tests' validaties, Stephen Wiseman's 1956 comment about
find differences of opinion that cannot be reso ve validity and reliability still holds true today; "There seems to be no doubt
and where the concept of the true score makes no that, over the past two or three decades, educational psychologists have
sense ... ,Some d I sa eements
g r .(within limits) should
do not slowly but steadily inflated the importance of reliability, perhaps at the
not be called error, since, as wIth the arts, we , expense of validity" ("Symposium" 178).
reall have a true score, even in theor~. Yet 1f we
ima Yine that we are seeking to approxunat~ a true The Validity of Essay Test Scores
g e exaggerate the nega tive effects of dlsagree-
score, w f th s we do
ment an d d 1S
tor t the meaning 0 e score Validity is a controversial subject in writing assessment and in all behav-
achieve. ("Language and Reality" 192) ioral research, Determining any test's validity involves finding evidence to
establish the extent to which performance on the test corresponds to the
wor~s,
th differences in readers' judgments of a writing actual behavior or knowledge that the test user wants to measure. Objec-
In other e --deliberate differences, not random o.r,non
sample are often SImply that S h Wiseman the British cntIc of tions to the validity of multiple-choice writing assessment have a long
random errors. Forty years ago: te~ ~~"It is arg~able that, provided history. For decades, many English teachers have claimed that multiple-
indirect assessrne~t, asserted this l~~;;o'f high inter-eorrelation is desir- choice tests of writing oversimplify and trivialize writing as the mere abil ity
markers are expenenced tea,:hers, . . t' the judgment of complex to memorize rules of grammar, spelling, and punctuation and foster
able, since it points to adverstt~ of vl:P~U;l_::;ound'picture" ("Marking" instruction in memorizing discrete bits of language (Witte et al.).
material. , . and the total mark gIves a er a The three Sources of evidence for any test's validity include its
content, its relationship to the underlying "construct," and its ability to
206). ... . their selected summaries of research,
ReliabIhty IS a continu~~~~er cited essay test inter-rater scorer
Predict SCores on related "criterion" measures (American Psychological
Braddock, LloY~-Jones, an 96 (41-42) Over the past decade, asfa~ulty
AsSOciation 1-4). "Canten t-rela ted validi ty" depends On the extent to which
reliabilities rangmg from ,87 to . h 'own many colleges and UOlver- a test reflects a specific domain of content (in this case, the content of the
s ectable inter-rater rehablhtles o~
. 'th ring essay tests as g r , . ... the Writing COurses to which the test is attached). Many researchers have noted
expenence WI sco .
sities have been able to achieve re P i t The City UniverSIty of that there is little evidence for the content validity of multiple-choice tests
'ty-wide single-sample wntmg te~
th . 'ting samples For examp e, a ., t of Writing because their "content" is English grammar and usage; none of
scoring of elr w n , '.
New York (which reqUlres a umversdl ) the annual audits of approxl- them samples what most teachers consider the important content of
h 1 ~ n by two rea ers, , writing courses--the processes of composing, reVising, and editing ideas
that is scored 0 lSuca y led inter-rater correlations rangtng
mately 2,000 essays per year have revea (Ryzewic 25). Similarly, single- (Brassell; Bridgeman and Carlson; Brown; CCCC Committee on Assess-
from ,75 to .88 over the p~st ~~~e: ~~a~e Freshman English Equivalency Illent; COoper; Faigley et al.; Gere; Greenberg, Wiener, and Donovan;
UOYd-Jones; Lucas; Nystrand; Odell).
sample, double-reader .relta~~l~;te University and Colleges have range~
Examination at the Cahfotnl . d Wh't "Comparison and Contrast Most of the research on the validity of writing tests has focused on
from .68 to .84 over a six yea,r ,peno (ra:"se'need not resort to multiple- "criterion-related validity"--the extent to which scores on essay tests and on
291). Administrators of wnting prog Irlultiple-choice tests correlate with other measures that purport to assess
/2
writing ability; and almost every study of criterion validity done in the past and "narrative, expository, and persuasive wri . . .. .
four decades has focused on the correlation between test scores and course two different topks in each mode")(Breland hng abilities In response to
grades. While course grade is a reasonable criterion variable for tests that model that predicted the data best wa et a~. 45). ~ey found that the
are designed for admissions purposes (like the tests of the College Board concluded that "while t h s, the hierarchIcal one, and they
and the Educational Testing Service), it is not an appropriate criterion for ere IS one dommant f bl'
explains about 78 percent of the wn mg a Ilty factor that
placement, competency, and proficiency testing. A student's grade in an common variance th '
sub/actors based on writing to' d ,ere are definable
English or a composition course results from many variables besides ,
expressIOn" (45). pIeS an to a I
,eSSer extent, mode of
writing ability (including student motivation, attendance, and diligence).
This finding calls into question the racti f . .,
Further, a serious problem with focusing exclusively on criterion with a single writing sampleand a1 Ph ceo as~smgwntingability
validity is that one can easily lose Sight of the skills or abilities being taught abili' tylSnotasingleconstructbutrath
, so ec Oes recent eVldenc th t
. . e a wnting
'.
and assessed. As Rex Brown pointed out, there is a high correlation specificconstructs (Appleb L er IS a compOSIte of several situation-
ee, anger and Mull 'B 'd
between the number of bathrooms in one's home and the grade he or she Faigley ~t at; Gorman et al.; Lucas; pu.:ves et a1.) I:; Tl. ~ema~~nd Carlson;
is assigned in a composition or English class, but that does not mean we across diSCOurse modes, the implication i th .' wntIng abIlity does vary
should consider "quantity of bathrooms" a valid measure of writing ability. or more tasks (including task th t s. at It should be assessed by two
More important than criterion validity is the evidence for a measure'" S a reqUIre writi t
communicate one's know led e r ng 0 construct or to
"construct-related validity." In essence, construct validity is the degree to something, writing to enterta~ , w~ m~ to convince readers to feel or to so
which a test score measures the psychological or cognitive construct that ers, however, did not reach thi'~ so lort~). The College Board research-
the test is intended to measure. To determine the construct validity of a . scone uSlon, possibly b .
samp1e, multi-domain essay test is e x . ecause a multI-
writing test, one attempts to identify the factors underlying people's increase in validity might not be " ptenfsflve: and, as they pointed out, the
performance on the test. This is a hypothesis-testing activity, rooted in cos -e ecove":
theories about the ways in which writing ability manifests itself. Because
of the inadequacy of our profession's current heterogeneous theories of the ~:~o1;. ~mall incr~ments in validity are possible
nature of writing ability, however, the construct validity of most writing a Itional readmgs [of essays] the m . 1
cost of these additional readings is v~ry hi har~a
tests is very questionable. Only a handful of test developers have tried to
analyze the domain of abilities, skills, understandings, and awareness that ::::::n~ly, one wa~ to reduce the cost of ess~~sse~:~
IS to combme them with a multi le-choi
comprise the construct of writing ability (see Bridgeman and Carlson; assessment. (Breland et al. 59) P ce
Faigley et at.; Gorman, Purves, and Degenhart; and Witte for some excellent
attempts at defining this construct.) Unfortuna tely, even the best ana lyses However, while multiple-choice tests c
result in taxonomies of the domain, and, as Arthur Applebee has noted, tests, they aTe more costly to de 1 B ost less to SCore than do essay
"There is no widely accepted taxonomy of types of writing, and certainly Ia~s, many institutions have tov:uo p , e~ause of recent "truth-in-testing"
none that holds up to empirical examination of the kinds of tasks on which thetr multiple-choice tests to k tYh mUltiple forms and new revisions of
students can be expected to perform similarly well (or poorly)"(7). often Iess expensive to dev 1 eep em secure . Th us, m . th e Iong run it is
To repeat, the evidence for the construct validity of a test of writing E e op essay tests. '
ssay tests of writing have on . r
grows out of its conceptual framework. The strategy for obtaining this tests: FaCUlty who share ideas a de mai advantage Over multiple-choice
=
evidence is to build a theoretical model of the writing process and then to often shape an exam th t n wor together to develop an essay test
..1___ a IS grounded in th h .
examine the dimensions of the process that the test taps. This is the method ~room practices. TItis ha be elr t eones, curricula, and
that College Board researchers used in a recentattempt to provide evidence ~rtfolio" writing assessme~tset~ ~u~or teachers who have developed
for construct validity of multiple-choice tests of writing (Breland et at). and that provide oppor~ni: Stat sample several discourse dG-
Using factor analytic techniques, College Board researchers tested a series Be n; BeJanoff and Dixon; Belano~sa~~ ~~de~ts to revise their writing
of hypothesized models of writing ability: a single-factor model ("general ~ff; Fai,gley et al; Lucas). ow, Camp, 1985; Elbow and
writing ability"); a three-factor model ("narrative. expository, and persua- " ortfoho evaluation is probabl th '
sive writing abilities'); and a hierarchical model ("general writing ability" Writing available to us tOday because~t e ~ost vabd means of assessing
lena es teachers to assess compos-
14
15
ing and revising across a wide range of comn:unicative contexts and tasks
(Camp, "The Writing Folder"). Moreover, thIS frocess sen~~ the messagt' five writing across a wide variety of tasks and testing purposes. Neverthe-
that the construct of "writing" means developmg and revlsmg extended less, the skills described in the criteria on current holistic scoring guides do
pieces of discourse, not filling in blanks in mult~ple-choiceexercises or on not provide an adequate definition of "good writing" or of the many factors
computer screens. It communicates to everyone mvolvcd--studcnts, teach- that contribute to effective writing in different contexts. Cri teria on holistic
ers, parents, and legislators--our profession's beliefs about the nature of scoring guides cannot accommodate many of the cognitive and social
writing and about how writing is taught and learned. determinants of effective writing, notthe least of which include the writer's
Further, multi-sample writing tests enable teachers to evaluate parts intentions and the si.tu~tio~al ~xpectations of the writer and of potential
of the writing processes that so many of us emphasize in our CUfT.ieula. As readers. ~or ca~ holistic cntena assess a writer's composing processes or
Leo Ru th and Sandy Murph y have pointed out, it is poss~~Ie t~ de~l?n 13 rgt,- the ways m which these processes-~and products--vary in relation to the
scale writing tests that preserve more of the steps of real wntmg th~n writer's .purpose, aUd~~ce, role, topic, and context. But should they?
normally occur in most testing situations (1l,3~114). For example, In EVIdence for valIdIty depends on the purpose for which the assess-
England's version of a na tional assessment o~ wntJ~g, teachers adm Inlster- mentinstrom~~tis used. According to the surveys cited earlier, most post-
ing- the test introduce the writing task and ~I~CUSS It be~ore st~dent.s begIn secondarywnting assess~entinAmericais done for entry-level placement
writing. In addition, three samples of wnhng, each JOvolvmg different purposes (CCCC Comnuttee on Assessment; Greenberg, Wiener, and
types of tasks, are collected (l13). In Ontario, ~anada, students have two ~ovan;Led~nnan,R~zewic,and Ribaudo). The second most frequently
CIted purpose IS to certify that students have mastered the competencies
days to take their writing assessment. On the flf.st day, stu?ents ?e~er.al,e
and record ideas; and on the second day, they wnte and revIse theIr essay" that they practiced in a specific writing course. These surveys revealed that
the majority of the respondents whose schools used holistically-scored
(114). .
Multi-sample portfolio tests seem more relevant to our theones about essay tests for these purposes were "very satisfied" with the results of the
the construct of writing and to our classroom practices than do. other tests. ~ese findings indicate that ~any users of holistically-scored essay
writing assessment measures. Howev~r,. quest~ons about the sconn~ o~ ~ts beheve that th~ te~t~ ~an proVide adequate and appropriate informa-
these tests remain unresolved. Is holIstic scormg the most appropnatt ti~ .about students abIlIties to function successfully in different types of
WTIting courses. In addition, holistically-scored essay tests may provide a
method of rating writing samples? Are holistic ratings. base~ o~ the
consistent application of mutually agreed-upon substanhve ~nterta for man: accurat.e asses~rnent of the writing skills of minority students than
"good writing"? Researchers have not addressed these ~uestlOns e~ten. muJtiple-ch01C~ testing ~an rrovide. The California State University
sively, and the results of exishng research are contrad ~cto,~y. Se~ era,~ resea~ch (mentioned earlier) mdicated that many more Black, Mexican-
Amencan, ~d Asian-American students than white students received the
low~tposslblescore on a multiple-choice test of usage buteamed a middle
studies indicated that holistic scores correlate strongly With superflclal
aspects of writing, such as handwriting, spelling, word ~hoice, and ~rrors
(Greenberg, 1981; McCoUy; Nol~ and Freed.man; Netl~on a.nd Pl~h~), ~ hig~ score from ~ained essay readers for their writing ability (White,
Tetlchl~g and A~sessmg). This disparity casts further doubt on the validity
of multipIe.cl1~lcetesting as an indica tor of writing ability.
However, the essays being scored 10 these studIes were qUIte bnef (wnttl n
in time periods ranging from twenty to sixty minutes): and one can a~gw'
that readers were predisposed to attend to these sahent b~t .superhclill lio Research IS n.eed~ to det~~e whether holistically-scored portfo-
errors. To my knowledge, no one has yet analyzed the ~ohstJc. s.cores 0: tests c~n prOVide diagnostic information and can function as fully
raters who evaluate tests that require writers to do extensIve reVISIon or tt contextuali~edassessments of students' writing competence or improve-
write multi-sample portfolios. ~t. A~ Bnan Huot pointed au t in his recent review of research on holistic
The question of substantive criteri.a for "good ~rihng" relates directly: ~~, We need ~~ question and explore the particular problems associ-
to the issue of construct validity. Lacktng a theoretical model of effectl\'l h~~Ith the specific uses of holistic scoring" (208). What we have now,
writing ability, most test developers fall back on descriptio~s of text o ~cally-scored essay tests, serve Our limited purposes very well. What
We still need I '-<1 ft
in d' -~ mu tI ra lOstrument that adequately represents writing
~t dlscou~~e d~mai~
characteristics for inclusion as criteria in holistic scoring gUldes_ An
examination of these guides reveals that many of them are quite similar. for different purposes and for different
indicating some professional consensus about the characteristics of effec- CommunltieS--ls an Inchoate vision that many of us share.
/6
/7
WPA: Writing Program Administration, Volume 16, Numbers 1-2, Fall/Winter 1992 11
The Implications of These Challenges for Our

Portsmouth, NH: Heinemann, 1991.
Profession
Belanoff, Patricia, and Peter Elbow "Using Portf l' t I
. dC " " 0 lOS 0 ncrease Collabo-
Writing teachers are asked to do more assessing than are any other ratio~~n .ommUnIty rna Writing Program." WPA-W " :J
AdmllllstratlOn 9.3 (I986): 27-40. . ntlll/{ 1 rogram
humanities colleagues, yet many of us are particularly subject to insecurity
about our ability to understand and manipulate data. It is no coincidence Braddock, Richard, Richard lloyd-Jones, and Lowell Sch ' .
that most of the research on writing and writing assessment that followed Written Composition C h a ' . oer. Rc:searclz en
the 1966 publication of the College Board's "Godshalk" stud y borrowed the English, 1963. mpaJgn, IL: N ahonal Council of Teachers of
quanHtative empiricism of research in the physical sciences. This reverenCe
for objective data diminished in the early 1980s, partially in response to the Breland, Hunter M., et a1. Assessing Writin,>; Skill.
publication of Janet Emig's contextualized research and her scathing Entrance Examination Board, 1987.' New York: College
indictments of empirical and experimental research. One of the recent
outgrowths of the trend toward contextualized research on writing was <1
Bridgem~, Brent, and Sybil Carlson. Survey of Academic Writill' Tasks
consensus about the need for naturalistic, context-rich, qualitative models
~eJ~~~ed of . Gralduate .and Undergraduate Foreign Students. Pri~cet~n
. uca h ana Testmg Service, J983. '
of evaluating students' writing. Current portfolio measures come closest
to capturing these models. Brassell, Gordon. "Current Research and U d . .
Writin A ", . nanswere QuestIons In
Those of us who are committed to the direct assessment of writing Kar ~ ssessment. Wrltzng Assessment: Issues and Strategies Eds
understand that we do not have to model our programs on multiple-choice en reenbcrg, HaJVey Wiener, and Richard Donovan New'y
tongman, 1986. 168-182. . or .
k:
assessments, that there is no need to create the"perfect" essay test. Readers
will always differ in their judgments of the quality of a piece of writing; Brown, Rexford "What W Kn N
there is no one "right" or "true" judgment of a person's writing ability. If we About Wri~in Abilit e i ow .ow"and How We C?ould Know More
accept that writing is a multidimensional, situational construct that fluctu- 1-6. g Y n Amenca. JOtlmal of BasIc Writing 4 (l97R):
ates across a wide variety of contexts, then we must also respect the
complexity of teaching and testing it Camp, R~berta. "~inking Together About Portfolios." TIle Q
NatIOnal Wntmg Project 12 (1990): 8-14. uarterly of the
Camp, Roberta "Th Wti .
D' . . e n ng Folder In Post-Secondar A" "
Works Cited 'ftectlOns and Misdirections in English Evaluation. Ed. r'ters]s~sEment..
Ot aW3 Canada' Can d' C 'I . . vans,
99.' . a Jan ounn of Teachers of English, 1985. 9J-
American Psychological Association. loint Technical Standards for Educa-
tional alld Psychological Testin{{, 4th Draft. Washington, D.C.: Ameri- CCCCC .
omrmttee on Assessment Po "f-' d W
can Psychological Association, 1984. Update On Practices and P . d S Stcon ary r[flllg Assessment: An
Executive Commitiee of th r~e ~res. (Spring 1988). Report to the
Anson, Chris. "Portfolio Assessment Across the Curriculum." Nofes frum
COmmunication. e on erence on College Composition and
the National Testzng Network ill Writing 8 (1988): 6-7.
Applebee, Arthur. "Musings." Research in the Teaching of English 22 (1988): Cooper, Charles. "Holistic Evaluation of W , . " .
6-8. Describing, MeaSUring JUdging Ed. Ch n~mg. Evaluntlllg Writing:
Applebee, Arthur, Judith Langer, and Ina Mullis. The Writing Report Card: OJ ~rbana, IL: National Council ~f T:~che:;~~ i:~fse~,~~~7Le;_~~el1.
Writing Achievement itl Americall Schools. Princeton, N]: Educational edench, Paul B. "The 1950 Colle B .
Research Bulletin RB ~O-58 p?,e oard Engltsh Validity Study."
Testing Service, 1987.
Service, ]950. J . rmceton, NJ: Educational Testing
Belanoff, Patricia, and Marcia Dixon. Portfolio Grading: Process and Product.
Measurmg Growth in English. Urbana, IL: National Council of
18
19
Huot, Brian. "Reliability, Validitv and H r f Sc .

and What We Need to Kno~ ;, C II 0 1S 1C ?~mg: What We Know
Teachers of English, 1974. 41 (1990): 201-213. . 0 ege Composltwtl and Comm1m ication
Diederich, Paul B., John W. French, and sydell T. Carlton. "Factors in
Judgments of Writing Ability." Research Bulletin RB 61-15. Princeton
and Improvemellt of the Academic Skills of En ' . e~ Ribaudo. Assess/.nmt
Lederman, Marie Jean, Susan Ryzewic, and Micha .
NJ: Educational Testing Service, 1961. '
Survey. New York: CUNY In t t' ') I tum:;, Freshmen: A NatIOnal
Elbow, Peter, and Pat Belanoff. "Portfolios as Substitutes for Proficiencv . s ruc LOna Resource Center, 1983.
Examinations." College Composition and Communication 37 (1987): 336- Uoyd-Jo~es, Richard.
N "Skepticism about T es't Scores " Notes fro th
339. Nallo na I Tes t l/lg ctwork in Writing 3 (1983): 9. " m e
Ernig, Janet. The Composing Processes of Twelfth Graders. Urbana, IL Lucas, Catharine Keech. "Recontextua\iz' .
Quarterly 10 (198H): 4-10. 109 Llteracy Assessment." The
National Council of Teachers of English, 1971.
ETS Quality Assurance Free-Response Testing Team. Guidelines for DCIWI-
oping and Scoring Free-Response Testing. Princeton, NJ: Educational
McColly, William.
Judging "What
of Writing Does
Abilit ,;, JEd uca r10nal Re.search Say about the
147-156. y. Dumal of EducatIOnal Research 64 (1970);
Testing Service, 1987.
Faigley, Lester et at Assessing Writers' Knowledge and Processes of Compos-
Complexity and Lexical choic~ on ; In~ue~ce. of H~aded Nominal
Neilson, Lorraine, and Gene Piche "Th
ing. Norwood, NJ: Ablex, 1985.
Research itl the Teaching oif E I' J 15 e(ac ers e\ aluahon of Writing."
Gere, Ann. "Written Composition: Toward a Theory of Evaluation." College Ilg 181 1981): 65-74
English 42 (1980): 44-58. Nold,Es " R Sarah Freedman . "A n A na I

Ellen, and VSIS' of Readers'
. R
says. esearch in the Teachhw
. ,( E I"" esponses to
,~o) ng IS'/ 11 (1977): 164-173.
Godshalk, Fred et al. The Measurement ofWritillg Ability. New York: College
Entrance Examination Board, 1966. Noyes,. Edw.ard 5., William M. Sale, and ohn
FIrst SIX Tests in Elwlish Co ' . J M. Stalnaker. Report on the
Gorman, Tom, Alan purves, and Elaine Degenhart, Eds. The lEA Study of E .. ,~" mposltlOn New Y) k C II
xammabon Board, 1945. ' (r. 0 ege Entrance
Written Composition. Vot. 1. New York: Pergamon Press, 1988.
Greenberg, Karen. "Competency Testing: What Role Should Teachers of Nystrand , Ma r t'lo. "An Analvsis of E .' .
Whilt Writers Knm" Ed- M t' rNrors In Wntten Communication"
Composition Play?" College Composition and Communication 33 (1982): v.. ar m yst d N .
ress, 1982. 57-73. ran. ew York: Academic
P
366-378,
Odell, Lee. "De"mmg
. and Assessing C
_ _' The Effects of Variations in Essay Questions on the Writing Performance and Measurement of C~~,petellcy ~:;~ete~~c in Writing." The Nature
ofCUNY Freshmen. New York: CUNY Instructional Resource Center,
Purv
Urbana, IL: National Council fT.
0
IS l. ;g
Ed. Charles Cooper.
eac ers of English, 1981. 95-138
1981.
Greenberg, Karen, Harvey Wiener, and Richard Donovan. "Preface." ~t
es, Alan .a1. ''Towards a Domain-Referenced
CompOSItion Assignment.. " R .
.- .
System for ClaSSIfying
Writing Assessment: Issues and Strategies. Eds. Karen Greenberg, (1984): 417-438. -' cseardl In the Teaching (if English 18
Harvey Wiener, and Richard Donovan. New York: Longman, 1986.
Ruth'wLe.0: and Sandra Murphy.
xi-xvii. rltmg. Norwood NJ' Abl De~i()llin
- <'\ g Ta:;-k s- fior the Assessment of
Huddleston, Edith M. ''Measurement of Writing Ability at the College R ' . ex, 1988. ')
Level: Objective vs. Subjective Techniques." Journal of Experimental yzewic, The CUNY
NewSusan.
York: CUNY I inWritin
. g A"ssessmcnt Test: A Three-Year Audit
Education 22 (1954): 165-213. ns lChonal Resource Center, 1982. .
20
White, Edward, ed. Comparison and Contrast: The California State Universitl{
Freshman English Equivalency Examination. Vol. 8. Long Beach:
f!l MAYFIELD PUBLISHING COMPANY
For your second-semester cot1fSES-
California State Un\versity, 1981.
_ . "Language and Reality in Writing Assessment." College Composition RESPONDING TO LITERATURE
and Communication 41 (1990): 187-200. JudiIh A. SIIInfmI, RivittCollege
IUspottding 10 l..iI6rZwe mcourage:lI the rcadel's peuooal response r.o a 1IIrge IWld richly
_ _' Teaching and Assessing Writing. San Francisco: Jossey-Bass, 1985. diwne selccliJn of IiIfn.ture, includirl8 the essay. Three inIrolb:klry chapfrs use SIUdent
pllPlllo iIlUIIlIIIe ways of respmding r.o ard writing ah>ut1itff'aOJre..
Wiseman, Stephen. "The Marking of English Composition in Grammar
Plperbamd/lSM pages
School Selection." British Journal of Educatwnal Psychology 19 (1949):
200-209. THE CRrrICALREADER, THINKER. AND WRITER
W. Ross WmteroWd ard Geoffrey R. WinIerowd, Univenily of Southern California
_ _. "Symposium: The Use of Essays on Selection of 11 +." British] Duma/
ThiI.lellt~ makes ~b\e to IlIldents a variety of poweriiJJ. 8pp'08Ches 10 aitical
afEducational Psychology 26 (March 1956): 172-179. reeding, lhinking, ard wntlllg through 51 rea:Iings diverse as D ~ culnnl beck-
W\tte, Stephen. "Direct Assessment and the Writin? Ability Construct." ~1R1gaue.
Notes from the National Testint Network III Wrztmg 8 (1988): 13-14. PIpcrbomdI ~ p88eS
Witte, Stephen, et al. "Literacy and the Dired Assessment of Writing:. A MOTIVES FOR WRITING
Diachronic Perspective." Writing Assessment: Issues and StrategU'~ Robert Keilh MilIec. Univasity ofSt. 1lonas
&wme S. Webb. Texas Woman's Univemty
Eds. Karen Greenberg, Harvey Wiener, and Richard Donovan. Nnv
FollowinB a ~ ~ til the riletoricaJ sil1l8lion md the wribng pocess, 74 readings
York: Longman, 1986. 13-34. ~ wnttnl fmm divtne backgrounds have realized the most ccmmon Jmlives for
PIperilound I ~ p8gC5
New, for advanced compOOtion-

DEVELOPING A WRIlTEN VOICE
Dmt. Hickey, University of Richnxnl
~ pncIical ~lon voice ard style furnly roots SlIJdmts in traditiooal1'ltetOOc while iOOuc-
~~ !hem to explore ~tstylistic issues within a ool1abonltiveQ)Il1eXt.
~LUUI ... /228 pages
Forcreative writing-
WORKING WORDS: THE PROCESS OF CREATWE WRF1'ING
Wenty BWqi. Florida State Univenity
~~ inIroductioo to aeattvc writing focuses 00 the prna:ss of creative writing
D..........a...~ r.o genre distirK:lions n:I en! productx.
~/336pages
MAVFlELD PuBUSlDNG COMPANY

1240 VILLA STREET MOUNTAIN VIEW, CA 94041
For orden or inquiries, please leJ.epbCllle (800)4331279 or FAX (4IS)9I'iO.{)328.
22

Validity and Reliability Issues in The Direct Assessment of Writing

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Validity and Reliability Issues in The Direct Assessment of Writing

Transféré par

Droits d'auteur :

Formats disponibles

WPA: Writing Program Administration, Volume 16, Numbers 1-2, Fall/Winter 1992

Council of Writing Program Administrators

Validity and Reliability

During the past decade, writing assessment programs have mushroomed

The Reliability of Essay Test Scores

. 1 because those c1asses h a ve emphasized

Obviously, multiple-choice scores W1'n ahiwnesYcan

The Implications of These Challenges for Our

Huot, Brian. "Reliability, Validitv and H r f Sc .

English 42 (1980): 44-58. Nold,Es " R Sarah Freedman . "A n A na I

New, for advanced compOOtion-

MAVFlELD PuBUSlDNG COMPANY

Vous aimerez peut-être aussi