Vous êtes sur la page 1sur 13

Assessing Writing 27 (2016) 1123

Contents lists available at ScienceDirect

Assessing Writing

Comparing the accuracy of different scoring methods for


identifying sixth graders at risk of failing a state writing
assessment
Joshua Wilson a, , Natalie G. Olinghouse b , D. Betsy McCoach b ,
Tanya Santangelo c , Gilbert N. Andrada d
a
School of Education, University of Delaware, Newark, DE, United States
b
Department of Educational Psychology, University of Connecticut, Manseld, CT, United States
c
Department of Education, Arcadia University, Glenside, PA, United States
d
Psychometrics and Applied Research, Bureau of Student Assessment, Connecticut State Department of Education, Hartford, CT,
United States

a r t i c l e i n f o a b s t r a c t

Article history: Students who fail state writing tests may be subject to a number of negative consequences.
Received 20 March 2015 Identifying students who are at risk of failure affords educators time to intervene and
Received in revised form 15 June 2015 prevent such outcomes. Yet, little research has examined the classication accuracy of
Accepted 27 June 2015
predictors used to identify at-risk students in the upper-elementary and middle-school
grades. Hence, the current study compared multiple scoring methods with regards to their
Keywords: accuracy for identifying students at risk of failing a state writing test. In the fall of 2012, stu-
Classication accuracy
dents composed a persuasive prompt in response to a computer-based benchmark writing
At-risk students
test, and in the spring of 2013 they participated in the state writing assessment. Predictor
Automated essay scoring
Scoring methods measures included prior writing achievement, human holistic scoring, automated essay
Project Essay Grade scoring via Project Essay Grade (PEG), total words written, compositional spelling, and sen-
Benchmark tests tence accuracy. Classication accuracy was measured using the area under the ROC curve.
ROC curve analysis Results indicated that prior writing achievement and PEG Overall Score had the highest
classication accuracy. A multivariate model combining these two measures resulted in
only slight improvements over univariate prediction models. Study ndings indicated that
choice of scoring method affects classication accuracy, and automated essay scoring can
be used to accurately identify at-risk students.
2015 Elsevier Inc. All rights reserved.

1. Introduction

In light of evidence that the majority of U.S. students in grades four, eight, and twelve fail to achieve grade-level prociency
in writing (National Center for Education Statistics, 2012; Persky, Daane, & Jin, 2002), a growing body of research has focused
on methods of identifying struggling writers in need of intervention in the early grades (K-2), before writing difculties

The authors would like to thank Trish Martin, Fran Brown, Andrew ONeill, and Tiwanna Bazemore of Measurement Incorporated for their assistance
in preparing the datasets used for analysis. Thanks also to Andrew Petsche and Samantha Evans for assistance scoring.
Corresponding author. Tel.: +1 3028312955.
E-mail address: joshwils@udel.edu (J. Wilson).

http://dx.doi.org/10.1016/j.asw.2015.06.003
1075-2935/ 2015 Elsevier Inc. All rights reserved.
12 J. Wilson et al. / Assessing Writing 27 (2016) 1123

become severe and intractable (Coker & Ritchey, 2014; Fewster & McMillan, 2002; Gansle et al., 2004; McMaster, Parker, &
Jung, 2012; Parker, Tindal, & Hasbrouck, 1991; Ritchey & Coker, 2014). However, the need to develop predictive models that
identify struggling writers does not disappear as students enter upper-elementary and middle-school grades.
As students progress through these grades (G4-8), they transition from developing lower-level writing skills handwrit-
ing, spelling, sentence construction, grammar, and punctuation to developing higher-level writing skills, such as utilizing
genre-specic methods of idea-development, organization, and word choice (Berninger, Abbott, Whitaker, Sylvester, &
Nolen, 1995). Coordinating these low and high-level skills strains working memory and may impact writing performance
(Flower & Hayes, 1980; Kellogg & Whiteford, 2009; McCutchen, 1996, 2011). Consequently, students in upper elementary
and middle school may be at risk of developing, or worsening, writing difculties.
In addition, students in these grades must participate in state and national accountability assessments which are used to
determine whether students have attained grade-level standards (Hamilton et al., 2007). Performance on these assessments
has a number of consequences for students, such as: (a) being assigned to particular schools, programs or classes (i.e., aca-
demic tracking) (Decker & Bolt, 2008; Goertz & Duffy, 2003); (b) being referred for additional instructional support (Graham,
Hebert, & Harris, 2011b; Jones et al., 1999); and (c) being retained or promoted to the next grade (Darling-Hammond,
2004; Hamilton et al., 2007). Indirect consequences associated with repeated failure of accountability assessments include
increased risk for school dropout (Heubert & Hauser, 1999) or referral to special education (Fiiglio & Getzler, 2002; Haney,
2000).
Thus, given the developmental challenges faced by upper-elementary and middle-grade students with regards to writing
prociency, and the direct and indirect consequences associated with poor performance on accountability assessments, it is
important to develop predictive models that accurately identify students at risk for writing failure. Once identied, at-risk
students may receive intervention to prevent and remediate their writing difculties. While an emerging body of research
has focused on identifying at-risk writers in the early grades (K-2), there is little research to guide educators in developing
accurate prediction models for students in upper elementary or middle school. Hence, the present study compared measures
of sixth-graders writing ability with regard to classication accuracy, i.e., the accuracy of predicting which students passed
or failed a state writing test.

1.1. Prior research examining predictors of performance on state writing tests

The majority of prior research examining predictors of performance on state writing tests has focused on assessment pro-
cedures and scoring measures associated with curriculum-based measurement for writing (W-CBM). For example, a study of
fourth-grade students examined correlations between writing quality measures derived from two three-minute CBM writ-
ing probes and subtest scores of the Louisiana Educational Assessment Program writing test (Gansle, Noell, VanDerHeyden,
Naquin, & Slider, 2002). Writing probes were scored for 12 W-CBM measures, two computer-scored measures of text read-
ability the Flesch Reading Ease score and FleschKincaid grade-level and computer-scored measures of sentence and
vocabulary complexity. Only number of verbs, words spelled correctly, and correct word sequences (CWS) demonstrated
statistically signicant correlations with scores of the state test: r = .33, .29, and .41, respectively.
McMaster and Campbell (2008) sampled fth-grade students who completed two passage copying tasks, two picture
prompts, two narrative prompts, and two expository prompts. Each of the writing tasks were scored for total words written
(TWW), words spelled correctly (WSC), correct word sequences (CWS), and correct minus incorrect word sequences (C-
IWS). Evidence of criterion validity with the Minnesota state writing test differed by scoring metric: TWW and WSC were
not statistically signicantly correlated with state test performance for any of the writing tasks; CWS was moderately
correlated (range r = .54.56) for the three-minute and ve-minute narrative writing tasks and the ve-minute expository
writing task, but for no other writing task; and C-IWS was moderately correlated (range r = .54.68) for the three-minute
and ve-minute narrative and expository tasks. Similar results were reported in a study of eighth-grade students (Espin
et al., 2000): moderate correlations were found between TWW, WSC, CWS, IWS, and C-IWS scored from two story writing
and two descriptive writing samples and a district writing test.
Finally, Lopez and Thompson (2011) sampled students in grades 68 who responded to a story starter scored for CWS
and who participated in the Arizona state writing assessment. CWS was not a statistically signicant predictor for grade six,
but demonstrated moderate correlations with the criterion measure for grades seven and eight. The authors also examined
how accurately a CWS cutscore of one standard deviation below the mean identied students who scored at the Does not
meet expectations level on the Arizona state writing test. They reported classication accuracy of 75% for grade six, 87%
of for grade seven, and 96% for grade eight. Classication accuracy is the percent of students correctly classied as true
positives or true negatives, i.e., as truly at risk or truly not at risk. However, classication accuracy is a misleading measure
of diagnostic accuracy when the base rate (i.e., prevalence) of a condition is low (Meehl & Rosen, 1955; Wilson & Reichmuth,
1985). When the base rate is low, it is possible to achieve high classication accuracy by simply diagnosing all students as
not at risk. In the Lopez and Thompson study, the base rates of students failing the state writing test were 17%, 22%, and 4%,
respectively, for the sixth, seventh, and eighth-grade samples. Thus, classication accuracy rates of 83%, 78%, and 96% would
have been achieved by simply assuming that no students were at-risk. To warrant utility for making selection/screening
decisions, a measure should yield classication accuracy values signicantly better than those obtained by identifying no
at-risk students (Johnson, Jenkins, Petscher, & Catts, 2009), which was not the case in this study.
J. Wilson et al. / Assessing Writing 27 (2016) 1123 13

1.2. Summary and study purpose

Based on prior research, it appears that choice of writing task and scoring method affect the accuracy of predicting upper-
elementary and middle-school students performance on state writing tests. However, with the exception of Gansle et al.
(2002) who included computer-scored measures of text readability, sentence sophistication, and vocabulary usage, prior
research has focused solely on the use of W-CBM scoring procedures. These procedures involve calculating the frequency or
proportion of observable and quantiable indices of writing ability. While count/frequency-based scoring has a number of
desirable qualities, including efciency and high inter-rater reliability (Gansle et al., 2002, 2004), alternate scoring methods
exist, including holistic scoring, analytic scoring, and automated essay scoring (AES). Each of these methods measure the
construct of writing ability differently, and, unlike count/frequency-based scoring, each is used to score large-scale writing
assessments.
Furthermore, studies have yet to examine measures derived from alternative formative assessment methods, such as
benchmark assessments. Benchmark assessments are typically administered three to six times a year and evaluate knowl-
edge and skills related to academic standards (Bulkely, Nabors Olah, & Blanc, 2010; Perie et al., 2009). In the area of writing,
benchmark assessments often include extended constructed-response items, in which students are given 4560 min to
respond to a writing prompt. Compared to W-CBM, this format more closely approximates the testing conditions of state
and national writing accountability assessments (Olinghouse et al., 2012). Research is needed to explore the predictive
validity and classication accuracy of measures derived from benchmark writing assessments.
Finally, with one exception (Lopez & Thompson, 2011), each of the above-mentioned studies assessed evidence of criterion
validity, but did not directly assess classication accuracy. Demonstrating evidence of criterion validity is not the same as
demonstrating whether a measure yields accurate selection/screening decisions (Wilson & Reichmuth, 1985). Doing so
requires different analytic procedures, such as Receiver Operating Characteristic (ROC) curve analysis or logistic regression,
which calculate classication accuracy, and sensitivity (true positive rate) and specicity (true negative rate) (Fielding &
Bell, 1997; Hosmer, Lemeshow, & Sturdivant, 2013).
Thus, the purpose of the present study was to compare the accuracy of different scoring methods for identifying
sixth-graders at risk of failing a state writing assessment. Students in the fall of their sixth-grade year participated in a
computer-based benchmark writing assessment. Student responses were scored using human holistic scoring, AES, and
count/frequency-based scoring. Correlations, ROC curve analysis, and logistic regression were used to assess criterion valid-
ity and classication accuracy. Two research questions guided the study: Does choice of scoring method affect classication
accuracy? Is classication accuracy improved by combining measures in a multivariate prediction model?

2. Methods

2.1. Sample selection

The current study sampled from sixth-grade students who participated in a statewide computer-based benchmark writing
assessment (BWA) between September 1st, 2012 and January 15th, 2013. The BWA was a non-compulsory assessment
resource developed by the state department of education in collaboration with Measurement Incorporated . It was offered
to students in grades 312 for the purpose of supporting teachers instructional decision-making. Students participated in the
BWA by logging-in to a web-based application, inputting their ID and password, and typing their response to an on-screen
prompt within a 60 min timeframe. Students responded to both system-created and teacher-created writing prompts offered
in multiple genres: narrative, informative/descriptive, and persuasive. Completed responses were scored via an AES system
called Project Essay Grade (PEG; Page, 1994, 2003), which provided students with an overall holistic rating of their writing
quality, as well as individual trait ratings (see Section 2.2.2 for more info on PEG). In accord with The Family Educational
Rights and Privacy Act (FERPA) (20 U.S.C. 1232 g; 34 CFR Part 99), both individual participant data as well as the identity of
the state have been de-identied for reporting here.
In order to build and compare predictive models, a sample of 272 students was selected in the following manner. First, we
identied all students who participated in the BWA and responded to persuasive writing prompts (n = 1666). The persuasive
genre was selected because the sixth-grade state writing test evaluated persuasive text. It was hypothesized that by keeping
the prompt genre consistent between the BWA and the state writing test it would increase the accuracy of prediction models.
Indeed, research indicates that prompt genre contributes unique variance in students writing performance (Graham, Harris,
& Hebert, 2011a; Olinghouse & Wilson, 2013). Persuasive prompts were identied in two ways. The genre of system-created
prompts was identied a priori by the test developer, Measurement Incorporated . The genre of teacher-created prompts
was identied based on consensus by the rst and second author. Specically, a persuasive prompt was one which required
students to support, defend, or argue (for or against) a position using reasons and details. This denition of persuasive text
is consistent with the Common Core State Standards (CCSSI, 2010).
Then, all students who responded to persuasive prompts were dichotomized as either at risk or not at risk based on
whether they passed or failed the spring 2013 state writing test. At-risk students were classied as those who scored in the
failing range of the state test. Specically, at-risk students were those who scored at Bands 1 and 2, indicating Below Basic
and Basic performance levels. Students who scored at Bands 35 (Procient, Goal, or Advanced) passed the state writing test
and were deemed not at risk.
14 J. Wilson et al. / Assessing Writing 27 (2016) 1123

Table 1
Demographic information for sample of at-risk and not-at-risk students.

Variable At-riska Not-at-riskb Pearson chi-square Strength of associationc

Total Students (n) 136 136


Number of Districts Represented (n) 15 13
Number of Schools Represented (n) 18 15
Male (%) 66.90 41.90 17.13*** , df = 1 .25

Race (%) 14.29* , df = 6 .23


White 63.90 71.50
Hispanic/Latino 21.50 9.00 At-risk > not-at-risk
African American 8.30 11.00
Asian 1.40 4.20
American Indian/Native Alaskan .70 2.80
Native Hawaiian/Pacic Islander .70
Two or more races 3.50 2.10

Free or Reduced Lunch (%) 44.90 30.10 6.28* , df = 1 .15


English Language Learners (%) 5.90 .70 5.63* , df = 1 .14
Special Education (%) 41.90 8.80 39.32*** , df = 1 .38

DRGe (%) 11.46** , df = 3 .21


A-C 22.10 21.30
D-F 50.00 49.30
G-I 14.70 25.70
X-Y 13.20 3.70 At-risk > not-at-risk
a
At-risk = students who scored at Below-Basic and Basic levels on the sixth-grade state writing test.
b
Not-at-risk = students who scored at or above Procient on the sixth-grade state writing test.
c
Measured using the phi coefcient and Cramers V.
e
DRG = District Reference Group, a statewide classication of educational districts based on indicators of socio-economic status, student need, and school
enrollment. DRG A represents districts with the greatest afuence and least need, while DRG I represents those with the least afuence and greatest need.
DRG X and Y indicate charter and magnet schools, respectively.
*
p < .05.
**
p < .01.
***
p < .001.

The nal step of sample selection involved selecting equal samples of at-risk and not-at-risk students. Obtaining equal
samples of students across outcome categories maximizes power to detect the effect of an independent variable (or set
of variables) for contributing to a predictive model estimated using ROC curve analysis and logistic regression (Zweig &
Campbell, 1993). Furthermore, doing so addresses the issue of measuring classication accuracy in samples with low base
rates (Meehl & Rosen, 1955). Of the 1666 students who responded to persuasive prompts, 136 students performed at Bands
1 and 2 on the state writing test. These students were selected to form the at risk sample. Next, a stratied random sample
of 136 not-at-risk students was selected from Bands 3, 4, and 5 to achieve the full sample of 272 students.
Table 1 displays demographic data, reported separately for at-risk and not-at-risk students. All variables displayed sta-
tistically signicant but weak associations with risk status. Chi-square tests of independence indicated that at-risk students
were statistically signicantly more likely to be male, of Hispanic/Latino ethnicity, to receive free/reduced lunch, to be
English Language Learners, to receive special education services, or to attend charter or magnet schools.

2.2. Measures and delity

2.2.1. Criterion measure


The sixth-grade state writing test consisted of two sections, the Direct Assessment of Writing (DAW) and the Editing and
Revising (ER) test. The DAW required students to compose a persuasive essay in response to a prompt within a 45 min time
limit. Student essays were scored holistically by two raters whose scores each on a 16 scale were summed together
for a nal score (range: 212). Raters used a rubric which evaluated the following features: inclusion of elaboration and
specic details, logical organization, and prociency of language usage including transition words. The ER test required
students to answer a total of 36 multiple choice questions which assessed their ability to identify errors in word choice,
grammar, spelling, punctuation, capitalization, and organization. Each multiple choice item was worth one point (range:
036). Raw scores on both sections were combined to form a weighted composite score which was then converted into
a scale score (range: 100400). The state then identied performance cutpoints along the scale score range to correspond
to the ve achievement bands. Table 2 displays the distribution of at-risk and not-at-risk students in the current study by
achievement band and writing scale score range. For data analysis, performance on the criterion measure was dichotomized
as 1 = at risk, and 0 = not at risk, based on whether or not students failed the 2013 sixth-grade state writing test (see
Section 2.1).
J. Wilson et al. / Assessing Writing 27 (2016) 1123 15

Table 2
Sample distribution by achievement band and writing scale score.

Band N Sample scale score range State scale score range for band

1 = Below Basic 26 140183 100184


2 = Basic 110 185209 185210
3 = Procient 45 212236 211236
4 = Goal 46 239279 237283
5 = Advanced 45 284400 284400

N = 272. At-risk students were dened as those scoring at Bands 1 and 2 (n = 136). Not-at-risk students were dened as those scoring at Bands 35 (n = 136).

2.2.2. Predictor measures


Predictor measures included a measure of prior writing achievement and measures derived from applying different
scoring methods to students writing sample generated in response to persuasive prompts from the BWA. Scoring methods
included human holistic scoring, automated essay scoring, and four measures assessed using count/frequency-based scoring:
text length, words spelled correctly, percent of words spelled correctly, and sentence accuracy. Each of these predictor
measures is described in turn, below. Table 3 illustrates each of the study measures organized by date of administration.

2.2.2.1. Prior writing achievement. Prior writing achievement was operationalized as students scale score (range: 100400)
on the spring 2012 state writing testa test given when the sample was in fth grade. Like the sixth-grade test, this test
included two subtests, the DAW and the ER. However, the genre of the fth-grade prompt was expository, as opposed to
persuasive, which was used for the sixth-grade test.

2.2.2.2. Holistic scoring. For the current study, a genre-specic holistic rubric was developed based on rubrics used by the
authors in prior research (Olinghouse & Wilson, 2013). This rubric evaluated students overall ability to compose persuasive
writing and asked raters to consider the following components based on the sixth-grade Common Core State Standards:
organization and structure, appropriate use of the elements of persuasive writing, persuasion and elaboration, variety and
accuracy of sentence structure, variety and maturity of vocabulary, and accuracy of spelling. Consistent with procedures out-
lined by Penny, Johnson, and Gordon (2000), two graduate assistants independently double-scored all of the texts, assigning
scores from 1 to 6 which they supplemented with a (+) or () to indicate a high or low score for that score point. Scores were
then transformed to a scale of 118 (e.g., 1 = 1, 1 = 2, 1+ = 3, 2 = 4, and so on). To achieve high inter-rater reliability (IRR),
the two raters were trained on the use of the rubric and then completed multiple rounds of training. After each round, the
raters discussed their scores and why they assigned the score they did. During initial rounds of training, the rst and second
author were present to arbitrate and to clarify language in the rubric. Two criteria needed to be met before training was
complete. First, the raters needed to achieve a minimum of 80% exact agreement for the non-supplemented rubric range
(range: 16), and, following this, they needed to achieve a minimum of 80% exact agreement for the supplemented rubric
range (118). Then, the two raters independently double-scored all of the texts. IRR was high: r = .95 (p < .001); percentage
of exact agreement = 89.33%. Scoring differences were resolved via consensus prior to data analysis.

2.2.2.3. PEG quality score. The BWA utilized an AES engine called Project Essay Grade (PEG) which was developed by Ellis Page
and colleagues (Page, 1966, 1994; Page, Poggio, & Keith, 1997) and acquired in 2002 by Measurement Incorporated . PEG
uses a combination of techniques such as natural language processing, syntactic analysis, and semantic analysis to measure

Table 3
Study measures organized by date of administration.

Spring 2012 FallWinter 2012 Spring 2013


Prior writing achievement Benchmark writing assessment (BWA) Criterion measure

Fifth-grade state writing assessment scored on Computer-based benchmark writing Sixth-grade state writing assessment scored on a
a 100400 point range, based on assessment administered by teachers 100400 point range. based on performance on
performance on two subtests using system-created or two subtests
teacher-created prompts
Direct Assessment of Writing: students Student writing samples were scored Direct Assessment of Writing: students compose
compose text for 45 min in response to an using the following measures text for 45 min in response to a persuasive prompt
expository prompt

Editing and Revising Test: students answer Human Holistic Scoreb Editing and Revising Test: students answer 36
36 multiple choice questions related to PEG Overall Scorea multiple choice questions related to knowledge of
knowledge of conventions and grammar Text Lengthc conventions and grammar
Words Spelled Correctlyc
Percent Words Spelled Correctlyc
Sentence Accuracyc
a
Human Holistic Score range.
b
PEG = Project Essay Grade, an automated essay quality rating ranging from 6 to 36.
c
Measures derived using count/frequency-based scoring similar to that of curriculum-based measurement for writing (W-CBM).
16 J. Wilson et al. / Assessing Writing 27 (2016) 1123

more than 500 variables that are combined in a regression-based algorithm that predicts human holistic and analytic trait
ratings (T. Martin, personal communication). PEG has demonstrated strong evidence of score reliability and convergent
validity with human essay ratings (Keith, 2003; Page et al., 1997; Shermis, 2014; Shermis, Koch, Page, Keith, & Harrington,
2002; Shermis, Mzumara, Olson, & Harrington, 2001). Within the context of the BWA, PEG provided students with an Overall
Score ranging from 6 to 36. The PEG Overall Score summarized students performance across six traits of writing quality,
each measured on a 16 scale: overall development, organization, support, sentence structure, word choice, and mechanics.
Since PEG is an AES system, scores were measured in absence of rater error.

2.2.2.4. Text length. Text length was measured by counting the total words written (TWW), disregarding spelling errors.
TWW is a general outcome measure of overall writing ability used in W-CBM for middle-grade students (Espin, De La Paz,
Scierka, & Roelofs, 2005; Espin et al., 2000; McMaster & Campbell, 2008; Parker et al., 1991). TWW was calculated by PEG
which used a word-count program similar to that of Microsoft WordTM , allowing text length to be measured without rater
error.

2.2.2.5. Compositional spelling. Compositional spelling refers to the ability to accurately spell words when composing text
versus when responding to dictated spelling assessments. Like TWW, compositional spelling is a general outcome measure
of writing ability used in W-CBM for middle-grade students (Gansle et al., 2002; Espin et al., 2000; Fewster & MacMillan,
2002). Compositional spelling was measured in two ways: (a) as a count of the number of words spelled correctly (WSC), and
(b) as a percentage of correctly spelled words in a text (%WSC). The rst measure, WSC, did not control for text length, and
thereby provided information on both compositional spelling and text length. The second measure, %WSC was calculated
as: %WSC = ([WSC/TWW] 100). This measure controlled for individual differences in text length. Both measures were
calculated without human rater error by using PEGs spelling-error detection software, which is similar to that used by
Microsoft Word.

2.2.2.6. Sentence accuracy. The Sentence Accuracy measure was scored using procedures similar to scoring for correct word
sequence (CWS), a general-outcome measure used in W-CBM for middle-grade students (Espin et al., 2000; Gansle et al.,
2002, 2004; Lopez & Thompson, 2011; McMaster & Campbell, 2008). Pairs of adjacent words were assessed for errors in
grade-level conventions related to capitalization, punctuation, usage/grammar, and sentence structure as documented in
the fth-grade Common Core State Standards (CCSSI, 2010). Fifth-grade standards were selected because students were
assessed early in sixth grade, before they had an opportunity to master the conventions of sixth-grade writing. To control for
individual differences in text length, the total number of correct word pairs was divided by the total number of word pairs
in the text to generate a proportion. Inter-rater reliability was calculated as the percentage of exact agreement between
two raters who double-scored 100% of the writing samples (n = 300). IRR was high: exact agreement = 98.64%. All scoring
differences were resolved through consensus prior to data analysis.

2.3. Data analysis

ROC curve analysis was used to estimate the overall classication accuracy of predictive models. Classication accuracy
was summarized via the area under the ROC curve statistic (AUC). The AUC is interpreted as the estimated probabil-
ity that, given a specic prediction model, a randomly selected pair of cases in which one case has Y = 1 and the other
case has Y = 0, the case with Y = 1 will have a higher predicted probability than the case with Y = 0 (Zweig & Campbell,
1993). AUC values range from .5 to 1.0, with .5 representing chance or random guessing, and 1.0 representing per-
fect discrimination. The following guidelines are suggested for interpreting AUC values: .5 = chance; .5 < AUC < .7 = poor
discrimination; .7 AUC < .8 = acceptable discrimination; .8 AUC < .9 = excellent discrimination; AUC .9 = outstanding dis-
crimination (Hosmer et al., 2013).
In addition, results of the ROC curve analysis were used to select cutpoints and probability thresholds associ-
ated with sensitivity values of .90. Sensitivity refers to the rate of identifying true positives and it is calculated as
True Positives/(True Positives + False Negatives). A minimum acceptable sensitivity of models designed to prevent academic
failure is .90 (Jenkins, Hudson, & Johnson, 2007). Then, models were compared according to the false positive rate associated
with sensitivity equal to .90 (FP rate = [1 Specicity]). This was done to provide an indication of the feasibility of the model.
Models with higher FP rates are less feasible because schools limited resources are taxed due to unnecessarily providing
intervention to students who are not at risk.
Finally, to determine whether observed differences in AUC values were statistically signicant, a critical ratio z
value was calculated using procedures outlined by Hanley and McNeil (1983). This procedure corrects for depend-
ence in AUC values  when both values are derived  from the same participants. The equation takes the form of z =

(AUC1 AUC2 ) / SE21 + SE22 2(r) SE1 SE2 , where r is the estimated correlation between AUC1 and AUC2 .

2.3.1. Developing univariate and multivariate prediction models


Univariate and multivariate prediction models were generated to predict Risk Status (coded 1 = at risk, 0 = not at risk).
Univariate models were estimated directly using ROC curve analysis. To determine whether classication accuracy could be
J. Wilson et al. / Assessing Writing 27 (2016) 1123 17

Table 4
Correlations and descriptive statistics of criterion and predictor measures.

1. 2. 3. 4. 5. 6. 7. 8.

1. Risk Statusa
2. Prior Writing Achievementb .62
3. Human Holistic Scorec .41 .53
4. PEG Overall Scored .49 .58 .81
5. Text Lengthe .42 .55 .79 .87
6. Words Spelled Correctly .43 .55 .80 .87 1.00
7. Percent Words Spelled Correctly .31 .35 .49 .44 .39 .45
8. Sentence Accuracy .30 .45 .37 .34 .25 .29 .52

Mean 236.76 8.28 19.83 270.15 256.26 93.75 78.24


Standard Deviation 44.21 3.33 4.76 147.63 145.57 5.18 9.94

Note. N = 272. All correlations statistically signicant at p < .001.


a
Risk status was coded 1 = at risk, 0 = not at risk.
b
Prior Writing Achievement = fth-grade performance on the state writing accountability test (scale score range: 100400).
c
Human Holistic Score range = 118.
d
PEG Overall Score range = 636.
e
Text length = total words written.

improved by combining measures derived from multiple scoring methods, a multivariate prediction model was developed
using logistic regression and then analyzed using ROC curve analysis. To derive this model each of the seven predictors
was entered, and then the model was re-estimated retaining only statistically signicant unique predictors. Model t was
assessed using the Deviance statistic and the HosmerLemeshow Chi-Square test (Hosmer et al., 2013). The deviance statistic
(D) measures whether the probabilities produced by the model accurately reect the observed outcomes. The deviance is
calculated as 2 log likelihood value (D = 2LL) and is distributed as a Chi-Square with df equal to the number of parameters
in the model. The HosmerLemeshow Chi-Square test is generated by comparing expected and observed values of estimated
probabilities for a set of predictors across deciles. Non-signicant p-values indicate failure to reject the null hypothesis that
the model ts the data well. To test whether the parsimonious model t that data equally as well as the more complex
model with all seven predictors, the difference in D statistic (D) was used. This statistic is distributed as a chi-square with
df equal to the difference in the number of parameters in the two models.

3. Results

Table 4 presents descriptive statistics and correlations among the predictor variables and Risk Status. Prior writing
achievement displayed the strongest correlation with Risk Status (r = .62), indicating that students with higher scale scores
on the fth-grade state writing test tended to be not-at-risk in sixth grade. PEG Overall Score displayed the next strongest
correlation with Risk Status (r = .49). Human holistic scoring, text length, and words spelled correctly (WSC) displayed
virtually identical correlations with Risk Status: r = .41, .42, and .43, respectively. When compositional spelling was
measured as the percent of words spelled correctly (%WSC), the relationship with Risk Status decreased (r = .31). Sentence
accuracy, a percentage-based measure which also controlled for text length, displayed a small but statistically signicant
correlation with Risk status (r = .30).

3.1. Univariate prediction models

Results of ROC Curve analyses comparing the classication accuracy of the seven predictors are presented in Table 5.
Also included in that table are the cutpoints associated with sensitivity equal to .90. Prior writing achievement was the
only screening measure to yield an AUC value in the excellent range (.88). This measure also had the lowest associated false
positive rate (FP rate = .34). Peg Overall Score yielded the second highest AUC value (.78), and second lowest FP rate (.45).
Similar to the correlation analyses, Human Holistic Score, TWW, and WSC performed virtually identically. Their AUC values

Table 5
ROC curve results of univariate screening models.

Measure AUC CI 95% AUC Cutpoint Sensitivity FP rate

Prior Writing Achievement .88 [.84, .92] 247 .90 .34


Human Holistic Score .74 [.68, .80] 10 .90 .55
PEG Overall Score .78 [.73, .84] 22 .90 .45
Text Length .75 [.69, .81] 350 .90 .59
Words Spelled Correctly .76 [.70, .82] 334 .90 .62
Percent Words Spelled Correctly .69 [.63, .75] 99.00 .90 .84
Sentence Accuracy .68 [.61, .74] 86.57 .90 .72

N = 272.
18 J. Wilson et al. / Assessing Writing 27 (2016) 1123

Table 6
Coefcients of the optimal multivariate model predicting risk status.

Predictor (SE) Wald

Prior Writing Achievement .05 (.01) 49.65***


PEG Overall Score .18 (.05) 14.3***
Constant 14.84 (1.82) 66.28***

N = 272.
***
p < .001.

and FP rates were within 1% of each other. The percentage-based measures, %WSC and Sentence Accuracy, had the lowest
AUC values (AUC = .69 and .68, respectively) and highest FP rates (.84 and .72, respectively).
To determine if differences in the point estimates of AUC values were statistically signicant, a critical ratio z value was
calculated using the method developed by Hanley and McNeil (1983). Based on these comparisons, prior writing achievement
had superior accuracy compared to Peg Overall Score (z = 3.16, p = .002). Peg Overall Score had superior accuracy compared to
Human Holistic Scoring (z = 1.98, p = .048), TWW (z = 2.37, p = .018), %WSC (z = 2.80, p = .005), and Sentence Accuracy (z = 2.99,
p = .003). The difference in AUC values for PEG Overall Score and WSC approached statistical signicance (z = 1.87, p = .062),
but PEG Overall score demonstrated superiority in terms of its lower FP rate (.45 versus .62). None of the differences in AUC
values between human holistic scoring and the count/frequency-based measures were statistically signicant at p .05:
TWW (z = .45, p = .653), WSC (z = .86, p = .390), %WSC (z = 1.54, p = .124), Sentence Accuracy (z = 1.74, p = .082). Thus, the most
accurate predictors were Prior Writing Achievement followed by PEG Overall Score.

3.2. Multivariate prediction model

Following recommendations of Hosmer et al. (2013), prior to estimating the logistic regression model each predictor was
analyzed independently with the criterion using a crosstabs to determine the extent of sparseness in the data. Sparseness
refers to having zero frequencies, or very small frequencies (<5) in some of the cells. When sparseness affects greater than
20% of cells in a logistic regression model, the Pearson Chi-Square test of model t and the Wald Statistic test are not
valid because their underlying distribution is no longer a Chi-Square (Cohen, Cohen, West, & Aiken, 2003). One method of
addressing sparseness is to convert continuous variables into quartile scores, and use the quartile-converted measures as
predictors in the logistic regression model (Hosmer et al., 2013). Based on the crosstabs analysis, the four count/frequency
based predictors (TWW, WSC, %WSC, and Sentence Accuracy) were subject to issues of sparseness and required conversion
to quartile measures. For each measure, 100% of the cells in the crosstabs had expected counts less than 5. Therefore, these
variables were input in the logistic regression model as quartile scores.
The deviance (D) of the null model with only the outcome variable, Risk Status, was 377.07. The deviance of the full
model with all seven predictors was 213.09, indicating a statistically signicantly better t to the data than the null model
(D = 163.98, df = 7, p < .001). Only two variables were unique predictors: Prior Writing Achievement and PEG Overall Score.
Accordingly, a trimmed model was estimated which included these two variables. The deviance of this model was 219.31. The
HosmerLemeshow Chi-Square test indicated this model was a good t to the data: 2 = 7.42, df = 8, p = .493. The parsimonious
model t the data equally well as the full model with all seven predictors (D = 6.22, df = 5, p = .285).
Table 6 presents the logistic regression coefcients of this model. The model met the assumptions of logistic regression.
The assumption of linearity of the logit was met for each of the predictors. This was indicated by non-statistically signicant
Wald values for the interaction term between a predictor and its natural log (Field, 2013). Tests of multicollinearity between
the predictors in the model also indicated no issues. All tolerance and VIF values were within normal limits. Thus, the model
including only Prior Writing Achievement and PEG Overall Score was retained as the optimal prediction model.
Predicted probabilities generated from this model were estimated using ROC curve analysis. The resulting AUC was in the
excellent to outstanding range: AUC = .89, CI95% = [.86, .93]. Based on a probability threshold of .35, sensitivity was equal to
.90 and the concomitant FP rate was .33. Hanley and McNeil (1983) comparisons between the AUC of this model and that of
the univariate prediction model containing Prior Writing Achievement approached statistical signicance: z = 1.93, p = .053.
This indicates that while PEG Overall score contributed uniquely to the multivariate prediction model, its contribution
resulted in minimal improvements in overall classication accuracy.

3.3. Illustration of performance of optimal model for predicting at-risk students

The multivariate model which included Prior Writing Achievement and PEG Overall Score was applied to the full sample of
1666 sixth-grade students who participated in the BWA and who responded to persuasive prompts. This involved sampling
with replacementthe 272 students used to develop the prediction model were replaced into the population of students
from which they were drawn. To apply the prediction model, the logistic regression coefcients from Table 6 were used
to calculate a predicted logit for each student using the equation: logit(y) = e14.84+(0.05)PriorWritingAch+(0.18)PEGOverallScore
  .
The predicted logit was converted to a predicted probability using the equation: P(y) = logit(y)/ 1 + logit(y) . Then, the ROC
probability threshold associated with 90% sensitivity was applied to identify students predicted to be at-risk. Any student
J. Wilson et al. / Assessing Writing 27 (2016) 1123 19

whose predicted probability was .35 was identied as at risk of failing the state writing assessment. Using the resultant
classications, it was possible to assess the models sensitivity, specicity, false positive rate, and overall classication
accuracy.
The multivariate prediction model performed well. It yielded a sensitivity value of .93, the upper limit of the 95% CI for
the AUC value derived from the model-building sample (n = 272). Its specicity value was in the acceptable range (.75). The
associated false positive rate was .25, which was lower than the false positive rate of .33 derived from the model-building
sample. Total classication accuracy was 77%. In real terms, had this model been used to identify at-risk students, a total
of 520 students (31%) would have been identied as needing intervention. However, since only 136 of these students were
true positives, the model still identied a substantial number of false positives.

4. Discussion

The purpose of the present study was to compare multiple methods of scoring writing samples with regard to their
accuracy for identifying students at risk of failing a state writing test administered in the spring of 2013. Prompts written in
response to a computer-based benchmark writing assessment administered in the fall were scored for six different measures
of writing ability using three scoring methods: human holistic scoring, AES, and count/frequency-based scoring. In addition,
a measure of prior writing achievement students performance on the fth-grade state writing test was also examined
as a predictor. The current study extends previous research by comparing measures derived from three different scoring
methods, by examining predictors derived from a benchmark writing assessment, and by utilizing analytic methods that
directly evaluated classication accuracy.
Results of the ROC curve analyses indicated that choice of scoring method does affect classication accuracy. AUC values of
univariate predictors ranged from poor to excellent discrimination across measures (.68.88). FP rates also exhibited a wide
range across predictors (.34.84), indicating dramatically different implications for feasibility. Interestingly, the measures
most commonly used as general outcome measures in W-CBM the four count/frequency-based measures of TWW, WSC,
%WSC, and Sentence Accuracy had the highest FP rates among the predictors, thereby limiting their feasibility. This differs
from ndings of prior research suggesting that these measures have utility for supporting selection/screening decisions
for upper-elementary and middle-grade students in the context of state writing tests (Espin et al., 2000; Gansle et al.,
2002; McMaster & Campbell, 2008). This underscores the importance of supplementing correlation analyses that provide
criterion-related evidence of validity with analytic procedures that directly evaluate classication accuracy.
Also of note was that although human holistic scoring had a lower FP rate than the count/frequency-based measures,
there were no statistically signicant differences between any of these measures in terms of overall classication accuracy, as
measured by differences in the AUC. This has implications for stakeholders who must select the most accurate, efcient, and
cost-effective method of scoring benchmark writing assessments. Human holistic scoring incurs greater time and nancial
costs, and is arguably less reliable (Huot, 1990), than simply measuring the total number of words written. Results of
the current study suggest that bearing these costs do not afford greater classication accuracy than simply measuring
TWW.
To our knowledge, this was the rst study to compare the classication accuracy of an AES writing quality measure
(PEG Overall Score) with human holistic scoring and count/frequency-based measures. The current study utilized an AES
system called Project Essay Grade (PEG) which yielded an Overall Score ranging from 6 to 36, summarizing students per-
formance across six traits of writing ability, each measured on a 16 scale. PEG Overall Score demonstrated statistically
signicantly superior classication accuracy compared to other scoring methods applied to prompts generated in response
to the benchmark writing assessment. Also, PEG Overall Score was the only predictor to add to the multivariate prediction
model that included prior writing achievement. This suggests that beyond efciency and reliability, PEG may have utility for
identifying at-risk students before they experience writing failure. This is a notable nding because automated formative
assessment systems employing AES are increasingly utilized by states and school districts in upper elementary and middle
school (Folz, 2014; Warschauer & Grimes, 2008) and study results suggest such systems may be used to accurately identify
at-risk students in need of additional writing intervention.
While the classication accuracy of PEG Overall Score was superior to the other predictors associated with the benchmark
assessment, it was not superior to Prior Writing Achievement. Prior writing achievement had the highest AUC value and
lowest concomitant FP rate of any predictor, suggesting that students performance on the state writing test was relatively
stable between fth and sixth-grade. Such stability emphasizes the importance of using predictive models to identify and
provide intervention to upper-elementary and middle-school students. In absence of such prevention efforts, students are
likely to experience continued writing failure and be placed at greater risk for broader academic difculties, school dropout,
or referral to special education (Graham & Perin, 2007; Heubert & Hauser, 1999).
Efforts to improve classication accuracy by combining measures in a multivariate prediction model proved relatively
unsuccessful. While the addition of PEG Overall Score resulted in a statistically signicant improvement in the t of the
logistic regression model, the practical signicance of this model was minor. Specically, the multivariate model had a 1%
higher AUC and 1% lower FP rate than the univariate model with Prior Writing Achievement. Nevertheless, application of
the multivariate model in the full population of sixth-graders who responded to persuasive prompts identied 93% of true
positives and had a lower FP rate (25%) than any univariate model.
20 J. Wilson et al. / Assessing Writing 27 (2016) 1123

4.1. Implications of study ndings for construct representation

Current notions of validity focus on developing a construct argument with claims and warrants based on specic sources
of validity evidence to support the design and use of an assessment for a specic purpose (American Educational Research
Association, American Psychological Association, & National Council on Measurement in Education, 2014; Cronbach, 1988;
Kane, 1992, 2006; Messick, 1989). Construct denition is a critical rst step in developing such an argument (Kane, 2006), and
it is the foundation of contemporary methods of assessment design such as Evidence-Centered Design (Mislevy & Haertel,
2006).
While there is no consistently agreed-upon denition of the construct of writing ability, writing has been dened as a com-
plex skill requiring the coordination of knowledge sources (topic, genre, linguistic, process), cognitive processes (planning,
translating, reviewing, evaluating, repairing), affective processes (motivation, disposition and self-efcacy, self-regulation),
and uency and accuracy (handwriting, keyboarding, spelling, grammar, sentence construction skills), in order to com-
municate meaning to an audience for a specic purpose (Deane, 2013; Hayes, 2012). Given this denition, what are the
implications that (a) among the measures that were applied to the BWA, the automated essay quality score (PEG Overall
Score) was the most accurate; and (b) that a measure of text length (TWW) was just as accurate as human holistic scoring?
Automated essay scoring and text length have both been criticized for lacking construct representativeness (see Perelman,
2012, 2014). Criticisms focus on the fact that automated scoring is unable to process the meaning of language; and that
TWW, as a measure of text length, eschews consideration of content, organization, style, or even mechanics. Arguably, when
compared to human holistic quality, automated scoring and text length represent a narrower range of the writing construct,
representing the range that relates to uency and accuracy (Deane, 2013). Yet, in the current study, these measures displayed
equivalent or greater classication accuracy than human holistic quality, even when holistic quality was measured with high
levels of inter-rater reliability.
This may be explained in light of the specic assessment purpose for which the measures were being used: identifying
students at-risk of failing a state writing test. Struggling writers and students with disabilities commonly produce text
which is shorter and more error-lled than their normally achieving and typically-developing peers (Berninger, Nielsen,
Abbott, Wijsman, & Raskind, 2008; Nelson & Van Meter, 2007; Scott & Windsor, 2000). This population of students evinces
further impairments in other aspects of writing quality because lack of automaticity of lower-level writing skills increases
cognitive load and decreases resources available for the development and expression of higher-level writing skills (Berninger
& Swanson, 1994; Kellogg & Whiteford, 2009; McCutchen, 1988, 1996). Hence, uency and accuracy function as threshold
skills: below a certain point, uency and accuracy map on well to the writing construct because broader dimensions of the
construct are effectively inhibited from evaluation with product-based measures when students cannot produce enough
text or enough accurate text to effectively communicate meaning to an audience for a specic purpose. In sum, the
ndings of the current study reinforce the salience of uency and accuracy to the writing construct when the assessment
purpose is identifying at-risk students.

4.2. Limitations and directions for future research

Study ndings must be interpreted in light of the following limitations. First, study ndings must be interpreted in light
of the choice to use a state writing assessment as the criterion measure. Though such tests are subject to rigorous technical
evaluation, they are not perfect measures of students writing abilityindeed, an issue facing classication research more
broadly is lack of a gold standard criterion measure of student writing (Elliot, 2005). In the current study, the sixth-grade
state writing test included a single extended constructed-response item and 36 multiple-choice items. While the inclusion
of the multiple-choice items likely improved score reliability (Godshalk, Swineford, & Coffman, 1966), multiple writing
prompts often are required before a truly stable estimate of student writing ability is obtained (Graham et al., 2011a,b;
Graham, Hebert, Sandbank, & Harris, 2014). Therefore, it is possible that some of the error attributed to the prediction
models may actually be error in the criterion measure. This highlights the need to interpret prediction models within the
context of the selected criterion measure. Future research that uses a similar design could employ different, more widely
used accountability assessments, such as the NAEP writing assessments or the summative ELA assessments developed by
SBAC and PARCC.
Second, the BWA is also subject to the limitations associated with using a single writing prompt as a measure of students
writing ability. Though multiple predictors were evaluated, with the exception of Prior Writing Achievement, they were
measured from only a single sample of students writing. While it is common practice to make instructional decisions
based on scores from a single writing prompt, classication accuracy may have been improved if scores were averaged
across multiple writing prompts. Future research should compare prediction models derived from a single writing sample
to those from multiple writing samples. Analyses should consider whether classication accuracy is positively affected by
(a) increasing the number of writing samples within a genre (i.e., varying topics), or (b) increasing the number of writing
samples across genres. This would help to determine whether genre-specic screening models are necessary, and whether
certain genres prove better for screening students at different grades.
Third, the BWA used a single AES system, PEG, to examine the accuracy of an AES score for identifying at-risk students.
However, AES systems use unique, often proprietary, methods for parsing and analyzing text. Variability in AES systems
means that the performance of a single system, such as PEG, cannot be used to validate the use of AES more broadly for
J. Wilson et al. / Assessing Writing 27 (2016) 1123 21

identifying at-risk students. While study results supporting the use of PEG are promising, future research is needed to
construct a broader validity argument for the use of AES for identifying students at-risk of failing state or national writing
assessments. An interesting study would be to compare multiple AES systems with regards to their AUC, sensitivity, and FP
rates when predicting passing/failing scores on a common criterion measure. Akin to the recent study by Shermis (2014),
such a study would improve stakeholders ability to judge the utility of AES for supporting prevention-intervention efforts.
Fourth, all measures assessed the writing product; no measures of process, cognition, or affect were examined. In keeping
with prior research on screening in other academic areas, like reading (Johnson et al., 2009), we elected to evaluate predic-
tive models using measures of language production. Future research can compare the classication accuracy of measures
assessing other aspects of the writing construct beyond which can be assessed via the writing product. It is possible that
the combination of measures derived from these multiple sources may increase classication accuracy. However, it is also
possible that gains in accuracy may not outweigh concomitant losses in efciency and cost-effectiveness. Future research
should weigh these costs and benets.
Finally, despite the relatively strong classication accuracy of Prior Writing Achievement, PEG Overall Score, and the
multivariate model with both predictors, each model generated FP rates that would be unfeasible for limited school inter-
vention resources. Reducing FP rates without sacricing sensitivity should be a priority for future research. One way of doing
so may be to administer multiple writing prompts to increase score generalizability (Graham et al., 2014). Another promising
method may be to employ a two-stage gated screening procedure (Compton et al., 2010; Fuchs, Fuchs, & Compton, 2012).
This method involves rst identifying all students who are clearly not at risk, as measured by performance surpassing a
cutpoint on a predictor measure. Then, the remaining students who may or may not be at risk are administered further
assessments to distinguish truly at-risk students from false positives.

5. Conclusion

There are a number of challenges to develop an accurate prediction model which identies students at-risk of failing
state or national writing tests. Some of these challenges are associated with issues facing writing assessment more broadly:
selecting measures which yield reliable scores; minimizing the time and nancial costs associated with rater training,
test administration, and scoring; and balancing the need to make valid inferences about students writing ability with the
availability of often only a single writing sample. However, the results of the current study illustrate an additional unique
challenge: selecting a scoring method that maximizes classication accuracy. Given the consequences associated with failing
accountability assessments, and the importance of identication as the foundation of prevention, it is important that research
continues to elucidate and address these challenges.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for
educational and psychological testing. American Educational Research Association.
Berninger, V. W., Abbott, R. D., Whitaker, D., Sylvester, L., & Nolen, S. B. (1995). Integrating low- and high-level skills in instructional protocols for writing
disabilities. Learning Disability Quarterly, 18, 293309.
Berninger, V. W., Nielsen, K. H., Abbott, R. D., Wijsman, E., & Raskind, W. (2008). Writing problems in developmental dyslexia: Under-recognized and
under-treated. Journal of School Psychology, 46, 121.
Berninger, V. W., & Swanson, H. L. (1994). Modifying Hayes and Flowers model of skilled writing to explain beginning and developing writing. Advances in
Cognition and Educational Practice, 2, 5781.
Bulkely, K. E., Nabors Olah, L., & Blanc, S. (2010). Introduction to the special issue on benchmarks for success? Interim assessments as a strategy for
educational improvement. Peabody Journal of Education, 85, 115124.
Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer, & H. Braun (Eds.), Test validity (pp. 317). Hillsdale, NJ: Lawrence Erlbaum.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ:
Lawrence Erlbaum.
Coker, D. L., & Ritchey, K. D. (2014). Universal screening for writing risk in kindergarten. Assessment for Effective Intervention, 39, 245256.
Common Core State Standards Initiative. (2010>). Common core state standards for English language arts & literacy in history/social studies, science, and
technical subjects. Retrieved from http://www.corestandards.org/assets/CCSSI ELA%20Standards.pdf.
Compton, D. L., Fuchs, D., Fuchs, L. S., Bouton, B., Gilbert, J. K., Barquero, L. A., et al. (2010). Selecting at-risk rst-grade readers for early intervention:
Eliminating false positives and exploring the promise of a two-stage gated screening procedure. Journal of Educational Psychology, 102, 327340.
Darling-Hammond, L. (2004). Standards, accountability, and school reform. Teachers College Record, 106, 10471085.
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18, 724.
Decker, D. M., & Bolt, S. E. (2008). Challenges and opportunities for promoting student achievement through large-scale assessment results: Research,
reections, and future directions. Assessment for Effective Intervention, 34, 4351.
Elliot, N. (2005). On a scale: A social history of writing assessment in America. New York, NY: Peter Lang Publishing.
Espin, C. A., De La Paz, S., Scierka, B. J., & Roelofs, L. (2005). The relationship between curriculum-based measures in written expression and quality and
completeness of expository writing for middle school students. The Journal of Special Education, 38(4), 208217.
Espin, C., Shin, J., Deno, S. L., Skare, S., Robinson, S., & Benner, B. (2000). Identifying indicators of written expression prociency for middle school students.
The Journal of Special Education, 34(3), 140153.
Fewster, S., & MacMillan, P. D. (2002). School-based evidence for the validity of curriculum-based measurement of reading and writing. Remedial and
Special Education, 23(3), 149156.
Field, A. (2013). Discovering statistics using IBM SPSS statistics (4th ed.). Los Angeles, CA: Sage.
Fielding, A. H., & Bell, J. F. (1997). A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental
Conservation, 24, 3949.
Figlio, D. N., & Getzler, L. S. (2002). Accountability, ability, and disability: Gaming the system? Cambridge, MA: National Bureau of Economic Research.
Flower, L. S., & Hayes, J. R. (1980). The dynamics of composing: Making plans and juggling constraints. In L. W. Gregg, & E. R. Sternberg (Eds.), Cognitive
processes in writing (pp. 329). Hillsdale, NJ: Lawrence Erlbaum Associates.
22 J. Wilson et al. / Assessing Writing 27 (2016) 1123

Folz, P. W. (2014). Improving student writing through automated formative assessment: Practices and results. In Paper presented at the International
Association for Educational Assessment (IAEA) Singapore.
Fuchs, D., Fuchs, L. S., & Compton, D. L. (2012). Smart RTI: A next-generation approach to multilevel prevention. Exceptional Children, 78(3), 263279.
Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Slider, N. J., Hoffpauir, L. D., Whitmarsh, E. L., et al. (2004). An examination of the criterion validity and
sensitivity to brief intervention of alternate curriculum-based measures of writing skill. Psychology in the Schools, 41(3), 291300.
Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Naquin, G. M., & Slider, N. J. (2002). Moving beyond total words written: The reliability, criterion validity,
and time cost of alternate measures for curriculum-based measurement in writing. School Psychology Review, 31, 477497.
Godshalk, F. I., Swineford, F., & Coffman, W. E. (1966). The measurement of writing ability. New York, NY: College Entrance Examination Board.
Goertz, M., & Duffy, M. (2003). Mapping the landscape of high-stakes testing and accountability programs. Theory into Practice, 42, 411.
Graham, S., Harris, K. R., & Hebert, M. A. (2011). Informing writing: The benets of formative assessment: A Carnegie Corporation Time to Act report.
Washington, DC: Alliance for Excellent Education.
Graham, S., Hebert, M., & Harris, K. R. (2011). Throw em out or make em better? State and district high-stakes writing assessments. Focus on Exceptional
Children, 44, 112.
Graham, S., Hebert, M., Sandbank, M. P., & Harris, K. R. (2014). Assessing the writing achievement of young struggling writers: Application of
generalizability theory. Learning Disability Quarterly, http://dx.doi.org/10.1177/0731948714555019 (Advanced online publication)
Graham, S., & Perin, D. (2007). Writing next: Effective strategies to improve writing of adolescents in middle and high schoolsA report to Carnegie Corporation
of New York. Washington, DC: Alliance for Excellent Education.
Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J. L., et al. (2007). Standards-based accountability under No Child Left Behind:
Experiences of teachers and administrators in three states. Santa Monica, CA: Rand Corporation.
Haney, W. (2000). The myth of the Texas miracle in education. Education Policy Analysis Archives, 8(41). Retrieved March 20, 2004, from
http://epaa.asu.edu/epaa/v8n41/.
Hanley, J. A., & McNeil, B. J. (1983). A method of comparing the areas under receiver operating characteristic curves derived from the same cases.
Radiology, 148, 839843.
Hayes, J. P. (2012). Modeling and remodeling writing. Written Communication, 29, 369388.
Heubert, J., & Hauser, R. (Eds.). (1999). High stakes: Testing for tracking, promotion, and graduation. A report of the National Research Council. Washington,
DC: National Academy Press.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Hoboken, NJ: John Wiley & Sons Inc.
Huot, B. (1990). Reliability, validity, and holistic scoring: What we know and what we need to know. College Composition and Communication, 41, 201213.
Jenkins, J. R., Hudson, R. F., & Johnson, E. S. (2007). Screening for at-risk readers in a response to intervention framework. School Psychology Review, 36,
582600.
Johnson, E. S., Jenkins, J. R., Petscher, Y., & Catts, H. W. (2009). How can we improve the accuracy of screening instruments? Learning Disabilities Research
and Practice, 24(4), 174185.
Jones, G., Jones, B., Hardin, B., Chaptman, L., Yarbrough, T., & Davis, M. (1999). The impacts of high-stakes testing on teachers and students in North
Carolina. Phi Delta Kappan, 81(3), 199203.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527535.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1764). Westport, CT: American Council on Education/Praeger.
Keith, T. Z. (2003). Validity and automated essay scoring systems. In M. D. Shermis, & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary
perspective (pp. 147167). Mahwah, NJ: Lawrence Erlbaum Associates Inc.
Kellogg, R. T., & Whiteford, A. P. (2009). Training advanced writing skills: The case for deliberative practice. Educational Psychologist, 44, 250266.
Lopez, F. A., & Thompson, S. S. (2011). The relationship among measures of written expression using curriculum-based measurement and the Arizona
Instrument to Measure Skills (AIMS) at the middle school level. Reading and Writing Quarterly, 27, 129152.
McCutchen, D. (1988). Functional automaticity in childrens writing. Written Communication, 5, 306324.
McCutchen, D. (1996). A capacity theory of writing: Working memory in composition. Educational Psychology Review, 8, 299325.
McCutchen, D. (2011). From novice to expert: Implications of language skills and writing-relevant knowledge for memory during the development of
writing skill. Journal of Writing Research, 3, 5168.
McMaster, K. L., & Campbell, H. (2008). New and existing curriculum-based writing measures: Technical features within and across grades. School
Psychology Review, 37, 550566.
McMaster, K. L., Parker, D., & Jung, P. (2012). Use of curriculum-based measurement for beginning writers within a response to intervention framework.
Reading Psychology, 33, 190216.
Meehl, P. E., & Rosen, A. (1955). Antecedent probability and the efciency of psychometric signs, patterns, or cutting scores. Psychological Bulletin, 52(3),
194216.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13103). New York, NY: American Council on Education and
Macmillan.
Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence-centered design for educational testing. Educational Measurement: Issues and Practice, 25,
620.
National Center for Education Statistics. (2012). The Nations Report Card: Writing 2011(NCES 2012-470). Washington, D.C.: Institute of Education Sciences,
U.S. Department of Education.
Nelson, N. W., & Van Meter, A. M. (2007). Measuring written language ability in narrative samples. Reading and Writing Quarterly, 23, 287309.
Olinghouse, N. G., & Wilson, J. (2013). The relationship between vocabulary and writing quality in three genres. Reading and Writing, 26, 4565.
Olinghouse, N. G., Zheng, J., & Morlock, L. (2012). State writing assessment: Inclusion of motivational factors in writing tasks. Reading and Writing
Quarterly, 28, 97119.
Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238243.
Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. The Journal of Experimental Education, 62(2), 127142.
Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis, & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 4354).
Mahwah, NJ: Lawrence Erlbaum Associates Inc.
Page, E. B., Poggio, J. P., & Keith, T. Z. (1997). Computer analysis of student essays: Finding trait differences in student prole. In Paper presented at the
annual meeting of the American Educational Research Association March 1997, Chicago, IL.
Parker, R., Tindal, G., & Hasbrouck, J. (1991). Countable indices of writing quality: Their suitability for screening-eligibility decisions. Exceptionality, 2, 117.
Penny, J., Johnson, R. L., & Gordon, B. (2000). The effect of rating augmentation on inter-rater reliability: An empirical study of a holistic rubric. Assessing
Writing, 7, 143164.
Perelman, L. (2012). Construct validity, length, score, and time in holistically graded writing assessments: The case against automated essay scoring (AES).
In C. Bazerman, C. Dean, K. Lunsford, S. Null, P. Rogers, A. Stansell, & al. et (Eds.), New directions in international writing research (pp. 121132).
Anderson, SC: Parlor Press.
Perelman, L. (2014). When the state of the art is counting words. Assessing Writing, 21, 104111.
Persky, H. R., Daane, M. C., & Jin, Y. (2002). The Nations Report Card: Writing 2002. (NCES 2003-529). Washington, D.C.: National Center for Education
Statistics, Institute of Education Sciences. U.S. Department for Education.
Perie, M., Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments.
Educational Measurement: Issues and Practice, 28(3), 513.
J. Wilson et al. / Assessing Writing 27 (2016) 1123 23

Ritchey, K. D., & Coker, D. K. (2014). Identifying writing difculties in rst grade: An investigation of writing and reading measures. Learning Disabilities
Research and Practice, 29(2), 5465.
Scott, C. M., & Windsor, J. (2000). General language performance measures in spoken and written narrative and expository discourse of school-age
children with language learning disabilities. Journal of Speech, Language, and Hearing Research, 43, 324339.
Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing
Writing, 20, 5376.
Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., & Harrington, S. (2002). Trait ratings for automated essay grading. Educational and Psychological
Measurement, 62, 518.
Shermis, M. D., Mzumara, H. R., Olson, J., & Harrington, S. (2001). On-line grading of student essays: PEG goes on the World Wide Web. Assessment and
Evaluation in Higher Education, 26(3), 247259.
Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies: An International Journal, 3, 2236.
Wilson, B. J., & Reichmuth, M. (1985). Early-screening programs: When is predictive accuracy sufcient? Learning Disability Quarterly, 8, 182188.
Zweig, M. H., & Campbell, G. (1993). Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry,
39, 561577.

Joshua Wilson, Ph.D. is an Assistant Professor in the School of Education at University of Delaware. His research focuses on methods of assessing and
instructing struggling writers, and on the application of automated essay scoring in formative assessment contexts.

Natalie G. Olinghouse, Ph.D. is an Associate Professor in the Neag School of Education at the University of Connecticut. Her research focuses on
writing instruction and assessment, individual differences in writing, the role of vocabulary in written composition, and analyzing the alignment
among writing standards and assessments.

D. Betsy McCoach, Ph.D. is Professor and Program Chair of the Measurement, Evaluation, and Assessment Program in the Neag School of Education at
the University of Connecticut. She has extensive experience in structural equation modeling, longitudinal data analysis, hierarchical linear modeling,
instrument design, and factor analysis.

Tanya Santangelo, Ph.D. is an Associate Professor of Special Education at Arcadia University. Her research focuses on the development and validation
of effective procedures for teaching academic and self-regulatory strategies to students who experience learning and/or behavioral difculties.

Gilbert N. Andrada, Ph.D. has been with the Connecticut State Department of Education for 21 years. In addition to having been the program manager
for the Connecticut Benchmark Assessment System (CBAS), his duties involve psychometric and statistical analyses, applied research projects, program
evaluations, and large-scale student assessment.

Vous aimerez peut-être aussi