Vous êtes sur la page 1sur 15

Cawthon, S. (2011). Test item linguistic complexity and assessments for deaf students.

American Annals of the Deaf 156 (3), 255269.

TEST ITEM LINGUISTIC COMPLEXITY AND ASSESSMENTS FOR DEAF STUDENTS

STEPHANIE CAWTHON

of test items is one test format element that has been studied in the context of struggling readers and their participation in paper-and-pencil tests. The present article presents findings from an exploratory study on the potential relationship between linguistic complexity and test performance for deaf readers. A total of 64 students completed 52 multiple-choice items, 32 in mathematics and 20 in reading. These items were coded for linguistic complexity components of vocabulary, syntax, and discourse. Mathematics items had higher linguistic complexity ratings than reading items, but there were no significant relationships between item linguistic complexity scores and student performance on the test items. The discussion addresses issues related to the subject area, student proficiency levels in the test content, factors to look for in determining a linguistic complexity effect, and areas for further research in test item development and deaf students.
INGUISTIC COMPLEXITY

CAWTHON IS AN ASSISTANT PROFESSOR, SCHOOL PSYCHOLOGY PROGRAM, DEPARTMENT OF EDUCATIONAL PSYCHOLOGY, UNIVERSITY OF TEXAS, AUSTIN.

Under current accountability reform frameworks such as those guided by the No Child Left Behind Act of 2001 (NCLB), evaluation of student knowledge and skill is almost exclusively conducted via large-scale, standardized assessments. Because of the accountability mechanisms (i.e., financial and programmatic sanctions) that follow poor student performance on assessments, a great deal of emphasis is placed on how to best measure proficiency in students who have not traditionally seen academic success. In almost all cases, these accountability assessments are in written English. Proficiency in reading is thus a gateway skill to accessing the content of

state assessments. For students who are not grade-level English users, including many deaf students, the written-English format of the assessment may be a barrier to demonstrating what they know. The challenge in accessing test content may compound other difculties in measuring achievement when students are several years behind grade level. Deaf 1 students, on average, show signicant academic delays in all subject areas (Mitchell, 2008). Many of the early complaints about the high-stakes testing under NCLB concerned the requirement that states use grade-level tests even for students whose instruction was 2 or more years behind the

255

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

TEST ITEM LINGUISTIC COMPLEXITY


test (Cawthon, 2007). In response to these discrepancies, the most recent iterations of the Individuals With Disabilities Education Act and NCLB include guidance on the development of assessments for students whose disability results in challenges reaching grade-level achievement (U.S. Department of Education, 2007a, 2007b). These assessments typically include some level of modication of the test item format, and at times, test content. Due to issues of test score validity, particularly when modied assessment scores are used as part of accountability reform initiatives, the psychometric implications of modifying test items faces great scrutiny and prompt greater caution (Lazarus, Thurlow, Lail, Eisenbraun, & Kato, 2006). Little research has been conducted on the effects of test item modications on deaf students academic achievement. In the present article, I rst summarize some of the policies related to test item modication, the research literature on linguistic complexity, and literacy development in deaf students. I then present the ndings from an exploratory study on the relationship between test item linguistic complexity and deaf student performance on mathematics and reading test items. This article concludes with a discussion of issues raised in this analysis, including areas for further research and implications for assessment development. Policy Context Because of the reliance on written English in standardized assessments, students with disabilities or those who are English Language Learners (ELL) often face challenges in accessing the content of tests used in accountability reforms (Abedi, 2002; Phillips, 1994). Previous policy-guided research in this area includes analysis of assessment strategies such as the use of accommodations (e.g., dictionaries or glossaries, or the provision of extra time), translated assessments (i.e., in the students native language), and modied assessments (Elliott & Roach, 2007; Lazarus et al. 2006; Sireci, Scarpati, & Li, 2005). In a further expansion of the state assessment systems used for NCLB, the U.S. Department of Education now allows states to develop two versions of alternate assessments: Alternate Assessments Based on Modied Academic Standards (AA-MAS) and Alternate Assessments Based on Alternate Academic Standards (AA-AAS). The AA-MAS targets the content areas of the regular standards, but in a more accessible and sometimes less rigorous format (Cawthon, Leppo, & Highley, in press). It is this set of assessments that draws upon the test item development literature on increasing the accessibility of test item format without reducing the rigor of item content. Linguistic Complexity of Test Items Research on test item development indicates that choices about item design can improve the quality of test items (e.g. Haladyna, Downing, & Rodriguez, 2002; Kettler, Elliott, & Beddow, 2009; Rodriguez, 2005). Whereas some development and review criteria focus on test item content bias, others relate to the way the item itself is presented (Lollis & LaSasso, 2009). For test item format, specific attention is often paid to the level of the language used in multiple-choice item question stems and responses, also referred to as its linguistic complexity (Abedi, Courtney, & Goldberg, 2000; Beddow, Kettler, & Elliott, 2008). Test items with high levels of linguistic complexity have unfamiliar vocabulary, words with multiple meanings, passive sentence structures, embedded clauses, and other features that are difficult for non-native English users. Writing about item development in South Carolina, Foster (2008) describes how a team of individuals reviews the language of each item to evaluate its potential challenge to deaf students. Her example of a problematic item contains the word state, a word with multiple meanings: According to the article, why did the author state, It was a lean mean machine? (p. 119). The key word state could be replaced with say and still test the intended construct, the meaning of the phrase lean, mean machine. In other words, the content standard being measured is not changed with a less linguistically complex question. When syntax is being considered, direct, active-tense sentences with clear referents are easier for students to understand than those with passive voice and relative clauses that separate the subject from the action of the sentence. Taken together, test item linguistic features can contribute to an items level of accessibility for students who are not reading at grade level. Thus far, the effects of language demand have been studied primarily in the context of ELL and these learners participation in mathematics assessments (e.g. Abedi & Herji, 2004; Kopriva, 2000). ELL have been shown to benet from changes to test item formats that reduce levels of linguistic complexity, including simplied syntax and vocabulary (Abedi, Courtney, Mirocha, Leon, & Goldberg, 2005). The strength of the effect, however, varies across different studies and may depend on student characteristics, test subjects, and the level of classroom instruction (Cawthon et al., in press). For example, Sato, Rabinowitz, Gallagher, and Huang (2010) examined the linguistic complexity of mathematics items and compared the performance of seventh- and eighthgrade students across levels of English

256

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

prociency (ranging from not procient to procient English-language users). Sato and colleagues found a consistent trend across all groups of better performance on mathematics content test items with lower linguistic complexity than items with higher linguistic complexity. This difference was most striking for students who were ELL, in comparison with students who were no longer considered ELL but were not yet fully procient in English language arts or students who were fully English procient. Reduction in linguistic complexity appears to give a specic benet to students who are not yet procient in English. As noted above, linguistic complexity includes features such as vocabulary and syntax. It may be that specic components of language contribute to the effects more than others. Shaftel, Belton-Kocher, Glassnap, and Poggio (2006) looked at the linguistic components of typical test questions and whether they affected students scores. In the study, students were from a variety of backgrounds; the sample included students with disabilities (SWD), ELL, and students without either disabilities or ELL status. The researchers found that language, specically mathematics vocabulary, affected item difculty strongly in 4th grade (moderate effect) and had lesser effects on item difculty in 10th grade. Additionally, they pointed out that mathematics items have linguistic components that may be difcult for all students regardless of SWD or ELL status. Most importantly, ambiguous or multiple meaning words increased item difculty for all students. Implications for Deaf Students Before findings about test item linguistic complexity can be generalized from students who are ELL to deaf students, it is important to further study

how test format features affect students within the latter population, specifically (American Educational Research Association, American Psychological Association, & National Council on Educational Measurement, 1999). In their analysis of potential bias and adequacy in the development of tests for deaf students taking a high school graduation exam, Lollis and LaSasso (2009) review previous literature on linguistic structures and students who are deaf, much of it from the 1970s and early 1980s. Their summary of test construction bias raised critical questions about how test developers consider the needs of students whose first language is not English, including deaf students. Much of the discussion about deaf students and access to standardized assessments centers on literacy development and academic achievement. The educational implications of prelingual or early childhood hearing loss can be signicant if sustained interventions are not in place to give students access to a robust language model such as American Sign Language (ASL) or speech via amplication (e.g., Moores, 2009; Nussbaum, La Porta, & Hinger, 2003; YoshinagaItano & Gravel, 2001). Even with early intervention, educational institutions historically have struggled to provide deaf students with opportunities for academic success (Harris & Bamford, 2001; Mutua & Elhoweris, 2002; Traxler, 2000). Part of the struggle has been in literacy development, which is often delayed. The three features of test item linguistic complexity targeted in the present study were vocabulary, syntax, and discourse. In order to have a sense of where test item features might interact with deaf students language and literacy development, I review some of the ndings in these three areas.

Although all three contribute to reading comprehension, there are some distinctions in how they contribute to literacy achievement in deaf students. Vocabulary, or semantics, is a core component of language development and is one of the earliest parts of language that children acquire (Bloom, 1991). Within the rst few years of life, children begin to connect names to individual objects and, later, to groups of objects (Golinkoff et al., 2000). For hearing children, word acquisition builds upon acquisition of phonology and morphology, or the sounds and units of speech that make up English. The extent to which deaf children use phonology and morphology in early vocabulary acquisition is a debated topic (see Mayberry, del Guidice, & Lieberman, 2011, for a meta-analysis), but the ultimate goal for deaf readers is to be able to decode English, or to read and identify words in print. Mitchell (2008) provides summaries of two recent studies of vocabulary achievement with deaf students, rst with the Woodcock-Johnson III (WJ) and second with the Stanford Achievement Test (SAT, 10th ed.). The advantage of these ndings is that they are normed data, and provided a sense of achievement relative to the general population. For the WJ data on letterword identication, over 60% of students with hearing impairments were in the rst quartile, or in the lowest 25% of scores on the assessment. For the SAT, nearly 90% of deaf students were in the rst quartile on a reading vocabulary assessment. Although the test samples were not exactly the same in their representativeness of the deaf student population, both sets of ndings point toward a low level of achievement in English vocabulary relative to deaf students hearing peers. The second two features of linguistic complexity are syntax and dis-

257

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

TEST ITEM LINGUISTIC COMPLEXITY


course; taken together, they affect readers uency and, ultimately, their reading comprehension. Syntax is in itself multifaceted and refers to the rules by which language provides meaning when two or more words are linked in a sentence or passage. In essence, syntax consists of the structure of a language. For example, English morphology markers provide additional meaning to a word, including plurality (e.g., adding s at the end of a noun to indicate more than one object) or verb tense (e.g., ed at the end of a verb root to indicate past tense). Syntax markers for English and for ASL are distinct due to the differences between marking with sound and marking within a physical space. If one cannot hear these English markers, it is still necessary to learn to decipher their meaning when reading English print through direct instruction (Schirmer, 2000). Deaf students may have difculty with reading uency and comprehension due both to (a) reduced levels of knowledge about syntax and (b) the increased cognitive processing demand in tasks such as decoding unfamiliar vocabulary, which leave less working memory available for integrating meaning across words and sentences (Kelly, 1996; Paul, 1998). Depending on the model of literacy proposed, syntax may play a direct and/or indirect role in reading achievement; regardless of the approach, deaf students appear to struggle with English syntax. The discourse component of test item linguistic complexity parallels most closely with how readers integrate meaning across sentences, a syntactic feature of the task of reading a passage (as opposed to reading a sentence). Processing beyond a single sentence, such as in the question stem of a test item that has several steps linked together, involves metacognitive skills such as monitoring comprehension and making inferences about how sentences fit together to convey a larger message. In a sense, the discourse component of linguistic complexity focuses on the broader processes of reading comprehension when there is more than one sentence involved in the question prompt. (This is different from the line of research on discourse genres and how deaf students express ideas in response to different stimuli such as found in Schick, 1997). Deaf students show lower levels of metacognition on both reading and problem- solving tasks (Spencer & Marschark, 2010); furthermore, deaf students who are good readers show strengths in using metacognitive strategies (Gibbs, 1989). Data summarized in Mitchell (2008) emphasize how few deaf students are strong in reading comprehension tasks. Reading comprehension subtests on the WJ and the SAT have shown a large proportion of deaf students, over 60% for the WJ and over 80% for the SAT, in the lowest quartile of students in those areas. Purpose of the Study Deaf students with delayed literacy skills and lower levels of content area proficiency may face obstacles to accessing the content of a test item (written in English) due to high levels of linguistic complexity found on some standardized assessments. The purpose of the present study was to explore the potential relationship between student performance on an assessment and the linguistic complexity of the multiple-choice test items. The study was designed to answer the following research question: What is the relationship between test item linguistic complexity and test item performance for deaf students? I included both mathematics and reading content subject areas in this analysis to see if there was a differential effect of linguistic complexity between the two areas. The context of this analysis was a larger study on the effects of accommodations on student performance on standardized assessments (Cawthon, Winton, Garberoglio, & Gobble, 2011). The larger study focused mainly on student and accommodations factors; the present article complements that work by exploring the potential contribution of test item linguistic complexity to performance on mathematics and reading multiplechoice items. Method

Sample
One of the greatest challenges in research with deaf students is the low incidence of the disability. Consequently, my fellow researchers and I recruited participants from six schools for deaf students across the United States. When teachers and administrators sought clarification on who would be a suitable participant, we provided them with sample items and an example of the test format and asked them to judge whether a potential participant would be able to complete the research study tasks. Students with a severe to profound hearing loss but without disabilities that required additional test accommodations were included in the sample. In the end there was a total of 64 students, 29 boys and 35 girls, who were enrolled in fifth through eighth grades (and ranged in age from 10 to 15 years). Students also completed a standardized measure of reading and mathematics prociency, the Iowa Test of Basic Skills, or ITBS (Hoover et al., 2001). The ITBS reading test (parts 1 and 2) is a widely used assessment of a students prociency in vocabulary and reading comprehension. For mathematics we used the ITBS mathematics test, parts 1 and 3 (Hoover

258

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

et al., 2001), which included problemsolving and computation sections. Student scores in both reading and mathematics were converted into grade equivalencies based on norms provided by the ITBS (Cawthon et al., 2011). In the sample, students had grade equivalencies in reading ranging from 1.5 to 6.6 (M = 3.6). The overall ITBS mathematics grade equivalency ranged from 1.9 to 7.6 (M = 4.2). As a point of reference, the test items were selected to target fthand sixth-grade reading and mathematics skills. While there was a strong correlation between grade level performance on the ITBS reading and mathematics assessments (r = .774, p < .01), there was not a signicant relationship between academic grade level (i.e., grade enrolled at school) and the ITBS grade level performance in either reading (r = .193, ns) or mathematics (r = .125, ns). Finally, we asked students to share with us information about their experiencs using language in both academic and social settings (Cawthon et al., 2011). While we did not ask the students to complete an ASL prociency measure, we did gather information about the length of time they had used ASL, their preferences for language use across different settings, and whether they had family members who were Deaf. For half of the students, ASL was the rst language; for those students who had acquired ASL as a second language, the average length of time they had been learning and using ASL was 5.4 years (SD = 3.3 years). Because our population was enrolled in the fth through eighth grades, this is roughly parallel to the length of time many of these students had been in elementary school. Students predominantly received instruction in ASL (91%), with Signing Exact English (SEE) following with 20% and spoken English with 16%. (Students

could respond with more than one language mode.) When asked what language they used at home, only 53% of students indicated ASL; 16% indicated SEE and 50% spoken language (47% English and 3% Spanish).

Test Items
Mathematics and reading tests were based on released practice problems from the fifth- and sixth-grade 2003 and 2004 Texas state assessments. There were a total of 32 mathematics items and 20 reading items. For mathematics, the questions were word problems that focused on a range of concepts such as number properties, proportions, and geometric properties. For reading, the passages were three to four paragraphs long and covered topics such as how scientists monitor penguins in the Arctic, panda bears visiting the United States, training elephants in Africa, and weaving lace in Paraguay. Two example items, one in mathematics and one in reading, are shown in Appendix A.

gories of linguistic demand and the criteria these categories used to identify potentially challenging areas (see Appendix B). In our coding scheme, each feature was defined and coded as described in the following sections for each item. (The codes for examples in Appendix A are provided in Appendix C.)

Vocabulary
We counted the number of complex vocabulary words in each text unit. Complex vocabulary items were dened as words with multiple meanings, nonliteral usage, and manipulation of lexical forms (Martinello, 2008). For example, if the item included the word plane, this word would be counted as a complex vocabulary word because it has different meanings depending on the context of the sentence. Content words in mathematics, such as parallelogram or other vocabulary that was relevant to the problem, were not included in the vocabulary count even if they might otherwise have multiple meanings. Similarly, words that were introduced in the reading passages as target concepts were not included in the complex vocabulary count. If there were no complex words, the vocabulary complexity was given a rating of 0; if one to two words, the score was 1; if three to four words, the score was 2; an item with ve or more complex words received a score of 3 for vocabulary.

Linguistic Complexity Coding


A measure of the linguistic complexity of each test item (inclusive of test item stem and responses) was included in the model (Lockhart, Beretvas, Cawthon, & Kaye, 2011). In accordance with previous research (Abedi, 2006; Abedi & Herji, 2004; Ketterlin-Geller, Yovanoff, & Tindal, 2007), linguistic complexity was assessed on the basis of three overarching features: vocabulary, syntax, and discourse. In this analysis, we adapted the coding framework put forward by Abedi and colleagues (2005). The original approach by Abedi and colleagues was developed with the use of a multistep review of potential language demands in test items and definitions of key terms identified as linguistically complex. In our adapted approach, we used the same cate-

Syntax
The syntax score is a composite of a checklist of the following items: atypical parts of speech, uncommon syntactic structures, complex syntax, academic syntactic form, long nominals, conditional clauses, relative clauses, and complex questions. The presence or absence of passive voice also was included in the syntax score. An item received 1 point for each syntax element present. The minimum raw

259

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

TEST ITEM LINGUISTIC COMPLEXITY


syntax score that could be given was 0; there was no maximum score. Raw scores were converted into a syntax score. Raw score values of 0, 1, and 2 were converted into syntax scores with the same values; raw score values at 3 or higher received a 3 for the syntax score. follows: less than 1 = not complex, scores ranging from 1 to less than 3 = moderately complex, and scores 3 and above = very complex. Results The rst analysis focused on the linguistic complexity scores for test items, with a specic emphasis on the test subject areas, reading and mathematics. In many respects, reading is typically thought to be a more linguistically demanding subject area for a test taker than mathematics. In the present study, however, the mathematics items were word problems, not computation items. It was relevant, therefore, to verify whether linguistic complexity scores for reading and mathematics varied in the sample. The average linguistic complexity score for reading items was M = 2.24 (SD = 1.47), whereas the average linguistic complexity score for mathematics items was M = 2.98 (SD = 1.67) a signicant difference (t = 2.07, p < .05). This is perhaps in the opposite direction one would expect, and implications are addressed in the discussion section. Using Pearson correlations, we analyzed the relationship between an items linguistic complexity score (and subscores) and the proportion of students who answered the item correctly. Results of this analysis for both reading and mathematics are shown in Table 1. The results show small relationships between an items linguistic complexity scores (and subscores) and the difculty of the test item (operationalized as the proportion of students who answered the item correctly). The strongest relationship is for the syntax rating for mathematics items, with a negative correlation at r = .22, meaning that the higher the syntax rating, the less likely students were to answer the item correctly. This is in the expected direction, given the literature on students who are ELL and the effects of higher levels of linguistic complexity, but overall, these are very small effect sizes. We also disaggregated student performance on items by level of complexity (not complex, moderately complex, and highly complex) and subject area (see Table 2). The samples for the not complex categories for both mathematics and reading were too small to permit chisquare analyses, so these gures are presented only as descriptive ndings. There does not appear to be a pattern that illustrates a relationship between the level of complexity and the likelihood that students would answer the items correctly. However, the standard deviations for the reading items do appear to be higher (again, descriptively) than for the mathematics items. This implies that while the average performances on reading and mathematics items were roughly similar, the range of performance across the sample was broader for reading than for mathematics. Discussion The present study represents a very early look at the potential relationship between test item linguistic complexity and the likelihood that deaf test takers would answer the items correctly. The results suggest that students performed relatively similarly across the items, with little variation between different levels of linguistic complexity of the test items.

Discourse
Complex discourse was defined in the present study as uncommon genre, the need for multiclausal processing, or the use of academic language (Abedi, Lord, & Plummer, 1997). Item discourse was also considered complex if students were required to synthesize information across sentences or to make clausal connections between concepts and sentences. Discourse was coded as a discrete variable (1 or 0) based on the presence or absence of one or more of these features. Using the linguistic complexity rubric, two raters convened to discuss and rate the test items used in the project. The raters rst took part in a training session using examples from released state assessment items. The training period ended once the raters reached at least a 90% agreement rate. They then independently rated the project items. For reading, initial agreement was 91% for vocabulary, 64% for syntax, and 100% for discourse. For mathematics, initial agreement was 84% for vocabulary, 81% for syntax, and 100% for discourse. When there was a disagreement, we took the average of the two items. In some cases, this resulted in a score with a .5 such as the average of 1 and 2 equaling 1.5. Abedi and colleagues (2005) give three levels of classication for the scores; not complex (0), moderately complex (12), and very complex (3 or higher). We therefore adapted the complexity ranges to accommodate our half-point scores, so that total score classications were as

Limitations of the Study


The present study had many significant limitations that affected our ability to draw meaningful conclusions from the data. The first limitation is that we had a very restricted set of information about the students and their characteristics. The small amount of information was due to a number

260

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

Table 1

Correlation of Item Linguistic Complexity Scores and Item Difficulty


Linguistic complexity component Syntax Vocabulary Discourse Total Syntax Vocabulary Discourse Total Correlation of linguistic complexity score with percentage of students who answered item correctly (r) .06 .03 .12 .09 -.22 .05 .12 -.12

Subject Reading (20 items)

Mathematics (32 items)

Table 2

Linguistic Complexity Levels and Student Performance


Subject Reading Linguistic complexity level Not complex (< 1); n = 1 Moderately complex (1 to < 3); n = 11 Highly complex (3 and higher); n = 6 Not complex (< 1); n = 4 Moderately complex (1 to < 3); n = 8 Highly complex (3 and higher); n = 20 Percentage of students who answered item correctly M (SD) 49% (n/a) 42% (45%) 43% (52%) 44% (28%) 52% (24%) 46% (17%)

Mathematics

of logistical challenges we faced in conducting the study. There were challenges raised by requests for anonymity from school sites that limited the range of information we could gather about each student and still allow them to remain unidentifiable to project staff. There were also limitations in allocated research time, with the study time needing to be balanced with instructional time and significant levels of testing for school and state accountability programs. Adding additional measures or asking for file reviews by school staff was not feasible under the time constraints that were in place. This said, the reduced demographic information resulted in only a very general idea of who the participating students were and how their individual characteristics may have influenced their approach to the test tasks. For example, we know that students at these sites tend to have

severe to profound hearing loss, and that about half of the students came from families in which ASL was used at home. However, more precise information about individual students degree of hearing impairment, use of a cochlear implant, or age at onset would provide a sense of their exposure to English, and how their background may have interacted with their processing of the different language characteristics (e.g. vocabulary vs. syntax) of the passage. A second limitation is the technical properties of the adapted linguistic complexity scale. The ndings of the present study could be conceptualized as a part of the larger process of providing evidence for or against the external validity of the low, moderate, and high ratings for linguistic complexity. However, these categories have not been compared with other measures of item difculty or levels of complex-

ity that take similar approaches to item analysis. For example, Abedi (2006) has a holistic rating scale that gathers information about the relative strength and weakness of an item. Instead of counting points for where there are components of linguistic complexity, the holistic scale asks raters to score each item on a scale from 1 to 5, with higher scores representing higher levels of linguistic complexity. The scale provides sample features associated with each score level, with all of the sample features relating to the language demands made by the item. For instance, for a score of 3, which identies a weak item, the scale includes features such as relatively unfamiliar or seldom used words, long sentences, abstract concepts, complex sentence/conditional tense/adverbial clause, and a few passive voice or abstract or impersonal presentations. The holistic nature of this scale allows raters to attend to the overall gestalt of the item but also requires that the raters have experience with students who are ELL and a solid understanding of the difculties involved in learning a new language. The measure used in the present study may not capture the features of an inaccessible item, or it may not have sufcient range to detect meaningful differences in effects on test takers.

Findings and Implications


The first plausible explanation of the study results is that the test content represented a high degree of difficulty, high enough that issues related to the linguistic structure of the test item did not have a significant enough impact on the students interaction with the item content. Overall, students answered the items correctly slightly less than half the time. This represents a relatively low level of proficiency in the subject matter, at least for some students in the sample. It

261

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

TEST ITEM LINGUISTIC COMPLEXITY


should be recalled that the range of performance for reading was broader than for mathematicsan indication that in reading there were students whose proficiency levels were more spread apart than they were for mathematics. It may be that, on average, the level of proficiency in the subject areas was not high enough to evoke a linguistic complexity effect. In other words, do students need to have sufficient command of the material before something as nuanced as linguistic complexity makes a measurable difference in student performance? Or perhaps do students with strong academic backgrounds also have stronger reading skills, with a consequence being that the linguistic complexity effect is more likely to occur in lowerperforming students? These floor and ceiling effects may result in similar outcomes for students at the two ends of the spectrum, albeit for different reasons. Further investigation might focus on the students in the upper and lower portions of the range. For example, research that includes cognitive lab protocols, with an active measure of the process students go through when answering test items, would shed light on the plausibility of these hypotheses. On the other hand, it is also possible that the linguistic complexity of the items was sufciently low not to interfere with how the students accessed the test content. How complex do items need to be before they are a challenge (beyond the test content)? The complexity scores ranged from 0 to 6 in the study sample, with half of the items (n = 26) being what the Abedi and colleagues (2005) have reported to be classied as highly linguistically complex. That said, there is no research that validates the linguistic complexity scale with deaf students. It may be that the complex elements (e.g., vocabulary words in the passage for reading comprehension items) in this assessment were not those that might pose a barrier to students access to the content of the test item (Lollis & LaSasso, 2009). There may have been additional features outside the realm of the linguistic complexity scale that contributed to construct-irrelevant variance. The subject area of an assessment (in the present article, mathematics vs. reading) is a central theme in the accommodations and test modication literature. Efforts to increase access to test content through accommodations can sometimes pose potential risks to the validity of the test scores (Elliott, McKevitt, & Kettler, 2002). This is particularly true when the accommo dation removes part of the task that the test is trying to measure (Crawford & Tindal, 2004; Fletcher et al., 2006). Take, for example, a mathematics test in which the goal is to measure how well a student knows her multiplication tables. If the student uses a calculator as an accommodation on that assessment, the test score no longer represents her ability to multiply numbers, but, rather, her ability to use a calculator. For reading, the read aloud accommodation is often seen as a threat to the validity of a reading assessment because the student is no longer directly decoding the text, a skill that some see as integral to the purpose of the test. The question thus becomes, Is reducing linguistic complexity a similar concern on a reading assessment? A response to this question must consider whether the format of an item on a reading comprehension test is part of the targeted skill in the assessment. It is also likely to depend on what aspect of linguistic complexity contributes to that higher score. For example, it is difcult for any item to receive a 0 on a discourse component of the linguistic complexity rating if the test item presents two or more ideas for a student to connect together. Two sentences are often required for test items that are asking about the relationship between two ideas or about causal connections such as in an if . . . then sequence. This skill would appear to be needed for the test items on this assessment that were measuring a students ability to make inferences from the text. On the other hand, some of the reading assessment items focused on identifying key information or the meaning of specic terms used in the text. Because content area vocabulary was left out of the linguistic complexity ratings, using single-meaning words would appear to be a reasonable adjustment to reduce the linguistic load for students, even within a reading assessment. Therefore, the extent to which linguistic complexity is an integral component of the target skill depends on the component of linguistic complexity and the purpose of the test item. Signicant in the present study is the nding that the test items in mathematics had a higher linguistic complexity than those in reading. The implication of this result is tied to the kind of knowledge students must bring to the test process versus what is presented in the test itself. Reading comprehension items followed a reading passage, whereas the mathematics test items were stand-alone concepts. In reading, students were therefore making inferences from an identiable, presented knowledge base (i.e., the passage). Test items may have required students to infer meaning or intent, a skill they certainly would have to varying degrees, but the availability of the information to make those inferences was standardized across participants. In contrast, a students knowledge of how to interpret a mathematics word problem resided primarily within the student (and his or her previous experiences with the con-

262

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

tent). The mathematics word problems were opportunities for students to apply strategies they had learned to do particular kinds of inquiry (e.g., transforming objects across a grid, solving for a missing variable). Gaining access to the test item content is thus a very different process for reading comprehension items and mathematics word problems. Due to the different cognitive tasks required, levels of linguistic complexity for reading comprehension test items may have different effects on student performance than a similar level of linguistic complexity for mathematics items. How the item content is presented, including the linguistic complexity level, would appear to be even more critical when the student was connecting the demand of the test item with his or her own knowledge base without the additional support of a reading passage. In short, the question of access to test item content, with potential barriers within the test item format, needs to take into account the type of cognitive skills that are measured by the standardized assessment format in the content area (a nding echoed by Ansell & Pagliaro, 2006). Although our results did not show a difference in student performance between mathematics and reading, future research will need to investigate this nding on a larger scale, using larger item pools and controlling for item task cognitive difculty while manipulating the level of linguistic complexity for the item pair(s). Conclusion When considering the potential impact of policies that guide alternate assessments and test item modifications for students with disabilities, it is important to identify which aspects of the test format are revised from the standard version. The assumption be-

hind changing language components such as the syntactic structure or level of vocabulary is that students will find the simplified version easier to understand. In the definition of alternate assessments, modified assessments, and related special frameworks for testing students with significant cognitive disabilities, we have moved perhaps a step away from the concept of universal design of assessments that arose 10 or 15 years ago (Center for Universal Design, 1997; Thompson, Johnstone, & Thurlow, 2002). The continuing evolution of large-scale assessments to create different formats for eligible populations (i.e., for the three percent) places a great deal of emphasis on the specific match between student characteristic and test format (Elliott & Roach, 2007; Weigert, 2009; Zigmond & Kloo, 2009). However, especially for students from heterogeneous populations such as deaf students, this match is very difficult and can lead to inconsistencies in assessment practices across schools, districts, and states. Issues of linguistic complexity, if they do arise as pertinent for deaf students, may be best addressed through assessments that look at comprehensive access features (instead of items with only modied language components). For example, researchers from Vanderbilt University (e.g., the Consortium for Alternate Assessment Validity and Experimental Studies, or CAAVES) take an aggregate approach to item modication. In this approach, a range of characteristics is evaluated as an overall accessibility construct. More specically, in CAAVES Accessibility Rating Matrix, characteristics such as the format of the test item or the use of graphics are summed in an aggregate manner to measure the extent to which access has been increased for students under the modified test condition (Beddow et al.,

2008). By taking into account both the visual and linguistic features of a test item, test developers may be able to create assessments that allow deaf students to access test items using multiple representations of information not solely based on English text. Note 1. The definition of Deaf or hard of hearing varies by multiple factors including hearing threshold and cultural identity. Deaf or hard of hearing may include people who are culturally Deaf, individuals who identify as audiologically deaf, sign language users, those with cochlear implants, those who wear hearing aids, and those who use a range of communication styles in a variety of settings. The term deaf is used in the present article to refer generally to students who have a severe to profound hearing loss but does not specify other characteristics such as whether they participate in the Deaf community or if they may be categorized demographically as hard of hearing. References
Abedi, J. (2002). Standardized achievement tests and English Language Learners: Psychometrics issues. Educational Assessment, 8, 231257. Abedi, J. (2006) Language issues in item development. In S. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 377399). Rahway, NJ: Erlbaum. Abedi, J., Bailey, A., Butler, F., Castellon-Wellington, M., Leon, S., & Mirocha, J. (2005). The validity of administering large-scale content assessments to English Language Learners: An investigation from three perspectives (CSE Report No. 663). Los Angeles, CA: Center for Research on Evaluation Standards and Student Testing. Abedi, J., Courtney, M., & Goldberg, J. (2000). Language modification of reading, science, and mathematics test items. Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Testing. Abedi, J., Courtney, M., Mirocha, J., Leon, S., & Goldberg, J. (2005). Language accommodations for English Language Learners in largescale assessments: Bilingual dictionaries and linguistic modification. Los Angeles, CA:

263

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

TEST ITEM LINGUISTIC COMPLEXITY


National Center for Research on Evaluation, Standards, and Student Testing. Abedi, J., & Herji, F. (2004). Accommodations for students with limited English proficiency in the National Assessment of Educational Progress. Applied Measurement in Education, 17(4), 371392. Abedi, J., Lord, C., & Plummer, J. (1997). Final report of language background as a variable in NAEP mathematics performance (CRESST Technical Report No. 429). Los Angeles, CA: Center for Research on Evaluation Standards and Student Testing. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing (3rd ed.).Washington, DC: American Educational Research Association. Ansell, E., & Pagliaro, C. (2006). The relative difficulty of signed arithmetic story problems for primary-level deaf and hard of hearing students. Journal of Deaf Studies and Deaf Education, 11(2), 153170. doi: 10.1093/ deafed/enj030 Beddow, P Kettler, R., & Elliott, S. (2008). Test ., accessibility and modification inventory. Nashville, TN: Vanderbilt University. Bloom, L. (1991). Language development from birth to three. New York, NY: Cambridge University Press. Cawthon, S. (2007). Hidden benefits and unintended consequences of No Child Left Behind policies for students who are Deaf or hard of hearing. American Educational Research Journal, 44(3), 460492. Cawthon, S., Leppo, R., & Highley, K. (in press). Review of empirical research on test item modifications for students with disabilities and English-language learners, School Psychology Forum. Cawthon, S., Winton, S., Garberoglio, C., & Gobble, M. (2011). The effects of ASL as an accommodation for students who are deaf or hard of hearing. Journal of Deaf Studies and Deaf Education, 16(2), 198211. Center for Universal Design. (1997). The principles of universal design. Raleigh: University of North Carolina. Crawford, L., & Tindal, G. (2004). Effects of a student-reads-aloud accommodation on the performance of students with and without learning disabilities on a test of reading comprehension, Exceptionality, 12(2), 7188. Elliott, S. N., McKevitt, B. C., & Kettler, R. (2002). Testing accommodations research and decision making: The case of good scores being highly valued but difficult to achieve for all students. Measurement and Evaluation in Counseling and Development, 35, 153166. Elliott, S. N., & Roach, R. T. (2007). Alternate assessments of students with significant disabilities: Alternative approaches, common technical challenges. Applied Measurement in Education, 20(3), 301333. Fletcher, J., Francis, D. J., Boudousquie, A., Copeland, K., Young, V Kalinowski, S., & ., Vaughn, S. (2006). Effects of accommodations on high-stakes testing for students with reading disabilities. Exceptional Children, 72(2), 136150. Foster, C. (2008). One states perspective on the appropriate inclusion of deaf students in large-scale assessments. In R. Johnson & R. Mitchell (Eds.), Testing deaf students in an age of accountability (pp. 115135). Washington, DC: Gallaudet University Press. Gibbs, K. (1989). Individual differences in cognitive skills related to reading ability in the deaf. American Annals of the Deaf, 134, 214218. Golinkoff, R. M., Hirsh-Pasek, K., Bloom, L., Smith, L., Woodward, A., Akhtar, N., et al. (2000). Becoming a word learner: A debate on lexical acquisition. New York, NY: Oxford University Press. Haladyna, T., Downing, S., & Rodriguez, M. (2002). A review of multiple-choice itemwriting guidelines for classroom assessment. Applied Measurement in Education, 15(2), 309344. Harris, J., & Bamford, C. (2001). The uphill struggle: Services for Deaf and hard of hearing people: Issues of equality, participation, and access. Disability and Society, 16(7), 969980. Hoover, H. D., Dunbar, S. B., Frisbie, D. A., Oberley, K. R., Bray, G. B., Naylor, R. J., et al. (2001). Iowa Test of Basic Skills: Survey battery. Rolling Meadows, IL: Riverside Publishing. Individuals With Disabilities Education Act Amendments of 2004, Pub. L. 108446, 118 Stat. 2647. Kelly, L. (1996). The interaction of syntactic competence and vocabulary during reading by deaf students. Journal of Deaf Studies and Deaf Education, 14(1), 7590. Kettler, R. J., Elliott, S. N., & Beddow, P A. . (2009). Modifying achievement test items: A theory-guided and data-based approach for better measurement of what students with disabilities know. Peabody Journal of Education, 84, 529551. Ketterlin-Geller, L. R., Yovanoff, P & Tindal, G. ., (2007). Developing a new paradigm for accommodations research. Exceptional Children, 73(3), 331347. Kopriva, R. (2000). Ensuring accuracy in testing for English Language Learners. Washington, DC: Council of Chief State School Officers. Lazarus, S. S., Thurlow, M. L., Lail, K. E., Eisenbraun, K. D., & Kato, K. (2006). 2005 state policies on assessment participation and accommodations for students with disabilities (Synthesis Report No. 64). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved from University of Minnesota website: http://education.umn.edu/NCEO/OnlinePubs/ Synthesis64/ Lockhart, L., Beretvas, S. N., Cawthon, S., & Kaye, A. (2011). A multilevel measurement model that assesses whether accommodations moderate linguistic complexitys effect on items difficulties. Manuscript submitted for publication. Lollis, J., & LaSasso, C. (2009). The appropriateness of the NC state-mandated reading competency test for deaf students as a criterion for high school graduation. Journal of Deaf Studies and Deaf Education, 14(1), 7698. Martinello, M. (2008). Language and the performance of English Language Learners in math word problems. Harvard Educational Review, 78(2), 333368. Mayberry, R., del Giudice, A., & Lieberman, A. (2011). Reading achievement in relation to phonological coding and awareness in deaf readers: A meta-analysis. Journal of Deaf Studies and Deaf Education, 16(2), 164188. Mitchell, R. (2008). Academic achievement of deaf students. In R. Johnson & R. Mitchell (Eds.), Testing deaf students in an age of accountability (pp. 3850). Washington, DC: Gallaudet University Press. Moores, D. (2009). Cochlear failures. American Annals of the Deaf, 153(5), 423424. Mutua, N. K., & Elhoweris, H. (2002). Parents expectations about the postschool outcomes of children with hearing impairments. Exceptionality, 10(3), 189201. No Child Left Behind Act of 2001, Pub. L. 107110, 20 U.S.C. 6301 et seq. Nussbaum, D., LaPorta, R., & Hinger, J. (2003). (Eds.). Cochlear implants and sign language: Putting it all together. Washington, DC: Gallaudet University. Paul, P (1998). Literacy and deafness. Boston, . MA: Allyn & Bacon. Phillips, S. E. (1994). High-stakes testing accommodations: Validity versus disabled rights. Applied Measurement in Education, 7(2), 93120. Rodriguez, M. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24, 313. Sato, E., Rabinowitz, S., Gallagher, C., & Huang, C.-W (2010). Accommodations for English . Language Learner students: The effect of linguistic modification of math test item sets. Washington, DC: National Center for Education Evaluation and Regional Assistance. Retrieved from U.S. Department of Education website: http://ies.ed.gov/ncee/edlabs/ regions/west/pdf/REL_20094079.pdf Schick, B. (1997). The effects of discourse genre on English-language complexity in schoolage deaf students. Journal of Deaf Studies and Deaf Education, 2(4), 234251.

264

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

Schirmer, B. (2000). Language and literacy development in children who are deaf (2nd ed.). Needham Heights, MA: Allyn & Bacon. Shaftel, J., Belton-Kocher, E., Glassnap, J., & Poggio, J. (2006). The impact of language characteristics in mathematics test items on the performance of English Language Learners and students with disabilities. Educational Assessment, 11, 105126. Sireci, S. G., Scarpati, S. E., & Li, S. (2005). Test accommodations for students with disabilities: An analysis of the interaction hypothesis. Review of Educational Research, 75(4), 457490. Spencer, P & Marschark, M. (2010). Evidence., based practice in educating deaf and hard of hearing students. New York: Oxford University Press.

Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002). Universal design applied to large-scale assessments (Synthesis Report No. 44). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved from University of Minnesota website: http://education.umn.edu/ NCEO/OnlinePubs/Synthesis44.html Traxler, C. (2000). The Stanford Achievement Test, ninth edition: National norming and performance standards for deaf and hard of hearing students. Journal of Deaf Studies and Deaf Education, 5(4), 337348. U.S. Department of Education (2007a). Modified academic achievement standards: Nonregulatory guidance. Washington, DC: Author. U.S. Department of Education (2007b). Stan-

dards and assessments peer review guidance. Washington, DC: Author. Weigert, S. (2009). Perspectives on the current state of alternate assessments based on modified academic achievement standards: Commentary on Peabody Journal of Education special issue. Peabody Journal of Education, 84, 585594. Yoshinaga-Itano, C., & Gravel, J. (2001). The evidence for universal newborn hearing screening. American Journal of Audiology,10(2), 6264. Zigmond, N., & Kloo, A. (2009). The two percent students: Considerations and consequences of eligibility decisions. Peabody Journal of Education: Issues of Leadership, Policy, and Organization, 84(4), 478495.

265

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

TEST ITEM LINGUISTIC COMPLEXITY


Appendix A Sample Problems From Released Texas State Assessment Items

Mathematics
3 Marcy bought 6 apples priced at $0.35 each. She used a coupon worth $0.50 off the total cost. Which number sentence can be used to find how much money Marcy needed in order to buy the apples? A B C D (6 0.35) 0.50 = 1.60 (6 + 0.35) + 0.50 = 6.85 (6 0.35) + 0.50 = 6.15 (6 0.50) 0.35 = 2.65

Reading (after passage about penguins)

Why did scientists need to study penguins in the Antarctic before building the Penguin Encounter? A The scientists were afraid the penguins homes would be destroyed before the Penguin Encounter was finished. B The scientists wanted to make the Penguin Encounter as much like the penguins natural home as they could. C The scientists wanted to see how the penguins reacted to global warming before taking them to sunny California. D The scientists knew it would take many years to capture several hundred penguins for the Penguin Encounter.

266

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

Appendix B Linguistic Complexity Scoring Rubric

PART 1: VOCABULARY Description: uncommon usage, nonliteral usage, manipulation of lexical forms List of uncommon words here: -----------------------------------# of uncommon words Score -----------------------------------0 ------------------------------------ 0 12 Vocabulary score 34 5+ PART 2: SYNTAX Description: Atypical parts of speech, uncommon syntactic structures, complex syntax, academic syntactic form Number of long nominals? 1 2 3

------------------------------------------------------------------------------------------------ Does the passage use passive voice? (no = 0, yes = 1) ------------------------------------------------------------------------------------------------ Number of conditional clauses? ------------------------------------------------------------------------------------------------ Number of relative clauses? ------------------------------------------------------------------------------------------------ Number of complex questions? -----------------------------------------------------------------------------------------------Raw syntax score Syntax score 0 1 2 3+ 0 1 2 3 Syntax score Total raw syntax score

267

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

TEST ITEM LINGUISTIC COMPLEXITY PART 3: DISCOURSE Description: Uncommon genre, need for multiclausal processing, academic language. Consider whether students are required to synthesize information across sentences. Is the student required to make clausal connections between concepts and sentences? YES NO ( = 1) ( = 0)

TOTAL SCORE Syntax score Vocabulary score Discourse score = Total score

Total score 0 12

Classification Not linguistically complex Moderately linguistically complex Very linguistically complex

3+ yes

268

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF

Appendix C Example Item Rubrics

Using the items in Appendix A, here is the breakdown of their rating scores with information about how the items were scored (these are raw scores): Category Mathematics item: apples Vocabulary Raw score: Vocabulary score: Syntax sentence 1 1 Passive tense (can be used) Long nominal (coupon worth $.50) Complex question (2nd sentence) Raw score: Syntax score: Discourse Discourse score: Total Linguistic rating
a

Reading item: penguin No uncommon or multiplemeaning words 0 0 Passive tense (would be)

3.5a 3 Integrate across sentences 1 5 Very complex

No discourse elements 0 1 Moderately complex

The raters disagreed on this category, and thus the score was averaged. The resulting syntax score is the same with either of the raters scores (3 vs. 4).

269

VOLUME 156, NO. 3, 2011

AMERICAN ANNALS OF THE DEAF