Vous êtes sur la page 1sur 5

RELIABILITY AND

By Sheila Sco tt
Sheila Scott is associate professor of music at Brandon University in Brandon, Manitoba, Canada.

ost classroom teachers d o not think much about the reliability of the assessments they use. Performance-based assessments, in particular, often seem very subjective. Teachers may wonder what reliability really is and how they can improve the reliability of the assessments they use. A process of discovering how the reliability of assessment can be improved is told here as the story of "Anita," a fictional elementary general music teacher. While this article is written from her point of view, her experiences are a combination of the author's experiences, the experiences of other teachers, and experiences that may be typical of elementary music classrooni teachers in general. My name is Anita. I'm a general music teacher in an elementary school. Recent literature on using performance-based assessments to document a student's attainment of musical competencies has propelled me to examine my own educational practice. In doing so, I use the nine content standards for music education (MENC 1994) as a basis for developing my cuniculum. In addition, I incorporate recommendations from MENC to develop perforniance- based assessments that provide information about what students are able to do as a result of instruction (MENC 1996). I recently attended a state-sponsored workshop on the benefits and limitations of performance- based assessments. The clinician's presentation covered the technical features of performance- based assessments-including an in-depth examination of reliability. I was, however, left with two unanswered questions: (1) What should I, as a general music teacher, know

about reliability in performance-based assessment to help me in my everyday work in the classroom? (2) How can this information help me obtain meaningful information about the music skills and knowledge my students acquire as a result of instruction? To understand this issue, I examined two aspects of reliability: (1) What is reliability in performance-based assessments? (2) How can music teachers enhance the reliability of the scores obtained from the performance-based assessments used to evaluate their students' work? I examined selected literature in the area of student assessment and applied recommendations from these sources to my classroom practice. Here is what I learned.

Reliability and Performance-BasedAssessments


The reliability of performance-based assessments refers to the consistency of scores obtained by using an observational measure such as a checklist or rubric to obtain information about what a student is able to do as a result of instruction.' As noted by Popham (1999): Reliability is a central concept in measurement . . . if an assessment procedure fails to yield consistent results, it is almost impossible to make any accurate inferences about what an examinee's score signifies. Inconsistent measurement is, indeed, unreliable. (p. 32) In terms of my own practice, this means that when examining the consistency of assessments, I need to have confidence that the score assigned is based on the student's performance on the intended task and not on conditions irrelevant to the performance of this task. For example, was Lynn's critique of the symphony performance graded lower than expected due to

GMT

S P R I N G

2 0 0 3

her faulty use of grammar? Did Andrew perform the ostinato incorrectly because the assessment was administered on a Friday afternoon? Were the scores Lucy earned on the singing rubric lower than those from last year because her former music teacher was a lenient rater? Did Kevin do better than expected on the composition assignment because he was working with a group of musically gifted ~ t u d e n t s ? ~ I also learned that there are many different types of reliability. For example, if I use a 5-point scale to assess my students' performances of the song "Over My Head" in September and repeat this assessment one month later, the results would be reliable across this period of time (Nitko 1996). If another music teacher rates these performances using the same scale, a comparison of our scores would provide a measure of reliability across different raters. Using a third example, if I assess my students' performances of the song "Over My Head" in September and one month later use the same rubric to assess their ability to sing the song "Old Brass Wagon," the scores would be "reliable with respect to equivalent versions of the same task" (p. 63). As these examples illustrate, reliability in assessment represents different forms of consistency-consistency over time, consistency with different raters, and consistency across similar tasks.

My concern for the reliability of performance-based assessments stems from my need for confidence in the accuracy of the scores obtained by using these measures (Nitko 1996). What does a score of 5 on a 5-point scale really mean? If more than one teacher assesses the performance, what does each score signify if the judges disagree? If performances scored at the beginning of a marking session are rated differently from performances scored at the end of this session, what does the score assigned to each student really mean? (Herman, Aschbacher, and Winter 1992). As I thought about these questions, I wondered whether it was possible for assessment results based on subjective judgment to be reliable and fair. To gain insight on this problem, I studied two aspects of reliability for judging the consistency of scores generated from performanceassessments: the specificity of the scoring guide or rubric, and the consistency of my judgments.

How Can MusicTeaaeS Eiiha~e~e-ReliiibiIRyaf Assessments?

In applying these principles in practice, I have many choices about how to document my students' proficiency in a variety of elementary-level music tasks-for example, the proficiency with which my first grade students maintain a beat.3 Using a checklist, I can document whether the students can step to the beat by recording my judgment in the corresponding column (figure 1). I can also use rating scales to assess each performance for various levels of proficiency. In this case, the evaluative criteria are described in general terms, for example, along a continuum ranging from "unable to perform" to "musical performance" (figure 2). I can also use a rubric in which the levels of proficiency reflect the performances I expect my students to demonstrate (figure 3). The details provided in this rubric afford the most confidence in the ratings. The time needed to develop rubrics seems daunting. Fortunately, I do not have to develop all my scoring guides from scratch. The MENC publication Performance Standards for Music (1996) has rubrics for one assessment strategy for each achievement standard appearing under the nine national content standards for music. These scoring guides provide slullspecific feedback at three levels of proficiency-basic, proficient, and advanced. When used in the classroom, these rubrics supply diagnostic information about each student's performanceTTh&fsfeeMMatIcrarr-ptarmlnr educational experiences and feedback that my students can use to improve future performances. Realizing that my colleagues were facing a similar dilemma, I organized a districtwide workshop that brought teachers together to examine the use of rubrics in the elementary music classroom. Group discussions, using the MENC publication as a foundation, helped clarify the expected learning outcomes assessed by these rubrics, thereby increasing the reliability of these measures when used in our classrooms.

Teachers9Judgments

Consistency of judgment is a key to reliable assessment. My judgments are reliable if I make the same scoring decision whenever I view a similar performance. So, if I assign a certain performance an A, all other performances where students demonstrate a similar level of ability should receive an A. This task is often difficult. However, there are strategies that can help increase the reliability of scores obtained with perforSpecificity of the Scoring Guide mance-based measures, such as monitoring the discrepancies In studying the literature, I found that the consistency of between the scores I assign and those assigned by two or more scores largely depends on the specificity of the scoring guide. other teachers, or monitoring discrepancies when I score the This was explained by Freeman and Lewis (1998): same performances more than once. To examine consistency with different raters, I asked for ... good criteria are explicit, understood, and agreed by all assistance from a music teacher in a neighboring school disassessors and also by students. A small number of simple critrict. Beginning with the rubric, the colleague and I decided teria usually lead to greater reliability than do complicated how the given criteria would be applied to the student performarking schemes, because they are more manageable and Z e X o fZdKthTmind ~ X n ~ r K i i i g T h e ~ s ~ a n i t mms~ermarr;-As&baeher;a&%tefs W 2 9 m the nature of the task, thereby helping your students focus there, we both scored videotaped examples of student work on exactly what is required. (p. 25-26)
-

- -

GMT

S P R I N G

2 0 0 3

independently. Discussion of the results revealed disagreements in the interpretation of the rubric-thereby facilitating further delineation of the scoring guide. Although I found it helpful to work with a colleague, this process was so time-consuming that I would only use it when working with a particularly detailed rubric or when working in areas where multiple aspects of performance are assessed concurrently (for example, assessing music- based improvisations). For this reason, I needed to find strategies that I could use on my own to improve the reliability of performance-based assessments. I tested my consistency over time by assessing videotaped performances twice-with two weeks separating the first and second assessments (Nitko 1996). I then examined both sets of marks to see whether I gave the same ratings each time. Consistency with similar tasks involved using the same rubric to assess performances obtained with two similar tasks (for example, stepping the beat while singing the songs "Bow, Wow, Wow" and "John Kanaka") and comparing the results obtained on both assessments. While these procedures were time-consuming, I found them useful for identifying inconsistencies in the scores. I decided to continue using this strategy with a small number of performances to monitor my use of subjectively scored assessments. My scoring decisions changed as I continued through the grading process. For example, when scoring twenty student improvisations, I found that scores awarded to the first ten students were higher than those awarded to the last ten students. I began to monitor my consistency by rescoring samples of student work regularly (Herman, Aschbacher, and Winters 1992). While this procedure helped me check the consistency of the scores I assigned, it did not help me identify personal biases in these judgments. I addressed this problem by examining how errors may negatively influence the scores I assign to my students' work. I did this by applying Nitko's (1996) summary of rating-scale errors to my work in the music classroom: Leniency error occurs when a teacher tends to make almost all ratings toward the high end of the scale, avoiding the low end of the scale. Severity error is the opposite of leniency error: A teacher tends to make almost all ratings toward the low end of the scale. Central tendency error occurs when a teacher hesitates to use extremes and uses the middle part of the scale only. ... A halo effect occurs when a teacher lets her general impression of the student affect how she rates the student on specific dimensions. (p. 277) I noticed leniency errors in my scoring patterns when I assessed newly acquired skills. I tended to award higher marks than those warranted by the performances because students were in the process of acquiring the musical competencies measured by the particular scale. On the other hand, I observed a pattern of severity errors when scoring performances for which I wanted subsequent assessments to show

growth or improvement over time. Central tendency errors occurred when I assessed creative aspects of musical proficiency such as improvisation and composition. In hesitating to award marks at the low or high ends of the scale, I tended to score all performances as "average" (Nitko 1996). I observed the halo effect when I scored performances of students who were struggling in music class but who were trying to succeed. In these cases, I tended to award marks based on effort rather than on attributes of the completed work. Conversely, I sometimes awarded low scores to the performances of students who demonstrated a haphazard approach to the music class. I also found the halo effect at work when I made decisions about performances on the border between two scores (Nitko 1996)-for example, should a recorder performance be awarded a 2 or a 3 out of 5? In these cases, the recorder performances of the hard-working students were given the higher score, while the performances of students who failed to demonstrate a serious approach to this work (often those students who challenged my skills in classroom management) were awarded the lower score. For scores to be reliable, they should be based on the merits of the performances, not on the predilections of the evaluator. The classifications outlined by Nitko (1996) serve as a guide in identifying sources of bias in my own work. I also use certain strategies to monitor my consistency over time, consistency across similar tasks, and consistency with different raters in an effort to improve the reliability of the scores I assign using performance-based assessments. In the future, I will engage my students in the process of assessment. Together, we can review the rubric that will be used to score performance of a particular task. They can assist me in determining the properties of these performances at the basic, proficient, and advanced levels. I will be interested in learning how my views correspond to or differ from those of my students. The insight gained from this experience should increase the reliability of the scores generated from the performance-based assessments I use in my classroom. "Anita" is a composite of hard-working professionals who strive to improve their educational practices. Her story shows that teachers can develop and use performance-based assessments to provide information about what their students are able to do as a result of instruction. Using these assessments is not enough. Teachers need to be confident about the consistency or reliability of the scores generated with this method of assessment. Music teachers do not need to study the statistical properties of reliability to apply this concept to their educational practice. Rather, they can apply a practical knowledge of reliability to improve the consistency of subjectively scored measurements used to document what students can do as a result of formal music instruction. Using the strategies outlined here,

Conclusion

GMT

S P R I N G

2 0 0 3

teachers can improve the consistency with which they award marks on performance-based assessments. Based on these standards, assessment results founded on subjectivejudgment can be reliable and fair.

Freeman, R., and R. Lewis. 1998. Planning andimplementing msess. ment. London: Kogan.

Endnotes
1. Teachers may assume student learning as a result of instruction. Students may also gain understandings and competencies in formal and informal educational settings outside the classroom.

Herman, J. L., P. R. Aschbacher, L. Winters. 1992. A pmticdguide to alternative assessment. Alexandria, VA: Association for Supervi. sion and Cuniculum. MENC: The National Association for Music Education. 1996. Performance standards formusic. Reston, VA: MENC. Nitko, A. J. 1996. Educational assessment o f students. 2d ed. Englewood Cliffs, NJ: Prentice Hall. Popham, W. J. 1999. Clasmom issessment: What teirchen need to know. 2d ed. Boston, MA: Allyn and Bacon. Scott, S. J. 2001. Using checklists and rating scales (rubrics) to assess student learning in music: Heather's story. Manitoba Music Educator41(3):7-9,

2. Questions adapted from Herman, Aschbacher, and Winters (1992).

3. This summary of performance-based assessment is based on Scott (2001).

References
Consortium of National Arts Education Associations. 1994. National standards tbrarts education. Reston, V A : M E N C .

Figure I. Checklist

Specific learning target: given song in 2/4 and 4/4 meter, student steps the beat Directions: Observe the student as he/she steps the beat Name J. B. Lock H. E. Woodward F. Faraci M. Lucerne J. M. Fortier Yes X X No

X
X

Figure 2. Generic Rating Scale (Rubn?)

Specific learning target: Given songs in 2/4 and 4/4 meter, student steps the beat Scale 1. Unable to perform 2. Experiences some difficulty 3. Inaccurate performance 4. Accurate performance 5. Musical performance

GMT

S P R I N G

2 0 0 3

F i g u ~ Skill Specific Rating Scale (Rubric) 3.


Specific learning targer: Given songs in 2/4 and 4/4 meter, student steps the beat

Unable to perform -Unsteady bcat

Expcricnces difficulty -Able to fecl aecentcd beat -Most remaining bcats incorrcct

Inaccurate -Able to fcel acccnted beat -Performance cmrs -Drift.$away from bcat and then returns

Accuratc -No errors -Stiff rnovcmcnts

Musical -No crrors -Free movements

GMT

S P R I N G

2 0 0 3

Vous aimerez peut-être aussi