Académique Documents
Professionnel Documents
Culture Documents
Education
http://jrm.sagepub.com
Published by:
http://www.sagepublications.com
On behalf of:
Additional services and information for Journal of Research in Music Education can be found at:
Subscriptions: http://jrm.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations http://jrm.sagepub.com/cgi/content/refs/55/4/344
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
Performer, Rater, Occasion, 10.1177/0022429408317515
and Sequence as Sources
of Variability in Music
Performance Assessment
Martin J. Bergee
University of Kansas
This study examined performer, rater, occasion, and sequence as sources of variability
in music performance assessment. Generalizability theory served as the study’s basis.
Performers were 8 high school wind instrumentalists who had recently performed a
solo. The author audio-recorded performers playing excerpts from their solo three
times, establishing an occasion variable. To establish a rater variable, 10 certified adju-
dicators were asked to rate the performances from 0 (poor) to 100 (excellent). Raters
were randomly assigned to one of five performance sequences, thus nesting raters
Please address correspondence to Martin J. Bergee, University of Kansas, School of Fine Arts, Department
of Music and Dance, Murphy Hall, 1530 Naismith Drive, Room 460, Lawrence, KS 66045-3102; e-mail:
mbergee@ku.edu.
344
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
345
sizeable error term. This suggests that a substantial amount of measurement error
might have been present.
Because the independent variables in these four studies were dichotomized and
scrutinized carefully for categorization errors, a great deal of measurement error
among them was unlikely. On the other hand, the dependent variable, adjudicator
ratings, was potentially rife with measurement error. The reliability of festival adju-
dication has been called into question over a long span of time (e.g., to cite a few,
Bumsed, Hinkle, & King, 1985; Fiske, 1983; Hare, 1960; Thompson & Williamon,
2003). To date, however, the issue remains underinvestigated, in part because of the
difficulties involved in identifying sources of measurement error. Addressing concerns
about festival adjudication requires comprehensive and psychometrically sound
approaches to determining these sources of error. Approaches common in perfor-
mance assessment-calculating interrater reliability, for example-lack the requisite
level of sophistication.
Unresolved issues surround the measurement purposes of festival adjudication
and similar approaches to performance assessment. Is the festival experience pri-
marily an opportunity for students to perform in public up to designated standards,
or do festivals evaluate young performers’ achievement so that useful suggestions for
improvement can be made? Both purposes, one more summative and the other more
formative, have merit, and they are not wholly incompatible.
The latter purpose perhaps is more defensible pedagogically, especially for youth
without a great deal of performance experience. If music educators are to accept this
more formative purpose as central, then in a measurement sense an individual’s true
performance level, that is, a rendering of the music that faultlessly expresses his or
her achievement level at a precise moment in time, should encounter true assessment-
the consensus of recommendations for improvement from a large (infinite, theoreti-
cally) pool of qualified raters. The reality of adjudication is of course no match for
this ideal.
Stated in psychometric terms, adjudication’s measurement concerns (some of the
more pressing at least) involve the extent to which (a) a single performance represents
a given performer’s actual state of achievement, that is, his or her hypothetical true
score; (b) a single adjudicator, despite a tight schedule, fatigue, and a myriad of other
obstacles, is able to discern this true score-that is, to evaluate each entrant with per-
fect reliability and validity; (c) performers’ serial position potentially influences an
adjudicator’s ability to evaluate multiple events fairly across time; and (d) these phe-
nomena and others might interact. Because these issues remain unresolved, teachers
and students understandably conjecture that festival adjudication is unreliable.
Research findings have lent substance to their concerns. Researchers have
frequently concluded that more than one adjudicator is necessary for good reliabil-
ity (e.g., to cite only a few, Bergee, 2003; Fiske, 1983; Sagin, 1983; Vasil, 1973).
The one-adjudicator model remains the norm, however, especially in solo and
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
346
small-ensemble festival evaluation. Other adjudicator issues are present as well. How
experienced should judges be? Some studies have used student evaluators with accept-
able results (e.g., Bergee, 1995; Wapnick, Ryan, Lacaille, & Darrow, 2004), but
students apparently do not have enough expertise to validly assess high-level perfor-
mance (Thompson, Diamond, & Balkwill, 1998). Another issue is interrater consis-
tency versus agreement. Authors of most studies, if they report anything at all, usually
report interrater consistency in the form of correlation coefficients. (A notable excep-
tion is Sagin, 1983, who used analysis of variance.) Correlation coefficients, however,
are insensitive to differences in rater agreement. Two raters with a similar contour will
correlate highly, even if one rates far more stringently than does the other.
Beyond the raters themselves, sequence issues persistently arise. In music con-
texts, sequence effects have been found in daylong sequences (e.g., Bergee, 2006;
F16res & Ginsburgh, 1996), sequences of intermediate length (Wapnick, Flowers,
Alegant, & Jasinskas, 1993), and even among pairs (Duerksen, 1972). Specifically,
a tendency to evaluate later performances more leniently has been noted. These
effects have been found in performance areas outside of music too (e.g., figure skat-
ing, Bruine de Bruin, 2005; synchronized swimming, Wilson, 1977).
I found no studies of the extent to which performances were judged to remain
consistent across repeated trials. (A closely related issue is interrater reliability
derived from a single stimulus, which has been studied. James, Demaree, and Wolf,
1984, 1993, developed a reliability formula for this scenario, although Schmidt and
Hunter, 1989, argued that multiple stimuli continue to be necessary.) Multiple trials-
that is, the same individual performing the same task several times-are critical to
establishing the true score, which is estimated from observed scores assumed to be
normally distributed around the true score. Therefore, other things being equal, a
larger sample of behaviors should yield a more accurate estimate of a given per-
former’s true level of achievement. In most music assessment contexts, however, the
performers play or sing only once.
In brief, error in performance measurement has been shown to originate from
multiple sources. Classical test theory approaches to determining score reliability,
however, are not capable of identifying and untangling this profusion of error.
Classical reliability was not conceptualized to do this; it accounts for only one error
source, the consistency (or lack thereof) with which raters evaluate a set of perfor-
mances. Other potential sources remain, but as undifferentiated error. A more
advanced method is needed, one capable of accommodating multiple sources of
error and of placing findings into theoretical contexts beyond the local panels of
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
347
oped in the early 20th century (e.g., Spearman, 1904). Despite its recent emergence,
G theory is &dquo;perhaps the most broadly defined measurement model currently in exis-
tence, and ... represents a major contribution to psychometrics&dquo; (Brennan, 2001, 9
Method
Performers
To control for variability, owing strictly to different types of performers or per-
formances, I used only soloists and only woodwind and brass instrumentalists. All
were high school soloists (Grades 10 to 12) who had received a Superior (I) rating
at the district level and had gone on to perform at the state level. Ideally, I would
have randomly selected performers from the sum total of all who played wind solos
at this state festival. Because this was not feasible, I randomly selected schools of
three sizes, one large (size classification 5A), one medium (3A), and one small (lA),
from the communities surrounding my university.
I contacted these three schools’ instrumental music teachers and learned that only
wind soloists from the 5A school had received a Superior rating at the district level,
which accorded them eligibility to perform at the state level. Rather than attempt to
locate other schools, I remained faithful to the original random selection process and
recruited participants only from the 5A school’s wind soloists. Doing so also helped
to minimize such additional sources of unwanted variability as extreme heterogene-
ity of performance quality, geographical location of the community, differences in
quality of instruction among their band directors, and so forth. After obtaining per-
mission to conduct the study, I spoke with the eligible wind soloists at this school
about participating; 2 of the 11, however, were absent on that day. With the 9 who
were present, I discussed what participation would entail. I let them know that their .
participation was strictly voluntary and that no penalty would be attached to non-
participation. All 9 agreed to participate. However, 1 later withdrew owing to a
scheduling conflict on the day of data collection, which left a total of 8 performers-
3 flutists, 2 clarinetists, 1 alto saxophonist, 1 trumpeter, and I tubist.
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
348
Raters
approached the official of our state’s high school activities association responsible
for music events and asked her for a roster of all woodwind and brass judges who
had undergone the association’s official training for adjudicators and thus were eli-
gible to judge in this state. She agreed to supply me with this roster. I then identified
all certified adjudicators within a reasonable driving distance of my university and
randomly selected 10. I contacted these individuals and asked them to serve as raters
for the study. All agreed. Of the 10 raters, 8 were current or retired members of uni-
versity faculties, and 2 were public school band directors, 1 of whom had recently
retired. Years of teaching experience ranged from 8 to &dquo;40-plus.&dquo;
Evaluation Procedures
Because thisstudy requires a continuous dependent variable (for ANOVA pur-
poses, asexplained in the following), I asked the raters to evaluate the 24 perfor-
mances globally from 0 (very poor) to 100 (excellent). Whether music performance
is better evaluated using global or specifics protocols has been a source of disagree-
ment (e.g., Stanley, Brooker, & Gilbert, 2002). Some (e.g., Fiske, 1977; Mills, 1987)
have argued for a global approach, whereas others (e.g., Bergee, 2003; Thompson &
Williamon, 2003) have found good interrater reliability among a limited number of
subscales. In the latter two studies, the subscale scores correlated highly with over-
all scores. Wapnick and Ekholm (1997), who found a similar pattern, suggested that
raters first form an overall impression and then respond to individual scale items
accordingly. Radocy and Boyle (1987) suggested that the approach to assessment
should depend on the function of the assessment. Because studying raters, not per-
formers or performances per se, was the function of the present study, a global
assessment seemed the more suitable choice.
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
349
Each rater evaluated independently, using as a playback unit the same electronic
device on which the performances had been recorded and with a set of Sony MDR-
027 headphones. I was present at all rating sessions and operated the playback
equipment. Earlier, I had reordered the 24 performances into five different random
sequences. I attempted no stratifications or other manipulations of the sequences. I
had randomly assigned each of the raters to one of the five sequences; accordingly,
two raters evaluated within each of the sequences.
The raters first read a letter that explained the purpose of the study and provided
information about the task. I supplemented the letter by asking the raters to use
whatever criteria they were comfortable with, so long as they attempted to remain
consistent within themselves. I cautioned raters that they would listen to each per-
formance only once and that when they had scored a performance and moved on to
another one, they could not later return to change their score. I reminded raters that
the performers would not receive these scores. The rating sessions went smoothly,
each taking about 45 minutes to complete.
Theoretical Framework
G theory allows for precise identification of sources of measurement error. One
obvious source is the persons undergoing evaluation-in this study, the 8 perform-
ers (designated p; cf. tables). The variance representing that error, known in G theory
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
350
Table 1
ANOVA for p (Performer by Occasion by Rater, Completely Crossed)
x o x r
Note: Estimated variance components: (J2(P) = [MS(p) - MS(po) - MS(pr) + MS(por)]/nonr; (5(o) = [MS(o) -
MS(po) - MS(or) + MS(por)]/npnr; (J2(r) = [MS(r) - MS(pr) - MS(or) + MS(por)lln,no; (J2(pO) = [MS(po) -
MS(por)llnr; (J2(pr) = [MS(pr) - MS(por)]/n,,; (J2(or) = [MS(or) - MS(por)llnp; (J2(por) MS(por). =
&dquo;
Table 2 .
Note: Estimated variance components: a2(p) = [MS(p) - MS(po) - MS(ps) - MS(pos) - MS(pr : s) +
MS(por : s)lln,,n,nr:,; (J2(0) = [MS(o) - MS(po) - MS(os) - MS(pos) - MS(or : s) + MS(por : s)]/npn,n,: s;
62(s) _ [MS(s) - MS(ps) - MS(os) - MS(r : s) - MS(pos) - MS(pr : s) -MS(or : s) + MS(por : s)]/npn&dquo;n,:,;,;
(y2(po) = [MS(po) - MS(pos) + MS(por : s)]/n,~n,.,s; 6Z(ps) _ [MS(ps) - MS(pos) - MS(pr : s) + MS(por : s)ll
non,. ,; (J2(OS) = [MS(os) - MS(pos) - MS(or : s) + MS(por : s)]/npn,: s; (J2(r : s) = [MS(r : s) - MS(pr : s) -
MS(or : s) + MS(por : s)]/npno; (J2(pOS) = [MS(pos) - MS(por : s)]/n,: s; (J2(pr: s) = [MS(pr : s) -MS(por: s)]l
no; (J2(or : s) = [MS(or : s) - MS(por : s)llnp; (J2(por : s) MS(por : s).
=
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
351
resulting performances. Facets usually are nested within other facets because of
resource or design considerations. In this investigation’s second G study, I nested
raters within the sequence facet because completely crossing sequence with all other
variables would have led to an unwieldy number of performances to evaluate (with
five sequences, 120). In addition, variables are analyzed as either fixed or random
effects. Because the performers, raters, occasions, and sequences in this study can be
thought of as being selected from a far larger theoretical population, I analyzed all
four as random effects.
G studies serve as the basis for making optimal decisions about the objects of
measurement (in the present study, the performers). In such D (decision) studies, an
investigator generalizes based on specified measurement procedures. One kind of
decision is relative; I as the investigator might wish to know where performers stand
in relation to other performers. A second type of decision is absolute, in which case
I am interested in whether performers meet established criteria. D studies pose &dquo;what
iP’ questions by estimating reliability under different hypothetical scenarios, using
variance components established in G studies to make these estimations. They
explore what reliability might be for different theorized combinations of levels of the
facets (in the present study, different theorized combinations of occasions, raters,
and sequences). By doing so, an investigator identifies those combinations that result
in theoretically optimal reliability. D studies establish both relative (8) error vari-
ance and absolute (A) error variance (cf. Table 3). From these error variances, two
kinds of coefficients are determined. Essentially an enhanced intraclass correlation
coefficient, the generalizability coefficient (Ep2; Table 3, bottom panel), the ratio of
performer variance, a2(t), to itself plus relative error variance, 0’(6), is analogous to
the reliability coefficient in classical test theory. (Throughout, I have used G theory
standard notational conventions as found in Brennan, 2001.) A second coefficient,
the index of dependability (<1>; Table 3, bottom panel), is the ratio of performer vari-
ance to itself plus absolute error variance, 02(A). This latter coefficient, which takes
into account all other sources of variance outside of the performer main effect, is not
possible in classical test theory determinations of score reliability.
For the present investigation, I formulated two G studies and one follow-up D
study. The first of the G studies used three completely crossed random variables-
performers, occasions, and raters-with the last two serving as the measurement
facets. In G theory symbolization, such a study is notated p x o x r (cf. Table 1).
Because previous research findings have made a strong case to include sequence as
an additional (random) facet and because design limitations required me to nest
raters within sequences, a second G study became necessary. Without the second G
study, I would not have been able to include sequence in this investigation and study
the raters as a main effect. The rather complex design of this second study in G
theory symbolization is p x (o x [r : s]) (cf. Table 2), read as &dquo;raters are nested within
sequences, both of which are completely crossed with occasions, and all three of which
are completely crossed with performers.&dquo; For reasons discussed soon, the follow-up
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
352 J
Table 3
Random Effects p x O x R
(Decisions on Occasions and Raters) Decision
Study Design for Performance Evaluation Data
Note: a =
n; and n§ = modifications of rater and occasion sample sizes respectively; i object of
effect; =
measurement (here, = performers); cr2(8) relative error variance; a’(0) absolute error variance;
= =
a. Components of relative (6) and absolute (A) error variance terms with o and r as random effects.
D study (presented in Table 3) modified the occasion and rater facets only, resulting
in a p x 0 x R design. (Facets modified in D studies are assigned capital letters.)
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
353
Results
G Study I : p x o x r
Table 1 displays the results of the first G study. The table reports outcomes by
effect (a); each effect’s df, sums of squares, and mean square; and the estimated vari-
ance component for each effect, 62(a), obtained from mean squares via an algorithm
Brennan (2001) illustrated. (Contributions of the mean squares to each of the vari-
ance components via this algorithm can be found in the notes at the bottom of Tables
1 and 2.) Estimated variance components are presented for all main effects and inter-
actions among the three fully crossed variables. Statistically speaking, there was no
variability in the occasion and rater by occasion effects. After the collapsing of all
raters’ evaluations into the occasion effect, four of the performers’ ratings essentially
remained the same among the three occasions, another’s increased marginally, and
the remaining three others’ actually decreased, but also marginally.
The most variability was found in the performer, rater, and performer by rater
effects. These performers clearly were at different levels of maturity; therefore, the
performer variability was anticipated. Rater variability, which ideally should have
been zero, was quite high. Furthermore, the raters’ rank ordering of the performers
clearly varied (cf. the pronounced pr effect). The three-way interaction’s variance
component was not large despite its confounding with the remaining unexplained
error. The p x o x r model explained a large share (94%) of variability in the ratings.
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
354
G Study 2: p x (o x [r : s])
Results of the second G study, which incorporated the sequence facet, are found
in Table 2. The main effect of sequence produced virtually no variability in the
model; neither did the performer by sequence, occasion by sequence, performer by
occasion by sequence, and occasion by rater within sequence interactions.
Apparently, it made little difference which sequence the raters evaluated within.
Because raters were nested within sequence, isolating a main effect for rater was
not possible. Variability among raters is located in the rater within sequence (r : s)
and performer by rater within sequence (pr : s) effects and congruent with the first
G study, quite pronounced.
D Study: p x 0 x R
Table 3 presents findings for the D study. Because estimated variability owing to
sequence was effectively zero, I did not include sequence in the D study. The final
two rows, Ep2, the reliability-like generalizability coefficient, and <1>, the index of
dependability, are the most important. Within the universe of generalizability estab-
lished in this investigation, estimated reliability (Ep2) for a hypothetical one-occa-
sion, one-rater scenario was .47. Addition of a 2nd hypothetical rater substantially
increased Ep2 to .69. To reach a benchmark of .80 (cf. Carmines & Zeller, 1979, who
proposed this figure as a lower bound for good reliability), 5 hypothetical raters were
needed. Ep2 increased only marginally as the number of hypothetical raters expanded
beyond 5. Even 17 hypothetical raters did not push Ep2 beyond .90.
The more stringent index of dependability, <1>, for the one-occasion, one-rater sce-
nario was .25. Introducing a variance component from the raters clearly had a
depressing effect. Indeed, 4 hypothetical raters still resulted in a relatively low
figure, .55. In contrast to Ep2, (D increased steadily until about the 14th hypothetical
rater. In terms of the one-occasion scenarios did not reach the .80 benchmark until
about the 17th hypothetical rater.
Discussion
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
355
2006; Bruine de Bruin, 2005), sequence also seemed make little difference.
to
Performers were rated virtually the same way no matter where their performances
fell in the five randomly ordered sequences. The typical festival scenario, which per-
mits only one performance and schedules events more or less randomly (or at least
without considering performance level), might not necessarily compromise per-
formers’ ability to establish a &dquo;true&dquo; level of achievement.
On the other hand, measurement error originating from raters’ disagreements with
one another and to a lesser but still substantive extent, their lack of rank order con-
analysis along classical test theory lines would not have uncovered these discrepan-
cies. Hoyt’s ( 1941 ) popular approach, for example, which subtracts variability owing
to rater inconsistency (represented by the mean square of the performer by rater
interaction) from performer variability (represented by the mean square of the per-
former main effect) and then places this figure over the performer variability, results
in a high interrater reliability figure of .93 (1342.3 - 89.2/1342.3; cf. Table 1 ).’ In
terms of what actually happened among the present study’s raters, this reliability
figure is too high. It contrasts sharply with the index of dependability of .72 for 10
raters and one occasion, although less so with the more relaxed generalizability coef-
ficient (.84; cf. Table 3). Within the limited frame of classical test theory, Hoyt’s
technique is quite sound for continuous data. But in the present study, it might have
led to erroneous conclusions about the extent of rater agreement.
Hypothetical one-occasion, one-rater scenarios resulted in predictably low relia-
bility and dependability outcomes. The generalizability coefficient (akin to classical
test theory’s reliability coefficient) for this scenario (.47) was unacceptably low. The
more stringent index of dependability was markedly lower (.25). Adding a modest
number of raters (a hypothetical panel of five), however, brought the generalizabil-
ity coefficient to an acceptable benchmark (.80) and into consonance with other find-
ings (Fiske, 1983; Smith, 2004; Vasil, 1973). Generalizability theory, however,
extends beyond consistency. Are ratings consistent with one another in a relative
sense sufficient, should rater agreement in a more absolute sense also be consid-
or
ered ? How strongly do other factors (e.g., variability among performers, perfor-
mances taking place in a linear sequence, how often a performer is allowed to
display his or her &dquo;wares,&dquo; etc.) influence either or both? For those looking beyond
classical reliability to other sources of variability in performance assessment, G
theory provides a method to address these questions.
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
356
G theory’s index of dependability (4S) is quite stringent. Given the resource con-
straints found in most music assessment contexts, it is perhaps too stringent. The
great majority of the time, assembling panels of 15 to 20 adjudicators would not be
practical. On the other hand, performance assessment sometimes can have exceed-
ingly high stakes-acceptance into a prestigious conservatory, auditions for posi-
tions in professional orchestras or opera companies, competitions with large cash
prizes, and so forth. In these contexts, protocols for assessment and the assessors
themselves should be held to the highest available standards.
As the first of its kind, this study has some limitations. Of the 10 raters who par-
ticipated, 3 rated with more severity than did the others. At this point, there is no way
to know whether 3 of every 10 raters are this severe or 3 of every 100 are, and my
selection of raters, despite efforts to standardize their qualifications and to minimize
selection bias, was unfortunate. G theory estimations, of course, must hedge accord-
ingly. Furthermore, the novel use of a 100-point scale for global assessment may
have contributed to the lack of agreement. There is no way to demonstrate that it did
not. But global evaluation is usually quite consistent, and it was in the present study
as well. Judges who rated more severely rated consistently more severely. The stan-
dard evidence for a faulty rating protocol is inconsistency.
To help establish acceptable levels of agreement, further research should exam-
ine the effect of training evaluators to the assessment protocols they will use. Raters
should reach consensus on trial &dquo;anchor&dquo; performances before proceeding with the
main task. Along similar lines, in terms of the number of raters needed, further study
should broaden rater pools used to establish &dquo;universes&dquo; for generalizability. All
other things being equal, as sample size increases, performer variance remains the
same, variance among facets decreases, and generalizability coefficients and indices
of dependability increase (Brennan, 2001, p. 112). Broadening the pool of raters,
however, might involve trade-offs. The N increases, potentially narrowing statistical
confidence intervals, but at a potential sacrifice of standardization in terms of train-
ing and experience. Perhaps there are creative yet cost-efficient ways to increase the
number of qualified raters available for evaluated events. Raters need not be experi-
enced performers on a given examinee’s medium; raters should, however, have a
background in the same general family (Bergee, 2003; Fiske, 1975).
In summary, the present study’s findings suggest the possibility of substantive
measurement error among raters. Classical test theory indices of reliability are
unable to identify such crucial and potentially strong sources of error. As a conse-
quence, musicians at present might not always receive the consistency and depend-
ability of performance assessment that we would wish for them to receive.
Note
,2
1. Generalizability (G) theory should and does have a correlate to Hoyt’s. Eρ
2 τ) + σ
(τ)/σ
σ
( (δ) in
2
r 10 is analogous to classical test theory interrater reliability. With raters
this instance for n’ = 8 and n’ =
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
357
pn’
(
2
σ
/
)
r r )/10.
=
pr Like Hoyt’s performer by rater interaction mean square, )
(
MS pr estimates the extent
(
2
σ
to which performers are rank ordered differently by the different raters (Brennan, 2001). From Table 1,
τ) + σ
(τ)/σ
(σ
2 (δ) = 125.3/125.3 + 8.92, or .93.
2
References
Bergee, M. J. (1995). Primary and higher-order factors in a scale assessing concert band performance.
Bulletin of the Council for Research in Music Education, 126, 1-14.
Bergee, M. J. (2003). Faculty interjudge reliability of music performance evaluation. Journal of Research
in Music Education, 51, 137-148.
Bergee, M. J. (2006). Validation of a model of extramusical influences on solo and small-ensemble festi-
val ratings. Journal of Research in Music Education, 54, 244-256.
Bergee, M. J., & McWhirter, J. L. (2005). Selected influences on solo and small-ensemble festival rat-
ings: Replication and extension. Journal of Research in Music Education, 53, 177-190.
Bergee, M. J., & Platt, M. C. (2003). Influence of selected variables on solo and small-ensemble festival
ratings. Journal of Research in Music Education, 51, 342-353.
Bergee, M. J., & Westfall, C. R. (2005). Stability of a model explaining selected extramusical influences
on solo and small-ensemble festival ratings. Journal of Research in Music Education, 53, 253-271.
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
358 J
Mills,J. (1987). Assessment of solo musical performance: A preliminary study. Bulletin of the Council
for Research in Music Education, 91, 119-125.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. New York: McGraw-Hill.
Radocy, R. E., & Boyle, J. D. (1987). Measurement and evaluation of musical experiences. New York:
Schirmer.
Sagin, D. P. (1983). The development and validation of a university band performance rating scale.
Journal of Band Research, 18(2), 1-11.
Schmidt, F. L., & Hunter, J. E. (1989). Interrater reliability coefficients cannot be computed when only
one stimulus is rated. Journal of Applied Psychology, 74, 368-370.
Smith, B. (2004). Five judges’ evaluation of audiotaped string performance. Bulletin of the Council of
Research in Music Education, 160,
61-69.
Spearman, C. E. (1904). The proof of measurement of association between two things. American Journal
of Psychology, 15, 72-101.
Stanley, M., Brooker, R., & Gilbert, R. (2002). Examiner perceptions of using criteria in music perfor-
mance assessment. Research Studies in Music Education, 18, 43-52.
Thompson, S., & Williamon, A. (2003). Evaluating evaluation: Musical performance assessments as a
research tool. Music Perception, 21, 21-41.
Thompson, W. F., Diamond, C. T., & Balkwill, L. L. (1998). The adjudication of six performances of a
Chopin etude: A study of expert knowledge. Psychology of Music, 26, 154-174.
Vasil, T. (1973). The effects of systematically varying selected factors on music performance adjudica-
tion. Unpublished doctoral dissertation, University of Connecticut.
Wapnick, J., & Ekholm, E. (1997). Expert consensus in solo voice performance evaluation. Journal of
Voice, 11,
429-436.
Wapnick, J., Flowers, P., Alegant, M., & Jasinskas, L. (1993). Consistency in piano performance evalua-
tion. Journal of Research in Music Education, 41, 282-292.
Wapnick, J., Ryan, C., Lacaille, N., & Darrow, A. A. (2004). Effects of selected variables on musicians’
ratings of high-level piano performances. International Journal of Music Education, 22, 7-20.
Wilson, V. E. (1977). Objectivity and effect of order of appearance in judging of synchronized swimming
meets. Perceptual and Motor Skills, 44, 295-298.
Martin J. Bergee is professor of music education and music therapy at the University of Kansas. His
research interests include music performance assessment.
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009