Education Journal of Research in Music

Journal of Research in Music
Education
http://jrm.sagepub.com
Performer, Rater, Occasion, and Sequence as Sources of Variability in

Music Performance Assessment
Martin J. Bergee
Journal of Research in Music Education 2007; 55; 344
DOI: 10.1177/0022429408317515
The online version of this article can be found at:

http://jrm.sagepub.com/cgi/content/abstract/55/4/344
Published by:
http://www.sagepublications.com
On behalf of:
MENC: The National Association for Music Education
Additional services and information for Journal of Research in Music Education can be found at:
Email Alerts: http://jrm.sagepub.com/cgi/alerts
Subscriptions: http://jrm.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations http://jrm.sagepub.com/cgi/content/refs/55/4/344
Downloaded from http://jrm.sagepub.com at UNIV FEDERAL RIO GRANDE SO SU on August 13, 2009
Performer, Rater, Occasion, 10.1177/0022429408317515
and Sequence as Sources
of Variability in Music
Performance Assessment
Martin J. Bergee
University of Kansas
This study examined performer, rater, occasion, and sequence as sources of variability
in music performance assessment. Generalizability theory served as the study’s basis.
Performers were 8 high school wind instrumentalists who had recently performed a
solo. The author audio-recorded performers playing excerpts from their solo three
times, establishing an occasion variable. To establish a rater variable, 10 certified adju-
dicators were asked to rate the performances from 0 (poor) to 100 (excellent). Raters
were randomly assigned to one of five performance sequences, thus nesting raters
within a sequence variable. Two G (generalizability) studies established that occasion

and sequence produced virtually no measurement error. Raters were a strong source of
error. D (decision) studies established the one-rater, one-occasion scenario as unreli-
able. In scenarios using the generalizability coefficient as a criterion, 5 hypothetical
raters were necessary to reach the .80 benchmark. Using the dependability index, 17
hypothetical raters were necessary to reach .80.
Keywords: music; performance; assessment; generalizability theory
A series of recent studies has developed a model of selected extramusical variables’

influence on solo and small-ensemble festival ratings. The first three of these stud-
ies (Bergee & McWhirter, 2005; Bergee & Platt, 2003; Bergee & Westfall, 2005)
established that performing as a soloist and entering from a large, metropolitan area,
relatively well-financed school led to high odds for success at a state-level adjudi-
cated festival. Serving as the validation phase, the fourth study (Bergee, 2006) veri-
fied the model’s ability to explain variability in festival ratings.
One unanticipated outcome of the fourth study, however, was that the model did
not meet Nunnally and Bernstein’s (1994) sufficiency criterion. In particular, the
model, although it demonstrated acceptable external and internal validity, accounted
for a relatively small proportion of the variance in ratings and thus contained a
Please address correspondence to Martin J. Bergee, University of Kansas, School of Fine Arts, Department
of Music and Dance, Murphy Hall, 1530 Naismith Drive, Room 460, Lawrence, KS 66045-3102; e-mail:
mbergee@ku.edu.
344
345
sizeable error term. This suggests that a substantial amount of measurement error
might have been present.
Because the independent variables in these four studies were dichotomized and
scrutinized carefully for categorization errors, a great deal of measurement error
among them was unlikely. On the other hand, the dependent variable, adjudicator
ratings, was potentially rife with measurement error. The reliability of festival adju-
dication has been called into question over a long span of time (e.g., to cite a few,
Bumsed, Hinkle, & King, 1985; Fiske, 1983; Hare, 1960; Thompson & Williamon,
2003). To date, however, the issue remains underinvestigated, in part because of the
difficulties involved in identifying sources of measurement error. Addressing concerns
about festival adjudication requires comprehensive and psychometrically sound
approaches to determining these sources of error. Approaches common in perfor-
mance assessment-calculating interrater reliability, for example-lack the requisite
level of sophistication.
Unresolved issues surround the measurement purposes of festival adjudication
and similar approaches to performance assessment. Is the festival experience pri-
marily an opportunity for students to perform in public up to designated standards,
or do festivals evaluate young performers’ achievement so that useful suggestions for
improvement can be made? Both purposes, one more summative and the other more
formative, have merit, and they are not wholly incompatible.
The latter purpose perhaps is more defensible pedagogically, especially for youth
without a great deal of performance experience. If music educators are to accept this
more formative purpose as central, then in a measurement sense an individual’s true
performance level, that is, a rendering of the music that faultlessly expresses his or
her achievement level at a precise moment in time, should encounter true assessment-
the consensus of recommendations for improvement from a large (infinite, theoreti-
cally) pool of qualified raters. The reality of adjudication is of course no match for
this ideal.
Stated in psychometric terms, adjudication’s measurement concerns (some of the
more pressing at least) involve the extent to which (a) a single performance represents
a given performer’s actual state of achievement, that is, his or her hypothetical true
score; (b) a single adjudicator, despite a tight schedule, fatigue, and a myriad of other
obstacles, is able to discern this true score-that is, to evaluate each entrant with per-
fect reliability and validity; (c) performers’ serial position potentially influences an
adjudicator’s ability to evaluate multiple events fairly across time; and (d) these phe-
nomena and others might interact. Because these issues remain unresolved, teachers
and students understandably conjecture that festival adjudication is unreliable.
Research findings have lent substance to their concerns. Researchers have
frequently concluded that more than one adjudicator is necessary for good reliabil-
ity (e.g., to cite only a few, Bergee, 2003; Fiske, 1983; Sagin, 1983; Vasil, 1973).
The one-adjudicator model remains the norm, however, especially in solo and
346
small-ensemble festival evaluation. Other adjudicator issues are present as well. How
experienced should judges be? Some studies have used student evaluators with accept-
able results (e.g., Bergee, 1995; Wapnick, Ryan, Lacaille, & Darrow, 2004), but
students apparently do not have enough expertise to validly assess high-level perfor-
mance (Thompson, Diamond, & Balkwill, 1998). Another issue is interrater consis-
tency versus agreement. Authors of most studies, if they report anything at all, usually
report interrater consistency in the form of correlation coefficients. (A notable excep-
tion is Sagin, 1983, who used analysis of variance.) Correlation coefficients, however,
are insensitive to differences in rater agreement. Two raters with a similar contour will
correlate highly, even if one rates far more stringently than does the other.
Beyond the raters themselves, sequence issues persistently arise. In music con-
texts, sequence effects have been found in daylong sequences (e.g., Bergee, 2006;
F16res & Ginsburgh, 1996), sequences of intermediate length (Wapnick, Flowers,
Alegant, & Jasinskas, 1993), and even among pairs (Duerksen, 1972). Specifically,
a tendency to evaluate later performances more leniently has been noted. These
effects have been found in performance areas outside of music too (e.g., figure skat-
ing, Bruine de Bruin, 2005; synchronized swimming, Wilson, 1977).
I found no studies of the extent to which performances were judged to remain
consistent across repeated trials. (A closely related issue is interrater reliability
derived from a single stimulus, which has been studied. James, Demaree, and Wolf,
1984, 1993, developed a reliability formula for this scenario, although Schmidt and
Hunter, 1989, argued that multiple stimuli continue to be necessary.) Multiple trials-
that is, the same individual performing the same task several times-are critical to
establishing the true score, which is estimated from observed scores assumed to be
normally distributed around the true score. Therefore, other things being equal, a
larger sample of behaviors should yield a more accurate estimate of a given per-
former’s true level of achievement. In most music assessment contexts, however, the
performers play or sing only once.
In brief, error in performance measurement has been shown to originate from
multiple sources. Classical test theory approaches to determining score reliability,
however, are not capable of identifying and untangling this profusion of error.
Classical reliability was not conceptualized to do this; it accounts for only one error
source, the consistency (or lack thereof) with which raters evaluate a set of perfor-
mances. Other potential sources remain, but as undifferentiated error. A more
advanced method is needed, one capable of accommodating multiple sources of
error and of placing findings into theoretical contexts beyond the local panels of
raters found in individual studies.

We have such a method. An important extension of thinking about reliability, gen-
eralizability theory (G theory) has distinct advantages over classical test theory
(Kieffer, 1999). Specifically, generalizability theory is able to encompass multiple
sources of measurement error simultaneously, account for interaction and main
347
effect measurement error, and estimate reliability-like coefficients in both relative

(akin to classical test theory reliability) and absolute senses. G theory is considered
a modern measurement theory, in contrast to the more classical approaches devel-
oped in the early 20th century (e.g., Spearman, 1904). Despite its recent emergence,
G theory is &dquo;perhaps the most broadly defined measurement model currently in exis-
tence, and ... represents a major contribution to psychometrics&dquo; (Brennan, 2001, 9
p. vii). It is especially well suited to evaluating ratings of human performance

(Nunnally & Bernstein, 1994).
G theory seems an ideal utility for examining multiple sources of error in music
performance measurement. With this study, I applied G theory principles to the
determination of measurement error in evaluations of music performance.
Specifically, I applied these principles to study four sources of error-performers,
raters, multiple trials (hereafter occasions), and sequences.
Method
Performers
To control for variability, owing strictly to different types of performers or per-
formances, I used only soloists and only woodwind and brass instrumentalists. All
were high school soloists (Grades 10 to 12) who had received a Superior (I) rating
at the district level and had gone on to perform at the state level. Ideally, I would
have randomly selected performers from the sum total of all who played wind solos
at this state festival. Because this was not feasible, I randomly selected schools of
three sizes, one large (size classification 5A), one medium (3A), and one small (lA),
from the communities surrounding my university.
I contacted these three schools’ instrumental music teachers and learned that only
wind soloists from the 5A school had received a Superior rating at the district level,
which accorded them eligibility to perform at the state level. Rather than attempt to
locate other schools, I remained faithful to the original random selection process and
recruited participants only from the 5A school’s wind soloists. Doing so also helped
to minimize such additional sources of unwanted variability as extreme heterogene-
ity of performance quality, geographical location of the community, differences in
quality of instruction among their band directors, and so forth. After obtaining per-
mission to conduct the study, I spoke with the eligible wind soloists at this school
about participating; 2 of the 11, however, were absent on that day. With the 9 who
were present, I discussed what participation would entail. I let them know that their .
participation was strictly voluntary and that no penalty would be attached to non-
participation. All 9 agreed to participate. However, 1 later withdrew owing to a
scheduling conflict on the day of data collection, which left a total of 8 performers-
3 flutists, 2 clarinetists, 1 alto saxophonist, 1 trumpeter, and I tubist.
348
Because performance levels decline at different rates for different performers, I

recorded all performers the morning after the state festival. I recorded the perfor-
mances with a Sony MZ-NH900 portable minidisc recorder and Sony ECM-MS907
electret condenser microphone. I recorded only about the first quarter of each per-
former’s solo. Vasil (1973) found that this proportion of a total composition was suf-
ficient for rating purposes. I recorded each performer playing his or her excerpt three
times in succession, with a brief break of about 1 minute between each occasion. I
asked the performers not to stop and not to speak during the recording process, but
I also mentioned that if either happened I would delete that track and begin another
one. Before each session, I again reminded performers that they were free to with-
draw at any time, including after the session began. All 8 students seemed comfort-
able with the process, and all completed their session without difficulties.
Raters
With eye to minimizing variability among raters as much as possible, I

an
approached the official of our state’s high school activities association responsible
for music events and asked her for a roster of all woodwind and brass judges who
had undergone the association’s official training for adjudicators and thus were eli-
gible to judge in this state. She agreed to supply me with this roster. I then identified
all certified adjudicators within a reasonable driving distance of my university and
randomly selected 10. I contacted these individuals and asked them to serve as raters
for the study. All agreed. Of the 10 raters, 8 were current or retired members of uni-
versity faculties, and 2 were public school band directors, 1 of whom had recently
retired. Years of teaching experience ranged from 8 to &dquo;40-plus.&dquo;
Evaluation Procedures
Because thisstudy requires a continuous dependent variable (for ANOVA pur-
poses, asexplained in the following), I asked the raters to evaluate the 24 perfor-
mances globally from 0 (very poor) to 100 (excellent). Whether music performance
is better evaluated using global or specifics protocols has been a source of disagree-
ment (e.g., Stanley, Brooker, & Gilbert, 2002). Some (e.g., Fiske, 1977; Mills, 1987)
have argued for a global approach, whereas others (e.g., Bergee, 2003; Thompson &
Williamon, 2003) have found good interrater reliability among a limited number of
subscales. In the latter two studies, the subscale scores correlated highly with over-
all scores. Wapnick and Ekholm (1997), who found a similar pattern, suggested that
raters first form an overall impression and then respond to individual scale items
accordingly. Radocy and Boyle (1987) suggested that the approach to assessment
should depend on the function of the assessment. Because studying raters, not per-
formers or performances per se, was the function of the present study, a global
assessment seemed the more suitable choice.
349
Each rater evaluated independently, using as a playback unit the same electronic
device on which the performances had been recorded and with a set of Sony MDR-
027 headphones. I was present at all rating sessions and operated the playback
equipment. Earlier, I had reordered the 24 performances into five different random
sequences. I attempted no stratifications or other manipulations of the sequences. I
had randomly assigned each of the raters to one of the five sequences; accordingly,
two raters evaluated within each of the sequences.
The raters first read a letter that explained the purpose of the study and provided
information about the task. I supplemented the letter by asking the raters to use
whatever criteria they were comfortable with, so long as they attempted to remain
consistent within themselves. I cautioned raters that they would listen to each per-
formance only once and that when they had scored a performance and moved on to
another one, they could not later return to change their score. I reminded raters that
the performers would not receive these scores. The rating sessions went smoothly,
each taking about 45 minutes to complete.
Theoretical Framework
G theory allows for precise identification of sources of measurement error. One
obvious source is the persons undergoing evaluation-in this study, the 8 perform-
ers (designated p; cf. tables). The variance representing that error, known in G theory
parlance as universe score variance, is expected and can be considered analogous to

true score variance in classical test theory. G theory also specifies conditions of mea-
surement, or facets, as sources of error variance. Such variance, analogous to error
variance in classical test theory, is not desirable. In this study, occasions, raters, and
sequences (designated o, r, and s, respectively; cf. the Effect columns in Tables 1 and 2)
comprised the measurement facets.
G theory estimates variance components by means of random- or mixed-effects
analysis of variance designs. G theory, however, does not use ANOVA for the usual
hypothesis testing purposes; that is, no F values are generated. Instead, G theory uses
ANOVA in a more denotative sense, as an &dquo;analysis of statistical variance in
...
order to determine the contributions of given factors or variables to the variance&dquo;

(Mish, 2003, p. 44). In G theory, the interchangeability of facet levels-in the present
study, the different raters, occasions, and sequences-becomes an empirical ques-
tion. Efforts to address such questions are known as G (generalizability) studies.
G study designs can have either crossed or nested variables. In a completely
crossed design, each variable is fully represented in all other variables. In the first of
the G studies in the present investigation, performers were completely crossed with
the occasion and rater facets, which were completely crossed with one another. In
other words, each participating soloist played his or her designated excerpt three
times (which defined the occasion facet), and each of the 10 raters evaluated all 24
350
Table 1
ANOVA for p (Performer by Occasion by Rater, Completely Crossed)
x o x r
Generalizability Study 1 Design for Performance Evaluation Data
Note: Estimated variance components: (J2(P) = [MS(p) - MS(po) - MS(pr) + MS(por)]/nonr; (5(o) = [MS(o) -
MS(po) - MS(or) + MS(por)]/npnr; (J2(r) = [MS(r) - MS(pr) - MS(or) + MS(por)lln,no; (J2(pO) = [MS(po) -
MS(por)llnr; (J2(pr) = [MS(pr) - MS(por)]/n,,; (J2(or) = [MS(or) - MS(por)llnp; (J2(por) MS(por). =
a. Estimated variance component is negative and reset to 0.
&dquo;
Table 2 .
ANOVA for p x (o x [r : s]) (Performer by Occasion

by Rater Nested Within Sequence) Generalizability Study 2
Design for Performance Evaluation Data
Note: Estimated variance components: a2(p) = [MS(p) - MS(po) - MS(ps) - MS(pos) - MS(pr : s) +
MS(por : s)lln,,n,nr:,; (J2(0) = [MS(o) - MS(po) - MS(os) - MS(pos) - MS(or : s) + MS(por : s)]/npn,n,: s;
62(s) _ [MS(s) - MS(ps) - MS(os) - MS(r : s) - MS(pos) - MS(pr : s) -MS(or : s) + MS(por : s)]/npn&dquo;n,:,;,;
(y2(po) = [MS(po) - MS(pos) + MS(por : s)]/n,~n,.,s; 6Z(ps) _ [MS(ps) - MS(pos) - MS(pr : s) + MS(por : s)ll
non,. ,; (J2(OS) = [MS(os) - MS(pos) - MS(or : s) + MS(por : s)]/npn,: s; (J2(r : s) = [MS(r : s) - MS(pr : s) -
MS(or : s) + MS(por : s)]/npno; (J2(pOS) = [MS(pos) - MS(por : s)]/n,: s; (J2(pr: s) = [MS(pr : s) -MS(por: s)]l
no; (J2(or : s) = [MS(or : s) - MS(por : s)llnp; (J2(por : s) MS(por : s).
=
a. Estimated variance component is negative and reset to 0.
351
resulting performances. Facets usually are nested within other facets because of
resource or design considerations. In this investigation’s second G study, I nested
raters within the sequence facet because completely crossing sequence with all other
variables would have led to an unwieldy number of performances to evaluate (with
five sequences, 120). In addition, variables are analyzed as either fixed or random
effects. Because the performers, raters, occasions, and sequences in this study can be
thought of as being selected from a far larger theoretical population, I analyzed all
four as random effects.
G studies serve as the basis for making optimal decisions about the objects of
measurement (in the present study, the performers). In such D (decision) studies, an
investigator generalizes based on specified measurement procedures. One kind of
decision is relative; I as the investigator might wish to know where performers stand
in relation to other performers. A second type of decision is absolute, in which case
I am interested in whether performers meet established criteria. D studies pose &dquo;what
iP’ questions by estimating reliability under different hypothetical scenarios, using
variance components established in G studies to make these estimations. They
explore what reliability might be for different theorized combinations of levels of the
facets (in the present study, different theorized combinations of occasions, raters,
and sequences). By doing so, an investigator identifies those combinations that result
in theoretically optimal reliability. D studies establish both relative (8) error vari-
ance and absolute (A) error variance (cf. Table 3). From these error variances, two
kinds of coefficients are determined. Essentially an enhanced intraclass correlation
coefficient, the generalizability coefficient (Ep2; Table 3, bottom panel), the ratio of
performer variance, a2(t), to itself plus relative error variance, 0’(6), is analogous to
the reliability coefficient in classical test theory. (Throughout, I have used G theory
standard notational conventions as found in Brennan, 2001.) A second coefficient,
the index of dependability (<1>; Table 3, bottom panel), is the ratio of performer vari-
ance to itself plus absolute error variance, 02(A). This latter coefficient, which takes
into account all other sources of variance outside of the performer main effect, is not
possible in classical test theory determinations of score reliability.
For the present investigation, I formulated two G studies and one follow-up D
study. The first of the G studies used three completely crossed random variables-
performers, occasions, and raters-with the last two serving as the measurement
facets. In G theory symbolization, such a study is notated p x o x r (cf. Table 1).
Because previous research findings have made a strong case to include sequence as
an additional (random) facet and because design limitations required me to nest
raters within sequences, a second G study became necessary. Without the second G
study, I would not have been able to include sequence in this investigation and study
the raters as a main effect. The rather complex design of this second study in G
theory symbolization is p x (o x [r : s]) (cf. Table 2), read as &dquo;raters are nested within
sequences, both of which are completely crossed with occasions, and all three of which
are completely crossed with performers.&dquo; For reasons discussed soon, the follow-up
352 J
Table 3
Random Effects p x O x R
(Decisions on Occasions and Raters) Decision
Study Design for Performance Evaluation Data
Note: a =
n; and n§ = modifications of rater and occasion sample sizes respectively; i object of
effect; =
measurement (here, = performers); cr2(8) relative error variance; a’(0) absolute error variance;
= =
Ep’ = generalizability coefficient; O index of dependability.

=
a. Components of relative (6) and absolute (A) error variance terms with o and r as random effects.
D study (presented in Table 3) modified the occasion and rater facets only, resulting
in a p x 0 x R design. (Facets modified in D studies are assigned capital letters.)
Statistical Analysis and Interpretation

Brennan (2001 ) recommended interpreting G study variance components in the
following manner: Assume that for each performer in the population an investigator
was able to obtain this performer’s mean score over all occasions and all raters in the
universe of admissible observations. Thus, 62(p) is the estimated variance of these
mean scores over the population of performers. Variance components for the occa-
sion and rater facets can be interpreted similarly. Approximate statements can be
made about interaction variance components; (y2(or), for example, estimates the
extent to which performers are ranked differently by different raters.
Variance components as estimates have sampling variability. Consequently, a
variance component might be expressed as a negative number, although negative
variance is not possible. An estimated variance component close to 0 would have
sampling variability around 0. The suggested procedure is to reset the negative esti-
mate to 0 (Brennan, 2001 ). As specified in Tables 1 and 2, I did so for the effects
with negative variance components.
353
D studies forecast the expected reliability and dependability among a number of

different hypothetical scenarios. In the present study, I modified the number of occa-
sions, n’ 0’ and especially the number of raters, n’r. (In G theory, primes are used to
indicate facets whose sample sizes undergo modification.) G study variance compo-
nents are the sources of D study components-the G study components (found in
Table 1) are divided by the given facet’s or facets’ modifications of sample size. For
example, 62(R) a2(r)ln;, and 62(OR)
= =
but (J2(pR) = cr2(pr)/n; because
(J2(or)/n;n;;
performers as the object of measurement do not undergo modification. The second
column of Table 3 specifies whether a given effect’s variance component is included
in the calculation of the relative (8) error variance, absolute (A) error variance, or
both. From the relative and absolute error variance terms, respectively, the reliabil-
ity-like generalizability coefficient (Ep’) and the index of dependability (C) are
determined.
A stringent interrater agreement criterion, the index of dependability encom-
passes variability arising from all main effects except performers and all interactions,
in the present study both those that involved the performer effect (po, pr, por, com-
ponents of the generalizability coefficient) and those that did not (the remainder-o,
r, or).
Results
G Study I : p x o x r
Table 1 displays the results of the first G study. The table reports outcomes by
effect (a); each effect’s df, sums of squares, and mean square; and the estimated vari-
ance component for each effect, 62(a), obtained from mean squares via an algorithm
Brennan (2001) illustrated. (Contributions of the mean squares to each of the vari-
ance components via this algorithm can be found in the notes at the bottom of Tables
1 and 2.) Estimated variance components are presented for all main effects and inter-
actions among the three fully crossed variables. Statistically speaking, there was no
variability in the occasion and rater by occasion effects. After the collapsing of all
raters’ evaluations into the occasion effect, four of the performers’ ratings essentially
remained the same among the three occasions, another’s increased marginally, and
the remaining three others’ actually decreased, but also marginally.
The most variability was found in the performer, rater, and performer by rater
effects. These performers clearly were at different levels of maturity; therefore, the
performer variability was anticipated. Rater variability, which ideally should have
been zero, was quite high. Furthermore, the raters’ rank ordering of the performers
clearly varied (cf. the pronounced pr effect). The three-way interaction’s variance
component was not large despite its confounding with the remaining unexplained
error. The p x o x r model explained a large share (94%) of variability in the ratings.
(This figure can be obtained from the sums of squares column.)
354
G Study 2: p x (o x [r : s])
Results of the second G study, which incorporated the sequence facet, are found
in Table 2. The main effect of sequence produced virtually no variability in the
model; neither did the performer by sequence, occasion by sequence, performer by
occasion by sequence, and occasion by rater within sequence interactions.
Apparently, it made little difference which sequence the raters evaluated within.
Because raters were nested within sequence, isolating a main effect for rater was
not possible. Variability among raters is located in the rater within sequence (r : s)
and performer by rater within sequence (pr : s) effects and congruent with the first
G study, quite pronounced.
D Study: p x 0 x R
Table 3 presents findings for the D study. Because estimated variability owing to
sequence was effectively zero, I did not include sequence in the D study. The final
two rows, Ep2, the reliability-like generalizability coefficient, and <1>, the index of
dependability, are the most important. Within the universe of generalizability estab-
lished in this investigation, estimated reliability (Ep2) for a hypothetical one-occa-
sion, one-rater scenario was .47. Addition of a 2nd hypothetical rater substantially
increased Ep2 to .69. To reach a benchmark of .80 (cf. Carmines & Zeller, 1979, who
proposed this figure as a lower bound for good reliability), 5 hypothetical raters were
needed. Ep2 increased only marginally as the number of hypothetical raters expanded
beyond 5. Even 17 hypothetical raters did not push Ep2 beyond .90.
The more stringent index of dependability, <1>, for the one-occasion, one-rater sce-
nario was .25. Introducing a variance component from the raters clearly had a
depressing effect. Indeed, 4 hypothetical raters still resulted in a relatively low
figure, .55. In contrast to Ep2, (D increased steadily until about the 14th hypothetical
rater. In terms of the one-occasion scenarios did not reach the .80 benchmark until
about the 17th hypothetical rater.
Discussion
This project defined sources of measurement error in ratings of music perfor-

mance. Two generalizability studies demonstrated that occasion and sequence facets
contributed negligibly to measurement error. Raters on the other hand were respon-
sible for a sizeable share of error. Decision study scenarios demonstrated that a
hypothetical one-occasion, one-rater model was not reliable or dependable. Adding
hypothetical raters soon brought reliability to more acceptable levels. Dependability,
the more stringent criterion, seemed less amenable to improvement as hypothetical
raters were added, presumably owing to the lack of agreement among raters.
355
Contrary to my expectations, there were no incremental or decremental changes

in rated performance quality across the three occasions. In contrast to some other
kinds of performers (e.g., pole vaulters, who vault three times at each height), musi-
cians perform only once. The present study provided no evidence that multiple trials
serve as a better index of achievement. In contrast to the literature (e.g., Bergee,
2006; Bruine de Bruin, 2005), sequence also seemed make little difference.
to
Performers were rated virtually the same way no matter where their performances
fell in the five randomly ordered sequences. The typical festival scenario, which per-
mits only one performance and schedules events more or less randomly (or at least
without considering performance level), might not necessarily compromise per-
formers’ ability to establish a &dquo;true&dquo; level of achievement.
On the other hand, measurement error originating from raters’ disagreements with
one another and to a lesser but still substantive extent, their lack of rank order con-
sistency, overwhelmed other sources of variance, even performer variance. There

were pronounced discrepancies in terms of rater severity. A standard reliability
analysis along classical test theory lines would not have uncovered these discrepan-
cies. Hoyt’s ( 1941 ) popular approach, for example, which subtracts variability owing
to rater inconsistency (represented by the mean square of the performer by rater
interaction) from performer variability (represented by the mean square of the per-
former main effect) and then places this figure over the performer variability, results
in a high interrater reliability figure of .93 (1342.3 - 89.2/1342.3; cf. Table 1 ).’ In
terms of what actually happened among the present study’s raters, this reliability
figure is too high. It contrasts sharply with the index of dependability of .72 for 10
raters and one occasion, although less so with the more relaxed generalizability coef-
ficient (.84; cf. Table 3). Within the limited frame of classical test theory, Hoyt’s
technique is quite sound for continuous data. But in the present study, it might have
led to erroneous conclusions about the extent of rater agreement.
Hypothetical one-occasion, one-rater scenarios resulted in predictably low relia-
bility and dependability outcomes. The generalizability coefficient (akin to classical
test theory’s reliability coefficient) for this scenario (.47) was unacceptably low. The
more stringent index of dependability was markedly lower (.25). Adding a modest
number of raters (a hypothetical panel of five), however, brought the generalizabil-
ity coefficient to an acceptable benchmark (.80) and into consonance with other find-
ings (Fiske, 1983; Smith, 2004; Vasil, 1973). Generalizability theory, however,
extends beyond consistency. Are ratings consistent with one another in a relative
sense sufficient, should rater agreement in a more absolute sense also be consid-
or
ered ? How strongly do other factors (e.g., variability among performers, perfor-
mances taking place in a linear sequence, how often a performer is allowed to
display his or her &dquo;wares,&dquo; etc.) influence either or both? For those looking beyond
classical reliability to other sources of variability in performance assessment, G
theory provides a method to address these questions.
356
G theory’s index of dependability (4S) is quite stringent. Given the resource con-
straints found in most music assessment contexts, it is perhaps too stringent. The
great majority of the time, assembling panels of 15 to 20 adjudicators would not be
practical. On the other hand, performance assessment sometimes can have exceed-
ingly high stakes-acceptance into a prestigious conservatory, auditions for posi-
tions in professional orchestras or opera companies, competitions with large cash
prizes, and so forth. In these contexts, protocols for assessment and the assessors
themselves should be held to the highest available standards.
As the first of its kind, this study has some limitations. Of the 10 raters who par-
ticipated, 3 rated with more severity than did the others. At this point, there is no way
to know whether 3 of every 10 raters are this severe or 3 of every 100 are, and my
selection of raters, despite efforts to standardize their qualifications and to minimize
selection bias, was unfortunate. G theory estimations, of course, must hedge accord-
ingly. Furthermore, the novel use of a 100-point scale for global assessment may
have contributed to the lack of agreement. There is no way to demonstrate that it did
not. But global evaluation is usually quite consistent, and it was in the present study
as well. Judges who rated more severely rated consistently more severely. The stan-
dard evidence for a faulty rating protocol is inconsistency.
To help establish acceptable levels of agreement, further research should exam-
ine the effect of training evaluators to the assessment protocols they will use. Raters
should reach consensus on trial &dquo;anchor&dquo; performances before proceeding with the
main task. Along similar lines, in terms of the number of raters needed, further study
should broaden rater pools used to establish &dquo;universes&dquo; for generalizability. All
other things being equal, as sample size increases, performer variance remains the
same, variance among facets decreases, and generalizability coefficients and indices
of dependability increase (Brennan, 2001, p. 112). Broadening the pool of raters,
however, might involve trade-offs. The N increases, potentially narrowing statistical
confidence intervals, but at a potential sacrifice of standardization in terms of train-
ing and experience. Perhaps there are creative yet cost-efficient ways to increase the
number of qualified raters available for evaluated events. Raters need not be experi-
enced performers on a given examinee’s medium; raters should, however, have a
background in the same general family (Bergee, 2003; Fiske, 1975).
In summary, the present study’s findings suggest the possibility of substantive
measurement error among raters. Classical test theory indices of reliability are
unable to identify such crucial and potentially strong sources of error. As a conse-
quence, musicians at present might not always receive the consistency and depend-
ability of performance assessment that we would wish for them to receive.
Note
,2
1. Generalizability (G) theory should and does have a correlate to Hoyt’s. Eρ
2 τ) + σ
(τ)/σ
σ
( (δ) in
2
r 10 is analogous to classical test theory interrater reliability. With raters
this instance for n’ = 8 and n’ =
357
as the sole measurement facet (i.e., a p x R design), σ

(τ) σ
2 (p) MS
2=
-r
)
p
( p
(
MS
/
)
,=
n’ (δ) )
r and σ
2 pR
(
2
σ
= =
pn’
(
2
σ
/
)
r r )/10.
=
pr Like Hoyt’s performer by rater interaction mean square, )
(
MS pr estimates the extent
(
2
σ
to which performers are rank ordered differently by the different raters (Brennan, 2001). From Table 1,
τ) + σ
(τ)/σ
(σ
2 (δ) = 125.3/125.3 + 8.92, or .93.
2
References
Bergee, M. J. (1995). Primary and higher-order factors in a scale assessing concert band performance.
Bulletin of the Council for Research in Music Education, 126, 1-14.
Bergee, M. J. (2003). Faculty interjudge reliability of music performance evaluation. Journal of Research
in Music Education, 51, 137-148.
Bergee, M. J. (2006). Validation of a model of extramusical influences on solo and small-ensemble festi-
val ratings. Journal of Research in Music Education, 54, 244-256.
Bergee, M. J., & McWhirter, J. L. (2005). Selected influences on solo and small-ensemble festival rat-
ings: Replication and extension. Journal of Research in Music Education, 53, 177-190.
Bergee, M. J., & Platt, M. C. (2003). Influence of selected variables on solo and small-ensemble festival
ratings. Journal of Research in Music Education, 51, 342-353.
Bergee, M. J., & Westfall, C. R. (2005). Stability of a model explaining selected extramusical influences
on solo and small-ensemble festival ratings. Journal of Research in Music Education, 53, 253-271.
Brennan, R. L. (2001). Generalizability theory. New York/Berlin: Springer-Verlag.

Bruine de Bruin, W. (2005). Save the last dance for me: Unwanted serial position effects in jury evalua-
tions. Acta Psychologica, 118, 245-260.
Burnsed, V., Hinkle, D., & King, S. (1985). Performance evaluation reliability at selected concert festi-
vals. Journal of Band Research, 21 (1), 22-29.
Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment (Sage Quantitative
Applications in the Social Sciences Series No. 17). Beverly Hills, CA: Sage.
Duerksen, G. (1972). Some effects of expectation on evaluation of recorded musical performances.
Journal of Research in Music Education, 20, 268-272.
Fiske, H. E., Jr. (1975). Judge-group differences in the rating of secondary school trumpet performers.
Fiske, H. E., Jr. (1977). Relationship of selected factors in trumpet performance adjudication reliability.
Fiske, H. E., Jr. (1983). Judging musical performance: Method or madness? Update: Applications of
Research in Music Education, 1 (3), 7-10.
Flôres, R. G., & Ginsburgh, V. A. (1996). The Queen Elisabeth musical competition: How fair is the final
ranking? The Statistician, 45, 97-104.
Hare, R. Y. (1960). Problems and considerations in adjudicating music contests. The Instrumentalist, 14,
106-107.
Hoyt, C. (1941). Test reliability estimated by analysis of variance. Psychometrika, 6, 153-160.
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and
without response bias. Journal of Applied Psychology, 69, 85-98.
James, L. R., Demaree, R. G., & Wolf, G. (1993). : wg An assessment of within-group interrater agree-
r
ment. Journal of Applied Psychology, 78, 306-309.
Kieffer, K. M. (1999). Why generalizability theory is essential and classical test theory is often inade-
quate. In B. Thompson (Ed.), Advances in social science methodology (Vol. 5, pp. 149-170).
Greenwich, CT: JAI.
Mish, F. C. et al. (Ed.). (2003). Merriam-Webster’s collegiate dictionary (11th ed.). Springfield, MA:
Author.
358 J
Mills,J. (1987). Assessment of solo musical performance: A preliminary study. Bulletin of the Council
for Research in Music Education, 91, 119-125.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. New York: McGraw-Hill.
Radocy, R. E., & Boyle, J. D. (1987). Measurement and evaluation of musical experiences. New York:
Schirmer.
Sagin, D. P. (1983). The development and validation of a university band performance rating scale.
Journal of Band Research, 18(2), 1-11.
Schmidt, F. L., & Hunter, J. E. (1989). Interrater reliability coefficients cannot be computed when only
one stimulus is rated. Journal of Applied Psychology, 74, 368-370.
Smith, B. (2004). Five judges’ evaluation of audiotaped string performance. Bulletin of the Council of
Research in Music Education, 160,
61-69.
Spearman, C. E. (1904). The proof of measurement of association between two things. American Journal
of Psychology, 15, 72-101.
Stanley, M., Brooker, R., & Gilbert, R. (2002). Examiner perceptions of using criteria in music perfor-
mance assessment. Research Studies in Music Education, 18, 43-52.
Thompson, S., & Williamon, A. (2003). Evaluating evaluation: Musical performance assessments as a
research tool. Music Perception, 21, 21-41.
Thompson, W. F., Diamond, C. T., & Balkwill, L. L. (1998). The adjudication of six performances of a
Chopin etude: A study of expert knowledge. Psychology of Music, 26, 154-174.
Vasil, T. (1973). The effects of systematically varying selected factors on music performance adjudica-
tion. Unpublished doctoral dissertation, University of Connecticut.
Wapnick, J., & Ekholm, E. (1997). Expert consensus in solo voice performance evaluation. Journal of
Voice, 11,
429-436.
Wapnick, J., Flowers, P., Alegant, M., & Jasinskas, L. (1993). Consistency in piano performance evalua-
tion. Journal of Research in Music Education, 41, 282-292.
Wapnick, J., Ryan, C., Lacaille, N., & Darrow, A. A. (2004). Effects of selected variables on musicians’
ratings of high-level piano performances. International Journal of Music Education, 22, 7-20.
Wilson, V. E. (1977). Objectivity and effect of order of appearance in judging of synchronized swimming
meets. Perceptual and Motor Skills, 44, 295-298.
Martin J. Bergee is professor of music education and music therapy at the University of Kansas. His
research interests include music performance assessment.
Submitted October 10, 2006; accepted November 20, 2007.

Education Journal of Research in Music

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Education Journal of Research in Music

Transféré par

Droits d'auteur :

Formats disponibles

Journal of Research in Music

Performer, Rater, Occasion, and Sequence as Sources of Variability in

The online version of this article can be found at:

MENC: The National Association for Music Education

Email Alerts: http://jrm.sagepub.com/cgi/alerts

within a sequence variable. Two G (generalizability) studies established that occasion

Keywords: music; performance; assessment; generalizability theory

A series of recent studies has developed a model of selected extramusical variables’

raters found in individual studies.

effect measurement error, and estimate reliability-like coefficients in both relative

p. vii). It is especially well suited to evaluating ratings of human performance

Because performance levels decline at different rates for different performers, I

With eye to minimizing variability among raters as much as possible, I

parlance as universe score variance, is expected and can be considered analogous to

order to determine the contributions of given factors or variables to the variance&dquo;

Generalizability Study 1 Design for Performance Evaluation Data

a. Estimated variance component is negative and reset to 0.

ANOVA for p x (o x [r : s]) (Performer by Occasion

a. Estimated variance component is negative and reset to 0.

Ep’ = generalizability coefficient; O index of dependability.

Statistical Analysis and Interpretation

D studies forecast the expected reliability and dependability among a number of

(This figure can be obtained from the sums of squares column.)

This project defined sources of measurement error in ratings of music perfor-

Contrary to my expectations, there were no incremental or decremental changes

sistency, overwhelmed other sources of variance, even performer variance. There

as the sole measurement facet (i.e., a p x R design), &sigma;

Brennan, R. L. (2001). Generalizability theory. New York/Berlin: Springer-Verlag.

Submitted October 10, 2006; accepted November 20, 2007.

Vous aimerez peut-être aussi

as the sole measurement facet (i.e., a p x R design), σ