Rater Reliability

1
Rater Reliability
Why Consider Rater Reliability?
Whenever human judgment is a part of the measurement process, disagreements
can be a substantial source of variance in scores. For example, when teachers grade
essay exams, they often disagree substantially about the merits of various papers. In
meta-analysis, judgments about the studys rigor, quality of the experimental design,
relevance of the participants, and so forth can be expected to yield judgments that vary
across judges. Even numerical estimates such as the effect size can be subject to
disagreement. Two judges are unlikely to make many recording mistakes (e.g., if they
both see a d statistic of .57, they will both likely record it as .57). However, they may
see different information that is used to calculate an effect size in different places in an
article, and such a difference may result in different estimates across judges. Therefore,
it is a good idea to estimate the reliability of judges whenever they code any study
information. Reliability of data will be of interest to your readers, critics, and general
audience.
How Much Data?
To estimate the reliability of judges, you need to have at least two judges for at
least some of the studies. You can have more than two judges provide data on all of the
studies, of course. But you can also estimate the reliability of a single judge so long as
you have data on at least two judges for at least some of the studies. I recommend that
you have at least two judges complete coding for at least 30 studies. If you have access
to more judges, have all of them complete coding on the same studies, even if its only 10
or 15 of them. For most meta-analyses, I recommend having two judges code each scale
for all articles. This may be relaxed if the meta-analysis is very large so that coding all
studies by two people is too expensive.
Reliability for Continuous and Categorical Data
When the judges code studies, their judgments will be either categorical or
continuous. I suggest that you consider a judgment to be categorical if the judgment
clearly belongs to a class, category, or other label designation. For example, type of
publication (journal or dissertation), type of participant (student, job applicant), country
in which the study was completed (U.S., Germany, Argentina), and so forth, would be
considered categorical. Use kappa to measure agreement among judges if the scores for
the scale are categorical.
I suggest that you consider as continuous those scales in which there are several
or many ordinal, interval, or ratio scores. For example study rigor scored on a 1 to 5
scale where 1 is poor and 5 is excellent, average age of the study participants, and effect
size for a study would all be considered continuous. Use intraclass correlations to
estimate and quantify the agreement (reliability) of judges when the scales are
continuous.
2
Estimate the reliability between or among judges separately for each scale (one
estimate each for country of studys origin, effect size, percentage of males, etc.).
What Intraclass Correlation to Use?
There are two main types of intraclass correlations, Case 2 (ICC(2)), which is the
random case, and Case 3 (ICC(3)), which is the fixed case. In Case 3, the judges and
studies are crossed, meaning that for a given scale, each judge codes all studies. So if in
my meta-analysis, Jim and Joan code all 25 studies for effect size, then studies and
people are completely crossed and Case 3, the fixed case, applies. In Case 2, the judges
and studies are not crossed. For all 25 studies, you have two people code each study, but
different people code different studies. For example, Jim and Joan codes studies 1 5,
Jim and Steve code studies 6 10, Joan and Steve code studies 11 15, and so forth. If
so, studies are nested in raters, and Case 2, the random case, applies.
Another way to collect data is to have different numbers of people code each
study. Joan codes study 1, Jim and Steve code study 2, and Joan, Jim and Steve all code
study 3. Avoid collecting data like that. If you do, some studies will be coded more
reliably than others and no single number will accurately estimate the reliability of the
data.
The proper intraclass correlation to use depends on how you collect the data
during your study. If for the scale in question, the same people rate each and every study,
then use ICC(3). If different people code different studies for a scale, then use ICC(2). If
different people code different scales, but the same people code the same scales across
studies, you can still use ICC(3) because you report reliability by scale.
Collecting Data to Estimate Reliability Before a Full Study
Regardless of how you collect data for you whole study, however, I recommend
that you estimate the whole study reliability by collecting data on a subsample of studies
where all the judges and studies are completely crossed. If you do that, you will have the
data you need to estimate both ICC(2) and ICC(3) the way Im showing you how to do it.
It doesnt work the other way. If you collect nested data, you have to use some ugly
models and most likely hire a statistician to figure how to get the right estimates (if you
are interested or just feeling lucky, some models for more complicated data collection
designs are developed for you in Brennan, 1992; Cronbach, et al., 1972, Crocker &
Algina, 1986).
The simple models that I present here were developed by Shrout & Fleiss (1979).
They assume that the data were collected in a design in which the judges and studies
(targets) are completely crossed. I will show you
1. How to estimate the reliability of a single judge in the fixed and random
conditions
2. How to estimate the reliability of any number of judges (e.g., two) given the
reliability of a single judge, and
3. How to estimate the number of judges required to attain any desired level of
reliability for the scale.
Illustrative Example
Scenario
Jim, Joe and Sue have gather data on a meta-analysis of the effects of classical
music on plant growth. They have gathered a random sample of five of their studies and
each of them has rated the rigor of the same five studies. Their ratings are reproduced
below.
Study
1
2
3
4
5
Jim
2
3
4
5
5
Joe
3
2
3
4
5
Sue
1
2
3
4
3
If we use SAS GLM, we can specify the model in which rating is a function of rater,
study, and their interaction.
[Technical note: In doing so, we have what is essentially an ANOVA model in which
there is one observation per cell. In such a design, the error and interaction terms are not
separately estimable. If there is good reason to believe that there is an interaction
between raters and targets (e.g., Olympic figure skating judges ratings of people from
their own country), then the entire design should be replicated to allow a within cell error
term.]
4
SAS Input
data rel1;
input rating rater target;
*************************************************
*Rating is the variable that is each judge's
*evaluation of the study's rigor.
*Jim is rater 1, Joe is 2, and Sue is 3.
*Target is the study number.
*************************************************;
cards;
2 1 1
3 1 2
4 1 3
5 1 4
5 1 5
3 2 1
2 2 2
3 2 3
4 2 4
5 2 5
1 3 1
2 3 2
3 3 3
4 3 4
3 3 5
;
*proc print;
proc glm;
class rater target;
model rating = rater target rater*target;
run;
Note that rater*target is the interaction term, and we will use that as the error term under
the assumption that the interaction is negligible.
5
SAS Output
The GLM Procedure
Class Level Information
Class
Levels
Values
rater
1 2 3
target
1 2 3 4 5
Number of observations
15
The GLM Procedure

Dependent Variable: rating
Source
DF
Squares
Model
Error
Corrected Total
14
0
14
20.93333333
0.00000000
20.93333333
R-Square
1.000000
Source
rater
target
rater*target
Source
rater
target
rater*target
Coeff Var
.
Sum of
Mean Square
1.49523810
.
Root MSE
.
DF
Type I SS
Mean Square
2
4
8
3.73333333
14.26666667
2.93333333
1.86666667
3.56666667
0.36666667
DF
Type III SS
Mean Square
2
4
8
3.73333333
14.26666667
2.93333333
1.86666667
3.56666667
0.36666667
F Value
.
Pr > F
.
rating Mean
3.266667
F Value
.
.
.
F Value
.
.
.
Pr > F
.
.
.
Pr > F
.
.
.
What we want from the output is the Type III mean squares (in this case, 1.87, 3.57, and
.37).
6
Now to compute the estimates:
Item
Formula
Reliability
BMS EMS
of one
BMS + (k 1) EMS + k ( JMS EMS ) / n
random
rater:
ICC(2,1)
Reliability
BMS EMS
of one
BMS + (k 1) EMS
fixed
rater:
ICC(3,1)
Note.
BMS=mean square for targets (studies)
JMS = mean square for raters (judges)
EMS = mean square for rater*target
K = number of raters
N = number of targets (studies)
Estimate
3.57 .37
= .61
3.57 + (3 1).37 + 3(1.87 .37) / 5
3.57 .37
= .74
3.57 + (3 1).37
Notice that the reliability of one random rater is less than the reliability of on fixed rater.
This is because mean difference among raters reduce reliability when raters are random
but not when raters are fixed. The reliability estimate of the one random rater is a bit low
to use in practice.
We can use the estimate of reliability of one rater to estimate the reliability of any
number of raters using the Spearman-Brown prophecy formula. We can use a variation
on the theme to estimate the number of raters we need to achieve any desired reliability.
Suppose we want to know what the reliability will be if we have two raters. The general
form of the Spearman-Brown is:
k ii
1 + (k 1) ii
If we move to two raters, then k is two and ii is either ICC(2,1) or ICC(3,1). For two
random judges, we have:
2(.61)
CC ' =
= .76.
1 + .61
For two fixed judges, we have:
2(.74)
CC ' =
= .85.
1 + .74
CC ' =
For both estimates, we see that the average of two judges ratings will be more reliable
than will one judges ratings. Fixed judges are still more reliable than random judges.
7
Suppose we want to achieve a reliability of .90. Then we can use a variant of the
Spearman Brown that looks like this:
* (1 L )
L (1 * )
where m is an integer formed by rounding up, * is our aspiration level, and L is our
lower estimate, either ICC(2,1) or ICC(3,1). In our example, for the random case, we
have:
.9(1 .61)
m=
= 5.75, or 6 when rounded up.
.61(1 .90)
m=
For fixed judges, we have

.9(1 .74)
m=
= 3.16 or 4 when we round up.
.74(1 .90)
8
References
Brennan, R. L. (1992). Elements of generalizability theory. Iowa City, IA: ACT
Publications.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test
theory. New York: Holt, Rinehart & Winston.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The
dependability of behavioral measurements. New York: Wiley.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing
rater reliability. Psychological Bulletin, 86, 420-428.

Rater Reliability

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Rater Reliability

Transféré par

Droits d'auteur :

Formats disponibles

1

The GLM Procedure

For fixed judges, we have

Vous aimerez peut-être aussi