Vous êtes sur la page 1sur 10

International Journal of Nursing Studies 46 (2009) 12741283

Contents lists available at ScienceDirect

International Journal of Nursing Studies


journal homepage: www.elsevier.com/ijns

Content validity is naught


Jason W. Beckstead *
University of South Florida, College of Nursing, 12901 Bruce B. Downs Boulevard, MDC22, Tampa, FL 33612, USA

A R T I C L E I N F O

A B S T R A C T

Article history:
Received 26 January 2009
Received in revised form 24 April 2009
Accepted 29 April 2009

Content validation theory and practice have received considerable attention in the nursing
research literature. This paper positions the discourse within the broader scientic
literature on validity of measurement. The content validity index has been recommended
as a means to quantify content validity; this paper critically examines its origins,
theoretical interpretations, and statistical properties. In addition, the author sets out to
understand why many nurse researchers are occupied with content validity and its
estimation. This investigation may be of interest to the scholar who desires to deeply
understand the issues surrounding validity of measurement.
2009 Elsevier Ltd. All rights reserved.

Keywords:
Validity
Content validity
Operational denition
Interrater agreement

What is already known about this topic?


 Content validity is widely used in nursing research.
 The CVI and its variants have been advocated as
measures of content validity.
What this paper adds
 This paper reviews the origin, historical development,
and theoretical underpinnings of content validity.
 The role and signicance of the CVI and its variants in
scientic measurement are discussed.
The purpose of this article is to discuss content
validation theory and practice as this topic has received
considerable attention in the nursing research literature
for over two decades. A broad, in-depth review of the
scientic literature on content validity, and validity in
general, was conducted to provide context. In addition,
some of the technical aspects involved in attempts to
quantify content validity are critically examined in detail.
The text is structured as a narrative of discovery; the facts

* Tel.: +1 813 974 7667.


E-mail address: jbeckste@health.usf.edu.
0020-7489/$ see front matter 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ijnurstu.2009.04.014

and opinions uncovered and the conclusions that may be


drawn from them will be shared with the reader. This
investigation may be of interest to the scholar who desires
to deeply understand the issues surrounding validity of
measurement. There are two aims: (1) to persuade those
scientists who are preoccupied with establishing content
validity that they are misdirecting their energies, and (2) to
show how the current practice of quantifying agreement
among a small group of experts as the means for
establishing content validity is woefully decient on
technical grounds.
The concept of content validity and how best to
quantify it have recently received attention from nurse
scholars (e.g., Polit and Beck, 2006; Polit et al., 2007; Wynd
et al., 2003). These authors make two main points: (1)
content validity is necessary/essential in the development
of new measurement instruments, and (2) the content
validity index (CVI) quantifying the extent to which a small
group of content experts agrees on the relevance of items
making up a new measure is problematic because it does
not adjust for chance agreement among the experts. The
CVI therefore provides an inated estimate of content
validity. All these authors recommend that nurse researchers employ an alternative class of index, based on Cohens
kappa (1960), for estimating content validity because such
coefcients discount for chance agreement among experts.

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283

As a methodologist I have long been acquainted with


Cohens kappa and its many uses, but as I was unfamiliar
with the CVI until I read the aforementioned papers, I
wanted to know more about it. What are its origins, its
theoretical interpretation(s), its statistical properties, and
at a more fundamental level, why are nurse researchers so
interested in content validity and its estimation? The
following is an attempt to answer these questions.
1. On validation of measurement

1275

Unfortunately I could not obtain a copy of Hambleton et al.


(1975) as the Waltz and Bausell citation appears to refer to
a handout obtained at an annual meeting on educational
research, and not to an article published in a peer-reviewed
journal. So, at this point the sequence of citations in the
nursing literature comes to an end without offering much
insight into the origin and properties of the CVI, nor into
the question of why nurse researchers are so focused on
content validity and its estimation. Perhaps more answers
are to be found in the historical literature on validity. It is to
this literature that we now turn.

1.1. Content validity in nursing research


1.2. Historical context
The rst clue was provided by Polit and Beck who stated
Although the criterion-related and construct validity of a
new instrument are considered especially important,
information about the content validity of the measure is
also viewed as necessary in drawing conclusions about the
scales quality (Polit and Beck, 2006, p. 489). I continued
reading their article with great interest but they never did
state what this information is exactly, or why it is
necessary, or how it supplements criterion and construct
validity. Their paper did clarify that the CVI is computed at
the item-level as the proportion of a small group of experts
who agree that the item is relevant to the construct being
proposed by the scale developers; that there are three
ways in which item-level CVIs may be combined to form an
index of the content validity of an entire scale, and that
nurse researchers often do not report in sufcient detail
how they compute scale-level CVIs.
The second clue uncovered was Content validity is an
essential step in the development of new empirical
measuring devices because it represents a beginning
mechanism for linking abstract concepts with observable
and measurable indicators (Wynd et al., 2003, p. 508).
This statement required reading several times; somewhere
mid-sentence the focus shifts from scale development (and
presumably content validity) to construct validity. Hence,
the distinction between content validity and construct
validity may be blurred for some nurse researchers, as it
has been for some researchers in psychology and education
(see Tenopyr, 1977, for discussion). These authors concluded by saying Kappa offers additional information
beyond proportion agreement because it removes random
chance agreement (Wynd et al., 2003, p. 516). This remark
seems puzzling because kappa is a proportion of agreement; it has simply been adjusted to remove the
contribution of chance by applying principles from
probability theory. The authors made no further statement
specifying what exactly this additional information is
however.
To better understand the remarks made by these
scholars, it is benecial to examine the sources that they
cited as they built their arguments. Polit and Beck (2006),
Polit et al. (2007), as well as Wynd et al. (2003) all
referenced Lynn (1986) as the primary/seminal source for
the CVI in nursing research. Lynn (1986), however, cited
Waltz and Bausell (1981) as providing the denition of the
CVI. Waltz and Bausell cited an even earlier source,
referencing Hambleton et al. (1975) as being responsible
for the approach they advocated for calculating the CVI.

A database search was initiated turning rst to CINAHL


only to nd that it does not index material published prior
to 1982. As the last reference to content validity in the
nursing literature was published in 1981, another database, PsycInfo which indexes many publications dating
back to the 1890s, was consulted. Entering the term
content validity in the title eld revealed that many
scholars in education and psychology had been writing
about the validity of measurement for many years. The
earliest writings on the topic (McCall, 1922; Thorndike,
1918) recognized two basic types of validity: logical and
empirical. These eventually became the four types of
validity that we are familiar with today (content validity,
predictive validity, concurrent validity, and construct
validity) when they were rst laid out in the Technical
Recommendations for Psychological Tests and Diagnostic
Techniques produced by the American Psychological
Association (APA) Committee on Test Standards (1954).
Notable members of the Committee were Lee Cronbach
and Paul Meehl who published the seminal work on the
rationale for construct validity and its relation to the other
types (see Cronbach and Meehl, 1955). In distinguishing
content validity from the other three, they stated: Content
validity is established by showing that the test items are a
sample of a universe in which the investigator is
interested. Content validity is ordinarily to be established
deductively, by dening a universe of items and sampling
systematically within this universe to establish the test
(p. 282). So, here is the fundamental denition of content
validity; it concerns only the test items (i.e., stimuli) and
makes no mention of the responses that people provide to
the test items. Cronbach reiterated this distinction again
some 15 years later stating Content validity has to do with
the test as a set of stimuli and as a set of observing
conditions (1971, p. 452). Shortly thereafter Messick
(1975) stated succinctly that content validity, as dened, is
a xed property of a test rather than a property of test
responses.
A urry of papers were published on content validity
around 1955 and again around 1975. This was no
coincidence as two academic conferences on the topic
had taken place in these years. The APA had held a
symposium entitled Content Validity of Non-factual
Tests at their annual meeting in San Francisco in 1955,
and there had also been a major conference Content
Validity II held at Bowling Green State University in 1975
attended by many of the people who had a hand in drafting

1276

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283

the APA Standards for Educational and Psychological Tests


(1974). Although separated by 20 years, the papers given at
these two meetings addressed common themes providing
additional clues; these are summarized in the following
paragraphs.
Lennon (1956, p. 295) noted that Early analyses of
validity commonly recognized the two meanings of the
term, often labeled empirical or statistical, on the one hand,
and logical, or, for educational tests, curricular, on the
other; and content validity is a lineal descendant of this
second branch of the family [italics in original]. Messick
(1975) picked up on this theme noting that while
psychologists were emphasizing the value of valid interpretations of scores in terms of attributes or processes
underlying performance (i.e., construct validity), educational measurement specialists seemed to have spent
decades preoccupied with comparative interpretations of
scores (either with regard to norms or to standards)
accepting tests as being valid on the basis of content
analysis. So, the prioritization of content and construct
validity had long ago diverged in these two elds of study.
This claries, somewhat, why nurse researchers are so
focused on content validity and its estimation; many of
them have studied measurement theory primarily as it
applies to achievement testing in educational settings.
This divergence leads to another interesting topic, the
discussion of which began over 50 years ago; When should,
and when can, a test be validated? Huddleston (1956) in
discussing the test development process, began by noting
that achievement tests, those most common in educational
settings, are largely or entirely factual in content. She went
on to state that if a test is purely factual in its content,
measuring nothing other than factual recall, the question
of its content validity becomes a relatively simple one. If
the facts called for in the test are fairly representative of
the universe of facts in the eld covered by the test, then
the test assuredly has content validity. But, what about the
case when a test goes beyond simple factual recall, as is
often the case when measuring abilities or temperaments?
In this case, the problem is that there is no domain or
universe of items, so it is impossible to speak about how
well the test samples from it (see Loevinger, 1965). This
criticism has led educational measurement specialists to
propose the use of techniques such as the test blueprint
and the content-by-process grid as means of operationally
dening the content universe. Content validity is then the
outcome of judging the sampling adequacy of test content
relative to an operational denition of the universe. The
dilemma here is that an operational denition does not
necessarily yield a valid representation of the construct of
interest, except by at (see Mosier, 1947).
Therefore, in order to avoid tautology, any validation
process must consider responses as well as stimuli. This
assessment is not original, it echoes the conclusions drawn
by others who have given the matter serious thought over
the years. Validity cannot be regarded as a xed or a
unitary characteristic of a test Gulliksen (1950a, p. 88). Or,
as Cronbach (1971, p. 447) put it, one validates, not a test,
but an interpretation of data arising from a specied
procedure. Messick wrote extensively on the issue, tests
per se do not have construct validities, nor reliabilities, or

predictive validities for that matter. These are properties of


test responses, not of tests, and test responses are a
function of the persons making them and of factors in the
environmental setting and measurement context. (Messick, 1975, p. 956). He went on, The major problem here is
that content validity. . .is focused upon test forms [items]
rather than test scores, upon instruments rather than
measurements. . .. Any concept of validity of measurement
must include reference to empirical consistency. Content
coverage is an important consideration in test construction
and interpretation, to be sure, but in itself it does not
provide validity (p. 960).
Following the Content Validity II conference, other
methodologists weighed in on the relevance of content
validity to scientic investigation. Tenopyr (1977) considered content validity as merely propaedeutic. Content
validity deals with inferences about test construction;
construct validity involves inferences about test scores.
Since by denition, all validity is the accuracy of inferences
about test scores, that which has been called content
validity is not validity at all (Tenopyr, 1977, p. 50). The
issue of content validity received attention in the context
of personnel selection by Guion (1977) who distinguished
the terms content validity from content domain
sampling concluding that the latter is important for all
types of psychological measurement and that the former
does not exist. Fitzpatrick (1983), in an exquisite expose on
the meaning of content validity, also argued that the
sampling adequacy of test content, the relevance of test
content, and the clarity of domain denitions are
important features of test development, but that they
should be associated with the terms content representativeness, content relevance, and domain clarity,
respectively, and not with the term content validity
because these notions do not provide evidence for the
interpretation of responses and thus cannot refer to any
kind of validity.
So, what may we conclude from reading this extended
literature? Validity refers to the inferences investigators
make from the scores (responses) they obtain, not to the
questionnaire items (stimuli) used to obtain those
responses. Validation efforts based solely on judgments
about items (e.g., their relevance or representativeness)
cannot, by denition, provide any empirical support for
validity. Thus it would seem that researchers interested in
interpreting scores in terms of attributes or processes that
underlie performance had pretty much written off content
validity as being irrelevant and had focused instead on
assessing construct validity.
2. On technical issues involved in the quantication of
interrater agreement
Despite having reached the conclusion that content
validity is not, by denition, validity, it is informative to
examine nurse researchers efforts to gauge what they
have called content validity. At the heart of their efforts lies
the calculation of interrater agreement. In this section
there are four technical issues to be discussed as they apply
to interrater agreement in general, and by subsumption, to
the practice of using interrater agreement to dene

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283

content validity. The issues, in no particular order, are: the


statistical model underlying interrater agreement, the
collapsing of response categories, the correction for chance
agreement among raters, and the age-old philosophical
problem of induction. These issues will be laid out in the
following sections and their implications demonstrated by
example. The discussion will also highlight how each issue
has been addressed (or neglected) by various nurse
scholars in their attempts to quantify content validity.
2.1. Specifying a model of interrater agreement
The fundamental problem with the CVI is the statistical
model for interrater agreement upon which it is based.
Lynn (1986) specied the proportion of experts whose
endorsement is required to establish content validity. For
instance, when 4 of 5 experts endorse an item as valid
she takes this result to indicate 80% agreement (see Lynn,
1986, Table 2). To be sure, 4/5 = 0.80, however this is not an
accurate model of the amount of agreement among
experts. To see why this is the case, say that the ve
experts: Anne, Beth, Carol, Ted, and Bob are asked to
endorse an item as being valid or not valid. Anne can
agree or disagree with Beth, with Carol, with Ted, and with
Bob; Beth can agree or disagree with Carol, with Ted, and
with Bob; Carol can agree or disagree with Ted, and with
Bob; and Ted can agree or disagree with Bob. There are thus
5(51)/2 = 10 possible opportunities for agreement (or
disagreement). For the sake of illustration, assume that
Anne said that the item was not valid, while everyone
else said it was valid. There are thus four disagreements
out of the possible ten (Anne disagreed with Beth, with
Carol, with Ted, and with Bob). Conversely, the number of
agreements is six and so interrater agreement is 60% not
80%. Lynns mis-specication of the statistical model for
agreement also biases her correction for chance agreement
(see below).
2.2. Collapsing response categories
In the typical study conducted to assess content
validity, experts are asked to rate the relevance of each
item, usually on a 4-point scale such as: 1 = not relevant,
2 = somewhat relevant, 3 = quite relevant, and 4 = very
relevant. After such data are collected, researchers have
been instructed to collapse the four ordinal response
options into two dichotomous categories, such as not
relevant and relevant, and then to determine the items CVI
by calculating the proportion of experts who rated the item
as relevant, either by assigning it a 3 or a 4 (Lynn, 1986;
Polit and Beck, 2006; Waltz and Bausell, 1981). Aside from
making the computation of the CVI easier, there is no
defensible basis for this recommendation. There are,
however, several reasons why such a practice is problematic and should be avoided.
Tinsley and Weiss (1975) differentiated interrater
reliability from interrater agreement. Interrater reliability
represents the degree to which the ratings of different
experts are proportional when expressed as deviations
from their means and is typically reported in terms of
correlational indices. Interrater agreement, on the other

1277

hand, represents the extent to which the different experts


tend to make exactly the same judgments about an item.
When judgments are made on a numerical scale, agreement means that the experts assigned exactly the same
values when rating the item. Recognizing this distinction,
Wynd et al. commented Interrater agreement [not
reliability] is the indicated procedure for quantitatively
estimating the content validity of new instruments (2003,
p. 512). But, they failed to consider how collapsing
response categories distorts interrater agreement.
Conceptually, combining the response categories prior
to calculating agreement fundamentally alters the meaning of the resulting proportion; it no longer reects exactly
what the experts did, nor how much they actually agreed.
Statistically, this practice also alters the basis for dening
the role of chance which is key in computing alternative
indices, such as Cohens kappa, that represent the
proportion of agreement remaining after chance agreement has been removed. Correcting for chance is discussed
in greater detail in the next section. For now, let us consider
further how combining response categories alters the
meaning of interrater agreement and degrades the
information transmitted to the researcher.
If ve experts are asked to judge an items relevance
using the 4-point scale described above, there are 56
possible outcomes, or ways that their judgments can be
distributed across the four response categories (this is
determined using the multinomial distribution). Four of
these outcomes count as interrater agreement: all ve gave
a rating of 1, all ve gave a rating of 2, all ve gave a rating
of 3, and all ve gave a rating of 4. Only two of these
outcomes count as favorable agreement (all ve gave a
rating of 3, and all ve gave a rating of 4). But if the 4-point
scale is dichotomized, then twelve of the 56 possible
outcomes are counted as interrater agreement and six of
these are counted as favorable. The meaning of interrater
agreement has been changed by counting as instances of
agreement judgments that were not exactly the same.
Collapsing response categories after the fact also leads
to loss of information. This problem has been studied
extensively by Garner and colleagues (Garner, 1960;
Garner and Hake, 1951; Garner and McGill, 1956) and is
summarized briey to highlight how this applies to
interrater agreement. The purpose in using a rating scale
(e.g., 1 = not relevant, 2 = somewhat relevant, 3 = quite
relevant, and 4 = very relevant) is to allow a rater or group
of raters to demonstrate their perceptual discriminations
among a set of stimuli (items in the present case). The
rating scale may be thought of as having a certain capacity
for transmitting information about the items from the
expert rater to the researcher. This capacity is related to the
number of response categories on the scale, the sample of
experts providing the ratings, and to the sample of items
being rated.
A simple example (using hypothetical data) may sufce
to illustrate how collapsing responses is detrimental to the
transmission of information. The upper panel of Table 1
contains the frequency distribution of responses from ve
experts who have rated the relevance of four items. Using
the method of successive categories (see Guilford, 1954) a
psychological continuum of relevance may be constructed

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283

1278

Table 1
Frequency distributions, scale values, information transmitted, and measures of multirater agreement for four items rated by ve experts.
Item

A
B
C
D

Response categories
1

4
1

1
3
1

1
3
1
Variance in scale values = 3.27
Information transmitted = 1.05 bits

Item

Agreement

4
0.00
1.31
3.37
4.68

1
4

Response categories
1 and 2

A
B
C
D

Scale values

Scale values

3 and 4

5
4
1

1
4
5
Variance in scale values = 1.28
Information transmitted = 0.60 bits

0.00
1.12
1.97
3.09

Discounted
agreement

0.60
0.47
0.30
0.07
0.30
0.07
0.60
0.47
Mean proportion agreement = 0.45
Mean proportion agreement due to chance = 0.25
Multirater kappa = 0.27
Agreement

Discounted
agreement

1.00
1.00
0.60
0.20
0.60
0.20
1.00
1.00
Mean interrater agreement = 0.80
Mean proportion agreement due to chance = 0.50
Multirater kappa = 0.60

Note: Upper panel shows analysis of data as obtained. Lower panel shows analysis of the same data collapsing over response categories. Original response
categories: 1 = not relevant, 2 = somewhat relevant, 3 = quite relevant, and 4 = very relevant. Scale values determined using method of successive categories
(Guilford, 1954). Information transmitted = 1/2[log2(variance + 1)]. Agreement, mean proportion agreement, mean proportion agreement due to chance,
and multirater kappa calculated using Fleiss (1971) Equations 2, 3, 5, and 7, respectively. Discounted agreement = (agreement  mean proportion
agreement due to chance)/(1  mean proportion agreement due to chance).

from the distribution of the responses. The resulting scale


values (see Table 1) express relative differences among the
items in their perceived relevance. The spacing of the items
along the continuum is dependent upon the amount of
interrater agreement. The variance in these scale values
expresses the degree of discrimination that the experts
exhibited regarding the items. Garner (1960) showed how
this variance can be equated to the amount of information
transmitted. For this set of data, the transmitted information is 1.05 bits. Collapsing the data and repeating the
scaling process reveals that only 0.60 bits of information is
transmitted (see lower panel of Table 1). This amounts to a
loss of (1.050.60)/1.05 = 43% of the potential information
contained in the data. Here the loss of information is due to
the fact that ratings of items B and C overlap the categories
on the collapsed rating scale. In short, collapsing response
categories should be avoided because it distorts the
meaning of agreement and discards pertinent information.
2.3. Correcting for chance agreement in a sample of experts
Over the years much has been written about how best
to quantify the degree of agreement among human
observers (Cohen, 1960; Fleiss, 1971; James et al., 1984;
Landis and Koch, 1977; Lindell and Brandt, 1999; Lindell
et al., 1999; Stemler, 2004; Tinsley and Weiss, 1975;
Wakeeld, 1980). A major issue raised in these discussions is that the simple proportion of agreement
capitalizes on chance, thus providing an inated estimate
of the true agreement among observers. Various methods
have been proposed for isolating and removing chance
agreement. It is noteworthy that none of these writers
addressed content validity, but only the concepts of
interrater agreement (reliability). Nevertheless, this
literature has informed nurse researchers as they have

developed approaches for assessing content validity via


expert panel agreement (e.g., Polit et al., 2007; Wynd et al.,
2003).
The rst approach to adjusting the CVI for chance
agreement was offered by Lynn (1986). After dichotomizing the responses from a 4-point scale, she proceeded to
incorporate a large-sample approximation to the standard
error for proportions in order to establish a cut-off for
chance versus real agreement. She concluded If there are
ve or fewer experts, all must agree on the content validity
for their rating to be considered a reasonable representation of the universe of possible ratings. When six or more
experts are used, one or more can be in disagreement with
the others and the instrument will be assessed as content
valid (p. 383). There are two technical problems with
Lynns approach. The rst problem stems from the issue
raised earlier; by mis-specifying the statistical model for
interrater agreement, the cut-off values which she
proposed, based on that model, are irrelevant. The second
problem (presented mainly for pedagogic reasons, because
it is moot in light of the rst problem) has to do with her
use of the large-sample approximation for the standard
error of a proportion. According to Polit et al. (2007), Lynn
used this approximation to calculate the 95% lower-bound
condence limit and then compared this value to 0.50, her
criterion for chance agreement, in order to establish her
recommended cut-off values for sample size. For example,
5/6 gives proportion of 0.833 and using the large sample
approximation, the 95% lower-bound condence
limit = 0.535; Lynns conclusion (quoted above) then
follows. The large-sample approximation, however, is
not at all accurate over the range of sample sizes that
Lynn applied it to in her Table 2. The exact 95% lowerbound condence limit (see Zar, 1999, p. 527) for 5/
6 = 0.359, well below her criterion for chance agreement.

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283

1279

Table 2
Evaluation of I-CVIs with different numbers of experts and agreement showing lower-bound condence limits.
(1)

(2)

(3)

(4)

(5)

(6)

(7)

No. of experts

No. giving rating 34

I-CVI

pc*

k*

Evaluation

95% L.C.L. for k*

(8)
Evaluation of 95% L.C.L

3
3
4
4
5
5
6
6
6
7
7
7
8
8
8
9
9
9

3
2
4
3
5
4
6
5
4
7
6
5
8
7
6
9
8
7

1.000
0.667
1.000
0.750
1.000
0.800
1.000
0.833
0.667
1.000
0.857
0.714
1.000
0.875
0.750
1.000
0.889
0.778

0.125
0.375
0.063
0.250
0.031
0.156
0.016
0.094
0.234
0.008
0.055
0.164
0.004
0.031
0.109
0.002
0.018
0.070

1.000
0.467
1.000
0.667
1.000
0.763
1.000
0.816
0.565
1.000
0.849
0.658
1.000
0.871
0.719
1.000
0.887
0.761

Excellent
Fair
Excellent
Good
Excellent
Excellent
Excellent
Excellent
Fair
Excellent
Excellent
Good
Excellent
Excellent
Good
Excellent
Excellent
Excellent

0.292
0.094
0.398
0.194
0.478
0.284
0.541
0.359
0.223
0.590
0.421
0.290
0.631
0.473
0.349
0.664
0.518
0.400

Poor
Poor
Poor
Poor
Fair
Poor
Fair
Poor
Poor
Fair
Fair
Poor
Good
Fair
Poor
Good
Fair
Poor

12
13
13
20
20
35
35
54
54

12
13
12
19
18
35
34
53
52

1.000
1.000
0.923
0.950
0.900
1.000
0.971
0.981
0.963

2.44E04
1.22E04
1.59E03
1.91E05
1.81E04
2.91E11
1.02E09
3.00E15
7.94E14

1.000
1.000
0.923
0.950
0.900
1.000
0.971
0.981
0.963

Excellent
Excellent
Excellent
Excellent
Excellent
Excellent
Excellent
Excellent
Excellent

0.735
0.753
0.640
0.751
0.683
0.900
0.851
0.901
0.873

Good
Excellent
Good
Excellent
Good
Excellent
Excellent
Excellent
Excellent

Note: Values in the upper portion of columns 16 are after those found in Polit et al. (2007, Table 4). I-CVI = item-level content validity index;
pc* = probability of chance agreement; k* = modied kappa coefcient designating proportion agreement on relevance Polit et al. (2007). Evaluation criteria
for kappa: poor < 0.40, fair = 0.400.599, Good = 0.600.749, excellent  0.75 (Fleiss, 1981). 95% L.C.L. = ninety-ve percent lower-bound condence limit
for k* based on condence limits for population proportion (Zar, 1999, p. 527).

Wynd et al. (2003) addressed the issue of chance


agreement by recommending that nurse researchers use
both the item-level CVI and the multirater kappa statistic
developed by Fleiss (1971). To understand this recommendation let us briey review the kappa statistic. Cohen
(1960) proposed kappa to quantify the proportion of
agreement, after chance agreement has been removed,
between two judges who had classied a large set of N
objects into j categories. Cohen denes the proportion of
observed agreement, po, as the number of objects placed in
the same category by both judges relative to the total
number of objects and then uses the marginal distributions
of both judges responses to dene the proportion of
agreement expected due to chance, pc. Kappa discounts the
observed proportion of agreement for chance agreement
by subtracting pc and normalizes the result by the
complement of pc. Thus, Cohens Equation 1 denes
kappa = (po  pc)/(1  pc). Kappa ranges from 1.0 to
1.0; when obtained agreement equals chance agreement
kappa = 0. Greater than chance agreement produces
positive values of kappa, less than chance agreement
leads to negative values. Fleiss et al. (1969) derived
accurate standard errors of kappa for calculating condence intervals and testing hypotheses.
Fleiss (1971) developed the multirater kappa as a
means of handling situations wherein each of n judges has
classied some, but not necessarily all, of the N objects. In
his approach, agreement, and agreement due to chance, are

computed between all n(n  1)/2 possible pairs of judges,


using the objects classied by both, and then these values
are averaged to obtain an overall index of agreement and
agreement due to chance (see his Equations 2 and 3).
Multirater kappa is then computed from these averaged
quantities (his Equation 7 which is equivalent to Cohens
Equation 1 with appropriate substitutions). Once these
quantities have been calculated they may be used to
discount agreement for individual objects. Fleiss (1971)
also derived the variance and standard error for multirater
kappa for computing condence intervals.
Wynd et al. (2003) used eight experts to evaluate 23
items pertaining to osteoporosis; they reported a multirater kappa of only 0.039 and noted that this was not
statistically greater than zero. Cohen (1960) pointed out
that in most measurement situations it is trivial to test the
signicance of kappa since we usually expect much more
than this in the way of reliability. He did, however,
mention that demonstrating statistical signicance can
serve as the minimum standard for establishing the
reliability of the resulting judgments. So, in Wynd
et al.s study, kappa was not signicantly greater than
zero, indicating that the experts judgments of item
relevance were unreliable. What is puzzling is that these
authors ignored this (unfortunate) nding and proceeded
to interpret substantively item-level CVIs based on these
unreliable judgments. It is unclear from their discussion
whether the low value of multirater kappa was based on

1280

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283

collapsed response categories or not. It is to this issue that


we now turn.
The problem alluded to in the previous section on
collapsing response categories now comes to the forefront.
As Maclure and Willett (1987) pointed out, the magnitude
of kappa is dependent more upon how the categories are
dened in the calculation than upon the degree of
reproducibility in the observations. By altering the number
of categories after the fact, not only do we distort the
proportion of judgments which agree exactly, but we
change the probability of chance agreement. The probability of an experts judgment falling into any given
response category due to chance is given by j1, where j is
the number of response categories. Specically, when the
4-point scale is used, the probability of an experts
judgment falling into any given response category due
to chance is 41 = 0.25, but when the four categories are
collapsed into two this probability is 21 = 0.50. Applying
Fleiss (1971) Equation 3 for the mean proportion agreement and his Equation 7 for multirater kappa to the data in
upper and lower panels of Table 1 illustrates the
consequences of this practice. When the actual judgments
are analyzed (upper panel), mean proportion agreement is
poor (0.45) and kappa is quite low (0.27). Collapsing
response categories (in the lower panel) inates both these
indices substantially (0.80 and 0.60, respectively).
The disturbing consequences of collapsing response
categories after the fact can also be illustrated at the item
level using the information in Table 1. First, compare the
data for item B in the upper and lower panels. Notice that
collapsing response categories inates the amount of
agreement among the experts (from 0.30 to 0.60). Second,
compare item A in the upper panel with item B in the lower
panel. In this comparison both items show agreement of
0.60, however when correcting for chance agreement the
degree of discounting is not the same because of the
different values of j1. Agreement on item B is discounted
for chance by a factor of (0.60  0.20)/0.60 = 67%, while
agreement on item A is discounted by a factor of
(0.60  0.47)/0.60, less than 22%. Thus collapsing response
categories after the fact undermines the statistical
foundations by which chance agreement is removed.
The most recent approach for correcting for chance
agreement among experts is offered by Polit et al. (2007).
These authors have developed k*, a statistic that they refer
to as a modied kappa because as they state . . .it is an
index of agreement of a certain type, namely, agreement
among the judges that the item is relevant. Agreement
about non-relevance is not counted, because such agreement does not inform judgments about the content
validity of an item (p. 465). The derivation of k* begins
by collapsing response categories and then using the
binomial distribution to calculate the probabilities of
specic outcomes (i.e., the number of successes, r,
observed in n trials, where r is dened as the number of
experts responding relevant to an item and n is the
number of experts sampled). The authors then dened
chance agreement, pc* = [n!/(r!(n  r)!)]0.5n. They showed
how k* can be used to discount the item-level CVI for
chance agreement by subtracting pc* and normalizing the
result by the complement of pc*. Thus for an item,

k* = (CVI  pc*)/(1  pc*). Polit et al. went on to interpret/


evaluate k* as though it were kappa using standards
recommended by Fleiss (1981) and also provided a table of
standards for relating k* to the item-level CVI. They did not
however develop the standard error for k*.
Because k* is based on Lynns original CVI, it too suffers
from the problems of collapsed response categories and a
mis-specied model of interrater agreement. For example,
consider again the case where 4 of 5 experts have
evaluated an item favorably (item C of Table 1). Interrater
agreement is only 0.30; the impact of collapsing response
categories inates this to 0.60, but applying the misspecied model of the CVI gives 4/5 = 0.80. Computing
k* = (0.8  0.156)/(1  0.156) = 0.763. When interrater
agreement is discounted for chance agreement it is
(0.30  0.25)/(1  0.25) = 0.07. By comparison in this
example, k* actually inates the interrater agreement by
a factor of (j0.30  0.763j)/0.30 = 154% rather than discounting it for chance agreement. Given that k* does not
behave as kappa, it seems dubious to evaluate it against
Fleiss (1981) standards for kappa.
2.4. The problem of induction
Since ancient times the issue of how to justify inductive
inference from the specics to the general has stymied
philosophers. To address this quandary, known as the
problem of induction, modern social science relies on the
theory of inferential statistics whereby condence intervals can be established around sample statistics. Whether
an item-level CVI is adjusted for chance agreement as in
the case of k*, or simply used as is, it is still just a sample
statistic (as is kappa). Sampling experts as to their opinions
about an item is merely a mechanism for obtaining an
estimate of an items relevance. The quality (i.e., precision)
of any such estimate is a function of the number of experts
sampled.
To illustrate, consider the following situation: we ask a
group of n experts to judge the relevance of each of N items
for a new instrument we are developing. Assume
responses have been dichotomized into relevant and
not relevant. (Although we have already discussed how
this practice is problematic, it is used in the following
example so that the results are directly comparable with
Polit et al., 2007 and others). Using the most conservative
procedure, that of universal agreement, we deem an item
to be appropriate for inclusion if 100% of our experts judge
it as relevant. Assume that this is the outcome; 100% of
our experts did in fact judge the item in question as
relevant. How sure can we be that we would have
obtained the same outcome if we had sampled a different
group of n experts from the population of all qualied
experts?
If we assume that the outcome (the number of experts
who judged the item as relevant) is Poisson distributed,
the 95% upper bound on the probability of a not relevant
response to the item in the future, given that we did not
obtain any not relevant responses from our experts, may
be approximated by 3/n (see van Belle, 2002, p. 49). So, if
we used n = 30 experts and they all judged the item as
relevant, we can be 95% condent that the maximum

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283

probability of observing any not relevant judgments in


the future is 3/30 or only 0.10. Conversely, we can be 95%
sure that the minimum probability is (1  0.10) = 0.90 that
the population of experts will all judge the item as
relevant.
On the other hand, had we used only n = 5 experts and
they all judged the item as relevant, we can be 95%
condent that the maximum probability of observing any
not relevant judgments in the future is 3/5 or 0.60, and
conversely, we can be 95% sure that the minimum
probability is only (1  0.60) = 0.40 that the population
of experts will all judge the item as relevant. This result
does not seem as persuasive as the former for establishing
the relevance of the item. Thus, if we are to rely on a sample
of expert opinion to establish content relevance, then we
must incorporate the notion of sampling error into our
approach.
Having examined the basic idea of sampling error by
applying the rule of 3s, let us consider a more accurate
(but more cumbersome to calculate) estimate of the 95%
lower-bound condence limit for k* based on the binomial
distribution. Table 2 has been prepared to facilitate our
discussion; it shows item-level CVI, Polit et al.s (2007) pc*
and k* for a variety of sample sizes (n). The information in
upper left portion of the table (Columns 16) is after that
provided by Polit et al. in their Table 4; note that there are a
few discrepancies due to errors in their calculations of pc*.
Polit et al. applied Fleiss (1981) guidelines for evaluating
kappa to k*; these evaluations are shown in Column 6 (as
noted above this application may not be warranted but we
will proceed with it for the sake of illustration). Note that
with only a few experts (e.g., three or four), agreement
must be unanimous in order for k* to be considered
excellent; with ve experts, the criterion is relaxed a bit
allowing for one disagreement. As the number of experts
increases the number of disagreements permitted also
increases.
Using the exact standard error for a binomial proportion (Zar, 1999, p. 527), the 95% lower-bound condence
limit was calculated for the values of k* (see Column 7).
Applying the same guidelines to these condence limits
(Column 8) reveals the inadequacy of the small sample
sizes commonly used in studies assessing item relevance
via expert panels.
The rows in the lower portion of Table 2 are used to
make three points. First, when the evaluation guidelines
(i.e., excellent  0.75) are to be applied to the lower-bound
estimates, the minimum number of experts needed to
conclude with 95% condence that k* is at least 0.75 is 13;
note that this would only be the case if all 13 gave a rating
of 3 or 4. If just one of 13 experts responded 1 or 2, the
result would not meet the criterion for excellent. Second,
applying this same standard, the minimum number of
experts needed in which a single disagreement would be
acceptable is 20. Third, it may be that some instrument
developers are uncomfortable with the guidelines offered
(i.e., excellent  0.75) and would prefer a more rigorous
standard, such as a lower-bound estimate of at least .90. In
this case the minimum number of experts needed is 35 if
they showed unanimous agreement. The minimum sample
size in which a single disagreement would be acceptable is

1281

54. (It is worth noting that if the appropriate statistical


model for interrater agreement were used, the sample
sizes needed to obtain the same evaluations are even
larger.)
In light of these numbers, many readers may feel
disappointment from discovering that estimates of item
relevance that are based on agreement in small samples of
experts are not as reliable as they might have believed.
Many may also be realizing for the rst time that obtaining
a high degree of agreement with a high degree of
condence is beyond their resources. This is precisely
the point of this exercise; sampling of experts is merely a
mechanism for obtaining an estimate of the items (or
instruments) relevance to the construct at hand. There is
no escaping the law of large numbers; the quality (i.e.,
precision) of an estimate is a function of sample size.
3. Conclusion
Validity of measurement is an important and complex
issue that has received much attention from scholars for
nearly 100 years. Validity is a property of the inferences
investigators make from the responses they obtain and not
a xed property of the instruments used to obtain those
responses. Expert judgments about the relevance or
representativeness of an instruments content should
not be construed as validity. Gulliksen (1950b) writing
about validity, noted that when the early investigations of
psychological tests were conducted, the value of these
tests was assessed by comparing the test results with
expert judgments. He cited as examples the pioneering
work of Binet (1899) and Cattel (1890) on mental testing,
noting that the judgment of a teacher or supervisor was
regarded as a criterion which the test should approximate
as closely as possible. There are two important points here.
First, expert judgment has been a part of the measurement
validation process since the beginning. Second, and this is
the vitally important point, it was the experts assessments
of the test takers, not of the test items, that provided the
basis for establishing validity. Although the current review
did not establish how it happened, it does seem to be the
case that over the last century the original purpose for
using expert judgement, as a standard for establishing
criterion validity, seems to have been forgotten by many
so-called measurement specialists.
In nursing and the other social and behavioral sciences,
measurements are typically obtained on individual persons, with the intent of treating these measurements as
representing individual differences on some construct of
interest. We, as researchers, often speak about the
instruments we use in terms of their validity. When we
do so, it is essential to remember that we are referring to
the inferences we wish to make from the responses we
have obtained, and not to the tests or scales we have used
to procure these responses. Questions regarding the
validity of an instrument are questions about what may
be properly inferred from a score on the instrument.
Without scores, there is no basis for assessing validity.
Consequently, studies that report only on item relevance or
representativeness, as judged by experts, cannot by
denition, offer any support for validity. Researchers and

1282

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283

measurement specialists should abandon the term content validity and instead speak specically to issues such
as domain clarity and the adequacy of content domain
sampling when discussing instrument development.
The various methods for quantifying what has been
called content validity in the nursing research literature
were shown to be decient on one, or more, of the
following technical grounds: they distort interrater
agreement and discard information by collapsing response
categories, they mis-specify the statistical model of
interrater agreement, they do not adequately correct for
chance agreement, and, they neglect consideration of the
huge sampling errors incurred by the use of small samples
of experts. If nurse researchers feel it necessary to seek
expert opinion on item relevance as part of instrument
development, then large samples of experts should be
employed, response categories should never be collapsed
after the fact, indices like multirater kappa should be used
along with their standard errors to examine interrater
agreement, and the results should not be interpreted as
addressing validity but the acceptability of an operational
denition.
Nursing comprises the application and adaptation of
established scientic knowledge to the promotion,
improvement, and maintenance of human health and
well-being (Beckstead and Beckstead, 2006). The eld of
psychology has much to offer nursing in this regard.
Theories from clinical psychology have already inuenced
many nurse scholars. Humanistic psychologists Carl
Rogers and Abraham Maslow shared an optimistic view
of people as being capable of self-care and self-determination given a secure, nurturing environment; themes that
pervade many nursing theories. Beckstead and Beckstead
showed that the ideas of these and other psychologists
have been productively incorporated into the thinking of
various nurse scholars. The current article highlights how
the ideas of some distinguished methodologists in the eld
of psychology have shifted attention from content to
construct validity. The ideas of Cronbach and Meehl
regarding construct validity, introduced into psychology
some 50 years ago, have served to productively redirect
intellectual energies in that eld. Nurse scholars can
benet from these ideas as well. Focusing our attention on
the attributes or processes that underlie the individual
differences that we see in our own data (collected from
patients, students, or colleagues), rather than debating
how best to quantify what an insufciently small sample of
experts think about an arbitrary operational denition,
seems a much more productive activity for advancing
nursing as a scientic enterprise.
Conict of interest. None declared.
Funding. None.
References
American Psychological Association, American Educational Research
Association & National Council on Measurements Used in Education,
1954. Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin 51 (2), 201238.
American Psychological Association, American Educational Research
Association & National Council on Measurements Used in Education,
1974. Standards for Educational and Psychological Tests. American
Psychological Association, Washington.

Beckstead, J.W., Beckstead, L.G., 2006. A multidimensional analysis of the


epistemic origins of nursing theories, models, and frameworks. International Journal of Nursing Studies 43, 113122.
Binet, A., 1899. Attention et adaptation. Annee Psychology 6, 248404.
Cronbach, L.J., 1971. Test validation. In: Thorndike, R.L. (Ed.), Educational
Measurement. 2nd ed. American Council on Education, Washington,
DC.
Cronbach, L.J., Meehl, P.E., 1955. Construct validity in psychological tests.
Psychological Bulletin 52 (4), 281302.
Cattel, J.McK., 1890. Mental tests and measurements. Mind 15, 373381.
Cohen, J., 1960. A coefcient of agreement for nominal scales. Educational
and Psychological Measurement 20 (1), 3746.
Fitzpatrick, A.R., 1983. The meaning of content validity. Applied Psychological Measurement 7 (1), 313.
Fleiss, J.L., 1971. Measuring nominal scale agreement among many raters.
Psychological Bulletin 76 (3), 378382.
Fleiss, J.L., 1981. Statistical Methods for Rates and Proportions, 2nd ed.
John Wiley & Sons, Inc., New York.
Fleiss, J.L., Cohen, J., Everitt, B.S., 1969. Large sample standard errors of
kappa and weighted kappa. Psychological Bulletin 72 (5), 323327.
Garner, W.R., 1960. Rating scales, discriminability, and information transmission. Psychological Review 67, 343352.
Garner, W.R., Hake, H.W., 1951. The amount of information in absolute
judgments. Psychological Review 58, 446459.
Garner, W.R., McGill, W.J., 1956. The relation between information and
variance analysis. Psychometrika 21, 219228.
Guilford, J.P., 1954. Psychometric Methods, 2nd ed. McGraw-Hill, New
York.
Guion, R.M., 1977. Content validitythe source of my discontent. Applied
Psychological Measurement 1 (1), 110.
Gulliksen, H., 1950a. Theory of Mental Tests. John Wiley & Sons, Inc., New
York.
Gulliksen, H., 1950b. Intrinsic validity. American Psychologist 5, 511517.
Hambleton, R.K., et al., 1975.In: Criterion-referenced Testing and Measurement: Review of Technical Issues and Developments. An Invited
Symposium Presented at the Annual Meeting of the American Educational Research Association (mimeo.), Washington, DC.
Huddleston, E.M., 1956. Test development on the basis of content validity.
Educational and Psychological Measurement 16, 283293.
James, L.R., Demaree, R.G., Wolf, G., 1984. Estimating within-group interrater reliability with and without response bias. Journal of Applied
Psychology 69 (1), 8598.
Landis, J.R., Koch, G.G., 1977. The measurement of observer agreement for
categorical data. Biometrics 33, 11591174.
Lennon, R.T., 1956. Assumptions underlying the use of content validity.
Educational and Psychological Measurement 16, 294304.
Lindell, M.K., Brandt, C.J., 1999. Assessing interrater agreement on the job
relevance of a test: a comparison of the CVI, rWG(J), and r*WG(J) indexes.
Journal of Applied Psychology 84 (4), 640647.
Lindell, M.K., Brandt, C.J., Whitney, D.J., 1999. A revised index of agreement for multi-item ratings of a single target. Applied Psychological
Measurement 23 (2), 127135.
Loevinger, J., 1965. Person and population as psychometric concepts.
Psychological Review 72, 143155.
Lynn, M.R., 1986. Determination and quantication of content validity.
Nursing Research 35 (6), 382385.
Messick, S., 1975. The standard problem: meaning and values in measurement and evaluation. American Psychologist 30 (10), 955966.
Maclure, M., Willett, W.C., 1987. Misinterpretation and misuse of the
kappa statistic. American Journal of Epidemiology 126 (2), 161169.
McCall, W.A., 1922. How to Measure in Education. Macmillian, New York.
Mosier, C.I., 1947. A critical examination of the concepts of face validity.
Educational and Psychological Measurement 7, 191205.
Polit, D.F., Beck, C.T., 2006. The content validity index: are you sure you
know whats being reported? Critique and recommendations.
Research in Nursing & Health 29, 489497.
Polit, D.F., Beck, C.T., Owen, S.V., 2007. Is the CVI an acceptable indicator of
content validity? Appraisal and recommendations. Research in Nursing & Health 30, 459467.
Stemler, S.E., 2004. A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical
Assessment, Research & Evaluation 9 (4) Retrieved October 27,
2008 from http://PAREonline.net.
Tinsley, H.E.A., Weiss, D.J., 1975. Interrater reliability and agreement of
subjective judgments. Journal of Counseling Psychology 22 (4), 358
376.
Tenopyr, M.L., 1977. Content-construct confusion. Personnel Psychology
30, 4754.
Thorndike, L.M., 1918. The nature, purposes and general methods of
measurement of educational products. In: The measurement of edu-

J.W. Beckstead / International Journal of Nursing Studies 46 (2009) 12741283


cational products. 17th Yearbook Part II, National Society for the
Study of Education, Chicago.
van Belle, G., 2002. Statistical Rules of Thumb. John Wiley & Sons, New York.
Wakeeld, J.A., 1980. Relationship between two expressions of reliability:
percentage agreement and phi. Educational and Psychological Measurement 40 (3), 593597.

1283

Waltz, C.F., Bausell, R.B., 1981. Nursing Research: Design, Statistics and
Computer Analysis. F.A. Davis, Philadelphia.
Wynd, C.A., Schmidt, B., Schaefer, M.A., 2003. Western Journal of Nursing
Research 25 (5) 508518.
Zar, J.H., 1999. Biostatistical Analysis, 4th ed. Prentice Hall, Upper Saddle
River, NJ.

Vous aimerez peut-être aussi