Académique Documents
Professionnel Documents
Culture Documents
http://apm.sagepub.com/
Published by:
http://www.sagepublications.com
Additional services and information for Applied Psychological Measurement can be found at:
Subscriptions: http://apm.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://apm.sagepub.com/content/32/3/224.refs.html
What is This?
Person-fit methods are used to uncover atypical Guttman errors is effective in detecting serious
test performance as reflected in the pattern of scores person misfit. The simulation study further shows
on individual items in a test. Unlike parametric that in most conditions a simple nonparametric
person-fit statistics, nonparametric person-fit person-fit statistic is as effective as a commonly used
statistics do not require fitting a parametric test parametric person-fit statistic in detecting deviant
theory model. This study investigates the item score vectors. An empirical example illustrates
effectiveness of generalizations of nonparametric the use of the nonparametric person-fit statistics in
person-fit statistics to polytomous item response real data. Index terms: aberrant response
data. A simulation study using varying test and item behavior, nonparametric item response theory,
characteristics shows that a simple count of the person-fit analysis, person misfit, polytomous items
Person-fit analysis is concerned with uncovering atypical test performance as reflected in the
pattern of scores on individual items in a test (Meijer & Sijtsma, 2001). Because the validity of
such atypical item score vectors may be questionable, it is important to identify these patterns to
prevent the drawing of inadequate conclusions from the test results. Person-fit methods may help
to identify invalid outcomes of a test caused by, for example, a lack of motivation to take the test
seriously, concentration problems on a cognitive test, and faking on a personality test. Person-fit
analysis has a successful history in the domain of cognitive and achievement testing (Meijer &
Sijtsma, 2001). Examples include identification of respondents with language deficiencies on an
intelligence test (Van der Flier, 1982) and students suffering from test anxiety on a cognitive test
(Birenbaum & Nassar, 1994).
A distinction can be made between parametric and nonparametric person-fit methods. Unlike
parametric person-fit statistics, nonparametric person-fit statistics do not require a parametric item
response theory (IRT) model that fits the data. Parametric and nonparametric person-fit methods
were extensively studied for dichotomous items (Karabatsos, 2003; Meijer & Sijtsma, 2001).
Studies of person-fit methods for polytomous items (i.e., items with three or more answer cate-
gories), however, were primarily concentrated on parametric approaches. Examples include van
Krimpen-Stoop and Meijer (2002) and Dagohoy (2005) in the context of educational testing, and
Zickar and Drasgow (1996), Zickar, Gibby, and Robie (2004), and Reise and Widaman (1999) in
the context of noncognitive assessment.
The main topic of this study is to introduce three nonparametric person-fit methods for polyto-
mous items (i.e., the number of Guttman errors, the newly proposed normed Guttman errors, and the
generalized U3 statistic), study their performance under various conditions of misfit (e.g., careless
responding and extreme response behavior), and benchmark the obtained results to an often-used
parametric person-fit statistic, lpz (Drasgow, Levine, & Williams, 1985).
In the context of nonparametric item response theory (NIRT), Molenaar (1991) proposed the
weighted number of Guttman errors as an index of model fit. The weighted number of Guttman errors
can also be calculated from an individual’s vector of item scores, which can be used as an index of
person fit. However, the properties of the number of Guttman errors as an index of person fit were not
studied as extensively as in the dichotomous case. Furthermore, a disadvantage of the weighted
number of Guttman errors is that its maximum depends on the sum score. This limits the comparabil-
ity of the index across sum-score levels. In this study, a normed version of the number of Guttman
errors is proposed that weights the number of Guttman errors by its maximum given the sum score.
In the context of dichotomous NIRT models, the U3 statistic (Van der Flier, 1980) was devel-
oped, which takes into account both the item-difficulty ordering and the values of the item difficul-
ties (i.e., proportion of correct or coded answers). Karabatsos (2003) compared 36 person-fit
statistics and found that U3 was in the top four of most powerful statistics (see also Emons,
Sijtsma, & Meijer, 2005). In this study, the U3 statistic for dichotomous item scores is generalized
to polytomous item scores, and its properties are compared with the number of Guttman errors and
the normed number of Guttman errors.
This report is organized as follows. First, a theoretical framework is provided for the NIRT
models used in this study. Second, the nonparametric person-fit statistics for polytomous items are
discussed. Third, a simulation study was done in which the properties of the nonparametric per-
son-fit statistics were studied and compared with a popular parametric person-fit statistic for poly-
tomous items. Fourth, the results of the simulation study are discussed. This report concludes with
an empirical example on industrial malodor.
Figure 1
Examples of Item Step Response Functions for (A) the Monotone Homogeneity
Model and (B) the Double Monotonicity Model
which can be described by the DMM. The fit of the DMM for polytomous items can be evaluated
in empirical data using methods discussed by Sijtsma and Molenaar (2002). Examples of applica-
tions of the DMM include Rivas, Bersabé, and Berrocal (2005) and Van Onna (2003).
For M = 1, statistic Gp specializes to the number of Guttman errors for a vector of dichotomous
items (e.g., Meijer & Sijtsma, 2001). Statistic Gp is implemented in the computer program MSP5
for Windows (Molenaar & Sijtsma, 2000).
which is the sum of the log odds of the item-step difficulties of the steps that were passed. The
polytomous generalization of U3, denoted by U3p , is obtained by norming Wðy) as follows:
maxðW|X + Þ − WðyÞ
U3p = , ð3Þ
maxðW|X+ Þ − minðW|X+ Þ
with a minimum value of U3p equal to 0 indicating no misfit, and a value of 1 indicating extreme
misfit. The maximum maxðW|X+ ) in equation (3) is obtained if and only if the X+ easiest item
steps are passed; that is,
X
X+
maxðWjX+ Þ = pk Þ:
logitð^
k=1
Because of structural dependencies between the item-step scores, the minimum value, min
ðW|X+ ), cannot be expressed in closed form. Therefore, minðW|X + ) was computed using a recur-
sion algorithm (details can be found in the appendix).
Simulation Study
Data Generation
Data were generated under the graded response model (GRM; Samejima, 1969, 1997). The
GRM also assumes unidimensionality and local independence, but defines the ISRFs as
h i
exp aj ðy − djxj Þ
Pjxj ðyÞ = h i, ð4Þ
1 + exp aj ðy − djxj Þ
where aj is the slope parameter and djxj is the location parameter of the ISRF for xj . The location
parameter djxj indicates for which y the probability of scoring xj or higher is equal to .50. The
option response curve (ORC) defines the probability of scoring xj on item j conditional on y,
which is obtained from the ISRFs as follows:
8
< 1 − Pj1 ðyÞ if xj = 0
∗ ∗
Pjx ðyÞ = P ðX j = x j |yÞ = P jx ðyÞ − P jðxj + 1Þ ðyÞ if 1 ≤ xj < M − 1 :
j : j
pjM ðyÞ if xj = M
A response to item j was generated by drawing a random score from the multinomial distribution
∗
with M + 1 outcomes and parameters Pjx j
ðy), with xj = 0, . . . , M.
For the GRM, the ISRFs do not intersect if a1 = a2 = . . . = aJ . Although nonintersecting
ISRFs are the basis of the nonparametric person-fit statistics used in this study, the data were gen-
erated under GRMs that allowed the as to vary and thus did not strictly satisfy the assumption of
the DMM. The choice of a more general model in which ISRFs intersect is justified by results from
simulation studies in the context of dichotomous person-fit analysis, which consistently showed
that nonparametric person-fit methods are robust against mild to moderate departures from nonin-
tersection (e.g., Emons, 2003; Sijtsma & Meijer, 2001). The latter condition is often realized in
real data because tests are assembled to have items with steep slopes and varying item difficulties
(Emons et al., 2005). Items with relatively flat slopes, causing many intersections with the other
response functions, are often excluded from the test because they provide little information for
measurement.
Tendency to choose extreme response options. Respondents may differ in their tendency to
choose the extreme response options (Hamilton, 1968; Paulhus, 1991). This means that some
respondents are more inclined to endorse one of the extreme response options (e.g., strongly dis-
agree or strongly agree) regardless of the item content and his or her y, whereas others have the
tendency to avoid using extreme response options. This type of response behavior is indicated as
extreme response style behavior. Large differences in extreme response style behavior may impair
the comparability of individual test scores (e.g., Van Herk, Poortinga, & Verhallen, 2004). This
lack of comparability of individual test scores may be revealed by a person-fit analysis.
A respondent exhibiting extreme response style behavior has a higher probability of endorsing
one of the extreme response categories than predicted from his or her y and the ISRFs. These indi-
vidual increases of endorsement probabilities for the extreme response options can be accounted
for by individual changes in the distances between the threshold parameters djm (to be explained
shortly). This means that for each person who exhibits extreme response style behavior, there is
a unique vector of threshold parameters, which differs from the threshold parameters that describe
the postulated ISRFs. Psychometric models that take into account these individual threshold struc-
tures to model extreme response style behavior were pursued by Johnson (2004), Rennie (1982),
and Rossi, Gilula, and Allenby (2001).
To simulate data for extreme response style behavior, an appropriate transformation of the djxj s
is needed at the individual level. This transformation must maintain the ordering of the djxj s and
must result in endorsement probabilities that are higher for extreme response options. A suitable
approach is by means of a linear transformation of the item-step location parameters djxj that shifts
them closer to the average of the djxj s. This approach is comparable to the proportional threshold
approach that was proposed by Rossi et al. (2001). Let x denote the person parameter that governs
the transformation of the threshold parameters and thus reflects the individual’s extreme response
style behavior. Furthermore, let dj be the mean of the M location parameters djxj of item j. The
linear transformation of the djxj s, denoted by d∗jxj , which was used in this study, is obtained by
For x < 0, equation (5) shifts the location parameters toward the mean d and, as a result, decreases
the distance between the djxj s and d. For persons at the lower end of the y scale, this results in
higher endorsement probabilities for the lowest response option, and for persons at the higher end
of the y scale, this results in higher endorsement probabilities for the highest response option.
Figures 2A and 2B give the ORCs of a hypothetical item with M = 3, for x = 0 (i.e., the null
model), and x = −0:8. For x > 0, the d∗jxj s are more dispersed with respect to the mean item diffi-
culty, resulting in decreased option response probabilities for the extreme response options (i.e.,
a tendency to avoid extreme response options; see Figure 2C).
In this study, data were simulated for x = −0:8. This means that extreme response style behav-
ior was treated as a fixed effect. The choice of x was based on preliminary simulations using the
same item and test characteristics as used in this study. In these simulations, the effect of x on the
item responses in the normal and aberrant sample was verified using an overall measure of extreme
response behavior proposed by Bachman and Malley (1984; see also Van Herk et al., 2004). This
index is a count of the number of responses in the extreme response categories divided by J. For
x = −0:8, differences between means of extreme response indices for the normal sample and the
aberrant were significant (t test; p < :000). Effect sizes for these differences ranged from 0.46 to
2.58, indicating medium to strong effects (Cohen, 1988). These results led to the conclusion that
x = −0:8 is a reasonable choice for simulating extreme response behavior.
Reversed scoring. Personality questionnaires and attitude scales may consist both of items that
are positively worded (i.e., high scores correspond to high y levels) and items that are negatively
worded (i.e., high scores correspond to low y levels). Respondents may fail to notice the different
directions of the wording and, as a result, may answer some items opposite to what they meant to
do. This type of aberrant response behavior was simulated by means of recoding the generated
score xj under the IRT model into M − xj .
Figure 2
Option Response Curves of a Four-Choice Item for Simulating Data Under
(A) the Null Model (x = 0), (B) the Tendency to Choose Extreme Response Options
(x < 0), and (C) the Tendency to Avoid Extreme Response Options (x > 0)
nonparametric person-fit statistics. Let dxj ðmÞ = 1 if xj = m (m = 0, . . . , M), and 0 otherwise. The
unstandardized log-likelihood person-fit statistic for polytomous items, lp , is given by
J X
X M
∗
lp = dxj ðmÞ ln Pjxj
ðyÞ:
j=1 m=0
Statistic lpz can be interpreted as a standard normal deviate, with large negative values of lpz (say,
≤ −2:0) indicating misfit.
Independent Variables
Test length and number of response options. Data were generated for two levels of test length:
J = 12 and J = 24. For each level of J, data were generated for two levels of the number of
response options: M = 2 and M = 4. These choices of J and M were based on the characteristics
of existing personality scales. For example, the Neuroticism-Extroversion-Openness Five-Factor
Inventory (NEO-FFI; Costa & McCrae, 1992) measures each factor of the five-factor model (the
Big Five) using 12 items. The item parameter values (Table 1) that were used for generating the
data were taken from Embretson and Reise (2000, p. 100). These values are representative of other
empirical studies on the fit of the GRM to data from personality questionnaires (e.g., Reise, Wida-
man, & Puch, 1993). For J = 24, the 12-item set was doubled.
Discrimination power. Several studies (e.g., Meijer, Molenaar, & Sijtsma, 1994; Meijer &
Sijtsma, 2001) showed that the power of person-fit statistics depends on the discrimination power
of the items. A higher discrimination means a more reliable score X+ . This may produce higher
detection rates. To investigate the effect of item discrimination, the y variance was varied because
for fixed ISRFs increasing the y variance results in higher item discrimination and higher test-
score reliability (e.g., Hemker et al., 1995). In particular, y variances equal to 1.0 and 1.6 were
used to simulate low and moderate discrimination, respectively.
Number of misfitting item scores in a vector. In this study, three types of aberrant response
behavior were discerned. These types of response behavior may govern the answers to all J
items, for example, when respondents do not seriously answer any of the items. They may also
govern only a part of the answers to the items. For example, respondents may be more inclined to
choose one of the extreme response options of the questions they consider particularly important
Table 1
Configuration of the Item Parameters in the Simulation Study
M=2 M=4
for a favorable presentation of themselves. To investigate the effect of the number of affected
items, Jmisfit , two levels of misfit for J = 12 (Jmisfit = 6 and 12) and four levels of misfit for J = 24
(Jmisfit = 6, 12, 18, and 24) were simulated.
Dependent Variables
The usefulness of a person-fit statistic as a diagnostic tool for detecting aberrant item score vec-
tors is determined by the trade-off between the detection rates (i.e., the degree to which misfitting
item score vectors are detected) and the Type I error rate (i.e., the degree to which fitting item score
vectors are incorrectly diagnosed as misfitting). The detection rates for each statistic were obtained
at five fixed Type I error rates: .01, .025, .05, .10, and .20. It should be noted that person-fit
researchers (e.g., Meijer, 2003) may prefer relatively large a levels because most person-fit statis-
tics have relatively low power at low a levels and incorrect rejection of the null hypothesis of no
misfit often has no serious consequences.
Detection rates were obtained as follows:
1. A total of 1,000 item score vectors were simulated under the null model of normal response behavior.
These data were used to estimate the ordering of the item-step difficulties p ^s and the item parameters
of the GRM (equation (4)). The parameters were estimated using the program MULTILOG (Thissen,
1991).
2. Two data sets of 3,000 item score vectors each were simulated: one data set under the null model of
normal response behavior (called the clean data set) and the other under the aberrant response mecha-
nism of interest (called the aberrant data set).
3. For each person, a y estimate was obtained using the item parameter estimates obtained in Step 1.
Then, the person-fit statistics were computed for the 3,000 simulated item score vectors in the clean
sample and the 3,000 simulated item score vectors in the aberrant sample.
4. For statistic Gp , for each Type I error level the critical value tcv was determined such that in the clean
sample the fraction of respondents having a value higher than tcv is equal to the Type I error rate; that
is, PðGp Þ ≥ tcv = q (with q = :01, .05, .10, and .20 for each Type I error rate, respectively). The
detection rate is the fraction of respondents in the aberrant sample having a person-fit value Gp higher
than tcv . Detection rates for GpN and U3p were obtained in the same way. For lpz , critical values tcv were
determined by the fractions lpz ≤ tcv = q (with q = :01, .025, .05, .10, and .20) in the clean sample, and
the corresponding fraction in the aberrant sample produced the detection rate.
Results
Results under normal response behavior. The scatterplots in Figure 3 show for J = 12 and
M = 4 the relationship between X + and statistics Gp (panels in the first row), GpN (panels in the
second row), and U3p (panels in the third row) under the null model of normal response behavior
for low discrimination (left-hand side panels) and high discrimination (right-hand side panels).
The plots show that the distribution of statistic Gp conditional on X+ varies across X+ . Compared
with low and high X+ , the conditional Gp distribution for medium X+ had a higher mean and
a larger variance. Except for very high and very low X+ scores, the conditional distributions of
GpN and U3p were approximately the same across X+ and therefore less confounded with X+ .
Similar results were found in the other conditions. The results corroborate the conclusion that it is
best not to use person-fit indices for patterns with scores near the extremes. These patterns contain
too little information to draw valid conclusions about aberrant response behavior. Similar results
were found for dichotomous items (Emons, Meijer, & Sijtsma, 2002). Item score vectors yielding
X+ ≤ M or X+ ≥ J − M are discarded from further analysis. This means that critical values (tcv )
were obtained in the clean sample, and detection rates in the aberrant sample, from which extreme
patterns were removed.
Results for carelessness and inattention. For the five significance levels, Table 2 shows the
detection rates of Gp , GpN , U3p , and lpz for detecting carelessness and inattention under 12 combi-
nations of M, J, Jmisfit , and discrimination power. For the three nonparametric person-fit statistics,
statistic Gp showed, in general, the best performance. In only a few conditions, U3p performed
a little better than Gp , but the differences were negligible. The detection rates for Gp were some-
what smaller than for lpz ; differences ranged from .01 to .11. For J = 12 and Jmisfit = 6, the detection
rates were smaller than .50 for a levels smaller than .05. This means that under this condition,
person-fit statistics lacked enough power to detect carelessness at conventional a levels. The dif-
ferences between the detection rates obtained for M = 2 and M = 4 were small; absolute differ-
ences ranged from .02 to .06 for J = 12, Jmisfit = 6, and from .01 to .15 for J = 12, Jmisfit = 12.
Thus, increasing the number of response options had minor effects on the detection rates for a small
number of misfitting item scores and somewhat larger effects for a larger number of misfitting item
scores. This result was found for both the nonparametric person-fit statistics and the parametric
statistic lpz . As expected, increasing the number of misfitting item scores had a substantial effect on
the detection rates. In particular, for J = 24, Jmisfit = 12, and low discrimination, a detection rate of
.72 was found for M = 2, and .73 for M = 4. Small differences were found between low and high
discrimination (the largest absolute difference was .11). The simulations suggest that of all three
nonparametric person-fit statistics, Gp is most effective in detecting extreme cases of carelessness
or inattention in which all item scores are affected. Such patterns are realistic for this type of
response behavior because serious lack of motivation, or serious lack of concentration, is indepen-
dent of the specific item content and most likely to affect the response behavior to all items.
Results for tendency to choose extreme response options. Table 3 shows at the five significance
levels the detection rates of Gp , GpN , U3p , and lpz for detecting tendencies to choose extreme
response options. Compared with the detection rates for carelessness, the detection rates for
extreme response behavior were smaller and, in particular, the detection rates for J = 12 and
(text continues on page 238)
Figure 3
Scatterplots of Sum Score (X + ) Against (A) Statistic Gp for Low Discrimination, (B) Statistic
Gp for High Discrimination, (C) Statistic GpN for Low Discrimination, (D) Statistic GpN for High
Discrimination, (E) Statistic U3p for Low Discrimination, and (F) Statistic U3p for High Discrimination
A B
150
140
120 125
Guttman Errors (G p)
Guttman Errors (G p)
100 100
80
75
60
50
40
25
20
0 0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45
Guttman Errors (Gp)
C D
1.0 1.0
Normed Guttman Errors (G Np )
0.8
Normed Guttman Errors (G Np )
0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45
Sum Score (X+) Sum Score (X+)
E F
1.0 1.0
U 3p
U 3p
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45
Sum Score (X+) Sum Score (X+)
Table 2
Detection Rates for Carelessness and Inattention at Five Significance Levels
J = 12, Jmisfit = 12, high disc. J = 12, Jmisfit = 12, high disc.
J = 24, Jmisfit = 12, low disc. J = 24, Jmisfit = 12, low disc.
J = 24, Jmisfit = 18, low disc. J = 24, Jmisfit = 18, low disc.
Table 3
Detection Rates for Tendency to Choose Extreme Response Options at Five Significance Levels
a M=2 M=4
J = 12, Jmisfit = 12, low disc. J = 12, Jmisfit = 12, low disc.
J = 12, Jmisfit = 12, high disc. J = 12, Jmisfit = 12, high disc.
J = 24, Jmisfit = 12, low disc. J = 24, Jmisfit = 12, low disc.
J = 24, Jmisfit = 18, low disc. J = 24, Jmisfit = 18, low disc.
Jmisfit = 6 were smaller than .50 at conventional a levels, a result that was also found for the lpz sta-
tistic. This indicates that this type of misfit is difficult to detect. The detection rates were higher for
M = 4 than for M = 2, yielding acceptable detection rates if the number of affected items is large
enough. In particular, for J = 12, Jmisfit = 12, and low discrimination, the detection rates at a = :05
were .71 and .73 for GpN and U3p , respectively. Compared with Gp and lpz , the detection rates were
higher for the normed statistics GpN and U3p .
The simulations also revealed higher detection rates for low discrimination than for high dis-
crimination. The explanation for this result is that the higher the item discrimination, the higher
the probability of responses in the lowest category of difficult items, and likewise for the highest
category of easy items. This means that as the discrimination increases, a tendency to choose
extreme response options leads to smaller discrepancies between the expected pattern under the
null model and the observed item score vector and, thus, lower detection rates.
The simulations suggest that for items with five or more answer categories, the normed non-
parametric person-fit statistics are particularly effective in detecting cases that show a tendency to
choose the extreme response categories on all items. Both parametric and nonparametric person-
fit statistics lack power if the number of response options is small or if the tendency to choose
extreme response options is exhibited on less than half of the items. Note that the tendency to
choose extreme responses is also independent of item content and, as a result, most likely to affect
the response behavior to most of the items.
Results for reversed scoring. Table 4 shows the detection rates of Gp , GpN , U3p , and lpz for
detecting reversed scoring. For the three nonparametric person-fit statistics, statistic Gp showed
the highest detection rates in the conditions in which half the item scores were affected. In these
conditions, statistic Gp also performed better than lpz , but differences were small. No clear trends
were found if all item scores were affected, but in general, higher detection rates were found if half
of the items were reversed scored than if all items were reversed scored. For example, for J = 24
the highest detection rates were found if 18 items were reversed scored. Note that for tests that use
different wording directions, only a subset of items are worded in the opposite direction. Failure to
notice the changes in the direction of wording affects only a subset of item scores. In this case, sta-
tistic Gp is most useful for detecting item score vectors containing errors in the direction of the
scoring.
Comparison of the results for low and high discrimination showed that higher discrimination
produced higher detection rates when half of the items were reverse scored, but lower detection
rates when all items were reverse scored. If only half of the items are reverse scored, there is a vec-
tor in which some of the easy (difficult) items are unaffected and a response in the highest (lowest)
category is observed, whereas other easy (difficult) items are affected and a response in the oppo-
site lowest (highest) category is observed. This yields large inconsistencies between the item
scores within this pattern, given the item-step difficulty ordering. This effect becomes stronger as
the item discrimination increases because higher discrimination increases the probability that
respondents will choose extreme categories of easy and difficult items. Reverse scoring of all
items affects the item score vectors as follows. For respondents with a high y, who have high
scores on most items, reverse scoring results in a pattern with low scores on most items and a low
X+ . This pattern of low scores and a low X+ is not as inconsistent with the NIRT model as if half
of the items were reverse scored. Similar results hold for respondents with a low y, who have low
scores on most items. Thus, it is the combination of the number of affected items and item discrim-
ination that influenced the detection rates. This effect depends on the characteristics of the test
(e.g., spread of item difficulties).
Table 4
Detection Rates for Reverse Scoring at Five Significance Levels
a M=2 M=4
J = 12, Jmisfit = 12, low disc. J = 12, Jmisfit = 12, low disc.
J = 12, Jmisfit = 12, high disc. J = 12, Jmisfit = 12, high disc.
J = 24, Jmisfit = 12, low disc. J = 24, Jmisfit = 12, low disc.
J = 24, Jmisfit = 18, low disc. J = 24, Jmisfit = 18, low disc.
Table 5
Item Labels and Item Means for the Items Used in the Real Data Example,
and the Observed Item Score Vector and Person-Fit Results for Five Cases
Case 1 2 3 4 5
Person-fit results
Person-Fit Results
There were n = 153 respondents having X+ ≤ 2 and X+ ≥ 22 (i.e., scores near the extremes
0 or 24). These subjects were excluded from the person-fit analysis, yielding a final sample of
Figure 4
Venn Diagram Indicating Overlap of the Number of Item Score Vectors
Classified as Misfitting Using Gp , GpN , and U3p
Gp GNp
10 1 1
23
0 9
0
U3p
675 item score vectors (X + = 7:42 and SD = 3:65). Inspection of the correlation between X+
and the person-fit indices showed the highest correlation for GP (r = :36) and the smallest cor-
relation for U3p (r = :11). The correlations between the nonparametric person-fit indices ran-
ged from .88 to .89. The Venn diagram in Figure 4 shows how many item score vectors
produced a person-fit index in the highest 5% for a single statistic, a combination of two statis-
tics, or all three person statistics. In particular, it shows that 44 patterns were detected by at
least one of the statistics, and 23 patterns were identified by all three statistics. There were 10
vectors found by Gp that were not found by one of the normed statistics GpN or U3p , and vice
versa.
Table 5 shows five individual cases that had person-fit values in the upper 5% range for at least
one of the nonparametric person-fit statistics. Substantive interpretation of these patterns is diffi-
cult without additional information, but the score pattern of Case 4 suggests a tendency to choose
extreme response patterns. The same tendency may also explain the high person-fit values for the
other cases. Case 5 shows a high U3p value and relatively low values for the other statistics. This
case showed an unexpectedly high score on the most difficult item, which explained the relatively
high U3p compared to Gp and GpN .
Discussion
Person-fit methods are important tools used to detect individual item score vectors that deviate
from the other vectors in the sample and therefore need further inquiry. This study investigated
the effectiveness of three nonparametric person-fit statistics for polytomous item scores, both in
simulated data and in real data. Under varying test and item characteristics, simulations were
done under IRT models that are characteristic of real data applications. In general, a simple count
of the number of Guttman errors in the pattern of item-step scores was most effective in detecting
item score vectors that showed considerable misfit. The number of Guttman errors has the disad-
vantage of being confounded with X+ , particularly for relatively low and high X+ . The normed
statistics GpN and U3p were less confounded with X+ , but they showed lower detection rates in
most conditions. The simulations showed that item score vectors yielding X+ near the extremes
should be discarded from the person-fit analysis. The simulations further showed that in most
conditions, the detection rates of Gp were comparable to the parametric person-fit statistic lpz .
This means that the choice of a nonparametric approach does not necessarily imply a substantial
reduction of power.
An advantage of nonparametric approaches is that the underlying NIRT model is less restrictive
with respect to the data than their parametric counterparts. Empirical studies (Chernyshenko,
Stark, Chan, Drasgow, & Williams, 2001; Steinberg & Thissen, 1996) showed, for example, that
fitting parametric IRT models to personality data may not be straightforward, and NIRT may be
a useful alternative (e.g., Meijer & Baneke, 2004). A second advantage of NIRT models is that
a smaller sample size is needed to obtain reliable estimation of psychometric characteristics (e.g.,
^) for person-fit analysis.
item-step difficulties p
For the nonparametric person-fit statistics used in this study, the fit of a particular item score
vector is evaluated in relation to the fit values of all other item score vectors in the sample.
Because one does not want to confound detection of person misfit with X+ , normed person-fit sta-
tistics were proposed to reduce the dependency with X+ . Alternatively, the fit of a particular item
score vector can also be evaluated by comparing its fit value to the other fit values in the group of
persons with the same X+ (i.e., using distributions of person-fit statistics conditional on X+ ; e.g.,
Molenaar & Hoijtink, 1990). A serious limitation of this approach is that in particular for polyto-
mous item tests, large sample sizes are needed to adequately decide about the fit of an individual
item score vector.
In this study, simulated cutoff values for fixed Type I error rates were used to obtain detec-
tion rates that are comparable across different statistics. In practice, one cannot generate item
score patterns under the true model for setting cutoff values for fixed Type I error rates. There-
fore, in most applications, person-fit indices are used as descriptive measures for identification
of misfitting item score vectors in the sample at hand. Alternatively, cutoff values may be deter-
mined in empirical research. In the context of psychological assessment, psychological scales
are extensively studied in norm populations for score interpretation before they are put into
practice. Test users may use the distribution of nonparametric person-fit statistics obtained in
the norm population as the reference distribution to evaluate the consistency of new individual
observations. Item score vectors that produce a person-fit statistic that is in the upper tail of the
empirical distribution derived in the norm population are suspicious and subject to further
inquiry.
A topic for future research is the use of nonparametric estimates of the ISRFs. For example,
Sijtsma and Van der Ark (2003) used the estimated item rest-score regression curves to impute
missing data. Their approach can be extended to other applications as well, including person-fit
analysis. In particular, nonparametric estimates of the IRT model can be obtained (e.g., using
Testgraf; Ramsay, 2000). For the estimated nonparametric model, bootstrap methods (Efron &
Tibshirani, 1993) can be used to investigate the distributional characteristics of the person-fit
statistics under the null model. This may lead to cutoff values defining ranges of Gp , GpN , or U3p
that indicate varying levels of misfit (e.g., ‘‘no misfit,’’ ‘‘moderate misfit,’’ and ‘‘serious mis-
fit’’). Another topic for further research is applications of person-fit methods to nonmonotonic
items. In the context of personality measurement, items are sometimes found to be single peaked
(e.g., Meijer & Baneke, 2004). NIRT models may be pursued that accommodate nonmonotonic
items. Generalizations of person-fit methods to this class of models are also a topic for future
research.
Appendix
Initialization Step (s = 1)
First, rank numbers are assigned to the item steps as follows. The item steps for xj = 0 are
ranked 0. The remaining item steps are rank numbered according to increasing difficulty (i.e.,
decreasing Pjxj ðy); see columns 6-9 of Table A1). Second, for each item the cumulative rank num-
bers of the item steps are computed (see columns 10-13 of Table A1). For example, the rank num-
bers of Item 1 were 0, 1, 2, and 5, and the corresponding cumulative ranks are 0, 1, 3, and 8. Third,
let rj = (rj1 , . . . , rj(M+1) ) be the row vector containing the cumulative ranks of item j (e.g.,
r1 = (0, 1, 3, 8)). An (M + 1) × (M + 1) matrix V(s=1) is computed, with elements
ðs = 1Þ
Vkl = r1k + r2l with ðk, l = 1, . . . , M + 1Þ:
In the example, the resulting matrix Vðs = 1Þ is given by
2 3
0 3 9 18
6 1 4 10 19 7
Vðs = 1Þ = 643 6
7:
12 21 5
8 11 17 26
Fourth, from this matrix V(s = 1) , a new vector T(s = 1) of length 2M + 1 is computed, with elements
Tk(s = 1) (k = 1, . . . , 2M + 1) given by
ðs = 1Þ maxfVk−l;1+l g with l = ð0, . . . , k − 1Þ if k ≤ M
Tk = : ðA1Þ
maxfVk−l, 1+l g with l = ðk − M − 1, . . . , MÞ if k > M
In the example, Tðs=1Þ = ð0, 3, 9, 18, 19, 21, 26). This ends the initialization step.
Recursion Steps
The initialization step is followed by J − 2 recursion steps. Each step s (s = 2, . . . , J − 1)
proceeds as follows. A new matrix VðsÞ is computed, with elements
ðsÞ ðs−1Þ
Vkl = Tk + rs+1, l with k = 1, . . . , ðMs + 1Þ and l = 1, . . . , ðM + 1Þ:
This matrix has Ms + 1 rows and M + 1 columns. From this matrix a new vector TðsÞ is obtained,
ðsÞ
in which the elements Tk , with k = 1, . . . , Mðs + 1Þ + 1, are given by
8
< maxfVk−l, 1+l g and l = ð0, . . . , k − 1Þ if k ≤ M
ðsÞ
Tk = maxfVk−l, 1+l g and l = ð0, . . . , MÞ if M < k ≤ Ms : ðA2Þ
:
maxfVk−l, 1+l g and l = ðk − Ms − 1, . . . , MÞ if k > Ms
This vector TðsÞ has Mðs + 1Þ + 1 elements. After all recursion steps are accomplished, the vector
T contains the maximum rank sum element-wise for X+ = 0, . . . , JM. In particular, the maximum
rank sum given X+ is given by the (X+ + 1)th element of the final vector T.
Table A1
Item-Step Probabilities, Ranked Steps, and Cumulative Ranks for J = 4
Hypothetical Items, Each With Four Answer Categories (M = 3)
Table A2
Maximum Number of Guttman Errors and Norming Indices for U3p
for the Four Hypothetical Items With M = 3
0 0 0.000000 0.0000000
1 7 −1.450010 1.9924302
2 16 −3.647235 2.3563955
3 25 −5.960870 2.4765399
4 25 −6.242721 2.1946887
5 27 −7.567646 1.6624719
6 31 −9.658387 0.8151740
7 27 −9.538243 −0.5097514
8 25 −10.385541 −1.9597616
9 25 −11.901888 −3.4761091
10 16 −9.909458 −5.5668501
11 7 −9.545493 −7.7640747
12 0 −10.077710 −10.0777097
The maximum rank sum is finally transformed to the number of Guttman errors using the mini-
mum rank sum given X+ , which is equal to 12 X+ (X+ + 1) (e.g., Emons, 2003). The maximum
number of Guttman errors given X+ is then given by
1
maxðGp |X+ Þ = TðX+ + 1Þ − X+ ðX+ + 1Þ:
2
The values of max (Gp |X+ ) of the hypothetical example are given in Table A2.
Norming U3p
The algorithm for finding the norming values to compute U3p is based on the same recursion
algorithm as for finding max (G|X+ ), but it differs in two respects from that of finding
max (G|X+ ). First, the cumulative ranks are replaced by the cumulative sum of the logits of the
item-step probabilities. Second, in equations (A1) and (A2) of the algorithm, the minimum values
are recorded in the vector T instead of the maximum values. As a consequence, the algorithm pro-
duces the minimum value of the sum of logits for each level of X+ . The maximum value of the
sum of the logits given X+ is the sum of the logits of the first X+ item steps. These values can be
used for the normed U3p statistic. The conditional norming values for U3 are also given in
Table A2.
References
Bachman, J. G., & Malley, P. M. (1984). Yea-saying, Emons, W. H. M. (2003). Investigating the local fit
nay-saying, and going to extremes: Black-white of item-score vectors. In H. Yanai, A. Okada,
differences in response styles. Public Opinion K. Shigemasu, Y. Kano, & J. J. Meulman (Eds.),
Quarterly, 48, 491-509. New developments in psychometrics (pp. 289-296).
Birenbaum, M., & Nassar, F. (1994). On the relation- Tokyo: Springer.
ship between test anxiety and test performance. Emons, W. H. M., Meijer, R. R., & Sijtsma, K.
Measurement and Evaluation in Counseling and (2002). Comparing simulated and theoretical
Development, 27, 293-301. sampling distributions of the U3 person-fit sta-
Cavalini, P. M. (1992). It’s an ill wind that bring no tistic. Applied Psychological Measurement, 26,
goods. Studies on odour annoyance and the dis- 88-108.
persion of odorant concentrations from indus- Emons, W. H. M., Sijtsma, K., & Meijer, R. R.
tries. Unpublished doctoral dissertation, (2005). Global, local, and graphical person-fit
University of Groningen, Netherlands. analysis using person response functions.
Chernyshenko, O. S., Stark, S., Chan, K., Drasgow, Psychological Methods, 10, 101-119.
F., & Williams, B. (2001). Fitting item response Hamilton, D. L. (1968). Personality attributes associ-
theory models to two personality inventories: ated with extreme response style. Psychological
Issues and insights. Multivariate Behavioral Bulletin, 69, 192-203.
Research, 36, 523-562. Hemker, B. T., Sijtsma, K., & Molenaar, I. W.
Cohen, J. (1988). Statistical power analysis for the (1995). Selection of unidimensional scales from
behavioral sciences. Hillsdale, NJ: Lawrence a multidimensional item bank in the polytomous
Erlbaum. Mokken IRT model. Applied Psychological Mea-
Costa, P. T., & McCrae, R. R. (1992). The NEO surement, 19, 337-352.
Personality Inventory and NEO Five Factor Hemker, B. T., Sijtsma, K., Molenaar, I. W., &
Inventory professional manual. Odessa, FL: Junker, B. W. (1997). Stochastic ordering using
Psychological Assessment Resources. the latent trait and the sum score in polytomous
Dagohoy, A. V. T. (2005). Person fit for tests with IRT models. Psychometrika, 62, 331-347.
polytomous responses. Unpublished doctoral dis- Johnson, T. R. (2004). On the use of heterogeneous
sertation. Enschede, Netherlands: University of thresholds ordinal regression models to account
Twente. for individual differences in response style. Psy-
Drasgow, F., Levine, M. V., & Williams, E. A. chometrika, 68, 563-583.
(1985). Appropriateness measurement with poly- Karabatsos, G. (2003). Comparing the aberrant
chotomous item response models and standard- response detection performance of thirty-six per-
ized indices. British Journal of Mathematical and son-fit statistics. Applied Measurement in Educa-
Statistical Psychology, 38, 67-68. tion, 16, 277-298.
Efron, B., & Tibshirani, R. J. (1993). An introduction Meijer, R. R. (2003). Diagnosing item score patterns
to the bootstrap. New York: Chapman & Hall. on a test using IRT based person-fit statistics.
Embretson, S. E., & Reise, S. P. (2000). Item Psychological Methods, 8, 72-87.
response theory for psychologists. Mahwah, NJ: Meijer, R. R., & Baneke, J. (2004). Analyzing psy-
Lawrence Erlbaum. chopathology items: A case for nonparametric
item response theory modeling. Psychological depression items on subjects with eating disor-
Methods, 9, 354-367. ders. European Journal of Psychological Assess-
Meijer, R. R., Molenaar, I. W., & Sijtsma, K. (1994). ment, 21, 1-10.
Influence of test and person characteristics on non- Rossi, P. E., Gilula, Z., & Allenby, G. M. (2001).
parametric appropriateness measurement. Applied Overcoming scale usage heterogeneity: A Bayes-
Psychological Measurement, 18, 111-120. ian hierarchical approach. Journal of the Ameri-
Meijer, R. R., & Sijtsma, K. (2001). Methodology can Statistical Association, 96, 20-31.
review: Evaluating person fit. Applied Psycho- Samejima, F. (1969). Estimation of latent ability
logical Measurement, 25, 107-135. using a response pattern of graded scores. Psy-
Molenaar, I. W. (1982). Mokken scaling revisited. chometrika Monograph Supplement, 17.
Kwantitatieve Methoden, 3(8), 145-164. Samejima, F. (1997). The graded response model. In
Molenaar, I. W. (1991). A weighted Loevinger H-coef- W. J. van der Linden & R. K. Hambleton (Eds.),
ficient extending Mokken scaling to multicategory Handbook of modern item response theory
items. Kwantitatieve Methoden, 12(37), 97-117. (pp. 85-100). New York: Springer.
Molenaar, I. W. (1997). Nonparametric models for Sijtsma, K., & Meijer, R. R. (2001). The person
polytomous responses. In W. J. van der Linden & R. response function as a tool in person-fit research.
K. Hambleton (Eds.), Handbook of modern item Psychometrika, 66, 191-208.
response theory (pp. 369-380). New York: Springer. Sijtsma, K., & Molenaar, I. W. (2002). Introduction
Molenaar, I. W., & Hoijtink, H. (1990). The many to nonparametric item response theory. Thou-
null distributions of person-fit indices. Psycho- sand Oaks, CA: Sage.
metrika, 55, 75-106. Sijtsma, K., & Van der Ark, L. A. (2003). Investiga-
Molenaar, I. W., & Sijtsma, K. (2000). MSP5 for tion and treatment of missing item scores in test
Windows. User’s manual [Computer manual]. and questionnaire data. Multivariate Behavioral
Groningen, Netherlands: ProGAMMA. Research, 38, 505-528.
Paulhus, D. L. (1991). Measurement and control of Steinberg, L., & Thissen, D. (1996). Uses of item
response bias. In J. P. Robinson, P. R. Shaver, & response theory and the testlet concept in the
L. S. Wrightsman (Eds.), Measures of personality measurement of psychopathology. Psychological
and social psychological attitudes (pp. 17-59). Methods, 1, 81-97.
San Diego, CA: Academic Press. Thissen, D. (1991). MULTILOG user’s guide. Mul-
Ramsay, J. O. (2000). Testgraf. A program for the tiple categorical item analysis and test scoring
graphical analysis of multiple choice test and ques- using item response theory [Computer manual].
tionnaire data [Computer software]. Montreal, Chicago: Scientific Software.
Canada: Department of Psychology, McGill Van der Ark, L. A. (2001). Relationships and proper-
University. ties of polytomous item response theory models.
Reise, S. P., & Widaman, K. F. (1999). Assessing the Applied Psychological Measurement, 25, 273-282.
fit of measurement models at the individual level: Van der Flier, H. (1980). Vergelijkbaarheid van indi-
A comparison of item response theory and viduele testprestaties [Comparability of individ-
covariance structure approaches. Psychological ual test performance]. Lisse, Netherlands: Swets
Methods, 4, 3-21. & Zeitlinger.
Reise, S. P., Widaman, K. F., & Puch, R. H. (1993). Van der Flier, H. (1982). Deviant response patterns
Confirmatory factor analysis and item response and comparability of test scores. Journal of Cross-
theory: Two approaches for exploring measure- Cultural Psychology, 13, 267-298.
ment invariance. Psychological Bulletin, 114, Van Herk, H., Poortinga, Y. H., & Verhallen, T. M.
552-566. M. (2004). Response styles in rating scales: Evi-
Rennie, L. J. (1982). Research note: Detecting dence of method bias in data from six EU coun-
a response set to Likert-style attitude items with tries. Journal of Cross Cultural Psychology, 35,
the rating model. Education Research and Per- 346-360.
spectives, 9, 114-118. Van Krimpen-Stoop, E. M. L. A., & Meijer, R. R.
Rivas, T., Bersabé, R., & Berrocal, C. (2005). (2002). Detection of person misfit in computerized
Application of the double monotonicity model adaptive tests with polytomous items. Applied Psy-
to polytomous items: Scalability of the Beck chological Measurement, 26, 164-180.