Vous êtes sur la page 1sur 25

Applied Psychological Measurement

http://apm.sagepub.com/

Nonparametric Person-Fit Analysis of Polytomous Item Scores


Wilco H. M. Emons
Applied Psychological Measurement 2008 32: 224
DOI: 10.1177/0146621607302479

The online version of this article can be found at:


http://apm.sagepub.com/content/32/3/224

Published by:

http://www.sagepublications.com

Additional services and information for Applied Psychological Measurement can be found at:

Email Alerts: http://apm.sagepub.com/cgi/alerts

Subscriptions: http://apm.sagepub.com/subscriptions

Reprints: http://www.sagepub.com/journalsReprints.nav

Permissions: http://www.sagepub.com/journalsPermissions.nav

Citations: http://apm.sagepub.com/content/32/3/224.refs.html

>> Version of Record - Apr 8, 2008

What is This?

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Nonparametric Person-Fit Analysis
of Polytomous Item Scores
Wilco H. M. Emons, Tilburg University

Person-fit methods are used to uncover atypical Guttman errors is effective in detecting serious
test performance as reflected in the pattern of scores person misfit. The simulation study further shows
on individual items in a test. Unlike parametric that in most conditions a simple nonparametric
person-fit statistics, nonparametric person-fit person-fit statistic is as effective as a commonly used
statistics do not require fitting a parametric test parametric person-fit statistic in detecting deviant
theory model. This study investigates the item score vectors. An empirical example illustrates
effectiveness of generalizations of nonparametric the use of the nonparametric person-fit statistics in
person-fit statistics to polytomous item response real data. Index terms: aberrant response
data. A simulation study using varying test and item behavior, nonparametric item response theory,
characteristics shows that a simple count of the person-fit analysis, person misfit, polytomous items

Person-fit analysis is concerned with uncovering atypical test performance as reflected in the
pattern of scores on individual items in a test (Meijer & Sijtsma, 2001). Because the validity of
such atypical item score vectors may be questionable, it is important to identify these patterns to
prevent the drawing of inadequate conclusions from the test results. Person-fit methods may help
to identify invalid outcomes of a test caused by, for example, a lack of motivation to take the test
seriously, concentration problems on a cognitive test, and faking on a personality test. Person-fit
analysis has a successful history in the domain of cognitive and achievement testing (Meijer &
Sijtsma, 2001). Examples include identification of respondents with language deficiencies on an
intelligence test (Van der Flier, 1982) and students suffering from test anxiety on a cognitive test
(Birenbaum & Nassar, 1994).
A distinction can be made between parametric and nonparametric person-fit methods. Unlike
parametric person-fit statistics, nonparametric person-fit statistics do not require a parametric item
response theory (IRT) model that fits the data. Parametric and nonparametric person-fit methods
were extensively studied for dichotomous items (Karabatsos, 2003; Meijer & Sijtsma, 2001).
Studies of person-fit methods for polytomous items (i.e., items with three or more answer cate-
gories), however, were primarily concentrated on parametric approaches. Examples include van
Krimpen-Stoop and Meijer (2002) and Dagohoy (2005) in the context of educational testing, and
Zickar and Drasgow (1996), Zickar, Gibby, and Robie (2004), and Reise and Widaman (1999) in
the context of noncognitive assessment.
The main topic of this study is to introduce three nonparametric person-fit methods for polyto-
mous items (i.e., the number of Guttman errors, the newly proposed normed Guttman errors, and the
generalized U3 statistic), study their performance under various conditions of misfit (e.g., careless

Applied Psychological Measurement, Vol. 32 No. 3, May 2008, 224–247


224 DOI: 10.1177/0146621607302479
Ó 2008 Sage Publications

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 225

responding and extreme response behavior), and benchmark the obtained results to an often-used
parametric person-fit statistic, lpz (Drasgow, Levine, & Williams, 1985).
In the context of nonparametric item response theory (NIRT), Molenaar (1991) proposed the
weighted number of Guttman errors as an index of model fit. The weighted number of Guttman errors
can also be calculated from an individual’s vector of item scores, which can be used as an index of
person fit. However, the properties of the number of Guttman errors as an index of person fit were not
studied as extensively as in the dichotomous case. Furthermore, a disadvantage of the weighted
number of Guttman errors is that its maximum depends on the sum score. This limits the comparabil-
ity of the index across sum-score levels. In this study, a normed version of the number of Guttman
errors is proposed that weights the number of Guttman errors by its maximum given the sum score.
In the context of dichotomous NIRT models, the U3 statistic (Van der Flier, 1980) was devel-
oped, which takes into account both the item-difficulty ordering and the values of the item difficul-
ties (i.e., proportion of correct or coded answers). Karabatsos (2003) compared 36 person-fit
statistics and found that U3 was in the top four of most powerful statistics (see also Emons,
Sijtsma, & Meijer, 2005). In this study, the U3 statistic for dichotomous item scores is generalized
to polytomous item scores, and its properties are compared with the number of Guttman errors and
the normed number of Guttman errors.
This report is organized as follows. First, a theoretical framework is provided for the NIRT
models used in this study. Second, the nonparametric person-fit statistics for polytomous items are
discussed. Third, a simulation study was done in which the properties of the nonparametric per-
son-fit statistics were studied and compared with a popular parametric person-fit statistic for poly-
tomous items. Fourth, the results of the simulation study are discussed. This report concludes with
an empirical example on industrial malodor.

Nonparametric Polytomous IRT


The person-fit statistics used in this study are defined in the context of Mokken scaling for poly-
tomous item scores (e.g., Hemker, Sijtsma, Molenaar, & Junker, 1997; Molenaar, 1997; Sijtsma &
Molenaar, 2002, chaps. 7 and 8). Let J be the number of items, each with M + 1 ordered response
categories; let Xj be the random variablePJ for the score on item jðj = 1, . . . , J), with possible reali-
zations xj = 0, . . . , M; and let X+ = j = 1 Xj be the sum score. The central part of polytomous
NIRT models are the item step response functions (ISRFs), which relate the probability of scoring
xj or higher to the latent trait y of interest (e.g., aptitude, ability, or attitude). The ISRFs are
denoted by Pjxj ðyÞ = PðXj ≥ xj |y) with xj = 1, . . . , M. The person-fit statistics used in this study
are defined under NIRT models that satisfy the following four assumptions. First, the latent trait y
that governs the item responses is unidimensional. Second, the item scores are independent condi-
tional on y. Third, the ISRFs are monotonically increasing in y. This assumption states that for two
arbitrary values ya and yb , Pjxj ðya Þ ≥ Pjxj ðyb Þ whenever ya > yb . A model that satisfies the first
three assumptions is Mokken’s model of monotone homogeneity (MHM). The practical impor-
tance of the MHM is that it justifies the use of the sum score for ordering persons on y (Sijtsma &
Molenaar, 2002, p. 121; Van der Ark, 2001). Figure 1A gives the ISRFs of two items, each with
four answer categories (i.e., M = 3), which can be described by the MHM.
The fourth assumption is that the IRSFs of different items do not intersect. Nonintersection
of the ISRFs means that for two items i and j and a fixed value y0 , given that Pixi ðy0 Þ < Pjxj ðy0 Þ,
then Pixi ðyÞ ≤ Pjxj ðyÞ for all values of y. Note that the ISRFs within one item cannot intersect by
definition. A model that satisfies all four assumptions is Mokken’s double monotonicity model
(DMM). Figure 1B gives the ISRFs of two items, each with four answer categories (i.e., M = 3),

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
226 APPLIED PSYCHOLOGICAL MEASUREMENT

Figure 1
Examples of Item Step Response Functions for (A) the Monotone Homogeneity
Model and (B) the Double Monotonicity Model

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 227

which can be described by the DMM. The fit of the DMM for polytomous items can be evaluated
in empirical data using methods discussed by Sijtsma and Molenaar (2002). Examples of applica-
tions of the DMM include Rivas, Bersabé, and Berrocal (2005) and Van Onna (2003).

Nonparametric Person-Fit Statistics

Number of Guttman Errors


The first statistic that was studied was the number of Guttman errors for polytomous items
(Molenaar, 1991; see also Hemker, Sijtsma, & Molenaar, 1995), denoted by Gp . Statistic Gp is
explained in the context of Mokken models using a fictitious example for J = 2 and M = 3. Mok-
ken models for polytomous items are defined using the concept of item steps (Molenaar, 1982,
1997; Sijtsma & Molenaar, 2002). Consider an item with four response categories (i.e., M = 3),
labeled 0 = strongly disagree, 1 = disagree, 2 = agree, and 3 = strongly agree. In the first step,
the respondent ascertains whether he or she has enough of the latent trait to take the first step from
strongly disagree to disagree. If and only if the first step is taken, he or she moves on to the next
step and ascertains whether he or she has enough of the latent trait to also take the second step from
disagree to agree. This process of consecutive steps proceeds until the respondent fails at an item
step or reaches the highest response category.
Let pjxj be the item-step difficulty, which is the population proportion of respondents with
a score xj or higher on item j, and let p ^jxj be its sample estimate. In this example, the hypothetical
item-step difficulties are equal to p11 = :80, p12 = :70, and p13 = :30 for Item 1 and equal to
p21 = :90, p22 = :50, and p23 = :20 for Item 2. This means that the easiest item step is Step 1 of Item
2 (passed by 90% of the respondents); the next easiest item step is Step 1 of Item 1 (passed by
80%), followed by Step 2 of Item 2 (passed by 70%), Step 2 of Item 1 (passed by 50%), Step 3 of
Item 2 (passed by 30%); and the most difficult step is Step 3 of Item 1 (passed by 20%). Suppose
a respondent scored x1 = 1 and x2 = 3. This respondent passed the second step of Item 2, but failed
the easier Step 2 of Item 1. In addition, this person passed Step 3 of Item 2, but not the easier Steps
2 and 3 of Item 1. In this example, of all possible item-step pairs, for three item-step pairs the easy
step failed and the difficult item step passed. This is the number of Guttman errors. The higher the
number of Guttman errors, the more evidence of person misfit.
Statistic Gp is formally defined as follows. Let Y denote the item-step score variable, with reali-
zations 1 if the item step is passed, and 0 otherwise and let the random vector Y denote the joint
vector of JM item-step scores in ascending item-step difficulty order. This ordering is obtained
from the item-step difficulties p ^jxj ðj = 1, . . . , J; xj = 1, . . . , M). In particular, let the JM item
steps be ordered and numbered by increasing difficulty p ^1 ≥ p ^2 ≥ . . . ≥ p
^k ≥ . . . ≥ p ^JM ,
with ðk = 1, . . . , JMÞ; and let Y = ðY1 , . . . , Yk , . . . , YJM Þ be the corresponding ordered vector of
the JM item-step scores and y = ðy1 , . . . , yk , . . . , yJM Þ its realization. The item-step scores in Y are
structurally dependent because passing a particular item step of one item implies PJM that the easier
steps of that same item are also passed. Note that X+ is related to Y by X + = k = 1 Yk . In this fic-
titious example, item scores x1 = 1 and x2 = 3 yield y = ð1, 1, 0, 1, 0, 1). For an observed vector y,
the number of Guttman errors is
X
JM
Gp = yk ð1 − yl Þ: ð1Þ
l<k

For M = 1, statistic Gp specializes to the number of Guttman errors for a vector of dichotomous
items (e.g., Meijer & Sijtsma, 2001). Statistic Gp is implemented in the computer program MSP5
for Windows (Molenaar & Sijtsma, 2000).

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
228 APPLIED PSYCHOLOGICAL MEASUREMENT

Normed Number of Guttman Errors


The minimum of Gp is equal to 0, which is obtained if and only if the X+ easiest item steps are
passed. The maximum of Gp , however, depends on X+ and the ordering of the JM item-step diffi-
culties. To compare Gp for different X+ scores, statistic Gp was normed by the maximum possible
given X+ and the item-step difficulty ordering of all JM item steps. The normed number of Gutt-
man errors, denoted by GpN , is given by
Gp
GpN = , ð2Þ
maxðGp |X+ Þ
with a minimum value of GpN equal to 0 (no misfit) and a maximum value of 1 (extreme misfit).
Because the item-step scores Y are structurally dependent, maxðGp |X+ ) cannot be expressed in
closed form. Therefore, a recursion algorithm was developed that determines the maximum of Gp
^k s and X+ . Details of the algo-
conditional on the item-step difficulty ordering obtained from the p
rithm can be found in the appendix, and the accompanying software can be obtained from the author.

Generalized U3 Person-Fit Statistic


A generalization of Van der Flier’s (1980) U3 person-fit statistic to polytomous items is pro-
posed, which is defined as follows. For an observed vector y, let
X
JM  
^k
p
WðyÞ = yk log ,
k=1
1−p^k

which is the sum of the log odds of the item-step difficulties of the steps that were passed. The
polytomous generalization of U3, denoted by U3p , is obtained by norming Wðy) as follows:
maxðW|X + Þ − WðyÞ
U3p = , ð3Þ
maxðW|X+ Þ − minðW|X+ Þ
with a minimum value of U3p equal to 0 indicating no misfit, and a value of 1 indicating extreme
misfit. The maximum maxðW|X+ ) in equation (3) is obtained if and only if the X+ easiest item
steps are passed; that is,
X
X+
maxðWjX+ Þ = pk Þ:
logitð^
k=1

Because of structural dependencies between the item-step scores, the minimum value, min
ðW|X+ ), cannot be expressed in closed form. Therefore, minðW|X + ) was computed using a recur-
sion algorithm (details can be found in the appendix).

Simulation Study
Data Generation
Data were generated under the graded response model (GRM; Samejima, 1969, 1997). The
GRM also assumes unidimensionality and local independence, but defines the ISRFs as
h i
exp aj ðy − djxj Þ
Pjxj ðyÞ = h i, ð4Þ
1 + exp aj ðy − djxj Þ

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 229

where aj is the slope parameter and djxj is the location parameter of the ISRF for xj . The location
parameter djxj indicates for which y the probability of scoring xj or higher is equal to .50. The
option response curve (ORC) defines the probability of scoring xj on item j conditional on y,
which is obtained from the ISRFs as follows:
8
< 1 − Pj1 ðyÞ if xj = 0
∗ ∗
Pjx ðyÞ = P ðX j = x j |yÞ = P jx ðyÞ − P jðxj + 1Þ ðyÞ if 1 ≤ xj < M − 1 :
j : j
pjM ðyÞ if xj = M

A response to item j was generated by drawing a random score from the multinomial distribution

with M + 1 outcomes and parameters Pjx j
ðy), with xj = 0, . . . , M.
For the GRM, the ISRFs do not intersect if a1 = a2 = . . . = aJ . Although nonintersecting
ISRFs are the basis of the nonparametric person-fit statistics used in this study, the data were gen-
erated under GRMs that allowed the as to vary and thus did not strictly satisfy the assumption of
the DMM. The choice of a more general model in which ISRFs intersect is justified by results from
simulation studies in the context of dichotomous person-fit analysis, which consistently showed
that nonparametric person-fit methods are robust against mild to moderate departures from nonin-
tersection (e.g., Emons, 2003; Sijtsma & Meijer, 2001). The latter condition is often realized in
real data because tests are assembled to have items with steep slopes and varying item difficulties
(Emons et al., 2005). Items with relatively flat slopes, causing many intersections with the other
response functions, are often excluded from the test because they provide little information for
measurement.

Simulation of Misfitting Item Score Vectors


Carelessness and inattention. When a respondent takes a test without any personal interest in
the outcomes, he or she may blindly answer the questions by randomly choosing one of the
response options without considering the item content. This type of response behavior may be
found, for example, when the respondent participated in a test because of a reward (e.g., a fee or
study credits) and has to give the appearance that he or she seriously took the test. Other examples
of response behavior leading to random item score patterns include carelessness, serious loss of
concentration, misreading of items, and alignment errors. An item score vector X resulting from
random response behavior was simulated by drawing J random numbers zj (j = 1, . . . , J) from
the uniform distribution ½0, M + 1 and then taking the truncated value of the zs; that is,
Xj = truncðzj Þ.

Tendency to choose extreme response options. Respondents may differ in their tendency to
choose the extreme response options (Hamilton, 1968; Paulhus, 1991). This means that some
respondents are more inclined to endorse one of the extreme response options (e.g., strongly dis-
agree or strongly agree) regardless of the item content and his or her y, whereas others have the
tendency to avoid using extreme response options. This type of response behavior is indicated as
extreme response style behavior. Large differences in extreme response style behavior may impair
the comparability of individual test scores (e.g., Van Herk, Poortinga, & Verhallen, 2004). This
lack of comparability of individual test scores may be revealed by a person-fit analysis.
A respondent exhibiting extreme response style behavior has a higher probability of endorsing
one of the extreme response categories than predicted from his or her y and the ISRFs. These indi-
vidual increases of endorsement probabilities for the extreme response options can be accounted
for by individual changes in the distances between the threshold parameters djm (to be explained

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
230 APPLIED PSYCHOLOGICAL MEASUREMENT

shortly). This means that for each person who exhibits extreme response style behavior, there is
a unique vector of threshold parameters, which differs from the threshold parameters that describe
the postulated ISRFs. Psychometric models that take into account these individual threshold struc-
tures to model extreme response style behavior were pursued by Johnson (2004), Rennie (1982),
and Rossi, Gilula, and Allenby (2001).
To simulate data for extreme response style behavior, an appropriate transformation of the djxj s
is needed at the individual level. This transformation must maintain the ordering of the djxj s and
must result in endorsement probabilities that are higher for extreme response options. A suitable
approach is by means of a linear transformation of the item-step location parameters djxj that shifts
them closer to the average of the djxj s. This approach is comparable to the proportional threshold
approach that was proposed by Rossi et al. (2001). Let x denote the person parameter that governs
the transformation of the threshold parameters and thus reflects the individual’s extreme response
style behavior. Furthermore, let  dj be the mean of the M location parameters djxj of item j. The
linear transformation of the djxj s, denoted by d∗jxj , which was used in this study, is obtained by

d∗jxj = expðxÞ × ðdjxj − dj Þ + dj : ð5Þ

For x < 0, equation (5) shifts the location parameters toward the mean d and, as a result, decreases
the distance between the djxj s and  d. For persons at the lower end of the y scale, this results in
higher endorsement probabilities for the lowest response option, and for persons at the higher end
of the y scale, this results in higher endorsement probabilities for the highest response option.
Figures 2A and 2B give the ORCs of a hypothetical item with M = 3, for x = 0 (i.e., the null
model), and x = −0:8. For x > 0, the d∗jxj s are more dispersed with respect to the mean item diffi-
culty, resulting in decreased option response probabilities for the extreme response options (i.e.,
a tendency to avoid extreme response options; see Figure 2C).
In this study, data were simulated for x = −0:8. This means that extreme response style behav-
ior was treated as a fixed effect. The choice of x was based on preliminary simulations using the
same item and test characteristics as used in this study. In these simulations, the effect of x on the
item responses in the normal and aberrant sample was verified using an overall measure of extreme
response behavior proposed by Bachman and Malley (1984; see also Van Herk et al., 2004). This
index is a count of the number of responses in the extreme response categories divided by J. For
x = −0:8, differences between means of extreme response indices for the normal sample and the
aberrant were significant (t test; p < :000). Effect sizes for these differences ranged from 0.46 to
2.58, indicating medium to strong effects (Cohen, 1988). These results led to the conclusion that
x = −0:8 is a reasonable choice for simulating extreme response behavior.

Reversed scoring. Personality questionnaires and attitude scales may consist both of items that
are positively worded (i.e., high scores correspond to high y levels) and items that are negatively
worded (i.e., high scores correspond to low y levels). Respondents may fail to notice the different
directions of the wording and, as a result, may answer some items opposite to what they meant to
do. This type of aberrant response behavior was simulated by means of recoding the generated
score xj under the IRT model into M − xj .

Comparison With the Parametric Polytomous Person-Fit Statistic l pz


The nonparametric person-fit statistics that are used in this study are compared with the para-
metric standardized log-likelihood statistic for polytomous items, known as lpz (Drasgow et al.,
1985). This statistic serves as a benchmark to give a better interpretation of the power of the

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 231

Figure 2
Option Response Curves of a Four-Choice Item for Simulating Data Under
(A) the Null Model (x = 0), (B) the Tendency to Choose Extreme Response Options
(x < 0), and (C) the Tendency to Avoid Extreme Response Options (x > 0)

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
232 APPLIED PSYCHOLOGICAL MEASUREMENT

nonparametric person-fit statistics. Let dxj ðmÞ = 1 if xj = m (m = 0, . . . , M), and 0 otherwise. The
unstandardized log-likelihood person-fit statistic for polytomous items, lp , is given by
J X
X M

lp = dxj ðmÞ ln Pjxj
ðyÞ:
j=1 m=0

The standardized log likelihood, lpz , is given by


lp − Eðlp Þ
lpz = 1=2
, ð6Þ
½Varðlp Þ
where Eðlp ) is the expected value of lp , given by
J X
X M
∗ ∗
Eðlp Þ = Pjxj
ðyÞ ln Pjxj
ðyÞ,
j = 1 xj = 0

and Varðlp ) is the variance of lp , given by


2 ∗ !3
X J XM X M Pjx ∗ ðyÞ
4 ∗ ∗ ∗ 5:
Varðlp Þ =
j
Pjx∗ ðyÞPjx ðyÞ ln Pjx∗ ðyÞ ln ∗
Pjxj ðyÞ
j j j
j=1 x∗ x j
j

Statistic lpz can be interpreted as a standard normal deviate, with large negative values of lpz (say,
≤ −2:0) indicating misfit.

Independent Variables
Test length and number of response options. Data were generated for two levels of test length:
J = 12 and J = 24. For each level of J, data were generated for two levels of the number of
response options: M = 2 and M = 4. These choices of J and M were based on the characteristics
of existing personality scales. For example, the Neuroticism-Extroversion-Openness Five-Factor
Inventory (NEO-FFI; Costa & McCrae, 1992) measures each factor of the five-factor model (the
Big Five) using 12 items. The item parameter values (Table 1) that were used for generating the
data were taken from Embretson and Reise (2000, p. 100). These values are representative of other
empirical studies on the fit of the GRM to data from personality questionnaires (e.g., Reise, Wida-
man, & Puch, 1993). For J = 24, the 12-item set was doubled.

Discrimination power. Several studies (e.g., Meijer, Molenaar, & Sijtsma, 1994; Meijer &
Sijtsma, 2001) showed that the power of person-fit statistics depends on the discrimination power
of the items. A higher discrimination means a more reliable score X+ . This may produce higher
detection rates. To investigate the effect of item discrimination, the y variance was varied because
for fixed ISRFs increasing the y variance results in higher item discrimination and higher test-
score reliability (e.g., Hemker et al., 1995). In particular, y variances equal to 1.0 and 1.6 were
used to simulate low and moderate discrimination, respectively.

Number of misfitting item scores in a vector. In this study, three types of aberrant response
behavior were discerned. These types of response behavior may govern the answers to all J
items, for example, when respondents do not seriously answer any of the items. They may also
govern only a part of the answers to the items. For example, respondents may be more inclined to
choose one of the extreme response options of the questions they consider particularly important

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 233

Table 1
Configuration of the Item Parameters in the Simulation Study

M=2 M=4

j aj dj1 dj2 dj1 dj2 dj3 dj4

1 0.70 −2.87 0.51 −3.80 −1.93 −0.87 1.88


2 1.42 −1.15 1.68 −2.07 −0.22 0.93 2.42
3 1.43 −1.65 0.48 −2.37 −0.93 −0.39 1.34
4 1.31 −1.77 0.95 −2.72 −0.81 0.04 1.85
5 1.14 −1.87 1.68 −3.14 −0.60 0.64 2.72
6 1.84 −0.65 0.99 −1.15 −0.15 0.37 1.60
7 1.06 −2.37 1.29 −3.75 −0.99 0.11 2.47
8 0.65 −2.76 2.36 −4.43 −1.08 0.75 3.96
9 2.09 −1.07 1.06 −1.93 −0.20 0.42 1.70
10 1.18 −1.73 1.31 −2.81 −0.64 0.37 2.24
11 1.69 −0.69 1.47 −1.46 0.08 0.81 2.13
12 1.15 −1.64 0.84 −2.52 −0.76 −0.04 1.71

for a favorable presentation of themselves. To investigate the effect of the number of affected
items, Jmisfit , two levels of misfit for J = 12 (Jmisfit = 6 and 12) and four levels of misfit for J = 24
(Jmisfit = 6, 12, 18, and 24) were simulated.

Dependent Variables
The usefulness of a person-fit statistic as a diagnostic tool for detecting aberrant item score vec-
tors is determined by the trade-off between the detection rates (i.e., the degree to which misfitting
item score vectors are detected) and the Type I error rate (i.e., the degree to which fitting item score
vectors are incorrectly diagnosed as misfitting). The detection rates for each statistic were obtained
at five fixed Type I error rates: .01, .025, .05, .10, and .20. It should be noted that person-fit
researchers (e.g., Meijer, 2003) may prefer relatively large a levels because most person-fit statis-
tics have relatively low power at low a levels and incorrect rejection of the null hypothesis of no
misfit often has no serious consequences.
Detection rates were obtained as follows:

1. A total of 1,000 item score vectors were simulated under the null model of normal response behavior.
These data were used to estimate the ordering of the item-step difficulties p ^s and the item parameters
of the GRM (equation (4)). The parameters were estimated using the program MULTILOG (Thissen,
1991).
2. Two data sets of 3,000 item score vectors each were simulated: one data set under the null model of
normal response behavior (called the clean data set) and the other under the aberrant response mecha-
nism of interest (called the aberrant data set).
3. For each person, a y estimate was obtained using the item parameter estimates obtained in Step 1.
Then, the person-fit statistics were computed for the 3,000 simulated item score vectors in the clean
sample and the 3,000 simulated item score vectors in the aberrant sample.
4. For statistic Gp , for each Type I error level the critical value tcv was determined such that in the clean
sample the fraction of respondents having a value higher than tcv is equal to the Type I error rate; that
is, PðGp Þ ≥ tcv = q (with q = :01, .05, .10, and .20 for each Type I error rate, respectively). The

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
234 APPLIED PSYCHOLOGICAL MEASUREMENT

detection rate is the fraction of respondents in the aberrant sample having a person-fit value Gp higher
than tcv . Detection rates for GpN and U3p were obtained in the same way. For lpz , critical values tcv were
determined by the fractions lpz ≤ tcv = q (with q = :01, .025, .05, .10, and .20) in the clean sample, and
the corresponding fraction in the aberrant sample produced the detection rate.

Results
Results under normal response behavior. The scatterplots in Figure 3 show for J = 12 and
M = 4 the relationship between X + and statistics Gp (panels in the first row), GpN (panels in the
second row), and U3p (panels in the third row) under the null model of normal response behavior
for low discrimination (left-hand side panels) and high discrimination (right-hand side panels).
The plots show that the distribution of statistic Gp conditional on X+ varies across X+ . Compared
with low and high X+ , the conditional Gp distribution for medium X+ had a higher mean and
a larger variance. Except for very high and very low X+ scores, the conditional distributions of
GpN and U3p were approximately the same across X+ and therefore less confounded with X+ .
Similar results were found in the other conditions. The results corroborate the conclusion that it is
best not to use person-fit indices for patterns with scores near the extremes. These patterns contain
too little information to draw valid conclusions about aberrant response behavior. Similar results
were found for dichotomous items (Emons, Meijer, & Sijtsma, 2002). Item score vectors yielding
X+ ≤ M or X+ ≥ J − M are discarded from further analysis. This means that critical values (tcv )
were obtained in the clean sample, and detection rates in the aberrant sample, from which extreme
patterns were removed.

Results for carelessness and inattention. For the five significance levels, Table 2 shows the
detection rates of Gp , GpN , U3p , and lpz for detecting carelessness and inattention under 12 combi-
nations of M, J, Jmisfit , and discrimination power. For the three nonparametric person-fit statistics,
statistic Gp showed, in general, the best performance. In only a few conditions, U3p performed
a little better than Gp , but the differences were negligible. The detection rates for Gp were some-
what smaller than for lpz ; differences ranged from .01 to .11. For J = 12 and Jmisfit = 6, the detection
rates were smaller than .50 for a levels smaller than .05. This means that under this condition,
person-fit statistics lacked enough power to detect carelessness at conventional a levels. The dif-
ferences between the detection rates obtained for M = 2 and M = 4 were small; absolute differ-
ences ranged from .02 to .06 for J = 12, Jmisfit = 6, and from .01 to .15 for J = 12, Jmisfit = 12.
Thus, increasing the number of response options had minor effects on the detection rates for a small
number of misfitting item scores and somewhat larger effects for a larger number of misfitting item
scores. This result was found for both the nonparametric person-fit statistics and the parametric
statistic lpz . As expected, increasing the number of misfitting item scores had a substantial effect on
the detection rates. In particular, for J = 24, Jmisfit = 12, and low discrimination, a detection rate of
.72 was found for M = 2, and .73 for M = 4. Small differences were found between low and high
discrimination (the largest absolute difference was .11). The simulations suggest that of all three
nonparametric person-fit statistics, Gp is most effective in detecting extreme cases of carelessness
or inattention in which all item scores are affected. Such patterns are realistic for this type of
response behavior because serious lack of motivation, or serious lack of concentration, is indepen-
dent of the specific item content and most likely to affect the response behavior to all items.

Results for tendency to choose extreme response options. Table 3 shows at the five significance
levels the detection rates of Gp , GpN , U3p , and lpz for detecting tendencies to choose extreme
response options. Compared with the detection rates for carelessness, the detection rates for
extreme response behavior were smaller and, in particular, the detection rates for J = 12 and
(text continues on page 238)

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 235

Figure 3
Scatterplots of Sum Score (X + ) Against (A) Statistic Gp for Low Discrimination, (B) Statistic
Gp for High Discrimination, (C) Statistic GpN for Low Discrimination, (D) Statistic GpN for High
Discrimination, (E) Statistic U3p for Low Discrimination, and (F) Statistic U3p for High Discrimination

A B

150
140

120 125
Guttman Errors (G p)

Guttman Errors (G p)
100 100

80
75
60
50
40
25
20

0 0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45
Guttman Errors (Gp)

Sum Score (X+) Sum Score (X+)

C D

1.0 1.0
Normed Guttman Errors (G Np )

0.8
Normed Guttman Errors (G Np )

0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45
Sum Score (X+) Sum Score (X+)

E F

1.0 1.0
U 3p

U 3p

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45
Sum Score (X+) Sum Score (X+)

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
236 APPLIED PSYCHOLOGICAL MEASUREMENT

Table 2
Detection Rates for Carelessness and Inattention at Five Significance Levels

Gp GpN U3p lpz Gp GpN U3p lpz


a M=2 M=4

J = 12, Jmisfit = 6, low disc. J = 12, Jmisfit = 6, low disc.

.01 .26 .14 .20 .25 .26 .19 .17 .29


.025 .36 .30 .33 .41 .39 .31 .27 .40
.05 .47 .43 .46 .50 .49 .43 .41 .52
.10 .62 .57 .57 .62 .62 .58 .56 .64
.20 .73 .69 .73 .76 .76 .73 .71 .78

J = 12, Jmisfit = 12, low disc. J = 12, Jmisfit = 6, low disc.

.01 .56 .39 .54 .67 .66 .54 .49 .76


.025 .70 .57 .71 .78 .74 .67 .66 .83
.05 .79 .73 .80 .85 .83 .78 .77 .88
.10 .88 .85 .89 .91 .89 .88 .86 .93
.20 .94 .92 .95 .96 .95 .94 .94 .97

J = 12, Jmisfit = 6, high disc. J = 12, Jmisfit = 6, high disc.

.01 .28 .20 .24 .36 .31 .26 .21 .35


.025 .39 .31 .35 .46 .44 .37 .32 .51
.05 .48 .42 .46 .57 .54 .46 .40 .61
.10 .61 .56 .60 .69 .67 .59 .55 .71
.20 .75 .71 .76 .81 .79 .74 .70 .82

J = 12, Jmisfit = 12, high disc. J = 12, Jmisfit = 12, high disc.

.01 .55 .35 .46 .64 .66 .47 .41 .72


.025 .69 .58 .69 .76 .74 .66 .60 .81
.05 .80 .72 .77 .85 .82 .75 .70 .87
.10 .88 .84 .87 .92 .88 .83 .82 .92
.20 .94 .92 .94 .96 .94 .92 .90 .96

J = 24, Jmisfit = 12, low disc. J = 24, Jmisfit = 12, low disc.

.01 .49 .18 .35 .53 .49 .39 .31 .57


.025 .59 .40 .56 .67 .63 .53 .49 .71
.05 .72 .56 .69 .76 .73 .68 .63 .80
.10 .81 .71 .82 .85 .84 .80 .76 .87
.20 .91 .84 .91 .93 .92 .90 .88 .93

J = 24, Jmisfit = 18, low disc. J = 24, Jmisfit = 18, low disc.

.01 .73 .44 .64 .79 .76 .61 .48 .82


.025 .84 .64 .81 .87 .85 .76 .69 .89
.05 .89 .80 .89 .93 .90 .86 .81 .93
.10 .95 .91 .94 .96 .95 .92 .90 .97
.20 .98 .96 .98 .98 .98 .97 .95 .99

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 237

Table 3
Detection Rates for Tendency to Choose Extreme Response Options at Five Significance Levels

Gp GpN U3p lpz Gp GpN U3p lpz

a M=2 M=4

J = 12, Jmisfit = 6, low disc. J = 12, Jmisfit = 6, low disc.

.01 .06 .07 .08 .06 .14 .14 .14 .12


.025 .10 .13 .14 .12 .21 .24 .23 .18
.05 .16 .18 .21 .18 .29 .32 .32 .28
.10 .25 .29 .33 .29 .40 .43 .48 .39
.20 .41 .47 .49 .45 .55 .60 .63 .55

J = 12, Jmisfit = 12, low disc. J = 12, Jmisfit = 12, low disc.

.01 .16 .21 .32 .29 .38 .47 .49 .39


.025 .28 .35 .45 .38 .49 .59 .63 .52
.05 .39 .47 .55 .47 .57 .71 .73 .60
.10 .49 .62 .67 .56 .67 .80 .84 .69
.20 .62 .75 .77 .68 .75 .89 .91 .79

J = 12, Jmisfit = 6, high disc. J = 12, Jmisfit = 6, high disc.

.01 .05 .04 .09 .06 .09 .09 .10 .10


.025 .08 .11 .15 .10 .15 .16 .16 .17
.05 .12 .16 .21 .16 .22 .25 .27 .26
.10 .20 .26 .31 .25 .34 .40 .41 .37
.20 .34 .42 .44 .41 .46 .57 .61 .52

J = 12, Jmisfit = 12, high disc. J = 12, Jmisfit = 12, high disc.

.01 .18 .20 .29 .24 .24 .40 .41 .30


.025 .23 .34 .40 .32 .35 .52 .53 .40
.05 .30 .45 .53 .39 .46 .65 .67 .49
.10 .42 .60 .62 .49 .55 .76 .78 .57
.20 .55 .74 .74 .61 .61 .85 .88 .69

J = 24, Jmisfit = 12, low disc. J = 24, Jmisfit = 12, low disc.

.01 .13 .08 .14 .15 .27 .23 .22 .31


.025 .21 .19 .26 .26 .38 .40 .39 .42
.05 .31 .30 .36 .36 .47 .54 .55 .52
.10 .43 .46 .51 .49 .58 .67 .70 .62
.20 .60 .63 .67 .64 .69 .79 .83 .76

J = 24, Jmisfit = 18, low disc. J = 24, Jmisfit = 18, low disc.

.01 .27 .16 .32 .36 .49 .56 .55 .48


.025 .38 .31 .46 .48 .56 .69 .69 .61
.05 .47 .47 .58 .56 .64 .77 .79 .67
.10 .58 .63 .69 .66 .72 .84 .86 .75
.20 .70 .80 .80 .75 .79 .92 .94 .84

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
238 APPLIED PSYCHOLOGICAL MEASUREMENT

Jmisfit = 6 were smaller than .50 at conventional a levels, a result that was also found for the lpz sta-
tistic. This indicates that this type of misfit is difficult to detect. The detection rates were higher for
M = 4 than for M = 2, yielding acceptable detection rates if the number of affected items is large
enough. In particular, for J = 12, Jmisfit = 12, and low discrimination, the detection rates at a = :05
were .71 and .73 for GpN and U3p , respectively. Compared with Gp and lpz , the detection rates were
higher for the normed statistics GpN and U3p .
The simulations also revealed higher detection rates for low discrimination than for high dis-
crimination. The explanation for this result is that the higher the item discrimination, the higher
the probability of responses in the lowest category of difficult items, and likewise for the highest
category of easy items. This means that as the discrimination increases, a tendency to choose
extreme response options leads to smaller discrepancies between the expected pattern under the
null model and the observed item score vector and, thus, lower detection rates.
The simulations suggest that for items with five or more answer categories, the normed non-
parametric person-fit statistics are particularly effective in detecting cases that show a tendency to
choose the extreme response categories on all items. Both parametric and nonparametric person-
fit statistics lack power if the number of response options is small or if the tendency to choose
extreme response options is exhibited on less than half of the items. Note that the tendency to
choose extreme responses is also independent of item content and, as a result, most likely to affect
the response behavior to most of the items.

Results for reversed scoring. Table 4 shows the detection rates of Gp , GpN , U3p , and lpz for
detecting reversed scoring. For the three nonparametric person-fit statistics, statistic Gp showed
the highest detection rates in the conditions in which half the item scores were affected. In these
conditions, statistic Gp also performed better than lpz , but differences were small. No clear trends
were found if all item scores were affected, but in general, higher detection rates were found if half
of the items were reversed scored than if all items were reversed scored. For example, for J = 24
the highest detection rates were found if 18 items were reversed scored. Note that for tests that use
different wording directions, only a subset of items are worded in the opposite direction. Failure to
notice the changes in the direction of wording affects only a subset of item scores. In this case, sta-
tistic Gp is most useful for detecting item score vectors containing errors in the direction of the
scoring.
Comparison of the results for low and high discrimination showed that higher discrimination
produced higher detection rates when half of the items were reverse scored, but lower detection
rates when all items were reverse scored. If only half of the items are reverse scored, there is a vec-
tor in which some of the easy (difficult) items are unaffected and a response in the highest (lowest)
category is observed, whereas other easy (difficult) items are affected and a response in the oppo-
site lowest (highest) category is observed. This yields large inconsistencies between the item
scores within this pattern, given the item-step difficulty ordering. This effect becomes stronger as
the item discrimination increases because higher discrimination increases the probability that
respondents will choose extreme categories of easy and difficult items. Reverse scoring of all
items affects the item score vectors as follows. For respondents with a high y, who have high
scores on most items, reverse scoring results in a pattern with low scores on most items and a low
X+ . This pattern of low scores and a low X+ is not as inconsistent with the NIRT model as if half
of the items were reverse scored. Similar results hold for respondents with a low y, who have low
scores on most items. Thus, it is the combination of the number of affected items and item discrim-
ination that influenced the detection rates. This effect depends on the characteristics of the test
(e.g., spread of item difficulties).

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 239

Table 4
Detection Rates for Reverse Scoring at Five Significance Levels

Gp GpN U3p lpz Gp GpN U3p lpz

a M=2 M=4

J = 12, Jmisfit = 6, low disc. J = 12, Jmisfit = 6, low disc.

.01 .24 .15 .16 .22 .20 .13 .09 .17


.025 .30 .23 .23 .30 .30 .21 .17 .28
.05 .39 .32 .31 .37 .40 .30 .25 .38
.10 .51 .42 .44 .48 .52 .45 .37 .48
.20 .65 .58 .58 .63 .72 .63 .53 .63

J = 12, Jmisfit = 12, low disc. J = 12, Jmisfit = 12, low disc.

.01 .09 .10 .06 .06 .08 .07 .05 .05


.025 .16 .16 .10 .11 .16 .13 .09 .11
.05 .25 .25 .17 .18 .22 .22 .19 .18
.10 .35 .37 .26 .27 .34 .36 .30 .31
.20 .49 .53 .40 .42 .50 .53 .47 .47

J = 12, Jmisfit = 6, high disc. J = 12, Jmisfit = 6, high disc.

.01 .37 .23 .30 .40 .42 .31 .25 .40


.025 .42 .32 .38 .48 .51 .40 .33 .50
.05 .52 .41 .46 .55 .56 .47 .40 .55
.10 .60 .52 .56 .63 .68 .58 .51 .63
.20 .74 .64 .69 .74 .81 .72 .64 .75

J = 12, Jmisfit = 12, high disc. J = 12, Jmisfit = 12, high disc.

.01 .10 .10 .07 .07 .06 .05 .05 .06


.025 .14 .18 .13 .12 .12 .12 .10 .11
.05 .19 .26 .18 .17 .17 .20 .18 .20
.10 .29 .38 .27 .26 .31 .31 .29 .31
.20 .41 .53 .41 .41 .44 .48 .45 .47

J = 24, Jmisfit = 12, low disc. J = 24, Jmisfit = 12, low disc.

.01 .28 .13 .22 .28 .33 .21 .16 .32


.025 .37 .22 .31 .37 .42 .31 .23 .40
.05 .43 .30 .39 .44 .53 .42 .33 .48
.10 .56 .41 .51 .54 .64 .54 .44 .59
.20 .69 .57 .64 .65 .78 .68 .58 .71

J = 24, Jmisfit = 18, low disc. J = 24, Jmisfit = 18, low disc.

.01 .33 .18 .24 .30 .37 .27 .20 .36


.025 .44 .29 .32 .39 .48 .38 .26 .45
.05 .54 .42 .40 .47 .60 .50 .38 .54
.10 .67 .53 .53 .57 .73 .65 .52 .67
.20 .79 .69 .68 .70 .86 .78 .66 .78

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
240 APPLIED PSYCHOLOGICAL MEASUREMENT

Table 5
Item Labels and Item Means for the Items Used in the Real Data Example,
and the Observed Item Score Vector and Person-Fit Results for Five Cases

Item Content Mean Item Score Per Case

Case 1 2 3 4 5

Search source of malodor 1.85 3 0 0 3 0


Talk to friends and family 0.98 0 3 0 0 0
Do something to get rid of it 0.86 0 2 0 3 0
Try to find solutions 0.82 0 0 0 3 0
Experience unrest 0.65 1 0 1 0 0
Go elsewhere for fresh air 0.54 0 0 3 3 0
File complaint at producer 0.35 3 3 0 3 0
Call environmental agency 0.26 3 1 0 3 3

Person-fit results

Person-fit index Person-fit value for each case

Gp 62.00 67.00 42.00 56.00 25.00


GpN 0.59 0.66 0.76 0.65 0.50
U3p 0.62 0.62 0.80 0.61 0.57
lpz −4.25 −3.87 −2.50 −1.91 −1.17

Real Data Application


Test and Data
Data were used from Cavalini’s (1992) study on industrial malodor to illustrate the nonparametric
person-fit statistics used in this study. The complete data set consisted of 828 subjects answering 17
items on coping behavior with industrial malodor. Each item had four answer categories (i.e.,
M = 3). The fit of the DMM was evaluated using MSP5 for Windows (Molenaar & Sijtsma, 2000).
Using MSP5’s search procedure (Molenaar & Sijtsma, 2000), a subset of eight items was selected
that constituted a unidimensional scale and satisfied the DMM (see Table 5; items are ordered in
decreasing popularity from top to bottom). The monotonicity assumption was evaluated using the
scalability coefficient H and rest-score regression (Sijtsma & Molenaar, 2002, pp. 41-42). For the
eight items, the scalability coefficient H = 0:39, indicating a weak scale. The rest-score regression
analysis showed several sample violations of monotonicity, but only one was significant at the 5%
level and one at the 1% level. The results suggest no serious violations of monotonicity. Nonintersec-
tion of the ISRFs was also investigated by means of rest-score regression (Sijtsma & Molenaar,
2002, pp. 98-99). Many sample violations were found, but only two of them were significant at the
5% level. MSP5 also provides a diagnostic index, denoted by Crit Value, which summarizes viola-
tions of the assumption of nonintersecting ISRFs. Crit Values higher than 80 cast serious doubt on
the assumption (Molenaar & Sijtsma, 2000, p. 74). The highest Crit Value for testing nonintersecting
ISRFs was 68. The results suggest minor violations of the assumption of nonintersecting ISRFs.

Person-Fit Results
There were n = 153 respondents having X+ ≤ 2 and X+ ≥ 22 (i.e., scores near the extremes
0 or 24). These subjects were excluded from the person-fit analysis, yielding a final sample of

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 241

Figure 4
Venn Diagram Indicating Overlap of the Number of Item Score Vectors
Classified as Misfitting Using Gp , GpN , and U3p

Gp GNp

10 1 1

23
0 9

0
U3p

675 item score vectors (X + = 7:42 and SD = 3:65). Inspection of the correlation between X+
and the person-fit indices showed the highest correlation for GP (r = :36) and the smallest cor-
relation for U3p (r = :11). The correlations between the nonparametric person-fit indices ran-
ged from .88 to .89. The Venn diagram in Figure 4 shows how many item score vectors
produced a person-fit index in the highest 5% for a single statistic, a combination of two statis-
tics, or all three person statistics. In particular, it shows that 44 patterns were detected by at
least one of the statistics, and 23 patterns were identified by all three statistics. There were 10
vectors found by Gp that were not found by one of the normed statistics GpN or U3p , and vice
versa.
Table 5 shows five individual cases that had person-fit values in the upper 5% range for at least
one of the nonparametric person-fit statistics. Substantive interpretation of these patterns is diffi-
cult without additional information, but the score pattern of Case 4 suggests a tendency to choose
extreme response patterns. The same tendency may also explain the high person-fit values for the
other cases. Case 5 shows a high U3p value and relatively low values for the other statistics. This
case showed an unexpectedly high score on the most difficult item, which explained the relatively
high U3p compared to Gp and GpN .

Discussion

Person-fit methods are important tools used to detect individual item score vectors that deviate
from the other vectors in the sample and therefore need further inquiry. This study investigated
the effectiveness of three nonparametric person-fit statistics for polytomous item scores, both in
simulated data and in real data. Under varying test and item characteristics, simulations were
done under IRT models that are characteristic of real data applications. In general, a simple count
of the number of Guttman errors in the pattern of item-step scores was most effective in detecting
item score vectors that showed considerable misfit. The number of Guttman errors has the disad-
vantage of being confounded with X+ , particularly for relatively low and high X+ . The normed

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
242 APPLIED PSYCHOLOGICAL MEASUREMENT

statistics GpN and U3p were less confounded with X+ , but they showed lower detection rates in
most conditions. The simulations showed that item score vectors yielding X+ near the extremes
should be discarded from the person-fit analysis. The simulations further showed that in most
conditions, the detection rates of Gp were comparable to the parametric person-fit statistic lpz .
This means that the choice of a nonparametric approach does not necessarily imply a substantial
reduction of power.
An advantage of nonparametric approaches is that the underlying NIRT model is less restrictive
with respect to the data than their parametric counterparts. Empirical studies (Chernyshenko,
Stark, Chan, Drasgow, & Williams, 2001; Steinberg & Thissen, 1996) showed, for example, that
fitting parametric IRT models to personality data may not be straightforward, and NIRT may be
a useful alternative (e.g., Meijer & Baneke, 2004). A second advantage of NIRT models is that
a smaller sample size is needed to obtain reliable estimation of psychometric characteristics (e.g.,
^) for person-fit analysis.
item-step difficulties p
For the nonparametric person-fit statistics used in this study, the fit of a particular item score
vector is evaluated in relation to the fit values of all other item score vectors in the sample.
Because one does not want to confound detection of person misfit with X+ , normed person-fit sta-
tistics were proposed to reduce the dependency with X+ . Alternatively, the fit of a particular item
score vector can also be evaluated by comparing its fit value to the other fit values in the group of
persons with the same X+ (i.e., using distributions of person-fit statistics conditional on X+ ; e.g.,
Molenaar & Hoijtink, 1990). A serious limitation of this approach is that in particular for polyto-
mous item tests, large sample sizes are needed to adequately decide about the fit of an individual
item score vector.
In this study, simulated cutoff values for fixed Type I error rates were used to obtain detec-
tion rates that are comparable across different statistics. In practice, one cannot generate item
score patterns under the true model for setting cutoff values for fixed Type I error rates. There-
fore, in most applications, person-fit indices are used as descriptive measures for identification
of misfitting item score vectors in the sample at hand. Alternatively, cutoff values may be deter-
mined in empirical research. In the context of psychological assessment, psychological scales
are extensively studied in norm populations for score interpretation before they are put into
practice. Test users may use the distribution of nonparametric person-fit statistics obtained in
the norm population as the reference distribution to evaluate the consistency of new individual
observations. Item score vectors that produce a person-fit statistic that is in the upper tail of the
empirical distribution derived in the norm population are suspicious and subject to further
inquiry.
A topic for future research is the use of nonparametric estimates of the ISRFs. For example,
Sijtsma and Van der Ark (2003) used the estimated item rest-score regression curves to impute
missing data. Their approach can be extended to other applications as well, including person-fit
analysis. In particular, nonparametric estimates of the IRT model can be obtained (e.g., using
Testgraf; Ramsay, 2000). For the estimated nonparametric model, bootstrap methods (Efron &
Tibshirani, 1993) can be used to investigate the distributional characteristics of the person-fit
statistics under the null model. This may lead to cutoff values defining ranges of Gp , GpN , or U3p
that indicate varying levels of misfit (e.g., ‘‘no misfit,’’ ‘‘moderate misfit,’’ and ‘‘serious mis-
fit’’). Another topic for further research is applications of person-fit methods to nonmonotonic
items. In the context of personality measurement, items are sometimes found to be single peaked
(e.g., Meijer & Baneke, 2004). NIRT models may be pursued that accommodate nonmonotonic
items. Generalizations of person-fit methods to this class of models are also a topic for future
research.

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 243

Appendix

Finding max(Gp |X + ) and min (W|X + )


The algorithm is illustrated for J = 4 hypothetical items, each with four answer categories (i.e.,
M = 3). The item-step probabilities, Pjxj (y) (j = 1, . . . , 4; xj = 0, . . . , 3), are given in Table A1
(columns 2-5). Note that the algorithm requires the probability of the (redundant) item steps
P(Xj ≥ 0)(j = 1, . . . , J), which is equal to 1 by definition. The algorithm proceeds as follows.

Initialization Step (s = 1)
First, rank numbers are assigned to the item steps as follows. The item steps for xj = 0 are
ranked 0. The remaining item steps are rank numbered according to increasing difficulty (i.e.,
decreasing Pjxj ðy); see columns 6-9 of Table A1). Second, for each item the cumulative rank num-
bers of the item steps are computed (see columns 10-13 of Table A1). For example, the rank num-
bers of Item 1 were 0, 1, 2, and 5, and the corresponding cumulative ranks are 0, 1, 3, and 8. Third,
let rj = (rj1 , . . . , rj(M+1) ) be the row vector containing the cumulative ranks of item j (e.g.,
r1 = (0, 1, 3, 8)). An (M + 1) × (M + 1) matrix V(s=1) is computed, with elements
ðs = 1Þ
Vkl = r1k + r2l with ðk, l = 1, . . . , M + 1Þ:
In the example, the resulting matrix Vðs = 1Þ is given by
2 3
0 3 9 18
6 1 4 10 19 7
Vðs = 1Þ = 643 6
7:
12 21 5
8 11 17 26
Fourth, from this matrix V(s = 1) , a new vector T(s = 1) of length 2M + 1 is computed, with elements
Tk(s = 1) (k = 1, . . . , 2M + 1) given by

ðs = 1Þ maxfVk−l;1+l g with l = ð0, . . . , k − 1Þ if k ≤ M
Tk = : ðA1Þ
maxfVk−l, 1+l g with l = ðk − M − 1, . . . , MÞ if k > M
In the example, Tðs=1Þ = ð0, 3, 9, 18, 19, 21, 26). This ends the initialization step.

Recursion Steps
The initialization step is followed by J − 2 recursion steps. Each step s (s = 2, . . . , J − 1)
proceeds as follows. A new matrix VðsÞ is computed, with elements

ðsÞ ðs−1Þ
Vkl = Tk + rs+1, l with k = 1, . . . , ðMs + 1Þ and l = 1, . . . , ðM + 1Þ:
This matrix has Ms + 1 rows and M + 1 columns. From this matrix a new vector TðsÞ is obtained,
ðsÞ
in which the elements Tk , with k = 1, . . . , Mðs + 1Þ + 1, are given by
8
< maxfVk−l, 1+l g and l = ð0, . . . , k − 1Þ if k ≤ M
ðsÞ
Tk = maxfVk−l, 1+l g and l = ð0, . . . , MÞ if M < k ≤ Ms : ðA2Þ
:
maxfVk−l, 1+l g and l = ðk − Ms − 1, . . . , MÞ if k > Ms
This vector TðsÞ has Mðs + 1Þ + 1 elements. After all recursion steps are accomplished, the vector
T contains the maximum rank sum element-wise for X+ = 0, . . . , JM. In particular, the maximum
rank sum given X+ is given by the (X+ + 1)th element of the final vector T.

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
244 APPLIED PSYCHOLOGICAL MEASUREMENT

Table A1
Item-Step Probabilities, Ranked Steps, and Cumulative Ranks for J = 4
Hypothetical Items, Each With Four Answer Categories (M = 3)

Pjxj ðyÞ Ranked Steps Cum. Ranks

xj = (0) 1 2 3 (0) 1 2 3 (0) 1 2 3


j

1 1.00 .88 .59 .37 0 1 2 5 0 1 3 8


2 1.00 .53 .30 .18 0 3 6 9 0 3 9 18
3 1.00 .43 .21 .11 0 4 7 10 0 4 11 21
4 1.00 .19 .10 .09 0 8 11 12 0 8 19 31

Table A2
Maximum Number of Guttman Errors and Norming Indices for U3p
for the Four Hypothetical Items With M = 3

Norming Indices for U3p

X+ maxðGp |X+ Þ minðW|X+ Þ maxðW|X+ Þ

0 0 0.000000 0.0000000
1 7 −1.450010 1.9924302
2 16 −3.647235 2.3563955
3 25 −5.960870 2.4765399
4 25 −6.242721 2.1946887
5 27 −7.567646 1.6624719
6 31 −9.658387 0.8151740
7 27 −9.538243 −0.5097514
8 25 −10.385541 −1.9597616
9 25 −11.901888 −3.4761091
10 16 −9.909458 −5.5668501
11 7 −9.545493 −7.7640747
12 0 −10.077710 −10.0777097

The maximum rank sum is finally transformed to the number of Guttman errors using the mini-
mum rank sum given X+ , which is equal to 12 X+ (X+ + 1) (e.g., Emons, 2003). The maximum
number of Guttman errors given X+ is then given by
1
maxðGp |X+ Þ = TðX+ + 1Þ − X+ ðX+ + 1Þ:
2
The values of max (Gp |X+ ) of the hypothetical example are given in Table A2.

Norming U3p
The algorithm for finding the norming values to compute U3p is based on the same recursion
algorithm as for finding max (G|X+ ), but it differs in two respects from that of finding

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 245

max (G|X+ ). First, the cumulative ranks are replaced by the cumulative sum of the logits of the
item-step probabilities. Second, in equations (A1) and (A2) of the algorithm, the minimum values
are recorded in the vector T instead of the maximum values. As a consequence, the algorithm pro-
duces the minimum value of the sum of logits for each level of X+ . The maximum value of the
sum of the logits given X+ is the sum of the logits of the first X+ item steps. These values can be
used for the normed U3p statistic. The conditional norming values for U3 are also given in
Table A2.

References

Bachman, J. G., & Malley, P. M. (1984). Yea-saying, Emons, W. H. M. (2003). Investigating the local fit
nay-saying, and going to extremes: Black-white of item-score vectors. In H. Yanai, A. Okada,
differences in response styles. Public Opinion K. Shigemasu, Y. Kano, & J. J. Meulman (Eds.),
Quarterly, 48, 491-509. New developments in psychometrics (pp. 289-296).
Birenbaum, M., & Nassar, F. (1994). On the relation- Tokyo: Springer.
ship between test anxiety and test performance. Emons, W. H. M., Meijer, R. R., & Sijtsma, K.
Measurement and Evaluation in Counseling and (2002). Comparing simulated and theoretical
Development, 27, 293-301. sampling distributions of the U3 person-fit sta-
Cavalini, P. M. (1992). It’s an ill wind that bring no tistic. Applied Psychological Measurement, 26,
goods. Studies on odour annoyance and the dis- 88-108.
persion of odorant concentrations from indus- Emons, W. H. M., Sijtsma, K., & Meijer, R. R.
tries. Unpublished doctoral dissertation, (2005). Global, local, and graphical person-fit
University of Groningen, Netherlands. analysis using person response functions.
Chernyshenko, O. S., Stark, S., Chan, K., Drasgow, Psychological Methods, 10, 101-119.
F., & Williams, B. (2001). Fitting item response Hamilton, D. L. (1968). Personality attributes associ-
theory models to two personality inventories: ated with extreme response style. Psychological
Issues and insights. Multivariate Behavioral Bulletin, 69, 192-203.
Research, 36, 523-562. Hemker, B. T., Sijtsma, K., & Molenaar, I. W.
Cohen, J. (1988). Statistical power analysis for the (1995). Selection of unidimensional scales from
behavioral sciences. Hillsdale, NJ: Lawrence a multidimensional item bank in the polytomous
Erlbaum. Mokken IRT model. Applied Psychological Mea-
Costa, P. T., & McCrae, R. R. (1992). The NEO surement, 19, 337-352.
Personality Inventory and NEO Five Factor Hemker, B. T., Sijtsma, K., Molenaar, I. W., &
Inventory professional manual. Odessa, FL: Junker, B. W. (1997). Stochastic ordering using
Psychological Assessment Resources. the latent trait and the sum score in polytomous
Dagohoy, A. V. T. (2005). Person fit for tests with IRT models. Psychometrika, 62, 331-347.
polytomous responses. Unpublished doctoral dis- Johnson, T. R. (2004). On the use of heterogeneous
sertation. Enschede, Netherlands: University of thresholds ordinal regression models to account
Twente. for individual differences in response style. Psy-
Drasgow, F., Levine, M. V., & Williams, E. A. chometrika, 68, 563-583.
(1985). Appropriateness measurement with poly- Karabatsos, G. (2003). Comparing the aberrant
chotomous item response models and standard- response detection performance of thirty-six per-
ized indices. British Journal of Mathematical and son-fit statistics. Applied Measurement in Educa-
Statistical Psychology, 38, 67-68. tion, 16, 277-298.
Efron, B., & Tibshirani, R. J. (1993). An introduction Meijer, R. R. (2003). Diagnosing item score patterns
to the bootstrap. New York: Chapman & Hall. on a test using IRT based person-fit statistics.
Embretson, S. E., & Reise, S. P. (2000). Item Psychological Methods, 8, 72-87.
response theory for psychologists. Mahwah, NJ: Meijer, R. R., & Baneke, J. (2004). Analyzing psy-
Lawrence Erlbaum. chopathology items: A case for nonparametric

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


Volume 32 Number 3 May 2008
246 APPLIED PSYCHOLOGICAL MEASUREMENT

item response theory modeling. Psychological depression items on subjects with eating disor-
Methods, 9, 354-367. ders. European Journal of Psychological Assess-
Meijer, R. R., Molenaar, I. W., & Sijtsma, K. (1994). ment, 21, 1-10.
Influence of test and person characteristics on non- Rossi, P. E., Gilula, Z., & Allenby, G. M. (2001).
parametric appropriateness measurement. Applied Overcoming scale usage heterogeneity: A Bayes-
Psychological Measurement, 18, 111-120. ian hierarchical approach. Journal of the Ameri-
Meijer, R. R., & Sijtsma, K. (2001). Methodology can Statistical Association, 96, 20-31.
review: Evaluating person fit. Applied Psycho- Samejima, F. (1969). Estimation of latent ability
logical Measurement, 25, 107-135. using a response pattern of graded scores. Psy-
Molenaar, I. W. (1982). Mokken scaling revisited. chometrika Monograph Supplement, 17.
Kwantitatieve Methoden, 3(8), 145-164. Samejima, F. (1997). The graded response model. In
Molenaar, I. W. (1991). A weighted Loevinger H-coef- W. J. van der Linden & R. K. Hambleton (Eds.),
ficient extending Mokken scaling to multicategory Handbook of modern item response theory
items. Kwantitatieve Methoden, 12(37), 97-117. (pp. 85-100). New York: Springer.
Molenaar, I. W. (1997). Nonparametric models for Sijtsma, K., & Meijer, R. R. (2001). The person
polytomous responses. In W. J. van der Linden & R. response function as a tool in person-fit research.
K. Hambleton (Eds.), Handbook of modern item Psychometrika, 66, 191-208.
response theory (pp. 369-380). New York: Springer. Sijtsma, K., & Molenaar, I. W. (2002). Introduction
Molenaar, I. W., & Hoijtink, H. (1990). The many to nonparametric item response theory. Thou-
null distributions of person-fit indices. Psycho- sand Oaks, CA: Sage.
metrika, 55, 75-106. Sijtsma, K., & Van der Ark, L. A. (2003). Investiga-
Molenaar, I. W., & Sijtsma, K. (2000). MSP5 for tion and treatment of missing item scores in test
Windows. User’s manual [Computer manual]. and questionnaire data. Multivariate Behavioral
Groningen, Netherlands: ProGAMMA. Research, 38, 505-528.
Paulhus, D. L. (1991). Measurement and control of Steinberg, L., & Thissen, D. (1996). Uses of item
response bias. In J. P. Robinson, P. R. Shaver, & response theory and the testlet concept in the
L. S. Wrightsman (Eds.), Measures of personality measurement of psychopathology. Psychological
and social psychological attitudes (pp. 17-59). Methods, 1, 81-97.
San Diego, CA: Academic Press. Thissen, D. (1991). MULTILOG user’s guide. Mul-
Ramsay, J. O. (2000). Testgraf. A program for the tiple categorical item analysis and test scoring
graphical analysis of multiple choice test and ques- using item response theory [Computer manual].
tionnaire data [Computer software]. Montreal, Chicago: Scientific Software.
Canada: Department of Psychology, McGill Van der Ark, L. A. (2001). Relationships and proper-
University. ties of polytomous item response theory models.
Reise, S. P., & Widaman, K. F. (1999). Assessing the Applied Psychological Measurement, 25, 273-282.
fit of measurement models at the individual level: Van der Flier, H. (1980). Vergelijkbaarheid van indi-
A comparison of item response theory and viduele testprestaties [Comparability of individ-
covariance structure approaches. Psychological ual test performance]. Lisse, Netherlands: Swets
Methods, 4, 3-21. & Zeitlinger.
Reise, S. P., Widaman, K. F., & Puch, R. H. (1993). Van der Flier, H. (1982). Deviant response patterns
Confirmatory factor analysis and item response and comparability of test scores. Journal of Cross-
theory: Two approaches for exploring measure- Cultural Psychology, 13, 267-298.
ment invariance. Psychological Bulletin, 114, Van Herk, H., Poortinga, Y. H., & Verhallen, T. M.
552-566. M. (2004). Response styles in rating scales: Evi-
Rennie, L. J. (1982). Research note: Detecting dence of method bias in data from six EU coun-
a response set to Likert-style attitude items with tries. Journal of Cross Cultural Psychology, 35,
the rating model. Education Research and Per- 346-360.
spectives, 9, 114-118. Van Krimpen-Stoop, E. M. L. A., & Meijer, R. R.
Rivas, T., Bersabé, R., & Berrocal, C. (2005). (2002). Detection of person misfit in computerized
Application of the double monotonicity model adaptive tests with polytomous items. Applied Psy-
to polytomous items: Scalability of the Beck chological Measurement, 26, 164-180.

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014


W. H. M. EMONS
PERSON-FIT ANALYSIS OF POLYTOMOUS ITEMS 247

Van Onna, M. J. H. (2003). Ordered latent class Acknowledgments


models in nonparametric item response theory.
Unpublished doctoral dissertation, University of The author would like to thank Klaas Sijtsma for
Groningen, Netherlands. his helpful comments on earlier versions of this
Zickar, M. J., & Drasgow, F. (1996). Detecting fak- article.
ing on a personality instrument using appropri-
ateness measurement. Applied Psychological
Measurement, 20, 71-88. Author’s Address
Zickar, M. J., Gibby, R. E., & Robie, C. (2004). Un-
covering faking samples in applicant, incumbent, Address correspondence to Wilco H. M. Emons,
and experimental data sets: An application of Department of Methodology and Statistics, FSW,
mixed model item response theory. Organiza- Tilburg University, P.O. Box 90153, 5000 LE Til-
tional Research Methods, 7, 168-190. burg, Netherlands; e-mail: w.h.m.emons@uvt.nl.

Downloaded from apm.sagepub.com at TEXAS SOUTHERN UNIVERSITY on November 19, 2014

Vous aimerez peut-être aussi