Vous êtes sur la page 1sur 307

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Handbook of
Polytomous
Item Response
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Theory Models

Y102002_Book.indb 1 3/3/10 6:56:48 PM


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 2
3/3/10 6:56:48 PM
Handbook of
Polytomous
Item Response
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Theory Models

Edited by
Michael L. Nering
Remo Ostini

Y102002_Book.indb 3 3/3/10 6:56:48 PM


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Routledge Routledge
Taylor & Francis Group Taylor & Francis Group
270 Madison Avenue 27 Church Road
New York, NY 10016 Hove, East Sussex BN3 2FA
2010 by Taylor and Francis Group, LLC
Routledge is an imprint of Taylor & Francis Group, an Informa business

This edition published in the Taylor & Francis e-Library, 2011.


To purchase your own copy of this or any of Taylor & Francis or Routledges
collection of thousands of eBooks please go to www.eBookstore.tandf.co.uk.

International Standard Book Number: 978-0-8058-5992-8 (Hardback)

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Handbook of polytomous item response theory models / editors, Michael L. Nering, Remo Ostini.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-8058-5992-8 (hardcover : alk. paper)
1. Social sciences--Mathematical models. 2. Item response theory. 3. Psychometrics. 4. Social
sciences--Statistical methods. I. Nering, Michael L. II. Ostini, Remo.

H61.25.H358 2010
150.287--dc22 2009046380

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com
and the Psychology Press Web site at
http://www.psypress.com

ISBN 0-203-86126-4 Master e-book ISBN

Y102002_Book.indb 4 3/3/10 6:56:49 PM


Contents
Preface vii
Contributors ix

Part I Development of Polytomous IRT Models


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Chapter 1 New Perspectives and Applications 3


Remo Ostini and Michael L. Nering

Chapter 2 IRT Models for the Analysis of Polytomously Scored Data: Brief
and Selected History of Model Building Advances 21
Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Chapter 3 The Nominal Categories Item Response Model 43


David Thissen, Li Cai, and R. Darrell Bock

Chapter 4 The General Graded Response Model 77


Fumiko Samejima

Chapter 5 The Partial Credit Model 109


Geoff N. Masters

Chapter 6 Understanding the Response Structure and Process in the


Polytomous Rasch Model 123
David Andrich

Part II Polytomous IRT Model Evaluation


Chapter 7 Factor Analysis of Categorical Item Responses 155
R. Darrell Bock and Robert Gibbons

Chapter 8 Testing Fit to IRT Models for Polytomously Scored Items 185
Cees A. W. Glas

Y102002_Book.indb 5 3/3/10 6:56:49 PM


vi Contents

Part III Application of Polytomous IRT Models


Chapter 9 An Application of the Polytomous Rasch Model to Mixed
Strategies 211
Chun-Wei Huang and Robert J. Mislevy

Chapter 10 Polytomous Models in Computerized Adaptive Testing 229


Aimee Boyd, Barbara Dodd, and Seung Choi
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Chapter 11 Equating With Polytomous Item Response Models 257


Seonghoon Kim, Deborah J. Harris, and Michael J. Kolen
Index 293

Y102002_Book.indb 6 3/3/10 6:56:49 PM


Preface
The Handbook of Polytomous Item Response Theory Models brings together lead-
ers in the field to tell the story of polytomous item response theory (IRT). It is
designed to be a valuable resource for researchers, students, and end-users of
polytomous IRT models bringing together, in one book, the primary actors in
the development of the most important polytomous IRT models to describe
their work in their own words. Through the chapters in the book, the authors
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

show how these models originated and were developed, as well as how they
have inspired or assisted applied researchers and measurement practitioners. It is
hoped that hearing these stories and seeing what can be done with these models
will inspire more researchers, who might not otherwise have considered using
polytomous IRT models, to apply these models in their own work and thereby
achieve the type of improved measurement that IRT models can provide.
This handbook is for measurement specialists, practitioners, and graduate
students in psychological and educational measurement who want a compre-
hensive resource for polytomous IRT models. It will also be useful for those
who want to use the models but do not want to wade through the fragmented
mass of original literature and who need a more comprehensive treatment of
the topic than is available in the individual chapters that occasionally show up
in textbooks on IRT. It will also be useful to specialists who are unfamiliar
with polytomous IRT models but want to add it to their repertoire, particularly
psychologists and assessment specialists in individual differences, social, and
clinical psychology, who develop and use tests and measures in their work.
The handbook contains three sections. Part 1 is a comprehensive account
of the development of the most commonly used polytomous IRT models
and their location within two general theoretical frameworks. The context
of the development of these models is presented within either an historical
or a conceptual framework. Chapter 1 describes the contents of this book
and discusses major issues that cut across different models. It also provides
a model reference guide that introduces the major polytomous IRT models
in a common notation and describes how to calculate information functions
for each model. Chapter 2 outlines the historical context surrounding the
development of influential models, providing a basis from which to investi-
gate individual models more deeply in subsequent chapters. Chapters 1 and
2 also briefly introduce software that can be used to implement polytomous
IRT models, providing readers with a practical resource when they are ready
to use these models in their own work. In Chapters 3, 4, 5, and 6, the psy-
chometricians responsible for important specific models describe the devel-
opment of the models, outlining important underlying features of the models
and how they relate to measurement with polytomous test items.
Part 2 contains two chapters that detail two very different approaches to
evaluating how well specific polytomous IRT models work in a given measure-
ment context. Although model-data fit is not the focus of Chapter 7while

vii

Y102002_Book.indb 7 3/3/10 6:56:49 PM


viii Preface

being very much the focus of Chapter 8each of these chapters makes a sub-
stantial contribution to this difficult problem. Reminiscent of the earlier strug-
gles in structural equation modelling, the lack of a strong fit-testing regimen is
a serious impediment to the widespread adoption of polytomous IRT models.
Careful appraisal of the properties of the evaluation procedures and fit tests
outlined in the two chapters in this section, along with their routine implemen-
tation in accessible IRT software, would go far towards filling this need.
The final section demonstrates a variety of ways in which these models
have been used. In Chapter 9 the authors investigate the different test-taking
strategies of respondents using a multidimensional polytomous IRT model.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Chapter 10 comprehensively addresses the major issues in computerized


adaptive testing (CAT) using polytomous IRT models and provides a review
of CAT applications in both applied and research settings. Equating test
scores across different testing contexts is an important practical challenge in
psychological and educational testing. The theoretical and practical consider-
ations in accomplishing this task with polytomous IRT models are the focus
of the last chapter in this handbook.
Disparate elements of the book are linked through editorial sidebars that
connect common ideas across chapters, compare and reconcile differences in
terminology and explain variations in mathematical notation. This approach
allows the chapters to remain in the authors own voice while drawing
together commonalities that exist across the field.

Acknowledgements
This book is clearly a collaborative effort and we first and foremost acknowl-
edge the generosity of our contributing authors, particularly for sharing their
expertise, but also for their ongoing support for this project and for giving us a
glimpse into the sources of the inspirations and ideas that together form the field
of polytomous item response theory. A project like this always has unheralded
contributors working behind the scenes. That this book was ever completed is
due in no small part to the determination, skills, abilities, hard work, forbearance
and good humor of Kate Weber. Thanks Kate! We are also grateful for the assis-
tance of the reviewers including Mark D. Reckase of Michigan State and Terry
Ackerman of the University of North Carolina at Greensboro as well as other
unsung colleagues who contributed by casting a careful eye over different parts of
this project. We especially thank Jenny Ostini, Wonsuk Kim, Liz Burton, Rob
Keller, Tom Kesel and Robin Petrowicz. A critical catalyst in bringing this proj-
ect to fruition was a generous visiting fellowship from Measured Progress to RO,
which we gratefully acknowledge. Finally, we thank the capable staff at Taylor
and Francis, particularly Debra Riegert and Erin Flaherty, for their confidence
in this project and for their skill in turning our manuscript into this handbook.

MN Dover, New Hampshire


RO Ipswich, Queensland

Y102002_Book.indb 8 3/3/10 6:56:49 PM


Contributors

David Andrich Seonghoon Kim


University of Western Australia Keimyung University

R. Darrell Bock Michael J. Kolen


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

University of Illinois at Chicago University of Iowa

Aimee Boyd Geoff N. Masters


Pearson Australian Council for Educational
Research
Li Cai
University of North Carolina at Chapel Robert J. Mislevy
Hill University of Maryland

Seung Choi Michael L. Nering


Northwestern University Measured Progress

Barbara Dodd Remo Ostini


University of Texas at Austin Healthy Communities Research
Centre, University of Queensland
Cees A. W. Glas
University of Twente Fumiko Samejima
University of Tennessee
Robert Gibbons
University of Illinois at Chicago David Thissen
University of North Carolina at Chapel
Ronald K. Hambleton Hill
University of Massachusetts at Amherst
Wim J. van der Linden
Deborah J. Harris CTB/McGraw Hill
Act, Inc.
Craig S. Wells
Chun-Wei Huang University of Massachusetts at
WestEd Amherst

ix

Y102002_Book.indb 9 3/3/10 6:56:49 PM


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 10
3/3/10 6:56:49 PM
Pa r t I
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Development
of Poly tomous
IRT Models

Y102002_Book.indb 1 3/3/10 6:56:49 PM


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 2
3/3/10 6:56:49 PM
Chapter 1
New Perspectives and Applications
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Remo Ostini
Healthy Communities Research Centre, University of Queensland

Michael L. Nering
Measured Progress

Polytomous item response theory (IRT) models are mathematical mod-


els used to help us understand the interaction between examinees and test
questions where the test questions have various response categories. These
test questions are not scored in a simple dichotomous manner (i.e., correct/
incorrect); rather, they are scored in way that reflects the particular score cat-
egory that an examinee has achieved, been classified into, or selected (e.g., a
score point of 2 on an item that is scored from 0 to 4, or selecting somewhat
agree on a survey).
Polytomous items have become omnipresent in the educational and psy-
chological testing community because they offer a much richer testing expe-
rience for the examinee while also providing more psychometric information
about the construct being measured. There are many terms used to describe
polytomous items (e.g., constructed response items, survey items), and poly-
tomous items can take on various forms (e.g., writing prompts, Likert type
items). Essentially, polytomous IRT models can be used for any test question
where there are several response categories available.
The development of measurement models that are specifically designed
around polytomous items is complex, spans several decades, and involves a
variety of researchers and perspectives. In this book we intend to tell the
story behind the development of polytomous IRT models, explain how
model evaluation can be done, and provide some concrete examples of work
that can be done with polytomous IRT models. Our goal in this text is to
give the reader a broad understanding of these models and how they might
be used for research and operational purposes.

Y102002_Book.indb 3 3/3/10 6:56:50 PM


4 Remo Ostini and Michael L. Nering

Who Is This Book For?


This book is intended for anyone that wants to learn more about polyto-
mous IRT models. Many of the concepts discussed in this book are technical
in nature, and will require an understanding of measurement theory and
some familiarity with dichotomous IRT models. There are several excellent
sources for learning more about measurement generally (Allen & Yen, 1979;
Anastasi, 1988; Crocker & Algina, 1986; Cronbach, 1990) and dichoto-
mous IRT models specifically (e.g., Embretson & Reise, 2000; Rogers,
Swaminathan, & Hambleton, 1991). Throughout the book there are numer-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ous references that are valuable resources for those interested in learning
more about polytomous IRT.

The Approach of This Book and Its Goals


This handbook is designed to bring together the major polytomous IRT
models in a way that helps both students and practitioners of social science
measurement understand where these state-of-the-art models come from,
how they work, and how they can be used. As Hambleton, van der Linden,
and Wells (Chapter 2) point out, the handbook is not an exhaustive cata-
logue of all polytomous IRT models, but the most commonly used models
are presented in a comprehensive manner.
It speaks to the maturation of this field that there are now models that
appear to have fallen by the wayside despite what could be considered desir-
able functional properties. Rosts (1988) successive intervals model might be
an example of such a model in that very little research has been focused on
it. Polytomous IRT also has its share of obscure models that served their
purpose as the field was finding its feet but which have been supplanted
by more flexible models (e.g., Andrichs (1982) dispersion model has given
way to the partial credit model) or by mathematically more tractable models
(e.g., Samejimas (1969) normal ogive model is more difficult to use than her
logistic model).
Perhaps the most prominent model to not receive separate treatment in
this handbook is the generalized partial credit model (GPCM; Muraki,
1992). Fortunately, the structure and functioning of the model are well cov-
ered in a number of places in this book, including Hambleton and colleagues
survey of the major polytomous models (Chapter 2) and Kim, Harris, and
Kolens exposition of equating methods (Chapter 11).
Rather than focus on an exhaustive coverage of available models, this
handbook tries to make polytomous IRT more accessible to a wider range
of potential users in two ways. First, providing material on the origins and
development of the most influential models brings together the historical
and conceptual setting for those models that are not easily found elsewhere.
The appendix to Thissen, Cai, and Bocks chapter (Chapter 3) is an exam-
ple of previously unpublished material on the development context for the
nominal model.

Y102002_Book.indb 4 3/3/10 6:56:50 PM


New Perspectives and Applications 5

Second, this handbook addresses important issues around using the


models, including the challenge of evaluating model functioning (Bock &
Gibbons, Chapter 7; Glas, Chapter 8) and applying the models in comput-
erized adaptive testing (CAT; Boyd, Dodd, & Choi, Chapter 10), equating
test scores derived from polytomous models (Kim et al., Chapter 11), and
using a polytomous IRT model to investigate examinee test-taking strategies
(Huang & Mislevy, Chapter 9).

Part 1: Development
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

In this book we attempt to bring together a collection of different polyto-


mous IRT models with the story of the development of each model told by
the people whose work is most closely associated with the models. We begin
with a chapter by Hambleton, van der Linden, and Wells (Chapter 2), which
broadly outlines various influential polytomous models, introducing their
mathematical form and providing some of the common historical setting
for the models. Introducing a range of models in this consistent way forms a
solid basis for delving into the more complex development and measurement
issues addressed in later chapters. Hambleton and colleagues also introduce
models that are not addressed in later model development chapters (e.g.,
generalized partial credit model, nonparametric IRT models) and touch on
parameter estimation issues and other challenges facing the field.
Thissen and Cai (Chapter 3) provide a succinct introduction to the nomi-
nal categories item response model (often known in other places as the
nominal response model). They neatly describe derivations and alternative
parameterizations of the model as well as showing various applications of the
model. Saving the best for last, Thissen and Cai provide a completely new
parameterization for the nominal model. This new parameterization builds
on 30 years of experience to represent the model in a manner that facilitates
extensions of the model and simplifies the implementation of estimation
algorithms for the model. The chapter closes by coming full circle with a
special contribution by R. Darrell Bock, which provides previously unpub-
lished insight into the background to the models genesis and is one of the
highlights of this book.
Samejimas chapter (Chapter 4) is in some ways the most ambitious in this
book. It presents a framework for categorizing and cataloguing every possi-
ble unidimensional polytomous IRT modelincluding every specific model
developed to date as well as future models that may be developed. Issues with
nomenclature often arise in topics of a technical nature, and this is certainly
the case in the world of polytomous IRT models. For example, Samejima
typically used the term graded response model or the related general graded
response model to refer to her entire framework of models. In effect, the graded
response model is, for her, a model of models. In common usage, however,
graded response model (GRM) refers to a specific model, which Samejima
developed before fully expounding her framework. Samejima herself calls
this specific model the logistic model in the homogeneous caseand never

Y102002_Book.indb 5 3/3/10 6:56:50 PM


6 Remo Ostini and Michael L. Nering

refers to it as the graded response model. There is no simple way to resolve


this terminology conflict. Ultimately the reader simply needs to be aware
that when Samejima refers to the graded response model, she is referring
to her framework while other authors are referring to her logistic model.
The other major terminological issue is the distinction between Samejimas
homogeneous case and her heterogeneous case. This dichotomy refers to dif-
ferent types of modelsthe two major branches in her framework. Early
researchers understood the heterogeneous case to simply be Samejimas
logistic model (usually called the GRM) with a discrimination parameter
that varied across categories. This is not correct. The simplest way to under-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

stand the distinction is that the homogeneous case is the term for models
that are elsewhere called difference models (Thissen & Steinberg, 1986),
cumulative models (Mellenberg, 1995), or indirect models (Hambleton
etal., Chapter 2), whereas models in the heterogeneous case are essentially
all other polytomous IRT models, including the nominal response model
and Rasch type models. These issues will be highlighted on occasion as they
arise throughout the book.
Prior to presenting the comprehensive framework in Chapter 4, Samejima
outlines a set of criteria for evaluating the adequacy of any given model.
These criteria are essentially an argument for assessing a model, not at the
model-data fit level, but rather at the level of how the model operatesthe
structural and functional properties that determine how the model repre-
sents response data.
The chapter by Andrich (Chapter 6) also presents an argument about model
functioning. Andrich argues that the feature of polytomous Rasch models,
which allows item category thresholds to be modeled in a different order to
the response categories themselves, provides an important diagnostic tool
for testing dataand the items that produced it. Moving beyond the sim-
ple, intuitively appealing, but ultimately inaccurate representation of these
thresholds as successive steps in a response process, Andrich argues that the
presence of unordered category thresholds is an indication of an improp-
erly functioning item. He notes that this diagnostic ability is not available
in models belonging to Samejimas homogeneous case of graded response
models. Thus, while Samejimas argument outlines how models should func-
tion to properly represent response data, Andrichs argument concerns the
properties that should be present in response data if they are to be properly
modeled.
Nested between these two chapters is a relatively unargumentative pre-
sentation of the logic behind the partial credit model (PCM) as told by the
originator of that model (Masters, Chapter 5). The PCM is an extremely
flexible polytomous model that can be applied to any polytomous response
data, including data from tests that have items with different numbers of
categories, from questionnaires using rating scales, or both. While the PCM
is a model in the Rasch family of models it was developed separately from
Raschs (1961) general polytomous model.

Y102002_Book.indb 6 3/3/10 6:56:50 PM


New Perspectives and Applications 7

Some Issues with Polytomous Model Item Parameters


The presentation of the PCM in Masterss chapter (as well as its description in
the Hambleton et al. chapter) focuses more narrowly on its use with the types
of items that give rise to the models nameability to test items for which is
it is possible to obtain partial credit. This focus can obscure the versatility of
the model and also tends to reinforce the notion of item category thresholds
being successive steps in an item response process. This notion glosses over
the fact that these thresholds do not model responses to pairs of independent
categoriesas Masters notes toward the end of his chaptersince all the cat-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

egories in an item are interconnected. Nor do the thresholds model successive


steps because they do not take into account response probabilities beyond the
categories being modeled (see, e.g., Tutz, 1990; Verhelst & Verstralen, 1993).
Differences over step terminology and the ambiguity surrounding what
the difficulty parameter of PCM thresholds actually represents should not
obscure the fact that this is a very versatile model with desirable statistical
properties. Part of its flexibility derives from the fact that for polytomous
models (including the PCM), the discriminating power of a specific category
depends not only on a separate modeled parameter but also on the proximity
of thresholds adjacent to each other (Muraki, 1992, 1993; Muraki & Bock,
1999). The closer two adjacent thresholds are, the more discriminating the
category that they bound.
Digging deeper into why the item step notion is often misunderstood
in the PCM reveals a feature of polytomous IRT model item parameters
that underlies a broader set of misunderstandings in polytomous IRT mod-
els generally. Simply put, the probability of traversing a category boundary
threshold (i.e., of passing an item step) is not the same probability as the prob-
ability of responding in the next categoryexcept in one trivial case. These
probabilities are never the same in the case of polytomous Rasch models or
other models in Samejimas heterogeneous case of graded models (also called
divide-by-total models (Thissen & Steinberg, 1986), adjacent category mod-
els (Mellenberg, 1995), or direct models (Hambleton et al., Chapter 2)).
What this means and why these probabilities are not the same is easiest to
see for models in Samejimas homogeneous case of graded models (also called
difference models (Thissen & Steinberg, 1986)), adjacent category models
(Mellenberg, 1995), or indirect models (Hambleton et al., Chapter 2)). As
will be shown in later chapters, in this type of polytomous IRT model the
probability of passing an item threshold is explicitly modeled as the probabil-
ity of responding in any category beyond that threshold. This is clearly not
the same as responding in the category immediately beyond the threshold,
unless it is the final item threshold and there is only one category remaining
beyond it. For example, the modeled probability of passing the threshold
between the second and third categories in a five-category item (i.e., passing
the second step in a four-step item) is explicitly modeled as the probability
of responding in Categories 3, 4, or 5. Clearly, this cannot be the same prob-
ability as that of responding in Category 3.

Y102002_Book.indb 7 3/3/10 6:56:50 PM


8 Remo Ostini and Michael L. Nering

The distinction between passing a category threshold (step) and respond-


ing in a category is more difficult to appreciatebut equally realin Rasch
type (divide-by-total, difference) polytomous IRT models because the cat-
egory boundary thresholds are only defined (and modeled) between pairs
of adjacent categories. Thus, in this type of model, the probability of pass-
ing the threshold from the second to the third categories in the aforemen-
tioned five-category item is modeled simply as the probability of responding
in Category 3 rather than in Category 2. This is not the same probability as
simply responding in Category 3even though it might sound like it should
be. In fact, the probability of just responding in Category 3 is a function
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

of the probability of passing each threshold up to Category 3 divided by a


function of the probability of passing every threshold in the entire item. In
simple terms, the probability of responding in Category 3 is not the same as
the responding in Category 3 rather than Category 2 because it is instead the
probability of responding in Category 3 rather than responding in Categories
1, 2, 4, or 5.
In practice, a manifestation of this distinction is that for a given set of
data, modeled by a general direct and a general indirect polytomous model,
the probabilities associated with the category boundaries (e.g., the difficulty
or location parameters) will be quite different for the two models, whereas
the probability of responding in a particular category will be almost identical
across the range of the measurement scale (Ostini, 2001).

The Importance of the Distinction


One thing that the foregoing discussion tells us is that category thresholds
(specifically, their associated parameters) have different meanings in differ-
ence models compared to divide-by total modelsa situation that does not
exist for dichotomous models. Put another way, whereas restricting a two-
parameter dichotomous model to only have one modeled item parameter
effectively makes it a Rasch model, this is not at all the case for polytomous
models. The modeled threshold parameters for the generalized partial credit
model (GPCM) and Samejimas logistic model do not have the same mean-
ingeven though they have the same number and type of parameters
and removing the discrimination parameter from the logistic model does not
make it a polytomous Rasch model.
The failure to appreciate the distinction between passing a threshold and
responding in a category is easy to understand considering that polytomous
IRT models were extrapolated from dichotomous models where this dis-
tinction does not exist. Passing the threshold between the two categories in
a dichotomous item has the same probability as responding in the second
category. In the dichotomous case, that is precisely what passing the thresh-
old means. It means getting the item right, choosing yes instead of no, and
getting a score of 1 rather than 0. As has hopefully been made clear, passing
a polytomous item threshold is not nearly that simple.
Failing to make the distinction between the two probabilities (passing
a threshold and responding in a category) with polytomous IRT models is

Y102002_Book.indb 8 3/3/10 6:56:50 PM


New Perspectives and Applications 9

even easier to understand considering the focus of early IRT modeling on


tests of ability. In that context, passing the threshold between the second and
the third category in an item is commonly understood to mean that you get
the mark for the third category. What is ignored in this understanding is that
it also means that you failed to pass subsequent thresholds on the itemeven
though this failure must be (and is) included in the modeled probability of
responding in that third category.
The early context for polytomous IRT models, combined with a failure to
clearly enunciate the semantic distinction between threshold probabilities
and category probabilities, likely contributed to the misunderstanding sur-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

rounding the step notion for category thresholds in polytomous Rasch mod-
els. This misunderstanding leads to the belief that the difficulty of passing a
category threshold is the same as the difficulty for that category when, as we
have seen above, it is not the same probability.
The failure to rigorously distinguish between category threshold probabil-
ities and the probability of responding in a category can lead to some loose-
ness in terminology in discussing polytomous IRT models and their usage.
In such cases, passing a threshold is spoken of as responding in a category.
Examples of this sort of blurring of the distinction between the two types of
probability can be seen by implication in parts of Chapters 2 and 11.
While blurring the distinction between the two types of probability is
unlikely to have any adverse consequences on respondents test scores, for
example, it can lead to misunderstandings about the nature and relationships
between different types of models. It can also lead to misunderstandings
about how polytomous models operatewhat they can do, how they do it,
and what they provide the test user. Equally importantly, this distinction has
implications for the types of arguments that Andrich is making in Chapter4.
As a result, being clear about the distinction between the probability of pass-
ing a category threshold and the probability of responding in a category is
important for discussions about choosing among the different models.

Part 2: Evaluation
On our journey to actually using the polytomous IRT models in our applica-
tion section we provide two different methods for evaluating the use of the
polytomous models relative to the data at hand. The task of model evalua-
tion is not an easy one, and there are several ways one might perform this
task. Our intent here is to provide an overview of a couple of approaches that
might be considered.
Bock and Gibbons (Chapter 7) describe the development of an extension
to full information factor analysis (FIFA), which not only brings multidi-
mensional IRT a step closer, but also allows the dimensionality of an instru-
ment to be evaluated through a confirmatory factor analysis procedure. A
feature of this chapter is the worked example that clearly shows how to take
advantage of the possibilities that this method provides. An interesting but
uncommon form of confirmatory factor analysisbifactor analysisis also

Y102002_Book.indb 9 3/3/10 6:56:50 PM


10 Remo Ostini and Michael L. Nering

described and demonstrated in this chapter. This innovative model test pro-
cess provides an elegant way to test models for data that contain one general
factor and a number of group factors.
In Chapter 8, Glas focuses squarely on the problem of evaluating fit in
polytomous IRT models. Outlining both an innovative likelihood-based
framework and a Bayesian approach, Glas systematically addresses the chal-
lenges and complexities of evaluating both person and item fit in models
with a substantial number of estimated parameters. He shows how these
approaches can be applied to general versions of three broad types of poly-
tomous IRT modelsRasch type, Samejima homogeneous case type, and
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

sequential models. Furthermore, Glas demonstrates how relevant fit statis-


tics can be calculated. Given that the lack of adequate fit tests might be
considered the Achilles heel of polytomous IRT, the solutions that Glas
provides warrant enthusiastic investigation to determine whether they can
fulfill their promise. If these methods prove successful, their inclusion in
future polytomous IRT model estimation software would greatly enhance
their usability and reach.

Part 3: Applications
Rather than catalogue examples of areas where polytomous IRT models have
been used in practice, the approach in this book is to focus on a few key issues
that are important when using polytomous IRT models in applied settings. In
Chapter 9, Huang and Mislevy apply a multidimensional polytomous Rasch
model to the investigation of the different strategies that test takers bring to an
examination. The multidimensional model being used here is parameterized very
differently than the multidimensional model used in the FIFA chapter (Chapter
7), with different dimensions representing respondent differences rather than
item differences. This very flexible model is used to score student responses in
terms of their conceptions of a content domain, rather than in terms of the cor-
rectness of the response, which is typically the focus of ability measurement.
In their chapter on computerized adaptive testing (CAT) with polytomous
IRT models Boyd et al. (Chapter 10) provide a detailed and clear presenta-
tion of both the major issues that arise in CAT and how polytomous CAT
has been used. The presentation includes sections on CAT use in research
and applied settings and describes both the challenges and opportunities
associated with polytomous CAT.
In the final chapter, Kim, Harris, and Kolen (Chapter 11) provide a care-
ful and comprehensive survey of equating methods and how those methods
can be applied in the context of polytomous IRT models. The breadth and
depth of coverage in this chapter results in an excellent overview of different
equating methods, their advantages, their challenges, and issues specifically
associated with their use in polytomous IRT models. The figures and the
example provided in the chapter are a welcome feature and help to make
more concrete some of the distinctions between the three equating methods
that are described in the first part of the chapter. This chapter is particularly

Y102002_Book.indb 10 3/3/10 6:56:50 PM


New Perspectives and Applications 11

important as equating becomes a significant topic with measurement models


increasingly used operationally rather than primarily being presented and
studied from a theoretical framework.
Full information factor analysis (Chapter 7), the approach to model-data
fit developed by Glas (Chapter 8), and the investigation of mixed-response
strategies (Chapter 9) are features of polytomous IRT modeling that are still
largely confined to research settings. In contrast, CAT and equating have
moved beyond the research setting and are important elements in the routine
use of these models. It is a sign of the maturation that is occurring in this field
that most of the models described in the following chapters are being used to
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

do testing rather than being limited to measurement research. The advantages


of polytomous IRT models are also being drawn on more in the construc-
tion of published measurement instruments, such as, for example, Hibbard,
Mahoney, Stockard, and Tuslers (2005) patient activation measure.

Integration
Rather than modify what authors have written, we have tried to enhance the
flow of the book by connecting chapters with notes that draw together rela-
tionships among different chapters. We do this in the first instance through
editor note sidebars at the beginning of chapters.

Relationships Among Models


Additionally, we will make comments within a chapter to allow us to further
compare and contrast various aspects of the models presented in the book.
Below is an example of an editor note sidebar, highlighting relationships
among models, and printed in the style you will see throughout the text:

Relationship to Other Models: Often it is important to compare and contrast models to better
understand the intricate details of a model. Pay particular attention to pop-out boxes that focus on
model comparison so that you will have a comprehensive understanding of the various models used.

Terminology Note
We have tried to highlight the meaning of important elements of models and
applications, especially where similar concepts are represented differently, by
using terminology notes. Again, we will do this primarily through editor note
sidebars within chapters. The goal of using these special editor notes is to help
connect basic ideas from one chapter to the next, to help readers understand
some confusing concepts, or to offer an alternative explanation to a key con-
cept. Below is how we will highlight terminology notes throughout this book:

Terminology Note: We prefer to use the term boundary parameter to describe the statistical term
used for the functions that separate response categories. Different authors use this term or concept
differently, and we will highlight this throughout the text.

Y102002_Book.indb 11 3/3/10 6:56:51 PM


12 Remo Ostini and Michael L. Nering

Notational Differences
Various elements in the field of polytomous IRT developed out of different
mathematical traditions, and consequently, a range of notational conventions
are used across the field. In another place (Ostini & Nering, 2006) we have
attempted to present a unified notational approach. In this handbook, how-
ever, we want authors to present what is often their lifes work in their own
voice and have kept the preferred notation of each author. Retaining the
notation used by the contributing authors allows readers who follow up the
work of any of these authors to find consistent notation across the authors
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

body of work. Instead of changing notation or terminology, we have provided


brief editor notes in sidebars throughout the text to highlight links between
differing notations and uses of terms across chapters.

Notational Differences: These will be highlighted so that comparisons between models can be
made and to help avoid confusion from one chapter to the next.

Model Reference Guide


Below is a model reference guide that can be used while reading this text, or
while using the models for research or operational purposes. For this partic-
ular section we have highlighted what we believe to be the most commonly
used polytomous IRT models, or models that deserve special attention.
Within this reference guide we have used a common notational method that
allows the reader to more readily compare and contrast models, and we have
included information functions that the reader might find useful.

Note on Information
Polytomous IRT information can be represented in two ways. It can be eval-
uated at the item level or at the category level. Starting at the category level,
Samejima (1977, 1988, 1996, 1998) defines information as the second deriva-
tive of the log of the category response probability

2
I ik () = log Pik () (1.1)
2

where I ik () is information for category k of item i evaluated across the range


of , and Pik () is the probability of responding in category k of item i.
Category information can then be combined to produce item information
[ I i ()] ,


I i ( ) = I P
k
ik ik
(1.2)

Y102002_Book.indb 12 3/3/10 6:56:52 PM


New Perspectives and Applications 13

which Samejima (1969) notes is equivalent, in conditional expectation terms,


to describing item information as the expected value of category information.
That is,

I i = E[ I ik |] (1.3)

For operational purposes it is perhaps simpler to work at the item level


where, broadly speaking, item information can be defined as squared item
response function slope/conditional variance. Operationalizing this defini-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

tion rests on understanding that a polytomous item response function (IRF) is


essentially a regression of the item score on the trait scale (Chang & Mazzeo,
1994; Lord, 1980)that is, the expected value of an item response as a func-
tion of the change in (Andrich, 1988). For polytomous items, the expected
value of response x (where x k = 0, 1, , m) is


E[ X i ] = kP ()
k
ik (1.4)

Item information is then a partial derivative of the expected value of an


item response
2
E[ X i ]

= V [Xi ] = 2
k P
i ik

ki Pik

(1.5)
k k

If necessary, category information can then simply be obtained as a parti-


tion of item information due to a particular category, by

I ik () = Pik ()I i () (1.6)


This equation is in the normal ogive metric and is for models without a separate
discrimination parameter. The logistic metric can be obtained by multiplying
Equation 1.5 by the usual correction factor squared (i.e., D2, where D = 1.702).
Similarly, multiplying Equation 1.5 by squared item discrimination (a 2 ) takes
account of separately modeled item discrimination in calculating information.
Information for all of the models for ordered response category data below
could be calculated using Equation 1.5. However, information functions
for the graded response model (Samejimas logistic model in the homoge-
neous case) have traditionally been obtained at the category level, and so that
approach will be shown below. Dodd and Koch (1994) found that the two
approaches to obtaining item information produce almost identical results
empirically. Matters are more complicated for the nominal model, and the
procedure described below draws heavily from both the logic and the math-
ematical derivations provided by Baker (1992). The first two information

Y102002_Book.indb 13 3/3/10 6:56:53 PM


14 Remo Ostini and Michael L. Nering

functions in the following model reference guide will be defined in terms


of category information, while the final three functions will be based on
Equation 1.5. Note that even though most of the information functions are
defined at the item level ( I i ), the IRT models themselves describe category-
level functions (Pik ).

Nominal Model
The Model
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

exp( a k + c k )
Pik (u = k | ; a , c ) = Pik () = (1.7)
i exp( ai + ci )

where Pik () is the probability that a response u to item i is in category k (k = 0,


1, , m), as a function of the ability or trait continuum , with a category slope
parameter a k and category intercept parameter c k , and with ( a k + c k ) Z k.

Item Information
The most practical way to present item information for the nominal model
is through a three-step process. Firstly, a general equation is presented
(Equation 1.8). This contains two derivatives that require calculation. Each
part is described separately (Equations 1.9 and 1.10), and each is typically
calculated separately, with the appropriate values substituted back into
Equation 1.8 to obtain the information function for an item.
The general equation is
m
[ Pik ()]2
I i ( ) =
Pik ()
Pik()

(1.8)
k =1

where I i () is information evaluated across the range of for item i, and


Pik () is defined in Equation 1.7.
The equation for the first derivative, Pik (), is

exp(Z k ) i exp(Z k )( a k av )
Pik () = 2 (1.9)
i exp(Zv )

while the equation for the second derivative, Pik(), is

Pik() =
{ ( ) }
exp(Z k ) i exp(Zv ) i exp(Zv ) a k2 av2 2 i exp(Zv )( a k av ) i av exp(Zv )
3
i exp(Zv )

(1.10)
where Z k is defined as ( a k + c k ), c k is the category intercept parameter, and
a k is the category slope parameter.

Y102002_Book.indb 14 3/3/10 6:56:57 PM


New Perspectives and Applications 15

Considerations in Using This Model


This model was specifically designed for polytomous item types where the
response categories do not need to follow a specific order.

Graded Response Model (Logistic ModelHomogeneous Case)


The Model

exp[ ai ( bik )] exp[ ai ( bik +1 )]


Pik () = (1.11)
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1 + exp[ ai ( bik )] 1 + exp[ ai ( bik +1 )]


which is summarized as Pik = Pik* Pik* +1, where Pik () is the probability of
responding in category k (k = 0, 1, , m) of item i, Pik* represents the category
boundary (threshold) function for category k of item i, ai is the item discrim-
ination parameter, and bik is the difficulty (location) parameter for category
boundary (threshold) parameter k of item i.

Item Information

I i ( ) = A ik (1.12)
k

[ Pik* ()[1 Pik* ()] Pik* +1 ()[1 Pik* +1 ()]]


Aik = D 2 ai2 (1.13)
Pik ()

where I i () is information evaluated across the range of for item i, Aik ()
is described as the basic function, Pik () is defined in Equation 1.11, Pik*
represents the category boundary (threshold) function for category k of item
i, D is the scaling factor 1.702, and ai is the item discrimination parameter.

Rating Scale Model


The Model

exp kj = 0 ( (i + k ))
Pik () = (1.14)
mi =01 exp ij = 0 ( (i + j ))

where Pik () is the probability of responding in category k (k = 0, 1, , m) of


item i, i is the item difficulty (location) parameter, and k is the common
category boundary (threshold) parameter for all the items using a particular
rating scale. The k define how far from any given item location a particular
threshold for the scale is located.

Y102002_Book.indb 15 3/3/10 6:57:02 PM


16 Remo Ostini and Michael L. Nering

Item Information

2

I i ( ) = 2
k P
i ik

ki Pik

(1.15)
k k

where I i () is information evaluated across the range of for item i, summed


across k categories (k = 0, 1, , m), and Pik () is defined in Equation 1.14.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Partial Credit Model


The Model

exp kj = 0 ( ik )
Pik () = (1.16)
mi =01 exp ij = 0 ( ik )

where Pik () is the probability of responding in category k (k = 0, 1, , m)


of item i, and ik is the difficulty (location) parameter for category boundary
(threshold) parameter k of item i.

Item Information

2

I i ( ) = 2
k P
i ik

ki Pik

(1.17)
k k

where this is identical to the equation for the rating scale model because
both equations are in the same metric and in both cases Pik () is calculated
without reference to a separate discrimination parameter. Information will
nevertheless differ across the two models, for a given set of data, because
Pik () will be different at any given level of for the two models.

Generalized Partial Credit Model (Two-Parameter Partial Credit Model)


The Model

exp kj = 0 1.7 ai ( bi + d j )
Pik () = (1.18)
mi =01 exp ij = 0 1.7aai ( bi + d j )

where Pik () is the probability of responding in category k (k = 0, 1, , m)


of item i, ai is the item discrimination parameter, bi is the item difficulty
(location) parameter, and dj is the category boundary (threshold) parame-
ter for an item. The dj define how far from an item location a threshold is
located.

Y102002_Book.indb 16 3/3/10 6:57:04 PM


New Perspectives and Applications 17

Item Information


2

2 2
i

I i ( ) = D a


k
2
k P
i ik

k
ki Pik




(1.19)

where I i () is information evaluated across the range of for item i, summed
across k categories (k = 0, 1, , m), Pik () is defined in Equation 1.18, D is
the scaling factor 1.702, and ai is the item discrimination parameter. This is
the same as for the partial credit model with the addition of the squared item
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

discrimination parameter. The generalized partial credit model is also typi-


cally reported in the logistic metric, hence the further addition of D 2.

A Word on Software
Below is a list of commonly used software for estimating polytomous IRT
model parameters, which the authors have used. Some of the older prod-
ucts listed now have GUIs. Most products give very similar item and person
parameter estimates where they estimate parameters for the same models.
However, different software products typically estimate parameters for dif-
ferent models. Most products also provide different fit statistics to each other.
None yet implement the kinds of fit analyses that Glas talks about in his
chapter. Table 1.1 indicates the specific polytomous IRT models that are
estimated by a particular program.

Parscale. Muraki, E., and Bock R. D. (2003). Version 4. Scientific


Software International, 7383 North Lincoln Avenue, Suite 100,
Chicago IL, 60646. http://www.ssicentral.com
Multilog. Thissen, D. (2003). Version 7. Scientific Software
International, 7383 North Lincoln Avenue, Suite 100, Chicago IL,
60646. http://www.ssicentral.com

Table1.1 Polytomous Models and Software Programs (X Indicates Models That an


Estimation Program Can Fit)
Estimation Procedure
Models Parscale Multilog Rumm WinMira BigSteps ConQuest Quest
GRM X X
RS-GRM X
GPCM X X
PCM X X X X X X X
RSM X X X X X X
SIM X
DSLM X
DLM X X

Y102002_Book.indb 17 3/3/10 6:57:05 PM


18 Remo Ostini and Michael L. Nering

Rumm2020. Andrich, D., Lyne, A., Sheridan, B., and Luo, G. (2003).
Windows version. Rumm Laboratory, 14 Dodonaea Court, Duncraig
6023, Western Australia, Australia. http://www.rummlab.com.au
WinMira2001. von Davier, M. (2000). Version 1.36 for Windows. http://
winmira.von-davier.de
WinSteps. Linacre, J. M., and Wright, B. D. (2009). Version 3.68.1.
MESA Press, 5835 South Kimbark Avenue, Chicago IL, 60637. http://
www.winsteps.com
ACER ConQuest. Wu, M. L., Adams, R. J., and Wilson, M. R. (2000).
Build Date, August 22, for DOS and Windows. The Australian Council
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

for Educational Measurement, 19 Prospect Hill Road, Camberwell,


Melbourne, Victoria, 3124, Australia. mailto:quest@acer.edu.au
Quest. Adams, R. J., and Khoo, S.-T. (1996). Version 2.1 for PowerPC
Macintosh. The Australian Council for Educational Measurement,
19 Prospect Hill Road, Camberwell, Melbourne, Victoria, 3124,
Australia. mailto:quest@acer.edu.au

Conclusion
This chapter introduces the subsequent chapters, highlighting their contri-
butions and discussing issues in polytomous IRT that cut across different
models. It organizes the handbooks content by describing what individual
chapters do and where they fit in relation to other chapters. This chapter also
explains how the editors have attempted to integrate content across chapters
through editor note sidebars.
The goal of this chapter is to make the handbook, and by extension polyto-
mous IRT generally, more accessible and useful to readers, emphasizing its value
and making it easier for readers to unlock that value. In addition to its organiz-
ing role, the chapter helps readers to consider how they might use polytomous
IRT most effectively, in part, by providing an overarching reference guide to
polytomous IRT models in a common notation. The information functions that
are included for each model in the reference guide provide an important practi-
cal tool for designing and evaluating tests and items. Access to polytomous IRT
is also improved by the inclusion of a brief section on the software that is avail-
able to implement the models in research or applied measurement settings.

References
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Monterey,
CA: Brooks/Cole.
Anastasi, A. (1988). Psychological testing (6th ed.). Upper Saddle River, NJ: Prentice Hall.
Andrich, D. (1982). An extension of the Rasch model for ratings providing both
location and dispersion parameters. Psychometrika, 47, 105113.
Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York:
Marcel Dekker.

Y102002_Book.indb 18 3/3/10 6:57:05 PM


New Perspectives and Applications 19

Chang, H.-H., & Mazzeo, J. (1994). The unique correspondence of the item response
function and item category response functions in polytomously scored item
response models. Psychometrika, 59, 391404.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Fort
Worth, TX: Harcourt Brace Jovanovich.
Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York:
Harper Collins.
Dodd, B. G., & Koch, W. R. (1994). Item and scale information functions for the
successive intervals Rasch model. Educational and Psychological Measurement,
54, 873885.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

NJ: Lawrence Erlbaum Associates.


Hibbard, J. H., Mahoney, E. R., Stockard, J., & Tusler, M. (2005). Development
and testing of a short form of the patient activation measure. Health Services
Research, 40, 19181930.
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hilsdale, NJ: Lawrence Erlbaum Associates.
Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item
responses. Applied Psychological Measurement, 19, 91100.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-
rithm. Applied Psychological Measurement, 16, 159176.
Muraki, E. (1993). Information functions of the generalized partial credit model.
Applied Psychological Measurement 17, 351363.
Muraki, E., & Bock, R. D. (1999). PARSCALE: IRT item analysis and test scoring for
rating-scale data, Version 3.5. Chicago: Scientific Software international.
Ostini, R. (2001). Identifying substantive measurement differences among a variety
of polytomous IRT models (Doctoral dissertation, University of Minnesota,
2001). Dissertation Abstracts International, 62-09, Section B, 4267.
Rogers, H. J., Swaminathan, H., & Hambleton, R. K. (1991). Fundamentals of item
response theory. Newbury Park, CA: Sage.
Rost, J. (1988). Measuring attitudes with a threshold model drawing on a traditional
scaling concept. Applied Psychological Measurement, 12, 397409.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
scores. Psychometrika, Monograph Supplement 17.
Samejima, F. (1977). A method of estimating item characteristic functions using the
maximum likelihood estimate of ability. Psychometrika, 42, 163191.
Samejima, F. (1988). Comprehensive latent trait theory. Behaviormetrika, 24, 124.
Samejima, F. (1996). Evaluation of mathematical models for ordered polychotomous
responses. Behaviormetrika, 23, 1735.
Samejima, F. (1998). Efficient nonparametric approaches for estimating the operat-
ing characteristics of discrete item responses. Psychometrika, 63, 111130.
von Davier, M. (2000). WinMira2001 user manual: Version 1.36 for Windows. Author.

References in Embedded Editor Notes


Andrich, D. (1978). A rating formulation for ordered response categories.
Psychometrika, 43, 561573.
Andrich, D. (1988). A general form of Raschs extended logistic model for partial
credit scoring. Applied Measurement in Education, 1, 363378.

Y102002_Book.indb 19 3/3/10 6:57:05 PM


20 Remo Ostini and Michael L. Nering

Masters, G. N. (1988). Measurement models for ordered response categories. In


R.Langeheine & J. Rost (Eds.), Latent traits and latent class models (pp. 1129).
New York: Plenum Press.
Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item
responses. Applied Psychological Measurement, 19, 91100.
Molenaar, I. W. (1983). Item steps (Report HB-83-630-EX). Heymans Bulletins
Psychological Institute, University of Groningen.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-
rithm. Applied Psychological Measurement, 16, 159176.
Rasch, G. (1961). On general laws and the meaning of measurement in psychology.
Paper presented at the Proceedings of the Fourth Berkeley Symposium on
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Mathematical Statistics and Probability (pp. 321334). Berkeley: University of


California Press.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.
Psychometrika, 51, 567577.

Y102002_Book.indb 20 3/3/10 6:57:05 PM


Chapter 2
IRT Models for the Analysis
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ofPolytomously Scored Data


Brief and Selected History
of Model Building Advances

Ronald K. Hambleton
University of Massachusetts at Amherst

Wim J. van der Linden


CTB/McGraw Hill

Craig S. Wells
University of Massachusetts at Amherst

Introduction

Editor Introduction: This chapter places the models developed in later chapters into a common
historical context. The authors review dichotomous IRT models and lay an important foundation for the
concept of informationa key concept in IRT. They also discuss nonparametric IRT and provide an
introduction to the issue of parameter estimation. This provides an excellent starting point from which
to delve deeper into the specific models, issues surrounding these models, and uses of the models that
are provided in later chapters.

The publication of several papers by Lord (1952, 1953a, 1953b) marked


the beginning of the transition from classical to modern test theory and
practicesa response, in part, to Gulliksens (1950) challenge to develop
invariant item statistics for test development. Modern test theory is charac-
terized by strong modeling of the data, and modeling examinee responses
or scores at the item level. Item response theory, originally called latent trait

21

Y102002_Book.indb 21 3/3/10 6:57:06 PM


22 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

theory and later item characteristic curve theory, is about linking examinee
item responses or scores via item response functions to the latent trait or
traits measured by the test. According to Baker (1965), Tucker (1946) may
have been the first to use the term item characteristic curve, and Lazarsfeld
(1950) appears to have been the first to use the term latent trait (Lord, per-
sonal communication, 1977). In his 1980 monograph, Lord coined the terms
item response function (IRF) and ability as alternatives for item characteristic
curve and trait.
Progress in model building and model parameter estimation was slow
initiallyalmost certainly because of the mathematical complexity of the
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

modeling and the computational demands required for model parameter


estimation. There was also considerable skepticism among some researchers
about any measurement advantages that might accrue from the modeling.
This skepticism remained well into the 1970s. The skeptics were particularly
concerned about the strong model assumptions that needed to be made (e.g.,
undimensionality), and secondarily the computational challenges that item
response theory (IRT) modeling posed. But the promise of these IRT mod-
els was great (e.g., model parameter invariance; possibility to deal with miss-
ing data designs, for instance, in equating studies or adaptive testing) and the
research continued at an exponential rate in the 1970s and 1980s.
The real turning point in the transition process came with the publication
of Statistical Theories of Mental Test Scores by Lord and Novick (1968). The
transition was helped along considerably by Rasch (1960) and work in the
late 1950s by Birnbaum (1957, 1958).
Today, there is widespread use of IRT models with both binary and poly-
tomously scored data, although the latter is less well developed and under-
stood. Hence the need for this book and others like it (e.g., van der Linden &
Hambleton, 1997). Until the mid-1980s, test developers, psychologists, and
researchers with polytomous response data tended to do classical scalingas
outlined by Thurstone (1925). This meant assuming a normal distribution
certainly unrealistic in many applications. Some researchers simply dichoto-
mized their data, even though one of the consequences may be loss of fit
of the model (Jansen & Roskam, 1983; Roskam & Jansen, 1986). Lack of
software limited IRT applications using polytomous response data, though
Andrich (1978) and later Thissen (1981) moved things forward with software
for selected polytomous response IRT models.
Interestingly, although Lord and Novick (1968) provided a formulation
of a general theory of multidimensional ability spaces, there had been no
development up to that time with specific models for the analysis of poly-
tomous response data. Perhaps this was because both authors were working
in the educational testing area, where in 1968, binary data were much more
common.
The purposes of this chapter will be (1) to introduce many of the polyto-
mous response IRT models that are available today, including several that are
the focus of this book, (2) to provide background for the motivations of model

Y102002_Book.indb 22 3/3/10 6:57:06 PM


IRT Models for the Analysis ofPolytomously Scored Data 23

developers, and (3) to highlight similarities and differences among the models,
and challenges that still remain to be addressed for successful applications.
It is interesting to note that polytomous response IRT models were intro-
duced long before they found any use in education and psychology, albeit
without the necessary software to implement the models. We will begin by
introducing Samejimas (1969, 1972) work in the late 1960s with the graded
response model, the free response model, and multidimensional models too.
Her work was followed by Bock (1972) and the nominal response model
here, the multiple score categories did not even need to be ordered. Later,
advances came from Andrich (1978, 1988) with the rating scale model, and
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

this model did receive some fairly quick use, in part because he made soft-
ware available. Other models to follow Samejimas pioneering work included
the partial credit model and variations (e.g., Andersen, 1973; Masters &
Wright, 1984; Tutz, 1997; Verhelst, Glas, & de Vries, 1997), the generalized
partial credit model (e.g., Muraki, 1992), as well as models by Embretson,
Fischer, McDonald, and others for applying multidimensional IRT models
to polytomous response data. By 1997, when van der Linden and Hambleton
published their edited book Handbook of Modern Item Response Theory (1997),
they reported the existence of over 100 IRT models and organized them
into six categories: (1) models for items with polytomous response formats,
(2)nonparametric models, (3) models for response time or multiple attempts
on items, (4) models for multiple abilities or cognitive components, (5) models
for nonmonotone items, and (6) models with special assumptions about the
response models. Only models in the first two categories will be described in
this chapter. Readers are referred to van der Linden and Hambleton (1997)
for details on models in all six categories.

Development of IRT Models to Fit Polytomously Scored Data


Some of the seminal contributions to the topic of IRT model development
are highlighted in Figure2.1.
We will begin our selected history of IRT model development by focus-
ing first on those models that were developed to handle binary-scored data
such as multiple-choice items and short-answer items with achievement
and aptitude tests, and true-false, yes-no type items with personality tests.
These developments laid the foundation for those that followed for polyto-
mous response data, as the models are based on the same assumptions of
unidimensionality and statistical independence of item responses, and model
parameters are similar in their purpose and interpretation. Parameter esti-
mation methods were simply extended to handle the extra model parameters
with the polytomous response data.

IRT Models to Fit Binary-Scored Data


A primary reason for IRTs attractive features is that explicit, falsifiable
models are used in developing a scale on which test items and examinees

Y102002_Book.indb 23 3/3/10 6:57:06 PM


24 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Year Model and the Developer

1952one- and two-parameter normal-ogive models (Lord)


19571958two- and three-parameter logistic models (Birnbaum)
1960one-parameter logistic model (Rasch)
1961Rasch rating scale model (Rasch)
1967normal-ogive multidimensional model (McDonald)
1969two-parameter normal ogive and logistic graded response model (Samejima)
1969multidimensional model (Samejima)
1972continuous (free) response model (Samejima)
1972nominal response model (Bock)
1973Rasch rating scale model (Andersen)
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1976linear logistic Rasch model (Fischer)


1978Rasch rating scale model (Andrichs full development was carried out and
independently of the work of Andersen)
1980multi-component response models (Embretson)
1981four-parameter logistic model (Barton & Lord)
1982partial credit model (Masters)
1985linear logistic multidimensional model (Reckase)
1988unfolding model (Andrich)
1990sequential step model (Tutz)
1991non-parametric models (Ramsey)
1992generalized partial credit model (Muraki)
1992full information item bifactor model (Gibbons & Hedeker)
1993steps model (Verhelst & Glas)

Figure 2.1 The most popular of the undimensional and multidimensional models for analyzing
binary-scored and polytomously scored data.

are placed (Baker, 1965; Hambleton, Swaminathan, & Rogers, 1991; van
der Linden & Hambleton, 1997). All IRT models define the probability of
a positive response as a mathematical function of item properties, such as
difficulty, and examinee properties, such as ability level. For example, one of
the popular models used with dichotomous item response data is the three-
parameter logistic model (3PLM), expressed as

exp[ ai ( j bi )]
P (uij = 1| j ) = ci + (1 ci ) (2.1)
1 + exp[ ai ( j bi )]

Here P (uij = 1| j ) indicates the probability of a correct response for exam-


inee j. Hereafter, P (uij = 1| j ) will be written as P(). The ability param-
eter for person j is denoted q j and, although theoretically unbounded, ranges
from 3.0 to 3.0 for a typical population with ability estimates scaled to a
mean of zero and a standard deviation of 1.0, where larger positive values
indicate higher ability. The lower asymptote is denoted ci, also known as the
guessing parameter; in other words, the c parameter indicates the probability
of positively endorsing an item for examinees with very low ability levels.
The difficulty parameter for item i is denoted bi. The b parameter is on the
same scale as and is defined as the -value where P() is halfway between

Y102002_Book.indb 24 3/3/10 6:57:09 PM


IRT Models for the Analysis ofPolytomously Scored Data 25

1.0

0.9

0.8

0.7
Item 1: a=1.7, b=-0.8,
Probability

0.6
c=0.15
0.5

0.4 Item 2: a=0.8, b=0.2,


c=0.08
0.3
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.2
Item 3:a=1.45,
0.1 b=1.1, c=0.26

0.0
-3 -2 -1 0 1 2 3
Ability

Figure 2.2 Illustration of three ICCs from the 3PLM.

the c parameter value and 1.0 (i.e., -value associated with P() = (1 + c)/2).
The values for the b parameter also typically range from 3.0 to 3.0, where
larger positive values indicate harder items and larger negative values indicate
easier items. The discrimination of item i is denoted ai and is proportional to
the slope of the IRF at = b (see Figure2.2). For good items, the a param-
eter typically ranges from 0.40 to 2.50. Some testing programs use a ver-
sion of the model with a scaling constant D = 1.7, which was introduced by
Birnbaum in 1957 to eliminate scale differences between the item parameters
in the two-parameter normal ogive and logistic models. Figure2.2 graphi-
cally illustrates IRFs for three items based on the 3PLM.
The three IRFs shown in Figure 2.2 follow the 3PLM and differ with
respect to difficulty, discrimination, and the lower asymptote. The y axis
represents the probability of a correct response, (P()), while the x axis rep-
resents the ability () scale, sometimes called proficiency scale. There are
several important features of the 3PLM IRFs shown in Figure2.2. First, the
IRFs are monotonically increasing in that the probabilities increase as the
ability levels increase for each item. Second, the IRFs are located throughout
the ability distribution, indicating that the items differ in their difficulty. For
example, Item 1 is the easiest, Item 2 is moderately difficult, and Item 3 is
the hardest (b1 = 0.8, b2 = 0.2, and b3 = 1.1). Third, the inflection point of
the respective IRF is located at the b parameter value. Fourth, the a param-
eter value is proportional to the slope of the IRF at the b parameter value. In
addition, the IRF has maximum discriminatory power for examinees whose
-value is near the b parameter value. For example, Item 1 has maximum
discrimination for examinees with -values around b = 0.8. And fifth, the
items differ with respect to the c parameter value as indicated by the disparate
lower asymptotes. Interestingly, although Item 3 is generally more difficult

Y102002_Book.indb 25 3/3/10 6:57:09 PM


26 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

for most examinees throughout the scale, lower ability examinees have a
higher chance of answering the item correctly compared to Items 1 and 2.
When the c parameter is set equal to zero for an item, Equation 2.1 sim-
plifies to the two-parameter logistic model (2PLM) introduced by Birnbaum
(1957, 1958, 1968). This model is more mathematically tractable than the
two-parameter normal ogive model introduced by Lord (1952). Since the
lower asymptote is fixed at zero, the 2PLM implies that guessing is absent, or
at least, negligible for all practical purposes. A dichotomously scored, short-
answer item is an example of an item in which the 2PLM is commonly used.
Because c = 0, the b parameter is now the point on the ability scale at which
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

the probability of a correct response is 0.5.


Constraining the a parameter to be equal to one across all items
in Equation 2.1, as well as fixing c = 0, produces a third IRT model for
dichotomous data known as the one-parameter logistic model (1PLM) or
the Rasch model. Neither Lord nor Birnbaum showed any interest in a
one-parameter normal ogive or logistic model because of their belief that
multiple-choice test items needed at least two parameters in the model to
adequately account for actual item response dataone to account for item
difficulty and the other to account for item discriminating power. Lord was
sufficiently concerned about guessing (and omitted responses) on multiple-
choice items (as well as computational demands) that he discontinued his
dissertation research (1952, 1953a, 1953b) and pursued research with true
score theory instead, until about 1965 (Lord, 1965). He then became con-
vinced that computer power was going to be sufficient to allow him to work
with a three-parameter model with the third item parameter in the model
to account for the nonzero item performance of low-performing candidates,
even on hard test items.
Although the 1PLM produces some attractive statistical features (e.g.,
exponential family, simple sufficient statistics) when the model actually fits
test data, the advantages of the 1PLM come at the expense of assuming
the items are all equally discriminating (not to mention free from examinee
guessing behavior). Although these assumptions may hold for psychological
tests with narrowly defined constructs, they are generally problematic in edu-
cational testing. Therefore, the equal discrimination assumption is usually
checked closely prior to implementing the 1PLM.
Rasch (1960), on the other hand, developed his one-parameter psychomet-
ric model from a totally different perspective than either Lord or Birnbaum.
He began with the notion that the odds for an examinees success on an item
depended on the product of two factorsitem easiness and examinee ability.
Obviously, the easier the item and the more capable the examinee, the higher
the odds for a successful response and correspondingly, the higher the prob-
ability for the examinees success on the item. From the definition of odds
for success, P/(1 P), and setting the odds equal to the product of the model
parameters for item easiness and examinee ability, it was easy for Rasch to
produce a probability model similar to Equation 2.1 with c = 0.0 and a = 1.0,
though in his development, the concepts of item discrimination and guessing

Y102002_Book.indb 26 3/3/10 6:57:09 PM


IRT Models for the Analysis ofPolytomously Scored Data 27

were never considered. At the same time, failure to consider them allows
them to become possible sources for model misfit. Rasch certainly would
not have worried about guessing in his own work since he developed his
model for a longitudinal study with intelligence testing. Also, in 1960, the
multiple-choice item was not in use in Denmark.
The three logistic models for analyzing binary data are valuable, and are
receiving extensive use in testing practices, but there are several item types
or formats in education and psychology that are scored polytomously. For
example, many statewide assessments use constructed-response item formats
as part of the assessment in which a scoring rubric is implemented to pro-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

vide partial credit. In addition, Likert type items that provide responses in a
graded fashion are commonly used, especially in surveys, questionnaires, and
attitudinal inventories. The goal of the IRT models for polytomous data is to
describe the probability that an individual responds to a particular category
given her or his level of ability and the item properties.
One of the most popular IRT models to address polytomous data, devel-
oped by Samejima (1969), is a simple yet elegant extension of the 2PLM
and is referred to as Samejimas graded response model (GRM). Samejima
was clearly motivated in her work by the fact that all of the modeling up to
1969 was applicable only to binary-scored data. These models were excellent
at the time for handling educational testing data where most of the IRT
model developers were working (Lord, Novick, Rasch, Wright). However,
Samejima was well aware of the use of rating scales (sometimes called ordered
response categories by psychologists) and wanted to extend the applicability
of item response modeling to these types of data. Samejimas GRM is appro-
priate for ordered polytomous item responses such as those used in Likert
type items or constructed-response items.
For the following explanation, we will consider a five-category (i.e., K = 5)
item with scores ranging from 0 to 4 (i.e., k = 0, , 4). Samejimas work was
just the first of many models that followedincluding several more of her
own (see, for example, Samejima, 1997).
The GRM uses a two-step process in order to obtain the probability that
an examinee responds to a particular category. The first step is to model the
probability that an examinees response falls at or above a particular ordered
category given . The probabilities, denoted Pik* (), may be expressed as
follows:

exp[ ai ( j bik )]
Pik* () = (2.2)
1 + exp[ ai ( j bik )]

In this equation Pik* () is referred to as the operating or boundary character-


istic function of item i for category k, and indicates the probability of scor-
ing in the kth or higher category on item i (by definition, the probability of
responding in or above the lowest category is Pik* () = 1.0). The ai parameter
refers to the discrimination for item i.

Y102002_Book.indb 27 3/3/10 6:57:10 PM


28 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Notational Difference: The formula presented in Equation 2.2 is the same as that presented by
Samejima in Chapter 4 and shown below. Samejimas Equation 4.2 is structured differently, uses
different subscripts, and locates the category subscript (x) before (and at a higher level) than the
item subscript (g). It is nevertheless functionally and algebraically equivalent to Equation 2.2 in this
chapter except for the scaling factor D:

{ }
1

Px* () = 1+ exp Dag ( bx )

g
g

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Note that each item has the same discrimination parameter across all cat-
egories in Equation 2.2. Samejima referred to Equation 2.2 as the homoge-
neous case of her model. She developed a variation on Equation 2.2 in which
the operating characteristic function for each score category could vary, and
she called this the heterogeneous form of the graded response model, but
this version of her model did not attract much attention.

Terminology Note: It is common to think of the heterogeneous case as the graded response
model with P*ik() that vary in shapeas is suggested here. However, the boundary function P*ik()
actually plays no role in defining heterogeneous models. A fundamental cause of this misunder-
standing is the way Samejima uses the term graded response model. In this chapterand in com-
mon usagegraded response model refers to Samejimas logistic model in the homogeneous case.
However, Samejima uses graded response model to refer to a framework that covers all possible
polytomous IRT models in both the homogeneous case (including but not limited to the logistic model)
and the heterogeneous case, which itself includes many different specific models (see Chapter 4).
The most common examples of graded response models in the heterogeneous case are polytomous
Rasch models, such as the partial credit model. In such models P*ik() is not (and typically cannot
be) explicitly modeled and can only be obtained empiricallyby summing category probabilities. If
this is done, it does indeed transpire that the boundary functions for an item are not parallel. In the
terminology used later in this chapter, heterogeneous models are typically direct models.

In Equation 2.2 bik refers to the ability level in which the probability of
responding at or above the particular category equals 0.5 and is often referred
to as the threshold parameter. (The threshold parameter is analogous to the
item difficulty parameter in the 2PLM.) Since the probability of responding
in the first (i.e., lowest) category or higher is defined to be 1.0, the threshold
parameter for the first category is not estimated. Therefore, although there
are five categories (K = 5), there are only four threshold parameters estimated
(K 1 = 4). Basically, the item is regarded as a series of K 1 dichotomous
responses (e.g., 0 vs. 1, 2, 3, 4; 0, 1 vs. 2, 3, 4; 0, 1, 2 vs. 3, 4; and 0, 1, 2, 3
vs.4); the 2PLM is used to estimate the IRF for each dichotomy with the
added constraint that the slopes are equal within an item.
Once the operating characteristic functions are estimated, the category
response functions, which indicate the probability of responding to a particu-
lar category given , are computed by subtracting adjacent Pik* () as follows:

Pik () = Pik* () Pi (* k +1) () (2.3)

Y102002_Book.indb 28 3/3/10 6:57:10 PM


IRT Models for the Analysis ofPolytomously Scored Data 29

By definition, the probability of responding above the highest category is


* = 0.0; therefore, the probability of responding in the highest category is
Pi(k)
simply equal to the highest operating characteristic function. For the present
example, the category response functions are computed as follows:
Pi 0 () = 1.0 Pi 1* ()

Pi 1 () = Pi 1* () Pi *2 ()

Pi 2 () = Pi *2 () Pi *3 () (2.4)
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Pi 3 () = Pi *3 () Pi *4 ()

Pi 4 () = Pi *4 ()

Figures 2.3 and 2.4 illustrate the operating characteristic and category
response functions, respectively, for an item with a = 1.25, b1 = 2.3, b2 = 1.1,
b3 = 0.1, and b4 = 1.15. The figures also highlight some important characteristics
of the GRM. First, the operating characteristic curves are ordered from small-
est to largest based on the threshold parameters (i.e., b1 < b2 < b3 < b4 ). Second,
the threshold parameters dictate the location of the operating curves. Third,
the slope is the same for each of the curves within an item (the slopes are free
to vary across items, however). Fourth, for any given value of , the sum of the
category responses probabilities equals 1.0. Fifth, the first response category
curve is always monotonically decreasing while the last category curve is always
monotonically increasing. The middle categories will always be unimodal with
the peak located at the midpoint of the adjacent threshold categories.
We will skip ahead a bit to consider a second commonly used IRT model
for polytomous data where the data are scored in terms of the number of steps

1
0.9
P*i1
0.8
0.7 P*i2
Probability

0.6
0.5 P*i3
0.4
P*i4
0.3
0.2
0.1
0
-3 -2 -1 0 1 2 3
Latent Trait

Figure 2.3 Operating characteristic curves for a five-category GRM item with a = 1.25, b1 = 2.3,
b 2 = 1.1, b 3 = 0.1, and b4 = 1.15.

Y102002_Book.indb 29 3/3/10 6:57:11 PM


30 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

1
0.9
0.8
0.7
0.6 Pi0 Pi4

Probability
0.5
Pi1 Pi2
0.4 Pi3
0.3
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.2
0.1
0
-3 -2 -1 0 1 2 3
Latent Trait

Figure 2.4 Category response curves for a five-category GRM item with a = 1.25, b1 = 2.3,
b 2 = 1.1, b 3 =0.1, and b4 = 1.15.

completed in solving a problem (e.g., constructed-response math items): the


generalized partial credit model (GPCM; Muraki, 1992, 1993). The model
was proposed independently by Yen at the same time, who referred to it as
the two-parameter partial credit (2PPC) model and used it in a number of
applications to show improved fit relative to the original partial credit model
(Fitzpatrick & Yen, 1995; Fitzpatrick et al., 1996).
In contrast to Samejimas GRM, which is considered an indirect model due to
the two-step process of obtaining the category response functions, the GPCM is
referred to as a direct IRT model because it models the probability of responding
to a particular category directly as a function of . As the model expression for the
GPCM is an exponential divided by a sum of exponentials, it is also classified as
a divide-by-total model for the aforementioned reason, while Samejimas GRM
is considered a difference model because the category probabilities are based
on the difference between the response functions (Thissen & Steinberg, 1986).
Muraki was motivated by earlier work from Masters (1982), but Masters did not
include a discrimination parameter in his model. This model by Muraki was
extensively used at ETS with the National Assessment of Educational Progress
(NAEP). With NAEP data, historically, items have varied substantially in their
difficulty and discriminating power, and so it was long felt that a two-parameter
polytomous response model was needed to fit the data well.
To illustrate the GPCM, we will consider a five-point partial credit item
(i.e., K = 5) ranging from 0 to 4 (i.e., k = 0,, 4 ). The category response
functions, Pik (), for the GPCM may be expressed as follows:
exp vk = 0 ai ( j biv )
Pik ( j ) = (2.5)
hK=01 exp vh = 0 ai ( j biv )

where vK= 0 ( j biv ) 0.

Y102002_Book.indb 30 3/3/10 6:57:13 PM


IRT Models for the Analysis ofPolytomously Scored Data 31

1.0
0.9
0.8
0.7 Pi0 Pi4

Probability
0.6 Pi1
0.5 Pi3
0.4
0.3
0.2 Pi2
0.1
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.0
-3 -2 -1 0 1 2 3
Latent Trait

Figure 2.5 Category response curves for a five-category GPCM item with a = 0.99, b1 = 1.9,
b 2 = 0.2, b 3 = 0.5, and b4 = 1.1.

Notational Difference: The GPCM is also presented in Chapter 3 using the following notation:

exp kj = 0 1.7ai ( bi + d j )
T (k ) = m 1
i =0
exp ij = 0 1.7ai ( bi + d j )

Note that in Chapter 3 the overall location parameter (bi) and the threshold parameters (dj) are
distinct, whereas in Equation 2.5 they are folded together (biv).

In the GPCM, the biv are referred to as step difficulty parameters. The
step difficulty parameters may be interpreted as representing the difficulty
in reaching step k given that the examinee has reached the previous step
(i.e., k 1). As a result, bivs are not necessarily ordered from smallest to larg-
est, in contrast to Samejimas GRM. As an example, Figure2.5 illustrates a
GPCM item in which the step difficulties are not ordered (a = 0.99, b1 = 1.9,
b2 = 0.2, b3 = 0.5, and b4 = 1.1).

Terminology Note: In this chapter the authors use the term step to describe the boundary param-
eters in the generalized partial credit model. This term is also used later in this chapter and in
Chapter 5, when Masters discusses the partial credit model. Although originally intended to refer to
the process of modeling sequential steps toward arriving at a response category, it has been shown
that neither model actually operates in this way mathematically (Masters, 1988; Molenaar, 1983).
Interpretation of the step difficulty parameters is also complex since each step is modeled in the
context of an entire item. A positive consequence of this is that these two models are not restricted to
constructed-response item data and can legitimately be used with any polytomous data, including
responses to rating scale items. The steps terminology has, however, proved enduring.

As seen from Figure2.5, the step difficulty parameters represent the value
on the scale at which two consecutive category response curves intersect
(e.g., the curves for category 0 and 1 intersect at = 1.9 ). The relative order
of the step difficulty parameters (i.e., intersections) indicates that going from

Y102002_Book.indb 31 3/3/10 6:57:14 PM


32 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

a 0 to 1 (Step 1) or from 2 to 3 (Step 3) is relatively easy for examinees,


while going from 1 to 2 (Step 2) is moderately difficult, and going from 3 to
4 (Step4) is the most difficult. Furthermore, the effect of the reversal (i.e.,
Step2 being more difficult than Step 3) can also be seen in the lower prob-
ability of receiving a score of 2 relative to the other categories.
If the a parameter is constrained to be equal across items, the GPCM
expressed in Equation 2.5 simplifies to the partial credit model (PCM;
Masters, 1982; Masters & Wright, 2004). But the partial credit model was
conceived of prior to the generalized partial credit model and by a very
different line of reasoning. The partial credit model is an extension of the
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1PLM and shares similar attractive statistical properties, such as belong-


ing to an exponential family and simple sufficient statistics. However, the
appealing properties of the PCM (and 1PLM/Rasch) can only be realized
if the model fits the data. Therefore, the equal item discrimination assump-
tion should be tested empirically before the model is implemented (see, for
example, Hambleton & Han, 2005).
Some researchers have been troubled by the fact that in the partial credit
model (and the generalized partial credit model) occasional reversals are seen
in the step difficulty parameters. Others have argued that the reversals can
be interpreted and have chosen not to worry about them (see, for example,
Masters, 1982).

Editor Note: In Chapter 4 Andrich discusses the potential pitfalls of step reversals and the advan-
tages of being able to model such reversals when they occur in data.

The rating scale model, which has been manifested in a number of ever-
increasing flexible forms, can be traced back to the original work of Rasch
(1961) and later his doctoral student, Erling Andersen. Then the model
became best known in a series of studies by Andrich beginning in 1978
and extended yet again, this time by Muraki in the 1990s with his general-
ized rating scale model. In the simplest version of this polytomous response
model, the thresholds for the score categories differ from each other by an
amount that is held constant across items. In addition, there is a shift in these
thresholds from one item to the next because the items, in principle, differ
in their difficulties. Most of the work associated with the rating scale model
has been done under the assumption that all test items are equally discrimi-
nating (see Andrich (1988) or Engelhard (2005) for excellent reviews). The
exception to this work is the generalized rating scale model introduced by
Muraki.
Wright and Masters (1982, chap. 3) provide a very useful comparison of all
of the modelsrating scale, partial credit, and graded response modelby
relating these models to the partial credit model and noting their similarities
and differences and deriving one model from another by placing constraints
on the models or making additional assumptions. A similar comparison,

Y102002_Book.indb 32 3/3/10 6:57:14 PM


IRT Models for the Analysis ofPolytomously Scored Data 33

which includes the generalized partial credit model, is made in Engelhard


(2005). A main difference is that several of the models allow test items to
vary in their discriminating powers (GRM and the GPCM include this
model parameter).
An interesting comparison between polytomous models is based on Agrestis
classification (1990) of response processes (see Mellenbergh, 1995). In adja-
cent-category models, the examinee is assumed to make his or her response
based on a comparison of adjacent categories, and the model represents these
comparisons. This type of process underlies the partial credit models. In the
graded response model, the basic process is a comparison between the cumu-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

lative categories in Equation 2.4, and models of this nature are appropriately
called cumulative probability models. The final typeof model is the continua-
tion ratio model, which assumes that the examinee basically chooses between
a current category and proceeding with one of the higher-order categories.
The belief that the last response process is a more adequate description of the
examinees behavior on polytomous items was Verhelst, Glas, and de Vries
(1997) and Tutzs (1997) reason for modifying the PCM.

Nonparametric IRT Models


The previously described models are referred to as parametric because they
require the IRF to follow a specific parametric expression. For example, when
using the 2PLM, the underlying IRF must follow the logistic expression in
Equation 2.2 with parameters ai and bi. However, in order to fit the data, for
some items on an educational or psychological assessment, the response data
may not conform to such an expression and these items would be discarded
from a test. For such items, there is a class of nonparametric models based on
ordinal assumptions only that may be used to determine the IRF.
While there are several nonparametric methods for modeling the IRF,
one of the more popular methods, developed by Ramsay (1991), is kernel
regression. For another line of nonparametric modeling of test data, see the
work of Molenaar (1997) and Sijtsma and Molenaar (2002) for up-to-date
reviews. The essential principle underlying this approach is the replacement
of the regression function of the responses on some independent ability score
by a function obtained through a smoothing operation. The smoothing oper-
ation is based on local weighted averaging of the responses with a so-called
kernel as the weight function. More specifically, in order to obtain kernel-
smoothed estimates of the response function of item i, the following steps
are implemented:

1. Select Q evaluation points along the proposed ability scale, denoted as


xq (e.g., x1 = 3.00, x2 = 2.88, x3 = 2.76,, x49 = 2.76, x50 = 2.88, x51 =
3.00, where Q = 51).
2. Obtain an ability score that is independent of item i and transform it to
have this scale. Usually, the rest score is taken (i.e., the total number-correct
score that excludes item i). Let Xj denote the score for examinee j.

Y102002_Book.indb 33 3/3/10 6:57:14 PM


34 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

3. Estimate the IRF using the following formula:

Pi ( xq ) = w u
jq ij (2.6)
j =1

where N indicates the number of examinees, uij is the response by


examinee j to item i (i.e., 0 or 1), and wjq represents the weight assigned
to examinee j at evaluation point xq. The weight for examinee j at evalu-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ation point q (i.e., xq) is calculated as follows:

w jq =
K ( X j xq
h )
( )
X j xq (2.7)
Nj =1 K h

where xq refers to evaluation point q, and Xj is the adopted ability score


for examinee j.

Two important components of the formula are bandwidth parameter h


and kernel function K. The bandwidth parameter controls the amount of bias
and variation in the IRF. As h decreases, the amount of bias is reduced but
the variation is increased (i.e., less smoothness). The opposite occurs when
h is increased. As a rule of thumb, h is often set equal to 1.1*N0.2 (Ramsay,
1991) so as to produce a smoothed function with acceptable bias. Kernel
function K is chosen to be nonnegative and to approach zero as Xj moves
away from an evaluation point, xq. Two commonly used kernel functions
are the Gaussian [K(y) = exp(y 2)] and uniform kernels [K(y) = 1 if | y| 1,
else 0]. Given the previous information, it is apparent that the further an
examinees ability score Xj is away from evaluation point xq, the less weight
that examinee has in determining P ( xq ). For example, the Gaussian kernel
has this feature; its value is largest at y = 0, and it decreases monotonically
with the distance from this point.
The computer program TESTGRAF (Ramsay, 1992) is often used to
perform kernel smoothing on test data. Figure2.6 illustrates a nonpara-
metrically estimated IRF using TESTGRAF for an item from a large-
scale assessment. The psychological meaning for the curve is not clear, but
with parametric models, the unusual shape of the curve would not even
be known.
It is apparent from Figure2.6 that the advantage of estimating the IRF
nonparametrically is that because it is not constrained to follow a monotonic
parametric shape, it provides a deeper understanding of its ordinal features.
For example, the 3PLM would never be able to show the local decrease in
response probability just above ability equal to 2 in Figure2.6. A statistical
disadvantage of the nonparametric approach, however, is the large number of
parameters per item that has to be estimated, and therefore the large amount

Y102002_Book.indb 34 3/3/10 6:57:15 PM


IRT Models for the Analysis ofPolytomously Scored Data 35

1.0
0.9
0.8
0.7
Probability 0.6
0.5
0.4
0.3
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.2
0.1
0.0
-3 -2 -1 0 1 2 3
Ability

Figure 2.6 A nonparametric ICC estimated using TESTGRAF for an item on a large-scale
assessment.

of response data required for its application. For example, for the rest score,
the number of item parameters in a nonparametric approach is n 1 (namely,
one response probability for each possible score minus 1), whereas the 3PLM
requires estimation of only three parameters per item. Further, if all items
on a test are modeled nonparametrically, then certain applications within
IRT cannot be performed. For example, the same ability parameter can-
not be estimated from different selection of items, and the nonparametric
approach consequently fails to support applications with structurally incom-
plete designs (e.g., adaptive testing and scale linking). Nevertheless, the non-
parametric approach to modeling the IRF has proven to be a useful tool for
the psychometric properties of an instrument. For example, nonparamet-
ric models have been used for item analysis (Ramsay, 1991; Santor, Zuroff,
Ramsay, Cervantes, & Palacios, 1995), testing for differential item function-
ing (Santor, Ramsay, & Zuroff, 1994; Shealy & Stout, 1993), testing possible
local dependencies, and even testing the fit of a parametric model (Douglas
& Cohen, 2001; Wells & Bolt, 2008). Therefore, the use of nonparametric
models in conjunction with parametric models appears to be a productive
strategy for building a meaningful score scale, and we expect to see more
applications of nonparametric models to handle polytomous response data in
the coming years. The software is available, and the few applications so far
appear very promising.

Other IRT Models


Samejima has been prolific over the years in her development of IRT models
for handling polytomous response data. Her creativity and insight placed
her 15 to 20 years ahead of the time when her models would be needed.
Following her work with the graded response model, Samejima (1972) made

Y102002_Book.indb 35 3/3/10 6:57:16 PM


36 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

the logical jump to the case where continuous response or free response data
replaced categorical data, she extended her work to multidimensional mod-
eling of polytomous response data, and in 1997 she extended her modeling
work again, this time to incorporate some subtleties in the estimation of
ability. Unlike the GRM, the free-response model has received very little
attention.
While the present chapter covers the major IRT models for dichotomous
and polytomous data, it does not provide a description of all available IRT
models. For a thorough description of several other popular IRT models
such as other polytomous models (e.g., nominal categories model, rating
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

scale model), multidimensional IRT models for both dichotomous and poly-
tomous data, other nonparametric models (e.g., the monotone homogeneity
model), and unfolding models for nonmonotone items, see Fischer (1974),
van der Linden and Hambleton (1997), Ostini and Nering (2006), Andrich
and Luo (1993), Roberts, Donoghue, and Laughin (2000), and Verhelst and
Verstralen (1993).

Parameter Estimation for IRT Models


Perhaps a few words regarding parameter estimation would be useful. Much
of the work to date has been for models handling dichotomously scored data
but the maximum likelihood and the Bayesian estimation principles applied
equally well to polytomous models, although the complexity is substantially
increased because of the increase in the number of model parameters.
The utility of IRT depends on accurately estimating the item parameters
and examinee ability. When the ability parameters are known, estimating the
item parameters is straightforward. Similarly, estimating the ability param-
eter is straightforward when the item parameters are known. The challenge,
however, is estimating the parameters when both sets are unknown. Due to
the complicated nature of the equations for IRT parameter estimation, only
a brief description of a few popular methods will follow. See Baker and Kim
(2004) for a detailed description of popular estimation methods.
Joint maximum likelihood estimation (JMLE) and marginal maximum
likelihood estimation (MMLE; Bock & Lieberman, 1970) are two estima-
tion methods well addressed in the literature. Although MMLE has become
the standard of the testing and JMLE lacks a basic statistical requirement,
the latter is simpler to describe and is therefore outlined first. JMLE uses an
iterative, two-stage procedure to estimate the item and person parameters
simultaneously. In Stage 1, the item parameters are estimated assuming the
ability parameters are known (simple functions of raw scores may be used as
initial estimates of ) by maximizing the following likelihood function for
the responses of N examinees on item i:
N n

L( a , b, c ; u, ) = P
j =1 i =1
uj
j
(1 u j )
(1 P j ) (2.8)

Y102002_Book.indb 36 3/3/10 6:57:16 PM


IRT Models for the Analysis ofPolytomously Scored Data 37

where u represents the matrix with the responses, , and a, b, and c are
the vectors with the examinee and item parameters, respectively, Pj is the
model-based probability, and uj is the item response for examinee j. By taking
derivatives and setting them equal to zero, maximum likelihood estimates
for each item parameter are obtained, for example, via a Newton-Raphson
procedure.
In Stage 2, the ability parameters are estimated treating the item param-
eters from Stage 1 as known by maximizing the likelihood function for a
response pattern for examinee j, which by assuming local independence may
be expressed as follows:
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

N n

L( p; u, a , b, c ,) = P uj
j
(1 u j )
(1 P j ) (2.9)
j =1 i =1

where n indicates the number of items. Observe that the likelihood in


Equation 2.9 is now treated as a function of the unknown ability parameters
given the response data and all item parameters.
Before proceeding back to Stage 1, the updated ability parameter estimates
are renormed using the restrictions adopted to eliminate the indeterminacy
in the scale (usually, mean estimate equal to zero and standard deviation to
one). The two stages are repeated, using the updated estimates from each
subsequent stage, until a convergence criterion is met (e.g., estimates change
a minimal amount between iterations).
Unfortunately, because the model item and person parameters are being
estimated simultaneously in JMLE, the estimates are not consistent. Bock
and Lieberman (1970) developed MMLE to address the disadvantages of
JMLE by integrating the unknown ability parameters out of the likelihood
function so that only the item parameters are left to be estimated. Therefore,
the problem becomes one of maximizing a marginal likelihood function in
which the unknown ability parameters have been removed through integra-
tion. Bock and Aitkin (1981) implemented MMLE using an expectation-
maximization (EM) algorithm.
Once the item parameters have been estimated using MMLE, estimates
of ability can be derived by treating the item parameter estimates as if they
are known item parameters, that is, using the same method outlined in Stage
2 of JMLE. Although for applications with a single standard test without
missing data simple number-correct scores may be appropriate, there are sev-
eral advantages to estimating s, including the IRT estimated s are compa-
rable when items are added or deleted from the test, they adjust the estimates
for the properties of the individual items (such as their difficulty and dis-
crimination), they produce more accurate standard errors, they provide better
adjustments for guessing than classical methods, and they are on the same
scale as the difficulty parameters. A primary disadvantage of MLE of is
that for examinees who answer all items correctly or incorrectly, no estimate
can be obtained.

Y102002_Book.indb 37 3/3/10 6:57:16 PM


38 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

An alternative approach for estimating ability is to use the Bayesian


approach, in which prior information is provided about the ability param-
eters in the form of a prior distribution and is incorporated into the likeli-
hood function. The prior distribution is updated by the response data into a
posterior distribution for , which is the Bayesian estimate of . The mean of
the posterior distribution may be used as a point estimate of known as the
expected a posteriori (EAP) estimate. The mode of the posterior distribution
may also be used as a point estimate of ability and is known as the maximum
a posterior (MAP) estimate (see, e.g., Swaminathan, 2005).
The next generation of IRT modeling for both dichotomous and polyto-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

mous response data will likely proceed in a Bayesian way with new numerical
procedures for establishing the posterior distribution of model parameters
known as Monte Carlo Markov chain (MCMC) procedures. The common
feature of the posterior is that they explore the posterior distribution by per-
forming iterative random draws from the posterior distribution of one class
of parameters given the previous draws from those of all other parameters.
Because of its iterative approach, an MCMC procedure is particularly pow-
erful for models with a high-dimensional parameter space. They also do not
require the calculation of the first and second derivatives, which makes MLE
cumbersome for complex models. At the same time, MCMC procedures can
take several hours for a complex model with a larger data set to produce
proper parameter estimates. Still, from a modeling perspective, it allows
researchers to be very creative, and currently implementations of MCMC
procedures for advanced IRT models are being intensively investigated.

Conclusion
Today, though many polytomous IRT models have been developed, only
a handful are receiving frequent use in education and psychology: (1) the
nominal response model, (2) the graded response model, (3) the polytomous
Rasch model, (4) the partial credit model, and (5) the generalized par-
tial credit model. Of the unidimensional models for handling polytomous
response data, these are almost certainly the five most frequently used, and
the ones that will be described in greater detail in subsequent chapters.
One of the biggest challenges facing applications of polytomous item
response models has been the shortage of user-friendly software. Several
software packages are available for parameter estimation: BILOG-MG
(www.ssicentral.com) can be used with the one-, two-, and three-param-
eter logistic models; PARSCALE (www.ssicentral.com) with the graded
response model and the generalized partial credit model; MULTILOG
(www.ssicentral.com) with the nominal response model and the graded
response model, and WINSTEPS and FACETS (www.winsteps.com) and
CONQUEST with the dichotomous Rasch model, partial credit model,
and polytomous Rasch model. The Web site www.assess.com is particularly

Y102002_Book.indb 38 3/3/10 6:57:16 PM


IRT Models for the Analysis ofPolytomously Scored Data 39

helpful is locating IRT software. We are already aware, too, of several major
releases of software scheduled for 2010, including a new version of Multilog
and software to comprehensively address model fit. Still, in general, it is to
be hoped that software in the future can be made more user-friendly.
Approaches to model fit too remain as technical challenges. Testing a
model with data is a losing proposition because with a sufficiently large sam-
ple size (which is desirable for obtaining stable model parameter estimates),
power to reject any IRT model is high. Investigating practical consequences
of any model misfit remains a more promising direction for studies of model
fit. Alternatively, it is becoming more standard to test one model against
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

another. This topic is generating considerable interest currently.

References
Agresti, A. (1990). Categorical data analysis. New York: Wiley.
Andersen, E. B. (1973). Conditional inferences for multiple-choice questionnaires.
British Journal of Mathematical and Statistical Psychology, 26, 3144.
Andrich, D. (1978). A rating formulation for ordered response categories. Psycho
metrika 43, 561573.
Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage Publications.
Andrich, D., & Luo, G. (1993). A hyperbolic cosine latent trait model for unfold-
ing dichotomous single-stimulus models. Applied Psychological Measurement, 17,
253276.
Baker, F. B. (1965). Origins of the item parameters X50 and as a modern item analy-
sis technique. Journal of Educational Measurement, 2, 167180.
Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques
(2nd ed.). New York: Marcel Dekker.
Birnbaum, A. (1957). Efficient design and use of mental ability for various decision-
making problems (Series Report 5816). Randolph Air Force Base, TX: USAF
School of Aviation Medicine.
Birnbaum, A. (1958). On the estimation of mental ability (Series Report No. 15)
Randolph Air Force Base, TX: USAF School of Aviation Medicine.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an exam-
inees ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test
scores (Chaps. 17 to 20). Reading, MA: Addison-Wesley.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are
scored in two or more latent categories. Psychometrika, 37, 2951.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item
parameters: Application of an EM algorithm. Psychometrika, 46, 443459.
Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously
scored items. Psychometrika, 35, 179197.
Douglas, J., & Cohen, A. S. (2001). Nonparametric item response function estimation for
assessing parametric model fit. Applied Psychological Measurement, 25, 234243.
Engelhard, G. (2005). IRT models for rating scale data. In B. Everitt & D. Howell
(Eds.), Encyclopedia of statistics in behavioral science (Vol. 2, pp. 9951003). West
Sussex, UK: John Wiley & Sons.
Fischer, G. H. (1974). Einfuhrung in die theorie psychologischer tests. Bern, DE: Huber.

Y102002_Book.indb 39 3/3/10 6:57:17 PM


40 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Fitzpatrick, A. R, Link, V. B., Yen, W. M., Burket, G. R., Ito, K., & Sykes, R. C.
(1996). Scaling performance assessments: A comparison of one-parameter and
two-parameter partial credit models. Journal of Educational Measurement, 33,
291314.
Fitzpatrick, A. R, & Yen, W. M. (1995). The psychometric characteristics of choice
items. Journal of Educational Measurement, 32, 243259.
Gulliksen, H. (1950). Theory of mental test scores. New York: Wiley.
Hambleton, R. K., & Han, N. (2005). Assessing the fit of IRT models to educational
and psychological test data: A five step plan and several graphical displays. In
W. R. Lenderking & D. Revicki (Eds.), Advances in health outcomes research
methods, measurement, statistical analysis, and clinical applications (pp. 5778).
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Washington, DC: Degnon Associates.


Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item
response theory. Thousand Oaks, CA: Sage Publications.
Jansen, P. G. W., & Roskam, E. E. (1986). Latent trait models and dichtomization of
graded responses. Psychometrika, 51, 2991.
Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent struc-
ture analysis. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld,
S. A. Star, & J. A. Clausen (Eds.), Measurement and prediction. Princeton, NJ:
Princeton University Press.
Lord, F. M. (1952). A theory of test scores. Psychometrika, Monograph 7, 17.
Lord, F. M. (1953a). The relation of test score to the trait underlying the test.
Educational and Psychological Measurement, 13, 517548.
Lord, F. M. (1953b). An application of confidence intervals and of maximum likeli-
hood to the estimation of an examinees ability. Psychometrika, 18, 5776.
Lord, F. M. (1965). An empirical study of item-test regression. Psychometrika, 30,
373376.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores (with con-
tributions by Allen Birnbaum). Reading, MA: Addison-Wesley.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,
149174.
Masters, G. (1988). The analysis of partial credit scoring. Applied Measurement in
Education, 1, 279298.
Masters, G. N., & Wright, B. D. (1984). The essential process in a family of measure-
ment models. Psychometrika, 49, 529544.
Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item
responses. Applied Psychological Measurement, 9, 91100.
Molenaar, I. W. (1983). Item steps (Heymans Bulletins HB-83-630-EX). Groningen,
NL: Psychologisch Instituut RU Groningen.
Molenaar, I. (1997). Nonparametric models for polytomous responses. In W. J. van
der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory
(pp. 369380). New York: Springer.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-
rithm. Applied Psychological Measurement, 16, 159176.
Muraki, E. (1993). Information functions of the generalized partial credit model.
Applied Psychological Measurement, 17, 351363.
Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models. Thousand
Oaks, CA: Sage Publications.
Ramsay, J. (1991). Kernal smoothing approaches to nonparametric item characteris-
tic curve estimation. Psychometrika, 56, 611630.

Y102002_Book.indb 40 3/3/10 6:57:17 PM


IRT Models for the Analysis ofPolytomously Scored Data 41

Ramsay, J. (1992). TESTGRAF: A program for the graphical item analysis of multiple-
choice test and questionnaire data (Technical Report). Montreal, Canada: McGill
University.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.
Copenhagen, Denmark: Danish Institute for Educational Research.
Rasch, G. (1961). On general laws and the meaning of measurement in psychology.
In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and
Probability (pp. 321334). Berkeley: University of California Press.
Roberts, J. S., Donoghue, J. R., & Laughlin, J. S. (2000). A general item response-
theory model for unfolding unidimensional polytomous responses. Applied
Psychological Measurement, 23, 332.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Roskam, E. E., & Jansen, P. G. W. (1986). Conditions of Rasch-dichotomizability of


the unidimensional polytomous Rasch model. Psychometrika, 54, 317332.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
scores. Psychometrika, Monograph 17, 34.
Samejima, F. (1972). A general model for free-response data. Psychometrika,
Monograph 18, 37.
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K.
Hambleton (Eds.), Handbook of modern item response theory (pp. 85100). New
York: Springer.
Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Nonparametric item analyses
of the Beck Depression Inventory: Evaluating gender item bias and response
option weights. Psychological Assessment, 6, 255270.
Santor, D. A., Zuroff, D. C., Ramsay, J. O., Cervantes, P., & Palacios, J. (1995).
Examining scale discriminability in the BDI and CES-D as a function of
depressive severity. Psychological Assessment, 7, 131139.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that sepa-
rates true bias/DIF from group ability differences and detects test bias/DIF as
well as item bias/DIF. Psychometrika, 58, 159194.
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response
theory. Thousand Oaks, CA: Sage.
Swaminathan, H. (2005). Bayesian item response theory estimation. In B. Everitt
& D. Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 1,
pp. 134139). West Sussex, UK: John Wiley & Sons.
Thissen, D. (1981). MULTILOG: Item analysis and scoring with multiple category mod-
els. Chicago: International Educational Services.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.
Psychometrika, 51, 567577.
Thurstone, L. L. (1925). A method of scaling psychological and educational tests.
Journal of Educational Psychology, 16, 433451.
Tucker, L. R. (1946). Maximum validity of a test with equivalent items. Psychometrika,
11, 113.
Tutz, G. (1997). Sequential models for ordered responses. In W. J. van der Linden &
R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 139152).
New York: Springer.
van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item
response theory. New York: Springer.
Verhelst, N. D., Glas, C. A. W., & de Vries, H. H. (1997). A steps model to analyze
partial credit. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of
modern item response theory (pp. 123138). New York: Springer.

Y102002_Book.indb 41 3/3/10 6:57:17 PM


42 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Verhelst, N. D., & Verstralen, H. (1993). A stochastic unfolding model derived from
the partial credit model. Kwantitative Methoden, 42, 7392.
Wells, C. S., & Bolt, D. M. (2008). Investigation of a nonparametric procedure for
assessing goodness-of-fit in item response theory. Applied Measurement in
Education, 21, 2240.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement.
Chicago: MESA Press.
Zhao, Y., & Hambleton, R. K. (2009). Software for IRT analyses: Descriptions and
features (Center for Educational Assessment Research Report 652). Amherst,
MA: University of Massachusetts, Center for Educational Assessment.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 42 3/3/10 6:57:17 PM


Chapter 3
The Nominal Categories
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Item Response Model


David Thissen
The University of North Carolina at Chapel Hill

Li Cai
University of California, Los Angeles

with a contribution by
R. Darrell Bock
University of Illinois at Chicago

Introduction

Editor Introduction: This chapter elaborates the development of the most general polytomous
IRT model covered in this book. It is the only model in this book that does not assume ordered
polytomous response data and can therefore be used to measure traits and abilities with items that
have unordered response categories. It can be used to identify the empirical ordering of response
categories where that ordering is unknown a priori but of interest, or it can be used to check whether
the expected ordering of response categories is supported in data. The authors present a new
parameterization of this model that may serve to expand the model and to facilitate a more wide-
spread use of the model. Also discussed are various derivations of the model and its relationship to
other models. The chapter concludes with a special section by Bock, where he elaborates on the
background of the nominal model.

The Original Context


The nominal categories model (Bock, 1972, 1997) was originally pro-
posed shortly after Samejima (1969, 1997) described the first general item
response theory (IRT) model for polytomous responses. Samejimas graded
models (innormal ogive and logistic form) were designed for item responses
that have some a priori order as they relate to the latent variable being
43

Y102002_Book.indb 43 3/3/10 6:57:17 PM


44 David Thissen, Li Cai, and R. Darrell Bock

measured (); the nominal model was designed for responses with no pre-
determined order.
Samejima (1969) illustrated the use of the graded model with the analy-
sis of data from multiple-choice items measuring academic proficiency. The
weakness of the use of a graded model for that purpose arises from the fact
that the scoring order, or relative degree of correctness, of multiple-choice
response alternatives can only rarely be known a priori. That was part of the
motivation for the development of the nominal model. Bocks (1972) pre-
sentation of the nominal model also used multiple-choice items measuring
vocabulary to illustrate its application. Ultimately, neither Samejimas (1969,
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1997) graded model nor Bocks (1972, 1997) nominal model has seen wide-
spread use as a model for the responses to multiple-choice items, because, in
addition to the aforementioned difficulty prespecifying order for multiple-
choice alternatives, neither the graded nor the nominal model makes any
provision for guessing. Elaborating a suggestion by Samejima (1979), Thissen
and Steinberg (1984) described a generalization of the nominal model that
does take guessing into account, and that multiple-choice model is preferable
if IRT analysis of all of the response alternatives for multiple-choice items
is required.

Current Uses
Nevertheless, the nominal model is in widespread use in item analysis and
test scoring. The nominal model is used for three purposes: (1) as an item
analysis and scoring method for items that elicit purely nominal responses, (2)
to provide an empirical check that items expected to yield ordered responses
have actually done so (Samejima, 1988, 1996), and (3) to provide a model for
the responses to testlets. Testlets are sets of items that are scored as a unit
(Wainer & Kiely, 1987); often testlet response categories are the patterns
of response to the constituent items, and those patterns are rarely ordered a
priori.

The Original Nominal Categories Model


Bocks (1972) original formulation of the nominal model was

exp( zk )
T (u = k|; a , c ) = T ( k ) =
i exp( zi ) (3.1)

in which T, the curve tracing the probability that the item response u is in
category k is a function of the latent variable with vector parameters a andc.
In what follows we will often shorten the notation for the trace line to T(k),
and in this presentation we number the response alternatives k = 0, 1,..., m 1
for an item with m response categories. The model itself is the so-called mul-
tivariate logistic function, with arguments

zk = a k + c k (3.2)

Y102002_Book.indb 44 3/3/10 6:57:18 PM


The Nominal Categories Item Response Model 45

in which zk is a response process (value) for category k, which is a (linear)


function of with slope parameter ak and intercept ck. Equations 3.1 and 3.2
can be combined and made more compact as

exp( a k + c k )
T (k) = (3.3)
i exp( ai + ci )

As stated in Equation 3.3, the model is twice not identified: The addition
of any constant to either all of the aks or all of the cks yields different parame-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ter sets but the same values of T(k). As identification constraints, Bock (1972)
suggested

m 1 m 1

a = ck k = 0 (3.4)
k=0 k=0

implemented by reparameterizing, and estimating the parameter vectors


and using

a = T ` and c = Tf (3.5)

in which deviation contrasts from the analysis of variance were used:

1 1 1

m m m
1 1 1
m 1 m

m
(3.6)
TDEV = 1 1 1
1
m ( m 1 )
m m m


1 1 1
1
m m m

With the T matrices defined as in Equation 3.6, the vectors (of length
m 1) ` and f may take any value and yield vectors a and c with elements
that sum to zero. As is the case in the analysis of variance, other contrast (T)
matrices may be used as well (see Thissen and Steinberg (1986) for examples);
for reasons that will become clear, in this presentation we will use systems
that identify the model with the constraints a0 = c 0 = 0 instead of the original
identification constraints.
Figure3.1 shows four sets of trace lines that illustrate some of the range of
variability of item response functions that can be obtained with the nominal

Y102002_Book.indb 45 3/3/10 6:57:20 PM


46 David Thissen, Li Cai, and R. Darrell Bock

1.0 1.0 3
0 3

T(Item Response)

T(Item Response)
0

0.5 0.5

1 2 1
2

0.0 0.0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Theta Theta
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1.0 4 1.0
0 4
T(Item Response)

T(Item Response)
1
0.5 0.5
3
0 2
1 2 3

0.0 0.0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Theta Theta

Figure 3.1 Upper left: Trace lines for an artificially constructed four-alternative item. Upper right:
Trace lines for the Identify testlet described by Thissen and Steinberg (1988). Lower left: Trace
lines for the number correct on questions following a passage on a reading comprehension test, using
parameter estimates obtained by Thissen, Steinberg, and Mooney (1989). Lower right: Trace lines for
judge-scored constructed-response item M075101 from the 1996 administration of the NAEP math-
ematics assessment.

model. The corresponding values of the parameter vectors a and c are shown
in Table3.1.
The curves in the upper left panel of Figure 3.1 artificially illustrate a
maximally ordered, centered set of item responses: As seen in the leftmost
two columns of Table3.1 (for Item 1) the values of ak increase by 1.0 as k
increases; as we will see in a subsequent section, that produces an ordered
variant of the nominal model. All of the values of ck are identically 0.0, so
the trace lines all cross at that value of . The upper right panel of Figure3.1

Table3.1 Original Nominal Model Parameter Values for the Trace Lines Shown
inFigure3.1
Item 1 Item 2 Item 3 Item 4
Response
Category (k) a c a c a c a c
0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0
1 1.0 0.0 0.0 0.9 0.2 0.5 0.95 1.2
2 2.0 0.0 1.1 0.7 0.7 1.8 1.90 0.2
3 3.0 0.0 2.7 0.7 1.3 3.0 2.85 1.4
4 2.2 3.3 3.80 2.7

Y102002_Book.indb 46 3/3/10 6:57:20 PM


The Nominal Categories Item Response Model 47

shows trace lines that correspond to parameter estimates (marked Item 2


in Table 3.1) obtained by Thissen and Steinberg (1988) (and subsequently
by Hoskens and Boeck (1997); see Baker and Kim (2004) for the details of
maximum marginal likelihood parameter estimation) for a testlet compris-
ing two items from Bergan and Stones (1985) data obtained with a test of
preschool mathematics proficiency. The two items required the child to iden-
tify the numerals 3 and 4; the curves are marked 0 for neither identified, 1 for
3 identified but not 4, 2 for 4 identified but not 3, and 3 for both identified
correctly. This is an example of a testlet with semiordered responses: The 0
and 1 curves are proportional because their ak estimates are identical, indi-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

cating that, except for an overall difference in probability of endorsement,


they have the same relation to proficiency: Both may be taken as incorrect. If
a child can identify 4 but not 3 (the 2 curve), that indicates a moderate, pos-
sibly developing, degree of mathematical proficiency, and both correct (the
3curve) increases as increases.
The lower left panel of Figure 3.1 shows trace lines that correspond to
parameter estimates (marked Item 3 in Table 3.1) obtained by Thissen,
Steinberg, and Mooney (1989) fitting the nominal model to the number-
correct score for the questions following each of four passages on a read-
ing comprehension test. Going from left to right, the model indicates that
the responses are increasingly ordered for this number-correct scored testlet:
Summed scores of 0 and 1 have nearly the same trace lines, because 0 (of 4)
and 1 (of 4) are both scores that can be obtained with nearly equal probability
by guessing on five-alternative multiple-choice items. After that, the trace
lines look increasingly like those of a graded model. The lower right panel of
Figure3.1 is for a set of graded responses: It shows the curves that correspond
to the parameter estimates for an extended constructed response mathemat-
ics item administered as part of the National Assessment of Educational
Progress (NAEP) (Allen, Carlson, & Zelenak, 1999). The judged scores
(from 0 to 4) were fitted with Murakis (1992, 1997) generalized partial
credit (GPC) model, which is a constrained version of the nominal model. In
Table3.1, the parameters for this item (Item 4 in the two rightmost columns)
have been converted into values of ak and ck for comparability with the other
items parameters. The GPC model is an alternative to Samejimas (1969,
1997) graded model for such ordered responses; the two models generally
yield very similar trace lines for the same data. In subsequent sections of this
chapter we will discuss the relation between the GPC and nominal models
in more detail.

Derivations of the Model


There are several lines of reasoning that lead to Equation 3.3 as an item
response model. In this section we describe three kinds of theoretical
argument that lead to the nominal model as the result, because they exist,
and because different lines of reasoning appeal to persons with different
backgrounds.

Y102002_Book.indb 47 3/3/10 6:57:20 PM


48 David Thissen, Li Cai, and R. Darrell Bock

As Statistical Mechanics
Certainly the simplest development of the nominal model is essentially
atheoretical, treating the problem as abstract statistical model creation. To
do this, we specify only the most basic facts: that we have categorical item
responses in several (>2) categories, that we believe those item responses
depend on some latent variable () that varies among respondents, and that
the mutual dependence of the item responses on that latent variable explains
their observed covariance. Then simple mathematical functions are used to
complete the model.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

First, we assume that the dependence of some response process (value) for
each person, for each item response alternative, is a linear function of theta
zk = a k + c k (3.7)

with unknown slope and intercept parameters ak and ck. Such a set of straight
lines for a five-category item is shown in the left panel of Figure3.2, using
the parameters for Item 3 from Table3.1.
To change those straight lines (zk) into a model that yields probabilities
(between 0 and 1) for each response, as functions of , we use the so-called
multivariate logistic link function
exp( zk )
(3.8)
i exp( zi )

This function (Equation 3.8) is often used in statistical models to trans-


form a linear model into a probability model for categorical data. It can be
characterized as simple mathematical mechanics: Exponentiation of the val-
ues of zk makes them all positive, and then division of each of those positive
line values by the sum of all of them is guaranteed to transform the straight
lines in the left panel of Figure3.2 into curves such as those shown in the
right panel of Figure3.2. The curves are all between 0 and 1, and sum to 1

10 1.0 4
8
T(Item Response)

6
4
0.5
z

2 3
0
1 2
0
-2
-4 0.0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Theta Theta

Figure 3.2 Left panel: Linear regressions of the response process zk on for five response alterna-
tives. Right panel: Multivariate logistic transformed curves corresponding to the five lines in the left
panel.

Y102002_Book.indb 48 3/3/10 6:57:21 PM


The Nominal Categories Item Response Model 49

at all values of , as required. (The curves in the right panel of Figure3.2 are
those from the lower left panel of Figure3.1. de Ayala (1992) has presented
a similar graphic as his Figure1.)
For purely statistically trained analysts, with no background in psycho-
logical theory development, this is a sufficient line of reasoning to use the
nominal model for data analysis. Researchers trained in psychology may
desire a more elaborated theoretical rationale, of which two are offered in the
two subsequent sections.
However, it is of interest to note at this point that the development in this
section, specifically Equation 3.7, invites the questions: Why linear? Why not
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

some higher-order polynomial, like quadratic? Indeed, quadratic functions of


have been suggested or used for special purposes as variants of the nominal
model: Upon hearing a description of the multiple-choice model (Thissen &
Steinberg, 1984) D. B. Rubin (personal communication, December 15, 1982)
suggested that an alternative to that model would be a nominal model with
quadratic functions replacing Equation 3.7. Ramsay (1995) uses a quadratic
term in Equation 3.7 for the correct response alternative for multiple-choice
items when the multivariate logistic is used to provide smooth information
curves for the nonparametric trace lines in the TestGraf system. Sympson
(1983) also suggested the use of quadratic, and even higher-order, polynomi-
als in a more complex model that never came into implementation or usage.
Nevertheless, setting aside multiple-choice items, for most uses of the
nominal model the linear functions in Equation 3.7 are sufficient.

Relations With Thurstone Models

Relationship to Other Models: The term Thurstone models in polytomous IRT typically refers
to models where response category thresholds characterize all responses above versus below a
given threshold. In contrast, Rasch type models only characterize responses in adjacent categories.
However, the Thurstone case V model, which is related to the development of the nominal categories
model, is a very different type of Thurstone modelone without thresholdshighlighting the nominal
categories model's unique place among polytomous IRT models.

The original development of the nominal categories model by Bock (1972)


was based on an extension of Thurstones (1927) case V model for binary
choices, generalized to become a model for the first choice among three or
more alternatives. Thurstones model for choice made use of the concept of
a response process that followed a normal distribution, one value (process in
Thurstones language) for each object. The idea was that the object or alter-
native selected was that with the larger value. In practice, a comparatal
process is computed as the difference between the two response processes,
and the first object is selected if the value of the comparatal process is greater
than zero.
Bock and Jones (1968) describe many variants and extensions of Thur
stones models for choice, including generalizations to the first choice from
among several objects. The obvious generalization of Thurstones binary

Y102002_Book.indb 49 3/3/10 6:57:21 PM


50 David Thissen, Li Cai, and R. Darrell Bock

choice model to create a model for the first choice from among three or more
objects would use a multivariate normal distribution of m 1 comparatal pro-
cesses for object or alternative j, each representing a comparison of object j
with one of the others of m objects. Then the probability of selection of alter-
native j would be computed as a multiple integral over that (m 1)-dimen-
sional normal density, computing a value known as an orthant probability.
However, multivariate normal orthant probabilities are notoriously difficult
to compute, even for simplified special cases. Bock and Jones suggest sub-
stitution of the multivariate logistic distribution, showing that the bivariate
logistic yields probabilities similar to those obtained from a bivariate normal
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

(these would be used for the first choice of three objects). The substitution of
the logistic here is analogous with the substitution of the logistic function for
the normal ogive in the two-parameter logistic IRT model (Birnbaum, 1968).
Of course, the multivariate logistic distribution function is Equation 3.1.
In the appendix to this chapter, Bock provides an updated and detailed
description of the theoretical development of the nominal categories model as
an approximation to the multivariate generalization of Thurstones model for
choice. In addition, the appendix describes the development of the model that is
obtained by considering first choices among three or more objects as an extreme
value problem, citing the extension of Dubeys (1969) derivation of the logistic
distribution to the multivariate case that has been used and studied by Bock
(1970), McFadden (1974), and Malik and Abraham (1973). This latter develop-
ment also ties the nominal categories model to the so-called Bradley-Terry-Luce
(BTL) model for choice (Bradley & Terry, 1952; Luce & Suppes, 1965).
Thus, from the point of view of mathematical models for choice, the nom-
inal categories model is both an approximation to Thurstone (normal) mod-
els for the choice of one of three or more alternatives, and the multivariate
version of the BTL model.

The Probability of a Response in One of Two Categories


Another derivation of the nominal model involves its implications for the
conditional probability of a response in one category (say k) given that the
response is in one of two categories (k or k). This derivation is analogous in
some respects to the development of Samejimas (1969, 1997) graded model,
which is built up from the idea that several conventional binary item response
models may be concatenated to construct a model for multiple responses. In
the case of the graded model, accumulation is used to transform the multiple
category model into a series of dichotomous models: The conventional nor-
mal ogive or logistic model is used to describe the probability that a response
is in category k or higher, and then those cumulative models are subtracted
to produce the model for the probability the response is in a particular cat-
egory. This development of the graded model rests, in turn, on the theoreti-
cal development of the normal ogive model as a model for the psychological
response process, as articulated by Lord and Novick (1968, pp. 370373),
and then on Birnbaums (1968) reiteration for test theory of Berksons (1944,
1953) suggestion that the logistic function could usefully be substituted for

Y102002_Book.indb 50 3/3/10 6:57:22 PM


The Nominal Categories Item Response Model 51

the normal ogive. (See Thissen and Orlando (2001, pp. 8489) for a sum-
mary of the argument by Lord and Novick and the story behind the logistic
substitution.)
The nominal model may be derived in a parallel fashion, assuming that
the conditional probability of a response in one category (say k), given that
the response is in one of two categories (k or k), can be modeled with the
two-parameter logistic (2PL). The algebra for this derivation frontwards
(from the 2PL for the conditional responses to the nominal model for all of
the responses) is algebraically challenging as test theory goes, but it is suf-
ficient to do it backwards, and that is what is presented here. (We note in
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

passing that Masters (1982) did this derivation frontwards for the simpler
route from the Rasch or one-parameter logistic (1PL) to the partial credit
model.)
If one begins with the nominal model as stated in Equation 3.3, and
writes the conditional probability for a response in category k given that the
response is in one of categories k or k,

T (k)
T ( k|k , k ) = (3.9)
T ( k ) + T ( k )

then only a modest amount of algebra (cancel the identical denominators,


and then more cancellation to change the three exponential terms into one) is
required to show that this conditional probability is, in fact, a two-parameter
logistic function:

1
T ( k|k , k ,) =
(
1 + exp a kc + c kc
) (3.10)

with

c kc = c k c k (3.11)

and

a kc = a k a k (3.12)

Placing interpretation on the algebra, what this means is that the nomi-
nal model assumes that if we selected the subsample of respondents who
selected either alternative k or k, setting aside respondents who made other
choices, and analyzed the resulting dichotomous item in that subset of the
data, we would use the 2PL model for the probability of response k in that
subset of the data. This choice, like the choice of the normal ogive or logistic
model for the cumulative probabilities in the graded model, then rests on

Y102002_Book.indb 51 3/3/10 6:57:23 PM


52 David Thissen, Li Cai, and R. Darrell Bock

the theoretical development of the normal ogive model as a psychologi-


cal response process model as articulated by Lord and Novick (1968), and
Birnbaums (1968) argument for the substitution of the logistic. The dif-
ference between the two ways of dividing multiple responses into a series
of dichotomies (cumulative vs. conditional) has been discussed by Agresti
(2002).
An interesting and important feature of the nominal model is obtained by
specializing the conditional probability for any pair of responses to adjacent
response categories (k or k 1; adjacent is meaningful if the responses are
actually ordered); the same two-parameter logistic is obtained:
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1
T ( k|k , k 1) =
( )
1 + exp a kc + c kc

(3.13)

with

c kc = c( k 1) c k (3.14)

and

a kc = a k a( k 1) (3.15)

It is worth noting at this point that the threshold bkc for the slope-threshold
form of the conditional 2PL curve,

1
T ( k|k , k 1) =
( k k )
1 + exp a c b c (3.16)

is

c kc c k 1 c k
bkc = c = (3.17)
a k a k a k 1

which is also the crossing point of the trace lines for categories k and k 1
(deAyala, 1993; Bock, 1997). These values are featured in some parameter-
izations of the nominal model for ordered data.
This fact defines the concept of order for nominal response categories:
Response k is higher than response k 1 if and only if a k > a k 1 , which means
that ac is positive, and so the conditional probability of selecting response k
(given that it is one of the two) increases as increases. Basically this means
that item analysis with the nominal model tells the data analyst the order of the
item responses. We have already made use of this fact in discussion of order and
the ak parameters in Figure3.1 and Table3.1 in the introductory section.

Y102002_Book.indb 52 3/3/10 6:57:25 PM


The Nominal Categories Item Response Model 53

1.0 +

T(Item Response)
0.5

NA
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.0
-3 -2 -1 0 1 2 3
Cognitive Dysfunction (Theta)

Figure 3.3 Trace lines corresponding to item parameters obtained by Huber (1993) in his analy-
sis of the item Count down from 20 by 3s on the Short Portable Mental Status Questionnaire
(SPMSQ ).

Two additional examples serve to illustrate the use of the nominal model
to determine the order of response categories, and the way the model may be
used to provide trace lines that can be used to compute IRT scale scores (see
Thissen, Nelson, Rosa, and McLeod, 2001) using items with purely nominal
response alternatives.
Figure 3.3 shows the trace lines corresponding to item parameters
obtained by Huber (1993) in his analysis of the item Count down from 20
by 3s on the Short Portable Mental Status Questionnaire (SPMSQ ), a brief
diagnostic instrument used to detect dementia. For this item, administered
to a sample of aging individuals, three response categories were recorded:
correct, incorrect (scored positively for this cognitive dysfunction scale),
and refusal (NA). Common practice scoring the SPMSQ in clinical and
research applications was to score NA as incorrect, based on a belief that
respondents who refused to attempt the task probably could not do it. Huber
fitted the three response categories with the nominal model and obtained
the parameters a = [0.0, 1.56, 1.92] and c = [0.0, 0.52, 0.85]; the cor-
responding curves are shown in Figure3.3. As expected, the ak parameter
for NA is much closer to the ak parameter for the incorrect response, and
the curve for NA is nearly proportional to the curve in Figure3.3. This
analysis lends a degree of justification to the practice of scoring NA as incor-
rect. However, if the IRT model is used to compute scale scores, those scale
scores reflect the relative evidence of failure provided by the NA response
more precisely.
The SPMSQ also includes items that many item analysts would expect to
be locally dependent. One example involves a pair of questions that require
the respondent to state his or her age, and then his or her date of birth. Huber
(1993) combined those two items into a testlet with four response categories:
both correct (++), age correct and date of birth incorrect (+), age incorrect
and date of birth correct (+), and both incorrect (). Figure3.4 shows the

Y102002_Book.indb 53 3/3/10 6:57:26 PM


54 David Thissen, Li Cai, and R. Darrell Bock

1.0 ++

T(Item Response)
0.5

++
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.0
-3 -2 -1 0 1 2 3
Cognitive Dysfunction (Theta)

Figure 3.4 Nominal model trace lines for the four response categories for Hubers (1993) SPMSQ
testlet scored as reporting both age and date of birth correctly (++), age correctly and date of birth
incorrectly (+), age incorrectly and date of birth correctly (+), and both incorrectly ().

nominal model trace lines for the four response categories for that testlet.
While one may confidently expect that the response reflects the highest
degree of dysfunction and the ++ response the lowest degree of dysfunction,
there is a real question about the scoring value of the + and + responses.
The nominal model analysis indicates that the trace lines for + and + are
almost exactly the same, intermediate between good and poor performance.
Thus, after the analysis with the nominal model one may conclude that this
testlet yields four response categories that collapse into three ordered scoring
categories: ++, [+ or +], and .

Alternative Parameterizations, With Uses


Thissen and Steinberg (1986) showed that a number of other item response
models may be obtained as versions of the nominal model by imposing con-
straints on the nominal models parameters, and further that the canonical
parameters of those other models may be made the s and s estimated for
the nominal model with appropriate choices of T matrices. Among those
other models are Masters (1982) partial credit (PC) model (see also Masters
and Wright, 1997) and Andrichs (1978) rating scale (RS) model (see also
Andersen (1997) for relations with proposals by Rasch (1961) and Andersen
(1977)). Thissen and Steinberg (1986) also mentioned in passing that a ver-
sion of the nominal model like the PC model, but with discrimination
parameters that vary over items, is also within the parameter space of the
nominal model. That latter model was independently developed and used in
the 1980s by Muraki (1992) and called the generalized partial credit (GPC)
model, and by Yen (1993) and called the two-parameter partial credit (2PPC)
model.

Y102002_Book.indb 54 3/3/10 6:57:26 PM


The Nominal Categories Item Response Model 55

More on Ordered Versions of the Nominal ModelRating


Scale and (Generalized) Partial Credit Models

Notational Difference: Remember this model was presented slightly differently in Chapter 2:

exp kv = 0 ai (j biv )
Pik (j ) =
hK=01 exp hv = 0 ai (j biv )

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Muraki (1992, 1997) has used several parameterizations to describe the GPC
model, among them

exp kj = 0 1.7 a( b + d j )
T (k) = (3.18)
mi =01 exp ij = 0 1.7 a( b + d j )

with the constraint that


m 1

d = 0 i (3.19)
i =1

and alternatively

exp[1.7 a[Tk ( b ) + K k ]]
T (k) = m 1 (3.20)
exp[1.7 a[Tk ( b ) + K k ]]
i =0

in which

Kk = d i (3.21)
i =1

Murakis parameterization of the GPC model is closely related to Masters


(1982) specification of the PC model:

Notational Difference: Here the authors use to refer to the latent variable of interest where
Masters (see Equations 5.22 and 5.23 in Chapter 5) and Andrich (see Equations 6.24 and 6.25 in
Chapter 6) typically refer to the latent variable using . This / notational difference will be seen in
other chapters and is common in IRT literature.

exp kj = 0 ( j )
T (k) =
mi =01 exp ij = 0 ( j ) (3.22)

Y102002_Book.indb 55 3/3/10 6:57:28 PM


56 David Thissen, Li Cai, and R. Darrell Bock

with the constraint


0

( ) = 0 j (3.23)
j =0

Andrichs (1978) RS model is


exp kj = 0 ( ( + k ))
T (k) =
mi =01 exp ij = 0 ( ( + j )) (3.24)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

with the constraints


0

[ ( + )] = 0 j (3.25)
j =0

and
m 1

j = 0 (3.26)
j =1

Thissen and Steinberg (1986)


Thissen and Steinberg (1986) described the use of alternative T matrices in
the formulation of the nominal model. For example, when formulated for
marginal estimation following Thissen (1982), Masters (1982) PC model
and Andrichs (1978) RS model use a single slope parameter that is the coef-
ficient for a linear basis function:

0

1

Ta ( PC )
= 2 (3.27)
m 1


m 1

Masters (1982) PC model used a parameterization for the threshold param-


eters that can be duplicated, up to proportionality, with this T matrix for the cs:
0 0 0

1 0 0
Tc ( PC ) = 1 1 0 (3.28)
m ( m 1 )

1 1 1

Y102002_Book.indb 56 3/3/10 6:57:29 PM


The Nominal Categories Item Response Model 57

Terminology Note: The authors use the term threshold here, whereas in other chapters these
parameters are sometimes referred to as step or boundary parameters.

Andrichs RS model separated an overall item location parameter from a


set of parameters describing the category boundaries for the item response
scale; the latter were constrained equal across items, and may be obtained,
again up to proportionality, with
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0 0 0 0

1 1 0 0
2 1 1 0
Tc ( RS-C ) = (3.29)

m ( m 1 )
(m 2) 1 1 1
(m 1) 0 0 0

Andrich (1978, 1985) and Thissen and Steinberg (1986) described the use
of a polynomial basis for the cs as an alternative to Tc ( RS-C ) that smooths
the category boundaries; the overall item location parameter is the coefficient
of the first (linear) column, and the coefficients associated with the other
columns describe the response category boundaries:

0 02 03

1 12 13
2 2 2 23
Tc ( RS-P ) = (3.30)
m ( m 1 )
(m 2) (m 2)2 (m 2)3

(m 1) (m 1)2 (m 1)3

Polynomial contrasts were used by Thissen et al. (1989) to obtain the trace
lines for summed score testlets for a passage-based reading comprehension
test; the trace lines for one of those testlets are shown as the lower left panel
of Figure3.1 and the right panel of Figure3.2. The polynomial contrast set
included only the linear term for the a k s and the linear and quadratic terms
for the c k s for that testlet; that was found to be a sufficient number of terms
to fit the data. This example illustrates the fact that, although the nomi-
nal model may appear to have many estimated parameters, in many situa-
tions a reduction of rank of the T matrix may result in much more efficient
estimation.

Y102002_Book.indb 57 3/3/10 6:57:31 PM


58 David Thissen, Li Cai, and R. Darrell Bock

A New Parameterization for the Nominal Model


After three decades of experience with the nominal model and its applica-
tions, a revision to the parameterization of the model would serve several pur-
poses: Such a revision could be used first of all to facilitate the extension of the
nominal model to become a multidimensional IRT (MIRT) model, a first for
purely nominal responses. In addition, a revision could make the model easier
to explain. Further, by retaining features that have actually been used in data
analysis, and discarding suggestions (such as many alternative T matrices) that
have rarely or never been used in practice, the implementation of estimation
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

algorithms for the model in software could become more straightforward.


Thus, while the previous sections of this chapter have described the nomi-
nal model as it has been, and as it has been used, this section presents a new
parameterization that we expect will be implemented in the next generation
of software for IRT parameter estimation. This is a look into the future.

Desiderata
The development of the new parameterization for the nominal model was
guided by several goals, combining a new insight with experience gained
over the last 30 years of applications of the model:

1. The dominating insight is that a kind of multidimensional nominal


model can be created by separating the a parameterization into a single
overall (mutliplicative) slope or discrimination parameter, that is then
expanded into vector form to correspond to vector , and a set of m 2
contrasts among the a parameters that represent what Muraki (1992)
calls the scoring functions for the responses. This change has the added
benefit that, for the first time, the newly reparameterized nominal
model has a single discrimination parameter comparable to those of
other IRT models. That eases explanation of results of item analysis
with the model.
2. In the process of accomplishing Goal 1, it is desirable to parameterize
the model in such a way that the scoring function may be (smoothly)
made linear ( 0, 1, 2,, m 1) so that the multiplicative overall slope
parameter becomes the slope parameter for the GPC model, which,
constrained equally across items, also yields the PC and RS models.
In addition, with this scoring function the overall slope parameter may
meaningfully be set equal to the (also equal) slope for a set of 2PL items
to mimic Rasch family mixed models.
3. We have also found it useful at times in the past 20 years to use models
between the highly constrained GPC model and the full-rank nominal
model, as suggested by Thissen and Steinberg (1986), most often by
using polynomial bases for the a and c parameters and reducing the
number of estimated coefficients below full rank to obtain smoothly
changing values of the a and c parameters across response categories.
It is desirable to retain that option.

Y102002_Book.indb 58 3/3/10 6:57:31 PM


The Nominal Categories Item Response Model 59

4. With other sets of data, we have found it useful to set equal subsets
of the a or c parameters within an item, modeling distinct response
categories as equivalent for scoring (the a parameters are equal) or alto-
gether equivalent (both the a and c parameters are equal).

Obtaining Goals 3 and 4 requires two distinct parameterizations, both


expressed as sets of T matrices; Goals 1 and 2 are maintained in both
parameterizations.

The New Parameterization


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

The new parameterization is


exp( zk )
( )
T u = k|; a i , a sk , c k = T ( k ) =
i exp( zi )
(3.31)

in which
zi k = a i a ks +1 + c k +1 (3.32)

and a is the overall slope parameter, a ks +1 is the scoring function for response
k, and ck+1 is the intercept parameter as in the original model. The equating
following restrictions for identification,
a1s = 0, ams = m 1, and c1 = 0 (3.33)

are implemented by reparameterizing, and estimating the parameters and :

a s = T ` and c = Tf (3.34)

The Fourier Version for Linear Effects and Smoothing


To accomplish Goals 1 to 3, we use a Fourier basis as the T matrix, aug-
mented with a linear column:

0 0 0

1 f 22 f 2( m 1 )
TF = 2 f 32 f 3(m 1) (3.35)
m ( m 1 )

m 1 0 0

in which f ki is
f ki = sin[ (i 1)( k 1)/(m 1)] (3.36)

and 1 = 1. Figure3.5 shows graphs of the linear and Fourier functions for
four categories (left panel) and six categories (right panel). The Fourier-based
terms functionally replace quadratic and higher-order polynomial terms that

Y102002_Book.indb 59 3/3/10 6:57:34 PM


60 David Thissen, Li Cai, and R. Darrell Bock

3 5
4
2
3
1 2
1
0
0
-1 -1
0 1 2 3 0 1 2 3 4 5
Response Response
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Figure 3.5 Graphs of the linear and Fourier basis functions for the new nominal model parameter-
ization, for four categories (left panel) and six categories (right panel); the values of T at integral values
on the Response axis are the elements of the T matrix of Equations 3.35 and 3.36.

we have often used to smooth sequences of ak and ck parameters with a more


numerically stable, symmetrical orthogonal basis.
The new parameterization, using the Fourier T matrix, provides several
useful variants of the nominal model: When a ,{ 2, , m 1 } , and are
estimated parameters, this is the full-rank nominal model. If { 2 ,, m1 } are
restricted to be equal to zero, this is a reparameterized version of the GPC
model. The Fourier basis provides a way to create models between the GPC
and nominal model, as were used by Thissen et al. (1989), Wainer, Thissen,
and Sireci (1991), and others.

Useful Derived Parameters


When the linear-Fourier basis TF is used for both
a s = TF ` and c = TF f (3.37)

with 1 = 1 and 2 ,, m1 = 0 , then the parameters of the GPC model
exp kj = 0 1.7 a( b + d j )
T i (k) = (3.38)
mi =01 exp ij = 0 1.7 a( b + d j )

may be computed as
a i
a= (3.39)
1.7
cm
b= * = *1 (3.40)
a (m 1) ai
i

and

c k c k 1 c
dk = * m (3.41)
ai m 1

Y102002_Book.indb 60 3/3/10 6:57:36 PM


The Nominal Categories Item Response Model 61

for k = 1,..., m 1 (noting that d 0 = 0 and c 0 = 0 as constraints for identifica-


tion). (Childs and Chen (1999) provided formulae to convert the parameters
of the original nominal model into those of the GPC model, but they used
the T matrices in the computations, which is not essential in the simpler
methods given here.)
Also note that if it desired to constrain the GPC parameters dk to be
equal across a set of items, that is accomplished by setting the parameter
sets 2 ,..., m1 equal across those items. This kind of equality constraint
really only makes sense if the overall slope parameter a is also set equal

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

across those items, in which case bi = ai ,1 reflects the overall difference in


difficulty, which still varies over items i. (Another way to put this is that the
linear-Fourier basis separates the parameter space into a (first) component

for bi = ai ,1 and a remainder that parameterizes the spacing among the
i
thresholds or crossover points of the curves.)
The alternative parameterization of the GPC

exp[1.7 a[Tk ( b ) + K k ]
T (k) = m 1 (3.42)
exp[1.7 a[Ti ( b ) + K i ]
i =0

in which

Kk = d i (3.43)
i =1

simply substitutes Kk parameters that may be computed from the values of


di. Note that the multiplication of the parameter b by the scoring function
Tk provides another explanation of the fact that with the linear-Fourier basis

bi = ai ,1 .
To provide translations of the parameters for Rasch family models, some
accommodation must be made between the conventions that the scale of the
latent variable is usually set for more general models by specifying the is
distributed with mean zero and variance one, versus many implementations
of Rasch family models with the specification that some items difficulty is
zero, or the average difficulty is zero, and the slope is one, leaving the mean
and variance of the distribution unspecified, and estimated.
If we follow the approach taken by Thissen (1982) that a version of Rasch
family models may be obtained with the specification that is distributed
with mean zero and variance one, estimating a single common slope param-
eter (a * in this case) for all items, and all items difficulty parameters, then the
c parameters of Masters PC model are

k = b d k (3.44)

Y102002_Book.indb 61 3/3/10 6:57:39 PM


62 David Thissen, Li Cai, and R. Darrell Bock

(in terms of the parameters of Murakis GPC model) up to a linear transfor-


mation of scale, and the and parameters of Andrichs RS model are

=b (3.45)

and

k = d k (3.46)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

again up to a linear transformation of scale.

The Identity-Based T Matrix for Equality Constraints


To accomplish Goals 1, 2, and 4, involving equality constraints, we use T
matrices for as as of the form

0 0m 2

TIa = 0m 2 Im 2 (3.47)
m ( m 1 ) m 1 0m 2

with the constraint that 1 = 1. If it is desirable to impose equality con-


straints in addition on the cs, we use the following T matrix:

0m 1
TIc =
m ( m 1 ) Im 1 (3.48)

This arrangement provides for the following variants of the nominal


model, among others: When a , { 2 ,..., m1 }, and are estimated param-
eters, this is again the full-rank nominal model. If i = i for { 2 ,..., m1 },
this is a reparameterized version of the generalized partial credit model.
The restriction a1s = a 2s is imposed by setting 2 = 0 . The restriction
am 1 = ams is imposed by setting (m 1) = m 1. For the other values of as the
s

restriction a ks = a ks is imposed by setting k = k .

Illustrations
Table3.2 shows the values of the new nominal model parameters for the
items with trace lines in Figure3.1 and the original parameters in Table3.1.
Note that the scoring parameters in as for Items 1 and 4 are [ 0, 1, 2,..., m 1],
indicating that the nominal model for those two items is one for strictly
ordered responses. In addition, we observe that the lower discrimination
of Item 3 (with trace lines shown in the lower left panel of Figure3.1) is
now clearly indicated by the relatively lower value of a ; the discrimination

Y102002_Book.indb 62 3/3/10 6:57:43 PM


The Nominal Categories Item Response Model 63

Table3.2 Item Parameters for the New Parameterization of the Nominal Model,
forthe Same Items With the Original Model Parameters in Table3.1
Parameter Item 1 Item 2 Item 3 Item 4
a* 1.0 0.9 0.55 0.95
a1s c1 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0
a2s c2 1.0 0.0 0.0 0.9 0.36 0.5 1.00 1.2
a3s c3 2.0 0.0 1.2 0.7 1.27 1.8 2.00 0.2
a4s c4 3.0 0.0 3.0 0.7 2.36 3.0 3.00 1.4
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

a5s c5 4.00 3.3 4.00 2.7

parameter for Item 3 is only 0.55, relative to values between 0.9 and 1.0 for
the other three items. The values of the c parameters are unchanged from
Table3.1. If the item analyst wishes to convert the parameters for Item 3 in
Table3.2 to those previously used for the GPC model, Equations 3.39 to
3.41 may be used.

Multidimensionality and the Nominal Model


The new parameterization of the nominal model is designed to facilitate mul-
tidimensional item factor analysis (or MIRT analysis) for items with nomi-
nal responses, something that has not heretofore been available (Cai, Bock,
& Thissen, in preparation). A MIRT model has a vector-valued ptwo or
more dimensions in the latent variable space that are used to explain the
covariation among the item responses. Making use of the separation of the
new nominal model parameterization of overall item discrimination param-
eter (a *) from the scoring functions (in as), the multidimensional nominal
model has a vector of discrimination parameters a *, one value indicating the
slope in each direction of the p-space. This vector of discrimination param-
eters taken together indicates the direction of highest discrimination of the
item, which may be along any of the axes or between them.
The parameters in as remain unchanged: Those represent the scoring func-
tions of the response categories and are assumed to be the same in all direc-
tions in the p-space. So the model remains nominal in the sense that the
scoring functions may be estimated from the data. The intercept parameter
c also remains unchanged, taking the place of the standard unitary intercept
parameter in a MIRT model.
Assembled in notation, the nominal MIRT model is

exp( zk )
T (u = k|p; a , a s , c ) = T ( k ) = (3.49)
i exp( zi )

modified from Equation 3.31 with vector a* and vector p, in which

zk = a * a ks p + c k (3.50)

Y102002_Book.indb 63 3/3/10 6:57:44 PM


64 David Thissen, Li Cai, and R. Darrell Bock

This is a nominal response model in the sense that, for any direction in
thep space, a cross section of the trace surfaces may take the variety of shapes
provided by the unidimensional nominal model. Software to estimate the
parameters of this model is currently under development. When completed
this model will permit the empirical determination of response alternative
order in the context of multidimensional p. If an ordered version of the model
is used, with scoring functions [ 0,1, 2,..., m 1], this model is equivalent to
the multidimensional partial credit model described by Yao and Schwarz
(2006).
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Conclusion
Reasonable questions may be raised about why the new parameterization of
the nominal model has been designed as described in the preceding section;
we try to answer some of the more obvious of those questions here:
Why is the linear term of the T matrix scaled between zero and m 1, as opposed
to some other norming convention? It is planned that the implementation of
estimation for this new version of the nominal model will be in general
purpose computer software that, among other features, can mix models,
for example, for binary and multiple-category models. We also assume that
the software can fix parameters to any specified value, or set equal any sub-
set of the parameters. Some users may want to use Rasch family (Masters
and Wright, 1984) models, mixing the original Rasch (1960) model for the
dichotomous items and the PC or RS models for the polytomous items. To
accomplish a close approximation of that in a marginal maximum likelihood
estimation system, with a N ( 0, 1) population distribution setting scale for
the latent variable, a common slope (equal across items) must be specified
for all items (Thissen, 1982). For the dichotomous items that scope param-
eter is for the items scored 0, 1 ; for the polytomous items it is for item scores
0, 1,,(m 1). Thus, scaling the linear component of the scoring function
with unit steps facilitates the imposition of the equality constraints needed
for mixed Rasch family analysis. It also permits meaningful equality con-
straints between discrimination parameters for different item response mod-
els that are not in the Rasch family.
In the MIRT version of the model, the a parameters may be rescaled
after estimation is complete, to obtain values that have the properties of fac-
tor loadings, much as has been done for some time for the dichotomous
model in the software TESTFACT (du Toit, 2003).
Why does the user need to prespecify both the lowest and highest response cat-
egory (to set up the T matrix) for a nominal model? This is not as onerous as it
may first appear: When fitting the full-rank nominal model, one does not
have to correctly specify highest and lowest response categories. If the data
indicate another order, estimated values of a ks may be less than zero or exceed
m 1, indicating the empirical scoring order. It is only necessary that the

Y102002_Book.indb 64 3/3/10 6:57:46 PM


The Nominal Categories Item Response Model 65

item analyst prespecify two categories that are differently related to , such
that one is relatively lower and the other relatively higherbut even which
one is which may be incorrect, and that will appear as a negative value of a i .
Presumably, when fitting a restricted (ordered) version of the model, the user
would have already fitted the unrestricted nominal model to determine or
check the empirical order of the response categories, or the user would have
confidence from some other source of information about the order.
Why not parameterize the model in slope-threshold form, instead of
slope-intercept form? Arent threshold parameters easier to interpret in IRT?
While we fully understand the attraction, in terms of interpretability, for
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

threshold-style parameters in IRT models, there are several good reasons


to parameterize with intercepts for estimation. The first (oldest historically)
reason is that the slope-intercept parameterization is a much more numeri-
cally stable arrangement for estimating the parameters of logistic models,
due to a closer approximation of the likelihood to normality and less error
correlation among the parameters. A second reason is that the threshold
parameterization does not generalize to the multidimensional case in any
event; there is no way in a MIRT model to split the threshold among
dimensions, rendering a threshold parameterization more or less meaning-
less. We note here that, for models for which it makes sense, we can always
convert the intercept parameters into the corresponding item location and
threshold values for reporting, and in preceding sections we have given for-
mulas for doing so for the GPC model.
Why not use polynomial contrasts to obtain intermediate models, as proposed by
Thissen and Steinberg (1986) and implemented in MULTILOG (du Toit, 2003),
instead of the Fourier basis? An equally compelling question is to ask: Why
polynomials? The purpose of either basis is to provide smooth trends in the as
or cs across a set of response categories. Theory is not sufficient at this time to
specify a particular mathematic formulation for smoothness across catego-
ries in the nominal model. The Fourier basis accomplishes that goal as well
as polynomials, and is naturally orthogonal, which (slightly) simplifies the
implementation of the estimation algorithm.
In this chapter we have reviewed the development of Bocks (1972) nomi-
nal model, described its relation with other commonly used item response
models, illustrated some of its unique uses, and provided a revised param-
eterization for the model that we expect will render it more useful for future
applications in item analysis and test scoring. As IRT has come to be used in
more varying contexts, expanding its domain of application from its origins
in educational measurement into social and personality psychology, and the
measurement of health outcomes and quality of life, the need to provide
item analysis for items with polytomous responses with unknown scoring
order has increased. The reparameterized nominal model provides a useful
response to that challenge. Combined with the development of multidimen-
sional nominal item analysis (Cai et al., in preparation), the nominal model
represents a powerful component among the methods of IRT.

Y102002_Book.indb 65 3/3/10 6:57:46 PM


66 David Thissen, Li Cai, and R. Darrell Bock

Appendix 1: Background of the Nominal Categories Model


R. Darrell Bock
The first step in the direction of the nominal model was an extension of
Thurstones (1927) method of paired comparisons to first choices among
three or more objects. The objects can be anything for which subjects could
be expected to have preferencesopinions on public issues, competing con-
sumer products, candidates in an election, and so on. The observations for
a set of m objects consist of the number of subjects who prefer object j to
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

object k and the number who prefer k to j. Any given subject does not neces-
sarily have to respond to all pairs. Thurstone proposed a statistical model for
choice in which differences in the locations of the objects on a hypothetical
scale of preference value predict the observed proportions of choice in all
m(m 1)/2 distinct pairs. He assumed that a subjects response to the task of
choosing between the objects depended upon a subjective variable for, say,
object j,

vj = j + j (3.51)

where, in the population of respondents, j is a random deviation distributed


normally with mean 0 and variance 2. He called this variable a response pro-
cess and assumed that the subject chooses the object with the larger process.
Although the distribution of j might have different standard deviations for
each object and nonzero correlations between objects, this greatly compli-
cates the estimation of differences between the means. Thurstone therefore
turned his attention to the case V model in which the standard deviations
were assumed equal and all correlations assumed zero in all comparisons.
With this simplification, the so-called comparatal process

v jk = v j vk (3.52)

has mean j k and standard deviation 2 , and the comparatal pro-


cesses v jk , v jl for object j have constant correlation . Thurstones solution to
the estimation problem was to convert the response proportions to normal
deviates and estimate the location differences by unweighted least squares,
which requires only m2 additions and m divisions. With modern computing
machinery, solutions with better properties (e.g., weighted least squares or
maximum likelihood) are now accessible (see Bock & Jones, 1968, Section
6.4.1). From the estimated locations, the expected proportions for each com-
parison are given by the cumulative normal distribution function, ( y ), at
y = ( j k ). These proportions can be used in chi-square tests of the good-
ness of fit of the paired comparisons model (Bock, 1956) (Bock & Jones,
1968, section 6.7.1).

Y102002_Book.indb 66 3/3/10 6:57:48 PM


The Nominal Categories Item Response Model 67

Extension to First Choices


The natural extension of the paired comparison case V solution to what
might be called the method of first choices, that is, the choice of one pre-
ferred object in a set of m objects is simply to assume the m 1 comparatal
processes for object j,

v jk = v j vk , k = 1, 2,, m; k j (3.53)

is distributed (m 1)-variate normal with means j k , constant variance


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

22, and constant correlation jk equal to (Bock, 1956; 1975, Section 8.1.3).
Expected probabilities of first choice for a given object then correspond to the
(m 1)-fold multiple integral of the (m 1)-variate normal density function
in the orthant from minus infinity up to the limits equal to the comparatal
means.
For general multivariate normal distributions of high dimensionality,
evaluation of orthant probabilities is computationally challenging even with
modern equipment. Computing formulae and tables exist for the bivariate
case (National Bureau of Standards, 1956) and the trivariate case (Steck,
1958), but beyond that, Monte Carlo approximation of the positive orthant
probabilities appears to be the only recourse at the present time. Fortunately,
much simpler procedures based upon a multivariate logistic distribution are
now available for estimating probabilities of first choice. By way of intro-
duction, the following section gives essential results for the univariate and
bivariate logistic distributions.

The Univariate Logistic Distribution


Applied to the case V paired comparisons model, the univariate logistic dis-
tribution function can be expressed either in terms of the comparatal process
z = u jk :

1
( z ) = (3.54)
1 + ez

or in terms of the separate processes z1 = v j and z2 = vk :

e z1
( z1 ) = (3.55)
e + e z2
z1

under the constraint z1 + z2 = 0 . Then z1 = z2 and

e z2
( z2 ) = 1 ( z1 ) = (3.56)
e z1 + e z2

Y102002_Book.indb 67 3/3/10 6:57:51 PM


68 David Thissen, Li Cai, and R. Darrell Bock

In either case the distribution is symmetric with mean z = 0, where


2
( z ) = 12 , and variance 3 . The deviate z is called a logit, and the pair z1 , z2
could be called a binomial logit.
The corresponding density function can be expressed in terms of the dis-
tribution function:

ez
( z ) = = ( z )[1 ( z )] (3.57)
(1 + e z )2
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Although ( z ) is heavier in the tails than ( z ), ( z ) closely resembles


(1.7z ). Using the scale factor 1.7 in place of the variance matching factor
1.81379 will bring the logistic probabilities closer to the normal over the full
range of the distribution, with a maximum absolute difference less than 0.01
(Johnson, Kotz, & Balakrishnan, 1995, p. 119).
An advantage of the logistic distribution over the normal is that the deviate
corresponding to an observed proportion, P, is simply the log odds,

P
z( P ) = log (3.58)
1 P

For that reason, logit linear functions are frequently used in analysis of g
binomially distributed data (see Anscombe, 1956).
Inasmuch as the prediction of first choices may be viewed as an extreme
value problem, it is of interest that Dubey (1969) derived the logistic distribu-
tion from an extreme value distribution of the double exponential type with
mixing variable . Then the cumulative extreme value distribution function,
conditional on , is

F ( x| ) = exp[ exp( x )] (3.59)

where has the exponential density function g( ) = exp( ) . The corre-


sponding extreme value density function is

f ( x| ) = exp( x )[ exp( exp( x ))], > 0 (3.60)

Integrating the conditional distribution function over the range of g gives


the distribution function of x:

F ( x ) = 0 F ( x| ) g ( )d = [1 + exp( x )]1 (3.61)


which we recognize as the logistic distribution.

Y102002_Book.indb 68 3/3/10 6:57:55 PM


The Nominal Categories Item Response Model 69

A Bivariate Logistic Distribution


The natural extension of the logistic distribution to the bivariate case is

( x1 , x 2 ) = [1 + e x1 + e x2 ]1 (3.62)

with marginal distributions ( x1 ) and ( x 2 ) . The density function is

( x1 , x 2 ) = 2 3 ( x1 , x 2 )e x1 x2 (3.63)

and regression equations and corresponding conditional variances are
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

E ( x1|x 2 ) = 1 + log( z2 ) (3.64)


E ( x 2|x1 ) = 1 + log( z1 ) (3.65)



2
V ( x1|x 2 ) = V ( x 2|x1 ) = 1 (3.66)
3
This distribution is the simplest of three bivariate logistic distributions
studied in detail by Gumbel (1961). It is similar to the bivariate normal dis-
tribution in having univariate logistic distributions as margins, but unlike
the normal, the bivariate logistic density is asymmetric and the regression
lines are curved (see Figure3.6). Nevertheless, the distribution function
gives probability values reasonably close to bivariate normal values when
the 1.7 scale correction is used (see Bock and Jones (1968, Section 9.1.1) for
some comparisons of bivariate normal and bivariate logistic probabilities).
4
.01
.02
2
.04
.06
0
.05
2 .03

4
-4 -2 0 2 4

Figure 3.6 Contours of the bivariate logistic density. The horizontal and vertical axes are x 1 and x 2
respectively, in Equation 3.64.

A Multivariate Logistic Distribution


The natural extension of the bivariate logistic distribution to higher dimen-
sions is
e zk
( z ) = , k = 1, 2,, m (3.67)
e z1 + e z2 + + e zm

Y102002_Book.indb 69 3/3/10 6:57:57 PM


70 David Thissen, Li Cai, and R. Darrell Bock

where the elements of the vector z = [ z1 , z2 ,, zm ] are constrained to sum to


zero. This vector is referred to as a multinomial logit.
Although this extension of the logistic distribution to dimensions greater
than two has been applied at least since 1967 (Bock, 1970; McFadden, 1974),
its first detailed study was by Malik and Abraham (1973). They derived the
m-variate logistic distribution from the m-fold product of independent uni-
variate marginal conditional distributions of the Dubey (1969) extreme value
distribution with mixing variable . Integrating over gives
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1
m
m

F ( X ) = F ( x j | ) g ( )d = 1 +

e xk
, m = n = 1 (3.68)

k =1 k =1

The corresponding density function is

m 1
m m

f ( ) = m !exp
x k 1 +
k =1
e xk


(3.69)
k =1

McFadden (1974) arrived at the same result by essentially the same


method, although he does not cite Dubey (1969). Gumbels bivariate distri-
bution (above) is included for n = 2, and margins of all orders up to n 1 are
multivariate logistic and all univariate margins have mean zero and variance
2
3 . No comparison of probabilities for high-dimensional normal and logistic
distributions has as yet been attempted.

Estimating Binomial and Multinomial Response Relations


If we substitute functions of external variables for normal or logistic devi-
ates, we can study the relationships of these variables to the probabilities
of first choice among the objects presented. In the two-category case, we
refer to these as binomial response relations, and with more than two cat-
egories, as multinomial response relations. The analytical problem becomes
one of estimating the coefficients of these functions rather than the logit
itself. If the relationship is less than perfect, some goodness of fit will be lost
relative to direct estimation of the logit (which is equivalent to estimating
the category expected probabilities). The difference in the Pearson or likeli-
hood ratio chi-square provides a test of statistical significance of the loss.
Examples of weighted least squares estimation of binomial response relations
in paired comparison data when the external variables represent a factorial or
response surface design on the objects are shown in Section 7.3 of Bock and
Jones (1968). Examples of maximum likelihood estimation of multinomial
response relations appear in Bock (1970), McFadden (1974), and Chapter 8
of Bock (1975).

Y102002_Book.indb 70 3/3/10 6:57:59 PM


The Nominal Categories Item Response Model 71

An earlier application of maximum likelihood in estimating binomial


response relations appears in Bradley and Terry (1952). They assume the
model

j
(3.70)
j + k

for the probability that object j is preferred to object k, but they estimated
j and k directly rather than exponentiating in order to avoid introducing a
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Lagrange multiplier to constrain the estimates to sum to unity.


Luce and Suppes (1965) generalized the Bradley-Terry model to multi-
nomial data,

j
(3.71)
1 + 2 + ... + m

but did not make the exponential transformation to the multinomial logit
and did not apply the model in estimating multinomial response relations.

Binomial and Multinomial Response Relations in the Context of IRT


In item response theory we deal with data arising from two-stage sampling:
in the first stage we sample respondents from some identified population, and
in the second stage we sample responses of each respondent to some num-
ber of items, usually items from some form of psychological or educational
test. Thus, there are two sources of random variation in the databetween
respondents and between item responses. When the response is scored
dichotomously, right/wrong or yes/no, for example, the logistic distribution
for binomial data applies. If the scoring is polytomous, as when the respon-
dent is choosing among several alternatives, for instance, in a multiple-choice
test with recording of each choice, the logistic distribution for multinomial
data applies. If the respondents level of performance is graded polytomously
in ordered categories, the multivariate logistic can still apply, but its parame-
terization must be specialized to reflect the assumed order of the categories.
In IRT the external variable is not an observable quantity, but rather
an unobservable latent variable, usually designated by , that measures the
respondents ability or other propensity. The binomial or multinomial logit
is expressed as linear functions of containing parameters specific to each
item. We refer to the functions that depend on as item response models.
Item response models now in use (see Bock & Moustaki, 2007) include,
for item j, the two-parameter logistic model, based on the binomial logistic
distribution,

() = [1 + exp( a j + c j )]1 (3.72)


Y102002_Book.indb 71 3/3/10 6:58:00 PM


72 David Thissen, Li Cai, and R. Darrell Bock

and the nominal categories model, based on the multinomial logistic


distribution,
exp( a jk + c jk )
() = n (3.73)
exp( a jl + c jl )
l =1

under the constraints nl =1 a jl = 0 and nl =1 c jl = 0.


In empirical applications, the parameters of the item response models
must be estimated in large samples of the two-stage data. Estimation of these
parameters is complicated, however, by the presence of the propensity vari-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

able , which is random in the first-stage sample. Because there are poten-
tially different values of this variable for every respondent, there is no way
to achieve convergence in probability as number of respondents increases.
We therefore proceed in the estimation by integrating over an assumed or
empirically derived distribution of the latent variable. If the first-stage sam-
ple is large enough to justify treating the parameter estimates so obtained as
fixed values, we can then use Bayes or maximum likelihood estimation to
locate each respondent on the propensity dimension, with a level of precision
dependent on the number of items.
The special merit of the nominal categories item response model is that no
assumption about the order or other structure of the categories is required.
Given that the propensity variable is one-dimensional and an ordering of
the categories is implicit in the data and is revealed by the order of the coef-
ficients ajk in the nominal model (see Bock & Moustaki, 2007).

References
Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley.
Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report
(NCES 1999-452). Washington, DC: National Center for Education Statistics,
Office of Educational Research and Improvement, U.S. Department of Education.
Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika,
42, 6981.
Andersen, E. B. (1997). The rating scale model. In W. van der Linden & R. K.
Hambleton (Eds.), Handbook of modern item response theory (pp. 6784). New
York: Springer.
Andrich, D. (1978). A rating formulation for ordered response categories.
Psychometrika, 43, 561573.
Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for mea-
surment. In N. Brandon-Tuma (Ed.), Sociological methodology (pp. 3380). San
Francisco: Jossey-Bass.
Anscombe, F. J. (1956). On estimating binomial response relations. Biometrika, 35,
246254.
Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques
(2nd ed., revised and expanded). New York: Marcel Dekker.
Bergan, J. R., & Stone, C. A. (1985). Latent class models for knowledge domains.
Psychological Bulletin, 98, 166184.

Y102002_Book.indb 72 3/3/10 6:58:00 PM


The Nominal Categories Item Response Model 73

Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the


American Statistical Association, 39, 357375.
Berkson, J. (1953). A statistically precise and relatively simple method of estimating
the bio-assay with quantal response, based on the logistic function. Journal of the
American Statistical Association, 48, 565599.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an exam-
inees ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental
test scores (pp. 395479). Reading, MA: Addison-Wesley.
Bock, R. D. (1956). A generalization of the law of comparative judgment applied to a
problem in the prediction of choice [Abstract]. American Psychologist, 11, 442.
Bock, R. D. (1970). Estimating multinomial response relations. In E. A. R. C. Bose
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

(Ed.), Contribution to statistics and probability (pp. 111132). Chapel Hill, NC:
University of North Carolina Press.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are
scored in two or more latent categories. Psychometrika, 37, 2951.
Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New
York: McGraw-Hill.
Bock, R. D. (1997). The nominal categories model. In W. van der Linden & R. K.
Hambleton (Eds.), Handbook of modern item response theory (pp. 3350). New
York: Springer.
Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and
choice. San Francisco: Holden-Day.
Bock, R. D., & Moustaki, I. (2007). Item response theory in a general framework.
In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26,
pp.469513). Amsterdam: Elsevier.
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs.
I.Method of paired comparisons. Biometrika, 39, 324345.
Childs, R. A., & Chen, W.-H. (1999). Software note: Obtaining comparable item
parameter estimates in MULTILOG and PARSCALE for two polytomous
IRT models. Applied Psychological Measurement, 23, 371379.
de Ayala, R. J. (1992). The nominal response model in computerized adaptive testing.
Applied Psychological Measurement, 16, 327343.
de Ayala, R. J. (1993). An introduction to polytomous item response theory models.
Measurement and Evaluation in Counseling and Development, 25, 172189.
Dubey, S. D. (1969). A new derivation of the logistic distribution. Naval Research
Logistics Quarterly, 16, 3740.
du Toit, M. (Ed.). (2003). IRT from SSI: BILOG-MG MULTILOG PARSCALE
TESTFACT. Lincolnwood, IL: Scientific Software International.
Gumbel, E. J. (1961). Bivariate logistic distributions. Journal of the American Statistical
Association, 56, 335349.
Hoskens, M., & Boeck, P. D. (1997). A parametric model for local dependence among
test items. Psychological Methods, 2, 261277.
Huber, M. (1993). An item response theoretical approach to scoring the Short Portable
Mental Status Questionnaire for assessing cognitive status of the elderly. Unpublished
masters thesis, Department of Psychology, University of North Carolina,
Chapel Hill
Johnson, N. L., Kotz, N., & Balakrishnan, N. (1995). Continuous univariate distribu-
tions (2nd ed., Vol. 2). New York: Wiley.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading,
MA: Addison-Wesley.

Y102002_Book.indb 73 3/3/10 6:58:00 PM


74 David Thissen, Li Cai, and R. Darrell Bock

Luce, R. D., & Suppes, P. (1965). Preference, utility, and subjective probability. In
R. D. Luce & R. R. Bush (Eds.), Handbook of mathematical psychology (Vol. 3
pp. 249410). New York: Wiley.
Malik, H., & Abraham, B. (1973). Multivariate logistic distributions. Annals of
Statistics, 1, 588590.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,
149174.
Masters, G. N., & Wright, B. D. (1984). The essential process in a family of measure-
ment models. Psychometrika, 49, 529544.
Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. van der
Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

(pp.101122). New York: Springer.


McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior.
In P. Zarembka (Ed.), Frontiers of econometrics (pp. 105142). New York:
Academic Press.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-
rithm. Applied Psychological Measurement, 16, 159176.
Muraki, E. (1997). A generalized partial credit model. In W. van der Linden & R. K.
Hambleton (Eds.), Handbook of modern item response theory (pp. 153164). New
York: Springer.
National Bureau of Standards (1956). Tables of the bivariate normal distribution
function and related functions. Applied Mathematic Series, Number 50.
Ramsay, J. O. (1995). Testgraf: A program for the graphical analysis of multiple-choice
test and questionnaire data (Technical Report). Montreal: McGill University
(Psychology Department).
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.
Copenhagen: Denmarks Paedagogiske Institut.
Rasch, G. (1961). On general laws and the meaning of measurement in psychology.
In Proceedings of the Fourth Annual Berkeley Symposium on Mathematical Statistics
and Probability (Vol. 4, pp. 321333). Berkeley: University of California Press.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
cores. Psychometric Monograph, No. 18.
Samejima, F. (1979). A new family of models for the multiple choice item (Research
Report 79-4). Knoxville: University of Tennessee (Department of Psychology).
Samejima, F. (1988). Comprehensive latent trait theory. Behaviormetrika, 15, 124.
Samejima, F. (1996). Evaluation of mathematical responses for ordered polychoto-
mous responses. Behaviormetrika, 23, 1735.
Samejima, F. (1997). Graded response model. In W. van der Linden & R. K. Hambleton
(Eds.), Handbook of modern item response theory (pp. 85100). New York: Springer.
Steck, G. P. (1958). A table for computing trivariate normal probabilities. Annals of
Mathematical Statistics, 29, 780800.
Sympson, J. B. (1983, June). A new IRT model for calibrating multiple choice items. Paper
presented at the annual meeting of the Psychometric Society, Los Angeles.
Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter
logistic model. Psychometrika, 47, 175186.
Thissen, D., Nelson, L., Rosa, K., & McLeod, L. D. (2001). Item response theory for
items scored in more than two categories. In D. Thissen & H. Wainer (Eds.), Test
scoring (Chap. 4, pp. 141186). Mahwah, NJ: Lawrence Erlbaum Associates.
Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two
categories. In D. Thissen & H. Wainer (Eds.), Test scoring (Chap. 3, pp. 73140).
Mahwah, NJ: Lawrence Erlbaum Associates.

Y102002_Book.indb 74 3/3/10 6:58:00 PM


The Nominal Categories Item Response Model 75

Thissen, D., & Steinberg, L. (1984). A response model for multiple-choice items.
Psychometrika, 49, 501519.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.
Psychometrika, 51, 567577.
Thissen, D., & Steinberg, L. (1988). Data analysis using item response theory.
Psychological Bulletin, 104, 385395.
Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of
multiple-categorical-response models. Journal of Educational Measurement, 26,
247260.
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34,
278286.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing:
A case for testlets. Journal of Educational Measurement, 24, 185201.
Wainer, H., Thissen, D., & Sireci, S. G. (1991). DIFferential testlet functioning:
Definitions and detection. Journal of Educational Measurement, 28, 197219.
Yao, L., & Schwarz, R. D. (2006). A multidimensional partial credit model with
associated item and test statistics: An application to mixed-format tests. Applied
Psychological Measurement, 30, 469492.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local
item dependence. Journal of Educational Measurement, 30, 187214.

Y102002_Book.indb 75 3/3/10 6:58:01 PM


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 76
3/3/10 6:58:01 PM
Chapter 4
The General Graded Response Model
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Fumiko Samejima
University of Tennessee

Editor Introduction: This chapter outlines a framework that encompasses most of the specific
polytomous IRT models mentioned in this book. The place of the models within the framework is
described with particular attention given to models that Samejima has developed. Prior to elaborat-
ing this framework, a set of criteria for evaluating different models is proposed.

Personal Reflections on the Origins of the Model


When item response theory (IRT) originated and was developed in psy-
chology and sociology in the 1940s, 1950s, and the first half of the 1960s,
the theory only dealt with dichotomous responses, where there are only
two item score categories, for example, correct and incorrect in ability mea-
surement, true and false in personality measurement. As a graduate student
I was very much impressed by Fred Lords (1952) Psychometric monograph
ATheory of Mental Test Scores and could foresee great potential for latent
trait models. It seemed that the first thing to be done was to expand IRT
to enable it to deal with ordered, multicategory responses and enhance its
applicability, not only in psychology, sociology, and education, but also in
many other social and natural science areas. That opportunity came when
I was invited to spend one year as visiting research psychologist in the
Psychometric Research Group of the Educational Testing Service (ETS),
Princeton, New Jersey, in 1966. The essential outcomes of the research
conducted during my first year in the United States were published in
Samejima (1969).
A subsequent invitation to work in the psychometric laboratory at the
University of North Carolina at Chapel Hill allowed continuation of the
initial work. The essential outcomes of the research conducted in 19671968
were published in Samejima (1972). This second monograph is as important
as the first, and the two monographs combined propose the fundamental
tenets of the general graded response model framework.

77

Y102002_Book.indb 77 3/3/10 6:58:01 PM


78 Fumiko Samejima

In recent years, more and more researchers have started citing these two
Psychometrika monographs in their research. In this chapter I will try to cor-
rect common misunderstandings among researchers, as well as introduce and
explain further developments in the general graded response model.

Rationale
In the present chapter, uni-dimensional latent trait models are almost exclu-
sively discussed, where the latent trait assumes any real number. The general
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

graded response model is a comprehensive mathematical model that provides


the general structure of latent trait models that deal with cases in which item
g, is the smallest observable unit for measuring the latent trait and, subse-
quently, one of the graded item scores, or ordered polytomous item scores,
x g = 0,1, 2,..., m g , (mg 1), is assigned to each response. The highest score, mg,
can be any positive integer, and the general graded response model does not
require all items in a test or questionnaire to have the same values of mg. This is
a great advantage, and it makes it possible to mix dichotomous response items
with those whose mgs are greater than unity. Models that belong to the gen-
eral framework discussed in this chapter include the normal ogive model, the
logistic model, the graded response model expanded from the logistic positive
exponent family of models (Samejima, 2008), the acceleration model, and the
models expanded from Bocks nominal response model. Thus, the framework
described here applies for any of these models.
Graded response model (GRM) was proposed by Samejima (1969, 1972),
to provide a general theoretical framework to deal with the graded item
scores, 0, 1, 2,, mg , in the item response theory (IRT), whereas in the
original IRT the item scores were limited to 0, 1. As is explained later in this
chapter, the logistic model is a specific model that belongs to GRM. Because
the logistic model was applied for empirical data in early years, as exem-
plified by Roche, Wainer, and Thissen (1975); however, researchers started
treating the logistic model as if it were the GRM. Reading this chapter, the
reader will realize that GRM is a very comprehensive concept that includes
normal ogive model, logistic model, expanded model from the logistic posi-
tive exponent family of models, BCK-SMJ model, acceleration model, etc.
Correct terminology is important; otherwise, correct research will become
impossible.
The latent trait can be any construct that is hypothesized to be behind
observable items, such as the way in which ability is behind performance
on problem-solving questions, general attitude toward war is represented by
responses to peace/war-oriented statements, maturity of human bodies is
represented by experts evaluations of x-ray films, and so on.
Throughout this paper, the latent trait is denoted by , which assumes
any real number in ( , ) , except for the case where the multidimensional
latent space is considered.

Y102002_Book.indb 78 3/3/10 6:58:01 PM


The General Graded Response Model 79

The general graded response model framework (Samejima, 1997, 2004) is


based on the following five functions for each graded item score xg:

1. Processing function (PRF), M x g ( ) ( x g = 0, 1, 2,..., m g , m g + 1) . This is


a joint conditional probability, given , and given that the individual
has passed the preceding process. Specifically M x g ( ) = 1 for x g = 0, and
M x g ( ) = 0 for x g = m g + 1 for all , respectively, since there is no pro-
cess preceding x g = 0 and m g + 1 is a nonexistent, imaginary graded
score no one can attain.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

2. Cumulative operating characteristic (COC), Px*g ( ) (xg = 0, 1, 2, , mg,


m g + 1), defined by

Px*g ( ) prob[ X g x g | ] = u x g M u ( ) . (4.1)



This is the conditional probability, given , that the individual gets
the graded score xg or greater. In particular, Px*g ( ) = 1 for x g = 0 and
Px*g ( ) = 0 for x g = m g + 1 for the entire range of , since everyone
obtains a score of 0 or greater, and no one gets a score of m g + 1 or
greater. Also from Equation 4.1, Px*g ( ) = M x g ( ) when x g = 0 for the
entire range of .

Terminology Note: Elsewhere in this book and in the literature this function is called a category
boundary function or a threshold function. In another difference, we and other authors in this book
have indexed this and other types of polytomous response functions using i for items and k for
category response (e.g., P *) while the Anthor uses g for items, following Lord & Novick (1968,
ik
[Chap. 16]). Note that in addition to using different letters to index item and category components,
Samejima also indexes the category response (x) prior to the item index (g), where other authors
index the item prior to the category response.

3. Operating characteristic (OC), Px g ( ), ( x g = 0, 1, 2,..., m g ) defined by

Px g ( ) prob[ X g = x g | ] = Px*g ( ) Px*g +1 ( ). (4.2)



This is the conditional probability, given , that the individual obtains
a specific graded score xg.
Note that when m g = 1, from Equations 4.1 and 4.2, both Px*g ( )
and Px g ( ) for xg = 1 become the item characteristic function (ICF; Lord
& Novick, 1968, Chap. 16) for a dichotomous item. Thus, a specific
graded response model, defined in this way, models dichotomous item
responses as a special case, and so the general graded response model
framework also applies to dichotomous response items.
4. Basic function (BSF), Ax g ( ), ( x g = 0, 1, 2,..., m g ) defined by


Ax g ( ) log Px g ( ) = [ Px g ( )]1 P ( ). (4.3)
x g

Y102002_Book.indb 79 3/3/10 6:58:08 PM


80 Fumiko Samejima

It is obvious that the basic function exists as long as Px g ( ) is positive


for the entire range of , and is differentiable with respect to .
5. Item response information function (IRIF), I x g ( ) ( x g = 0, 1, 2,..., m g ),
is defined by
2
I x g ( ) log Px g ( ) = A ( ) (4.4)
2 x g

{ }
2
2
= [ Px g ( )]2 Px g ( ) 2 Px g ( ) Px g ( )

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017


2
2
= [ Px g ( )] 2 Px g ( ) + [ Px g ( )]2 Px g ( )
1


2
2
= [ Px g ( )]2 Px g ( ) [ Px g ( )]1 2 Px g ( ) .

One can conclude that the item response information function exists
asfar as Px g ( ) is positive for the entire range of , and is twice differ-
entiable with respect to .

Thissen and Steinberg (1986) called the normal ogive model and logis-
tic model for graded responses the difference models, and models for graded
responses expanded from Bocks (1972) nominal model divide-by-total
models. The naming may be a little misleading, however, because the gen-
eral framework for graded response models that has been introduced above
accommodates both of Thissens two categories. From Equation 4.2 we see
that for any x g ( = 0, 1, 2,..., m g ) ,

Px*g ( ) Px*g +1 ( ), (4.5)



for the entire range of in order to satisfy the definition of the operating
characteristic Px g ( ) because it is a conditional probability.

Item Information Function


Samejima (1973b) defined the item information function, I g ( ), for the
general graded response item g as the conditional expectation of the IRIF,
given, that was defined by Equation 4.4. Thus it can be written
m
I g ( ) E [ I x g ( )| ] = x gg=0 I x g ( )Px g ( ). (4.6)

Note that this item information function (IIF) of the graded item g includes
Birnbaums (1968) item information function on the dichotomous responses

Y102002_Book.indb 80 3/3/10 6:58:11 PM


The General Graded Response Model 81

as a special case. To simplify the notation, let the item characteristic func-
tion (ICF) for dichotomous responses (Lord & Novick, 1968, Chapter 16)
be represented as
Pg ( ) prob[ X g = 1| ] = Px g ( ; x g = m g ), (4.7)

where m g = 1, and

Q g ( ) prob[ X g = 0| ] = Px g ( ; x g = 0 ) = 1 Pg ( ). (4.8)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Let Pg( ) and Q g ( ) denote their first derivatives with respect to , respec-
tively. Due to the complementary relationship of Equations 4.7 and 4.8 we
can see that

Q g ( ) = Pg( ) (4.9)

and from Equation 4.9,

Q g( ) = Pg( ), (4.10)

where Pg( ) and Q g( ) denote the second derivatives of Pg ( ) and Q g ( )


with respect to , respectively.
From Equations 4.4 and 4.7 to 4.10 we can now rewrite our IRIF as

2 2 1
= [Q g ( )] [ Pg( )] [Q g ( )] [ Pg( )] ug = 0
I u g ( ) 2 2 1 (4.11)
= [ Pg ( )] [ Pg( )] [ Pg ( )] [ Pg( )] u g = 1.

Thus from Equations 4.6 and 4.11 for the IIF a dichotomous response item
can be written as

I g ( ) = I u g ( ; u g = 0 )Q g ( ) + I u g ( ; u g = 1)Pg ( ) (4.12)

= [Q g ( )]1 [ Pg( )]2 [ Pg( )] + [ Pg ( )]1 [ Pg( )]2 [ Pg( )]

= [ Pg( )]2 {Q g ( )}1 + { Pg ( )}1

= [ Pg( )]2 [ Pg ( ) + Q g ( )][ Pg ( )Q g ( )]1

= [ Pg( )]2 [Pg ( )Q g ( )]1 .

The last expression of Equation 4.12 equals Birnbaums (1968) IIF for the
dichotomous response item (p. 454).

Y102002_Book.indb 81 3/3/10 6:58:14 PM


82 Fumiko Samejima

Expansion of Latent Trait Models for Dichotomous


Responses to Those for Graded Responses
It is noted from Equation 4.1 that the definition of the COC, Px*g ( ), of the
graded response score xg becomes the ICF, Pg ( ), that is defined by Equation
4.7, if Xg is replaced by the binary item score Ug and mg is 1. This implies
that expansion of the general dichotomous response model to the general
graded response model can be done straightforwardly, with the restriction
of Equation 4.5.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

For example, suppose that the final grade of a mathematics course that
all math majors are required to pass is based on five letter grades, A, B, C,
D, and F. For these graded responses, m g = 4 . When we reclassify all math
majors into pass and fail, there are, in general, mg different ways to set the
borderline of pass and fail between (1) A and B, (2) B and C, (3) C and D,
and (4) D and F. It is noted that Way 1 is the strictest of passing math majors,
Way 4 is the most generous, Way 2 is moderately strict, and Way 3 is mod-
erately generous.
Now the course grade has been changed to a set of two grade categories
from five, in four different ways, and in each case the item characteristic
function Pg ( ) that is defined by Equation 4.7 can be specified. Note that
these four ICFs equal the COCs, that is, Px*g ( ) s , for the letter grades A, B,
C, and D, respectively.
Figure4.1 illustrates these m g = 4 ICFs. Because Way 1 is the strictest
of passing math majors and Way 4 is the most generous, it is natural that
their ICFs are located at the right-most and left-most parts in Figure 4.1,

1
F
0.9

0.8
D
0.7
ICF of Pass/Fail

0.6

0.5 C

0.4

0.3

0.2 B

0.1
A
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Latent Trait

Figure 4.1 OCs of five letter grades A, B, C, D, and F shown as the differences of COCs.

Y102002_Book.indb 82 3/3/10 6:58:18 PM


The General Graded Response Model 83

respectively, and the other two ICFs are positioned and ordered between
the two with respect to their levels of generosity. These curves also satisfy
Equation 4.5, as is obvious from the nature of recategorizations. Thus it is
clear from the definitions of pass and fail and Equation 4.2 that the OCs for
A, B, C, D, and F are given as the differences of the two adjacent curves,
given , as indicated in Figure4.1. Note that these curves do not have to be
identical in shape or point symmetric, but should only satisfy Equation 4.5.

Terminology Note: As the author points out, it is not necessary for the curves in Figure4.1 to have
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

identical shapes. This is a key distinction between the heterogeneous and the homogenous models.
In the homogeneous case the COC forms are always parallel, whereas in the heterogeneous case
they are not necessarily parallel.

The above explanation may be the easiest way to understand the transition
from models of dichotomous responses to those of graded responses. Because
of this close relationship between the ICFs for dichotomous responses and
the COCs for graded responses, in the following sections specific math-
ematical models that belong to the general graded response model will be
represented by their COCs in most cases.

Unique Maximum Condition

Terminology Note: The unique maximum condition is an important concept in the graded
response model framework and will be referred to throughout this chapter. In essence this condition
requires that for a given model and for a given likelihood function of the specific response pattern,
there exists a single maximum point that can be used as the estimate of the latent trait.

In IRT, the test score is practically useless in estimating the individuals


latent trait level even though asymptotically there is a one-to-one correspon-
dence between the test score and the latent trait . (Note that no tests have
infinitely many items.)
The main reason is that the use of the test score will reduce the local
accuracy of latent trait estimation (cf. Samejima, 1969, Chap. 6, pp. 4345;
1996b), unless there exists a test score or any summary of the response pat-
tern that is a sufficient statistic for the model, such as Raschs (1960) model
or the logistic model (Birnbaum, 1968) for dichotomous responses. In gen-
eral, the direct use of the response pattern or the sequence of item scores, xgs,
rather than a single test score, therefore, is strongly encouraged for estimat-
ing individuals latent traits.
Although the proposal of the logistic model as a substitute for the normal
ogive model was a big contribution in the 1960s, in these days when most
researchers have access to electronic computers, there is little need for the
substitution of a model by another that has a sufficient statistic.

Y102002_Book.indb 83 3/3/10 6:58:18 PM


84 Fumiko Samejima

Let v be a specified response pattern, or a sequence of specified graded


item scores, that includes a sequence of specified binary item scores as a spe-
cial case, such that v = ( x1 , x 2 ,..., xn ) for a set of n items. Because of local
independence (Lord & Novick, 1968, Chap. 16), it can be written


Lv ( ) = Pv ( ) = x vP g xg ( ) , (4.13)

where Lv ( ) is the likelihood function for the specific response pattern V = v ,


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

and Pv ( ) denotes the conditional probability of the response pattern v, given


. Using Equations 4.3 and 4.13, the likelihood equation is givenby


log Lv ( ) = x g v log Px g ( ) = x g v Ax g ( ) 0 . (4.14)

Thus, there exists a sufficient condition that a specific graded response


model provides a unique maximum for any likelihood function (i.e., for each
and every response pattern) and the condition is that both of the following
requirements are met:

1. The basic function Ax g ( ) of each and every graded score xg of each item
g is strictly decreasing in .
2. Its upper and lower asymptotes are nonnegative and nonpositive,
respectively.

For brevity, this condition is called the unique maximum condition (Samejima,
1997, 2004).
On the dichotomous response level, such frequently used models as the
normal ogive model, logistic model, and all models that belong to the logistic
positive exponent family of models, satisfy the unique maximum condition.
A notable exception is the three-parameter logistic model (3PL; Birnbaum,
1968). In that model, for x g = 1, the unique maximum condition is not satis-
fied, and it is quite possible that for some response patterns the unique MLE
does not exist (for details see Samejima, 1973b).
An algorithm for writing all the basic functions and finding the solutions
of Equation 4.14 for all possible response patterns is easy and straightforward
for most models that satisfy the unique maximum condition, so the unique
local or terminal maximum likelihood estimate (MLE) of can be found
easily without depending on the existence of a sufficient statistic.
It should be noted that, when the set of n items do not follow a single
model, but they follow several different models, as long as all of these models
satisfy the unique maximum condition, a unique local or terminal maximum
of the likelihood function of each and every possible response pattern is also
assured to exist.

Y102002_Book.indb 84 3/3/10 6:58:20 PM


The General Graded Response Model 85

Criteria for Evaluating Specific Graded Response Models


Samejima (1996a) proposed five different criteria to evaluate a latent trait
model from a substantive point of view:

1. The principle behind the model and the set of accompanying assump-
tions agree with the psychological nature that underlies the data.
2. Additivity 1, that is, if the existing graded response categories get finer
(e.g., pass and fail are changed to A, B, C, D, and F), their OCs can still
be specified in the same model.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

3. Additivity 2, that is, following a combination of two or more adjacent


response categories (e.g., A, B, C, D, and F to pass and fail), the OCs of
the newly combined categories can still be specified in the same math-
ematical form. (If additivities 1 and 2 hold, the model can be naturally
expanded to a continuous response model.)
4. The model satisfies the unique maximum condition (discussed above).
5. Modal points of the OCs of the m g + 1 graded response categories are
ordered in accordance with the graded item scores, x g = 0, 1, 2,..., m g .
Of these five criteria, the first is related to the data to which the model is
applied, but the other four can be used strictly mathematically.

Response Pattern and Test Information Functions


Because the specific response pattern v is the basis of ability estimation,
it is necessary to consider the information function provided by v. Samejima
(1973b) defined the response pattern information function (RPIF), I v ( ) , as
2
I v ( ) log Pv ( ). (4.15)
2
Equation 4.15 is analogous to the definition of the IRIF, I x g ( ), given earlier
by Equation 4.4. Using Equations 4.13, 4.14, and 4.4, this can be changed to
2
I v ( ) = x g v log Px g ( ) = x g v I x g ( ) (4.16)
2
indicating that the RPIF can be obtained as the sum of all IRIFs for x g v .
The test information function (TIF), I ( ), in the general graded response
model is defined by the conditional expectation, given , of the RPIF, I v ( ),
as analogous to the relationship between the IIF and IRIFs. Thus,
I ( ) E [ I v ( )| ] = v I v ( )Pv ( ). (4.17)

Since it can be written that

Px g ( ) = x g v Pv ( ), (4.18)

Y102002_Book.indb 85 3/3/10 6:58:23 PM


86 Fumiko Samejima

we obtain from Equations 4.17 to 4.18 and 4.6


m
I ( ) = v x g v I x g ( )Px g ( ) = ng =1 x gg= 0 I x g ( )Px g ( ) = ng =1 I g ( ). (4.19)

Note that this outcome, that the test information function equals the sum
total of the item information functions, is true only if the individuals ability
estimation is based on that individuals response pattern and not its aggregate,
such as a test score, unless it is a simple sufficient statistic, as is the case with
the Rasch model. Otherwise, the test information function assumes a value
less than the sum total of the item information functions (Samejima, 1996b).
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

The final outcome of Equation 4.19, that the TIF equals the sum total of IIFs
over all items in the test, questionnaire, and so on, is the same as the outcome of
the general dichotomous response model (cf. Birnbaum, 1968, Chap. 20).
It should be noted that, because of the simplicity of the above outcome, that
is, the test information function equals the sum total of the item information
functions, researchers tend to take it for granted. It is necessary, however, that
the reader understands how this outcome was obtained based on the definitions
of the TIF and the RPIF, in order to apply IRT properly and innovatively.

Latent Trait Models in the Homogeneous Case


All specific latent trait models that belong to the general graded response
framework can be categorized into the homogeneous case and the heteroge-
neous case. To give some examples, such models as the normal ogive model
and logistic model belong to the former, and the graded response model
expanded from the logistic positive exponent family of models (Samejima,
2008), acceleration model (Samejima, 1995), and graded response models
expanded from Bocks (1972) nominal response model belong to the latter.

Terminology Note: As mentioned earlier, the essential difference between the homogeneous case
and the heterogeneous case is whether or not the shapes of COCs vary from one score category to the
next. In the homogenous case the COCs are parallel, whereas in the heterogeneous case they are not
parallel.

Rationale Behind the Models in the Homogeneous Case


Lord set a hypothetical relation between dichotomous item score ug and
latent trait that leads to the normal ogive model (cf. Lord & Novick, 1968,
Section 16.6). He assumes a continuous variable Y g behind the item score ug
and the critical value g, as well as the following:

1. An individual a will get u g = 1 (e.g., pass) if Y ga g , and if Y ga < g , the


individual will obtain u g = 0 (e.g., fail). This assumption may be reason-
able if the reader thinks of the fact that within a group of individuals
who get credit for solving problem g there are diversities of different
levels of ability; that is, some individuals may solve it very easily while
some others may barely make it after having struggled a lot, and so on.

Y102002_Book.indb 86 3/3/10 6:58:24 PM


The General Graded Response Model 87

Yg

g = 2
P2()

g1
P1()
g = 1
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

g0
g = 0

P0()

E (Yg | )

LATENT TRAIT

Figure 4.2 Illustration of a hypothesized continuous variable underlying graded response models
in the homogeneous case.

2. The regression (conditional expectation) of Y g on is linear.


3. The conditional distribution of Y g , given , is normal.
4. The variance of these conditional distributions is the same for all .

The figure that Lord used for the normal ogive model for dichotomous
responses (Lord & Novick, 1968, Figure16.6.1) is illustrated in Figure4.2.
In this figure, two critical values, g 0 and g 1 ( g 0 < g 1 ) , are used instead
of a single g, as is also shown in Figure4.2, and Hypothesis 1 is changed
to Hypothesis 1*: Individual a will get x g = 2 (e.g., honor pass) if Y ga g 1 ,
x g = 1 (e.g., pass) if g 0 Y ga < g 1 , and x g = 0 (e.g., fail) if Y ga < g 0 . As is
obvious from Equations 4.1 and 4.2, the shaded area for the interval [ g1 , )
indicates the OC for x g = 2 of item g; for [ g 0 , g 1 ) , the OC for x g = 1; and
for ( , g 0 ), the OC for x g = 0 at each of the two levels of in Figure4.2.
The above example leads to the normal ogive model for graded responses
when m g = 2. By increasing the number of the critical values, g s , however,
a similar rationale can be applied for any positive integer for mg.
It should also be noted that Hypotheses 3 and 4 can be replaced by any
other conditional density functions, symmetric or asymmetric, in so far as their
shapes are identical at all the fixed values of . All those models are said to
belong to the homogeneous case. Thus, a model that belongs to the homogeneous
case does not imply that its COCs are point symmetric for x g = 1, 2,..., m g ,
nor do its OCs provide symmetric curves for x g = 1, 2,..., m g 1, although
both the normal ogive and logistic models do so.

Y102002_Book.indb 87 3/3/10 6:58:29 PM


88 Fumiko Samejima

From the above definition and observations, it is clear that any graded
response model that belongs to the homogeneous case satisfies additivities 1
and 2, which were introduced earlier as criteria for evaluating mathematical
models for graded responses.
From Equation 4.1 it can be seen that a common feature of the models
in the homogeneous case is that the cumulative operating characteristics,
Px*g ( ) s , for x g = 1, 2,..., m g are identical in shape except for the positions on
the dimension, which are ordered in accordance with the graded score xg.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Normal Ogive Model (NMLOG)


The rationale behind the normal ogive model was provided earlier as an
example of the rationale behind any model that belongs to the homogeneous
case. In the normal ogive model, the COC is specified by
a g ( bx )
g
z 2

1
*
P ( ) = [ 2 ]
xg
2
exp dz, (4.20)
2

where ag denotes the item discrimination parameter and bx g is the item response
difficulty parameter, the latter of which satisfies

= b0 < b1 < ... < bm g < bm g +1 = . (4.21)



Figures4.3a and b illustrate the OCs in the normal ogive model, for two
different items, both with m g = 5 , but having different ag and bx g s, that is, for
the item in Figure4.3a a g = 1.0 and bx g = 1.50, 0.50, 0.00, 0.75,1.25 , while
for the item in Figure4.3b a g = 2.0 and bx g = 2.00, 1.00, 0.00, 1.00, 2.00.
It is noted that for both items the OCs for x g = 0 and x g = 5 ( = m g ) are
strictly decreasing and increasing in , respectively, with unity and zero as the
two asymptotes in the former, and with zero and unity as the two asymptotes
in the latter. They are also point symmetric, meaning if each curve is rotated
by 180 around the point, = b1 and P0 ( ) = 0.5 when x g = 0 , and = b5
and P5 ( ) = 0.5 when x g = 5 , then the rotated upper half of the curve over-
laps the original lower half of the curve, and vice versa. It is also noted that in
both figures the OCs for x g = 1, 2, 3, 4 are all unimodal and symmetric.
These two sets of OCs provide substantially different impressions,
because in Figure4.3a the four bell-shaped curves have varieties of different
heights. They are determined by the distance, bx g +1 bx g . In this figure, the
modal point of the curve for x g = 1 equals b2 b1 = 1.00 , and the maxi-
mal OC is higher than any others, because b3 b2 = 0.50, b4 b3 = 0.75, and
b5 b4 = 0.50 . Thus the second highest modal point belongs to x g = 3 , and
the lowest is shared by x g = 2 and x g = 4. For the item in Figure4.3b, it is
noted that those distances ( bx g +1 bx g ) are uniformly 1.00; thus the heights
of the four bell-shape curves are all equal. The height of a bell-shaped curve

Y102002_Book.indb 88 3/3/10 6:58:37 PM


The General Graded Response Model 89

1 xg = 5
xg = 0
0.8

Probability
0.6

0.4
xg = 3

0.2 xg = 1 xg = 2
xg = 4
0
-4 -3 -2 -1 0 1 2 3 4
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Latent Trait

(a)
1

0.8
Probability

0.6

0.4

0.2

0
-4 -3 -2 -1 0 1 2 3 4
Latent Trait

(b)
Figure 4.3 Two examples of six-category operating characteristics when categories are (a) not
equally spaced and (b) equally spaced.

also depends on the value of ag. In Figure4.3b, the common height of the
four bell-shaped curves is higher than the one for x g = 1 in Figure4.3a, and
this comes from the larger value of a g ( = 2.0 ) for the item in Figure4.3b than
that of the item in Figure4.3a for which a g = 1.0. It should also be noted
that in each of the two examples the modal point of the OCs is ordered in
accordance with the item score, x g = 1, 2, 3, and 4.
The above characteristics of the NMLOG are also shared by the logistic
model that will be introduced in the following section.
It has been observed (Samejima, 1969, 1972) that the BSFs of the x g 's are
all strictly decreasing in , with 0 and as the two asymptotes for x g = 0,
with and for 0 < x g < m g , and with and 0 for x g = m g , respectively,
indicating that the model satisfies the unique maximum condition discussed
above. The IRIFs for all 0 x g m g are positive for the entire range of . The
processing functions (PRFs) are all strictly increasing in for all 0 < x g m g ,
with zero and unity as the two asymptotes. (For more details, cf. Samejima,
1969, 1972.)

Y102002_Book.indb 89 3/3/10 6:58:43 PM


90 Fumiko Samejima

Logistic Model (LGST)

Relationship to Other Models: This LGST model has been mentioned to by other authors in
this book, and is typically referred to in the broader literature as the graded response model. As
mentioned earlier, however, Samejima uses the term graded response model to refer to her broader
framework, which includes this (and the previous NMLOG) example of the homogeneous case as
well as the later LPEFG and ACLR examples of the heterogeneous case.

In the logistic model, the cumulative operating characteristic is specified by


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Px*g ( ) = [1 + exp{ Da g ( bx g )}]1 , (4.22)


where ag denotes the item discrimination parameter and bx g is the item response
difficulty parameter that satisfies the inequality presented in Equation 4.21, as
is the case with the normal ogive model. D is a scaling factor usually set equal
to 1.702 or 1.7 so that Equation 4.22 provides a very close curve to Equation
4.20, that is, the COC in the normal ogive model when the same values
of item discrimination parameter ag and item response difficulty parameters
bx g s are used. As is expected, the set of COCs and the set of OCs are
similar to the corresponding sets in the normal ogive model, illustrated in
Figure4.3a and b.
Notable differences are found in its PRFs and BSFs, however (Samejima,
1969, 1972, 1997). Although the PRFs are strictly increasing in for all
0 < x g m g and their upper asymptotes are all unity, as is the case with
the NMLOG, their lower asymptotes equal exp[ Da g ( bx g bx g 1 )]
(cf. Samejima, 1972, p. 43), which is positive except for x = 1 , where it
g
is zero. This indicates that, unlike in the NMLOG, in the LGST for all
1 < x g m g the lower asymptotes are positive, and moreover, the closer the
item difficulty parameter bx g is to that of the preceding item score, bx g 1, the
better the chances are of passing the current step xg, giving favor to individu-
als of lower levels of ability. This fact is worth taking into consideration when
model selection is considered. (A comparison of LGST processing functions
with those in the NMLOG is illustrated in Samejima (1972, Figure5-2-1,
p. 43).)
Although the BSFs are strictly decreasing in for all 0 x g m g , unlike
in the normal ogive model, its two asymptotes for x g = 0 are zero and a finite
value, Da g , those for x g = m g are Da g and zero, and for all other intermediate
x g s their asymptotes are finite values, Da g and Da g , respectively. The unique
maximum condition is also satisfied (Samejima, 1969, 1972), however, and the
IRIFs for all 0 x g m g are positive for the entire range of .

Y102002_Book.indb 90 3/3/10 6:58:47 PM


The General Graded Response Model 91

An Example of the Application of the Logistic


Model to Medical Science Research
It was personally delightful when, as early as 1975, Roche, Wainer, and
Thissen applied the logistic model for graded responses in medical science
research in the book Skeletal Maturity. The research is a fine combination of
medical expertise and a latent trait model for graded responses. It is obvious
that every child grows up to become an adolescent and then an adult, and its
skeletal maturity progresses with age. But there are many individual differ-
ences in the speed of that process, and a childs chronological age is not an
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

accurate indicator of his or her skeletal maturity. For example, if you take a
look at a group of sixth graders, in spite of the closeness of their chronologi-
cal ages, some of them are already over 6 feet tall and look like young adults,
while others still look like small children. Measuring the skeletal maturity of
each child accurately is important because certain surgeries have to be con-
ducted when a child or adolescents skeletal maturity has reached a certain
level, to give an example.
In Roche et al. (1975), x-ray films of the left knee joint that were taken
from different angles were mostly used as items, or skeletal maturity indica-
tors. The items were grouped into three categories: femur (12), tobra (16),
and fibula (6). The reference group of subjects for the skeletal maturity scale
consists of 273 girls and 279 boys of various ages. A graded item score was
assigned to each of those subjects for each item following medical experts
evaluations of the x-ray film. The reader is strongly encouraged to read the
entire Roche et al. text to learn more about this valuable research and to see
how the LGST, a specific example of the graded response model framework,
has been applied in practice.

Further Observations of the Normal Ogive and Logistic Models


Both the normal ogive and logistic models satisfy the unique maximum con-
dition, and the additivities 1 and 2 criteria (discussed above), and the modal
points of their OCs are arranged in accordance with the item scores and
they can be naturally expanded to respective continuous response models
(cf.Samejima, 1973a).
It is clear that in the NMLOG and LGST for graded responses (and also in
many other models in the homogeneous case) the COC can be expressed as

a g ( x g )

Px*g ( ) =
g ( z ) dz , (4.23)

where g ( z ) is replaced by the standard normal and logistic density func-


tions, respectively.

Y102002_Book.indb 91 3/3/10 6:58:48 PM


92 Fumiko Samejima

It can be seen from Equation 4.23 that these models for graded responses
can be expanded to their respected models for continuous responses. Replacing
xg by zg in Equation 4.23, the operating density characteristic H z g ( ) (Samejima,
1973a) for a continuous response zg is defined by
Pzg ( ) Pzg + z ( )
d
H 2 g ( ) lim = a g g { a g ( bz g )}
g
b
z 0
g zg dz g z g

where ag is the item discrimination parameter and bz g is the item response
difficulty parameter, the latter of which is a continuous, strictly increasing,
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

and differentiable function of zg.


In the normal ogive model for continuous responses, there exists a sufficient
statistic, t (v ), such that
t (v ) = a g2 bz g (4.24)

2
and the MLE of is provided by dividing t (v ) by the sum total of a g over
all n items.
When the latent space is multidimensional, that is,
= {1 , 2 ,..., j ,..., r }
in the NMLOG the sufficient statistic becomes a vector of order r, that is,
t(v ) = z gv a g a g b zg , (4.25)

where the bold letters indicate vectors of order r, and the MLE of is given
by the inverse of the matrix z gv a g a g postmultiplied by t(v). It is noted that
Equation 4.24 is a special case of Equation 4.25 when r = 1 (for details, the
reader is directed to Samejima (1974)).
For graded response data, when mg is very large, a continuous response
model may be more appropriately applied instead of a graded response model,
as is often done in applying statistic methods. (Note that the test score is a set
of finite values, and yet it is sometimes treated as a continuous variable, for
example.) In such a case, if the normal ogive model fits our data, the MLE
of will be obtained more easily, taking advantage of the sufficient statistic
when the latent space is multidimensional, as well as unidimensional.
It has been observed (Samejima, 2000) that the normal ogive model for
dichotomous responses provides some contradictory outcomes in the orders
of MLEs of , because of the point-symmetric nature of its ICF that is char-
acterized by the right-hand side of Equation 4.20, with the replacement of the
item response difficulty parameter bx g by the item difficulty parameter bg.
To illustrate this fact, Table1 of Samejima (2000) presents all 32 (= 25)
response patterns of five hypothetical dichotomous items following the
NMLOG, with a g = 1.0 and b g = 3.0, 1.5, 0.0,1.5, 3.0 , respectively, that
are arranged in the ascending order of the MLEs. When the model is
changed to the LGST, because all five items share the same item discrimi-
nation parameter, ag = 1.0, the simple number correct test score becomes a

Y102002_Book.indb 92 3/3/10 6:58:51 PM


The General Graded Response Model 93

sufficient statistic, and a subset of response patterns that have the same number
of ug = 1 shares the same value of MLE.
The following are part of all 32 response patterns listed in that table and
their corresponding MLEs in the NMLOG:

Pattern 2 (10000) 2.284


Pattern 7 (00001) 0.866
Pattern 26 (01111) 0.866
Pattern 31 (11110) 2.284
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

It is noted that the first two response patterns share the same subresponse
pattern, 000, for Items 2 to 4, and the second two share the same subresponse
pattern 111 for the same three items. In the first pair of response patterns,
the subresponse patterns for Items 1 and 5 are 10 and 01, respectively, while
in the second pair they are 01 and 10. Because the only difference in each
pair of response patterns is this subresponse pattern of items 1 and 5, it is
contradictory that in the first pair (Patterns 2 and 7) success in answering the
most difficult item is more credited (0.866 > 2.284) in the normal ogive
model, while in the second pair (Patterns 26 and 24) success in answering
the easiest item is more credited (2.284 > 0.866).
Observations like that above provided the motivation for proposing a
family of models, the logistic positive exponent family (LPEF; Samejima,
2000), for dichotomous responses, which arrange their MLEs consistently
following one principle concerning penalties or credits for failing or succeed-
ing in answering easier or more difficult items. This was later expanded to a
graded response model (LPEFG; Samejima, 2008) that will be introduced
later in this chapter.
In spite of some shortcomings of the normal ogive and logistic models
for dichotomous responses, they are useful models as working hypotheses,
and effectively used, for example, in on-line item calibration in computerized
adaptive testing (cf. Samejima, 2001).

Models in the Heterogeneous Case


The heterogeneous case consists of all specific latent trait models for graded
responses that do not belong to the homogeneous case. In each of those mod-
els, the COCs, for x g = 1, 2,..., xm g , are not all identical in shape, unlike those
models in the homogeneous case, and yet the relationship in Equation 4.5
holds for every pair of adjacent x g 's . That is, even though adjacent functions
are not parallel, they never cross.
Two subcategories are conceivable for specific graded response mod-
els in the heterogeneous case. One is a subgroup of those models that can
be naturally expanded to continuous response models. In this section, the
graded response model (LPEFG), which was expanded from the logistic
positive exponent family (LPEF) of models for dichotomous responses,
and theacceleration model, which was specifically developed for elaborate

Y102002_Book.indb 93 3/3/10 6:58:52 PM


94 Fumiko Samejima

cognitive diagnosis, are described and discussed. The other subcategory con-
tains those models that are discrete in nature, which are represented by those
models expanded from Bocks (1972) nominal response model (BCK-SMJ).

Models Expanded from the Logistic Positive Exponent Family of Models


Logistic Positive Exponent Family of Models for Dichotomous Responses (LPEF)
This family previously appeared in Samejimas (1969) Psychometrika mono-
graph, using the normal ogive function instead of the logistic function in
ICFs, although at that time it was premature for readers and practically
impossible to pursue the topic and publish in refereed journals.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

As was exemplified earlier by the NMLOG, if the ICF is point symmet-


ric, there is no systematic principle of ordering the values of MLE obtained
on response patterns, as is seen in the example of the normal ogive model.
The logistic model, where a simple sufficient statistic u g v a g u g (Birnbaum,
1968) exists, is an exception, and there the ordering depends solely on the
discrimination parameters, ag, without being affected by the difficulty para-
meters, bg.
The strong motivation for the LPEF was to identify a model that arranged
the values of MLE, that is, all possible response patterns consistently fol-
lowing a single principle, that is, penalizing or crediting incorrect or cor-
rect responses, respectively. This motivation was combined with an idea
to perceive Birnbaums logistic model as a transition model in a family of
models. After all, the fact that in the logistic model MLEs are determined
from the sufficient statistic (Birnbaum, 1968) that disregards, totally, the
difficulty parameters, bgs, and is solely determined by the discrimination
parameters ags, is not easily acceptable to this researchers intuition. This led
to the family of models called the logistic positive exponent family (LPEF)
(Samejima, 2000), where ICFs are defined by
g
Pg ( ) = [ g ( )] 0 < g < , (4.26)

where the third parameter, g, is called the acceleration parameter, and

g ( ) = [1 + exp{ Da g ( b g )}]1 (4.27)



the right-hand side of which is identical to the logistic ICF (Birnbaum, 1968),
where the scaling factor D is usually set equal to 1.702. Note that Equation
4.26 also becomes the logistic ICF when g = 1, that is, a point-symmetric
curve, and otherwise, it provides point-asymmetric curves, having a long tail
on lower levels of as g (< 1) gets smaller. Samejima (2000) explains that
when g < 1 the model arranges the values of MLE following the principle
that penalizes the failure in solving as easier item, and when g < 1, fol-
lowing the principle that gives credit for solving a more difficult item, pro-
vided that the discrimination parameters assume the same value for all items
(cf. Samejima, 2000). Thus Birnbaums logistic model can be considered to
represent the transition between the two opposing principles.

Y102002_Book.indb 94 3/3/10 6:58:53 PM


The General Graded Response Model 95

0.9

0.8

0.7

0.6

0.5

0.4
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.3

0.2

0.1

0
-5 -4 -3 -2 -1 0 1 2 3 4 5 6
Latent Trait

Figure 4.4 ICFs of seven items modeled with the LPEF, where the values of g are (left to right)
0.3, 0.5, 0.8, 1.0, 1.5, 2.0, and 3.0, respectively.

Figure4.4 illustrates the ICFs of models that belong to the LPEF, with
the common item parameters a g = 1 and b g = 0, where the values of g are
0.3, 0.5, 0.8, 1.0, 1.5, 2.0, and 3.0, respectively.
The characteristics of the LPEF are as follows:
1. If 0 < g < 1 , then the principle arranging the MLEs of is that failure
in answering an easier item correctly is penalized (success in answering
an easier item correctly is credited).
2. If 1 < g < , then the principle arranging the MLEs of is that suc-
cess in answering a more difficult item is credited (failure in answering
a more difficult item correctly is penalized).
3. When g = 1 , both of the above principles degenerate, and neither of
the two principles works.
The reader is directed to Samejima (2000) for detailed explanations and
observations of the LPEF for dichotomous responses. It is especially impor-
tant to understand the role of the model/item feature function, S g ( ), which is
defined in that article in Equation 7, specified by Equation 28 for the LPEF,
and illustrated in Figures4(a) to (c) (pp. 331332).
It should be noted that the item parameters, ag and bg, in the LPEF should
not be considered as the discrimination and difficulty parameters. (The same
is also true with the 3PL.) Actually, the original meaning of the difficulty
parameter is the value of at which Pg ( ) = 0.5. These values are indicated
in Figure4.4, where they are strictly increasing with g, not constant for all
items. Also, the original meaning of the discrimination parameter is a param-
eter proportional to the slope of Pg ( ) at the level of where Pg ( ) = 0.5, and
it is also strictly increasing with g, not a constant value for all seven items.

Y102002_Book.indb 95 3/3/10 6:58:58 PM


96 Fumiko Samejima

Graded Response Model Expanded from the Logistic


Positive Exponent Family of Models (LPEFG)
It can also be seen in Figure4.4 that whenever g < h for a pair of arbitrary
items, g and h, there is a relationship that Pg ( ) > Ph ( ) for the entire range of
. The reason is because for any value 0 < R < 1, R s > R t holds for any s < t .
The ICFs that are illustrated in Figure4.4 can be used, therefore, for an
example of a set of COCs for graded item scores for a single graded response
item with m g = 7 that satisfies Equation 4.5. More formally, the LPEFG is
characterized by
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

x x
M x g ( ) = [ g ( )] g g 1
x g = 0, 1, 2,..., m g , (4.28)

where g ( ) is given by Equation 4.27 and

1 0 = 0 < 1 < ... < x g < ... < m g 1 < m g < m g +1 , (4.29)

which leads to
x
Px*g ( ) = [ g ( )] g
, (4.30)

due to Equations 4.28 and 4.1. From Equations 4.2 and 4.28 through 4.30
the OC in the LPEFG is defined by


x x
Px g ( ) = [ g ( )] g
[ g ( )] g +1
(4.31)

and all the other functions, Ax g ( ), I x g ( ), I g ( ) , and I ( ), can be obtained


by replacing Px g ( ) by the right-hand side of Equation 4.22 and evaluating
its derivatives in Equations 4.3, 4.4, 4.6, and 4.19, and substituting these
outcomes into Equations 4.2 through 4.4 (details in Samejima, 2008).
Figure4.5a and b presents the OCs and BSFs (per Equation 4.3) of an
example of the LPEFG, with m g = 5 , a g = 1, b g = 0, and x g = 0.3, 0.8, 1.6,
3.1, and 6.1, respectively. Note that, for x g = 0 and x g = m g , OCs are strictly
decreasing and increasing in , respectively, and for all the other graded item
scores they are unimodal, and the BSFs are all strictly decreasing in , with
the upper asymptotes zero for x g = 0 and Da g x g for xg = 1, 2, 3, 4, 5,
respectively, with the lower asymptotes Da g for xg = 0, 1, 2, 3, 4 and zero
for x g = 5, indicating the satisfaction of the unique maximum condition.
The set of BSFs in Figure4.5b is quite different from that of NMLOG or
LGST (see Samejima, 1969, 1972) because the upper limit of the BSF is
largely controlled by the item response parameter x . Because of this fact, it
g

Y102002_Book.indb 96 3/3/10 6:59:04 PM


The General Graded Response Model 97

0.8

0.6
Probability

0.4
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.2

0
-6 -4 -2 0 2 4 6
Latent Trait

(a)
12

10

8
Basic Functions

0
-6 -4 -2 0 2 4 6
-2

-4
Latent Trait

(b)
Figure 4.5 Operating characteristics (a) and basic functions (b) of an example six-category item
modeled with the LPEFG. In (a), the modal points of OCs are produced according to xg s, i.e., the
lowest is for xg = 0, and the highest is for xg = 5. In (b), the asymptotes when q approaches
are ordered, i.e., lowest for xg = 0 and higest for xg = 5.

Y102002_Book.indb 97 3/3/10 6:59:06 PM


98 Fumiko Samejima

can also be seen that the amount of information shown in the IRIF becomes
larger as the item score xg gets larger. The value of at which each curve in
Figure4.5b crosses the -dimension for each xg indicates the modal point of
the corresponding OC that is seen in Figure4.5a, and these modal points are
ordered in accordance with the x g s , with the terminal maximum at nega-
tive infinity and positive infinity for x g = 0 and x g = m g (= 5), respectively.
Another set of PRFs, COCs, OCs, BSFs, and IRIFs for the LPEFG with
different parameter values can be found in Samejima (2008).
It was noted above that the LPEFG satisfies the unique maximum condi-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

tion, as illustrated in Figure4.5b. In addition, both additivity and expand-


ability to a continuous response model (discussed above) are intrinsic in the
LPEFG (Samejima, 2008).

LPEFG as a Substantive Mathematical Model


It is noted that, unlike the normal ogive or the logistic model, LPEFG is a
substantive mathematical model, in the sense that the principle and nature of
the model support, consistently, certain psychological phenomena. To give
an example, for answering a relatively difficult problem-solving question, we
must successfully follow a sequence of cognitive processes. The individuals
performance can be evaluated by the number of processes in the sequence
that he or she has successfully cleared, and one of the graded item scores, 0
through mg, is assigned. It is reasonable to assume that passing up to each
successive cognitive process becomes progressively more difficult, represented
by the item response parameter x g . Concrete examples of problem solving,
for which the LPEFG is likely to fit, are various geometric proofs (Samejima,
2008).
Usually, there is more than one way of proving a given geometry theorem.
Notably, it is said that there are 362 different ways to prove the Pythagoras
theorem! It would make an interesting project for a researcher to choose
a geometry theorem having several proofs, collect data, categorize subjects
into subgroups, each of which consists of those who choose one of the differ-
ent proofs, assign graded item scores to represent the degrees of attainment
for each subgroup, and apply LPEFG for the data of each subgroup. It is
most likely that separate proofs will have different values of mg, and it would
be interesting to observe the empirical outcomes.
Readers will be able to think of other substantive examples. It would
be most interesting to see applications of the LPEFG to data collected
for such examples to find out if the model works well. Any such feedback
would be appreciated.

Relationship to Other Chapters: Huang and Mislevy do something similar to what is suggested
here with responses to a physical mechanics exam. However, they use the polytomous Rasch model
to investigate response strategies rather than the LPEFG, and take a slightly different approach given
that Rasch models do not model processing functions.

Y102002_Book.indb 98 3/3/10 6:59:07 PM


The General Graded Response Model 99

Acceleration Model (ACLR)


Greater Opportunities for Applying Mathematical
Models for Cognitive Psychology Data
For any research in the social sciences, mathematical models and methodolo-
gies are important, if one aims at truly scientific accomplishments. Because of
the intangible nature of the social sciences, however, there still is a long way to
go if the levels of scientific attainments in natural sciences are ones goal.
Nonetheless, the research environment for behavioral science has been
improved, especially during the past few decades. One of the big reasons
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

for the improvement is advancement in computer technologies. To give


an example, in cognitive psychology it used to be typical that a researcher
invited a subject to an experimental room and gave him or her instructions,
which the subject followed and responded to accordingly. Because of its
time-consuming nature, it was very usual that research was based on a very
small group of subjects and quantitative analysis of the research data was
practically impossible.
With the rapid advancement of computer technologies, microcomputers
have become much more capable, smaller, and much less expensive. It is quite
possible to replace the old procedure by computer software that accommo-
dates all experimental procedures, including instructions, response formats,
and data collections. The software is easy to copy, and identical software can
be installed onto multiple laptops of the same type, each of which can be
taken by well-trained instructors to different geographical areas to collect
data for dozens of subjects each. Thus, data can be collected with a sample
size of several hundred relatively easily, in a well-controlled experimental
environment. Sampling can also be made closer to random sampling.
In return, the need for mathematical models for cognitive processes has
become greater, and one must propose mathematical models with the above
perspective.

Acceleration Model
Samejima (1995) proposed the acceleration model that belongs to the het-
erogeneous case with such a future need in mind. In general, cognitive diag-
nosis is complicated, so naturally models for cognitive diagnosis must be
more complicated than many other mathematical models that are applied,
for example, to test or questionnaire data.
In the acceleration model, the PRF is defined by
x*
M x g ( ) = x g ( )
g
, (4.32)

where x* g (> 0) is also called the acceleration parameter in this model, and
x g ( ) is a member of the family of functions satisfying
2
x* g = 1 x g ( ) x g ( ) x g ( ) (4.33)

Y102002_Book.indb 99 3/3/10 6:59:08 PM


100 Fumiko Samejima

that includes the logistic function such that

{ ( )}
1
x g ( ) = 1 + exp Dax g bx g .

(4.34)

In Samejima (1995) Equation 4.33 is mostly used in Equation 4.32. The


COC in this model is provided by
u x x*
Px*g ( ) = x g ( )
g g
(4.35)
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

and all the other functions, such as OC, BSF, and IRIF in this model, are
given by substituting Equation 4.35 into those formulas of the general graded
response model, Equations 4.2 through 4.4, respectively.
It should be noted that on the left-hand side of Equation 4.34 x g ( ) is
used instead of g ( ) in Equation 4.27, with ax g and bx g replacing ag and
bg, respectively, on the right-hand side. This indicates that in the accelera-
tion model the logistic function is defined separately for each graded score
xg, while in the LPEFG it is common for all the graded item scores of item
g. This difference makes the acceleration model more complicated than the
LPEFG for the purpose of using it for cognitive diagnosis of more compli-
cated sequences of cognitive processes.
It is also noted that if x g ( ) in Equation 4.34 is replaced by g ( ) in
Equation 4.27, and we define x g x* g x* g +1 and * 1 0 , then the LPEFG
can be considered as a special, simplified case of the acceleration model. The
model is described in detail in Samejima (1995). It may be wise to collect
data to which the LPEFG substantively fits, and analyze them first, build-
ing on that experience to analyze more elaborate cognitive data using the
acceleration model.

Bocks Nominal Model Expanded to a Graded Response Model (BCK-SMJ)


Bocks (1972) nominal response model is a valuable model for nominal
response items in that it discloses the implicit order of the nominal response
categories. Samejima (1972) proposed a graded response model expanded
from Bocks nominal model. When a model fits data that implicitly have
ordered response categories, it is easy to expand the model to a graded
response model for the explicit graded item scores.
Samejima did not pursue BCK-SMJ much further, however, because an
intrinsic restriction was observed in the expanded model. Later, Masters
(1982) proposed a special case of BCK-SMJ as the partial credit model and
Muraki (1992) proposed BCK-SMJ itself as a generalized partial credit
model without realizing that the model had already been proposed in 1972.
Many researchers have applied those models. Practitioners using IRT in their
research should only use either model when their research data are within the
limit of the previously identified restriction, however.

Y102002_Book.indb 100 3/3/10 6:59:11 PM


The General Graded Response Model 101

The OC in the BCK-SMJ is given by


1
1
Px g ( ) = exp x g + x g u x g exp{ u + u } (4.36)

with
0 < 0 < 1 < ... < m g < .

It is noted that the denominator of Equation 4.36 is common for all x g s.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

This makes the conditional ratio, given , of any pair of the OCs for x g = s
and x g = t ( s t ) such that

Ps ( )[ Pt ( )]1 = exp[ s t ] exp[ t s ] (4.37)


indicating the invariance of this conditional ratio, which characterizes


Bocks nominal response model. The same characteristic, however, becomes
a restriction for the BCK-SMJ model. When s and t are two arbitrary adja-
cent graded item scores, for example, the combined graded response cat-
egory will have the OC

1
Ps + t ( ) = [exp{ s + s } + exp{ t + t }] u x g exp{ u + u } . (4.38)

It is obvious that Equation 4.38 does not belong to Equation 4.36, and thus
additivity 2 does not hold for the BCK-SMJ. It can also be seen that additiv-
ity 1 does not hold for the model either. Thus BCK-SMJ is discrete in nature,
and cannot be naturally expanded to a continuous response model, unlike
the normal ogive model, logistic model, and LPEFG. It should be applied
strictly for data that are collected for a fixed set of graded response categories,
where no recategorizations are legitimate. This is a strong restriction.
A summary of the characteristics of the five specific graded response
models discussed above, with respect to the four evaluation criteria that were
discussed earlier, is given in Table4.1.

Table4.1 Summary of the Characteristics of the Specific Graded Response Model


With Respect to the Four Evaluation Criteria
NMLOG LGST LPEFG ACLR BCK-SMJ
Additivity 1 Yes Yes Yes Yes No
Additivity 2 Yes Yes Yes Robust No
Expands to CRM Yes Yes Yes Yes No
Satisfies unique maximum Yes Yes Yes Yes Yes
condition
Ordered modal points Yes Yes Yes Robust Yes

Y102002_Book.indb 101 3/3/10 6:59:13 PM


102 Fumiko Samejima

The Importance of Nonparametric Estimation

Failure in Parametric Estimation of Item Parameters


in the Three-Parameter Logistic Model
Quite often researchers, using simulated data for multiple-choice items,
adopt software for a parametric estimation of the three-parameter logistic
(3PL) model (Birnbaum, 1968), where the ICF is defined as

Pg ( ) = c g + (1 c g )[1 + exp{ Da g ( b g )}]1 (4.39)


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017


and fail to recover the values of three parameters, ag, bg, and cg, within the
range of error. Sometimes all the item parameter estimates are outrageously
different from their true values. This is a predictable result because in most
cases simulated data are based on a mound-shaped ability distribution
with very low densities for very high and very low levels of . In Equation
4.39 estimating the third parameter, cg, which is the lower asymptote of
the ICF, will naturally be inaccurate. This inaccuracy will also affect, neg-
atively, the accuracies in estimating the other two parameters. Moreover,
even if the ability distribution has large densities on lower levels of ,
when is treated as an individual parameter and if an EM algorithm is
used to estimate both the individual parameter and item parameters, then
the more hypothetical individuals at lower levels of are included, the
larger the amount of estimation error of individual parameters will occur,
influencing accuracy in estimating cg negatively and, consequently, in ag
and bg. Thus such an attempt is doomed to fail.
Even without making an additional effort to increase the number of sub-
jects at lower levels of the latent trait to more accurately recover the three
item parameters, which will not in any event be successful, if the true curve
and the estimated curve with outrageously wrong estimated parameter values
are plotted together, the fit of the curve with the estimated parameter values
to the true curve is usually quite good for the interval of at which densities
of ability distribution are high. We could say that, although the parametric
estimation method aims at the recovery of item parameters, it actually recov-
ers the shape of the true curve for that interval of , as a well-developed non-
parametric estimation method does, not the item parameters themselves.

Nonparametric Estimation of OCs


From a truly scientific standpoint, parametric estimation of OCs is not
acceptable unless there is evidence to justify the adoption of the model in
question, because if the model does not fit the nature of our data, it molds the
research data into a wrong mathematical form and the outcomes of research
will become meaningless and misleading.
Thus, well-developed nonparametric estimation methods that will discover
the shapes of OCCs will be valuable. Lord developed such a nonparamet-
ric estimation method, and applied it for estimating the ICFs of Scholastic

Y102002_Book.indb 102 3/3/10 6:59:13 PM


The General Graded Response Model 103

Aptitude Test items (Lord, 1980: Figure 2.31 on page 16, for example out-
comes). The method is appropriate for a large set of data, represented by
widely used tests that are developed and administered by the Educational
Testing Service, American College Testing, and Law School Admission
Council, for example, but it is not appropriate for data of relatively small
sizes, such as those collected in a college or university environment. The non-
parametric methods that were developed by Levine (1984), Ramsay (1991),
and Samejima (1998, 2001) will be more appropriate to use for data collected
on a relatively small number of individuals.
Figure4.6 exemplifies the outcomes obtained by Samejimas (1998, 2001)
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

simple sum procedure (SSP) and differential weight procedure (DWP) of the
conditional probability density function (pdf) approach, based on the simu-
lated data of 1,202 hypothetical examinees in computerized adaptive testing
(CAT). The outcome of DWP1 (thin, solid line) was obtained by using the
outcome of SSP (dashed line) as the differential weight function, while the
fourth curve is the result of DWP using the true curve (thick, solid line)
as the differential weight function. The DWP_True is called the criterion
operating characteristic (dashed line), indicating the limit of the closeness of
an estimated curve to the true curve; if they are not close enough, either the
procedures in the method of estimation should be improved, or the sample
size should be increased.
It should be noted that the nonmonotonicity of the true curve is detected
by both the SSP and DWP1 in Figure4.6. Even if the true curve is non-
monotonic, which is quite possible, especially for the item characteristic
function of a multiple-choice test item (Samejima, 1979), such detection

0.9

0.8

0.7

0.6 TRUE
SSP
0.5
DWP_True
0.4 DWP1
0.3

0.2

0.1

0
-4 -3 -2 -1 0 1 2 3
Latent Trait

Figure 4.6 A non-monotonic ICF (TRUE), its two nonparametric estimates (SSP, DWPI), and
the criterion ICF (DWP_True).

Y102002_Book.indb 103 3/3/10 6:59:14 PM


104 Fumiko Samejima

cannot be made by a parametric estimation. If, for example, the true curve
in Figure4.6 is the ICF of a multiple-choice test item and a parametric esti-
mation method such as the 3PL is used, the estimated three item parameters
will provide, at best, an estimated curve with a monotonic tail.
There is no reason to throw away an item whose ICF is nonmonotonic,
as illustrated in Figure 4.6. It is noted that approximately for the interval
of (0.0, 1.5) the amount of item information at each value of is large,
so there is no reason why we should not take advantage of it. On the other
hand, on levels lower than this interval of the nonmonotonicity of the curve
will make the IRIFs negative, so this part of the curve should not be used.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Samejima (1973b) pointed out that the IRIF of the 3PL for u g = 1 assumes
negative value, and for that reason, 3PL does not satisfy the unique maxi-
mum condition. Using an item whose ICF is nonmonotonic, as illustrated by
Figure4.6, is especially easy in CAT (Samejima, 2001), but a similar method
can be used in a paper-and-pencil test or questionnaire.
In Figure4.6 it can be seen that (1) the outcome of DWP1 is a little
closer to the criterion operating characteristic than that of SSP, (2) the
outcomes of SSP and DWP1 are both very close to the criterion operating
characteristic, and (3) the criterion operating characteristic is very close
to the true curve. For more details, the reader is directed to Samejima
(1998, 2001).
Samejima (1994) also used the SSP on empirical data, for estimating the
conditional probability, given , of each distractor of the multiple-choice items
of the Level 11 Vocabulary Test of the Iowa Test of Basic Skills, and called
those functions of the incorrect answers plausibility functions. It turned out
that quite a few items proved to possess plausibility functions that have dif-
ferential information, and the use of those functions in addition to the ICFs
proved to be promising for increasing the accuracy of ability estimation.
In the example of Figure4.6, nonparametric estimation of the ICFs for
dichotomous items, or that of the COCs for graded responses, was consid-
ered. We could estimate PRFs or OCs first, however. Equations 4.1 and 4.2
can be changed to

1
M x g ( ) = Px*g ( ) Px*g 1 ( ) for x g = 1, 2,..., m g (4.40)

and

Px*g ( ) = X g x g Px g ( ) for x g = 1, 2,..., m g (4.41)


respectively. Nonparametric estimation can be performed to discover the


PRFs, first, and using those outcomes the COCs and then the OCs can be
obtained through Equations 4.1 and 4.2. An alternative way is to estimate
the COCs first, and using Equation 4.40, the PRFs can be obtained, and
then the OCs through Equation 4.2. It is possible to estimate the OCs first,

Y102002_Book.indb 104 3/3/10 6:59:15 PM


The General Graded Response Model 105

and then using the outcomes, the COCs can be obtained through Equation
4.41, and then the PRFs through Equation 4.40. Note, however, this last
method may include substantial amounts of error, because it is quite pos-
sible that some graded score may include only a small number of individuals,
unless the total sample size is large enough.
In any case, after the shapes of those functions are nonparametrically esti-
mated, it is wise to parameterize the nonparametrically discovered functions
selecting a parametric model that is legitimate in principle and agrees with
the nature of the data. Otherwise, it is difficult to proceed in research using
functions with no mathematical forms.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Limitation of Curve Fittings in Model Validation and Selection


While the goodness of fit of curves is important, it has its limitations,
especially when a model belongs to the heterogeneous case where it has less
restrictions and more freedom for innovation. Samejima (1996a, 1997) dem-
onstrated two sets of OCs that belong to the two specific graded response
models of quite different principles, the ACLR and BCK-SMJ, which are
nevertheless practically identical to each other. This means that if the OCs
that are discovered as the outcomes of a nonparametric estimation method
fit the OCs of the ACLR, they should also fit those of the BCK-SMJ.
Curve fitting alone cannot be a good enough criterion for model vali-
dation. In model selection, in addition to curve fitting, the most important
consideration should be how well the principle behind each model agrees
with the nature of the research data. Furthermore, considerations should be
made of whether each of the other four criteria listed in Table4.1 fits the
model, as well as the research data. For example, if we know that the results
of our research may be compared with other research on the same or similar
contents, mathematical models that lack additivity should be avoided. If con-
tinuous responses are used, a model should be chosen that can be expanded
naturally from a graded response model in order to make future comparisons
possible with the outcomes of other research in which graded responses are
used.
An effort to select a substantive model is by far the most important criterion,
and curve fitting can be used as an additional criterion, to see if those curves
in a substantive model provide at least reasonably good fit to the data.

Conclusion
IRT has developed so much in the past few decades that it is hard to write
even just the essential elements of the general graded response model frame-
work as a handbook chapter. Many important and useful topics have been
omitted. An attempt has been made to include useful hints for researchers
and practitioners in applying IRT within this chapter, including suggested
readings. But even with reference to the original work cited in this chapter,
it may be difficult to identify ways to apply the models.

Y102002_Book.indb 105 3/3/10 6:59:15 PM


106 Fumiko Samejima

Face-to-face workshops may be a useful way to supplement the written


and cited material in this chapter. Such opportunities would make interac-
tive communications and deeper understanding possible.

References
Birnbaum, A. (1968). Some latent trait models and their use in inferring an exam-
inees ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental
test scores (Part 5: Chapters 1720). Reading, MA: Addison-Wesley.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

scored in two or more nominal categories. Psychometrika, 37, 2951.


Levine, M. (1984). An introduction to multilinear formula scoring theory (Measurement
Series 84-5). Champaign: University of Illinois, Department of Educational
Psychology, Model-Based Measurement Laboratory.
Lord, F. M. (1952). A theory of mental test scores. Psychometric, Monograph 7.
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores (Chap. 16).
Reading, MA: Addison-Wesley.
Masters, E. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,
149174.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-
rithm. Applied Psychological Measurement, 16, 159176.
Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item charac-
teristic curve estimation. Psychometrika, 56, 611630.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.
Copenhagen: Nielsen & Lydiche.
Roche, A. M., Wainer, H., & Thissen, D. (1975). Skeletal maturity. New York:
Plenum Medical.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
scores. Psychometrika, Monograph 17.
Samejima, F. (1972). A general model of free-response data. Psychometrika,
Monograph 18.
Samejima, F. (1973a). Homogeneous case of the continuous response model.
Psychometrika, 38, 203219.
Samejima, F. (1973b). A comment on Birnbaums three-parameter logistic model in
the latent trait theory. Psychometrika, 38, 221233.
Samejima, F. (1974). Normal ogive model on the continuous response level in the
multi-dimensional latent space. Psychometrika, 39, 111121.
Samejima, F. (1979). A new family of models for multiple-choice item (Office of Naval
Research Report 79-4). Knoxville: University of Tennessee.
Samejima, F. (1994). Nonparametric estimation of the plausibility function of the distrac-
tors of the Iowa Vocabulary items. Applied Psychological Measurement, 18, 3551.
Samejima, F. (1995). Acceleration model in the heterogeneous case of the general
graded response model. Psychometrika, 60, 549572.
Samejima, F. (1996a). Evaluation of mathematical models for ordered polychoto-
mous responses. Behaviormetrika, 23, 1735.
Samejima, F. (1996b, April). Polychotomous responses and the test score. Paper presented
at the 1996 National Council on Measurement in Education, New York.

Y102002_Book.indb 106 3/3/10 6:59:15 PM


The General Graded Response Model 107

Samejima, F. (1997). Graded response model. In W. J. Van Linden & R. K.


Hambleton (Eds.), Handbook of modern item response theory (pp. 85100). New
York: Springer-Verlag.
Samejima, F. (1998). Efficient nonparametric approaches for estimating the operat-
ing characteristics of discrete item responses. Psychometrika, 63, 111130.
Samejima, F. (2000). Logistic positive exponent family of models: Virtue of asym-
metric item characteristic curves. Psychometrika, 65, 319335.
Samejima, F. (2001). Nonparametric on-line item calibration. Final report of research
funded by the Law School Admission Council for 19992001.
Samejima, F. (2004). Graded response model. In K. Kempf-Leonard (Ed.),
Encyclopedia of social measurement (Vol. 2, pp. 145153). Amsterdam: Elsevier.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Samejima, F. (2008). Graded response model based on the logistic positive exponent
family of models for dichotomous responses. Psychometrika, 73, 561578.
Thissen, D., & Steinburg, L. (1986). A taxonomy of item response models.
Psychometrika, 51, 567577.

Y102002_Book.indb 107 3/3/10 6:59:15 PM


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 108
3/3/10 6:59:15 PM
Chapter 5
The Partial Credit Model
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Geoff N. Masters
Australian Council for Educational Research

Editor Introduction: This chapter demonstrates the elegant simplicity of the underlying concept on
which the partial credit model was built. This provides a valuable basis for understanding this highly
influential polytomous IRT model, including its relationship to other models and the reasons for its
widespread use.

The partial credit model (PCM) is a particular application of the model for
dichotomies developed by Danish mathematician Georg Rasch. An under-
standing of the partial credit model thus depends on an understanding of
Raschs model for dichotomies, the properties of this model, and in particu-
lar, Raschs concept of specific objectivity.
Rasch used the term specific objectivity in relation to a property of the model
for tests he developed during the 1950s. He considered this property to be
especially useful in the attempt to construct numerical measures that do not
depend on the particulars of the instrument used to obtain them.
This property of Raschs model can be understood by considering two per-
sons A and B with imagined abilities A and B. If these two persons attempt
a set of test items, and a tally is kept of the number of items N1,0 that person
A answers correctly but B answers incorrectly, and of the number of items
N 0 ,1 that person B answers correctly but A answers incorrectly, then under
Raschs model, the difference, A B , in the abilities of these two persons
can be estimated as
ln( N1,0 /N 0 ,1 ) (5.1)

What is significant about this fact is that this relationship between the
parameterized difference A B and the tallies N1,0 and N 0 ,1 of observed
successes and failures applies to any selection of items when test data con-
form to Raschs model. In other words, provided that the responses of per-
sons A and B to a set of items are consistent with the model, the difference
A B can be estimated by simply counting successes and failures without

109

Y102002_Book.indb 109 3/3/10 6:59:17 PM


110 Geoff N. Masters

Table 5.1 Tallies of Four Possible


Outcomes When Persons A and B
Attempt a Set of Items
Person B
Wrong Right

Right
N1,0 N1,1
Person A

Wrong
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

N0,0 N0,1

having to know or estimate the difficulties of the items involved. Any subset
of items (e.g., a selection of easy items, hard items, even-numbered items,
odd-numbered items) can be used to obtain an estimate of the relative abili-
ties of persons A and B from a simple tally (Table5.1).
The possibility of obtaining an estimate of the relative abilities of persons
A and B that is not dependent upon the details of the items used was referred
to by Rasch as the possibility of specifically objective comparison.

Raschs Model
In its most general form, Raschs model begins with the idea of a measure-
ment variable upon which two objects, A and B, have imagined locations,
A and B (Table5.2).
The possibility of estimating the relative locations of objects A and B on
this variable depends on the availability of two observable events:

An event X indicating that B exceeds A


An event Y indicating that A exceeds B

Raschs model relates the difference between objects A and B to the events X
and Y that they govern:
B A = ln( X / Y ) (5.2)

where Px is the probability of observing X and PY is the probability of observ-
ing Y. Notice that, under the model, the odds Px /PY of observing X rather
than Y is dependent only on the direction and distance of B from A, and is
uninfluenced by any other parameter.

Table 5.2 Locations of Objects A and B on a Measurement Variable


A B

Y102002_Book.indb 110 3/3/10 6:59:17 PM


The Partial Credit Model 111

Table 5.3 Locations of Item i and Person n on a Measurement Variable


i n

In 1977 Rasch described the comparison of two objects as objective if


the result of the comparison was independent of everything else within the
frame of reference other than the two objects which are to be compared and
their observed reactions.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

An estimate of the difference between objects A and B on the measure-


ment variable can be obtained if there are multiple independent opportuni-
ties to observe either event X or event Y. Under these circumstances, B A
can be estimated as
ln( px / p y ) = ln( N x /N y ) (5.3)

where px and py are the proportions of occurrences of X and Y, and Nx
and Ny are the numbers of times X and Y occur in Nx + Ny observation
opportunities.

Dichotomous Test Items


The most common application of Raschs model is to tests in which responses
to items are recorded as either wrong (0) or right (1). Each person n is imag-
ined to have an ability n, and each item i is imagined to have a difficulty i,
both of which can be represented as locations on the variable being measured
(Table5.3).
In this case, observable event X is person ns success on item i, and observ-
able event Y is person ns failure on item i (Table5.4).
Raschs model applied to this situation is
n i = ln( P1 /P0 ) (5.4)

If person n could have multiple independent attempts at item i, then the
difference, n i, between person ns ability and item is difficulty could be
estimated as

ln( p1 / p0 ) = ln( N1 /N 0 ) (5.5)


Table 5.4 Two Possible Outcomes of Person ns Attempt at Item i


Observable Event
B A Observation Opportunity X Y
n i Person n attempts item i 1 0

Y102002_Book.indb 111 3/3/10 6:59:19 PM


112 Geoff N. Masters

Table 5.5 Locations of Persons m and n on a Measurement Variable


m n

Although this is true in theory, and this method could be useful in some
situations, it is not a practical method for estimating n i from test data
because test takers are not given multiple attempts at the same item (and
if they were, they would not be independent attempts). To estimate the
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

difference, n i, from test data, it is necessary to estimate n from per-


son ns attempts at a number of items, and to estimate i from a number of
persons attempts at that item. In other words, the difficulties of a number
of test items and the abilities of a number of test takers must be estimated
simultaneously.

Comparing and Measuring Persons


In the application of Raschs model to tests, every person has an imagined
location on the variable being measured. Two persons m and n have imagined
locations m and n (Table5.5).
It follows from Equation 5.4 that if persons m and n attempt the same
item and their attempts at that item are independent of each other, then the
modeled difference between persons n and m is
n m = ln( 1,0 / 0 ,1 ) (5.6)

where 1,0 is the model probability of person n succeeding but m failing the
item, and 1,0 is the probability of person m succeeding but n failing that
item.
It can be seen that Equation 5.6 is Raschs model (Equation 5.2) applied to
the comparison of two persons on a measurement variable. The two observ-
able events involve the success of one person but failure of the other in their
attempts at the same item (Table5.6).
In this comparison of persons m and n, nothing was said about the difficulty
of the item being attempted by these two persons. This is because Equation5.6
applies to every item. The odds of it being person n who succeeds, given that
one of these two persons succeeds and the other fails, is the same for every item
and depends only on the relative abilities of persons m and n.

Table 5.6 Two Possible Outcomes of Persons n and m Attempting the


Same Item
Observable Event
B A Observation Opportunity X Y
n m Persons n and m independently 1,0 0,1
attempt the same item

Y102002_Book.indb 112 3/3/10 6:59:19 PM


The Partial Credit Model 113

Because the modeled odds 1,0 / 1,0 are the same for every item, the differ-
ence n m can be estimated as
ln( 1,0 / 0 ,1 ) (5.7)

where 1,0 is the number of items that person n has right but m has wrong,
and 0 ,1 is the number of items that person m has right but n has wrong.
When test data conform to the Rasch model, the relative abilities of two
persons can be estimated in this way using any selection of items without
regard to their difficulties (or any other characteristics). By making multiple
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

pairwise comparisons of this kind, it is possible to estimate the relative loca-


tions of a number of persons on the same measurement variable.

Editor Note: This does assume that each person will get some items right and some items wrong.
This is not a feature of the model but rather a characteristic of the test and means that, to take advan-
tage of the pairwise comparison feature of the model, there must be some items in the test that even
test takers with low ability will get right and some items that even test takers with high ability will
get wrong. Low-ability test takers getting a difficult item right, for example, will not satisfy this need
because that would be data that does not conform to the model.

Comparing and Calibrating Items


In the application of Raschs model to tests, every item has an imagined
location on the variable being measured. Two items i and j have imagined
locations i and j (Table5.7).
It follows from Equation 5.4 that if items i and j are attempted by the same
person and this persons attempts at items i and j are independent of each
other, then the modeled difference between items i and j is

i j = ln( 0 ,1 / 1,0 ) (5.8)


where 1,0 is the model probability of the person succeeding on item i but
failing item j, and 0 ,1 is the probability of the person succeeding on item j
but failing item i.
It can be seen that Equation 5.8 is Raschs model (Equation 5.2) applied
to the comparison of two items on a measurement variable. The two observ-
able events involve the persons success on one item but failure on the other
(Table5.8).
In this comparison of items i and j, nothing was said about the ability of
the person attempting them. This is because Equation 5.8 applies to every

Table 5.7 Locations of Items i and j on a Measurement Variable


j i

Y102002_Book.indb 113 3/3/10 6:59:21 PM


114 Geoff N. Masters

Table 5.8 Two Possible Outcomes When the Same Person Attempts Items i
and j
Observable Event
B A Observation Opportunity X Y
i j Items i and j independently 0,1 1,0
attempted by the same person

person. The odds of success on item i given success on one item but failure
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

on the other is the same for every person and depends only on the relative dif-
ficulties of items i and j.
Because the modeled odds 0 ,1 / 1,0 are the same for every person, the dif-
ference i j can be estimated as

ln(n0 ,1 /n1,0 ) (5.9)


where n1,0 is the number of persons with item i right but j wrong, and n0 ,1 is
the number of persons with j right but i wrong.
When test data conform to the Rasch model, the relative difficulties of
two items can be estimated in this way using any group of persons without
regard to their abilities (or any other characteristics). By making multiple
pairwise comparisons of this kind, it is possible to estimate the relative loca-
tions of a number of items on the measurement variable.

Editor Note: This assumes that each item will be answered correctly by some respondents and
incorrectly by others. Again, this is not a feature of the model but is here a characteristic of the group
taking the test. It means that, to take advantage of the pairwise comparison feature of the model,
there must be some respondents that even get easy items wrong and some respondents that even
get difficult items right.

Application to Ordered Categories


The partial credit model applies Raschs model for dichotomies to tests in
which responses to items are recorded in several ordered categories labeled
0, 1, 2,K i . Each person n is imagined to have an ability n and each item i
is imagined to have a set of ki parameters i 1, i 2, iK i , each of which can be
represented as a location on the variable being measured (q). For example,
see Table5.9, where ik governs the probability of scoring k rather than k 1
on item i (Table5.10).

Table 5.9 Locations of Item and Person Parameters on a Measurement Variable


ik n

Y102002_Book.indb 114 3/3/10 6:59:23 PM


The Partial Credit Model 115

Table 5.10 Two Possible Outcomes When Person n Attempts Polytomous Item i
Observable Event
B A Observation Opportunity X Y
n ik Person n attempts item i k k1

The Rasch model applied to this situation is


n ik = ln( k / k 1 ) (5.10)
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017


In polytomous test items, objective comparison (and thus objective mea-
surement) continues to depend on the modeling of the relationship between
two imagined locations on the variable and two observable events. This com-
parison is independent of everything else within the frame of reference
including other possible outcomes of the interaction of person n with item i.
The conditioning out of other possible outcomes to focus attention only on the
two observable events that provide information about the relative locations of
the two parameters of interest is a fundamental feature of Raschs model.
The conditioning on a pair of adjacent response alternatives has parallels
with McFaddens (1974) assumption that a persons probability of choosing
to travel by car rather than by bus should be independent of the availability
of other options (e.g., train). McFadden refers to this as the assumption of
independence from irrelevant alternatives. In a similar way, it is assumed
in this application of Raschs model that a persons probability of choosing or
scoring k rather than k 1 is independent of all other possible outcomes.
When a person responds to an item with several ordered response catego-
ries, he or she must make a choice taking into account all available alternatives.
The partial credit model makes no assumption about the response mecha-
nism underlying a persons choice. It simply proposes that if category k is
intended to represent a higher level of response than category k 1, then the
probability of choosing or scoring k rather than k 1 should increase mono-
tonically with the ability being measured.
As for dichotomously scored items, if person n could have multiple inde-
pendent attempts at item i, then the difference n ik could be estimated
from proportions or counts of occurrences of k and k 1:

ln( pk / pk 1 ) = ln( k / k 1 ) (5.11)


However, because multiple independent attempts at test items usually are not
possible, this method is not feasible in practice.

Comparing and Measuring Persons


In the application of Raschs model to tests in which responses to items are
recorded in several ordered categories, every person has an imagined location
on the variable being measured (Table5.11).

Y102002_Book.indb 115 3/3/10 6:59:24 PM


116 Geoff N. Masters

Table 5.11 Locations of Persons m and n on a Measurement Variable


m n

It follows from Equation 5.10 that if persons m and n attempt the same
item and their attempts at that item are independent of each other, then the
modeled difference between persons n and m is
n m = ln( Pk , k 1 /Pk 1, k ) (5.12)
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017


where k , k 1 is the model probability of person n scoring k but m scoring
k 1, and k , k 1 is the probability of person m scoring k but n scoring k 1
on that item.
It can be seen that Equation 5.12, which applies for all values of
k( k = 1, 2 K i ), is Raschs model, Equation 5.2 (Table5.12).
If one of persons m and n scores k on an item, and the other scores k 1,
then the probability of it being person n who scores k is the same for every item
and depends only on the relative abilities of persons m and n.
Because the modeled odds k , k 1 / k 1, k are the same for every item, the
difference n m can be estimated as

ln( k , k 1 / k 1, k ) (5.13)

where N k , k 1 is the number of items on which person n scores k and m scores


k 1, and N k 1, k is the number of items on which person m scores k and n
scores k 1.
Once again, when test data conform to Raschs model, the relative abilities
of two persons can be estimated in this way using any selection of items. And
by making multiple pairwise comparisons of this kind, it is possible to estimate
the relative locations of a number of persons on the measurement variable.

Editor Note: Similarly to the issue described earlier, this assumes that each person sometimes
scores k and sometimes scores k 1. Again, this is not a feature of the model but a characteristic
of the respective response categories in the items in the test. It becomes apparent from this that
modeling polytomous item responses can require large amounts of data if all possible pairwise
comparisons are to be realized.

Table 5.12 Two Possible Outcomes of Persons N and M Attempting the Same
Polytomous Item
Observable Event
B A Observation Opportunity X Y
n m Persons n and m independently k,k 1 k 1,k
attempt the same item

Y102002_Book.indb 116 3/3/10 6:59:26 PM


The Partial Credit Model 117

Table 5.13 Location of Two Polytomous Item Parameters on the Same Measurement
Variable
jk ik

Comparing and Calibrating Items


In polytomous items, each item parameter ik ( k = 1, 2, K i ) is a location on
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

the variable being measured. The parameters ik and jk from two different
items i and j can be compared on this variable (Table5.13).
It follows from Equation 5.10 that if items i and j are attempted by the
same person and this persons attempts at items i and j are independent of
each other, then the modeled difference between parameters ik and jk is

ik jk = ln( Pk 1, k /Pk , k 1 ) (5.14)


where k , k 1 is the probability of the person scoring k on item i but k 1 on


item j, and k 1, k is the probability of the person scoring k on item j but k 1
on item i.
It can be seen that Equation 5.14, which applies for all values of
k( = 1, 2, K i ), is Raschs model, Equation 5.2 (Table5.14).
In this comparison of items i and j, nothing was said about the ability of
the person attempting them. This is because Equation 5.14 applies to every
person. When a person attempts items i and j, the probability of the person
scoring k on item i given that he or she scores k on one item and k 1 on the
other is the same for every person.
Because the modeled odds k 1, k / k , k 1 are the same for every person, the
difference ik jk can be estimated as

ln(nk 1, k /nk , k 1 ) (5.15)


where nk , k 1 is the number of persons scoring k on item i but k 1 on item j,


and nk 1, k is the number of persons scoring k on item j but k 1 on item i.

Table 5.14 Two Possible Outcomes When the Same Person Attempts Polytomous
Item i and j
Observable Event
B A Observation Opportunity X Y
ik jk Items i and j independently k 1,k k,k 1
attempted by the same person

Y102002_Book.indb 117 3/3/10 6:59:29 PM


118 Geoff N. Masters

When test data conform to Raschs model, the difference ik jk can


be estimated in this way using any group of persons without regard to their
abilities (or any other characteristics).

Editor Note: In keeping with the previous editor notes, this assumes that the polytomous items
elicit responses for each modeled score. Again, this is not a feature of the model but depends on the
characteristic of the items in the test and the group of respondents taking the test.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Comparisons With Other Models


The partial credit model is one of a number of models that have been intro-
duced for the analysis of ordered response category data. To understand simi-
larities and differences between these models, it is useful to identify a couple
of broad classes of models.

Models With Discrimination Parameters


In some models proposed for the analysis of test data, in addition to a loca-
tion n for each person n and a location i for each item i, a discrimina-
tion parameter i is proposed for each item i. Among models for ordered
response categories that include a discrimination parameter are Samejimas
(1969) graded response model and Murakis (1992) generalized partial credit
model.
These models differ from the partial credit model in that they do not
enable specifically objective comparisons as described by Rasch. The rea-
son for this can be seen most easily in the two-parameter dichotomous item
response theory (IRT) model:

i (n i ) = ln( 1 / 0 ) (5.16)

If we follow the steps outlined earlier and consider independent attempts


of two persons m and n at item i, then for the two-parameter IRT model we
obtain:

i (n m ) = ln( 1,0 / 0 ,1 ) (5.17)


where 1,0 is the probability of person n succeeding but m failing item i, and
0 ,1 is the probability of person m succeeding but n failing.
It can be seen from Equation 5.17 that the odds of person n succeeding
but m failing given that one of these two persons succeeds and the other fails
is not the same for all items. Rather, the odds depend on the discrimination
of the item in question.
To compare the locations of persons m and n on the measurement vari-
able, it is not possible to ignore the particulars of the items involved and

Y102002_Book.indb 118 3/3/10 6:59:30 PM


The Partial Credit Model 119

simply tally occurrences of (1,0) and (0,1). The comparison of n and m


on the measurement variable is dependent not only on the two observable
events (1,0) and (0,1) that they govern, but also on the details (viz., the dis-
criminations) of the items these two persons take. For this reason, the two-
parameter IRT model does not permit objective comparison in the sense
described by Rasch.

Models With Cumulative Thresholds


A second class of models for ordered response categories include as parame-
ters cumulatively defined thresholds. Each threshold parameter is intended
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

to divide all ordered response alternatives to an item up to and including


alternative k 1 from response alternatives k and above. L. L. Thurstone, who
used the normal rather than logistic function to model thresholds, referred to
them as category boundaries.
The threshold notion is used as the basis for Samejimas graded response
model. Her model also includes an item discrimination parameter, but
that is ignored here for the sake of simplicity. Samejimas model takes the
form:
n ik = ln[( k + k +1 + + K t )/( 0 + 1 + + k 1 )] (5.18)

In this model, the item threshold ik governs the probability of scoring k or
better on item i.

Terminology Note: This form of Samejimas logistic model is very different from the way it is
presented in either Samejimas own work (see Chapter 4) or polytomous IRT literature generally. The
way it is presented here is in keeping with the approach in this chapter of describing models in terms
of specific comparisonsin this case, n and ik.

Table 5.15 compares Samejimas graded response model with the par-
tial credit model for an item with four ordered response alternatives labeled
0, 1, 2, and 3.
From Table5.15 it can be seen that the observable events in this model are
compound events, for example:

Event X: Response in category 1 or 2 or 3


Event Y: Response in category 0

The consequence is that the elementary equations in this model are not inde-
pendent because

( 1 + 2 + 3 )/ 0 > ( 2 + 3 )/( 0 + 1 ) > 3 /( 0 + 1 + 2 )



As a result, thresholds are not independent, but are always ordered
i1 < i 2 < i 3 .

Y102002_Book.indb 119 3/3/10 6:59:31 PM


120 Geoff N. Masters

Table5.15 Comparison of Samejima and Rasch Models for Polytomous Items


Samejima Rasch
Elementary equations n i1 = ln[(P1 + P2 + P3)/P0] n i1 = ln[P1/P0]
(person n, item i, Ki = 3) n i2 = ln[(P2 + P3)/(P0 + P1)] n i2 = ln[P2/P1]
n i3 = ln[P3/(P0 + P1 + P2)] n i3 = ln[P3/P2]
Events being compared Compound Simple
(e.g., response in category 1 (comparison of adjacent
or 2 or 3 rather than 0) response categories)
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Item parameters Global/unconditional Local/conditional


Each relates to all available Each relates to adjacent
response categories response categories only
Relationship of elementary Dependent Independent
equations (P1 + P2 + P3)/P0 > (e.g., odds of response in
(P2 + P3)/(P0 + P1) > category 1 rather than 0 is
independent of odds of
P3/(P0 + P1 + P2)
response in category 2
rather than 1)
Implications for item i1 < i2 < i3 s are unfettered and free to
parameters take any value
Model for ordered categories When brought together, the The elementary equations
elementary equations provide a model for
provide a model for ordered ordered response
response categories in which categories in which the
the person parameters person parameters can be
cannot be conditioned out of conditioned out of the
the estimation procedure for estimation procedure for
the items the items and vice versa
Specific objectivity No Yes

The elementary equations in Samejimas model lead to the following


expressions for the probabilities of person n scoring 0, 1, 2, and 3 on item i:

ni 0 = 1 exp(n i 1 )/[1 + exp(n i 1 )]



ni 1 = exp(n i 1 )/[1 + exp(n i 1 )] exp(n i 2 )/[1 + exp(n i 2 )]

ni 2 = exp(n i 2 )/[1 + exp(n i 2 )] exp(n i 3 )/[1 + exp(n i 3 )]

ni 3 = exp(n i 3 )/[1 + exp(n i 3 )]

It is not possible to condition one set of parameters (either the person


parameters or the item thresholds) out of the estimation procedures for the
other in this model.

Y102002_Book.indb 120 3/3/10 6:59:32 PM


The Partial Credit Model 121

In contrast, the elementary equations for the Rasch model (see Table5.15)
lead to the following expressions for the probabilities of person n scoring 0,
1, 2, and 3 on item i:

ni 0 = 1/

ni 1 = exp(n i 1 )/

ni 2 = exp( 2n i 1 i 2 )/

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ni 3 = exp( 3n i 1 i 2 i 3 )/

where is the sum of the numerators.


In general, the partial credit model takes the form

exp( kn i 1 i 2 ik )
nik = (5.19)

It is possible to condition the person parameters out of the estimation


procedures for the item parameters, and vice versa, in this model.

Conclusion
As a member of the Rasch family of item response models, the partial credit
model is closely related to other members of that family. Masters and Wright
(1984) describe several members of this family and show how each has as its
essential element Raschs model for dichotomies. Andrichs (1978) model
for rating scales, for example, can be thought of as a version of the partial
credit model with the added expectation that the response categories are
defined and function in the same way for each item in an instrument. With
this added expectation, rather than modeling a set of mi parameters for each
item, a single parameter i is modeled for item i, and a set of m parameters
( 1 , 2 , m ) is proposed for the common response categories. To obtain the
rating scale version of the PCM, each item parameter in the model is rede-
fined as ix = + x . Wilson also has proposed a generalized version of the
partial credit model (Wilson & Adams, 1993).

References
Andrich, D. (1978). A rating formulation for ordered response categories.
Psychometrika, 43, 561573.
Masters, G. N., & Wright, B. D. (1984). The essential process in a family of measure-
ment models. Psychometrika, 49, 529544.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In
P.Zarempka (Ed.), Frontiers in econometrics (pp. 105142). New York: Academic
Press.

Y102002_Book.indb 121 3/3/10 6:59:34 PM


122 Geoff N. Masters

Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-


rithm. Applied Psychological Measurement, 16, 159176.
Rasch, G. (1977). On specific objectivity: An attempt at formalising the request for
generality and validity of scientific statements. Danish Yearbook of Philosophy,
14, 5894.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
scores. Psychometrika, Monograph Supplement 17.
Wilson, M., & Adams, R. J. (1993). Marginal maximum likelihood estimation for the
ordered partition model. Journal of Educational Statistics, 18, 6990.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 122 3/3/10 6:59:34 PM


Chapter 6
Understanding the Response
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Structure and Process in the


Polytomous Rasch Model
David Andrich
The University of Western Australia

Editor Introduction: Rather than developing a specific polytomous IRT model, this chapter out-
lines and argues for the importance of the item response process that is modeled by all polytomous
IRT models in the Rasch family of models. Modeling response processes in this way is argued to have
an important advantage over the way the response process is modeled in what Hambleton, van der
linden, and Wells (Chapter 2) call indirect models and what are elsewhere referred to as divide-by-
total (Thissen & Steinberg, 1986) or adjacent category (Mellenberg, 1995) models.

The Rasch model for ordered response categories in standard formats was
derived from a sequence of theoretical propositions requiring invariance of
comparisons among item and among person parameter estimates. A model
with sufficient statistics is the consequence. The model was not derived to
describe any particular data (Andersen, 1977; Andrich, 1978; Rasch, 1961).
Standard formats involve only one response in one of the categories
deemed a priori to reflect increasing levels of the property and are common in
quantification of performance, status, and attitude in the social sciences. The
advantage of an item with more than two ordered categories is that, if the
categories work as intended, it gives more information than a dichotomous
item. Table6.1 shows three common examples. Figure6.1 shows a graphical
counterpart using the first example in Table6.1.
The ordered categories on the hypothesized continuum in Figure 6.1
are contiguous. They are separated on the continuum by successive points
termed thresholds. This is analogous to mapping a location of an object on a
line partitioned into equal units to obtain a physical measurement. Because
we do not have a fixed origin, there is no endpoint on the latent continuum of

123

Y102002_Book.indb 123 3/3/10 6:59:34 PM


124 David Andrich

Table6.1 Standard Response Formats for the Rasch Model

Fail Pass Credit Distinction

Never Sometimes Often Always

Strongly disagree Disagree Agree Strongly agree


Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Figure6.1 for the extreme categories, only partitions of the continuum into
four contiguous categories, which requires three thresholds.

Terminology Note: Thresholds are also sometimes referred to as category boundaries. What this
chapter makes clear, and what should always be remembered, is that irrespective of which term is
used (category boundary or threshold), these points on the trait continuum, which separate ordered
response categories, are defined very differently in cumulative (e.g., Samejima logistic graded
response model) or adjacent category (e.g., Rasch) models. As a result, these category-separating
points have a different meaning in these two most common types of polytomous IRT models.

In elementary analyses of data obtained from formats such as those in


Table6.1, and by analogy to physical measurement, successive integers are
assigned to the categories. In more advanced analyses, a probabilistic model
that accounts for the finite number of categories and for sizes of the catego-
ries is applied. The Rasch model is one such model, and in this model the
successive categories are scored with successive integers.

Identity of Rating and Partial Credit Models


In the examples in Table6.1 it might be considered that if the same format is
used across all items, the sizes of the categories will also be the same across
all items. However, that is an empirical question, and it is possible that there
is an interaction between the content of the item and the response format,
so that the sizes of the categories are different for different items. It may also
be the case that different items have different formats that are natural to the
item with different numbers of categories, as, for example, when there are
different items in an achievement test with different maximum scores. The
Rasch model that has the same format across all items and has the same sized

Fail Pass Credit Distinction

Figure 6.1 Graphical representation of ordered categories.

Y102002_Book.indb 124 3/3/10 6:59:35 PM


Understanding the Response Structure and Process in the Polytomous Rasch Model 125

categories is referred to sometimes as the rating scale model. The model with
different sized categories or with different numbers of categories is referred
to sometimes as the partial credit model. The difference, as will be seen, is only
a matter of parameterization, and at the level of a single person responding
to a single item, the models are identical. The focus of this chapter is on
the response of one person to one item that covers both parameterizations.
Therefore, the model will be referred to simply as the polytomous Rasch
model (PRM). The dichotomous model is simply a special case of the PRM,
but where it is necessary to distinguish it as a dichotomous model in the
exposition, it will be referred to explicitly as the dichotomous RM.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Distinctive Properties of the PRM


The PRM has two properties that, when first disclosed, were considered
somewhat counterintuitive: First, combining adjacent categories by sum-
ming the probabilities of responses in the categories, and in the related sense
of summing their frequencies to form a single category, can only be done
under very restricted circumstances (Andersen, 1977; Andrich, 1978, Jansen
& Roskam, 1986; Rasch, 1966). Second, the thresholds analogous to those
in Figure6.1 that define the boundaries of the successive categories may take
on values that are not in their natural order. In part because these properties
are exactly opposite to those of the then prevailing model for ordered catego-
ries, that based on the work of Thurstone (Thurstone & Chave, 1929), they
have been ignored, denied, circumvented, or generated debate and misunder-
standings in the literature (Andrich, 2002). Known in psychometrics as the
graded response model (GRM), the latter model has been developed further by
Bock (1975), Samejima (1969, 1997, 1996), and McCullagh (1980).

Criterion That Ordered Categories Should Satisfy


One observation from these reactions to the models properties is that in the
development of response models for ordered category formats there is no
a priori articulation of any criterion that data in ordered categories should
satisfyit seems it is simply assumed that if categories are deemed to be
ordered, they will necessarily operate that way. One factor that immediately
comes to mind as possibly violating the required order is respondents not
being able to distinguish between two adjacent categories. This has been
observed in data in Andrich (1979).
The theme of this chapter is that it is an empirical hypothesis whether or
not ordered categories work as intended. The chapter sets up a criterion that
must be met by data in an item response theory framework for it to be evi-
dent empirically that the categories are working as intended, and shows how
the PRM makes a unique contribution to providing the empirical evidence.
Meeting this requirement empirically is necessary because if the intended
ordering of the categories does not reflect successively more of the property,

Y102002_Book.indb 125 3/3/10 6:59:35 PM


126 David Andrich

then it puts into question the very understanding of what it means to have
more of the property and of any subsequent interpretations from the data.
The chapter is not concerned with issues of estimation and the tests of fit,
which are well covered in the literature, but in better understanding the dis-
tinctive properties of the model itself, and the opportunities it provides for
the empirical study of ordered polytomous response formats.
It is stressed that the criterion for ordered categories working as intended
pertains to the data, and not to response models themselves irrespective of the
data. The importance of distinguishing between the properties of data from
procedures of models of analysis for ordered categories was recognized by
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

R. A. Fisher, who had a method for analyzing data intended to be in ordered


categories, and upon obtaining results for a particular data set noted: It
will be observed that the numerical values lie in the proper order for
increasing reaction. This is not a consequence of the procedure by which
they have been obtained, but a property of the data examined (Fisher, 1958,
p. 294). Any note from Fisher is worthy of substantial consideration and
study (Wright, 1980).
This chapter demonstrates that the properties of the PRM are compatible
with treating the operation of the ordered categories as an empirical hypoth-
esis. In particular, it is demonstrated that the model has the remarkable
property that from a set of structurally dependent responses in an ordered
category format, it recovers information that would arise from compatible,
experimentally independent formats. This permits the inference regarding the
empirical ordering of categories. Thus, the chapter does not merely describe
the Rasch model for ordered categories from the perspective of modeling
data and for providing invariant comparisons, but presents a case that it is the
ideal model for characterizing the intended response process and for testing
empirically whether ordered categories are operating as required.
The chapter is organized as follows. We first describe an experiment
with independent responses at thresholds devised to assess unequivocally
the empirical ordering of the categories. We then analyze in detail three
response spaces, whose relationship needs to be understood. We also explain
why the probabilities and frequencies in adjacent categories cannot be
summed in the PRM except in special circumstances. Finally, we conclude
with a summary that includes a suggestion as to why over the long history
of the development and application of models for data in ordered categories,
and despite the lead from Fisher, no previous criteria have been articulated
that ordered categories must meet.

Criteria for Data in Ordered Response Categories


In preparation for developing and specifying a criterion for the empirical
ordering of categories, we consider some relationships between models and
data. These relationships are generally taken for granted, but they are made
explicit here because of their specific roles in relation to the PRM and the
theme of the chapter.

Y102002_Book.indb 126 3/3/10 6:59:35 PM


Understanding the Response Structure and Process in the Polytomous Rasch Model 127

A Comment on the Uses of Models


One use of models is simply to summarize and describe data. Models describe
data in terms of a number of parameters that are generally substantially
smaller than the number of data points. It is of course necessary to check
the fit between the data and the model to be satisfied that the model does
describe the data.
A second use of models is to characterize the process by which data are gen-
erated. For example, the Poisson distribution arises from, among many other
circumstances, the cumulative effect of many improbable events (Feller,
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1950, p. 282). This model is derived a priori to the data in characterizing a


response process. If the data do not fit the model, then a question might be
asked about its characterization of the process. However, the fit of the data
to the model is only a necessary, not sufficient, condition to confirm that the
model characterizes the process in those data.
A third and much less conventional use of models is to express a priori
conditions that data are required to follow if they are to subscribe to some
principles. As indicated above, this is the case with the PRM. Following
a series of studies, Rasch articulated conditions of invariance of compari-
sons that data should have if they are to be useful in making quantitative
statements. Specifically, (1) the comparison between two stimuli should be
independent of which particular individuals were instrumental for the com-
parison; (2) symmetrically, a comparison between two individuals should
be independent of which particular stimuli within the class considered were
instrumental for comparison (Rasch, 1961, p. 332).
These conditions of invariance were not unique to Raschvirtually iden-
tical conditions were articulated by Thurstone (1928) and Guttman (1950)
before him. However, the distinctive contribution of Rasch was that Rasch
expressed these conditions in terms of a probabilistic model. Rasch wrote his
conditions for invariance in a general equation, which, in the probabilistic
case for dichotomous responses, takes the form

{(Yni = y ni , Ynj = y nj ); n , i , j | f ( yni , ynj )} = ( yni , ynj , i , j ) (6.1)

where Yni , Ynj are random variables whose responses ( yni , ynj ) take the val-
ues {0,1}, n and i, j are location parameters of person n and items i and j
respectively, and the right side of Equation 6.1 is independent of the person
parameter n. As indicated already, this leads to a class of models with suf-
ficient statistics for the parameters, which generalizes to the PRM.
The key advantage of specifying the conditions in terms of a model is that
mathematical consequences, some of which might be initially counterintui-
tive, can be derived. This is the case with the Rasch model. However, when
the consequences follow mathematically from a specification as compelling
as that of making relatively invariant comparisons, then because they can
provide genuinely new insights that might not be apparent immediately intu-
itively, they should be understood. Another distinctive consequence of this

Y102002_Book.indb 127 3/3/10 6:59:36 PM


128 David Andrich

use of a model is that no amount of data analysis and demonstration of fit to


the model or otherwise is relevant to the case for the model.

Definition of a Threshold in the Presence of Experimental Independence


Suppose that it is intended to assess the relative location of persons on some
construct that can be mapped on a linear continuum, for example, an achieve-
ment test. Items of successively increasing difficulty would be landmarks
of achievement (Thurstone, 1925) requiring successively increasing ability
for success.
Suppose further that the responses of persons to items are scored dichot-
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

omously, for a successful and unsuccessful response, respectively. From such


responses, and in an arbitrary unit, the dichotomous RM (Fischer &
Molenaar, 1995; Rasch, 1960, 1961; Wright, 1997; Wright & Panchapakesan,
1969) can be used to estimate the relative location of items on the continuum.
This model takes the form

e y (n i )
P{Yni = y } = (6.2)
1 + e n i

where the variables are identical to those of Equation 6.1. The response func-
tion of Equation 6.2 for y = 1 is known as the item characteristic curve
(ICC). Three ICCs for the dichotomous RM are illustrated in Figure6.2.
The data giving rise to estimates in Figure 6.2 were simulated for 5,000
persons responding to six items independently (two sets of three items)
with locations of 1.0, 0.0, and 1.0 , respectively, in the first set. Only the
responses of the first set of three items are shown in Figure6.1. These data,
together with those of the second set, are used later in the chapter to illus-
trate the derivations.
The responses {0,1} are ordered; response y = 1 is deemed successful, and
the response y = 0 unsuccessful. In achievement testing i is referred to as

1.0
Pr{Y=1}

0.5

Label Locn Slope


10001 -1.00 0.25
10002 0.03 0.25
10003 0.94 0.25
0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 2 3 Person Location(logits)

Figure 6.2 ICCs for three items.

Y102002_Book.indb 128 3/3/10 6:59:38 PM


Understanding the Response Structure and Process in the Polytomous Rasch Model 129

the difficulty of the item. In general terms, and following psychophysics


(Bock & Jones, 1968), it is termed a thresholdit is the point at which the
person with the same location n = i has an equal probability of being suc-
cessful and unsuccessful:

Pr{Yni = 1} = Pr{Yni = 0 } = 0.5 (6.3)


In the dichotomous RM, the ICCs are parallel (Wright, 1997), which
is exploited in this chapter. We use the dichotomous RM to construct the
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

PRM. However, to better understand the PRM for more than two ordered
categories, the two-parameter logistic model (2PLM),

e yi (n i ) (6.4)
P{Yni = y } =
1 + e i (n i )

where i, known as the discrimination parameter, characterizes the slope of


the ICC (Birnbaum, 1968), is also used in a later section of this chapter.

Notational Difference: The parameters in Equation 6.4 are presented in a format consistent with
the rest of the chapter. Typically, however, the letter a is used to denote the discrimination parameter,
which is denoted by in Equation 6.4. Similarly, the 2PLM typically uses in place of , and b is
used for the item difficulty parameter, which is in Equation 6.4.

The Guttman Structure and the Dichotomous Rasch Model


Another, more explicit view of the location of items of increasing difficulty,
equivalent to Thurstones notion of landmarks, is that of Guttman (1950),
who enunciated an idealized deterministic response structure for unidimen-
sional items. The Guttman structure is central in understanding the PRM.
For I dichotomous items responded to independently, there are 2I possible
response patterns. These are shown in Table6.2 for the case of three items.
The top part of Table6.2 shows the subset of patterns of responses according
to the Guttman structure. The number of these patterns is I + 1.
The rationale for the Guttman structure in Table 6.2 (Guttman, 1950)
is that for unidimensional responses across items, if a person succeeds on
anitem, then the person should succeed on all items that are easier than that
item, and that if a person fails on an item, then the person should fail on all
items more difficult than that item. The content of the items with different
difficulties operationalizes the continuum.
With experimentally independent items, it is possible that a deterministic
Guttman structure will not be observed in data. In that case, the dichoto-
mous RM may be used to locate items on a continuum. The dichotomous RM
is a probabilistic counterpart of the Guttman structure that is a deterministic
limiting case (Andrich, 1985). Specifically, for any person, the probability of

Y102002_Book.indb 129 3/3/10 6:59:39 PM


130 David Andrich

Table6.2 The Guttman Structure With I = 3 Dichotomous Items in Threshold Order


Items 1 2 3 Total Score x Pr{(yn1, yn2, yn3)|x}
I + 1 = 4 Guttman Response Patterns
0 0 0 0 1
1 0 0 1 0.667
1 1 0 2 0.678
1 1 1 3 1
2I I 1 = 4 Non-Guttman Response Patterns
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0 1 0 1 0.248
0 0 1 1 0.085
1 0 1 2 0.235
0 1 1 2 0.087

success on an easier item will always be greater than the probability of suc-
cess on a more difficult item. This statement is evident from the parallel ICC
curves in Figure6.2.
In the Guttman structure, as is evident in Table 6.2, the total score,
x , x = iI=1 yi , completely characterizes the response pattern. In the dichoto-
mous RM, the total score plays a similar role, though probabilistically; it is a
sufficient statistic for the person parameter (Andersen, 1977; Rasch, 1961). If
item thresholds are then ordered in difficulty, and for a given total score, the
Guttman pattern has the greatest probability of occurring (Andrich, 1985).
Furthermore, because of sufficiency, the probability of any pattern, given
the total score x, is independent of the persons ability. Thus, the probabilities
of the patterns of responses for total scores of 1 and 2, shown in Table6.2,
are given by

e yn11 yn 2 2 -yn 33
P{( yn1 ,yn 2 ,yn 3 )|x = 1} = (6.5)
e 1 + e 2 + e 3

e yn11 yn 2 2 - yn 33
P{( yn1 ,yn 2 , yn 3 )|x = 2} = (6.6)
e 1 2 + e 2 - 3 + e 1 3

respectively, both of which are independent of the person ability n and are
special cases of Equation 6.1. These equations are the basis of conditional
estimation of the item parameters independently of the person parameters
(Andersen, 1973).

Design of an Experiment to Assess the Empirical Ordering of Categories


We now consider the design of an experiment in which the empirical order-
ing of the categories can be investigated. The key feature of this experiment
is the empirical independence among the judgments.

Y102002_Book.indb 130 3/3/10 6:59:39 PM


Understanding the Response Structure and Process in the Polytomous Rasch Model 131

Table6.3 Operational Definitions of Ordered Classes for Judging Essays and a


Response Structure Compatible With the Rasch Model
Fail (F) Inadequate setting: Insufficient or irrelevant information given for the story. Or,
sufficient elements may be given, but they are simply listed from the task
statement, and not linked or logically organized.
Pass (P) Discrete setting: Discrete setting as an introduction, with some details that also
show some linkage and organization. May have an additional element to
those listed that is relevant to the story.
Credit (C) Integrated setting: There is a setting that, rather than simply being at the
beginning, is introduced throughout the story.
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Distinction (D) Integrated and manipulated setting: In addition to the setting being introduced
throughout the story, pertinent information is woven or integrated so that this
integration contributes to the story.
The labels of fail, pass, credit, and distinction have been added in this table for the purposes of
this chapter (Harris, 1991).

To make the case concrete, consider the ordered category descriptors


shown in Table 6.3 that were used in assessing the abilities of students to
write a narrative in relation to a particular criterion. The responses among the
categories are not independent in the sense that if a response is made in one
category, it is not made in any other category. The task is to construct a design
compatible with Table6.3 in which independence of judgments prevails.
Clearly from the descriptors for each category, there is an intended order-
ing in the quality of performance with respect to the feature of setting. We
take it that the category descriptors operationalize the writing variable to be
measured and describe the qualities that reflect successively better writing
on this continuum. We further note that the first, and least demanding, cat-
egory is a complement to the second category, and that the other categories
show increasing quality of writing with respect to setting. We shall see how
this complementarity of the first category to the others plays out.
The experimental design involves taking the descriptors in Table6.3 and
constructing independent dichotomous judgments at thresholds that are of
increasing difficulty.
Instead of one judge assigning an essay into one of four categories, con-
sider a design with three judges where each judge only declares whether each
essay is successful or not in achieving the standard at one of pass, credit,
or distinction. Thus, we have three independent dichotomous random vari-
ables. Although there are four categories, there are only three independent
responses. The F descriptor helps in understanding the variable in the region
of fail/pass, and helps the judge decide on the success or otherwise of the
essay at this standard. This is the role of the F descriptor in this design.
We now consider this experimental design, summarized in Table 6.4,
more closely.
The descriptors, as already indicated, describe the variable and what it takes
to reflect more of its property. The better the essay in terms of these characteris-
tics, the greater the probability that it will be deemed successful at each level.

Y102002_Book.indb 131 3/3/10 6:59:40 PM


132 David Andrich

Table6.4 Experimental Design Giving Independent Responses


Inadequate Discrete Integrated Integrated and
Setting F Setting P Setting C Manipulated Setting D
Judgment 1 Not P P
Judgment 2 P Not C C
Judgment 3 C Not D D
D
Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Two further specific cases may be highlighted in order to make clear the
operation of the experimental design. First, suppose the judge considers that
the essay does satisfy the P descriptor, but observes that it does not meet the
C and D descriptors. Then the judge should classify the essay as a success.
Second, suppose that the judge, still with respect to success at P, consid-
ers that an essay satisfies even the qualities described in C or D, or some
combination of these. Because of the structure of the descriptors as ordered
categories, which implies that C reflects more of the property to be measured
than P, and D even more of the property than C, the judge must classify it as
a success at P. The more of the properties of C and D an essay has, the greater
the probability of it being classified successful at the P level. Similar inter-
pretations follow for decisions at each of the other categories. It is stressed
that in each judgment it is the latent continuum that has been dichotomized
at each threshold, and not the categories as such.

Requirements of Data From the Experimental Design


In such an experimental design it would be required that the success rate at P
is greater than that at C, and that the success rate at C is in turn greater than
that at D. That is, it is required that it is more difficult to be successful at D
than at C, which in turn is more difficult than being successful at P.
If that were not the case, for example, the success rate at C was the same
as that at D for the same essays, then it would be inferred that the judges
do not distinguish between the two levels consistently. This could arise, for
example, if the judge at C were harsher than intended and the judge at D
were more lenient than intended. Thus, it may be that the experiment did not
work, and it would need to be studied further to understand why this is the
case. But such evidence is central to treating the ordering of the categories as
an empirical hypothesis to be tested in the data.
Not only would we require that the thresholds increase in difficulty with
their a priori ordering, but we would want the same distance between them
irrespective of the location of the essays on the continuum. It seems unten-
able that these distances are different for essays of different quality. This uni-
formity of the relationships between these levels in probability is guaranteed
if the success rate curves at different levels of quality of the essays are paral-
lel, that is, if the dichotomous responses at the corresponding thresholds

Y102002_Book.indb 132 3/3/10 6:59:40 PM


Understanding the Response Structure and Process in the Polytomous Rasch Model 133

follow the dichotomous RM. This is the essential justification for applying
the dichotomous RM to such data.
In summary, if P , C , D are the difficulties of the thresh