14 vues

Transféré par msa_imeg

- v14n1
- MMDST.doc
- 4drzoqg6 Buku Operation Research
- TABLE OF CONTENTS.docx
- TASK 3
- The International Journal of Educational and Psychological Assessment Vol 1
- 9698_s02_qp_2
- answer sheet
- Artikel
- stats final project
- ssssssb
- AIM Research Project REVISED (1)
- Alagoz Cigdem 200505 Ma
- Question Bank on Research Methodology
- School Resume
- ACTforEducation_survey Data Report
- GRE Links
- Back to Blackboard Boldly
- Lesson 6 SE lecting and Evalkuating Instructional Materials
- Statistics Memo

Vous êtes sur la page 1sur 307

Handbook of

Polytomous

Item Response

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Theory Models

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 2

3/3/10 6:56:48 PM

Handbook of

Polytomous

Item Response

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Theory Models

Edited by

Michael L. Nering

Remo Ostini

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Routledge Routledge

Taylor & Francis Group Taylor & Francis Group

270 Madison Avenue 27 Church Road

New York, NY 10016 Hove, East Sussex BN3 2FA

2010 by Taylor and Francis Group, LLC

Routledge is an imprint of Taylor & Francis Group, an Informa business

To purchase your own copy of this or any of Taylor & Francis or Routledges

collection of thousands of eBooks please go to www.eBookstore.tandf.co.uk.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://

www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,

978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For

organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for

identification and explanation without intent to infringe.

Handbook of polytomous item response theory models / editors, Michael L. Nering, Remo Ostini.

p. cm.

Includes bibliographical references and index.

ISBN 978-0-8058-5992-8 (hardcover : alk. paper)

1. Social sciences--Mathematical models. 2. Item response theory. 3. Psychometrics. 4. Social

sciences--Statistical methods. I. Nering, Michael L. II. Ostini, Remo.

H61.25.H358 2010

150.287--dc22 2009046380

http://www.taylorandfrancis.com

and the Psychology Press Web site at

http://www.psypress.com

Contents

Preface vii

Contributors ix

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Remo Ostini and Michael L. Nering

Chapter 2 IRT Models for the Analysis of Polytomously Scored Data: Brief

and Selected History of Model Building Advances 21

Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

David Thissen, Li Cai, and R. Darrell Bock

Fumiko Samejima

Geoff N. Masters

Polytomous Rasch Model 123

David Andrich

Chapter 7 Factor Analysis of Categorical Item Responses 155

R. Darrell Bock and Robert Gibbons

Chapter 8 Testing Fit to IRT Models for Polytomously Scored Items 185

Cees A. W. Glas

vi Contents

Chapter 9 An Application of the Polytomous Rasch Model to Mixed

Strategies 211

Chun-Wei Huang and Robert J. Mislevy

Aimee Boyd, Barbara Dodd, and Seung Choi

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Seonghoon Kim, Deborah J. Harris, and Michael J. Kolen

Index 293

Preface

The Handbook of Polytomous Item Response Theory Models brings together lead-

ers in the field to tell the story of polytomous item response theory (IRT). It is

designed to be a valuable resource for researchers, students, and end-users of

polytomous IRT models bringing together, in one book, the primary actors in

the development of the most important polytomous IRT models to describe

their work in their own words. Through the chapters in the book, the authors

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

show how these models originated and were developed, as well as how they

have inspired or assisted applied researchers and measurement practitioners. It is

hoped that hearing these stories and seeing what can be done with these models

will inspire more researchers, who might not otherwise have considered using

polytomous IRT models, to apply these models in their own work and thereby

achieve the type of improved measurement that IRT models can provide.

This handbook is for measurement specialists, practitioners, and graduate

students in psychological and educational measurement who want a compre-

hensive resource for polytomous IRT models. It will also be useful for those

who want to use the models but do not want to wade through the fragmented

mass of original literature and who need a more comprehensive treatment of

the topic than is available in the individual chapters that occasionally show up

in textbooks on IRT. It will also be useful to specialists who are unfamiliar

with polytomous IRT models but want to add it to their repertoire, particularly

psychologists and assessment specialists in individual differences, social, and

clinical psychology, who develop and use tests and measures in their work.

The handbook contains three sections. Part 1 is a comprehensive account

of the development of the most commonly used polytomous IRT models

and their location within two general theoretical frameworks. The context

of the development of these models is presented within either an historical

or a conceptual framework. Chapter 1 describes the contents of this book

and discusses major issues that cut across different models. It also provides

a model reference guide that introduces the major polytomous IRT models

in a common notation and describes how to calculate information functions

for each model. Chapter 2 outlines the historical context surrounding the

development of influential models, providing a basis from which to investi-

gate individual models more deeply in subsequent chapters. Chapters 1 and

2 also briefly introduce software that can be used to implement polytomous

IRT models, providing readers with a practical resource when they are ready

to use these models in their own work. In Chapters 3, 4, 5, and 6, the psy-

chometricians responsible for important specific models describe the devel-

opment of the models, outlining important underlying features of the models

and how they relate to measurement with polytomous test items.

Part 2 contains two chapters that detail two very different approaches to

evaluating how well specific polytomous IRT models work in a given measure-

ment context. Although model-data fit is not the focus of Chapter 7while

vii

viii Preface

being very much the focus of Chapter 8each of these chapters makes a sub-

stantial contribution to this difficult problem. Reminiscent of the earlier strug-

gles in structural equation modelling, the lack of a strong fit-testing regimen is

a serious impediment to the widespread adoption of polytomous IRT models.

Careful appraisal of the properties of the evaluation procedures and fit tests

outlined in the two chapters in this section, along with their routine implemen-

tation in accessible IRT software, would go far towards filling this need.

The final section demonstrates a variety of ways in which these models

have been used. In Chapter 9 the authors investigate the different test-taking

strategies of respondents using a multidimensional polytomous IRT model.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

adaptive testing (CAT) using polytomous IRT models and provides a review

of CAT applications in both applied and research settings. Equating test

scores across different testing contexts is an important practical challenge in

psychological and educational testing. The theoretical and practical consider-

ations in accomplishing this task with polytomous IRT models are the focus

of the last chapter in this handbook.

Disparate elements of the book are linked through editorial sidebars that

connect common ideas across chapters, compare and reconcile differences in

terminology and explain variations in mathematical notation. This approach

allows the chapters to remain in the authors own voice while drawing

together commonalities that exist across the field.

Acknowledgements

This book is clearly a collaborative effort and we first and foremost acknowl-

edge the generosity of our contributing authors, particularly for sharing their

expertise, but also for their ongoing support for this project and for giving us a

glimpse into the sources of the inspirations and ideas that together form the field

of polytomous item response theory. A project like this always has unheralded

contributors working behind the scenes. That this book was ever completed is

due in no small part to the determination, skills, abilities, hard work, forbearance

and good humor of Kate Weber. Thanks Kate! We are also grateful for the assis-

tance of the reviewers including Mark D. Reckase of Michigan State and Terry

Ackerman of the University of North Carolina at Greensboro as well as other

unsung colleagues who contributed by casting a careful eye over different parts of

this project. We especially thank Jenny Ostini, Wonsuk Kim, Liz Burton, Rob

Keller, Tom Kesel and Robin Petrowicz. A critical catalyst in bringing this proj-

ect to fruition was a generous visiting fellowship from Measured Progress to RO,

which we gratefully acknowledge. Finally, we thank the capable staff at Taylor

and Francis, particularly Debra Riegert and Erin Flaherty, for their confidence

in this project and for their skill in turning our manuscript into this handbook.

RO Ipswich, Queensland

Contributors

University of Western Australia Keimyung University

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Pearson Australian Council for Educational

Research

Li Cai

University of North Carolina at Chapel Robert J. Mislevy

Hill University of Maryland

Northwestern University Measured Progress

University of Texas at Austin Healthy Communities Research

Centre, University of Queensland

Cees A. W. Glas

University of Twente Fumiko Samejima

University of Tennessee

Robert Gibbons

University of Illinois at Chicago David Thissen

University of North Carolina at Chapel

Ronald K. Hambleton Hill

University of Massachusetts at Amherst

Wim J. van der Linden

Deborah J. Harris CTB/McGraw Hill

Act, Inc.

Craig S. Wells

Chun-Wei Huang University of Massachusetts at

WestEd Amherst

ix

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 10

3/3/10 6:56:49 PM

Pa r t I

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Development

of Poly tomous

IRT Models

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 2

3/3/10 6:56:49 PM

Chapter 1

New Perspectives and Applications

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Remo Ostini

Healthy Communities Research Centre, University of Queensland

Michael L. Nering

Measured Progress

els used to help us understand the interaction between examinees and test

questions where the test questions have various response categories. These

test questions are not scored in a simple dichotomous manner (i.e., correct/

incorrect); rather, they are scored in way that reflects the particular score cat-

egory that an examinee has achieved, been classified into, or selected (e.g., a

score point of 2 on an item that is scored from 0 to 4, or selecting somewhat

agree on a survey).

Polytomous items have become omnipresent in the educational and psy-

chological testing community because they offer a much richer testing expe-

rience for the examinee while also providing more psychometric information

about the construct being measured. There are many terms used to describe

polytomous items (e.g., constructed response items, survey items), and poly-

tomous items can take on various forms (e.g., writing prompts, Likert type

items). Essentially, polytomous IRT models can be used for any test question

where there are several response categories available.

The development of measurement models that are specifically designed

around polytomous items is complex, spans several decades, and involves a

variety of researchers and perspectives. In this book we intend to tell the

story behind the development of polytomous IRT models, explain how

model evaluation can be done, and provide some concrete examples of work

that can be done with polytomous IRT models. Our goal in this text is to

give the reader a broad understanding of these models and how they might

be used for research and operational purposes.

4 Remo Ostini and Michael L. Nering

This book is intended for anyone that wants to learn more about polyto-

mous IRT models. Many of the concepts discussed in this book are technical

in nature, and will require an understanding of measurement theory and

some familiarity with dichotomous IRT models. There are several excellent

sources for learning more about measurement generally (Allen & Yen, 1979;

Anastasi, 1988; Crocker & Algina, 1986; Cronbach, 1990) and dichoto-

mous IRT models specifically (e.g., Embretson & Reise, 2000; Rogers,

Swaminathan, & Hambleton, 1991). Throughout the book there are numer-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ous references that are valuable resources for those interested in learning

more about polytomous IRT.

This handbook is designed to bring together the major polytomous IRT

models in a way that helps both students and practitioners of social science

measurement understand where these state-of-the-art models come from,

how they work, and how they can be used. As Hambleton, van der Linden,

and Wells (Chapter 2) point out, the handbook is not an exhaustive cata-

logue of all polytomous IRT models, but the most commonly used models

are presented in a comprehensive manner.

It speaks to the maturation of this field that there are now models that

appear to have fallen by the wayside despite what could be considered desir-

able functional properties. Rosts (1988) successive intervals model might be

an example of such a model in that very little research has been focused on

it. Polytomous IRT also has its share of obscure models that served their

purpose as the field was finding its feet but which have been supplanted

by more flexible models (e.g., Andrichs (1982) dispersion model has given

way to the partial credit model) or by mathematically more tractable models

(e.g., Samejimas (1969) normal ogive model is more difficult to use than her

logistic model).

Perhaps the most prominent model to not receive separate treatment in

this handbook is the generalized partial credit model (GPCM; Muraki,

1992). Fortunately, the structure and functioning of the model are well cov-

ered in a number of places in this book, including Hambleton and colleagues

survey of the major polytomous models (Chapter 2) and Kim, Harris, and

Kolens exposition of equating methods (Chapter 11).

Rather than focus on an exhaustive coverage of available models, this

handbook tries to make polytomous IRT more accessible to a wider range

of potential users in two ways. First, providing material on the origins and

development of the most influential models brings together the historical

and conceptual setting for those models that are not easily found elsewhere.

The appendix to Thissen, Cai, and Bocks chapter (Chapter 3) is an exam-

ple of previously unpublished material on the development context for the

nominal model.

New Perspectives and Applications 5

models, including the challenge of evaluating model functioning (Bock &

Gibbons, Chapter 7; Glas, Chapter 8) and applying the models in comput-

erized adaptive testing (CAT; Boyd, Dodd, & Choi, Chapter 10), equating

test scores derived from polytomous models (Kim et al., Chapter 11), and

using a polytomous IRT model to investigate examinee test-taking strategies

(Huang & Mislevy, Chapter 9).

Part 1: Development

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

mous IRT models with the story of the development of each model told by

the people whose work is most closely associated with the models. We begin

with a chapter by Hambleton, van der Linden, and Wells (Chapter 2), which

broadly outlines various influential polytomous models, introducing their

mathematical form and providing some of the common historical setting

for the models. Introducing a range of models in this consistent way forms a

solid basis for delving into the more complex development and measurement

issues addressed in later chapters. Hambleton and colleagues also introduce

models that are not addressed in later model development chapters (e.g.,

generalized partial credit model, nonparametric IRT models) and touch on

parameter estimation issues and other challenges facing the field.

Thissen and Cai (Chapter 3) provide a succinct introduction to the nomi-

nal categories item response model (often known in other places as the

nominal response model). They neatly describe derivations and alternative

parameterizations of the model as well as showing various applications of the

model. Saving the best for last, Thissen and Cai provide a completely new

parameterization for the nominal model. This new parameterization builds

on 30 years of experience to represent the model in a manner that facilitates

extensions of the model and simplifies the implementation of estimation

algorithms for the model. The chapter closes by coming full circle with a

special contribution by R. Darrell Bock, which provides previously unpub-

lished insight into the background to the models genesis and is one of the

highlights of this book.

Samejimas chapter (Chapter 4) is in some ways the most ambitious in this

book. It presents a framework for categorizing and cataloguing every possi-

ble unidimensional polytomous IRT modelincluding every specific model

developed to date as well as future models that may be developed. Issues with

nomenclature often arise in topics of a technical nature, and this is certainly

the case in the world of polytomous IRT models. For example, Samejima

typically used the term graded response model or the related general graded

response model to refer to her entire framework of models. In effect, the graded

response model is, for her, a model of models. In common usage, however,

graded response model (GRM) refers to a specific model, which Samejima

developed before fully expounding her framework. Samejima herself calls

this specific model the logistic model in the homogeneous caseand never

6 Remo Ostini and Michael L. Nering

this terminology conflict. Ultimately the reader simply needs to be aware

that when Samejima refers to the graded response model, she is referring

to her framework while other authors are referring to her logistic model.

The other major terminological issue is the distinction between Samejimas

homogeneous case and her heterogeneous case. This dichotomy refers to dif-

ferent types of modelsthe two major branches in her framework. Early

researchers understood the heterogeneous case to simply be Samejimas

logistic model (usually called the GRM) with a discrimination parameter

that varied across categories. This is not correct. The simplest way to under-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

stand the distinction is that the homogeneous case is the term for models

that are elsewhere called difference models (Thissen & Steinberg, 1986),

cumulative models (Mellenberg, 1995), or indirect models (Hambleton

etal., Chapter 2), whereas models in the heterogeneous case are essentially

all other polytomous IRT models, including the nominal response model

and Rasch type models. These issues will be highlighted on occasion as they

arise throughout the book.

Prior to presenting the comprehensive framework in Chapter 4, Samejima

outlines a set of criteria for evaluating the adequacy of any given model.

These criteria are essentially an argument for assessing a model, not at the

model-data fit level, but rather at the level of how the model operatesthe

structural and functional properties that determine how the model repre-

sents response data.

The chapter by Andrich (Chapter 6) also presents an argument about model

functioning. Andrich argues that the feature of polytomous Rasch models,

which allows item category thresholds to be modeled in a different order to

the response categories themselves, provides an important diagnostic tool

for testing dataand the items that produced it. Moving beyond the sim-

ple, intuitively appealing, but ultimately inaccurate representation of these

thresholds as successive steps in a response process, Andrich argues that the

presence of unordered category thresholds is an indication of an improp-

erly functioning item. He notes that this diagnostic ability is not available

in models belonging to Samejimas homogeneous case of graded response

models. Thus, while Samejimas argument outlines how models should func-

tion to properly represent response data, Andrichs argument concerns the

properties that should be present in response data if they are to be properly

modeled.

Nested between these two chapters is a relatively unargumentative pre-

sentation of the logic behind the partial credit model (PCM) as told by the

originator of that model (Masters, Chapter 5). The PCM is an extremely

flexible polytomous model that can be applied to any polytomous response

data, including data from tests that have items with different numbers of

categories, from questionnaires using rating scales, or both. While the PCM

is a model in the Rasch family of models it was developed separately from

Raschs (1961) general polytomous model.

New Perspectives and Applications 7

The presentation of the PCM in Masterss chapter (as well as its description in

the Hambleton et al. chapter) focuses more narrowly on its use with the types

of items that give rise to the models nameability to test items for which is

it is possible to obtain partial credit. This focus can obscure the versatility of

the model and also tends to reinforce the notion of item category thresholds

being successive steps in an item response process. This notion glosses over

the fact that these thresholds do not model responses to pairs of independent

categoriesas Masters notes toward the end of his chaptersince all the cat-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

steps because they do not take into account response probabilities beyond the

categories being modeled (see, e.g., Tutz, 1990; Verhelst & Verstralen, 1993).

Differences over step terminology and the ambiguity surrounding what

the difficulty parameter of PCM thresholds actually represents should not

obscure the fact that this is a very versatile model with desirable statistical

properties. Part of its flexibility derives from the fact that for polytomous

models (including the PCM), the discriminating power of a specific category

depends not only on a separate modeled parameter but also on the proximity

of thresholds adjacent to each other (Muraki, 1992, 1993; Muraki & Bock,

1999). The closer two adjacent thresholds are, the more discriminating the

category that they bound.

Digging deeper into why the item step notion is often misunderstood

in the PCM reveals a feature of polytomous IRT model item parameters

that underlies a broader set of misunderstandings in polytomous IRT mod-

els generally. Simply put, the probability of traversing a category boundary

threshold (i.e., of passing an item step) is not the same probability as the prob-

ability of responding in the next categoryexcept in one trivial case. These

probabilities are never the same in the case of polytomous Rasch models or

other models in Samejimas heterogeneous case of graded models (also called

divide-by-total models (Thissen & Steinberg, 1986), adjacent category mod-

els (Mellenberg, 1995), or direct models (Hambleton et al., Chapter 2)).

What this means and why these probabilities are not the same is easiest to

see for models in Samejimas homogeneous case of graded models (also called

difference models (Thissen & Steinberg, 1986)), adjacent category models

(Mellenberg, 1995), or indirect models (Hambleton et al., Chapter 2)). As

will be shown in later chapters, in this type of polytomous IRT model the

probability of passing an item threshold is explicitly modeled as the probabil-

ity of responding in any category beyond that threshold. This is clearly not

the same as responding in the category immediately beyond the threshold,

unless it is the final item threshold and there is only one category remaining

beyond it. For example, the modeled probability of passing the threshold

between the second and third categories in a five-category item (i.e., passing

the second step in a four-step item) is explicitly modeled as the probability

of responding in Categories 3, 4, or 5. Clearly, this cannot be the same prob-

ability as that of responding in Category 3.

8 Remo Ostini and Michael L. Nering

ing in a category is more difficult to appreciatebut equally realin Rasch

type (divide-by-total, difference) polytomous IRT models because the cat-

egory boundary thresholds are only defined (and modeled) between pairs

of adjacent categories. Thus, in this type of model, the probability of pass-

ing the threshold from the second to the third categories in the aforemen-

tioned five-category item is modeled simply as the probability of responding

in Category 3 rather than in Category 2. This is not the same probability as

simply responding in Category 3even though it might sound like it should

be. In fact, the probability of just responding in Category 3 is a function

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

function of the probability of passing every threshold in the entire item. In

simple terms, the probability of responding in Category 3 is not the same as

the responding in Category 3 rather than Category 2 because it is instead the

probability of responding in Category 3 rather than responding in Categories

1, 2, 4, or 5.

In practice, a manifestation of this distinction is that for a given set of

data, modeled by a general direct and a general indirect polytomous model,

the probabilities associated with the category boundaries (e.g., the difficulty

or location parameters) will be quite different for the two models, whereas

the probability of responding in a particular category will be almost identical

across the range of the measurement scale (Ostini, 2001).

One thing that the foregoing discussion tells us is that category thresholds

(specifically, their associated parameters) have different meanings in differ-

ence models compared to divide-by total modelsa situation that does not

exist for dichotomous models. Put another way, whereas restricting a two-

parameter dichotomous model to only have one modeled item parameter

effectively makes it a Rasch model, this is not at all the case for polytomous

models. The modeled threshold parameters for the generalized partial credit

model (GPCM) and Samejimas logistic model do not have the same mean-

ingeven though they have the same number and type of parameters

and removing the discrimination parameter from the logistic model does not

make it a polytomous Rasch model.

The failure to appreciate the distinction between passing a threshold and

responding in a category is easy to understand considering that polytomous

IRT models were extrapolated from dichotomous models where this dis-

tinction does not exist. Passing the threshold between the two categories in

a dichotomous item has the same probability as responding in the second

category. In the dichotomous case, that is precisely what passing the thresh-

old means. It means getting the item right, choosing yes instead of no, and

getting a score of 1 rather than 0. As has hopefully been made clear, passing

a polytomous item threshold is not nearly that simple.

Failing to make the distinction between the two probabilities (passing

a threshold and responding in a category) with polytomous IRT models is

New Perspectives and Applications 9

tests of ability. In that context, passing the threshold between the second and

the third category in an item is commonly understood to mean that you get

the mark for the third category. What is ignored in this understanding is that

it also means that you failed to pass subsequent thresholds on the itemeven

though this failure must be (and is) included in the modeled probability of

responding in that third category.

The early context for polytomous IRT models, combined with a failure to

clearly enunciate the semantic distinction between threshold probabilities

and category probabilities, likely contributed to the misunderstanding sur-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

rounding the step notion for category thresholds in polytomous Rasch mod-

els. This misunderstanding leads to the belief that the difficulty of passing a

category threshold is the same as the difficulty for that category when, as we

have seen above, it is not the same probability.

The failure to rigorously distinguish between category threshold probabil-

ities and the probability of responding in a category can lead to some loose-

ness in terminology in discussing polytomous IRT models and their usage.

In such cases, passing a threshold is spoken of as responding in a category.

Examples of this sort of blurring of the distinction between the two types of

probability can be seen by implication in parts of Chapters 2 and 11.

While blurring the distinction between the two types of probability is

unlikely to have any adverse consequences on respondents test scores, for

example, it can lead to misunderstandings about the nature and relationships

between different types of models. It can also lead to misunderstandings

about how polytomous models operatewhat they can do, how they do it,

and what they provide the test user. Equally importantly, this distinction has

implications for the types of arguments that Andrich is making in Chapter4.

As a result, being clear about the distinction between the probability of pass-

ing a category threshold and the probability of responding in a category is

important for discussions about choosing among the different models.

Part 2: Evaluation

On our journey to actually using the polytomous IRT models in our applica-

tion section we provide two different methods for evaluating the use of the

polytomous models relative to the data at hand. The task of model evalua-

tion is not an easy one, and there are several ways one might perform this

task. Our intent here is to provide an overview of a couple of approaches that

might be considered.

Bock and Gibbons (Chapter 7) describe the development of an extension

to full information factor analysis (FIFA), which not only brings multidi-

mensional IRT a step closer, but also allows the dimensionality of an instru-

ment to be evaluated through a confirmatory factor analysis procedure. A

feature of this chapter is the worked example that clearly shows how to take

advantage of the possibilities that this method provides. An interesting but

uncommon form of confirmatory factor analysisbifactor analysisis also

10 Remo Ostini and Michael L. Nering

described and demonstrated in this chapter. This innovative model test pro-

cess provides an elegant way to test models for data that contain one general

factor and a number of group factors.

In Chapter 8, Glas focuses squarely on the problem of evaluating fit in

polytomous IRT models. Outlining both an innovative likelihood-based

framework and a Bayesian approach, Glas systematically addresses the chal-

lenges and complexities of evaluating both person and item fit in models

with a substantial number of estimated parameters. He shows how these

approaches can be applied to general versions of three broad types of poly-

tomous IRT modelsRasch type, Samejima homogeneous case type, and

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

tics can be calculated. Given that the lack of adequate fit tests might be

considered the Achilles heel of polytomous IRT, the solutions that Glas

provides warrant enthusiastic investigation to determine whether they can

fulfill their promise. If these methods prove successful, their inclusion in

future polytomous IRT model estimation software would greatly enhance

their usability and reach.

Part 3: Applications

Rather than catalogue examples of areas where polytomous IRT models have

been used in practice, the approach in this book is to focus on a few key issues

that are important when using polytomous IRT models in applied settings. In

Chapter 9, Huang and Mislevy apply a multidimensional polytomous Rasch

model to the investigation of the different strategies that test takers bring to an

examination. The multidimensional model being used here is parameterized very

differently than the multidimensional model used in the FIFA chapter (Chapter

7), with different dimensions representing respondent differences rather than

item differences. This very flexible model is used to score student responses in

terms of their conceptions of a content domain, rather than in terms of the cor-

rectness of the response, which is typically the focus of ability measurement.

In their chapter on computerized adaptive testing (CAT) with polytomous

IRT models Boyd et al. (Chapter 10) provide a detailed and clear presenta-

tion of both the major issues that arise in CAT and how polytomous CAT

has been used. The presentation includes sections on CAT use in research

and applied settings and describes both the challenges and opportunities

associated with polytomous CAT.

In the final chapter, Kim, Harris, and Kolen (Chapter 11) provide a care-

ful and comprehensive survey of equating methods and how those methods

can be applied in the context of polytomous IRT models. The breadth and

depth of coverage in this chapter results in an excellent overview of different

equating methods, their advantages, their challenges, and issues specifically

associated with their use in polytomous IRT models. The figures and the

example provided in the chapter are a welcome feature and help to make

more concrete some of the distinctions between the three equating methods

that are described in the first part of the chapter. This chapter is particularly

New Perspectives and Applications 11

increasingly used operationally rather than primarily being presented and

studied from a theoretical framework.

Full information factor analysis (Chapter 7), the approach to model-data

fit developed by Glas (Chapter 8), and the investigation of mixed-response

strategies (Chapter 9) are features of polytomous IRT modeling that are still

largely confined to research settings. In contrast, CAT and equating have

moved beyond the research setting and are important elements in the routine

use of these models. It is a sign of the maturation that is occurring in this field

that most of the models described in the following chapters are being used to

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

of polytomous IRT models are also being drawn on more in the construc-

tion of published measurement instruments, such as, for example, Hibbard,

Mahoney, Stockard, and Tuslers (2005) patient activation measure.

Integration

Rather than modify what authors have written, we have tried to enhance the

flow of the book by connecting chapters with notes that draw together rela-

tionships among different chapters. We do this in the first instance through

editor note sidebars at the beginning of chapters.

Additionally, we will make comments within a chapter to allow us to further

compare and contrast various aspects of the models presented in the book.

Below is an example of an editor note sidebar, highlighting relationships

among models, and printed in the style you will see throughout the text:

Relationship to Other Models: Often it is important to compare and contrast models to better

understand the intricate details of a model. Pay particular attention to pop-out boxes that focus on

model comparison so that you will have a comprehensive understanding of the various models used.

Terminology Note

We have tried to highlight the meaning of important elements of models and

applications, especially where similar concepts are represented differently, by

using terminology notes. Again, we will do this primarily through editor note

sidebars within chapters. The goal of using these special editor notes is to help

connect basic ideas from one chapter to the next, to help readers understand

some confusing concepts, or to offer an alternative explanation to a key con-

cept. Below is how we will highlight terminology notes throughout this book:

Terminology Note: We prefer to use the term boundary parameter to describe the statistical term

used for the functions that separate response categories. Different authors use this term or concept

differently, and we will highlight this throughout the text.

12 Remo Ostini and Michael L. Nering

Notational Differences

Various elements in the field of polytomous IRT developed out of different

mathematical traditions, and consequently, a range of notational conventions

are used across the field. In another place (Ostini & Nering, 2006) we have

attempted to present a unified notational approach. In this handbook, how-

ever, we want authors to present what is often their lifes work in their own

voice and have kept the preferred notation of each author. Retaining the

notation used by the contributing authors allows readers who follow up the

work of any of these authors to find consistent notation across the authors

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

brief editor notes in sidebars throughout the text to highlight links between

differing notations and uses of terms across chapters.

Notational Differences: These will be highlighted so that comparisons between models can be

made and to help avoid confusion from one chapter to the next.

Below is a model reference guide that can be used while reading this text, or

while using the models for research or operational purposes. For this partic-

ular section we have highlighted what we believe to be the most commonly

used polytomous IRT models, or models that deserve special attention.

Within this reference guide we have used a common notational method that

allows the reader to more readily compare and contrast models, and we have

included information functions that the reader might find useful.

Note on Information

Polytomous IRT information can be represented in two ways. It can be eval-

uated at the item level or at the category level. Starting at the category level,

Samejima (1977, 1988, 1996, 1998) defines information as the second deriva-

tive of the log of the category response probability

2

I ik () = log Pik () (1.1)

2

of , and Pik () is the probability of responding in category k of item i.

Category information can then be combined to produce item information

[ I i ()] ,

I i ( ) = I P

k

ik ik

(1.2)

New Perspectives and Applications 13

to describing item information as the expected value of category information.

That is,

I i = E[ I ik |] (1.3)

where, broadly speaking, item information can be defined as squared item

response function slope/conditional variance. Operationalizing this defini-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

essentially a regression of the item score on the trait scale (Chang & Mazzeo,

1994; Lord, 1980)that is, the expected value of an item response as a func-

tion of the change in (Andrich, 1988). For polytomous items, the expected

value of response x (where x k = 0, 1, , m) is

E[ X i ] = kP ()

k

ik (1.4)

item response

2

E[ X i ]

= V [Xi ] = 2

k P

i ik

ki Pik

(1.5)

k k

tion of item information due to a particular category, by

This equation is in the normal ogive metric and is for models without a separate

discrimination parameter. The logistic metric can be obtained by multiplying

Equation 1.5 by the usual correction factor squared (i.e., D2, where D = 1.702).

Similarly, multiplying Equation 1.5 by squared item discrimination (a 2 ) takes

account of separately modeled item discrimination in calculating information.

Information for all of the models for ordered response category data below

could be calculated using Equation 1.5. However, information functions

for the graded response model (Samejimas logistic model in the homoge-

neous case) have traditionally been obtained at the category level, and so that

approach will be shown below. Dodd and Koch (1994) found that the two

approaches to obtaining item information produce almost identical results

empirically. Matters are more complicated for the nominal model, and the

procedure described below draws heavily from both the logic and the math-

ematical derivations provided by Baker (1992). The first two information

14 Remo Ostini and Michael L. Nering

of category information, while the final three functions will be based on

Equation 1.5. Note that even though most of the information functions are

defined at the item level ( I i ), the IRT models themselves describe category-

level functions (Pik ).

Nominal Model

The Model

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

exp( a k + c k )

Pik (u = k | ; a , c ) = Pik () = (1.7)

i exp( ai + ci )

1, , m), as a function of the ability or trait continuum , with a category slope

parameter a k and category intercept parameter c k , and with ( a k + c k ) Z k.

Item Information

The most practical way to present item information for the nominal model

is through a three-step process. Firstly, a general equation is presented

(Equation 1.8). This contains two derivatives that require calculation. Each

part is described separately (Equations 1.9 and 1.10), and each is typically

calculated separately, with the appropriate values substituted back into

Equation 1.8 to obtain the information function for an item.

The general equation is

m

[ Pik ()]2

I i ( ) =

Pik ()

Pik()

(1.8)

k =1

Pik () is defined in Equation 1.7.

The equation for the first derivative, Pik (), is

exp(Z k ) i exp(Z k )( a k av )

Pik () = 2 (1.9)

i exp(Zv )

while the equation for the second derivative, Pik(), is

Pik() =

{ ( ) }

exp(Z k ) i exp(Zv ) i exp(Zv ) a k2 av2 2 i exp(Zv )( a k av ) i av exp(Zv )

3

i exp(Zv )

(1.10)

where Z k is defined as ( a k + c k ), c k is the category intercept parameter, and

a k is the category slope parameter.

New Perspectives and Applications 15

This model was specifically designed for polytomous item types where the

response categories do not need to follow a specific order.

The Model

Pik () = (1.11)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

which is summarized as Pik = Pik* Pik* +1, where Pik () is the probability of

responding in category k (k = 0, 1, , m) of item i, Pik* represents the category

boundary (threshold) function for category k of item i, ai is the item discrim-

ination parameter, and bik is the difficulty (location) parameter for category

boundary (threshold) parameter k of item i.

Item Information

I i ( ) = A ik (1.12)

k

Aik = D 2 ai2 (1.13)

Pik ()

where I i () is information evaluated across the range of for item i, Aik ()

is described as the basic function, Pik () is defined in Equation 1.11, Pik*

represents the category boundary (threshold) function for category k of item

i, D is the scaling factor 1.702, and ai is the item discrimination parameter.

The Model

exp kj = 0 ( (i + k ))

Pik () = (1.14)

mi =01 exp ij = 0 ( (i + j ))

item i, i is the item difficulty (location) parameter, and k is the common

category boundary (threshold) parameter for all the items using a particular

rating scale. The k define how far from any given item location a particular

threshold for the scale is located.

16 Remo Ostini and Michael L. Nering

Item Information

2

I i ( ) = 2

k P

i ik

ki Pik

(1.15)

k k

across k categories (k = 0, 1, , m), and Pik () is defined in Equation 1.14.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

The Model

exp kj = 0 ( ik )

Pik () = (1.16)

mi =01 exp ij = 0 ( ik )

of item i, and ik is the difficulty (location) parameter for category boundary

(threshold) parameter k of item i.

Item Information

2

I i ( ) = 2

k P

i ik

ki Pik

(1.17)

k k

where this is identical to the equation for the rating scale model because

both equations are in the same metric and in both cases Pik () is calculated

without reference to a separate discrimination parameter. Information will

nevertheless differ across the two models, for a given set of data, because

Pik () will be different at any given level of for the two models.

The Model

exp kj = 0 1.7 ai ( bi + d j )

Pik () = (1.18)

mi =01 exp ij = 0 1.7aai ( bi + d j )

of item i, ai is the item discrimination parameter, bi is the item difficulty

(location) parameter, and dj is the category boundary (threshold) parame-

ter for an item. The dj define how far from an item location a threshold is

located.

New Perspectives and Applications 17

Item Information

2

2 2

i

I i ( ) = D a

k

2

k P

i ik

k

ki Pik

(1.19)

where I i () is information evaluated across the range of for item i, summed

across k categories (k = 0, 1, , m), Pik () is defined in Equation 1.18, D is

the scaling factor 1.702, and ai is the item discrimination parameter. This is

the same as for the partial credit model with the addition of the squared item

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

cally reported in the logistic metric, hence the further addition of D 2.

A Word on Software

Below is a list of commonly used software for estimating polytomous IRT

model parameters, which the authors have used. Some of the older prod-

ucts listed now have GUIs. Most products give very similar item and person

parameter estimates where they estimate parameters for the same models.

However, different software products typically estimate parameters for dif-

ferent models. Most products also provide different fit statistics to each other.

None yet implement the kinds of fit analyses that Glas talks about in his

chapter. Table 1.1 indicates the specific polytomous IRT models that are

estimated by a particular program.

Software International, 7383 North Lincoln Avenue, Suite 100,

Chicago IL, 60646. http://www.ssicentral.com

Multilog. Thissen, D. (2003). Version 7. Scientific Software

International, 7383 North Lincoln Avenue, Suite 100, Chicago IL,

60646. http://www.ssicentral.com

Estimation Program Can Fit)

Estimation Procedure

Models Parscale Multilog Rumm WinMira BigSteps ConQuest Quest

GRM X X

RS-GRM X

GPCM X X

PCM X X X X X X X

RSM X X X X X X

SIM X

DSLM X

DLM X X

18 Remo Ostini and Michael L. Nering

Rumm2020. Andrich, D., Lyne, A., Sheridan, B., and Luo, G. (2003).

Windows version. Rumm Laboratory, 14 Dodonaea Court, Duncraig

6023, Western Australia, Australia. http://www.rummlab.com.au

WinMira2001. von Davier, M. (2000). Version 1.36 for Windows. http://

winmira.von-davier.de

WinSteps. Linacre, J. M., and Wright, B. D. (2009). Version 3.68.1.

MESA Press, 5835 South Kimbark Avenue, Chicago IL, 60637. http://

www.winsteps.com

ACER ConQuest. Wu, M. L., Adams, R. J., and Wilson, M. R. (2000).

Build Date, August 22, for DOS and Windows. The Australian Council

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Melbourne, Victoria, 3124, Australia. mailto:quest@acer.edu.au

Quest. Adams, R. J., and Khoo, S.-T. (1996). Version 2.1 for PowerPC

Macintosh. The Australian Council for Educational Measurement,

19 Prospect Hill Road, Camberwell, Melbourne, Victoria, 3124,

Australia. mailto:quest@acer.edu.au

Conclusion

This chapter introduces the subsequent chapters, highlighting their contri-

butions and discussing issues in polytomous IRT that cut across different

models. It organizes the handbooks content by describing what individual

chapters do and where they fit in relation to other chapters. This chapter also

explains how the editors have attempted to integrate content across chapters

through editor note sidebars.

The goal of this chapter is to make the handbook, and by extension polyto-

mous IRT generally, more accessible and useful to readers, emphasizing its value

and making it easier for readers to unlock that value. In addition to its organiz-

ing role, the chapter helps readers to consider how they might use polytomous

IRT most effectively, in part, by providing an overarching reference guide to

polytomous IRT models in a common notation. The information functions that

are included for each model in the reference guide provide an important practi-

cal tool for designing and evaluating tests and items. Access to polytomous IRT

is also improved by the inclusion of a brief section on the software that is avail-

able to implement the models in research or applied measurement settings.

References

Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Monterey,

CA: Brooks/Cole.

Anastasi, A. (1988). Psychological testing (6th ed.). Upper Saddle River, NJ: Prentice Hall.

Andrich, D. (1982). An extension of the Rasch model for ratings providing both

location and dispersion parameters. Psychometrika, 47, 105113.

Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York:

Marcel Dekker.

New Perspectives and Applications 19

Chang, H.-H., & Mazzeo, J. (1994). The unique correspondence of the item response

function and item category response functions in polytomously scored item

response models. Psychometrika, 59, 391404.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Fort

Worth, TX: Harcourt Brace Jovanovich.

Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York:

Harper Collins.

Dodd, B. G., & Koch, W. R. (1994). Item and scale information functions for the

successive intervals Rasch model. Educational and Psychological Measurement,

54, 873885.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Hibbard, J. H., Mahoney, E. R., Stockard, J., & Tusler, M. (2005). Development

and testing of a short form of the patient activation measure. Health Services

Research, 40, 19181930.

Lord, F. M. (1980). Applications of item response theory to practical testing problems.

Hilsdale, NJ: Lawrence Erlbaum Associates.

Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item

responses. Applied Psychological Measurement, 19, 91100.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-

rithm. Applied Psychological Measurement, 16, 159176.

Muraki, E. (1993). Information functions of the generalized partial credit model.

Applied Psychological Measurement 17, 351363.

Muraki, E., & Bock, R. D. (1999). PARSCALE: IRT item analysis and test scoring for

rating-scale data, Version 3.5. Chicago: Scientific Software international.

Ostini, R. (2001). Identifying substantive measurement differences among a variety

of polytomous IRT models (Doctoral dissertation, University of Minnesota,

2001). Dissertation Abstracts International, 62-09, Section B, 4267.

Rogers, H. J., Swaminathan, H., & Hambleton, R. K. (1991). Fundamentals of item

response theory. Newbury Park, CA: Sage.

Rost, J. (1988). Measuring attitudes with a threshold model drawing on a traditional

scaling concept. Applied Psychological Measurement, 12, 397409.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded

scores. Psychometrika, Monograph Supplement 17.

Samejima, F. (1977). A method of estimating item characteristic functions using the

maximum likelihood estimate of ability. Psychometrika, 42, 163191.

Samejima, F. (1988). Comprehensive latent trait theory. Behaviormetrika, 24, 124.

Samejima, F. (1996). Evaluation of mathematical models for ordered polychotomous

responses. Behaviormetrika, 23, 1735.

Samejima, F. (1998). Efficient nonparametric approaches for estimating the operat-

ing characteristics of discrete item responses. Psychometrika, 63, 111130.

von Davier, M. (2000). WinMira2001 user manual: Version 1.36 for Windows. Author.

Andrich, D. (1978). A rating formulation for ordered response categories.

Psychometrika, 43, 561573.

Andrich, D. (1988). A general form of Raschs extended logistic model for partial

credit scoring. Applied Measurement in Education, 1, 363378.

20 Remo Ostini and Michael L. Nering

R.Langeheine & J. Rost (Eds.), Latent traits and latent class models (pp. 1129).

New York: Plenum Press.

Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item

responses. Applied Psychological Measurement, 19, 91100.

Molenaar, I. W. (1983). Item steps (Report HB-83-630-EX). Heymans Bulletins

Psychological Institute, University of Groningen.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-

rithm. Applied Psychological Measurement, 16, 159176.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology.

Paper presented at the Proceedings of the Fourth Berkeley Symposium on

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

California Press.

Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.

Psychometrika, 51, 567577.

Chapter 2

IRT Models for the Analysis

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Brief and Selected History

of Model Building Advances

Ronald K. Hambleton

University of Massachusetts at Amherst

CTB/McGraw Hill

Craig S. Wells

University of Massachusetts at Amherst

Introduction

Editor Introduction: This chapter places the models developed in later chapters into a common

historical context. The authors review dichotomous IRT models and lay an important foundation for the

concept of informationa key concept in IRT. They also discuss nonparametric IRT and provide an

introduction to the issue of parameter estimation. This provides an excellent starting point from which

to delve deeper into the specific models, issues surrounding these models, and uses of the models that

are provided in later chapters.

the beginning of the transition from classical to modern test theory and

practicesa response, in part, to Gulliksens (1950) challenge to develop

invariant item statistics for test development. Modern test theory is charac-

terized by strong modeling of the data, and modeling examinee responses

or scores at the item level. Item response theory, originally called latent trait

21

22 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

theory and later item characteristic curve theory, is about linking examinee

item responses or scores via item response functions to the latent trait or

traits measured by the test. According to Baker (1965), Tucker (1946) may

have been the first to use the term item characteristic curve, and Lazarsfeld

(1950) appears to have been the first to use the term latent trait (Lord, per-

sonal communication, 1977). In his 1980 monograph, Lord coined the terms

item response function (IRF) and ability as alternatives for item characteristic

curve and trait.

Progress in model building and model parameter estimation was slow

initiallyalmost certainly because of the mathematical complexity of the

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

estimation. There was also considerable skepticism among some researchers

about any measurement advantages that might accrue from the modeling.

This skepticism remained well into the 1970s. The skeptics were particularly

concerned about the strong model assumptions that needed to be made (e.g.,

undimensionality), and secondarily the computational challenges that item

response theory (IRT) modeling posed. But the promise of these IRT mod-

els was great (e.g., model parameter invariance; possibility to deal with miss-

ing data designs, for instance, in equating studies or adaptive testing) and the

research continued at an exponential rate in the 1970s and 1980s.

The real turning point in the transition process came with the publication

of Statistical Theories of Mental Test Scores by Lord and Novick (1968). The

transition was helped along considerably by Rasch (1960) and work in the

late 1950s by Birnbaum (1957, 1958).

Today, there is widespread use of IRT models with both binary and poly-

tomously scored data, although the latter is less well developed and under-

stood. Hence the need for this book and others like it (e.g., van der Linden &

Hambleton, 1997). Until the mid-1980s, test developers, psychologists, and

researchers with polytomous response data tended to do classical scalingas

outlined by Thurstone (1925). This meant assuming a normal distribution

certainly unrealistic in many applications. Some researchers simply dichoto-

mized their data, even though one of the consequences may be loss of fit

of the model (Jansen & Roskam, 1983; Roskam & Jansen, 1986). Lack of

software limited IRT applications using polytomous response data, though

Andrich (1978) and later Thissen (1981) moved things forward with software

for selected polytomous response IRT models.

Interestingly, although Lord and Novick (1968) provided a formulation

of a general theory of multidimensional ability spaces, there had been no

development up to that time with specific models for the analysis of poly-

tomous response data. Perhaps this was because both authors were working

in the educational testing area, where in 1968, binary data were much more

common.

The purposes of this chapter will be (1) to introduce many of the polyto-

mous response IRT models that are available today, including several that are

the focus of this book, (2) to provide background for the motivations of model

IRT Models for the Analysis ofPolytomously Scored Data 23

developers, and (3) to highlight similarities and differences among the models,

and challenges that still remain to be addressed for successful applications.

It is interesting to note that polytomous response IRT models were intro-

duced long before they found any use in education and psychology, albeit

without the necessary software to implement the models. We will begin by

introducing Samejimas (1969, 1972) work in the late 1960s with the graded

response model, the free response model, and multidimensional models too.

Her work was followed by Bock (1972) and the nominal response model

here, the multiple score categories did not even need to be ordered. Later,

advances came from Andrich (1978, 1988) with the rating scale model, and

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

this model did receive some fairly quick use, in part because he made soft-

ware available. Other models to follow Samejimas pioneering work included

the partial credit model and variations (e.g., Andersen, 1973; Masters &

Wright, 1984; Tutz, 1997; Verhelst, Glas, & de Vries, 1997), the generalized

partial credit model (e.g., Muraki, 1992), as well as models by Embretson,

Fischer, McDonald, and others for applying multidimensional IRT models

to polytomous response data. By 1997, when van der Linden and Hambleton

published their edited book Handbook of Modern Item Response Theory (1997),

they reported the existence of over 100 IRT models and organized them

into six categories: (1) models for items with polytomous response formats,

(2)nonparametric models, (3) models for response time or multiple attempts

on items, (4) models for multiple abilities or cognitive components, (5) models

for nonmonotone items, and (6) models with special assumptions about the

response models. Only models in the first two categories will be described in

this chapter. Readers are referred to van der Linden and Hambleton (1997)

for details on models in all six categories.

Some of the seminal contributions to the topic of IRT model development

are highlighted in Figure2.1.

We will begin our selected history of IRT model development by focus-

ing first on those models that were developed to handle binary-scored data

such as multiple-choice items and short-answer items with achievement

and aptitude tests, and true-false, yes-no type items with personality tests.

These developments laid the foundation for those that followed for polyto-

mous response data, as the models are based on the same assumptions of

unidimensionality and statistical independence of item responses, and model

parameters are similar in their purpose and interpretation. Parameter esti-

mation methods were simply extended to handle the extra model parameters

with the polytomous response data.

A primary reason for IRTs attractive features is that explicit, falsifiable

models are used in developing a scale on which test items and examinees

24 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

19571958two- and three-parameter logistic models (Birnbaum)

1960one-parameter logistic model (Rasch)

1961Rasch rating scale model (Rasch)

1967normal-ogive multidimensional model (McDonald)

1969two-parameter normal ogive and logistic graded response model (Samejima)

1969multidimensional model (Samejima)

1972continuous (free) response model (Samejima)

1972nominal response model (Bock)

1973Rasch rating scale model (Andersen)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1978Rasch rating scale model (Andrichs full development was carried out and

independently of the work of Andersen)

1980multi-component response models (Embretson)

1981four-parameter logistic model (Barton & Lord)

1982partial credit model (Masters)

1985linear logistic multidimensional model (Reckase)

1988unfolding model (Andrich)

1990sequential step model (Tutz)

1991non-parametric models (Ramsey)

1992generalized partial credit model (Muraki)

1992full information item bifactor model (Gibbons & Hedeker)

1993steps model (Verhelst & Glas)

Figure 2.1 The most popular of the undimensional and multidimensional models for analyzing

binary-scored and polytomously scored data.

are placed (Baker, 1965; Hambleton, Swaminathan, & Rogers, 1991; van

der Linden & Hambleton, 1997). All IRT models define the probability of

a positive response as a mathematical function of item properties, such as

difficulty, and examinee properties, such as ability level. For example, one of

the popular models used with dichotomous item response data is the three-

parameter logistic model (3PLM), expressed as

exp[ ai ( j bi )]

P (uij = 1| j ) = ci + (1 ci ) (2.1)

1 + exp[ ai ( j bi )]

inee j. Hereafter, P (uij = 1| j ) will be written as P(). The ability param-

eter for person j is denoted q j and, although theoretically unbounded, ranges

from 3.0 to 3.0 for a typical population with ability estimates scaled to a

mean of zero and a standard deviation of 1.0, where larger positive values

indicate higher ability. The lower asymptote is denoted ci, also known as the

guessing parameter; in other words, the c parameter indicates the probability

of positively endorsing an item for examinees with very low ability levels.

The difficulty parameter for item i is denoted bi. The b parameter is on the

same scale as and is defined as the -value where P() is halfway between

IRT Models for the Analysis ofPolytomously Scored Data 25

1.0

0.9

0.8

0.7

Item 1: a=1.7, b=-0.8,

Probability

0.6

c=0.15

0.5

c=0.08

0.3

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.2

Item 3:a=1.45,

0.1 b=1.1, c=0.26

0.0

-3 -2 -1 0 1 2 3

Ability

the c parameter value and 1.0 (i.e., -value associated with P() = (1 + c)/2).

The values for the b parameter also typically range from 3.0 to 3.0, where

larger positive values indicate harder items and larger negative values indicate

easier items. The discrimination of item i is denoted ai and is proportional to

the slope of the IRF at = b (see Figure2.2). For good items, the a param-

eter typically ranges from 0.40 to 2.50. Some testing programs use a ver-

sion of the model with a scaling constant D = 1.7, which was introduced by

Birnbaum in 1957 to eliminate scale differences between the item parameters

in the two-parameter normal ogive and logistic models. Figure2.2 graphi-

cally illustrates IRFs for three items based on the 3PLM.

The three IRFs shown in Figure 2.2 follow the 3PLM and differ with

respect to difficulty, discrimination, and the lower asymptote. The y axis

represents the probability of a correct response, (P()), while the x axis rep-

resents the ability () scale, sometimes called proficiency scale. There are

several important features of the 3PLM IRFs shown in Figure2.2. First, the

IRFs are monotonically increasing in that the probabilities increase as the

ability levels increase for each item. Second, the IRFs are located throughout

the ability distribution, indicating that the items differ in their difficulty. For

example, Item 1 is the easiest, Item 2 is moderately difficult, and Item 3 is

the hardest (b1 = 0.8, b2 = 0.2, and b3 = 1.1). Third, the inflection point of

the respective IRF is located at the b parameter value. Fourth, the a param-

eter value is proportional to the slope of the IRF at the b parameter value. In

addition, the IRF has maximum discriminatory power for examinees whose

-value is near the b parameter value. For example, Item 1 has maximum

discrimination for examinees with -values around b = 0.8. And fifth, the

items differ with respect to the c parameter value as indicated by the disparate

lower asymptotes. Interestingly, although Item 3 is generally more difficult

26 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

for most examinees throughout the scale, lower ability examinees have a

higher chance of answering the item correctly compared to Items 1 and 2.

When the c parameter is set equal to zero for an item, Equation 2.1 sim-

plifies to the two-parameter logistic model (2PLM) introduced by Birnbaum

(1957, 1958, 1968). This model is more mathematically tractable than the

two-parameter normal ogive model introduced by Lord (1952). Since the

lower asymptote is fixed at zero, the 2PLM implies that guessing is absent, or

at least, negligible for all practical purposes. A dichotomously scored, short-

answer item is an example of an item in which the 2PLM is commonly used.

Because c = 0, the b parameter is now the point on the ability scale at which

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Constraining the a parameter to be equal to one across all items

in Equation 2.1, as well as fixing c = 0, produces a third IRT model for

dichotomous data known as the one-parameter logistic model (1PLM) or

the Rasch model. Neither Lord nor Birnbaum showed any interest in a

one-parameter normal ogive or logistic model because of their belief that

multiple-choice test items needed at least two parameters in the model to

adequately account for actual item response dataone to account for item

difficulty and the other to account for item discriminating power. Lord was

sufficiently concerned about guessing (and omitted responses) on multiple-

choice items (as well as computational demands) that he discontinued his

dissertation research (1952, 1953a, 1953b) and pursued research with true

score theory instead, until about 1965 (Lord, 1965). He then became con-

vinced that computer power was going to be sufficient to allow him to work

with a three-parameter model with the third item parameter in the model

to account for the nonzero item performance of low-performing candidates,

even on hard test items.

Although the 1PLM produces some attractive statistical features (e.g.,

exponential family, simple sufficient statistics) when the model actually fits

test data, the advantages of the 1PLM come at the expense of assuming

the items are all equally discriminating (not to mention free from examinee

guessing behavior). Although these assumptions may hold for psychological

tests with narrowly defined constructs, they are generally problematic in edu-

cational testing. Therefore, the equal discrimination assumption is usually

checked closely prior to implementing the 1PLM.

Rasch (1960), on the other hand, developed his one-parameter psychomet-

ric model from a totally different perspective than either Lord or Birnbaum.

He began with the notion that the odds for an examinees success on an item

depended on the product of two factorsitem easiness and examinee ability.

Obviously, the easier the item and the more capable the examinee, the higher

the odds for a successful response and correspondingly, the higher the prob-

ability for the examinees success on the item. From the definition of odds

for success, P/(1 P), and setting the odds equal to the product of the model

parameters for item easiness and examinee ability, it was easy for Rasch to

produce a probability model similar to Equation 2.1 with c = 0.0 and a = 1.0,

though in his development, the concepts of item discrimination and guessing

IRT Models for the Analysis ofPolytomously Scored Data 27

were never considered. At the same time, failure to consider them allows

them to become possible sources for model misfit. Rasch certainly would

not have worried about guessing in his own work since he developed his

model for a longitudinal study with intelligence testing. Also, in 1960, the

multiple-choice item was not in use in Denmark.

The three logistic models for analyzing binary data are valuable, and are

receiving extensive use in testing practices, but there are several item types

or formats in education and psychology that are scored polytomously. For

example, many statewide assessments use constructed-response item formats

as part of the assessment in which a scoring rubric is implemented to pro-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

vide partial credit. In addition, Likert type items that provide responses in a

graded fashion are commonly used, especially in surveys, questionnaires, and

attitudinal inventories. The goal of the IRT models for polytomous data is to

describe the probability that an individual responds to a particular category

given her or his level of ability and the item properties.

One of the most popular IRT models to address polytomous data, devel-

oped by Samejima (1969), is a simple yet elegant extension of the 2PLM

and is referred to as Samejimas graded response model (GRM). Samejima

was clearly motivated in her work by the fact that all of the modeling up to

1969 was applicable only to binary-scored data. These models were excellent

at the time for handling educational testing data where most of the IRT

model developers were working (Lord, Novick, Rasch, Wright). However,

Samejima was well aware of the use of rating scales (sometimes called ordered

response categories by psychologists) and wanted to extend the applicability

of item response modeling to these types of data. Samejimas GRM is appro-

priate for ordered polytomous item responses such as those used in Likert

type items or constructed-response items.

For the following explanation, we will consider a five-category (i.e., K = 5)

item with scores ranging from 0 to 4 (i.e., k = 0, , 4). Samejimas work was

just the first of many models that followedincluding several more of her

own (see, for example, Samejima, 1997).

The GRM uses a two-step process in order to obtain the probability that

an examinee responds to a particular category. The first step is to model the

probability that an examinees response falls at or above a particular ordered

category given . The probabilities, denoted Pik* (), may be expressed as

follows:

exp[ ai ( j bik )]

Pik* () = (2.2)

1 + exp[ ai ( j bik )]

istic function of item i for category k, and indicates the probability of scor-

ing in the kth or higher category on item i (by definition, the probability of

responding in or above the lowest category is Pik* () = 1.0). The ai parameter

refers to the discrimination for item i.

28 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Notational Difference: The formula presented in Equation 2.2 is the same as that presented by

Samejima in Chapter 4 and shown below. Samejimas Equation 4.2 is structured differently, uses

different subscripts, and locates the category subscript (x) before (and at a higher level) than the

item subscript (g). It is nevertheless functionally and algebraically equivalent to Equation 2.2 in this

chapter except for the scaling factor D:

{ }

1

Px* () = 1+ exp Dag ( bx )

g

g

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Note that each item has the same discrimination parameter across all cat-

egories in Equation 2.2. Samejima referred to Equation 2.2 as the homoge-

neous case of her model. She developed a variation on Equation 2.2 in which

the operating characteristic function for each score category could vary, and

she called this the heterogeneous form of the graded response model, but

this version of her model did not attract much attention.

Terminology Note: It is common to think of the heterogeneous case as the graded response

model with P*ik() that vary in shapeas is suggested here. However, the boundary function P*ik()

actually plays no role in defining heterogeneous models. A fundamental cause of this misunder-

standing is the way Samejima uses the term graded response model. In this chapterand in com-

mon usagegraded response model refers to Samejimas logistic model in the homogeneous case.

However, Samejima uses graded response model to refer to a framework that covers all possible

polytomous IRT models in both the homogeneous case (including but not limited to the logistic model)

and the heterogeneous case, which itself includes many different specific models (see Chapter 4).

The most common examples of graded response models in the heterogeneous case are polytomous

Rasch models, such as the partial credit model. In such models P*ik() is not (and typically cannot

be) explicitly modeled and can only be obtained empiricallyby summing category probabilities. If

this is done, it does indeed transpire that the boundary functions for an item are not parallel. In the

terminology used later in this chapter, heterogeneous models are typically direct models.

In Equation 2.2 bik refers to the ability level in which the probability of

responding at or above the particular category equals 0.5 and is often referred

to as the threshold parameter. (The threshold parameter is analogous to the

item difficulty parameter in the 2PLM.) Since the probability of responding

in the first (i.e., lowest) category or higher is defined to be 1.0, the threshold

parameter for the first category is not estimated. Therefore, although there

are five categories (K = 5), there are only four threshold parameters estimated

(K 1 = 4). Basically, the item is regarded as a series of K 1 dichotomous

responses (e.g., 0 vs. 1, 2, 3, 4; 0, 1 vs. 2, 3, 4; 0, 1, 2 vs. 3, 4; and 0, 1, 2, 3

vs.4); the 2PLM is used to estimate the IRF for each dichotomy with the

added constraint that the slopes are equal within an item.

Once the operating characteristic functions are estimated, the category

response functions, which indicate the probability of responding to a particu-

lar category given , are computed by subtracting adjacent Pik* () as follows:

IRT Models for the Analysis ofPolytomously Scored Data 29

* = 0.0; therefore, the probability of responding in the highest category is

Pi(k)

simply equal to the highest operating characteristic function. For the present

example, the category response functions are computed as follows:

Pi 0 () = 1.0 Pi 1* ()

Pi 1 () = Pi 1* () Pi *2 ()

Pi 2 () = Pi *2 () Pi *3 () (2.4)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Pi 3 () = Pi *3 () Pi *4 ()

Pi 4 () = Pi *4 ()

Figures 2.3 and 2.4 illustrate the operating characteristic and category

response functions, respectively, for an item with a = 1.25, b1 = 2.3, b2 = 1.1,

b3 = 0.1, and b4 = 1.15. The figures also highlight some important characteristics

of the GRM. First, the operating characteristic curves are ordered from small-

est to largest based on the threshold parameters (i.e., b1 < b2 < b3 < b4 ). Second,

the threshold parameters dictate the location of the operating curves. Third,

the slope is the same for each of the curves within an item (the slopes are free

to vary across items, however). Fourth, for any given value of , the sum of the

category responses probabilities equals 1.0. Fifth, the first response category

curve is always monotonically decreasing while the last category curve is always

monotonically increasing. The middle categories will always be unimodal with

the peak located at the midpoint of the adjacent threshold categories.

We will skip ahead a bit to consider a second commonly used IRT model

for polytomous data where the data are scored in terms of the number of steps

1

0.9

P*i1

0.8

0.7 P*i2

Probability

0.6

0.5 P*i3

0.4

P*i4

0.3

0.2

0.1

0

-3 -2 -1 0 1 2 3

Latent Trait

Figure 2.3 Operating characteristic curves for a five-category GRM item with a = 1.25, b1 = 2.3,

b 2 = 1.1, b 3 = 0.1, and b4 = 1.15.

30 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

1

0.9

0.8

0.7

0.6 Pi0 Pi4

Probability

0.5

Pi1 Pi2

0.4 Pi3

0.3

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.2

0.1

0

-3 -2 -1 0 1 2 3

Latent Trait

Figure 2.4 Category response curves for a five-category GRM item with a = 1.25, b1 = 2.3,

b 2 = 1.1, b 3 =0.1, and b4 = 1.15.

generalized partial credit model (GPCM; Muraki, 1992, 1993). The model

was proposed independently by Yen at the same time, who referred to it as

the two-parameter partial credit (2PPC) model and used it in a number of

applications to show improved fit relative to the original partial credit model

(Fitzpatrick & Yen, 1995; Fitzpatrick et al., 1996).

In contrast to Samejimas GRM, which is considered an indirect model due to

the two-step process of obtaining the category response functions, the GPCM is

referred to as a direct IRT model because it models the probability of responding

to a particular category directly as a function of . As the model expression for the

GPCM is an exponential divided by a sum of exponentials, it is also classified as

a divide-by-total model for the aforementioned reason, while Samejimas GRM

is considered a difference model because the category probabilities are based

on the difference between the response functions (Thissen & Steinberg, 1986).

Muraki was motivated by earlier work from Masters (1982), but Masters did not

include a discrimination parameter in his model. This model by Muraki was

extensively used at ETS with the National Assessment of Educational Progress

(NAEP). With NAEP data, historically, items have varied substantially in their

difficulty and discriminating power, and so it was long felt that a two-parameter

polytomous response model was needed to fit the data well.

To illustrate the GPCM, we will consider a five-point partial credit item

(i.e., K = 5) ranging from 0 to 4 (i.e., k = 0,, 4 ). The category response

functions, Pik (), for the GPCM may be expressed as follows:

exp vk = 0 ai ( j biv )

Pik ( j ) = (2.5)

hK=01 exp vh = 0 ai ( j biv )

where vK= 0 ( j biv ) 0.

IRT Models for the Analysis ofPolytomously Scored Data 31

1.0

0.9

0.8

0.7 Pi0 Pi4

Probability

0.6 Pi1

0.5 Pi3

0.4

0.3

0.2 Pi2

0.1

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.0

-3 -2 -1 0 1 2 3

Latent Trait

Figure 2.5 Category response curves for a five-category GPCM item with a = 0.99, b1 = 1.9,

b 2 = 0.2, b 3 = 0.5, and b4 = 1.1.

Notational Difference: The GPCM is also presented in Chapter 3 using the following notation:

exp kj = 0 1.7ai ( bi + d j )

T (k ) = m 1

i =0

exp ij = 0 1.7ai ( bi + d j )

Note that in Chapter 3 the overall location parameter (bi) and the threshold parameters (dj) are

distinct, whereas in Equation 2.5 they are folded together (biv).

In the GPCM, the biv are referred to as step difficulty parameters. The

step difficulty parameters may be interpreted as representing the difficulty

in reaching step k given that the examinee has reached the previous step

(i.e., k 1). As a result, bivs are not necessarily ordered from smallest to larg-

est, in contrast to Samejimas GRM. As an example, Figure2.5 illustrates a

GPCM item in which the step difficulties are not ordered (a = 0.99, b1 = 1.9,

b2 = 0.2, b3 = 0.5, and b4 = 1.1).

Terminology Note: In this chapter the authors use the term step to describe the boundary param-

eters in the generalized partial credit model. This term is also used later in this chapter and in

Chapter 5, when Masters discusses the partial credit model. Although originally intended to refer to

the process of modeling sequential steps toward arriving at a response category, it has been shown

that neither model actually operates in this way mathematically (Masters, 1988; Molenaar, 1983).

Interpretation of the step difficulty parameters is also complex since each step is modeled in the

context of an entire item. A positive consequence of this is that these two models are not restricted to

constructed-response item data and can legitimately be used with any polytomous data, including

responses to rating scale items. The steps terminology has, however, proved enduring.

As seen from Figure2.5, the step difficulty parameters represent the value

on the scale at which two consecutive category response curves intersect

(e.g., the curves for category 0 and 1 intersect at = 1.9 ). The relative order

of the step difficulty parameters (i.e., intersections) indicates that going from

32 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

while going from 1 to 2 (Step 2) is moderately difficult, and going from 3 to

4 (Step4) is the most difficult. Furthermore, the effect of the reversal (i.e.,

Step2 being more difficult than Step 3) can also be seen in the lower prob-

ability of receiving a score of 2 relative to the other categories.

If the a parameter is constrained to be equal across items, the GPCM

expressed in Equation 2.5 simplifies to the partial credit model (PCM;

Masters, 1982; Masters & Wright, 2004). But the partial credit model was

conceived of prior to the generalized partial credit model and by a very

different line of reasoning. The partial credit model is an extension of the

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ing to an exponential family and simple sufficient statistics. However, the

appealing properties of the PCM (and 1PLM/Rasch) can only be realized

if the model fits the data. Therefore, the equal item discrimination assump-

tion should be tested empirically before the model is implemented (see, for

example, Hambleton & Han, 2005).

Some researchers have been troubled by the fact that in the partial credit

model (and the generalized partial credit model) occasional reversals are seen

in the step difficulty parameters. Others have argued that the reversals can

be interpreted and have chosen not to worry about them (see, for example,

Masters, 1982).

Editor Note: In Chapter 4 Andrich discusses the potential pitfalls of step reversals and the advan-

tages of being able to model such reversals when they occur in data.

The rating scale model, which has been manifested in a number of ever-

increasing flexible forms, can be traced back to the original work of Rasch

(1961) and later his doctoral student, Erling Andersen. Then the model

became best known in a series of studies by Andrich beginning in 1978

and extended yet again, this time by Muraki in the 1990s with his general-

ized rating scale model. In the simplest version of this polytomous response

model, the thresholds for the score categories differ from each other by an

amount that is held constant across items. In addition, there is a shift in these

thresholds from one item to the next because the items, in principle, differ

in their difficulties. Most of the work associated with the rating scale model

has been done under the assumption that all test items are equally discrimi-

nating (see Andrich (1988) or Engelhard (2005) for excellent reviews). The

exception to this work is the generalized rating scale model introduced by

Muraki.

Wright and Masters (1982, chap. 3) provide a very useful comparison of all

of the modelsrating scale, partial credit, and graded response modelby

relating these models to the partial credit model and noting their similarities

and differences and deriving one model from another by placing constraints

on the models or making additional assumptions. A similar comparison,

IRT Models for the Analysis ofPolytomously Scored Data 33

(2005). A main difference is that several of the models allow test items to

vary in their discriminating powers (GRM and the GPCM include this

model parameter).

An interesting comparison between polytomous models is based on Agrestis

classification (1990) of response processes (see Mellenbergh, 1995). In adja-

cent-category models, the examinee is assumed to make his or her response

based on a comparison of adjacent categories, and the model represents these

comparisons. This type of process underlies the partial credit models. In the

graded response model, the basic process is a comparison between the cumu-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

lative categories in Equation 2.4, and models of this nature are appropriately

called cumulative probability models. The final typeof model is the continua-

tion ratio model, which assumes that the examinee basically chooses between

a current category and proceeding with one of the higher-order categories.

The belief that the last response process is a more adequate description of the

examinees behavior on polytomous items was Verhelst, Glas, and de Vries

(1997) and Tutzs (1997) reason for modifying the PCM.

The previously described models are referred to as parametric because they

require the IRF to follow a specific parametric expression. For example, when

using the 2PLM, the underlying IRF must follow the logistic expression in

Equation 2.2 with parameters ai and bi. However, in order to fit the data, for

some items on an educational or psychological assessment, the response data

may not conform to such an expression and these items would be discarded

from a test. For such items, there is a class of nonparametric models based on

ordinal assumptions only that may be used to determine the IRF.

While there are several nonparametric methods for modeling the IRF,

one of the more popular methods, developed by Ramsay (1991), is kernel

regression. For another line of nonparametric modeling of test data, see the

work of Molenaar (1997) and Sijtsma and Molenaar (2002) for up-to-date

reviews. The essential principle underlying this approach is the replacement

of the regression function of the responses on some independent ability score

by a function obtained through a smoothing operation. The smoothing oper-

ation is based on local weighted averaging of the responses with a so-called

kernel as the weight function. More specifically, in order to obtain kernel-

smoothed estimates of the response function of item i, the following steps

are implemented:

xq (e.g., x1 = 3.00, x2 = 2.88, x3 = 2.76,, x49 = 2.76, x50 = 2.88, x51 =

3.00, where Q = 51).

2. Obtain an ability score that is independent of item i and transform it to

have this scale. Usually, the rest score is taken (i.e., the total number-correct

score that excludes item i). Let Xj denote the score for examinee j.

34 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Pi ( xq ) = w u

jq ij (2.6)

j =1

examinee j to item i (i.e., 0 or 1), and wjq represents the weight assigned

to examinee j at evaluation point xq. The weight for examinee j at evalu-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

w jq =

K ( X j xq

h )

( )

X j xq (2.7)

Nj =1 K h

for examinee j.

and kernel function K. The bandwidth parameter controls the amount of bias

and variation in the IRF. As h decreases, the amount of bias is reduced but

the variation is increased (i.e., less smoothness). The opposite occurs when

h is increased. As a rule of thumb, h is often set equal to 1.1*N0.2 (Ramsay,

1991) so as to produce a smoothed function with acceptable bias. Kernel

function K is chosen to be nonnegative and to approach zero as Xj moves

away from an evaluation point, xq. Two commonly used kernel functions

are the Gaussian [K(y) = exp(y 2)] and uniform kernels [K(y) = 1 if | y| 1,

else 0]. Given the previous information, it is apparent that the further an

examinees ability score Xj is away from evaluation point xq, the less weight

that examinee has in determining P ( xq ). For example, the Gaussian kernel

has this feature; its value is largest at y = 0, and it decreases monotonically

with the distance from this point.

The computer program TESTGRAF (Ramsay, 1992) is often used to

perform kernel smoothing on test data. Figure2.6 illustrates a nonpara-

metrically estimated IRF using TESTGRAF for an item from a large-

scale assessment. The psychological meaning for the curve is not clear, but

with parametric models, the unusual shape of the curve would not even

be known.

It is apparent from Figure2.6 that the advantage of estimating the IRF

nonparametrically is that because it is not constrained to follow a monotonic

parametric shape, it provides a deeper understanding of its ordinal features.

For example, the 3PLM would never be able to show the local decrease in

response probability just above ability equal to 2 in Figure2.6. A statistical

disadvantage of the nonparametric approach, however, is the large number of

parameters per item that has to be estimated, and therefore the large amount

IRT Models for the Analysis ofPolytomously Scored Data 35

1.0

0.9

0.8

0.7

Probability 0.6

0.5

0.4

0.3

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.2

0.1

0.0

-3 -2 -1 0 1 2 3

Ability

Figure 2.6 A nonparametric ICC estimated using TESTGRAF for an item on a large-scale

assessment.

of response data required for its application. For example, for the rest score,

the number of item parameters in a nonparametric approach is n 1 (namely,

one response probability for each possible score minus 1), whereas the 3PLM

requires estimation of only three parameters per item. Further, if all items

on a test are modeled nonparametrically, then certain applications within

IRT cannot be performed. For example, the same ability parameter can-

not be estimated from different selection of items, and the nonparametric

approach consequently fails to support applications with structurally incom-

plete designs (e.g., adaptive testing and scale linking). Nevertheless, the non-

parametric approach to modeling the IRF has proven to be a useful tool for

the psychometric properties of an instrument. For example, nonparamet-

ric models have been used for item analysis (Ramsay, 1991; Santor, Zuroff,

Ramsay, Cervantes, & Palacios, 1995), testing for differential item function-

ing (Santor, Ramsay, & Zuroff, 1994; Shealy & Stout, 1993), testing possible

local dependencies, and even testing the fit of a parametric model (Douglas

& Cohen, 2001; Wells & Bolt, 2008). Therefore, the use of nonparametric

models in conjunction with parametric models appears to be a productive

strategy for building a meaningful score scale, and we expect to see more

applications of nonparametric models to handle polytomous response data in

the coming years. The software is available, and the few applications so far

appear very promising.

Samejima has been prolific over the years in her development of IRT models

for handling polytomous response data. Her creativity and insight placed

her 15 to 20 years ahead of the time when her models would be needed.

Following her work with the graded response model, Samejima (1972) made

36 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

the logical jump to the case where continuous response or free response data

replaced categorical data, she extended her work to multidimensional mod-

eling of polytomous response data, and in 1997 she extended her modeling

work again, this time to incorporate some subtleties in the estimation of

ability. Unlike the GRM, the free-response model has received very little

attention.

While the present chapter covers the major IRT models for dichotomous

and polytomous data, it does not provide a description of all available IRT

models. For a thorough description of several other popular IRT models

such as other polytomous models (e.g., nominal categories model, rating

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

scale model), multidimensional IRT models for both dichotomous and poly-

tomous data, other nonparametric models (e.g., the monotone homogeneity

model), and unfolding models for nonmonotone items, see Fischer (1974),

van der Linden and Hambleton (1997), Ostini and Nering (2006), Andrich

and Luo (1993), Roberts, Donoghue, and Laughin (2000), and Verhelst and

Verstralen (1993).

Perhaps a few words regarding parameter estimation would be useful. Much

of the work to date has been for models handling dichotomously scored data

but the maximum likelihood and the Bayesian estimation principles applied

equally well to polytomous models, although the complexity is substantially

increased because of the increase in the number of model parameters.

The utility of IRT depends on accurately estimating the item parameters

and examinee ability. When the ability parameters are known, estimating the

item parameters is straightforward. Similarly, estimating the ability param-

eter is straightforward when the item parameters are known. The challenge,

however, is estimating the parameters when both sets are unknown. Due to

the complicated nature of the equations for IRT parameter estimation, only

a brief description of a few popular methods will follow. See Baker and Kim

(2004) for a detailed description of popular estimation methods.

Joint maximum likelihood estimation (JMLE) and marginal maximum

likelihood estimation (MMLE; Bock & Lieberman, 1970) are two estima-

tion methods well addressed in the literature. Although MMLE has become

the standard of the testing and JMLE lacks a basic statistical requirement,

the latter is simpler to describe and is therefore outlined first. JMLE uses an

iterative, two-stage procedure to estimate the item and person parameters

simultaneously. In Stage 1, the item parameters are estimated assuming the

ability parameters are known (simple functions of raw scores may be used as

initial estimates of ) by maximizing the following likelihood function for

the responses of N examinees on item i:

N n

L( a , b, c ; u, ) = P

j =1 i =1

uj

j

(1 u j )

(1 P j ) (2.8)

IRT Models for the Analysis ofPolytomously Scored Data 37

where u represents the matrix with the responses, , and a, b, and c are

the vectors with the examinee and item parameters, respectively, Pj is the

model-based probability, and uj is the item response for examinee j. By taking

derivatives and setting them equal to zero, maximum likelihood estimates

for each item parameter are obtained, for example, via a Newton-Raphson

procedure.

In Stage 2, the ability parameters are estimated treating the item param-

eters from Stage 1 as known by maximizing the likelihood function for a

response pattern for examinee j, which by assuming local independence may

be expressed as follows:

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

N n

L( p; u, a , b, c ,) = P uj

j

(1 u j )

(1 P j ) (2.9)

j =1 i =1

Equation 2.9 is now treated as a function of the unknown ability parameters

given the response data and all item parameters.

Before proceeding back to Stage 1, the updated ability parameter estimates

are renormed using the restrictions adopted to eliminate the indeterminacy

in the scale (usually, mean estimate equal to zero and standard deviation to

one). The two stages are repeated, using the updated estimates from each

subsequent stage, until a convergence criterion is met (e.g., estimates change

a minimal amount between iterations).

Unfortunately, because the model item and person parameters are being

estimated simultaneously in JMLE, the estimates are not consistent. Bock

and Lieberman (1970) developed MMLE to address the disadvantages of

JMLE by integrating the unknown ability parameters out of the likelihood

function so that only the item parameters are left to be estimated. Therefore,

the problem becomes one of maximizing a marginal likelihood function in

which the unknown ability parameters have been removed through integra-

tion. Bock and Aitkin (1981) implemented MMLE using an expectation-

maximization (EM) algorithm.

Once the item parameters have been estimated using MMLE, estimates

of ability can be derived by treating the item parameter estimates as if they

are known item parameters, that is, using the same method outlined in Stage

2 of JMLE. Although for applications with a single standard test without

missing data simple number-correct scores may be appropriate, there are sev-

eral advantages to estimating s, including the IRT estimated s are compa-

rable when items are added or deleted from the test, they adjust the estimates

for the properties of the individual items (such as their difficulty and dis-

crimination), they produce more accurate standard errors, they provide better

adjustments for guessing than classical methods, and they are on the same

scale as the difficulty parameters. A primary disadvantage of MLE of is

that for examinees who answer all items correctly or incorrectly, no estimate

can be obtained.

38 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

approach, in which prior information is provided about the ability param-

eters in the form of a prior distribution and is incorporated into the likeli-

hood function. The prior distribution is updated by the response data into a

posterior distribution for , which is the Bayesian estimate of . The mean of

the posterior distribution may be used as a point estimate of known as the

expected a posteriori (EAP) estimate. The mode of the posterior distribution

may also be used as a point estimate of ability and is known as the maximum

a posterior (MAP) estimate (see, e.g., Swaminathan, 2005).

The next generation of IRT modeling for both dichotomous and polyto-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

mous response data will likely proceed in a Bayesian way with new numerical

procedures for establishing the posterior distribution of model parameters

known as Monte Carlo Markov chain (MCMC) procedures. The common

feature of the posterior is that they explore the posterior distribution by per-

forming iterative random draws from the posterior distribution of one class

of parameters given the previous draws from those of all other parameters.

Because of its iterative approach, an MCMC procedure is particularly pow-

erful for models with a high-dimensional parameter space. They also do not

require the calculation of the first and second derivatives, which makes MLE

cumbersome for complex models. At the same time, MCMC procedures can

take several hours for a complex model with a larger data set to produce

proper parameter estimates. Still, from a modeling perspective, it allows

researchers to be very creative, and currently implementations of MCMC

procedures for advanced IRT models are being intensively investigated.

Conclusion

Today, though many polytomous IRT models have been developed, only

a handful are receiving frequent use in education and psychology: (1) the

nominal response model, (2) the graded response model, (3) the polytomous

Rasch model, (4) the partial credit model, and (5) the generalized par-

tial credit model. Of the unidimensional models for handling polytomous

response data, these are almost certainly the five most frequently used, and

the ones that will be described in greater detail in subsequent chapters.

One of the biggest challenges facing applications of polytomous item

response models has been the shortage of user-friendly software. Several

software packages are available for parameter estimation: BILOG-MG

(www.ssicentral.com) can be used with the one-, two-, and three-param-

eter logistic models; PARSCALE (www.ssicentral.com) with the graded

response model and the generalized partial credit model; MULTILOG

(www.ssicentral.com) with the nominal response model and the graded

response model, and WINSTEPS and FACETS (www.winsteps.com) and

CONQUEST with the dichotomous Rasch model, partial credit model,

and polytomous Rasch model. The Web site www.assess.com is particularly

IRT Models for the Analysis ofPolytomously Scored Data 39

helpful is locating IRT software. We are already aware, too, of several major

releases of software scheduled for 2010, including a new version of Multilog

and software to comprehensively address model fit. Still, in general, it is to

be hoped that software in the future can be made more user-friendly.

Approaches to model fit too remain as technical challenges. Testing a

model with data is a losing proposition because with a sufficiently large sam-

ple size (which is desirable for obtaining stable model parameter estimates),

power to reject any IRT model is high. Investigating practical consequences

of any model misfit remains a more promising direction for studies of model

fit. Alternatively, it is becoming more standard to test one model against

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

References

Agresti, A. (1990). Categorical data analysis. New York: Wiley.

Andersen, E. B. (1973). Conditional inferences for multiple-choice questionnaires.

British Journal of Mathematical and Statistical Psychology, 26, 3144.

Andrich, D. (1978). A rating formulation for ordered response categories. Psycho

metrika 43, 561573.

Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage Publications.

Andrich, D., & Luo, G. (1993). A hyperbolic cosine latent trait model for unfold-

ing dichotomous single-stimulus models. Applied Psychological Measurement, 17,

253276.

Baker, F. B. (1965). Origins of the item parameters X50 and as a modern item analy-

sis technique. Journal of Educational Measurement, 2, 167180.

Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques

(2nd ed.). New York: Marcel Dekker.

Birnbaum, A. (1957). Efficient design and use of mental ability for various decision-

making problems (Series Report 5816). Randolph Air Force Base, TX: USAF

School of Aviation Medicine.

Birnbaum, A. (1958). On the estimation of mental ability (Series Report No. 15)

Randolph Air Force Base, TX: USAF School of Aviation Medicine.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an exam-

inees ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test

scores (Chaps. 17 to 20). Reading, MA: Addison-Wesley.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are

scored in two or more latent categories. Psychometrika, 37, 2951.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item

parameters: Application of an EM algorithm. Psychometrika, 46, 443459.

Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously

scored items. Psychometrika, 35, 179197.

Douglas, J., & Cohen, A. S. (2001). Nonparametric item response function estimation for

assessing parametric model fit. Applied Psychological Measurement, 25, 234243.

Engelhard, G. (2005). IRT models for rating scale data. In B. Everitt & D. Howell

(Eds.), Encyclopedia of statistics in behavioral science (Vol. 2, pp. 9951003). West

Sussex, UK: John Wiley & Sons.

Fischer, G. H. (1974). Einfuhrung in die theorie psychologischer tests. Bern, DE: Huber.

40 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Fitzpatrick, A. R, Link, V. B., Yen, W. M., Burket, G. R., Ito, K., & Sykes, R. C.

(1996). Scaling performance assessments: A comparison of one-parameter and

two-parameter partial credit models. Journal of Educational Measurement, 33,

291314.

Fitzpatrick, A. R, & Yen, W. M. (1995). The psychometric characteristics of choice

items. Journal of Educational Measurement, 32, 243259.

Gulliksen, H. (1950). Theory of mental test scores. New York: Wiley.

Hambleton, R. K., & Han, N. (2005). Assessing the fit of IRT models to educational

and psychological test data: A five step plan and several graphical displays. In

W. R. Lenderking & D. Revicki (Eds.), Advances in health outcomes research

methods, measurement, statistical analysis, and clinical applications (pp. 5778).

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item

response theory. Thousand Oaks, CA: Sage Publications.

Jansen, P. G. W., & Roskam, E. E. (1986). Latent trait models and dichtomization of

graded responses. Psychometrika, 51, 2991.

Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent struc-

ture analysis. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld,

S. A. Star, & J. A. Clausen (Eds.), Measurement and prediction. Princeton, NJ:

Princeton University Press.

Lord, F. M. (1952). A theory of test scores. Psychometrika, Monograph 7, 17.

Lord, F. M. (1953a). The relation of test score to the trait underlying the test.

Educational and Psychological Measurement, 13, 517548.

Lord, F. M. (1953b). An application of confidence intervals and of maximum likeli-

hood to the estimation of an examinees ability. Psychometrika, 18, 5776.

Lord, F. M. (1965). An empirical study of item-test regression. Psychometrika, 30,

373376.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores (with con-

tributions by Allen Birnbaum). Reading, MA: Addison-Wesley.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,

149174.

Masters, G. (1988). The analysis of partial credit scoring. Applied Measurement in

Education, 1, 279298.

Masters, G. N., & Wright, B. D. (1984). The essential process in a family of measure-

ment models. Psychometrika, 49, 529544.

Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item

responses. Applied Psychological Measurement, 9, 91100.

Molenaar, I. W. (1983). Item steps (Heymans Bulletins HB-83-630-EX). Groningen,

NL: Psychologisch Instituut RU Groningen.

Molenaar, I. (1997). Nonparametric models for polytomous responses. In W. J. van

der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory

(pp. 369380). New York: Springer.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-

rithm. Applied Psychological Measurement, 16, 159176.

Muraki, E. (1993). Information functions of the generalized partial credit model.

Applied Psychological Measurement, 17, 351363.

Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models. Thousand

Oaks, CA: Sage Publications.

Ramsay, J. (1991). Kernal smoothing approaches to nonparametric item characteris-

tic curve estimation. Psychometrika, 56, 611630.

IRT Models for the Analysis ofPolytomously Scored Data 41

Ramsay, J. (1992). TESTGRAF: A program for the graphical item analysis of multiple-

choice test and questionnaire data (Technical Report). Montreal, Canada: McGill

University.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Copenhagen, Denmark: Danish Institute for Educational Research.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology.

In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and

Probability (pp. 321334). Berkeley: University of California Press.

Roberts, J. S., Donoghue, J. R., & Laughlin, J. S. (2000). A general item response-

theory model for unfolding unidimensional polytomous responses. Applied

Psychological Measurement, 23, 332.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

the unidimensional polytomous Rasch model. Psychometrika, 54, 317332.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded

scores. Psychometrika, Monograph 17, 34.

Samejima, F. (1972). A general model for free-response data. Psychometrika,

Monograph 18, 37.

Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K.

Hambleton (Eds.), Handbook of modern item response theory (pp. 85100). New

York: Springer.

Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Nonparametric item analyses

of the Beck Depression Inventory: Evaluating gender item bias and response

option weights. Psychological Assessment, 6, 255270.

Santor, D. A., Zuroff, D. C., Ramsay, J. O., Cervantes, P., & Palacios, J. (1995).

Examining scale discriminability in the BDI and CES-D as a function of

depressive severity. Psychological Assessment, 7, 131139.

Shealy, R., & Stout, W. (1993). A model-based standardization approach that sepa-

rates true bias/DIF from group ability differences and detects test bias/DIF as

well as item bias/DIF. Psychometrika, 58, 159194.

Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response

theory. Thousand Oaks, CA: Sage.

Swaminathan, H. (2005). Bayesian item response theory estimation. In B. Everitt

& D. Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 1,

pp. 134139). West Sussex, UK: John Wiley & Sons.

Thissen, D. (1981). MULTILOG: Item analysis and scoring with multiple category mod-

els. Chicago: International Educational Services.

Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.

Psychometrika, 51, 567577.

Thurstone, L. L. (1925). A method of scaling psychological and educational tests.

Journal of Educational Psychology, 16, 433451.

Tucker, L. R. (1946). Maximum validity of a test with equivalent items. Psychometrika,

11, 113.

Tutz, G. (1997). Sequential models for ordered responses. In W. J. van der Linden &

R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 139152).

New York: Springer.

van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item

response theory. New York: Springer.

Verhelst, N. D., Glas, C. A. W., & de Vries, H. H. (1997). A steps model to analyze

partial credit. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of

modern item response theory (pp. 123138). New York: Springer.

42 Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells

Verhelst, N. D., & Verstralen, H. (1993). A stochastic unfolding model derived from

the partial credit model. Kwantitative Methoden, 42, 7392.

Wells, C. S., & Bolt, D. M. (2008). Investigation of a nonparametric procedure for

assessing goodness-of-fit in item response theory. Applied Measurement in

Education, 21, 2240.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement.

Chicago: MESA Press.

Zhao, Y., & Hambleton, R. K. (2009). Software for IRT analyses: Descriptions and

features (Center for Educational Assessment Research Report 652). Amherst,

MA: University of Massachusetts, Center for Educational Assessment.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Chapter 3

The Nominal Categories

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

David Thissen

The University of North Carolina at Chapel Hill

Li Cai

University of California, Los Angeles

with a contribution by

R. Darrell Bock

University of Illinois at Chicago

Introduction

Editor Introduction: This chapter elaborates the development of the most general polytomous

IRT model covered in this book. It is the only model in this book that does not assume ordered

polytomous response data and can therefore be used to measure traits and abilities with items that

have unordered response categories. It can be used to identify the empirical ordering of response

categories where that ordering is unknown a priori but of interest, or it can be used to check whether

the expected ordering of response categories is supported in data. The authors present a new

parameterization of this model that may serve to expand the model and to facilitate a more wide-

spread use of the model. Also discussed are various derivations of the model and its relationship to

other models. The chapter concludes with a special section by Bock, where he elaborates on the

background of the nominal model.

The nominal categories model (Bock, 1972, 1997) was originally pro-

posed shortly after Samejima (1969, 1997) described the first general item

response theory (IRT) model for polytomous responses. Samejimas graded

models (innormal ogive and logistic form) were designed for item responses

that have some a priori order as they relate to the latent variable being

43

44 David Thissen, Li Cai, and R. Darrell Bock

measured (); the nominal model was designed for responses with no pre-

determined order.

Samejima (1969) illustrated the use of the graded model with the analy-

sis of data from multiple-choice items measuring academic proficiency. The

weakness of the use of a graded model for that purpose arises from the fact

that the scoring order, or relative degree of correctness, of multiple-choice

response alternatives can only rarely be known a priori. That was part of the

motivation for the development of the nominal model. Bocks (1972) pre-

sentation of the nominal model also used multiple-choice items measuring

vocabulary to illustrate its application. Ultimately, neither Samejimas (1969,

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1997) graded model nor Bocks (1972, 1997) nominal model has seen wide-

spread use as a model for the responses to multiple-choice items, because, in

addition to the aforementioned difficulty prespecifying order for multiple-

choice alternatives, neither the graded nor the nominal model makes any

provision for guessing. Elaborating a suggestion by Samejima (1979), Thissen

and Steinberg (1984) described a generalization of the nominal model that

does take guessing into account, and that multiple-choice model is preferable

if IRT analysis of all of the response alternatives for multiple-choice items

is required.

Current Uses

Nevertheless, the nominal model is in widespread use in item analysis and

test scoring. The nominal model is used for three purposes: (1) as an item

analysis and scoring method for items that elicit purely nominal responses, (2)

to provide an empirical check that items expected to yield ordered responses

have actually done so (Samejima, 1988, 1996), and (3) to provide a model for

the responses to testlets. Testlets are sets of items that are scored as a unit

(Wainer & Kiely, 1987); often testlet response categories are the patterns

of response to the constituent items, and those patterns are rarely ordered a

priori.

Bocks (1972) original formulation of the nominal model was

exp( zk )

T (u = k|; a , c ) = T ( k ) =

i exp( zi ) (3.1)

in which T, the curve tracing the probability that the item response u is in

category k is a function of the latent variable with vector parameters a andc.

In what follows we will often shorten the notation for the trace line to T(k),

and in this presentation we number the response alternatives k = 0, 1,..., m 1

for an item with m response categories. The model itself is the so-called mul-

tivariate logistic function, with arguments

zk = a k + c k (3.2)

The Nominal Categories Item Response Model 45

function of with slope parameter ak and intercept ck. Equations 3.1 and 3.2

can be combined and made more compact as

exp( a k + c k )

T (k) = (3.3)

i exp( ai + ci )

As stated in Equation 3.3, the model is twice not identified: The addition

of any constant to either all of the aks or all of the cks yields different parame-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ter sets but the same values of T(k). As identification constraints, Bock (1972)

suggested

m 1 m 1

a = ck k = 0 (3.4)

k=0 k=0

and using

a = T ` and c = Tf (3.5)

1 1 1

m m m

1 1 1

m 1 m

m

(3.6)

TDEV = 1 1 1

1

m ( m 1 )

m m m

1 1 1

1

m m m

With the T matrices defined as in Equation 3.6, the vectors (of length

m 1) ` and f may take any value and yield vectors a and c with elements

that sum to zero. As is the case in the analysis of variance, other contrast (T)

matrices may be used as well (see Thissen and Steinberg (1986) for examples);

for reasons that will become clear, in this presentation we will use systems

that identify the model with the constraints a0 = c 0 = 0 instead of the original

identification constraints.

Figure3.1 shows four sets of trace lines that illustrate some of the range of

variability of item response functions that can be obtained with the nominal

46 David Thissen, Li Cai, and R. Darrell Bock

1.0 1.0 3

0 3

T(Item Response)

T(Item Response)

0

0.5 0.5

1 2 1

2

0.0 0.0

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Theta Theta

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1.0 4 1.0

0 4

T(Item Response)

T(Item Response)

1

0.5 0.5

3

0 2

1 2 3

0.0 0.0

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Theta Theta

Figure 3.1 Upper left: Trace lines for an artificially constructed four-alternative item. Upper right:

Trace lines for the Identify testlet described by Thissen and Steinberg (1988). Lower left: Trace

lines for the number correct on questions following a passage on a reading comprehension test, using

parameter estimates obtained by Thissen, Steinberg, and Mooney (1989). Lower right: Trace lines for

judge-scored constructed-response item M075101 from the 1996 administration of the NAEP math-

ematics assessment.

model. The corresponding values of the parameter vectors a and c are shown

in Table3.1.

The curves in the upper left panel of Figure 3.1 artificially illustrate a

maximally ordered, centered set of item responses: As seen in the leftmost

two columns of Table3.1 (for Item 1) the values of ak increase by 1.0 as k

increases; as we will see in a subsequent section, that produces an ordered

variant of the nominal model. All of the values of ck are identically 0.0, so

the trace lines all cross at that value of . The upper right panel of Figure3.1

Table3.1 Original Nominal Model Parameter Values for the Trace Lines Shown

inFigure3.1

Item 1 Item 2 Item 3 Item 4

Response

Category (k) a c a c a c a c

0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0

1 1.0 0.0 0.0 0.9 0.2 0.5 0.95 1.2

2 2.0 0.0 1.1 0.7 0.7 1.8 1.90 0.2

3 3.0 0.0 2.7 0.7 1.3 3.0 2.85 1.4

4 2.2 3.3 3.80 2.7

The Nominal Categories Item Response Model 47

in Table 3.1) obtained by Thissen and Steinberg (1988) (and subsequently

by Hoskens and Boeck (1997); see Baker and Kim (2004) for the details of

maximum marginal likelihood parameter estimation) for a testlet compris-

ing two items from Bergan and Stones (1985) data obtained with a test of

preschool mathematics proficiency. The two items required the child to iden-

tify the numerals 3 and 4; the curves are marked 0 for neither identified, 1 for

3 identified but not 4, 2 for 4 identified but not 3, and 3 for both identified

correctly. This is an example of a testlet with semiordered responses: The 0

and 1 curves are proportional because their ak estimates are identical, indi-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

they have the same relation to proficiency: Both may be taken as incorrect. If

a child can identify 4 but not 3 (the 2 curve), that indicates a moderate, pos-

sibly developing, degree of mathematical proficiency, and both correct (the

3curve) increases as increases.

The lower left panel of Figure 3.1 shows trace lines that correspond to

parameter estimates (marked Item 3 in Table 3.1) obtained by Thissen,

Steinberg, and Mooney (1989) fitting the nominal model to the number-

correct score for the questions following each of four passages on a read-

ing comprehension test. Going from left to right, the model indicates that

the responses are increasingly ordered for this number-correct scored testlet:

Summed scores of 0 and 1 have nearly the same trace lines, because 0 (of 4)

and 1 (of 4) are both scores that can be obtained with nearly equal probability

by guessing on five-alternative multiple-choice items. After that, the trace

lines look increasingly like those of a graded model. The lower right panel of

Figure3.1 is for a set of graded responses: It shows the curves that correspond

to the parameter estimates for an extended constructed response mathemat-

ics item administered as part of the National Assessment of Educational

Progress (NAEP) (Allen, Carlson, & Zelenak, 1999). The judged scores

(from 0 to 4) were fitted with Murakis (1992, 1997) generalized partial

credit (GPC) model, which is a constrained version of the nominal model. In

Table3.1, the parameters for this item (Item 4 in the two rightmost columns)

have been converted into values of ak and ck for comparability with the other

items parameters. The GPC model is an alternative to Samejimas (1969,

1997) graded model for such ordered responses; the two models generally

yield very similar trace lines for the same data. In subsequent sections of this

chapter we will discuss the relation between the GPC and nominal models

in more detail.

There are several lines of reasoning that lead to Equation 3.3 as an item

response model. In this section we describe three kinds of theoretical

argument that lead to the nominal model as the result, because they exist,

and because different lines of reasoning appeal to persons with different

backgrounds.

48 David Thissen, Li Cai, and R. Darrell Bock

As Statistical Mechanics

Certainly the simplest development of the nominal model is essentially

atheoretical, treating the problem as abstract statistical model creation. To

do this, we specify only the most basic facts: that we have categorical item

responses in several (>2) categories, that we believe those item responses

depend on some latent variable () that varies among respondents, and that

the mutual dependence of the item responses on that latent variable explains

their observed covariance. Then simple mathematical functions are used to

complete the model.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

First, we assume that the dependence of some response process (value) for

each person, for each item response alternative, is a linear function of theta

zk = a k + c k (3.7)

with unknown slope and intercept parameters ak and ck. Such a set of straight

lines for a five-category item is shown in the left panel of Figure3.2, using

the parameters for Item 3 from Table3.1.

To change those straight lines (zk) into a model that yields probabilities

(between 0 and 1) for each response, as functions of , we use the so-called

multivariate logistic link function

exp( zk )

(3.8)

i exp( zi )

form a linear model into a probability model for categorical data. It can be

characterized as simple mathematical mechanics: Exponentiation of the val-

ues of zk makes them all positive, and then division of each of those positive

line values by the sum of all of them is guaranteed to transform the straight

lines in the left panel of Figure3.2 into curves such as those shown in the

right panel of Figure3.2. The curves are all between 0 and 1, and sum to 1

10 1.0 4

8

T(Item Response)

6

4

0.5

z

2 3

0

1 2

0

-2

-4 0.0

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Theta Theta

Figure 3.2 Left panel: Linear regressions of the response process zk on for five response alterna-

tives. Right panel: Multivariate logistic transformed curves corresponding to the five lines in the left

panel.

The Nominal Categories Item Response Model 49

at all values of , as required. (The curves in the right panel of Figure3.2 are

those from the lower left panel of Figure3.1. de Ayala (1992) has presented

a similar graphic as his Figure1.)

For purely statistically trained analysts, with no background in psycho-

logical theory development, this is a sufficient line of reasoning to use the

nominal model for data analysis. Researchers trained in psychology may

desire a more elaborated theoretical rationale, of which two are offered in the

two subsequent sections.

However, it is of interest to note at this point that the development in this

section, specifically Equation 3.7, invites the questions: Why linear? Why not

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

have been suggested or used for special purposes as variants of the nominal

model: Upon hearing a description of the multiple-choice model (Thissen &

Steinberg, 1984) D. B. Rubin (personal communication, December 15, 1982)

suggested that an alternative to that model would be a nominal model with

quadratic functions replacing Equation 3.7. Ramsay (1995) uses a quadratic

term in Equation 3.7 for the correct response alternative for multiple-choice

items when the multivariate logistic is used to provide smooth information

curves for the nonparametric trace lines in the TestGraf system. Sympson

(1983) also suggested the use of quadratic, and even higher-order, polynomi-

als in a more complex model that never came into implementation or usage.

Nevertheless, setting aside multiple-choice items, for most uses of the

nominal model the linear functions in Equation 3.7 are sufficient.

Relationship to Other Models: The term Thurstone models in polytomous IRT typically refers

to models where response category thresholds characterize all responses above versus below a

given threshold. In contrast, Rasch type models only characterize responses in adjacent categories.

However, the Thurstone case V model, which is related to the development of the nominal categories

model, is a very different type of Thurstone modelone without thresholdshighlighting the nominal

categories model's unique place among polytomous IRT models.

was based on an extension of Thurstones (1927) case V model for binary

choices, generalized to become a model for the first choice among three or

more alternatives. Thurstones model for choice made use of the concept of

a response process that followed a normal distribution, one value (process in

Thurstones language) for each object. The idea was that the object or alter-

native selected was that with the larger value. In practice, a comparatal

process is computed as the difference between the two response processes,

and the first object is selected if the value of the comparatal process is greater

than zero.

Bock and Jones (1968) describe many variants and extensions of Thur

stones models for choice, including generalizations to the first choice from

among several objects. The obvious generalization of Thurstones binary

50 David Thissen, Li Cai, and R. Darrell Bock

choice model to create a model for the first choice from among three or more

objects would use a multivariate normal distribution of m 1 comparatal pro-

cesses for object or alternative j, each representing a comparison of object j

with one of the others of m objects. Then the probability of selection of alter-

native j would be computed as a multiple integral over that (m 1)-dimen-

sional normal density, computing a value known as an orthant probability.

However, multivariate normal orthant probabilities are notoriously difficult

to compute, even for simplified special cases. Bock and Jones suggest sub-

stitution of the multivariate logistic distribution, showing that the bivariate

logistic yields probabilities similar to those obtained from a bivariate normal

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

(these would be used for the first choice of three objects). The substitution of

the logistic here is analogous with the substitution of the logistic function for

the normal ogive in the two-parameter logistic IRT model (Birnbaum, 1968).

Of course, the multivariate logistic distribution function is Equation 3.1.

In the appendix to this chapter, Bock provides an updated and detailed

description of the theoretical development of the nominal categories model as

an approximation to the multivariate generalization of Thurstones model for

choice. In addition, the appendix describes the development of the model that is

obtained by considering first choices among three or more objects as an extreme

value problem, citing the extension of Dubeys (1969) derivation of the logistic

distribution to the multivariate case that has been used and studied by Bock

(1970), McFadden (1974), and Malik and Abraham (1973). This latter develop-

ment also ties the nominal categories model to the so-called Bradley-Terry-Luce

(BTL) model for choice (Bradley & Terry, 1952; Luce & Suppes, 1965).

Thus, from the point of view of mathematical models for choice, the nom-

inal categories model is both an approximation to Thurstone (normal) mod-

els for the choice of one of three or more alternatives, and the multivariate

version of the BTL model.

Another derivation of the nominal model involves its implications for the

conditional probability of a response in one category (say k) given that the

response is in one of two categories (k or k). This derivation is analogous in

some respects to the development of Samejimas (1969, 1997) graded model,

which is built up from the idea that several conventional binary item response

models may be concatenated to construct a model for multiple responses. In

the case of the graded model, accumulation is used to transform the multiple

category model into a series of dichotomous models: The conventional nor-

mal ogive or logistic model is used to describe the probability that a response

is in category k or higher, and then those cumulative models are subtracted

to produce the model for the probability the response is in a particular cat-

egory. This development of the graded model rests, in turn, on the theoreti-

cal development of the normal ogive model as a model for the psychological

response process, as articulated by Lord and Novick (1968, pp. 370373),

and then on Birnbaums (1968) reiteration for test theory of Berksons (1944,

1953) suggestion that the logistic function could usefully be substituted for

The Nominal Categories Item Response Model 51

the normal ogive. (See Thissen and Orlando (2001, pp. 8489) for a sum-

mary of the argument by Lord and Novick and the story behind the logistic

substitution.)

The nominal model may be derived in a parallel fashion, assuming that

the conditional probability of a response in one category (say k), given that

the response is in one of two categories (k or k), can be modeled with the

two-parameter logistic (2PL). The algebra for this derivation frontwards

(from the 2PL for the conditional responses to the nominal model for all of

the responses) is algebraically challenging as test theory goes, but it is suf-

ficient to do it backwards, and that is what is presented here. (We note in

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

passing that Masters (1982) did this derivation frontwards for the simpler

route from the Rasch or one-parameter logistic (1PL) to the partial credit

model.)

If one begins with the nominal model as stated in Equation 3.3, and

writes the conditional probability for a response in category k given that the

response is in one of categories k or k,

T (k)

T ( k|k , k ) = (3.9)

T ( k ) + T ( k )

and then more cancellation to change the three exponential terms into one) is

required to show that this conditional probability is, in fact, a two-parameter

logistic function:

1

T ( k|k , k ,) =

(

1 + exp a kc + c kc

) (3.10)

with

c kc = c k c k (3.11)

and

a kc = a k a k (3.12)

Placing interpretation on the algebra, what this means is that the nomi-

nal model assumes that if we selected the subsample of respondents who

selected either alternative k or k, setting aside respondents who made other

choices, and analyzed the resulting dichotomous item in that subset of the

data, we would use the 2PL model for the probability of response k in that

subset of the data. This choice, like the choice of the normal ogive or logistic

model for the cumulative probabilities in the graded model, then rests on

52 David Thissen, Li Cai, and R. Darrell Bock

cal response process model as articulated by Lord and Novick (1968), and

Birnbaums (1968) argument for the substitution of the logistic. The dif-

ference between the two ways of dividing multiple responses into a series

of dichotomies (cumulative vs. conditional) has been discussed by Agresti

(2002).

An interesting and important feature of the nominal model is obtained by

specializing the conditional probability for any pair of responses to adjacent

response categories (k or k 1; adjacent is meaningful if the responses are

actually ordered); the same two-parameter logistic is obtained:

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1

T ( k|k , k 1) =

( )

1 + exp a kc + c kc

(3.13)

with

c kc = c( k 1) c k (3.14)

and

a kc = a k a( k 1) (3.15)

It is worth noting at this point that the threshold bkc for the slope-threshold

form of the conditional 2PL curve,

1

T ( k|k , k 1) =

( k k )

1 + exp a c b c (3.16)

is

c kc c k 1 c k

bkc = c = (3.17)

a k a k a k 1

which is also the crossing point of the trace lines for categories k and k 1

(deAyala, 1993; Bock, 1997). These values are featured in some parameter-

izations of the nominal model for ordered data.

This fact defines the concept of order for nominal response categories:

Response k is higher than response k 1 if and only if a k > a k 1 , which means

that ac is positive, and so the conditional probability of selecting response k

(given that it is one of the two) increases as increases. Basically this means

that item analysis with the nominal model tells the data analyst the order of the

item responses. We have already made use of this fact in discussion of order and

the ak parameters in Figure3.1 and Table3.1 in the introductory section.

The Nominal Categories Item Response Model 53

1.0 +

T(Item Response)

0.5

NA

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.0

-3 -2 -1 0 1 2 3

Cognitive Dysfunction (Theta)

Figure 3.3 Trace lines corresponding to item parameters obtained by Huber (1993) in his analy-

sis of the item Count down from 20 by 3s on the Short Portable Mental Status Questionnaire

(SPMSQ ).

Two additional examples serve to illustrate the use of the nominal model

to determine the order of response categories, and the way the model may be

used to provide trace lines that can be used to compute IRT scale scores (see

Thissen, Nelson, Rosa, and McLeod, 2001) using items with purely nominal

response alternatives.

Figure 3.3 shows the trace lines corresponding to item parameters

obtained by Huber (1993) in his analysis of the item Count down from 20

by 3s on the Short Portable Mental Status Questionnaire (SPMSQ ), a brief

diagnostic instrument used to detect dementia. For this item, administered

to a sample of aging individuals, three response categories were recorded:

correct, incorrect (scored positively for this cognitive dysfunction scale),

and refusal (NA). Common practice scoring the SPMSQ in clinical and

research applications was to score NA as incorrect, based on a belief that

respondents who refused to attempt the task probably could not do it. Huber

fitted the three response categories with the nominal model and obtained

the parameters a = [0.0, 1.56, 1.92] and c = [0.0, 0.52, 0.85]; the cor-

responding curves are shown in Figure3.3. As expected, the ak parameter

for NA is much closer to the ak parameter for the incorrect response, and

the curve for NA is nearly proportional to the curve in Figure3.3. This

analysis lends a degree of justification to the practice of scoring NA as incor-

rect. However, if the IRT model is used to compute scale scores, those scale

scores reflect the relative evidence of failure provided by the NA response

more precisely.

The SPMSQ also includes items that many item analysts would expect to

be locally dependent. One example involves a pair of questions that require

the respondent to state his or her age, and then his or her date of birth. Huber

(1993) combined those two items into a testlet with four response categories:

both correct (++), age correct and date of birth incorrect (+), age incorrect

and date of birth correct (+), and both incorrect (). Figure3.4 shows the

54 David Thissen, Li Cai, and R. Darrell Bock

1.0 ++

T(Item Response)

0.5

++

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.0

-3 -2 -1 0 1 2 3

Cognitive Dysfunction (Theta)

Figure 3.4 Nominal model trace lines for the four response categories for Hubers (1993) SPMSQ

testlet scored as reporting both age and date of birth correctly (++), age correctly and date of birth

incorrectly (+), age incorrectly and date of birth correctly (+), and both incorrectly ().

nominal model trace lines for the four response categories for that testlet.

While one may confidently expect that the response reflects the highest

degree of dysfunction and the ++ response the lowest degree of dysfunction,

there is a real question about the scoring value of the + and + responses.

The nominal model analysis indicates that the trace lines for + and + are

almost exactly the same, intermediate between good and poor performance.

Thus, after the analysis with the nominal model one may conclude that this

testlet yields four response categories that collapse into three ordered scoring

categories: ++, [+ or +], and .

Thissen and Steinberg (1986) showed that a number of other item response

models may be obtained as versions of the nominal model by imposing con-

straints on the nominal models parameters, and further that the canonical

parameters of those other models may be made the s and s estimated for

the nominal model with appropriate choices of T matrices. Among those

other models are Masters (1982) partial credit (PC) model (see also Masters

and Wright, 1997) and Andrichs (1978) rating scale (RS) model (see also

Andersen (1997) for relations with proposals by Rasch (1961) and Andersen

(1977)). Thissen and Steinberg (1986) also mentioned in passing that a ver-

sion of the nominal model like the PC model, but with discrimination

parameters that vary over items, is also within the parameter space of the

nominal model. That latter model was independently developed and used in

the 1980s by Muraki (1992) and called the generalized partial credit (GPC)

model, and by Yen (1993) and called the two-parameter partial credit (2PPC)

model.

The Nominal Categories Item Response Model 55

Scale and (Generalized) Partial Credit Models

Notational Difference: Remember this model was presented slightly differently in Chapter 2:

exp kv = 0 ai (j biv )

Pik (j ) =

hK=01 exp hv = 0 ai (j biv )

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Muraki (1992, 1997) has used several parameterizations to describe the GPC

model, among them

exp kj = 0 1.7 a( b + d j )

T (k) = (3.18)

mi =01 exp ij = 0 1.7 a( b + d j )

m 1

d = 0 i (3.19)

i =1

and alternatively

exp[1.7 a[Tk ( b ) + K k ]]

T (k) = m 1 (3.20)

exp[1.7 a[Tk ( b ) + K k ]]

i =0

in which

Kk = d i (3.21)

i =1

(1982) specification of the PC model:

Notational Difference: Here the authors use to refer to the latent variable of interest where

Masters (see Equations 5.22 and 5.23 in Chapter 5) and Andrich (see Equations 6.24 and 6.25 in

Chapter 6) typically refer to the latent variable using . This / notational difference will be seen in

other chapters and is common in IRT literature.

exp kj = 0 ( j )

T (k) =

mi =01 exp ij = 0 ( j ) (3.22)

56 David Thissen, Li Cai, and R. Darrell Bock

0

( ) = 0 j (3.23)

j =0

exp kj = 0 ( ( + k ))

T (k) =

mi =01 exp ij = 0 ( ( + j )) (3.24)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0

[ ( + )] = 0 j (3.25)

j =0

and

m 1

j = 0 (3.26)

j =1

Thissen and Steinberg (1986) described the use of alternative T matrices in

the formulation of the nominal model. For example, when formulated for

marginal estimation following Thissen (1982), Masters (1982) PC model

and Andrichs (1978) RS model use a single slope parameter that is the coef-

ficient for a linear basis function:

0

1

Ta ( PC )

= 2 (3.27)

m 1

m 1

eters that can be duplicated, up to proportionality, with this T matrix for the cs:

0 0 0

1 0 0

Tc ( PC ) = 1 1 0 (3.28)

m ( m 1 )

1 1 1

The Nominal Categories Item Response Model 57

Terminology Note: The authors use the term threshold here, whereas in other chapters these

parameters are sometimes referred to as step or boundary parameters.

set of parameters describing the category boundaries for the item response

scale; the latter were constrained equal across items, and may be obtained,

again up to proportionality, with

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0 0 0 0

1 1 0 0

2 1 1 0

Tc ( RS-C ) = (3.29)

m ( m 1 )

(m 2) 1 1 1

(m 1) 0 0 0

Andrich (1978, 1985) and Thissen and Steinberg (1986) described the use

of a polynomial basis for the cs as an alternative to Tc ( RS-C ) that smooths

the category boundaries; the overall item location parameter is the coefficient

of the first (linear) column, and the coefficients associated with the other

columns describe the response category boundaries:

0 02 03

1 12 13

2 2 2 23

Tc ( RS-P ) = (3.30)

m ( m 1 )

(m 2) (m 2)2 (m 2)3

(m 1) (m 1)2 (m 1)3

Polynomial contrasts were used by Thissen et al. (1989) to obtain the trace

lines for summed score testlets for a passage-based reading comprehension

test; the trace lines for one of those testlets are shown as the lower left panel

of Figure3.1 and the right panel of Figure3.2. The polynomial contrast set

included only the linear term for the a k s and the linear and quadratic terms

for the c k s for that testlet; that was found to be a sufficient number of terms

to fit the data. This example illustrates the fact that, although the nomi-

nal model may appear to have many estimated parameters, in many situa-

tions a reduction of rank of the T matrix may result in much more efficient

estimation.

58 David Thissen, Li Cai, and R. Darrell Bock

After three decades of experience with the nominal model and its applica-

tions, a revision to the parameterization of the model would serve several pur-

poses: Such a revision could be used first of all to facilitate the extension of the

nominal model to become a multidimensional IRT (MIRT) model, a first for

purely nominal responses. In addition, a revision could make the model easier

to explain. Further, by retaining features that have actually been used in data

analysis, and discarding suggestions (such as many alternative T matrices) that

have rarely or never been used in practice, the implementation of estimation

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Thus, while the previous sections of this chapter have described the nomi-

nal model as it has been, and as it has been used, this section presents a new

parameterization that we expect will be implemented in the next generation

of software for IRT parameter estimation. This is a look into the future.

Desiderata

The development of the new parameterization for the nominal model was

guided by several goals, combining a new insight with experience gained

over the last 30 years of applications of the model:

model can be created by separating the a parameterization into a single

overall (mutliplicative) slope or discrimination parameter, that is then

expanded into vector form to correspond to vector , and a set of m 2

contrasts among the a parameters that represent what Muraki (1992)

calls the scoring functions for the responses. This change has the added

benefit that, for the first time, the newly reparameterized nominal

model has a single discrimination parameter comparable to those of

other IRT models. That eases explanation of results of item analysis

with the model.

2. In the process of accomplishing Goal 1, it is desirable to parameterize

the model in such a way that the scoring function may be (smoothly)

made linear ( 0, 1, 2,, m 1) so that the multiplicative overall slope

parameter becomes the slope parameter for the GPC model, which,

constrained equally across items, also yields the PC and RS models.

In addition, with this scoring function the overall slope parameter may

meaningfully be set equal to the (also equal) slope for a set of 2PL items

to mimic Rasch family mixed models.

3. We have also found it useful at times in the past 20 years to use models

between the highly constrained GPC model and the full-rank nominal

model, as suggested by Thissen and Steinberg (1986), most often by

using polynomial bases for the a and c parameters and reducing the

number of estimated coefficients below full rank to obtain smoothly

changing values of the a and c parameters across response categories.

It is desirable to retain that option.

The Nominal Categories Item Response Model 59

4. With other sets of data, we have found it useful to set equal subsets

of the a or c parameters within an item, modeling distinct response

categories as equivalent for scoring (the a parameters are equal) or alto-

gether equivalent (both the a and c parameters are equal).

expressed as sets of T matrices; Goals 1 and 2 are maintained in both

parameterizations.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

exp( zk )

( )

T u = k|; a i , a sk , c k = T ( k ) =

i exp( zi )

(3.31)

in which

zi k = a i a ks +1 + c k +1 (3.32)

and a is the overall slope parameter, a ks +1 is the scoring function for response

k, and ck+1 is the intercept parameter as in the original model. The equating

following restrictions for identification,

a1s = 0, ams = m 1, and c1 = 0 (3.33)

are implemented by reparameterizing, and estimating the parameters and :

a s = T ` and c = Tf (3.34)

To accomplish Goals 1 to 3, we use a Fourier basis as the T matrix, aug-

mented with a linear column:

0 0 0

1 f 22 f 2( m 1 )

TF = 2 f 32 f 3(m 1) (3.35)

m ( m 1 )

m 1 0 0

in which f ki is

f ki = sin[ (i 1)( k 1)/(m 1)] (3.36)

and 1 = 1. Figure3.5 shows graphs of the linear and Fourier functions for

four categories (left panel) and six categories (right panel). The Fourier-based

terms functionally replace quadratic and higher-order polynomial terms that

60 David Thissen, Li Cai, and R. Darrell Bock

3 5

4

2

3

1 2

1

0

0

-1 -1

0 1 2 3 0 1 2 3 4 5

Response Response

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Figure 3.5 Graphs of the linear and Fourier basis functions for the new nominal model parameter-

ization, for four categories (left panel) and six categories (right panel); the values of T at integral values

on the Response axis are the elements of the T matrix of Equations 3.35 and 3.36.

numerically stable, symmetrical orthogonal basis.

The new parameterization, using the Fourier T matrix, provides several

useful variants of the nominal model: When a ,{ 2, , m 1 } , and are

estimated parameters, this is the full-rank nominal model. If { 2 ,, m1 } are

restricted to be equal to zero, this is a reparameterized version of the GPC

model. The Fourier basis provides a way to create models between the GPC

and nominal model, as were used by Thissen et al. (1989), Wainer, Thissen,

and Sireci (1991), and others.

When the linear-Fourier basis TF is used for both

a s = TF ` and c = TF f (3.37)

with 1 = 1 and 2 ,, m1 = 0 , then the parameters of the GPC model

exp kj = 0 1.7 a( b + d j )

T i (k) = (3.38)

mi =01 exp ij = 0 1.7 a( b + d j )

may be computed as

a i

a= (3.39)

1.7

cm

b= * = *1 (3.40)

a (m 1) ai

i

and

c k c k 1 c

dk = * m (3.41)

ai m 1

The Nominal Categories Item Response Model 61

tion). (Childs and Chen (1999) provided formulae to convert the parameters

of the original nominal model into those of the GPC model, but they used

the T matrices in the computations, which is not essential in the simpler

methods given here.)

Also note that if it desired to constrain the GPC parameters dk to be

equal across a set of items, that is accomplished by setting the parameter

sets 2 ,..., m1 equal across those items. This kind of equality constraint

really only makes sense if the overall slope parameter a is also set equal

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

difficulty, which still varies over items i. (Another way to put this is that the

linear-Fourier basis separates the parameter space into a (first) component

for bi = ai ,1 and a remainder that parameterizes the spacing among the

i

thresholds or crossover points of the curves.)

The alternative parameterization of the GPC

exp[1.7 a[Tk ( b ) + K k ]

T (k) = m 1 (3.42)

exp[1.7 a[Ti ( b ) + K i ]

i =0

in which

Kk = d i (3.43)

i =1

di. Note that the multiplication of the parameter b by the scoring function

Tk provides another explanation of the fact that with the linear-Fourier basis

bi = ai ,1 .

To provide translations of the parameters for Rasch family models, some

accommodation must be made between the conventions that the scale of the

latent variable is usually set for more general models by specifying the is

distributed with mean zero and variance one, versus many implementations

of Rasch family models with the specification that some items difficulty is

zero, or the average difficulty is zero, and the slope is one, leaving the mean

and variance of the distribution unspecified, and estimated.

If we follow the approach taken by Thissen (1982) that a version of Rasch

family models may be obtained with the specification that is distributed

with mean zero and variance one, estimating a single common slope param-

eter (a * in this case) for all items, and all items difficulty parameters, then the

c parameters of Masters PC model are

k = b d k (3.44)

62 David Thissen, Li Cai, and R. Darrell Bock

mation of scale, and the and parameters of Andrichs RS model are

=b (3.45)

and

k = d k (3.46)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

To accomplish Goals 1, 2, and 4, involving equality constraints, we use T

matrices for as as of the form

0 0m 2

TIa = 0m 2 Im 2 (3.47)

m ( m 1 ) m 1 0m 2

straints in addition on the cs, we use the following T matrix:

0m 1

TIc =

m ( m 1 ) Im 1 (3.48)

model, among others: When a , { 2 ,..., m1 }, and are estimated param-

eters, this is again the full-rank nominal model. If i = i for { 2 ,..., m1 },

this is a reparameterized version of the generalized partial credit model.

The restriction a1s = a 2s is imposed by setting 2 = 0 . The restriction

am 1 = ams is imposed by setting (m 1) = m 1. For the other values of as the

s

Illustrations

Table3.2 shows the values of the new nominal model parameters for the

items with trace lines in Figure3.1 and the original parameters in Table3.1.

Note that the scoring parameters in as for Items 1 and 4 are [ 0, 1, 2,..., m 1],

indicating that the nominal model for those two items is one for strictly

ordered responses. In addition, we observe that the lower discrimination

of Item 3 (with trace lines shown in the lower left panel of Figure3.1) is

now clearly indicated by the relatively lower value of a ; the discrimination

The Nominal Categories Item Response Model 63

Table3.2 Item Parameters for the New Parameterization of the Nominal Model,

forthe Same Items With the Original Model Parameters in Table3.1

Parameter Item 1 Item 2 Item 3 Item 4

a* 1.0 0.9 0.55 0.95

a1s c1 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0

a2s c2 1.0 0.0 0.0 0.9 0.36 0.5 1.00 1.2

a3s c3 2.0 0.0 1.2 0.7 1.27 1.8 2.00 0.2

a4s c4 3.0 0.0 3.0 0.7 2.36 3.0 3.00 1.4

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

parameter for Item 3 is only 0.55, relative to values between 0.9 and 1.0 for

the other three items. The values of the c parameters are unchanged from

Table3.1. If the item analyst wishes to convert the parameters for Item 3 in

Table3.2 to those previously used for the GPC model, Equations 3.39 to

3.41 may be used.

The new parameterization of the nominal model is designed to facilitate mul-

tidimensional item factor analysis (or MIRT analysis) for items with nomi-

nal responses, something that has not heretofore been available (Cai, Bock,

& Thissen, in preparation). A MIRT model has a vector-valued ptwo or

more dimensions in the latent variable space that are used to explain the

covariation among the item responses. Making use of the separation of the

new nominal model parameterization of overall item discrimination param-

eter (a *) from the scoring functions (in as), the multidimensional nominal

model has a vector of discrimination parameters a *, one value indicating the

slope in each direction of the p-space. This vector of discrimination param-

eters taken together indicates the direction of highest discrimination of the

item, which may be along any of the axes or between them.

The parameters in as remain unchanged: Those represent the scoring func-

tions of the response categories and are assumed to be the same in all direc-

tions in the p-space. So the model remains nominal in the sense that the

scoring functions may be estimated from the data. The intercept parameter

c also remains unchanged, taking the place of the standard unitary intercept

parameter in a MIRT model.

Assembled in notation, the nominal MIRT model is

exp( zk )

T (u = k|p; a , a s , c ) = T ( k ) = (3.49)

i exp( zi )

modified from Equation 3.31 with vector a* and vector p, in which

zk = a * a ks p + c k (3.50)

64 David Thissen, Li Cai, and R. Darrell Bock

This is a nominal response model in the sense that, for any direction in

thep space, a cross section of the trace surfaces may take the variety of shapes

provided by the unidimensional nominal model. Software to estimate the

parameters of this model is currently under development. When completed

this model will permit the empirical determination of response alternative

order in the context of multidimensional p. If an ordered version of the model

is used, with scoring functions [ 0,1, 2,..., m 1], this model is equivalent to

the multidimensional partial credit model described by Yao and Schwarz

(2006).

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Conclusion

Reasonable questions may be raised about why the new parameterization of

the nominal model has been designed as described in the preceding section;

we try to answer some of the more obvious of those questions here:

Why is the linear term of the T matrix scaled between zero and m 1, as opposed

to some other norming convention? It is planned that the implementation of

estimation for this new version of the nominal model will be in general

purpose computer software that, among other features, can mix models,

for example, for binary and multiple-category models. We also assume that

the software can fix parameters to any specified value, or set equal any sub-

set of the parameters. Some users may want to use Rasch family (Masters

and Wright, 1984) models, mixing the original Rasch (1960) model for the

dichotomous items and the PC or RS models for the polytomous items. To

accomplish a close approximation of that in a marginal maximum likelihood

estimation system, with a N ( 0, 1) population distribution setting scale for

the latent variable, a common slope (equal across items) must be specified

for all items (Thissen, 1982). For the dichotomous items that scope param-

eter is for the items scored 0, 1 ; for the polytomous items it is for item scores

0, 1,,(m 1). Thus, scaling the linear component of the scoring function

with unit steps facilitates the imposition of the equality constraints needed

for mixed Rasch family analysis. It also permits meaningful equality con-

straints between discrimination parameters for different item response mod-

els that are not in the Rasch family.

In the MIRT version of the model, the a parameters may be rescaled

after estimation is complete, to obtain values that have the properties of fac-

tor loadings, much as has been done for some time for the dichotomous

model in the software TESTFACT (du Toit, 2003).

Why does the user need to prespecify both the lowest and highest response cat-

egory (to set up the T matrix) for a nominal model? This is not as onerous as it

may first appear: When fitting the full-rank nominal model, one does not

have to correctly specify highest and lowest response categories. If the data

indicate another order, estimated values of a ks may be less than zero or exceed

m 1, indicating the empirical scoring order. It is only necessary that the

The Nominal Categories Item Response Model 65

item analyst prespecify two categories that are differently related to , such

that one is relatively lower and the other relatively higherbut even which

one is which may be incorrect, and that will appear as a negative value of a i .

Presumably, when fitting a restricted (ordered) version of the model, the user

would have already fitted the unrestricted nominal model to determine or

check the empirical order of the response categories, or the user would have

confidence from some other source of information about the order.

Why not parameterize the model in slope-threshold form, instead of

slope-intercept form? Arent threshold parameters easier to interpret in IRT?

While we fully understand the attraction, in terms of interpretability, for

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

to parameterize with intercepts for estimation. The first (oldest historically)

reason is that the slope-intercept parameterization is a much more numeri-

cally stable arrangement for estimating the parameters of logistic models,

due to a closer approximation of the likelihood to normality and less error

correlation among the parameters. A second reason is that the threshold

parameterization does not generalize to the multidimensional case in any

event; there is no way in a MIRT model to split the threshold among

dimensions, rendering a threshold parameterization more or less meaning-

less. We note here that, for models for which it makes sense, we can always

convert the intercept parameters into the corresponding item location and

threshold values for reporting, and in preceding sections we have given for-

mulas for doing so for the GPC model.

Why not use polynomial contrasts to obtain intermediate models, as proposed by

Thissen and Steinberg (1986) and implemented in MULTILOG (du Toit, 2003),

instead of the Fourier basis? An equally compelling question is to ask: Why

polynomials? The purpose of either basis is to provide smooth trends in the as

or cs across a set of response categories. Theory is not sufficient at this time to

specify a particular mathematic formulation for smoothness across catego-

ries in the nominal model. The Fourier basis accomplishes that goal as well

as polynomials, and is naturally orthogonal, which (slightly) simplifies the

implementation of the estimation algorithm.

In this chapter we have reviewed the development of Bocks (1972) nomi-

nal model, described its relation with other commonly used item response

models, illustrated some of its unique uses, and provided a revised param-

eterization for the model that we expect will render it more useful for future

applications in item analysis and test scoring. As IRT has come to be used in

more varying contexts, expanding its domain of application from its origins

in educational measurement into social and personality psychology, and the

measurement of health outcomes and quality of life, the need to provide

item analysis for items with polytomous responses with unknown scoring

order has increased. The reparameterized nominal model provides a useful

response to that challenge. Combined with the development of multidimen-

sional nominal item analysis (Cai et al., in preparation), the nominal model

represents a powerful component among the methods of IRT.

66 David Thissen, Li Cai, and R. Darrell Bock

R. Darrell Bock

The first step in the direction of the nominal model was an extension of

Thurstones (1927) method of paired comparisons to first choices among

three or more objects. The objects can be anything for which subjects could

be expected to have preferencesopinions on public issues, competing con-

sumer products, candidates in an election, and so on. The observations for

a set of m objects consist of the number of subjects who prefer object j to

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

object k and the number who prefer k to j. Any given subject does not neces-

sarily have to respond to all pairs. Thurstone proposed a statistical model for

choice in which differences in the locations of the objects on a hypothetical

scale of preference value predict the observed proportions of choice in all

m(m 1)/2 distinct pairs. He assumed that a subjects response to the task of

choosing between the objects depended upon a subjective variable for, say,

object j,

vj = j + j (3.51)

normally with mean 0 and variance 2. He called this variable a response pro-

cess and assumed that the subject chooses the object with the larger process.

Although the distribution of j might have different standard deviations for

each object and nonzero correlations between objects, this greatly compli-

cates the estimation of differences between the means. Thurstone therefore

turned his attention to the case V model in which the standard deviations

were assumed equal and all correlations assumed zero in all comparisons.

With this simplification, the so-called comparatal process

v jk = v j vk (3.52)

cesses v jk , v jl for object j have constant correlation . Thurstones solution to

the estimation problem was to convert the response proportions to normal

deviates and estimate the location differences by unweighted least squares,

which requires only m2 additions and m divisions. With modern computing

machinery, solutions with better properties (e.g., weighted least squares or

maximum likelihood) are now accessible (see Bock & Jones, 1968, Section

6.4.1). From the estimated locations, the expected proportions for each com-

parison are given by the cumulative normal distribution function, ( y ), at

y = ( j k ). These proportions can be used in chi-square tests of the good-

ness of fit of the paired comparisons model (Bock, 1956) (Bock & Jones,

1968, section 6.7.1).

The Nominal Categories Item Response Model 67

The natural extension of the paired comparison case V solution to what

might be called the method of first choices, that is, the choice of one pre-

ferred object in a set of m objects is simply to assume the m 1 comparatal

processes for object j,

v jk = v j vk , k = 1, 2,, m; k j (3.53)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

22, and constant correlation jk equal to (Bock, 1956; 1975, Section 8.1.3).

Expected probabilities of first choice for a given object then correspond to the

(m 1)-fold multiple integral of the (m 1)-variate normal density function

in the orthant from minus infinity up to the limits equal to the comparatal

means.

For general multivariate normal distributions of high dimensionality,

evaluation of orthant probabilities is computationally challenging even with

modern equipment. Computing formulae and tables exist for the bivariate

case (National Bureau of Standards, 1956) and the trivariate case (Steck,

1958), but beyond that, Monte Carlo approximation of the positive orthant

probabilities appears to be the only recourse at the present time. Fortunately,

much simpler procedures based upon a multivariate logistic distribution are

now available for estimating probabilities of first choice. By way of intro-

duction, the following section gives essential results for the univariate and

bivariate logistic distributions.

Applied to the case V paired comparisons model, the univariate logistic dis-

tribution function can be expressed either in terms of the comparatal process

z = u jk :

1

( z ) = (3.54)

1 + ez

e z1

( z1 ) = (3.55)

e + e z2

z1

e z2

( z2 ) = 1 ( z1 ) = (3.56)

e z1 + e z2

68 David Thissen, Li Cai, and R. Darrell Bock

2

( z ) = 12 , and variance 3 . The deviate z is called a logit, and the pair z1 , z2

could be called a binomial logit.

The corresponding density function can be expressed in terms of the dis-

tribution function:

ez

( z ) = = ( z )[1 ( z )] (3.57)

(1 + e z )2

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

(1.7z ). Using the scale factor 1.7 in place of the variance matching factor

1.81379 will bring the logistic probabilities closer to the normal over the full

range of the distribution, with a maximum absolute difference less than 0.01

(Johnson, Kotz, & Balakrishnan, 1995, p. 119).

An advantage of the logistic distribution over the normal is that the deviate

corresponding to an observed proportion, P, is simply the log odds,

P

z( P ) = log (3.58)

1 P

For that reason, logit linear functions are frequently used in analysis of g

binomially distributed data (see Anscombe, 1956).

Inasmuch as the prediction of first choices may be viewed as an extreme

value problem, it is of interest that Dubey (1969) derived the logistic distribu-

tion from an extreme value distribution of the double exponential type with

mixing variable . Then the cumulative extreme value distribution function,

conditional on , is

sponding extreme value density function is

the distribution function of x:

The Nominal Categories Item Response Model 69

The natural extension of the logistic distribution to the bivariate case is

( x1 , x 2 ) = [1 + e x1 + e x2 ]1 (3.62)

with marginal distributions ( x1 ) and ( x 2 ) . The density function is

( x1 , x 2 ) = 2 3 ( x1 , x 2 )e x1 x2 (3.63)

and regression equations and corresponding conditional variances are

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

2

V ( x1|x 2 ) = V ( x 2|x1 ) = 1 (3.66)

3

This distribution is the simplest of three bivariate logistic distributions

studied in detail by Gumbel (1961). It is similar to the bivariate normal dis-

tribution in having univariate logistic distributions as margins, but unlike

the normal, the bivariate logistic density is asymmetric and the regression

lines are curved (see Figure3.6). Nevertheless, the distribution function

gives probability values reasonably close to bivariate normal values when

the 1.7 scale correction is used (see Bock and Jones (1968, Section 9.1.1) for

some comparisons of bivariate normal and bivariate logistic probabilities).

4

.01

.02

2

.04

.06

0

.05

2 .03

4

-4 -2 0 2 4

Figure 3.6 Contours of the bivariate logistic density. The horizontal and vertical axes are x 1 and x 2

respectively, in Equation 3.64.

The natural extension of the bivariate logistic distribution to higher dimen-

sions is

e zk

( z ) = , k = 1, 2,, m (3.67)

e z1 + e z2 + + e zm

70 David Thissen, Li Cai, and R. Darrell Bock

zero. This vector is referred to as a multinomial logit.

Although this extension of the logistic distribution to dimensions greater

than two has been applied at least since 1967 (Bock, 1970; McFadden, 1974),

its first detailed study was by Malik and Abraham (1973). They derived the

m-variate logistic distribution from the m-fold product of independent uni-

variate marginal conditional distributions of the Dubey (1969) extreme value

distribution with mixing variable . Integrating over gives

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

1

m

m

F ( X ) = F ( x j | ) g ( )d = 1 +

e xk

, m = n = 1 (3.68)

k =1 k =1

m 1

m m

f ( ) = m !exp

x k 1 +

k =1

e xk

(3.69)

k =1

method, although he does not cite Dubey (1969). Gumbels bivariate distri-

bution (above) is included for n = 2, and margins of all orders up to n 1 are

multivariate logistic and all univariate margins have mean zero and variance

2

3 . No comparison of probabilities for high-dimensional normal and logistic

distributions has as yet been attempted.

If we substitute functions of external variables for normal or logistic devi-

ates, we can study the relationships of these variables to the probabilities

of first choice among the objects presented. In the two-category case, we

refer to these as binomial response relations, and with more than two cat-

egories, as multinomial response relations. The analytical problem becomes

one of estimating the coefficients of these functions rather than the logit

itself. If the relationship is less than perfect, some goodness of fit will be lost

relative to direct estimation of the logit (which is equivalent to estimating

the category expected probabilities). The difference in the Pearson or likeli-

hood ratio chi-square provides a test of statistical significance of the loss.

Examples of weighted least squares estimation of binomial response relations

in paired comparison data when the external variables represent a factorial or

response surface design on the objects are shown in Section 7.3 of Bock and

Jones (1968). Examples of maximum likelihood estimation of multinomial

response relations appear in Bock (1970), McFadden (1974), and Chapter 8

of Bock (1975).

The Nominal Categories Item Response Model 71

response relations appears in Bradley and Terry (1952). They assume the

model

j

(3.70)

j + k

for the probability that object j is preferred to object k, but they estimated

j and k directly rather than exponentiating in order to avoid introducing a

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Luce and Suppes (1965) generalized the Bradley-Terry model to multi-

nomial data,

j

(3.71)

1 + 2 + ... + m

but did not make the exponential transformation to the multinomial logit

and did not apply the model in estimating multinomial response relations.

In item response theory we deal with data arising from two-stage sampling:

in the first stage we sample respondents from some identified population, and

in the second stage we sample responses of each respondent to some num-

ber of items, usually items from some form of psychological or educational

test. Thus, there are two sources of random variation in the databetween

respondents and between item responses. When the response is scored

dichotomously, right/wrong or yes/no, for example, the logistic distribution

for binomial data applies. If the scoring is polytomous, as when the respon-

dent is choosing among several alternatives, for instance, in a multiple-choice

test with recording of each choice, the logistic distribution for multinomial

data applies. If the respondents level of performance is graded polytomously

in ordered categories, the multivariate logistic can still apply, but its parame-

terization must be specialized to reflect the assumed order of the categories.

In IRT the external variable is not an observable quantity, but rather

an unobservable latent variable, usually designated by , that measures the

respondents ability or other propensity. The binomial or multinomial logit

is expressed as linear functions of containing parameters specific to each

item. We refer to the functions that depend on as item response models.

Item response models now in use (see Bock & Moustaki, 2007) include,

for item j, the two-parameter logistic model, based on the binomial logistic

distribution,

72 David Thissen, Li Cai, and R. Darrell Bock

distribution,

exp( a jk + c jk )

() = n (3.73)

exp( a jl + c jl )

l =1

In empirical applications, the parameters of the item response models

must be estimated in large samples of the two-stage data. Estimation of these

parameters is complicated, however, by the presence of the propensity vari-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

able , which is random in the first-stage sample. Because there are poten-

tially different values of this variable for every respondent, there is no way

to achieve convergence in probability as number of respondents increases.

We therefore proceed in the estimation by integrating over an assumed or

empirically derived distribution of the latent variable. If the first-stage sam-

ple is large enough to justify treating the parameter estimates so obtained as

fixed values, we can then use Bayes or maximum likelihood estimation to

locate each respondent on the propensity dimension, with a level of precision

dependent on the number of items.

The special merit of the nominal categories item response model is that no

assumption about the order or other structure of the categories is required.

Given that the propensity variable is one-dimensional and an ordering of

the categories is implicit in the data and is revealed by the order of the coef-

ficients ajk in the nominal model (see Bock & Moustaki, 2007).

References

Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley.

Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report

(NCES 1999-452). Washington, DC: National Center for Education Statistics,

Office of Educational Research and Improvement, U.S. Department of Education.

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika,

42, 6981.

Andersen, E. B. (1997). The rating scale model. In W. van der Linden & R. K.

Hambleton (Eds.), Handbook of modern item response theory (pp. 6784). New

York: Springer.

Andrich, D. (1978). A rating formulation for ordered response categories.

Psychometrika, 43, 561573.

Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for mea-

surment. In N. Brandon-Tuma (Ed.), Sociological methodology (pp. 3380). San

Francisco: Jossey-Bass.

Anscombe, F. J. (1956). On estimating binomial response relations. Biometrika, 35,

246254.

Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques

(2nd ed., revised and expanded). New York: Marcel Dekker.

Bergan, J. R., & Stone, C. A. (1985). Latent class models for knowledge domains.

Psychological Bulletin, 98, 166184.

The Nominal Categories Item Response Model 73

American Statistical Association, 39, 357375.

Berkson, J. (1953). A statistically precise and relatively simple method of estimating

the bio-assay with quantal response, based on the logistic function. Journal of the

American Statistical Association, 48, 565599.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an exam-

inees ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental

test scores (pp. 395479). Reading, MA: Addison-Wesley.

Bock, R. D. (1956). A generalization of the law of comparative judgment applied to a

problem in the prediction of choice [Abstract]. American Psychologist, 11, 442.

Bock, R. D. (1970). Estimating multinomial response relations. In E. A. R. C. Bose

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

(Ed.), Contribution to statistics and probability (pp. 111132). Chapel Hill, NC:

University of North Carolina Press.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are

scored in two or more latent categories. Psychometrika, 37, 2951.

Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New

York: McGraw-Hill.

Bock, R. D. (1997). The nominal categories model. In W. van der Linden & R. K.

Hambleton (Eds.), Handbook of modern item response theory (pp. 3350). New

York: Springer.

Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and

choice. San Francisco: Holden-Day.

Bock, R. D., & Moustaki, I. (2007). Item response theory in a general framework.

In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26,

pp.469513). Amsterdam: Elsevier.

Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs.

I.Method of paired comparisons. Biometrika, 39, 324345.

Childs, R. A., & Chen, W.-H. (1999). Software note: Obtaining comparable item

parameter estimates in MULTILOG and PARSCALE for two polytomous

IRT models. Applied Psychological Measurement, 23, 371379.

de Ayala, R. J. (1992). The nominal response model in computerized adaptive testing.

Applied Psychological Measurement, 16, 327343.

de Ayala, R. J. (1993). An introduction to polytomous item response theory models.

Measurement and Evaluation in Counseling and Development, 25, 172189.

Dubey, S. D. (1969). A new derivation of the logistic distribution. Naval Research

Logistics Quarterly, 16, 3740.

du Toit, M. (Ed.). (2003). IRT from SSI: BILOG-MG MULTILOG PARSCALE

TESTFACT. Lincolnwood, IL: Scientific Software International.

Gumbel, E. J. (1961). Bivariate logistic distributions. Journal of the American Statistical

Association, 56, 335349.

Hoskens, M., & Boeck, P. D. (1997). A parametric model for local dependence among

test items. Psychological Methods, 2, 261277.

Huber, M. (1993). An item response theoretical approach to scoring the Short Portable

Mental Status Questionnaire for assessing cognitive status of the elderly. Unpublished

masters thesis, Department of Psychology, University of North Carolina,

Chapel Hill

Johnson, N. L., Kotz, N., & Balakrishnan, N. (1995). Continuous univariate distribu-

tions (2nd ed., Vol. 2). New York: Wiley.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading,

MA: Addison-Wesley.

74 David Thissen, Li Cai, and R. Darrell Bock

Luce, R. D., & Suppes, P. (1965). Preference, utility, and subjective probability. In

R. D. Luce & R. R. Bush (Eds.), Handbook of mathematical psychology (Vol. 3

pp. 249410). New York: Wiley.

Malik, H., & Abraham, B. (1973). Multivariate logistic distributions. Annals of

Statistics, 1, 588590.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,

149174.

Masters, G. N., & Wright, B. D. (1984). The essential process in a family of measure-

ment models. Psychometrika, 49, 529544.

Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. van der

Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior.

In P. Zarembka (Ed.), Frontiers of econometrics (pp. 105142). New York:

Academic Press.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-

rithm. Applied Psychological Measurement, 16, 159176.

Muraki, E. (1997). A generalized partial credit model. In W. van der Linden & R. K.

Hambleton (Eds.), Handbook of modern item response theory (pp. 153164). New

York: Springer.

National Bureau of Standards (1956). Tables of the bivariate normal distribution

function and related functions. Applied Mathematic Series, Number 50.

Ramsay, J. O. (1995). Testgraf: A program for the graphical analysis of multiple-choice

test and questionnaire data (Technical Report). Montreal: McGill University

(Psychology Department).

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Copenhagen: Denmarks Paedagogiske Institut.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology.

In Proceedings of the Fourth Annual Berkeley Symposium on Mathematical Statistics

and Probability (Vol. 4, pp. 321333). Berkeley: University of California Press.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded

cores. Psychometric Monograph, No. 18.

Samejima, F. (1979). A new family of models for the multiple choice item (Research

Report 79-4). Knoxville: University of Tennessee (Department of Psychology).

Samejima, F. (1988). Comprehensive latent trait theory. Behaviormetrika, 15, 124.

Samejima, F. (1996). Evaluation of mathematical responses for ordered polychoto-

mous responses. Behaviormetrika, 23, 1735.

Samejima, F. (1997). Graded response model. In W. van der Linden & R. K. Hambleton

(Eds.), Handbook of modern item response theory (pp. 85100). New York: Springer.

Steck, G. P. (1958). A table for computing trivariate normal probabilities. Annals of

Mathematical Statistics, 29, 780800.

Sympson, J. B. (1983, June). A new IRT model for calibrating multiple choice items. Paper

presented at the annual meeting of the Psychometric Society, Los Angeles.

Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter

logistic model. Psychometrika, 47, 175186.

Thissen, D., Nelson, L., Rosa, K., & McLeod, L. D. (2001). Item response theory for

items scored in more than two categories. In D. Thissen & H. Wainer (Eds.), Test

scoring (Chap. 4, pp. 141186). Mahwah, NJ: Lawrence Erlbaum Associates.

Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two

categories. In D. Thissen & H. Wainer (Eds.), Test scoring (Chap. 3, pp. 73140).

Mahwah, NJ: Lawrence Erlbaum Associates.

The Nominal Categories Item Response Model 75

Thissen, D., & Steinberg, L. (1984). A response model for multiple-choice items.

Psychometrika, 49, 501519.

Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.

Psychometrika, 51, 567577.

Thissen, D., & Steinberg, L. (1988). Data analysis using item response theory.

Psychological Bulletin, 104, 385395.

Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of

multiple-categorical-response models. Journal of Educational Measurement, 26,

247260.

Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34,

278286.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing:

A case for testlets. Journal of Educational Measurement, 24, 185201.

Wainer, H., Thissen, D., & Sireci, S. G. (1991). DIFferential testlet functioning:

Definitions and detection. Journal of Educational Measurement, 28, 197219.

Yao, L., & Schwarz, R. D. (2006). A multidimensional partial credit model with

associated item and test statistics: An application to mixed-format tests. Applied

Psychological Measurement, 30, 469492.

Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local

item dependence. Journal of Educational Measurement, 30, 187214.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 76

3/3/10 6:58:01 PM

Chapter 4

The General Graded Response Model

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Fumiko Samejima

University of Tennessee

Editor Introduction: This chapter outlines a framework that encompasses most of the specific

polytomous IRT models mentioned in this book. The place of the models within the framework is

described with particular attention given to models that Samejima has developed. Prior to elaborat-

ing this framework, a set of criteria for evaluating different models is proposed.

When item response theory (IRT) originated and was developed in psy-

chology and sociology in the 1940s, 1950s, and the first half of the 1960s,

the theory only dealt with dichotomous responses, where there are only

two item score categories, for example, correct and incorrect in ability mea-

surement, true and false in personality measurement. As a graduate student

I was very much impressed by Fred Lords (1952) Psychometric monograph

ATheory of Mental Test Scores and could foresee great potential for latent

trait models. It seemed that the first thing to be done was to expand IRT

to enable it to deal with ordered, multicategory responses and enhance its

applicability, not only in psychology, sociology, and education, but also in

many other social and natural science areas. That opportunity came when

I was invited to spend one year as visiting research psychologist in the

Psychometric Research Group of the Educational Testing Service (ETS),

Princeton, New Jersey, in 1966. The essential outcomes of the research

conducted during my first year in the United States were published in

Samejima (1969).

A subsequent invitation to work in the psychometric laboratory at the

University of North Carolina at Chapel Hill allowed continuation of the

initial work. The essential outcomes of the research conducted in 19671968

were published in Samejima (1972). This second monograph is as important

as the first, and the two monographs combined propose the fundamental

tenets of the general graded response model framework.

77

78 Fumiko Samejima

In recent years, more and more researchers have started citing these two

Psychometrika monographs in their research. In this chapter I will try to cor-

rect common misunderstandings among researchers, as well as introduce and

explain further developments in the general graded response model.

Rationale

In the present chapter, uni-dimensional latent trait models are almost exclu-

sively discussed, where the latent trait assumes any real number. The general

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

the general structure of latent trait models that deal with cases in which item

g, is the smallest observable unit for measuring the latent trait and, subse-

quently, one of the graded item scores, or ordered polytomous item scores,

x g = 0,1, 2,..., m g , (mg 1), is assigned to each response. The highest score, mg,

can be any positive integer, and the general graded response model does not

require all items in a test or questionnaire to have the same values of mg. This is

a great advantage, and it makes it possible to mix dichotomous response items

with those whose mgs are greater than unity. Models that belong to the gen-

eral framework discussed in this chapter include the normal ogive model, the

logistic model, the graded response model expanded from the logistic positive

exponent family of models (Samejima, 2008), the acceleration model, and the

models expanded from Bocks nominal response model. Thus, the framework

described here applies for any of these models.

Graded response model (GRM) was proposed by Samejima (1969, 1972),

to provide a general theoretical framework to deal with the graded item

scores, 0, 1, 2,, mg , in the item response theory (IRT), whereas in the

original IRT the item scores were limited to 0, 1. As is explained later in this

chapter, the logistic model is a specific model that belongs to GRM. Because

the logistic model was applied for empirical data in early years, as exem-

plified by Roche, Wainer, and Thissen (1975); however, researchers started

treating the logistic model as if it were the GRM. Reading this chapter, the

reader will realize that GRM is a very comprehensive concept that includes

normal ogive model, logistic model, expanded model from the logistic posi-

tive exponent family of models, BCK-SMJ model, acceleration model, etc.

Correct terminology is important; otherwise, correct research will become

impossible.

The latent trait can be any construct that is hypothesized to be behind

observable items, such as the way in which ability is behind performance

on problem-solving questions, general attitude toward war is represented by

responses to peace/war-oriented statements, maturity of human bodies is

represented by experts evaluations of x-ray films, and so on.

Throughout this paper, the latent trait is denoted by , which assumes

any real number in ( , ) , except for the case where the multidimensional

latent space is considered.

The General Graded Response Model 79

based on the following five functions for each graded item score xg:

a joint conditional probability, given , and given that the individual

has passed the preceding process. Specifically M x g ( ) = 1 for x g = 0, and

M x g ( ) = 0 for x g = m g + 1 for all , respectively, since there is no pro-

cess preceding x g = 0 and m g + 1 is a nonexistent, imaginary graded

score no one can attain.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

m g + 1), defined by

This is the conditional probability, given , that the individual gets

the graded score xg or greater. In particular, Px*g ( ) = 1 for x g = 0 and

Px*g ( ) = 0 for x g = m g + 1 for the entire range of , since everyone

obtains a score of 0 or greater, and no one gets a score of m g + 1 or

greater. Also from Equation 4.1, Px*g ( ) = M x g ( ) when x g = 0 for the

entire range of .

Terminology Note: Elsewhere in this book and in the literature this function is called a category

boundary function or a threshold function. In another difference, we and other authors in this book

have indexed this and other types of polytomous response functions using i for items and k for

category response (e.g., P *) while the Anthor uses g for items, following Lord & Novick (1968,

ik

[Chap. 16]). Note that in addition to using different letters to index item and category components,

Samejima also indexes the category response (x) prior to the item index (g), where other authors

index the item prior to the category response.

This is the conditional probability, given , that the individual obtains

a specific graded score xg.

Note that when m g = 1, from Equations 4.1 and 4.2, both Px*g ( )

and Px g ( ) for xg = 1 become the item characteristic function (ICF; Lord

& Novick, 1968, Chap. 16) for a dichotomous item. Thus, a specific

graded response model, defined in this way, models dichotomous item

responses as a special case, and so the general graded response model

framework also applies to dichotomous response items.

4. Basic function (BSF), Ax g ( ), ( x g = 0, 1, 2,..., m g ) defined by

Ax g ( ) log Px g ( ) = [ Px g ( )]1 P ( ). (4.3)

x g

80 Fumiko Samejima

for the entire range of , and is differentiable with respect to .

5. Item response information function (IRIF), I x g ( ) ( x g = 0, 1, 2,..., m g ),

is defined by

2

I x g ( ) log Px g ( ) = A ( ) (4.4)

2 x g

{ }

2

2

= [ Px g ( )]2 Px g ( ) 2 Px g ( ) Px g ( )

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

2

2

= [ Px g ( )] 2 Px g ( ) + [ Px g ( )]2 Px g ( )

1

2

2

= [ Px g ( )]2 Px g ( ) [ Px g ( )]1 2 Px g ( ) .

One can conclude that the item response information function exists

asfar as Px g ( ) is positive for the entire range of , and is twice differ-

entiable with respect to .

Thissen and Steinberg (1986) called the normal ogive model and logis-

tic model for graded responses the difference models, and models for graded

responses expanded from Bocks (1972) nominal model divide-by-total

models. The naming may be a little misleading, however, because the gen-

eral framework for graded response models that has been introduced above

accommodates both of Thissens two categories. From Equation 4.2 we see

that for any x g ( = 0, 1, 2,..., m g ) ,

for the entire range of in order to satisfy the definition of the operating

characteristic Px g ( ) because it is a conditional probability.

Samejima (1973b) defined the item information function, I g ( ), for the

general graded response item g as the conditional expectation of the IRIF,

given, that was defined by Equation 4.4. Thus it can be written

m

I g ( ) E [ I x g ( )| ] = x gg=0 I x g ( )Px g ( ). (4.6)

Note that this item information function (IIF) of the graded item g includes

Birnbaums (1968) item information function on the dichotomous responses

The General Graded Response Model 81

as a special case. To simplify the notation, let the item characteristic func-

tion (ICF) for dichotomous responses (Lord & Novick, 1968, Chapter 16)

be represented as

Pg ( ) prob[ X g = 1| ] = Px g ( ; x g = m g ), (4.7)

where m g = 1, and

Q g ( ) prob[ X g = 0| ] = Px g ( ; x g = 0 ) = 1 Pg ( ). (4.8)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Let Pg( ) and Q g ( ) denote their first derivatives with respect to , respec-

tively. Due to the complementary relationship of Equations 4.7 and 4.8 we

can see that

Q g ( ) = Pg( ) (4.9)

and from Equation 4.9,

Q g( ) = Pg( ), (4.10)

with respect to , respectively.

From Equations 4.4 and 4.7 to 4.10 we can now rewrite our IRIF as

2 2 1

= [Q g ( )] [ Pg( )] [Q g ( )] [ Pg( )] ug = 0

I u g ( ) 2 2 1 (4.11)

= [ Pg ( )] [ Pg( )] [ Pg ( )] [ Pg( )] u g = 1.

Thus from Equations 4.6 and 4.11 for the IIF a dichotomous response item

can be written as

I g ( ) = I u g ( ; u g = 0 )Q g ( ) + I u g ( ; u g = 1)Pg ( ) (4.12)

The last expression of Equation 4.12 equals Birnbaums (1968) IIF for the

dichotomous response item (p. 454).

82 Fumiko Samejima

Responses to Those for Graded Responses

It is noted from Equation 4.1 that the definition of the COC, Px*g ( ), of the

graded response score xg becomes the ICF, Pg ( ), that is defined by Equation

4.7, if Xg is replaced by the binary item score Ug and mg is 1. This implies

that expansion of the general dichotomous response model to the general

graded response model can be done straightforwardly, with the restriction

of Equation 4.5.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

For example, suppose that the final grade of a mathematics course that

all math majors are required to pass is based on five letter grades, A, B, C,

D, and F. For these graded responses, m g = 4 . When we reclassify all math

majors into pass and fail, there are, in general, mg different ways to set the

borderline of pass and fail between (1) A and B, (2) B and C, (3) C and D,

and (4) D and F. It is noted that Way 1 is the strictest of passing math majors,

Way 4 is the most generous, Way 2 is moderately strict, and Way 3 is mod-

erately generous.

Now the course grade has been changed to a set of two grade categories

from five, in four different ways, and in each case the item characteristic

function Pg ( ) that is defined by Equation 4.7 can be specified. Note that

these four ICFs equal the COCs, that is, Px*g ( ) s , for the letter grades A, B,

C, and D, respectively.

Figure4.1 illustrates these m g = 4 ICFs. Because Way 1 is the strictest

of passing math majors and Way 4 is the most generous, it is natural that

their ICFs are located at the right-most and left-most parts in Figure 4.1,

1

F

0.9

0.8

D

0.7

ICF of Pass/Fail

0.6

0.5 C

0.4

0.3

0.2 B

0.1

A

0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Latent Trait

Figure 4.1 OCs of five letter grades A, B, C, D, and F shown as the differences of COCs.

The General Graded Response Model 83

respectively, and the other two ICFs are positioned and ordered between

the two with respect to their levels of generosity. These curves also satisfy

Equation 4.5, as is obvious from the nature of recategorizations. Thus it is

clear from the definitions of pass and fail and Equation 4.2 that the OCs for

A, B, C, D, and F are given as the differences of the two adjacent curves,

given , as indicated in Figure4.1. Note that these curves do not have to be

identical in shape or point symmetric, but should only satisfy Equation 4.5.

Terminology Note: As the author points out, it is not necessary for the curves in Figure4.1 to have

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

identical shapes. This is a key distinction between the heterogeneous and the homogenous models.

In the homogeneous case the COC forms are always parallel, whereas in the heterogeneous case

they are not necessarily parallel.

The above explanation may be the easiest way to understand the transition

from models of dichotomous responses to those of graded responses. Because

of this close relationship between the ICFs for dichotomous responses and

the COCs for graded responses, in the following sections specific math-

ematical models that belong to the general graded response model will be

represented by their COCs in most cases.

Terminology Note: The unique maximum condition is an important concept in the graded

response model framework and will be referred to throughout this chapter. In essence this condition

requires that for a given model and for a given likelihood function of the specific response pattern,

there exists a single maximum point that can be used as the estimate of the latent trait.

latent trait level even though asymptotically there is a one-to-one correspon-

dence between the test score and the latent trait . (Note that no tests have

infinitely many items.)

The main reason is that the use of the test score will reduce the local

accuracy of latent trait estimation (cf. Samejima, 1969, Chap. 6, pp. 4345;

1996b), unless there exists a test score or any summary of the response pat-

tern that is a sufficient statistic for the model, such as Raschs (1960) model

or the logistic model (Birnbaum, 1968) for dichotomous responses. In gen-

eral, the direct use of the response pattern or the sequence of item scores, xgs,

rather than a single test score, therefore, is strongly encouraged for estimat-

ing individuals latent traits.

Although the proposal of the logistic model as a substitute for the normal

ogive model was a big contribution in the 1960s, in these days when most

researchers have access to electronic computers, there is little need for the

substitution of a model by another that has a sufficient statistic.

84 Fumiko Samejima

item scores, that includes a sequence of specified binary item scores as a spe-

cial case, such that v = ( x1 , x 2 ,..., xn ) for a set of n items. Because of local

independence (Lord & Novick, 1968, Chap. 16), it can be written

Lv ( ) = Pv ( ) = x vP g xg ( ) , (4.13)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

. Using Equations 4.3 and 4.13, the likelihood equation is givenby

log Lv ( ) = x g v log Px g ( ) = x g v Ax g ( ) 0 . (4.14)

model provides a unique maximum for any likelihood function (i.e., for each

and every response pattern) and the condition is that both of the following

requirements are met:

1. The basic function Ax g ( ) of each and every graded score xg of each item

g is strictly decreasing in .

2. Its upper and lower asymptotes are nonnegative and nonpositive,

respectively.

For brevity, this condition is called the unique maximum condition (Samejima,

1997, 2004).

On the dichotomous response level, such frequently used models as the

normal ogive model, logistic model, and all models that belong to the logistic

positive exponent family of models, satisfy the unique maximum condition.

A notable exception is the three-parameter logistic model (3PL; Birnbaum,

1968). In that model, for x g = 1, the unique maximum condition is not satis-

fied, and it is quite possible that for some response patterns the unique MLE

does not exist (for details see Samejima, 1973b).

An algorithm for writing all the basic functions and finding the solutions

of Equation 4.14 for all possible response patterns is easy and straightforward

for most models that satisfy the unique maximum condition, so the unique

local or terminal maximum likelihood estimate (MLE) of can be found

easily without depending on the existence of a sufficient statistic.

It should be noted that, when the set of n items do not follow a single

model, but they follow several different models, as long as all of these models

satisfy the unique maximum condition, a unique local or terminal maximum

of the likelihood function of each and every possible response pattern is also

assured to exist.

The General Graded Response Model 85

Samejima (1996a) proposed five different criteria to evaluate a latent trait

model from a substantive point of view:

1. The principle behind the model and the set of accompanying assump-

tions agree with the psychological nature that underlies the data.

2. Additivity 1, that is, if the existing graded response categories get finer

(e.g., pass and fail are changed to A, B, C, D, and F), their OCs can still

be specified in the same model.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

response categories (e.g., A, B, C, D, and F to pass and fail), the OCs of

the newly combined categories can still be specified in the same math-

ematical form. (If additivities 1 and 2 hold, the model can be naturally

expanded to a continuous response model.)

4. The model satisfies the unique maximum condition (discussed above).

5. Modal points of the OCs of the m g + 1 graded response categories are

ordered in accordance with the graded item scores, x g = 0, 1, 2,..., m g .

Of these five criteria, the first is related to the data to which the model is

applied, but the other four can be used strictly mathematically.

Because the specific response pattern v is the basis of ability estimation,

it is necessary to consider the information function provided by v. Samejima

(1973b) defined the response pattern information function (RPIF), I v ( ) , as

2

I v ( ) log Pv ( ). (4.15)

2

Equation 4.15 is analogous to the definition of the IRIF, I x g ( ), given earlier

by Equation 4.4. Using Equations 4.13, 4.14, and 4.4, this can be changed to

2

I v ( ) = x g v log Px g ( ) = x g v I x g ( ) (4.16)

2

indicating that the RPIF can be obtained as the sum of all IRIFs for x g v .

The test information function (TIF), I ( ), in the general graded response

model is defined by the conditional expectation, given , of the RPIF, I v ( ),

as analogous to the relationship between the IIF and IRIFs. Thus,

I ( ) E [ I v ( )| ] = v I v ( )Pv ( ). (4.17)

Since it can be written that

Px g ( ) = x g v Pv ( ), (4.18)

86 Fumiko Samejima

m

I ( ) = v x g v I x g ( )Px g ( ) = ng =1 x gg= 0 I x g ( )Px g ( ) = ng =1 I g ( ). (4.19)

Note that this outcome, that the test information function equals the sum

total of the item information functions, is true only if the individuals ability

estimation is based on that individuals response pattern and not its aggregate,

such as a test score, unless it is a simple sufficient statistic, as is the case with

the Rasch model. Otherwise, the test information function assumes a value

less than the sum total of the item information functions (Samejima, 1996b).

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

The final outcome of Equation 4.19, that the TIF equals the sum total of IIFs

over all items in the test, questionnaire, and so on, is the same as the outcome of

the general dichotomous response model (cf. Birnbaum, 1968, Chap. 20).

It should be noted that, because of the simplicity of the above outcome, that

is, the test information function equals the sum total of the item information

functions, researchers tend to take it for granted. It is necessary, however, that

the reader understands how this outcome was obtained based on the definitions

of the TIF and the RPIF, in order to apply IRT properly and innovatively.

All specific latent trait models that belong to the general graded response

framework can be categorized into the homogeneous case and the heteroge-

neous case. To give some examples, such models as the normal ogive model

and logistic model belong to the former, and the graded response model

expanded from the logistic positive exponent family of models (Samejima,

2008), acceleration model (Samejima, 1995), and graded response models

expanded from Bocks (1972) nominal response model belong to the latter.

Terminology Note: As mentioned earlier, the essential difference between the homogeneous case

and the heterogeneous case is whether or not the shapes of COCs vary from one score category to the

next. In the homogenous case the COCs are parallel, whereas in the heterogeneous case they are not

parallel.

Lord set a hypothetical relation between dichotomous item score ug and

latent trait that leads to the normal ogive model (cf. Lord & Novick, 1968,

Section 16.6). He assumes a continuous variable Y g behind the item score ug

and the critical value g, as well as the following:

individual will obtain u g = 0 (e.g., fail). This assumption may be reason-

able if the reader thinks of the fact that within a group of individuals

who get credit for solving problem g there are diversities of different

levels of ability; that is, some individuals may solve it very easily while

some others may barely make it after having struggled a lot, and so on.

The General Graded Response Model 87

Yg

g = 2

P2()

g1

P1()

g = 1

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

g0

g = 0

P0()

E (Yg | )

LATENT TRAIT

Figure 4.2 Illustration of a hypothesized continuous variable underlying graded response models

in the homogeneous case.

3. The conditional distribution of Y g , given , is normal.

4. The variance of these conditional distributions is the same for all .

The figure that Lord used for the normal ogive model for dichotomous

responses (Lord & Novick, 1968, Figure16.6.1) is illustrated in Figure4.2.

In this figure, two critical values, g 0 and g 1 ( g 0 < g 1 ) , are used instead

of a single g, as is also shown in Figure4.2, and Hypothesis 1 is changed

to Hypothesis 1*: Individual a will get x g = 2 (e.g., honor pass) if Y ga g 1 ,

x g = 1 (e.g., pass) if g 0 Y ga < g 1 , and x g = 0 (e.g., fail) if Y ga < g 0 . As is

obvious from Equations 4.1 and 4.2, the shaded area for the interval [ g1 , )

indicates the OC for x g = 2 of item g; for [ g 0 , g 1 ) , the OC for x g = 1; and

for ( , g 0 ), the OC for x g = 0 at each of the two levels of in Figure4.2.

The above example leads to the normal ogive model for graded responses

when m g = 2. By increasing the number of the critical values, g s , however,

a similar rationale can be applied for any positive integer for mg.

It should also be noted that Hypotheses 3 and 4 can be replaced by any

other conditional density functions, symmetric or asymmetric, in so far as their

shapes are identical at all the fixed values of . All those models are said to

belong to the homogeneous case. Thus, a model that belongs to the homogeneous

case does not imply that its COCs are point symmetric for x g = 1, 2,..., m g ,

nor do its OCs provide symmetric curves for x g = 1, 2,..., m g 1, although

both the normal ogive and logistic models do so.

88 Fumiko Samejima

From the above definition and observations, it is clear that any graded

response model that belongs to the homogeneous case satisfies additivities 1

and 2, which were introduced earlier as criteria for evaluating mathematical

models for graded responses.

From Equation 4.1 it can be seen that a common feature of the models

in the homogeneous case is that the cumulative operating characteristics,

Px*g ( ) s , for x g = 1, 2,..., m g are identical in shape except for the positions on

the dimension, which are ordered in accordance with the graded score xg.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

The rationale behind the normal ogive model was provided earlier as an

example of the rationale behind any model that belongs to the homogeneous

case. In the normal ogive model, the COC is specified by

a g ( bx )

g

z 2

1

*

P ( ) = [ 2 ]

xg

2

exp dz, (4.20)

2

where ag denotes the item discrimination parameter and bx g is the item response

difficulty parameter, the latter of which satisfies

Figures4.3a and b illustrate the OCs in the normal ogive model, for two

different items, both with m g = 5 , but having different ag and bx g s, that is, for

the item in Figure4.3a a g = 1.0 and bx g = 1.50, 0.50, 0.00, 0.75,1.25 , while

for the item in Figure4.3b a g = 2.0 and bx g = 2.00, 1.00, 0.00, 1.00, 2.00.

It is noted that for both items the OCs for x g = 0 and x g = 5 ( = m g ) are

strictly decreasing and increasing in , respectively, with unity and zero as the

two asymptotes in the former, and with zero and unity as the two asymptotes

in the latter. They are also point symmetric, meaning if each curve is rotated

by 180 around the point, = b1 and P0 ( ) = 0.5 when x g = 0 , and = b5

and P5 ( ) = 0.5 when x g = 5 , then the rotated upper half of the curve over-

laps the original lower half of the curve, and vice versa. It is also noted that in

both figures the OCs for x g = 1, 2, 3, 4 are all unimodal and symmetric.

These two sets of OCs provide substantially different impressions,

because in Figure4.3a the four bell-shaped curves have varieties of different

heights. They are determined by the distance, bx g +1 bx g . In this figure, the

modal point of the curve for x g = 1 equals b2 b1 = 1.00 , and the maxi-

mal OC is higher than any others, because b3 b2 = 0.50, b4 b3 = 0.75, and

b5 b4 = 0.50 . Thus the second highest modal point belongs to x g = 3 , and

the lowest is shared by x g = 2 and x g = 4. For the item in Figure4.3b, it is

noted that those distances ( bx g +1 bx g ) are uniformly 1.00; thus the heights

of the four bell-shape curves are all equal. The height of a bell-shaped curve

The General Graded Response Model 89

1 xg = 5

xg = 0

0.8

Probability

0.6

0.4

xg = 3

0.2 xg = 1 xg = 2

xg = 4

0

-4 -3 -2 -1 0 1 2 3 4

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Latent Trait

(a)

1

0.8

Probability

0.6

0.4

0.2

0

-4 -3 -2 -1 0 1 2 3 4

Latent Trait

(b)

Figure 4.3 Two examples of six-category operating characteristics when categories are (a) not

equally spaced and (b) equally spaced.

also depends on the value of ag. In Figure4.3b, the common height of the

four bell-shaped curves is higher than the one for x g = 1 in Figure4.3a, and

this comes from the larger value of a g ( = 2.0 ) for the item in Figure4.3b than

that of the item in Figure4.3a for which a g = 1.0. It should also be noted

that in each of the two examples the modal point of the OCs is ordered in

accordance with the item score, x g = 1, 2, 3, and 4.

The above characteristics of the NMLOG are also shared by the logistic

model that will be introduced in the following section.

It has been observed (Samejima, 1969, 1972) that the BSFs of the x g 's are

all strictly decreasing in , with 0 and as the two asymptotes for x g = 0,

with and for 0 < x g < m g , and with and 0 for x g = m g , respectively,

indicating that the model satisfies the unique maximum condition discussed

above. The IRIFs for all 0 x g m g are positive for the entire range of . The

processing functions (PRFs) are all strictly increasing in for all 0 < x g m g ,

with zero and unity as the two asymptotes. (For more details, cf. Samejima,

1969, 1972.)

90 Fumiko Samejima

Relationship to Other Models: This LGST model has been mentioned to by other authors in

this book, and is typically referred to in the broader literature as the graded response model. As

mentioned earlier, however, Samejima uses the term graded response model to refer to her broader

framework, which includes this (and the previous NMLOG) example of the homogeneous case as

well as the later LPEFG and ACLR examples of the heterogeneous case.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

where ag denotes the item discrimination parameter and bx g is the item response

difficulty parameter that satisfies the inequality presented in Equation 4.21, as

is the case with the normal ogive model. D is a scaling factor usually set equal

to 1.702 or 1.7 so that Equation 4.22 provides a very close curve to Equation

4.20, that is, the COC in the normal ogive model when the same values

of item discrimination parameter ag and item response difficulty parameters

bx g s are used. As is expected, the set of COCs and the set of OCs are

similar to the corresponding sets in the normal ogive model, illustrated in

Figure4.3a and b.

Notable differences are found in its PRFs and BSFs, however (Samejima,

1969, 1972, 1997). Although the PRFs are strictly increasing in for all

0 < x g m g and their upper asymptotes are all unity, as is the case with

the NMLOG, their lower asymptotes equal exp[ Da g ( bx g bx g 1 )]

(cf. Samejima, 1972, p. 43), which is positive except for x = 1 , where it

g

is zero. This indicates that, unlike in the NMLOG, in the LGST for all

1 < x g m g the lower asymptotes are positive, and moreover, the closer the

item difficulty parameter bx g is to that of the preceding item score, bx g 1, the

better the chances are of passing the current step xg, giving favor to individu-

als of lower levels of ability. This fact is worth taking into consideration when

model selection is considered. (A comparison of LGST processing functions

with those in the NMLOG is illustrated in Samejima (1972, Figure5-2-1,

p. 43).)

Although the BSFs are strictly decreasing in for all 0 x g m g , unlike

in the normal ogive model, its two asymptotes for x g = 0 are zero and a finite

value, Da g , those for x g = m g are Da g and zero, and for all other intermediate

x g s their asymptotes are finite values, Da g and Da g , respectively. The unique

maximum condition is also satisfied (Samejima, 1969, 1972), however, and the

IRIFs for all 0 x g m g are positive for the entire range of .

The General Graded Response Model 91

Model to Medical Science Research

It was personally delightful when, as early as 1975, Roche, Wainer, and

Thissen applied the logistic model for graded responses in medical science

research in the book Skeletal Maturity. The research is a fine combination of

medical expertise and a latent trait model for graded responses. It is obvious

that every child grows up to become an adolescent and then an adult, and its

skeletal maturity progresses with age. But there are many individual differ-

ences in the speed of that process, and a childs chronological age is not an

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

accurate indicator of his or her skeletal maturity. For example, if you take a

look at a group of sixth graders, in spite of the closeness of their chronologi-

cal ages, some of them are already over 6 feet tall and look like young adults,

while others still look like small children. Measuring the skeletal maturity of

each child accurately is important because certain surgeries have to be con-

ducted when a child or adolescents skeletal maturity has reached a certain

level, to give an example.

In Roche et al. (1975), x-ray films of the left knee joint that were taken

from different angles were mostly used as items, or skeletal maturity indica-

tors. The items were grouped into three categories: femur (12), tobra (16),

and fibula (6). The reference group of subjects for the skeletal maturity scale

consists of 273 girls and 279 boys of various ages. A graded item score was

assigned to each of those subjects for each item following medical experts

evaluations of the x-ray film. The reader is strongly encouraged to read the

entire Roche et al. text to learn more about this valuable research and to see

how the LGST, a specific example of the graded response model framework,

has been applied in practice.

Both the normal ogive and logistic models satisfy the unique maximum con-

dition, and the additivities 1 and 2 criteria (discussed above), and the modal

points of their OCs are arranged in accordance with the item scores and

they can be naturally expanded to respective continuous response models

(cf.Samejima, 1973a).

It is clear that in the NMLOG and LGST for graded responses (and also in

many other models in the homogeneous case) the COC can be expressed as

a g ( x g )

Px*g ( ) =

g ( z ) dz , (4.23)

tions, respectively.

92 Fumiko Samejima

It can be seen from Equation 4.23 that these models for graded responses

can be expanded to their respected models for continuous responses. Replacing

xg by zg in Equation 4.23, the operating density characteristic H z g ( ) (Samejima,

1973a) for a continuous response zg is defined by

Pzg ( ) Pzg + z ( )

d

H 2 g ( ) lim = a g g { a g ( bz g )}

g

b

z 0

g zg dz g z g

where ag is the item discrimination parameter and bz g is the item response

difficulty parameter, the latter of which is a continuous, strictly increasing,

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

In the normal ogive model for continuous responses, there exists a sufficient

statistic, t (v ), such that

t (v ) = a g2 bz g (4.24)

2

and the MLE of is provided by dividing t (v ) by the sum total of a g over

all n items.

When the latent space is multidimensional, that is,

= {1 , 2 ,..., j ,..., r }

in the NMLOG the sufficient statistic becomes a vector of order r, that is,

t(v ) = z gv a g a g b zg , (4.25)

where the bold letters indicate vectors of order r, and the MLE of is given

by the inverse of the matrix z gv a g a g postmultiplied by t(v). It is noted that

Equation 4.24 is a special case of Equation 4.25 when r = 1 (for details, the

reader is directed to Samejima (1974)).

For graded response data, when mg is very large, a continuous response

model may be more appropriately applied instead of a graded response model,

as is often done in applying statistic methods. (Note that the test score is a set

of finite values, and yet it is sometimes treated as a continuous variable, for

example.) In such a case, if the normal ogive model fits our data, the MLE

of will be obtained more easily, taking advantage of the sufficient statistic

when the latent space is multidimensional, as well as unidimensional.

It has been observed (Samejima, 2000) that the normal ogive model for

dichotomous responses provides some contradictory outcomes in the orders

of MLEs of , because of the point-symmetric nature of its ICF that is char-

acterized by the right-hand side of Equation 4.20, with the replacement of the

item response difficulty parameter bx g by the item difficulty parameter bg.

To illustrate this fact, Table1 of Samejima (2000) presents all 32 (= 25)

response patterns of five hypothetical dichotomous items following the

NMLOG, with a g = 1.0 and b g = 3.0, 1.5, 0.0,1.5, 3.0 , respectively, that

are arranged in the ascending order of the MLEs. When the model is

changed to the LGST, because all five items share the same item discrimi-

nation parameter, ag = 1.0, the simple number correct test score becomes a

The General Graded Response Model 93

sufficient statistic, and a subset of response patterns that have the same number

of ug = 1 shares the same value of MLE.

The following are part of all 32 response patterns listed in that table and

their corresponding MLEs in the NMLOG:

Pattern 7 (00001) 0.866

Pattern 26 (01111) 0.866

Pattern 31 (11110) 2.284

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

It is noted that the first two response patterns share the same subresponse

pattern, 000, for Items 2 to 4, and the second two share the same subresponse

pattern 111 for the same three items. In the first pair of response patterns,

the subresponse patterns for Items 1 and 5 are 10 and 01, respectively, while

in the second pair they are 01 and 10. Because the only difference in each

pair of response patterns is this subresponse pattern of items 1 and 5, it is

contradictory that in the first pair (Patterns 2 and 7) success in answering the

most difficult item is more credited (0.866 > 2.284) in the normal ogive

model, while in the second pair (Patterns 26 and 24) success in answering

the easiest item is more credited (2.284 > 0.866).

Observations like that above provided the motivation for proposing a

family of models, the logistic positive exponent family (LPEF; Samejima,

2000), for dichotomous responses, which arrange their MLEs consistently

following one principle concerning penalties or credits for failing or succeed-

ing in answering easier or more difficult items. This was later expanded to a

graded response model (LPEFG; Samejima, 2008) that will be introduced

later in this chapter.

In spite of some shortcomings of the normal ogive and logistic models

for dichotomous responses, they are useful models as working hypotheses,

and effectively used, for example, in on-line item calibration in computerized

adaptive testing (cf. Samejima, 2001).

The heterogeneous case consists of all specific latent trait models for graded

responses that do not belong to the homogeneous case. In each of those mod-

els, the COCs, for x g = 1, 2,..., xm g , are not all identical in shape, unlike those

models in the homogeneous case, and yet the relationship in Equation 4.5

holds for every pair of adjacent x g 's . That is, even though adjacent functions

are not parallel, they never cross.

Two subcategories are conceivable for specific graded response mod-

els in the heterogeneous case. One is a subgroup of those models that can

be naturally expanded to continuous response models. In this section, the

graded response model (LPEFG), which was expanded from the logistic

positive exponent family (LPEF) of models for dichotomous responses,

and theacceleration model, which was specifically developed for elaborate

94 Fumiko Samejima

cognitive diagnosis, are described and discussed. The other subcategory con-

tains those models that are discrete in nature, which are represented by those

models expanded from Bocks (1972) nominal response model (BCK-SMJ).

Logistic Positive Exponent Family of Models for Dichotomous Responses (LPEF)

This family previously appeared in Samejimas (1969) Psychometrika mono-

graph, using the normal ogive function instead of the logistic function in

ICFs, although at that time it was premature for readers and practically

impossible to pursue the topic and publish in refereed journals.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ric, there is no systematic principle of ordering the values of MLE obtained

on response patterns, as is seen in the example of the normal ogive model.

The logistic model, where a simple sufficient statistic u g v a g u g (Birnbaum,

1968) exists, is an exception, and there the ordering depends solely on the

discrimination parameters, ag, without being affected by the difficulty para-

meters, bg.

The strong motivation for the LPEF was to identify a model that arranged

the values of MLE, that is, all possible response patterns consistently fol-

lowing a single principle, that is, penalizing or crediting incorrect or cor-

rect responses, respectively. This motivation was combined with an idea

to perceive Birnbaums logistic model as a transition model in a family of

models. After all, the fact that in the logistic model MLEs are determined

from the sufficient statistic (Birnbaum, 1968) that disregards, totally, the

difficulty parameters, bgs, and is solely determined by the discrimination

parameters ags, is not easily acceptable to this researchers intuition. This led

to the family of models called the logistic positive exponent family (LPEF)

(Samejima, 2000), where ICFs are defined by

g

Pg ( ) = [ g ( )] 0 < g < , (4.26)

where the third parameter, g, is called the acceleration parameter, and

the right-hand side of which is identical to the logistic ICF (Birnbaum, 1968),

where the scaling factor D is usually set equal to 1.702. Note that Equation

4.26 also becomes the logistic ICF when g = 1, that is, a point-symmetric

curve, and otherwise, it provides point-asymmetric curves, having a long tail

on lower levels of as g (< 1) gets smaller. Samejima (2000) explains that

when g < 1 the model arranges the values of MLE following the principle

that penalizes the failure in solving as easier item, and when g < 1, fol-

lowing the principle that gives credit for solving a more difficult item, pro-

vided that the discrimination parameters assume the same value for all items

(cf. Samejima, 2000). Thus Birnbaums logistic model can be considered to

represent the transition between the two opposing principles.

The General Graded Response Model 95

0.9

0.8

0.7

0.6

0.5

0.4

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.3

0.2

0.1

0

-5 -4 -3 -2 -1 0 1 2 3 4 5 6

Latent Trait

Figure 4.4 ICFs of seven items modeled with the LPEF, where the values of g are (left to right)

0.3, 0.5, 0.8, 1.0, 1.5, 2.0, and 3.0, respectively.

Figure4.4 illustrates the ICFs of models that belong to the LPEF, with

the common item parameters a g = 1 and b g = 0, where the values of g are

0.3, 0.5, 0.8, 1.0, 1.5, 2.0, and 3.0, respectively.

The characteristics of the LPEF are as follows:

1. If 0 < g < 1 , then the principle arranging the MLEs of is that failure

in answering an easier item correctly is penalized (success in answering

an easier item correctly is credited).

2. If 1 < g < , then the principle arranging the MLEs of is that suc-

cess in answering a more difficult item is credited (failure in answering

a more difficult item correctly is penalized).

3. When g = 1 , both of the above principles degenerate, and neither of

the two principles works.

The reader is directed to Samejima (2000) for detailed explanations and

observations of the LPEF for dichotomous responses. It is especially impor-

tant to understand the role of the model/item feature function, S g ( ), which is

defined in that article in Equation 7, specified by Equation 28 for the LPEF,

and illustrated in Figures4(a) to (c) (pp. 331332).

It should be noted that the item parameters, ag and bg, in the LPEF should

not be considered as the discrimination and difficulty parameters. (The same

is also true with the 3PL.) Actually, the original meaning of the difficulty

parameter is the value of at which Pg ( ) = 0.5. These values are indicated

in Figure4.4, where they are strictly increasing with g, not constant for all

items. Also, the original meaning of the discrimination parameter is a param-

eter proportional to the slope of Pg ( ) at the level of where Pg ( ) = 0.5, and

it is also strictly increasing with g, not a constant value for all seven items.

96 Fumiko Samejima

Positive Exponent Family of Models (LPEFG)

It can also be seen in Figure4.4 that whenever g < h for a pair of arbitrary

items, g and h, there is a relationship that Pg ( ) > Ph ( ) for the entire range of

. The reason is because for any value 0 < R < 1, R s > R t holds for any s < t .

The ICFs that are illustrated in Figure4.4 can be used, therefore, for an

example of a set of COCs for graded item scores for a single graded response

item with m g = 7 that satisfies Equation 4.5. More formally, the LPEFG is

characterized by

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

x x

M x g ( ) = [ g ( )] g g 1

x g = 0, 1, 2,..., m g , (4.28)

1 0 = 0 < 1 < ... < x g < ... < m g 1 < m g < m g +1 , (4.29)

which leads to

x

Px*g ( ) = [ g ( )] g

, (4.30)

due to Equations 4.28 and 4.1. From Equations 4.2 and 4.28 through 4.30

the OC in the LPEFG is defined by

x x

Px g ( ) = [ g ( )] g

[ g ( )] g +1

(4.31)

by replacing Px g ( ) by the right-hand side of Equation 4.22 and evaluating

its derivatives in Equations 4.3, 4.4, 4.6, and 4.19, and substituting these

outcomes into Equations 4.2 through 4.4 (details in Samejima, 2008).

Figure4.5a and b presents the OCs and BSFs (per Equation 4.3) of an

example of the LPEFG, with m g = 5 , a g = 1, b g = 0, and x g = 0.3, 0.8, 1.6,

3.1, and 6.1, respectively. Note that, for x g = 0 and x g = m g , OCs are strictly

decreasing and increasing in , respectively, and for all the other graded item

scores they are unimodal, and the BSFs are all strictly decreasing in , with

the upper asymptotes zero for x g = 0 and Da g x g for xg = 1, 2, 3, 4, 5,

respectively, with the lower asymptotes Da g for xg = 0, 1, 2, 3, 4 and zero

for x g = 5, indicating the satisfaction of the unique maximum condition.

The set of BSFs in Figure4.5b is quite different from that of NMLOG or

LGST (see Samejima, 1969, 1972) because the upper limit of the BSF is

largely controlled by the item response parameter x . Because of this fact, it

g

The General Graded Response Model 97

0.8

0.6

Probability

0.4

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0.2

0

-6 -4 -2 0 2 4 6

Latent Trait

(a)

12

10

8

Basic Functions

0

-6 -4 -2 0 2 4 6

-2

-4

Latent Trait

(b)

Figure 4.5 Operating characteristics (a) and basic functions (b) of an example six-category item

modeled with the LPEFG. In (a), the modal points of OCs are produced according to xg s, i.e., the

lowest is for xg = 0, and the highest is for xg = 5. In (b), the asymptotes when q approaches

are ordered, i.e., lowest for xg = 0 and higest for xg = 5.

98 Fumiko Samejima

can also be seen that the amount of information shown in the IRIF becomes

larger as the item score xg gets larger. The value of at which each curve in

Figure4.5b crosses the -dimension for each xg indicates the modal point of

the corresponding OC that is seen in Figure4.5a, and these modal points are

ordered in accordance with the x g s , with the terminal maximum at nega-

tive infinity and positive infinity for x g = 0 and x g = m g (= 5), respectively.

Another set of PRFs, COCs, OCs, BSFs, and IRIFs for the LPEFG with

different parameter values can be found in Samejima (2008).

It was noted above that the LPEFG satisfies the unique maximum condi-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ability to a continuous response model (discussed above) are intrinsic in the

LPEFG (Samejima, 2008).

It is noted that, unlike the normal ogive or the logistic model, LPEFG is a

substantive mathematical model, in the sense that the principle and nature of

the model support, consistently, certain psychological phenomena. To give

an example, for answering a relatively difficult problem-solving question, we

must successfully follow a sequence of cognitive processes. The individuals

performance can be evaluated by the number of processes in the sequence

that he or she has successfully cleared, and one of the graded item scores, 0

through mg, is assigned. It is reasonable to assume that passing up to each

successive cognitive process becomes progressively more difficult, represented

by the item response parameter x g . Concrete examples of problem solving,

for which the LPEFG is likely to fit, are various geometric proofs (Samejima,

2008).

Usually, there is more than one way of proving a given geometry theorem.

Notably, it is said that there are 362 different ways to prove the Pythagoras

theorem! It would make an interesting project for a researcher to choose

a geometry theorem having several proofs, collect data, categorize subjects

into subgroups, each of which consists of those who choose one of the differ-

ent proofs, assign graded item scores to represent the degrees of attainment

for each subgroup, and apply LPEFG for the data of each subgroup. It is

most likely that separate proofs will have different values of mg, and it would

be interesting to observe the empirical outcomes.

Readers will be able to think of other substantive examples. It would

be most interesting to see applications of the LPEFG to data collected

for such examples to find out if the model works well. Any such feedback

would be appreciated.

Relationship to Other Chapters: Huang and Mislevy do something similar to what is suggested

here with responses to a physical mechanics exam. However, they use the polytomous Rasch model

to investigate response strategies rather than the LPEFG, and take a slightly different approach given

that Rasch models do not model processing functions.

The General Graded Response Model 99

Greater Opportunities for Applying Mathematical

Models for Cognitive Psychology Data

For any research in the social sciences, mathematical models and methodolo-

gies are important, if one aims at truly scientific accomplishments. Because of

the intangible nature of the social sciences, however, there still is a long way to

go if the levels of scientific attainments in natural sciences are ones goal.

Nonetheless, the research environment for behavioral science has been

improved, especially during the past few decades. One of the big reasons

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

an example, in cognitive psychology it used to be typical that a researcher

invited a subject to an experimental room and gave him or her instructions,

which the subject followed and responded to accordingly. Because of its

time-consuming nature, it was very usual that research was based on a very

small group of subjects and quantitative analysis of the research data was

practically impossible.

With the rapid advancement of computer technologies, microcomputers

have become much more capable, smaller, and much less expensive. It is quite

possible to replace the old procedure by computer software that accommo-

dates all experimental procedures, including instructions, response formats,

and data collections. The software is easy to copy, and identical software can

be installed onto multiple laptops of the same type, each of which can be

taken by well-trained instructors to different geographical areas to collect

data for dozens of subjects each. Thus, data can be collected with a sample

size of several hundred relatively easily, in a well-controlled experimental

environment. Sampling can also be made closer to random sampling.

In return, the need for mathematical models for cognitive processes has

become greater, and one must propose mathematical models with the above

perspective.

Acceleration Model

Samejima (1995) proposed the acceleration model that belongs to the het-

erogeneous case with such a future need in mind. In general, cognitive diag-

nosis is complicated, so naturally models for cognitive diagnosis must be

more complicated than many other mathematical models that are applied,

for example, to test or questionnaire data.

In the acceleration model, the PRF is defined by

x*

M x g ( ) = x g ( )

g

, (4.32)

where x* g (> 0) is also called the acceleration parameter in this model, and

x g ( ) is a member of the family of functions satisfying

2

x* g = 1 x g ( ) x g ( ) x g ( ) (4.33)

100 Fumiko Samejima

{ ( )}

1

x g ( ) = 1 + exp Dax g bx g .

(4.34)

COC in this model is provided by

u x x*

Px*g ( ) = x g ( )

g g

(4.35)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

and all the other functions, such as OC, BSF, and IRIF in this model, are

given by substituting Equation 4.35 into those formulas of the general graded

response model, Equations 4.2 through 4.4, respectively.

It should be noted that on the left-hand side of Equation 4.34 x g ( ) is

used instead of g ( ) in Equation 4.27, with ax g and bx g replacing ag and

bg, respectively, on the right-hand side. This indicates that in the accelera-

tion model the logistic function is defined separately for each graded score

xg, while in the LPEFG it is common for all the graded item scores of item

g. This difference makes the acceleration model more complicated than the

LPEFG for the purpose of using it for cognitive diagnosis of more compli-

cated sequences of cognitive processes.

It is also noted that if x g ( ) in Equation 4.34 is replaced by g ( ) in

Equation 4.27, and we define x g x* g x* g +1 and * 1 0 , then the LPEFG

can be considered as a special, simplified case of the acceleration model. The

model is described in detail in Samejima (1995). It may be wise to collect

data to which the LPEFG substantively fits, and analyze them first, build-

ing on that experience to analyze more elaborate cognitive data using the

acceleration model.

Bocks (1972) nominal response model is a valuable model for nominal

response items in that it discloses the implicit order of the nominal response

categories. Samejima (1972) proposed a graded response model expanded

from Bocks nominal model. When a model fits data that implicitly have

ordered response categories, it is easy to expand the model to a graded

response model for the explicit graded item scores.

Samejima did not pursue BCK-SMJ much further, however, because an

intrinsic restriction was observed in the expanded model. Later, Masters

(1982) proposed a special case of BCK-SMJ as the partial credit model and

Muraki (1992) proposed BCK-SMJ itself as a generalized partial credit

model without realizing that the model had already been proposed in 1972.

Many researchers have applied those models. Practitioners using IRT in their

research should only use either model when their research data are within the

limit of the previously identified restriction, however.

The General Graded Response Model 101

1

1

Px g ( ) = exp x g + x g u x g exp{ u + u } (4.36)

with

0 < 0 < 1 < ... < m g < .

It is noted that the denominator of Equation 4.36 is common for all x g s.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

This makes the conditional ratio, given , of any pair of the OCs for x g = s

and x g = t ( s t ) such that

Bocks nominal response model. The same characteristic, however, becomes

a restriction for the BCK-SMJ model. When s and t are two arbitrary adja-

cent graded item scores, for example, the combined graded response cat-

egory will have the OC

1

Ps + t ( ) = [exp{ s + s } + exp{ t + t }] u x g exp{ u + u } . (4.38)

It is obvious that Equation 4.38 does not belong to Equation 4.36, and thus

additivity 2 does not hold for the BCK-SMJ. It can also be seen that additiv-

ity 1 does not hold for the model either. Thus BCK-SMJ is discrete in nature,

and cannot be naturally expanded to a continuous response model, unlike

the normal ogive model, logistic model, and LPEFG. It should be applied

strictly for data that are collected for a fixed set of graded response categories,

where no recategorizations are legitimate. This is a strong restriction.

A summary of the characteristics of the five specific graded response

models discussed above, with respect to the four evaluation criteria that were

discussed earlier, is given in Table4.1.

With Respect to the Four Evaluation Criteria

NMLOG LGST LPEFG ACLR BCK-SMJ

Additivity 1 Yes Yes Yes Yes No

Additivity 2 Yes Yes Yes Robust No

Expands to CRM Yes Yes Yes Yes No

Satisfies unique maximum Yes Yes Yes Yes Yes

condition

Ordered modal points Yes Yes Yes Robust Yes

102 Fumiko Samejima

in the Three-Parameter Logistic Model

Quite often researchers, using simulated data for multiple-choice items,

adopt software for a parametric estimation of the three-parameter logistic

(3PL) model (Birnbaum, 1968), where the ICF is defined as

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

and fail to recover the values of three parameters, ag, bg, and cg, within the

range of error. Sometimes all the item parameter estimates are outrageously

different from their true values. This is a predictable result because in most

cases simulated data are based on a mound-shaped ability distribution

with very low densities for very high and very low levels of . In Equation

4.39 estimating the third parameter, cg, which is the lower asymptote of

the ICF, will naturally be inaccurate. This inaccuracy will also affect, neg-

atively, the accuracies in estimating the other two parameters. Moreover,

even if the ability distribution has large densities on lower levels of ,

when is treated as an individual parameter and if an EM algorithm is

used to estimate both the individual parameter and item parameters, then

the more hypothetical individuals at lower levels of are included, the

larger the amount of estimation error of individual parameters will occur,

influencing accuracy in estimating cg negatively and, consequently, in ag

and bg. Thus such an attempt is doomed to fail.

Even without making an additional effort to increase the number of sub-

jects at lower levels of the latent trait to more accurately recover the three

item parameters, which will not in any event be successful, if the true curve

and the estimated curve with outrageously wrong estimated parameter values

are plotted together, the fit of the curve with the estimated parameter values

to the true curve is usually quite good for the interval of at which densities

of ability distribution are high. We could say that, although the parametric

estimation method aims at the recovery of item parameters, it actually recov-

ers the shape of the true curve for that interval of , as a well-developed non-

parametric estimation method does, not the item parameters themselves.

From a truly scientific standpoint, parametric estimation of OCs is not

acceptable unless there is evidence to justify the adoption of the model in

question, because if the model does not fit the nature of our data, it molds the

research data into a wrong mathematical form and the outcomes of research

will become meaningless and misleading.

Thus, well-developed nonparametric estimation methods that will discover

the shapes of OCCs will be valuable. Lord developed such a nonparamet-

ric estimation method, and applied it for estimating the ICFs of Scholastic

The General Graded Response Model 103

Aptitude Test items (Lord, 1980: Figure 2.31 on page 16, for example out-

comes). The method is appropriate for a large set of data, represented by

widely used tests that are developed and administered by the Educational

Testing Service, American College Testing, and Law School Admission

Council, for example, but it is not appropriate for data of relatively small

sizes, such as those collected in a college or university environment. The non-

parametric methods that were developed by Levine (1984), Ramsay (1991),

and Samejima (1998, 2001) will be more appropriate to use for data collected

on a relatively small number of individuals.

Figure4.6 exemplifies the outcomes obtained by Samejimas (1998, 2001)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

simple sum procedure (SSP) and differential weight procedure (DWP) of the

conditional probability density function (pdf) approach, based on the simu-

lated data of 1,202 hypothetical examinees in computerized adaptive testing

(CAT). The outcome of DWP1 (thin, solid line) was obtained by using the

outcome of SSP (dashed line) as the differential weight function, while the

fourth curve is the result of DWP using the true curve (thick, solid line)

as the differential weight function. The DWP_True is called the criterion

operating characteristic (dashed line), indicating the limit of the closeness of

an estimated curve to the true curve; if they are not close enough, either the

procedures in the method of estimation should be improved, or the sample

size should be increased.

It should be noted that the nonmonotonicity of the true curve is detected

by both the SSP and DWP1 in Figure4.6. Even if the true curve is non-

monotonic, which is quite possible, especially for the item characteristic

function of a multiple-choice test item (Samejima, 1979), such detection

0.9

0.8

0.7

0.6 TRUE

SSP

0.5

DWP_True

0.4 DWP1

0.3

0.2

0.1

0

-4 -3 -2 -1 0 1 2 3

Latent Trait

Figure 4.6 A non-monotonic ICF (TRUE), its two nonparametric estimates (SSP, DWPI), and

the criterion ICF (DWP_True).

104 Fumiko Samejima

cannot be made by a parametric estimation. If, for example, the true curve

in Figure4.6 is the ICF of a multiple-choice test item and a parametric esti-

mation method such as the 3PL is used, the estimated three item parameters

will provide, at best, an estimated curve with a monotonic tail.

There is no reason to throw away an item whose ICF is nonmonotonic,

as illustrated in Figure 4.6. It is noted that approximately for the interval

of (0.0, 1.5) the amount of item information at each value of is large,

so there is no reason why we should not take advantage of it. On the other

hand, on levels lower than this interval of the nonmonotonicity of the curve

will make the IRIFs negative, so this part of the curve should not be used.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Samejima (1973b) pointed out that the IRIF of the 3PL for u g = 1 assumes

negative value, and for that reason, 3PL does not satisfy the unique maxi-

mum condition. Using an item whose ICF is nonmonotonic, as illustrated by

Figure4.6, is especially easy in CAT (Samejima, 2001), but a similar method

can be used in a paper-and-pencil test or questionnaire.

In Figure4.6 it can be seen that (1) the outcome of DWP1 is a little

closer to the criterion operating characteristic than that of SSP, (2) the

outcomes of SSP and DWP1 are both very close to the criterion operating

characteristic, and (3) the criterion operating characteristic is very close

to the true curve. For more details, the reader is directed to Samejima

(1998, 2001).

Samejima (1994) also used the SSP on empirical data, for estimating the

conditional probability, given , of each distractor of the multiple-choice items

of the Level 11 Vocabulary Test of the Iowa Test of Basic Skills, and called

those functions of the incorrect answers plausibility functions. It turned out

that quite a few items proved to possess plausibility functions that have dif-

ferential information, and the use of those functions in addition to the ICFs

proved to be promising for increasing the accuracy of ability estimation.

In the example of Figure4.6, nonparametric estimation of the ICFs for

dichotomous items, or that of the COCs for graded responses, was consid-

ered. We could estimate PRFs or OCs first, however. Equations 4.1 and 4.2

can be changed to

1

M x g ( ) = Px*g ( ) Px*g 1 ( ) for x g = 1, 2,..., m g (4.40)

and

PRFs, first, and using those outcomes the COCs and then the OCs can be

obtained through Equations 4.1 and 4.2. An alternative way is to estimate

the COCs first, and using Equation 4.40, the PRFs can be obtained, and

then the OCs through Equation 4.2. It is possible to estimate the OCs first,

The General Graded Response Model 105

and then using the outcomes, the COCs can be obtained through Equation

4.41, and then the PRFs through Equation 4.40. Note, however, this last

method may include substantial amounts of error, because it is quite pos-

sible that some graded score may include only a small number of individuals,

unless the total sample size is large enough.

In any case, after the shapes of those functions are nonparametrically esti-

mated, it is wise to parameterize the nonparametrically discovered functions

selecting a parametric model that is legitimate in principle and agrees with

the nature of the data. Otherwise, it is difficult to proceed in research using

functions with no mathematical forms.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

While the goodness of fit of curves is important, it has its limitations,

especially when a model belongs to the heterogeneous case where it has less

restrictions and more freedom for innovation. Samejima (1996a, 1997) dem-

onstrated two sets of OCs that belong to the two specific graded response

models of quite different principles, the ACLR and BCK-SMJ, which are

nevertheless practically identical to each other. This means that if the OCs

that are discovered as the outcomes of a nonparametric estimation method

fit the OCs of the ACLR, they should also fit those of the BCK-SMJ.

Curve fitting alone cannot be a good enough criterion for model vali-

dation. In model selection, in addition to curve fitting, the most important

consideration should be how well the principle behind each model agrees

with the nature of the research data. Furthermore, considerations should be

made of whether each of the other four criteria listed in Table4.1 fits the

model, as well as the research data. For example, if we know that the results

of our research may be compared with other research on the same or similar

contents, mathematical models that lack additivity should be avoided. If con-

tinuous responses are used, a model should be chosen that can be expanded

naturally from a graded response model in order to make future comparisons

possible with the outcomes of other research in which graded responses are

used.

An effort to select a substantive model is by far the most important criterion,

and curve fitting can be used as an additional criterion, to see if those curves

in a substantive model provide at least reasonably good fit to the data.

Conclusion

IRT has developed so much in the past few decades that it is hard to write

even just the essential elements of the general graded response model frame-

work as a handbook chapter. Many important and useful topics have been

omitted. An attempt has been made to include useful hints for researchers

and practitioners in applying IRT within this chapter, including suggested

readings. But even with reference to the original work cited in this chapter,

it may be difficult to identify ways to apply the models.

106 Fumiko Samejima

and cited material in this chapter. Such opportunities would make interac-

tive communications and deeper understanding possible.

References

Birnbaum, A. (1968). Some latent trait models and their use in inferring an exam-

inees ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental

test scores (Part 5: Chapters 1720). Reading, MA: Addison-Wesley.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Levine, M. (1984). An introduction to multilinear formula scoring theory (Measurement

Series 84-5). Champaign: University of Illinois, Department of Educational

Psychology, Model-Based Measurement Laboratory.

Lord, F. M. (1952). A theory of mental test scores. Psychometric, Monograph 7.

Lord, F. M. (1980). Applications of item response theory to practical testing problems.

Hillsdale, NJ: Lawrence Erlbaum Associates.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores (Chap. 16).

Reading, MA: Addison-Wesley.

Masters, E. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,

149174.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algo-

rithm. Applied Psychological Measurement, 16, 159176.

Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item charac-

teristic curve estimation. Psychometrika, 56, 611630.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Copenhagen: Nielsen & Lydiche.

Roche, A. M., Wainer, H., & Thissen, D. (1975). Skeletal maturity. New York:

Plenum Medical.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded

scores. Psychometrika, Monograph 17.

Samejima, F. (1972). A general model of free-response data. Psychometrika,

Monograph 18.

Samejima, F. (1973a). Homogeneous case of the continuous response model.

Psychometrika, 38, 203219.

Samejima, F. (1973b). A comment on Birnbaums three-parameter logistic model in

the latent trait theory. Psychometrika, 38, 221233.

Samejima, F. (1974). Normal ogive model on the continuous response level in the

multi-dimensional latent space. Psychometrika, 39, 111121.

Samejima, F. (1979). A new family of models for multiple-choice item (Office of Naval

Research Report 79-4). Knoxville: University of Tennessee.

Samejima, F. (1994). Nonparametric estimation of the plausibility function of the distrac-

tors of the Iowa Vocabulary items. Applied Psychological Measurement, 18, 3551.

Samejima, F. (1995). Acceleration model in the heterogeneous case of the general

graded response model. Psychometrika, 60, 549572.

Samejima, F. (1996a). Evaluation of mathematical models for ordered polychoto-

mous responses. Behaviormetrika, 23, 1735.

Samejima, F. (1996b, April). Polychotomous responses and the test score. Paper presented

at the 1996 National Council on Measurement in Education, New York.

The General Graded Response Model 107

Hambleton (Eds.), Handbook of modern item response theory (pp. 85100). New

York: Springer-Verlag.

Samejima, F. (1998). Efficient nonparametric approaches for estimating the operat-

ing characteristics of discrete item responses. Psychometrika, 63, 111130.

Samejima, F. (2000). Logistic positive exponent family of models: Virtue of asym-

metric item characteristic curves. Psychometrika, 65, 319335.

Samejima, F. (2001). Nonparametric on-line item calibration. Final report of research

funded by the Law School Admission Council for 19992001.

Samejima, F. (2004). Graded response model. In K. Kempf-Leonard (Ed.),

Encyclopedia of social measurement (Vol. 2, pp. 145153). Amsterdam: Elsevier.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Samejima, F. (2008). Graded response model based on the logistic positive exponent

family of models for dichotomous responses. Psychometrika, 73, 561578.

Thissen, D., & Steinburg, L. (1986). A taxonomy of item response models.

Psychometrika, 51, 567577.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Y102002_Book.indb 108

3/3/10 6:59:15 PM

Chapter 5

The Partial Credit Model

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Geoff N. Masters

Australian Council for Educational Research

Editor Introduction: This chapter demonstrates the elegant simplicity of the underlying concept on

which the partial credit model was built. This provides a valuable basis for understanding this highly

influential polytomous IRT model, including its relationship to other models and the reasons for its

widespread use.

The partial credit model (PCM) is a particular application of the model for

dichotomies developed by Danish mathematician Georg Rasch. An under-

standing of the partial credit model thus depends on an understanding of

Raschs model for dichotomies, the properties of this model, and in particu-

lar, Raschs concept of specific objectivity.

Rasch used the term specific objectivity in relation to a property of the model

for tests he developed during the 1950s. He considered this property to be

especially useful in the attempt to construct numerical measures that do not

depend on the particulars of the instrument used to obtain them.

This property of Raschs model can be understood by considering two per-

sons A and B with imagined abilities A and B. If these two persons attempt

a set of test items, and a tally is kept of the number of items N1,0 that person

A answers correctly but B answers incorrectly, and of the number of items

N 0 ,1 that person B answers correctly but A answers incorrectly, then under

Raschs model, the difference, A B , in the abilities of these two persons

can be estimated as

ln( N1,0 /N 0 ,1 ) (5.1)

What is significant about this fact is that this relationship between the

parameterized difference A B and the tallies N1,0 and N 0 ,1 of observed

successes and failures applies to any selection of items when test data con-

form to Raschs model. In other words, provided that the responses of per-

sons A and B to a set of items are consistent with the model, the difference

A B can be estimated by simply counting successes and failures without

109

110 Geoff N. Masters

Outcomes When Persons A and B

Attempt a Set of Items

Person B

Wrong Right

Right

N1,0 N1,1

Person A

Wrong

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

N0,0 N0,1

having to know or estimate the difficulties of the items involved. Any subset

of items (e.g., a selection of easy items, hard items, even-numbered items,

odd-numbered items) can be used to obtain an estimate of the relative abili-

ties of persons A and B from a simple tally (Table5.1).

The possibility of obtaining an estimate of the relative abilities of persons

A and B that is not dependent upon the details of the items used was referred

to by Rasch as the possibility of specifically objective comparison.

Raschs Model

In its most general form, Raschs model begins with the idea of a measure-

ment variable upon which two objects, A and B, have imagined locations,

A and B (Table5.2).

The possibility of estimating the relative locations of objects A and B on

this variable depends on the availability of two observable events:

An event Y indicating that A exceeds B

Raschs model relates the difference between objects A and B to the events X

and Y that they govern:

B A = ln( X / Y ) (5.2)

where Px is the probability of observing X and PY is the probability of observ-

ing Y. Notice that, under the model, the odds Px /PY of observing X rather

than Y is dependent only on the direction and distance of B from A, and is

uninfluenced by any other parameter.

A B

The Partial Credit Model 111

i n

the result of the comparison was independent of everything else within the

frame of reference other than the two objects which are to be compared and

their observed reactions.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ment variable can be obtained if there are multiple independent opportuni-

ties to observe either event X or event Y. Under these circumstances, B A

can be estimated as

ln( px / p y ) = ln( N x /N y ) (5.3)

where px and py are the proportions of occurrences of X and Y, and Nx

and Ny are the numbers of times X and Y occur in Nx + Ny observation

opportunities.

The most common application of Raschs model is to tests in which responses

to items are recorded as either wrong (0) or right (1). Each person n is imag-

ined to have an ability n, and each item i is imagined to have a difficulty i,

both of which can be represented as locations on the variable being measured

(Table5.3).

In this case, observable event X is person ns success on item i, and observ-

able event Y is person ns failure on item i (Table5.4).

Raschs model applied to this situation is

n i = ln( P1 /P0 ) (5.4)

If person n could have multiple independent attempts at item i, then the

difference, n i, between person ns ability and item is difficulty could be

estimated as

Observable Event

B A Observation Opportunity X Y

n i Person n attempts item i 1 0

112 Geoff N. Masters

m n

Although this is true in theory, and this method could be useful in some

situations, it is not a practical method for estimating n i from test data

because test takers are not given multiple attempts at the same item (and

if they were, they would not be independent attempts). To estimate the

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

son ns attempts at a number of items, and to estimate i from a number of

persons attempts at that item. In other words, the difficulties of a number

of test items and the abilities of a number of test takers must be estimated

simultaneously.

In the application of Raschs model to tests, every person has an imagined

location on the variable being measured. Two persons m and n have imagined

locations m and n (Table5.5).

It follows from Equation 5.4 that if persons m and n attempt the same

item and their attempts at that item are independent of each other, then the

modeled difference between persons n and m is

n m = ln( 1,0 / 0 ,1 ) (5.6)

where 1,0 is the model probability of person n succeeding but m failing the

item, and 1,0 is the probability of person m succeeding but n failing that

item.

It can be seen that Equation 5.6 is Raschs model (Equation 5.2) applied to

the comparison of two persons on a measurement variable. The two observ-

able events involve the success of one person but failure of the other in their

attempts at the same item (Table5.6).

In this comparison of persons m and n, nothing was said about the difficulty

of the item being attempted by these two persons. This is because Equation5.6

applies to every item. The odds of it being person n who succeeds, given that

one of these two persons succeeds and the other fails, is the same for every item

and depends only on the relative abilities of persons m and n.

Same Item

Observable Event

B A Observation Opportunity X Y

n m Persons n and m independently 1,0 0,1

attempt the same item

The Partial Credit Model 113

Because the modeled odds 1,0 / 1,0 are the same for every item, the differ-

ence n m can be estimated as

ln( 1,0 / 0 ,1 ) (5.7)

where 1,0 is the number of items that person n has right but m has wrong,

and 0 ,1 is the number of items that person m has right but n has wrong.

When test data conform to the Rasch model, the relative abilities of two

persons can be estimated in this way using any selection of items without

regard to their difficulties (or any other characteristics). By making multiple

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

tions of a number of persons on the same measurement variable.

Editor Note: This does assume that each person will get some items right and some items wrong.

This is not a feature of the model but rather a characteristic of the test and means that, to take advan-

tage of the pairwise comparison feature of the model, there must be some items in the test that even

test takers with low ability will get right and some items that even test takers with high ability will

get wrong. Low-ability test takers getting a difficult item right, for example, will not satisfy this need

because that would be data that does not conform to the model.

In the application of Raschs model to tests, every item has an imagined

location on the variable being measured. Two items i and j have imagined

locations i and j (Table5.7).

It follows from Equation 5.4 that if items i and j are attempted by the same

person and this persons attempts at items i and j are independent of each

other, then the modeled difference between items i and j is

where 1,0 is the model probability of the person succeeding on item i but

failing item j, and 0 ,1 is the probability of the person succeeding on item j

but failing item i.

It can be seen that Equation 5.8 is Raschs model (Equation 5.2) applied

to the comparison of two items on a measurement variable. The two observ-

able events involve the persons success on one item but failure on the other

(Table5.8).

In this comparison of items i and j, nothing was said about the ability of

the person attempting them. This is because Equation 5.8 applies to every

j i

114 Geoff N. Masters

Table 5.8 Two Possible Outcomes When the Same Person Attempts Items i

and j

Observable Event

B A Observation Opportunity X Y

i j Items i and j independently 0,1 1,0

attempted by the same person

person. The odds of success on item i given success on one item but failure

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

on the other is the same for every person and depends only on the relative dif-

ficulties of items i and j.

Because the modeled odds 0 ,1 / 1,0 are the same for every person, the dif-

ference i j can be estimated as

where n1,0 is the number of persons with item i right but j wrong, and n0 ,1 is

the number of persons with j right but i wrong.

When test data conform to the Rasch model, the relative difficulties of

two items can be estimated in this way using any group of persons without

regard to their abilities (or any other characteristics). By making multiple

pairwise comparisons of this kind, it is possible to estimate the relative loca-

tions of a number of items on the measurement variable.

Editor Note: This assumes that each item will be answered correctly by some respondents and

incorrectly by others. Again, this is not a feature of the model but is here a characteristic of the group

taking the test. It means that, to take advantage of the pairwise comparison feature of the model,

there must be some respondents that even get easy items wrong and some respondents that even

get difficult items right.

The partial credit model applies Raschs model for dichotomies to tests in

which responses to items are recorded in several ordered categories labeled

0, 1, 2,K i . Each person n is imagined to have an ability n and each item i

is imagined to have a set of ki parameters i 1, i 2, iK i , each of which can be

represented as a location on the variable being measured (q). For example,

see Table5.9, where ik governs the probability of scoring k rather than k 1

on item i (Table5.10).

ik n

The Partial Credit Model 115

Table 5.10 Two Possible Outcomes When Person n Attempts Polytomous Item i

Observable Event

B A Observation Opportunity X Y

n ik Person n attempts item i k k1

n ik = ln( k / k 1 ) (5.10)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

In polytomous test items, objective comparison (and thus objective mea-

surement) continues to depend on the modeling of the relationship between

two imagined locations on the variable and two observable events. This com-

parison is independent of everything else within the frame of reference

including other possible outcomes of the interaction of person n with item i.

The conditioning out of other possible outcomes to focus attention only on the

two observable events that provide information about the relative locations of

the two parameters of interest is a fundamental feature of Raschs model.

The conditioning on a pair of adjacent response alternatives has parallels

with McFaddens (1974) assumption that a persons probability of choosing

to travel by car rather than by bus should be independent of the availability

of other options (e.g., train). McFadden refers to this as the assumption of

independence from irrelevant alternatives. In a similar way, it is assumed

in this application of Raschs model that a persons probability of choosing or

scoring k rather than k 1 is independent of all other possible outcomes.

When a person responds to an item with several ordered response catego-

ries, he or she must make a choice taking into account all available alternatives.

The partial credit model makes no assumption about the response mecha-

nism underlying a persons choice. It simply proposes that if category k is

intended to represent a higher level of response than category k 1, then the

probability of choosing or scoring k rather than k 1 should increase mono-

tonically with the ability being measured.

As for dichotomously scored items, if person n could have multiple inde-

pendent attempts at item i, then the difference n ik could be estimated

from proportions or counts of occurrences of k and k 1:

However, because multiple independent attempts at test items usually are not

possible, this method is not feasible in practice.

In the application of Raschs model to tests in which responses to items are

recorded in several ordered categories, every person has an imagined location

on the variable being measured (Table5.11).

116 Geoff N. Masters

m n

It follows from Equation 5.10 that if persons m and n attempt the same

item and their attempts at that item are independent of each other, then the

modeled difference between persons n and m is

n m = ln( Pk , k 1 /Pk 1, k ) (5.12)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

where k , k 1 is the model probability of person n scoring k but m scoring

k 1, and k , k 1 is the probability of person m scoring k but n scoring k 1

on that item.

It can be seen that Equation 5.12, which applies for all values of

k( k = 1, 2 K i ), is Raschs model, Equation 5.2 (Table5.12).

If one of persons m and n scores k on an item, and the other scores k 1,

then the probability of it being person n who scores k is the same for every item

and depends only on the relative abilities of persons m and n.

Because the modeled odds k , k 1 / k 1, k are the same for every item, the

difference n m can be estimated as

ln( k , k 1 / k 1, k ) (5.13)

k 1, and N k 1, k is the number of items on which person m scores k and n

scores k 1.

Once again, when test data conform to Raschs model, the relative abilities

of two persons can be estimated in this way using any selection of items. And

by making multiple pairwise comparisons of this kind, it is possible to estimate

the relative locations of a number of persons on the measurement variable.

Editor Note: Similarly to the issue described earlier, this assumes that each person sometimes

scores k and sometimes scores k 1. Again, this is not a feature of the model but a characteristic

of the respective response categories in the items in the test. It becomes apparent from this that

modeling polytomous item responses can require large amounts of data if all possible pairwise

comparisons are to be realized.

Table 5.12 Two Possible Outcomes of Persons N and M Attempting the Same

Polytomous Item

Observable Event

B A Observation Opportunity X Y

n m Persons n and m independently k,k 1 k 1,k

attempt the same item

The Partial Credit Model 117

Table 5.13 Location of Two Polytomous Item Parameters on the Same Measurement

Variable

jk ik

In polytomous items, each item parameter ik ( k = 1, 2, K i ) is a location on

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

the variable being measured. The parameters ik and jk from two different

items i and j can be compared on this variable (Table5.13).

It follows from Equation 5.10 that if items i and j are attempted by the

same person and this persons attempts at items i and j are independent of

each other, then the modeled difference between parameters ik and jk is

item j, and k 1, k is the probability of the person scoring k on item j but k 1

on item i.

It can be seen that Equation 5.14, which applies for all values of

k( = 1, 2, K i ), is Raschs model, Equation 5.2 (Table5.14).

In this comparison of items i and j, nothing was said about the ability of

the person attempting them. This is because Equation 5.14 applies to every

person. When a person attempts items i and j, the probability of the person

scoring k on item i given that he or she scores k on one item and k 1 on the

other is the same for every person.

Because the modeled odds k 1, k / k , k 1 are the same for every person, the

difference ik jk can be estimated as

and nk 1, k is the number of persons scoring k on item j but k 1 on item i.

Table 5.14 Two Possible Outcomes When the Same Person Attempts Polytomous

Item i and j

Observable Event

B A Observation Opportunity X Y

ik jk Items i and j independently k 1,k k,k 1

attempted by the same person

118 Geoff N. Masters

be estimated in this way using any group of persons without regard to their

abilities (or any other characteristics).

Editor Note: In keeping with the previous editor notes, this assumes that the polytomous items

elicit responses for each modeled score. Again, this is not a feature of the model but depends on the

characteristic of the items in the test and the group of respondents taking the test.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

The partial credit model is one of a number of models that have been intro-

duced for the analysis of ordered response category data. To understand simi-

larities and differences between these models, it is useful to identify a couple

of broad classes of models.

In some models proposed for the analysis of test data, in addition to a loca-

tion n for each person n and a location i for each item i, a discrimina-

tion parameter i is proposed for each item i. Among models for ordered

response categories that include a discrimination parameter are Samejimas

(1969) graded response model and Murakis (1992) generalized partial credit

model.

These models differ from the partial credit model in that they do not

enable specifically objective comparisons as described by Rasch. The rea-

son for this can be seen most easily in the two-parameter dichotomous item

response theory (IRT) model:

i (n i ) = ln( 1 / 0 ) (5.16)

of two persons m and n at item i, then for the two-parameter IRT model we

obtain:

where 1,0 is the probability of person n succeeding but m failing item i, and

0 ,1 is the probability of person m succeeding but n failing.

It can be seen from Equation 5.17 that the odds of person n succeeding

but m failing given that one of these two persons succeeds and the other fails

is not the same for all items. Rather, the odds depend on the discrimination

of the item in question.

To compare the locations of persons m and n on the measurement vari-

able, it is not possible to ignore the particulars of the items involved and

The Partial Credit Model 119

on the measurement variable is dependent not only on the two observable

events (1,0) and (0,1) that they govern, but also on the details (viz., the dis-

criminations) of the items these two persons take. For this reason, the two-

parameter IRT model does not permit objective comparison in the sense

described by Rasch.

A second class of models for ordered response categories include as parame-

ters cumulatively defined thresholds. Each threshold parameter is intended

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

alternative k 1 from response alternatives k and above. L. L. Thurstone, who

used the normal rather than logistic function to model thresholds, referred to

them as category boundaries.

The threshold notion is used as the basis for Samejimas graded response

model. Her model also includes an item discrimination parameter, but

that is ignored here for the sake of simplicity. Samejimas model takes the

form:

n ik = ln[( k + k +1 + + K t )/( 0 + 1 + + k 1 )] (5.18)

In this model, the item threshold ik governs the probability of scoring k or

better on item i.

Terminology Note: This form of Samejimas logistic model is very different from the way it is

presented in either Samejimas own work (see Chapter 4) or polytomous IRT literature generally. The

way it is presented here is in keeping with the approach in this chapter of describing models in terms

of specific comparisonsin this case, n and ik.

Table 5.15 compares Samejimas graded response model with the par-

tial credit model for an item with four ordered response alternatives labeled

0, 1, 2, and 3.

From Table5.15 it can be seen that the observable events in this model are

compound events, for example:

Event Y: Response in category 0

The consequence is that the elementary equations in this model are not inde-

pendent because

As a result, thresholds are not independent, but are always ordered

i1 < i 2 < i 3 .

120 Geoff N. Masters

Samejima Rasch

Elementary equations n i1 = ln[(P1 + P2 + P3)/P0] n i1 = ln[P1/P0]

(person n, item i, Ki = 3) n i2 = ln[(P2 + P3)/(P0 + P1)] n i2 = ln[P2/P1]

n i3 = ln[P3/(P0 + P1 + P2)] n i3 = ln[P3/P2]

Events being compared Compound Simple

(e.g., response in category 1 (comparison of adjacent

or 2 or 3 rather than 0) response categories)

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Each relates to all available Each relates to adjacent

response categories response categories only

Relationship of elementary Dependent Independent

equations (P1 + P2 + P3)/P0 > (e.g., odds of response in

(P2 + P3)/(P0 + P1) > category 1 rather than 0 is

independent of odds of

P3/(P0 + P1 + P2)

response in category 2

rather than 1)

Implications for item i1 < i2 < i3 s are unfettered and free to

parameters take any value

Model for ordered categories When brought together, the The elementary equations

elementary equations provide a model for

provide a model for ordered ordered response

response categories in which categories in which the

the person parameters person parameters can be

cannot be conditioned out of conditioned out of the

the estimation procedure for estimation procedure for

the items the items and vice versa

Specific objectivity No Yes

expressions for the probabilities of person n scoring 0, 1, 2, and 3 on item i:

ni 1 = exp(n i 1 )/[1 + exp(n i 1 )] exp(n i 2 )/[1 + exp(n i 2 )]

ni 2 = exp(n i 2 )/[1 + exp(n i 2 )] exp(n i 3 )/[1 + exp(n i 3 )]

ni 3 = exp(n i 3 )/[1 + exp(n i 3 )]

parameters or the item thresholds) out of the estimation procedures for the

other in this model.

The Partial Credit Model 121

In contrast, the elementary equations for the Rasch model (see Table5.15)

lead to the following expressions for the probabilities of person n scoring 0,

1, 2, and 3 on item i:

ni 0 = 1/

ni 1 = exp(n i 1 )/

ni 2 = exp( 2n i 1 i 2 )/

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

ni 3 = exp( 3n i 1 i 2 i 3 )/

In general, the partial credit model takes the form

exp( kn i 1 i 2 ik )

nik = (5.19)

procedures for the item parameters, and vice versa, in this model.

Conclusion

As a member of the Rasch family of item response models, the partial credit

model is closely related to other members of that family. Masters and Wright

(1984) describe several members of this family and show how each has as its

essential element Raschs model for dichotomies. Andrichs (1978) model

for rating scales, for example, can be thought of as a version of the partial

credit model with the added expectation that the response categories are

defined and function in the same way for each item in an instrument. With

this added expectation, rather than modeling a set of mi parameters for each

item, a single parameter i is modeled for item i, and a set of m parameters

( 1 , 2 , m ) is proposed for the common response categories. To obtain the

rating scale version of the PCM, each item parameter in the model is rede-

fined as ix = + x . Wilson also has proposed a generalized version of the

partial credit model (Wilson & Adams, 1993).

References

Andrich, D. (1978). A rating formulation for ordered response categories.

Psychometrika, 43, 561573.

Masters, G. N., & Wright, B. D. (1984). The essential process in a family of measure-

ment models. Psychometrika, 49, 529544.

McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In

P.Zarempka (Ed.), Frontiers in econometrics (pp. 105142). New York: Academic

Press.

122 Geoff N. Masters

rithm. Applied Psychological Measurement, 16, 159176.

Rasch, G. (1977). On specific objectivity: An attempt at formalising the request for

generality and validity of scientific statements. Danish Yearbook of Philosophy,

14, 5894.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded

scores. Psychometrika, Monograph Supplement 17.

Wilson, M., & Adams, R. J. (1993). Marginal maximum likelihood estimation for the

ordered partition model. Journal of Educational Statistics, 18, 6990.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Chapter 6

Understanding the Response

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Polytomous Rasch Model

David Andrich

The University of Western Australia

Editor Introduction: Rather than developing a specific polytomous IRT model, this chapter out-

lines and argues for the importance of the item response process that is modeled by all polytomous

IRT models in the Rasch family of models. Modeling response processes in this way is argued to have

an important advantage over the way the response process is modeled in what Hambleton, van der

linden, and Wells (Chapter 2) call indirect models and what are elsewhere referred to as divide-by-

total (Thissen & Steinberg, 1986) or adjacent category (Mellenberg, 1995) models.

The Rasch model for ordered response categories in standard formats was

derived from a sequence of theoretical propositions requiring invariance of

comparisons among item and among person parameter estimates. A model

with sufficient statistics is the consequence. The model was not derived to

describe any particular data (Andersen, 1977; Andrich, 1978; Rasch, 1961).

Standard formats involve only one response in one of the categories

deemed a priori to reflect increasing levels of the property and are common in

quantification of performance, status, and attitude in the social sciences. The

advantage of an item with more than two ordered categories is that, if the

categories work as intended, it gives more information than a dichotomous

item. Table6.1 shows three common examples. Figure6.1 shows a graphical

counterpart using the first example in Table6.1.

The ordered categories on the hypothesized continuum in Figure 6.1

are contiguous. They are separated on the continuum by successive points

termed thresholds. This is analogous to mapping a location of an object on a

line partitioned into equal units to obtain a physical measurement. Because

we do not have a fixed origin, there is no endpoint on the latent continuum of

123

124 David Andrich

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Figure6.1 for the extreme categories, only partitions of the continuum into

four contiguous categories, which requires three thresholds.

Terminology Note: Thresholds are also sometimes referred to as category boundaries. What this

chapter makes clear, and what should always be remembered, is that irrespective of which term is

used (category boundary or threshold), these points on the trait continuum, which separate ordered

response categories, are defined very differently in cumulative (e.g., Samejima logistic graded

response model) or adjacent category (e.g., Rasch) models. As a result, these category-separating

points have a different meaning in these two most common types of polytomous IRT models.

Table6.1, and by analogy to physical measurement, successive integers are

assigned to the categories. In more advanced analyses, a probabilistic model

that accounts for the finite number of categories and for sizes of the catego-

ries is applied. The Rasch model is one such model, and in this model the

successive categories are scored with successive integers.

In the examples in Table6.1 it might be considered that if the same format is

used across all items, the sizes of the categories will also be the same across

all items. However, that is an empirical question, and it is possible that there

is an interaction between the content of the item and the response format,

so that the sizes of the categories are different for different items. It may also

be the case that different items have different formats that are natural to the

item with different numbers of categories, as, for example, when there are

different items in an achievement test with different maximum scores. The

Rasch model that has the same format across all items and has the same sized

Understanding the Response Structure and Process in the Polytomous Rasch Model 125

categories is referred to sometimes as the rating scale model. The model with

different sized categories or with different numbers of categories is referred

to sometimes as the partial credit model. The difference, as will be seen, is only

a matter of parameterization, and at the level of a single person responding

to a single item, the models are identical. The focus of this chapter is on

the response of one person to one item that covers both parameterizations.

Therefore, the model will be referred to simply as the polytomous Rasch

model (PRM). The dichotomous model is simply a special case of the PRM,

but where it is necessary to distinguish it as a dichotomous model in the

exposition, it will be referred to explicitly as the dichotomous RM.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

The PRM has two properties that, when first disclosed, were considered

somewhat counterintuitive: First, combining adjacent categories by sum-

ming the probabilities of responses in the categories, and in the related sense

of summing their frequencies to form a single category, can only be done

under very restricted circumstances (Andersen, 1977; Andrich, 1978, Jansen

& Roskam, 1986; Rasch, 1966). Second, the thresholds analogous to those

in Figure6.1 that define the boundaries of the successive categories may take

on values that are not in their natural order. In part because these properties

are exactly opposite to those of the then prevailing model for ordered catego-

ries, that based on the work of Thurstone (Thurstone & Chave, 1929), they

have been ignored, denied, circumvented, or generated debate and misunder-

standings in the literature (Andrich, 2002). Known in psychometrics as the

graded response model (GRM), the latter model has been developed further by

Bock (1975), Samejima (1969, 1997, 1996), and McCullagh (1980).

One observation from these reactions to the models properties is that in the

development of response models for ordered category formats there is no

a priori articulation of any criterion that data in ordered categories should

satisfyit seems it is simply assumed that if categories are deemed to be

ordered, they will necessarily operate that way. One factor that immediately

comes to mind as possibly violating the required order is respondents not

being able to distinguish between two adjacent categories. This has been

observed in data in Andrich (1979).

The theme of this chapter is that it is an empirical hypothesis whether or

not ordered categories work as intended. The chapter sets up a criterion that

must be met by data in an item response theory framework for it to be evi-

dent empirically that the categories are working as intended, and shows how

the PRM makes a unique contribution to providing the empirical evidence.

Meeting this requirement empirically is necessary because if the intended

ordering of the categories does not reflect successively more of the property,

126 David Andrich

then it puts into question the very understanding of what it means to have

more of the property and of any subsequent interpretations from the data.

The chapter is not concerned with issues of estimation and the tests of fit,

which are well covered in the literature, but in better understanding the dis-

tinctive properties of the model itself, and the opportunities it provides for

the empirical study of ordered polytomous response formats.

It is stressed that the criterion for ordered categories working as intended

pertains to the data, and not to response models themselves irrespective of the

data. The importance of distinguishing between the properties of data from

procedures of models of analysis for ordered categories was recognized by

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

categories, and upon obtaining results for a particular data set noted: It

will be observed that the numerical values lie in the proper order for

increasing reaction. This is not a consequence of the procedure by which

they have been obtained, but a property of the data examined (Fisher, 1958,

p. 294). Any note from Fisher is worthy of substantial consideration and

study (Wright, 1980).

This chapter demonstrates that the properties of the PRM are compatible

with treating the operation of the ordered categories as an empirical hypoth-

esis. In particular, it is demonstrated that the model has the remarkable

property that from a set of structurally dependent responses in an ordered

category format, it recovers information that would arise from compatible,

experimentally independent formats. This permits the inference regarding the

empirical ordering of categories. Thus, the chapter does not merely describe

the Rasch model for ordered categories from the perspective of modeling

data and for providing invariant comparisons, but presents a case that it is the

ideal model for characterizing the intended response process and for testing

empirically whether ordered categories are operating as required.

The chapter is organized as follows. We first describe an experiment

with independent responses at thresholds devised to assess unequivocally

the empirical ordering of the categories. We then analyze in detail three

response spaces, whose relationship needs to be understood. We also explain

why the probabilities and frequencies in adjacent categories cannot be

summed in the PRM except in special circumstances. Finally, we conclude

with a summary that includes a suggestion as to why over the long history

of the development and application of models for data in ordered categories,

and despite the lead from Fisher, no previous criteria have been articulated

that ordered categories must meet.

In preparation for developing and specifying a criterion for the empirical

ordering of categories, we consider some relationships between models and

data. These relationships are generally taken for granted, but they are made

explicit here because of their specific roles in relation to the PRM and the

theme of the chapter.

Understanding the Response Structure and Process in the Polytomous Rasch Model 127

One use of models is simply to summarize and describe data. Models describe

data in terms of a number of parameters that are generally substantially

smaller than the number of data points. It is of course necessary to check

the fit between the data and the model to be satisfied that the model does

describe the data.

A second use of models is to characterize the process by which data are gen-

erated. For example, the Poisson distribution arises from, among many other

circumstances, the cumulative effect of many improbable events (Feller,

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

response process. If the data do not fit the model, then a question might be

asked about its characterization of the process. However, the fit of the data

to the model is only a necessary, not sufficient, condition to confirm that the

model characterizes the process in those data.

A third and much less conventional use of models is to express a priori

conditions that data are required to follow if they are to subscribe to some

principles. As indicated above, this is the case with the PRM. Following

a series of studies, Rasch articulated conditions of invariance of compari-

sons that data should have if they are to be useful in making quantitative

statements. Specifically, (1) the comparison between two stimuli should be

independent of which particular individuals were instrumental for the com-

parison; (2) symmetrically, a comparison between two individuals should

be independent of which particular stimuli within the class considered were

instrumental for comparison (Rasch, 1961, p. 332).

These conditions of invariance were not unique to Raschvirtually iden-

tical conditions were articulated by Thurstone (1928) and Guttman (1950)

before him. However, the distinctive contribution of Rasch was that Rasch

expressed these conditions in terms of a probabilistic model. Rasch wrote his

conditions for invariance in a general equation, which, in the probabilistic

case for dichotomous responses, takes the form

where Yni , Ynj are random variables whose responses ( yni , ynj ) take the val-

ues {0,1}, n and i, j are location parameters of person n and items i and j

respectively, and the right side of Equation 6.1 is independent of the person

parameter n. As indicated already, this leads to a class of models with suf-

ficient statistics for the parameters, which generalizes to the PRM.

The key advantage of specifying the conditions in terms of a model is that

mathematical consequences, some of which might be initially counterintui-

tive, can be derived. This is the case with the Rasch model. However, when

the consequences follow mathematically from a specification as compelling

as that of making relatively invariant comparisons, then because they can

provide genuinely new insights that might not be apparent immediately intu-

itively, they should be understood. Another distinctive consequence of this

128 David Andrich

the model or otherwise is relevant to the case for the model.

Suppose that it is intended to assess the relative location of persons on some

construct that can be mapped on a linear continuum, for example, an achieve-

ment test. Items of successively increasing difficulty would be landmarks

of achievement (Thurstone, 1925) requiring successively increasing ability

for success.

Suppose further that the responses of persons to items are scored dichot-

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

responses, and in an arbitrary unit, the dichotomous RM (Fischer &

Molenaar, 1995; Rasch, 1960, 1961; Wright, 1997; Wright & Panchapakesan,

1969) can be used to estimate the relative location of items on the continuum.

This model takes the form

e y (n i )

P{Yni = y } = (6.2)

1 + e n i

where the variables are identical to those of Equation 6.1. The response func-

tion of Equation 6.2 for y = 1 is known as the item characteristic curve

(ICC). Three ICCs for the dichotomous RM are illustrated in Figure6.2.

The data giving rise to estimates in Figure 6.2 were simulated for 5,000

persons responding to six items independently (two sets of three items)

with locations of 1.0, 0.0, and 1.0 , respectively, in the first set. Only the

responses of the first set of three items are shown in Figure6.1. These data,

together with those of the second set, are used later in the chapter to illus-

trate the derivations.

The responses {0,1} are ordered; response y = 1 is deemed successful, and

the response y = 0 unsuccessful. In achievement testing i is referred to as

1.0

Pr{Y=1}

0.5

10001 -1.00 0.25

10002 0.03 0.25

10003 0.94 0.25

0.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 2 3 Person Location(logits)

Understanding the Response Structure and Process in the Polytomous Rasch Model 129

(Bock & Jones, 1968), it is termed a thresholdit is the point at which the

person with the same location n = i has an equal probability of being suc-

cessful and unsuccessful:

In the dichotomous RM, the ICCs are parallel (Wright, 1997), which

is exploited in this chapter. We use the dichotomous RM to construct the

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

PRM. However, to better understand the PRM for more than two ordered

categories, the two-parameter logistic model (2PLM),

e yi (n i ) (6.4)

P{Yni = y } =

1 + e i (n i )

the ICC (Birnbaum, 1968), is also used in a later section of this chapter.

Notational Difference: The parameters in Equation 6.4 are presented in a format consistent with

the rest of the chapter. Typically, however, the letter a is used to denote the discrimination parameter,

which is denoted by in Equation 6.4. Similarly, the 2PLM typically uses in place of , and b is

used for the item difficulty parameter, which is in Equation 6.4.

Another, more explicit view of the location of items of increasing difficulty,

equivalent to Thurstones notion of landmarks, is that of Guttman (1950),

who enunciated an idealized deterministic response structure for unidimen-

sional items. The Guttman structure is central in understanding the PRM.

For I dichotomous items responded to independently, there are 2I possible

response patterns. These are shown in Table6.2 for the case of three items.

The top part of Table6.2 shows the subset of patterns of responses according

to the Guttman structure. The number of these patterns is I + 1.

The rationale for the Guttman structure in Table 6.2 (Guttman, 1950)

is that for unidimensional responses across items, if a person succeeds on

anitem, then the person should succeed on all items that are easier than that

item, and that if a person fails on an item, then the person should fail on all

items more difficult than that item. The content of the items with different

difficulties operationalizes the continuum.

With experimentally independent items, it is possible that a deterministic

Guttman structure will not be observed in data. In that case, the dichoto-

mous RM may be used to locate items on a continuum. The dichotomous RM

is a probabilistic counterpart of the Guttman structure that is a deterministic

limiting case (Andrich, 1985). Specifically, for any person, the probability of

130 David Andrich

Items 1 2 3 Total Score x Pr{(yn1, yn2, yn3)|x}

I + 1 = 4 Guttman Response Patterns

0 0 0 0 1

1 0 0 1 0.667

1 1 0 2 0.678

1 1 1 3 1

2I I 1 = 4 Non-Guttman Response Patterns

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

0 1 0 1 0.248

0 0 1 1 0.085

1 0 1 2 0.235

0 1 1 2 0.087

success on an easier item will always be greater than the probability of suc-

cess on a more difficult item. This statement is evident from the parallel ICC

curves in Figure6.2.

In the Guttman structure, as is evident in Table 6.2, the total score,

x , x = iI=1 yi , completely characterizes the response pattern. In the dichoto-

mous RM, the total score plays a similar role, though probabilistically; it is a

sufficient statistic for the person parameter (Andersen, 1977; Rasch, 1961). If

item thresholds are then ordered in difficulty, and for a given total score, the

Guttman pattern has the greatest probability of occurring (Andrich, 1985).

Furthermore, because of sufficiency, the probability of any pattern, given

the total score x, is independent of the persons ability. Thus, the probabilities

of the patterns of responses for total scores of 1 and 2, shown in Table6.2,

are given by

e yn11 yn 2 2 -yn 33

P{( yn1 ,yn 2 ,yn 3 )|x = 1} = (6.5)

e 1 + e 2 + e 3

e yn11 yn 2 2 - yn 33

P{( yn1 ,yn 2 , yn 3 )|x = 2} = (6.6)

e 1 2 + e 2 - 3 + e 1 3

respectively, both of which are independent of the person ability n and are

special cases of Equation 6.1. These equations are the basis of conditional

estimation of the item parameters independently of the person parameters

(Andersen, 1973).

We now consider the design of an experiment in which the empirical order-

ing of the categories can be investigated. The key feature of this experiment

is the empirical independence among the judgments.

Understanding the Response Structure and Process in the Polytomous Rasch Model 131

Response Structure Compatible With the Rasch Model

Fail (F) Inadequate setting: Insufficient or irrelevant information given for the story. Or,

sufficient elements may be given, but they are simply listed from the task

statement, and not linked or logically organized.

Pass (P) Discrete setting: Discrete setting as an introduction, with some details that also

show some linkage and organization. May have an additional element to

those listed that is relevant to the story.

Credit (C) Integrated setting: There is a setting that, rather than simply being at the

beginning, is introduced throughout the story.

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Distinction (D) Integrated and manipulated setting: In addition to the setting being introduced

throughout the story, pertinent information is woven or integrated so that this

integration contributes to the story.

The labels of fail, pass, credit, and distinction have been added in this table for the purposes of

this chapter (Harris, 1991).

shown in Table 6.3 that were used in assessing the abilities of students to

write a narrative in relation to a particular criterion. The responses among the

categories are not independent in the sense that if a response is made in one

category, it is not made in any other category. The task is to construct a design

compatible with Table6.3 in which independence of judgments prevails.

Clearly from the descriptors for each category, there is an intended order-

ing in the quality of performance with respect to the feature of setting. We

take it that the category descriptors operationalize the writing variable to be

measured and describe the qualities that reflect successively better writing

on this continuum. We further note that the first, and least demanding, cat-

egory is a complement to the second category, and that the other categories

show increasing quality of writing with respect to setting. We shall see how

this complementarity of the first category to the others plays out.

The experimental design involves taking the descriptors in Table6.3 and

constructing independent dichotomous judgments at thresholds that are of

increasing difficulty.

Instead of one judge assigning an essay into one of four categories, con-

sider a design with three judges where each judge only declares whether each

essay is successful or not in achieving the standard at one of pass, credit,

or distinction. Thus, we have three independent dichotomous random vari-

ables. Although there are four categories, there are only three independent

responses. The F descriptor helps in understanding the variable in the region

of fail/pass, and helps the judge decide on the success or otherwise of the

essay at this standard. This is the role of the F descriptor in this design.

We now consider this experimental design, summarized in Table 6.4,

more closely.

The descriptors, as already indicated, describe the variable and what it takes

to reflect more of its property. The better the essay in terms of these characteris-

tics, the greater the probability that it will be deemed successful at each level.

132 David Andrich

Inadequate Discrete Integrated Integrated and

Setting F Setting P Setting C Manipulated Setting D

Judgment 1 Not P P

Judgment 2 P Not C C

Judgment 3 C Not D D

D

Downloaded by [The University of Edinburgh] at 10:42 26 September 2017

Two further specific cases may be highlighted in order to make clear the

operation of the experimental design. First, suppose the judge considers that

the essay does satisfy the P descriptor, but observes that it does not meet the

C and D descriptors. Then the judge should classify the essay as a success.

Second, suppose that the judge, still with respect to success at P, consid-

ers that an essay satisfies even the qualities described in C or D, or some

combination of these. Because of the structure of the descriptors as ordered

categories, which implies that C reflects more of the property to be measured

than P, and D even more of the property than C, the judge must classify it as

a success at P. The more of the properties of C and D an essay has, the greater

the probability of it being classified successful at the P level. Similar inter-

pretations follow for decisions at each of the other categories. It is stressed

that in each judgment it is the latent continuum that has been dichotomized

at each threshold, and not the categories as such.

In such an experimental design it would be required that the success rate at P

is greater than that at C, and that the success rate at C is in turn greater than

that at D. That is, it is required that it is more difficult to be successful at D

than at C, which in turn is more difficult than being successful at P.

If that were not the case, for example, the success rate at C was the same

as that at D for the same essays, then it would be inferred that the judges

do not distinguish between the two levels consistently. This could arise, for

example, if the judge at C were harsher than intended and the judge at D

were more lenient than intended. Thus, it may be that the experiment did not

work, and it would need to be studied further to understand why this is the

case. But such evidence is central to treating the ordering of the categories as

an empirical hypothesis to be tested in the data.

Not only would we require that the thresholds increase in difficulty with

their a priori ordering, but we would want the same distance between them

irrespective of the location of the essays on the continuum. It seems unten-

able that these distances are different for essays of different quality. This uni-

formity of the relationships between these levels in probability is guaranteed

if the success rate curves at different levels of quality of the essays are paral-

lel, that is, if the dichotomous responses at the corresponding thresholds

Understanding the Response Structure and Process in the Polytomous Rasch Model 133

follow the dichotomous RM. This is the essential justification for applying

the dichotomous RM to such data.

In summary, if P , C , D are the difficulties of the thresh