Vous êtes sur la page 1sur 34

ALTERNATIVE APPROACHES TO THE

ASSESSMENT OF ACHIEVEMENT
Evaluation in Education and
Human Services

Editors:

George F. Madaus, Boston College, Chestnut


Hill, Massachusetts, U.S.A.
Daniel L. Stufflebeam, Western Michigan
University, Kalamazoo, Michigan, U.S.A.

Previously published books in the series:

Kellaghan, T., Madaus, G., and Airasian, P.:


The Effects of Standardized Testing
Madaus, G. (editor):
The Courts, Validity, and Minimum
Competency Testing
Brinkerhoff, R., Brethower, D., Hluchyj, T., and
Nowakowski, J.:
Program Evaluation, Sourcebook/ Casebook
Brinkerhoff, R., Brethower, D., Hluchyj, T., and
Nowakowski, J.:
Program Evaluation, Sourcebook
Brinkerhoff, R., Brethower, D., Hluchyj, T., and
Nowakowski, J.:
Program Evaluation, Design Manual
Madaus, G., Scriven, M., Stufflebeam, D.,;
Evaluation Models: Viewpoints on Educational
and Human Services Evaluation
Hambleton, R., Swaminathan, H.
Item Response Theory
Stufflebeam, D., Shinkfield, A.;
Systematic Evaluation
Nowakowski, J.,
Handbook of Educational Variables:
A Guide to Evaluation
Stufflebeam, D.;
Conducting Educational Needs Assessments
Abrahamson, Stephen;
Evaluation of Continuing Education
in the Health Professions
Cooley, William, Bickel, William;
Decision-oriented Educational Research
Gable, Robert K.;
Instrument Development in the
Affective Domain
Sirotnik, Kenneth A., Oakes, Jeannie;
Critical Perspectives on the Organization
and Improvement of Schooling
Wick, John W.;
School-based Evaluation: A Guide for
Board Members, Superintendents, Principals,
Department Heads and Teachers
Worthen, Blaine R., White, Karl R.;
Evaluating Educational and Social Programs
ALTERNATIVE APPROACHES TO THE
ASSESSMENT OF ACHIEVEMENT

Edited by

David L. McArthur
UCLA Graduate School of Education
Center for the Study of Evaluation

"
~

Kluwer Academic Publishers


Boston/ Dordrecht/ Lancaster
DiSlribUIOU for NOrib Americl:
Klu ..... er Aeademic Puhli~hers
101 Phil ip Drive
Assinippi Park
Norwell. MA 02061 USA

Dislribu lOrs for Ibt UK 3nd Ir ela nd:


Kluwcr Academic Publishers
MTP Press Limited
Falcon House. Queen Square
lancaster LAI IRN UNITE D KI NGDOM

Oi~tribulo ", for all olber countri",


Kluwer Academic Publishers Group
Distribution Cellire
POSt Office B OJ! 322
3JOO A H Dord recht
The Netherlands

Library of C.mgrtss Cil i login g-in-Publin.lion Ol tl

Ahernative approache~ to the assessment of


achie~·emem.

( Evaluation in education and human services)


Includes bibliographies alld inde~.
I. &locationall"sts and measurements.
2. Educational tcsts and measureme ms- Mathcmatical
models. J. Academic achie""menl - Tesling-
Mathematical models. I. McAnhur. David.
II. Series. [O;\' l M: I Educational Measuremenl-
methods. 2. Evaluation Studies. LB 3051 A466]
LB3051.A567 1986 )71.2'6 86-27483
tSIIN-!J: 1178·\1-1-011).7\161-7 <-tSBN- !J: \178-\1-1-009-3257-9
DOt: to.l007I97tt-9-I-009-J257-9

Copyright " 19S7 by Kluwer Academic I'ublishers


Softco,"er reprint of the hardco,"er I st edit ion \987
All riSh lS reserved. No part uf this publication may be reproduced. SlUred in a retrieval system. or
transmiltcd ill an)' form or b)' an)' means. mechanical.llhotocopying. reo;ording. or otherwise. without
the prioT wTiuen permission of t he publisher. KIII",eT Academic Publishers. Philip Drive. Assinippi
Park. Nor ....-cll. Massachu\ClIS 0206 1. USA .
CONTENTS

Contributors Vll

Preface IX

Introduction Xlll
Bruce H. Choppin and David L. McArthur

Chapter 1: Educational Assessment: A Brief History


David L. McArthur

Chapter 2: Toward More Sensible Achievement Measurement: 21


A Retrospective
Kenneth A. Sirotnik

Chapter 3: Analysis of Patterns: The S-P Technique 79


David L. McArthur

Chapter 4: The Rasch Model for Item Analysis 99


Bruce H. Choppin

Chapter 5: The Three-Parameter Logistic Model 129


Ronald K. Hambleton

Chapter 6: Measuring Achievement with Latent Structure Models 159


Rand R. Wilcox

Chapter 7: Generalizability Theory and Achievement Testing 187


Noreen M. Webb

Chapter 8: Analysis of Reading Comprehension Data 233


David L. McArthur

Chapter 9: A Comparison of Models for Measuring Achievement 249


J. Ward Keesling

Index 267
LIST OF CONTRIBUTORS

Bruce H. Choppin, PhD


deceased; Dr. Choppin was Director of the Methodology Research
program at the Center for Study of Evaluation
Graduate School of Education
University of California Los Angeles

Ronald K. Hambleton, PhD


Division of Educational Policy, Research and
Administration
University of Massachusetts at Amherst
Amherst, MA 01003

J. Ward Keesling, PhD


Computer Resources Special Project
Advanced Technology, Inc.
Camarillo, CA 93010

David L. McArthur, PhD


Center for Student Testing, Evaluation and Standards
University of California Los Angeles
Graduate School of Education
Los Angeles, CA 90024

Kenneth A. Sirotnik, PhD


Policy, Governance and Administration
College of Education
University of Washington
Seattle, WA 98195

Noreen M. Webb, PhD


Research Methods in Education
Graduate School of Education
University of California Los Angeles
Los Angeles, CA 90024

Rand R. Wilcox, PhD


Quantitative Area
Department of Psychology
University of Southern California
Los Angeles, CA 90089
PREFACE

Ingrained for many years in the science of educational assessment


were a large number of "truths" about how to make sense out of testing
results, artful wisdoms that appear to have held away largely by force
of habit alone. Practitioners and researchers only occasionally
agreed about how tests should be designed, and were even further apart
when they came to interpreting test responses by any means other than
categorically "right" or "wrong." Even the best innovations were
painfully slow to be incorporated into practice.
The traditional approach to testing was developed to accomplish
only two tasks: to provide ranking of students, or to select
relatively small proportions of students for special treatment. In
these tasks it was fairly effective, but it is increasingly seen as
inadequate for the broader spectrum of issues that educational
measurement is now called upon to address.
Today the range of questions being asked of educational test data
is itself growing by leaps and bounds. Fortunately, to meet this
challenge we have available a wide panoply of resource tools for
assessment which deserve serious attention. Many of them have
exceptionally sOphisticated mathematical foundations, and succeed well
where older and less versatile techniques fail dismally. Yet no
single new tool can conceivably cover the entire arena.
Our intent in this book is to convey to the reader a
well-balanced appreciation for the diversity of alternative
methodologies for educational assessment. To do this we have
attempted to set a stage which portrays both a cross-section of
methods and an intersection of the principles which underlie them.
Included are evaluations of both the strengths and weaknesses of every
approach, as well as completely worked-out examples which demonstrate
their present utility, and informed speculations about future
developments and applications to which they could be put.
x
Most mathematical models, at their heart, are quite simple; the
statistical tricks are in place solely to make them work efficiently.
From the reader who may be a trifle rusty in statistics, we ask only
for patience if at times the various formulae for number--crunching
make tough reading. Readers experienced in designing and interpreting
educational tests may have an advantage on this score, though the book
is also designed to be useful at introductory levels. Moreover, the
essential ingredients of this book are not the formulae themselves but
the competing rationales which precede them, and the variety of
conclusions which follow.
This book is the culmination of a project first set in motion by
the late Bruce Choppin. Its overall design is his, and it was his
imagination which first brought the contributors together to consider
detailing alternative models for the assessment of educational
achievement using a common framework. His desire to give full and
fair scrutiny to contrasting viewpoints and his ability to draw
connections across rival hypotheses with remarkable clarity, together
with his wide command of theory and practice, were a strong influence
on all who had the chance to work with him. One intent in this book
has been to capture Bruce's original sense of marvel at how good
questions can be raised--and answered well--in the field of
educational assessment.
Bruce suffered an untimely death while on the first leg of a
consulting trip around the world. He continues to be missed by his
many friends and colleagues.
My heartfelt thanks go to each of the authors for their
participation, and their willingness to develop each of the chapters
in this book into documents which integrate their views along a common
strand of critical issues. We hope that the reader will find this
means of presenting the several alternative methods to be one which
elucidates the essential philosophical bases beneath the measurement
process. We have also strived to provide a consistent technical
vocabulary necessary to convey statistical concepts, and frequent
guideposts to applying the several models in actual practice.
XI

Support for work presented in this document was originally


provided by the National Institute of Education, Grant Number
NIE-G-80-112, P1. The opinions expressed herein do not necessarily
represent the position or policy of the Institute, now known as the
Office of Educational Research and Improvement, and no official
endorsement should be inferred. Additional support was generously
provided by the Center for Student Testing, Evaluation and Standards
at UCLA. Special appreciation is extended to Katharine Fry, Ruth
Pays en and Dan Reich for diligent typing of each contribution.

David L. McArthur
INTRODUCTION

Fragmentation is occurring within the psychometric field.


Dissatisfaction with the limitations inherent in traditional forms of
mental test analysis typified by the norm-referenced multiple-choice
test of achievement, has led in recent years to a variety of new
psychometric theories and procedures. Novel applications have
stimulated new psychometric models and methods, each shaped to deal
with the specific problems of the particular situation. The last two
decades have seen the development of new types of tests, new scoring
methods, new procedures for item analysis, and entirely new
conceptions of the mental measurement process.
A marked characteristic of the professional literature on these
novel approaches to educational measurement is its parochialism. Many
of the most prolific psychometricians display little interest in
models other than their own, and there have been few attempts to
integrate separate results. The proponents of different models
have different objectives which on first impression appear to be
mutually exclusive. Implicitly or explicitly they make widely
differing assumptions; frequently they use the same words and phrases
to mean different things (e.g., reliability, accuracy, guessing, error
and true-score). Separate methodologies based on different models
have diverged to a point where it is no longer possible to identify a
mainstream approach to educational measurement, and where informed and
balanced advice on the full range of alternative approaches is almost
impossible to obtain.
The primary goal of this book is to build a convergent
perspective on psychometrics by describing, fairly and
comprehensively, the rationales that underlie several of the leading
XIV

alternative approaches to the measurement of achievement. Five


distinct approaches are set forth in this work, with particular
attention in each case to the philosophy, assumptions, mathematical
procedures, advantages and limitations of each. Our aim is to provide
the reader with sufficient underpinnings wi thin each approach to be
able to sift out what each views as the critical facets of the
measurement process.
We have tended to describe these five approaches as alternative
models of achievement measurement, and in the strict scientific sense
this is true, though a comprehensive mathematical formulation is
easier for some than for others. This detailed documentation enables
us to clarify our understanding of the similarities and differences
among the models so that we might explore the consequences of adopting
one analytic strategy rather than another.
The models we consider all belong to the class of latent
structure models in that their analysis is directed to the inferential
classification of test items and/or persons, based on theoretical
assumptions concerning the structure of test data and conceptual
theories of measurement. Within this framework, the different models
may be seen as attempts at the solution of a variety of measurement
problems. Sometimes, even when the models of procedures appear
similar, the issues of central concern to one may not be of any
particular interest to the other. In the measurement area, we meet
substantial variations in philosophy and value systems as well as in
statistical referents.
A good example of this can be found in the recent controversy
over latent trait models. Although the Rasch one-parameter model and
the three-parameter model developed by Birnbaum and Lord appear to
have much in common mathematically the Rasch model is a special case
of Lord's model) they are conceptually quite distinct. Lord began
some thirty years ago wi th large quanti ties of item response data
which he wished to understand and explain. For him it was important
to find a model that fitted his data. Today his disciples see in the
Rasch a model that does not fit their data well. It is founded on
assumptions (e.g., no guessing) which are often not met in practice,
and such workers rightly discard the Rasch model in favor of a more
xv

complex and more expensive analysis that better meets their need to
"fit" data. On the other hand, during the 1950s Rasch was developing
his model not on the basis of actual test data but rather on a series
of principles and axioms for measurement systems that he extracted
from other realms of scientific experience. He did not create his
model primarily to explain existing data sets, but instead to form the
basis for constructing new measurement systems. For his followers,
test items must "fit" the model if they are to be useful for
measurement. The goal is to find items that do fit the model so as to
permit the construction of test instruments with the optimal
properties that Rasch described.
Unfortunately, many psychometricians in each camp have not been
able to appreciate the distinction between these two approaches.
There have been public debates during which Item Response Theorists
have condemned the Rasch model for not "fitting" real data, while the
Rasch practitioners attack Item Response Theory for dealing with
models whose parameters cannot be satisfactorily estimated and which
do not satisfy the requirements for "objective measurement." The
criticisms are sound in themselves, but in major respects they do not
relate to the issues that the other side holds to be important.
There are other, though perhaps less dramatic, examples of where
different priorities and different concerns have led to some breakdown
in communication. For example, Generalizability Theory is directly
concerned with measures and with analyzing "errors" associated with
them. However, it treats these on a grouped basis as "error variance"
and makes certain assumptions about their distribution. By contrast,
latent trait theorists use "standard error of measurement" on an
individual basis, holding it to be a more useful concept than the
conventional one of test reliability. Latent trait theorists also
make assumptions about the distribution of these errors, and in
general these assumptions are not compatible with those of
Generalizability Theory. It is clear that both approaches have much
to offer for solving specific measurement problems, through their
areas of application are very different. How the two approaches may
be regarded as complementary has gone unappreciated. This book
XVI

addresses these and other questions. We hope to bring some


illumination to previously dark and shadowy areas where models for
educational measurement come together.
The first chapter of this book analyzes the history of mental
testing to show how conventional item analysis procedures were
developed and the variety of pressures and constraints over the years
since the problems of educational assessment were first recognized.
It shows how dissatisfaction with conventional methods has led to
fragmentation and the range of distinct conceptual and methodological
approaches to achievement testing that now exist. The second chapter
analyzes in depth the act of measurement. It is a central and
continuing problem in mental testing, one which not merely illustrates
the shortcomings of the traditional approach, but highlights the
differences between the modern alternatives.
There follow five chapters treating each of the selected
approaches individually but according to a standard format. These
"models" are: the Student-Problem technique, which may be viewed as a
simplified form of Guttman scaling; two latent trait logistic models
(Rasch with one item parameter and Lord wi th three item parameters)
given separate treatment because of the philosophical and conceptual
contrasts cited above; a latent class model within which the
estimation of true scores is central; and Generalizabili ty Theory
which, though somewhat different in scope from those mentioned
earlier, offers another mathematical model for test data and some
powerful statistical procedures for interpreting them.
An empirical demonstration of different models using common sets
of data is presented in the next-to-last chapter. While each of the
authors of the separate methods has provided illustrations, we believe
that applying the different approaches to a single dataset can assist
in building an appreciation of the ways in which the methods converge.
In conclusion, a summary chapter provides a synthesis of the
present state of alternative models for measuring achievement and
conclusions about their applicability to different measurement
problems.
ALTERNATIVE APPROACHES TO THE
ASSESSMENT OF ACHIEVEMENT
CHAPTER ONE

Educational Assessment: A Brief History

Educational assessment in the Western world has a long but very


irregular history. Two distinct threads are woven together: the
first is the variety of settings in which testing itself came to have
practical use while the second is the incorporation of increasingly
rigorous methods by which to make sense out of the results of that
testing. This chapter sets out some of the key developments in each
of these two areas, from their origins until the dawn of contemporary
psychometrics. For extended periods of time even the simplest
improvements in either testing or statistics fought long and hard
against tradition and inertia. It took many generations for the two
threads to finally merge into a full-fledged science of educational
measurement.
Seven centuries ago, one English college was deemed remiss in its
responsibilities because its founder had determined that its recent
graduates " ••. expressed themselves very inaccurately in the learned
languages ••• " (Sylvester, 1970, p.19); the method of such
determination was not described. A tradition of oral examinations
slowly built up over several centuries; though the evidence is
ambiguous, the earliest written exam may .have been in place around
1510. By the time Isaac Newton attended college about 1660, however,
the tradition had already fallen, and fallen hard. Not only were
there no examinations but frequently lecturers themselves simply never
showed up for classes. Then, in a series of major reforms the
faculties of both Oxford and Cambridge, recognizing the deteriorated
situation, decided to improve their curriculum and instituted regular
examinations in a variety of topics. The first clear indication that
written examinations were relied on for purposes of determining
2

admission and graduation was 1702 (Burt, 1936). The exams of this era
were almost exclusively essay questions emphasizing factual recall;
one extant example shows eight questions each in history and
geography, and six in grammar, primarily Latin and Greek.
In the education of the younger pupils, examinations began to
become more prevalent as textbooks for the grammar school came to be
formulated into distinct grade levels.
The new sequences of textbooks allowed a more precise grading to
be implemented in schools in various parts of Europe ••• Within the
school a further step was the development and application of the
principle of a child's regular progression through grades at
various intervals of about a year (Bower, 1975, p.419).
The Jesuits, finding that such a procedure fit perfectly into their
concept of the systematically ordered body of knowledge took up the
idea with vigor, and it rapidly spread across Europe. Examinations
were seen as the best way to focus the students' academic efforts, and
to set criteria.
Yet, as one might expect, examinations were not necessarily
viewed favorably by either faculty or students. Students at Yale
rebelled outright in 1762, writing to the trustees that until there
was an actual law on record they were unanimous in their refusal to
participate (Smallwood, 1935). "Cramming" for exams was recognized as
a major deterent to good scholarship as early as 1786. Various
faculty committees, hoping to aid things along, attempted to establish
testing systems that would represent a proper balance of content,
memory and skills. Balance proved exceedingly elusive to achieve as
academic politics also had to be considered, so most of these
well-intentioned attempts foundered. Testing continued and grew only
because the schools continued to demand some formal way to assign
ranks and evaluate student progress.
Meanwhile, in China, ci vil service examinations were already
several millenia old. The earliest proficiency testing on record
dates from 2200 B.C., and formal procedures for examinations date from
3

1115 B.C. Despite a concentration on literary rather than managerial


skills, the system was to serve as the model for a number of efforts
at standardizing competition for civil service positions in Europe and
the U.S. during the 19th century. But the tradition proved fragile;
the testing system was abolished in China in sweeping reforms at the
beginning of the 20th century, as western technologies and educational
orientations intruded into the Orient (DuBois, 1964, 1967).
In the United States, it was not until 1845, following Horace
Mann's advocacy of written examinations, that testing was incorporated
into educational practice in primary and secondary schools. The first
examination was administered in Boston that year, and the concept took
hold quickly (Englehart, 1950). within thirty-five years, promotion
from grade to grade, once made solely by personal recommendation, carne
to be judged by success or failure, scored as a percentage, on a
written exam. Mann's viewpoint of testing, while not using the word
"objective," carried with it a decided bias towards objective
measurement and standard tests (Ruch, 1929).
The earliest objective educational tests extant are found in a
book complete with questions, answers and scales, by an English
schoolmaster, dated 1864 (Kelley, 1927). The "Scale-book" assigned
points on a five-point scale to represent degrees of proficiency. For
writing skills it provided model essays as a working standard by which
to compare the student's essay. Sample questions were also given for
tests in spelling, mathematics, navigation, Scripture knowledge,
grammar, French, history, drawing and science. For the first time,
the idea was put forward that statistical evidence - in this case
simple score totals -- could be useful in evaluating the results of
educational training (Thorndike, 1913).
Objective tests in spelling and arithmetic were in place in many
schools in the U.S. by the 1870's. Then, in 1881, the superintendent
of schools in Chicago, expressing a strong sentiment against testing
in particular (if not against science in general) decreed that
4

advancement of students was to be carried out only by direct


recommendations of teachers and principals. Testing for purposes of
grade-level advancement was prohibited. His viewpoint turned out to
be widely shared; suddenly, the impetus for "objective" measurement
and assessment was on the wane. "Examinations for grade promotions
were gradually abolished in all the best schools," claimed the
superintendent's successor. "The person best qualified to judge of a
child's ability to go on is his teacher ••• To say that any other test
is necessary is a travesty on common sense" (Bright, 1895,
pp.274-275). By the end of the nineteenth century, educational
testing had indeed achieved a bad name. Where testing was still used
for advancement, teachers were "teaching on the test," devoting weeks
of preparation and drill to extant editions of upcoming exams, and the
public was not pleased.
Meanwhile, a completely separate thread in the fabric of
educational measurement was developing in the field of statistics.
In response to political pressures in Europe during the sixteenth
century, the rudimentary techniques for census taking of the ancient
Greeks and Romans for census-taking were rediscovered. The methodical
collection of information was viewed as "an adjunct of the police
power, as a safeguard against the arbitrary and usurious demands of
the tax farmers, and finally as a means of information as to the
number and position in life of the population" (Meitzen, 1891, p.25).
The census came into regular use in the century that followed, though
a number of countries chose to keep the results classified as state
secrets. Indeed, the word "statist" dates from that era; it occurs in
Hamlet, Act V, Scene 2, written in 1602, and may have referred both to
statesman or politician and keeper of state records. The first
lectures in statistics date around 1660, the word "statistic" derived
from the Latin and possibly linked to the Germanic discipline called
"Staatenkunde" or study of governments (Pearson, 1978).
5

Though extensive developments in mathematics were being made


during this time, the setting out of facts and figures in the social
sciences was strictly limited to tabulations for census taking and
actuarial tables. In 1693, Edmund Halley, for whom the comet is
named, was asked by the Royal Society to help make sense of tables of
births and deaths collected over a four-year period; he appears to
have been the first to devise numerical methods for smoothing data and
developing predictions. Insurance companies made the obvious
connection and quickly took up the statistical technique for their own
purposes. Interest continued for another century in statistical
enumerations of data, and a number of state-run statistical bureaus
were established.
Science was changing, however, and as fresh theories about
politics, geography, economy and insurance sprang into being, the role
played by the old statistics dwindled in importance. The statistics
profession suffered a decline during the eighteenth century as the old
teachers passed away, and the task of statistics was being made
increasingly narrow.
In 1806 and 1807 a passionate controversy arose against the
brainless bungling of the number statisticians, the slaves of the
tables, the skeleton-makers of statistics .•• The opponents in the
sharp attack were themselves, however, not sufficiently clear how
new and precise limits for their science should be determined.
(Meitzen, 1891, pp.49-50).
An International Statistical Congress began meeting in 1853 to attempt
to resolve the confusion; in certain areas it showed a surprising
degree of success. Even though its members chose to stay out of most
issues of statistical theory, in 1869 one of their resolutions
declared:
.•• that in all statistical researches it is important to know the
number of observations ••• ; the qualitative value is to be
measured by the divergences of the numbers among themselves as
well as the average ••. ; it is desirable to calculate ••• the
average deviations (Meitzen, 1891, p.80).
6

Major developments in both the analysis of probabilities and


the usefulness of statistics for improving the human condition
occurred during the nineteenth century. Commentaries by Quetelet, a
Belgian astronomer, on the theory of probability and the art of
statistical analysis, were available in English by the middle of the
century. He was the first to apply the normal curve to social data.
His views on the idea of the "average man" and the importance of
education to society in general played an important role in the
awakening social consciousness of that time. Towards this general
end, the task of statistics was focused increasingly on systematic
reduction of data into tables and the elimination of unwanted noise.
These principles guided educational statistics into the twentieth
century: one of the early texts (Rugg, 1917) devoted most of its
efforts to tabulation, averages, frequencies and variabilities.
In large measure the collection and analysis of data at this time was
confined to tabulations of school attendance and costs. The
statistical societies of the day, being deeply embroiled in social
problems--especially the relations of education to crime, spent no
time at all on assessing educational achievement beyond such indices
as the ability to sign one's own name (Cullen, 1975).
By the middle of the nineteenth century, considerable progress
had been made in the analysis of experimental data from psychophysics
and agricultural research. The work of Weber and Fechner in the first
half of the century had begun to demonstrate how mental attributes
could be analyzed and compared the earliest groundwork for mental
testing (Clarke & Clarke, 1985). Good experimental designs, including
factorial and split-plot techniques, were in place about 1850. Galton
spent time investigating how mathematical solutions might best be
developed for data from studies of Charles Darwin, building a number
of statistical tools in the process, and was the first to attempt to
measure characteristics of individual intelligence (1883). Galton was
among the first to recognize the importance of the statistical concept
7

of variance as it directly impacts the interpretation of meaning from


measurements of and by humans. But it was not until Pearson's
chi-square test (1900), and "Student's" t-test (1908) that appropriate
quantification of the significance of variations could be developed,
although the latter, surprisingly, took a number of years to catch on
(Cochran, 1976). Fisher's analysis of variance (1924) drew heavily on
these precursors but it too was relatively slow in being incorporated
into the repertoire of educational statisticians. Guilford's text on
fundamental statistics in 1942, for example, awards analysis of
variance fewer than nine pages, embedded in a chapter on reliability.
During the nineteenth century, a recognition emerged in the
context of examinations for purposes of scholarships and advancement,
that tests in mathematics were easier to design and score than tests
in subjects like philosophy. Tests became increasingly skewed towards
number problems, with a result that faculties outside of mathematics
became increasingly jealous of their colleagues' heightened
influence. A solution for the time being was to attempt systems of
score weighting (Latham, 1877), but it is not clear that these
suggestions clarified matters in any practical sense. Statistical
solutions seem not to have been proposed to this problem. It was not
until 1890 that the first study of test reliability appeared
(Edgeworth, 1890), though it was largely ignored until after World War
I. In the same year a short article by Cattell (1890) marked the
first time the words "mental tests" were used together, and that
phrase had a more rapid impact on the world.
Following Gal ton's lead, several investigators in Germany had
began to develop tests of mental skill and in the U.S. there was
extensive interest in the relationship of mental capacities to
physical characteristics, particularly because heightened concern for
the treatment of criminals and the retarded. The American
Psychological Association set up a standing committee in 1895 to
consider cooperative efforts in mental and physical statistics; the
8

American Association for the Advancement of Science did likewise the


following year. Binet, who had been working on problems in mental
reasoning since 1886, wrote an important article in 1898 on the
utility of measurement and scaling in the appraisal of human
intelligence. However, two major studies of testing around this time
(Sharp, 1899; Wissler, 1901) concluded that the available tests used
for psychological research fell far short of their claims, in both
content and method (Peterson, 1925). In education, Rice's (1897)
study of spelling attainment, using a single list of 50 words in a
test administered to 30,000 children, was a pioneering effort at
improving test content; it circulated widely but gained few supporters
(Wilds & Lottich, 1970).
At the turn of the century, the first survey of school facilities
and educational practice was conducted, the College Entrance
Examination Board was established, and in 1902 the first course in
educational measurement was taught (by Thorndike at Columbia) (Meyer,
1965) • Concurrently, interest in the concept of general intelligence
was being pursued by a number of investigators, following a suggestion
by Gal ton in 1883 and a study of intelligence in 1,500 children
conducted in 1891 (Burt, 1909). From the analysis of results from
this investigation, however, came the explicit realization that
statistical methods for educational measurement were in desperate need
of thoughtful improvement. Burt (1936) speculated that the consistent
failures of research investigations in the area of general
intelligence before the turn of the century

••• were largely due to their reliance for discovery of


correlations upon mere inspections of the data they obtained,
instead of upon quantitative determination and mathematical
deduction (pp.94-95).
During the first decade of the twentieth century, the growing
impetus for increased statistical rigor could be felt in several
areas; measurement successes in anthropometry and biology provided
9

much needed support for such improvement. In 1904, Toulouse and


Pieron's two volume manual on laboratory experiments included sections
on intelligence and the measurement of individual differences. In
1906 the American Psychological Association created a permanent
committee charged with evaluating requirements for standard laboratory
technique and appraising both group and individual tests with
a ttention to practical applications. Binet's test for intelligence
( 1905) and Thorndike's book on mental measurement (1904) had
particular significance during this time, as did Spearman's ( 1904)
paper on general intelligence.
Once it was out, the idea of applying the scale concept not only
to intelligence but to achievement caught on quickly. By 1910, a vast
number of tests in skills like English, spelling, handwriting, reading
and arithmetic had emerged, followed closely by more technical
articles on topics like numerical analysis, standardization, validity
and correlations (Cremin, 1961).
In 1913, the National Council of Education released a major
report on standards and tests for measuring school efficiency, and
expressed this sentiment:

We are only beginning to have measurement undertaken in terms of


standards or units which are, or may become, commonly
recognized. Such standards will undoubtedly be developed by
means of applying scientifically derived scales of measurement to
many systems of schools. From such measurements it will be
possible to describe accurately the accomplishment of children
and to derive a series of standards ••• (Strayer, 1913, p.4).

Graves, reviewing the condition of education in 1913, expressed the


sentiment that the application of mathematics to measurements in
education was one of the most significant movements of that time.
Developments in objective measurement of intelligence and
educational achievement came to a head with the crisis of the Great
War. Work in Germany on the screening of inductees had been in
progress since 1905; Binet and Simon (1910) were directly involved in
the application of intelligence testing in the French army. In the
10

u.s., Terman's revision of the Binet scale was completed in 1917, and
was applied soon thereafter to the testing of 1.7 million recruits. A
small team of educational psychologists produced the Army Alpha and
Beta tests of intelligence between May 28 and June 10, 1917; a copy of
the examiner's manual was enroute to the printer wi thin a month.
Immediately after the war, as the Army was selling thousands of unused
test blanks, both educational specialists and the public began to
realize that objective test results had to be taken with some degree
of caution. One of the originators of the Army Alpha expressed the
sentiment unambiguously: "We do not know what intelligence is and it
is doubtful if we will ever know what knowledge is" (Goddard, 1922,
quoted in Spring, 1972, p.5). Even so, by 1920, objective testing
formed the core of educational assessment methods. The Journal of
Educational Measurement devoted several issues in 1921 to a symposium
on scientific measurement of intelligence.
During the decade that followed, the objective measurement of
intelligence "swept America, and to a lesser extent Canada, like an
educational crusade ••• The critics were numerous but few in comparison
to the advocates ••• "(Marks, 1977, p.10). McCall's (1922) book on
educational measurement and Monroe's ( 1923) the following year were
the first to set out the procedures for a "new type examination," the
multiple-choice and true-false test. Principles of test construction
began to earn chapters of their own, and the variety of uses and
interpretations of tests was becoming a major consideration for many
educators (Monroe, 1945).
Despite an eloquent demurral to the contrary, N.R. Campbell
(1920) appears to have been the first to set out a fully consistent
description of the essential requirements for measurement. Using the
field of physics as illustration, Campbell defined measurement as the
assigning of numbers " ••• to represent all of those properties of
systems which can be ordered by a single transitive asymmetrical
relation" (p. 321). Additionally, in the same context he put forward
the idea that the sole task of measurement was to state numerical
II

laws, that is, build models of reality to derive a systematic pattern


of underlying scores from observations which include both a true value
and multiple sources of error. All of this he called "fundamental
measurement." None of Campbell's mathematical derivations were new;
indeed Yule's (1910) text on the theory of statistics, published a
decade earlier, goes a great deal further in its formal treatment of
numerical methods but is far less adroit in explaining how such
methods can be used to form a cohesive sense of reality.
Campbell's exposition, and his work the following year for
general readers (What is Science?, 1921), were far more accessible
than any previous study of the concepts underlying measurement. From
this time onward, the old tradition of statistics as a science of
tabulations faded quickly. Models of behavior began to play
increasingly dominant roles in the social science literature of the
time. Then .came the first contributions to what is now recognized as
classical test theory: Thurstone's (1925, 1926, 1927) articles on the
scoring of individual performance, Ruch and DeGraff's (1926) study of
corrections for guessing, Ruch's (1929) The Objective or New Type
Examination, and Thurstone's (1931) The Reliability and Validity of
Tests. Classical test theory was launched.
The concept of reliability is illustrative of the uneven
development of educational measurement. Because of its basis in
correlational method, which was already well advanced at the turn of
the century, a number of technical articles appeared quite early
concerning the statistical nature of reliability indices. By the time
that a maj or study was launched in the late 1920' s by the AIDer ican
Historical Association's Commission on the Social Studies into the
nature of testing in social sciences education, reliability measures
were regarded as essential by technical specialists but uniformly
disregarded by practitioners. Under the counsel of Truman Kelley, a
large-scale investigation was conducted on the use of tests for
determining overall class and school performance, recognizing
individual skill levels and individual differences, and appraising
12

attitudes and personality traits. It also studied the utility of the


"new-type" tests. Both social science specialists and educational
measurement technicians were disappointed in the results of the
study. The former were not pleased by the tendency of short-answer
and multiple-choice tests towards fragmentary presentation of simple
facts in the curriculum and the deletion of shades of meaning. The
latter felt that lack of objective terms, which they saw as essential
for objective measurement, obviated the study's conclusions. Kelley's
feelings were sufficiently strong on these issues that he wrote a
15-page appendix to the study report entitled "A Divergent Opinion as
to the Function of Tests and Testing." In this he excoriated the
opponents of testing with more than a dozen carefully reasoned
arguments regarding appropriate scientific use of educational tests,
plus one or two direct strikes to the more emotional nature of the
argument:
The opponents [of testing] show no awareness of the tests of
reliability and validity of measuring instruments, either
judgments of teachers or of test scores. We believe that such
awareness is essential to any educator who is not content to work
in the dark (p. 489).
In the areas of reliability and validity, technical proofs were
available as early as 1910 providing a rationale behind error
measurement (Spearman, 1910) and giving a definition of true score
(Brown, 1910). But it was some time before either term was given
serious treatment in the standard texts. We find only a half-dozen
index entries on reliability and validity in Rugg's 1917 text, 18
entries between the two in Ruch's 1929 text, four chapters in his 1942
book, and eight full chapters devoted to the two topics in Gulliksen's
1950 text. By the 1930's there had accumulated a variety of
estimation procedures and a great deal of confusion of terms (Adams,
1936; Barthelmess, 1931; Lincoln, 1932). An attempt to resolve the
issues was made in Thurstone' s small book on the topic in 1931,
another in Kuder and Richardson's (1937) key article on test
reliability, followed by Guttman's (1945) reformulation and Cronbach's
(1947) discussion of the several different kinds of reliability
coefficients. The American Psychological Association tried to resolve
13

the various discrepancies by committee in 1954. Tryon (1957) provided


an extensive historical review of the reliability concept and a
domain-sampling reformulation. "The extraordinarily massive
li terature in this topic," wrote Cattell " ••. has never lacked
statistical finesse and mathematical virtuosity" (1964, p.1), but he,
too, felt a need to suggest substantial redefinitions for both
The first formulations of a "sample-free" approach to mental
measurement are found in Lawley's (1943) analysis of item selection.
Although the problem had been explored tangentially by Horst (1936)
and more directly by Ferguson (1942), his paper was among the earliest
to seek mathematically rigorous justifications for the selection of
maximally discriminating test items, and to examine in some detail the
concept of item characteristic curves. Tucker (1946) provided further
statistical support. Gulliksen (1950) summarized the early work in
true score theory, and Lord explored the application of latent trait
theory to test theory with his doctoral dissertation, published as
Theory of Test Scores (1952). Interestingly, he felt that the actual
utility of large portions of the theory would be limited in practice
by the difficulty in obtaining sufficiently large data sets, and did
not publish about the problem again for another ten years. At that
point he presented an important development, the beta-binomial model
of the frequency distribution of true scores and raw scores (Keats &
Lord, 1962), and further refined the definition of true scores in Lord
and Novick (1968). Birnbaum explored certain statistical properties
of normal and logistic characteristic functions in 1957 and 1958, but
few other papers on this topic appeared until the 1960's.
The sentiment has been expressed more than once that the science
of educational testing has progressed fitfully. Despite a plethora of
statistical developments, "most of the major theoretical and technical
distinctions and most of the principle points of dispute were in
existence by 1925" (Thomson & Sharp, 1983) • This includes such
diverse topics as item analysis, test bias, the nature vs. nurture
arguments regarding individual intelligence, and at least the
beginnings of factor structure explanations for educational
assessment.
14

In the chapter which follows I we explore the merging of several


independent developments into the present-day science of achievement
measurement.
15

REFERENCES

Adams, H. F. (1936). Validity, reliability and objectivity.


In W.R. Miles (Ed.), Psychological studies of human variability.
Psychological Monographs, 57, 329-350.

Barthelmess, H. M. (1931) . The validi ty of intelligence test


elements. New York: Teachers College.

Binet, A. (1898). La mesure en psychologie individuelle. Revue


Philosophique, 46, 113-123.

Binet, A., & Simon, T. (1905). Methodes nouvelles pour le


diagnostic scientifique des etats inferieurs de l'intelligence.
L'Annee Psychologique, 12, 163-190.

Binet, A., & Simon, T. (1910). Sur la necessite d'une methode


applicable au diagnostic des arrierees militaires. Annales
Medico-psychologique.

Birnbaum, A. ( 1957) • An eff icient design and use of tests of a


mental ability for various decision making problems. Series
Report No. 58-16, USAF School of Aviation Medicine, Randolph,
TX.

Birnbaum, A. (1958). On the estimation of mental ability. Series


Report No.15, USAF School of Aviation Medicine, Randolph, TX.

Bower, J. (1975). A history of western education. Civilization of


Europe sixth to sixteenth century, vol. 2. New York: St.
Martin's Press.

Bright, O. T. (1895). Changes - wise and unwise - in grammar and


high schools. Journal of Proceeding and Addresses, St. Paul:
National Education Association.

Brown, W. (1910). Some experimental results in the correlation of


mental abilities. British Journal of Psychology, i, 296-322.

Brown, W., & Thompson, G. H. (1940) • The essentials of mental


measurement, Cambridge, MA: Cambridge University Press.

Brownless, V. T., & Keats, J. A. (1958). A retest method of studying


partial knowledge and other factors influencing item response.
Psychometrika, ~, 67-73.

Burt, C. L. (1909). Experimental tests of general intelligence.


British Journal of Psychology, i, 94-177.

Burt, C. L. (1936). The use of psychological tests in England.


In Sadler, M. E., Abbott, A., Burts, C. L., Burns, C. D.,
Hartog, P., Spearman, C., and Stirk, S. D. Essays on
examinations. London: Macmillan.
16

Campbell, N.R. (1920). Physics, the elements. Cambridge: Cambridge


University Press.

Campbell, N.R. (1921). What is science? London: Methuen.

Cattell, J. M. (1890). Mental tests and measurements. Mind, ..12,


373-381.

Cattell, R. B. ( 1964) • Validi ty and reliability: A proposed more


basic set of concepts. Journal of Educational Psychology,
55, 1-22.

Clarke, A. D. B., and Clarke, A. M. (1985). Mental testing: origins,


evolution, and present status. History of Education, li,
263-272.

Cochran, W. G. (1976). Early development of techniques in


experimentation. In D. B. Owen (Ed.), .=.O.=.n=--.=.t.=.h:.:e=--.=.h::i::s:..;t:..;o:..;ry~_o=f_
statistics and probability. New York: Dekker.

Cremin, L. ( 1961 ) • The transformation of the school. New York:


Knopf.

Cronbach, L. J. ( 1947) • Test "reliability": Its meaning and


determination. Psychometrika,~, 1-16.

Cronbach, L. J. (1975). Five decades of public controversy over


mental testing. American Psychologist, lQ, 1-14.

Cullen, M. J. (1975). The statistical movement in early Victorian


Britain: The foundations of empirical social research. New
York: Barnes & Noble.

DuBois, P. H. (1964). A test-dominated society: China, 1115 B.C.-


1905 A.D. ETS Invitational conference on testing problems •
Princeton: Educational Testing Service.

DuBois, P. H. (1970). A history of psychological testing. Boston:


Allyn and Bacon.

Edgeworth, F. Y. (1890). The element of chance in competitive


examinations. Journal of the Royal Statistical Society.
53, 460-475, 644-673.

Englehart, M. D. (1950). Examinations. In W. S. Monroe (Ed.),


Encyclopedia of educational research. New York: MacMillan.

Ferguson, G. A. ( 1942) • Item selection by the constant process.


Psychometrika, 2, 19-29.

Fisher, R. A. (1956). Statistical methods and scientific inference.


New York: Hafner.
17

Fisher, A. (1915). The mathematical theory of probabilities and its


application to frequency curves and statistical methods. New
York: Macmillan.

Freeman, F. N. (1926). Mental tests: Their history, principles and


applications. Boston: Houghton Mifflin.

Goodenough, F. L. ( 1936) • A critical note on the use of the term


'reliability' in mental measurement. Journal of Educational
Psychology, 27, 173-178.

Graves, F. P. (1950). A history of education in modern times. New


York: MacMillan.

Guilford, J.P. (1936). Psychometric methods. New York: McGraw-Hill.

Gulliksen, H. (1961). Measurement of learning and mental abilities.


Psychometrika, 26, 93-107.

Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.

Guttman, L. (1944). A basis for scaling qualitative data. American


Sociological Review, ~, 139-150.

Hambleton, R. K., & Cook, L. L. (1977). Latent trait models and


their use in the analysis of educational test data. Journal of
Educational Measurement, li, 75-96.

Horst, A. P. (1936). Item selection by means of maximizing


function. Psychometrika, 1, 229-244.

Keats, J. A., & Lord, F. M. (1962). A theoretical distribution for


mental test scores. Psychometrika,~, 59-72.

Kelley, T. L. (1927). Interpretation of educational measurements.


Yonkers-on-Hudson, NY: World.

Kelley, T. L., & Krey, A. C. (1934). Tests and measurements in the


social sciences. Report of the Commission on the Social Studies,
American Historical Association, Part IV. New York: Charles
Scribner's Sons.

Kuder, G. F., & Richardson, M. W. (1937). The theory of the


estimation of test reliability. Psychometrika, ~, 151-160.

Latham,H. (1877). On the action of examinations considered as a


means of selection. Cambridge: Deighton Bell.

Lawley, D. N. (1943). On problems connected with item selection and


test construction. Proceedings of the Royal Society of
Edinburgh, ~, Section A, 273-287.

Lazarsfeld, P. F. (1960). Latent structure analysis and test


theory. In H. Gulliksen and S. Messick (Eds.), Psychological
scaling: Theory and applications. New York: Wiley.
18

Lazarsfeld, P. F. (1950). The logical and mathematical foundations


of latent struture analysis. In S. A. Stouffer, et al (Eds.),
Measurement and prediction. Princeton: Pr inceton Uni versi ty
Press.

Lentz, T. F., Hirshstein, B., & Finch, F. H. ( 1932). Evaluation of


methods of evaluating test items. Journal of Educational
Psychology, ~, 344-350.

Lincoln, E. A. ( 1932) • The unreliability of reliabili ty


coefficients. Journal of Educational Psychology, 23, 11-14.

Lord, F. M. (1952). A theory of test scores. Psychometric


Monographs, No.7

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental


test scores. Reading, Mass.: Addison-Wesley.

Macready, G. B., & Dayton, C. M. ( 1977) • The use of probabilistic


models in the assessment of mastery. Journal of Educational
Statistics, ~, 99-120.

Marks, R. (1977). Providing for individual differences: A history


of the intelligence testing movement in North America.
Interchange, 2, 3-16.

McCall, W. A. (1922) • How to Measure in Education. New York:


Macmillan.

Meitzen, A. (1891). History, theory, and technique of statistics.


fnnals of the American Academy of Political and Social Science,
..!, 1-237.
Meyer, A. E. (1965). Educational history of the western world. New
York: McGraw Hill.

Monroe, W. S. (1923). Introduction to the theory of educational


measurement. Boston: Houghton Mifflin.

Monroe, W. S. (1945). Educational measurement in 1920 and 1945.


Journal of Educational Research, 38, 334-340.

Pe arson, E. S. (Ed.) ( 1978) • .=.T~h:::e~h;.!;;1::::·s:::,t=o.::;r.l.y........:o::,;f::........::s:.;:t:.::a:,:t:.;:i::..:s::.,t:::.1:::,·.:::c::.s---::i::..:n::......t.::h=e~1:...7:...t:::.h::..:


and 18th centuries, against the changing background of
intellectual, scientific and religious thought. Lectures by Karl
Pearson. London: Charles Griffin.

Peterson, J. ( 1925) • Early conceptions and tests of intelligence.


Yonkers-on-Hudson, NY: World.

Quetelet, M.A. (1849). Letters on the theory of probabilities.


London: Charles and Edwin Layton.

Rasch, G. (1960) • Probabilistic models for some intelligence and


attainment tests. Copenhagen, Denmark: Neilsen & Lydiche.
19

Rice, J. M. Forum, 1897. Cited in W. H. Wilds & K. v. Lottich,


(1970). Foundations of modern education. New York: Holt,
Rinehart & Winston.

Ruch, G. M. (1929). The objective or new-type examination, an


introduction to educational measurement. Chicago: Scott,
Foresman.

Ruch, G. M., & deGraff, M. H. (1926). Corrections for chance and


"guess" vs. "do not guess" instructions in multiple-response
tests. Journal of Educational Psychology, 12, 368-375.

Rugg, H. o. (1917). Statistical methods applied to education.


Boston: Houghton Mifflin.

Sadler, M. E. (1936). The scholarship system in England to 1890 and


some of its developments. In Sadler, M. E., Abbott, A., Burts,
C. L. Burns, C. D., Hartog, P., Spearman, C., and Stirk, S. D.
Essays on examinations. London: MacMillan.

Sharp, S. E. (1899). Individual psychology: A study in


psychological method. American Journal of Psychology, ..!.Q.,
329-391.

Smallwood, M. L. (1935). An historical study of examinations and


grading systems in early American universities. Cambridge:
Harvard University Press (Harvard Studies in Education vol. 24).

Spearman, C. (1910). Correlation calculated from faulty data.


British Journal of Psychology, 1, 271-295.

Spearman, C. (1904). General intelligence objectively determined and


measured. American Journal of Psychology, ~, 201-292.

Spring, J. H. ( 1972) • Psychologists and the war: The meaning of


intelligence and the Alpha and Beta tests. History of Education
Quarterly, ~, 3-15.

Strayer, G. D. (1913).Standards and tests for measuring the


efficiency of schools or systems of schools. Bulletin, United
States Bureau of Education. Whole No. 13: Report of the
Committee of the National Council of Education.

Sylvester, D. W. (1970). Educational documents 800-1816. London:


Methuen,

Thompson, G. O. B., & Sharp, S. (1983). History of mental testing.


In T. Husen & N. Postlethwaite (Eds.), International encyclopedia
of education: Research and studies, Oxford: Pergamon Press.

Thorndike, E. L. (1904). An introduction to the theory of mental and


social measurements. New York: Science Press.

Thorndike, E. L. (1913) • Educational measurements of fifty years


ago. Journal of Educational Psychology, ~, 551-552.
20

Thurstone, L. L. (1925). A method of scaling psychological and


educational tests. Journal of Educational Psychology, ~,
433-451.

Thurstone, L. L. (1931). The reliability and validity of tests. Ann


Arbor: Edwards.

Thurstone, L. L. (1926). The scoring of individual performance.


Journal of Educational Psychology, 12, 446-457.

Thurstone, L. L. (1927) • The unit of measurement in educational


scales. Journal of Educational Psychology, ~, 505-524.

Toulouse, E., & Pieron, H. (1904). Technique de psychologie


experimentale. Paris: Doin.

Tryon, R. c. (1957). Reliability and behavior domain validity:


Reformulation and historical critique. Psychological Bulletin,
54, 229-249.

Tucker, L. R. (1946). Maximum validity of a test with equivalent


items. Psychometrika, l!, 1-13.

Wilds, E. H., & Lottich, K. V. (1970). Foundations of modern


education. New York: Holt, Rinehart & Winston.

Wissler, C. ( 190 1) • The correlation of mental and physical tests.


Psychological Review, Monograph Supplement vol. 8, No. 16.

Wright, B.D. (1984). Despair and hope for educational measurement.


Contemporary Education Review, ~, 281-288.

Yerkes, R. M. (Ed.) ( 1921 ) • Psychological examining in the United


States Army. Memoirs of the National Academy of Sciences, .12,
1-890.

Yule, G.U. (1910). An introduction to the theory of statistics.


London: Charles Griffin.

Vous aimerez peut-être aussi