Académique Documents
Professionnel Documents
Culture Documents
ASSESSMENT OF ACHIEVEMENT
Evaluation in Education and
Human Services
Editors:
Edited by
David L. McArthur
UCLA Graduate School of Education
Center for the Study of Evaluation
"
~
Contributors Vll
Preface IX
Introduction Xlll
Bruce H. Choppin and David L. McArthur
Index 267
LIST OF CONTRIBUTORS
David L. McArthur
INTRODUCTION
complex and more expensive analysis that better meets their need to
"fit" data. On the other hand, during the 1950s Rasch was developing
his model not on the basis of actual test data but rather on a series
of principles and axioms for measurement systems that he extracted
from other realms of scientific experience. He did not create his
model primarily to explain existing data sets, but instead to form the
basis for constructing new measurement systems. For his followers,
test items must "fit" the model if they are to be useful for
measurement. The goal is to find items that do fit the model so as to
permit the construction of test instruments with the optimal
properties that Rasch described.
Unfortunately, many psychometricians in each camp have not been
able to appreciate the distinction between these two approaches.
There have been public debates during which Item Response Theorists
have condemned the Rasch model for not "fitting" real data, while the
Rasch practitioners attack Item Response Theory for dealing with
models whose parameters cannot be satisfactorily estimated and which
do not satisfy the requirements for "objective measurement." The
criticisms are sound in themselves, but in major respects they do not
relate to the issues that the other side holds to be important.
There are other, though perhaps less dramatic, examples of where
different priorities and different concerns have led to some breakdown
in communication. For example, Generalizability Theory is directly
concerned with measures and with analyzing "errors" associated with
them. However, it treats these on a grouped basis as "error variance"
and makes certain assumptions about their distribution. By contrast,
latent trait theorists use "standard error of measurement" on an
individual basis, holding it to be a more useful concept than the
conventional one of test reliability. Latent trait theorists also
make assumptions about the distribution of these errors, and in
general these assumptions are not compatible with those of
Generalizability Theory. It is clear that both approaches have much
to offer for solving specific measurement problems, through their
areas of application are very different. How the two approaches may
be regarded as complementary has gone unappreciated. This book
XVI
admission and graduation was 1702 (Burt, 1936). The exams of this era
were almost exclusively essay questions emphasizing factual recall;
one extant example shows eight questions each in history and
geography, and six in grammar, primarily Latin and Greek.
In the education of the younger pupils, examinations began to
become more prevalent as textbooks for the grammar school came to be
formulated into distinct grade levels.
The new sequences of textbooks allowed a more precise grading to
be implemented in schools in various parts of Europe ••• Within the
school a further step was the development and application of the
principle of a child's regular progression through grades at
various intervals of about a year (Bower, 1975, p.419).
The Jesuits, finding that such a procedure fit perfectly into their
concept of the systematically ordered body of knowledge took up the
idea with vigor, and it rapidly spread across Europe. Examinations
were seen as the best way to focus the students' academic efforts, and
to set criteria.
Yet, as one might expect, examinations were not necessarily
viewed favorably by either faculty or students. Students at Yale
rebelled outright in 1762, writing to the trustees that until there
was an actual law on record they were unanimous in their refusal to
participate (Smallwood, 1935). "Cramming" for exams was recognized as
a major deterent to good scholarship as early as 1786. Various
faculty committees, hoping to aid things along, attempted to establish
testing systems that would represent a proper balance of content,
memory and skills. Balance proved exceedingly elusive to achieve as
academic politics also had to be considered, so most of these
well-intentioned attempts foundered. Testing continued and grew only
because the schools continued to demand some formal way to assign
ranks and evaluate student progress.
Meanwhile, in China, ci vil service examinations were already
several millenia old. The earliest proficiency testing on record
dates from 2200 B.C., and formal procedures for examinations date from
3
u.s., Terman's revision of the Binet scale was completed in 1917, and
was applied soon thereafter to the testing of 1.7 million recruits. A
small team of educational psychologists produced the Army Alpha and
Beta tests of intelligence between May 28 and June 10, 1917; a copy of
the examiner's manual was enroute to the printer wi thin a month.
Immediately after the war, as the Army was selling thousands of unused
test blanks, both educational specialists and the public began to
realize that objective test results had to be taken with some degree
of caution. One of the originators of the Army Alpha expressed the
sentiment unambiguously: "We do not know what intelligence is and it
is doubtful if we will ever know what knowledge is" (Goddard, 1922,
quoted in Spring, 1972, p.5). Even so, by 1920, objective testing
formed the core of educational assessment methods. The Journal of
Educational Measurement devoted several issues in 1921 to a symposium
on scientific measurement of intelligence.
During the decade that followed, the objective measurement of
intelligence "swept America, and to a lesser extent Canada, like an
educational crusade ••• The critics were numerous but few in comparison
to the advocates ••• "(Marks, 1977, p.10). McCall's (1922) book on
educational measurement and Monroe's ( 1923) the following year were
the first to set out the procedures for a "new type examination," the
multiple-choice and true-false test. Principles of test construction
began to earn chapters of their own, and the variety of uses and
interpretations of tests was becoming a major consideration for many
educators (Monroe, 1945).
Despite an eloquent demurral to the contrary, N.R. Campbell
(1920) appears to have been the first to set out a fully consistent
description of the essential requirements for measurement. Using the
field of physics as illustration, Campbell defined measurement as the
assigning of numbers " ••• to represent all of those properties of
systems which can be ordered by a single transitive asymmetrical
relation" (p. 321). Additionally, in the same context he put forward
the idea that the sole task of measurement was to state numerical
II
REFERENCES