Vous êtes sur la page 1sur 24

~~ffl~~ti~~ and Program Planning, Vol. 2, pp. 67-90, 1979 0149-7189/79/01006?-24$02.

Printed in the U.S.A. All rights reserved. Copyright Q 1979 Pergamon Press Ltd



Northwestern University


It is a spec~l characteristic ofall modern societies that we consciously decide on and plan
projects designed to improve our social systems. It is our universal predica~nent that our
projects do not always have their intended effects. Very probably we all share in the ex-
perience that often we cannot tell whether the project had any impact at all, so complex
is the flux of historical changes that would have been going on anyway, and so many are
the other projects that might be expected to modify the same indicators.
It seems inevitable that in most countries this common set of problems, combined
with the obvious relevance of social science research procedures, will have generated a
methodology and methodological specialists focused on the problem of assessing the im-
pact of planned social change. It is an assumption if this paper that, in spite of differences
in the forms of government and approaches to social planning and problem-solving,
much of this methodolo~ can be usefully shared - that social project evaluation meth-
odology is one of the fields of science that has enough universality to make sc~enti~c
sharing mutually beneficial. As a part of this sharing, this paper reports on program im-
pact assessment methodology as it is developing in the United States today.

The most common name in the U.S. for this developing Evaluation Quarterly, Studies in Educational Evaluation,
specialty is evaluation research, which now almost always Evaluation and the Health Professions). The number of
implies program evaluation (even though the term “evalu- newsletters is even larger, and program evaluation articles
ation” has a well-established usage in assessing the ade- appear in most of the many applied social science and
quacy of persons in the execution of specific social roles). policy journals.
Already there are many anthologies and textbooks in this The participants in this new area come from a variety
area (e.g., Anderson, Ball, & Murphy, 197.5; Bennett & of social science disciplines. Economists are well repre-
Lumsdaine, 1975; Boruch & Riecken, 1974; Caporaso & sented. Operations research and other forms of “scientific
ROOS, 1973;Caro, 1971,1977;Cook et al., 1978;Coursey management” contribute. Statisticians, sociologists, psy-
et al., 1977; Fairweather, 1967; Fairweather & Tornatzky, chologists, political scientists, social service administration
1977; Glaser, 1973; Glass, 1976;Guttentag& Saar, 1977; researchers, and educational researchers ail participate.
Rivlin, 1971; Rossi & Williams, 1972; Rutman, 1977; The similarity of what they all end up recommending and
Schulberg, Sheldon, & Baker, 1969; Segall, 1977; Streu- doing testifies to the rapid emergence of a new and sepa-
ning & Guttentag, 1975; Suchman, 1967; Weiss, 1972a, rate discipline that may soon have its own identity di-
1972b; Wholey et al., 1970). There are a number of newly vorced from this diverse parentage.
founded journals specifically devoted to program evalua- Since my own disciplinary background is social psy-
tion (e.g., Evaluation and Program Planning, Evaluation, chology, I feel some need to comment on the special con-

Reprinted, with minor revisions and additions, from Gene M. Lyons, (Ed.), Social Research and Public Policies, Hanover, New Hampshire:
University Press of New England, 1975, with the permission of the publisher. Preparation of this paper has been supported in part by
grants from the Russell Sage Foundation and the National Science Foundation Grants Numbers GSOC-7103704-03 and BNS76-23920. An
earlier version of this paper was presented to the Visegrad, Hungary, Conference on Social Psychology, May 5-10,1974.


tribution that this field can make even though I regard Having said something about U.S. evaluation research
what I am now doing as “applied social science” rather in its professional aspects, I would like to spend the rest
than social psychology. First, of all of the contributing of the time telling about the problems we have encountered
disciplines, psychology is the only one with a laboratory so far and the solutions we have proposed. It is with regret
experimental orientation. and social psychologists in par- that I note that we have not progressed very far from my
ticular have had the most cxperiencc in extending labora- earlier review (Campbell, 1969b): however, 1 will attempt
tory experimental design to social situations. Since the to provide new illustrations.
model of experimental science emerges as a major altcrna- The focus of what follows is so much on troubles and
live in reducing eqLlivt)~ality about what caused what in problems that I feel the necessity of warning and apologiz-
program evaluation ifi-om Suchman’s. 1967. founding ing. If we set out to be ~~l~t~l~~~)lo~ists,we set out to be
book onward). this is a very important contribution of experts in problems and, hopefully, inventors of solutions.
both general orientation and specific skills. The need for such :I specialty would not exist except for
Second, psychologists are best prepared with appropri- the problems. From this point of view. no apology is
ately critical and analytic measurement concepts. Econo- needed. But I would also like to be engaged in recruiting
mists have an admirably skeptical book on monetary I-ec- members to a new profession and in inspiring them to in-
ords (Morgenstern. 1963), but most economists treat the vest great effort in activities with only long-range payoff.
figures available as though they were perfect. Sociologists For potential recruits, or for those already engaged in it, a
have a literature on interviewer bias and underenumeration, full account of our difficulties, including the problem of
but usually treat census figures as though they were un- gerting our skills used in ways we can condone. is bound
biased. Psychology, through its long traditi~~tl of building to be discouraging. We cannot yet promise a set of profes-
and criticizing its own measures, has developed concepts sional skills guaranteed to make an irnp~~rtant difference.
and rnat~ie~~~ati~ai models of reliability and validity which In the few SLKXXSS stories of beneficial programs unequi-
arc greatly needed in program evaluation, even though vocally evaluated, society has gotten by. or could have
they are probably not yet adequate for the study of the gotten by, without our help. We still lack instances of im-
cognitive growth of groups differing in ability. The con- portant contributions to societal innovation which were
cept of hilrs, as developed in the older psychophysics in abetted by our methodological skills. The need for our
in the distinction between “constant error” (bias) and specialty. and the specific recommendations we make,
“variable error” (unreliability). and the more recent work must still be justified by promise rather than by past
in personality and attitude measurement on response sets. performance. They are a priori in that they represent
halo eifects, social desirability factors, index correlations. extrapolations into a new context not yet cross-validated
methods factors. etc. (Cronbach, 1946, 1950; Edwards, in that context. I myself believe that the importance of
1957: Jackson & Messick, 1967: CampbeIl, Siegman & the problem of social system reality-testing is so great that
Rees, 1967: Campbell & Fiske, 1959) is also very impot-- our efforts and professional ~orn?llitlneilt are fully justi-
tant and is apt to be missing in the concept of validity if fied by promise. I believe that the problems of equivocali-
that is defined in terms of a correlation coefficient with a ty of evidence for program effectiveness arc so akin to the
criterion. All this, of course, is nut our monopoly. indeed, general problems of scientific inference that our extrapo-
it is the qualitative sociologists who do studies of the con- lations in to recommendations about program evaluation
ditions under which social statistics get laid down (e.g., procedures can be. with proper mutual criticism, well-
Becker, et al., 1968, 1970; Douglas. 1967; Garfinkel, grounded, Nonetheless, motivated in part by the reflexive
1967; Kitsuse & Cicourel. 1963; Beck. 1970) who best consideration that promising too much turns out to be a
provide the needed skepticism of such measures as suicide major obstacle lo meaningful program evaluation, I aim,
rates and crime rates. But even here, it is psychologists however ambivalently. to present an honestly pessimistic
who have had the depth of experience sufficient to distin- picture of the problems.
0truish degrees of validity lving between total ~~~)rtl~essliess A second problem with the problem focus comes from
and utter perfection. and who have been willing to use, the fact that inevitably many of the ~net~~od~~I~~gi~a1 dif-
albeit critically, measures they knew were partially biased G&ties are generated from the interaction of aspects of
and errorful. the political situation surrounding programs and their
Third, many of the methodological problems of social evaluation. Thus the U.S. experience in evaluation re-
project implementation and impact measurement have to search. combined with the focus on problems, may make
do with the social psychology of the interaction between my presentation seem inappropriately and tactlessly crit-
citizens and projects, or between citizens and modes of ical of the U.S. system of government and our current po-
experimental implementation (randomization, control litical climate. Ideally, this would be balanced out by the
groups), or between citizens and the special measurement sharing experiences from many nations. In the absence of
procedures introduced as a part of the evaluation. These this. I can only ask you not to be misled by this by-pro-
are specia1 problems of attitude formation and of the ef- duct of the otherwise sensible approach of focusing on
fects of attitudes on responses. and are clearly within the problems.
domain of our intended competence. It is in the area of rnetliod(~l(~gi~al problems generated
Assessing the Impact of Planned Social Change 69

by political considerations that the assumptions of univer- novations. Whether this independence of ideology is pos-
sality for our methodological principles will fail as we sible, and even if it is, how it is to be integrated with the
compare experiences from widely differing social, eco- social scientist’s duty to participate in the development of
nomic, and political systems. You listeners will have to more authentic human consciousness and more humane
judge the extent, if any, to which the same politico-meth- forms of social life are questions I have not adequately
odological problems would emerge in your program-evalu- faced, to say nothing of resolved.
ation settings. Most international conferences of scientists In what follows, I have grouped our problems under
can avoid the political issues which divide nations by con- three general headings, but have made little effort to keep
centrating on the scientific issues which unite them as sci- the discussion segregated along these lines. First come
entists. On the topic of assessing the impact of planned so- issues that are internal to our scientific community and
cial change we do not have that luxury. Even so, I have would be present even if scientist program evaluators were
hopes for a technology that would be useful to any politi- running society solely for the purpose of unambiguous
cal system. I believe that much of the methodology of program evaluation. These are: 1) Metascientific issues:
program evaluation will be independent of the content of and 2) Statistical issues. The remaining heading involves
the program, and politically neutral in this sense. This interaction with the societal context. Under 3) Political
stance is augmented by emphasizing the social scientist’s system problems, I deal with issues that specifically in-
role in helping society keep track of the effects of changes volve political processes and governmental institutions,
that its political process has initiated and by playing down some of which are perhaps common to all large, bureau-
the role of the social scientist in the design of program in- cratic nations, others unique to the U.S. setting.


Quantitative vs. Qualitative Methodology are poor from an experimental design point of view; even
A controversy between “qualitative” vs. “quantitative” under well-controlled situations, experimentation is a pro-
modes of knowing, between geisteswissenschaftlich and foundly tedious and equivocal process; experimentation is
naturwissenschaftlich approaches, between “humanistic” too slow to be politically useful; etc. All these are true
and “scientific” approaches is characteristic of most of enough, often enough, to motivate a vigorous search for
the social sciences in the U.S.A. today. In fields such as alternatives. So far, the qualitative-knowing alternatives
sociology and social psychology, many of our ablest and suggested (e.g., Weiss & Rein, 1969, 1970; Guttentag,
most dedicated graduate students are increasingly opting 1971, 1973) have not been persuasive to me. Indeed, I be-
for the qualitative, humanistic mode. In political science, lieve that naturalistic observation of events is an intrinsic-
there has been a continuous division along these lines. ally equivocal arena for causal inference, by qualitative or
Only economics and geography seem relatively immune. quantitative means, because of the ubiquitous confound-
Inevitably, this split has spilled over into evaluation re- ing of selection and treatment. Any efforts to reduce that
search, taking the form of a controversy over the legiti- equivocality will have the effect of making conditions
macy of the quantitative-experimental paradigm for pro- more “experimental. ” “Experiments” are, in fact, just
gram evaluation (e.g., Weiss & Rein, 1969, 1970; Gutten- that type of contrived observational setting optimal for
tag, 1971, 1973; Campbell, 1970, 1973). The issue has causal inference. The problems of inference surrounding
not, to be sure, been argued in quite these terms. The program evaluation are intrinsic to program settings in on-
critics taking what 1 am calling the humanistic position are going social processes. Experimental designs do not cause
often well-trained in quantitative-experimental methods. these problems and, in fact, alleviate them, though often
Their specific criticisms are often well-grounded in the ex- only slightly so.
perimentalist’s own framework: experiments implement- In such protests, there often seems implicitly a plea for
ing a single treatment in a single setting are profoundly the substitution of qualitative clairvoyance for the in-
ambigous as to what caused what; there is a precarious ri- direct and presumptive processes of science. But while 1
gidity in the measurement system, limiting recorded out- must reject this aspect of the humanistic protest, there are
comes to those dimensions anticipated in advance; process other aspects of it that have motivated these critics in
is often neglected in an experimental program focused on which I can wholeheartedly join. These other criticisms
the overall effect of a complex treatment, and thus know- may be entitled “neglect of relevant qualitative contextual
ing such effects has only equivocal implications for pro- evidence” or “overdependence upon a few quantified ab-
gram replication or improvement; broad-guage programs stractions to the neglect of contradictory and supplement-
are often hopelessly ambiguous as to goals and relevant ary qualitative evidence.”
indicators; changes of treatment program during the course Too often quantitative social scientists, under the influ-
of an ameliorative experiment, while practically essential, ence of missionaries from logical positivism, presume that
make input-output experimental comparisons uninterpret- in true science, quantitative knowing replaces qualitative,
able; social programs are often implemented in ways that common-sense knowing. The situation is in fact quite dif-

cation of quantifiable atomic particles, in reverse of the

logical atomist’s intuition, still too widespread (Campbell,
1966). Such an epistemology is fallibilist, rather than
/ \\ / clairvoyant. emphasizing the presumptive error-proneness
of such pattern identification, rather than perception as a
\ // \ dependable ground of certainty. Hut it also recognizes this
-0 -b fallible. intuitive, presumptive, ordinary perception to be
the only route. This is nor to make perceptions uncriticiz-
FIGURE 1. able (Campbell, 1969a), but they are, as we have seen,
only criticizable by trusting many other perceptions of
ferent. Rather, science depends upon qualitative. com- the same epistemic level.
mon-sense knowing even though at best it goes beyond it. If we apply such an epistemology to evaluation re-
Science in the end contradicts some items of common search. it immediately legitimizes the “narrative history”
sense, but it only does so by trusting the great bulk of the portion of most reports and suggests that this activity be
rest of common-sense knowledge. Such revision of com- given formal recognition in the planning and execution of
mon sense by science is akin to the revision of common the study. rather than only receiving attention as an after-
sense by common sense which, paradoxically. can only be thought. Evaluation studies are uninterpretable without
done by trusting more common sense. Let us consider as this. and most would be better interpreted with more.
an example the Miiller-Lyer illustration (Figure 1). That this content is subjective and guilty of perspectival
If you ask the normal resident of a “carpentered” culture biases should lead us to better select those who are invited
(Segall. et al., 1966) which lint is longer, a or 1). he will rc- to record the events. and to prepare formal procedures
ply 0. If you supply him with a ruler. or allow him to use whereby all interested participants can offer additions and
the edge of another piece of paper as a makeshift ruler. he corrections to the official story. The use of professionally
will eventually convince himself that lie is wrong. and that trained historians, anthropologists, and qualitative soci-
line a is longer. In so deciding he will t-eject as inaccurate ologists should be considered. The narrative history is an
one product of visual perception by trusting a larger set 01‘ indispensable part of the final report. and the best qualita-
other visual perceptions. He will also have made man). pre- tive methods should he used in preparing it.
sumptions. inexplicit for the most part, including the as- WC should also recognize that participants and observ-
sumption that the lengths of the lines have remained rela- ers have been evaluating program innovations for centuries
tlvely constant durmg the measurement process. that the without benefit of quantification or scientific method.
ruler was rigid rather than elastic, that the heat and mois- This is the common-seilse knowing which our scientific
ture of his hand have not changed the ruler’s length in suck evidence should build upon and go beyond. not replace.
a coincidental way as to produce the differenl measure- But it is usually neglected in cluantitative evaluations. un-
ments. expanding it when approaching line a and ccmtract- less a few supporting anacdotcs haphar.ardly collected are
ing it when approaching line h, etc. inclu&d. Under the epistemology 1 ndwcate. one should
l.et us take as another example a scientific paper con- attempt to systematically tap all the qualitative common-
taining theory and experimental results demonstrating the sense pi-ogram critiques and evaluations that have been
particulate nature of light. in dramatic contrast to coni- generated among tllc pi-ogram staff, program clients and
mon-sense understanding. Or a scientific paper dcmoil- their families. and community observers. While quanti-
stratinp that what ordinary perception deems “solids” arc tative procedures such as cluestionnaircs and I-atmg scales
in fact open lattices. Were such a papel- to limit itself to will often hc introduceI. at this stage for reasons of con-
mathematical symbols and purely scientific terms. omit- venience in collecting and summarizing. non-quantitative
ting ordinal-y language. it would fail IO communicate to methods of collection and compiling should also bc con-
another scientist in such a way as to enable him to replicate sidered. such as hierarchically organized discussion groups.
the experiment and verify the observations. Instead. the Whet e such evaluaticlns at-e contrary to the quantitative Ire-
few scientific terms have been imbedded in ;I discourse of suits. the quantitative results should he regartlcd as sus-
elliptical prescientific ordinal-y language which the reader pect until the reasons for the discrepancy are well under-
is presumed to (and presumes to) understand. And in the stood. Neither is infallible. of course. But for many of us.
laboratory work of the original and replicating laboratory. what needs to be emphasized is that the quantitative re-
a coiii11ioi1-seiise. prescientific language and perception of sults may he as mistaken as the qualitative After all. in
objects. solids. and light was employed and trusted in physical science laboratories, the meters often work im-
coming to the conclusions that thus revise tlie ordinal-!; prot3erly. and it is usually qualitative knowing. plus as-
understanding. To challenge and correct the commoil- sumptions about what the meter ought to be showing.
sense understanding in one detail. commoil-sense under-- that discovers the malfunction. (This is a far cl-y from the
standing in general had to bc trusted. myth that meter readings operationally define theoretical
Related to this is the epistemologica1 emphasis on parameters.)
qualitative pattern identit’icatioll as prior to an idcntifi- It is with regret that 1 report that in U.S. prograni eval-
Assessing the Impact of Planned Social Change 71

uations, this sensible joint use of modes of knowing is not much more involved with extrascientific issues of human
yet practiced. Instead, there seems to be an all or none nature, social systems, and political process. I have classi-
flip-flop. Where, as in Model Cities evaluations, anthro- fled them here only because I judge that a first step in
pologists have been used as observers, this has often been their resolution would be developing a consensus among
in place of, rather than in addition to, quantitative indi- evaluation methodologists, and such a consensus would
cators, pretests, posttests, and control-group comparisons. involve agreement on metascientific issues rather than on
A current example of the use of anthropologists in the details of method.
“Experimental Schools” program started in the U.S.
Office of Education and now in the National Institute of ~~~~~ati~~n ~~~~lp~ernentati~nand E~a~~atiu~l
Education. In this program, school-system initiative is en- A well-established policy in those U.S. Government agen-
couraged, and winning programs receive substantial incre- cies most committed to program evaluation is to have pro-
ments to their budgets (say 25%) for use in implementing gram implementation organizationally separated from pro-
the innovations. To evaluate some of these programs, very gram evaluation. This recommendation comes from the
expensive contracts have been let for anthropological pro- academic community of scientific management theory,
cess evaluations of single programs. In one case, this was proliferated in the governmental circles of the late 1960’s
to involve a team of five anthropologists for five years, as “Programming, Planning, and Budgeting Systems,” or
studying the school system for a unique city with a popu- PPBS, in which these functions, plus program monitoring
lation of 100,000 persons. The anthropologists have had or evaluation, were to be placed in a separate organiza-
no prior experience with any other U.S. school system. tional unit independent of the operating agencies. (Wil-
They have been allowed no base-line period of study be- liam & Evans, 1969, provide one relevant statement of
fore the program was introduced; they arrived instead af- this policy.) This recommendation is based on an organi-
ter the program had started. They were not scheduled to zational control theory of checks and balances. It is sup-
study any other comparable school system not undergoing ported not only by general observations on human reluc-
this change. To believe that under these disadvantaged ob- tance to engage in self-criticism, but more particularly on
servational conditions, these qualitative observers could observations of a long standing self-defeating U.S. practice
infer what aspects of the processes they observe were due in which progress reports and other program evaluations
to the new program imrovation requires more faith than I are of necessity designed with the primary purpose of jus-
have, although I should withhold judgment until I see the tifying the following year’s budget. For the typical ad-
products. Furthermore, the emphasis of the study is on ministrator of an ameliorative program in the U.S.A., be it
the primary observations of the anthropologists them- a new experimental program or one of long standing,
selves, rather than on their role in using participants as in- budgets must be continually justified, and are usually on a
formants. As a result there is apt to be a neglect of the ob- year-to-year basis with six months or more lead time rare.
servations of other qualitative observers better placed than For such an administrator, program evaluations can hardly
the anthropologists. These include the parents who have be separated from this continual desperate battle. In this
had other children in the school prior to the change; the context, it makes excellent sense to turn program evalu-
teachers who have observed this one system before, ations over to a separate unit having no budgetary con-
during, and after the change; the teachers who have trans- straints on an honest evaluation. And so far, the policy is
ferred in with prior experience in otherwise comparable unchallenged.
systems; and the students themselves. Such observations My own observations, however, lead me to the conclu-
one would perhaps want to mass produce in the form of sion that this policy is not working either. The separation
questionnaires. if so, one would wish that appropriate works against modes of implementation that would opti-
questions had also been asked prior to the experimental mize interpretability of evaluation data. There are such,
program, and on both occasions in some comparable and low cost ones too, but these require advance planing
school system undergoi~lg no such reform, thus reestab- and close ilnplementor/evaluator cooperation. The exter-
lishing experimental design and quantitative summaries of nal evaluators also tend to lack the essential qualitative
qualitative judgments. (For a more extended discussion of knowledge of what happened. The chronic conflict be-
the qualitative-quantitative issues, see Campbell, 1974.) tween evaluators and implementors, which will be bad
While the issue of quantitative vs. qualitative orienta- enough under a unified local direction, tends to be exacer-
tions has important practical implications, it is still, as I bated. Particularly when combined with U.S. research
see it, primarily an issue among us social scientists and re- contracting procedures, the relevance of the measures to
latively independent of the larger political process. local program goals and dangers is weakened. Evaluation
Whether one or the other is used has pretty much been up becomes a demoralizing influence and a source of distract-
to the advice of the segment of the social science commu- ing conflict. It might be hoped that through specializa-
nity from which advice was sought, motivated in part by tion, more technically proficient methodologists would be
frustration with a previously used model. The issue, in employed. If there is such a gain, it is more than lost
other words is up to us to decide. through reduced experimental control.
The remaining issues in the metascientific group are These problems are, of course. not entirely due to the

separation of implementation and evaluation. And the rea- At present. the preference is for single, coordinated,
sons that argue for the separation remain strong. Yet the national evaluations. even where the program innovation
problems are troublesome and related enough to justify is implemented in many separate, discrete sites. If one
reconsidering the principle, particularly when it is rioted were to imitate science’s approach to objectivity, it would
that the separation seems totally lacking in experimental instead seem optimal to split up the big experiments and
science. This raises the metascientific issue of how ob- evaluations into two or more contracts with the same
jectivity in science is obtained in spite of the partisan bias mission so that some degree of simultaneous replication
of scientists, and of the relevance of this model for objec- would be achieved. Our major evaluations of compensa-
tivity in program evaluation. tory education programs (e.g., Head Start. Follow
In ordinary science, the one who designs the experi- Through) offer instances which were of such magnitude
ment also reads the meter. Comparably biasing motiva- that costs would not have been appreciably increased by
tional problems exist. Almost inevitably. the scientist is a this process. We could often. if we so planned. build in
partisan advocate of one particular outcome. Ambiguities some of the competitive replication that keeps science ob-
of interpretation present themselves. Fame and careers are jective.
at stake. Errors are made, and not all get corrected before One positive feature of the U.S. evaluation research
publication. with the hypothesis-supporting errors much scene in this regard is the widespread advocacy and occa-
less likely to be caught, etc. The puzzle of h(jw science sional practice of reanazysis by others of program evalua-
gets its objectivity (if any) is a metascientific issue still un- tion data. The Russell Sage Foundation has funded a se-
resolved. While scientists are probably more honest, cau- ries of these. including one on the “Sesame Street” pre-
tious, and self-critical than most groups, this is more apt school television programs (Cook. et al., 1975). The origi-
to be a by-product of the social forces that produce scien- nal governmental evaluation of the Head Start colnpensa-
tific objectivity than the source. Probably the tradition tory preschool program (Circirelli. 1969) has been reanal-
and possibility of independent rep!ication is a major fac- yzed by Smith and Bissell (1970) and Barnow (1973),
tor. As the philosophers and sociologists of science better Magidson, 1977. Similarly for several other classic bodies
clarify this issue, evaluation research methodologists of evaluation data, although this is still a rare activity, and
should be alert to the possibility of models applicable to many sets of data are not made available.
their area. Jumping ahead speculatively, I come to the fol- One needed change in research customs or ethics is to-
lowing tentative stance. ward the encouragement of “minority reports” from the
Ameliorative program implementation and evaluation research staff. The ethic that the data should be available
in the U.S.A. today need more zeal, dedication, and for critical reanalysis should be explicitly extended to in-
morale. These would be increased by adopting the scien- clude the staff members who did the data collection and
tist’s model of experimenter-evaluator. If the conditions analysis and who very frequently have the detailed insight
for cross-validating replication could be established, and if to see how the data might be assembled to support quite
budgetary jeopardy from negative evaluations could be re- different conclusions than the official report presents.
moved (for example, by allowing program implementors At present. any such activity would be seen as reprehen-
to shift to alternative programs in pursuit of the same sible organizational disloyalty. Because of this, a specially
goal). then the policy separation of implementation and competent source of criticism. and through this a source
evaluation should be abandoned. of objectivity, is lost. An official invitation by sponsor
The issue does not have to be one or the other. Exter- and administrator to every member of the professional
nal evaluations can be combined with in-house evalua- evaluation team to prepare minority reports would be of
tions. Certainly. even under present budgeting systems, considerable help in reducin g both guilt and censure in
program implementors silould be funded to do their own this regard.
evaluations and to argue their validity in competition with In this regard. we need to keep in mind two important
external evaluations. The organizational arrangement models of social experimentation. On the one hand there
separating evaluation from implementation is borrowed is the big-science model, exemplified in the Negative In-
from the model of external auditors, and it should be re- come Tax experiments to be discussed below (Kershaw &
membered that in accounting, auditors check on the inter- Fair. 1976: Watts & Rees, I977 ). On the other hand there
nal records, rather than creating new data. Perhaps some is the low-budget “administrative experiment” (Campbell.
such evaluation methodologist’s audit of internal evalua- 1967: Thompson, 1974), in which an administrative unit
tion records would be enough of an external evaluation. such as a city or state (or factory, or school) introduces a
new policy in such a way as to achieve an experimental or
Maximizing Replication arld Criticism quasi-experimental test of its efficacy. Wholey and his
Continuing the same metascience theme as in the pre- associates ( 1975) describe such studies and the Urban Insti-
vious section: a number of other recommendations about tute in general has pioneered in this regard. Hatry, Winnie.
policy research emerge, some of which run counter to cur- and Fisk’s Practical Program Evaluation jbr State and ID-
rent U.S. orthodoxy and practice. cal Govcrrzmrrzt Ojricials (19’73) exemplifies this empha-
Assessing the Impact of Planned Social Change 73

sis. For administrative experimentation to produce objec- already seen), I recognize much of value in what he says,
tivity, cross-validating diffusion is needed, in which those and some of it is relevant here. To summarize: There are
cities or states, etc., adopting a promising innovation con- normal periods of scientific growth during which there is
firm its efficacy by means of their own evaluation effort. general consensus on the rules for deciding which theory
Decentralization of decision-making has the advantage is more valid. There are extraordinary or revolutionary
of creating more social units that can replicate and cross- periods in science in which the choices facing scientists
validate social ameliorative inventions or that can explore have to be made on the basis of decision rules which are
a wide variety of alternative solutions simultaneously. not a part of the old paradigm. Initially, the choice of the
Even without planning, the existence in the U.S.A. of new dominant theory after such a revolution is unjustified
state governments creates quasi-experimental comparisons in terms of the decision rules of the prior period of nor-
that would be unavailable in a more integrated system. mal science.
Zeisel (1971) has argued this well, and it is illustrated in For evaluation research, the Kuhnian metaphor of
the study of Baldus (1973) cited more extensively below. revolution can be returned to the political scene. Evalua-
If factories, schools, and units of similar size, are allowed tion research is clearly something done by, or at least
independent choice of programs and if borrowed pro- tolerated by, a government in power. It presumes a stable
grams are evaluated as well as novel ones, the contagious social system generating social indicators that remain rela-
borrowing of the most promising programs would provide tively constant in meaning so that they can be used to
something of the validation of science. measure the program’s impact. The programs which are
implemented must be small enough not to seriously dis-
Evaluation Research as Normal Rather than turb the encompassing social system. The technology I am
Extraordinary Science discussing is not available to measure the social impact of
The metascientific points so far have shown little explicit a revolution. Even within a stable political continuity, it
reference to the hot metascientific issues in the U.S. may be limited .co the relatively minor innovations, as Zei-
today. Of these, the main focus of discussion is still se1 has argued in the case of experimentation with the
Thomas Kuhn’s Structure of Scientific Revolution U.S. legal system. (Needless to say, I do not intend this to
(1970). While I would emphasize the continuity and the constitute a valid argument against making changes of a
relative objectivity of science more than he (as you have magnitude that precludes evaluation.)


In this section I will get into down-to-earth issues where changes in the tolerance of citizens in lodging complaints
we quantitative evaluation methodologists feel most at and of policemen may have produced a drop in number of
home. Here are issues that clearly call for a professional reported offenses without a drop in actual offenses. By
skill. Here are issues that both need solving and give prom-
ise of being solvable. These statistical issues are ones that
assume a solution to the metascientific issues in favor of a 100
quantitative experimental approach. In this section, I will
start with a useful common-sense method - the interrupt-
ed time-series. Next will come some popular but unaccept-
able regression approaches to quasi-experimental design.
Following that, problems with randomization experiments
will be discussed, and following that, a novel compromise

The Interrupted Time-Series Design

By this term I cover the formalization of the widespread
common practice of plotting a time-series on some social
statistic and attempting to interpret it. This practice, the
problems encountered, and the solutions have been devel-
oped independently in many nations. I will start from
some non-U.S. examples.
Figure 2 shows data on sex crimes in Denmark and pos-
sibly the effect of removing restrictions on the sale of FIGURE 2. Incidence of all sex offenses (except incest)
pornography (Kutchinsky, 1973). Kutchinsky is cautious known to the police in Denmark during the period 194%
about drawing causal conclusions, emphasizing that 1970. (Kutchinsky, 1973, p. 164.)
74 DONALD T. ~A~~PB~~,l,

studying attitudes of citizens and police over time, and by

other subtle analyses, he concludes that for child mofesta-
tion these other explanations do not hold, and one must
conclude that a genuine drop in the frequency of this
250 crime occurred. However, the graphic portrayal of these
trends in Figure 3 is less convincing than Figure 2 because
of a marked downward trend prior to the increased avail-
ability of pornography. In both cases interpretation of ef-
fects is made more difficult by the problem of when to
define onset. In 1965 hard-core pornographic magazines
became readily available. In 1967 the penal law ban on
pornographic lieterature was repealed. In 1969 the sale of
pornog~~hic pictures to those 16 or older was legalized,
etc. Kutchinsky presentation is a model of good quasi-
exper~nlental analysis in its careful searching out of other
relevant data to evaluate plausible rival hypotheses.
Figure 4 shows the impact in Romania of the October
1966 population policy change which greatly restricted the
/ I : ! I
use of abortion, reduced the availability of contraceptives,
1960 1965 1970
and provided several new incentives for large families.
FIGURE 3. Child molestation (offenses against girls) in (David & Wright, 1971; David, 1970.) The conlbined ef-
the city of Copenhagen. (From Kutchinsky’s 1973 data.) fect is clear and convincing, with the change in the abor-

1 ! I 1 i I # 1
1966 1967 1968 1969 1970 i971

FIGURE 4. Monthly birth rates per 1,000 population, Romania: 1966-1970. (David &Wright. 1971. p. l%‘i’.)
Assessing the Impact of Planned Social Change 75

1935 1940 1946 1950 1955 1960 1965 1970

FIGURE 5. Live Births in Romania (193~-1970, excluding 1941-1945): total.

tion law probably the main factor, particularly for the onset of the October 1966 decrees also produced an im-
July-September 1967 peak. Presumably the subsequent mediate increase in stillbirths, many of which were no
decline represents a shift to other means of birth control. doubt substitutes for the newly outlawed abortions. Such
While clear visually, the data offer problems for the appli- compensation was, however, too minimal to prevent an
cation of tests of significance. The strong seasonal trend immediate increase in births.
rules out the application of the best statistical models Figure 6 shows the effect of the British Breath~yser
(Glass,WilIson, & Gottman, 1975) and there are not crackdown of 1967, illustrated here more dramatically
enough data points plotted here upon which to base a than it has yet appeared in any British publication. The
good seasonal adjustment. The point of onset for comput- British Ministry of Transport dutifully reported strong re-
ing purposes is also ambiguous. Is it October 1, 1966, or sults during the year folIowing. Their mode of report was
six months later as per the prior rule permitting abortions in terms of the percentage of decline in a given month
in the first three month? Or nine months later? A shift to compared with the same month one year earlier. This is
annual data obviates these two problems, but usually better than total neglect of seasonal effects, but it is an
there are too few years or too many other changes to per- inefficient method because unusual “effects” are often
mit the use of tests of significance. Figure 5 shows annual due as much to the eccentricity of the prior period as to
data and also provides an opportunity to look for the ef- that of the current one. It also precludes presentation of
fect of legalizing abortion in 1957. This occurred at a time the over-all picture. The newspapers duly noted the suc-
when the rate of use of all means of birth control, includ- cess, but interest soon faded, and today most British so-
ing abortion, was increasing, and there is no graphic evi- cial scientists are unaware of the program’s effectiveness.
dence that the 1957 law accelerated that trend. In other In Figure 6 the data have been adjusted to attempt to cor-
data not presented here, there is illustrated a chronic rect for seasonal trends, uneven number of days per
me~odologic~ problem with this design: Social systems month, uneven number of weekends per month, and, for
react to abrupt changes by using all of the discretionary the month of October 1969, the fact that the crackdown
decision points to minimize that change. Thus the abrupt did not begin until October 9. All of these adjustments


closed hours




1000 200-

0 ~,~~lll!llll,l,,,,,,,~,~ ,,,,/,,,/,,,,,,/,,,/,,,,,/,,,,,,/,,, I_
1966 1967 1968 1969 1970

FIGURE 6. British traffic casualties (fatalities plus serious injuries) before and after the British Breathalyser crackdown
of October 1967, seasonally adjusted. (Ross, 1973.)

have problems and alternative solutions. In this particular activity so far evaluated. Its highly beneficial effects are
case, the effects are so strong that any approach would no doubt specific to the particular time period, program
have shown them, but in many instances this will not be morale. low student attrition rate and other special
so. The data on commuting hours serves as a control for factors (for contrasting outcomes, see Director, 1979).
weekend nights. Ashenfelter’s complex regression analysis amply confirms
Figure 7 shows data from Baldus (1973) on the sub- the signi~cant and persisting effect. While his mode of
stantial effects of a law that Baldus believes to be evil just analysis shares the general probtem of underadjustmerlt
because it is effective. This law requires that, when a reci- referred to in the section to follow, (and in Director.
pient of old age assistance (charity to the poor from the 1979) this is minimized because he has 5 years of pre-
government) dies and leaves money or property, the gov- treatment earnings to use as covariates. Because the
ernment must be repaid. In our capitalist ideology, shared comparison groups are superior prior to treatment, any
even by the poor, many old people will starve themselves such underadjustment would lead to underestimates of
just to be able to leave their homes to their children. Bal- effect (or to apparent harmful effects in case of a true
dus has examined the effects of such laws in some 40 effect of zero).
cases where states have initiated them and in some 40 Figure 9 is a weak example of a time-series study be-
other cases where states have discontinued them. In each cause it has so few time periods. It has a compensator
case, he has sought out nearby, comparable states that did strength because the several comparison groups are them-
not change their laws to use as comparisons. One such in- selves constituted on a quantitative dimension. I include it
stance is shown in Figure 7. primarily because it seems to indicate that the U.S. Medic-
Figure 8 (based on data from Ashenfelter, 1978) shows aid legislation of 1964 has had a most dramatic effect on
dramatic effects for a national job training program the access to medical attention of the poorest group of
offered in 1964. This is surely the most promising out- U.S. citizens.
come of any U.S. government compensatory education The i~~terrupted time-series design is of the very great-
Assessing the Impact of Planned Social Change

I I I I I I II 1 I I I I I I I1
-30 I-

1950 1951 1952 1953

FIGURE 7. Effect of introducing a law in State A requiring repayment of welfare costs from the deceased recipient’s
estate on the old age assistance case loads. Modified from Baldus (1973, p. 204). Monthly data, with all values expressed
as a percentage of the case load 18 months prior to the change in the law.

est importance for program evaluation. It is available 3. There is a tendency for new administrations that ini-
where the new program affects everyone and where, tiate new programs to make other changes in the record-
therefore, a proper control group can usually not be keeping system. This often makes changes in indicators
constituted. If comparison group data are available, it uninterpretable (Campbell, 1969b, pp. 414-415). Where
ranks as the strongest of all quasi-experimental designs possible, this should be avoided.
(Campbell & Stanley, 1966). It can often be reconstructed 4. Where programs are initiated in response to an acute
from archival data. Graphically presented, it is readily problem (e.g., sudden change for the worse in a social in-
understood by administrators and legislators. Therefore, it dicator), ameliorative effects of the program are con-
is well worth the maximum of technical development. founded with “regression artifacts” due to the fact that in
Following is a brief list of its methodological problems as an unstable series, points following an extreme deviation
we have encountered them. tend to be closer to the general trend (Campbell, 1969b,
1. Tests of significance are still a problem. Ordinary pp. 413-414).
least squares estimation is usually inapplicable because of 5. Gradually introduced changes are usually impossible
autoregressive error; therefore moving-average models to detect by this design. If an administrator wants to op-
seem most appropriate. Glass, Willson, and Gottman timize evaluability using this design, program initiation
(1972) have assembled an analysis package which builds should be postponed until preparations are such that it
on the work of Box and Tiao (1965) and Box and Jenkins can be introduced abruptly. The British Breathalyser
(1970). McCleary and McCain (1979) present the “trans- crackdown exemplifies this optimal practice (see Figure 6,
fer function” appraoch of Box and Tiao (1975). These above).
models require that systematic cycles in the data be ab- 6. Because long series of observations are required, we
sent, but all methods of removing them tend to underad- tend to be limited to indicators that are already being re-
just. They also require large numbers of time-points, and corded for other purposes. While these are often relevant
will sometimes fail to confirm an effect which is compell- (e.g., births and deaths) and while even the most deliber-
ing visually as graphed. ately designed indicators are never completely relevant,
2. Removing seasonal trends remains a problem. this is a serious limitation. Particularly lacking are reports
Seasonal trends are themselves unstable and require a on the participants’ experiences and perceptions. On the
moving average model. The month-to-month change other hand, it seems both impossible and undesirable to
coincident with program change should not be counted as attempt to anticipate all future needs and to initiate
purely seasonal; thus the series has to be split at this point bookkeeping procedures for them. Some intermediate
for estimating the seasonal pattern. Therefore, the parts of compromise is desirable, even at the expense of adding to
the series just before and just after program initiation the forms to be filled out and the records to be kept. For
become series ends, and corrections for these are much institutional settings, it would be valuable to receive
poorer than for mid-series points. (Kepka, 197 1). from all participants “Annual Reports for Program Evalu-

- Trainees
--- Comporlson Group

White Males Black Males

I, L I
x 1959 60 61 62 63 64 65 66 67 68 t2
1959 60 61 62 63 64 ,65 66 67 60 69

2 .-

White Females Black Females

I b 1 ~
1959 60 61 62 63 64 65 66 67 68 ( 1959 60 61 62 63 64 65 66 67 68 6’

FIGURE 8. Impact on the 1964 trainee cohort trained under the U.S. Manpower Development and Training Act, based
upon earnings reported in Social Security Administration records (Ashenfelter, 1978. drawn from data presented in his
Tabie I. p. 5 I ).

ation” (Gordon & Campbell, 1971). In educatio~lal settings. in a~it~~inistering a novel program to 3 single intact institu-
teachers, students, and parents could file such a report. tion or ~~dnlinistrative unit. with measures before and
Note that at present the school system records how the after. While this leaves much to be desired in the way of
pupil is doing but never records the pupil’s report on how controls, it is still infortnative enough often enough to be
the school is doing. Teachers are annually rated for cffici- worth doing. Almost as frequently. this design is augment-
ency but never get a chance to systematically rate the ed by the addition of a comparison group which is also
policies they are asked to implement. Some first steps in measured before and after. This is typically another intact
this direction are being explored. (Weher, Cook, & Camp- social unit which does not receive the new program and is
bell, 1971: Anderson, 1973). In the U.S. social welfare judged comparable in other respects. It usually turns out
system. both social worker and welfare recipient would be that these two groups differ even before the treatment.
offered the opportunity to file reports (Gordon & Camp- and a natural tendency is to tl-y to adjust away that differ-
bell, 197 I ). ,411ratings would be restricted to the evalua- ence. In statistical practice in the I1.S. today, thf means
tion of programs and policies. not persons. for reasons to by which this is done are, in my opinion. almost always
be discussed below. wrong. What has happened is that a set of statistical tools
developed for and appropriate to prediction are applied to
Regressim Adjustments as Suhstitu tes for Katldornizatio~~ causal inference purposes for which they are inappropri-
The commonest evaluation design in I1.S. practice consists ate. Regression analysis. nlultivariate statistics. covarixtice
Assessing the Impact of Planned Social Change

A = $15,000 or more D = $5,000 - 6,999

I3 = 10,000 - 14,999 E= 3,000 - 4,999
c = 7,000 - 9,999 F- under 3,000

I I I I I I I I 1
1963 1964 t965 1966 1967 1968 1969 t970

FIGURE 9.Effect of Medicaid on number of visits to physician per year by persons in low-income families. (From Lohr,
1972, p. 5. Table B. )

analysis are some of the names of the statistical tools I In US. experience it has become important to distin-
have in mind. Whether from educational statistics or eco- guish two types of setting in which this type of quasi-ex-
nomics, the choice of methods seems to be the same. The perimental design and these types of adjustments are
economists have a phrase for the problem “error in vari- used, since the social implications of the underadjust-
ables” or, more specifically, “error in independent vari- ment are opposite. On the one hand, there are those spe-
ables.” But while theoretic~ly aware of the problem, they cial opportu~lity programs like university education which
are so used to regarding their indicators as essentially lack- are given to those who need them least, or as more usually
ing in error that they neglect the problem in practice. stated, who deserve them most or who are most likely to
What they forget is that irrelevant systematic components be able to profit from them. Let us call these “distribu-
of variance create the same problem as does random error, tive” programs in contrast with the “compensatory” pro-
leading to the same bias of underadjustment. Note that grams. or those special opportunities given to those who
the presence of error and unique variance has a systematic need them most.
effect, i.e., operates as a source of bias rather than as a For the regressive programs, the treatment group will
source of instability in estimates. This fact, too, the econ- usually be superior to the control group or the population
omists and others neglect. Thus efforts to correct for pre- from which the quasi-experimental controls are chosen.
treatment differences by “regression adjustments” on the In this setting the inevitable underadjustment due to
means or by “partialing out” pretest differences or by co- unique variance and error in the pretest and/or other cova-
variance adjustments all lead to underadjustment unless riates (the “regression artifacts”) works to make the treat-
the pretest (or other covariate) is a perfect measure of ment seem effective if it is actually worthless and to exag-
what pretest and posttest have in common. The older gerate its effectiveness in any case. For most of us this
technique of using only cases matched on pretest scores is seems a benign error, confirming our belief in treatments
well known to produce “regression artifacts” (Thorndike, we know in our hearts are good. (It may come as a sur-
1942; Campbell & Stanley, 1966). Covariance turns out prise, but the U.S. Sesame Street preschool educational
to produce the same bias, the same degree of underadjust- television program is “distributive,” in that children from
ment only with greater precision (Lord, 1960; 1969;Por- better-educated parents watch it more.) (Cook, et al.,
ter, 1967; Campbell & Erlebacher, 1970) and so also for 1975.)
multiple regression and partial correlation (Campbell & For compensatory programs usually, although not al-
Boruch, 1975). Essentially the same problem emerges in ways, the control groups start out superior to the treat-
ex post facto studies where, although there is no pretest, ment group, or are selected from a larger population
other covariates are available for adjustI~ent. A common whose average is superior. In this setting, the biases of
version of the probfem occurs where some persons have underadjustment, the regression artifacts, are in the direc-
received a treatment and there is a larger population of tion of underestimating program effectiveness and of mak-
untreated individuals from which ““controls” are sought ing our programs seem harmful when they are merely
and a comparison group assembled. worthless. This quasi~xperimental research setting has oc-

curred for our major evaluations of compensatory educa- family of four, under one of eight plans which differ as to
tion programs going under the names of Head Start., Fol- support level and incentive for increasing own earnings.
low Through. Performance Contracting, Job Corps (for Another 600 families received no income support but co-
(unemployed young men). and many others. In the major operated with the quarterly interviews. The experiment
Head Start evaluation (C’icirelli, 1969; Campbell & Erle- lasted for three years, and preliminary final results are
bather, 1970), this almost certainly accounts for the signi-now available. This study when completed will have cost
ficantly ~larmful effects shown in the short three-lnoiit~~,some S8,000,000 of which $3,00~,000 represented pay-
ten-i~ours-a-week program. I am persuaded that the over- ments to participants and necessary adl~inistr~~tive costs,
whelming prevalence of this quasi-experirllental setting
and $5,000,000 research costs, the costs of program cvalu-
and adjustment procedures is one of the major sources of ation. Before this study was half completed, three other
the pessimistic record for such programs of compensatory negative income tax experiments were started, some much
education efforts. The very few studies in compensatory bigger (rural North Caroline and Iowa, Gary, and Seattle
education which have avoided this problem by random and Denver). The total U.S. investment for these experi-
assignment of children to experimental and control con- ments now totals S65,000,000. It is to me amazing, and
ditions have shown much more optimistic results. inspiring, that our nation achieved, for a while at least,
In the compensatory education situation. there are sev- this great willingness to deliberately “experiment” with
eral other problems whiclt also work to make the program policy alternatives using the best of scientific methods.
look harmful in quasi-experimental studies. These include This requires a brief historical note. The key period is
tests that are too difficult, differential growth rates com-1964-68. L. B. Johnson was President and proclaimed a
bined with age-based, grade-equivalent, “Great Society” program and a “War on Poverty.” In
absolute. or raw
scores, and the fact that test reliabiIity is higher for theWashington. D.C. administrative circles, the spirit of scien-
posttests than for the pretests and higher for the control tific management (the PPBS I’ve already criticized and
group than for the experimental (Campbell 81 Borsch,will again) had already created a sophisticated interest in
1975). These require major revisions of our test score hardheaded program evaluation. Congress was already
practice. When various scoring models are applied to a writing into its legislation for new programs the requirc-
single pop~ilat~on on a single occasion, all scoring proce- ment that 1% {or some other proportion) of the program
dures correlate so highly that one might as well use the budget bc devoted to evaluation of effectiveness. In a new
simplest. But when two groups that differ- initially are agency, the Office of Economic Opportunity, a particular-
measured at two different times in a period of rapid ly creative and dedicated group of young economists were
growth, our standard test score practices have the effect recruited, and these scientist-evaluators were given an es-
of making the gap appear to increase. if. as is usual, test pecially strong role in guiding over-all agency policy. This
reliability is increasing. The use of a correction for guess-OEO initiated the first two of the Negative Income Tax
ing becomes important. The common model that assumes experiments. (Two others were initiated from the Depart-
“‘true score” and “error” are independent ment of Health Education and Welfare.) Under the first
needs to be
abandoned, substituting one that sees error and true score Nixon administration, 1968-72, OEO programs were con-
negatively correlated across persons (the larger the error tinued, although on a reduced scale. Under the second
component, the smaller the true score c~~rnpotlent). Reich- Nixon administration. OEO itself was dismantled, al-
ardt (1979), provides the most up-to-date recommenda- though several programs were transferred to different
tions on statistical analysis. agencies. I believe that all four of the Negative Income
Tax Experiments are still being carried out much as
planned. Initiation of new programs has not entirely
The focal example of a good social experiment in the ceased, but it is greatly reduced. My general advice to my
U.S. today is the New Jersey Negative Income Tax Experi- fellow U.S. social scientists about the attitude they should
ment. (Watts & Rees, 1973, 1977; T~~~o~rr~al of Hu~~l~i~ take toward this historical period is as follows: Let us use
Resourms, Vol.9,No. 2, Spring, 1974~Kersbaw, 1972, this experience so as to be ready when this political will
1975; Kershaw & Fair, 1976.) This is an experiment deal- returns again. In spite of the good example provided by
ing with a guaranteed annual income as an alternative to the New Jersey Negative Income Tax Experiment, over
present U.S. welfare systems. It gets its name from the no- all we were not ready iast time. Competent evaluation re-
tion that when incomes fall below a given level, the tax searchers were not available when Model Cities Programs,
should become negative, that is, the government should Job Corps Programs, etc. went to their local universities
pay the citizen rather than the citizen paying a tax to the for help. Perhaps 90% of the funds designated for program
government. It also proposes substituting income-taxlike evaluation was wasted: at any rate, 90% of the programs
procedures for citizen reports of income in place of the came out with no interpretable evidence of their effective-
present social worker supervision. In this experiment some ness. The available evaluation experts grossly overesti-
600 families with a working male head of household re- mated the usefulness of statistical adjustments as substi-
ceived income support payments bringing their income up tutes for good experimdntal design, including especially
to some level between $3,000 and $4,000 per year for a randomized ass~~ment of treatments.
Assessing the Impact of Planned Social Change 81

In this spirit I would like to use the experience of the the NJNITE shown a reduction in work effort, national
New Jersey Negative Income Tax Experiment to elucidate adoption of the policy would have been very unlikely. For
the methodological problems remaining to be solved in this reason alone, it was well worth doing and doing well.
the best of socialexperiments. In this spirit, my comments The details of the experiment draw attention to a num-
are apt to sound predominantly critical. My over-all atti- ber of problems of method that need detailed attention
tude, however, is one of highest approval. Indeed, in lec- from creative statisticians and social psychologists. These
tures in the U.S. I often try to shock audiences with the will only be mentioned here, but are being treated more
comment that the New Jersey experiment is the greatest extensively elsewhere (Riecken, et al., 1974). The issue of
example of applied social science since the Russian the unit of randomization has already been raised in pass-
Revolution. ing. Often there is a choice of randomizing larger social
The major finding of the NJNITE is that income guar- units than persons or families - residential blocks, census
antees do not reduce the effective work effort of employed tracts, classrooms, schools, etc. are often usable. For rea-
poor people. This finding, if believed, removes the princi- sons of statistical efficiency, the smaller, more numerous
pal argument against such a program - for on a purely units are to be preferred, maximizing the degrees of free-
cost basis, it would be cheaper than the present welfare dom and the efficacy of randomization. But the use of
system unless it tempted many persons now employed to larger units often increases construct validity. Problems of
cease working. The major methodological criticisms of the losses due to refusals and later attrition interact with the
study are focused on the credibility of the belief that this choice of the stage of respondent recruitment at which to
“laboratory” finding would continue to hold up if the randomize. NJNITE used census statistics on poverty
support program became standard, permanent U.S. policy. areas and sample survey approaches to locate eligible
These are questions of “external validity” (Campbell & participants. Rethinking their problem shows the useful-
Stanley, 1966) or of “construct validity,” as Cook (Cook ness of distinguishing two types of assent required, for
& Campbell, 1976) applies the term developed initially for measurement and for treatment; thus two separate stages
measurement theory. Two specific criticisms are promi- for refusals occur. There emerge three crucial alternative
nent: One, there was a “Hawthorne Effect” or a “Guinea- points at which randomization could be done:
pig effect.” The experimental families knew they were the
exceptional participants in an artificial arrangement and Alternative lnvltatlon to partlclpate in the
points for 2 aerlodlc measurement actlwty
that the spotlight of public attention was upon them.
randonlzed 3 lnwtatlon to participate in the
Therefore, they behaved in a “good” industrious respect- assignment < experlmental treatment
able way, producing the results obtained. Such motivation to treatment
would be lacking once the program was universal. Two
features in the NJNITE implementation can be supposed In NJNITE, alternative 1 was employed. Subsequently
to accentuate this. There was publicity about the experi- control group respondents were asked to participate in the
ment at its start, including television interviews with se- measurement, and experimental subjects were asked to
lected experimental subjects; and the random selection participate in both measurement and treatment. As a
was by families rather than neighborhoods, so each experi- result, there is the possibility that the experimental group
mental family was surrounded by equally poor neighbors contains persons who would not have put up with the
who were not getting this beneficence. The second com- bother of measurement had they by chance been invited
mon criticism, particularly among economists, might be into the control group. Staging the invitations separately,
called the time-limit effect. Participants were offered the and randomizing from among those who had agreed to the
support for exactly three years. It was made clear that the survey (i.e., to the control group condition) would have
experiment would terminate at that time. This being the ensured comparability. In NJNITE there were some refu-
case, prudent participants would hang on to their jobs un- sals to experimental treatment because of unwillingness to
less they could get better ones, so that they would be accept charity. This produces dissimilarity, but the bias
ready for a return to their normal financial predicament. due to such differential refusal can be estimated if those re-
It should be recognized that these two problems are in fusing treatment continue in the measurement activity.
no way specific to randomized experiments, and would Alternative 2 is the point we would now recommend.
also have characterized the most casual of pilot programs. One could consider deferring the randomization still
They can only be avoided by the evaluation of the adop- further, to alternative 3. Under this procedure, all poten-
tion of NIT as a national policy. Such an evaluation will tial participants would be given a description of each of
have to be quasi-experimental, as by time-series, perhaps the experimental conditions and his chances for each. He
using several Canadian cities as comparisons. Such evalua- would then be asked to agree to participate no matter
tions would be stronger on external, construct validity, what outcome he drew by chance. From those who agreed
but weaker on internal validity. It is characteristic of our to all this, randomized assignment would be made. This al-
national attitudes, however, that this quasi-experimental ternative is one that is bound to see increased use. The
evaluation is not apt to be done well, if at all - once opportunity for differential refusals is minimized (though
we’ve chosen a policy we lose interest in evaluating it. Had some will still refuse when they learn their lot). This

seems to tnaximize the “informed consent” required by

the U.S. National Institutes of Health for all of the medi-
cal and behavioral research funded by them. The U.S. So- C E
cial Science Research Council’s Committee on Experitnen- n=8 n-12

tation as a Method for Planning and Evaluating Social Pro-
grams (Riecken, et al., 1974) failed to recommend altcrna-
tive 3 however. In net, they judge informed consent to be n= lo
adequately achieved when the participant is fully informed
of the treatment he is to receive. Informing the control
group participants of the benefits that others were getting
and that they almost got would have caused discontent.
and have made the control treatment an unusual experi-
n =I0
El Cl
SES Level : Experimental Group
ence rather than the representative of the absence of treat-
ment. The tendency of control subjects to drop out more
frequently than experimental subjects would have been
accentuated. and thus this approach to greater cotnpara-
bility would be in the end self-defeating. These are reason-
able arguments on both sides. More discussion and re-
search are needed on the problem.
Attrition and in particular dij$mwtiul attritiotl become
major problems on which the work of inventive statisti-
cians is still needed. In the NJNITL;. attrition rates over
the three-year period ranged from 25.3X in the control I 2 3 4 5 6 7
group to only 6.5% in the most remunerative expcrimen-
tal group. These differences are large enough to create SES Level : Control Group
pseudo effects in posttest values. The availability of p~c-
FIGURE 10. Hypothetical data from two treatment
test scores provides some information on the probable di-
gl-otips in a social experiment 1 grouped by SES level and
rection of bias, but covariance on these values under-
given coded list designators A through Z.
adjusts and is thus not an adequate correction. Methods
for bracketing maximum and tninimutn biases under vari-
ous specified assumptions need to be developed. Where administrative records without revealing confidential data
there is an encompassing p eriodic tiieasurement frame- on individuals (Schwartz & Orleans. 1967: Campbell. Bo-
work that still retains persons who have ceased coopcrat- ruch, Schwartz & Steinberg. 1977).
ing with the experiment. other alternatives arc’ present Let 11s call this “mutually insulated interfile exchange.”
that need developing. It rcquit-es that the administrative file have the capacity
Such measurement frameworks appropriate to NJNITE fat internal statistical analysis of its own records. Without
would include the Social Security Adtiiitlistratiotl’s rfc- going into detail. 1 would nonetheless like to communicate
ords on earnings subject to social security withholding and the nub of the idea. Figure 9 shows a hypothetical ex-
claitns on unemployment insurance. rhc Internal Re- periment with one experimental group and one crontrol
venue Service records on withholding taxes. hospitalir.ation group. In this case there are enough cases to allow a fur-
insurance records. etc. These records are occasionally ther breakdown by socio-economic level. Front these data
usable in evaluation research (e.g.. Levenson & McDill. some 76 lists are pt-epared ranging from X to 14 persons in
1966; Bauman, David & Millet-, 1970; Fischct-. 1972: length. These lists arc assigned designations at random (in
Heller. 1972) but the facilities for so doing are not adc- this case A to Z) so that list designation communicates no
quatcly developed. and the concept of such usage may information to the administrative file. The list provides
seem to run counter to the cut rent U.S. emphasis on pre- the name of the person. his Social Security number. and
serving the confidentiality of administrative records (e.g.. pet-haps his birth date and birth place. These lists are then
Reubhausen 81 Brim. 1965; Sawyer & Schcchter. 1968: tui-ned over to the adtninistrative file which deletes one
Goslin. 1970; Miller. 1971 ; Westin, 1967;Wl1celcr, 1969). person at random from each list, retrieves the desired data
Because research access to administrative files would make from the files on each of the others fat- whotn it is avail-
possible so much valuable low cost follow-up on program able. and computes mean. variance. and number of casts
innovations. this issue is wortlty of discussion somewhere with data available for each list for each variable. These
in this paper. For convenience. if not organizational con- values for each list designation are then returned to the
sistenc,y 1I will insert lhe discussion here. evaluation researchers who reassemble them into statisti-
There is a way to statistically relate research data and cally tneaningful composites and then compute experi-
Assessing the Impact of Planned Social Change 83

mental and control group means and variances, correla- tion (Conner, 1978). While at one time lotteries were used
tions, interactions with socio-economic level, etc. Thus to “let God decide,” now a program administrator feels he
neither the research file nor the administrative file has is “playing God” himself when he uses a randomization
learned individual data from the other file, yet the statisti- procedure, but not when he is using his own incompetent
cal estimates of program effectiveness can be made. In the and partisan judgment based on inadequate and irrelevant
U.S. it would be a great achievement were this facility for information (Campbell, 1971). Participants also resist ran-
program evaluation to become feasible for regular use. domization, though less so when they themselves choose
To return to attrition problems in randomized experi- the capsule from the lottery bowl than when the adminis-
ments, not only do we need new statistical tools for the trator does the randomization in private (Wortman, et al.,
attrition problem, we also need social-spychological inven- 1976). Collecting a full list of eligible applicants and then
tions. In long-term experiments such as that of Ikeda, randomizing often causes burdensome delays, and it may
Yinger, and Laycock (1970) in which a university starts be better to offer a 50-50 lottery to each applicant as he
working with under-privileged twelve-year-olds during applies, closing off all applications when the program
summers, trying to motivate and guide their high school openings have been filled, at which time the controls
activities (ages 14 to 18) to be university preparatory, two would be approximately the same in number. For settings
of the reasons why attrition is so differential are that more like specially equipped old people’s homes, the control
recent home addresses are available for the experimentals group ceases to be representative of nonexperimental con-
(who have been in continuous contact) and that the ex- ditions if those losing the lottery are allowed to get on the
perimentals answer more follow-up inquiries because of waiting list ~ waiting for an opening forestalls normal
gratitude to the project. This suggests that controls in problem-solving. For such settings, a three-outcome lot-
long-term studies might be given some useful service on a tery is suggested: (1) admitted; (2) waiting list; (3) re-
continuing basis ~ less than the experimentals but enough jected. Group 3 would be the appropriate control. For
to motivate keeping the project informed of address agencies having a few new openings each week or so, spe-
changes and cooperating with follow-up inquiries (Ikeda, cial “trickle processing” procedures are needed rather
et al., 1970). If one recognizes that comparability be- than large-batch randomization. Where the program is in
tween experimental and control goups is more important genuinely short supply, one might think that the fact that
than completeness per se, it becomes conceivable that most people were going without it would reconcile control
comparability might be improved by deliberately degrad- group subjects to their lot; however, experimental proce-
ing experimental group data to the level of the control dures including randomization and measurement may cre-
group. In an exploration of the possibility, Ikeda, Rich- ate an acute focal deprivation, making control status itself
ardson, and I (in preparation) are conducting an extra fol- an unusual treatment. This may result in compensatory
low-up of this same study using five-year-old addresses, a striving or low morale (Cook & Campbell, 1976, 1979).
remote unrelated inquiring agency, and questions that do
not refer spcifically to the Ikeda, Yinger, and Laycok pro- Regression-Discontinuity Design
gram. (I offer this unpromising example to communicate The arguments against randomizing admissions to an
my feeling that we need wide-ranging explorations of pos- ameliorative program (one with more eligible applicants
sible solutions to this problem.) than there is space for) include the fact that there are de-
It is apparent that through refusal and attrition, true grees of eligibility, degrees of need or worthiness, and that
experiments tend to become quasi-experiments. Worse the special program should go to the most eligible, needy,
than that, starting with randomization makes the many or worthy. If eligibility can be quantified (e.g., through
potential sources of bias more troubling in that it focuses ranks, ratings, scores, or composite scores) and if admis-
awareness on them. I am convinced, however, that while sion for some or all of the applicants can be made on the
the biases are more obvious, they are in fact considerably basis of a strict application of this score, then a powerful
less than those accompanying more casual forms of select- quasi-experimental design, Regression-discontinuity, is
ing comparison groups. In addition, our ability to estimate made possible. General explanation and discussion of ad-
the biases is immeasurably greater. Thus we should, in my ministrative details are to be found in Campbell (1969b)
judgment, greatly increase our use of random assignment, and Riecken, et al. (1974). Sween (1971) has provided
including in regular admissions procedures in ongoing pro- appropriate tests of significance. Goldberger (1972),
grams, having a surplus of applicants. To do this requires working from an econometric background, has made an
that we develop practical procedures and rationales that essentially equivalent recommendation.
overcome the resistances to randomization met with in The application of quantified eligibility procedures
those settings. Just to communicate to you that there are usually involves at least as great a departure from ordinary
problems to be solved in these areas, I will briefly sketch admission procedures as does randomization. Developing
several of them. specific routines appropriate to the setting is necessary.
Administrators raise many objections to randomiza- But once instituted, their economic costs would be low

and would be more than compensated for by increased “reliability” for the quantified eligibility criterion is not
equity of the procedures. Resistance, however, occurs. required; indeed, as it approaches zero reliability. it be-
Administrators like the freedom to make exceptions even comes the equivalent of randomization.)
to the rules they themselves have designed. (“Validity” or


KesistanctJ to Evaluation lem. This task we have hardly begun, and it is one in
In the U.S. one of the pervasive reasons why interpretable which we may not succeed. Two minor suggestions will
program evaluations are so rare is the widespread resist- illustrate. I recommend that we evaluation research meth-
ance of institutions and administrators to having their pro- odologists should refuse to use our skills in ad hominem
grams evaluated. The methodology of evaluation research research. While the expensive machinery of social experi-
should include the reasons for this resistance and ways of mentation can be used to evaluate persons, it should not
overcoming it. be. Such results are of very limited generalizability. Our
A major source of this resistance in the U.S. is the iden- skills should be reserved for the evaluation of policies and
tification of the administrator and the administrative unit programs that can be applied in more than one setting and
with the program. An evaluation of a program under our that any well-intentioned administrator with proper fund-
political climate becomes an evaluation of the agency and ing could adopt. We should meticulously edit our opinion
its directors. In addition the machinery for evaluating pro- surveys so that only attitudes toward program alternatives
grams can be used deliberately to evaluate administrators. are collected and such topics as supervisory efficiency ex-
combined with this. there are a number of factors that cluded. This prohibition on ad hominem research should
lead administrators to correctly anticipate a disappointing also be extended to program clients. We should be evaluat-
outcome. As Rossi (1969) has pointed out. the special ing not students or welfare recipients but alternative poli-
programs that are the focus of evaluation interests have cies for dealing with their problems. It is clear that I felt
usually been assigned the chronically unsolvable problems. such a prohibition is morally justified. But I should also
those on which the usually successful standard institutions confess that in our U.S. settings it is also recommended
have failed. This in itself provides a pessimistic prognosis. out of cowardice. Program administrators and clients have
Furthermore. the funding is usually inadequate. both it in their power to sabotage our evaluation efforts, and
through the inevitable competition of many worthy they will attempt to do so if their own careers and inter-
causes for limited funds and because of a tendency on the ests are at stake. While such a policy on our part will not
part of our legislatures and executives to generate token entirely placate administrators’ fears, I do believe that if
or cosmetic efforts designed more to convince the public we conscientiously lived up to it. it would initiate a
that action is being taken than to solve the problem. Even change toward a less self-defeating political climate.
for genuinely valuable programs, the great effort required A second recommendation is for advocates to justify
to overcome institutional inertia in establishing any new new programs on the basis of the seriousness of the prob-
program leads to grossly exaggerated claims. This pt-o- lem rather than the certainty of any one answer and com-
duces the “overadvocay trap” (Campbell, 196?b, 1971 ). bine this with an einphasis on the need to go on to other
so that even good and effective programs fall short of attempts at solution should the first one fail (Campbell,
what has been promised. which intensifies fear of evalu- 1969b). Shaver and Staines (1971) have challenged this
ation. suggestion, arguing that for an adtninistrator to take this
The seriousness of these and related problems can hard- attitude of scientific tentativeness constttutes a default of
ly be exaggerated. While I have spent more time in this leadership. Conviction, zeal, enthusiasm, faith are required
presentation on more optimistic cases, the preceding para- for any effective effort to change traditional institutional
graph is more typical of evaluation reserach in the U.S. to- practice. To acknowledge only a tentative faith in the new
day. As methodologists, we in the U.S. are called upon to program is to guarantee a half’-hearted implementation of
participate in the political process in efforts to remedy it. But the problem remains: the overadvocacy trap con-
this situation. But before we do so. we should sit back in tinues to sabotage program evaluation. Clearly, social-psy-
our armchairs in our ivory towers and invent political/ chological and organization-theoretical skills are needed.
organizational alternatives which would avoid the prob-


Evaluation research in the U.S.A. is becoming a recog- the unemployment and cost-of-living indices of the Bu-
nized tool for social decision-making. Certain indicators, reau of Labor statistics. As regular parts of the political
collected through such social science methods as sample decision process, it seems useful to consider them as akin
surveys, have already achieved this status; for example, to voting in political elections (Gordon & Campbell, 1971 :
Assessing the Impact of Planned Sociaf Change 85

Campbell, 1971). From this enlarged perspective, which is ty of the police rather than changes in the number of
supported by qualitative sociological studies of how pub- criminal acts (Gardiner, 1969; Zeisel, 1971). It seems to
lic statistics get created, I come to the following pessimis- be well-documented that a well-publicized, deliberate ef-
tic laws (at least for the U.S. scene): The mure any quanti- fort at social change - Nixon’s crackdown on crime - had
tative social indicator is used for social decision-making, as its main effect the corruption of crime-rate indicators
the more subject it will be to corruption pressures and (Seidman & Couzens, 1972; Morrissey, 1972; Twigg,
the more apt it will be to distort and corrupt the social 1972): achieved through underrecording and by down-
processes it is intended to monitor. Let me illustrate grading the crimes to less serious classifications.
these two laws with some evidence which I take seriously, For other types of adnlinistrative records, similar use-
although it is predominantly anecdotal. related distortions are reported (Kitsuse & Cicourel. 1963;
Take, for example, a comparison between voting statis- Garfinkel, 1967). Blau (1963) provides a variety of exam-
tics and census data in the City of Chicago: Surrounding ples of how productivity standards set for workers in gov-
the voting process, there are elaborate precautionary ernment offices distort their efforts in ways deleterious to
devices designed to ensure its honesty; surrounding the program effectiveness. In an employment office, evalu-
census-taking process, there are few, and these could be ating staff members by the number of cases handled led to
so easily evaded. Yet, in our region, the voting statistics quick, ineffective interviews and placements. Rating the
are regarded with suspicion while the census statistics are staff by the number of persons placed led to concentra-
widely trusted (despite underenumeration of young adult, tion of efforts on the easiest cases, neglecting those most
black males), I believe this order of relative trust to be jus- needing the service, in a tactic known as “creaming” (Mil-
tified. The best explanation for it is that votes have con- ler, et al., 1970). Ridgway’s pessimistic essay on the dys-
tinually been used ~- have had real implications as far as functional effects of performance measures (1956) pro-
jobs, money, and power are concerned - and have there- vides still other examples.
fore been under great pressure from efforts to corrupt. On From an experimental program in compensatory edu-
the other hand, until recently our census data were un- cation comes a very clear-cut illustration of the principle.
used for political decision-making. (Even the constitution- In the Texarkana “performance contracting” experiment
al requirement that electoral districts be changed to match (Stake, 1971) supplementary teaching for undereducated
population distribution after every census was neglected children was provided by “contractors” who came to the
for decades.) schools with special teaching machines and individualized
Another example: In the spirit of scientific manage- instruction. The corruption pressures were high because
ment, accountability, the PPBS movement, etc., police de- the contractors were to be paid on the basis of the
partments in some jurisdictions have been evaluated by achievement test score gains of individual pupils. it turned
“clearance rates,” i.e., the proportion of crimes solved, out that the contractors were teaching the answers to spe-
and considerable administrative and public pressure is gen- cific test items that were to be used on the final pay-off
erated when that rate is low. Skolnick (1966) provides testing. Although they defended themselves with a logical-
illustrations of how this pressure has produced both cor- positivist, operational-definitionalist argument that their
ruption of the indicator itself and a corruption of the agreed-upon goal was defined as improving scores on that
criminal justice administered. Failure to record all citizens’ one test, this was generally regarded as scandalous. How-
complaints, or to postpone recording them unless solved, ever, the acceptability of tutoring the students on similar
are simple evasions which are hard to check. since there is items from other tests is still being debated. From my
no independent record of the complaints. A more compli- own point of view, achievement tests may well be valuable
cated corruption emerges in combination with “plea-bar- indicators of general school achievement under conditions
gaining.” sea-bargaining is a process whereby the prose- of normal teaching aimed at general competence. But
cutor and court bargain with the prisoner and agree on a when test scores become the goal of the teaching process,
crime and a punishment to which the prisoner is willing they both lose their value as indicators of educational
to plead guilty, thus saving the costs and delays of a trial. status and distort the educational process in undesirable
While this is only a semilegal custom, it is probably not ways. (Similar biases of course surround the use of objec-
undesirable in most instances. However, combined with tive tests in courses or as entrance examinations.) In com-
the clearance rates, Skolnick finds the following miscar- pensatory education in general there are rumors of other
riage of justice. A burglar who is caught in the act can end subversions of the measurement process, such as adminis-
up getting a lighter sentence the more prior unsolved burg- tering pretests in a way designed to make scores as low as
laries he is willing to confess to. In the bargaining, he is possible so that larger gains will be shown on the post test,
doing the police a great favor by improving the clearance or limiting treatment to those scoring lowest on the pre-
rate, and in return, they provide reduced punishment. test so that regression to the mean will provide apparent
Skolnick believes that in many cases the burglar is confess- gains. Stake (1971) lists still other problems. Achievement
ing to crimes he did not in fact commit. Crime rates are in tests are, in fact, highly corruptible indicators.
general very corruptible indicators. For many crimes, That this serious methodological problem may be a
changes in rates are a reflection of changes in the actitivi- universal one is demonstrated by the extensive U.S.S.R.

literature (reviewed in Granick, 1954; and Berliner, 1957) will get worse, the more common quantitative evaluations
on the harmful effects of setting quantitative industrial of social programs become. We must develop ways of
production goals. Prior to the use of such goals, several avoiding this problem if we are to move ahead. We should
indices were useful in suill~lari~ing factory productivity -” study the social processes through which corruption is be-
e.g.. monetary value of total product. total weight of all ing uncovered and try to design sociji systems that incor-
products produced. or number of items produced. Each of porate these features. In the Texarkana performance-con-
these, however, created dysfunctional distortions of pro- tracting study. it was an “outside evaluator” who uncov-
duction when used as the official goal in terms of which ered the problem. In a later U.S. p~rf~)rm~nce-col~tracting
factory production was evaluated. If m~3netary value, then study. the Seattle Teachers’ Union provided the watchdog
factories would tool up for and produce only one product role. WC must seek out and institutionalize such objec-
to avoid the production interruptions in retooling. If tivity-preserving features. We should also study the insti-
weight, then factories would produce only their heaviest tutional form of those indicator systems, such as the cen-
item (e.g.. the largest nails in a nail factory). If number of sus or the cost-of-living index in the U.S.. which seem rela-
items, then only their easiest item to produce (e.g.. the tively immune to distortion. Many com~ientat~)rs, includ-
smallest nails). All these distortions led to (3verproduction ing myself ( 1969b), assume that the use of ~llultiple indi-
of unneeded items and underproduction of much needed cators, all recognized as imperfect, will alleviate the prob-
ones. lem. although Ridgeway (1956) doubts this.
To return to the U.S. experience in a final example. There are further problems that can be anticipated in
During the first period of U.S. inv~3lvement in Viet Nam, the future. A very c~iallen~in~ group centers on the use of
the estimates of enemy casualties put out by both the public o~3~nio~~surveys, questionnaires, or attitude mea-
South Vietnamese and our own military were both unveri- sigres in program evaluation. Trends in the 1J.S. are such
fiable and unbelievably large. In the spirit of McNamara that before long, it will be required that all participants in
and PPBS. an effort was then instituted to substitute a such surveys. before they answer, will know the uses to
more conservative and verifiable form of reporting, even if which the survey will be put, and will receive copies of the
it ~inderesti~nated total enemy casualties. Thus the “body results. Participants will have the right to use the results
count” was introduced, an enumeration of only those for their own political purposes. (Where opinion surveys
bodies left by the enemy on the battlefield. This became are used by the U.S. Government. our present freedom of
used not only for overall reflection of the tides of war, information statutes should be sufficient to establish this
but also for evaluating the effectiveness of specific battal- right now.) Under these conditions, using opinion surveys
ions and other military units. There was thus created a to evaluate local government service programs can be ex-
new military goal. that of having bodies to count, a goal pected to produce the f~3llowing new problems when em-
that came to function instead of or in addition to more ployed in ~3olitically sophisticated conli~~urlities such as we
traditional goals, such as gainin p control over territory. find in some of our poorest urban rle~~lib[~rl?~3(3ds:There
Pressure to score well in this regard was passed down from will be political campaigns to get respondents to reply in
higher officers to field commanders. The realities of guer- the particular ways the local political ~3r~a~~~a~i~3ns see as
illa warfare ~3articipatioll by persons of a wide variety of desirrtble. just as there are campaigns to influence the
sexes and ages added ;I permissive ai~ibi~ujty to the situa- vote. There will be eff&ts co[liparabIe to ballot-I3ox stuff-
tion. Thus pool- Lt. Galley was merely engaged in getting ing. Interviewer bias will become even more a problem.
bodies to count for the weekly effectiveness report when Bandwagon effects ~~ i.e.. conformity influence from the
he participated in the tragedy at My Lai. His goals had published results or prior surveys -. must be anticipated.
been corrupted by the worship of a quant~tati~~e indicator. New biases. like exaggerated c(3iilplaint. will emerge.
leading both to a reduction in the validity of that indica- In my judgment, opinion surveys will still be useful if
tor for its original military purposes, and a corruption ot appropriate safeguards can be developed. Most of these
the social processes it was designed to reflect. are problems that we could be doing research on now in
I am convinced that this is one of the major problems anticipation of future need. (Gordon & Campbell I.19711
to be solved if we are to achieve meaningful evaluation of provide a detailed discussion of these problems in a social
our efforts at planned social change. It is a problem that welfare service program evaluation setting.)


This has been a condensed overview of the problems en- seemed predominantly pessimistic. While there are very
countered in the U.S. experience with assessing the impact serious problems, somehow the overall picture is not as
of planned social change. The sections of the paper deal- gloomy as this seems. Note that the sections on time-series
ing with problems related to political processes have and on randomized designs contain success stories worthy
Assessing the Impact of Planned Social Change 87

of emulation. And many of the quasi-experimental evalu- which emerge when we move experimentation out of the
ations that I have scolded could have been implemented in laboratory into social program evaluation. In solving these
better ways - had the social science methodological com- problems, we may need to make new social-organizational
munity insisted upon it ~ within the present political sys- inventions.
tern. There are, however, new methodological problems


ANDERSON, J. K. Evaluation as a process in social systems. Un- BOX, G. E. P., & TIAO. G. C. A change in level of non-stationary
published doctoral dissertation, Northwestern University, Depart- time series. Biometrika, 1965, 52.181-192.
ment of Industrial Engineering & Management Sciences, 1973.
BOX, G. E. P., & TIAO, G. C. Intervention analysis with applica-
tions to economic and environmental problems. Jourrtal of the
ARD T., & ASSOCIATES. Encyclopedia of educational evaluation. American Statistical Association, 1975, 70, 70-92.
San Francisco: Jossey-Bass Publishers, 1975.
BOX, G. E. P., & JENKINS. G. M. Time-series analysis: Forecast-
ing and control. San Francisco: Holden Day, 1970.
ASHENFELTER, ORLEY. Estimating the effect of training pro-
grams on earnings. The Rev&v of Economics and Statistics, 1978, CAMPBELL. D. T. Pattern matching as an essential in distal know-
60(l), 47-57. ing. In K. R. Hammond (Ed.), The psychology of Egon Brunswik.
New York: Holt. Rinehart &Winston, 1966, 81-106.
BALDUS. D. C. Welfare as a loan: An empirical study of the
recovery of public assistance payments in the United States. CAMPBELL, D. T. Administrative experimentation. institutional
records, and nonreactive measures. In J. C. Stanley 8c S. M. Elam
Stanford Law Review, 1973, 25, 123-250. (No. 2)
(Eds.), Improving experimental design and statistical analysis.
Chicago: Rand McNally, 1967, 257-291. Reprinted: In W.M. Evan
BARNOW, B. S. The effects of Head Start and socioeconomic sta-
(Ed.) Organizational experimerlts: Laboratory and field research.
tus on cognitive development of disadvantaged children. Unpub-
New York: Harper & Row, 1971. 169-179.
lished doctoral dissertation, University of Wisconsin, Department
of Economics, 1973, (252 pp.) CAMPBELL. Donald T. A phenomenology of the other one: Cor-
rigible. hypothetical and critical. In T. Mischel (Ed.), Human
BAUMAN, R. A., DAVID, M. H. & Miller, R. F. Working with action.. Co,lcepttral and empirical issues. New York: Academic
complex data files: Il. The Wisconsin assets and incomes studies Press. 1969a. 41-69.
archive. In R. L. Bisco (Ed.), Data bases, computers, and the social
sciences. New York: Wiley-Interscience, 1970, 112-136. CAMPBELL, D. T. Reforms as experiments. American psycholo-
gist, 1969b. 24.409-429. (No. 4. April)
BECK, B. Cooking the welfare stew. In R. W. Habenstein (Ed.),
Pathways to data: Field methods for studying ongoing social organ- CAMPBELL, D. T. Considering the case against experimental eval-
izations. Chicago: Aldine, 1970. uations of social innovations. Administrative Science Quarterly,
1970, 15.110-113 (No. 1, March).
BECKER, H. M.. GEER, B., & HUGHES, E. C. Making thegrade.
New York: Wiley, 1968. CAMPBELL. D. T. Methods for an esperimenting society. Paper
presented to the Eastern Psychological Association, April 17, 1971,
BECKER, H. W. Sociological work: Method and substance. Chi- and to the American Psychological Association. Sunday, Septem-
cago: Aldine, 1970. ber 5. 12:00, Diplomat Room. Shoreham Hotel. Washington, D.C.
To appear when revised in the American Psychologist.
CAMPBELL, D. T. Experimentation revisited: A conversation
Evaluation and experiment: Some critical issues in assessing social
with Donald T. Campbell (by S. Salasin). Evaluation, 1973. 1.
programs. New York: Academic Press, 1975.
7-13. (No. 1)

BERLINER, J. S. Factory and manager in the U.S.S.R. Cambridge. CAMPBELL, D. T. Qualitative knowing in action research. Kurt
Mass.: Harvard University Press, 1957. Lewin Award Lecture, American Psychological Association meet-
ing. 1974.
BLAU, P. M. The dynamics of bureaucracy. (Rev. ed.) Chicago:
University of Chicago Press, 1963. CAMPBELL, D. T., & BORUCH, R. I;. Making the cast for ran-
domized assignment to treatments by considering the alternatives:
BORUCH, R. l:., $ CAMPBELL, D. T. Preserving confidentiality Si\ ways in which quasi-experimental evaluations in compensator]
in evaluative social research: Intrafile and interfile data analysis. education tend to undercstimatc effects. In C. A. Bennett. & A.
Paper presented at the meeting of the American Association for Lumsdaine. (Eds.). El,aluation and experiments: Some critical
the Advancement of Science, San Francisco, February/March issues in assessing social programs. New York: Academic Press,
1974. 1975. pp. 195-296.

BORUCH, R. F.. & RIECKEN. H. W., (Eds.). Experimental testing CAMPBELL. D. T.. BORUCH. R. F.. SCHWARTZ, R. D.. &
of public policy.- The proceedings of the 1974 Social Science Re- STEINBERG. J. Confidentiality-preserving modes of access to filey
search Council Conference on Social Experiments. Boulder, Colo.: and to interfile exchange for useful statistical analysis. Evaluation
Westview Press, 1975. Quarterly, 1977, I (2), 269-299.
~A~lP%~LL. D, 7.. & ERLEBACHER, A. E. How regression arfi-
fxts in quasi-e~periitlent~l evaluations can lltist~k~rtly make cow-
pens;~tory educ:ttinn look harmful. In J. ~elln)littl (Ed.). IXSQ&
vattta& child. Vol. 3. Co~}t~e~7sa~o~~,cdric~~fion: A riatioiral de-
hare, New York: Brtinner/~~~~~l. 1970.

CAMPBELL, D. T.. & Fiske. D. W. Convergent xnd discrintin;mt

vafidation by the multitrait-ntultiInethod mafri\. !?~w/~&gicaI
Brrlli’rix. 1959. 56.8i-105.

CA~~~%~L~. D. T.. SIEGMAN. C. R.. & RECS, M. B. Direction-

oi-wordlnp effects in the relationships bctwcen scales. Ps~c~z~~~~?~j-
~*a/Rullrrir,. 196-l. 68.293-303. (No. 5)

CAMPBELL. D. T. & STANLEY, J. C. Experinterttat and ytlasi-

~~spc~rimenral desigrts fbr research. Chicago: R:md McNally. 1966,

C‘APORASO. J. A.. Xr ROOS. L. L., Jr. i&Is.). ~tcasi-~,~~pcritnerttal

approaclws: Twini: tkeory arid evalzratitfg fxZcy. Evanston. 111.:
Norrhwestern University Press, 1973.

CARO. I:. G. [Cd.) Rerzdittgs irr ~~~~~~~ri~~~t

resear~c’it(2nd e&f. New
York: Russell Sage l~~~~l?td~~~on,1977.

C’fflR1:LLI. V. C.. et al. The irnpaoi of Hclid Start: An evaluatitxt

of the effects of Head Start on children’s cognitive and affective
tleveiopment. A report presented to the Office of Economic Op-
portunity pursuant to Contrxt B89-4536. June 1969. Weating-
house Learning C’orporution. Ohio University. (Distributed by
Clwringhousc for l:ederal Scientific and Technical Information.
U.S. Dcp:lrtrrrent of Commerce. National Bureau of Standards.
Institute I’or Applied Technology. PR 184 318.)

CONNER, ROSS, b’;. Selecting ;t control group: An xtalysis trf’thc

randomization process in twelve so&l rcfornt programs. Ewlrra-
rioit QIMI~W&, 1977, I. 12t, 195-244.

COOK ~I+&. CfSd.). Kwizlarinn strr&s ri=~i<w amwai, VOX. 3.

Beverly Hills, Calif’.: Sage Publications, 1978.


TAMKIN’, G.. & WEBER. S. J. S~sa~~~~ Street revisited: A (‘axe
stud.v in cvaiuatiott research. New York: Russell Sapc I:oundafion,

COOK, T. I)., & CAMPBELL, D. T. The dcsipn and conduct of

qwsi-cxperimcnts and true expcrirncnts in field settings. In M. 11.
Dunnettc & 1. P. Campbell (Eds.), Hattdbook oj’ind~striai and or-
gattizalbwal rcsearck. Chicago: Rand McNally. 1975.

COOK. T. D., d CAMPBELL, D. T. Quasi-espcl-ii~ier~ralint~: De-

s&r? and atmlj:sis issucs for fidd seitirt,s_x Chicago: Rand McNally,
1979 fin prcs\.)


%., fFds.), Program et~ahation for mental itea~rk: M&o&. stratr-
@es. attdi~arr~ci~altts. New York: Grune & Sfrerton. 1977.

~RONBA~H. L. J. Response sets and tcsx validity. ~l,a~t{a?iott and


CRONBACH, L. J. Further cvidcncc on response sets and test de-

sign. ~d~ca~i~~~za~
a& ~s.~lcko~u~jcal ~~east~re~rte~~~,1950, 10, 3-3 1.

DAVXD, $3” P. Ramify pla?z~j~?~ a?726 abortiurt in tke soriaffst cotta-

f&w of Ccwral and Easrern Brtropr. New York: The Population
Assessing the Impact of Planned Social Change 89

GU?TENTAG, M. Evaluation of social intervention programs. LORD, I:. M. Statistical adjustments when comparinp preexisting
Annals of the New York Academy of Sciences, 1913.218-3-13. groups. Ps.vchological BuNetin, 1969, 72.336-337.

GUTTENTAG, M. Subjectivity and its use in evaluation research. MAGIDSON, J. Toward a causal model approach for adjusting for
Evaluation. 1973. 1.60&5. (No. 2) pre-existing differences in the nonequivalent control group situa-
tion. t;:valuation Quarterly, 1977. l(3). 399-420.
GUTTENTAG, M., & SAAR, S., (Eds.). Evaluation studies revieltf
am~uai. Beverly Hills. Calif.: Sage Publj~tions. 1977, Vol. 2. MCCLEARY, R., d MCCAIN, L. J. The statistical analysis of the
simple interrupted time-series quasi-e~pcriment. In Cook T. D., L
HATRY. H. P., WINNIE, R. E., & FISK, D. M. Practicaipro,qram Campbell, D. T. Quasi-exlierinlelttatiorl: Design and arla(vsis issues
evaluation for state and local fiovernment of;rcials. Washington 1 ftirfield settirtg. Chicago: Rand McNally, 1979 (in press).
D.C.: The Urban Institute (2100 M Street NW. Washington, DC.
20037). 1973. MILLT<R. A. R. T&e assault on privaqv. Ann Arbor. Mich.: Univcr-
sity of Micipan Press, 197 1,
I-fF.LLER, R. N. The uses of social security administration data. In
M. E. Borus (Ed.), Evaluating the impact of manpower programs. MILLER. S. M., ROBY, T.. & STEENWIJK, A. A. V. Creaming
Lexington, Mass.: D.C.Heath, 1972, 197-201. the poor. Truns-action, 1970, 7.3845. (No. 8, June)

IKEDA, K., YINGER. 3. M., & LAYCOK, F. Reforms as experi- MORGENSTFRN. 0. 0,~ the accuraq of economic obserlsations.
ments and experiments as reforms. Paper presented to the meeting (2nd ed.) Princeton. N.J.: Princeton University Press, 1963.
of the Ohio Valley Socioio~ical Society. Akron, May 19’70.

JACKSON, D. N., & MESSICK. S. Response styles on the MMPI: MORRISSEY, W. R. Nixon anti-crime plan undermines crime
Comparison of clinical and normal samples. Jounzal of Abnormal statistics, Justice Magazine. 197 2, 1.8-i 1.14. (No S/6. June/July)
andSocialPs~~‘crtolo~~, 1962,65.285-299. (Justice Mzgazine. 922 National Press Building.. Washington. D.C.
KEPKA, E. J. Model represcntatio~~ and the threat of instability in
the interrupted time series quasi-experiment. Doctoral disserta- PORTER, A. C. The effects of’ using fallible variables in the analy-
tion, Northwestern University, Department of Psychology, 1971. sis of covnriance. Unpublished doctoral dissertation. liniversity of
Wisconsin, 1967. (University Microfilms. Ann Arbor, Michigan,
KERSHAW, D. N. A negative income tax experiment. Scientific 1968.)
American, 1972,227, 19-25.
REICHARDT, CHARLES S. The statistical analysis of data from
KERSHAW, D. N. The New Jersey negative income tax experi- the nonequivalent control group design. In Cook T., D.. & Camp-
ment: A sLim~]ary of the design, operations and results of the first be II, U. T. C)uasi-e.wperitnelltation, Design and Analysis issues for
large-scale social science experiment. In Lyons, Gene M.. (Ed.) Sn- .~~e~dseft~n~s.Chicago: Rand McNally, 1979 (in press).
cial research and public policies. Dartmouth College, Hanover,
N.H.: The Public Affairs Center, 1975, pp. 87-116. RIDCEWAY, V. Dysfunctional consequences of performance mea-
sures. Admiraistratilje Scierzct Quarter&, 1956. 1.240-247. (No. 2,
KERSHAW. D., & FAIR, J. The NeMl Jersey inconze-mai~ltenance September)
experiment, Vol. 1, Oi~eratio~~s,surveys, and adnt~~iistrat~on.New
York: Aoadcmic Press. 1976. RIECKEN, Ii. W., & Boruch, R. F., (Eds.).Sociale,~peri~ne/ttatio
A method ,for planning and evaluating social intervention. New
KITSUSE, J. K., & CIRCOUREL, A. V. A note on the uses of offi- York: Academic Press, 1974.
cial statistics. Social Problems, 1963, 11, 131-139. (Fall).
RIVLIN, A. M. Systematic thinking for social action. Washington,
KUHN, T. S. The structure of scientific revolutions. (2nd ed.) Chi- D.C.: The Brookings Institution, 1971.
cago: University of Chicago Press, 1970. (Vol. 2. No. 2, fnterna-
tional ~~~c.yrlopediao,f ~fzjfied Science). ROSS, It. L. Law. science and accidents: The British Road Safety
Act of 1967. Jouritaf of l&aI Studies. 1973, 2.1-75. (No. 1;
KUTCHINSKY, 3. The effect of easy availability of pornography American Bar Foundation)
on the incident of sex crimes: The Danish experience. Journal of
Social Issues, 1973, 29.163-181. (No. 3). ROSSI. P. H. Practice, method. and theory in evaluating social-
action programs. In J. L. Sundquist (Ed.), 0,~ f$@ing poverty.
LEVENSON. B., & MCDILL, M. S. Vocationat graduates in auto New York: Basic Books. 1969, 217-234. (Chapter 10)
mechanics: A follow-up study of Negro and white youth.P!T?llon,
1966, 27.347-357. (No. 4). ROSSI. P. H., & WILLIAMS, W. (Eds.) Evaluating socialprograms.
Theory, practice, and politics. New York: Seminar Press, 1912.
LOHR, B. W. An historical view of the research on the factors re-
lated to the utilization of health services. Duplicated Research Re- RUEHAUSEN, 0. M., & BRIM, 0. G.. Jr. Privacy and behavioral
port. Bureau for Health Services Research and Evaluation. Social research. Columbia Law Review, 1965, 65.1184-121 I.
and Economic Analysis Division, Rockville, Md.. January 1972,34
pp. In press, U.S. Government Printing Office. RUTMAN, LEONARD. (Ed.). ~vaiuatiort research met~tods: A
bas~c~~ide. Beverly Hills, Calif.: Sage, 1977.

LORD, F. M. Large-scale covariance analysis when the ccntrol SAWYER, J., & Schechter, H. Computers, privacy, and the Na-
variable is fallible. Journal of the American StatisticalAssociation, tional Data Center: The responsibility of social scientists. Amer-
1960.55.307-321. ican Ps~c~z(~~o~ist.1968, 23.810-818.

SCHWARTZ. R. D. & ORLCANS, S. On Icg:tl wnctions. iJnivcr-

sify of Cltica,qo I,anr ReGw. 1961, 34.274-300.

SEGALL, MARSHALL H. Jfnma?f behavior and public policy

political Elmsford.
py~clzolog~~ N.Y.: Perpamon (General Psychol-
ogy Series. Vol. 39). 1977.


itf.fhierzce of culture 011 Idal perccptim. Indi;m;ipoh, Ind. :
Bobbs-Merrill. 1966.

SEIDMAN. D.. & COUZENS. M. Getting the crime rate down:

Political pressure and crime rcpurting. f,aw d: Socirty KcJ~~icu~.
1974,5(3). pp. 457-493.

SHAVER, P.. & STAINES. C. Problems fucinp C~mpbell’s “I:spcr-

irnentinp Society.” Urf>afollAj.Tairs Qrlarteri)~. ) 97 I. I. I 73-l 86.
(no. 2. December)

STAKE, R. E. Testing hazards in performance contracting. Phi

Dcita Kappan. 1971, 52.583-588. (No. 10. June)


Hardhook o.f claaluation research.Vo1umes 1 and 2. Beverly Hills,
Culif’.: Sage Publications. 1975.

SUCHMAN, E. E~duafim rcscarcl~. New York: Russel1 Sage

l:oundation, 1967.

SWf?EN. J. A. The experimental regression design: An inquiry into

the feasibility ot‘ non-random trcatmrnt allocation. Unpublished
doctornl dissertation. Northwestern University, 1971.

THOMPSON. <‘.W.N. Administrlftivc e\perimcnts: The c\pericnce

of fifty-eight engineers and engineering man;rpers. IFh’P Tramac-
tiorls ofI Eqirieeriif,q Management. 1974. EM-21. pp. 42-50. (No.
2. May)

TWIGG. R. l~~~n~r~din~ of crimes verified in t3altimore. Jnstict~

Mqqazirre, 1972. 1,15, 18. (No. 5/6, June/July) (JxsfI’w .Muga;iue.
922 National Press Building. Washington, D.C. 20004.