Vous êtes sur la page 1sur 15

# CHAPTER 32

TIME-SERIES DESIGNS
Richard McCleary and David McDowall
Time-series designs are distinguished from other
designs by the properties of time-series data and by
the necessary reliance on a statistical model to con-
trol threats to validity. In the long run, a time-series
is a realization of a latent causal process. Represent-
ing the complete time-series as
the observed series (Y
1
, ... , YN) is a probability
sample of the complete realization. The probability
sampling weights for (Y
1
, ... , YN) are specified in a
statistical model, which, for present purposes, is
written in a general linear form as
(2)
The a, term of this model is the tth observation of a
strictly exogenous innovation series with the white
noise property,
(3)
The X, term is the tth observation of a causal time-
series. Although X, is ordinarily a binary variable
coded for the presence or absence of an interven-
tion, it can also be a purely stochastic series. In
either case, the model is constructed by a set of rules
that allow for the solution,
(4)
Because a, has white noise properties, the solved
model satisfies the assumptions of all common tests
of statistical significance.
DOl: 10.1037/13620-032
THREE DESIGN CATEGORIES
We elaborate on the specific forms of the general
model and on the set of rules for building models at
a later point. For present purposes, the general
model allows three design variations: (a) descriptive
time-series designs, (b) correlational time-series
designs, and (c) experimental or quasi-experimental
time-series designs.
Descriptive Designs
The earliest time-series designs used observed cycles
or trends in a series to infer the nature of a latent
causal mechanism. Historical examples include
Wolf's (1848; Yule, 1927) investigation of sunspot
activity and Elton's (1924; Elton & Nicholson,
194 2) investigation of lynx populations. In both
cases, time-series analyses revealed cycles or trends
that corroborated substantive interpretations of the
phenomenon.
Kroeber's (1919; Richardson & Kroeber, 1940)
analyses of cultural change illustrate the poor fit of
descriptive time-series designs to many social and
behavioral phenomena. Kroeber (1944) hypothe-
sized that women's fashions change in response to
political and economic variables. During stable peri-
ods of peace and prosperity, fashions changed
slowly; during wars, revolutions, and depressions,
fashions changed rapidly. Because political and eco-
nomic cataclysms were thought to recur in long his-
torical cycles, Kroeber tested his hypothesis by
searching for the same cycles in women's fashions.
Figure 32.1 plots one of Richardson and Kroe-
ber's (1940) annual fashion time series. Although
APA Handbook of Research Methods in Psychology: Vol. 2. Research Designs, H. Cooper (Editor-in-ChieD
613
McCleary and McDowall
Skirt Width
Index
1800 1820 1840 1860 1880 1900 1920
Year
FIGURE 32.1. Annual skirt widths, 1787 to 1936.
Kroeber believed that the long cycles in this series
corroborated his cultural unsettlement theory,
wholly random processes can generate identical pat-
terns. Whereas most time-series designs treat j(aJ
as a nuisance function whose sole purpose is to con-
trol the threats to statistical conclusion validity
posed by cycles and trends, the descriptive time-
j(at). Although the statistical models and methods
developed for the analysis of descriptive designs are
applied for exploratory purposes (see Mills, 1991),
they are currently not widely used for null hypothe-
sis tests.
Correlational Designs
A second type of time-series design attempts to infer
a causal relationship between two series from their
covariance. Historical examples include Chree's
(1913) analyses of the temporal correlation between
Lynchings
sunspot activity and terrestrial magnetism and Bev-
eridge's (1922) analyses of the temporal correlation
between rainfall and wheat prices. The validity of
correlational inferences rests heavily on theory.
When theory can specify a single causal effect oper-
ating at discrete lags, as in these natural science
examples, correlational designs support unambigu-
ous causal interpretations. Lacking theoretical speci-
fication, however, correlational designs do not allow
strong causal inferences.
Analyses of the temporal correlation between
lynchings and cotton prices by Hovland and Sears
(1940) illustrate the inferential problem. To test the
frustration-aggression hypothesis of Dollard et al.
(1939), Hovland and Sears estimated a Pearson
product-moment correlation coefficient from the
annual time-series plotted in Figure 32.2. Assuming
that the correlation would be zero in the absence of
a causal relationship, Hovland and Sears interpreted
the statistically significant estimate as corroborating
evidence. Because of common stochastic time-series
properties, however, especially trend, causally inde-
pendent series will be correlated. Controlling for
trend, Hepworth and West (1988) reported a small,
significant correlation between the series but warned
against causal interpretations. The correlation is an
artifact of the war years 1914 to 1918, when the
demand for cotton and the civilian population
moved in opposite directions (McCleary, 2000). If
the war years are excluded, the correlation is not
statistically significant.
Where theory supports strong specification, cor-
relational time-series designs continue to be used.
150
-;-Cotton Prices
100
50
1890
-Lynchings
1900 1910
Year
FIGURE 32.2. Annual cotton prices and lynchings, 1886 to 1930.
614
1920
-
r
Other than limited areas in economics and psychol-
ogy, however, social theories will not support the
required specification. Even in these areas, causal
inferences require the narrow definition of Granger
causality (Granger, 1969) to rule out plausible alter-
native interpretations.
Quasi-Experimental Designs
causal effect of a temporally discrete intervention or
treatment from discontinuities or interruptions in a
time series. Campbell and Stanley (1963, pp. 37-43)
called this design the "time-series experiment," and
its use is currently the major application of time-
series data for causal inference. Historical examples
of the general approach include investigations of
workplace interventions on health and productivity
by the British Industrial Fatigue Research Board
(Florence, I923) and by the Hawthorne experiments
(Roethlisberger & Dickson, I939). Fisher's (1921)
analyses of agricultural interventions on crop yields
also relied on variants of the time-series quasi-
experiment.
In the simplest case of the design, a discrete
intervention breaks a time-series into pre- and post-
intervention segments of Npre and Npast observations.
For pre- and postintervention means, J.lpre and ].lpost,
analysis of the quasi-experiment tests the null
hypothesis
Ha: (!) = 0, where(!)= J.lpost- ].lpre; HA: (!);F. 0. (5)
Rejecting H
0
, HA attributes ro to the intervention. In
practice, however, treatment effects are almost
always more complex than the simple change in
level implied by this null hypothesis.
Figure 32.3 illustrates a typical example of the
time-series quasi-experimental design. The data are
50 daily self-injurious behavior counts for an insti-
tutionalized patient (McCleary, Touchette, Tay-
lor, & Barron, 1999). Beginning on the 26th day, the
patient is treated with Naltrexone, an opiate-
blocker. The plotted time series leaves the visual
impression that the opiate-blocker has reduced the
incidence of self-injurious behavior. Indeed, the dif-
ference in means for the 25 pre- and 25 postinter-
vention days amounts to a 4 2% reduction. The value
ofF= 32.56 associated with this difference occurs
Time-Series Designs
Self-Injurious
Behavior Incidents
8
6
5
4
25 Preintervention Days 25 Postintervention Days
FIGURE 32.3. Self-injurious behavior incidents for
a single institutionalized patient before and after an
opiate-blocker regimen.
by chance with p < .0001, and the null hypothesis
can be rejected in favor of the alternative: The medi-
cation is an effective treatment for self-injurious
behavior.
This conclusion ignores a serious threat to valid-
ity. Whereas the null hypothesis test assumes that
the daily counts are independent, in fact, the count
on any given day is predictable from the count on
the preceding day. The visual evidence of Figure
32.3 leaves the unambiguous impression that the
opiate-blocker reduced the rate of self-injurious
behavior for this patient. Visual evidence can be
deceiving, of course, and that is why statistical
hypothesis tests are conducted. Because of the day-
to-day dependence of these data, however, the value
ofF= 32.56 cannot be interpreted.
More generally, time-series experiments and
quasi-experiments present many challenges to
valid inferences about causal effects. A solid ratio-
nale for the design is mostly due to the work of
Donald T. Campbell and his collaborators (Camp-
bell & Stanley, 1963; Cook & Campbell, 1979;
Shadish, Cook, & Campbell, 2002). Campbell and
associates extensively considered threats to the
design's validity and proposed ways to address
them when they were plausible. A general conclu-
sion from Campbell's work is that experimental
and quasi-experimental time-series designs face
fewer threats to validity than do most other nonex-
perimental designs. This conclusion is largely
responsible for the current popularity of time-
series research.
615
I
McCleary and McDowall
FOUR TYPES OF VALIDITY
Campbell and Stanley (1963; Campbell, 1963)
divided the empirical threats to valid inference into
two categories. Threats to internal validity addressed
the question, "Did in fact the experimental treat-
ments make a difference in this specific experimen-
tal instance?" Threats to external validity addressed
the question, "To what populations, settings, treat-
ment variables, and measurement variables can this
effect be generalized?"
Recognizing the incompleteness of the dichot-
tional categories. Threats to statistical conclusion
validity addressed questions of confidence and
power that had previously been included implicitly
as threats to internal validity. Threats to construct
validity addressed questions of confounding that
had previously been included implicitly as threats to
external validity. Shadish et al. (2002) used the
same four categories but expanded the list of threats
to valid inference in each category.
Table 32.1lists the threats to validity that are rel-
evant to time-series studies. Time-series designs dif-
fer from other approaches in that common threats to
validity are controlled by a statistical model. This
applies not only to quasi-experimental designs but
also to designs that would be considered true experi-
ments. When a treatment or intervention can be
TABlE 32.1
Four Types of Validity From Shadish, Cook, and
Campbell (2002)
Type of validity
Statistical conclusion
validity
Internal validity
Construct validity
External validity
Threat to validity
Low statistical power
Violated assumptions of the test
History
Maturation
Regression artifacts
Instrumentation
Reactivity
Novelty and disruption
Interaction over treatment variations
Interaction with settings
manipulated experimentally, presumably to control
threats to internal validity, the manipulation raises
threats to construct and external validity that can be
controlled by the statistical model.
Trade-offs among the four validities are implicit
in our tradition. The salient flaw in the Campbell
and Stanley (1963) two-category system was that all
eight threats to internal validity and all four threats
to external validity could be controlled by design.
Campbell and his colleagues proposed the four-
validity system in large part to correct this miscon-
ception. The trade-off among validities is a crucial
consideration for time-series studies. Although
threats to internal validity can be controlled, in prin-
ciple, by experimental manipulation of the treat-
ment, in practice, experimental manipulation raises
near-fatal threats to construct and external validity.
Accordingly, we analyze time-series experiments as
if they were quasi-experiments.
Statistical Conclusion Validity
Shadish et al. (2002) identified nine threats to statis-
tical conclusion validity or "reasons why researchers
may be wrong in drawing valid inferences about the
existence and size of covariation between two vari-
ables" (p. 45). Although the consequences of any
particular threat will vary across settings, the threats
to statistical conclusion validity fall neatly into cate-
gories involving misstatements of the Type I and
Type II error rates.
1
Type I errors (also known as a-errors or false-
positive errors) occur when a true H
0
is mistakenly
rejected; that is, when the intervention has no effect
but the test statistic suggests otherwise. Under a
convention established by Fisher (1925), the Type I
error rate is fixed at 0.05, corresponding to a
confidence level of at least 0.95 (or 95% confidence).
Type II errors (also known as {3-errors or false-
negative errors) occur when a false H
0
is mistakenly
accepted; that is, when the intervention has an effect
but the test statistic suggests otherwise. Following
Neyman and Pearson (1928), the conventional Type
II error rate is fixed at (3 0.2, corresponding to sta-
tistical power level of at least 0.8 (or 80% statistical
power). Whereas the Type I error rate is set a priori,

authority on this topic is Kendall and Stuart (1979, Chapter 22). Cohen (1988) and Lipsey (1990) provided more accessible
616
1
.
.
-'q
-
the Type II error rate is conditioned on the Type I
error rate and a likely effect size.
2
For both Type I and Type II errors, uncontrolled
threats to statistical conclusion validity distort the
nominal values of a and [3, leading to invalid infer-
ences. The threats are controlled by a statistical
model. The most widely used models for that pur-
pose are the AutoRegressive Integrated Moving
Average (ARIMA) models of Box and jenkins
(1970). Under H
0
, an ARIMA model is written as
<p(B)Zt = 8(B)at Zt = Yt- py, (6)
where Zt and at are the tth observations of a station-
ary time-series and an iid N(O,cr
2
) error series
respectively; and where <p(B) and S(B) are polyno-
mial lag operators. If the parameters of <p(B) and
8(B) are appropriately constrained, the ARIMA
model can be solved for at:
(7)
To test H
0
, the intervention is represented by a
step function (or dummy variable) defined such that
Xt = 0 for t::; Npre; Xt = 1 thereafter. (8)
A transfer function of Xt is then added to the right-
hand side of the ARIMA model:
(9)
where o(B) is a polynomial lag operator and ffi is the
effect of Xt on Zt. Because at is an iid N(O,<J
2
) error
term, the null hypothesis
Ha: ffi = 0 (10)
can be tested with ordinary test statistics, such as t
or F, effectively controlling all Type I threats to sta-
tistical conclusion validity.
Because the <p(B) and e(B) lag operators serve the
sole purpose of transforming Zt into the underlying
a, error, ARIMA models are not unique. Methods for
ARIMA model have been described in Glass,
Willson, and Gottman (1975); McCleary and Hay
(1980); and especially McDowall, McCleary,
Meidinger, and Hay (1980). The <p(B) and 8(B) lag
Time-Series Designs
operators can be used for descriptive purposes, but
with respect to H
0
, they are nuisance parameters. The
transfer function of X,, in contrast, is a theoretical
construct, specified on purely theoretical grounds.
struct validity are considered.
Although the Type II threats to statistical conclu-
sion validity are straightforward, they are poorly
understood and often ignored. The failure to reject
H
0
does not imply that H
0
is true. The decision to
accept H
0
as true requires, first, a consensus decision
on the likely effect size (or value of ffi); and, second,
a demonstration that the time-series quasi-
experiment was designed to yield a Type II error
rate of 13::; 0.2 for the likely effect size. In sum, the
decision to accept H
0
as true requires the research
begin with an analysis of statistical power.
Of the several factors that determine statistical
power of a time-series design, the likely effect size
is the most important. Small effects are difficult to
detect even under the best circumstances. In addi-
tion to the likely size of the effect, the statistical
power of a time-series quasi-experiment is a func-
tion of the series length, balance, and to a lesser
extent, the quality of the ARIMA model. Because
maximum likelihood estimates rely on large sample
properties, analyses of short time series will often
fail to achieve the nominal level of power. Most
authorities recommend Npre + Npost > 50 as the mini-
mum length for a time series. Statistical power
increases proportional to the square root of series
length. But for a given length, power is highest
when the design is balanced such that Npre = Npost
Even when the series is long and the design is
balanced, the statistical power of a time-series
quasi-experiment can be affected adversely by a
poor-fitting ARIMA model. In addition to the pur-
pose of transforming Zt into ar, the ARIMA model
decomposes the time-series variance into stochastic
and deterministic components, j(a,) and g(Xt). To
ensure an optimal decomposition, the ARIMA
model used to test H
0
should have the lowest resid-
ual variance among the several statistically ade-
quate models.
'Cohen (1988, pp. 3-4) and Lipsey (1990, pp. 38-40) set the conventional Type I and Type II error rates at a= .05 and 13 = .2, respectively. If the Type
I error rate is set lower, say, a= .01, the Type II error rate is set at 13 = .04 to maintain a 4:1 ratio of Type II to Type I errors. The 4:1 convention dates
back at least to Neyman and Pearson (1928) and reflects the view that science should be conservative.
617

McCleary and McDowall
0.5
Property Crimes
1956
0
1951
-0.5
-1
-1.5
-2
1945 1950 1955 1960
FIGURE 32.4. Annual property crimes for cities with commercial television broadcast service in 1951 and for cities
with television in 1955.
Internal Validity
Under some circumstances, all nine of the threats to
internal validity identified by Shadish et al. (2002,
p. 55) might apply to the time-series quasi-
experiment. Typically, however, only four threats
are plausible enough to pose serious difficulties.
History and maturation, which arise from the use of
multiple temporal observations, can threaten virtu-
ally any application of the design. Instrumentation
and regression can also be problems, but only for
interventions that involve advanced planning. For
unplanned interventions-what Campbell (1963,
pp. 229-230) called "natural experiments"-these
threats are much less realistic.
The largest, most obvious, and most frequent
threats to internal validity involve the operation of
history. Historical threats come from changes in a
time series that occur coincidently with an interven-
tion but are due to other causes. A standard design-
based approach to making these threats less
plausible is to analyze one or more comparison
series. The comparisons can take many forms, and a
careful choice can substantially narrow the scope
within which history can operate. An analysis might
consider no-treatment control series that the inter-
vention should not have influenced, for example, or
study multiple periods during which the interven-
tion was and was not in operation. A consistent pat-
tern of results across different variables or time
periods reduces the plausibility of historical threats
and helps support the existence of an intervention
impact.
618
An illustration of the effective use of comparison
series comes from Hennigan et al. (1982), who
studied changes in property crime following the
introduction of commercial television. In 34 early
cities, broadcasting began in 1950, whereas in 34
late cities, television was not available until1954.
The time series in Figure 32.4 show the annual log-
transformed levels of property crimes for both
groups of cities.
History is a plausible threat to inferences about
the effects of television when studying the early or
late cities alone. Other variables also changed during
the years in which each group adopted television,
and many of these might explain a change in crime.
The Korean War was under way in 1951, for exam-
ple, and an economic recession began in 1955. More
generally, criminological theories suggest multiple
factors that might influence property crimes and
that could have changed around the time that broad-
casting began in either of the groups.
Considering both time series together makes the
changes in crime much more difficult to dismiss as
artifacts of history. To be plausible, a historical
explanation would have to account for increases that
occurred at two different time points but affected
only one group of cities at each. Although not
impossible, constructing such an explanation would
be a difficult enterprise.
In contrast to history, methods for addressing
the other three threats to internal validity do not
heavily rely on design variations. Maturation, which
like history is plausible in all applications of the
Productivity
80
70
60
50
First
"Punishment"
Effect
Second
"Punishment"
Effect
Third
"Punishment"
Effect
Time-Series Designs
40
FIGURE 32.5. Weekly productivity measures for the first bank wiring room (Hawthorne) experiment. Estimated
interventions are plotted against the series.
time-series quasi-experiment, requires a statistical
modeling approach. Maturation threats appear as
trends in the data and are due to developmental pro-
cesses that are independent of the intervention.
Time-series data often display such trends, and
trending patterns are especially common in the long
series that are most desirable for analysis. Matura-
tional trends are a problem for inference because
they can easily produce false evidence of an inter-
vention effect.
Figure 32.5 illustrates one of the best-known
examples of the maturation threat, the so-called
Hawthorne effect. The data consist of weekly pro-
ductivity levels for a group of five machine operators
in the bank wiring room of the Western Electric
Company's Hawthorne Works. Researchers manipu-
lated daily rest breaks during the study period and
claimed after a visual inspection of the series that
the breaks helped increase productivity. Question-
ing this conclusion, Franke and Kaul (1978) argued
that any productivity increases were instead due
solely to fear generated by the imposition of three
"punishment" regimes. Their statistical analysis,
which included interventions at the beginning of
each regime, supported this hypothesis.
Maturation provides an explanation that chal-
lenges the interpretations of both Franke and Kaul
(1978) and the original researchers. Figure 32.5
shows the presence of a systematic trend in produc-
tivity that could easily have resulted from increases
in worker experience. The trend closely follows the
patterns that each set of researchers observed, and
this makes maturation a plausible alternative expla-
nation of their findings. A reanalysis that controlled
for the trend in fact found a small effect of the rest
breaks and no effect at all for the punishment
regimes (McCleary, 2000).
Unlike other threats to internal validity, which
are ordinarily handled by design, maturation threats
are controlled by the statistical model. Under H
0
, the
causal effect of Xt vanishes, leaving a simple model
that represents the time series as a weighted sum of
past and present white noise innovations:
(11)
Proper solutions of this model are guaranteed by
constraining the parameters of cp(B) and 9(B) to the
bounds of stationarity-invertibility.
3
This assumes a stationary time series, however,
and this assumption is unwarranted in many
instances. Kroeber's fashion time series (Figure
32.1), for example, shows the drifting pattern charac-
teristic of a nonstationary random walk process. Seg-
ments of Kroeber's series are indistinguishable from
the steady secular trend that poses the maturation
threat in the Hawthorne experiment (Figure 32.5).
Although nonstationary time series are com-
monly encountered in the social sciences, most are
3
Although the bounds of stationarity and invertibility are identical, they are distinct properties of a time series. See Box and jenkins (1970, pp. 53-54).
All modem time-series software packages report parameter estimates that are constrained to the stationarity-invertibility bounds.
619
McCleary and McDowall
Productivity
Differences
10
8
6
4
2
0
-2
-4
-6
FIGURE 32.6. Weekly productivity measures from Figure 32.5, differenced.
stationary in first-differences. Using V to represent
the differencing operation
(12)
a nonstationary time series can be modeled as
(13)
or
(14)
Figure 32.6 plots the first-differences of the Haw-
thorne experiment time series. The differenced
series fluctuates around a constant level, and matu-
ration is no longer a plausible threat to internal
validity.
4
In addition to controlling maturation threats,
differencing removes the confounding effects of
cross-sectional fixed causes of
1
To illustrate,
suppose that
1
is the U.S. unemployment rate and
that W represents the causes of
1
that vary cross-
sectionally but that are constant over short periods
of time. When consecutive observations are
differenced,
Yt- Yt-1 = j(at) - j(at-1) + W- W (IS)
V'Yt = j(at)- j(at-1), (16)
the confounding effects of W vanish from the model.
This property of the difference equation model is the
motivation for the use of fixed-effects panel models
in economics (Greene, 2000).
Like the maturation threat to internal validity,
the regression threat is controlled by the statistical
model. Whereas the maturation threat is plausible
in both time-series experiments and quasi-
experiments, however, the regression threat is plau-
sible only in quasi-experiments involving planned
interventions. The regression threat arises whenever
the intervention is a reaction to an unusually high or
low level of the time series. Regardless of the inter-
vention's effects, regression to the mean is likely to
produce an increase or decrease in the series level.
In one of the earliest formal applications of the
time-series quasi-experiment, Campbell and Ross
(1968) studied the impact on highway fatalities of a
I955 speeding crackdown in Connecticut. Traffic
deaths dropped significantly after the crackdown
began, but Campbell and Ross showed that the
decrease was largely attributable to a regression arti-
fact. Fatalities were unusually high in 1955, and the
crackdown was a response intended to reduce them.
A drop was then predictable as deaths regressed
back toward their historically average levels.
Regression becomes a less plausible threat to
internal validity as the length of a time series
increases. Introducing the intervention at an
unusually high (or low) point in the series will
create a transient bias in estimates of the pre- and
4
The mean of the first-differenced time series is interpreted as the secular trend of Y,.
620
p
Burglaries
100
90
80
70
60
50
40 L_ _____ 19_7_6 ________ ________
Year
FIGURE 32. 7. Monthly burglaries for Tucson, 1975
to 1981. During a 24-month period, burglaries are
assigned to detectives for investigation.
postintervention means. As the pre- and postinter-
vention series grow longer, the bias becomes pro-
portionately smaller, and eventually it reaches zero.
The recommendation to use a total series length of
50 or more observations for the time-series quasi-
experiment is in part intended to reduce the plausi-
bility of regression threats (McCleary & Hay, 1980;
McDowall et al., 1980).
Instrumentation is also a plausible threat to
planned interventions because new methods for
measuring the outcome variable often accompany
the introduction of other changes. Figure 32.7 (from
McCleary, Nienstedt, & Erven, 1982) presents a
monthly plot of Tucson burglary counts. For 2 years
beginning in 1979, detectives replaced uniformed
officers in performing burglary investigations. In
1981, the investigative responsibility was returned
to the uniformed officers. Consistent with the
notion that detectives are more proficient in pre-
Daily "Talking
Out" Incidents
25
20
15
10
5
Before After
FIGURE 32.8. Daily disruptions caused by talking out
before and after a behavioral intervention.
Time-Series Designs
venting burglaries, the counts were lower when they
handled the cases.
Although the switching intervention feature of
the design effectively rules out history, all other
internal validity threats are still plausible. Additional
analysis showed that detectives and uniformed offi-
cers did not keep records in the same way, and this
difference reduced the number of burglaries that the
detectives recorded. Allowing for the influence of
the instrumentation change, burglary counts did not
vary significantly with the type of officer responsible
for the investigations.
Construct Validity
Shadish et al. (2002) identified 14 "reasons why
inferences about the constructs that characterize
study operations may be incorrect" (p 73). One of the
14 threats to construct validity, novelty and disruption,
is relevant to experimental and quasi-experimental
time-series designs. Regardless of whether an inter-
vention has its intended effect, the time series is
likely to react to the novelty or disruption associated
with it. If the general form of the artifact is known, it
can be incorporated into the statistical model.
Figure 32.8 illustrates one aspect of this threat to
construct validity. Hallet al. (1971) counted the
number of talking out disruptions in a classroom for
20 consecutive days. When a behavioral interven-
tion is implemented on the 21st day, the time series
changes gradually, falling to a lower daily level of
disruption. If the gradual nature of the response is
not taken into account, the effectiveness of the inter-
responses to interventions are a common feature of
behavioral research, the uncontrolled threat to con-
struct validity can have serious consequences.
Figure 32.9 illustrates the complementary aspect
of this threat to construct validity. Similar to Figure
32.3, these data are daily counts of self-injurious
behavior incidents but for a different patient
(McCleary et al., 1999). Instead of receiving an
opiate-blocker, beginning on the 26th day, this
patient received a placebo. The level of the time series
dropped immediately but then, within a few days,
returned to its preintervention level. Placebo effects
of this sort are common in time-series experiments.
Given a well-behaved time-series process and a
621
r
McCleary and McDowall
Self-Injurious
Behavior Incidents
8
7
6
5
4
3
2
0
25 Preintervention Days 25 Postintervention Days
FIGURE 32.9. Self-injurious behavior incidents for a single institutionalized patient before and after a placebo regimen.
sufficient number of postintervention observations,
the threat to construct validity may be ignored. Under
more realistic circumstances, however, the placebo
effect must be incorporated into the analytical model.
Campbell and Stanley (1963, p. 43) recognized
the threat to validity posed by a dynamic response to
an intervention; still, their external validity assess-
ment of the time-series quasi-experiment seemed to
leave the issue open. Addressing the same threat
from a modeling perspective, Box and jenkins
(1970; Box & Tiao, 1975) proposed a lagged poly-
nomial parameterization of h(Xt) that allows for
hypothesis testing. The polynomial lag makes the
ARIMA model inherently nonlinear, complicating
the interpretation of analytic results. The polyno-
mial lag provides a straightforward method of con-
trolling novelty and disruption threats to construct
validity, however, and has become widely accepted
in the social sciences.
Figure 32.10 shows four variations of the same
general polynomial lag model. The model variations
in the top row depict permanent responses to the
intervention. The series may respond to the inter-
vention instantaneously or gradually but, in either
case, the response is permanent. A gradual, perma-
nent response model seems to capture the effect of
the behavioral intervention on the daily time series
of disruptive talking out incidents (Figure 32.8).
622
The model variations in the bottom row of Fig-
ure 32.10 depict temporary responses to the inter-
vention. Both responses model placebo artifacts,
spiking at the onset of the intervention but then
decaying over time to reveal the long-run effect of
the intervention or treatment. The gradual, tempo-
rary response model seems to capture the effect of a
placebo intervention on the daily time series of self-
injurious behavior (Figure 32.9).
Permanent and temporary responses can be com-
bined in a model. Figure 32.11 shows a time series
of divorce rates for Australia before and after the
1975 Family Law Act, which allowed for no-fault
divorce. Opponents of the act argued that its no-
fault provisions would lead to an increase in
divorces. An evaluation of the act by the Australian
government found that although divorces did rise
following the act, the divorce rate fell back to its
pre-1975level after 3 years. The fact that post-
1975 divorce rates were higher was attributed to
secular trend.
Reanalyzing these data, McCleary and Riggs (1982)
hypothesized a complex response to the act, realized
as the sum of a permanent and a temporary increase
in divorce. The temporary spike in divorces decayed
rapidly in the years immediately following 1975.
Divorces never returned to their pre-1975 rates, how-
ever, and instead stabilized at a new higher level.
Time-Series Designs
Percentage of Effect Percentage of Effect
-50
-100
Percentage of Effect Percentage of Effect
FIGURE 32.10. Model responses to an intervention. The top row illustrates permanent response patterns. The bot-
tom row illustrates temporary response patterns.
Whether or not they have permanent effects, new
laws often have temporary effects that are well mod-
eled as decaying spikes. Failure to allow for these
temporary effects can lead to invalid inferences. At
the individual level, placebo artifacts pose an analo-
gous threat to construct validity. Although these
threats are easily controlled with an explicit com-
plex response model, by allowing the possibility of
Divorces
80
70
60
50
40
30
20
10
0
50 60
Family Law Act of 1975
70
Year
80 90 95
FIGURE 32.11. Australian divorces before and after
the 1975 Family Law Act.
several responses, the model raises a potential threat
to statistical conclusion validity: fishing. To control
the fishing threat, the complex response model must
be fully specified before any hypothesis test.
four-validity system. In principle, all nine threats to
internal validity can be controlled by manipulating
the intervention or treatment experimentally. Inter-
nal validity is bought at the expense of construct
validity, however, which may be more threatening
in single-subject designs. Although the opiate-
blocker regimen appears to reduce self-injurious
behavior, implementation of the regimen provokes a
week-long reaction to the novelty and disruption.
Because none of the common threats to internal
validity seem plausible, trading construct for inter-
nal validity may be unwarranted.
External Validity
A time-series quasi-experiment typically considers
an intervention's influence on only one series, and
623
McCleary and McDowall
this makes it highly vulnerable to external validity
threats. External validity considers whether find-
ings hold "over variations in persons, settings,
treatments, and outcomes" (Shadish et al., 2002,
p. 83), and threats to it are always plausible when
analyzing a single series. Ruling out external valid-
ity threats necessarily requires replicating the
quasi-experiment over a diverse set of conditions.
The evaluation research literature shows
many cases in which effect estimates exist to sup-
port every possible conclusion about a program's
impact. Evaluations of gun control policies, for
example, include numerous instances of positive,
negative, and null effect estimates (Reiss & Roth,
1993, pp. 255-287). In these situations, the vari-
ance of the effects across replications can be more
informative than is a single-point estimate of the
average effect.
If several quasi-experimental replications exist,
they allow external validity to be assessed in one of
two ways. First, the set of individual time series can
be assembled to form a single vector series, Y
1
:
(17)
A statistical analysis can then take advantage of the
variation in the replications to obtain estimates of
both the average impact and its expected variability
(e.g., McGaw & McCleary, 1985). Second, an analy-
sis can decompose the set of individual impact
estimates,
(18)
into components associated with various external
validity threats.
The second approach is a restricted case of the
first, and in theory, it is less desirable. Still, the first
and more general model makes highly demanding
assumptions, and time-series often do not conform
to them. Because the second approach places fewer
requirements on the data, it is therefore generally
more practical to apply. Statistical models for the
second approach come from meta-analysis and
divide the effect variance into components caused
by the setting, intervention, and other potential
threats to external validity. McDowall, Loftin, and
Wiersema (1992) used this approach to estimate the
overall impact and variability of sentencing laws,
624
and McCleary (2000) used it to combine estimates
from the Hawthorne experiments.
DYNAMIC INTERVENTION ANALYSES
A salient advantage of time-series designs over other
before-after designs is the capability of modeling
dynamic responses to the intervention. Proper specifi-
cation of a dynamic response model, such as those
shown in Figure 32.10, requires a parsimonious the-
ory of the response. One theory that has proved use-
ful in psychological research restricts the response to
one of four types defined by dichotomizing onset and
duration. The response may be abrupt or gradual in
onset and permanent or temporary in duration.
Although general transfer function specifications
(Box & jenkins, 1970; Box & Tiao, 1975) allow a
wider range of responses, the four-type theory is
more realistic for psychological interventions and
more appropriate for testing intervention null
hypotheses.
The talking out intervention plotted in Figure
32.8 is a typical gradual, permanent response to an
intervention. Though implemented on the 21st
day, the full effect of the intervention is not real-
ized immediately but, rather, accumulates gradu-
ally over several days. If a time-series model is
written as the sum of stochastic and intervention
components,
(19)
The stochastic component plays no meaningful role
in our explication of the intervention analysis. Pro-
cedures for building statistically adequate ARIMA
models of j(a
1
) are described elsewhere (McCleary &
Hay, 1980) but are of little interest here. Subtracting
the stochastic component from the series
(20)
leaves the dynamic intervention component:
(21)
The simplest dynamic model of a gradual, perma-
nent response to xt is
(22)
,..
Time-Series Designs
TABLE 32.2
Parameter Estimates for Dynamic Intervention Models
g(X
1
) = XrW (1 - oB)-
1
Talking out (Figure 32.8)
co= -6.79 co)= -6.81
o = 0.56 o) = 8.48
where B is a backward lag operator defined such that
BkXt = Xt-k for any integer k.
A Taylor series expansion of the right-hand side
yields the more useful series identity:
(23)
2t = X
1
ro (1 + 8B + o
2
B
2
+ o
3
B
3
+ ... ) (24)
= X
1
ro + oxt-1 ro + o
2
xt-2ro + o
3
xt-3ro + ... (25)
X
1
is defined as a binary variable such that
xt = 0 for preintervention days
t=1, ... ,20 (26)
xt = 1 for postintervention days
t=21, ... ,40. (27)
Before the intervention, X
1
\$
20
= 0, so
Thereafter, X
1
>
20
= 1. On the jth postintervention
day,
(28)
220+j = ro + oro+ o
2
ro + ... + oi-lro. (29)
Because o is a fraction, oH is a very small number.
Parameter estimates for the talking out time-
series are reported in the left-hand panel of
Table 32.2.
5
Substituting the estimates of ro and o
into the series identity,
221 = -6. 79(1) = -6.79 (30)
2
22
= -6.79(1 +.56)= -10.59 (31)
223 = -6.79(1 +.56+ .31) = -12.72 (32)
224 = -6.79(1 +.56+ .31 + .18) = -13.91 (33)
225 = -6.79(1 +.56+ .31 + .18 + .10) = -14.58.
(34)
Daily changes in talking out continue throughout
the postintervention segment, but reductions
'Parameters estimated with the SCA Statistical System (Liu, 1999).
Abrupt response, temporary change
g(Xr) = (1 - B)XrW (1 - 88)-
1
Self-injurious behavior (Figure 32.9)
co= -3.65 co)= -3.04
o = 0.57 o) = 3.02
become smaller and smaller. Eventually, the effect
will converge on
-6.79 I (1- .56)= -15.43, (35)
but by the end of the fifth postintervention day, 95%
of this effect has been realized.
The self-injurious behavior time series in Figure
32.9 presents a typical abrupt, temporary response to
an intervention that, in this case, is a placebo. On the
first postintervention day, this patient's rate of self-
injurious behavior drops abruptly but, in subsequent
days, returns to its preintervention level. The sim-
plest dynamic model of an abrupt, temporary effect is
This model has the series identity:
2t = vxtro + o vxt-1ro + o
2
vxt-2ro
+ 83 vxt-3ro + ...
(36)
(37)
Whereas X
1
remains on throughout the postinter-
vention period, the first difference of Xr, V'X
1
, turns
on in the first postintervention day and then turns
off again:
vxt = 0 for days t = 1, ... '25 (38)
vxt = 1 fort= 26 (39)
vxt = 0 for days t = 27, ... '50. (40)
Before the intervention, V'X
1
\$
25
= 0 and
On the first day of the intervention, V'X
26
= 1, so
226=(1). (42)
But thereafter, V'Xr,
26
= 0 again, and
226+j = oi-lro. C 4 3)
625
McCleary and McDowall
Again, as ro&-
1
approaches 0, the placebo effect
decays.
Parameter estimates for the placebo self-injurious
behavior time series are reported in the right-hand
column of Table 32.2. Substituting the estimates of
ro and o into the series identity,
z26 = -3.04,
z27 = -3.04 (.57)= -1.73,
Z2s = -3.04 (.57)
2
= -0.99,
z29 = -3.04 (.57)
3
= -0.56, and
Z
30
= -3.04 (.57)
4
= -0.32.
(44)
(45)
(46)
(47)
(48)
By end of the fifth postintervention day, 90% of the
placebo effect has dissipated.
In either of these two dynamic models, the
parameter o determines the rate of postintervention
change in the time series. Intervention null hypothe-
ses can be devised around the value of 8 to test
properties of the response.
CONCLUSION
Time-series data and time-series designs have a long
history in psychological research. Of the many uses
of time-series data, causal inferences from experi-
ments and quasi-experiments are currently their wid-
est application. Given a reasonably long time series,
balanced data, and an adequate ARIMA model, the
time-series quasi-experiment is among the most use-
ful and valid quasi-experimental designs.
The advantages of the time-series quasi-
experiment are especially apparent in the absence of
naturally defined control groups. To emphasize this
property, Campbell and Stanley (1963) cited the
hypothetical example of a chemist who, dipping an
iron bar into nitric acid, attributes the bar's loss of
weight to the acid bath: "There may well have been
'control groups' of iron bars remaining on the shelf
that lost no weight but the measurement and report-
ing of these weights would typically not be thought
necessary or relevant" (p. 3 7).
The design is also vulnerable to multiple threats
to validity, of course, and one would normally not
use it in situations in which randomized controlled
trials are possible. These cases aside, the time-series
626
quasi-experiment is a feasible and relatively strong
design across a wide range of circumstances.
References
Beveridge, W. H. (1922). Wheat prices and rainfall
in western Europe. journal of the Royal Statistical
Society, 85,412-475. doi:10.2307/2341183
Box, G. E. P., &:Jenkins, G. M. (1970). Time series
analysis: Forecasting and control. San Francisco, CA:
Holden-Day.
Box, G. E. P., &: Tiao, G. C. (1975). Intervention analysis
with applications to economic and environmen-
tal problems. Journal of the American Statistical
Association, 70, 70-79. doi:10.2307/2285379
Campbell, D. T. (1963). From description to experimen-
tation: Interpreting trends as quasi-experiments. In
C. W. Harris (Ed.), Problems in measuring change
(pp. 212-243). Madison: University of Wisconsin
Press.
Campbell, D. T., &: Ross, H. L. (1968). The Connecticut
crackdown on speeding: Time series data in quasi-
experimental analysis. Law and Society Review, 3,
33-53. doi:10.2307/3052794
Campbell, D. T., &: Stanley,]. C. (1963). Experimental
and quasi-experimental designs for research. Chicago,
IL: Rand-McNally.
Chree, C. (1913). Some phenomena of sunspots and
of terrestrial magnetism at Kew Observatory.
Philosophical Transactionsof the Royal Society
of London, Series A, 212, 75-116. doi: 10.1098/
rsta.19l3.0003
Cohen,]. (1988). Statistical power analysis for the behav-
ioral sciences (2nd ed.). Englewood Cliffs, N]:
Erlbaum.
Cook, T. D.,&: Campbell, D. T. (1979). Quasi-
experimentation: Design and analysis issues for field
settings. Boston, MA: Houghton-Mifflin.
Dollard,]., Doob, L. W., Miller, N. E., Mowrer, 0. H.,&:
Sears, R. R. (1939). Frustration and aggression. New
Haven, CT: Yale University Press.
Elton, C. S. (1924). Fluctuations in the numbers
of animals: Their causes and effects. journal of
Experimental Biology, 2, 119-163.
Elton, C.,&: Nicholson, M. (1942). The ten-year cycle
in numbers of the lynx in Canada. Journal of Animal
Ecology, 11,215-244. doi:l0.2307/l358
Fisher, R. A. (1921). Studies in crop variation: An exami-
nation of the yield of dressed grain from Broadbalk.
journal of Agricultural Science, 11, 107-135.
doi:lO.l017/S0021859600003750
Fisher, R. A. (1925). Statistical methods for research work-
ers. London, England: Oliver &: Boyd.
Florence, P. S. (1923). Recent researches in indus-
trial fatigue. Economic journal, 33, 185-197.
doi:10.2307/2222844
Franke, H. F., & Kaul,]. D. (1978). The Hawthorne
experiments: First statistical interpretation.
American Sociological Review, 43, 623-643.
doi:10.2307/2094540
Glass, G. V., Willson, V. L., & Gottman,]. M. (1975).
Design and analysis of time series experiments.
Granger, C. W.]. (1969). Investigating causal rela-
tionships by econometric models and cross-
spectral methods. Econometrica, 37, 424-438.
doi:10.2307/19l2791
Greene, W. H. (2000). Econometric analysis (4th ed.).
Englewood Cliffs, NJ: Prentice-Hall.
Hall, R. V., Fox, R., Willard, D., Goldsmith, L., Emerson,
M., Owen, M., ... Porcia, E. (1971). The teacher
as observer and experimenter in the modification
of disputing and talking-out behaviors. journal of
Applied Behavior Analysis, 4, 141-149. doi:10.1901/
jaba.197l. 4-141
Hennigan, K. M., Del Rosario, M. L., Heath, L., Cook,
T. D., Wharton,]. D., & Calder, B.]. (1982). Impact of
the introduction of television on crime in the United
States: Empirical findings and theoretical implica-
tions. journal of Personality and Social Psychology, 42,
461-4 77. doi: 10.1037/0022-3514.42.3.461
Hepworth,]. T., & West, S. G. (1988). Lynchings and the
economy: A time-series reanalysis of Hovland and
Sears. journal of Personality and Social Psychology, 55,
239-24 7. doi:10.1037/0022-3514.55.2.239
Hovland, C. I., & Sears, R. R. (1940). Minor studies of
aggression IV. Correlation of lynchings with eco-
nomic indices. journal of Psychology, 9, 301-310. doi:
10.1080/00223980.1940.9917696
Kendall, M., & Stuart, A. (1979). The advanced theory of statis-
tics (4th ed., Vol. 2). London, England: Charles Griffin.
Kroeber, A. L. (1919). On the principle of order in
civilization as exemplified by changes of fashion.
American Anthropologist, 21, 235-263. doi: 10.1525/
aa.1919.2l.3.02a00010
Kroeber, A. L. (1944). Configurations of cultural growth.
Berkeley: University of California Press.
Lipsey, M. (1990). Design sensitivity: Statistical power for
experimental research. Thousand Oaks, CA: Sage.
Liu, L.-M. (1999). Forecasting and time series analy-
sis using the SCA statistical system. Villa Park, IL:
Scientific Computing Associates.
McCleary, R. (2000). Evolution of the time series experi-
ment. In L. Bickman (Ed.), Research design: Donald
Campbell's legacy (pp. 215-234). Thousand Oaks,
CA: Sage.
Time-Series Designs
McCleary, R., & Hay, R. A. ,Jr. (1980). Applied time series
analysis for the social sciences. Beverly Hills, CA: Sage.
McCleary, R., Nienstedt, B. C., & Erven,]. M. (1982).
Uniform crime reports and organizational outcomes:
Three time series quasi-experiments. Social Problems,
29, 361-372. doi: 10.1525/sp.1982.29.4.03a00030
McCleary, R., & Riggs,]. E. (1982). The 1975 Australian
Family Law Act: A model for assessing legal impacts.
New Directions for Program Evaluation,16, 7-18.
McCleary, R., Touchette, P., Taylor, D. V., & Barron,
]. L. (1999, March). Contagious models for self-
injurious behavior. Poster presentation, 32nd Annual
Gatlinburg Conference on Research and Theory in
Mental Retardation, Charleston, SC.
McDowall, D., Loftin, C., & Wiersema, B. (1992). A
comparative study of the preventive effects of man-
datory sentencing laws for gun crimes. journal
of Criminal Law and Criminology, 83, 378-394.
doi:10.2307/ll43862
McDowall, D., McCleary, R., Meidinger, E. E., & Hay,
R. A. ,Jr. (1980). Interrupted time series analysis.
Beverly Hills, CA: Sage.
McGaw, D. B., & McCleary, R. (1985). PAC spend-
ing, electioneering, and lobbying: A vector
ARIMA time series analysis. Polity, 17, 574-585.
doi:10.2307/3234659
Mills, T. C. (1991). Time series techniques for economists.
New York, NY: Cambridge University Press.
Neyman,]., & Pearson, E. S. (1928). On the use and
interpretation of certain test criteria for purposes of
statistical inference. Biometrika, 20A, 175-240.
Reiss, A.]., & Roth,]. A. ( 1993). Understanding and
preventing violence. Washington, DC: National
Richardson,]., & Kroeber, A. L. (1940). Three centuries
of women's dress fashions: A quantitative analysis.
Anthropological Records, 5, 111-153.
Roethlisberger, F.,&: Dickson, W.]. (1939). Management
and the worker. Cambridge, MA: Harvard University
Press.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
Experimental and quasi-experimental designs for general-
ized causal inference. New York, NY: Houghton Mifflin.
Wolf,]. R. (1848). Nachrichten uber die Sternwarte
in Bern [News from the observatory in Berne].
Mittheilungen der naturforschenden gese!lschaft in
Bern, Nr. 114-115. ETH-Bibliothek Zurich, Rar
4201. doi:10.393l/e-rara-2007
Yule, G. U. (1927). On a method of investigating peri-
odicities in in disturbed series with special refer-
ence to Wolfer's sunspot numbers. Philosophical
T ransactionsof the Royal Society of London, Series A,
226, 267-298.
627