TIMESERIES DESIGNS
Richard McCleary and David McDowall
Timeseries designs are distinguished from other
designs by the properties of timeseries data and by
the necessary reliance on a statistical model to con
trol threats to validity. In the long run, a timeseries
is a realization of a latent causal process. Represent
ing the complete timeseries as
the observed series (Y
1
, ... , YN) is a probability
sample of the complete realization. The probability
sampling weights for (Y
1
, ... , YN) are specified in a
statistical model, which, for present purposes, is
written in a general linear form as
(2)
The a, term of this model is the tth observation of a
strictly exogenous innovation series with the white
noise property,
(3)
The X, term is the tth observation of a causal time
series. Although X, is ordinarily a binary variable
coded for the presence or absence of an interven
tion, it can also be a purely stochastic series. In
either case, the model is constructed by a set of rules
that allow for the solution,
(4)
Because a, has white noise properties, the solved
model satisfies the assumptions of all common tests
of statistical significance.
DOl: 10.1037/13620032
THREE DESIGN CATEGORIES
We elaborate on the specific forms of the general
model and on the set of rules for building models at
a later point. For present purposes, the general
model allows three design variations: (a) descriptive
timeseries designs, (b) correlational timeseries
designs, and (c) experimental or quasiexperimental
timeseries designs.
Descriptive Designs
The earliest timeseries designs used observed cycles
or trends in a series to infer the nature of a latent
causal mechanism. Historical examples include
Wolf's (1848; Yule, 1927) investigation of sunspot
activity and Elton's (1924; Elton & Nicholson,
194 2) investigation of lynx populations. In both
cases, timeseries analyses revealed cycles or trends
that corroborated substantive interpretations of the
phenomenon.
Kroeber's (1919; Richardson & Kroeber, 1940)
analyses of cultural change illustrate the poor fit of
descriptive timeseries designs to many social and
behavioral phenomena. Kroeber (1944) hypothe
sized that women's fashions change in response to
political and economic variables. During stable peri
ods of peace and prosperity, fashions changed
slowly; during wars, revolutions, and depressions,
fashions changed rapidly. Because political and eco
nomic cataclysms were thought to recur in long his
torical cycles, Kroeber tested his hypothesis by
searching for the same cycles in women's fashions.
Figure 32.1 plots one of Richardson and Kroe
ber's (1940) annual fashion time series. Although
APA Handbook of Research Methods in Psychology: Vol. 2. Research Designs, H. Cooper (EditorinChieD
Copyright 2012 by the American Psychological Association. All rights reserved.
613
McCleary and McDowall
Skirt Width
Index
1800 1820 1840 1860 1880 1900 1920
Year
FIGURE 32.1. Annual skirt widths, 1787 to 1936.
Kroeber believed that the long cycles in this series
corroborated his cultural unsettlement theory,
wholly random processes can generate identical pat
terns. Whereas most timeseries designs treat j(aJ
as a nuisance function whose sole purpose is to con
trol the threats to statistical conclusion validity
posed by cycles and trends, the descriptive time
series design infers substantive explanations from
j(at). Although the statistical models and methods
developed for the analysis of descriptive designs are
applied for exploratory purposes (see Mills, 1991),
they are currently not widely used for null hypothe
sis tests.
Correlational Designs
A second type of timeseries design attempts to infer
a causal relationship between two series from their
covariance. Historical examples include Chree's
(1913) analyses of the temporal correlation between
Lynchings
sunspot activity and terrestrial magnetism and Bev
eridge's (1922) analyses of the temporal correlation
between rainfall and wheat prices. The validity of
correlational inferences rests heavily on theory.
When theory can specify a single causal effect oper
ating at discrete lags, as in these natural science
examples, correlational designs support unambigu
ous causal interpretations. Lacking theoretical speci
fication, however, correlational designs do not allow
strong causal inferences.
Analyses of the temporal correlation between
lynchings and cotton prices by Hovland and Sears
(1940) illustrate the inferential problem. To test the
frustrationaggression hypothesis of Dollard et al.
(1939), Hovland and Sears estimated a Pearson
productmoment correlation coefficient from the
annual timeseries plotted in Figure 32.2. Assuming
that the correlation would be zero in the absence of
a causal relationship, Hovland and Sears interpreted
the statistically significant estimate as corroborating
evidence. Because of common stochastic timeseries
properties, however, especially trend, causally inde
pendent series will be correlated. Controlling for
trend, Hepworth and West (1988) reported a small,
significant correlation between the series but warned
against causal interpretations. The correlation is an
artifact of the war years 1914 to 1918, when the
demand for cotton and the civilian population
moved in opposite directions (McCleary, 2000). If
the war years are excluded, the correlation is not
statistically significant.
Where theory supports strong specification, cor
relational timeseries designs continue to be used.
150
;Cotton Prices
100
50
1890
Lynchings
1900 1910
Year
FIGURE 32.2. Annual cotton prices and lynchings, 1886 to 1930.
614
1920

r
Other than limited areas in economics and psychol
ogy, however, social theories will not support the
required specification. Even in these areas, causal
inferences require the narrow definition of Granger
causality (Granger, 1969) to rule out plausible alter
native interpretations.
QuasiExperimental Designs
The third type of timeseries design infers the latent
causal effect of a temporally discrete intervention or
treatment from discontinuities or interruptions in a
time series. Campbell and Stanley (1963, pp. 3743)
called this design the "timeseries experiment," and
its use is currently the major application of time
series data for causal inference. Historical examples
of the general approach include investigations of
workplace interventions on health and productivity
by the British Industrial Fatigue Research Board
(Florence, I923) and by the Hawthorne experiments
(Roethlisberger & Dickson, I939). Fisher's (1921)
analyses of agricultural interventions on crop yields
also relied on variants of the timeseries quasi
experiment.
In the simplest case of the design, a discrete
intervention breaks a timeseries into pre and post
intervention segments of Npre and Npast observations.
For pre and postintervention means, J.lpre and ].lpost,
analysis of the quasiexperiment tests the null
hypothesis
Ha: (!) = 0, where(!)= J.lpost ].lpre; HA: (!);F. 0. (5)
Rejecting H
0
, HA attributes ro to the intervention. In
practice, however, treatment effects are almost
always more complex than the simple change in
level implied by this null hypothesis.
Figure 32.3 illustrates a typical example of the
timeseries quasiexperimental design. The data are
50 daily selfinjurious behavior counts for an insti
tutionalized patient (McCleary, Touchette, Tay
lor, & Barron, 1999). Beginning on the 26th day, the
patient is treated with Naltrexone, an opiate
blocker. The plotted time series leaves the visual
impression that the opiateblocker has reduced the
incidence of selfinjurious behavior. Indeed, the dif
ference in means for the 25 pre and 25 postinter
vention days amounts to a 4 2% reduction. The value
ofF= 32.56 associated with this difference occurs
TimeSeries Designs
SelfInjurious
Behavior Incidents
8
6
5
4
25 Preintervention Days 25 Postintervention Days
FIGURE 32.3. Selfinjurious behavior incidents for
a single institutionalized patient before and after an
opiateblocker regimen.
by chance with p < .0001, and the null hypothesis
can be rejected in favor of the alternative: The medi
cation is an effective treatment for selfinjurious
behavior.
This conclusion ignores a serious threat to valid
ity. Whereas the null hypothesis test assumes that
the daily counts are independent, in fact, the count
on any given day is predictable from the count on
the preceding day. The visual evidence of Figure
32.3 leaves the unambiguous impression that the
opiateblocker reduced the rate of selfinjurious
behavior for this patient. Visual evidence can be
deceiving, of course, and that is why statistical
hypothesis tests are conducted. Because of the day
today dependence of these data, however, the value
ofF= 32.56 cannot be interpreted.
More generally, timeseries experiments and
quasiexperiments present many challenges to
valid inferences about causal effects. A solid ratio
nale for the design is mostly due to the work of
Donald T. Campbell and his collaborators (Camp
bell & Stanley, 1963; Cook & Campbell, 1979;
Shadish, Cook, & Campbell, 2002). Campbell and
associates extensively considered threats to the
design's validity and proposed ways to address
them when they were plausible. A general conclu
sion from Campbell's work is that experimental
and quasiexperimental timeseries designs face
fewer threats to validity than do most other nonex
perimental designs. This conclusion is largely
responsible for the current popularity of time
series research.
615
I
McCleary and McDowall
FOUR TYPES OF VALIDITY
Campbell and Stanley (1963; Campbell, 1963)
divided the empirical threats to valid inference into
two categories. Threats to internal validity addressed
the question, "Did in fact the experimental treat
ments make a difference in this specific experimen
tal instance?" Threats to external validity addressed
the question, "To what populations, settings, treat
ment variables, and measurement variables can this
effect be generalized?"
Recognizing the incompleteness of the dichot
omy, Cook and Campbell (1979) added two addi
tional categories. Threats to statistical conclusion
validity addressed questions of confidence and
power that had previously been included implicitly
as threats to internal validity. Threats to construct
validity addressed questions of confounding that
had previously been included implicitly as threats to
external validity. Shadish et al. (2002) used the
same four categories but expanded the list of threats
to valid inference in each category.
Table 32.1lists the threats to validity that are rel
evant to timeseries studies. Timeseries designs dif
fer from other approaches in that common threats to
validity are controlled by a statistical model. This
applies not only to quasiexperimental designs but
also to designs that would be considered true experi
ments. When a treatment or intervention can be
TABlE 32.1
Four Types of Validity From Shadish, Cook, and
Campbell (2002)
Type of validity
Statistical conclusion
validity
Internal validity
Construct validity
External validity
Threat to validity
Low statistical power
Violated assumptions of the test
History
Maturation
Regression artifacts
Instrumentation
Reactivity
Novelty and disruption
Interaction over treatment variations
Interaction with settings
manipulated experimentally, presumably to control
threats to internal validity, the manipulation raises
threats to construct and external validity that can be
controlled by the statistical model.
Tradeoffs among the four validities are implicit
in our tradition. The salient flaw in the Campbell
and Stanley (1963) twocategory system was that all
eight threats to internal validity and all four threats
to external validity could be controlled by design.
Campbell and his colleagues proposed the four
validity system in large part to correct this miscon
ception. The tradeoff among validities is a crucial
consideration for timeseries studies. Although
threats to internal validity can be controlled, in prin
ciple, by experimental manipulation of the treat
ment, in practice, experimental manipulation raises
nearfatal threats to construct and external validity.
Accordingly, we analyze timeseries experiments as
if they were quasiexperiments.
Statistical Conclusion Validity
Shadish et al. (2002) identified nine threats to statis
tical conclusion validity or "reasons why researchers
may be wrong in drawing valid inferences about the
existence and size of covariation between two vari
ables" (p. 45). Although the consequences of any
particular threat will vary across settings, the threats
to statistical conclusion validity fall neatly into cate
gories involving misstatements of the Type I and
Type II error rates.
1
Type I errors (also known as aerrors or false
positive errors) occur when a true H
0
is mistakenly
rejected; that is, when the intervention has no effect
but the test statistic suggests otherwise. Under a
convention established by Fisher (1925), the Type I
error rate is fixed at 0.05, corresponding to a
confidence level of at least 0.95 (or 95% confidence).
Type II errors (also known as {3errors or false
negative errors) occur when a false H
0
is mistakenly
accepted; that is, when the intervention has an effect
but the test statistic suggests otherwise. Following
Neyman and Pearson (1928), the conventional Type
II error rate is fixed at (3 0.2, corresponding to sta
tistical power level of at least 0.8 (or 80% statistical
power). Whereas the Type I error rate is set a priori,
authority on this topic is Kendall and Stuart (1979, Chapter 22). Cohen (1988) and Lipsey (1990) provided more accessible
616
1
.
.
'q

the Type II error rate is conditioned on the Type I
error rate and a likely effect size.
2
For both Type I and Type II errors, uncontrolled
threats to statistical conclusion validity distort the
nominal values of a and [3, leading to invalid infer
ences. The threats are controlled by a statistical
model. The most widely used models for that pur
pose are the AutoRegressive Integrated Moving
Average (ARIMA) models of Box and jenkins
(1970). Under H
0
, an ARIMA model is written as
<p(B)Zt = 8(B)at Zt = Yt py, (6)
where Zt and at are the tth observations of a station
ary timeseries and an iid N(O,cr
2
) error series
respectively; and where <p(B) and S(B) are polyno
mial lag operators. If the parameters of <p(B) and
8(B) are appropriately constrained, the ARIMA
model can be solved for at:
(7)
To test H
0
, the intervention is represented by a
step function (or dummy variable) defined such that
Xt = 0 for t::; Npre; Xt = 1 thereafter. (8)
A transfer function of Xt is then added to the right
hand side of the ARIMA model:
(9)
where o(B) is a polynomial lag operator and ffi is the
effect of Xt on Zt. Because at is an iid N(O,<J
2
) error
term, the null hypothesis
Ha: ffi = 0 (10)
can be tested with ordinary test statistics, such as t
or F, effectively controlling all Type I threats to sta
tistical conclusion validity.
Because the <p(B) and e(B) lag operators serve the
sole purpose of transforming Zt into the underlying
a, error, ARIMA models are not unique. Methods for
building a parsimonious, statistically adequate
ARIMA model have been described in Glass,
Willson, and Gottman (1975); McCleary and Hay
(1980); and especially McDowall, McCleary,
Meidinger, and Hay (1980). The <p(B) and 8(B) lag
TimeSeries Designs
operators can be used for descriptive purposes, but
with respect to H
0
, they are nuisance parameters. The
transfer function of X,, in contrast, is a theoretical
construct, specified on purely theoretical grounds.
We return to this topic when the threats to con
struct validity are considered.
Although the Type II threats to statistical conclu
sion validity are straightforward, they are poorly
understood and often ignored. The failure to reject
H
0
does not imply that H
0
is true. The decision to
accept H
0
as true requires, first, a consensus decision
on the likely effect size (or value of ffi); and, second,
a demonstration that the timeseries quasi
experiment was designed to yield a Type II error
rate of 13::; 0.2 for the likely effect size. In sum, the
decision to accept H
0
as true requires the research
begin with an analysis of statistical power.
Of the several factors that determine statistical
power of a timeseries design, the likely effect size
is the most important. Small effects are difficult to
detect even under the best circumstances. In addi
tion to the likely size of the effect, the statistical
power of a timeseries quasiexperiment is a func
tion of the series length, balance, and to a lesser
extent, the quality of the ARIMA model. Because
maximum likelihood estimates rely on large sample
properties, analyses of short time series will often
fail to achieve the nominal level of power. Most
authorities recommend Npre + Npost > 50 as the mini
mum length for a time series. Statistical power
increases proportional to the square root of series
length. But for a given length, power is highest
when the design is balanced such that Npre = Npost
Even when the series is long and the design is
balanced, the statistical power of a timeseries
quasiexperiment can be affected adversely by a
poorfitting ARIMA model. In addition to the pur
pose of transforming Zt into ar, the ARIMA model
decomposes the timeseries variance into stochastic
and deterministic components, j(a,) and g(Xt). To
ensure an optimal decomposition, the ARIMA
model used to test H
0
should have the lowest resid
ual variance among the several statistically ade
quate models.
'Cohen (1988, pp. 34) and Lipsey (1990, pp. 3840) set the conventional Type I and Type II error rates at a= .05 and 13 = .2, respectively. If the Type
I error rate is set lower, say, a= .01, the Type II error rate is set at 13 = .04 to maintain a 4:1 ratio of Type II to Type I errors. The 4:1 convention dates
back at least to Neyman and Pearson (1928) and reflects the view that science should be conservative.
617
McCleary and McDowall
0.5
Property Crimes
1956
0
1951
0.5
1
1.5
2
1945 1950 1955 1960
FIGURE 32.4. Annual property crimes for cities with commercial television broadcast service in 1951 and for cities
with television in 1955.
Internal Validity
Under some circumstances, all nine of the threats to
internal validity identified by Shadish et al. (2002,
p. 55) might apply to the timeseries quasi
experiment. Typically, however, only four threats
are plausible enough to pose serious difficulties.
History and maturation, which arise from the use of
multiple temporal observations, can threaten virtu
ally any application of the design. Instrumentation
and regression can also be problems, but only for
interventions that involve advanced planning. For
unplanned interventionswhat Campbell (1963,
pp. 229230) called "natural experiments"these
threats are much less realistic.
The largest, most obvious, and most frequent
threats to internal validity involve the operation of
history. Historical threats come from changes in a
time series that occur coincidently with an interven
tion but are due to other causes. A standard design
based approach to making these threats less
plausible is to analyze one or more comparison
series. The comparisons can take many forms, and a
careful choice can substantially narrow the scope
within which history can operate. An analysis might
consider notreatment control series that the inter
vention should not have influenced, for example, or
study multiple periods during which the interven
tion was and was not in operation. A consistent pat
tern of results across different variables or time
periods reduces the plausibility of historical threats
and helps support the existence of an intervention
impact.
618
An illustration of the effective use of comparison
series comes from Hennigan et al. (1982), who
studied changes in property crime following the
introduction of commercial television. In 34 early
cities, broadcasting began in 1950, whereas in 34
late cities, television was not available until1954.
The time series in Figure 32.4 show the annual log
transformed levels of property crimes for both
groups of cities.
History is a plausible threat to inferences about
the effects of television when studying the early or
late cities alone. Other variables also changed during
the years in which each group adopted television,
and many of these might explain a change in crime.
The Korean War was under way in 1951, for exam
ple, and an economic recession began in 1955. More
generally, criminological theories suggest multiple
factors that might influence property crimes and
that could have changed around the time that broad
casting began in either of the groups.
Considering both time series together makes the
changes in crime much more difficult to dismiss as
artifacts of history. To be plausible, a historical
explanation would have to account for increases that
occurred at two different time points but affected
only one group of cities at each. Although not
impossible, constructing such an explanation would
be a difficult enterprise.
In contrast to history, methods for addressing
the other three threats to internal validity do not
heavily rely on design variations. Maturation, which
like history is plausible in all applications of the
Productivity
80
70
60
50
First
"Punishment"
Effect
Second
"Punishment"
Effect
Third
"Punishment"
Effect
TimeSeries Designs
40
FIGURE 32.5. Weekly productivity measures for the first bank wiring room (Hawthorne) experiment. Estimated
interventions are plotted against the series.
timeseries quasiexperiment, requires a statistical
modeling approach. Maturation threats appear as
trends in the data and are due to developmental pro
cesses that are independent of the intervention.
Timeseries data often display such trends, and
trending patterns are especially common in the long
series that are most desirable for analysis. Matura
tional trends are a problem for inference because
they can easily produce false evidence of an inter
vention effect.
Figure 32.5 illustrates one of the bestknown
examples of the maturation threat, the socalled
Hawthorne effect. The data consist of weekly pro
ductivity levels for a group of five machine operators
in the bank wiring room of the Western Electric
Company's Hawthorne Works. Researchers manipu
lated daily rest breaks during the study period and
claimed after a visual inspection of the series that
the breaks helped increase productivity. Question
ing this conclusion, Franke and Kaul (1978) argued
that any productivity increases were instead due
solely to fear generated by the imposition of three
"punishment" regimes. Their statistical analysis,
which included interventions at the beginning of
each regime, supported this hypothesis.
Maturation provides an explanation that chal
lenges the interpretations of both Franke and Kaul
(1978) and the original researchers. Figure 32.5
shows the presence of a systematic trend in produc
tivity that could easily have resulted from increases
in worker experience. The trend closely follows the
patterns that each set of researchers observed, and
this makes maturation a plausible alternative expla
nation of their findings. A reanalysis that controlled
for the trend in fact found a small effect of the rest
breaks and no effect at all for the punishment
regimes (McCleary, 2000).
Unlike other threats to internal validity, which
are ordinarily handled by design, maturation threats
are controlled by the statistical model. Under H
0
, the
causal effect of Xt vanishes, leaving a simple model
that represents the time series as a weighted sum of
past and present white noise innovations:
(11)
Proper solutions of this model are guaranteed by
constraining the parameters of cp(B) and 9(B) to the
bounds of stationarityinvertibility.
3
This assumes a stationary time series, however,
and this assumption is unwarranted in many
instances. Kroeber's fashion time series (Figure
32.1), for example, shows the drifting pattern charac
teristic of a nonstationary random walk process. Seg
ments of Kroeber's series are indistinguishable from
the steady secular trend that poses the maturation
threat in the Hawthorne experiment (Figure 32.5).
Although nonstationary time series are com
monly encountered in the social sciences, most are
3
Although the bounds of stationarity and invertibility are identical, they are distinct properties of a time series. See Box and jenkins (1970, pp. 5354).
All modem timeseries software packages report parameter estimates that are constrained to the stationarityinvertibility bounds.
619
McCleary and McDowall
Productivity
Differences
10
8
6
4
2
0
2
4
6
FIGURE 32.6. Weekly productivity measures from Figure 32.5, differenced.
stationary in firstdifferences. Using V to represent
the differencing operation
(12)
a nonstationary time series can be modeled as
(13)
or
(14)
Figure 32.6 plots the firstdifferences of the Haw
thorne experiment time series. The differenced
series fluctuates around a constant level, and matu
ration is no longer a plausible threat to internal
validity.
4
In addition to controlling maturation threats,
differencing removes the confounding effects of
crosssectional fixed causes of
1
To illustrate,
suppose that
1
is the U.S. unemployment rate and
that W represents the causes of
1
that vary cross
sectionally but that are constant over short periods
of time. When consecutive observations are
differenced,
Yt Yt1 = j(at)  j(at1) + W W (IS)
V'Yt = j(at) j(at1), (16)
the confounding effects of W vanish from the model.
This property of the difference equation model is the
motivation for the use of fixedeffects panel models
in economics (Greene, 2000).
Like the maturation threat to internal validity,
the regression threat is controlled by the statistical
model. Whereas the maturation threat is plausible
in both timeseries experiments and quasi
experiments, however, the regression threat is plau
sible only in quasiexperiments involving planned
interventions. The regression threat arises whenever
the intervention is a reaction to an unusually high or
low level of the time series. Regardless of the inter
vention's effects, regression to the mean is likely to
produce an increase or decrease in the series level.
In one of the earliest formal applications of the
timeseries quasiexperiment, Campbell and Ross
(1968) studied the impact on highway fatalities of a
I955 speeding crackdown in Connecticut. Traffic
deaths dropped significantly after the crackdown
began, but Campbell and Ross showed that the
decrease was largely attributable to a regression arti
fact. Fatalities were unusually high in 1955, and the
crackdown was a response intended to reduce them.
A drop was then predictable as deaths regressed
back toward their historically average levels.
Regression becomes a less plausible threat to
internal validity as the length of a time series
increases. Introducing the intervention at an
unusually high (or low) point in the series will
create a transient bias in estimates of the pre and
4
The mean of the firstdifferenced time series is interpreted as the secular trend of Y,.
620
p
Burglaries
100
90
80
70
60
50
40 L_ _____ 19_7_6 ________ ________
Year
FIGURE 32. 7. Monthly burglaries for Tucson, 1975
to 1981. During a 24month period, burglaries are
assigned to detectives for investigation.
postintervention means. As the pre and postinter
vention series grow longer, the bias becomes pro
portionately smaller, and eventually it reaches zero.
The recommendation to use a total series length of
50 or more observations for the timeseries quasi
experiment is in part intended to reduce the plausi
bility of regression threats (McCleary & Hay, 1980;
McDowall et al., 1980).
Instrumentation is also a plausible threat to
planned interventions because new methods for
measuring the outcome variable often accompany
the introduction of other changes. Figure 32.7 (from
McCleary, Nienstedt, & Erven, 1982) presents a
monthly plot of Tucson burglary counts. For 2 years
beginning in 1979, detectives replaced uniformed
officers in performing burglary investigations. In
1981, the investigative responsibility was returned
to the uniformed officers. Consistent with the
notion that detectives are more proficient in pre
Daily "Talking
Out" Incidents
25
20
15
10
5
Before After
FIGURE 32.8. Daily disruptions caused by talking out
before and after a behavioral intervention.
TimeSeries Designs
venting burglaries, the counts were lower when they
handled the cases.
Although the switching intervention feature of
the design effectively rules out history, all other
internal validity threats are still plausible. Additional
analysis showed that detectives and uniformed offi
cers did not keep records in the same way, and this
difference reduced the number of burglaries that the
detectives recorded. Allowing for the influence of
the instrumentation change, burglary counts did not
vary significantly with the type of officer responsible
for the investigations.
Construct Validity
Shadish et al. (2002) identified 14 "reasons why
inferences about the constructs that characterize
study operations may be incorrect" (p 73). One of the
14 threats to construct validity, novelty and disruption,
is relevant to experimental and quasiexperimental
timeseries designs. Regardless of whether an inter
vention has its intended effect, the time series is
likely to react to the novelty or disruption associated
with it. If the general form of the artifact is known, it
can be incorporated into the statistical model.
Figure 32.8 illustrates one aspect of this threat to
construct validity. Hallet al. (1971) counted the
number of talking out disruptions in a classroom for
20 consecutive days. When a behavioral interven
tion is implemented on the 21st day, the time series
changes gradually, falling to a lower daily level of
disruption. If the gradual nature of the response is
not taken into account, the effectiveness of the inter
vention is underestimated. Because gradual
responses to interventions are a common feature of
behavioral research, the uncontrolled threat to con
struct validity can have serious consequences.
Figure 32.9 illustrates the complementary aspect
of this threat to construct validity. Similar to Figure
32.3, these data are daily counts of selfinjurious
behavior incidents but for a different patient
(McCleary et al., 1999). Instead of receiving an
opiateblocker, beginning on the 26th day, this
patient received a placebo. The level of the time series
dropped immediately but then, within a few days,
returned to its preintervention level. Placebo effects
of this sort are common in timeseries experiments.
Given a wellbehaved timeseries process and a
621
r
McCleary and McDowall
SelfInjurious
Behavior Incidents
8
7
6
5
4
3
2
0
25 Preintervention Days 25 Postintervention Days
FIGURE 32.9. Selfinjurious behavior incidents for a single institutionalized patient before and after a placebo regimen.
sufficient number of postintervention observations,
the threat to construct validity may be ignored. Under
more realistic circumstances, however, the placebo
effect must be incorporated into the analytical model.
Campbell and Stanley (1963, p. 43) recognized
the threat to validity posed by a dynamic response to
an intervention; still, their external validity assess
ment of the timeseries quasiexperiment seemed to
leave the issue open. Addressing the same threat
from a modeling perspective, Box and jenkins
(1970; Box & Tiao, 1975) proposed a lagged poly
nomial parameterization of h(Xt) that allows for
hypothesis testing. The polynomial lag makes the
ARIMA model inherently nonlinear, complicating
the interpretation of analytic results. The polyno
mial lag provides a straightforward method of con
trolling novelty and disruption threats to construct
validity, however, and has become widely accepted
in the social sciences.
Figure 32.10 shows four variations of the same
general polynomial lag model. The model variations
in the top row depict permanent responses to the
intervention. The series may respond to the inter
vention instantaneously or gradually but, in either
case, the response is permanent. A gradual, perma
nent response model seems to capture the effect of
the behavioral intervention on the daily time series
of disruptive talking out incidents (Figure 32.8).
622
The model variations in the bottom row of Fig
ure 32.10 depict temporary responses to the inter
vention. Both responses model placebo artifacts,
spiking at the onset of the intervention but then
decaying over time to reveal the longrun effect of
the intervention or treatment. The gradual, tempo
rary response model seems to capture the effect of a
placebo intervention on the daily time series of self
injurious behavior (Figure 32.9).
Permanent and temporary responses can be com
bined in a model. Figure 32.11 shows a time series
of divorce rates for Australia before and after the
1975 Family Law Act, which allowed for nofault
divorce. Opponents of the act argued that its no
fault provisions would lead to an increase in
divorces. An evaluation of the act by the Australian
government found that although divorces did rise
following the act, the divorce rate fell back to its
pre1975level after 3 years. The fact that post
1975 divorce rates were higher was attributed to
secular trend.
Reanalyzing these data, McCleary and Riggs (1982)
hypothesized a complex response to the act, realized
as the sum of a permanent and a temporary increase
in divorce. The temporary spike in divorces decayed
rapidly in the years immediately following 1975.
Divorces never returned to their pre1975 rates, how
ever, and instead stabilized at a new higher level.
TimeSeries Designs
Percentage of Effect Percentage of Effect
50
100
Percentage of Effect Percentage of Effect
FIGURE 32.10. Model responses to an intervention. The top row illustrates permanent response patterns. The bot
tom row illustrates temporary response patterns.
Whether or not they have permanent effects, new
laws often have temporary effects that are well mod
eled as decaying spikes. Failure to allow for these
temporary effects can lead to invalid inferences. At
the individual level, placebo artifacts pose an analo
gous threat to construct validity. Although these
threats are easily controlled with an explicit com
plex response model, by allowing the possibility of
Divorces
80
70
60
50
40
30
20
10
0
50 60
Family Law Act of 1975
70
Year
80 90 95
FIGURE 32.11. Australian divorces before and after
the 1975 Family Law Act.
several responses, the model raises a potential threat
to statistical conclusion validity: fishing. To control
the fishing threat, the complex response model must
be fully specified before any hypothesis test.
Finally, we return to the tradeoff implicit in the
fourvalidity system. In principle, all nine threats to
internal validity can be controlled by manipulating
the intervention or treatment experimentally. Inter
nal validity is bought at the expense of construct
validity, however, which may be more threatening
in singlesubject designs. Although the opiate
blocker regimen appears to reduce selfinjurious
behavior, implementation of the regimen provokes a
weeklong reaction to the novelty and disruption.
Because none of the common threats to internal
validity seem plausible, trading construct for inter
nal validity may be unwarranted.
External Validity
A timeseries quasiexperiment typically considers
an intervention's influence on only one series, and
623
McCleary and McDowall
this makes it highly vulnerable to external validity
threats. External validity considers whether find
ings hold "over variations in persons, settings,
treatments, and outcomes" (Shadish et al., 2002,
p. 83), and threats to it are always plausible when
analyzing a single series. Ruling out external valid
ity threats necessarily requires replicating the
quasiexperiment over a diverse set of conditions.
The evaluation research literature shows
many cases in which effect estimates exist to sup
port every possible conclusion about a program's
impact. Evaluations of gun control policies, for
example, include numerous instances of positive,
negative, and null effect estimates (Reiss & Roth,
1993, pp. 255287). In these situations, the vari
ance of the effects across replications can be more
informative than is a singlepoint estimate of the
average effect.
If several quasiexperimental replications exist,
they allow external validity to be assessed in one of
two ways. First, the set of individual time series can
be assembled to form a single vector series, Y
1
:
(17)
A statistical analysis can then take advantage of the
variation in the replications to obtain estimates of
both the average impact and its expected variability
(e.g., McGaw & McCleary, 1985). Second, an analy
sis can decompose the set of individual impact
estimates,
(18)
into components associated with various external
validity threats.
The second approach is a restricted case of the
first, and in theory, it is less desirable. Still, the first
and more general model makes highly demanding
assumptions, and timeseries often do not conform
to them. Because the second approach places fewer
requirements on the data, it is therefore generally
more practical to apply. Statistical models for the
second approach come from metaanalysis and
divide the effect variance into components caused
by the setting, intervention, and other potential
threats to external validity. McDowall, Loftin, and
Wiersema (1992) used this approach to estimate the
overall impact and variability of sentencing laws,
624
and McCleary (2000) used it to combine estimates
from the Hawthorne experiments.
DYNAMIC INTERVENTION ANALYSES
A salient advantage of timeseries designs over other
beforeafter designs is the capability of modeling
dynamic responses to the intervention. Proper specifi
cation of a dynamic response model, such as those
shown in Figure 32.10, requires a parsimonious the
ory of the response. One theory that has proved use
ful in psychological research restricts the response to
one of four types defined by dichotomizing onset and
duration. The response may be abrupt or gradual in
onset and permanent or temporary in duration.
Although general transfer function specifications
(Box & jenkins, 1970; Box & Tiao, 1975) allow a
wider range of responses, the fourtype theory is
more realistic for psychological interventions and
more appropriate for testing intervention null
hypotheses.
The talking out intervention plotted in Figure
32.8 is a typical gradual, permanent response to an
intervention. Though implemented on the 21st
day, the full effect of the intervention is not real
ized immediately but, rather, accumulates gradu
ally over several days. If a timeseries model is
written as the sum of stochastic and intervention
components,
(19)
The stochastic component plays no meaningful role
in our explication of the intervention analysis. Pro
cedures for building statistically adequate ARIMA
models of j(a
1
) are described elsewhere (McCleary &
Hay, 1980) but are of little interest here. Subtracting
the stochastic component from the series
(20)
leaves the dynamic intervention component:
(21)
The simplest dynamic model of a gradual, perma
nent response to xt is
(22)
,..
TimeSeries Designs
TABLE 32.2
Parameter Estimates for Dynamic Intervention Models
Gradual response, permanent change
g(X
1
) = XrW (1  oB)
1
Talking out (Figure 32.8)
co= 6.79 co)= 6.81
o = 0.56 o) = 8.48
where B is a backward lag operator defined such that
BkXt = Xtk for any integer k.
A Taylor series expansion of the righthand side
yields the more useful series identity:
(23)
2t = X
1
ro (1 + 8B + o
2
B
2
+ o
3
B
3
+ ... ) (24)
= X
1
ro + oxt1 ro + o
2
xt2ro + o
3
xt3ro + ... (25)
X
1
is defined as a binary variable such that
xt = 0 for preintervention days
t=1, ... ,20 (26)
xt = 1 for postintervention days
t=21, ... ,40. (27)
Before the intervention, X
1
$
20
= 0, so
Thereafter, X
1
>
20
= 1. On the jth postintervention
day,
(28)
220+j = ro + oro+ o
2
ro + ... + oilro. (29)
Because o is a fraction, oH is a very small number.
Parameter estimates for the talking out time
series are reported in the lefthand panel of
Table 32.2.
5
Substituting the estimates of ro and o
into the series identity,
221 = 6. 79(1) = 6.79 (30)
2
22
= 6.79(1 +.56)= 10.59 (31)
223 = 6.79(1 +.56+ .31) = 12.72 (32)
224 = 6.79(1 +.56+ .31 + .18) = 13.91 (33)
225 = 6.79(1 +.56+ .31 + .18 + .10) = 14.58.
(34)
Daily changes in talking out continue throughout
the postintervention segment, but reductions
'Parameters estimated with the SCA Statistical System (Liu, 1999).
Abrupt response, temporary change
g(Xr) = (1  B)XrW (1  88)
1
Selfinjurious behavior (Figure 32.9)
co= 3.65 co)= 3.04
o = 0.57 o) = 3.02
become smaller and smaller. Eventually, the effect
will converge on
6.79 I (1 .56)= 15.43, (35)
but by the end of the fifth postintervention day, 95%
of this effect has been realized.
The selfinjurious behavior time series in Figure
32.9 presents a typical abrupt, temporary response to
an intervention that, in this case, is a placebo. On the
first postintervention day, this patient's rate of self
injurious behavior drops abruptly but, in subsequent
days, returns to its preintervention level. The sim
plest dynamic model of an abrupt, temporary effect is
This model has the series identity:
2t = vxtro + o vxt1ro + o
2
vxt2ro
+ 83 vxt3ro + ...
(36)
(37)
Whereas X
1
remains on throughout the postinter
vention period, the first difference of Xr, V'X
1
, turns
on in the first postintervention day and then turns
off again:
vxt = 0 for days t = 1, ... '25 (38)
vxt = 1 fort= 26 (39)
vxt = 0 for days t = 27, ... '50. (40)
Before the intervention, V'X
1
$
25
= 0 and
On the first day of the intervention, V'X
26
= 1, so
226=(1). (42)
But thereafter, V'Xr,
26
= 0 again, and
226+j = oilro. C 4 3)
625
McCleary and McDowall
Again, as ro&
1
approaches 0, the placebo effect
decays.
Parameter estimates for the placebo selfinjurious
behavior time series are reported in the righthand
column of Table 32.2. Substituting the estimates of
ro and o into the series identity,
z26 = 3.04,
z27 = 3.04 (.57)= 1.73,
Z2s = 3.04 (.57)
2
= 0.99,
z29 = 3.04 (.57)
3
= 0.56, and
Z
30
= 3.04 (.57)
4
= 0.32.
(44)
(45)
(46)
(47)
(48)
By end of the fifth postintervention day, 90% of the
placebo effect has dissipated.
In either of these two dynamic models, the
parameter o determines the rate of postintervention
change in the time series. Intervention null hypothe
ses can be devised around the value of 8 to test
properties of the response.
CONCLUSION
Timeseries data and timeseries designs have a long
history in psychological research. Of the many uses
of timeseries data, causal inferences from experi
ments and quasiexperiments are currently their wid
est application. Given a reasonably long time series,
balanced data, and an adequate ARIMA model, the
timeseries quasiexperiment is among the most use
ful and valid quasiexperimental designs.
The advantages of the timeseries quasi
experiment are especially apparent in the absence of
naturally defined control groups. To emphasize this
property, Campbell and Stanley (1963) cited the
hypothetical example of a chemist who, dipping an
iron bar into nitric acid, attributes the bar's loss of
weight to the acid bath: "There may well have been
'control groups' of iron bars remaining on the shelf
that lost no weight but the measurement and report
ing of these weights would typically not be thought
necessary or relevant" (p. 3 7).
The design is also vulnerable to multiple threats
to validity, of course, and one would normally not
use it in situations in which randomized controlled
trials are possible. These cases aside, the timeseries
626
quasiexperiment is a feasible and relatively strong
design across a wide range of circumstances.
References
Beveridge, W. H. (1922). Wheat prices and rainfall
in western Europe. journal of the Royal Statistical
Society, 85,412475. doi:10.2307/2341183
Box, G. E. P., &:Jenkins, G. M. (1970). Time series
analysis: Forecasting and control. San Francisco, CA:
HoldenDay.
Box, G. E. P., &: Tiao, G. C. (1975). Intervention analysis
with applications to economic and environmen
tal problems. Journal of the American Statistical
Association, 70, 7079. doi:10.2307/2285379
Campbell, D. T. (1963). From description to experimen
tation: Interpreting trends as quasiexperiments. In
C. W. Harris (Ed.), Problems in measuring change
(pp. 212243). Madison: University of Wisconsin
Press.
Campbell, D. T., &: Ross, H. L. (1968). The Connecticut
crackdown on speeding: Time series data in quasi
experimental analysis. Law and Society Review, 3,
3353. doi:10.2307/3052794
Campbell, D. T., &: Stanley,]. C. (1963). Experimental
and quasiexperimental designs for research. Chicago,
IL: RandMcNally.
Chree, C. (1913). Some phenomena of sunspots and
of terrestrial magnetism at Kew Observatory.
Philosophical Transactionsof the Royal Society
of London, Series A, 212, 75116. doi: 10.1098/
rsta.19l3.0003
Cohen,]. (1988). Statistical power analysis for the behav
ioral sciences (2nd ed.). Englewood Cliffs, N]:
Erlbaum.
Cook, T. D.,&: Campbell, D. T. (1979). Quasi
experimentation: Design and analysis issues for field
settings. Boston, MA: HoughtonMifflin.
Dollard,]., Doob, L. W., Miller, N. E., Mowrer, 0. H.,&:
Sears, R. R. (1939). Frustration and aggression. New
Haven, CT: Yale University Press.
Elton, C. S. (1924). Fluctuations in the numbers
of animals: Their causes and effects. journal of
Experimental Biology, 2, 119163.
Elton, C.,&: Nicholson, M. (1942). The tenyear cycle
in numbers of the lynx in Canada. Journal of Animal
Ecology, 11,215244. doi:l0.2307/l358
Fisher, R. A. (1921). Studies in crop variation: An exami
nation of the yield of dressed grain from Broadbalk.
journal of Agricultural Science, 11, 107135.
doi:lO.l017/S0021859600003750
Fisher, R. A. (1925). Statistical methods for research work
ers. London, England: Oliver &: Boyd.
Florence, P. S. (1923). Recent researches in indus
trial fatigue. Economic journal, 33, 185197.
doi:10.2307/2222844
Franke, H. F., & Kaul,]. D. (1978). The Hawthorne
experiments: First statistical interpretation.
American Sociological Review, 43, 623643.
doi:10.2307/2094540
Glass, G. V., Willson, V. L., & Gottman,]. M. (1975).
Design and analysis of time series experiments.
Boulder: Colorado Associated University Press.
Granger, C. W.]. (1969). Investigating causal rela
tionships by econometric models and cross
spectral methods. Econometrica, 37, 424438.
doi:10.2307/19l2791
Greene, W. H. (2000). Econometric analysis (4th ed.).
Englewood Cliffs, NJ: PrenticeHall.
Hall, R. V., Fox, R., Willard, D., Goldsmith, L., Emerson,
M., Owen, M., ... Porcia, E. (1971). The teacher
as observer and experimenter in the modification
of disputing and talkingout behaviors. journal of
Applied Behavior Analysis, 4, 141149. doi:10.1901/
jaba.197l. 4141
Hennigan, K. M., Del Rosario, M. L., Heath, L., Cook,
T. D., Wharton,]. D., & Calder, B.]. (1982). Impact of
the introduction of television on crime in the United
States: Empirical findings and theoretical implica
tions. journal of Personality and Social Psychology, 42,
4614 77. doi: 10.1037/00223514.42.3.461
Hepworth,]. T., & West, S. G. (1988). Lynchings and the
economy: A timeseries reanalysis of Hovland and
Sears. journal of Personality and Social Psychology, 55,
23924 7. doi:10.1037/00223514.55.2.239
Hovland, C. I., & Sears, R. R. (1940). Minor studies of
aggression IV. Correlation of lynchings with eco
nomic indices. journal of Psychology, 9, 301310. doi:
10.1080/00223980.1940.9917696
Kendall, M., & Stuart, A. (1979). The advanced theory of statis
tics (4th ed., Vol. 2). London, England: Charles Griffin.
Kroeber, A. L. (1919). On the principle of order in
civilization as exemplified by changes of fashion.
American Anthropologist, 21, 235263. doi: 10.1525/
aa.1919.2l.3.02a00010
Kroeber, A. L. (1944). Configurations of cultural growth.
Berkeley: University of California Press.
Lipsey, M. (1990). Design sensitivity: Statistical power for
experimental research. Thousand Oaks, CA: Sage.
Liu, L.M. (1999). Forecasting and time series analy
sis using the SCA statistical system. Villa Park, IL:
Scientific Computing Associates.
McCleary, R. (2000). Evolution of the time series experi
ment. In L. Bickman (Ed.), Research design: Donald
Campbell's legacy (pp. 215234). Thousand Oaks,
CA: Sage.
TimeSeries Designs
McCleary, R., & Hay, R. A. ,Jr. (1980). Applied time series
analysis for the social sciences. Beverly Hills, CA: Sage.
McCleary, R., Nienstedt, B. C., & Erven,]. M. (1982).
Uniform crime reports and organizational outcomes:
Three time series quasiexperiments. Social Problems,
29, 361372. doi: 10.1525/sp.1982.29.4.03a00030
McCleary, R., & Riggs,]. E. (1982). The 1975 Australian
Family Law Act: A model for assessing legal impacts.
New Directions for Program Evaluation,16, 718.
McCleary, R., Touchette, P., Taylor, D. V., & Barron,
]. L. (1999, March). Contagious models for self
injurious behavior. Poster presentation, 32nd Annual
Gatlinburg Conference on Research and Theory in
Mental Retardation, Charleston, SC.
McDowall, D., Loftin, C., & Wiersema, B. (1992). A
comparative study of the preventive effects of man
datory sentencing laws for gun crimes. journal
of Criminal Law and Criminology, 83, 378394.
doi:10.2307/ll43862
McDowall, D., McCleary, R., Meidinger, E. E., & Hay,
R. A. ,Jr. (1980). Interrupted time series analysis.
Beverly Hills, CA: Sage.
McGaw, D. B., & McCleary, R. (1985). PAC spend
ing, electioneering, and lobbying: A vector
ARIMA time series analysis. Polity, 17, 574585.
doi:10.2307/3234659
Mills, T. C. (1991). Time series techniques for economists.
New York, NY: Cambridge University Press.
Neyman,]., & Pearson, E. S. (1928). On the use and
interpretation of certain test criteria for purposes of
statistical inference. Biometrika, 20A, 175240.
Reiss, A.]., & Roth,]. A. ( 1993). Understanding and
preventing violence. Washington, DC: National
Academies Press.
Richardson,]., & Kroeber, A. L. (1940). Three centuries
of women's dress fashions: A quantitative analysis.
Anthropological Records, 5, 111153.
Roethlisberger, F.,&: Dickson, W.]. (1939). Management
and the worker. Cambridge, MA: Harvard University
Press.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
Experimental and quasiexperimental designs for general
ized causal inference. New York, NY: Houghton Mifflin.
Wolf,]. R. (1848). Nachrichten uber die Sternwarte
in Bern [News from the observatory in Berne].
Mittheilungen der naturforschenden gese!lschaft in
Bern, Nr. 114115. ETHBibliothek Zurich, Rar
4201. doi:10.393l/erara2007
Yule, G. U. (1927). On a method of investigating peri
odicities in in disturbed series with special refer
ence to Wolfer's sunspot numbers. Philosophical
T ransactionsof the Royal Society of London, Series A,
226, 267298.
627