Académique Documents
Professionnel Documents
Culture Documents
. reg y x, robust
iv
Chapter 1
Reading
Assignments: • “Does the Vaccine Matter” The Atlantic (November 2009)
The selection problem is one of the most important concepts in economics. To illustrate this
concept, we start with a health-related example. Suppose that we would like to empirically
estimate the effect of the flu shot on health. This is known as treatment evaluation,1 where
health is the outcome and the flu shot is the treatment. Unfortunately, we don’t have a
flu shot experiment to analyze. In fact, there has never been a randomized controlled trial
to evaluate the effectiveness of the flu shot. Instead, imagine that we have survey data
from a large random sample of adults. For each individual i we observe if a flu shot was
administered as well as a measure of overall health. The flu shot treatment is given by:
(
1 if individual i received the treatment
si =
0 if individual i did not receive the treatment.
The outcome of interest, yi , is a continuous measure of overall health where higher values
indicate better health.2 We can use this data to calculate the mean of y when s = 1 and the
mean of y when s = 0. Empirical studies of the effect of the flu shot on health have found
1
If the treatment varies by type or intensity, we call this multiple treatment evaluation.
2
In practice, objective measures of health are hard to come by, particularly objective measures that are
continuous. Many researchers just use mortality (an indicator variable for death) as their measure of health.
1
that:
E(yi |si = 1) > E(yi |si = 0). (1.1)
Does this mean that the flu shot is causing health to improve, perhaps by making contracting
the flu less likely and thus reducing the probability of other flu-related illnesses? No, it does
not. Eq. (1.1) simply declares that people who get a flu shot are healthier on average than
those who do not. Comparing the average health of those who received a flu shot with those
who did not does not tell us why there is a difference, it only tells us that the flu shot and
good health are positively correlated. I’ll repeat: the difference between the average health
of those who received a flu shot and the average health of those who did not receive a flu shot
does not does not tell us if the flu shot is making people healthier, that is, causing people
to have better health. Ultimately, it is the causal effect of the flu shot on health that we
care about.
The unit-level treatment effect is the difference between the two potential outcomes:
The unit-level treatment effect, τi , is the causal effect of the treatment for individual i. If
τi is positive it means that the flu shot would improve individual i’s health. If τi is zero, it
means that the flu shot has no effect on individual i’s health.
The fundamental problem of causal inference is that it is impossible to observe
both y1i and y0i for the same individual. This makes it impossible to observe the unit-level
treatment effect, τi . For each individual i we we only observe:
This potential outcome notation allows us to frame causal inference as a missing data problem
3
The potential outcome approach was developed by Rubin (1974, 1977, and 1978).
2
because we can’t know what effect the flu shot has unless we know what would have happened
had the individual not received the flu shot.
A large amount of homogeneity could solve the missing data problem. Assume that
everyone is identical except that some people choose s = 1 and others choose s = 0 randomly.
Everyone is identical so all individuals have the same potential outcomes. This implies that
if individual i selects si = 1 and therefore we observe yi = y1i we could use the health
outcome yj for individual j, who selected sj = 0, as a proxy for y0i . The reason this works is
that the homogeneity ensures that y0j = y0i . This enables us to calculate the causal effect,
y1i − y0i , by using observed health from different individuals. Alternatively, assume that y0i
and y1i are constant over time for individual i. Then if there are some periods in which y0i
is observed and others in which y1i is observed, we can calculate the causal effect, y1i − y0i ,
by using observed health from different time periods. However, in practice neither of these
approaches is likely to work as there is often a large degree of heterogeneity both across
individuals and over time.
There may also be heterogeneity in the unit-level treatment effect. The effect of a flu
shot may be larger for some individuals than others. There may be individuals for which a
flu shot actually decreases health. Because we never observe both y1i and y0i for the same
person, we are just trying to learn about the average value of τ . The three most commonly
estimated treatment effects are the average treatment effect (ATE), the average treatment
effect on the treated (ATT), and the average treatment effect on the untreated (ATU) as
defined below:
Definition 1.1 (Average Treatment Effect) The average treatment effect (ATE) is
the population mean of all unit-level treatment effects:
Definition 1.2 (Average Treatment Effect on the Treated) The average treat-
ment effect on the treated (ATT) is the population mean of unit-level treatment effects
for only those individuals who received the treatment:
3
Definition 1.3 (Average Treatment Effect on the Untreated) The average treat-
ment effect on the untreated (ATU) is the population mean of unit-level treatment effects
for only those individuals who did not received the treatment:
We contrast these average treatment effects with the naı̈ve average treatment effect
(NATE) that is commonly reported in newspapers and in unsophisticated academic arti-
cles and reports. The naı̈ve average treatment effect is what we obtain when we compare
the average value of y for those with s = 1 to the average value of y for those with s = 0. It
is also what we obtain when we run a simple regression of y on s.
Definition 1.4 (Naı̈ve Average Treatment Effect) The naı̈ve average treatment ef-
fect (NATE) is the difference in the population mean for those who receive the treatment
and those who did not receive the treatment:
E [y|s = 1] − E [y|s = 0]
The naı̈ve average treatment effect is not the ATE, the ATT, or the ATU. So, what does it
estimate? We can decompose the NATE into two parts:
Moving from the first to the second line above comes from adding and subtracting the
same term, E [y0 |s = 1]. This enables us to express the naı̈ve average treatment effect as a
combination of the average treatment effect on the treated and selection bias. Consider
the selection bias term, E [y0 |s = 1] − E [y0 |s = 0]. This term represents the difference in
baseline health (in the hypothetical world in which there was no treatment) between those
who choose to get a flu shot and those who do not. If healthier people are more likely to
get the flu shot than those who are less healthy, this selection bias term will be positive.
4
This implies that the naı̈ve average treatment effect is larger than the causal effect for the
treated. This positive selection into the treatment is what likely accounts for the findings
that those who get the flu shot are less likely to be in an car accident, get diabetes, and
become unemployed.
Alternatively, we can express the naı̈ve average treatment effect as a combination of the
ATU and a different form of selection bias:
This time, the selection bias term, E [y1 |s = 1] − E [y1 |s = 0], is not in terms of the baseline
health status, y0 , but is instead in terms of the post-treatment health status, y1 . This term
incorporates both the positive selection of healthier people into the flu shot treatment as
well as any difference in the causal effect of the treatment (even if not observed). for those
who chose to receive the flu shot and those who did not.
The average treatment effect (ATE) is simply the weighted average of the average treat-
ment effect on the treated (ATT) and the average treatment effect on the untreated (ATU):
where π is the fraction of the population that selects into treatment. This means that we
can also decompose the naı̈ve average treatment effect as follows:
which shows that the difference between the average treatment effect and the naı̈ve average
treatment effect is the selection bias plus (1 − π) times the difference between the average
5
treatment effect on the treated and the average treatment effect on the untreated.
In addition, random assignment assures us that there will be no difference between the ATT
and the ATU. The treated and untreated groups were randomly assigned so there should be
no underlying difference between the two groups in the average unit-level treatment effect
and thus E [y1 |s = 1] = E [y1 |s = 0]. Therefore Eq. (1.4), Eq. (1.5), and Eq. (1.8) simplify
down to a single equivalence:
The condition under which equation Eq. (1.10) is true is written as {y1 , y0 } ⊥ s, indicating
that the treatment is independent of the two potential outcomes. This assumption was first
stated by Rosenbaum and Rubin (1983) as unconfoundedness and we will elaborate on
the unconfoundedness assumption below.
Randomized trials are not common in economics, but they offer the most credible evi-
dence. One example is the 1962 Perry preschool project, a experiment in which some black
preschool students in Michigan were randomly assigned to receive high-quality schooling and
home visits. Another is the 1985 Tennessee STAR experiment in which students in kinder-
garten through third-grade were randomly assigned to either small classes (13-17 students),
large classrooms (22-25 students), or large classrooms with a paid teacher aide. The Oregon
Medicaid Experiment randomized which families were allowed to participate in the Medicaid
expansion.
Randomized trials are the conceptual benchmark for observational or non-experimental
6
study designs. When trying to estimate a causal effect using observational data, the
economist should ask, what ideal experiment would yield the causal effect of interest? If you
cannot think of an experiment that would produce the desired causal effect estimate, this
implies that you do not have a well-defined research question. If there is no ideal experiment
that would estimate the causal effect, regression analysis will not be able to estimate the
causal effect either.
Definition 1.5 (Stable Unit Treatment Value Assumption) The observed out-
comes are realized as
yi = y1i si + y0i (1 − si )
regardless of the mechanism used to assign the treatment and regardless of what treatments
the other units receive.
This assumption implies that the potential outcomes of individuals must be unaffected by
the treatment of individual j and rules out all interference across units. In our flu shot
example, this assumption will not be satisfied if the treatment of some individuals directly
increases the potential health outcomes for untreated individuals.
The second assumption is the constant treatment effects assumption which implies
that the treatment effect is the same for every individual. This assumption is stated formally
as:
7
Definition 1.6 (Constant Treatment Effects Assumption)
y1i − y0i = τ ∀i
y i = α + τ s i + ui . (1.11)
In this model, τ is the causal effect of the flu shot on health. A regression model is considered
a structural model if it represents a causal relationship rather than statistical correlation.
A structural equation may be derived from theory or obtained from informal reasoning. The
error term ui consists of everything other than si that determines yi . This includes both
omitted variables and measurement error.
We can derive the expected value of health conditional on flu shot status:
Lets turn now to estimating the model to obtain τ̂ , the estimated causal effect of s on y.
In this simple linear regression model, the OLS estimate is given by:
n
X
(si − s̄) yi
i=1
τ̂ = n (1.14)
X 2
(si − s̄)
i=1
. clear
. set seed 123
8
. set obs 1000
. gen u = rnormal()
. gen s = rnormal()
. gen y = 2*s + 4*u
. reg y s
. predict yhat, xb
. gen yline = 0.0813193 + 1.988262*s
. summarize y*
. set scheme s1mono
. twoway (lfit y s, lcolor(black) lwidth(medium)) (scatter y s, mcolor(black)
msymbol(point)), legend(off) ytitle(y) xtitle(s)
Run these commands in Stata and examine the results. In this simulation, we chose the
data generating process, y = 2s + 4u, and we generated s and u independently. Therefore,
u and s have little correlation and the regression model that we specified is the correct
structural model producing causal estimates. The estimated coefficient on s is very close to
the true population value of 2, as we would expect.
−4 −2 0 2 4
s
The figure plots each of the simulated data points as well as the regression line. The
intercept is very close to the true parameter value of 0 and the simulated data points appear to
9
have a constant variance with no unusual patterns. However, it is essential to understand that
no amount of statistical testing can tell us if the regression model we specified is producing
a causal estimate. Consider altering the example slightly to draw random values of s that
are correlated with u:
. clear
. set seed 123
. set obs 1000
. local corr = 0.5
. gen u = rnormal()
. gen s = u + sqrt((1/(‘corr’^2)-1)) * rnormal()
. gen y = 2*s + 4*u
. reg y s
. twoway (lfit y s, lcolor(black) lwidth(medium)) (scatter y s, mcolor(black)
msymbol(point)), legend(off) ytitle(y) xtitle(s)
−10 −5 0 5 10
s
The true causal effect of s on y is still 2 as specified in the data generating process.
However, the correlation between u and s causes bias in our estimate of τ as shown in Figure
1.2. Returning to equation 1.14 and then substituting yi = α + τ si + ui into that equation,
10
we obtain the following:
n
X
(si − s̄) (α + τ si + ui )
i=1
τ̂ = n
X
(si − s̄)2
i=1
n
X n
X n
X
(si − s̄) (si − s̄) si (si − s̄) ui
i=1 i=1
=α n +τ n + i=1
Xn
X X
(si − s̄)2 (si − s̄)2 (si − s̄)2
i=1 i=1 i=1
| {z } | {z }
=0 =1
n
X
(si − s̄) ui
i=1
τ̂ = τ + n (1.15)
X 2
(si − s̄)
i=1
| {z }
Selection Bias
11