Vous êtes sur la page 1sur 4

Proxy Variables

William Matcham
February 29, 2016
Reference: Slide Set 1, 3033 and Wooldridge 298303 (Wooldridge adds in an extra regressor in the
wage example, but the slides and the text below omit this term in order to make the discussion clearer.)

Introduction
A very common problem in econometrics is to not observe (i.e. not have data) on covariates that
are considered important to the analysis.
Proxy variables provide one way to mitigate the problems that arise when we cannot include in
our regression a covariate that we would like to.
Motivating example: suppose we wish to understand how education influences the wage of an
individual. A factor that affects the wage of an individual, that is correlated with education level,
is inherent natural ability.
Ability is therefore a confounding factor in the model.
The regression we may consider is

log(wage) = 0 + 1 educ + 2 abil + u

(1)

Inherent ability is very difficult, if not impossible, to measure, so without any better options we
may just leave out ability and run

log(wage) = 0 + 1 educ + w,

w = 3 abil + u

(2)

OLS estimation on (2) leads to an inconsistent estimator of 1


An attempt to deal with this inconsistency comes from using a proxy variable for ability. Loosely,
a proxy should be a variable correlated to the unobserved variable in question. Its essentially an
observable variable that provides a similar measure to the unobserved confounder.
In the motivating example above, one choice of a proxy for natural ability would be IQ.
The proxy need not be exactly equivalent as a variable to the unobserved quantity, but it should
be as well correlated as possible. We know for example that IQ is definitely not a perfect measure
of ability at all, but is somewhat related to this.
Given that we have a proxy, the most obvious way to use it to mitigate our inconsistency problem
is to plug the proxy variable in as a regressor, in the hope that it acts as a good replacement for
the unobserved confounder. We hope that it does a good job mimicking the unobserved variable.

The plug in approach gives the regression model

log(wage) = 0 + 1 educ + 2 IQ + e

(3)

This seems logical, but how do we know that it works? The rest of this handout explains the
conditions that a proxy needs to satisfy in order for an OLS regression on (3) to provide consistent
a consistent estimator of 1

Required Conditions for a Proxy


Let us work more generally than the wage regression example, so that we have a model

y = 0 + 1 x1 + 2 x2 + u

(4)

Where MLR1-5 hold on this model. We have data on y and x1 . The variable x2 is unobserved,
but we have data on a proxy for x2 , denoted x2 .

2.0

Requirement 0

The first requirement is that x2 should have a relationship (correlation) with x2 . In other words,
in the regression
x2 = 0 + 2 x2 + v
(5)

we should have 2 6= 0.

The reason why the 0 term exists in (4) is because the proxy x2 and x2 may have different units,
and the v exists to represent the notion that the proxy and the unobserved variable are not exactly
the same.
So far, the long and short is that if 2 = 0, then x2 cannot be a proxy for x2 . This is a somewhat
obvious, but necessary, condition.

2.1

Requirement 1

The next requirement is that x2 should not be in the main regression (4), given that x1 and x2
are already in the regression. In other words, x2 is the factor that directly affects y and not x2 .
This is a bit like the instrumental variable exclusion restriction. The only channel from the proxy
x2 into y, is through x2 .
In the wage regression example, we have to believe that a higher IQ score in itself will not lead
to a higher wage (people dont tend to put IQ score on their CV anyway). We can (and must)
have however that a higher IQ will be associated with a higher innate ability, and then the higher
ability will be associated with a higher wage.
The mathematical way of stating the above is that we require

E(y | x1 , x2 ) = E(y | x1 , x2 , x2 )

(6)

Which in words says that the explanatory power of x1 and x2 in explaining the mean of y is exactly
the same as the explanatory power of x1 , x2 and x2 in explaining the mean value of y.
1

Youll see why we skipped the index 1 later.

That is, once we control for x1 and x2 , x2 cannot improve our prediction of the mean value of y.
NOTE: by substituting y from (4) into both sides of (6), we obtain

E(0 + 1 x1 + 2 x2 + u | x1 , x2 ) = E(0 + 1 x1 + 2 x2 + u | x1 , x2 , x2 )
m
0 + 1 x1 + 2 x2 + E(u | x1 , x2 ) = 0 + 1 x1 + 2 x2 + E(u | x1 , x2 , x2 )
m
E(u | x1 , x2 ) = E(u | x1 , x2 , x2 )
Note that since MLR1-5 hold on (4), E(u | x1 , x2 ) = 0 and therefore the above derivation shows
us that
E(u | x1 , x2 , x2 ) = 0
In other words, requirement 1 ensures that the error term u is not only uncorrelated with x1 and
x2 , but also x2 .
The above result implies that E(u | x1 , x2 ) = 0.2

2.2

Requirement 2

Now go back and consider the proxy regression (5). The second requirement related to the proxy
is that once x2 is controlled for, the mean value x2 shouldnt depend upon x1 .
In other words, x2 should have no correlation with x1 , once x2 is partialled out. Another way of
seeing this is that if we considered

x2 = 0 + 1 x1 + 2 x2 + v
Then 1 = 0 should hold, so we get back to obtaining (5).
In mathematics, this requirement is given by

E(x2 | x1 , x2 ) = E(x2 | x2 )
Similar to above, substituting (5) into (7) gives

E(0 + 2 x2 + v | x1 , x2 ) = E(0 + 2 x2 + v | x2 )
m
0 + 2 x2 + E(v | x1 , x2 ) = 0 + 2 x2 + E(v | x2 )
m
E(v | x1 , x2 ) = E(v | x2 )
Since E(v | x2 ) = 0, we thus obtain E(v | x1 , x2 ) = 0
v should not just be uncorrelated with x2 , but also x1 .
2

Tower

(7)

To cast this requirement in the wage regression example, we are saying that

E(abil | educ, IQ) = E(abil | IQ) = 0 + 2 IQ


We are saying that the expected value of ability only changes with IQ, and not education. Some
may argue that this is not reasonable, but we may want to think of this ability variable as innate
- as something natural that cannot be taught or learnt. I will leave it up to you to decide whether
you believe such a variable exists.

Consistency of OLS with a correct proxy


To finish, lets see why a proxy variable satisfying the three above requirements will allow for
consistent estimation of 1 .
In the first case where all the requirements are met, recall that we have

y = 0 + 1 x1 + 2 x2 + u

(8)

x2 = 0 + 2 x2 + v

(9)

and
Substituting (9) into (8), we obtain

y = 0 + 1 x1 + 2 0 + 2 2 x2 + 2 v + u
Which leaves
y = 0 + 1 x1 + 2 x2 + e

(10)

Where
1. 0 = 0 + 2 0
2. 2 = 2 2
3. e = 2 v + u
Note that now we cannot identify 2 , the marginal effect of x2 on y, but in many settings, identifying
2 will be more interesting anyway: in the wage regression example, we may be more interested
in the marginal effect of one more IQ point, rather than the marginal effect of one more unit of
innate ability, which is a vague notion at best.
Consider running OLS on (10). For unbiased estimation of 1 and 2 , we need that E(e | x1 , x2 ) =
0. Observe, noting the two results in red in the above text, that

E(e | x1 , x2 ) = 2 E(v | x1 , x2 ) + E(u | x1 , x2 )


= 0+0=0
Therefore we have found a way to obtain an unbiased estimator for 1 .

Vous aimerez peut-être aussi