Simple and Bias-Corrected Matching Estimators For Average Treatment Effects

Preliminary
Simple and Bias-Corrected Matching Estimators

for Average Treatment Effects
Alberto Abadie Harvard University and NBER

Guido Imbens UC Berkeley and NBER
This version: June 2002
We wish to thank Whitney Newey, Jack Porter, Jim Powell, Paul Rosenbaum, Ed Vytlacil, and
participants at seminars at Berkeley, Brown, Chicago, Harvard/MIT, McGill, Princeton, Yale,
and the 2001 EC2 conference in Louvain for comments, and Don Rubin for many discussions on
these topics.
Simple and Bias-Corrected Matching Estimators
for Average Treatment Effects
Alberto Abadie and Guido W. Imbens
Abstract
In this paper we analyze large sample properties of matching estimators. Such estimators have
found wide applicability in evaluation research despite the fact that their large sample properties
have not been established in many cases. We show that simple matching estimators have biases
in large samples that do not vanish under the standard normalization if the number of continuous
covariates is at least four, and in fact dominate the variance if the dimension of the continuous
covariates is at least five. In addition, we show that simple matching estimators do not reach
the semiparametric efficiency bound, although the efficiency loss is typically small. We then
propose a bias-corrected matching estimator that has no asymptotic bias. In simulations the
bias-corrected matching estimator performs well compared to both simple matching estimators
and to regression estimators in terms of bias and root-mean-squared-error.
Alberto Abadie Guido Imbens

John F. Kennedy School of Government Department of Economics
Harvard University University of California at Berkeley
79 John F. Kennedy Street 549 Evans Hall #3880
Cambridge, MA 02138 Berkeley, CA 94720-3880
and NBER and NBER
alberto abadie@harvard.edu imbens@econ.berkeley.edu
1. Introduction
Estimation of average treatment effects is an important goal of much evaluation research. Often,
analyses are based on the assumption that assignment to treatment is unconfounded, that is,
based on observable pretreatment variables only, and that there is sufficient overlap in the distri-
butions of the pretreatment variables (Barnow, Cain and Goldberger, 1980; Heckman and Robb,
1984; Rubin 1977). Under these assumptions one can estimate the average effect within sub-
populations defined by the pretreatment variables by differencing average treatment and control
outcomes. The population average treatment effect can then be estimated by averaging these
conditional average treatment effects over the appropriate distribution of the covariates. Meth-
ods implementing this in parametric forms have a long history. See for example Cochran and
Rubin (1973), Barnow, Cain, and Goldberger (1980), Rosenbaum and Rubin (1985), Rosenbaum
(1995), and Heckman and Robb (1984). Recently a number of nonparametric implementations
of this idea have been proposed. Hahn (1998) calculates the efficiency bound and proposes an
asymptotically efficient estimator based on nonparametric series estimation. Heckman, Ichimura,
and Todd (1997) and Heckman, Ichimura, Todd and Smith (1998) focus on the average effect
on the treated and consider estimators based on local linear regression. Robins and Rotnitzky
(1995) and Robins, Rotnitzky, and Zhao (1995), in the related setting of missing data problems,
propose efficient estimators that combine weighting and regression adjustment. Hirano, Imbens,
and Ridder (2000) propose an estimator that weights the units by the inverse of their assignment
probabilities, and show that nonparametric series estimation of this conditional probability, la-
beled the propensity score by Rosenbaum and Rubin (1983a), leads to an efficient estimator.
Ichimura and Linton (2001) consider higher order expansions of such estimators to analyze op-
timal bandwidth choices.
Here we study an estimator that matches each treated unit to a fixed number of control units
and each control unit to a fixed number of treated units. Various other versions of matching
estimators have been proposed in the literature. For example, Rosenbaum (1989, 1995), Gu and
Rosenbaum (1993) Rubin (1973a,b), and Dehejia and Wahba (1999) focus on the case where
only the treated units are matched, and with typically many controls relative to the number of
1
treated so that each control is used as a match only once. In contrast, we match all units, treated
as well as controls, so that the estimand is identical to the average treatment effect that is the
focus of the Hahn (1998), Robins and Rotnitzky (1995), and Hirano, Imbens and Ridder (2001)
estimators. In addition,as we match both treated and controls, we allow a unit to be used as a
match more than once. Matching estimators have great intuitive appeal as they do not require
the researcher to set any smoothing parameters other than the number of matches. However,
as we show, somewhat surprisingly given the widespread popularity of matching estimators, the
large sample properties of the simple matching estimators with a fixed number of matches are
not necessarily very attractive. First, if the dimension of the continuous pre-treatment variables
is equal to four, the simple matching estimator has a bias that does not vanish in the asymptotic
distribution under the standard normalization, and if the dimension is five or more the bias
will in fact dominate the large sample variance. This crucial role for the dimension also arises
in nonparametric differencing methods in regression models (e.g., Honore 1992, Yatchew 1999,
Estes and Honore 2001). Second, even if the dimension of the covariates is low enough for the
bias to vanish asymptotically, the simple matching estimator with a fixed number of matches is
not efficient.
We also consider combining matching with additional bias reductions based on a nonpara-
metric extension of the regression adjustment proposed in Rubin (1973b) and Quade (1982).
This regression adjustment removes the bias of the estimators, without affecting the variance.
Compared to estimators based on regression adjustment without matching (e.g., Hahn 1998;
Heckman, Ichimura, and Todd 1997; Heckman, Ichimura, Smith, and Todd 1998) or estimators
based on weighting by the inverse of the propensity score, (Hirano, Imbens, and Ridder 2000)
the proposed estimators are more robust as the matching ensures that they do not rely for con-
sistency on an accurate approximation to either the regression function or the propensity score
in finite samples.
Since the bias correction does not affect the variance, the bias corrected matching estimators
with a fixed number of matches still do not reach the semiparametric efficiency bound. Only
if the number of matches increases with the sample size do the matching estimators reach the
2
efficiency bound. However, as we shall show, the efficiency loss from using only a small number of
matches is very modest, and it can often be bounded. An advantage of the matching methods is
that the variance can be estimated conditional on the smoothing parameter (i.e., the number of
matches), whereas in the regression estimators often only estimators for the limiting variance are
available. An estimator of the variance that conditions on the value of the smoothing parameter
may be more accurate than one based on the limiting form of the variance. In simulations we
find that the proposed estimators perform well with relatively few matches.
In the next section we introduce the notation and define the estimators. In Section 3 we
discuss the large sample properties of matching estimators. In section 4 we analyze bias cor-
rections. In Section 5 we apply the estimators to data from an employment training program
previously analyzed by Lalonde (1986), Heckman and Hotz (1989), and Dehejia and Wahba
(1999). In Section 6 we carry out a small simulation study to investigate the properties of the
various estimators. Section 7 concludes. The appendix contains proofs of the theorems.
2. Notation and Basic Ideas
2.1. Notation
We are interested in estimating the average effect of a binary treatment on some outcome. Let
(Y (0), Y (1)) denote for a typical unit the two potential outcomes given the control treatment and
given the active treatment respectively. The variable W , for W {0, 1} indicates the treatment
received. We observe W and the outcome for this treatment,

Y (0) if W = 0,
Y =
Y (1) if W = 1,
as well as a vector of pretreatment variables X. The estimand of interest is the population

average treatment effect
= E[Y (1) Y (0)],
or the average effect for the treated
t = E[Y (1) Y (0)|W = 1].
3
See for some discussion of the various estimands Rubin (1977) and Heckman and Robb (1984).
We assume that assignment to treatment is unconfounded (Rosenbaum and Rubin 1983a),
and that the probability of assignment is bounded away from zero and one.
Assumption 1: Let X be a random vector of covariates distributed on Rk with compact and

convex support X . For almost every x X ,
(i) W is independent of (Y (0), Y (1)) conditional on X = x;
(ii) c < Pr(W = 1|X = x) < 1 c, for some c > 0.
The dimension of X, denoted by k, will be seen to play an important role in the proper-
ties of the matching estimators. We assume that all covariates have continuous distributions.
(Discrete covariates can be easily dealt with by analyzing estimating average treatment effects
within subsamples defined by their values. The number of discrete covariates does not affect the
asymptotic properties of the estimators.) The combination of the two conditions in Assump-
tion 1 is referred to as strong ignorability (Rosenbaum and Rubin 1983a). These conditions are
strong, and in many cases may not be satisfied. In many studies, however, researchers have
found it useful to consider estimators based on these or similar conditions. See, for example,
Cochran (1968), Cochran and Rubin (1973), Rubin (1973a,b), Barnow, Cain, and Goldberger
(1980), Heckman and Robb (1984), Rosenbaum and Rubin (1984), Ashenfelter and Card (1985),
Lalonde (1986), Card and Sullivan (1988), Manski, Sandefur, McLanahan, and Powers (1992),
Robins and Rotnitzky (1995), Robins, Rotnitzky, and Zhao (1995), Rosenbaum (1995), Heck-
man, Ichimura, and Todd (1997), Hahn (1998), Heckman, Ichimura, Smith, and Todd (1998),
Lechner (1998), Dehejia and Wahba (1999), Hotz, Imbens, and Mortimer (1999) and Hirano,
Imbens, and Ridder (2000). If the first condition, unconfoundedness, is deemed implausible in a
given application, methods allowing for selection on unobservables such as instrumental variables
(e.g., Heckman and Robb 1984; Angrist, Imbens and Rubin 1996, Abadie 2002), sensitivity anal-
yses (Rosenbaum and Rubin, 1983b), or bounds calculations (Manski 1990) may be considered.
See for general discussion of such issues the surveys in Heckman and Robb (1984), Angrist and
Krueger (2000) and Heckman, Lalonde, and Smith (2000). The importance of second part of the
assumption, the restriction on the probability of assignment has been discussed in Rubin (1977),
4
Heckman, Ichimura, and Todd (1997), Heckman, Ichimura, Smith and Todd (1998), and Dehejia
and Wahba (1999). Compactness and convexity of the support of the covariates are convenient
regularity conditions.
Under Assumption 1 the average treatment effect for the subpopulation with pretreatment
variables equal to X = x, (x) = E[Y (1) Y (0)|X = x], can be estimated from the data on
(Y, W, X) because
(x) = E[Y (1) Y (0)|X = x] = E[Y |W = 1, X = x] E[Y |W = 0, X = x].
To get the average effect of interest we average this conditional treatment effect over the marginal
distribution of X:
= E[ (X)],
or over the conditional distribution to get the average effect for the treated:
t = E[ (X)|W = 1].
Next we introduce some additional notation. For x X and w {0, 1}, let (x, w) = E[Y |X =
x, W = w] and 2 (x, w) = E[(Y (x, w)))2 |X = x, W = w]. Also let w (x) = E[Y (w)|X = x]
and w2 (x) = E[(Y (w) w (x)))2 |X = x]. By Assumption 1,
w (x) = E[Y (w)|X = x] = E[Y (w)|X = x, W = w]
= E[Y |X = x, W = w] = (x, w).
Similarly, w2 (x) = 2 (x, w). Let fw (x) be the conditional density of X given W = w, and let
e(x) = Pr(W = 1|X = x) be the propensity score (Rosenbaum and Rubin 1983a).We assume
that there is a random sample of size N from the distribution of (Y, W, X) available.
Assumption 2: {(Yi , Wi , Xi )}N

i=1 are independent draws from the distribution of (Y, W, X).
P P
The numbers of control and treated units are N0 = i (1 Wi ) and N1 = i Wi respectively,
with N = N0 + N1 . Let kxk = (x0 x)1/2 , for x X be the standard vector norm.1 Given a sample,
1
Alternative norms of the form kxkV = (x0 V x)1/2 for some positive definite symmetric matrix V are also
covered by the results below, since kxkV = ((P x)0 (P x))1/2 for P such that P 0 P = V .
5
{(Yi , Xi , Wi )}N
i=1 , let jm (i) be the index j that solves
X n o
1 kXl Xi k kXj Xi k = m,
Wl =1Wi
where 1{} is the indicator function, equal to one if the expression in brackets is true and zero
otherwise. In other words, jm (i) is the index of the unit that is the m-th closest to unit i in terms
of the distance measure based on the norm k k, among the units with the treatment opposite
to that of unit i. In particular, j1 (i), sometimes for notational convenience denoted by j(i), is
the nearest match for unit i. For notational simplicity and since we only consider continuous
covariates, we ignore the possibility of ties, which do not happen with probability one. Let JM (i)
denote the set of indices for the first M matches for unit i:
JM (i) = {j1 (i), . . . , jM (i)}.
Finally, let KM (i) denote the number of times unit i is used as a match given that M matches
per unit are done:
N
X
KM (i) = 1{i JM (l)}.
l=1
In many matching methods (e.g., Rosenbaum 1995), the matching is carried out without replace-
ment, so that every unit is used as a match at most once, and KM (i) 1. However, when both
treated and control units matched it is imperative that units can be used as matches more than
once. We show below that the distribution of KM (i) is important in terms of the variance of the
estimators.
2.2. Estimators
The unit level treatment effect is i = Yi (1) Yi (0). For the units in the sample one of the
potential outcomes is observed and the other is unobserved. All estimators we consider impute the
unobserved potential outcomes in some way. The first estimator, the simple matching estimator,
uses the following estimates for the potential outcomes:

Yi X if Wi = 0,

Yi (0) = 1
Yj if Wi = 1,
M
jJM (i)
6
and
X
1

Yj if Wi = 0,
Yi (1) = M
jJM (i)

Y
i if Wi = 1.
The simple matching estimator we shall study is
1 X
N
sm
M = Yi (1) Yi (0) . (1)
N i=1
Consider the case with a single match (M = 1). The differences Yi (1) Yi (0) and Yj (1) Yj (0)
are not necessarily independent, and in fact one will be minus the other if i is matched to
j (that is, j(i) = j) and j is matched to i (that is, j(j) = i). This procedure differs from
standard pairwise matching procedures where one constructs a number of distinct pairs, without
replacement. Matching with replacement leads to a higher variance, but produces higher match
quality, and thus typically a lower bias.
In Table 1 the simple matching estimator is illustrated for an example with four units. In
this example unit 1 is matched to unit 3, units 2 and 3 are both matched to unit 1, and unit
4 is matched to unit 2. Hence unit 1 is used as a match twice, units 2 and 3 are used as a
match once, and unit 4 is never used as a match. The estimated average treatment effect is
P4
i=1
i /4 = (2 + 5 + 2 + 0)/4 = 9/4.
We shall compare the matching estimators to covariance-adjustment or regression imputation
estimators. Let (Xi , w) be a consistent estimator of (Xi , w). Let

Yi if Wi = 0,
Yi (0) = (2)

(Xi , 0) if Wi = 1,
and

(Xi , 1) if Wi = 0,
Yi (1) = (3)
Yi if Wi = 1.
The regression imputation estimator is defined by

N
1 X
reg
= Yi (1) Yi (0) . (4)
N i=1
7
We called this estimator the regression imputation estimator because under unconfoundedness
(Xi , w) = w (Xi ) = E[Yi (w)|Xi ], and therefore an estimate of w (Xi ) = E[Yi (w)|Xi ] is used to
impute Yi (w). The distinction between regression imputation estimators and matching estimators
is not transparent in finite samples. If (Xi , w) is estimated using a nearest neighbor estimator
with a fixed number of neighbors, then the regression imputation estimator is identical to the
matching estimator with the same number of matches. However, the regression imputation and
matching estimators differ in the way they change with the number of observations. We classify
as matching estimators those estimators which use a finite and fixed number of matches. We
classify as regression imputation estimators those for which
(Xi , w) is a consistent estimator
for (Xi , w). The estimators considered by Hahn (1998), Heckman, Ichimura and Todd (1997)
and Heckman, Ichimura, Smith and Todd (1998) are regression imputation estimators. Hahn
shows that if nonparametric series estimation is used for E[Y W |X = x], E[Y (1 W )|X], and
E[W |X], and those are used to estimate
1 (x) as W |X]/E[W
(x, 1) = E[Y |X] and
0 (x) as
(1W )|X]/E[1W
(x, 0) = E[Y |X], then the regression imputation estimator is asymptotically
efficient for , under regularity conditions.
In addition we develop a bias-corrected matching estimator where the difference within the
matches is adjusted:

Yi if Wi = 0,
1 X
Yi (0) = (Yj + (Xi , 0)
(Xj , 0)) if Wi = 1, (5)

M
jJM (i)
and
X
1

(Yj +
(Xi , 1)
(Xj , 1)) if Wi = 0,
Yi (1) = M
jJM (i) (6)

Yi if Wi = 1,
with corresponding estimator
1 X
N
bcm
M = Yi (1) Yi (0) . (7)
N i=1
Rubin (1979) discusses such estimators in the context of matching without replacement and
linear covariance adjustment.
8
To set the stage for some of the discussion below, note that conditional on {Xi , Wi }N
i=1 , the
bias for the simple matching estimator is (under Assumption 1):

" ! #
1 X
N

E Yi (1) Yi (0) Yi (1) Yi (0) {Xi , Wi }N
i=1
N i=1
( )
1 X
N M
1 X
= (1 2Wi ) (Xi , Wjm (i) ) (Xjm (i) , Wjm (i) ) . (8)
N i=1 M m=1
That is, the bias consists of terms of the form (xi , wj ) (xj , wj ), for Xi = xi , Wi = wi , Xj(i) =
xj , and Wj(i) = wj . These terms are small when xi ' xj , as long as the regression functions are
continuous. Similarly, the bias of the regression imputation estimator consists of terms of the
form (xi , wj ) E[
(xi , wj )], which are small when E[
(xi , wj )] ' (xi , wj ). On the other hand,
the bias of the bias-corrected estimator consists of terms of the form (xi , wj ) (xj , wj )
E[
(xi , wj )
(xj , wj )], which are small if either xi ' xj or E[
(xi , wj )] ' (xi , wj ). The
bias-adjusted matching estimator combines some of the bias reductions from the matching, by
comparing units with similar values of the covariates, and the bias-reduction from the regression.
Compared to only regression imputation, the bias-corrected matching estimator relies less on the
accuracy of the estimator of the regression function since it only needs to adjust for relatively
small differences in the covariates.
We are interested in the properties of the simple and bias-corrected matching estimators in
large samples, that is, as N increases, for fixed M . The properties of interest include bias and
variance. Of particular interest is the dependence of these results on the dimension of the covari-
ates. Some of these properties will be considered conditional on the covariates. In particular,
we will propose an estimator for the conditional variance of an estimator given X1 , . . . , XN
P
viewed as an estimator of the sample average conditional treatment effect (Xi )/N , where
N N
1 X 1 X
(Xi ) = ((Xi , 1) (Xi , 0)) .
N i=1 N i=1
There are two reasons for focusing on the conditional variance. First, in many cases one is
interested in the average effect for the sample at hand, rather than for the hypothetical population
this sample is drawn from especially given that the former can typically be estimated more
9
precisely. The difference between the marginal variance and the conditional variance is the
P
variance of (X) divided by the sample size, that is, the expected value of ( i (Xi )/N )2 .
This variance represents the difference between the sample distribution of the covariates and
the population. The second reason is that there is an estimator for the conditional variance
that, in the spirit of the matching estimator, does not rely on additional choices for smoothing
parameters. Estimating the unconditional variance would require estimating the variance of
(X), which, in turn, as in Hirano, Imbens and Ridder (2001), requires choices regarding the
smoothing parameters in nonparametric estimation of the conditional means and variances.
3. Simple Matching Estimators
sm
In this section we investigate the properties of the basic matching estimator M defined in (1).
Let X and W be the matrices with i-th row equal to Xi0 , and Wi , respectively. Define the two
N N matrices A1 (X, W) and A0 (X, W), with typical element

1 if i = j, Wi = 1
A1,ij = 1/M if j JM (i), Wi = 0, (9)

0 otherwise,
and

1 if i = j, Wi = 0
A0,ij = 1/M if j JM (i), Wi = 1, (10)

0 otherwise,
and define A = A1 A0 . For the example in Table 1,

1 0 0 0 0 0 1 0 1 0 1 0
1 0 0 0 0 1 0 0 1 1 0 0
A1 =
1 0 0 0 , A0 = 0
and A= .
0 1 0 1 0 1 0
0 0 0 1 0 1 0 0 0 1 0 1
Notice that for any N 1 vector V = (V1 , . . . , VN )0 :

XN
0 KM (i)
N AV = (2Wi 1) 1 + Vi , (11)
i=1
M

where N be the N -dimensional vector with all elements equal to one. Let Y, Y(0), Y(1), Y(0),

and Y(1) be the matrices with i-th row equal to Yi , Yi (0), Yi (1), Yi (0), and Yi (1), respectively.
10
Then

Y(1) = A1 Y(1) = A1 Y,
and

Y(0) = A0 Y(0) = A0 Y.
sm
We can now write the estimator M as
. .
sm
M
= 0N Y(1)
Y(0) N = 0N A1 Y 0N A0 Y N = 0N AY/N.
sm
Alternatively, from equation (11), M can be written as:
N
sm 1 X KM (i)
M = (2Wi 1) 1 + Yi . (12)
N i=1 M
Define i = Yi (Xi , Wi ), so that
E[|X = x, W = w] = 0,
and
Var(|X = x, W = w) = 2 (x, w).
Let (X, W) and be the N 1 vectors with i-th element equal to (Xi , Wi ) and i , respectively.
Similarly, 0 (X) and 1 (X) are the N 1 vectors with i-th element equal to 0 (Xi ) and 1 (Xi ),
respectively. Then
sm
M = 0N A(X, W)/N + 0N A/N.
Using the fact that A1 (X, W) = A1 1 (X) and A0 (X, W) = A0 0 (X) we can write this as

sm
M = 0N 1 (X) 0 (X) /N + 0N A/N
+ 0N (A1 IN )1 (X)/N 0N (A0 IN )0 (X)/N. (13)
The sum of the last two terms is the conditional bias in equation (8). If the matching is exact,
and Xi = Xjm (i) , then the last two terms are equal to zero. In general, they are not and
11
their expectation constitutes the bias, which will be analyzed in Section 3.1. The first two
terms determine the large sample variance of the estimator. The first term depends only on the
covariates X and has variance equal to the variance of the treatment effect, E[( (X) )2 ]/N .
The variance of the second term is the conditional variance of the estimator. We will analyze
this term in Section 3.2.
3.1. Bias
The bias of the simple matching estimator is equal to the expectation of the expression in equation
(8). To investigate this term further, consider the i-th element of sample average on the right
hand side of equation (8). Suppose Wi = 0, then the i-th element is equal to
1 X
M
1 (Xjm (i) ) 1 (Xi ) .
M m=1
Thus the components of the bias consist of the difference between expected value of Yi (1) given
Xi , and the average of the expected values of the matches. To investigate the nature of this bias
we expand the difference 1 (Xjm (i) ) 1 (Xi ) around Xi :
1
1 (Xjm (i) ) 1 (Xi ) = (Xjm (i) Xi )0 (Xi )
x
1 2 1
+ (Xjm (i) Xi )0 (Xi )(Xjm (i) Xi ) + O(kXjm (i) Xi k3 ).
2 xx0
In order to study the components of the bias it is therefore useful to consider the distribution of
the matching discrepancy, the difference between the value of the covariate Xi and the value of
the covariate for its m-th nearest match, Xjm (i) .
Fix the covariate value at X = z, and consider a sample X1 , ..., XN from some distribution
over the compact support of X (with density f (x) and distribution function F (x)). Now, consider
the closest match to z in the sample. Let
j1 = argminj=1,... ,N kXj zk,
and let U1 = Xj1 z be the matching discrepancy. We are interested in the moments of the
difference U1 . More generally, we are interested in the moments of the m-th closest match
12
discrepancy: Um = Xjm z, where Xjm is the m-th closest match to z from the random sample
of size N .
Theorem 1: (Matching Discrepancy)

Suppose that f (z) > 0 and that f is differentiable in a neighborhood of z. Then,
(i)
Um = Op (N 1/k ),
(ii)
!2/k
mk + 2 1 k/2 1 f 1 1
E [Um ] = f (z) (z) 2/k + o ,
k (m 1)!k 1 + k2 f (z) x N N 2/k
(iii)
!2/k
0 mk + 2 1 k/2 1 1
E[Um Um ] = f (z) Ik + o ,
k (m 1)!k 1 + k2 N 2/k N 2/k
and (iv)

E[kUm k3 ] = O N 3/k ,
R
where (y) = 0
et ty1 dt (for y > 0) is Eulers Gamma Function.
The theorem states that the order of the matching discrepancy increases with the number
of continuous covariates (result (i) of the theorem). As the number of continuous covariates
increases, it becomes more difficult to find close matches. In the proof of the theorem, it is shown
that the leading term in the stochastic expansion of the matching discrepancy has a rotation
invariant distribution around the origin. As a result, the order of the first and second moment
(O(N 2/k )) is lower than the order of the discrepancy itself (Op (N 1/k )), but still increasing in
the number of covariates (results (i) and (ii) of the theorem). Another implication of the rotation
invariant property is that the leading term in the stochastic expansion of covariance matrix of
the matching discrepancy is a scalar matrix (result (iii) of the theorem). The next theorem gives
the result for the bias of the matching estimator.
13
Theorem 2: (Bias)
Under assumptions 1 and 2 and if 1 (x) and 0 (x) have bounded third derivatives, then the bias
for the simple matching estimator is
M !
sm sm 1 X mk + 2 1 1
B = E[b
M ] =
M m=1 k (m 1)!k N 2/k
!2/k
(1 p) Z k/2
2
1 f1 1 1 1
f1 (x) (x) (x) + tr (x) f0 (x)dx
p2/k 1 + k2 f1 (x) X 0 X 2 X 0 X
!2/k
Z k/2
2

p 1 f0 0 1 0
f0 (x) (x) (x) + tr (x) f1 (x)dx
(1 p)2/k 1 + k2 f0 (x) X 0 X 2 X 0 X

1
+ o .
N 2/k
Theorem 2 shows that order of the bias of the simple matching estimator is O(N 2/k ). Con-
sider the distribution of N 1/2 (sm
M ). The order of the bias, normalized by N 1/2 , is N 1/22/k ,
which vanishes asymptotically if k 3. If k = 4, the normalized bias converges, and the asymp-
totic distribution will have a non-zero bias. If the number of continuous covariates is larger than
4, the bias will in large samples dominates the asymptotic variance. This role for the dimen-
sion of the covariates also arises in other applications of nonparametric estimation. See Honore
(1992), Yatchew (1999), and Estes and Honore (2001).
The simple matching estimator can easily be modified to estimate the average treatment
effect for the treated:
PN
Wi (Yi (1) Yi (0)) 1 X
smt
M = i=1 PN = Yi Yi (0).
i=1 W i N 1
Wi =1
1/2
In that case the bias will disappear under the standard normalization (N1 ) if treated and
controls are sampled separately and the number of controls increases sufficiently fast. The next
theorem gives the formal result.
Theorem 3: (Bias Treated)

Suppose that N1 = O(N0s ), with s < 4/k. Then the bias of the simple matching estimator for the
1/2
average treatment effect on the treated, normalized by N1 , vanishes asymptotically.
14
3.2. Variance
sm
In this section we investigate the variance of the simple matching estimator M . Consider
the representation of the estimator in (13). We investigate the variance of the first two terms.
Although the last two terms can contribute to the variance, we can only use the variance to
construct valid confidence intervals when these bias terms disappear in large samples and hence
we ignore them for the variance calculations. The first of the remaining two terms has a simple
variance, E[( (X) )2 ]. We therefore focus in this section on the variance of the second term,
0N A/N . Conditional on X and W, the variance of is
sm 1 0
Var(
M |X, W) = Var(0N A/N |X, W) = 2
N AA0 N , (14)
N
where
h i
= E 0 |X, W ,
a diagonal matrix with the ith diagonal element equal to the conditional variance of Yi given Xi
and Wi :
ii = 2 (Xi , Wi ).
Note that (14) gives the exact variance, not relying on large sample approximations. Using the
representation of the simple matching estimator in equation (12), we obtain:
N 2
sm 1 X KM (i)
Var(
M |X, W) = 2 1+ 2 (Xi , Wi ).
N i=1 M
The unconditional variance is:

( " 2 # )
sm 1 KM (i)
Var(
M )= E 1+ 2 (Xi , Wi ) + Var ( (X)) .
N M
This result demonstrates the critical role of the number of times a unit is used as a match for the
variance. Next theorem along with Markovs Inequality, and finite fourth moment for Y implies
root-N consistency of the simple matching estimator in absence of the bias term.
Theorem 4: The moments of KM (i) are bounded uniformly in N .
Asymptotic normality follows from applying the Linderberg-Feller central limit theorem.
15
3.3. Estimating the Variance
Estimating the conditional variance 0N AA0 N /N is complicated by the fact that it involves the
conditional variances 2 (x, w). We consider two approaches to estimating the variance. First,
can estimate the conditional variances 2 (x, w) consistently, first using nonparametric regression
to obtain (x, w), and then using nonparametric regression again to obtain 2 (x, w). Although
this leads to a consistent estimator for the conditional variance, it would require exactly the
type of nonparametric regression that the simple matching estimator allows one to avoid. We
therefore do not use this approach. Instead, we consider an approach that estimates the variance
consistently using matching estimators for the conditional variances that are themselves not
consistent, following an approach developed in Abadie and Imbens (2001) for pairwise randomized
experiments.
The conditional variance of the average treatment effect estimator depends on the unit-level
variances 2 (x, w) only through an average. To estimate these unit-level variances we use the
matching approach developed by Abadie and Imbens (2001) for pairwise randomized experiments.
This can be interpreted as a nonparametric estimator for 2 (x, w) with a fixed bandwidth, where
instead of the original matching of treated to control units, we now match treated units with
treated units and control units with control units. This leads to an approximately unbiased
estimate of 2 (x, w), although not a consistent one. However, the average of these inconsistent
variance estimators is consistent for the average of the variances. Suppose we have two pairs i
and j with the same covariates, Xi = Xj = x and the same treatment Wi = w, and consider the
squared difference between the within-pair differences:
h 2 i

E Yi Yj Xi = Xj = x, Wi = w = 2 2 (x, w).
In that case we can estimate the variance 2 (Xi , Wi ) as

2 (Xi , Wi ) = (Yi Yj )2 /2. This esti-
mator is unbiased, but it is not consistent as its variance does not go to zero with the sample
sm
size. However, this is not necessary for the estimator for the normalized variance of M to be
consistent.
In practice it may not be possible to find different pairs with the same value of the covariates.
16
Hence let us consider the nearest pair to pair i by solving
l(i) = argminl:l6=i,wi =wl ||Xi Xl ||,
and let
1 2
2 (Xi , Wi ) =
Yi Yl(i) ,
2
be an estimator for the conditional variance 2 (Xi , Wi ).

with ith
We use these pair-level variance estimates to estimate as the diagonal matrix
2 (Xi , Wi ). Uniform convergence of the pair-level bias ensures that
diagonal element equal to
the bias disappears as the sample size increases. Note that the pair-level variance estimates
2 (Xi , Wi ) are not all independent as the same pair may get used as a match more than once.
P
Nevertheless, as long as the same pair does not get used too often, that is as long as N
i=1 1{l(i) =
k} remains bounded for all k, the averaging implies the following theorem.
Theorem 5: Let A be the matrix defined in (9). Then
0 N /N 0N AA0 N /N = op (1).
0N AA
3.4. Efficiency
To compare the efficiency of the estimator considered here to previously proposed estimators
and in particularly the efficiency bound calculated by Hahn (1998), it is useful to go beyond the
conditional variance and compute the unconditional variance. In general the key to the efficiency
properties of the matching estimators is the distribution of KM (i), the number of times each unit
is used as a match. It is difficult to work out the limiting distribution of this variable for the
general case.2 Here we investigate the form of the variance for the special case with a scalar
covariate and a general M .
2
The key is the second moment of the volume of the catchment area AM (i), defined as the subset of X such
that each observation, j, with Wj = 1 Wi and Xj AM (i) is matched to i. In the single match case with M = 1
these objects are studied in statistical geometry where they are known as Poisson-Voronoi tesselations (Moller,
1994). The variance of the volume of such objects under uniform f0 (x) and f1 (x), normalized by the mean, has
been worked out numerically for the one, two, and three dimensional cases.
17
Theorem 6: Suppose k = 1. Then,

sm 12 (Xi ) 02 (Xi )
N Var(
M ) = E + + Var( (Xi ))
e(Xi ) 1 e(Xi )

1 1 2 1 2
+ E e(Xi ) 1 (Xi ) + (1 e(Xi )) 0 (Xi ) + o(1).
2M e(Xj ) 1 e(Xj )
Since the semiparametric efficiency bound for this problem is

2
1 (Xi ) 02 (Xi )
E + + Var( (Xi )),
e(Xi ) 1 e(Xi )
(Hahn 1998), the matching estimator is not efficient in general. However, the efficiency loss
disappears as one increases the number of matches. In practice the efficiency loss from using two
or three matches is very small. For example, the variance with a single match is less than 50%
higher than the variance of the efficient estimator, and with three matches the variance is less
than 17% higher.
4. Bias Corrected Matching
bcm
In this section we analyze the properties of the bias corrected matching estimator M . First we
introduce an infeasible version of the bias corrected estimator:

Yi if Wi = 0,
1 X
Y i (0) = (Yj + (Xi , 0) (Xj , 0)) if Wi = 1, (15)

M
jJM (i)
and
X
1

(Yj + (Xi , 1) (Xj , 1)) if Wi = 0,
Y i (1) = M (16)
jJM (i)

Yi if Wi = 1,
with corresponding estimator

N
ibcm 1 X
M = (Y (1) Y i (0)) . (17)
N i=1 i
The bias correction in the infeasible bias-corrected matching estimator uses the actual, rather
than the estimated regression functions. Using these it is obvious that the correction removes
18
all the bias. The correction term solely depends on the covariates and the treatment indicators,
so that conditional on X and W it has the same variance as the simple matching estimator. It
is therefore a clear improvement over the simple matching estimator in terms of mean-squared-
error.
Theorem 7: (Infeasible Bias Corrected Matching Estimator versus Simple Matching

Estimator)
ibcm
(i) E[
M ] = ,
ibcm sm
(ii) Var(
M |X, W) = Var(
M |X, W).
Proof: This follows immediately from the discussion. The conditional variance is not affected
by the addition of the bias correction terms that depend only on X and W.
The second key result concerns the difference between the infeasible and feasible bias-corrected
matching estimator. Given sufficient smoothness the difference between the infeasible and feasible
bias-corrected matching estimators is of sufficiently low order that it can be ignored in large
sample approximations.
Following Newey (1995), we use a series estimator for the two regression functions (, 0) and
(, 1), with K terms. As the basis we use power series. Let = (1 , ..., k ) be a multi-index of
P
dimension k, that is, a k-dimensional vector of non-negative integers, with || = ki=1 i , and let
x = x1 1 . . . xk k . Consider a series {(r)}
r=1 containing all distinct such vectors and such that
|(r)| is nondecreasing. Let pr (x) = x(r) , where pK (x) = (p1 (x), ..., pK (x))0 . The nonparametric
series estimator of the regression function (, w) is given by:
!
X X
(x, w) = pK (x)0 pK (Xi )pK (Xi )0 pK (Xi )Yi ,
Wi =w Wi =w
where () denotes a generalized inverse.
Theorem 8: (Infeasible versus Feasible Bias Corrected Matching Estimator)

Suppose
(i) The support of X is X Rk , the Cartesian product of compact intervals, with absolutely
continuous density fX (x) that is bounded and bounded away from zero on X .
19
(ii) The first two moments of Y (1) and Y (0) given X = x are finite for all x X .
(iii) K 4 /N 0,
(iv) There is a C such that for each multi-index the -th partial derivative of (x, w) exist for
w = 0, 1 and are bounded by C || . Then
d
ibcm
N 1/2 M bcm
M 0.
Thus, the feasible bias corrected matching estimator has the same normalized variance as the
infeasible bias corrected matching estimator.
5. An Application to the the Evaluation of a Labor Market Program
In this section we apply the estimators to a subset of the data first analyzed by Lalonde (1986)
and subsequently by Heckman and Hotz (1989), Dehejia and Wahba (1999) and Smith and Todd
(2001). We use data from a randomized evaluation of a job training program and a subsample
from the Panel Study of Income Dynamics. Using the experimental data we obtain an unbiased
estimate of the average effect of the training. We then see how well the non-experimental
matching estimates compare using the experimental trainees and the controls from the PSID.
Given the size of the trainee group and its difference from the PSID sample, and in line with
previous studies using these data, we focus on the average effect for the treated and therefore
only match the treated units.
Table 2 presents summary statistics for the three groups. The first two columns present the
summary statistics for the experimental trainees. The second pair of columns presents the results
for the experimental controls. The third pair of columns presents summary statistics for the non-
experimental control group constructed from the PSID. The last two columns present t-statistics
for the hypothesis that the population averages for the trainees and the experimental controls,
and for the trainees and the PSID controls, respectively, are zero. Note the large differences
in background characteristics between the trainees and the PSID sample. This is what makes
drawing causal inferences from comparisons between the PSID sample and the trainee group a
tenuous task. From the last two rows we can obtain an unbiased estimate of the effect of the
20
training on earnings in 1978 by comparing the averages for the trainees and the experimental
controls, 6.35 4.55 = 1.80 (t-stat. 2.7).
Table 3 presents estimates of the causal effect of training on earnings using various matching
and regression adjustment estimators. The top part of the table reports estimates for the ex-
perimental data (experimental trainees and experimental controls), and the bottom part reports
estimates based on the experimental trainees and the PSID controls. The first set of rows in each
case reports matching estimates, based on a number of matches including 1, 4, 16, 64 and 2490.
The matching estimates include simple matching with no bias adjustment, and bias-adjusted
matching. For the bias adjustment the regression uses all nine covariates linearly, but does not
include higher order terms. The regression function is estimated using on the matched control
units (185 for the single match case, and in general 185 M , but potentially with many control
units used more than once). The alternative is to use all control units, whether matched or not,
in this regression. The benefit of doing so is that the variance of the regression estimates would
be reduced. The potential cost is an increase in the bias if the regression function is not in fact
linear, exacerbated by the use of observations away from the region where we need to estimate
the regression function. Note that since we only match the treated units, there is no need to
estimate the regression function for the trainees.
The next part of the table reports estimates based on linear regression with no controls,
all covariates linearly and all covariates with quadratic terms and a full set of interactions. The
experimental estimates range from 1.17 (regression using the matched observations, with a single
match) to 2.27 (quadratic regression). The non-experimental estimates have a much wider range,
from -15.20 (simple difference) to 3.26 (quadratic regression). Using a single match, however,
there is little variation in the estimates, ranging only from 2.09 to 2.56. The regression adjusted
matching estimator does not vary much with the number of matches, with estimates of 2.45
(M = 1), 2.51 (M = 4), 2.48 (M = 16), and 2.26 (M = 16), and only with M = 2490 does the
estimate deteriorate to 0.84. The matching estimates all use the identity matrix as the weight
matrix, after normalizing the covariates to have zero mean and unit variance.
To see how well the matching performs in terms of balancing the covariates, Table 4 reports
21
average differences within the matched pairs. First all the covariates are normalized to have
zero mean and unit variance. The first two columns report the averages for the PSID controls
and the experimental trainees. One can see that before matching, the averages for some of the
variables are more than a standard deviation apart, e.g., the earnings and employment variables.
The next pair of columns reports the within-matched-pairs average difference and the standard
deviation of this within-pair difference. For all the indicator variables the matching is exact:
every trainee is matched to someone with the same ethnicity, marital status and employment
history for the years 1974 and 1975. The other, more continuously distributed variables are not
matched exactly, but the quality of the matches appears very high: the average difference within
the pairs is very small compared to the average difference between trainees and controls before
the matching, and it is also small compared to the standard deviations of these differences. If we
increase the number of matches the quality goes down, with even the indicator variables no longer
matched exactly, but in most cases the average difference is still far smaller than the standard
deviation till we get to 16 or more matches. The last row reports matching differences for logistic
estimates of the propensity score. Although the matching is not directly on the propensity score,
with single matches the average difference in the propensity score is only 0.21, whereas without
matching the difference between trainees and controls is 8.16, 40 times higher.
6. A Monte Carlo Study
In this section we discuss some simulations designed to assess the performance of the various
matching estimators in this context. To make the assessment more useful we simulate data sets
that closely resemble the Lalonde data set analyzed in the previous section.
In the simulation we have nine regressors, designed to match the following variables in the
Lalonde data set: age, education, black, hispanic, married, earnings1974, unemployed1974, earn-
ings1975, unemployed1975. For each simulated data set we sample with replacement 185 ob-
servations from the empirical covariate distribution of the trainees, and 2490 observations from
the empirical covariate distribution of the PSID controls. This gives us the joint distribution
of covariates and treatment indicators. For the conditional distribution of the outcome given
22
covariates, we estimated a two-part model on the PSID controls, where the probability of zero
earnings is a logistic function of the covariates with a full set of quadratic terms and interactions.
Conditional on being positive, the log of earnings is a function of the covariates with again a full
set of quadratic terms and interactions. We then assume a constant treatment effect of 2.0.
For each data set simulated in this way we report results for the same set of estimators. For
each estimator we report the mean and median bias, the root-mean-squared-error, the median-
absolute-error, the standard deviation, the average estimated standard error, and the coverage
rates for nominal 95% and 90% confidence intervals. The results are reported in Table 5.
In terms of rmse and mae the bias-adjusted matching estimator is best with 4 or 16 matches.
The simple matching estimator does not perform as well neither in terms of bias or rmse. The
pure regression adjustment estimators do not perform very well. They have high rmse and
substantial bias. Bias-corrected estimator also perform better in terms of coverage rates. In
terms of coverage rates the non-corrected matching estimators and the regression estimators
have lower than nominal coverage rates for any value of M .
7. Conclusion
In this paper we derive large sample properties of simple matching estimators that are widely used
in applied evaluation research. The formal large sample properties turn out to be surprisingly
poor. We show out that with more than three continuous covariates the bias of simple matching
estimators will not disappear, and with more than four continuous covariates actually dominate
the variance in large samples. We also show that matching estimators with a fixed number
of matches are not efficient. We suggest a nonparametric bias-adjustment that removes the
asymptotic bias. A simple implementation of this estimator where the bias-adjustment is based
on linear regression appears to perform well compared to both matching estimators without
bias-adjustment and regression-based estimators in simulations based on realistic settings for
nonexperimental program evaluations.
23
Appendix
Before proving Theorem 1, we collect some results on integration using polar coordinates that will be
useful. See for example Stroock (1999). Let Sk = { Rk : kk = 1} be the unit k-sphere, and Sk be
its surface measure. Then, the area of the unit k-sphere is:
Z
2 k/2
Sk (d) = .
Sk (k/2)
The volume of the unit k-ball is:

Z 1 Z
k1 2 k/2 k/2
r Sk (d) dr = = .
0 Sk k(k/2) (1 + k/2)
In addition,
Z
Sk (d) = 0,
Sk
and
Z
Z Sk (d)
0 Sk k/2
Sk (d) = Ik = Ik ,
Sk k (1 + k/2)
where Ik is the k-dimensional identity matrix. For any non-negative measurable function on Rk ,
Z Z Z
k1
g(x) dx = r g(r) Sk (d) dr.
Rk 0 Sk
We will also use the following result on Laplace approximation of integrals.
Lemma 1: Let a(r) and b(r) be two real functions, a(r) is continuous in a neighborhood of zero and
b(r) has continuous first derivative in a neighborhood of zero. Let b(0) = 0, b(r) > 0 for r > 0, and
that for every r > 0 the infimum of b(r) over r r is positive. Suppose that there exist positive real
numbers a0 , b0 , , such that
db
lim a(r)r1 = a0 , lim b(r)r = b0 , and lim (r)r1 = b0 .
r0 r0 r0 dr
R
Suppose also that 0 a(r) exp(N b(r)) dr converges absolutely throughout its range for all sufficiently
large N . Then, for N
Z
a0 1 1
a(r) exp(N b(r))dr = +o .
0 b/ N / N /
0
24
Proof: It follows from Theorem 7.1 in Olver (1997).
Proof of Theorem 1: First consider the conditional probability of unit i being the m-th closest
match to z, given Xi = x:
Pr(jm = i|Xi = x)

N 1
= (Pr(kX zk > kx zk))N m (Pr(kX zk kx zk))m1
m1

N 1
= (1 Pr(kX zk kx zk))N m (Pr(kX zk kx zk))m1 .
m1
Since the marginal probability of unit i being the m-th closest match to z is Pr(jm = i) = 1/N , and
the marginal density is f (x), the distribution of Xi , conditional on it being the m-th closest match, is:
fXi |jm =i (x) = N f (x) Pr(jm = i|Xi = x)

N 1
= N f (x) (1 Pr(kX zk kx zk))N m (Pr(kX zk kx zk))m1 ,
m1
and this is also the distribution of Xjm . Now transform to the matching discrepancy Um = Xjm z to
get

N 1
fUm (u) = N f (z + u) (1 Pr(kX zk kuk))N m
m1
(Pr(kX zk kuk))m1 . (A.1)
Transform to Vm = N 1/k Um with Jacobian N 1 to get:

N m
N 1 v kvk
fVm (v) = f z + 1/k 1 Pr kX zk 1/k
m1 N N
m1
kvk
Pr kX zk 1/k
N
N
1m N 1 kvk
= N (f (z) + o(1)) 1 Pr kX zk 1/k
m1 N
m1
kvk
(1 + o(1)) N Pr kX zk 1/k .
N
Note that Pr(kX zk kvkN 1/k ) is

Z kvk/N 1/k Z
k1
r f (z + r)Sk (d) dr,
0 Sk
25
where Sk = { Rk : kk = 1} is the unit k-sphere, and Sk is its surface measure. The derivative
w.r.t. N is
Z
1 kvkk kvkk
f z + 1/k Sk (d).
N2 k Sk N
Therefore,
Z
Pr(kX zk kvkN 1/k ) kvkk
lim = f (z) Sk (d).
N 1/N k Sk
In addition, it is easy to check that

N 1 1
N 1m = + o(1).
m1 (m 1)!
Therefore,
Z m1 Z
f (z) k f (z) k f (z)
lim fVm (v) = kvk Sk (d) exp kvk Sk (d) .
N (m 1)! k Sk k Sk
The previous equation shows that the density of Vm converges pointwise to a non-negative function
which is rotation invariant with respect to the origin. To check that this function defines a proper
distribution, transform to polar coordinates and integrate:
Z Z Z m1 Z !

f (z) k f (z) k f (z)
rk1 r (d) exp r (d) (d) dr
0 (m 1)! Sk k Sk k Sk
Z Z m Z
krmk1 f (z) k f (z)
= (d) exp r (d) dr.
0 (m 1)! k Sk k Sk
Transform t = rk to get
Z m1 Z m Z
t f (z) f (z)
(d) exp t (d) dt,
0 (m 1)! k Sk k Sk
which is equal
R to one because is the integral of the density of a gamma random variable with parameters
(m, k(f (z) Sk (d))1 ) over its support. As a result, the matching discrepancy Um is Op (N 1/k ) and
the limiting distribution of N 1/k Um is rotation invariant with respect to the origin. This finishes the
proof of the first result.
Next, given fUm (u) in (A.1),

N 1
EUm = N Am ,
m1
where
Z
Am = u f (z + u) (1 Pr(kX zk kuk))N m (Pr(kX zk kuk))m1 du.
R k
26
Changing variables to polar coordinates gives:
Z Z
Am = rk1 r f (z + r)Sk (d) (1 Pr(kX zk r))N m (Pr(kX zk r))m1 dr
0 Sk
Then rewriting the probability Pr(kX zk r) as

Z Z
f (x)1{kx zk r}dx = f (z + v)1{kvk r}dv
Rk Rk
Z r Z
= sk1 f (z + s)Sk (d) ds
0 Sk
and substituting this into the expression for Am gives:

Z Z Z r Z N m
k1 k1
Am = r r f (z + r)Sk (d) 1 s f (z + s)Sk (d) ds
0 Sk 0 Sk
Z r Z m1
k1
s f (z + s)Sk (d) ds dr
0 Sk
Z
= eN b(r) a(r) dr,
0
where
Z r Z
b(r) = log 1 sk1 f (z + s)Sk (d) ds ,
0 Sk
and
Z r Z
m1
k1
Z s f (z + s)Sk (d) ds
k 0 Sk
a(r) = r f (z + r)Sk (d) Z r Z m .
Sk k1
1 s f (z + s)Sk (d) ds
0 Sk
That is, a(r) = q(r)p(r), q(r) = rk c(r), and p(r) = (g(r))m1 , where
Z
f (z + r)Sk (d)
Sk
c(r) = Z r Z ,
k1
0 Sk
27
Z r Z
k1
s f (z + s)Sk (d) ds
0 Sk
g(r) = Z r Z .
1 sk1 f (z + s)Sk (d) ds
0 Sk
First note that

Z b(r) is continuous in a neighborhood of zero and b(0) = 0. By Theorem 6.20 in Rudin
(1976), sk1 f (z + s)Sk (d) is continuous, and
Sk
Z
rk1 f (z + r)Sk (d)
db Sk
(r) = Z r Z ,
dr
0 Sk
which is also continuous. Using LHospitals rule:

Z
1 db 1
lim b(r)rk = lim k1 (r) = f (z) Sk (d).
r0 r0 kr dr k Sk
Similarly, c(r) is continuous in a neighborhood of zero, c(0) = 0, and

Z Z
1 dc f 0 1 f
lim c(r)r = lim (r) = (z) Sk (d) = (z) Sk (d).
r0 r0 dr x Sk k x Sk
Therefore,
Z
(k+1) dc 1 f
lim q(r)r = lim (r) = (z) Sk (d).
r0 r0 dr k x Sk
Similar calculations yield

Z
k 1 dg 1
lim g(r)r = lim k1 (r) = f (z) Sk (d).
r0 r0 kr dr k Sk
Therefore
Z m1
(m1)k 1
lim p(r)r = f (z) Sk (d) .
r0 k Sk
Now, it is clear that

(mk+1) (m1)k (k+1)
lim a(r)r = lim p(r)r lim q(r)r
r0 r0 r0
Z m1 Z
1 1 f
= f (z) Sk (d) (z) Sk (d)
k Sk k x Sk
Z m
1 1 f
= f (z) Sk (d) (z).
k Sk f (z) x
Therefore, the conditions of Lemma 1 hold for = mk + 2, = k
Z m
1 1 f
a0 = f (z) Sk (d) (z)
k Sk f (z) x
Z
1
b0 = f (z) Sk (d).
k Sk
28
Applying Lemma 1, we get

mk + 2 a0 1 1
Am = (mk+2)/k N (mk+2)/k
+ o
k kb N (mk+2)/k
0
Z 2/k
mk + 2 1 f (z) 1 df 1 1
= Sk (d) (z) (mk+2)/k + o .
k k k Sk f (z) dx N N (mk+2)/k
!2/k
mk + 2 1 k/2 1 df 1 1
= f (z) (z) (mk+2)/k + o .
k k 1 + k2 f (z) dx N N (mk+2)/k
Now, since
N m /(m 1)!
lim = 1,
N N 1
N
m1
we have that

N 1
E[Um ] = N Am
m1
!2/k
mk + 2 1 k/2 1 df 1 1
= f (z) (z) 2/k + o ,
k (m 1)!k 1 + k2 f (z) dx N N 2/k
which finishes the proof for the second result of the theorem.
To get the result for E[Um Um 0 ], notice that

0 N 1
E[Um Um ]=N Bm ,
m1
where
Z
Bm = uu0 f (z + u) (1 Pr(kX zk kuk))N m (Pr(kX zk kuk))m1 du.
Rk
Transforming to polar coordinates again leads to
Z Z Z r Z N m
k1 2 0 k1
Bm = r r f (z + r)Sk (d) 1 s f (z + s)Sk (d) ds
0 Sk 0 Sk
Z r Z m1
k1
s f (z + s)Sk (d) ds dr
0 Sk
29
Z
= eN b(r) a
(r) dr,
0
where, as before
Z r Z
k1
b(r) = log 1 s f (z + s)Sk (d) ds ,
0 Sk
and
Z r Z
m1
k1
k+1 0 0 Sk
a
(r) = r f (z + r)Sk (d) Z r Z m .
Sk k1
0 Sk
That is, a(r) = q(r)p(r), q(r) = rk+1 c(r), and, as before, p(r) = (g(r))m1 , where
Z
0 f (z + r)Sk (d)
Sk
c(r) = Z r Z ,
0 Sk
Z r Z
k1
s f (z + s)Sk (d) ds
0 Sk
g(r) = Z r Z .
0 Sk
Clearly,
Z
1
lim q(r)r(k+1) = lim c(r) = f (z) Sk (d)Ik .
r0 r0 k Sk
Hence,

(mk+1) (m1)k (k+1)
lim a
(r)r = lim p(r)r lim q(r)r
r0 r0 r0
Z m1 Z
1 1
= f (z) Sk (d) f (z) Sk (d)Ik
k Sk k Sk
Z m
1
= f (z) Sk (d) Ik .
k Sk

Z m
1
a0 = f (z) Sk (d) Ik
k
Z Sk
1
b0 = f (z) Sk (d).
k Sk
30

mk + 2 a0 1 1
Bm = (mk+2)/k N (mk+2)/k
+o
k kb N (mk+2)/k
0
Z 2/k
mk + 2 1 f (z) 1 1
= Sk (d) Ik + o .
k k k Sk N (mk+2)/k N (mk+2)/k
!2/k
mk + 2 1 k/2 1 1
= f (z) Ik + o .
k k 1 + k2 N (mk+2)/k N (mk+2)/k
Hence, using the fact that

N m /(m 1)!
lim = 1,
N N 1
N
m1
we have that

0 N 1
E[Um Um ] =N Bm
m1
!2/k
mk + 2 1 k/2 1 1
= f (z) I +o ,
k (m 1)!k 1 + k2 N 2/k N 2/k
which finishes the third part of the theorem.

Using the same techniques as for the first two moments,

N 1
EkUm k3 = N Cm ,
m1
where
Z
Cm = eN b(r) a
(r) dr,
0
Z r Z
k1
b(r) = log 1 s f (z + s)Sk (d) ds ,
0 Sk
and
Z r Z
m1
k1
k+2 0 Sk
a
(r) = r f (z + r)Sk (d) Z r Z m .
Sk k1
0 Sk
31
That is, a(r) = q(r)p(r), q(r) = rk+2 c(r), and p(r) = (g(r))m1 , where
Z
f (z + r)Sk (d)
Sk
c(r) = Z r Z ,
k1
0 Sk
Z r Z
sk1f (z + s)Sk (d) ds
0
g(r) = Z r SZk .
k1
0 Sk
Now,
Z
(k+2)
lim q(r)r = lim c(r) = f (z) Sk (d).
r0 r0 Sk
Hence,

(r)r(mk+2) =
lim a lim p(r)r(m1)k lim q(r)r(k+2)
r0 r0 r0
Z m1 Z
1
= f (z) Sk (d) f (z) Sk (d)
k Sk Sk
Z m
1
= f (z) Sk (d) k.
k Sk

Z m
1
a0 = f (z) Sk (d) k
k Sk
Z
1
b0 = f (z) Sk (d).
k Sk

mk + 3 a0 1 1
Cm = (mk+3)/k N (mk+3)/k
+ o
k kb N (mk+3)/k
0
Z 3/k
mk + 3 f (z) 1 1
= Sk (d) +o .
k k Sk N (mk+3)/k N (mk+3)/k
!3/k
mk + 3 k/2 1 1
= f (z) +o .
k 1 + k2 N (mk+3)/k N (mk+3)/k
32
Hence, using the fact that
N m /(m 1)!
lim = 1,
N N 1
N
m1
we have that

3 N 1
E[kUm k ] = N Cm
m1
!3/k
mk + 3 1 k/2 1 1
= f (z) +o .
k (m 1)! 1 + k2 N 3/k N 3/k
Therefore

3 1
EkUm k = O .
N 3/k

Proof of Theorem 2:
Ebsm
M = E[Ybi (1) Ybi (0)] E[Yi (1) Yi (0)]
Z h i

= E (Ybi (1) Ybi (0)) (Yi (1) Yi (0)) Xi = x f (x) dx
Z h i
P
= (1 p) E (1/M ) jJM (i) 1 (Xj ) 1 (Xi ) Xi = x, Wi = 0 f0 (x) dx
Z h i
P
p E (1/M ) jJM (i) 0 (Xj ) 0 (Xi ) Xi = x, Wi = 1 f1 (x) dx. (A.2)
Applying a second order Taylor expansion
E[1 (Xjm (i) ) 1 (Xi )|Xi = x, Wi = 0, 0N W = N1 ]
1
= E[(Xjm (i) x)0 |Xi = x, Wi = 0, 0N W = N1 ] (x)
X

1 2 1 0 0
+ tr (x)E[(Xjm (i) x)(Xjm (i) x) |Xi = x, Wi = 0, N W = N1 ] + R(x),
2 X 0 X

where |R(x)| = O E[kXjm (i) xk3 |Xi = x, Wi = 0, 0N W = N1 ] . Applying Theorem 1, we get
E[1 (Xjm (i) ) 1 (Xi )|Xi = x, Wi = 0, 0N W = N1 ]
33
!2/k
mk + 2 1 k/2
= f1 (x)
k (m 1)!k 1 + k2
!
1 f1 1 1 2 1 1 1
0
(x) (x) + tr (x) +o
f1 (x) X X 2 X 0 X N1
2/k
N1
2/k
Note that
NX
M
1
2/k
Pr 0N W = N1 |Xi = x, Wi = 0
N1 =M N1
NX
M
1 N
= pN1 (1 p)N N1
2/k N1
N1 =M N1
NXM
1 p2/k N N1 N N1 1 1
= 2/k 2/k p (1 p) = 2/k 2/k + o ,
p N (N1 /N )2/k N1 p N N 2/k
N =M 1
since N1 /N = p + op (1). Therefore,

h P i
E (1/M ) jJM (i) 1 (Xj ) 1 (Xi )|Xi = x, Wi = 0
M ! !2/k
1 X mk + 2 1 k/2 1
= f1 (x)
M
m=1
k (m 1)!k 1 + k2 p2/k

1 f1 1 1 2 1 1 1
0
(x) (x) + tr (x) +o . (A.3)
f1 (x) X X 2 X 0 X N 2/k N 2/k
Similarly,
h P i
E (1/M ) jJM (i) 0 (Xj ) 0 (Xi )|Xi = x, Wi = 1
M ! !2/k
1 X mk + 2 1 k/2 1
= f0 (x)
M
m=1
k (m 1)!k 1 + k2 (1 p)2/k

1 f0 0 1 2 0 1 1
0
(x) (x) + tr (x) +o . (A.4)
f0 (x) X X 2 X 0 X N 2/k N 2/k
Combine equations (A.2), (A.3), and (A.4) to obtain the result.
34
Proof of Theorem 3:
smt is O(N 2/k ).
Following the same lines as in the proof of theorem 2, it can be shown that the bias of M 0
s/22/k
Therefore, the normalized bias is O(N0 ), which vanishes asymptotically if s < 4/k.
Proof of Theorem 4: Define f = inf x,w fw (x) and f = supx,w fw (x), with f > 0 and f finite. Let
X be a convex set of dimension equal to k and diameter equal to u . Consider all the balls B(x, u
) with
centers x X and radius u. Let c > 0 be the infimum of the proportion that the intersection with X
represents in volume of the balls. (Note that, since X is convex, this proportion can only increase for a
smaller radius.)
First we bound the probability that the distance to a match exceeds some value. Let x X and
1/k
u < N1Wi u . Then,

1/k
Pr kXj Xi k > u N1Wi W1 , . . . , WN , Wj = 1 Wi , Xi = x
Z uN1W
1/k Z
i k1
=1 r f1Wi (x + r)Sk (d)dr
0 Sk
Z 1/k
uN1W Z
i
1 cf rk1 Sk (d)dr
0 Sk
1
= 1 cf uk N1W i
k/2 /(1 + k/2).
Similarly

1/k
Pr kXj Xi k u N1Wi W1 , . . . , WN , Wj = 1 Wi , Xi = x fuk N1W
1
i
k/2 /(1 + k/2).
Thus

1/k
Pr kXj Xi k > u N1Wi W1 , . . . , WN , Xi = x, j JM (i)

1/k
Pr kXj Xi k > u N1Wi W1 , . . . , WN , Xi = x, j = jM (i)
M
X 1 N1W m
N1Wi 1/k i
= Pr kXj Xi k > u N1Wi W1 , . . . , WN , Wj = 1 Wi , Xi = x
m
m=0
m
1/k
Pr kXj Xi k u N1Wi W1 , . . . , WN , Wj = 1 Wi , Xi = x .
Notice that
m
N1Wi 1/k
Pr kXj Xi k u N1Wi W1 , . . . , WN , Wj = 1 Wi , Xi = x
m
35
!m
1 k/2
uk f .
m! (1 + k/2)
Therefore,

1/k
M 1
!m !N1W m
X 1 k/2 k/2 1 i
u fk k
1 u cf .
m! (1 + k/2) (1 + k/2) N1Wi
m=0
Then, for some constant C1 > 0,

1/k
M 1
!N1W m
X k/2 1 i
k(M 1)
C1 max{1, u } 1 uk cf
(1 + k/2) N1Wi
m=0
!
k/2
C1 M max{1, uk(M 1) } exp uk cf .
(1 + k/2)
(The last inequality holds because for a > 0, log a a 1.) Note that this bound also holds for
1/k 1/k
u N1Wi u (since in that case the probability that kXj(i) Xi k > u N1Wi is zero.) Since the bound
does not depend on x, this inequality also holds without conditioning on x. Let Bj(i) be the volume of
the catchment area Aj(i) :
Z
Bj(i) = dx,
Aj(i)
Notice that, conditional on W1 , . . . , WN , the match j(i), and {Xj | Wj = 1 Wi }, the distribution
of Xi is proportional to fWi (x) 1{x Aj(i) }, where Aj(i) is catchment area. Therefore, the same
is true if we condition only on W1 , . . . , WN , the match j(i), and Aj(i) . Note that a ball with radius
(b/2)1/k /( k/2 /(1 + k/2))1/k has volume b/2. Therefore
!
(b/2)1/k f

Pr kXj Xi k > k/2 W1 , . . . , WN , j JM (i), Bj b .
( /(1 + k/2)) 1/k 2f
As a result, if
!
(b/2)1/k f

Pr kXj Xi k > k/2 W1 , . . . , WN , j JM (i) , (A.5)
( /(1 + k/2))
1/k 2f
then it must be the case that Pr(Bj b | W1 , . . . , WN , j JM (i)) . In fact, the inequality in
equation (A.5) has been established above for
! !
uk k/2 f k/2
b/2 = , and = C1 M max{1, uk(M 1) } exp uk f .
N1Wi (1 + k/2) 2f (1 + k/2)
36
Let t = 2 uk k/2 /(1 + k/2), then
Pr(N1Wi Bj t | W1 , . . . , WN , j JM (i)) C2 max{1, C3 tM 1 } exp(C4 t),
for some positive constants, C2 , C3 , and C4 . This establishes an uniform exponential bound, so all the
moments of N1Wi Bj exist conditional on W1 , . . . , WN , j JM (i) (uniformly in N ). Finally, consider
the distribution of KM (j), the number of times unit j is used as a match. Conditional on the catchment
area, Aj , and on W1 , . . . , WN , the distribution is binomial with parameters N1Wj and p(Aj ), where
the probability of a catch is the integral of the density over the catchment area:
Z
p(Aj ) = f1Wj (x)dx Bj f.
Aj
Therefore, conditional on Aj , j JM (i), and W1 , . . . , WN , the r-th moment of Kj is

r
X r
X
S(r, n)N1Wj ! p(Aj )n
r
E[KM (j)|Aj , W1 , . . . , WN , j JM (i)] = f S(r, n)(N1Wj Bj )n ,
(N1Wj n)!
n=1 n=1
where S(r, n) are Stirling numbers of the second kind. Then,

r
X N1Wj n
r n
E[KM (j) | j JM (i)] f S(r, n)E (NWj Bj ) j JM (i)
NWj
n=1
is uniformly bounded in N (by the Law of Iterated Expectations and Holders Inequality). Since
conditioning on one match only increases the moments of KM (j) we conclude that all the moments of
KM (j) are bounded uniformly in N .
Before proving Theorem 6, we give some preliminary results. Define

( )
X
AM (i) = x | 1{kXj xk kXi xk} M .
j | Wj =Wi
The exact conditional distribution of Km (i) is,

Z !

KM (i) W, {Xj }Wj =1 , Wi = 1 Binomial N0 , f0 (z)dz ,
AM (i)
and
Z !

KM (i) W, {Xj }Wj =0 , Wi = 0 Binomial N1 , f1 (z)dz .
AM (i)
Let us describe the set AM (i) in more detail for the special case in which X is a scalar. First, let rw (x)
be the number of units with Wi = w and Xi x. Then, define X(i,k) = Xj if rWi (Xi ) rWi (Xj ) = k,
and rWi (Xi ) limxXj rWi (x) = k 1. Then the set AM (i) is equal to the interval
AM (i) = (Xi /2 + X(i,M ) /2, Xi /2 + X(i,M ) )/2),
with width (X(i,M ) X(i,M ) )/2.
37
Lemma 2: Given Xi = x, and Wi = 1
Z
f1 (x) d
2N1 f0 (z)dz Gamma (2M, 1),
f0 (x) AM (i)
and given Xi = x and Wi = 0,
Z
f0 (x) d
2N0 f1 (z)dz Gamma (2M, 1).
f1 (x) AM (i)
Proof: We only prove the first part of the lemma. The second part follows exactly the same proof.
First we establish that
Z
d
2N1 f1 (x) dz = N1 f1 (x) X(i,M ) X(i,M ) + op (1) Gamma (2M, 1).
AM (i)
Let F1 (x) be the distribution function of X given W = 1. Then D = F1 (X(i,+M ) ) F1 (X(i,M ) ) is the
difference in order statistics of the uniform distribution, 2M orders apart. Hence the exact distribution
of D is Beta with parameters 2M and N1 . For large N1 , the distribution of N1 D is then Gamma with
parameters 2M and 1. Now approximate N1 D as

i ) X(i,M ) X(i,M ) .
N1 D = N1 F1 (X(i,M ) ) F1 (X(i,M ) ) = N1 f1 (X
i (X(i,M ) , X(i,M ) ). The first claims follows because X
For X i converges almost surely to x. Second,
we show that
Z Z
f1 (x)
2N1 f0 (z)dz 2N1 f1 (x) dz = op (1).
f0 (x) AM (i) AM (i)
This difference can be written as

Z Z
f1 (x) f1 (x)
2N1 f0 (z)dz 2N1 f0 (x)dz
f0 (x) AM (i) f0 (x) AM (i)
Z !
f1 (x)
= 2N1 (f0 (z) f0 (x))dz .
f0 (x) AM (i)
Notice that
Z ! !
. Z

(f0 (z) f0 (x))dz dz
AM (i) AM (i)
Z ! ! Z !
f0 . Z f0
sup |z x| dz dz sup dz = op (1).
z AM (i) AM (i) z AM (i)
because |f0 /z| is bounded and AM (i) vanishes asymptotically. Thus,

Z !
f1 (x)

2N1 (f0 (z) f0 (x))dz
f0 (x) AM (i)
Z Z ! !
f (x) . Z
1
2N1 dz (f0 (z) f0 (x))dz dz = op (1).
f0 (x) AM (i) AM (i) AM (i)
38

Proof of Theorem 6:
Consider
" # " #
KM (i) 2 2 KM (i) 2 2

E 1+ (Xi , Wi ) = E 1+ 1 (Xi ) Wi = 1 p
M M
" #
KM (i) 2 2

+ E 1+ 0 (Xi ) Wi = 0 (1 p).
M
We know that
E[KM (i)|W, {Xj }Wj =1 , Wi = 1] = N0 p(Ai ),
and
E[KM (i)|W, {Xj }Wj =1 , Wi = 1] = N0 p(Ai ) + N0 (N0 1)p(Ai )2 .
Therefore,
" 2 #
KM (i)

E 1+ 12 (Xi ) Wi = 1
M

1 2
=E 2
1 + 2 N0 p(Ai ) + N0 (N0 1)p(Ai ) + 2
N0 p(Ai ) 1 (Xi ) Wi = 1 .
M M
From the previous results, this expectation is equal to

1 (1 p) f0 (Xi ) (1 p)2 f0 (Xi )2 2(1 p) f0 (Xi )
E 1+ + (2M + 1) + 2
1 (Xi ) Wi = 1 + o(1).
M p f1 (Xi ) 2p2 f1 (Xi )2 p f1 (Xi )
Rearranging terms, we get:

" 2 # " 2 #
KM (i) (1 p) f0 (Xi )

E 1+ 12 (Xi ) Wi = 1 = E 1+ 12 (Xi )
W =1
M p f1 (Xi ) i

1 (1 p) f0 (Xi ) (1 p)2 f0 (Xi )2
+ E + 2
1 (Xi ) Wi = 1 + o(1).
M p f1 (Xi ) 2p2 f1 (Xi )2
Notice that,
" 2 #
(1 p) f0 (Xi )

E 1+ 12 (Xi ) Wi = 1 p
p f1 (Xi )
Z 2 2
12 (Xi ) 1 (x) pf1 (x) 1 (Xi )
=E Wi = 1 p = f (x)dx = E .
e(Xi )2 e(x) e(x)f (x) e(Xi )
39
In addition,

(1 p) f0 (Xi ) (1 p)2 f0 (Xi )2
E + 2
1 (Xi ) Wi = 1 p
p f1 (Xi ) 2p2 f1 (Xi )2
Z
(1 p)2 f0 (x)2
= (1 p)f0 (x) + 12 (x)dx
2p f1 (x)
Z Z
1 1 p f0 (x) 1 1
= (1 p)f0 (x) 2 + 12 (x)dx = (1 p)f0 (x) 1 + 12 (x)dx
2 p f1 (x) 2 e(x)
2
1 2 1 (Xi ) 2
= E (1 e(Xi )) 1 + 1 (Xi ) = E e(Xi )1 (x) .
e(Xi ) e(Xi )
Therefore,
" 2 # 2
KM (i) 1 (Xi ) 1

E 1+ 12 (Xi ) Wi = 1 p = 1+ E 1 E[e(Xi )12 (Xi )] + o(1).
M 2M e(Xi ) 2M
The analogous result holds conditioning on Wi = 0, therefore

" # 2
KM (i) 2 2 1 (Xi ) 2 (Xi )
E 1+ (Xi , Wi ) = 1+ E 1 + 0
M 2M e(Xi ) 1 e(Xi )
1
E[e(Xi )12 (Xi ) + (1 e(Xi ))02 (Xi )] + o(1).
2M
As a result,
2
sm 1 1 (Xi ) 02 (Xi ) 1
N Var(
M ) = 1+ E + + Var( (Xi )) Var() + o(1).
2M e(Xi ) 1 e(Xi ) 2M

Before proving Theorem 8 we state some auxiliary lemmas.
Lemma 3: (Uniform Convergence of Series Estimators of Regression Functions, Newey

1995)
Suppose the conditions in Theorem 8 hold. Then for any > 0 and non-negative integer d,

(, w) (, w)|d = Op K 1+2k (K/N )1/2 + K ,
|
for w = 0, 1.
Proof: Assumptions 3.1, 4.1, 4.2 and 4.3 in Newey (1995) are satisfied, implying that Neweys Theorem
4.4 applies.
Lemma 4: (Infeasible versus Feasible Bias Correction)

Suppose the conditions in Theorem 8 hold. Then

(Xjm (i) , w) (Xi , w) (Xjm (i) , w) = op (N 1/2 ),
(Xi , w)
for w = 0, 1.
40
Proof:
Fix the non-negative integer L > (k 2)/2. Let be aPmulti-index of dimension k, that is, an
k
k-dimensional vector of non-negative integers, with || = i=1 i , and let l be the set of such
that || = l. Furthermore, let x = x1 1 . . . xk k , and let g(x) = || g(x)/x1 1 . . . xk k . Finally, let
Um (Xi ) = Xjm (i) Xi . Use a Taylor series expansion around Xi to write
L X
X X

(Xjm (i) , w) =
(Xi , w) +
(Xi , w)Um (Xi ) + x, w)Um (Xi ) .
(
l=1 l L+1
P (
First consider the last sum, L+1 x, w)Um (Xi ) . By Condition 4 in Theorem 8, the first
factor in each term is bounded by C || = C L+1 . The second factor in each term is of the form
Qk j j j /k ), so that the product is of the order
j=1 Um (Xi )j . The factor Um (Xi )j is of order Op (N
Pk
Op (N j=1 j /r
) = Op (N (L+1)/k ) = op (N 1/2 ) because L > (k 2)/2. Hence, we can write
L X
X

(Xjm (i) , w)
(Xi , w) =
(Xi , w)Um (Xi ) + op (N 1/2 ).
l=1 l
Using the same argument as we used for the estimated regression function
(x, w) we have for the true
regression function (x, w),
L X
X
(Xjm (i) , w) (Xi , w) = (Xi , w)Um (Xi ) + op (N 1/2 ).
l=1 l
Now consider the difference between these two expressions:

(Xjm (i) , w)
(Xi , w) (Xjm (i) , w) (Xi , w)
L X
X
=
(Xi , w) (Xi , w) Um (Xi ) + op (N 1/2 ).
l=1 l

Consider for a particular l the term (Xi , w) (Xi , w) Um (Xi ) . The second factor
is, using the same argument as before, of order Op (N l/k ). Since l 1, the second factor is at most
Op (N 1/k ). Now consider the first factor. By Lemma 3, this is of order Op K 1+2k (K/N )1/2 + K .

With K = N , this is Op N (3/2+2k)1/2 + N (1+2k) . We can choose large enough so that for any

given the first term dominates. Hence the order of the product is Op N (3/2+2k)1/2 Op (N 1/k )
Given that by Condition (iii) in Theorem 8 < 2/(3r + 4r2 ) we have (3/2 + 2r) 1/2 < 1/r 1/2,
and therefore the order is op (N 1/2 ).
Proof of Theorem 8:
The difference ibcm bcm can be written as

1 X
ibcm bcm =
(Xi , 1)
(1)(Xjm (i) ) (Xi , 1) (1)(Xjm (i) )
N
i|wi =c
41

X
+ (0)(Xjm (i) ) (Xi , 0) (0)(Xjm (i) ) .
(Xi , 0)
i|wi =t
Each of these terms are of order op (N 1/2 ). There are N of them, with a factor 1/N . Hence the sum
is of order op (N 1/2 ).
42
References
Abadie, A. (2002), Semiparametric Instrumental Variable Estimation of Treatment Response Mod-

els, Journal of Econometrics (forthcoming).
Abadie, A., G.W. Imbens (2001), Estimation of the Conditional Variance in Paired Experiments,
mimeo.
Angrist, J.D., G.W. Imbens and D.B. Rubin (1996), Identification of Causal Effects Using
Instrumental Variables, Journal of the American Statistical Association, 91, 444-472.
Angrist, J. D. and A. B. Krueger (2000), Empirical Strategies in Labor Economics, in A.

Ashenfelter and D. Card eds. Handbook of Labor Economics, vol. 3. New York: Elsevier
Science.
Ashenfelter, O., and D. Card, (1985), Using the Longitudinal Structure of Earnings to Estimate
the Effect of Training Programs, Review of Economics and Statistics, 67, 648-660.
Barnow, B.S., G.G. Cain and A.S. Goldberger (1980), Issues in the Analysis of Selectivity
Bias, in Evaluation Studies , vol. 5, ed. by E. Stromsdorfer and G. Farkas. San Francisco:
Sage.
Card, D., and Sullivan, (1988), Measuring the Effect of Subsidized Training Programs on Move-
ments In and Out of Employment, Econometrica, vol. 56, no. 3 497-530.
Cochran, W., (1968) The Effectiveness of Adjustment by Subclassification in Removing Bias in

Observational Studies, Biometrics 24, 295-314.
Cochran, W., and D. Rubin (1973) Controlling Bias in Observational Studies: A Review
Sankhya, 35, 417-46.
Dehejia, R., and S. Wahba, (1999), Causal Effects in Nonexperimental Studies: Reevaluating
the Evaluation of Training Programs, Journal of the American Statistical Association, 94, 1053-
1062.
Estes, E.M., and B.E. Honore , (2001), Partially Linear Regression Using One Nearest Neighbor,
unpublished manuscript, Princeton University.
Gu, X., and P. Rosenbaum, (1993), Comparison of Multivariate Matching Methods: Structures,
Distances and Algorithms, Journal of Computational and Graphical Statistics, 2, 405-20.
Hahn, J., (1998), On the Role of the Propensity Score in Efficient Semiparametric Estimation of
Average Treatment Effects, Econometrica 66 (2), 315-331.
Heckman, J., and J. Hotz, (1989), Alternative Methods for Evaluating the Impact of Training
Programs, (with discussion), Journal of the American Statistical Association.
Heckman, J., and R. Robb, (1984), Alternative Methods for Evaluating the Impact of Interven-
tions, in Heckman and Singer (eds.), Longitudinal Analysis of Labor Market Data, Cambridge,
Cambridge University Press.
43
Heckman, J., H. Ichimura, and P. Todd, (1997), Matching as an Econometric Evaluation Es-
timator: Evidence from Evaluating a Job Training Program, Review of Economic Studies 64,
605-654.
Heckman, J., H. Ichimura, J. Smith, and P. Todd, (1998), Characterizing Selection Bias Using
Experimental Data, Econometrica 66, 1017-1098.
Heckman, J.J., R.J. Lalonde, and J.A. Smith (2000), The Economics and Econometrics of
Active Labor Markets Programs, in A. Ashenfelter and D. Card eds. Handbook of Labor Eco-
nomics, vol. 3. New York: Elsevier Science.
Hirano, K., G. Imbens, and G. Ridder, (2000), Efficient Estimation of Average Treatment Effects
Using the Estimated Propensity Score, NBER Working Paper.
Hotz J., G. Imbens, and J. Mortimer (1999), Predicting the Efficacy of Future Training Programs
Using Past Experiences NBER Working Paper.
Ichimura, H., and O. Linton, (2001), Trick or Treat: Asymptotic Expansions for some Semipara-
metric Program Evaluation Estimators. unpublished manuscript, London School of Economics.
Lalonde, R.J., (1986), Evaluating the Econometric Evaluations of Training Programs with Exper-
imental Data, American Economic Review, 76, 604-620.
Lechner, M, (1998), Earnings and Employment Effects of Continuous Off-the-job Training in East
Germany After Unification, Journal of Business and Economic Statistics.
Manski, C., (1990), Nonparametric Bounds on Treatment Effects, American Economic Review
Papers and Proceedings, 80, 319-323.
Manski, C., G. Sandefur, S. McLanahan, and D. Powers (1992), Alternative Estimates of
the Effect of Family Structure During Adolescence on High School, Journal of the American
Statistical Association, 87(417):25-37.
Moller, J., (1994), Lectures on Random Voronoi Tessellations, Springer Verlag, New York.
Newey, W.K., (1995) Convergence Rates for Series Estimators, in G.S. Maddalla, P.C.B. Phillips
and T.N. Srinavasan eds. Statistical Methods of Economics and Quantitative Economics: Essays
in Honor of C.R. Rao. Cambridge: Blackwell.
Okabe, A., B. Boots, K. Sugihara, and S. Nok Chiu, (2000), Spatial Tessellations: Concepts
and Applications of Voronoi Diagrams, 2nd Edition, Wiley, New York.
Olver, F.W.J., (1974), Asymptotics and Special Functions. Academic Press, New York.
Quade, D., (1982), Nonparametric Analysis of Covariance by Matching, Biometrics, 38, 597-611.
Robins, J.M., and A. Rotnitzky, (1995), Semiparametric Efficiency in Multivariate Regression
Models with Missing Data, Journal of the American Statistical Association, 90, 122-129.
Robins, J.M., Rotnitzky, A., Zhao, L-P. (1995), Analysis of Semiparametric Regression Models
for Repeated Outcomes in the Presence of Missing Data, Journal of the American Statistical
Association, 90, 106-121.
44
Rosenbaum, P., (1984), Conditional Permutation Tests and the Propensity Score in Observational
Studies, Journal of the American Statistical Association, 79, 565-574.
Rosenbaum, P., (1989), Optimal Matching in Observational Studies, Journal of the American
Statistical Association, 84, 1024-1032.
Rosenbaum, P., (1995), Observational Studies, Springer Verlag, New York.
Rosenbaum, P., (2000), Covariance Adjustment in Randomized Experiments and Observational

Studies, forthcoming, Statistical Science.
Rosenbaum, P., and D. Rubin, (1983a), The Central Role of the Propensity Score in Observational
Studies for Causal Effects, Biometrika, 70, 41-55.
Rosenbaum, P., and D. Rubin, (1983b), Assessing the Sensitivity to an Unobserved Binary Co-
variate in an Observational Study with Binary Outcome, Journal of the Royal Statistical Society,
Ser. B, 45, 212-218.
Rosenbaum, P., and D. Rubin, (1984), Reducing the Bias in Observational Studies Using Subclas-
sification on the Propensity Score, Journal of the American Statistical Association, 79, 516-524.
Rosenbaum, P., and D. Rubin, (1985), Constructing a Control Group Using Multivariate Matched
Sampling Methods that Incorporate the Propensity Score, American Statistician, 39, 33-38.
Rubin, D., (1973a), Matching to Remove Bias in Observational Studies, Biometrics, 29, 159-183.
Rubin, D., (1973b), The Use of Matched Sampling and Regression Adjustments to Remove Bias in
Observational Studies, Biometrics, 29, 185-203.
Rubin, D., (1977), Assignment to Treatment Group on the Basis of a Covariate, Journal of Educa-
tional Statistics, 2(1), 1-26.
Rubin, D., (1979), Using Multivariate Matched Sampling and Regression Adjustment to Control
Bias in Observational Studies, Journal of the American Statistical Association, 74, 318-328.
Smith, J. A. and P. E. Todd, (2001a), Reconciling Conflicting Evidence on the Performance

of Propensity-Score Matching Methods, American Economic Review, Papers and Proceedings,
91:112-118.
Stoyan, D., W. Kendall, and J. Mecke, (1995), Stochastic Geometry and its Applications, 2nd
Edition, Wiley, New York.
Stroock, D.W., (1994), A Concise Introduction to the Theory of Integration. Birkhauser, Boston.
Yatchew, A., (1999), Differencing Methods in Nonparametric Regression: Simple Techniques for
the Applied Econometrician, Working Paper, Department of Economics, University of Toronto.
45
Table 1
A Matching Estimator with Four Observations
i Wi Xi Yi j(i) K1 (i) Yi (0) Yi (1) i
1 1 6 10 3 2 8 10 2
2 0 4 5 1 1 5 10 5
3 0 7 8 1 1 8 10 2
4 1 1 5 2 0 5 5 0
= 9/4
46
Table 2
Summary Statistics Lalonde Data
Experimental Data PSID T-statistic

Trainees (N=185) Controls (N=260) Controls (N=2490) Train/ Train/
mean (s.d.) mean (s.d.) mean (s.d.) Contr PSID
Age 25.8 (7.16) 25.05 (7.06) 34.85 (10.44) [1.1] [-16.0]

Education 10.4 (2.01) 10.09 (1.61) 12.12 (3.08) [1.4] [-11.1]
Black 0.84 (0.36) 0.83 (0.38) 0.25 (0.43) [0.5] [21.0]
Hispanic 0.06 (0.24) 0.11 (0.31) 0.03 (0.18) [-1.9] [1.5]
Married 0.19 (0.39) 0.15 (0.36) 0.87 (0.34) [1.0] [-22.8]
Earnings 74 2.10 (4.89) 2.11 (5.69) 19.43 (13.41) [-0.0] [-38.6]
47
Unempl. 74 0.71 (0.46) 0.75 (0.43) 0.09 (0.28) [-1.0] [18.3]
Earnings 75 1.53 (3.22) 1.27 (3.10) 19.06 (13.60) [0.9] [-48.6]
Unempl 75 0.60 (0.49) 0.68 (0.47) 0.10 (0.30) [-1.8] [13.76]
Earnings 78 6.35 (7.87) 4.55 (5.48) 21.55 (15.56) [2.7] [-23.1]

Unempl. 78 0.24 (0.43) 0.35 (0.48) 0.11 (0.32) [-2.65] [4.0]
Note: The rst two columns give the average and standard deviation of the 185 trainees from the experimental
data set. The second pair of columns give the average and standard deviation of the 260 controls from the
experimental data set. The third pair of columns give the averages and standard deviations of the 2490 controls
from the nonexperimental PSID sample. The seventh column gives t-statistics for the dierence between the
averages for the experimental trainees and controls. The last column gives the t-statistics for the dierences
between the averages for the experimental trainees and the PSID controls. The last two variables, earnings
78 and unemployed 78 are post-training. All the others are pre-training variables. Earnings data are in
thousands of dollars.
Table 3
Experimental and Non-experimental Estimates
of Average Treatment Effects for Lalonde Data
M =1 M =4 M = 16 M = 64 All Controls
mean (s.d.) est (s.e.) est (s.e.) est (s.e.) est (s.e.)
Panel A: Experimental Estimates

simple matching 1.23 (0.89) 1.99 (0.79) 1.76 (0.80) 2.20 (0.75) 1.79 (0.75)
bias-adjusted matching 1.17 (0.89) 1.84 (0.79) 1.55 (0.80) 1.74 (0.75) 1.72 (0.75)
Regression Estimates
dif 1.79 (0.67)
linear 1.72 (0.65)
quadratic 2.27 (0.73)
48
Panel B: Non-experimental Estimates
simple matching 2.09 (1.00) 1.62 (0.88) 0.47 (0.86) -0.11 (0.75) -15.20 (0.62)
bias-adjusted matching 2.45 (1.00) 2.51 (0.88) 2.48 (0.86) 2.26 (0.75) 0.84 (0.62)
Regression Estimates
dif -15.20 (0.66)
linear 0.84 (0.86)
quadratic 3.26 (0.98)
Note: Panel A reports the results for the experimental data (experimental controls and trainees), and Panel B the results
for the nonexperimental data (PSID controls with experimental trainees). In each panel the top part reports results for
the matching estimators, with the number of matches equal to 1, 4, 16, 64 and 2490 (all controls). The second part
reports results for three regression adjustment estimates, based on no covariates, all covariates entering linearly and all
covariates entering with a fully set of quadratic terms and interactions. The outcome is earnings in 1978 in thousands
of dollars.
Table 4
Mean Covariate Differences in Matched Groups
Average M =1 M =4 M = 16 M = 64 M = 2490
PSID Trainees dif (s.d.) dif (s.d.) dif (s.d.) dif (s.d.) dif (s.d.)
Age 0.06 -0.80 -0.02 (0.65) -0.06 (0.60) -0.30 (0.41) -0.57 (0.57) -0.86 (0.68)
Education 0.04 -0.54 -0.10 (0.44) -0.20 (0.48) -0.25 (0.39) -0.24 (0.42) -0.58 (0.66)
Black -0.09 1.21 -0.00 (0.00) 0.09 (0.32) 0.35 (0.47) 0.70 (0.66) 1.30 (0.80)
Hispanic -0.01 0.14 -0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.01 (0.03) 0.15 (1.30)
Married 0.12 -1.64 0.00 (0.00) -0.06 (0.30) -0.33 (0.46) -0.90 (0.85) -1.76 (1.02)
49
Earnings 74 0.09 -1.18 -0.01 (0.10) -0.01 (0.12) -0.05 (0.17) -0.15 (0.30) -1.26 (0.36)
Unempl 74 -0.13 1.72 0.00 (0.00) 0.02 (0.17) 0.24 (0.40) 0.41 (0.72) 1.85 (1.36)
Earnings 75 0.09 -1.18 -0.04 (0.17) -0.07 (0.15) -0.11 (0.19) -0.19 (0.26) -1.26 (0.23)
Unempl 75 -0.10 1.36 0.00 (0.00) 0.00 (0.05) 0.03 (0.28) 0.10 (0.41) 1.46 (1.44)
Log Odds
Prop Score -7.08 1.08 0.21 (0.99) 0.56 (1.13) 1.70 (1.14) 3.20 (1.49) 8.16 (2.13)
Note: In this table all covariates have been normalized to have mean zero and unit variance. The rst two columns present the
averages for the experimental trainees and the PSID controls. The remaining pairs of columns present the average dierence
within the matched pairs and the standard deviation of this dierence for matching based on 1, 4, 16, 64 and 2490 matches.
For the last variable the the logarithm of the odds ratio of the propensity score is used. This log odds ratio has mean -6.52 and
standard deviation 3.30 in the sample.
Table 5
Simulation Results
M Estimator mean median rmse mae s.d. mean coverage

bias bias s.e. (nom. 95%) (nom. 90%)
1 simple matching -0.49 -0.45 0.87 0.55 0.72 0.84 0.93 0.88
bias-adjusted 0.04 0.06 0.74 0.47 0.74 0.84 0.96 0.92
4 simple matching -0.85 -0.84 1.03 0.84 0.58 0.59 0.72 0.60
bias-adjusted 0.05 0.06 0.60 0.39 0.60 0.59 0.94 0.89
16 simple matching -1.80 -1.78 1.89 1.78 0.57 0.52 0.07 0.04
bias-adjusted 0.17 0.16 0.62 0.40 0.60 0.52 0.90 0.83
64 simple matching -3.27 -3.25 3.32 3.25 0.59 0.52 0.00 0.00
bias-adjusted 0.15 0.16 0.65 0.43 0.63 0.52 0.87 0.81
All simple matching -19.06 -19.06 19.07 19.06 0.61 0.43 0.00 0.00
(2490) bias-adjusted -2.04 -2.04 2.28 2.04 1.00 0.37 0.09 0.07
difference -19.06 -19.06 19.07 19.06 0.61 0.61 0.00 0.00

linear regression -2.04 -2.04 2.28 2.04 1.00 0.98 0.44 0.33
quadratic regression 2.70 2.64 3.02 2.64 1.34 1.24 0.40 0.27
Note: For each estimator summary statistics are provided for 10,000 replications of the data set. Results are
reported for five values of the number of matches (M = 1, 4, 16, 64, 2490), and for two estimators: the simple
matching estimator, the bias-adjusted matching estimator, based on the regression of only the matched treated
and controls. The last three rows report results for the simple average treatment-control difference, the ordinary
least squares estimator, and the ordinary least square estimator using a full set of quadratic terms and interactions.
For each estimator we report the mean and median bias, the root-mean-squared-error, the median-absolute-error,
the standard deviation of the estimators, the average estimate of the standard error, and the coverage rate of the
nominal 95% and 90% confidence intervals.
50

Simple and Bias-Corrected Matching Estimators For Average Treatment Effects

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Simple and Bias-Corrected Matching Estimators For Average Treatment Effects

Transféré par

Droits d'auteur :

Formats disponibles

Preliminary

Simple and Bias-Corrected Matching Estimators

Alberto Abadie Harvard University and NBER

This version: June 2002

Alberto Abadie Guido Imbens

2. Notation and Basic Ideas

as well as a vector of pretreatment variables X. The estimand of interest is the population

= E[Y (1) Y (0)],

or the average effect for the treated

t = E[Y (1) Y (0)|W = 1].

Assumption 1: Let X be a random vector of covariates distributed on Rk with compact and

(x) = E[Y (1) Y (0)|X = x] = E[Y |W = 1, X = x] E[Y |W = 0, X = x].

w (x) = E[Y (w)|X = x] = E[Y (w)|X = x, W = w]

= E[Y |X = x, W = w] = (x, w).

Assumption 2: {(Yi , Wi , Xi )}N

JM (i) = {j1 (i), . . . , jM (i)}.

The simple matching estimator we shall study is

The regression imputation estimator is defined by

with corresponding estimator

bias for the simple matching estimator is (under Assumption 1):

3. Simple Matching Estimators

Notice that for any N 1 vector V = (V1 , . . . , VN )0 :

Define i = Yi (Xi , Wi ), so that

Var(|X = x, W = w) = 2 (x, w).

+ 0N (A1 IN )1 (X)/N 0N (A0 IN )0 (X)/N. (13)

j1 = argminj=1,... ,N kXj zk,

Theorem 1: (Matching Discrepancy)

Theorem 3: (Bias Treated)

The unconditional variance is:

Theorem 4: The moments of KM (i) are bounded uniformly in N .

In that case we can estimate the variance 2 (Xi , Wi ) as

l(i) = argminl:l6=i,wi =wl ||Xi Xl ||,

be an estimator for the conditional variance 2 (Xi , Wi ).

Theorem 5: Let A be the matrix defined in (9). Then

Since the semiparametric efficiency bound for this problem is

4. Bias Corrected Matching

with corresponding estimator

Theorem 7: (Infeasible Bias Corrected Matching Estimator versus Simple Matching

where () denotes a generalized inverse.

Theorem 8: (Infeasible versus Feasible Bias Corrected Matching Estimator)

5. An Application to the the Evaluation of a Labor Market Program

6. A Monte Carlo Study

The volume of the unit k-ball is:

We will also use the following result on Laplace approximation of integrals.

fXi |jm =i (x) = N f (x) Pr(jm = i|Xi = x)

(Pr(kX zk kuk))m1 . (A.1)

Transform to Vm = N 1/k Um with Jacobian N 1 to get:

Note that Pr(kX zk kvkN 1/k ) is

In addition, it is easy to check that

Then rewriting the probability Pr(kX zk r) as

and substituting this into the expression for Am gives:

First note that

which is also continuous. Using LHospitals rule:

Similarly, c(r) is continuous in a neighborhood of zero, c(0) = 0, and

Similar calculations yield

Now, it is clear that

Therefore, the conditions of Lemma 1 hold for = mk + 2, = k

Hence, using the fact that

which finishes the third part of the theorem.

Therefore, the conditions of Lemma 1 hold for = mk + 3, = k

Applying Lemma 1, we get

Applying a second order Taylor expansion