Vous êtes sur la page 1sur 67

Economics 420

Introduction to Econometrics
Professor Woodbury
Fall Semester 2015
Simple Regression (Regression with One Regressor)
1. Introduction to linear regression
2. Defining the linear regression model
3. Estimating the linear regression model method of moments
4. Algebraic Properties and measures of fit of OLS
5. Sampling distribution of the OLS estimator
6. Hypothesis testing and confidence intervals for 0 and 1

4. Algebraic properties and measures of fit of OLS


Here are some definitions
Fitted values or predicted values:
!

Yi = 0 + 1X i

This can also be written:


!

Yi = b0 + b1X i (just different notation)

Actual values of Y:
!

Yi = 0 + 1X i + u i

Residuals (deviations from the regression line:


!

u i = Yi Yi = Yi ( 0 + 1X i )
2

And here are some algebraic properties


Deviations from the regression line sum to zero:
!

u
=0
i=1 i

The correlation between deviations and regressors is zero:


!

X
u
=0
i
i
i=1

Sample averages of Y and X lie on the regression line:


!

Yi = 0 + 1X i

Again, the difference


between MODEL
the population
SIMPLE REGRESSION
regression line and actual observations is an error (ui)
P4

Y = 1 + 2 X
u1 P1
Q1

1 + 2 X 1

X1

Q2
P2

X2

Q3

Q4

P3

X3

X4

But the difference between the fitted regression line


and actual

SIMPLE REGRESSION
MODEL
observations
is a residual

(ei or

u i )

Y (actual value)
Y

P4

Y (fitted value)
Y Y = e (residual)

e4
R3

e1

R2

P1
R1

b1

X1

e2
P2

X2

Y = b1 + b2 X
R4

e3
P3

X3

X4

The discrepancies between the actual and fitted values of Y are known as the residuals.

The population regression line and the fitted regression


SIMPLE REGRESSION MODEL
line in one graph
Y (actual value)
Y

P4

Y (fitted value)

Y = b1 + b2 X

u4

Y = 1 + 2 X
Q4

1 + 2 X 4

b1

X1

X2

X3

X4

Note: values of the residuals values of the errors!


6

Both of these lines will be used in our analysis. Each permits a decomposition of the value

PART 1 Regression Analysis with Cross-Sectional Data

CEO example again

T A B L E 2 . 2 Fitted Values and Residuals for the First 15 CEOs


obsno

roe

salary

salaryhat

uhat

14.1

1095

1224.058

129.0581

10.9

1001

1164.854

163.8542

23.5

1122

1397.969

275.9692

5.9

578

1072.348

494.3484

13.8

1368

1218.508

149.4923

20.0

1145

1333.215

188.2151

16.4

1078

1266.611

188.6108

16.3

1094

1264.761

170.7606

10.5

1237

1157.454

10

26.3

833

1449.773

616.7726

11

25.9

567

1442.372

875.3721

12

26.8

933

1459.023

526.0231

13

14.8

1339

1237.009

101.9911

14

22.3

937

1375.768

438.7678

15

56.3

2011

2004.808

6.191895

Cengage Learning, 2013

79.54626

Algebraic Properties of OLS Statistics


There are several useful algebraic properties of OLS estimates and their associated statistics. We now cover the three most important of these.
(1) The sum, and therefore the sample average of the OLS residuals, is zero.

Other Regression Statistics


A natural question is how well the regression line fits or
explains the data
Two regression statistics provide complementary measures of
the quality of fit
The regression R-squared measures the fraction of the variance
of Y that is explained by X
o It is unit-free and ranges between zero (no fit) and one
(perfect fit)
The standard error of the regression measures the fit the
typical size of a regression residual in the units of Y

Goodness-of-fit measure (R-squared)


The total sum of squares represents the total variation in the
dependent variable

SST = i=1(Yi Y )2 !
n

The explained sum of squares represents the variation explained by


the regression

SSE = i=1(Yi Y )2
n

The residual sum of squares, represents variation not explained by


the regression

SSR = i=1u i2
n

The total variation can be decomposed


! SST = SSE + SSR
So the total variation is the sum of the explained and unexplained
variation
Think of this as a generalization of the definition of the residual:
!

u i = Yi Yi

implies

Yi = Yi + u i

10

Yi = Yi + u i

SST = SSE + SSR

n
2
2

(Y

Y
)
=
(
Y

Y
)
+
u
i=1 i
i=1 i
i=1 i
n

11

This gives us the R2 statistic

SSE
R =
SST
2

SSR
=1
SST
So R-squared measures the fraction of the total variation in
(or variance of) Yi that is explained by the regression that
is, by X
But beware!
o A high R-squared does not mean that the regression has a
causal interpretation
o And a low R-squared doest mean the regression is useless
12

What does it mean?

R2 = 0 means ESS = 0, so X explains none of the variation of Y


R2 = 1 means ESS = TSS, so Y = Y and X explains all of the
variation of Y
Almost always, 0 < R2 < 1
For regression with a single regressor (the case here), R2 is the
square of the correlation coefficient between X and Y

13

The Standard Error of the Regression (SER)


The standard error of the regression is (almost) the sample
standard deviation of the regression error (the OLS residuals):

SER = su
2

u
i=1 i
n

n2

SSR
n2

This formula is similar to the formula for the sample standard


deviation of Y
Approximately, the square root of the average squared residual

14

The Standard Error of the Regression (SER)

represents a typical deviation from the regression line


has the units of u, which are same as the units of Y
measures the spread of the distribution of u
measures the average size of the OLS residual the average
mistake made by the OLS regression line

15

Root mean squared error (RMSE)


This is closely related to the SER
2

u
i=1 i
n

RMSE =

SSR
n

This measures the same thing as the SER the minor


difference is division by n instead of (n2)
In fact, Wooldridge doesnt even bother to distinguish between
SER and RMSE (still another name for this is the standard
error of the estimate)

16

Example of R2 and SER

= 698.9 2.28STR! !
(10.4) (0.52)

R2 = 0.05, SER = 18.6

17

In this case, STR explains only a small fraction of the variation


in test scores, and it has a large error
But the slope coefficient is large in a policy sense

18

5. Sampling distribution of the OLS estimator

The population objects (parameters) 0 and 1 are unknown


We to estimate them by drawing a sample from a population
and applying estimators we have derived
But the estimators themselves are random variables with a
probability distribution (the sampling distribution) that
describes the values they could take when we draw different
random samples

But to derive the sampling distributions of

and

1 we first

need to make some assumptions

19

Assumption SLR.1 (Linear in parameters)


In the population, the relationship between X and Y is linear
!

Yi = 0 + 1X i + ui
This seems more restrictive that it really is
For example, if we choose Y to be the natural logarithm of
earnings (rather than earnings), and X to be years of
schooling, we will estimate a nonlinear relationship between
schooling and earnings (examples to follow)
That is why the assumption is linear in parameters

20

Assumption SLR.2 (Random sampling)


We will choose n units of observation at random from the
population of interest, and observe X and Y for each unit or
entity
Simple random sampling implies that
! (Xi,Yi), i = 1,, n,
are independently and identically distributed (i.i.d.).

21

The population might consist of all workers in the country


In the population, a linear relationship between earnings (or
log of earnings) and years of education holds
Draw a worker at random from the population
Throw the worker back into population and repeat draw
another worker at random ... and so on
The result is twofold:
o each worker is drawn independently there is no reason
to suppose that, if a high-wage worker was drawn the first
time, a high-wage worker will also be drawn the second
time
o each worker is drawn from the same distribution

22

When would we NOT have random sampling?


The main place we will encounter non-i.i.d. sampling is when
data are recorded over time (time series data)
o Time-series observations are not independent
o GDP this quarter (or last year) is related to GDP last
quarter (or last year)
o The unemployment rate this month is related to the
unemployment rate last month
o This is called autocorrelation a variable is correlatied
with itself over time and it will introduce some extra
complications

23

Sample selection that is non-random is another case of nonrandom sampling, and it is very difficult to handle
o For example, if we want to estimate the response of
married women to a change in the wage, we are forced to
use the sample of women who are already working
(because they have wages we can observe)
o But working women may be different from non-working
women in unoberved ways, so the estimates we obtain
wont be applicable generally

24

Assumption SLR.3 (Sample variation in explanatory


variable)
The values of the explanatory variables are not all the same
there needs to be variation in X
If there in no variation in X, then how could X possibly explain
variation in Y?
If this assumption fails, it is impossible to study how different
values of X variable lead to different values of the dependent
variable

25

Assumption SLR.4 (Zero conditional mean: E(u|X) = 0)


The value of the explanatory variable must contain no
information about the mean of the unobserved factors
We have already talked about this a lot ...
In the class size example, E(u|X) = 0 means
E(Family Income|STR) = constant
so family income and STR are unrelated

26

Assumption SLR.5 (Homoskedasticity: var(u|X) = 2)


This is the constant variance assumption
All it means is that the error term has the same variance for
any value of X, the explanatory variable
For example, a regression of earnings (or wages) on education
is a good examples of a case that violates this assumption
Why? Because earnings vary far more widely for highly
educated workers than for less-educated workers
See the figures (from Wooldridge)

27

PART 1 Regression Analysis with Cross-Sectional Data

The simple regression model under homoskedasticity

F I G U R E 2 . 8 The simple regression model under homoskedasticity.


f(y|x)

E(y|x) 5

x1
x2

x3
x

MPLE 2.13

HETEROSKEDASTICITY IN A WAGE EQUATION

In order to get an unbiased estimator of the ceteris paribus effect of educ on wage, we must

Cengage Learning, 2013

28

53

CHAPTER 2 The Simple Regression Model

Heteroskedasticity: var(wage|educ) increases with educ

F I G U R E 2 . 9 Var(wageZeduc) increasing with educ.


f(wage|educ)

12

E(wage|educ) 5
0 1
1educ
16
educ

With the homoskedasticity assumption in place, we are ready to prove the following:

Cengage Learning, 2013

wage

29

Here is a scatterplot of average hourly earnings vs. years of


education from the 1999 Current Population Survey

30

Heteroskedasticity is often violated


So we need to have ways of handling it
For years, this was an active research question
o We now have straighforward ways of handling it, and we
will see these later
Applied researchers handle it (even today) by transforming
one or more of the variables in the model so the assumption
is satisfied
o For example, although var(wage|educ) increases with the
wage, var[log(wage)|educ) does not
o This is one reason labor economists have used the log
transformation for decades

31

One additional assumption


One reputable text (Stock and Watson) adds the assumption
that large outliers are unlikely:
E(X4) < and E(Y4) <
which means X and Y have finite kurtosis
This assumption is plausible: standardized test scores, STR,
family income, all have a finite domain
For our purposes, this is important because outliers can wreak
havoc with estimating a model, and you should always inspect
your data to make sure some wild values arent hiding there
Data-entry errors are often a problem, but you may also learn
something about the data (or where it came from) by asking
questions about odd values
32

Theorem 2.1 (Unbiasedness of OLS)


If Assumptions SLR.1SLR.4 hold, the OLS estimator is unbiased
for 0 and 1
!

E( 0 ) = 0

and
!

E( 1 ) = 1

This carries over from what we learned about estimating Y


!

E(Y ) = Y

so Y-bar is an unbiased estimator of Y

33

Two notes

First, this result does not require SLR.5 (homoskedasticity)


Second it does not depend on sample size
o

That is,

1 is unbiased for 1 even in a small sample

34

Estimation: the Sampling Distributions of

and

We have estimators of 0 and 1, and we know that those


estimators are unbiased
But how much to these estimators deviate from their mean
values, 0 and 1?

1 has a sampling distribution, just like


It is a random variable, and it will differ each time we draw a
different sample from the population remember, all we have
are samples

35

So we have two questions about 1


What is its variance? (measure of sampling uncertainty)

What is var(

1 )?

What is its sampling distribution in large samples?


We arent going to worry about small samples we dont
use them in economics, and it is harder to figure out
sampling distribution
We need to know this because we will need to do hypothesis
o

tests about 0 and 1 using 0 and 1

36

Start with the definition of variance

var( 1 ) = E{[ 1 E( 1 )]2 }


!

= E[( 1 1 )2 ]

where the second equality is true because we know the OLS


estimator of 1 is unbiased

37

Under Assumptions SLR.1SLR.5, it turns out that


2
(X

X
)
var(ui )
i=1 i
n

var( 1 ) = 2 =
1

(X

X
)
i=1 i

var(ui )

i=1(X i X )
n

var(Yi )

2
(X

X
)
i=1 i
n

The last equality holds because


!

var(ui ) = var(Yi )

Why?
38

Note that
!

var(ui ) = u2 = 2 and var(Yi ) = Y2 = 2


This is just notation, but it is important because 2 is often
called the error variance or disturbance variance

So we have another theorem

39

Theorem 2.2 (Sampling variances of the OLS estimators)


Under Assumptions SLR.1SLR.5,

var( 1 ) = 2 ==
1

=
n
2
SST
(X

X
)
X
i=1 i
var(ui )

This is easy! It just says the variance of

1 is the error variance

divided by the total sum of squares of X

40

What does this tell us?

var( 1 ) =

var(ui )

i=1(X i X )
n

2
2
(X

X
)
i=1 i
n

)
var(

var(Yi ) or var(ui ) or 2 is
First,
1 is smaller when
smaller
Makes sense the smaller the variance of Y, the less likely it is
we will draw extreme samples
Why? Because extreme samples are simply less likely to occur
Also, smaller errors mean the data will be closer to the
population regression line, so the estimate of the slope will be
more precise
41

var( 1 ) =

var(ui )
n

(X i X )

i=1

var(Yi )
n

(X i X )

i=1

Second, the larger the variation in X (the explanatory variable),


the smaller the sampling distribution of

Also makes sense if the denominator of the equation is


larger, then the expression is smaller
The denominator is larger if X varies more that is, we
observe X over a wider range of values, so we have more
information about the relationship between X and Y

42

Estimating the error variance


Were almost there, but we need an estimate of the error variance:
! 2 = var(ui) = var(Yi)
because 2 is an unknown population parameter
Solution: use the residuals
By definition,

2 = E(u 2 )
We dont observe ui (the errors), but we can use residuals to
compute an estimator of 2:

= (1/ n) i=1u i2
2

43

We can also write this in terms of the sum or squared


residuals

= (1/ n) i=1u i2 = SSR / n


2

Footnote: We actually should divide by (n2), but again, in large


samples it makes no difference

The main thing is that we now have a third theorem

44

Theorem 2.3 (Unbiased estimation of 2)


Under Assumptions SLR.1SLR.5,
!

E( 2 ) = 2

That is,

= (1/ n) i=1u i2 = SSR / n


2

is an unbiased estimator of 2
(Again, the divisor should be (n2), not n, but in a large sample,
this is not an issue)

45

So we know the variance of


2

var( 1 ) = =
1

1 and how to calculate it

var(ui )

i=1(X i X )
n

2
2
(X

X
)
i=1 i
n

and to calculate this we plug in

= (1/ n) i=1u i2 = SSR / n


2

for 2
2

var( 1 ) =
So we have an estimator for
1

46

The sampling distribution of

How is

is normal

1 distributed?

It turns out that, if the assumptions we have made hold, then in


large samples,
!

1 ~ N( 1, 2 )
1

which just says that the OLS estimator of 1 is normally


distributed with mean 1 and the variance we have worked
out so laboriously

47

Theorem 4.1 (Sampling distribution of

is normal)

1 ~ N( 1, 2 )
1

Theorem 4.1 in Wooldridge is a big deal because it means we


can do hypothesis testing and construct confidence intervals in
the familiar way

48

What have we learned from all this?


1. If the first four assumptions hold, the OLS estimator is unbiased
for 0 and 1

E( 0 ) = 0 and !E( 1 ) = 1

1 is

2. The variance of

var( 1 ) = 2

given by the formula above


3. The distribution of

1 is normal

49

6. Hypothesis testing and confidence intervals for 0


and 1
Suppose an angry parent says that reducing the number of
students in a class has no effect on learning (or test scores)
and spending money on more teachers is a waste
The null hypothesis is:
H0: 1 = 0
where 1 is the coefficient on STR in the regression of test
scores on STR
How do we test this hypothesis using data?

50

Null hypothesis with a two-sided alternative


! H0: 1 = 0 vs. H1: 1 0
or, more generally,
! H0: 1 = 1,0 vs. H1: 1 1,0
where 1,0 is the hypothesized value of 1 under the null
Null hypothesis and one-sided alternative:
! H0: 1 = 1,0 vs. H1:1 > 1,0
! OR H0: 1 = 1,0 vs. H1:1 < 1,0
In economics, it is almost always possible to come up with stories
in which an effect could go either way, so it is standard to focus
on two-sided alternatives

51

Two-sided tests are more conservative than one-sided tests, in


that it is harder to reject the null
Recall hypothesis testing for population mean using

t=

Y:

Y Y ,0
sY / n

then reject the null hypothesis if |t| >1.96

52

In general, the t-statistic has the following form


t = (estimator hypothesized value) / (standard error of the estimator)
where the standard error of the estimator is the square root
of an estimator of the variance of the estimator

53

Apply this to a hypothesis about 1 and we have:

1 1,0
t=
SE( )
1

where 1,0 is the value of 1 hypothesized under the null

More often than not, the null value is zero, so 1,0 = 0


These are the t-statistics that Stata (and all other statistical
software) spits out
But you should think about your question the null could
often be something other than 0

54

)
SE(

What is
1 ?
)
SE(

1 = the square root of our estimator of

the variance of the sampling distribution of 1

SE( 1 ) = 2

That is why we went to all that effort earlier!

55

Two-sided alternative
Return to calculation of the t-statsitic:

1 1,0 1 1,0
t=
=
SE( 1 )
2
1

With a two-sided alternative (and a large sample), we reject at


the 5% significance level if |t| > 1.96
H0: 1 = 0 vs. H1: 1 0
In the figure, c (the critical value) = 1.96

56

APPENDIX C Fundamentals of Mathematical Statistics

783

Rejection region for a 5% significance level test against


FIGURE C.6 Rejection region for a 5% significance level test against the two-sided
alternative H1: N p N0.
1
1,0

the two-sided alternative

area = .95

rejection
region

area = .025

0
c

rejection
region

Cengage Learning, 2013

area = .025

Asymptotic Tests for Nonnormal Populations


If the sample size is large enough to invoke the central limit theorem (see Section C.3),
the mechanics of hypothesis testing for population means are the same whether or not the
population distribution is normal. The theoretical justification comes from the fact that,

57

One-sided alternative

1 1,0 1 1,0
t=
=
2
SE( 1 )

1

With the one-sided alternative (and a large sample), we reject


at 5% significance level if t > 1.65 if we are testing
H0: 1 = 1,0 vs. H1:1 > 1,0
OR t > 1.65, if we are testing
H0: 1 = 1,0 vs. H1:1 < 1,0
In the figure, c (the critical value) = 1.65

58

ferent critical value.


The statistic in equation (C.35) is often called the t statistic for testing H0: N  N0. The
t statistic measures the distance from y- to N0 relative to the standard error of y-, se(y-).

Rejection region for a 5% significance level test against


the one-sided alternative 1 > 1,0

FIGURE C.5 Rejection region for a 5% significance level test against the one-sided
alternative N > N0.

area = .95

0
c

rejection

Cengage Learning, 2013

area = .05

59

Example: Test Scores and STR (California data)


Estimated regression line:
= 698.9 2.28STR
Regression software reports the standard errors:
!

SE( 0 ) = 10.4!

SE( 1 ) = 0.52

so the t-statistic testing 1,0 = 0 is:

!
!

1 1,0
t=
SE( )
1

= (2.28 0) / 0.52 = 4.38

The 1% significance level for a two-sided test is 2.58, so we reject


the null at the 1% significance level
60

Alternatively, we can compute the p-value


Remember, the p-value is the smallest significance level at
which the null hypothesis would be rejected, given the
observed t-statistic
For a two-sided hypothesis:
p-value = Pr{|t| > |t(observed)|}
= probability in tails of normal outside |t(observed)|
Again, we will always assume a large sample, and typically n =
50 is large enough
Stata gives you the p-value for each estimated coefficient

61

62

Confidence intervals

The estimated 1 is called a point estimate of 1


A confidence interval is called an interval estimate of 1
In general, if the sampling distribution of an estimator is
normal for large n, then a 95% confidence interval can be
constructed as:
estimator 1.96standard error

So a 95% confidence interval for 1 is:

SE(

{ 1 1.96
1 }

63

Example: Test Scores and STR (California data0


Estimated regression line:
!

SE( 0 ) = 10.4

SE( 1 ) = 0.52

= 698.9 2.28STR

95% confidence interval for 1 :


!

SE(

{ 1 1.96
1 } = {2.28 1.960.52} = {3.30, 1.26}

Equivalent statements:
The 95% confidence interval does not include zero;
The hypothesis 1 = 0 is rejected at the 5% level
64

The convention for reporting estimated regressions


Put standard errors in parentheses below the estimates
!

= 698.9 2.28STR
(10.4) (0.52)
This expression means that:
The estimated regression line is
!

= 698.9 2.28STR

The standard error of 0 is 10.4

The standard error of 1 is 0.52

65

OLS regression: Stata output


regress testscr str, robust
Regression with robust standard errors

Number of obs =
F(

1,

420

418) =

19.26

Prob > F

0.0000

R-squared

0.0512

Root MSE

18.581

------------------------------------------------------------------------|
testscr |

Robust
Coef.

Std. Err.

P>|t|

[95% Conf. Interval]

--------+---------------------------------------------------------------str |

-2.279808

.5194892

-4.38

0.000

-3.300945

-1.258671

_cons |

698.933

10.36436

67.44

0.000

678.5602

719.3057

-------------------------------------------------------------------------

66

so:
!
!

= 698.9 2.28 STR


(10.4) (0.52)

t (for 1 = 0) = 4.38
p-value = 0.000
95% confidence interval for 1 is (3.30, 1.26)

67

Vous aimerez peut-être aussi