9,10-Simple Regression-Rev PDF

Economics 420
Introduction to Econometrics
Professor Woodbury
Fall Semester 2015
Simple Regression (Regression with One Regressor)
1. Introduction to linear regression
2. Defining the linear regression model
3. Estimating the linear regression model method of moments
4. Algebraic Properties and measures of fit of OLS
5. Sampling distribution of the OLS estimator
6. Hypothesis testing and confidence intervals for 0 and 1
4. Algebraic properties and measures of fit of OLS

Here are some definitions
Fitted values or predicted values:
!
Yi = 0 + 1X i
This can also be written:

!
Yi = b0 + b1X i (just different notation)
Actual values of Y:
!
Yi = 0 + 1X i + u i
Residuals (deviations from the regression line:

!
u i = Yi Yi = Yi ( 0 + 1X i )
2
And here are some algebraic properties

Deviations from the regression line sum to zero:
!
u
=0
i=1 i
The correlation between deviations and regressors is zero:

!
X
u
=0
i
i
i=1
Sample averages of Y and X lie on the regression line:

!
Yi = 0 + 1X i
Again, the difference

between MODEL
the population
SIMPLE REGRESSION
regression line and actual observations is an error (ui)
P4
Y = 1 + 2 X
u1 P1
Q1
1 + 2 X 1
X1
Q2
P2
X2
Q3
Q4
P3
X3
X4
But the difference between the fitted regression line

and actual
SIMPLE REGRESSION
MODEL
observations
is a residual
(ei or
u i )
Y (actual value)
Y
P4
Y (fitted value)
Y Y = e (residual)
e4
R3
e1
R2
P1
R1
b1
X1
e2
P2
X2
Y = b1 + b2 X
R4
e3
P3
X3
X4
The discrepancies between the actual and fitted values of Y are known as the residuals.
The population regression line and the fitted regression

SIMPLE REGRESSION MODEL
line in one graph
Y (actual value)
Y
P4
Y (fitted value)
Y = b1 + b2 X
u4
Y = 1 + 2 X
Q4
1 + 2 X 4
b1
X1
X2
X3
X4
Note: values of the residuals values of the errors!

6
Both of these lines will be used in our analysis. Each permits a decomposition of the value
PART 1 Regression Analysis with Cross-Sectional Data
CEO example again
T A B L E 2 . 2 Fitted Values and Residuals for the First 15 CEOs

obsno
roe
salary
salaryhat
uhat
14.1
1095
1224.058
129.0581
10.9
1001
1164.854
163.8542
23.5
1122
1397.969
275.9692
5.9
578
1072.348
494.3484
13.8
1368
1218.508
149.4923
20.0
1145
1333.215
188.2151
16.4
1078
1266.611
188.6108
16.3
1094
1264.761
170.7606
10.5
1237
1157.454
10
26.3
833
1449.773
616.7726
11
25.9
567
1442.372
875.3721
12
26.8
933
1459.023
526.0231
13
14.8
1339
1237.009
101.9911
14
22.3
937
1375.768
438.7678
15
56.3
2011
2004.808
6.191895
Cengage Learning, 2013
79.54626
Algebraic Properties of OLS Statistics

There are several useful algebraic properties of OLS estimates and their associated statistics. We now cover the three most important of these.
(1) The sum, and therefore the sample average of the OLS residuals, is zero.
Other Regression Statistics

A natural question is how well the regression line fits or
explains the data
Two regression statistics provide complementary measures of
the quality of fit
The regression R-squared measures the fraction of the variance
of Y that is explained by X
o It is unit-free and ranges between zero (no fit) and one
(perfect fit)
The standard error of the regression measures the fit the
typical size of a regression residual in the units of Y
Goodness-of-fit measure (R-squared)

The total sum of squares represents the total variation in the
dependent variable
SST = i=1(Yi Y )2 !
n
The explained sum of squares represents the variation explained by

the regression
SSE = i=1(Yi Y )2
n
The residual sum of squares, represents variation not explained by

the regression
SSR = i=1u i2
n
The total variation can be decomposed

! SST = SSE + SSR
So the total variation is the sum of the explained and unexplained
variation
Think of this as a generalization of the definition of the residual:
!
u i = Yi Yi
implies
Yi = Yi + u i
10
Yi = Yi + u i
SST = SSE + SSR
n
2
2
(Y
Y
)
=
(
Y
Y
)
+
u
i=1 i
i=1 i
i=1 i
n
11
This gives us the R2 statistic
SSE
R =
SST
2
SSR
=1
SST
So R-squared measures the fraction of the total variation in
(or variance of) Yi that is explained by the regression that
is, by X
But beware!
o A high R-squared does not mean that the regression has a
causal interpretation
o And a low R-squared doest mean the regression is useless
12
What does it mean?
R2 = 0 means ESS = 0, so X explains none of the variation of Y

R2 = 1 means ESS = TSS, so Y = Y and X explains all of the
variation of Y
Almost always, 0 < R2 < 1
For regression with a single regressor (the case here), R2 is the
square of the correlation coefficient between X and Y
13
The Standard Error of the Regression (SER)

The standard error of the regression is (almost) the sample
standard deviation of the regression error (the OLS residuals):
SER = su
2
u
i=1 i
n
n2
SSR
n2
This formula is similar to the formula for the sample standard

deviation of Y
Approximately, the square root of the average squared residual
14
The Standard Error of the Regression (SER)
represents a typical deviation from the regression line

has the units of u, which are same as the units of Y
measures the spread of the distribution of u
measures the average size of the OLS residual the average
mistake made by the OLS regression line
15
Root mean squared error (RMSE)

This is closely related to the SER
2
u
i=1 i
n
RMSE =
SSR
n
This measures the same thing as the SER the minor

difference is division by n instead of (n2)
In fact, Wooldridge doesnt even bother to distinguish between
SER and RMSE (still another name for this is the standard
error of the estimate)
16
Example of R2 and SER
= 698.9 2.28STR! !
(10.4) (0.52)
R2 = 0.05, SER = 18.6
17
In this case, STR explains only a small fraction of the variation

in test scores, and it has a large error
But the slope coefficient is large in a policy sense
18
5. Sampling distribution of the OLS estimator
The population objects (parameters) 0 and 1 are unknown

We to estimate them by drawing a sample from a population
and applying estimators we have derived
But the estimators themselves are random variables with a
probability distribution (the sampling distribution) that
describes the values they could take when we draw different
random samples
But to derive the sampling distributions of
and
1 we first
need to make some assumptions
19
Assumption SLR.1 (Linear in parameters)

In the population, the relationship between X and Y is linear
!
Yi = 0 + 1X i + ui
This seems more restrictive that it really is
For example, if we choose Y to be the natural logarithm of
earnings (rather than earnings), and X to be years of
schooling, we will estimate a nonlinear relationship between
schooling and earnings (examples to follow)
That is why the assumption is linear in parameters
20
Assumption SLR.2 (Random sampling)

We will choose n units of observation at random from the
population of interest, and observe X and Y for each unit or
entity
Simple random sampling implies that
! (Xi,Yi), i = 1,, n,
are independently and identically distributed (i.i.d.).
21
The population might consist of all workers in the country

In the population, a linear relationship between earnings (or
log of earnings) and years of education holds
Draw a worker at random from the population
Throw the worker back into population and repeat draw
another worker at random ... and so on
The result is twofold:
o each worker is drawn independently there is no reason
to suppose that, if a high-wage worker was drawn the first
time, a high-wage worker will also be drawn the second
time
o each worker is drawn from the same distribution
22
When would we NOT have random sampling?

The main place we will encounter non-i.i.d. sampling is when
data are recorded over time (time series data)
o Time-series observations are not independent
o GDP this quarter (or last year) is related to GDP last
quarter (or last year)
o The unemployment rate this month is related to the
unemployment rate last month
o This is called autocorrelation a variable is correlatied
with itself over time and it will introduce some extra
complications
23
Sample selection that is non-random is another case of nonrandom sampling, and it is very difficult to handle
o For example, if we want to estimate the response of
married women to a change in the wage, we are forced to
use the sample of women who are already working
(because they have wages we can observe)
o But working women may be different from non-working
women in unoberved ways, so the estimates we obtain
wont be applicable generally
24
Assumption SLR.3 (Sample variation in explanatory

variable)
The values of the explanatory variables are not all the same
there needs to be variation in X
If there in no variation in X, then how could X possibly explain
variation in Y?
If this assumption fails, it is impossible to study how different
values of X variable lead to different values of the dependent
variable
25
Assumption SLR.4 (Zero conditional mean: E(u|X) = 0)

The value of the explanatory variable must contain no
information about the mean of the unobserved factors
We have already talked about this a lot ...
In the class size example, E(u|X) = 0 means
E(Family Income|STR) = constant
so family income and STR are unrelated
26
Assumption SLR.5 (Homoskedasticity: var(u|X) = 2)

This is the constant variance assumption
All it means is that the error term has the same variance for
any value of X, the explanatory variable
For example, a regression of earnings (or wages) on education
is a good examples of a case that violates this assumption
Why? Because earnings vary far more widely for highly
educated workers than for less-educated workers
See the figures (from Wooldridge)
27
PART 1 Regression Analysis with Cross-Sectional Data
The simple regression model under homoskedasticity
F I G U R E 2 . 8 The simple regression model under homoskedasticity.

f(y|x)
E(y|x) 5
x1
x2
x3
x
MPLE 2.13
HETEROSKEDASTICITY IN A WAGE EQUATION
In order to get an unbiased estimator of the ceteris paribus effect of educ on wage, we must
28
53
CHAPTER 2 The Simple Regression Model
Heteroskedasticity: var(wage|educ) increases with educ
F I G U R E 2 . 9 Var(wageZeduc) increasing with educ.

f(wage|educ)
12
E(wage|educ) 5
0 1
1educ
16
educ
With the homoskedasticity assumption in place, we are ready to prove the following:
wage
29
Here is a scatterplot of average hourly earnings vs. years of

education from the 1999 Current Population Survey
30
Heteroskedasticity is often violated

So we need to have ways of handling it
For years, this was an active research question
o We now have straighforward ways of handling it, and we
will see these later
Applied researchers handle it (even today) by transforming
one or more of the variables in the model so the assumption
is satisfied
o For example, although var(wage|educ) increases with the
wage, var[log(wage)|educ) does not
o This is one reason labor economists have used the log
transformation for decades
31
One additional assumption

One reputable text (Stock and Watson) adds the assumption
that large outliers are unlikely:
E(X4) < and E(Y4) <
which means X and Y have finite kurtosis
This assumption is plausible: standardized test scores, STR,
family income, all have a finite domain
For our purposes, this is important because outliers can wreak
havoc with estimating a model, and you should always inspect
your data to make sure some wild values arent hiding there
Data-entry errors are often a problem, but you may also learn
something about the data (or where it came from) by asking
questions about odd values
32
Theorem 2.1 (Unbiasedness of OLS)

If Assumptions SLR.1SLR.4 hold, the OLS estimator is unbiased
for 0 and 1
!
E( 0 ) = 0
and
!
E( 1 ) = 1
This carries over from what we learned about estimating Y

!
E(Y ) = Y
so Y-bar is an unbiased estimator of Y
33
Two notes
First, this result does not require SLR.5 (homoskedasticity)

Second it does not depend on sample size
o
That is,
1 is unbiased for 1 even in a small sample
34
Estimation: the Sampling Distributions of
and
We have estimators of 0 and 1, and we know that those

estimators are unbiased
But how much to these estimators deviate from their mean
values, 0 and 1?
1 has a sampling distribution, just like

It is a random variable, and it will differ each time we draw a
different sample from the population remember, all we have
are samples
35
So we have two questions about 1

What is its variance? (measure of sampling uncertainty)
What is var(
1 )?
What is its sampling distribution in large samples?

We arent going to worry about small samples we dont
use them in economics, and it is harder to figure out
sampling distribution
We need to know this because we will need to do hypothesis
o
tests about 0 and 1 using 0 and 1
36
Start with the definition of variance
var( 1 ) = E{[ 1 E( 1 )]2 }

!
= E[( 1 1 )2 ]
where the second equality is true because we know the OLS

estimator of 1 is unbiased
37
Under Assumptions SLR.1SLR.5, it turns out that

2
(X
X
)
var(ui )
i=1 i
n
var( 1 ) = 2 =
1
(X
X
)
i=1 i
var(ui )
i=1(X i X )
n
var(Yi )
2
(X
X
)
i=1 i
n
The last equality holds because

!
var(ui ) = var(Yi )
Why?
38
Note that
!
var(ui ) = u2 = 2 and var(Yi ) = Y2 = 2

This is just notation, but it is important because 2 is often
called the error variance or disturbance variance
So we have another theorem
39
Theorem 2.2 (Sampling variances of the OLS estimators)

Under Assumptions SLR.1SLR.5,
var( 1 ) = 2 ==
1
=
n
2
SST
(X
X
)
X
i=1 i
var(ui )
This is easy! It just says the variance of
1 is the error variance
divided by the total sum of squares of X
40
What does this tell us?
var( 1 ) =
var(ui )
i=1(X i X )
n
2
2
(X
X
)
i=1 i
n
)
var(
var(Yi ) or var(ui ) or 2 is
First,
1 is smaller when
smaller
Makes sense the smaller the variance of Y, the less likely it is
we will draw extreme samples
Why? Because extreme samples are simply less likely to occur
Also, smaller errors mean the data will be closer to the
population regression line, so the estimate of the slope will be
more precise
41
var( 1 ) =
var(ui )
n
(X i X )
i=1
var(Yi )
n
(X i X )
i=1
Second, the larger the variation in X (the explanatory variable),

the smaller the sampling distribution of
Also makes sense if the denominator of the equation is

larger, then the expression is smaller
The denominator is larger if X varies more that is, we
observe X over a wider range of values, so we have more
information about the relationship between X and Y
42
Estimating the error variance

Were almost there, but we need an estimate of the error variance:
! 2 = var(ui) = var(Yi)
because 2 is an unknown population parameter
Solution: use the residuals
By definition,
2 = E(u 2 )
We dont observe ui (the errors), but we can use residuals to
compute an estimator of 2:
= (1/ n) i=1u i2
2
43
We can also write this in terms of the sum or squared

residuals
= (1/ n) i=1u i2 = SSR / n

2
Footnote: We actually should divide by (n2), but again, in large

samples it makes no difference
The main thing is that we now have a third theorem
44
Theorem 2.3 (Unbiased estimation of 2)

Under Assumptions SLR.1SLR.5,
!
E( 2 ) = 2
That is,
= (1/ n) i=1u i2 = SSR / n

2
is an unbiased estimator of 2
(Again, the divisor should be (n2), not n, but in a large sample,
this is not an issue)
45
So we know the variance of

2
var( 1 ) = =
1
1 and how to calculate it
var(ui )
i=1(X i X )
n
2
2
(X
X
)
i=1 i
n
and to calculate this we plug in
= (1/ n) i=1u i2 = SSR / n

2
for 2
2
var( 1 ) =
So we have an estimator for
1
46
The sampling distribution of
How is
is normal
1 distributed?
It turns out that, if the assumptions we have made hold, then in

large samples,
!
1 ~ N( 1, 2 )
1
which just says that the OLS estimator of 1 is normally

distributed with mean 1 and the variance we have worked
out so laboriously
47
Theorem 4.1 (Sampling distribution of
is normal)
1 ~ N( 1, 2 )
1
Theorem 4.1 in Wooldridge is a big deal because it means we

can do hypothesis testing and construct confidence intervals in
the familiar way
48
What have we learned from all this?

1. If the first four assumptions hold, the OLS estimator is unbiased
for 0 and 1
E( 0 ) = 0 and !E( 1 ) = 1
1 is
2. The variance of
var( 1 ) = 2
given by the formula above

3. The distribution of
1 is normal
49
6. Hypothesis testing and confidence intervals for 0

and 1
Suppose an angry parent says that reducing the number of
students in a class has no effect on learning (or test scores)
and spending money on more teachers is a waste
The null hypothesis is:
H0: 1 = 0
where 1 is the coefficient on STR in the regression of test
scores on STR
How do we test this hypothesis using data?
50
Null hypothesis with a two-sided alternative

! H0: 1 = 0 vs. H1: 1 0
or, more generally,
! H0: 1 = 1,0 vs. H1: 1 1,0
where 1,0 is the hypothesized value of 1 under the null
Null hypothesis and one-sided alternative:
! H0: 1 = 1,0 vs. H1:1 > 1,0
! OR H0: 1 = 1,0 vs. H1:1 < 1,0
In economics, it is almost always possible to come up with stories
in which an effect could go either way, so it is standard to focus
on two-sided alternatives
51
Two-sided tests are more conservative than one-sided tests, in

that it is harder to reject the null
Recall hypothesis testing for population mean using
t=
Y:
Y Y ,0
sY / n
then reject the null hypothesis if |t| >1.96
52
In general, the t-statistic has the following form

t = (estimator hypothesized value) / (standard error of the estimator)
where the standard error of the estimator is the square root
of an estimator of the variance of the estimator
53
Apply this to a hypothesis about 1 and we have:
1 1,0
t=
SE( )
1
where 1,0 is the value of 1 hypothesized under the null
More often than not, the null value is zero, so 1,0 = 0

These are the t-statistics that Stata (and all other statistical
software) spits out
But you should think about your question the null could
often be something other than 0
54
)
SE(
What is
1 ?
)
SE(
1 = the square root of our estimator of
the variance of the sampling distribution of 1
SE( 1 ) = 2
That is why we went to all that effort earlier!
55
Two-sided alternative
Return to calculation of the t-statsitic:
1 1,0 1 1,0
t=
=
SE( 1 )
2
1
With a two-sided alternative (and a large sample), we reject at

the 5% significance level if |t| > 1.96
H0: 1 = 0 vs. H1: 1 0
In the figure, c (the critical value) = 1.96
56
APPENDIX C Fundamentals of Mathematical Statistics
783
Rejection region for a 5% significance level test against

FIGURE C.6 Rejection region for a 5% significance level test against the two-sided
alternative H1: N p N0.
1
1,0
the two-sided alternative
area = .95
rejection
region
area = .025
0
c
rejection
region
area = .025
Asymptotic Tests for Nonnormal Populations

If the sample size is large enough to invoke the central limit theorem (see Section C.3),
the mechanics of hypothesis testing for population means are the same whether or not the
population distribution is normal. The theoretical justification comes from the fact that,
57
One-sided alternative
1 1,0 1 1,0
t=
=
2
SE( 1 )

1
With the one-sided alternative (and a large sample), we reject

at 5% significance level if t > 1.65 if we are testing
H0: 1 = 1,0 vs. H1:1 > 1,0
OR t > 1.65, if we are testing
H0: 1 = 1,0 vs. H1:1 < 1,0
In the figure, c (the critical value) = 1.65
58
ferent critical value.

The statistic in equation (C.35) is often called the t statistic for testing H0: N N0. The
t statistic measures the distance from y- to N0 relative to the standard error of y-, se(y-).
Rejection region for a 5% significance level test against

the one-sided alternative 1 > 1,0
FIGURE C.5 Rejection region for a 5% significance level test against the one-sided
alternative N > N0.
area = .95
0
c
rejection
area = .05
59
Example: Test Scores and STR (California data)

Estimated regression line:
= 698.9 2.28STR
Regression software reports the standard errors:
!
SE( 0 ) = 10.4!
SE( 1 ) = 0.52
so the t-statistic testing 1,0 = 0 is:
!
!
1 1,0
t=
SE( )
1
= (2.28 0) / 0.52 = 4.38
The 1% significance level for a two-sided test is 2.58, so we reject

the null at the 1% significance level
60
Alternatively, we can compute the p-value

Remember, the p-value is the smallest significance level at
which the null hypothesis would be rejected, given the
observed t-statistic
For a two-sided hypothesis:
p-value = Pr{|t| > |t(observed)|}
= probability in tails of normal outside |t(observed)|
Again, we will always assume a large sample, and typically n =
50 is large enough
Stata gives you the p-value for each estimated coefficient
61
62
Confidence intervals
The estimated 1 is called a point estimate of 1

A confidence interval is called an interval estimate of 1
In general, if the sampling distribution of an estimator is
normal for large n, then a 95% confidence interval can be
constructed as:
estimator 1.96standard error
So a 95% confidence interval for 1 is:
SE(
{ 1 1.96
1 }
63
Example: Test Scores and STR (California data0

Estimated regression line:
!
SE( 0 ) = 10.4
SE( 1 ) = 0.52
= 698.9 2.28STR
95% confidence interval for 1 :

!
SE(
{ 1 1.96
1 } = {2.28 1.960.52} = {3.30, 1.26}
Equivalent statements:
The 95% confidence interval does not include zero;
The hypothesis 1 = 0 is rejected at the 5% level
64
The convention for reporting estimated regressions

Put standard errors in parentheses below the estimates
!
= 698.9 2.28STR
(10.4) (0.52)
This expression means that:
The estimated regression line is
!
= 698.9 2.28STR
The standard error of 0 is 10.4
The standard error of 1 is 0.52
65
OLS regression: Stata output

regress testscr str, robust
Regression with robust standard errors
Number of obs =
F(
1,
420
418) =
19.26
Prob > F
0.0000
R-squared
0.0512
Root MSE
18.581
------------------------------------------------------------------------|
testscr |
Robust
Coef.
Std. Err.
P>|t|
[95% Conf. Interval]
--------+---------------------------------------------------------------str |
-2.279808
.5194892
-4.38
0.000
-3.300945
-1.258671
_cons |
698.933
10.36436
67.44
0.000
678.5602
719.3057
-------------------------------------------------------------------------
66
so:
!
!
= 698.9 2.28 STR

(10.4) (0.52)
t (for 1 = 0) = 4.38
p-value = 0.000
95% confidence interval for 1 is (3.30, 1.26)
67

9,10-Simple Regression-Rev PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

9,10-Simple Regression-Rev PDF

Transféré par

Droits d'auteur :

Formats disponibles

Economics 420

4. Algebraic properties and measures of fit of OLS

This can also be written:

Yi = b0 + b1X i (just different notation)

Residuals (deviations from the regression line:

And here are some algebraic properties

The correlation between deviations and regressors is zero:

Sample averages of Y and X lie on the regression line:

Again, the difference

But the difference between the fitted regression line

The population regression line and the fitted regression

Note: values of the residuals values of the errors!

PART 1 Regression Analysis with Cross-Sectional Data

CEO example again

T A B L E 2 . 2 Fitted Values and Residuals for the First 15 CEOs

Cengage Learning, 2013

Algebraic Properties of OLS Statistics

Other Regression Statistics

Goodness-of-fit measure (R-squared)

The explained sum of squares represents the variation explained by

The residual sum of squares, represents variation not explained by

The total variation can be decomposed

SST = SSE + SSR

This gives us the R2 statistic

What does it mean?

R2 = 0 means ESS = 0, so X explains none of the variation of Y

The Standard Error of the Regression (SER)

This formula is similar to the formula for the sample standard

The Standard Error of the Regression (SER)

represents a typical deviation from the regression line

Root mean squared error (RMSE)

This measures the same thing as the SER the minor

Example of R2 and SER

R2 = 0.05, SER = 18.6

In this case, STR explains only a small fraction of the variation

5. Sampling distribution of the OLS estimator

The population objects (parameters) 0 and 1 are unknown

But to derive the sampling distributions of

need to make some assumptions

Assumption SLR.1 (Linear in parameters)

Assumption SLR.2 (Random sampling)

The population might consist of all workers in the country

When would we NOT have random sampling?

Assumption SLR.3 (Sample variation in explanatory

Assumption SLR.4 (Zero conditional mean: E(u|X) = 0)

Assumption SLR.5 (Homoskedasticity: var(u|X) = 2)

PART 1 Regression Analysis with Cross-Sectional Data

The simple regression model under homoskedasticity

F I G U R E 2 . 8 The simple regression model under homoskedasticity.

HETEROSKEDASTICITY IN A WAGE EQUATION

Cengage Learning, 2013

CHAPTER 2 The Simple Regression Model

Heteroskedasticity: var(wage|educ) increases with educ

F I G U R E 2 . 9 Var(wageZeduc) increasing with educ.

Cengage Learning, 2013

Here is a scatterplot of average hourly earnings vs. years of

Heteroskedasticity is often violated

One additional assumption

Theorem 2.1 (Unbiasedness of OLS)

This carries over from what we learned about estimating Y

so Y-bar is an unbiased estimator of Y

First, this result does not require SLR.5 (homoskedasticity)