Vous êtes sur la page 1sur 213

660

620

640

testscr

680

700

Empirical MethodsMW24.1

14

16

18

20

22

24

26

str

Call:
lm(formula = testscr ~ str)
Residuals:
Min
1Q
-47.727 -14.251

Median
0.483

3Q
12.822

Max
48.540

Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 698.9330
9.4675 73.825
< 2e-16 ***
str
-2.2798
0.4798 -4.751 0.00000278 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124,Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 0.000002783

Oliver Kirchkamp

c Oliver Kirchkamp

6 February 2015 09:49:38

This handout is a summary of the slides we use in the lecture. The handout is
perhaps not very helpful unless you also attend the lecture. This handout is also
not supposed to replace a book. The principal text for the lecture is Stock and
Watsons book. All formulas we use in the lecture can be found there (with fewer
mistakes). Please expect slides and small parts of the lecture to change from time
to time and print only the material you currently need.
Homepage: http://www.kirchkamp.de/oekonometrie/
Schedule: Lecture: Fri, 10:15-11:45, HS5
Exercise: Fri, 14:15-15:45, SR207
Mon, 16:15-17:45, HS4
Exam: Wed. 18.2., 8-10 (please check homepage!)
Literature:
* Stock and Watson; Introduction to Econometrics, Pearson, 2006
Studenmund; Using Econometrics, Pearson, 2006

Barreto and Howland; Introductory Econometrics, Cambridge, 2006


Software:
R

free
wide range of applications
Helpful hints, links to the documentation, etc., can be found on the
Homepage
In the lecture we will illustrate many things with R. You should try
these examples on you own computer. Use the online help to look
up unknown commands.

SAS, STATA, EViews, TSP, SPSS,. . .


expensive

more specialised
more heterogeneous syntax

Contents
1 Introduction
1.1 What is the purpose of economics . . . . . . . . . . . . . . . . . . . .

6
6

1.2
1.3
1.4
1.5
1.6

Econometrics uses data to measure causal relationsships


Learning aims . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

2 Statistical theory
2.1 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Random variables and distributions . . . . . . . . . . . . . . . . .
2.3.1 Conditional expected value and conditional variance . . .
2.4 Samples of a population . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 The distribution of Y . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Characteristics of sampling distributions . . . . . . . . . .
2.5.3 Why should we use Y to estimate Y ? . . . . . . . . . . . .
2.5.4 Testing hypotheses . . . . . . . . . . . . . . . . . . . . . . .
2.5.5 Estimating the variance of Y . . . . . . . . . . . . . . . . . .
2.5.6 Calculating the p-value with the help of an estimated 2Y .
2.5.7 Relation between p-value and the level of significance . .
2.5.8 What happened to the t table and the degrees of freedom?
2.5.9 A comment . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.10 Another problem . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 An alternative: The Bayesian Approach . . . . . . . . . . . . . . .
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Linear regression with a single regressor
3.1 Measures of determination . . . . . . . . . . . . . .
3.2 OLS assumptions . . . . . . . . . . . . . . . . . . .
3.2.1 Digression The Existence of Moments .
3.3 The distribution of the OLS estimator . . . . . . .
^1 . . . . . . . . . . . . . . . . . . .
3.4 Distribution of
^0 . . . . . . . . . . . . . . . . . . .
3.5 Distribution of
^1 . . . . . . . . . . . . . . . .
3.6 Hypothesis tests for
3.7 Confidence intervals and p-values . . . . . . . . .
3.8 Bayesian Regression . . . . . . . . . . . . . . . . .
3.9 Reporting estimation results . . . . . . . . . . . . .
3.10 Continuous and nominal variables . . . . . . . . .
3.11 Heteroscedastic and homoscedastic error terms . .
3.11.1 An example from labour market economics

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

6
7
7
11
11

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

11
11
12
12
15
16
17
17
17
19
19
21
21
21
22
22
23
23
24
31
32

.
.
.
.
.
.
.
.
.
.
.
.
.

34
36
37
38
39
43
46
46
48
49
51
52
56
59

c Oliver Kirchkamp

Contents

c Oliver Kirchkamp

6 February 2015 09:49:38


3.11.2 Back to Caschool . . . . . . . . . . . . . .
3.11.3 What do we get from homoscedasticity? .
3.11.4 Summary Homo-/Heteroskedasticity . .
3.12 Extended OLS assumptions . . . . . . . . . . . .
3.13 OLS problems . . . . . . . . . . . . . . . . . . . .
3.13.1 Alternatives to OLS . . . . . . . . . . . . .
3.13.2 Robust regression . . . . . . . . . . . . . .
3.13.3 A Bayesian approach to robust regression
3.14 Exercises . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

59
60
63
63
64
65
66
68
76

4 Models with more than one independent variable (multiple regression)


4.1 Matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 How to do calculations with matrices . . . . . . . . . . . . .
4.1.2 Calculations with matrices in R . . . . . . . . . . . . . . . . .
4.2 Deriving the OLS estimator in matrix notation . . . . . . . . . . . .
4.3 Sp ecification errors . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Examples: . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Specification errors generalization . . . . . . . . . . . . .
4.4 Assumptions for the multiple regression model . . . . . . . . . . . .
4.5 The distribution of the OLS estimator in a multiple regression . . .
4.6 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.3 Which regressor is responsible for the multicollinearity? . .
4.6.4 Multicollinearity of dummy variables . . . . . . . . . . . . .
4.7 Specification Errors: Summary . . . . . . . . . . . . . . . . . . . . .
^ . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.1 The variance of
4.7.2 Imperfect multicollinearity . . . . . . . . . . . . . . . . . . .
4.7.3 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.4 Digression: Multiplication: . . . . . . . . . . . . . . . . . . .
4.7.5 Extending the estimation equation by adding expenditure
per student . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Joint Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.1 F statistic for two restrictions . . . . . . . . . . . . . . . . . .
4.8.2 More than two restrictions . . . . . . . . . . . . . . . . . . . .
4.8.3 Specials cases: . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.4 Special case: Homoscedastic error terms . . . . . . . . . . . .
4.9 Restrictions with more than one coefficient . . . . . . . . . . . . . .
4.10 Model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10.1 Measure R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10.2 Measure contribution to R2 . . . . . . . . . . . . . . . . . . .
4.10.3 Information criteria . . . . . . . . . . . . . . . . . . . . . . . .

80
82
82
83
84
86
87
89
90
90
90
94
94
95
97
97
98
98
98
101
102
103
106
107
108
111
112
118
120
121
121

4.10.4 t-statistic for individual coefficients


4.10.5 Bayesian Model Comparison . . . .
4.10.6 Comparing models . . . . . . . . . .
4.10.7 Discussion . . . . . . . . . . . . . . .
4.11 Exercises . . . . . . . . . . . . . . . . . . . .

5
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

5 Non-linear regression functions


5.1 Functional forms . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Logarithmic Models . . . . . . . . . . . . . . . . . . . .
5.1.3 Logarithmic Models: linear-log . . . . . . . . . . . . . .
5.1.4 Logarithmic Models: log-linear . . . . . . . . . . . . . .
5.1.5 Logarithmic Models: log-log . . . . . . . . . . . . . . .
5.1.6 Comparison of the three logarithmic models . . . . . .
5.1.7 Generalization Box-Cox . . . . . . . . . . . . . . . .
5.1.8 Other non-linear functions . . . . . . . . . . . . . . . .
5.1.9 Non-linear least squares . . . . . . . . . . . . . . . . . .
5.2 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Interactions between binary variables . . . . . . . . . .
5.2.2 Interaction between a binary and a continuous variable
5.2.3 Application: Gender gap . . . . . . . . . . . . . . . . .
5.2.4 Interaction between two continuous variables . . . . .
5.3 Non-linear interaction terms . . . . . . . . . . . . . . . . . . . .
5.3.1 Non-linear interaction terms . . . . . . . . . . . . . . .
5.3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Evaluating multiple regressions
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Can we evaluate multiple regressions systematically?
6.1.2 Internal and external validity . . . . . . . . . . . . . .
6.2 Internal validity - Problems . . . . . . . . . . . . . . . . . . .
6.2.1 Omitted Variable Bias . . . . . . . . . . . . . . . . . .
6.2.2 Incorrect specification of the functional form . . . . .
6.2.3 Errors in the variables . . . . . . . . . . . . . . . . . .
6.2.4 Sample selection bias . . . . . . . . . . . . . . . . . . .
6.2.5 Simultaneous causality . . . . . . . . . . . . . . . . . .
6.2.6 Heteroscedasticity and correlation of error terms . . .
6.3 OLS and prediction . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Comparison of Caschool and MCAS . . . . . . . . . . . . . .
6.4.1 Internal validity . . . . . . . . . . . . . . . . . . . . . .
6.4.2 External validity . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

129
130
134
135
135

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

140
148
148
152
153
154
156
157
158
159
160
161
162
165
169
170
173
174
175
175

.
.
.
.
.
.
.
.
.
.
.
.
.
.

180
180
180
180
181
181
182
183
184
185
185
186
186
212
213

c Oliver Kirchkamp

Contents

c Oliver Kirchkamp

6 February 2015 09:49:38


6.4.3

Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

1 Introduction
What is an interesting economic theory?
Claim:
For each economic theory there is an alternative theory predicting the opposite.
Often economic theories suggest relationships often with policy implications but these relationships are rarely quantified.
How large is the increase in the performance of students when courses are
smaller?
How large is the increase in your income when you study for another year?
What is the elasticity of demand for cigarettes?
By how much does the GDP increase if the ECB raises interest rates by 1%?

1.1 What is the purpose of economics


Developing theories
Testing theories
Using theories for prediction

1.2 Econometrics uses data to measure causal relationsships


Ideal approach: controlled experiment (control group/treatment group)

By how much does the performance of students increase when courses


become smaller?
How much more do you earn if you decide to spend an additional year
at university.
What is the price elasticity of cigarettes?
By how much does the GDP increase if the ECB reduces the interest by
one percentage point?

hard to do

INTRODUCTION

Most of the data we have are from uncontrolled processes.


Student test scores

Incomes of alumni
Time series data about monetary policy
Problems related to data from uncontrolled processes
Unobserved factors

Simultaneous causalities
Coincidence causality

1.3 Learning aims

Application of econometric methods

Quantifying causal effects using observational data from uncontrolled


processes
Extrapolating times series

Evaluating the econometric work of others

1.4 Example
How does learning success change when class size is reduced by one student? What, if class size is reduced by eight students?
Can we answer this question without using data?
E.g. test scores from 420 school districts in California from 1998-1999

str = student teacher ratio number of students in the district / full time
equivalent teachers
testscr = 5th -grade test score (Stanford-9 achievement test)

For our examples we use the statistical software R.


Some components of R are contained in so called libraries. Together these libraries cover a huge
functional range. For our introductory examples we use only a few of these libraries. Additional libraries
can be loaded by using the library command. RSiteSearch and the R Site Search Extension for Firefox
help us to determine which library offers a certain functionality. Here we are going to use Ecdat, which
contains several econometric data sets and the library car, which offers a number of handy econometric
functions.
library(Ecdat)
library(car)

c Oliver Kirchkamp

c Oliver Kirchkamp

6 February 2015 09:49:38

The command data enables access to the data set contained in a library.

data(Caschool)

We can now access this data set. summary displays an overview of the statistical characteristics of the data
set.

names(Caschool)
[1] "distcod"
[7] "calwpct"
[13] "str"

"county"
"mealpct"
"avginc"

"district" "grspan"
"computer" "testscr"
"elpct"
"readscr"

"enrltot"
"compstu"
"mathscr"

"teachers"
"expnstu"

summary(Caschool$str)
Min. 1st Qu.
14.00
18.58

Median
19.72

Mean 3rd Qu.


19.64
20.87

Max.
25.80

It is quite cumbersome to write the name of a data set - here Caschool - time and time again. Whenever
we intend to work with the same data set for a while we can use attach(Caschool). This will tell R to
look at Caschool first, whenever we ask for a variable.

attach(Caschool)

Using summary is much easier now.

summary(str)
Min. 1st Qu.
14.00
18.58

Median
19.72

hist draws a histogram.

hist(str)

Mean 3rd Qu.


19.64
20.87

Max.
25.80

INTRODUCTION

60
40
0

20

Frequency

80

100

Histogram of str

14

16

18

20

22

24

26

str

scatterplot draws a scatterplot.

660
640
620

testscr

680

700

library(car)
scatterplot(testscr ~ str)

14

16

18

20

str

22

24

26

c Oliver Kirchkamp

c Oliver Kirchkamp

10

6 February 2015 09:49:38

Test results testscr seem to be getting worse as student teacher ratios str are
getting higher.
Is it possible to show that districts with low student teacher ratios str have
higher test scores testscr?
Compare average test scores in districts with small str to test scores in districts with high str (estimation)
Test the null hypothesis that mean test scores are the same against the alternative hypothesis that they are not (hypothesis testing)
Estimate an interval for the difference of the mean test scores (confidence
interval)
Is the difference large enough
for a school reform
to convince parents
to convince the school authority
In the following example we want to split up the data set into two pieces schools with a student teacher
ratio above and below 20. In other words, we will introduce a nominal variable. In R a nominal variable
is called a factor and factor converts a continuous variable (str) into a factor.
t.test performs a student-t test to compare mean values. We write Caschool$testsrc ~ large.
The variable to be tested is given before the tilde. The factor describing the two groups to be compared is
given after the tilde.
large <- str>20
t.test(testscr ~ large)

Welch Two Sample t-test


data: testscr by large
t = 3.9231, df = 393.721, p-value = 0.0001031
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.584445 10.785813
sample estimates:
mean in group FALSE mean in group TRUE
657.1846
649.9994

This simple test tells us that there is a significant difference of the test scores
testscr between large and small school groups.
We can estimate the difference between the two groups, we can test a hypothesis,
and we can calculate a confidence interval.

STATISTICAL THEORY

11

1.5 Plan
You already know estimates, hypotheses tests and confidence intervals.
We will generalize these concepts for regressions.
Before we do this, we will take a brief look at the underlying theory.

1.6 Exercises
1. Econometrics
What is econometrics?

What is the aim of econometrics?

Which kind of questions can be answered with econometric tools? Give


some examples of questions that can possibly be answered with econometrics. Give examples of different fields, e.g. policy advice, science,
marketing, . . . .
2. Data sources
What are typical data sources for econometric analysis?

Which problems might be related to the different data sources?

2 Statistical theory
Population, random variable, distribution
Moments of a distribution (mean value, variance, standard deviation, covariance, correlation)
Conditional distribution, conditional mean values
Distribution of a random sample

2.1 Population
The set of all entities which could theoretically be observed (e.g. all imaginable school districts at all points in time under all imaginable conditions)
Quite often we assume that the population is of infinite size (or at least very
large)
Usually we know something about A (our sample) and we want to say
something about B. We can do this, if we assume that both A and B are
drawn from the same population.

c Oliver Kirchkamp

c Oliver Kirchkamp

12

6 February 2015 09:49:38

2.2 Sample
A part of the population that we observe (e.g. Californian school districts in
1998 (and under the conditions of this year))

2.3 Random variables and distributions


random variable (RV) = numerical summary of random event
discrete (categorial, factor) variable / continuous variable
one-dimensional variable / multi-dimensional variable
Describing random variables using distributions
Probabilities of events P(x)
(when dealing with discrete RV)
Cumulative distribution function F(x)
(when dealing with one-dimensional RV)
Probability density function f(x)
(when dealing with continuous RV)
Properties of random variables
Expected value E(X), X , (theoretical) mean value of X
mean value of X for the entire population
Variance E((X X )2 ) = 2X
Measure of the mean of the squared deviation from the mean of the distribution

Standard deviation variance = X


Generally,
X F(1 , 2 , . . .)
where 1 , 2 , . . . are parameters of the distribution.
Some distributions (e.g. the normal distribution) are characterised by and 2 .
X N(, 2 )

STATISTICAL THEORY

13

Joint distribution of random variables


Random variables X and Z have a joint distribution
Covariance of X and Z: cov(X, Z) = E((X X )(Z Z )) = XZ
Covariance is a measure of the linear dependence between X and Z
Positive covariance = positive dependence between X and Z
If X and Z are independently distributed, then cov(X, Z) = 0 (but not
vice versa!!!)
The covariance between a random variable and itself is its variance
cov(X, X) = E((X X )(X X )) = E((X X )2 ) = 2X
The correlation coefficient can be written as a fraction of covariances:
cor(X, Z) = p

cov(X, Z)

= XZ
X Z
var(X) var(Z)

cor calculates correlation coefficients.


cov(str,testscr)
[1] -8.159324
cov(str,str)
[1] 3.578952
var(str)
[1] 3.578952
cor(str,testscr)
[1] -0.2263628

cor(X, Z) [1, +1]


cor(X, Z) = 1 perfectly positive linear dependence
cor(X, Z) = 1 perfectly negative linear dependence
cor(X, Z) = 0 no linear dependence

c Oliver Kirchkamp

6 February 2015 09:49:38

Conditional distributions and conditional mean values


Conditional distributions
The distribution of Y given the value of another random variable X
E.g. the distribution of test scores testscr given that student teacher
ratio str< 20
library(lattice)
densityplot(testscr,group=large,auto.key=list(columns=2,cex=.7))

FALSE

TRUE

0.020
0.015

Density

c Oliver Kirchkamp

14

0.010
0.005
0.000
600

650

700

testscr

E.g. wages of men and women

data(Wages)

The data set Wages contains, among others, the following two variables:
exp
years of full-time work experience
lwage logarithm of wage
xyplot(lwage ~ exp,group=sex,data=Wages,auto.key=list(corner=c(1,1)))

STATISTICAL THEORY

15

female
male

8.5
8.0

lwage

7.5
7.0
6.5
6.0
5.5
0

10

20

30

40

50

exp

2.3.1 Conditional expected value and conditional variance


Now, we have two data sets in memory. There are several possibilities of telling R which data set we want
to use, when we type a command:
Above, we have already used attach(Caschool). It tells R to first look for a variable in Caschool. We
can use detach to remove this search directive and issue attach(Wages) to tell R to search in Wages from
now on.
Alternatively, we can use statements like Caschool$large to call the variable large in the data set
Caschool and Wages$exp to call the variable exp in the data set Wages.
Furthermore, there is with(Wages,...) which tells R that we want to use Wages for everything we
write in parentheses after with.
In this example, we learn to use subset to select a part of a data set; for example we select all schools
who meet the condition str<20.

Conditional expected value: E(X|Y = y)

( important notation)

Conditional variance: variance of the the conditional distribution


Examples
E(testscr|str < 20) = expected mean value of the test scores of all districts
with small group size
with(subset(Caschool,str<20),mean(testscr))
[1] 657.3513

c Oliver Kirchkamp

c Oliver Kirchkamp

16

6 February 2015 09:49:38

with(subset(Caschool,str>=20),mean(testscr))
[1] 649.9788

Wage of female workers (X=wage, Y=sex)


with(subset(Wages,sex=="female"),mean(lwage))
[1] 6.255308
with(subset(Wages,sex=="male"),mean(lwage))
[1] 6.729774

Recovery rate of all patients who have received a certain drug (X=recovery,
Y=drug)
If E(X|Y = y) = const for all values of y (does not depend on y), then cor(X, Y) = 0
(not vice versa!!!)

2.4 Samples of a population


Consider the sample Y1 . . . Yn of a population Y
Before the sample is drawn the Y1 . . . Yn are random.
After the sample is drawn, the values of Y1 . . . Yn are realised, they are
fixed numbers they are not random anymore.
Y1 . . . Yn is the data set. Yi is the value of Y for observation i (person i,
district i etc.)
If we draw a sample randomly, it is true that

Two observations are drawn randomly, thus the value of Yi contains no


information about the value of Yj .

Yi and Yj are independently distributed


Yi and Yj were drawn from the same distribution. Thus, they are identically distributed
We say that Yi and Yj are independently and identically distributed (=
i.i.d.).
More generally speaking: Yi are i.i.d. for i = 1, . . . , n.

STATISTICAL THEORY

17

2.5 Estimations
In econometrics we often estimate unknown quantities. Lets suppose we have a
sample Y1 . . . Yn of a random variable Y. We start with a simple problem: How
can we estimate the mean value of Y (not the mean value of Y1 . . . Yn )?
Idea:
We could simply use the mean value Y of the sample Y1 . . . Yn
We could simply use the first observation Y1
We could use the median of the sample Y1 . . . Yn

2.5.1 The distribution of Y
The observations of the sample are drawn randomly.
Thus, the values of Y1 . . . Yn are random.
Thus, functions of Y1 . . . Yn are random (e.g. the mean value).

If we had drawn a different sample, the function (e.g. the mean value)
would have a different value.

We call the distribution of Y over several possible samples sampling distribution.


The mean value and the variance of Y are the mean value and the variance
and var(Y).

of the sampling distribution E(Y)


2.5.2 Characteristics of sampling distributions
Expected value of Y
= Y , i.e. Y is an unbiased estimator of Y
E(Y)
Variance of Y
How does the variance depend on the size of the sample n?
2Y

var(Y) =
n
Question: Does Y converge to Y if n is large?
Law of large numbers:

c Oliver Kirchkamp

6 February 2015 09:49:38

Y is a consistent estimator of Y .
Formally: If Y1 , . . . , Yn i.i.d. and 2Y < , then Y is a consistent estimator of Y ,
i.e.

>0 : lim Pr(|Y Y | < ) = 1


n

p
Y Y

we can also say

Central Limit Theorem:


If Y1 , . . . , Yn i.i.d. and 0 < 2Y < and n is large, then the distribution of Y
approximates a normal distribution
!
2

Y N Y , Y
n
Of course, R knows distributions, too. In the following example we draw two density functions of binomially distributed variable using dbinom.

0.6

x/10

1.0

0.020
0.010
0.000

0.30
0.20
0.10
0.00
0.2

0.030

x = 750:850
plot(x/1000,
dbinom(x, size=1000, prob=0.8))

dbinom(x, size = 1000, prob = 0.8)

x<-1:10
plot(x/10,
dbinom(x, size=10, prob=0.8))

dbinom(x, size = 10, prob = 0.8)

c Oliver Kirchkamp

18

0.76

0.80

0.84

x/1000

The distribution on the left, which is based on a small sample size, does not
quite look like a normal distribution. The distribution on the right is based on a

STATISTICAL THEORY

19

much larger sample size (n = 1000) and has a lot more similarities with a normal
distribution.
2.5.3 Why should we use Y to estimate Y ?
= Y
Y is unbiased: E(Y)
p
Y is consistent: Y Y

Y is the least squares estimator for Y


P
Y is the solution of minx n (Yi x)2
i=1

Y has a smaller variance than all other linear unbiased estimators.


1 Pn
^ Y is unbiased, it is
For any estimator
^Y = n
i=1 ai Yi with {ai } so that

true that var(Y) var(^


Y )
However, there are non-linear estimators, too. . .
2.5.4 Testing hypotheses
Is 652 the mean testscr?
t.test(Caschool$testscr, mu=652)

One Sample t-test


data: Caschool$testscr
t = 2.3196, df = 419, p-value = 0.02084
alternative hypothesis: true mean is not equal to 652
95 percent confidence interval:
652.3291 655.9840
sample estimates:
mean of x
654.1565

H0 : E(Y) = Y,0 versus H1 : E(Y) 6= Y,0

(two-sided test)

H0 : E(Y) = Y,0 versus H1 : E(Y) > Y,0

one-sided test

H0 : E(Y) = Y,0 versus H1 : E(Y) < Y,0

one-sided test

Level of significance of a test = Predefined probability of rejecting the null


hypothesis, despite it being true.

c Oliver Kirchkamp

c Oliver Kirchkamp

20

6 February 2015 09:49:38


= Probability of drawing a sample Y1 , . . . , YN ,
p-value of a statistic (e.g. for Y)
that is at least as averse to the null hypothesis as our data given that the
null hypothesis is true.
p-value = PrH (|Y Y,0 | > |Y sample Y,0 |)
e.g. with Y:
0

with Y sample being the value Y for our data.

To calculate the p-value we have to know the sampling distribution of Y.


That is complicated if n is small.
If n is large, we can use the normal distribution to approximate the sampling
(Central Limit Theorem)
distribution of Y.

p value = PrH0

= PrH0
= PrH0


sample

|Y Y,0 | > |Y
Y,0 |

!


Y Y,0 Y sample Y,0


/n > /n
Y
Y
!
sample


Y Y,0 Y


Y,0

>




Y
Y

F(|g|)
|g|

(1)
(2)
(3)

F(|g|)
0

|g|

If n is large: p value
= the probability,
that an N(0, 1)-distributed random


Y sample Y,0
.
variable is outside of
Y

Statistic: g =

x
0
/ n

In practice Y is unknown it must be estimated.

F(|g|)
|g|

F(|g|)
0

|g|

STATISTICAL THEORY

21

2.5.5 Estimating the variance of Y


s2Y

n
2
1 X
Yi Y = sample variance of Y
=
n1
i=1

If Y1 , . . . , Yn i.i.d. and E(Y 4 ) < , then s2Y 2Y


Why does the law of large numbers apply?
s2Y is the mean value of a sample.

We demand E(Y 4 ) < , because the mean value is not calculated from Yi
but from its square.
2.5.6 Calculating the p-value with the help of an estimated 2Y



sample

p value = PrH0 |Y Y,0 | > |Y


Y,0 |
!

sample

Y Y,0 Y
Y,0

>
= PrH0

Y / n Y / n









Y Y sample



Y,0 >
Y,0
= PrH0

sY / n
sY / n
| {z } |

{z
}




sample
t
t

F(|g|)
|g|

F(|g|)
0

|g|

H0 is rejected if p <
2.5.7 Relation between p-value and the level of significance
The level of significance is given. E.g. if the given level of significance is 5%,. . .
. . . the null hypothesis is rejected if |t| > 1.96,
. . . equivalently the null hypothesis is rejected if p < 0.05.
The p-value is also called marginal level of significance.

(4)
(5)

(6)

c Oliver Kirchkamp

c Oliver Kirchkamp

22

6 February 2015 09:49:38


In many situations we will provide much more information to others by
telling them the p-value we calculated, than by telling them whether we
rejected the null hypothesis or not.

2.5.8 What happened to the t table and the degrees of freedom?


If Y1 , . . . , Yn is i.i.d. and normally distributed N(Y , 2Y ), the t-statistic follows the Student-t distribution with n 1 degrees of freedom.
The most important values of the t-distribution can be found in all old statistics books. The recipe is simple:
1. Calculate the t statistic
2. Calculate the degrees of freedom n 1
3. Look up the 5% critical value
4. If the t static is greater (in absolute terms) than the critical value, reject
the null hypothesis.

2.5.9 A comment
The theory of the t-distribution is a mathematically beautiful and interesting
result.
If Y is i.i.d. and normally distributed, we know the exact distribution of the
t statistic.
But
If the Y are not exactly normally distributed, this does not help us at all.

data(OFP, package="Ecdat")
hist(OFP[["faminc"]],breaks=40)

STATISTICAL THEORY

23

600
0

200

Frequency

1000

Histogram of OFP[[faminc]]

10

20

30

40

50

OFP[[faminc]]

However, this is not as bad as it seems:


No matter how Y is distributed: if n is large enough, Y converges to the
normal distribution anyway.
2.5.10 Another problem
When we want to compare two groups, we look at
Y YB
t= rA
s2A
nA

s2B
nB

This statistic only follows the t-distribution if


Y is normally distributed and i.i.d.,
and the (population) variance 2A = 2B is the same in both groups. This can
be a heroic assumption. (e.g. wages of men vs wages of women).

2.6 Confidence intervals


A 95% confidence interval for Y is the interval that contains the true value of Y in
95% percent of all repeated samples.

c Oliver Kirchkamp

c Oliver Kirchkamp

24

6 February 2015 09:49:38

confidence interval
for

Y + n Q

0 Y +

Q 1

H0 : Y = 0 is rejected, if 0 is outside the confidence interval.


Note: The confidence interval is based on the random sample Y1 , . . . , Yn .
Thus, the confidence interval is random itself.
The parameter Y of the population is not random but we do not know
it.
confint(lm(testscr ~ 1))
2.5 % 97.5 %
(Intercept) 652.3291 655.984

2.7 An alternative: The Bayesian Approach


1
= P(|X)
P() P(X|)
| {z }
|{z} | {z } P(X)
prior

likelihood

posterior

Here we use a numerical approximation to calculate the Bayesian posterior distribution for the mean of testscr. We employ the Gibbs sampler jags (which is
similar to Bugs).
The first lines specify the stochastic process (y[i] ~ dnorm(mu,tau)), the next
lines specify the priors. Here we use uninformed priors, mu ~ dnorm (0,.0001)
means that mu could take almost any value. The precision of the normal distribution (0.0001) is very small.
library(runjags)
modelX <- model {
for (i in 1:length(y)) {
y[i] ~ dnorm(mu,tau)
}
mu
~ dnorm (0,.0001)
tau ~ dgamma(.01,.01)
sd <- sqrt(1/tau)
}
}
bayesX<-run.jags(model=modelX,data=list(y=testscr),monitor=c("mu","sd"))

STATISTICAL THEORY

25

Compiling rjags model and adapting for 1000 iterations...


Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 2 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

c Oliver Kirchkamp

Digression: an uninformed prior for , mudnorm(0,.0001)


precision<-.0001
x<-seq(-10000,10000,500)
xyplot(pnorm(x,0,1/precision) ~ x,type="l",xlab="$\\mu$",ylab="$F(\\mu)$")

0.8

F()

0.6

0.4

0.2

-10000

-5000

5000

10000

The prior distribution for (i.e. dnorm(0,.0001)) assigns (more or less) the
same a-priory probability to any reasonable value of .

Digression: an uninformed prior for = 1/ , taudgamma(.01,.01):

s<-10^seq(-1,4.5,.1)
x<-1-pgamma(1/s^2,.01,.01)
xyplot(x ~ s, scales=list(x=list(log=T)), xscale.components = xscale.components.fractions,xlab

6 February 2015 09:49:38

0.20

F()

0.15
0.10
0.05
0.00
1/10

10

100

1000

10000

JAGS gives us now a posterior distribution for and for the standard deviation
of testscr.

0.0 0.2 0.4

Density

651 652 653 654 655 656 657

plot(bayesX,var="mu",type=c("trace","density"))

mu

c Oliver Kirchkamp

26

652

654

mu

6000 8000 10000

Iteration

14000

656

658

STATISTICAL THEORY

27

0.0 0.2 0.4 0.6

19

17

18

sd

20

Density

21

plot(bayesX,var="sd",type=c("trace","density"))

18

19

20

21

22

17

sd

6000 8000 10000

14000

Iteration
summary(bayesX)

Iterations = 5001:15000
Thinning interval = 1
Number of chains = 2
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean
SD Naive SE Time-series SE
mu 654.10 0.9329 0.006597
0.006387
sd 19.09 0.6630 0.004688
0.004688
2. Quantiles for each variable:
2.5%
25%
50%
75% 97.5%
mu 652.28 653.46 654.11 654.73 655.93
sd 17.84 18.64 19.07 19.53 20.44

Comparison with the frequentist approach: The credible interval which can
be obtained from the last line of the summary is very similar to the confidence
interval from the frequentist approach.

c Oliver Kirchkamp

c Oliver Kirchkamp

28

6 February 2015 09:49:38

Credible interval:
2.5%
97.5%
652.2768 655.9267

Confidence interval:
2.5 % 97.5 %
(Intercept) 652.3291 655.984

Also the estimated mean and its standard deviation are very similar to mean
and standard error of the mean from the frequentist approach.
Priors: uninformed / mildly informed / informed
sonable?

Are uninformed priors rea-

Example:
You measure the eye colour of your fellow students. You sample 5 students and
they all have blue eyes.
100% of your sample has blue eyes. You have no variance. How many of the
remaining students will have blue eyes? Can you give a confidence interval?
Informed priors Above we used (similar to the frequentist approach) an uninformed prior. Here we will assume that we already know something. Actually,
we will pretend that we already did a similar study. That study gave us results
of similar precision but with a different mean. Here we pretend that our prior
distribution for is dnorm(664,1). Everything else remains the same.
library(runjags)
modelI <- model {
for (i in 1:length(y)) {
y[i] ~ dnorm(mu,tau)
}
mu
~ dnorm (664,1)
tau ~ dgamma(.01,.01)
}
}
bayesI<-run.jags(model=modelI,data=list(y=testscr),monitor="mu")
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 1 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

STATISTICAL THEORY

29

0.0 0.2 0.4

Density

659
658

mu

660

661

plot(bayesI,var="mu",type=c("trace","density"))

657

656 657 658 659 660 661

mu

6000 8000 10000

14000

Iteration

Iterations = 5001:15000
Thinning interval = 1
Number of chains = 2
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean
SD Naive SE Time-series SE
mu 658.9 0.7124 0.005037
0.005344
2. Quantiles for each variable:
2.5%
25%
50%
75% 97.5%
mu 657.5 658.4 658.9 659.4 660.3

We see that the informed prior has shifted the posterior away from the previous
results. The new results are now somewhere between the ones we got with an
uninformed prior and the new prior.
Comparison: Frequenties versus Bayesian approach
Frequentist: Null Hypothesis Significance Testing (Ronald A. Fisher, Statistical
Methods for Research Workers, 1925, p. 43)

c Oliver Kirchkamp

c Oliver Kirchkamp

30

6 February 2015 09:49:38


X , X is random, is fixed.

Confidence intervals and p-values are easy to calculate.


Interpretation of confidence intervals and p-values is awkward.
p-values depend on the intention of the researcher.
We can test Null-hypotheses (but where do these Null-hypotheses come
from).
Not good at accumulating knowledge.
More restrictive modelling.

Bayesian: (Thomas Bayes, 1702-1761; Metropolis et al., Equations of State Calculations by Fast Computing Machines. Journal of Chemical Physics, 1953.)
X , X is fixed, is random.

Requires more computational effort.


Credible intervals are easier to interpret.
Can work with uninformed priors (similar results as with frequentist statistics)
Efficient at accumulating knowledge.
Flexible modelling.
Most people are still used to the frequentist approach. Although the Bayesian
approach might have clear advantages it is important that we are able to understand research that is done in the context of the frequentist approach.
How the intention of the researcher affects p-values Example: Multiple testing
(this is not the only example).
Assume a researcher obtains the following confidence intervals for different
groups (H0 : = 0)

STATISTICAL THEORY

31

2
0
-2
-4

9 10 11 12 13 14 15 16 17 18 19 20

Groups
A researcher who a priori only suspects group 16 to have 6= 0 will find (correctly) a significant effect.
A researcher who does not have this a priori hypothesis, but who justs carries
out 20 independent tests, must correct for multiple testing and will find no significant effect. After all, it is not surprising to find in 5% of all samples a 95%
confidence interval which does not include the Null-hypothetical value.

2.8 Summary
Having started from these assumptions
single random samples of a population (Y1 , . . . , Yn are i.i.d.)

E Y4 <
the sample is large (n is large)

we now know how to

estimate (sample distribution of Y)


test hypotheses (Y is t-distributed and approximately normally distributed.
This allows us to calculate the p-value.)
calculate confidence intervals
Do these assumptions make sense?

c Oliver Kirchkamp

c Oliver Kirchkamp

32

6 February 2015 09:49:38

2.9 Exercises
1. Revision I
In the following task we will refresh some basic concepts:
You have the following data about childrens age (a) and the pocket money
(pm) they receive from their parents on children in elementary school.
age in years (a)
6
7
6
7
8
8
9
10
9
10

pocket money/week in $ (pm)


0
2
2
3
4
2
5
4
2
4

Compute the following items:


Mean (pm)

Median (pm)

First/third quartile (pm)


Variance (pm)

Standard deviation (pm)


Correlation (a, pm)

2. Revision II
Define the following items:
Confidence interval
Histogram

Scatter plot
Box plot

3. First steps in R: I
Do the following tasks using R and the data from the exercise above on
childrens pocket money.

STATISTICAL THEORY

33

Read the data into R assigning the names age and pm to the variables
age and pocket money, respectively.
Compute the descriptive statistics that you calculated for the exercise
above in R.
Visualize the data with a scatter plot.

Do you see a problem in visualizing the two variables with a scatter


plot?
4. First steps in R: II
Do the following tasks using R and the library Ecdat which contains economic data sets. Use the data set Schooling on wage and education:
What is the data set about? What does it contain?

Give a summary statistic about the hourly wage in cents in 1976 (wage76).
Draw a histogram on the hourly wage in 1976 (wage76).

Draw a box plot on the years of education in 1976 (ed76).

Draw a scatter plot on the hourly wage (wage76) and the years of education (ed76) both in the year 1976.
Are the wage (wage76) and the years of education received (ed76) correlated? What does this result mean?
5. Female labor supply
Do the following tasks using R and the library Ecdat. Use the data set
Workinghours on female labor supply:
Are the hours worked by wives (hours) related to the other income of
the household (income)? Calculate the answer and illustrate it with a
graph.
Are the hours worked by wives (hours) related to the education they
received (education)? Calculate the answer and illustrate it with a
graph.
Does the number of hours worked by wives (hours) who have at least
one child below 6 differ compared to wives without children under 6?
Illustrate your answer with a graph.
Do wives who live in a home owned by the household work more
hours?
How many hours do wives below 26 years of age work on average? Is
this significantly more than wives of age 26 and above work?

c Oliver Kirchkamp

c Oliver Kirchkamp

34

6 February 2015 09:49:38


6. Students test scores
Do the following tasks using R and the library Ecdat. Use the data set
Caschool on results of test scores in Californian schools:
Is the average test score in math (mathscr) equal to 652? First, phrase
your hypothesis and your alternative hypothesis, then do the computation in R.
Are the average score for math and reading (mathscr) and (readscr)
equal? First, phrase your hypothesis and your alternative hypothesis,
then do the computation in R.
Are the results of the two test scores (mathscr) and (readscr) related
to each other? Compute the answer and illustrate it with a graph.
Look at the outputs you obtained in R. Explain the information you
have received from R.
7. Sampling distributions:
Have another look at Workinghours and study the sampling distribution of the mean of hours, i.e. you take a large number of samples and
study the distribution of all these means.
Given your (approximate) sampling distribution, how likely is it to obtain a mean value for hours smaller or equal then 1100? Compare your
result with a t.test
Now consider the difference in hours separately for wives with and
without children under 5. Use your sampling distribution to calculate
a 95% confidence interval. Compare this interval with the interval you
get from a t.test

3 Linear regression with a single regressor


draw a line through two-dimensional data Y and X
estimate a causal dependence between Y and X
Lines have a slope and an axis intercept.
Estimation, hypothesis test, confidence interval
testscr = 1 str + 0

0 and 1 are parameters of the population


we do not not know them hence, we have to estimate them (like )

LINEAR REGRESSION WITH A SINGLE REGRESSOR

35

In the following diagram such a line is drawn through a scatterplot

660
620

640

testscr

680

700

data(Caschool)
attach(Caschool)
scatterplot(testscr~str)

14

16

18

20

22

24

str

Yi = 1 Xi + 0 + ui

i = 1, . . . , n

Y dependent variable
X independent variable
1 slope
0 axis intercept
u error term

(other factors that impact Y)

How can we estimate 0 and 1 ?


Remember: Y is the least squares estimator for Y .
P
2
Y is the solution of min n
i=1 (Yi m)
m

Try the same approach for 0 and 1 :

lm estimates an OLS regression. The result is saved to a variable. (est1 in this case)

26

c Oliver Kirchkamp

c Oliver Kirchkamp

36

6 February 2015 09:49:38

To take a look at the result, we have to tell R to display it. E.g. we can use summary(est1).
R can also display the result graphically. For example we could type abline(est1).

Of course, we do not have to calculate these results manually. R can do that for
us.
lm(testscr~str, data=Caschool)

Call:
lm(formula = testscr ~ str, data = Caschool)
Coefficients:
(Intercept)
698.93

str
-2.28

Approximating Y:

^ 1 Xi +
^0
Y^i =

Residuals:

u
^ i = Yi Y^i

i = 1, . . . , n
i = 1, . . . , n

testsrc = 2.2798 str + 698.93


testsrc
= 2.2798
str

3.1 Measures of determination


R2 : fraction of the variance of Y which is explained by X.
Yi = Y^i + u
^ i = OLS approximation + OLS residuals
var(Yi ) = var(Y^i ) + var(^
ui )
P
^ 2
(Y^i Y)
R =P
2
(Yi Y)
2

0 R2 1

In the case of regressions with a single independent variable X, R2 is the


same as the square of the correlation coefficient of X and Y.
Here are two ways to find the R2 in our example:
summary(lm(testscr~str))

LINEAR REGRESSION WITH A SINGLE REGRESSOR

37

Call:
lm(formula = testscr ~ str)
Residuals:
Min
1Q
-47.727 -14.251

Median
0.483

3Q
12.822

Max
48.540

Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 698.9330
9.4675 73.825
< 2e-16 ***
str
-2.2798
0.4798 -4.751 0.00000278 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124,Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 0.000002783
cor(testscr,str)^2
[1] 0.0512401

What does it mean, if R2 is only 0.05 in the example?


v
u
u
Standard error of the residuals SER = t

1
n2

n
X

v
u
u
2

(^
ui u
^) = t

i=1

n
1 X 2
u
^i
n2
i=1

The SER is not only a part of summary. We can also calculate it manually:
est <- lm(testscr~str)
sqrt(with(est,sum(residuals^2)/df.residual))
[1] 18.58097

3.2 OLS assumptions


Yi = 1 Xi + 0 + ui

i = 1, . . . , n

1. E(ui |Xi = x) = 0
2. (Xi , Yi ) are i.i.d.
3. Large outliers in X and Y are rare (the fourth moments of X and Y exist)

c Oliver Kirchkamp

6 February 2015 09:49:38

3.2.1 Digression The Existence of Moments


There are distributions for which some moments do not exist.
Example: The Cauchy distribution:
f(x) =

1
(1 + x2 )

1
1
F(x) = arctan(x) +

2
Z
1
dx
x2
2X =
(1 + x2 )

0.4

plot(dnorm,from=-4,to=4,ylab="density")
plot(dcauchy,from=-4,to=4,add=TRUE,lty=2)
legend("topleft",c("Normal","Cauchy"),lty=1:2)

0.2

0.3

Normal
Cauchy

0.0

0.1

density

c Oliver Kirchkamp

38

-4

-2

x
The sample distribution of 2 converges if X is a normally distributed random
variable. This is not the case for the Cauchy distribution.
In the following example, we draw two plots side-by-side. This is is done using the command commandparhttp://finzi.psych.upenn.edu/R/library/graphics/html/par.html(mfrow=c(1,2)). We can always
get back to the original state (one diagram at a time) by typing par(mfrow=c(1,1))
rnorm creates a vector of normally distributed (peudo-)random variables.
rcauchy creates a vector of Cauchy distributed (pseudo-)random variables.

LINEAR REGRESSION WITH A SINGLE REGRESSOR

39

set.seed(127)
N <- 1000
z <- rnorm(N)
plot(1:N,sapply(1:N,function(x) {var(z[1:x]) }),ylab="$\\sigma^2$",main="normal",t="l")
z <- rcauchy(N)
plot(1:N,sapply(1:N,function(x) {var(z[1:x]) }),ylab="$\\sigma^2$",main="Cauchy",t="l")

Cauchy

600
0

0.0

0.2

200

400

0.6
0.4

0.8

800

1.0

normal

400

800

1:N

400

800

1:N

3.3 The distribution of the OLS estimator


Example: We approximate the distribution of the estimator under the null hypothesis in our example. To do so, we repeatedly estimate using ever-new permutations of the independent variable.
sample draws a sample of a specified size out of a vector. If no size is given, it produces a random
permutation of the whole vector.
coef extracts the coefficients from a regression.
replicate executes a command multiple times.
density estimates a density function.

strH0dist <- replicate(100,coef(lm(testscr ~ sample(str)))[2])


strCIdist<-replicate(100,coef(lm(unlist(simulate(est)) ~ str))[2])
densityplot(~strCIdist + strH0dist,auto.key=list(columns=2),xlab="$\\hat\\beta_1$")

c Oliver Kirchkamp

6 February 2015 09:49:38

strCIdist

strH0dist

0.8

Density

c Oliver Kirchkamp

40

0.6
0.4
0.2
0.0
-4

-3

-2

-1

^1

sd(strH0dist)
[1] 0.4673529
sd(strCIdist)
[1] 0.4957402
sqrt(diag(vcov(est)))["str"]
str
0.4798256
coef(est)["str"]
str
-2.279808
coef(est)["str"]/sd(strH0dist)
str
-4.87813

^ 0 and
^ 1 are calculated by means of the sample. A different sample results

^ 0 and
^ 1.
in different values for
^ 0 and
^ 1.
Just as there is a distribution for Y there is a distribution for

LINEAR REGRESSION WITH A SINGLE REGRESSOR

41

^ 1 ) = 1 ? (OLS is unbiased)
Is E(
^ 1 ) small?
Is var(
How do we test hypotheses? (e.g. 1 = 0)
How do we calculate a confidence interval for 0 and 1 ?
^1
Mean value and variance of
^ 1 . We know:
We are interested in 1

^1

=
=
=

^ 1 1

Yi
Y

=
=

0 + 1 Xi + ui
0 + 1 X + u

also Yi Y

+ (ui u)
1 (Xi X)

Pn
i Y)

(X X)(Y
i=1
Pn i
2
i=1 (Xi X)

Pn
1 (Xi X)
+ (ui u)
(X

X)

i
i=1
Pn
2
i=1 (Xi X)
Pn
Pn

(X

X)(X

X)
(Xi X)(u

i u)
i
i=1
Pn i
P
1 i=1
+
n
2
2
i=1 (Xi X)
i=1 (Xi X)
Pn

(X X)(u

i u)
i=1
Pn i
2
(Xi X)
i=1

Now
n
X

(Xi X)(u

=
i u)

i=1

n
X

i
(Xi X)u

i=1
n
X

i
(Xi X)u

i=1
n
X

i
(Xi X)u

i=1

n
X

u
(Xi X)

i=1
n
X

i=1

Xi

n X u

c Oliver Kirchkamp

c Oliver Kirchkamp

42

6 February 2015 09:49:38

Hence,
Pn
Pn

(X

X)(u

u)

i
i
i=1 (Xi X)ui
i=1
^ 1 1 =
Pn
P

=
n
2
2
(Xi X)
(Xi X)
i=1

i=1

^ 1 ) 1 :
Now we can calculate E(

^ 1 1
E

=
=


 Pn
i
(X

X)u
i
E Pi=1
n
2
(Xi X)
Pn

 i=1



(X

X)u
i
i
E E Pi=1
n
X1 , . . . , X n
2

(Xi X)
i=1

since by assumption 1: E(ui |Xi = x) = 0

^ 1 ) 1
E(

^ 1 is an unbiased estimator for 1

1
Next we calculate the variance 21 :
Pn
i
(Xi X)u
^ 1 1 = Pi=1

n
2
(Xi X)
i=1

i
call (Xi X)u

s2X

^ 1 1

furthermore we have

vi
Pn

2
X)
n1
Pn
i=1 vi
(n 1)s2X
1 Pn
i=1 vi
n
i=1 (Xi

n1 2
n sX

LINEAR REGRESSION WITH A SINGLE REGRESSOR

for large n it holds that s2x 2x and


^ 1 1

n1
n
1
n

1, therefore

Pn

^ 1 ) = var(
^ 1 1 ) var
we have var(

43

i=1 vi
2X
1
n

Pn

i=1 vi
2X

var
=

1
n

Pn

i=1 vi

(2X )2

var(vi )/n
1 var((Xi X ) ui )
=
n
(2X )2
4X

Summary
If the three OLS assumptions are true,. . .
1. E(ui |Xi = x) = 0
2. (Xi , Yi ) are i.i.d.
3. Large outliers in X and Y are rare (the fourth moments of X and Y exist)
. . . then it is also true that. . .
^ 1 ) = 1
E(
^ 1) =
var(

^ is unbiased)
(

1 var((Xi X )ui )
n
4X

^1
3.4 Distribution of
^ 1 ) = 1
Mean value: E(
^ 1 1 =

Pn

i=1 (Xi X)ui


(n1)s2X

1 Pn

If n is large, then n
i=1 (Xi X)ui is approximately normally distributed
(Central Limit Theorem)

^ 1) =
var(
^1 N

i)
1 var((Xi X)u
n
4X

i)
var((Xi X)u
1 ,
4
nX

c Oliver Kirchkamp

c Oliver Kirchkamp

44

6 February 2015 09:49:38

^1
The larger the variance of X, the smaller the variance of
^1 N
mathematically:

i)
var((Xi X)u
1 ,
4
nX

intuitively:

We determine the regression line for two cases: Firstly, we use the entire
sample.

est1<-lm(testscr~str)
summary(est1)

Call:
lm(formula = testscr ~ str)
Residuals:
Min
1Q
-47.727 -14.251

Median
0.483

3Q
12.822

Max
48.540

Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 698.9330
9.4675 73.825
< 2e-16 ***
str
-2.2798
0.4798 -4.751 0.00000278 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124,Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 0.000002783

Then we use only those observations where str deviates not much from the
mean value.
lowVar <- str>19 & str<21
est2<-lm(testscr~str,subset=lowVar)
summary(est2)

Call:
lm(formula = testscr ~ str, subset = lowVar)
Residuals:
Min
1Q Median
-46.98 -13.39
2.82

3Q
12.74

Max
42.40

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 677.205
44.538 15.205
<2e-16 ***
str
-1.204
2.233 -0.539
0.59

LINEAR REGRESSION WITH A SINGLE REGRESSOR

--Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 17.68 on 187 degrees of freedom


Multiple R-squared: 0.001552,Adjusted R-squared: -0.003787
F-statistic: 0.2906 on 1 and 187 DF, p-value: 0.5904

No we take a sample of the same size as the low variance sample:

lowSize <- sample(str>19 & str<21)


est2b<-lm(testscr~str,subset=lowSize)
summary(est2b)

Call:
lm(formula = testscr ~ str, subset = lowSize)
Residuals:
Min
1Q
-47.022 -13.591

Median
0.844

3Q
12.196

Max
48.722

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 704.4589
15.7465 44.737 < 2e-16 ***
str
-2.5993
0.7956 -3.267 0.00129 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.99 on 187 degrees of freedom
Multiple R-squared: 0.054,Adjusted R-squared: 0.04894
F-statistic: 10.67 on 1 and 187 DF, p-value: 0.001292

We observe that standard errors and p-values are larger in the second case.
The next diagram clarifies the problem:

plot(testscr ~ str)
points(testscr ~ str,col=2,subset=lowVar)
points(testscr ~ str,col=3,subset=lowSize)
abline(est1)
abline(est2,col=2)
abline(est2b,col=3)

45

c Oliver Kirchkamp

620

640

660

680

700

6 February 2015 09:49:38

testscr

c Oliver Kirchkamp

46

14

16

18

20

22

24

str

For large n we have

p
^1
1

^ 1 is consistent

^ 1 E(
^ 1)

q
N(0, 1)
^
var(1 )

(Central limit theorem, just as for Y)

^0
3.5 Distribution of



2
^
^
For large n, 0 , too is normally distributed with 0 N 0 ,
^ 0 where
!
1

var(H
u
)
X
i i
 Xi
2
und Hi = 1

^0 =
2
n E H2
E X2i
i

Distribution for 1 and 0

Hypothesis tests and confidence intervals for 1 and 0

^1
3.6 Hypothesis tests for

two-sided test: H0 : 1 = 1,0 versus H1 : 1 6= 1,0

26

LINEAR REGRESSION WITH A SINGLE REGRESSOR

47

one-sided test: H0 : 1 = 1,0 versus H1 : 1 > 1,0 H0 : 1 = 1,0 versus


H1 : 1 < 1,0
with 1,0 being the hypothetical value of the null hypothesis
Approach: Build t statistic und determine the p-value (or compare the t statistic
with the critical N(0, 1) value.
Generally:
t=

estimator hypothetical value


standard error of the estimator

The standard error of the estimator is derived from the estimated variance
of the estimator.

In order to test the mean value of Y:


t=

Y Y,0
Y

In order to test the regression coefficient 1 :


^ 1 1,0

t=

^
1

^ 1:
Recall the theoretical variance of
i)
1 var((Xi X )ui )
var((Xi X)u
=
1
n
4X
n 4X
similarly the sample variance:
i)
1 estimate for var((Xi X)u

^ 2
=
^1
n
(estimate for 2x )2


with ^v = Xi X u
^i
1 Pn
v2
1
i=1 ^
n2
=

2 2
n 1 Pn

i=1 Xi X
n
2
^

Summary: To test the hypothesis H0 : 1 = 1,0 versus H1 : 1 6= 1,0 :


calculate the t statistic:
t=

^ 1 1,0

^
^1

c Oliver Kirchkamp

c Oliver Kirchkamp

48

6 February 2015 09:49:38


Reject at a level of significance of 5%, if t > 1.96


The p-value is p = Pr |t| > tsample

F(|t|)
|t|

F(|t|)
0

|t|

We need the assumption that n is large (n = 50 is large)


confint calculates confidence intervals for an estimated model.
pnorm und pt calculates the distribution function of the normal distribution and the t distribution.
qnorm and qt calculate the quantiles for a given value of the distribution.
est1<-lm(testscr~str)
confint(est1)
2.5 %
97.5 %
(Intercept) 680.32313 717.542779
str
-3.22298 -1.336637

Needless to say, we can calculate the confidence intervals


^ under the assumption of homoscedastic residuals.
vcov calculates the variance-covariance matrix for
diag extracts the diagonal of a matrix. In the case of the variance-covariance matrix the diagonal containts
the variances of the coefficients. sqrt calculates square roots.

3.7 Confidence intervals and p-values


coef(est1) + qnorm(.025) * sqrt(diag(vcov(est1)))
(Intercept)
680.377010

str
-3.220249

coef(est1) - qnorm(.025) * sqrt(diag(vcov(est1)))


(Intercept)
717.488895

str
-1.339367

confint(est1)
2.5 %
97.5 %
(Intercept) 680.32313 717.542779
str
-3.22298 -1.336637

LINEAR REGRESSION WITH A SINGLE REGRESSOR

49

We have already seen the p-value in the summary above. But we can also calculate it manually:
2 * pnorm (- abs(coef(est1) / sqrt(diag(vcov(est1)))))
(Intercept)
str
0.000000000000 0.000002020858

We have just used the approximation to the normal distribution. R uses the t
distribution in the summary command.
2 * pt (- abs(coef(est1) / sqrt(diag(vcov(est1)))),est1$df.resid)
(Intercept)
6.569925e-242

str
2.783307e-06

We find that the two values are slightly different.


The following two statements are equivalent:
The 95% confidence interval does not contain the zero
The hypothesis 1 = 0 is rejected at the 5% level of significance

strCIdist

strH0dist

0.8

Density

0.6
0.4
0.2
0.0
-4

-3

-2

-1

^1

3.8 Bayesian Regression


Of course, we use the Bayesian approach also in the context of regressions. All we
have to do is adjust the model from section 2.7 slightly.

c Oliver Kirchkamp

6 February 2015 09:49:38

modelR<-model {
for (i in 1:length(y)) {
y[i] ~ dnorm(beta0 + beta1*x[i],tau)
}
beta0 ~ dunif (0,1200)
beta1 ~ dnorm (0,.0001)
tau
~ dgamma(.01,.01)
}
}
bayesR<-run.jags(model=modelR,data=list(y=testscr,x=str),
monitor=c("beta0","beta1"))
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 2 variables....
Convergence may have failed for this run for 2 parameters after 10000
iterations (multi-variate psrf = 1.164)
Finished running the simulation

JAGS returns us a distribution for 1 :

0.8
0.4
0.0

-3.0

-2.5

-2.0

Density

-1.5

-1.0

plot(bayesR,var="beta1",type=c("trace","density"))

-4

-3.5

beta1

c Oliver Kirchkamp

50

-3

-2

beta1
6000 8000 10000

Iteration

14000

-1

LINEAR REGRESSION WITH A SINGLE REGRESSOR

51

summary(bayesR)

Iterations = 5001:15000
Thinning interval = 1
Number of chains = 2
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean
SD Naive SE Time-series SE
beta0 700.611 9.9421 0.070301
1.10111
beta1 -2.365 0.5038 0.003563
0.05573
2. Quantiles for each variable:
2.5%
25%
50%
75%
97.5%
beta0 682.074 693.408 700.372 707.782 719.225
beta1 -3.305 -2.732 -2.353 -1.999 -1.423

Comparison with frequentist approach:


coef(est1)
(Intercept)
698.932952

str
-2.279808

sqrt(diag(vcov(est1)))
(Intercept)
9.4674914

str
0.4798256

confint(est1)
2.5 %
97.5 %
(Intercept) 680.32313 717.542779
str
-3.22298 -1.336637

As in section 2.7 we see that the credible intervals are similar to the frequentist
confidence intervals. The interpretation, however, is quite different. The credible
intervals make a direct statement about the probability that 1 is in a certain interval. Confidence intervals make much more indirect statement which is harder
to interpret.

3.9 Reporting estimation results


Reporting estimation results:
The summary table is not very concise:

c Oliver Kirchkamp

c Oliver Kirchkamp

52

6 February 2015 09:49:38

summary(lm(testscr ~ str))

Call:
lm(formula = testscr ~ str)
Residuals:
Min
1Q
-47.727 -14.251

Median
0.483

3Q
12.822

Max
48.540

Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 698.9330
9.4675 73.825
< 2e-16 ***
str
-2.2798
0.4798 -4.751 0.00000278 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124,Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 0.000002783

testscr = 698.933 2.2798 str,


(9.4675)
(0.4798)

R2 = 0.05, SER = 18.58

Standard errors are often shown in parentheses below the estimated coefficients.
The estimated regression line is testscr = 698.933 2.2798 str
The standard error of 0 = 9.4675
The standard error of 1 = 0.4798
The R2 = 0.05, the standard error of the residuals is SER = 18.58.
These are almost all of the numbers we need to perform a hypothesis test and
calculate confidence intervals.

3.10 Continuous and nominal variables


Continuous:

gross domestic product


income in Euro
str

Nominal / discrete
sex

LINEAR REGRESSION WITH A SINGLE REGRESSOR

53

profession
sector of a firm
income in categories
Binary variable / dummy-variables are a special case of nominal variables
sex male/female

income higher than 40 000 Euro per year Yes/No


unemployed Yes/No
university degree Yes/No
In
testsrc = 0 + 1 str + u
we used a continous independent variable str. But what if only had binary data
for str?

1 if str>20
large =
0 else
Now estimate
testsrc = 0 + 1 large + u

large <- Caschool$str>20


est<-lm(testscr ~ large)
summary(est)

Call:
lm(formula = testscr ~ large)
Residuals:
Min
1Q
-50.435 -14.071

Median
-0.285

3Q
12.778

Max
49.565

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 657.185
1.202 546.62 < 2e-16 ***
largeTRUE
-7.185
1.852
-3.88 0.000121 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.74 on 418 degrees of freedom
Multiple R-squared: 0.03476,Adjusted R-squared: 0.03245
F-statistic: 15.05 on 1 and 418 DF, p-value: 0.0001215

c Oliver Kirchkamp

6 February 2015 09:49:38

700

700

680

680

testscr

testscr

c Oliver Kirchkamp

54

660

660

640

640

620

620

14 16 18 20 22 24 26

0.0 0.2 0.4 0.6 0.8 1.0

str

large

In general (when X is a binary / dummy-variable)


Yi = 0 + 1 Xi + ui
Interpretation:
If Xi = 0: Yi = 0 + ui
The mean value Y = 0
E(Yi |Xi = 0) = 0
If Xi = 1: Yi = 0 + 1 + ui
The mean value Y = 0 + 1
E(Yi |Xi = 1) = 0 + 1
1 = E(Yi |Xi = 1) E(Yi |Xi = 0) is the difference between the mean values of the
two groups of the population.
t.test performs a t-test to compare two mean values.
tapply uses a function (here the mean value mean and the standard deviation sd) on individual groups
of a data set. Here, these groups are described through the variable large.
est1<-lm(testscr ~ large)
summary(est1)

LINEAR REGRESSION WITH A SINGLE REGRESSOR

Call:
lm(formula = testscr ~ large)
Residuals:
Min
1Q
-50.435 -14.071

Median
-0.285

3Q
12.778

Max
49.565

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 657.185
1.202 546.62 < 2e-16 ***
largeTRUE
-7.185
1.852
-3.88 0.000121 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.74 on 418 degrees of freedom
Multiple R-squared: 0.03476,Adjusted R-squared: 0.03245
F-statistic: 15.05 on 1 and 418 DF, p-value: 0.0001215

confint(est1)
2.5 %
97.5 %
(Intercept) 654.82130 659.547833
largeTRUE
-10.82554 -3.544715

t.test(testscr ~ large)

Welch Two Sample t-test


data: testscr by large
t = 3.9231, df = 393.721, p-value = 0.0001031
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.584445 10.785813
sample estimates:
mean in group FALSE mean in group TRUE
657.1846
649.9994
tapply(testscr,large,mean)
FALSE
TRUE
657.1846 649.9994

tapply(testscr,large,sd)
FALSE
TRUE
19.28629 17.96589

55

c Oliver Kirchkamp

6 February 2015 09:49:38


It does not matter, whether we use a Student t-test to compare the mean
values of groups,
or whether we calculate a regression with a single binary variable.

A regression can be useful if we want to control for additional regressors.

3.11 Heteroscedastic and homoscedastic error terms


Recall the three OLS assumptions:
1. E(ui |Xi = x) = 0
2. (Xi , Yi ) are i.i.d.
3. Large outliers in X and Y are rare (the fourth moments of X and Y exist)
Now we add an additional assumption:
var(u|X = x) ist constant, u is homoscedastic
runif calculates a vector of uniformly distributed (pseudo-) random variables. Here, we need this vector
to simulate an estimation model.

10

12

x <- runif(1000)
u <- rnorm(1000)
y <- 10 - 3.1*x + u
plot(y ~ x)

c Oliver Kirchkamp

56

0.0

0.2

0.4

0.6

0.8

1.0

LINEAR REGRESSION WITH A SINGLE REGRESSOR

57

est<-lm(y ~ x)
plot(est,which=1:2)

Normal Q-Q

7.0

8.0

9.0

Fitted values

10.0

324 126
622

-3 -2 -1

0
-2

Residuals

324
622
126

Standardized residuals

Residuals vs Fitted

-3

-1

1 2 3

Theoretical Quantiles

In the following example the error terms are no longer independent of x

u2 <- rnorm(1000)*x
y2 <- 10 - 3.1*x + u2
plot(y2 ~ x)

c Oliver Kirchkamp

y2

10

6 February 2015 09:49:38

0.0

0.2

0.4

0.6

0.8

1.0

est2<-lm(y2 ~ x)
plot(est2,which=1:2)

Residuals vs Fitted

Normal Q-Q

7.0

8.0

9.0

839

-2

-3

383 697

-4

-2

-1

Standardized residuals

839

Residuals

c Oliver Kirchkamp

58

10.0

Fitted values
In both examples it is true that E(ui |Xi = x) = 0
In the first example u is homoscedastic

697
383

-3

-1

1 2 3

Theoretical Quantiles

LINEAR REGRESSION WITH A SINGLE REGRESSOR

59

In the second exmample u is heteroscedastic

3.11.1 An example from labour market economics


wage
educ

weekly wages for US make workers from


current population survey 1988
years of education

data(uswages,package="faraway")
plot(wage ~ educ,data=uswages)
plot(lm(wage ~ educ,data=uswages),which=1:2)

10

2780

25909

4000
2000

Residuals

2000

4000

25909

Standardized residuals

15

15387

2780

wage

Normal Q-Q

15387

6000

6000

8000

8000

Residuals vs Fitted

10

15

educ

3.11.2 Back to Caschool

data(Caschool,package="Ecdat")
attach(Caschool)
est <- lm(testscr ~ str)
plot(testscr ~ str)
plot(est,which=1:2)

100

300

500

700

Fitted values

-3

-1

Theoretical Quantiles

c Oliver Kirchkamp

6 February 2015 09:49:38

Normal Q-Q

700

60

Residuals vs Fitted
417

0
-2

-1

Standardized residuals

20
-20
-40

Residuals

40

417

680
660
640
620

testscr

c Oliver Kirchkamp

60

6
7

14

18

22

26

640

str

650

660

Fitted values

-3

-1 0

Theoretical Quantiles

3.11.3 What do we get from homoscedasticity?


OLS has the smallest variance of all estimators, which are linear in Y (GaussMarkov Theorem)

^1 .
It is easier to calculate var

Recall: var(X Y) = (E(X))2 var(Y) + (E(Y))2 var(X) + var(X) var(Y)



^ 1 = 1 var((Xi X )ui )
var
n
4X
0

}|
{
z }| {
z
2
1 (E(Xi X )) var ui + (E(ui ))2 var(Xi X ) + var(Xi X ) var ui
n
4X

2u
1 2X 2u
=
n 4X
n 2X

^ 1 ) decreases when var(X) increases.


We can see that var(
Above we assumed homoscedasticity:

LINEAR REGRESSION WITH A SINGLE REGRESSOR

^
^1

v
u
u1
=t
n

61

1 Pn
^ 2i
i=1 u
n2
1 Pn
2
i=1 (Xi X)
n

^ 1 is the standard setting of statistiThis formula for the standard deviation of


cal software and often the only choice we have in office software.
But if we do not assume var(u|X = x), then we have:

^
^1 =

1 var((Xi X )ui )
n
4X

What if (Xi X ) and var(ui ) are not independent?


homoscedasticity:

^
^

v
u
u1
=t
n

1 Pn
^ 2i
i=1 u
n2
1 Pn
2
i=1 (Xi X)
n

heteroscedasticity (always correct):

^
^1

v
u
u1
=u
tn

1
n2

Pn

1
n

i=1

Pn

i=1

2

Xi X u
^i
2 2

Xi X

The formula for the case of homoscedastic error terms is simpler, but it is only
correct if the assumption of homoscedastic error terms is actually satisfied.
Since the formulas are different, we usually get different results.
Homoscedasticity is the standard setting of the software (if not the only possible setting).
^ than the setting for
Typically, it will give us smaller standard error for
heteroscedasticity.
^ under the assumption of heteroscedastic residuals.
hccm calculates the variance-covariance matrix for
est <- lm(testscr ~ str)
summary(est)

c Oliver Kirchkamp

c Oliver Kirchkamp

62

6 February 2015 09:49:38

Call:
lm(formula = testscr ~ str)
Residuals:
Min
1Q
-47.727 -14.251

Median
0.483

3Q
12.822

Max
48.540

Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 698.9330
9.4675 73.825
< 2e-16 ***
str
-2.2798
0.4798 -4.751 0.00000278 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124,Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 0.000002783

By default, R uses homoscedastic standard error for p-values and confidence


intervals To obtain the summary with heteroscedastic errors, we use summaryR
from the package tonymisc.
library(tonymisc)
summaryR(est)

Call:
lm(formula = testscr ~ str)
Residuals:
Min
1Q
-47.727 -14.251

Median
0.483

3Q
12.822

Max
48.540

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9330
10.4605 66.816
< 2e-16 ***
str
-2.2798
0.5244 -4.348 0.0000173 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124,Adjusted R-squared: 0.04897
F-statistic: 18.9 on 1 and 418 DF, p-value: 0.00001729

R can extract the homescedastic and the heteroscedastic standarderrors.


sqrt(vcov(est))

(Intercept)
str

(Intercept)
str
9.467491
NaN
NaN 0.4798256

LINEAR REGRESSION WITH A SINGLE REGRESSOR

63

sqrt(hccm(est))

(Intercept)
str

(Intercept)
str
10.46053
NaN
NaN 0.5243585

3.11.4 Summary Homo-/Heteroskedasticity


If the data is homoscedastic and we assume heteroscedasticity, we are in
the clear (If the data is homoscedastic then both formulas produce the same
result for large n).
If the data is heteroscedastic and we assume homoscedasticity, we get wrong
standard errors. (The estimator is not consistent in this case.)
we should always use standard errors that are robust to heteroscedasticity.

What we know about OLS:


OLS is unbiased
OLS is consistent

^
we can calculate confidence intervals for
^
we can test hypothesis about
A large amount of econometric analysis is presented in the form of OLS.
One reason for this is that many people understand, how OLS works.
Whenever we use a different estimator we run the risk of not being understood
by others.
Is that enough of an explanation to use OLS?
Are there better estimators? Estimators with a lower variance?
To answer this question we will make additional assumptions

3.12 Extended OLS assumptions


1. E(ui |Xi = x) = 0
2. (Xi , Yi ) are i.i.d.
3. Large outliers in X and Y are rare (the fourth moments of X and Y exist)

c Oliver Kirchkamp

c Oliver Kirchkamp

64

6 February 2015 09:49:38


4. var(u|X = x) ist constant, u is homoscedastic
5. u is normally distributed u N(0, 2 )

Assumptions 4 and 5 are more restrictive they are warranted less often.
Gauss Markov
^ 1 has the smallest variance of all linear estimators (of all estimators
Assuming 1-4,
which are linear functions of Y).
Efficiency of OLS-II
^ 1 has the smallast variance of all consistent estimators, if n
Assuming 1-5,
(regardless of wether the estimators are linear or non-linear)

3.13 OLS problems


Gauss Markov:
The assumptions of the Gauss Markov Theorem (homoscedasticity) are
often not fulfilled.
The result is only valid for linear estimators. But linear estimators represent only a small share of all possible estimators.
smallest variance of all consistent estimators requires homoscedastic normally distributed residuals more often than not this is not plausible.
Outliers OLS is more sensitive to outliers than many other estimators.
Recall the discussion about estimating the mean value: The median is less sensitive to outlieres than the sample mean value.
We can do similar things when we estimate linear equations:
OLS: min

b0 ,b1

n
X

(Yi (b0 + b1 Xi ))2

i=1

LAD: min

b0 ,b1

n
X

|Yi (b0 + b1 Xi )|

i=1

however, OLS is used in most use cases we will do the same thing here.

LINEAR REGRESSION WITH A SINGLE REGRESSOR

65

3.13.1 Alternatives to OLS


Identifying and eliminating outliers
Quantile regression
Robust regression
The following dataset show the relation between income and food expenditure.

1000

1500

138

500

foodexp

2000

library(quantreg)
data(engel)
attach(engel)

1000

2000

3000

4000

5000

income
The estimation result depends on the inclusion or exclusion of observation 138:
lm(foodexp ~ income)

Call:
lm(formula = foodexp ~ income)
Coefficients:
(Intercept)
147.4754

income
0.4852

lm(foodexp ~ income,data=engel[-138,])

c Oliver Kirchkamp

6 February 2015 09:49:38

Call:
lm(formula = foodexp ~ income, data = engel[-138, ])
Coefficients:
(Intercept)
91.3330

income
0.5465

plot(foodexp ~ income)
text(engel[138,1],engel[138,2],138,pos=2)
est <- lm(foodexp ~ income)
abline(est)
abline(lm(foodexp ~ income,data=engel[-138,]),lty=2)
legend("bottomright",c("all","138 dropped"),lty=1:2,cex=.5)
plot(est,which=2)

all
138 dropped

1000

3000

-4

-2

59

105

-6

500

1000

1500

138

Standardized residuals

2000

Normal Q-Q

foodexp

c Oliver Kirchkamp

66

5000

income

138

-3

Theoretical Quantiles

Is observation 138 an outlier?


3.13.2 Robust regression
Until now we have been minimizing least squares:
X
(yi (0 + 1 xi ))2

-1 0

More generally, we minimize the sum of any function:


X
(yi (0 + 1 xi ))

LINEAR REGRESSION WITH A SINGLE REGRESSOR

67

where
1. (x) = x2

OLS

2. (x) = |x|
LAD (quantile regression)

x2 /2
if |x| c
3. (x) =
2
c|x| c /2 else
Hubers Method. c is an estimated value for u .

OLS

LAD

Huber

(x)

6
4
2
0
-3

-2

-1

x
rq performs a quantile regression, minimizing the sum of the absolutes of the residuals. rlm performs a
robust regression.

LAD:
library(quantreg)
summary(rq(foodexp ~ income))

Call: rq(formula = foodexp ~ income)


tau: [1] 0.5
Coefficients:
coefficients lower bd upper bd
(Intercept) 81.48225
53.25915 114.01156
income
0.56018
0.48702
0.60199

Huber:

c Oliver Kirchkamp

6 February 2015 09:49:38

library(MASS)
summary(rlm(foodexp ~ income))

Call: rlm(formula = foodexp ~ income)


Residuals:
Min
1Q
Median
3Q
Max
-933.748 -54.995
4.768
53.714 418.020
Coefficients:
Value
Std. Error t value
(Intercept) 99.4319 12.1244
8.2010
income
0.5368 0.0109
49.1797
Residual standard error: 81.45 on 233 degrees of freedom

1000

1500

2000

plot(foodexp ~ income)
abline(lm(foodexp ~ income))
abline(rq(foodexp ~ income),lty=2)
abline(rlm(foodexp ~ income),lty=3)
legend("bottomright",c("OLS","LAD","Huber"),lty=1:3)

OLS
LAD
Huber

500

foodexp

c Oliver Kirchkamp

68

1000

2000

3000

4000

5000

income

3.13.3 A Bayesian approach to robust regression


Of course, there is also a Bayesian approach to outliers. The idea is as follows:
Usually, we assume that our dependent variable follows a Normal distribution.

LINEAR REGRESSION WITH A SINGLE REGRESSOR

69

Let us estimate this as follows (this is still the non-robust approach):


modelR<-model {
for (i in 1:length(y)) {
y[i] ~ dnorm(beta0 + beta1*x[i],tau)
}
beta0 ~ dnorm (0,.0001)
beta1 ~ dnorm (0,.0001)
tau
~ dgamma(.01,.01)
}
}
bayesR<-run.jags(model=modelR,data=list(y=foodexp,x=income),
monitor=c("beta0","beta1"))
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 2 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

05 1525

Density

0.44 0.46 0.48 0.50 0.52 0.54

beta1

plot(bayesR,var="beta1",type=c("trace","density"),newwindows=FALSE)

0.45

0.50

beta1

6000 8000 10000

Iteration

14000

0.55

c Oliver Kirchkamp

c Oliver Kirchkamp

70

6 February 2015 09:49:38

summary(bayesR)

Iterations = 5001:15000
Thinning interval = 1
Number of chains = 2
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean
SD Naive SE Time-series SE
beta0 143.872 15.91473 0.1125341
0.3198132
beta1
0.488 0.01429 0.0001011
0.0002814
2. Quantiles for each variable:
2.5%
25%
50%
75%
97.5%
beta0 112.6957 133.1380 143.882 154.6010 174.6390
beta1
0.4602
0.4785
0.488
0.4976
0.5164

So far our results are very similar to the (non-robust) OLS results from above.

The Normal distribution does not put much weight on its tails. In other words,
observations (outliers) which are several standard deviations away from the expected value are very unlikely. When we use the Normal distribution above we,
implicitely, shift the estimator closer to these observations, so that the distance
between posterior estimate and observation becomes smaller.

A distribution which may (but need not) put more weight on its tails, and which
still contains the Normal distribution as a special case, is the t-distribution. If
the degrees of freedom are large, the t-distribution is very close to the Normal
distribution. If the degrees of freedom are small, the t-distribution has very fat
tails.

LINEAR REGRESSION WITH A SINGLE REGRESSOR

t20

71

t1

0.4

Density

0.3
0.2
0.1
0.0
-3

-2

-1

Bayesian, k = 20 Let us start with a large value for degrees of freedom:


modelRR20<-model {
for (i in 1:length(y)) {
y[i] ~ dt(beta0 + beta1*x[i],tau,20)
}
beta0 ~ dnorm (0,.0001)
beta1 ~ dnorm (0,.0001)
tau
~ dgamma(.01,.01)
}
}
set.seed(123)
bayesRR20<-run.jags(model=modelRR20,data=list(y=foodexp,x=income),
monitor=c("beta0","beta1"))
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 2 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

summary(bayesRR20)

c Oliver Kirchkamp

c Oliver Kirchkamp

72

6 February 2015 09:49:38

Iterations = 5001:15000
Thinning interval = 1
Number of chains = 2
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean
SD Naive SE Time-series SE
beta0 97.8757 15.84981 0.1120751
0.4986725
beta1 0.5389 0.01606 0.0001136
0.0004964
2. Quantiles for each variable:
2.5%
25%
50%
75%
97.5%
beta0 66.5581 87.4478 97.828 108.3922 129.4566
beta1 0.5067 0.5284 0.539
0.5495
0.5705

Bayesian, k = 1 Let us compare this with a small value for degrees of freedom:
modelRR1<-model {
for (i in 1:length(y)) {
y[i] ~ dt(beta0 + beta1*x[i],tau,1)
}
beta0 ~ dnorm (0,.0001)
beta1 ~ dnorm (0,.0001)
tau
~ dgamma(.01,.01)
}
}
bayesRR1<-run.jags(model=modelRR1,data=list(y=foodexp,x=income),
monitor=c("beta0","beta1"))
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 2 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

summary(bayesRR1)

Iterations = 5001:15000
Thinning interval = 1
Number of chains = 2
Sample size per chain = 10000

LINEAR REGRESSION WITH A SINGLE REGRESSOR

1. Empirical mean and standard deviation for each variable,


plus standard error of the mean:

73

c Oliver Kirchkamp

Mean
SD Naive SE Time-series SE
beta0 62.3640 14.47629 0.1023629
0.5464357
beta1 0.5892 0.01818 0.0001286
0.0006682
2. Quantiles for each variable:
2.5%
25%
50%
75%
97.5%
beta0 35.2204 52.2198 62.0430 72.192 90.9461
beta1 0.5533 0.5767 0.5897 0.602 0.6234

Bayesian, more robust:

Let us next endogenise the degrees of freedom (k):

modelRR<-model {
for (i in 1:length(y)) {
y[i] ~ dt(beta0 + beta1*x[i],tau,k)
}
beta0 ~ dnorm (0,.0001)
beta1 ~ dnorm (0,.0001)
tau
~ dgamma(.01,.01)
k
~ dexp(1/30)
}
}
bayesRR<-run.jags(model=modelRR,data=list(y=foodexp,x=income),
monitor=c("beta0","beta1","k"))
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 3 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

Digression: The exponential distribution:

k<-10^(seq(-.5,2,.1))
xyplot(pexp(k,1/30) ~ k,type="l",scales=list(x=list(log=T)),xscale.components = xscale.compone

6 February 2015 09:49:38

0.8

pexp(k,1/30)

c Oliver Kirchkamp

74

0.6
0.4
0.2

0.3

10

30

100

A t-distribution with 20 degrees of freedom is very similar to a normal distribution. If our prior for k follows dexp(k,1/30), then the probability for k > 20
is (slightly) larger than 1/2, i.e. we are giving the traditional model (of an almost
normal distribution) a very good chance.

plot(bayesRR,var="k",type=c("trace","density"))

75

0.00.20.4

Density

7
6
5
4

6000 8000 10000

14000

Iteration

0 510 20

Density

0.50 0.52 0.54 0.56 0.58 0.60 0.62

beta1

plot(bayesRR,var="beta1",type=c("trace","density"))

0.50

0.55

beta1

6000 8000 10000

Iteration
summary(bayesRR)

14000

0.60

c Oliver Kirchkamp

LINEAR REGRESSION WITH A SINGLE REGRESSOR

c Oliver Kirchkamp

76

6 February 2015 09:49:38

Iterations = 5001:15000
Thinning interval = 1
Number of chains = 2
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean
SD Naive SE Time-series SE
beta0 79.4320 15.32741 0.1083812
0.5138090
beta1 0.5622 0.01785 0.0001262
0.0006043
k
3.5468 0.86702 0.0061308
0.0131400
2. Quantiles for each variable:
2.5%
25%
50%
75%
97.5%
beta0 48.6728 69.2156 79.5799 89.814 108.5751
beta1 0.5284 0.5498 0.5619 0.574
0.5984
k
2.2482 2.9405 3.4172 4.001
5.6210

We see that our estimation for the slope (based on the t-distribution) is larger,
similar to the LAD and Huber regression.
Different from LAD and Huber the regression itself has chosen the optimal way
(or the optimal value of k) to accomodate outliers.

3.14 Exercises
1. Regressions I
Define the following items:
Regression

Independent variable
Dependent variable
Give the formula for a linear regression with a single regressor.
What does 1 indicate?
What does u indicate?

2. Regressions II
Use the data set Crime of the library Ecdat in R.
How do you interpret positive (negative) coefficients in simple linear
models?
What is the influence on the number of police men per capita (polpc) on
the crime rate in crimes committed per person (crmrte)? Interpret your
result. Do you have an explanation for this result?

LINEAR REGRESSION WITH A SINGLE REGRESSOR

77

How can we visualize regressions? Draw the respective graph in R.

What is the correlation coefficient of the number of police men per capital (polpc) and the crime rate (crmrte)? Interpret the result.
What do the standard errors tell you?
What does R2 indicate?

What do the p-values indicate?

3. Regressions III
List and explain the assumptions that have to be fulfilled to be able to use
OLS.
4. Classes of variables
Explain and give examples of the following types of variables:
continuous
discrete
binary
5. Dummies
Use the data set BudgetFood of the library Ecdat in R.
What is a dummy variable? Which values can it take?

Name some typical examples of variables which are often coded as


dummy variables.
You have heard that older people care more about the quality of food
they eat and thus spend more on food than younger people. Test whether
Spanish citizens from the age of 30 on spend a higher percentage of
their available income on food (wfood) than younger do. Do you have
an explanation for these findings?
Interpret the result of your regression. What does 1 specify in this
case?
Could you apply a different test for the same question? Use this test in
R and compare the results.
Check the relation of age (age) and percentage of income spent on food
(wfood) with a graph.
6. Reading data into R
Create a data file with the data on pocket money and age that we used
in chapters 1 and 2. Use the headers age and pm and save it in the
format .csv under the name pocketmoney.

c Oliver Kirchkamp

c Oliver Kirchkamp

78

6 February 2015 09:49:38


Read the file pocketmoney into R.

Draw a scatter plot with age on the x-axis and pm on the y-axis.

Draw the same scatter plot, this time without box plots and without
the lowess spline, but with a linear regression line.
Label your scatter plot with age on the x-axis and pocket money on the
y-axis. Give your graph the title Childrens pocket money.
7. Exam 28.7.2007, exercise 5a+d
Your task is to work on a hypothetical data set in R.
The variable names A, B, C, D, E, and Year are in the header of your data file
file.csv. The data set contains 553 observations in the format .csv (comma
separated values). Explain what the following commands do and choose
the correct one (with explanation).
First, read your data set into R.

a) daten = read.csv(file.csv, header=YES, sep=;)

b) daten = read.csv(file.csv, header=TRUE, sep=;)


c) daten = read.table(file.csv)
d) daten = read.table(file.csv, header=YES, sep=,)
Further, you would like to know the correlation between the variables
B, C, and D. How can you find this out?
a) corr(B,C, D)
b) corr(daten)
c) cor(daten)
d) corr(B,C, D)
8. Using and generating dummy variables
Use the data set Fatality of the library Ecdat in R.
What is the data set about?

Which of the variables are nominal variables?


Which of the variables are discrete variables?

Which of the variables are continuous variables?


Which of the variables are dummy variables?

Create a dummy variable which takes the value 1 if the size of the sales
floor is > 120 sqm and 0 otherwise.

LINEAR REGRESSION WITH A SINGLE REGRESSOR

79

Draw a graph with separate box plots for large and small sales floors
on the sales per square meter.
Measure the influence of the size of the sales floor on the sales per
square meter.
Do the same task as above, this time using your variable for large sales
floors.
9. Heteroscedasticity
What is heteroscedasticity?

Give an example for data where residual variances of differ along the
dimension of a second variable.
What is homoscedasticity?

Which advantage does homoscedasticity have for econometric analysis?


Imagine you had data with heteroscedastic error terms. You perform a
data analysis under the assumption of homoscedasticity. Is your estimator consistent?
Now you have data with homoscedastic error terms and you perform
a data analysis under the assumption of heteroscedasticity. Is your estimator consistent?
How would the answers to the last two questions be if you had a large
sample?
10. Advantages and disadvantages of OLS
What are the advantages and disadvantages of OLS?

What can you do to fix these problems in OLS estimations?


11. Prices for houses
Use the data set Housing of the library Ecdat in R.
Draw a scatter plot on the the lot size and the price of a house. Look at
the graph. From your visual impression, would you say that the error
terms are homo- or heteroscedastic?
What does the data set contain?

Look at the variables the data set contains. Formulate some sensible
hypotheses and test them in R.
12. Wages
Use the data set Wages1 of the library Ecdat in R.

c Oliver Kirchkamp

c Oliver Kirchkamp

80

6 February 2015 09:49:38


What does the data set contain?

Do you think that gender (sex) matters when it comes to wages (wage)?
Check your assumption in R.
Are years of education received (school) and gender correlated with
each other?
Do you think that experience (exper) or years of schooling matter more
for wage? Check your assumption in R. Which tests could you use to
test it?
Do employees with a college education (more than 12 years of education) earn more than those without? Test this in R. Which type of
variable do you use to answer this question?
Do you think that our models above are well specified? How would
you change the model if you could?

4 Models with more than one independent variable


(multiple regression)
testsrc = 1 str + 0 + u
testscr test score
str student / teacher ratio

How can we include more than one factor at the same time?
Keep one factor constant by only looking at a small group (e.g. all students with a very similar elpct (english learner percentage))
The subset option of the command lm limits the estimation to a certain part of the dataset.
data(Caschool)
attach(Caschool)
summary(elpct)
Min. 1st Qu.
0.000
1.941

Median
8.778

Mean 3rd Qu.


15.770 22.970

Max.
85.540

lm(testscr ~ str ,subset=(elpct<9))

Call:
lm(formula = testscr ~ str, subset = (elpct < 9))
Coefficients:
(Intercept)
680.252

str
-0.835

81

lm(testscr ~ str ,subset=(elpct>=9 & elpct<23))

Call:
lm(formula = testscr ~ str, subset = (elpct >= 9 & elpct < 23))
Coefficients:
(Intercept)
696.445

str
-2.231

lm(testscr ~ str ,subset=(elpct>=23))

Call:
lm(formula = testscr ~ str, subset = (elpct >= 23))
Coefficients:
(Intercept)
653.0746

str
-0.8656

[0,9]
(9,23]
(23,100]

700

testscr

680
660
640
620

14

16

18

20

22

24

str

depending on elpct the estimated relationships are very different.


extend the regression model
testsrc = 1 str + 2 elpct + 0 + u

26

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

82

6 February 2015 09:49:38

generally:
y = 0 + 1 x1 + 2 x2 + + k xk + u
for every observation:
0 + 1 x11 + 2 x12 + + k x1k + u1

y1

y2
y3

=
=
..
.

0 + 1 x21 + 2 x22 + + k x2k + u2


0 + 1 x31 + 2 x32 + + k x3k + u3

yn

0 + 1 xn1 + 2 xn2 + + k xnk + un

lm(testscr ~ str + elpct)

Call:
lm(formula = testscr ~ str + elpct)
Coefficients:
(Intercept)
686.0322

str
-1.1013

elpct
-0.6498

4.1 Matrix notation


4.1.1 How to do calculations with matrices
y = X + u

y=

y1
y2
y3
..
.
yn

X=

1 x11
1 x21
1 x31
..
.

x12
x22
x32

1 xn1 xn2

x1k
x2k
x3k
..
..
.
.
xnk

; =

0
1
2
..
.
k

; u =

u1
u2
u3
..
.
un

Addition

a11 a12 a13


b11 b12 b13
a11 + b11 a12 + b12 a13 + b13
a21 a22 a23 + b21 b22 b23 = a21 + b21 a22 + b22 a23 + b23
b31 b32 b33
a31 + b31 a32 + b32 a33 + b33
a31 a32 a33
|
{z
} |
{z
} |
{z
}
nm

Multiplication

nm

nm

a21

a22

{z

nm


b12

a23 b22
b32

} |
{z

mk

4.1.2 Calculations with matrices in R

P
m
=
i=1 a2i bi2

} |
{z
nk

83

We define vectors with c(...). We can then stack vectors horizontally or vertially
with rbind(...) or cbind(...).
A <- rbind(c(1,2,3),c(4,5,6))
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6

B <- cbind(c(2,2,2),c(3,3,3))

[1,]
[2,]
[3,]

[,1] [,2]
2
3
2
3
2
3

For the transpose of a matrix we use t(...)


t(B)

[1,]
[2,]

[,1] [,2] [,3]


2
2
2
3
3
3

+ adds matrices elementwise. This requires that the matrices have the same
rank. In this example we can not calculate A+B but we can calculate A+t(B).
A + t(B)

[1,]
[2,]

[,1] [,2] [,3]


3
4
5
7
8
9

* multiplies the elements of a matrix. This is not the usual matrix multiplication:
A * t(B)

[1,]
[2,]

[,1] [,2] [,3]


2
4
6
12
15
18

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

84

6 February 2015 09:49:38

A
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
B

[1,]
[2,]
[3,]

[,1] [,2]
2
3
2
3
2
3

%*% performs the usual matrix multiplication:


A %*% B

[1,]
[2,]

[,1] [,2]
12
18
30
45

4.2 Deriving the OLS estimator in matrix notation


y1

y=

y1
y2
y3
..
.

=
..
.

X=

0 + 1 x11 + 2 x12 + + k x1k + u1

y = X + u

1 x11
1 x21
1 x31
..
.

x12
x22
x32

yn
1 xn1 xn2
y = X + u: Now, the residuals are

x1k
x2k
x3k
..
..
.
.
xnk

; =

0
1
2
..
.
k

; u =

u = y X
The sum of squares of the residuals
S() =

n
X

u2i = u u

(y X) (y X)

y y y X X y + X X

y y 2 X y + X X

i=1

u1
u2
u3
..
.
un

85

(recall: (AB) = B A )
To minimize S(), we take the first derivative with respect to :
S()
!
= 2X y + 2X X = 0

^ = X y
X X

Normal equations:

(7)

Now, X X is a (k + 1) (k + 1) matrix. If this matrix is nonsingular, we calculate


its inverse.
^
(X X)1 X X
^

=
=

(X X)1 X y
(X X)1 X y
| {z }
X+

X+ = (X X)1 X
as an exercise: show that
XX+ X = X
X+ XX+ = X+
(XX+ ) = XX+
X+ X = I
Call
^
^ = X
y

^ = yy
^
u

orthogonality
^ = X y X X(X X)1 X y = 0
^ = X (y y
^ ) = X y X X
X u
^ X u
^ u
^=
^=0
y
^ = X y with
^ yields
Multiplying the normal equation X X
^ X X
^=
^ X y

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

86

6 February 2015 09:49:38

then
^ u
^
u

=
=
=
=
=
=

^ (y X)
^
(y X)
^ X y +
^ X X
^
y y 2
^ X X
^+
^ X X
^
y y 2
^ X X
^
y y
^ X
^
y y (X)
^ y
^
y y y

Quadratic analysis (variance analysis)


^ y
^+u
^ u
^
y y = y
In case of an inhomogeneous regression:
y =
X + u

y1
1 x11 x12 x1k
y2
1 x21 x22 x2k

y3
y=
; X = 1 x31 x32 x3k
..
..
..
..
.
.
.
.
1 xn1 xn2 xnk
yn
1
1
1
^ y
^ u
^+ u
^
y y= y
n
n
n
1
1
1
^ y
^ u
^ y 2 + u
^
y y y 2 = y
n
n
n
TSS
2
2
2
ESS
sy = sy
^ + su
^
|{z} |{z} |{z}
SSR
SSR
TSS

ESS

R =

s2y
^

s2y

; =

0
1
2
..
.
k

; u =

total sum of squares


explained sum of squares
sum of squares of residuals

s2u
= 1 2^
sy

4.3 Sp ecification errors


What can happen if we forget to include a variable into our model?
Let us take another look at our simple estimation equation:
testsrc = 1 str + 0 + u

u1
u2
u3
..
.
un

87

testscr test score


str student / teacher ratio
What else could have an influence on testscr?
corr. with
influence
regressor
on dep. var.
str
testscr
percent of English learners
x
x
time of day of the test
x
parking lot space per student
x
If we do not include a variable in our estimation equation, but this variable is
correlated with the regressor
and has an influence on the dependent variable
our estimation for is biased (omitted variable bias).
The assumption E(ui |Xi ) = 0 is no longer satisfied
4.3.1 Examples:
Classical music intelligence of children (Rauscher, Shaw, Ky; Nature;
1993)
(missing variable: income)

French paradox: Red wine, foie gras less illnesses of the coronary blood
vessels (Samuel Black, 1819)
(missing variable: percentage of fish and sugar in the diet,. . . )

Storks in Lower Saxony birth rate


(missing variable: industrialisation)

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

6 February 2015 09:49:38

18

1966
1965
1953

1963

1962

1961 1960

16

1967

1958

1954
1956
1957

1955

1959
1972

1968

1971
1969

1975
1973
1970

14

birth rate

1964

12

1974
1976

1977

50

100

150

200

freq. of nesting storks

10

Gabriel, K. R. and Odoroff, C. L. (1990) Biplots in biomedical research. Statistics


in Medicine 9(5): pp. 469-485.
UnitedStates
Australia
NewZealand

Scotland

Canada
England
Sweden

Norway
Ireland
Belgium
Netherlands
Denmark

Austria

Germany

Mortality

Italy

Switzerland

c Oliver Kirchkamp

88

France

20

40

60

Wine [l/year]

Mortality due to coronary heart disease (per 1000 men, 55 64 years). St. Leger
A.S., Cochrane, A.L. and Moore, F. (1979). Factors Associated with Cardiac Mor-

89

tality in Developed Countries with Particular Reference to the Consumption of


Wine, Lancet: 10171020.
4.3.2 Specification errors generalization
Let the true model be
y = X 1 1 + X2 2 + u
what happens if we forget to include X2 into the specification of our model?
^ = (X X)1 X y

| {z }
X+

b1

(X1 X1 )1 X1 y

(X1 X1 )1 X1 (X1 1 + X2 2 + u)

(X1 X1 )1 X1 X1 1 + (X1 X1 )1 X1 X2 2 + (X1 X1 )1 X1 u

E(b1 ) =

1 + (X1 X1 )1 X1 X2 2

Hence, E(b1 ) = 1 only if


2 = 0
or X1 X2 = 0, i.e. X1 and X2 are orthogonal
Specification errors another example

The correct model:

Y = 0 + 1 X1 + 2 X2 + u
X2 is correlated with X1 , e.g.:
X2 = X1 +
Omitting X2 means:
Y = 0 + 1 X1 + 2 (X1 + ) + u
= 0 + (1 + 2 )X1 + 2 + u
We overestimate 1 if 2 > 0
We underestimate 1 if 2 < 0

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

90

6 February 2015 09:49:38

4.4 Assumptions for the multiple regression model


1. E(ui |Xi = x) = 0
2. (Xi , Yi ) are i.i.d.
3. Large outliers in X and Y are rare (the fourth moments of X and Y exist)
4. X has the same rank as its number of columns (no multicollinearity)
5. var(u|X = x) is constant, u is homoscedastic
6. u is normally distributed u N(0, 2 )

4.5 The distribution of the OLS estimator in a multiple regression


^ 0 and
^ 1 are unbi model with a single regressor: the OLS estimators for
^ 0 and
^ 1 are normally
ased and consistent estimators. For large samples
distributed.
multiple regression: under the assumptions 14 (given above) the OLS esti^ = (X X)1 X y is unbiased and consistent. For large samples
^ is
mator
jointly normally distributed.

4.6 Multicollinearity
Example
testscr = 1 str + 2 elpct + 0
Now, we extend the model by adding another variable: Ratio of English learners FracEL=elpct/100:
testscr = 1 str + 2 elpct + 3 FracEL + 0

FracEL<-elpct/100

summary(lm(testscr ~ str + elpct))

Call:
lm(formula = testscr ~ str + elpct)
Residuals:

Min
1Q
-48.845 -10.240

Median
-0.308

3Q
9.815

Max
43.461

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.03225
7.41131 92.566 < 2e-16 ***
str
-1.10130
0.38028 -2.896 0.00398 **
elpct
-0.64978
0.03934 -16.516 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.46 on 417 degrees of freedom
Multiple R-squared: 0.4264,Adjusted R-squared: 0.4237
F-statistic:
155 on 2 and 417 DF, p-value: < 2.2e-16

summary(lm(testscr ~ str + elpct + FracEL))

Call:
lm(formula = testscr ~ str + elpct + FracEL)
Residuals:
Min
1Q
-48.845 -10.240

Median
-0.308

3Q
9.815

Max
43.461

Coefficients: (1 not defined because of singularities)


Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.03225
7.41131 92.566 < 2e-16 ***
str
-1.10130
0.38028 -2.896 0.00398 **
elpct
-0.64978
0.03934 -16.516 < 2e-16 ***
FracEL
NA
NA
NA
NA
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.46 on 417 degrees of freedom
Multiple R-squared: 0.4264,Adjusted R-squared: 0.4237
F-statistic:
155 on 2 and 417 DF, p-value: < 2.2e-16

FracEL<-elpct/100+rnorm(4)*.0000001
summary(lm(testscr ~ str + elpct + FracEL))

Call:
lm(formula = testscr ~ str + elpct + FracEL)
Residuals:
Min
1Q
-48.608 -10.063
Coefficients:

Median
-0.152

3Q
9.613

Max
43.857

91

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

92

6 February 2015 09:49:38

Estimate
Std. Error t value Pr(>|t|)
(Intercept)
685.9887
7.4172 92.486 < 2e-16 ***
str
-1.1040
0.3806 -2.901 0.00392 **
elpct
-53520.1940
87264.2711 -0.613 0.54001
FracEL
5351954.3231 8726426.9534
0.613 0.54001
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.48 on 416 degrees of freedom
Multiple R-squared: 0.4269,Adjusted R-squared: 0.4228
F-statistic: 103.3 on 3 and 416 DF, p-value: < 2.2e-16

We notice that R detects the multicollinearity on its own. It simplifies the model
accordingly.
But this does not always work.
We slightly perturb the variable. This can happen accidentally (e.g. through
rounding errors). Multicollinearity between the variables is no longer perfect.
We get the same result for all coefficients, but any (ever so slight) perturbation
changes the result considerably. The standard errors get very large.
elpct[1:5]/100
[1] 0.00000000 0.04583333 0.30000002 0.00000000 0.13857677
elpct[1:5]/100+rnorm(4)*.0000001
[1]
[5]

0.00000001292877
0.13857678752594

0.04583350642929

0.30000006516511 -0.00000012650612

elpct[1:5]/100+rnorm(4)*.0000001
[1] -0.00000006868529
[5] 0.13857670591188

0.04583329035659

0.30000014148167

0.00000003598138

elpct[1:5]/100+rnorm(4)*.0000001
[1] 0.00000004007715 0.04583334599106 0.29999996348937 0.00000017869131
[5] 0.13857681467431

perturbedEstimate <- function (x) {


FracEL <- elpct/100+rnorm(4)*.0000001
est <- lm(testscr ~ str + elpct + FracEL)
coef(est)[3:4]
}
perturbedEstimate(1)
elpct
FracEL
-53520.19 5351954.32

93

perturbedEstimate(1)
elpct
FracEL
56766.73 -5676738.56
perturbedEstimate(1)
elpct
FracEL
-64347.34 6434669.12

estList <- sapply(1:100,perturbedEstimate)


plot(t(estList),main="multicollinearity, estimated coefficients")

-10000000
-30000000

FracEL

10000000

multicollinearity, estimated coefficients

-200000

-100000

100000

200000

300000

elpct

Large coefficients for elpct are balanced by small coefficients for FracEL. What
happened? What is the true relationship?
testscr = 686.0322 1.1013str 0.6498elpct
FracEL = elpct/100
testscr = 686.0322 1.1013 str + (a 0.6498) elpct 100a elpct/100
testscr = 686.0322 1.1013 str + (a 0.6498) elpct 100a FracEL
coefficients cannot be identified anymore.

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

94

6 February 2015 09:49:38

4.6.1 Example 2
Dummy variable assumes the value 1 if str>12 (group is not very small)
NVS <- str>12
lm(testscr ~ str + elpct + NVS)

Call:
lm(formula = testscr ~ str + elpct + NVS)
Coefficients:
(Intercept)
686.0322

str
-1.1013

elpct
-0.6498

NVSTRUE
NA

The coefficient of NVS cannot be estimated. Why?


table(NVS)
NVS
TRUE
420

The new variable NVS is always TRUE and, hence, it is perfectly correlated with
the constant term. Explanation: There are no groups with str < 12. Therefore,
we cannot assess the effect of such a small group size.
4.6.2 Example 3
ESpct = 100 elpct
ESpct <- 100 - elpct
lm(testscr ~ str + elpct + ESpct)

Call:
lm(formula = testscr ~ str + elpct + ESpct)
Coefficients:
(Intercept)
686.0322

str
-1.1013

elpct
-0.6498

ESpct
NA

Again, R detects collinearity. We can perform a small perturbation to get a


result. However, this result is not exactly helpful.
set.seed(123)
perturbedEstimate2 <- function (x) {
ESpct <- 100 - elpct +rnorm(4)*.01
est <- lm(testscr ~ str + elpct + ESpct)

95

coef(est)[3:4]
}
estList <- sapply(1:100,perturbedEstimate2)
plot(t(estList),main="multicollinearity 2, estimated coefficients")

100
-300

-100

ESpct

300

multicollinearity 2, estimated coefficients

-300

-200

-100

100

200

300

elpct

The true relationship is:


testscr = 686.0322 1.1013str 0.6498elpct
Now let ESpct = 100 elpct
testscr = 686.0322 a 100 1.1013 str (0.6498 + a
) elpct + a (100 elpct)

testscr = 686.0322 a 100 1.1013 str (0.6498 + a) elpct + a ESpct


coefficients cannot be identified anymore.
4.6.3 Which regressor is responsible for the multicollinearity?
We perform regressions using each of the k regressors as dependent variables and
all other k 1 regressors as independent variables:
xi = 0 + 1 x1 + . . . + i1 xi1 + i+1 xi+1 + k xk + u
A large R2i is a sign of collinearity.

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

96

6 February 2015 09:49:38

We consider the Variance Inflation Factor


VIF =

1
1 R2i

In the following example we build a(n) (almost) linearly dependent value elpct2.
Additionally, we add an obviously pointless regressor to the equation: the number of the school district.
set.seed(123)
elpct2 <- elpct + rnorm(4)
est <- lm (testscr ~ str + elpct2 + elpct + as.numeric(district))
summaryR(est)

Call:
lm(formula = testscr ~ str + elpct2 + elpct + as.numeric(district))
Residuals:
Min
1Q
-48.328 -10.212

Median
-0.168

3Q
9.518

Max
43.872

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
685.454379
8.923996 76.810
<2e-16 ***
str
-1.096242
0.438310 -2.501
0.0128 *
elpct2
0.544466
0.853474
0.638
0.5239
elpct
-1.196071
0.857852 -1.394
0.1640
as.numeric(district)
0.001910
0.005937
0.322
0.7478
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.49 on 415 degrees of freedom
Multiple R-squared: 0.4271,Adjusted R-squared: 0.4216
F-statistic:
109 on 4 and 415 DF, p-value: < 2.2e-16

We notice that there are some factors with a high variance. We calculate the
variance inflation factor to test for collinearity:
library(car)
elpct2 <- elpct + rnorm(4)
est <- lm (testscr ~ str + elpct2 + elpct + as.numeric(district))
vif(est)
str
1.040984
as.numeric(district)
1.008520

elpct2
512.718019

vif(lm(testscr ~ str + elpct + mealpct + calwpct))

elpct
512.761866

97

str
elpct mealpct calwpct
1.044388 1.962265 3.870485 2.476509

We notice (at least we would if we had not know in advance) that the number
of the school district is not significant, but neither is it collinear. The two versions
of elpct are collinear. If we remove one, the variance of the other gets smaller.
summaryR(lm (testscr ~ str +

elpct + as.numeric(district)))

group 2

group 3

group 1

constant

4.6.4 Multicollinearity of dummy variables

1
1
1
1
1
1

1
0
0
1
0
0

0
1
0
0
1
0

0
0
1
0
0
1

This matrix does not have the full column rank

4.7 Specification Errors: Summary


underspecified model, a regressor 2 is missing:
^ is unbiased, only if 2 = 0 or X X2 = 0.

1

overspecified model, regressors are collinear:

^ cannot be estimated (X X cannot be inverted)


overspecified model, regressors are almost collinear:


^ can only be estimated inexactly

^
The distribution of

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

98

6 February 2015 09:49:38

^
4.7.1 The variance of
For the simple regression (as a reminder):

^ 2
^

1
=
n

^ 2
^ =
1

1 Pn
^ 2i
i=1 u
n2
1 Pn
2
i=1 (Xi X)
n
1 Pn
v2
i=1 ^
n2

Homoscedasticity

1

n 1 Pn

Heteroscedasticity (always correct)

2 2

i=1 Xi X
n

(mit ^v = Xi X u
^ i)

For the multiple regression:

^ 2u (X X)1
^
^ =

Homoscedasticity

X Iu2 X(X X)1


^
^ = (X X)

Heteroscedasticity (always correct)

4.7.2 Imperfect multicollinearity


Where X is almost multicollinear, (X X)1 is very large and the estimation
^ is rather imprecise.
for
4.7.3 Hypothesis tests
Testing the hypothesis H0 : j = j,0 against H1 : j 6= j,0 :
Determine the t statistic:

^ j j,0

t=

^
^j



The p-value is p = Pr |t| > tsample = 2(|tsample |)

F(|t|)
|t|

F(|t|)
0

|t|

99

est <- lm(testscr ~ str + elpct)


diag( X) extracts the diagonal of X if X is a matrix. If x is a vector, diag(x) constructs the diagonal
matrix. coef extracts the estimated coefficients fromr a model.

^
Homoscedastic standard deviation of
^ 2u (X X)1

^
^ =

homoscedasticity

(stddevh <-sqrt(diag(vcov(est))))
(Intercept)
7.41131248

str
0.38027832

coef(est) /

stddevh

(Intercept)
92.565554

str
-2.896026

elpct
0.03934255

elpct
-16.515879

round(2*pnorm(- abs(coef(est)) /
(Intercept)
0.00000

str
0.00378

stddevh),5)

elpct
0.00000

summary(est)

Call:
lm(formula = testscr ~ str + elpct)
Residuals:
Min
1Q
-48.845 -10.240

Median
-0.308

3Q
9.815

Max
43.461

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.03225
7.41131 92.566 < 2e-16 ***
str
-1.10130
0.38028 -2.896 0.00398 **
elpct
-0.64978
0.03934 -16.516 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.46 on 417 degrees of freedom
Multiple R-squared: 0.4264,Adjusted R-squared: 0.4237
F-statistic:
155 on 2 and 417 DF, p-value: < 2.2e-16

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

100

6 February 2015 09:49:38

summaryR(est)

Call:
lm(formula = testscr ~ str + elpct)
Residuals:
Min
1Q
-48.845 -10.240

Median
-0.308

3Q
9.815

Max
43.461

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.0322
8.8122
77.85
<2e-16 ***
str
-1.1013
0.4371
-2.52
0.0121 *
elpct
-0.6498
0.0313 -20.76
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.46 on 417 degrees of freedom
Multiple R-squared: 0.4264,Adjusted R-squared: 0.4237
F-statistic: 220.1 on 2 and 417 DF, p-value: < 2.2e-16

^
Heteroscedastic variance-covariance matrix of
Now we perform the same steps using heteroscedasticity-consistent standard
errors:
sqrt(diag(vcov(est)))
(Intercept)
7.41131248

str
0.38027832

elpct
0.03934255

(stddev <- sqrt(diag(hccm(est))))


(Intercept)
8.81224085

str
0.43706612

coef(est) /

stddev

(Intercept)
77.849920

str
-2.519747

elpct
0.03129693

elpct
-20.761681

round(2*pnorm(- abs(coef(est)) /
(Intercept)
0.00000

str
0.01174

stddev),5)

elpct
0.00000

1
X Iu2 X(X X)1

^
^ = (X X)

heteroscedasticity (always correct)

101

4.7.4 Digression: Multiplication:


Inner product of vectors using % %

(a0 , a1 , a2 , , ak ) % % +

Elementwise product *

a1
a2
a3
..
.
ak

b1
b2
b3
..
.
bk

b0
b1
b2
..
.
bk

k
X

ai bi
=

i=0

a1 b1
a2 b2
a3 b3
..
.
ak bk

Outer
product A%o%B
a1 b0 a1 b1 a1 b2
a1
a2 b0 a2 b1 a2 b2
a2

a3

%o% (b0 , b1 , b2 , , bm ) = a3 b0 a3 b1 a3 b2

..
..
..
..

.
.
.
.
ak b0 ak b1 ak b2
ak

a1 a1
a1
a2 a2
a2

a3 a3
a3

%o% (+1, 1) =

..
..
..
.
.
.
ak ak
ak
^
confidence interval for
qnorm(.975)
[1] 1.959964
coef(est) + qnorm(.975) * stddev

%o% c(-1,1)

[,1]
[,2]
(Intercept) 668.7605740 703.3039234
str
-1.9579298 -0.2446620
elpct
-0.7111176 -0.5884359

What happens if group size is reduced? (e.g. by 2)

a1 bm
a2 bm
a3 bm
..
..
.
.
ak bm

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

102

6 February 2015 09:49:38

-2* (coef(est) + qnorm(.975) * stddev

%o% c(-1,1))["str",]

[1] 3.9158595 0.4893241

4.7.5 Extending the estimation equation by adding expenditure per student


(est <- lm(testscr ~ str + elpct))

Call:
lm(formula = testscr ~ str + elpct)
Coefficients:
(Intercept)
686.0322

str
-1.1013

elpct
-0.6498

(stddev <- sqrt(diag(hccm(est))))


(Intercept)
8.81224085

str
0.43706612

coef(est) /

stddev

(Intercept)
77.849920

str
-2.519747

elpct
0.03129693

elpct
-20.761681

round(2*pnorm(-abs(coef(est) /
(Intercept)
0.00000

str
0.01174

stddev)),5)

elpct
0.00000

(est <- lm(testscr ~ str + elpct + expnstu))

Call:
lm(formula = testscr ~ str + elpct + expnstu)
Coefficients:
(Intercept)
649.577947

str
-0.286399

elpct
-0.656023

expnstu
0.003868

(stddev <- sqrt(diag(hccm(est))))


(Intercept)
15.668622170
coef(est) /

str
0.487512918
stddev

elpct
0.032114291

expnstu
0.001607407

(Intercept)
41.4572475

str
elpct
-0.5874701 -20.4277485

round(2*pnorm(-abs(coef(est) /
(Intercept)
0.00000

str
0.55689

103

expnstu
2.4062993

stddev)),5)

elpct
0.00000

expnstu
0.01612

Compare the standard error of the coefficient of str in the different estimation
equations.
sqrt(diag(hccm(lm(testscr ~ str ))))["str"]
str
0.5243585
sqrt(diag(hccm(lm(testscr ~ str + elpct))))["str"]
str
0.4370661
sqrt(diag(hccm(lm(testscr ~ str + elpct + expnstu))))["str"]
str
0.4875129

In case of multicolliniearity: increases


^
In case of omitted variable: may decrease
^

4.8 Joint Hypotheses


e.g. str = 0 and expnstu = 0, or 1 = 0 and 2 = 0
Formally H0 : 1 = 1,0 2 = 2,0
versus
H1 : 1 6= 1,0 2 6= 2,0
Idea: We could just test 1 = 0 and 2 = 0 independently of each other.
t1 =

^ 1 1,0

^
^1

t2 =

^ 2 2,0

^
^2

In that case the null hypothesis H0 : 1 = 1,0 2 = 2,0 would be rejected if


either 1 = 0 or 2 = 0 are rejected.
We can easily see that this does not even work for uncorrelated s:

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

104

6 February 2015 09:49:38

set.seed(100)
N<-1000
p<-0.05
qcrit<- -qnorm(p/2)
b1<-rnorm(N)
mean(abs(b1)>qcrit)*100
[1] 5.9
b2<-rnorm(N)
mean(abs(b2)>qcrit)*100
[1] 4.6
reject<-abs(b1)>qcrit | abs(b2)>qcrit
mean(reject)*100
[1] 10.3

In the example 10.3 % of the values are rejected by the joint test, not 5%. This is
not a coincidence. The next diagram shows that we are not only cutting off on the
left and on the right, but also at the top and at the bottom.

plot(b2 ~ b1,cex=.7)
points(b2 ~ b1,subset=reject,col="red",pch=7,cex=.5)
abline(v=c(qcrit,-qcrit),h=c(qcrit,-qcrit))
dataEllipse(b1,b2,levels=1-p,plot.points=FALSE)
legend("topleft",c("naive rejection","95\\% region"),pch=c(7,NA),col="red",lty=c(NA,1),cex=

105

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

0
-3

-2

-1

b2

naive rejection
95% region

-3

-2

-1

b1

Additionally we can see that this nave approach only takes the maximum deviation of the variables into account. It would be more sensible to exclude all
observations outside of the red circle.
The second problem becomes even more annoying if the random variables are
correlated:

set.seed(100)
b1<-rnorm(N)
b2<-.3* rnorm(N) + .7*b1
reject<-abs(b1)>qcrit | abs(b2)>qcrit
plot(b2 ~ b1,cex=.5)
points(b2 ~ b1,subset=reject,col="red",pch=7,cex=.5)
abline(v=c(qcrit,-qcrit),h=c(qcrit,-qcrit))
dataEllipse(b1,b2,levels=1-p,plot.points=FALSE)
text(-1,1,"A")
legend("topleft",c("naive rejection","95\\% region"),pch=c(7,NA),col="red",lty=c(NA,1),cex=.7)

6 February 2015 09:49:38

naive rejection
95% region

-2

-1

b2

c Oliver Kirchkamp

106

-3

-2

-1

b1

For example, "A" in the diagram is clearly outside the confidence ellipse, but
none of its single coordinates are conspicious.
4.8.1 F statistic for two restrictions

t1 =

^ 1 1,0

^
^1

t2 =

^ 2 2,0

^
^2

t1 t2 t1 t2
1 t21 + t22 2^
F=
2
1 ^2t t
1 2

with ^t1 t2 being the estimated correlation between t1 and t2 .


If ^t1 t2 = 0:

1 2
2
F=
t + t2
2 1

Recall:

N(0, 1)
p
tn
2n /n

n
X

i=1

(N(0, 1))

2n

2n1 /n1
Fn1 ,n2
2n2 /n2

4.8.2 More than two restrictions


Write restrictions as
R = r
e.g.

(0, 1, 0, , 0)

0 1 0
0 0 1

1 0 1
0 1
0
0 1
2

0
1
2
..
.
k

=0

0
1
2
..
.
k

0
1
2
..
.
k




=
0


7
=

1

1 ^

^ r)
^
(R
F = (R r) R
^
^R
q
with q being the number of restrictions.
If assumptions 14 (see 4.4) are satisfied:
p

F Fq,

107

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

1.0

6 February 2015 09:49:38

0.6
0.4
0.0

0.2

density

0.8

q=2
q=3
q=5
q = 10

0.2

0.4

0.6

Fq,

density

p = Pr(F(q, ) > Fsample

0.0

c Oliver Kirchkamp

108

Fq,

4.8.3 Specials cases:


The last line of a regression output

109

summaryR(lm(testscr ~ str + elpct + expnstu))

Call:
lm(formula = testscr ~ str + elpct + expnstu)
Residuals:
Min
1Q
-51.340 -10.111

Median
0.293

3Q
10.318

Max
43.181

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 649.577947 15.668622 41.457
<2e-16 ***
str
-0.286399
0.487513 -0.587
0.5572
elpct
-0.656023
0.032114 -20.428
<2e-16 ***
expnstu
0.003868
0.001607
2.406
0.0166 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.35 on 416 degrees of freedom
Multiple R-squared: 0.4366,Adjusted R-squared: 0.4325
F-statistic: 144.3 on 3 and 416 DF, p-value: < 2.2e-16

Restrictions for the F-statistic of an estimation (0 is not being tested)

1 0 0
0 1 0
0 0 1
.. .. ..
. . .
0 0 0

0
1

0
2

0
3
. .
..
. .. ..
1
k

0
0
0
..
.
0

Testing a single coefficient:

1 0 0

t1 =

^ 1 1,0

^
^1

0
1
2
..
.
k

=0

X t(k) X2 F(k,k)

c constructs a vector by joining the arguments together. rbind joins the arguments of the function
(vectors, matrices) line-wise. cbind joins the arguments of the function (vectors, matrices) column-wise.
linearHypothesis tests linear hypotheses. pf calculated the distribution function of the F-distribution, df

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

110

6 February 2015 09:49:38

calculates the density function, qf calculates quantiles of the F-distribution, rf calculates an F-distributed
random variable.

H0 : str = 0 and expnstu = 0


est <- lm(testscr ~ str + elpct + expnstu)
R <- rbind(c(0,1,0,0),c(0,0,0,1))
r <- c(0,0)

0





0 1 0 0

0
1

=
0 0 0 1 2
0
3
linearHypothesis(est, R, r)
Linear hypothesis test
Hypothesis:
str = 0
expnstu = 0
Model 1: restricted model
Model 2: testscr ~ str + elpct + expnstu
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
418 89000
2
416 85700 2
3300.3 8.0101 0.000386 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
testh<-linearHypothesis(est, R, r)
pf(testh$F[2],2,Inf,lower.tail=FALSE)
[1] 0.0003320828

linearHypothesis(est, R, r, vcov=hccm)
Linear hypothesis test
Hypothesis:
str = 0
expnstu = 0
Model 1: restricted model
Model 2: testscr ~ str + elpct + expnstu
Note: Coefficient covariance matrix supplied.

Res.Df Df
F
Pr(>F)
1
418
2
416 2 5.2617 0.005537 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
test<-linearHypothesis(est, R, r, vcov=hccm)
pf(test$F[2],2,Inf,lower.tail=FALSE)
[1] 0.005186642

linearHypothesis(est,c("str=0","expnstu=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
str = 0
expnstu = 0
Model 1: restricted model
Model 2: testscr ~ str + elpct + expnstu
Note: Coefficient covariance matrix supplied.
Res.Df Df
F
Pr(>F)
1
418
2
416 2 5.2617 0.005537 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

4.8.4 Special case: Homoscedastic error terms


Recall: In the case of heteroscedasticity:
1

1 ^

^ r)
^
F = (R r) R
(R
^
^R
q
in the case of homoscedasticity if we test 1 = 2 = . . . = k = 0:

F=

n k 1 SSRrestricted SSRunrestricted

q
SSRunrestricted

111

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

112

6 February 2015 09:49:38


Pn
u
^ 2i of the restricted model
Pi=1
n
^ 2i of the unrestricted model
i=1 u
number of regressors of the unrestricted model
number of restrictions

SSRrestricted
SSRunrestricted
k
q
Recall:

R =

s2y
^
s2y

s2u
SSR
= 1 2^ = 1
TSS
sy

divide numerator and denominator of F by TSS:


n k 1 R2unrestricted R2restricted
F=

q
1 R2unrestricted
F is distributed according to Fq,nk1

4.9 Restrictions with more than one coefficient


e.g. in RetSchool we estimate
wage76 ~ grade76 + age76

+ black + daded + momed

Hypothesis: daded = momed



1st approach (F-test): R = 0 0 0 0 1 1 , r = 0
2nd Approach (t-test, rearranging the equation)
y =
=

0 + 1 X1 + 2 X2 + u
0 + (1
2 )X1 + 2 (X2 + X
1) + u

data(RetSchool,package="Ecdat")
attach(RetSchool)
summary(wage76)
Min. 1st Qu.
0.000
1.377

Median
1.683

Mean 3rd Qu.


1.658
1.957

Max.
3.180

NAs
2147

table(grade76)
grade76
0
1
3
2
16
17
539 182

2
2
18
264

3
4

4
6

5
13

6
22

est <- lm(wage76 ~ grade76 + age76

7
42

8
90

9
92

10
148

11
12
194 1213

+ black + daded + momed)

13
332

14
314

15
209

113

summaryR(est)

Call:
lm(formula = wage76 ~ grade76 + age76 + black + daded + momed)
Residuals:
Min
1Q
-1.75969 -0.25153

Median
0.02054

3Q
0.25961

Max
1.36709

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.002560
0.078653
0.033
0.9740
grade76
0.039486
0.003060 12.902
<2e-16 ***
age76
0.039229
0.002317 16.930
<2e-16 ***
black
-0.218286
0.017866 -12.218
<2e-16 ***
daded
0.000465
0.002732
0.170
0.8648
momed
0.007247
0.003009
2.408
0.0161 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3894 on 3053 degrees of freedom
(2166 observations deleted due to missingness)
Multiple R-squared: 0.2274,Adjusted R-squared: 0.2261
F-statistic: 177.9 on 5 and 3053 DF, p-value: < 2.2e-16

Although the coefficient of momed is significantly different from zero and the
coefficient of daded is not, they are not significantly different from each other:
linearHypothesis(est,c("daded=momed"),vcov=hccm)
Linear hypothesis test
Hypothesis:
daded - momed = 0
Model 1: restricted model
Model 2: wage76 ~ grade76 + age76 + black + daded + momed
Note: Coefficient covariance matrix supplied.

1
2

Res.Df Df
F Pr(>F)
3054
3053 1 1.9809 0.1594

alternatively:
momdaded <- momed+daded
est2<-lm(wage76 ~ grade76 + age76 + black + momed + momdaded)
linearHypothesis(est2,"momed=0",vcov=hccm)
Linear hypothesis test

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

114

6 February 2015 09:49:38

Hypothesis:
momed = 0
Model 1: restricted model
Model 2: wage76 ~ grade76 + age76 + black + momed + momdaded
Note: Coefficient covariance matrix supplied.

1
2

Res.Df Df
F Pr(>F)
3054
3053 1 1.9809 0.1594

or even simpler:
summaryR(lm(wage76 ~ grade76 + age76

+ black + momed + momdaded)

Call:
lm(formula = wage76 ~ grade76 + age76 + black + momed + momdaded)
Residuals:
Min
1Q
-1.75969 -0.25153

Median
0.02054

3Q
0.25961

Max
1.36709

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.002560
0.078653
0.033
0.974
grade76
0.039486
0.003060 12.902
<2e-16 ***
age76
0.039229
0.002317 16.930
<2e-16 ***
black
-0.218286
0.017866 -12.218
<2e-16 ***
momed
0.006782
0.004818
1.407
0.159
momdaded
0.000465
0.002732
0.170
0.865
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3894 on 3053 degrees of freedom
(2166 observations deleted due to missingness)
Multiple R-squared: 0.2274,Adjusted R-squared: 0.2261
F-statistic: 177.9 on 5 and 3053 DF, p-value: < 2.2e-16

confidence.ellipse draws the confidence area for coefficients of a linear model.

confidence.ellipse(est,c("daded","momed"),levels=c(.9,.95,.975,.99))
abline(v=0,h=0,a=0,b=1)

0.010
0.005
0.000

momed coefficient

0.015

-0.005

0.000

daded coefficient

linearHypothesis(est,c("daded=0","momed=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
daded = 0
momed = 0
Model 1: restricted model
Model 2: wage76 ~ grade76 + age76 + black + daded + momed
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
3055
2
3053 2 3.6955 0.02495 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

linearHypothesis(est,c("daded=0","momed=0.01"),vcov=hccm)
Linear hypothesis test
Hypothesis:
daded = 0
momed = 0.01

0.005

115

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

116

6 February 2015 09:49:38

Model 1: restricted model


Model 2: wage76 ~ grade76 + age76 + black + daded + momed
Note: Coefficient covariance matrix supplied.

1
2

Res.Df Df
F Pr(>F)
3055
3053 2 0.4433 0.642

linearHypothesis(est,c("daded=0.01","momed=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
daded = 0.01
momed = 0
Model 1: restricted model
Model 2: wage76 ~ grade76 + age76 + black + daded + momed
Note: Coefficient covariance matrix supplied.
Res.Df Df
F
Pr(>F)
1
3055
2
3053 2 6.6741 0.001282 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Bayesian. . . One possibility to study the difference between the education of


dads and moms on wages in a Bayesian framework would be to look at the difference of the two coefficients. In the following we estimate the same regression
as above, but concentrate on the difference between momed and daded.
mData<-with(est$model,list(wage=wage76,grade=grade76,age=age76,black=black,
mom=momed,dad=daded))
modelR<-model {
for (i in 1:length(wage)) {
wage[i] ~ dnorm(beta0 + bGrade*grade[i]+bAge*age[i]+
bBlack*black[i]+bMom*mom[i]+bDad*dad[i],tau)
}
beta0 ~ dnorm (0,.0001)
bGrade~ dnorm (0,.0001)
bAge ~ dnorm (0,.0001)
bBlack~ dnorm (0,.0001)
bMom ~ dnorm (0,.0001)
bDad ~ dnorm (0,.0001)
tau
~ dgamma(.01,.01)
MomMinusDad <- bMom-bDad

}
}
bayesR<-run.jags(model=modelR,data=mData,monitor=c("MomMinusDad"))
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 1 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

JAGS gives us a distribution for this difference:

0 20406080

Density

0.01
0.00

MomMinusDad

0.02

plot(bayesR,var="MomMinusDad",type=c("trace","density"))

-0.01

0.00

0.01

0.02

-0.01

MomMinusDad

6000 8000 10000

14000

Iteration
The credible interval of this difference contains the zero.
summary(bayesR)

Iterations = 5001:15000
Thinning interval = 1
Number of chains = 2
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,

117

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

118

6 February 2015 09:49:38

plus standard error of the mean:


Mean
SD
Naive SE Time-series SE
MomMinusDad 0.006633 0.004661 0.00003296
0.0001848
2. Quantiles for each variable:
2.5%
25%
50%
75%
97.5%
MomMinusDad -0.002547 0.003548 0.006701 0.009747 0.01569

How likely is it, that the difference MomMinusDad is actually negative?


100*mean(unlist(bayesR$mcmc)<0)
[1] 8.105

How likely is it, that the difference MomMinusDad is positive?


100*mean(unlist(bayesR$mcmc)>0)
[1] 91.895

(The first number is about half of the p-value we got for daded=momed above.
linearHypothesis(est,c("daded=momed"))
Linear hypothesis test
Hypothesis:
daded - momed = 0
Model 1: restricted model
Model 2: wage76 ~ grade76 + age76 + black + daded + momed

1
2

Res.Df
RSS Df Sum of Sq
F Pr(>F)
3054 463.32
3053 463.01 1
0.31293 2.0634 0.151

This is to be expected since above we did a two-sided test, while here the alternative is one-sided.)

4.10 Model specification


Model specification which coefficients to include?
testscr distcod + county + district + grspan + enrltot + teachers
+ calwpct + mealpct + computer + compstu + expnstu + str + avginc
+ elpct + readscr + mathscr This model contains perhaps too many coefficients multicollinearity)

119

testscr str This model contains perhaps too few coefficients omitted
variable bias)
Omitted variable bias

E(b1 ) = 1 + (X1 X1 )1 X1 X2 2
Only when X1 is orthogonal to X2 or 2 is zero we have no bias
Overfitting (multicollinearity)
1
X Iu2 X(X X)1

^
^ = (X X)

When X is (almost) collinear, then (X X)1 is large, and then is large, hence
our estimates are not precise.
start with a base specification

building on these, develop alternative specifications

When coefficients change in an alternative specification, this can be a sign of


omitted variable bias.

Scaling coefficients How should we add a new variable to the regression?


Scaling might simplify readability
coef(lm(testscr ~ str + elpct + expnstu))
(Intercept)
649.577947257

str
-0.286399240

elpct
-0.656022660

expnstu
0.003867902

elratio <- elpct/100


coef(lm(testscr ~ str + elratio + expnstu))
(Intercept)
649.577947257

str
elratio
-0.286399240 -65.602266008

expnstu
0.003867902

expnstuTSD <- expnstu/1000


coef(lm(testscr ~ str + elpct + expnstuTSD))
(Intercept)
649.5779473

str
-0.2863992

elpct
-0.6560227

expnstuTSD
3.8679018

What is the benefit of adding another variable?


measure R2
measure contribution to R2

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

6 February 2015 09:49:38

look at p-value of the t-statistic


look at p-value of the variance analysis
measure AIC
Perhaps it is helpful to control for wealth in the school district. Which variables
could, in the Caschool example, be a good indicator for wealth?
plot(testscr ~ elpct,main="English learner percentage")
plot(testscr ~ mealpct,main="percentage qualifying for reduced price lunch")
plot(testscr ~ calwpct,main="percentage qualifying for income assistance")

20

40

elpct

60

80

700
620

640

testscr

660

680

700
680
660
640
620

620

640

testscr

660

680

700

English learner percentage percentage qualifying for reduced pricepercentage qualifying for income assistance

testscr

c Oliver Kirchkamp

120

20

40

60

80 100

mealpct

4.10.1 Measure R2
R2 = 1

SSR
TSS

R2 only measures the fit of the regression


R2 does not measure causality (e.g. parkings lots testscr)

R2 does not measure the absence of omitted variable bias


R2 does not measure the correctness of the specification

20

40

calwpct

60

80

121

4.10.2 Measure contribution to R2


There are different approaches to do this.
We can take a look at R2 , when the variable we are analysing is the only one in
the model, or when it is the last one, or we can look at all other feasible sequences.
library(relaimpo)
est <- lm(testscr ~ str + elpct + mealpct + calwpct)

calc.relimp(est,type=c("first","last","lmg","pmvd"),rela=TRUE)
Response variable: testscr
Total response variance: 363.0301
Analysis based on 420 observations
4 Regressors:
str elpct mealpct calwpct
Proportion of variance explained by model: 77.49%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
str
0.03119231
elpct
0.22371548
mealpct 0.53343971
calwpct 0.21165250

pmvd
0.0148176134
0.0242703918
0.9600101671
0.0009018276

last
0.059126952
0.048159854
0.890678586
0.002034608

first
0.03175031
0.25708495
0.46768098
0.24348376

Average coefficients for different model sizes:

str
elpct
mealpct
calwpct

1X
-2.2798083
-0.6711562
-0.6102858
-1.0426750

2Xs
-1.4612232
-0.4347537
-0.5922408
-0.5863541

3Xs
-1.1371224
-0.2510901
-0.5645062
-0.2639020

4Xs
-1.01435328
-0.12982189
-0.52861908
-0.04785371

4.10.3 Information criteria


Analysis of Variance Instead of looking at the p-value of a coefficient, we compare the variance of the residuals of two different models: model 2 (with the coefficient), model 1 (without).
F(k2 k1 ,nk2 ) =

RSS1 RSS2 n k2
RSS2
k2 k1

We can, hence, use the F statistic to compare two models.


Let L be the log-likelihood of the estimated model.

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

122

6 February 2015 09:49:38

One can show that for linear models


2 L = n log

RSS
+C
n

But then


RSS2
RSS1
log
2 (L2 L1 ) = n log
n
n

= n log

RSS1
2k2 k1
RSS2

We can, thus, also use the 2 statistic to compare two models.


est2 <- lm(testscr ~ str + elpct + mealpct + calwpct)
est1 <- lm(testscr ~ str + mealpct + calwpct)
sum(est2$residuals^2)
[1] 34247.46
sum(est1$residuals^2)
[1] 35450.8

An easier way to obtain the SSR is deviance:


deviance(est2)
[1] 34247.46
RSS2<-deviance(est2)
RSS1<-deviance(est1)
L2 <- logLik(est2)
L1 <- logLik(est1)
n <- length(est$residuals)
k2 <- est2$rank
k1 <- est1$rank
pchisq(2 *(L2 - L1),k2-k1,lower=FALSE)
log Lik. 0.0001398651 (df=6)
pchisq(n * log(RSS1 / RSS2),k2-k1,lower=FALSE)
[1] 0.0001398651

anova(est1,est2,test="Chisq")
Analysis of Variance Table
Model 1: testscr ~ str + mealpct + calwpct
Model 2: testscr ~ str + elpct + mealpct + calwpct
Res.Df
RSS Df Sum of Sq Pr(>Chi)

1
416 35451
2
415 34247
--Signif. codes:

1203.3 0.0001342 ***

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

pf((RSS1 - RSS2)/RSS2 * (n-k2)/(k2-k1),k2-k1,n-k2,lower=FALSE)


[1] 0.0001547027

anova(est1,est2)
Analysis of Variance Table
Model 1: testscr ~ str + mealpct + calwpct
Model 2: testscr ~ str + elpct + mealpct + calwpct
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
416 35451
2
415 34247 1
1203.3 14.582 0.0001547 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

F-test for ANOVA

summary(est2)

Call:
lm(formula = testscr ~ str + elpct + mealpct + calwpct)
Residuals:
Min
1Q
-32.179 -5.239

Median
-0.185

3Q
5.171

Max
31.308

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 700.39184
4.69797 149.084
< 2e-16 ***
str
-1.01435
0.23974 -4.231 0.0000286 ***
elpct
-0.12982
0.03400 -3.819 0.000155 ***
mealpct
-0.52862
0.03219 -16.422
< 2e-16 ***
calwpct
-0.04785
0.06097 -0.785 0.432974

123

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

124

6 February 2015 09:49:38

--Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 9.084 on 415 degrees of freedom


Multiple R-squared: 0.7749,Adjusted R-squared: 0.7727
F-statistic: 357.1 on 4 and 415 DF, p-value: < 2.2e-16

Comparing two models is different from testing one coefficient


Information criteria use a similar approach Goal: Find a model that explains the
data well but has as few parameters as possible. (prevent overfitting)
Let L be the log-likelihood of the estimated model.
Hirotsugo Akaike (1971): An Information Criterion:
AIC = 2 L + 2 k
Gideon E. Schwarz (1978): Bayesian Information Criterion
BIC = 2 L + k log n
est <- lm(testscr ~ str + elpct + mealpct + calwpct + enrltot)
extractAIC(est)
[1]

6.000 1859.498

step now looks automatically for a model with a good AIC


step(est)
Start: AIC=1859.5
testscr ~ str + elpct + mealpct + calwpct + enrltot
Df Sum of Sq
- calwpct 1
70.1
- enrltot 1
78.9
<none>
- elpct
1
1262.6
- str
1
1552.8
- mealpct 1
20702.3

RSS
34239
34247
34169
35431
35721
54871

AIC
1858.4
1858.5
1859.5
1872.7
1876.2
2056.4

Step: AIC=1858.36
testscr ~ str + elpct + mealpct + enrltot
Df Sum of Sq
RSS
AIC
- enrltot 1
60 34298 1857.1
<none>
34239 1858.4
- elpct
1
1208 35446 1870.9

- str
- mealpct

1
1

1496 35734 1874.3


51150 85388 2240.2

Step: AIC=1857.09
testscr ~ str + elpct + mealpct
Df Sum of Sq
<none>
- elpct
- str
- mealpct

1
1
1

RSS
34298
1167 35465
1441 35740
52947 87245

AIC
1857.1
1869.1
1872.4
2247.2

Call:
lm(formula = testscr ~ str + elpct + mealpct)
Coefficients:
(Intercept)
700.1500

str
-0.9983

elpct
-0.1216

mealpct
-0.5473

Why does the AIC make sense


within sample prediction

including more variables always improve the likelihood

out of sample prediction

including more variables may decrease the likelihood

set.seed(123)
N<-nrow(Caschool)
mySamp<-sample(1:N,N/2)
CaIn<-Caschool[mySamp,]
CaOut<-Caschool[-mySamp,]
est <-lm(testscr ~ str + elpct + mealpct + calwpct + enrltot,data=CaIn)
estSm<-lm(testscr ~ str + elpct + mealpct
+ enrltot,data=CaIn)
deviance(est)
[1] 15996.78
deviance(estSm)
[1] 16128.62

within sample a smaller model must have a larger deviance.


Now do the same out of sample:

sum((CaOut$testscr-predict(est,newdata=CaOut))^2)
[1] 18658.43

125

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

6 February 2015 09:49:38

sum((CaOut$testscr-predict(estSm,newdata=CaOut))^2)
[1] 18513.59

Out of sample a smaller model can have a smaller deviance.

640

660

680

700

newdata<-list(avginc=5:55)
plot(testscr ~ avginc)
lines(predict(lm(testscr ~ poly(avginc,2)),newdata=newdata)~newdata$avginc,lty=1,lwd=4)
lines(predict(lm(testscr ~ poly(avginc,5)),newdata=newdata)~newdata$avginc,lty=2,lwd=4)
lines(predict(lm(testscr ~ poly(avginc,15)),newdata=newdata)~newdata$avginc,lty=3,lwd=4)
legend("bottomright",c("$r=2$","$r=5$","$r=15$"),lty=1:3,lwd=4)

r=2
r=5
r = 15

620

testscr

c Oliver Kirchkamp

126

10

20

30

40

50

avginc
Of course, the above graph depends on the specific sample of CaIn and CaOut.
We could repeat this exercise for many samples.

The same idea with polynomial functions

Within sample deviance:

plot(sapply(1:15,function(r) deviance(lm(testscr~poly(avginc,r),data=CaIn))),
xlab="degree of polynomial $r$",ylab="within sample deviance")

c Oliver Kirchkamp

28000

30000

127

26000

within sample deviance

32000

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

10

12

14

degree of polynomial r
Out of sample deviance:

60000
50000
40000

out of sample deviance

plot(sapply(1:9,function(r) sum((CaOut$testscr-predict(lm(testscr~poly(avginc,r),data=CaIn),
newdata=CaOut))^2)),
xlab="degree of polynomial $r$",ylab="out of sample deviance")

degree of polynomial r

6 February 2015 09:49:38

Of course, the above graph depends on the specific sample of CaIn and CaOut.
We could repeat this exercise for many samples.
AIC tries to capture the quality of out of sample prediction:

3330 3340 3350 3360 3370

plot(sapply(1:15,function(r) AIC(lm(testscr~poly(avginc,r)))),
xlab="degree of polynomial $r$",ylab="AIC")

AIC

c Oliver Kirchkamp

128

10

12

14

degree of polynomial r

If we want to model the relation between avginc and testscr, perhaps the best
is a polynomial of degree 5:

newdata<-list(avginc=5:55)
plot(testscr ~ avginc)
lines(predict(lm(testscr ~ poly(avginc,5)),newdata=newdata)~newdata$avginc,lty=2,lwd=4)

680
660
640
620

testscr

10

20

30

40

50

avginc

4.10.4 t-statistic for individual coefficients


ti =

^ i i,0

^
^i

est <- lm(testscr ~ str + elpct + mealpct + calwpct)


coef(est)/sqrt(diag(hccm(est)))
(Intercept)
124.7256881

str
-3.7184527

elpct
mealpct
-3.5192705 -13.5610567

calwpct
-0.7778974

round(2*pnorm(-abs(coef(est)/sqrt(diag(hccm(est))))),5)
(Intercept)
0.00000

str
0.00020

elpct
0.00043

mealpct
0.00000

calwpct
0.43663

Instead of always calculating heteroscedasticity-consistent standard errors manually, as we did above, we can also use the function summaryR from the library
tonymisc.
summaryR(lm(testscr ~ str + elpct + mealpct + calwpct))

Call:
lm(formula = testscr ~ str + elpct + mealpct + calwpct)

c Oliver Kirchkamp

129

700

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

130

6 February 2015 09:49:38

Residuals:
Min
1Q
-32.179 -5.239

Median
-0.185

3Q
5.171

Max
31.308

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 700.39184
5.61546 124.726 < 2e-16 ***
str
-1.01435
0.27279 -3.718 0.000228 ***
elpct
-0.12982
0.03689 -3.519 0.000481 ***
mealpct
-0.52862
0.03898 -13.561 < 2e-16 ***
calwpct
-0.04785
0.06152 -0.778 0.437073
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 9.084 on 415 degrees of freedom
Multiple R-squared: 0.7749,Adjusted R-squared: 0.7727
F-statistic: 349.4 on 4 and 415 DF, p-value: < 2.2e-16

4.10.5 Bayesian Model Comparison


Idea: A binary process selects (randomly) among the two models we want
to compare.
1. testscr = 0 + 1 elpct + 2 str + u
2. testscr = 0 + 1 elpct + u
Problem: While one of the two models is not selected, parameters of this
model can take any value (and do not reduce likelihood).
convergence is slow!

Solution: Pseudopriors (the binary process already has informed priors


about the two models)
myData1<-list(str=str,testscr=testscr,elpct=elpct)
mod1 <- model {
for (i in 1:length(testscr)) {
testscr[i] ~ dnorm(beta[1]+beta[2]*elpct[i]+beta[3]*str[i],tau)
}
# prior:
for (j in 1:3) {
beta[j] ~ dnorm(0,.0001)
}
tau ~ dgamma(.1,.1)
}
mod1.jags <-run.jags(model=mod1,data=myData1,monitor=c("beta","tau"))

Compiling rjags model and adapting for 1000 iterations...


Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 4 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

mod1.sum<- summary(mod1.jags)[["statistics"]][,1:2]
Mean
SD
beta[1] 682.925223960 7.5409541182
beta[2] -0.651991336 0.0397160356
beta[3] -0.942712391 0.3874377114
tau
0.004783592 0.0003318526
myData2<-list(testscr=testscr,elpct=elpct)
mod2 <- model {
for (i in 1:length(testscr)) {
testscr[i] ~ dnorm(beta[1] + beta[2]*elpct[i],tau)
}
for (j in 1:2) {
beta[j] ~ dnorm(0,.0001)
}
tau ~ dgamma(.1,.1)
}
mod2.jags <-run.jags(model=mod2,data=myData2,monitor=c("beta","tau"))
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 3 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

mod2.sum<- summary(mod2.jags)[["statistics"]][,1:2]
Mean
SD
beta[1] 664.691032674 0.9445561702
beta[2] -0.669906945 0.0389640021
tau
0.004698533 0.0003245831

Remember: the uninformed priors we used above were as follows:


i dnorm(0, 0.0001)
dgamma(.1, .1)
Now we want to give some (informed) help to the Bayesian model:

131

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

132

6 February 2015 09:49:38

Pseudoprior for is trivial.


Mean of can be taken from the estimate above.
Precision = 1/2 , and we know from the estimate above.
To obtain the pseudoprior for we use properties of the -distribution:
If (, ) then E() = / and var() = /2

We estimate parameters of the -distribution as follows:


^ = /

^ = 2 / var(),
var().
myData12<-list(str=str,testscr=testscr,elpct=elpct,sum=mod1.sum,sumX=mod2.sum)
mod12 <- model {
for (i in 1:length(testscr)) {
testscr[i] ~ dnorm(ifelse(equals(mI,1),
beta[1]+beta[2]*elpct[i]+beta[3]*str[i],
betaX[1]+betaX[2]*elpct[i]),
ifelse(equals(mI,1),tau,tauX))
}
for (j in 1:3) {
beta[j] ~ dnorm(sum[j,1],1/sum[j,2]^2)
}
tau ~ dgamma(sum[4,1]^2/sum[4,2]^2,sum[4,1]/sum[4,2]^2)
for (j in 1:2) {
betaX[j] ~ dnorm(sumX[j,1],1/sumX[j,2]^2)
}
tauX ~ dgamma(sumX[3,1]^2/sumX[3,2]^2,sumX[3,1]/sumX[3,2]^2)
mI ~ dcat(mProb[])
mProb[1]<-.5
mProb[2]<-.5
}
mod12.jags <-run.jags(model=mod12,data=myData12,
monitor=c("beta","tau","betaX","tauX","mI"))
Compiling rjags model and adapting for 1000 iterations...
Calling the simulation using the rjags method...
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Calculating the Gelman-Rubin statistic for 8 variables....
The Gelman-Rubin statistic is below 1.05 for all parameters
Finished running the simulation

summary(mod12.jags)

Iterations = 5001:15000
Thinning interval = 1

133

Number of chains = 2
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean
beta[1] 683.257424
beta[2]
-0.651894
beta[3]
-0.964436
betaX[1] 664.697289
betaX[2] -0.669097
mI
1.169200
tau
0.004784
tauX
0.004700

SD
4.8295929
0.0303106
0.2439432
0.9011985
0.0374353
0.3749378
0.0002554
0.0003110

Naive SE Time-series SE
0.034150379
0.245753589
0.000214328
0.000329866
0.001724939
0.012495793
0.006372436
0.008693537
0.000264708
0.000340590
0.002651211
0.012153832
0.000001806
0.000002364
0.000002199
0.000002849

2. Quantiles for each variable:


2.5%
25%
50%
75%
97.5%
beta[1] 673.389990 680.253051 683.486755 686.347180 692.408889
beta[2]
-0.711881 -0.671751 -0.652033 -0.632004 -0.592018
beta[3]
-1.426208 -1.122639 -0.972609 -0.809592 -0.483269
betaX[1] 662.923675 664.096016 664.693130 665.295605 666.456273
betaX[2] -0.742853 -0.693643 -0.669193 -0.644838 -0.594024
mI
1.000000
1.000000
1.000000
1.000000
2.000000
tau
0.004296
0.004615
0.004778
0.004950
0.005300
tauX
0.004106
0.004492
0.004693
0.004898
0.005332

Interesting is the result for mI. This variable has the value 1 or 2 for model 1 or
model 2. An average of 1.1692 means that in 0.8308 of all cases we have model 1
and in 0.1692 of all cases we have model 2.
In other words, model 1 is 4.91 times more likely than model 2.
Compare with p-values from the frequentist analysis:
summaryR(lm(testscr ~ elpct + str ))

Call:
lm(formula = testscr ~ elpct + str)
Residuals:
Min
1Q
-48.845 -10.240

Median
-0.308

3Q
9.815

Max
43.461

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.0322
8.8122
77.85
<2e-16 ***
elpct
-0.6498
0.0313 -20.76
<2e-16 ***
str
-1.1013
0.4371
-2.52
0.0121 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

134

6 February 2015 09:49:38

Residual standard error: 14.46 on 417 degrees of freedom


Multiple R-squared: 0.4264,Adjusted R-squared: 0.4237
F-statistic: 220.1 on 2 and 417 DF, p-value: < 2.2e-16

Why are these numbers so different from p-values?


p-value=P(data|H0 )

Here H0 = model 2 is true


p-value=P(data|model2) = 0.0121166

In our Bayesian model comparison we have calculated


P(model1|data) = 0.8308
P(model2|data) = 0.1692.
Model comparison: Words of caution Whenever we compare models, we have
to ask ourselves:
Are the models we are comparing plausible?
Example: Returning from the lecture I find on my desk a book which I long
wanted to read.
Model 1: The book has been provided by Santa Claus.
Model 2: The book has been provided by the Tooth Fairy.
A careful model comparison finds Santa Claus 50 times more likely than the
Tooth Fary. Does this mean that Santa Claus, indeed, brought the book?
The same applies, of course, to interpreting p-values from hypothesis tests.
4.10.6 Comparing models
If we want to see a table of different models (but using heteroscedasticity-consistent
standard errors), we can adjust getSummary.lm. Mainly we use the version of
getSummary.lm from the memisc package, but we replace the standard error by
the heteroscedasticity-consistent standard error:
library(memisc)
getSummary.lm <- function (est,...) {
z <- memisc::getSummary.lm(est,...)
z$coef[,"se"] <- sqrt(diag(hccm(est)))
z
}

Now let us estimate a few models:

est1
est2
est3
est4
est5

<<<<<-

lm(testscr
lm(testscr
lm(testscr
lm(testscr
lm(testscr

~
~
~
~
~

str)
str +
str +
str +
str +

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

135

elpct)
elpct + mealpct)
elpct + calwpct)
elpct + mealpct + calwpct)

mtable("(1)"=est1,"(2)"=est2,"(3)"=est3,"(4)"=est4,"(5)"=est5,summary.stats=c("R-squared","AIC
(1)
(Intercept)
str

698.933
(10.461)
2.280
(0.524)

elpct

(2)
686.032
(8.812)
1.101
(0.437)
0.650
(0.031)

mealpct

(3)
700.150
(5.641)
0.998

(0.274)
0.122
(0.033)
0.547
(0.024)

calwpct
R-squared
AIC
N

0.051
3650.499
420

0.426
3441.123
420

0.775
3050.999
420

(4)
697.999
(7.006)
1.308

(0.343)
0.488
(0.030)

0.790
(0.070)
0.629
3260.656
420

(5)
700.392
(5.615)
1.014

(0.273)
0.130
(0.037)
0.529
(0.039)
0.048
(0.062)

0.775
3052.376
420

4.10.7 Discussion
Controlling for the student characteristics shrinks the coefficient of str to
half its size
The student characteristics are good predictors
The sign of the coefficients of the student characteristics is consistent with
the pictures
Not all control variables are significant

calwpct appears to be redundant in this context

4.11 Exercises
1. Multiple regression I
What is a multiple regression?

Give the formula for a regression with multiple regressors. Explain


what each part of the formula means.
Which other methods do you know to include more than one factor
into an analysis?
2. Multiple regressors
You would like to find out which factors influence the sales of downhill skis
in your friends store.

c Oliver Kirchkamp

136

6 February 2015 09:49:38


How would you specify your model? Which regressors would you
include?
Which signs would you expect for the coefficients of the different regressors?

3. Track and field


You are the trainer of a mixed team in track and field of children aged 10
13. You have estimated the following model for the time (in seconds) that it
takes your team members to run 100 meters:
t[
imei = 29.8 0.8 male 1.2 age + 0.5 rain

where male=1 if it is a boy, age denotes the childs age in years and rain=1
if it has been raining during the day so that the ground is muddy.
Which of the variables are dummy variables? Explain the meaning of
the coefficients of the dummy variables.
How would the estimated model look like if we dropped the variable
male (1 if male, 0 otherwise) and added a variable female (1 if female,
0 otherwise) instead?
You would like to predict the time that it takes the next girl to run 100
meters. She is 10 years old. You know that it has not been raining today.
What is the predicted time?
What would be the prediction for a 13-years-old boy on a day when it
has been raining?
Assume that you use the estimation above for a 20 year old man on a
rainy day. What would be the estimated time? Is that realistic? Why?
In your prediction of the time that the members of your team need to
run 100 meters you would like to add a measure for ability. Unfortunately, you have never classified your team members according to ability. Which other measures could help you to approximate the ability of
each child?
Do you think the model is well specified? What would you change if
you could? Why might you not be able to specify the model the way
you want?

4. Determinants of income
You would like to conduct a survey to find out which factors influence the
income of employees.
Which variables do you think have an influence on income and should
be included in your model?

137

You are only allowed to ask your respondents about their age, their
gender, and the number of years of education they have obtained. Build
a model with these variables as regressors. Which signs do you expect
for the coefficients of each of these variables?
Your assistant has estimated the following equation for monthly incomes: ^I = 976.9 + 38.2 a + 80.5 b 350.7 c with N = 250. Unfortunately, he has not noted which variables indicate what. Look at the
regression. Can you tell which variable stands for which factor?
What is the estimated income of a woman aged 27 who has obtained
17 years of education?
One employee wonders whether she should pursue a MBA program.
The one-year program costs 6 000 in tuition fees. During this year she
will forego a monthly salary of 3 200 (assume 12.5 salaries per year, for
simplicity assume that you live in a world without taxation and where
future income is not discounted). Will this degree pay out during the
next ten years?
Do you think that the above model is a good one? What would you
change if you could?
5. Multiple regressors in R, I
Use the data set Icecream of the library Ecdat in R.
You want to estimate the effect of average weekly family income (income),
price of ice cream (price), and the average temperature (temp) on ice
cream consumption (cons). Formulate your model.
Which variables do you expect to have an influence on ice cream consumption? In which direction do you expect the effect?
Check your assumptions in R.
6. Multiple regressors in R, II
Use the data set Computers of the library Ecdat in R.
What does the data set contain?

Classify the variables of the data set.

Estimate the price of a computer (price). Which regressors would you


include? Why? Which sign would you expect for the coefficients of
your regressors?
Interpret your results.
7. Subsets in R
Use the data set RetSchool of the library Ecdat in R.

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

c Oliver Kirchkamp

138

6 February 2015 09:49:38


You want to estimate the effect of grades (grade76) and experience
(exp76) on wages (wage76) separately for younger ( 30 years) and
older (> 30 years) employees. Is the effect of grades and experience
different for these two age groups?
Now you estimate the same model as above, but this time using a
dummy variable for employees who are older than 30 years in 1976.
Explain the difference of this model and the model above. How do you
interpret the results?

8. Multiple regressors in R, III


Use the data set Wages1 of the library Ecdat in R.
Estimate the effect of experience (exper), years of schooling (school),
and gender (sex) on hourly wages (wage).
Has experience or education a higher impact on wage?
Is the impact of one more year of education higher for employees who
have never attended college (school 12) or for those who have received some college education (school > 12)? Visualize this with a
graph.
Do you see a similar effect for experience?
9. Specification error I
What happens if you forget to include a variable? When is this problematic? How is this specification error called? Give some examples
for this problem.
What could you do to fix this problem? Why might one not be able to
fix this problem?
10. Multiple regression II
What are the assumptions that have to be fulfilled for using multiple
regressions? List and explain them.
11. Exam 21.7.2007, exercise 2
Multicollinearity is one of the problems that might occur in an estimation.
Describe what multicollinearity means.
How can one detect multicollinearity?
How can one avoid multicollinearity?

139

12. Rents in Jena

c Oliver Kirchkamp

4 MODELS WITH MORE THAN ONE INDEPENDENT VARIABLE


(MULTIPLE REGRESSION)

A company rents apartments for students in Jena. The manager would like
to estimate a model for rents for apartments. He has information on the
size of the apartment, the number of bedrooms, the number of bathrooms,
whether the kitchen is large, whether the apartment has a balcony, whether
there is a tub in the bath room, and the location measured as the distance to
the ThULB.
Specify a model to estimate the rents in Jena. Which of the above variables would you include? Which signs do you expect the coefficients
of these variables to take? Explain your answer.
Do you think the model is well specified? Are there any other variables
you would like to add?
13. Exam 21.7.2007, exercise 3
Product Z of your company has been advertised during the last year on two
different TV channels: EURO1 and SAT5. Prices for spots are the same on
both channels. A study with data on the last available periods has provided
the following model (standard errors in parentheses):
Ybi = 300 + 10 X1 + 20 X2 (1.0) (2.5)

You have 44 observations, R2 = 0.9 Y stands for the sales amount of your
product Z (in 1,000 Euros), X1 stands for expenses on commercials at EURO1
(in 1,000 Euros), X2 for expenses on SAT5 (in 1,000 Euros).

Which advertisement method should you prefer according to your regression results (all other factors constant)? Explain your answer.
14. Detecting multicollinearity in R
Use the data set Housing of the library Ecdat in R.
Build a model predicting the price of a house (price) depending on
the lotsize (lotsize), the number of bathrooms (bathrms), the number
of bedrooms (bedrooms), whether the house has air condition (airco),
and whether the house is located in a preferred neighbourhood (prefarea).
Estimate this model in R.
Create a dummy which takes the value 1 if the house has at least one
bathroom. Estimate the same model as above, this time using the dummy
for bathroom instead of the number of bathrooms. What happens?
Why?
Construct a variable which indicates the prices in Euros. Assume an
exchange rate of 0.74 Euros for each Canadian Dollar. Estimate the
same model as above, this time estimating the price of the house in
Euros. Interpret your result.

c Oliver Kirchkamp

140

6 February 2015 09:49:38


Create a dummy for houses with fewer than 3 bedrooms. Estimate the
same model as above, this time including the dummy variable for a
small hourse. Explain your result.

15. Detecting specification errors in R


Use the data set BudgetItaly of the library Ecdat in R.
Estimate the effect of the price of food (pfood), and of the size of the
household (size) on the share of expenditures for food (wfood) in R.
Look at the results of your estimation and interpret it. Do you think
you could estimate a better model?
Now add the share of expenditures for housing and fuel (whouse). Interpret your result.
Do you think there is any multicollinearity in your model? Test this.
Create a dummy variable for the price of food in Euros (Europrice).
Every Lira is worth 0.0005 Euros. Add this variable to your second
model. What happens? Why?
16. Specification error II
What specification errors do you know? List and explain them.
What does this mean for your estimations?
How should you proceed in modeling?

5 Non-linear regression functions


until now
y = 0 + 1 x1 + 2 x2 + . . . + k xk

NON-LINEAR REGRESSION FUNCTIONS

141

not linear,
no interaction

1.0

1.0

-1

0.8

interaction of
two variables

0.8

linear

c Oliver Kirchkamp

0.6
0.2

x2 = 1

0.0

0.0

-4

0.2

-3

0.4

y
0.4

-2

0.6

x2 = 2

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

If the dependence between Y and X is non-linear. . .


the marginal effect of X is not constant
a linear regression would be specified incorrectly
the estimated effect would be biased

hence, we estimate a non-linear regression in X

Approach:

non-linear functions of a single independent variable


Polynomials in X

Logarithmic transformation
Interactions
data(Caschool)
attach(Caschool)
est1 <- lm(testscr ~ avginc)
summaryR(est1)

Call:
lm(formula = testscr ~ avginc)

0.0

0.2

0.4

0.6

x1

0.8

1.0

6 February 2015 09:49:38

Residuals:
Min
1Q
-39.574 -8.803

Median
0.603

3Q
9.032

Max
32.530

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 625.3836
1.9290 324.20
<2e-16 ***
avginc
1.8785
0.1188
15.82
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 13.39 on 418 degrees of freedom
Multiple R-squared: 0.5076,Adjusted R-squared: 0.5064
F-statistic: 250.2 on 1 and 418 DF, p-value: < 2.2e-16

plot(testscr ~ avginc,main="district average income")


abline(est1)

620

640

660

680

700

district average income

testscr

c Oliver Kirchkamp

142

10

20

30

40

50

avginc

The diagnostic plot confirms that the residuals in this linear model are not independent of avginc.
par(mfrow=c(1,2))
plot(est1,which=1:2)

NON-LINEAR REGRESSION FUNCTIONS

143

Normal Q-Q

640

39

680

405

2
1
0
-1
-2
-3

Standardized residuals

-40

-20

20

40

Residuals vs Fitted

Residuals

c Oliver Kirchkamp

405
6 39

720

Fitted values

-3 -2 -1

Theoretical Quantiles

Here we have two options: Either we specify a precise functional form for the
relation between avginc and testscr or we leave the precise form open, requiring
only a smooth relationship, like the following:

A semiparametric fit of the relation between avginc and testscr:

lo <- loess(testscr ~ avginc,data=Caschool[order(avginc),])


plot(lo,xlab="avginc",ylab="testscr")
spline.lo<-predict(lo,se=TRUE)
lines(lo$x,lo$fitted,lwd=3,col="blue")
with(spline.lo,{
lines(lo[["x"]],fit+qnorm(.025)*se.fit,col="red")
lines(lo[["x"]],fit+qnorm(.975)*se.fit,col="red")
})

620

640

660

680

700

6 February 2015 09:49:38

testscr

c Oliver Kirchkamp

144

10

20

30

40

50

avginc

Semiparametric does not restrict the shape of the fitted functions.


Great to get an idea of the general shape.
Difficult to interpret.
Parametric may follow a specific economic model.
May completely misrepresent the data.
Easier to interpret.
(but also easy to be misled)
A quadratic model:

order calculates a permutation vector. We can use it for sorting purposes. If


we only want to sort a single vector, we can use sort. fitted calculates the y
^ of a regression.

avginc2 <- avginc*avginc


est2 <- lm(testscr ~ avginc + avginc2)
summaryR(est2)

Call:
lm(formula = testscr ~ avginc + avginc2)
Residuals:
Min
1Q
-44.416 -9.048

Median
0.440

3Q
8.348

Max
31.639

NON-LINEAR REGRESSION FUNCTIONS

c Oliver Kirchkamp

145

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 607.301735
2.924223 207.680
<2e-16 ***
avginc
3.850995
0.271104 14.205
<2e-16 ***
avginc2
-0.042308
0.004881 -8.668
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 12.72 on 417 degrees of freedom
Multiple R-squared: 0.5562,Adjusted R-squared: 0.554
F-statistic: 400.9 on 2 and 417 DF, p-value: < 2.2e-16

Lets also take a look at the diagnostic plot of this regression:


plot(est2,which=1:2)

Normal Q-Q

135
39

630

650

670

Fitted values

690

2
1
0
-1
-2
-3

0
-20
-40

Residuals

20

Standardized residuals

40

Residuals vs Fitted

135
6
39

-3 -2 -1

Theoretical Quantiles

or <- order(avginc)

plot(testscr ~ avginc,main="district average income")


abline(est1,col="blue",lwd=3)
lines(avginc[or],fitted(est2)[or],col="red",lwd=3)
legend("bottomright",c("linear","quadratic"),lwd=3,col=c("blue","red"))

6 February 2015 09:49:38

640

660

680

700

district average income

linear
quadratic

620

testscr

c Oliver Kirchkamp

146

10

20

30

40

avginc

The coefficient of avginc2 is significantly different from zero.

R2 is larger than in the linear regression.

testscr = 607.3 + 3.851 avginc 0.042308 avginc2

Marginal effect of a change of avginc?

testscr
= 3.851 0.042308 2 avginc
avginc
coef(est2)["avginc"] + 2*10*coef(est2)["avginc2"]
avginc
3.004826
coef(est2) %*% c(0,1,2*10)
[,1]
[1,] 3.004826
coef(est2) %*% c(0,1,2*40)
[,1]
[1,] 0.4663179
coef(est2) %*% c(0,1,2*60)
[,1]
[1,] -1.226021

50

NON-LINEAR REGRESSION FUNCTIONS

147

Confidence interval for the marginal effect:


In case of the linear model the confidence interval of i
In the example the confidence interval of Y = 1 + 2 2 x

c = 0 we
to find this, we need for 1 + 2 2 x. We know that with, H0 : Y
2

have

d |
|Y
2d
Y

F1, , hence Y
d =
"

d
|Y|
,
F

hence

c 1.96
c 1.96
|Y|
|Y|
c
c

, Y +
Y
F
F

(lhtest <- linearHypothesis(est2,"avginc + 20 * avginc2",vcov=hccm))


Linear hypothesis test
Hypothesis:
avginc + 20 avginc2 = 0
Model 1: restricted model
Model 2: testscr ~ avginc + avginc2
Note: Coefficient covariance matrix supplied.
Res.Df Df
F
Pr(>F)
1
418
2
417 1 288.38 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
sqrt(lhtest$F)
[1]

NA 16.98182

coef(est2) %*% c(0,1,2*10) * (1 +

qnorm(.975)/sqrt(lhtest$F)[2])

[,1]
[1,] 3.351629
coef(est2) %*% c(0,1,2*10) * (1 -

qnorm(.975)/sqrt(lhtest$F)[2])

[,1]
[1,] 2.658022

Procedure:
1. theoretical motivation for non-linear dependencies

c Oliver Kirchkamp

c Oliver Kirchkamp

148

6 February 2015 09:49:38

2. specified functional form


3. testing whether a non-linear function is justified
4. visual test
5. marginal effects

5.1 Functional forms


5.1.1 Polynomials
Yi = 0 + 1 Xi + 2 X2i + 3 X3i + . . . + r Xri + ui
(r = 2: quadratic model, r = 3: cubic model, . . . )
Testing whether the regression function is linear:
H0 : 2 = 0 3 = 0 . . . r = 0
versus
H1 : at least one j 6= 0, j {2, . . . , r}
What is the best r?

large r: more flexibility, better fit


small r: more precise estimation of the individual coefficients

sequential hypothesis test in the case of polynomial models:


1. Choose the largest sensible value of r and estimate a polynomial regression
2. Test H0 : r = 0. If H0 is rejected, use a polynomial of the r-th degree
3. Else, reduce r by 1. Continue with step 1.
avginc3 <- avginc*avginc*avginc
est3 <- lm(testscr ~ avginc + avginc2 + avginc3)
(lhtest <- linearHypothesis(est3,"avginc3",vcov=hccm))
Linear hypothesis test
Hypothesis:
avginc3 = 0
Model 1: restricted model
Model 2: testscr ~ avginc + avginc2 + avginc3

NON-LINEAR REGRESSION FUNCTIONS

Note: Coefficient covariance matrix supplied.

1
2

Res.Df Df
F Pr(>F)
417
416 1 2.4615 0.1174

est2 <- lm(testscr ~ avginc + avginc2)


(lhtest <- linearHypothesis(est2,"avginc2",vcov=hccm))
Linear hypothesis test
Hypothesis:
avginc2 = 0
Model 1: restricted model
Model 2: testscr ~ avginc + avginc2
Note: Coefficient covariance matrix supplied.
Res.Df Df
F
Pr(>F)
1
418
2
417 1 75.136 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

summary(estp <- lm(testscr ~ poly(avginc,10,raw=TRUE)))

Call:
lm(formula = testscr ~ poly(avginc, 10, raw = TRUE))
Residuals:
Min
1Q
-42.435 -9.159

Median
0.424

3Q
8.764

Max
33.066

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.004e+03 5.773e+02
1.739
0.0828 .
poly(avginc, 10, raw = TRUE)1 -2.714e+02 3.245e+02 -0.837
0.4034
poly(avginc, 10, raw = TRUE)2
7.436e+01 7.721e+01
0.963
0.3361
poly(avginc, 10, raw = TRUE)3 -1.073e+01 1.026e+01 -1.046
0.2961
poly(avginc, 10, raw = TRUE)4
9.349e-01 8.449e-01
1.106
0.2692
poly(avginc, 10, raw = TRUE)5 -5.202e-02 4.519e-02 -1.151
0.2503
poly(avginc, 10, raw = TRUE)6
1.888e-03 1.594e-03
1.184
0.2370
poly(avginc, 10, raw = TRUE)7 -4.444e-05 3.678e-05 -1.208
0.2276
[ reached getOption("max.print") -- omitted 3 rows ]
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

149

c Oliver Kirchkamp

6 February 2015 09:49:38

Residual standard error: 12.67 on 409 degrees of freedom


Multiple R-squared: 0.5686,Adjusted R-squared: 0.5581
F-statistic: 53.91 on 10 and 409 DF, p-value: < 2.2e-16

t(sapply(1:10,function(i) extractAIC(lm(testscr ~ poly(avginc,i,raw=TRUE)))))

[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]

[,1]
2
3
4
5
6
7
8
9
10
11

[,2]
2181.164
2139.508
2139.384
2136.448
2135.989
2137.467
2139.372
2141.179
2143.160
2143.578

2136

2140

2144

2148

r<-2:50
plot(sapply(r,function(x) extractAIC(lm(testscr ~ poly(avginc,x,raw=TRUE)))[2]) ~ r,t="l",
ylab="AIC")

AIC

c Oliver Kirchkamp

150

10

20

30

40

50

NON-LINEAR REGRESSION FUNCTIONS

151

c Oliver Kirchkamp

2140 2150 2160 2170 2180

AIC

r<-1:7
plot(sapply(r,function(x) extractAIC(lm(testscr ~ poly(avginc,x,raw=TRUE)))[2]) ~ r,t="l",
ylab="AIC")

plot(testscr ~ avginc,main="district average income")


abline(est1,col="blue",lwd=3)
lines(avginc[or],fitted(est2)[or],col="red",lwd=3)
lines(avginc[or],fitted(est3)[or],col="green",lwd=3)
smooth<-list(avginc=seq(5,70,.5))
lines(smooth$avginc,predict(estp,newdata=smooth),col="magenta",lwd=3)
legend("bottomright",c("linear","quadratic","cubic","10th-deg"),lwd=3,col=c("blue","red","gree

6 February 2015 09:49:38

660

testscr

680

700

district average income

620

640

linear
quadratic
cubic
10th-deg
10

20

30

40

50

avginc

5.1.2 Logarithmic Models

-4

-3

-2

-1

curve(log(x))

log(x)

c Oliver Kirchkamp

152

0.0

0.2

0.4

0.6

0.8

1.0

NON-LINEAR REGRESSION FUNCTIONS


Yi = 0 + 1 ln Xi + ui

linear-log

ln Yi = 0 + 1 Xi + ui

log-linear

ln Yi = 0 + 1 ln Xi + ui

153

c Oliver Kirchkamp

log-log

5.1.3 Logarithmic Models: linear-log


Yi = 0 + 1 ln Xi + ui
marginal effects:
1
Yi
= 1
Xi
Xi
Yi

Xi
1
Xi

if Xi changes by 1% (Xi = 0.01 Xi ) . . .


Yi

0.01Xi
1
Xi

. . . Yi changes by 0.01 1

(estL <- lm(testscr ~ log(avginc)))

Call:
lm(formula = testscr ~ log(avginc))
Coefficients:
(Intercept) log(avginc)
557.83
36.42
plot(testscr ~ avginc,main="district average income")
abline(est1,col="blue",lwd=3)
lines(avginc[or],fitted(est2)[or],col="red",lwd=3)
lines(avginc[or],fitted(estL)[or],col="green",lwd=3)
legend("bottomright",c("linear","quadratic","linear-log"),lwd=3,col=c("blue","red","green"))

6 February 2015 09:49:38

640

660

680

700

district average income

linear
quadratic
linear-log

620

testscr

c Oliver Kirchkamp

154

10

20

30

40

50

avginc

coef(estL)[2]/10
log(avginc)
3.641968
coef(estL)[2]/40
log(avginc)
0.9104921

5.1.4 Logarithmic Models: log-linear


ln Yi = 0 + 1 Xi + ui
marginal effects
ln Yi
= 1
Xi
ln Yi
1
Xi
with

ln Yi
1
Yi

we have
ln Yi 1 Xi
Yi
Yi
Yi

A change of Xi by one unit translates into a relative change of Yi by the share 1

NON-LINEAR REGRESSION FUNCTIONS

155

(estLL <- lm(log(testscr) ~ avginc))

c Oliver Kirchkamp

Call:
lm(formula = log(testscr) ~ avginc)
Coefficients:
(Intercept)
6.439362

avginc
0.002844

plot(testscr ~ avginc,main="district average income")


abline(est1,col="blue",lwd=3)
lines(avginc[or],fitted(est2)[or],col="red",lwd=3)
lines(avginc[or],fitted(estL)[or],col="green",lwd=3)
lines(avginc[or],exp(fitted(estLL))[or],col="black",lwd=3)
legend("bottomright",c("linear","quadratic","linear-log","log-lin"),lwd=3,col=c("blue","red","

660
640

linear
quadratic
linear-log
log-lin

620

testscr

680

700

district average income

10

20

30

avginc

coef(estLL)[2]
avginc
0.00284407

Example: What is the effect of work experience on wages


exp
years of full-time work experience
lwage logarithm of wage

40

50

c Oliver Kirchkamp

156

6 February 2015 09:49:38

library(lattice)
data(Wages, package="Ecdat")
lm(lwage ~ exp,data=Wages)

Call:
lm(formula = lwage ~ exp, data = Wages)
Coefficients:
(Intercept)
6.50143

exp
0.00881

5.1.5 Logarithmic Models: log-log


ln Yi = 0 + 1 ln Xi + ui
ui
1
Yi = e0 X
i e

marginal effect:
Yi
Y
= e0 1 Xi1 1 = 1 i
Xi
Xi
Yi Xi

= 1
Xi Yi
1 is the elasticity of Yi with respect to Xi .
(estLLL<- lm(log(testscr) ~ log(avginc)))

Call:
lm(formula = log(testscr) ~ log(avginc))
Coefficients:
(Intercept) log(avginc)
6.33635
0.05542
(estLL <- lm(log(testscr) ~ avginc))

Call:
lm(formula = log(testscr) ~ avginc)
Coefficients:
(Intercept)
6.439362

avginc
0.002844

NON-LINEAR REGRESSION FUNCTIONS

157

c Oliver Kirchkamp

plot(testscr ~ avginc,main="district average income")


abline(est1,col="blue",lwd=3)
lines(avginc[or],fitted(est2)[or],col="red",lwd=3)
lines(avginc[or],fitted(estL)[or],col="green",lwd=3)
lines(avginc[or],exp(fitted(estLL))[or],col="black",lwd=3)
lines(avginc[or],exp(fitted(estLLL))[or],col="orange",lwd=3)
legend("bottomright",c("linear","quadratic","linear-log","log-lin","log-log"),lwd=3,col=c("blu

640

660

linear
quadratic
linear-log
log-lin
log-log

620

testscr

680

700

district average income

10

20

30

40

50

avginc

5.1.6 Comparison of the three logarithmic models


X and/or Y are transformed
The regression equation is linear in the transformed variables
Hypothesis tests and confidence intervals can be calculated in the usual way
The interpretation of is different in each case
R2 and AIC can be used to compare log-log and log-linear
R2 and AIC can be used to compare linear-log and a linear model
Comparing ln Yi and Yi is impossible.
We need economic theory to motivate one of the four specifications.

6 February 2015 09:49:38

5.1.7 Generalization Box-Cox


The logarithmic model
log Y = X + u
Now, take a look at
g (Y) = X + u

where g (Y) =

Y 1

log Y

if 6= 0
if = 0

is calculated using maximum likelihood.

355

660
640
620

360

365

testscr

370

680

375

700

library(MASS)
est <- lm (testscr ~ avginc)
plot(boxcox(est,lambda=seq(-2,10,by=.5),plotit=FALSE),t="l",
xlab="$\\lambda$",ylab="log-Likelihood")
est2 <- lm(testscr^8 ~ avginc)
plot(testscr ~ avginc)
lines(fitted(est2)[or]^(1/8) ~ avginc[or],col="blue")
lines(exp(fitted(estLL)[or]) ~ avginc[or],col="orange")

log-Likelihood

c Oliver Kirchkamp

158

-2

10

10

20

30

avginc

40

50

NON-LINEAR REGRESSION FUNCTIONS

159

5.1.8 Other non-linear functions


Problems with the above models:
Polynomial model: potentially not monotonic.
linear-log: testscr rises monotonously in avginc, but is not bounded above.
Is there a specification that satisfies both conditions: monotonicity and boundedness?
Y = 0 e1 X
(negative exponential growth curve)

0.8
0.7

1 - exp(-x)

0.9

1.0

curve(1-exp(-x),xlim=c(1,10))

x
Estimate the parameters of
Yi = 0 e1 Xi + ui
or (using = 0 e2 )

1 (Xi 2 )

Yi = 0 1 e

+ ui

Compare this model to linear-log or polynomial model:


Yi

Yi

0 + 1 ln Xi + ui

0 + 1 Xi + 2 X2i + 3 X3i + ui

10

c Oliver Kirchkamp

c Oliver Kirchkamp

160

6 February 2015 09:49:38

Linearizing Yi = 0

1 e1 (Xi 2 )

+ ui is not possible anymore.

5.1.9 Non-linear least squares


Models which are linear in their parameters can be estimated using OLS.
Models which are non-linear in one or more parameters can be estimated
using non-linear methods (but cannot be estimated using OLS).
min

0 ,1 ,2

n 
X

i=1

1 (Xi 2 )

Yi 0 1 e

2

(nest<-nls(testscr ~ b0 * (1 - exp(-1 * b1 * (avginc - b2))),


start=c(b0=730,b1=0.1,b2=0),trace=TRUE))
7485378 : 730.0
0.1
0.0
1392009 : 695.41660291
0.09260118 -8.24400488
233046.7 : 696.82401675
0.07926959 -17.01056369
98541.84 : 699.44123343
0.06495656 -25.58741750
69931.36 : 702.08095456
0.05753086 -31.61313399
67013.75 : 703.05049452
0.05552973 -33.72483709
66988.4 : 703.20479789
0.05525916 -33.98665686
66988.4 : 703.22077199
0.05523597 -34.00235084
66988.4 : 703.22210309
0.05523406 -34.00353793
Nonlinear regression model
model: testscr ~ b0 * (1 - exp(-1 * b1 * (avginc - b2)))
data: parent.frame()
b0
b1
b2
703.22210
0.05523 -34.00354
residual sum-of-squares: 66988
Number of iterations to convergence: 8
Achieved convergence tolerance: 0.0000007268

summary(nest)

Formula: testscr ~ b0 * (1 - exp(-1 * b1 * (avginc - b2)))


Parameters:
Estimate Std. Error t value
Pr(>|t|)
b0 703.222103
6.697451 104.998
< 2e-16 ***
b1
0.055234
0.009101
6.069 0.00000000289 ***
b2 -34.003538
5.676787 -5.990 0.00000000454 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

NON-LINEAR REGRESSION FUNCTIONS

161

Residual standard error: 12.67 on 417 degrees of freedom


Number of iterations to convergence: 8
Achieved convergence tolerance: 0.0000007268

plot(testscr ~ avginc,main="district average income")


lines(avginc[or],fitted(nest)[or],col="red",lwd=3)
lines(avginc[or],fitted(estL)[or],col="blue",lwd=3)
legend("bottomright",c("nls","log"),lwd=3,col=c("red","blue"))

660
640

nls
log

620

testscr

680

700

district average income

10

20

30

40

50

avginc

5.2 Interactions
Maybe the effect of group sizes on test scores depends on further circumstances.
Maybe small groups sizes have a particularity large effect if groups have a
lot of foreign students.

testscr
str

depends on elpct.

Generally:

Y
X1

depends on X2 .

How can we include this interaction into a model?

c Oliver Kirchkamp

c Oliver Kirchkamp

162

6 February 2015 09:49:38

First look at binary X, later consider continuous X.


Example 1:
testsrc = 1 str + 2 elpct + 0 + u
in this model the effect of str is independent of elpct
Example 2:
lwage = 1 ed + 0 + u
library(lattice)
attach(Wages)
summary(ed)
Min. 1st Qu.
4.00
12.00

Median
12.00

Mean 3rd Qu.


12.85
16.00

Max.
17.00

lm(lwage ~ ed)

Call:
lm(formula = lwage ~ ed)
Coefficients:
(Intercept)
5.8388

ed
0.0652

Example 3:
lwage = 1 college + 2 sex + 0 + u
college=ed>16
lm(lwage ~ college + sex)

Call:
lm(formula = lwage ~ college + sex)
Coefficients:
(Intercept) collegeTRUE
6.2254
0.3340

sexmale
0.4626

5.2.1 Interactions between binary variables


lwage = 0 + 1 college + 2 sex + 3 sex college + u
|{z} |{z}
|{z}
|{z}
6.21

0.55

0.49

0.24

NON-LINEAR REGRESSION FUNCTIONS

163

(est<- lm(lwage ~ college + sex + sex:college))

Call:
lm(formula = lwage ~ college + sex + sex:college)
Coefficients:
(Intercept)
6.2057
collegeTRUE:sexmale
-0.2412

collegeTRUE
0.5543

sexmale
0.4850

Instead of regression coefficients we can calculate mean values for the individual categories:
mean(lwage[college==FALSE & sex=="female"])
[1] 6.205665
mean(lwage[college==TRUE & sex=="female"])
[1] 6.760007
mean(lwage[college==FALSE & sex=="male"])
[1] 6.690634
mean(lwage[college==TRUE & sex=="male"])
[1] 7.003751

mean(lwage)

sex
female
6.21

male
6.69

0
6.76

0 + 2
7.00

FALSE
college
TRUE
0 + 1
0 + 1 + 2 + 3
Effect of college education for women: 1 Effect of college education for men:
1 + 3
Histr<-str>=20
Hiel<-elpct>=10
table(Histr,Hiel)
Hiel
Histr
FALSE TRUE
FALSE
149
89
TRUE
79 103

c Oliver Kirchkamp

c Oliver Kirchkamp

164

6 February 2015 09:49:38

(est<- lm(testscr ~ Histr*Hiel))

Call:
lm(formula = testscr ~ Histr * Hiel)
Coefficients:
(Intercept)
664.143

HistrTRUE
-1.908

HielTRUE
-18.163

HistrTRUE:HielTRUE
-3.494

mean(testscr[Hiel==FALSE & Histr==FALSE])


[1] 664.1433
mean(testscr[Hiel==TRUE & Histr==FALSE])
[1] 645.9803
coef(est) %*% c(1,0,1,0)
[,1]
[1,] 645.9803
mean(testscr[Hiel==FALSE & Histr==TRUE])
[1] 662.2354
coef(est) %*% c(1,1,0,0)
[,1]
[1,] 662.2354
mean(testscr[Hiel==TRUE & Histr==TRUE])
[1] 640.5782
coef(est) %*% c(1,1,1,1)
[,1]
[1,] 640.5782

If we do not want to calculate mean values for the individual categories, we can
leave that job to R:
library(memisc)
aggregate(mean(testscr)~Hiel+Histr)
Hiel Histr mean(testscr)
1 FALSE FALSE
664.1433
3 TRUE FALSE
645.9803

NON-LINEAR REGRESSION FUNCTIONS

2 FALSE
6 TRUE

TRUE
TRUE

165

662.2354
640.5782

testscr = 0 + 1 Hiel + 2 Histr + 3 Histr Hiel + u


|{z}
|{z}
|{z}
|{z}
664.14

18.16

1.91

mean(testscr)

3.49

Histr

FALSE
664.1433

TRUE
662.2354

0
645.9803

0 + 2
640.5782

0 + 1

0 + 1 + 2 + 3

FALSE
Hiel
TRUE

5.2.2 Interaction between a binary and a continuous variable


attach(Wages)
plot(lwage ~ ed)
abline(lm(lwage ~ ed,subset=(sex=="female")))
abline(lm(lwage ~ ed,subset=(sex=="male")),col="red",lty=2)
legend("topright",c("male","female"),lty=2:1,col=c("red","black"))

7
6
5

lwage

male
female

10

ed

12

14

16

c Oliver Kirchkamp

c Oliver Kirchkamp

166

6 February 2015 09:49:38

(lm(lwage ~ ed,subset=(sex=="female")))

Call:
lm(formula = lwage ~ ed, subset = (sex == "female"))
Coefficients:
(Intercept)
5.04207

ed
0.09452

(lm(lwage ~ ed,subset=(sex=="male")))

Call:
lm(formula = lwage ~ ed, subset = (sex == "male"))
Coefficients:
(Intercept)
5.93060

ed
0.06221

(est<- lm(lwage ~ sex*ed))

Call:
lm(formula = lwage ~ sex * ed)
Coefficients:
(Intercept)
5.04207

sexmale
0.88854

ed
0.09452

sexmale:ed
-0.03231

detach(Wages)

lwage = 0 + 1 ed + 2 sex + 3 sex ed + u


3 = 0: Regression lines are parallel
2 = 0: Regression lines have the same axis intercept
Does the effect of the groups sizes on test scores depend on the ratio of native
speakers?

Hiel=elpct>=10
plot(testscr ~ str)
abline(lm(testscr ~ str,subset=(Hiel==FALSE)))
abline(lm(testscr ~ str,subset=(Hiel==TRUE)),col="red")
legend("topright",c("few el","many el"),lwd=3,col=c("black","red"))

167

660
640
620

testscr

680

few el
many el

14

16

18

20

str

(lm(testscr ~ str,subset=(Hiel==FALSE)))

Call:
lm(formula = testscr ~ str, subset = (Hiel == FALSE))
Coefficients:
(Intercept)
682.2458

str
-0.9685

(lm(testscr ~ str,subset=(Hiel==TRUE)))

Call:
lm(formula = testscr ~ str, subset = (Hiel == TRUE))
Coefficients:
(Intercept)
687.885

str
-2.245

lm(testscr ~ str*Hiel)

Call:
lm(formula = testscr ~ str * Hiel)
Coefficients:
(Intercept)
682.2458

str
-0.9685

HielTRUE
5.6391

str:HielTRUE
-1.2766

22

24

26

c Oliver Kirchkamp

NON-LINEAR REGRESSION FUNCTIONS

700

c Oliver Kirchkamp

168

6 February 2015 09:49:38

(est<- lm(testscr ~ str*Hiel))

Call:
lm(formula = testscr ~ str * Hiel)
Coefficients:
(Intercept)
682.2458

testscr =

str
-0.9685

0
|{z}

682.2458

HielTRUE
5.6391

str:HielTRUE
-1.2766

+ 1 Hiel + 2 str + 3 str Hiel + u


|{z}
|{z}
|{z}
5.6391

0.9685

1.2766

Effect of a change in group size str:


if elpct<10: -0.9685
if elpct10: -2.245
Are the two lines parallel?

linearHypothesis(est,"str:HielTRUE=0",vcov=hccm)
Linear hypothesis test
Hypothesis:
str:HielTRUE = 0
Model 1: restricted model
Model 2: testscr ~ str * Hiel
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
417
2
416 1 1.6778 0.1959

Are the two lines identical??


linearHypothesis(est,c("str:HielTRUE=0","HielTRUE=0") ,vcov=hccm)
Linear hypothesis test
Hypothesis:
str:HielTRUE = 0
HielTRUE = 0
Model 1: restricted model
Model 2: testscr ~ str * Hiel
Note: Coefficient covariance matrix supplied.

NON-LINEAR REGRESSION FUNCTIONS

169

Res.Df Df
F
Pr(>F)
1
418
2
416 2 88.806 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Do both lines have the same axis intercept?


linearHypothesis(est,"HielTRUE=0",vcov=hccm)
Linear hypothesis test
Hypothesis:
HielTRUE = 0
Model 1: restricted model
Model 2: testscr ~ str * Hiel
Note: Coefficient covariance matrix supplied.

1
2

Res.Df Df
F Pr(>F)
417
416 1 0.0804 0.7769

5.2.3 Application: Gender gap


exp
wks
bluecol
ind
south
smsa
married
sex
union
ed
black
lwage

years of full-time work experience


weeks worked
blue collar ?
works in a manufacturing industry ?
resides in the south ?
resides in a standard metropolitan statistical area ?
married ?
a factor with levels (male,female)
individuals wage set by a union contract ?
years of education
is the individual black ?
logarithm of wage

ifelse returns either the second or the third argument. Which argument is returned depends on the first
arguments. as.data.frame converts the argument (e.g. a matrix) into a data frame. Here, this is helpful,
because the returned structure mixed numbers and strings. colnames provides access to column names.
attach(Wages)
est1 <- lm(lwage
est2 <- lm(lwage
est3 <- lm(lwage
est4 <- lm(lwage

~
~
~
~

ed)
ed + sex)
ed * sex)
ed * sex + exp + black + union + south + wks + married + smsa + ind)

c Oliver Kirchkamp

c Oliver Kirchkamp

170

6 February 2015 09:49:38

mtable("(1)"=est1,"(2)"=est2,"(3)"=est3,"(4)"=est4,summary.stats=c("R-squared","N"))
(1)
5.839

(Intercept)

(0.032)
0.065

ed

(0.002)
sex: male/female

(2)
5.419

(0.034)
0.065

(0.002)
0.474
(0.018)

ed sex: male/female

(3)
5.042

(0.087)
0.095

(0.007)
0.889
(0.093)
0.032
(0.007)

exp
black: yes/no

(4)
4.666
(0.107)
0.086

(0.006)
0.552
(0.092)
0.016
(0.007)
0.011
(0.001)
0.168

(0.022)
0.063
(0.012)
0.055
(0.013)
0.005
(0.001)
0.066

union: yes/no
south: yes/no
wks
married: yes/no

(0.022)
0.161
(0.012)
0.043
(0.012)

smsa: yes/no
ind
R-squared
N

0.155
4165

0.260
4165

0.264
4165

5.2.4 Interaction between two continuous variables


Example
lwage = 0 + 1 ed + 2 exp + 3 ed exp + u
est1 <- lm(lwage ~ ed + exp)
est2 <- lm(lwage ~ ed * exp)

mtable("(1)"=est1,"(2)"=est2,summary.stats=c("R-squared","N"))

(Intercept)
ed
exp

(1)

(2)

5.436
(0.037)
0.076
(0.002)
0.013
(0.001)

5.446
(0.075)
0.076
(0.005)
0.013
(0.003)
0.000
(0.000)

ed exp
R-squared
N

0.247
4165

0.247
4165

0.387
4165

NON-LINEAR REGRESSION FUNCTIONS

171

Yi = 0 + 1 X1i + 2 X2i + 3 (X1i X2i ) + ui

Marginal Effects:

Yi
= 1 + 3 X2
X1

Yi
= 2 + 3 X1
X2

What happens, if X1 changes by X1 and X2 changes by X2 ?


Y

=
=
=

0 + 1 (X1 + X1 ) + 2 (X
2 + X2 ) + 3 ((X1 + X1 ) (X2 + X2 ))



0 + 1 X1 + 2 X
2 + 3 (X1 X2 )

1 X1 + 2 X2 + 3 (X1 X2 + X1 X2 + X2 X1 + X1 X2 )
3 X1 X2
1 X1 +
2 X2 + 3 X1 X2 +
3 X2 X1 + 3 X1 X2

(1 + 3 X2 )X1 + (2 + 3 X1 )X2 + 3 X1 X2

testscr = 0 + 1 str + 2 elpct +


|{z}
|{z}
|{z}
686.34

1.1170

0.6729

3
|{z}

0.001162

attach(Caschool)
summaryR(lm(testscr ~ str * elpct ))

Call:
lm(formula = testscr ~ str * elpct)
Residuals:
Min
1Q
-48.836 -10.226

Median
-0.343

3Q
9.796

Max
43.447

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.338525 11.937855 57.493
<2e-16 ***
str
-1.117018
0.596515 -1.873
0.0618 .
elpct
-0.672911
0.386538 -1.741
0.0824 .
str:elpct
0.001162
0.019158
0.061
0.9517
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.48 on 416 degrees of freedom
Multiple R-squared: 0.4264,Adjusted R-squared: 0.4223
F-statistic: 150.3 on 3 and 416 DF, p-value: < 2.2e-16

str elpct + u

c Oliver Kirchkamp

c Oliver Kirchkamp

172

6 February 2015 09:49:38

What is the effect of the group size str for a group with median share of foreigners?
median calculates the median of a vector. quantile calculates quantiles of vector. The smallest observation equals a quantile of 0, the largest observation equals a quantile of 1. mean calculates the arithmetic
mean.
median(elpct)
[1] 8.777634
est<- lm(testscr ~ str * elpct)
coef(est)
(Intercept)
686.338524629

str
-1.117018345

elpct
-0.672911392

str:elpct
0.001161752

(eff1=coef(est)["str"] + coef(est)["str:elpct"] * median(elpct))


str
-1.106821

What does this effect look like for a group with a share of foreigners in the 75%
quantile?
quantile(elpct,.5)
50%
8.777634
quantile(elpct,.75)
75%
22.97
(eff2=coef(est)["str"] + coef(est)["str:elpct"] * quantile(elpct,.75))
str
-1.090333

Is the interaction term significant?


linearHypothesis(est,"str:elpct=0",vcov=hccm)
Linear hypothesis test
Hypothesis:
str:elpct = 0
Model 1: restricted model
Model 2: testscr ~ str * elpct

NON-LINEAR REGRESSION FUNCTIONS

173

Note: Coefficient covariance matrix supplied.

1
2

c Oliver Kirchkamp

Res.Df Df
F Pr(>F)
417
416 1 0.0037 0.9517

5.3 Non-linear interaction terms


Hiel=elpct>=10
est1 <- lm(testscr
est2 <- lm(testscr
est3 <- lm(testscr
est4 <- lm(testscr
est5 <- lm(testscr
est6 <- lm(testscr
est7 <- lm(testscr

~
~
~
~
~
~
~

str + elpct + mealpct)


str + elpct + mealpct + log(avginc))
str * Hiel )
str * Hiel + mealpct + log(avginc))
str + I(str^2) + I(str^3) + Hiel + mealpct + log(avginc))
(str + I(str^2) + I(str^3))*Hiel + mealpct + log(avginc))
str + I(str^2) + I(str^3) + elpct + mealpct + log(avginc))

mtable("(1)"=est1,"(2)"=est2,"(3)"=est3,"(4)"=est4,"(5)"=est5,"(6)"=est6,"(7)"=est7,summary.st
(1)
(Intercept)
str
elpct
mealpct

(2)

700.150

658.552

(0.274)
0.122
(0.033)
0.547
(0.024)

(0.261)
0.176
(0.034)
0.398
(0.034)
11.569
(1.841)

(5.641)
0.998

log(avginc)

(8.749)
0.734

Hiel

(3)
682.246
(12.071)
0.968
(0.599)

5.639
(19.889)
1.277
(0.986)

str Hiel

(4)
653.666
(10.053)
0.531
(0.350)

0.411
(0.029)
12.124
(1.823)
5.498
(10.012)
0.578
(0.507)

(5)
252.051
(179.724)
64.339

(27.295)
0.420
(0.029)
11.748
(1.799)
5.474
(1.046)

3.424
(1.373)
0.059
(0.023)

str2
str3
str2 Hiel
str3 Hiel
R-squared
N

0.775
420

0.796
420

0.310
420

0.797
420

0.801
420

It looks like there is a non-linear effect of str on testscr. Lets take another
look at model 6 and lets estimate the marginal effect of str.
estC <- coef(est6)
mEffstr <- function (str,Hiel) {
estC %*% c(0,1,2*str,3*str^2,0,0,0,Hiel,Hiel*2*str,Hiel*3*str^2)
}
mEffstr(20,0)
[,1]
[1,] -1.622543

6 February 2015 09:49:38

mEffstr(20,1)
[,1]
[1,] -0.7771982
sapply applies a function to every element of a vector. Within an equation I() prevents a term from
being interpreted as an interaction.

-2

noHitest<-sapply(str,function(x) {mEffstr(x,0)})
Hitest<-sapply(str,function(x) {mEffstr(x,1)})
plot(noHitest ~ str,ylim=c(-6,6))
points(Hitest ~ str,col="red")
abline(h=0)
legend("bottomright",c("noHitest","Hitest"),pch=1,col=c("black","red"))

noHitest
Hitest

-6

-4

noHitest

c Oliver Kirchkamp

174

14

16

18

20

22

24

26

str

5.3.1 Non-linear interaction terms


linear interaction between str and elpct or str and Hiel: no significant
effect
significant non-linear effect of str on testscr
significant non-linear interaction of str and Hiel
Effect of a change in group sizes: It depends

NON-LINEAR REGRESSION FUNCTIONS

175

5.3.2 Summary
Non-linear transformations (log, polynomials) enable us to write non-linear
models as multiple regressions.
Estimations work in the same way as for OLS.
We have to take the transformations into account, when we interpret the
coefficients.
A large number of non-linear specifications is possible. Consider. . .
What non-linear effects are we interested in?
What makes sense with respect to the problem we are trying to solve?

5.4 Exercises
1. Polynomial Regression I
Give the formula of a quadratic regression.
How do you interpret the coefficients and the dependent variable of
this quadratic regression?
Draw a graph illustrating the impact of X1 on Y in a quadratic regression.
Give some real world examples where the use of a quadratic regression
function could be useful.
2. Polynomial Regression II
You want to estimate the time that it takes the members of your running
team to run 1km. You have information on age, gender, and whether the
members are generally spoken in a good physical condition.
Suggest a model to estimate the time it takes your club members to run
1km.
Which sign do you expect for each of the coefficients?
Draw a graph that illustrates the relation between time needed to run
1km and age.
3. Polynomial Regression III
Use the data set Bwages of the library Ecdat in R on wages in Belgium.

c Oliver Kirchkamp

c Oliver Kirchkamp

176

6 February 2015 09:49:38


Estimate the effect of years of experience (exper) and level of education
(educ) on hourly wages in Euro (wage). Do you think it makes sense to
use a non-linear model? Why? Which regression model would you
use? Note the regression function. Which signs do you expect the coefficients to take?
Estimate the model in R. Interpret the output.
How should the relationship between wage and experience look like
graphically? Verify this with a graph using R.

4. Logarithmic Regression I
What is a logarithm?
Where do logarithms occur in nature or science?
Give some examples for the use of logarithmic functions in economic
contexts.
5. Logarithmic Regression II
Which different types of logarithmic regressions do you know? Give
the formulas for each of them.
How do you interpret the coefficients of the different logarithmic models?
Give economic examples for each of them.
6. Logarithmic Regression III
Use the data set Wages of the library Ecdat in R on wages in the United
States.
Estimate the effect of years of experience (exp), whether the employee
has a blue collar job (bluecol), whether the employee lives in a standard metropolitan area (smsa), gender (sex), years of education (ed),
and whether the employee is black (black) on the logarithm of wage
(lwage). Do you think it makes sense to use this model? Would you
rather suggest a different model? Which one would you suggest?
Estimate both models in R. Interpret and compare the outputs.
Visualize the relationship of experience and wage with a graph. Does
this graph support the choice of your model?
7. Polynomial Regression IV
Use the data set Bwages of the library Ecdat in R on wages in Belgium.

NON-LINEAR REGRESSION FUNCTIONS

177

You want to estimate the wage increase per year of job experience (exper).
You use the level of education (educ) as an additional control. You do
not have information on wage increases, but only on absolute wages
(wage). Solve this problem using R.
8. Linear and Non-linear Regressions I
You want to estimate different models for the following problem sets of one
dependent and one independent variable (assume that the models are otherwise specified correctly, i.e. all other important variables are included, no
high correlation between two independent variables). Name the appropriate model (linear, quadratic, log-lin, lin-log, or log-log) for each of the problems and explain your choice (exercise adapted from Studenmunds "Using
econometrics", chapter 7, exercise 2).
Dependent variable: time it takes to walk from A to B
independent variable: distance from A to B
Dependent variable: total amount spent on food
independent variable: income
Dependent variable: monthly wage
independent variable: age
Dependent variable: number of ski lift tickets sold
independent variable: whether there is snow
Dependent variable: GDP growth rate
independent variable: years passed since beginning of transformation
to an industrialized country
Dependent variable: CO2 emission
independent variable: kilometers driven with car
Dependent variable: hourly wage
independent variable: number of years of job experience
Dependent variable: physical ability
independent variable: age
9. Linear and Non-linear Regressions II
How do you decide which model (linear model or one of the nonlinear
models) to use?
10. Interaction terms I
What is an interaction term?

How do you construct an interaction term?

c Oliver Kirchkamp

c Oliver Kirchkamp

178

6 February 2015 09:49:38


Write down a regression function including an interaction term.
How do you interpret interaction terms?

Why do you have to include not only the interaction of two variables
into your regression function, but also each of the individual variables?
What would happen if you would not include the individual variables?
Give some examples of situations where you think that interactions
play a role.
11. Exam BW 24.1, 26.5.2010, exercise 19
A group of athletes prepares for a competition. You have the following information about the athletes: age (A), gender (G; 1 if female, 0 otherwise),
daily training (T; 1 if true, 0 otherwise), healthy diet (E; 1 if true, 0 otherwise), and ranking list scores (R). Age and gender are not correlated with
the other variables. You assume that athletes only do especially well in the
ranking list if they practice daily and if they follow a healthy diet; a daily
training is only effective in combination with a healthy diet. What would
be possible specifications of your model to test your assumption? (Here we
dont ask for the "best" specification.)
a) R = 0 + 1 T + 2 E + u

b) R = 0 + 1 A + 2 G + 3 T + 4 E + 5 T E + u
c) R = 0 + 1 T + 2 E + 3 T E + u

d) T E = 0 + 1 R + u

e) R = 0 + 1 A + 2 G + 3 T + 4 E + u

12. Interaction terms II


You are a teacher of a cross country skiing school. Each year you teach students who have never done cross country skiing before the free style technique (also called skating technique). You realize that your students differ
in how fast they learn this new technique. You think that the two things that
matter are whether a student knows how to ice skate and whether a student
is familiar with down hill skiing.

You estimate the following model for the number of days it takes them to
learn the new technique so well that they are able to do their first tours:
^ i = 8 1 iceskating 2.5 alpineskiing 1.5 iceskating alpineskiing
days
How many day needs a person . . . to learn the skating technique with
cross country skis?
who has never done any ice skating nor downhill skiing
who has never done any ice skating, but some downhill skiing

NON-LINEAR REGRESSION FUNCTIONS

179

who knows how to ice skate, but has never done any downhill
skiing
who is familiar with both ice skating and downhill skiing
13. Interaction terms in R I
You would like to estimate students school achievements measured in test
scores (testscore) in a developing country. You think that gender (female)
and educational background of the parents (eduparents; measured in years)
have an impact. In particular, you think that poor people cannot afford that
their children spend all their time on learning, because they also need their
help to earn money. This might be especially true for girls, because their
parents might think that education is less important for them. How would
you test this assumption in R? Which if the following commands is correct
(multiple correct answers possible)?
a) summary(lm(testscore female+eduparents+eduparents*female))
b) summary(lm(testscore=female+eduparents+eduparents:female))
c) summary(lm(testscore female*eduparents))
d) summary(lm(testscore female+eduparents+eduparents:female))
e) eduparentsfemale <- eduparents*female
summary(lm(testscore female+eduparents+eduparentsfemale))
f) summary(lm(testscore <- eduparents*female))
g) summary(lm(testscore eduparents:female))
14. Interaction terms in R II
Use the data set RetSchool of the library Ecdat in R on returns to schooling
in the United States.
You are interested whether people considered as "black" (black) and
people living in the south (south76) earn less than others. Further, you
are interested whether Afro Americans (black) who live in the south
(south76) earn even less. You control for years of experience (exp76)
and grades (grade76). Solve this problem using R.
15. Interaction terms in R III
Use the data set DoctorContacts of the library Ecdat in R on contacts to medical doctors.
How do gender (sex), age (age), income (linc), the education of the
head of the household (educdec), health (health), physical limitations
(physlim), and the number of chronic diseases (ndisease) effect the

c Oliver Kirchkamp

c Oliver Kirchkamp

180

6 February 2015 09:49:38


number of visits to a medical doctor? In which direction do you expect
the effects to go?
Is there an interaction between gender and physical limitations?

16. Non-linear functions


Which non-linear functions do you know? List them.
Note the formula for each of them.
Give examples for each of them.

6 Evaluating multiple regressions


6.1 Introduction
Strengths of multiple regression models
Problems

e.g.: What can we really say about the effect of group sizes on testscores?
6.1.1 Can we evaluate multiple regressions systematically?
Advantages (in comparison to the simple regression model):
Marginal effects X Y can be estimated.

Omitted variable bias can sometimes be prevented (if the variable can be
measured)
Non-linear effects (which depend on X) can be analysed.
still: OLS can be a biased estimator of the true effect.

6.1.2 Internal and external validity

Internal validity statistical inferences about causal dependencies apply to the population / to the model we are studying.
The estimator is unbiased and consistent.

Hypothesis tests have the desired levels of significance and confidence


intervals have the desired levels of confidence.
External validity statistical inferences about causal dependencies can be carried
over to other populations and other circumstances.

EVALUATING MULTIPLE REGRESSIONS

181

To what extent can our results regarding Californian schools be generalized?


Different populations

California 1998/99
Massachusetts 1997/98
Mexico 1997/98

e.g. when testing drugs: Generalization from rats to humans.


Different circumstances

Legal circumstances of subsidy programs


Different handling of bilingual education
Different characteristics of teachers

Testing external validity by comparing different populations and circumstances.

6.2 Internal validity - Problems


Omitted variable bias
Incorrect functional form
Errors in the variables
Selection bias
Simultaneous causality
Heteroscedasticity and correlation of error terms:
E(ui |Xi ) 6= 0, OLS is biased and inconsistent.

6.2.1 Omitted Variable Bias

A variable has an effect on Y


A variable is correlated with one of the explaining variables X
E(b1 ) = 1 + (X1 X1 )1 X1 X2 2
Remedy:
If we can measure the variable add it to the regression.
Identifying the important coefficients (a priori)

c Oliver Kirchkamp

c Oliver Kirchkamp

182

6 February 2015 09:49:38


Searching actively for sources of omitted variable bias (a priori)
base specification

Extending the base specification by adding variables


Test whether the estimated coefficients are zero.

Do the coefficients which we have estimated before change when


we add another variable?
Overview of the different estimated specifications.
data(Caschool)
attach(Caschool)
est1 <- lm(testscr
est2 <- lm(testscr
est3 <- lm(testscr
est4 <- lm(testscr
est5 <- lm(testscr

~
~
~
~
~

str)
str +
str +
str +
str +

elpct)
elpct + mealpct)
elpct + calwpct)
elpct + mealpct + calwpct)

mtable("(1)"=est1,"(2)"=est2,"(3)"=est3,"(4)"=est4,"(5)"=est5,
summary.stats=c("R-squared","N"))
(1)
(Intercept)

698.933
(10.461)
2.280
(0.524)

str
elpct

(2)

(3)
700.150

697.999

(0.031)

(0.033)
0.547
(0.024)

(0.030)

(8.812)
1.101
(0.437)
0.650

mealpct

(5.641)
0.998
(0.274)
0.122

calwpct
R-squared
N

0.051
420

(4)

686.032

0.426
420

0.775
420

(7.006)
1.308
(0.343)
0.488

0.790
(0.070)
0.629
420

(5)
700.392
(5.615)
1.014
(0.273)
0.130
(0.037)
0.529
(0.039)
0.048
(0.062)

0.775
420

If we cannot measure the variable:

If the variable does not change over time Regression using panel
data
If the variable is correlated with another variable which we can measure Regression using instruments

Randomized controlled experiment to eliminate the effect on average


(if X is random, it is in particular independent of u, hence E(u|X = x) =
0)
6.2.2 Incorrect specification of the functional form
Including interaction terms.

EVALUATING MULTIPLE REGRESSIONS

183

Logarithmic/polynomial specification
In case of a discrete (e.g. binary) dependent variable: Extending multiple
regression models (probit, logit)
6.2.3 Errors in the variables
What, if we cannot measure our X precisely:
Typos
Imprecise recollections (When did you start working on this project?)
Imprecise questions (What was your income last year?)
Conscious lying (Alcohol intake / sexual preferences)

Example: Let the true specification be

Yi = 0 + 1 Xi + ui
for this specification it holds that E(ui |Xi ) = 0.
Let Xi be the true value of X and X i the imprecisely measured value of X.
We estimate
Yi

=
=
=

0 + 1 Xi + ui

0 + 1 X i + 1 (Xi X i ) + ui
0 + 1 X i + vi

where vi = 1 (Xi X i ) + ui . If (Xi X i ) is correlated with X i , then X i is corre^ 1 is biased and inconsistent.
lated with vi , and
Example: Let X i = Xi + wi , where wi is a random variable with a mean value
of zero and a variance of 2w and wi is uncorrelated with Xi and ui
vi

1 (Xi X i ) + ui

1 (Xi Xi wi ) + ui

1 wi + ui

According to the assumption


cov(Xi , ui ) =
cov(X i , wi ) =
=
hence cov(X i , vi ) =
=

0
cov(Xi + wi , wi )
2w
1 cov(X i , wi ) + cov(X i , ui )
1 2w

c Oliver Kirchkamp

c Oliver Kirchkamp

184

6 February 2015 09:49:38

^1

Recall

=
p

1 +

In our example
^1

Since

2X
2X +2w

1 +

1 Pn

i=1 (Xi X)ui


n
P
n
1
2
i=1 (Xi X)
n

cov(ui , Xi )
2X

1 +

cov(ui , Xi )
2X

2X + 2w 2w
2X
1 2w
=

1
1 2
2X + 2w
2X + 2w
X + 2w

^ 1 is biased towards zero.


< 1,

Extreme case 1: wi is large enough, so that X i contains virtually no information


^ 1 0 Extreme case 2: wi = 0:
^ 1 1

Remedy in case of errors in the variables


Measure X more precisely.

We can estimate an instrumental variable regression, if there is a variable


(instrument) which at the same time is correlated with Xi , but not with ui .
Alternatively: Develop a model of the measurement error and use it to correct the error. (e.g. on the basis of 2w and 2X )
6.2.4 Sample selection bias
What, if the selection of the data (sampling) is affected by the dependent variable?
Random sampling helps to avoid sampling bias.
Example: Explaining employees wages by the education. Unemployed individuals are not included in the sample.
Example: Performance of equity funds.

Draw 100 equity funds today and observe their average performance
over the past 10 years.

Performance will be overestimated.

EVALUATING MULTIPLE REGRESSIONS

185

Draw 100 equity funds ten years ago and observe their average performance over the past 10 years.
Equity funds do not beat the market.

High performance in the past does not explain high performance


in the future.
6.2.5 Simultaneous causality
Causality runs in both directions, from X Y, but also from Y X.
Example: What if the government gives additional funds to schools with low
test scores, so that they can hire additional teachers?

testscr
+

testscr

str
str

Yi

0 + 1 Xi + ui

Xi

0 + 1 Yi + vi

Problem: Xi is correlated with the error term ui


Remedy:
Instrumental variable regressions
Randomized controlled experiment
6.2.6 Heteroscedasticity and correlation of error terms
Heteroscedasticity-consistent standard errors
Correlation of error terms across observations

Does not occur, if observations are drawn randomly


Does happen in panels (the same observational unit is drawn many
times over time) and in time series. serial correlation

Geographical effect

OLS is still consistent, but the estimators for OLS standard errors are inconsistent.

Alternative formulae for standard errors of panel data, time series data and
data which has correlated groups.

c Oliver Kirchkamp

c Oliver Kirchkamp

186

6 February 2015 09:49:38

6.3 OLS and prediction


^
Unbiased estimation of .
reduced by two units.

Example: What happens to testscr, if str is

b What is a likely value of testscr in a district


Unbiased estimation of Y.
with str=20?
testscr = 698.933 2.2798 str
We know that the coefficient of str is biased. However, for prediction purposes, this is irrelevant.
Here, R2 is important.
Omitted variable bias is not a problem anymore.
The interpretation of the coefficients is not important. What we care
about, is a good "fit".
External validity is important: The model, which we estimated with
data from the past, has to be valid for the future.

6.4 Comparison of Caschool and MCAS


distcod
disctric code
county
county
district
district
grspan
grade span of district
enrltot
total enrollment
teachers
number of teachers
calwpct
percent qualifying for CalWorks
mealpct
percent qualifying for reduced-price lunch
computer
number of computers
testscr
average test score (read.scr+math.scr)/2
compstu
computer per student
expnstu
expenditure per student
str
student teacher ratio
avginc
district average income
elpct
percent of English learners
readscr
average reading score
mathscr
average math score
Source: California Department of Education

code
district code (numerical)
municipa
municipality (name)
district
district name
regday
spending per pupil, regular
specneed
spending per pupil, special needs
bilingua
spending per pupil, bilingual
occupday
spending per pupil, occupational
totday
spending per pupil, total
spc
students per computer
speced
special education students
lnchpct
eligible for free or reduced price lunch
tchratio
students per teacher
percap
per capita income
totsc4
4th grade score (math+english+science)
totsc8
8th grade score (math+english+science)
avgsalary
average teacher salary
pctel
percent english learners
Source: Massachusetts Comprehensive Assessment System (MCAS), Massachusetts Department of Education, 1990 U.S. Census

Datensatz$variable denotes a variable (column) in a data set. Alternatively we can write Datensatz[,"variabl
or Datensatz[,c("variable1","variable2")] for more than one column.
data(MCAS)
MCAS<-within(MCAS,{
type<-"MA"
str<-tchratio
testscr<-totsc4
elpct<-pctel
avginc<-percap
mealpct<-lnchpct})
Caschool$type<-"CA"
head(Caschool[,c("type","str","testscr","elpct","avginc","mealpct")])

1
2
3
4
5
6

EVALUATING MULTIPLE REGRESSIONS

type
CA
CA
CA
CA
CA
CA

str testscr
elpct
avginc
17.88991 690.80 0.000000 22.690001
21.52466 661.20 4.583333 9.824000
18.69723 643.60 30.000002 8.978000
17.35714 647.70 0.000000 8.978000
18.67133 640.85 13.857677 9.080333
21.40625 605.55 12.408759 10.415000

187

mealpct
2.0408
47.9167
76.3226
77.0492
78.4270
86.9565

head(MCAS[,c("type","str","testscr","elpct","avginc","mealpct")])

1
2
3
4
5
6

type
MA
MA
MA
MA
MA
MA

str testscr
elpct avginc mealpct
19.0
714 0.0000000 16.379
11.8
22.6
731 1.2461059 25.792
2.5
19.3
704 0.0000000 14.040
14.1
17.9
704 0.3225806 16.111
12.1
17.5
701 0.0000000 15.423
17.4
15.7
714 3.9215686 11.144
26.8

merge merges two data sets. Let one data contain the matriculation numbers and grades of students and
let another data set contain matriculation numbers and names. merge assigns the correct matriculation
numbers, grades and names to each other. If the two data sets do not have anything merge can also append
one data set to another.
cama=merge(Caschool[,c("type","str","testscr","elpct","avginc",
"mealpct")],MCAS[,c("type","str","testscr","elpct","avginc",
"mealpct")],all=TRUE)

The new dataframe cama contains now data from both regions, CA and MA.
First comes the CA data, followed by the MA data.
head(cama)

1
2
3
4
5
6

type
CA
CA
CA
CA
CA
CA

str testscr
elpct avginc mealpct
14.00000 635.60 0.000000 10.656 68.8235
14.20176 656.50 0.000000 13.712 20.0000
14.54214 695.30 3.765690 35.342 0.0000
14.70588 666.85 2.500000 11.826 53.5032
15.13898 698.25 2.807284 35.810 0.0000
15.22436 646.40 0.000000 10.268 76.2774

tail(cama)

635
636
637
638
639
640

type
MA
MA
MA
MA
MA
MA

str testscr
elpct
21.9
691 2.816901
22.0
706 0.000000
22.0
711 0.000000
22.6
731 1.246106
23.5
699 0.000000
27.0
664 10.798017

avginc mealpct
15.905
27.1
14.471
18.3
15.603
12.4
25.792
2.5
16.189
6.8
15.581
70.0

aggregate can be used to split a data set into parts and apply a function to each of the parts.

c Oliver Kirchkamp

6 February 2015 09:49:38

aggregate(cama[,2:6],list(cama$type),mean)
Group.1
str testscr
elpct
avginc mealpct
1
CA 19.64043 654.1565 15.768155 15.31659 44.70524
2
MA 17.34409 709.8273 1.117676 18.74676 15.31591
aggregate(cama[,2:6],list(cama$type),sd)

1
2

Group.1
str testscr
elpct
avginc mealpct
CA 1.891812 19.05335 18.28593 7.225890 27.12338
MA 2.276666 15.12647 2.90094 5.807637 15.06007

aggregate(cama[,2:6],list(cama$type),length)
Group.1 str testscr elpct avginc mealpct
1
CA 420
420
420
420
420
2
MA 220
220
220
220
220
subset selects a subset of a data set. If a function supports the parameter data, we can supply it with an
appropriate subset. Many function do also have a parameter subset which selects a subset directly. ylim
defines the scale of the y-axis and pch defines, which symbol is to be used to depict a point.

650

700

750

attach(cama)
plot(testscr ~ avginc,subset=(type=="MA"),col="blue",pch=3,ylim=c(600,750))
points(testscr ~ avginc,subset=(type=="CA"),col="red")
legend("bottomright",c("MASS","CA"),pch=c(3,1),col=c("blue","red"))

MASS
CA

600

testscr

c Oliver Kirchkamp

188

10

20

30

avginc

40

EVALUATING MULTIPLE REGRESSIONS

189

We write a small function which calculates a number of plots of this kind.


myPlot <- function (var) {
plot(testscr ~ var,subset=(type=="MA"),ylim=c(600,750),
col="blue",pch=3)
points(testscr ~ var,subset=(type=="CA"),col="red",pch=1)
legend("bottomright",c("MASS","CA"),pch=c(3,1),col=c("blue","red"))
}

10

20

30

700

testscr

700

MASS
CA

600

MASS
CA

600

testscr

myPlot(avginc)
myPlot(str)
myPlot(elpct)
myPlot(mealpct)

40

15

10

15

20

700

testscr

700

MASS
CA

600

MASS
CA
0

25

20

var

Test scores and income in Massachusetts


(estC<- lm(testscr ~ avginc,data=subset(cama,type=="CA")))

Call:
lm(formula = testscr ~ avginc, data = subset(cama, type == "CA"))
Coefficients:
(Intercept)
625.384

avginc
1.879

25

var

600

testscr

var

20

40

var

60

c Oliver Kirchkamp

6 February 2015 09:49:38

(estM<- lm(testscr ~ avginc,data=subset(cama,type=="MA")))

Call:
lm(formula = testscr ~ avginc, data = subset(cama, type == "MA"))
Coefficients:
(Intercept)
679.387

avginc
1.624

650

700

750

myPlot(avginc)
abline(estC,col="red")
abline(estM,col="blue")

MASS
CA

600

testscr

c Oliver Kirchkamp

190

10

20

30

40

var

If a function has the ... parameter in its definition, we can supply additional parameters when calling
the function. These additional parameters are substituted when we use ... within the function.
ePlot <- function(model,data,...) {
est<- lm(model,data)
stdev<-sqrt(diag(hccm(est)))
pvalue<-round(2*pnorm(-abs(coef(est)/stdev)),4)
stars<-ifelse(pvalue<.001,"***",ifelse(pvalue<.01,"**",
ifelse(pvalue<.05,"*",ifelse(pvalue<.1,".",""))))
a<-as.data.frame(cbind(coef(est),stdev,pvalue))
a$stars=stars

EVALUATING MULTIPLE REGRESSIONS

colnames(a)[1]="beta"
print(a,digits=3)
sum<-summary(est)
cat("R2=
",round(sum$r.squared,2),"\n")
or<-order(data$avginc)
if (substr(model[2],1,4)=="log(") {
lines(data$avginc[or],exp(fitted(est)[or]),...)
} else
lines(data$avginc[or],fitted(est)[or],...)
est
}

myPlot(avginc)
est<-ePlot(testscr ~ avginc + I(avginc^2),data=subset(cama,type=="CA"))
beta
stdev pvalue stars
(Intercept) 607.3017 2.92422
0
***
avginc
3.8510 0.27110
0
***
I(avginc^2) -0.0423 0.00488
0
***
R2=
0.56
est<-ePlot(testscr ~ avginc + I(avginc^2),data=subset(cama,type=="MA"))
beta stdev pvalue stars
(Intercept) 638.3711 8.0401
0
***
avginc
5.4703 0.6893
0
***
I(avginc^2) -0.0808 0.0136
0
***
R2=
0.48
est<-ePlot(testscr ~ avginc + I(avginc^2)+ I(avginc^3),
data=subset(cama,type=="CA"),col="red")
beta
(Intercept) 600.078985
avginc
5.018677
I(avginc^2) -0.095805
I(avginc^3)
0.000685
R2=
0.56

stdev
5.462310
0.787290
0.034052
0.000437

pvalue stars
0.0000
***
0.0000
***
0.0049
**
0.1167

est<-ePlot(testscr ~ avginc + I(avginc^2)+ I(avginc^3),


data=subset(cama,type=="MA"),col="red")
beta
stdev pvalue stars
(Intercept) 600.39853 26.96057 0.0000
***
avginc
10.63538 3.63075 0.0034
**
I(avginc^2) -0.29689 0.15614 0.0572
.
I(avginc^3)
0.00276 0.00214 0.1968
R2=
0.49
est<-ePlot(testscr ~ log(avginc),data=subset(cama,type=="CA"),
col="blue")

191

c Oliver Kirchkamp

c Oliver Kirchkamp

192

6 February 2015 09:49:38

beta stdev pvalue stars


(Intercept) 557.8 3.86
0
***
log(avginc) 36.4 1.41
0
***
R2=
0.56
est<-ePlot(testscr ~ log(avginc),data=subset(cama,type=="MA"),
col="blue")
beta stdev pvalue stars
(Intercept) 600.8 9.36
0
***
log(avginc) 37.7 3.17
0
***
R2=
0.46
est<-ePlot(log(testscr) ~ log(avginc),data=subset(cama,type=="CA"),
col="green")
beta
stdev pvalue stars
(Intercept) 6.3363 0.00596
0
***
log(avginc) 0.0554 0.00216
0
***
R2=
0.56
est<-ePlot(log(testscr) ~ log(avginc),data=subset(cama,type=="MA"),
col="green")
beta
stdev pvalue stars
(Intercept) 6.4107 0.01343
0
***
log(avginc) 0.0533 0.00454
0
***
R2=
0.46
est<-ePlot(log(testscr) ~ avginc,data=subset(cama,type=="CA"),
col="yellow")
beta
stdev pvalue stars
(Intercept) 6.43936 0.002987
0
***
avginc
0.00284 0.000183
0
***
R2=
0.5
options(scipen=5)
est<-ePlot(log(testscr) ~ avginc,data=subset(cama,type=="MA"),
col="yellow")
beta
stdev pvalue stars
(Intercept) 6.52186 0.005388
0
***
avginc
0.00229 0.000276
0
***
R2=
0.38

193

700
650

testscr

600

MASS
CA
10

20

30

var

Multiple regression:
myPlot(avginc)
est<-ePlot(testscr ~ str,data=subset(cama,type=="CA"))
beta stdev pvalue stars
(Intercept) 698.93 10.461
0
***
str
-2.28 0.524
0
***
R2=
0.05
est<-ePlot(testscr ~ str,data=subset(cama,type=="MA"))
beta stdev pvalue stars
(Intercept) 739.62 8.882 0.0000
***
str
-1.72 0.516 0.0009
***
R2=
0.07
myPlot(avginc)
est<-ePlot(testscr ~ str + elpct + mealpct + log(avginc),
data=subset(cama,type=="CA"))
beta stdev pvalue stars
(Intercept) 658.552 8.7489 0.0000
***
str
-0.734 0.2606 0.0048
**
elpct
-0.176 0.0342 0.0000
***
mealpct
-0.398 0.0336 0.0000
***
log(avginc) 11.569 1.8413 0.0000
***
R2=
0.8

40

c Oliver Kirchkamp

EVALUATING MULTIPLE REGRESSIONS

750

6 February 2015 09:49:38

est<-ePlot(testscr ~ str + elpct + mealpct + log(avginc),


data=subset(cama,type=="MA"))

10

20

30

40

650

700

750

600

MASS
CA

MASS
CA

600

650

testscr

700

750

beta
stdev pvalue stars
(Intercept) 682.432 12.0943 0.0000
***
str
-0.689 0.2779 0.0131
*
elpct
-0.411 0.3512 0.2422
mealpct
-0.521 0.0834 0.0000
***
log(avginc) 16.529 3.3010 0.0000
***
R2=
0.68

testscr

c Oliver Kirchkamp

194

10

var

20

30

var

The effect of str is significant.


Additional variables reduce the coefficient of str
The effect of avginc is significant.
myPlot(avginc)
est<-ePlot(testscr ~ str + elpct + mealpct + avginc + I(avginc^2) +
I(avginc^3),data=subset(cama,type=="MA"))
beta
stdev pvalue stars
(Intercept) 744.02504 23.18585 0.0000
***
str
-0.64091 0.27642 0.0204
*
elpct
-0.43712 0.35908 0.2235
mealpct
-0.58182 0.10781 0.0000
***

40

EVALUATING MULTIPLE REGRESSIONS

avginc
-3.06669
I(avginc^2)
0.16369
I(avginc^3) -0.00218
R2=
0.69

2.53398 0.2262
0.09172 0.0743
0.00104 0.0370

195

.
*

linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + elpct + mealpct + avginc + I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.

700
650

MASS
CA

600

testscr

750

Res.Df Df
F
Pr(>F)
1
215
2
213 2 6.227 0.002354 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

10

20

30

var

The effect of str is significant.

40

c Oliver Kirchkamp

c Oliver Kirchkamp

196

6 February 2015 09:49:38

myPlot(avginc)
est<-ePlot(testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct +
avginc + I(avginc^2) + I(avginc^3),data=subset(cama,type=="CA"))
beta
stdev pvalue stars
(Intercept) 330.079169 173.293815 0.0568
.
str
55.618325 26.486559 0.0357
*
I(str^2)
-2.914810
1.340721 0.0297
*
I(str^3)
0.049866
0.022437 0.0262
*
elpct
-0.196440
0.035054 0.0000
***
mealpct
-0.411538
0.033874 0.0000
***
avginc
-0.912858
0.587802 0.1204
I(avginc^2)
0.067430
0.022781 0.0031
**
I(avginc^3) -0.000826
0.000262 0.0016
**
R2=
0.81
est<-ePlot(testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct +
avginc + I(avginc^2) + I(avginc^3),data=subset(cama,type=="MA"))
beta
stdev pvalue stars
(Intercept) 665.49605 116.07834 0.0000
***
str
12.42598 20.27945 0.5401
I(str^2)
-0.68030
1.12956 0.5470
I(str^3)
0.01147
0.02081 0.5814
elpct
-0.43417
0.36722 0.2371
mealpct
-0.58722
0.11724 0.0000
***
avginc
-3.38154
2.74013 0.2172
I(avginc^2)
0.17410
0.09819 0.0762
.
I(avginc^3) -0.00229
0.00111 0.0398
*
R2=
0.69
linearHypothesis(est,c("str=0","I(str^2)","I(str^3)"),vcov=hccm)
Linear hypothesis test
Hypothesis:
str = 0
I(str^2) = 0
I(str^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
211 3 2.3364 0.07478 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
linearHypothesis(est,c("I(str^2)=0","I(str^3)=0"),vcov=hccm)

EVALUATING MULTIPLE REGRESSIONS

Linear hypothesis test


Hypothesis:
I(str^2) = 0
I(str^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.

1
2

Res.Df Df
F Pr(>F)
213
211 2 0.3396 0.7124

linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F
Pr(>F)
1
213
2
211 2 5.7043 0.003866 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

197

c Oliver Kirchkamp

650

700

750

6 February 2015 09:49:38

MASS
CA

600

testscr

c Oliver Kirchkamp

198

10

20

30

40

var

The effect of str is significant.


str has a significant effect in California, but not in Massachusetts.
myPlot(avginc)
est<-ePlot(testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct +
avginc + I(avginc^2) + I(avginc^3),data=subset(cama,type=="CA"))
beta
stdev pvalue stars
(Intercept) 330.079169 173.293815 0.0568
.
str
55.618325 26.486559 0.0357
*
I(str^2)
-2.914810
1.340721 0.0297
*
I(str^3)
0.049866
0.022437 0.0262
*
elpct
-0.196440
0.035054 0.0000
***
mealpct
-0.411538
0.033874 0.0000
***
avginc
-0.912858
0.587802 0.1204
I(avginc^2)
0.067430
0.022781 0.0031
**
I(avginc^3) -0.000826
0.000262 0.0016
**
R2=
0.81
est<-ePlot(testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct +
avginc + I(avginc^2) + I(avginc^3),data=subset(cama,type=="MA"))
beta
stdev pvalue stars
(Intercept) 665.49605 116.07834 0.0000
***
str
12.42598 20.27945 0.5401
I(str^2)
-0.68030
1.12956 0.5470

EVALUATING MULTIPLE REGRESSIONS

I(str^3)
0.01147
elpct
-0.43417
mealpct
-0.58722
avginc
-3.38154
I(avginc^2)
0.17410
I(avginc^3) -0.00229
R2=
0.69

0.02081
0.36722
0.11724
2.74013
0.09819
0.00111

0.5814
0.2371
0.0000
0.2172
0.0762
0.0398

***
.
*

linearHypothesis(est,c("str=0","I(str^2)","I(str^3)"),vcov=hccm)
Linear hypothesis test
Hypothesis:
str = 0
I(str^2) = 0
I(str^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
211 3 2.3364 0.07478 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
linearHypothesis(est,c("I(str^2)=0","I(str^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(str^2) = 0
I(str^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
213
2
211 2 0.3396 0.7124
linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:

199

c Oliver Kirchkamp

c Oliver Kirchkamp

200

6 February 2015 09:49:38

I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F
Pr(>F)
1
213
2
211 2 5.7043 0.003866 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
aggregate(elpct,list(type),median)

1
2

Group.1
x
CA 8.777634
MA 0.000000

cama$HiEL=cama$elpct>0
est<-ePlot(testscr ~ str + HiEL + HiEL:str + mealpct + avginc +
I(avginc^2) + I(avginc^3),data=subset(cama,type=="CA"))
beta
stdev pvalue stars
(Intercept) 658.16110 16.404309 0.0000
***
str
1.35471 0.810345 0.0946
.
HiELTRUE
36.11583 16.262280 0.0264
*
mealpct
-0.50670 0.027027 0.0000
***
avginc
-0.90724 0.616350 0.1410
I(avginc^2)
0.05912 0.023531 0.0120
*
I(avginc^3)
-0.00068 0.000272 0.0123
*
str:HiELTRUE -2.18763 0.864613 0.0114
*
R2=
0.8
est<-ePlot(testscr ~ str + HiEL + HiEL:str + mealpct + avginc +
I(avginc^2) + I(avginc^3),data=subset(cama,type=="MA"))
beta
stdev pvalue stars
(Intercept) 759.91422 25.28938 0.0000
***
str
-1.01768 0.38182 0.0077
**
HiELTRUE
-12.56073 10.22789 0.2194
mealpct
-0.70851 0.09894 0.0000
***
avginc
-3.86651 2.71955 0.1551
I(avginc^2)
0.18412 0.09930 0.0637
.
I(avginc^3)
-0.00234 0.00115 0.0414
*
str:HiELTRUE
0.79861 0.58020 0.1687
R2=
0.69
linearHypothesis(est,c("str=0","str:HiELTRUE=0"),vcov=hccm)

EVALUATING MULTIPLE REGRESSIONS

Linear hypothesis test


Hypothesis:
str = 0
str:HiELTRUE = 0
Model 1: restricted model
Model 2: testscr ~ str + HiEL + HiEL:str + mealpct + avginc + I(avginc^2) +
I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
212 2 3.7663 0.0247 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + HiEL + HiEL:str + mealpct + avginc + I(avginc^2) +
I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
212 2 3.2201 0.04191 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

201

c Oliver Kirchkamp

650

700

750

6 February 2015 09:49:38

MASS
CA

600

testscr

c Oliver Kirchkamp

202

10

20

30

40

var

The effect of str is significant.


No noteworthy interaction between HiEL and str in Massachusetts, but in
California.
myPlot(avginc)
est<-ePlot(testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct +
avginc + I(avginc^2) + I(avginc^3),data=subset(cama,type=="CA"))
beta
stdev pvalue stars
(Intercept) 330.079169 173.293815 0.0568
.
str
55.618325 26.486559 0.0357
*
I(str^2)
-2.914810
1.340721 0.0297
*
I(str^3)
0.049866
0.022437 0.0262
*
elpct
-0.196440
0.035054 0.0000
***
mealpct
-0.411538
0.033874 0.0000
***
avginc
-0.912858
0.587802 0.1204
I(avginc^2)
0.067430
0.022781 0.0031
**
I(avginc^3) -0.000826
0.000262 0.0016
**
R2=
0.81
est<-ePlot(testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct +
avginc + I(avginc^2) + I(avginc^3),data=subset(cama,type=="MA"))
beta
stdev pvalue stars
(Intercept) 665.49605 116.07834 0.0000
***
str
12.42598 20.27945 0.5401

EVALUATING MULTIPLE REGRESSIONS

I(str^2)
-0.68030
I(str^3)
0.01147
elpct
-0.43417
mealpct
-0.58722
avginc
-3.38154
I(avginc^2)
0.17410
I(avginc^3) -0.00229
R2=
0.69

1.12956
0.02081
0.36722
0.11724
2.74013
0.09819
0.00111

0.5470
0.5814
0.2371
0.0000
0.2172
0.0762
0.0398

***
.
*

linearHypothesis(est,c("str=0","I(str^2)","I(str^3)"),vcov=hccm)
Linear hypothesis test
Hypothesis:
str = 0
I(str^2) = 0
I(str^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
211 3 2.3364 0.07478 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
linearHypothesis(est,c("I(str^2)=0","I(str^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(str^2) = 0
I(str^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.

1
2

Res.Df Df
F Pr(>F)
213
211 2 0.3396 0.7124

linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)
Linear hypothesis test

203

c Oliver Kirchkamp

c Oliver Kirchkamp

204

6 February 2015 09:49:38

Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F
Pr(>F)
1
213
2
211 2 5.7043 0.003866 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
aggregate(elpct,list(type),median)

1
2

Group.1
x
CA 8.777634
MA 0.000000

cama$HiEL=cama$elpct>0
est<-ePlot(testscr ~ str + HiEL + HiEL:str + mealpct + avginc +
I(avginc^2) + I(avginc^3),data=subset(cama,type=="CA"))
beta
stdev pvalue stars
(Intercept) 658.16110 16.404309 0.0000
***
str
1.35471 0.810345 0.0946
.
HiELTRUE
36.11583 16.262280 0.0264
*
mealpct
-0.50670 0.027027 0.0000
***
avginc
-0.90724 0.616350 0.1410
I(avginc^2)
0.05912 0.023531 0.0120
*
I(avginc^3)
-0.00068 0.000272 0.0123
*
str:HiELTRUE -2.18763 0.864613 0.0114
*
R2=
0.8
est<-ePlot(testscr ~ str + HiEL + HiEL:str + mealpct + avginc +
I(avginc^2) + I(avginc^3),data=subset(cama,type=="MA"))
beta
stdev pvalue stars
(Intercept) 759.91422 25.28938 0.0000
***
str
-1.01768 0.38182 0.0077
**
HiELTRUE
-12.56073 10.22789 0.2194
mealpct
-0.70851 0.09894 0.0000
***
avginc
-3.86651 2.71955 0.1551
I(avginc^2)
0.18412 0.09930 0.0637
.
I(avginc^3)
-0.00234 0.00115 0.0414
*
str:HiELTRUE
0.79861 0.58020 0.1687
R2=
0.69
linearHypothesis(est,c("str=0","str:HiELTRUE=0"),vcov=hccm)

EVALUATING MULTIPLE REGRESSIONS

Linear hypothesis test


Hypothesis:
str = 0
str:HiELTRUE = 0
Model 1: restricted model
Model 2: testscr ~ str + HiEL + HiEL:str + mealpct + avginc + I(avginc^2) +
I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
212 2 3.7663 0.0247 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + HiEL + HiEL:str + mealpct + avginc + I(avginc^2) +
I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
212 2 3.2201 0.04191 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
est<-ePlot(testscr ~ str

+ mealpct + avginc + I(avginc^2) +


I(avginc^3),data=subset(cama,type=="MA"))

beta
stdev pvalue stars
(Intercept) 747.36389 21.67952 0.0000
***
str
-0.67188 0.27679 0.0152
*
mealpct
-0.65308 0.07859 0.0000
***
avginc
-3.21795 2.46635 0.1920
I(avginc^2)
0.16479 0.09113 0.0706
.
I(avginc^3) -0.00216 0.00106 0.0415
*
R2=
0.68
linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)

205

c Oliver Kirchkamp

6 February 2015 09:49:38

Linear hypothesis test


Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + mealpct + avginc + I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.

650

700

750

Res.Df Df
F Pr(>F)
1
216
2
214 2 4.2776 0.01508 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

MASS
CA

600

testscr

c Oliver Kirchkamp

206

10

20

30

40

var

The effect of str is significant.


Discussion:
The data set from California is larger it is easier to find significant results.

Comparing the mean values and standard deviations of str in California and
Massachusetts.

EVALUATING MULTIPLE REGRESSIONS

207

(mmean<-aggregate(cama$testscr,list(type),mean))
Group.1
x
1
CA 654.1565
2
MA 709.8273
colnames(mmean)<-c("type","testmean")
(msd=aggregate(cama$testscr,list(type),sd))
Group.1
x
1
CA 19.05335
2
MA 15.12647
colnames(msd)<-c("type","testsd")
cama2=merge(merge(cama,mmean),msd)
head(cama2)
type
str testscr
elpct avginc mealpct HiEL
1
CA 14.00000 635.60 0.000000 10.656 68.8235 FALSE
2
CA 14.20176 656.50 0.000000 13.712 20.0000 FALSE
3
CA 14.54214 695.30 3.765690 35.342 0.0000 TRUE
4
CA 14.70588 666.85 2.500000 11.826 53.5032 TRUE
[ reached getOption("max.print") -- omitted 2 rows ]

testmean
654.1565
654.1565
654.1565
654.1565

testsd
19.05335
19.05335
19.05335
19.05335

tail(cama2)
type str testscr
elpct avginc mealpct HiEL testmean
testsd
635
MA 21.9
691 2.816901 15.905
27.1 TRUE 709.8273 15.12647
636
MA 22.0
706 0.000000 14.471
18.3 FALSE 709.8273 15.12647
637
MA 22.0
711 0.000000 15.603
12.4 FALSE 709.8273 15.12647
638
MA 22.6
731 1.246106 25.792
2.5 TRUE 709.8273 15.12647
[ reached getOption("max.print") -- omitted 2 rows ]
cama2$testnorm=(cama2$testscr - cama2$testmean) / cama2$testsd
detach(cama)

myPlot(avginc)
est<-ePlot(testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct +
avginc + I(avginc^2) + I(avginc^3),data=subset(cama,type=="CA"))
beta
stdev pvalue stars
(Intercept) 330.079169 173.293815 0.0568
.
str
55.618325 26.486559 0.0357
*
I(str^2)
-2.914810
1.340721 0.0297
*
I(str^3)
0.049866
0.022437 0.0262
*
elpct
-0.196440
0.035054 0.0000
***
mealpct
-0.411538
0.033874 0.0000
***
avginc
-0.912858
0.587802 0.1204
I(avginc^2)
0.067430
0.022781 0.0031
**
I(avginc^3) -0.000826
0.000262 0.0016
**
R2=
0.81

c Oliver Kirchkamp

c Oliver Kirchkamp

208

6 February 2015 09:49:38

est<-ePlot(testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct +


avginc + I(avginc^2) + I(avginc^3),data=subset(cama,type=="MA"))
beta
stdev pvalue stars
(Intercept) 665.49605 116.07834 0.0000
***
str
12.42598 20.27945 0.5401
I(str^2)
-0.68030
1.12956 0.5470
I(str^3)
0.01147
0.02081 0.5814
elpct
-0.43417
0.36722 0.2371
mealpct
-0.58722
0.11724 0.0000
***
avginc
-3.38154
2.74013 0.2172
I(avginc^2)
0.17410
0.09819 0.0762
.
I(avginc^3) -0.00229
0.00111 0.0398
*
R2=
0.69
linearHypothesis(est,c("str=0","I(str^2)","I(str^3)"),vcov=hccm)
Linear hypothesis test
Hypothesis:
str = 0
I(str^2) = 0
I(str^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
211 3 2.3364 0.07478 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
linearHypothesis(est,c("I(str^2)=0","I(str^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(str^2) = 0
I(str^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
213
2
211 2 0.3396 0.7124

EVALUATING MULTIPLE REGRESSIONS

linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + I(str^2) + I(str^3) + elpct + mealpct + avginc +
I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F
Pr(>F)
1
213
2
211 2 5.7043 0.003866 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
aggregate(elpct,list(type),median)
1
2

Group.1
x
CA 8.777634
MA 0.000000

cama$HiEL=cama$elpct>0
est<-ePlot(testscr ~ str + HiEL + HiEL:str + mealpct + avginc +
I(avginc^2) + I(avginc^3),data=subset(cama,type=="CA"))
beta
stdev pvalue stars
(Intercept) 658.16110 16.404309 0.0000
***
str
1.35471 0.810345 0.0946
.
HiELTRUE
36.11583 16.262280 0.0264
*
mealpct
-0.50670 0.027027 0.0000
***
avginc
-0.90724 0.616350 0.1410
I(avginc^2)
0.05912 0.023531 0.0120
*
I(avginc^3)
-0.00068 0.000272 0.0123
*
str:HiELTRUE -2.18763 0.864613 0.0114
*
R2=
0.8
est<-ePlot(testscr ~ str + HiEL + HiEL:str + mealpct + avginc +
I(avginc^2) + I(avginc^3),data=subset(cama,type=="MA"))
beta
stdev pvalue stars
(Intercept) 759.91422 25.28938 0.0000
***
str
-1.01768 0.38182 0.0077
**
HiELTRUE
-12.56073 10.22789 0.2194
mealpct
-0.70851 0.09894 0.0000
***
avginc
-3.86651 2.71955 0.1551
I(avginc^2)
0.18412 0.09930 0.0637
.
I(avginc^3)
-0.00234 0.00115 0.0414
*
str:HiELTRUE
0.79861 0.58020 0.1687
R2=
0.69

209

c Oliver Kirchkamp

c Oliver Kirchkamp

210

6 February 2015 09:49:38

linearHypothesis(est,c("str=0","str:HiELTRUE=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
str = 0
str:HiELTRUE = 0
Model 1: restricted model
Model 2: testscr ~ str + HiEL + HiEL:str + mealpct + avginc + I(avginc^2) +
I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
212 2 3.7663 0.0247 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)
Linear hypothesis test
Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + HiEL + HiEL:str + mealpct + avginc + I(avginc^2) +
I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
214
2
212 2 3.2201 0.04191 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
est<-ePlot(testscr ~ str

+ mealpct + avginc + I(avginc^2) +


I(avginc^3),data=subset(cama,type=="MA"))

beta
stdev pvalue stars
(Intercept) 747.36389 21.67952 0.0000
***
str
-0.67188 0.27679 0.0152
*
mealpct
-0.65308 0.07859 0.0000
***
avginc
-3.21795 2.46635 0.1920
I(avginc^2)
0.16479 0.09113 0.0706
.
I(avginc^3) -0.00216 0.00106 0.0415
*
R2=
0.68
linearHypothesis(est,c("I(avginc^2)=0","I(avginc^3)=0"),vcov=hccm)

EVALUATING MULTIPLE REGRESSIONS

Linear hypothesis test


Hypothesis:
I(avginc^2) = 0
I(avginc^3) = 0
Model 1: restricted model
Model 2: testscr ~ str + mealpct + avginc + I(avginc^2) + I(avginc^3)
Note: Coefficient covariance matrix supplied.
Res.Df Df
F Pr(>F)
1
216
2
214 2 4.2776 0.01508 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
est<-ePlot(testnorm ~ str*type,data=cama2)
beta
(Intercept) 2.35005
str
-0.11965
typeMA
-0.38040
str:typeMA
0.00609
R2=
0.06

stdev pvalue stars


0.5490 0.000
***
0.0275 0.000
***
0.8038 0.636
0.0438 0.889

est<-ePlot(testscr ~ str*type + mealpct + avginc + I(avginc^2) +


I(avginc^3),data=cama2)
beta
stdev pvalue stars
(Intercept) 697.172818 7.309650 0.0000
***
str
-0.747646 0.264140 0.0046
**
typeMA
38.005721 7.229002 0.0000
***
mealpct
-0.542059 0.024022 0.0000
***
avginc
-1.279304 0.587298 0.0294
*
I(avginc^2)
0.075675 0.022051 0.0006
***
I(avginc^3) -0.000909 0.000257 0.0004
***
str:typeMA
-0.070309 0.378909 0.8528
R2=
0.92

211

c Oliver Kirchkamp

650

700

750

6 February 2015 09:49:38

MASS
CA

600

testscr

c Oliver Kirchkamp

212

10

20

30

40

var

6.4.1 Internal validity


Omitted variables: we control for
Income

Some characteristics of the students (language)


What is missing? e.g. str could be correlated with
Quality of teachers
Extracurricular activities
Attention of parents
Alternative: Experiment. Pupils are randomly assigned to groups of
different sizes.

Functional form:

different non-linear specifications lead to similar results in the example


V
Non-linearities are not very large V

Errors in the variables:

str is an average across the whole district.

EVALUATING MULTIPLE REGRESSIONS

213

The true variance of student teacher ratios will be underestimated by


str. Hence, the estimation for the coefficient of str will be underestimated as well.
Ideally we would like to have the data for individual students.
Selection bias:

Both data sets are based on a full census. V

Simlutaneous causality:
testscr

str (e.g. compensatory measures)

Massachusettes: no measures. V

California: Compensatory funding, but independent of students


successes. V
Heteroscedasticity and correlation of errors terms:

Heteroscedasticity: Heteroscedasticity-consistent variance-covariance


matrices. V
Correlation of errors terms: No random drawing of observations.

6.4.2 External validity


Comparison California Massachusetts

6.4.3 Result

Reducing group sizes by one unit

+ testscr increases by 0.08 standard deviations

costs (classrooms, teachers pay)

c Oliver Kirchkamp