Vous êtes sur la page 1sur 28

Multiple Regression Analysis: Applications

Lecture 5

Erik Lindqvist

Department of Economics, SSE

Lindqvist (Institute) 651: Lecture 5 1 / 28


Today

The main objective of todays lecture is to illustrate a few key points


from the rst four lectures by use of two dierent applications
Another objective is to illustrate how to use multiple regression in
practice
Toward the end of the lecture, I give a few suggestions for what to
think about when writing an empirical paper (e.g., a B.Sc. thesis)

Lindqvist (Institute) 651: Lecture 5 2 / 28


Application 1: Youth care (based on Lindqvist 2012)

In Sweden, teenagers with dierent types of behavioral problems like


drug addiction or violent behavior are often placed in residential youth
care
The objective of youth care is that teenagers should receive treatment
for their problems and readjust to society, not to punish misdemenours
About 80 % of youth care facilities are private, most of which are
small rms
Facilities are compensated per day of treatment, creating a powerful
nancial incentive for owners of private facilities to prolong treatment
periods in case the facility does not run at full capacity

Lindqvist (Institute) 651: Lecture 5 3 / 28


Question and data

Question: Do private facilities prolong treatment periods in youth


care?
Data: Information on 357 teenagers in youth care collected from the
teenagersles at the social services (Vinnerljung et at 2001)
The data set contains information teenager and facility characteristics,
and the treatment history of each teenager
Average duration of treatment is about 10 months longer in private
facilities
But does the dierence reect a causal eect of ownership on
duration? This is want we want to know from a policy perspective

Lindqvist (Institute) 651: Lecture 5 4 / 28


Selection problem

Main concern: Type of facility is chosen by the social services,


implying that the treatment needs of teenagers may vary
systematically between private and public facilities
A look at indicates that teenagers in private facilities tend to have
more severe problems

Table 1. Proportion teenagers (%)


Public Private
Drug addiction 26.9 41.5
Violent behavior 16.7 22.2
Criminal behavior 33.3 46.8
Psychological disorder 17.7 26.9

Lindqvist (Institute) 651: Lecture 5 5 / 28


Dealing with the selection problem I

In principle, we could solve the selection problem using multiple


regression if the data set contained information on all teenager
characteristics that determine facility ownership
In reality, this is unlikely to be the case. Some factors unobserved to
us will probably aect choice of ownership
What we can do is to test how sensitive our results are to controlling
for the things that we can observe

Lindqvist (Institute) 651: Lecture 5 6 / 28


Dealing with the selection problem II

We estimate a regression of the form

Durationij = 0 + 1 Privatej + 2 xi 1 + ... + k xik 1 + uij

where Durationij is the duration of stay for teenager i in facility,


Privatej is a "dummy" variable equal to one in case facility j is private
and xi 1 xik 1 is a set of k 1 teenager or treatment characteristics
1 gives the dierence in average duration of treatment between
private and public facilities, conditional on the xs
Adding sets of control variables stepwise, we get a sense of the
importance and direction of the selection problem

Lindqvist (Institute) 651: Lecture 5 7 / 28


Results

Table 2. Duration of treatment

Private facility 8.250*** 7.990*** 9.131*** 9.681*** 8.962***

(4.842) (4.604) (5.012) (4.586) (4.047)

Observations 337 337 319 291 291

Baseline covariates (3 variables) X X X X X

Pre-treatment characteristics (12 variables) X X X X

Treatment history (8 variables) X X X

Geographical variables (3 variables) X X

Type of treatment (5 variables) X

Standard errors in parenthesis. Three stars denotes statistical signicance at the 1 % level in a two-sided test.

Lindqvist (Institute) 651: Lecture 5 8 / 28


Interpreting the results

Table 2 shows that the results are insensitive to controlling for a host
of teenager and treatment characteristics
Why? Remember the formula for how the simple regression estimate
relates to the multiple regression

e
1 = b
1 + b
2 e
1 (3.23)

(3.23) work "approximately" when we have more than two


explanatory variables in the multiple regression
Simplifying, we can think of the entire set of teenager characteristics
as a single variable, x2 , reecting the severity of a teenagersproblems
While severity of teenager problems is signicantly correlated with
ownership e 1 > 0 , teenager and treatment characteristics are not
strongly related to the duration of treament b
2 0

Lindqvist (Institute) 651: Lecture 5 9 / 28


Conclusion

Our results suggest that the longer duration of treatment in private


facilities reects a causal eect
However, a certain doubt will always remain as there might be
unobservable factors that may aect both choice of facility and
duration of treatment

Lindqvist (Institute) 651: Lecture 5 10 / 28


Application 2: Maternal longevity and sex of ospring

Is the longevity of mothers aected by the sex of the child?


Physiological cost of childbearing higher for sons, for example due to
higher birth weights
Mothers carrying male foetuses have elevated levels of testosterone,
which may aect the immune system
Helle, Lummaa and Jokela (Science, 2002) collected data form church
books on 375 Sami mothers from northern Finland who lived between
1640 and 1870 to investigate this issue.
They estimated the regression

Mother _longi = 0 + 1 Sonsi + 2 Daughtersi + ui

restricting the sample to women who reached the age of 50.

Lindqvist (Institute) 651: Lecture 5 11 / 28


Results

Estimating their regression, Helle and coauthors got b


1 = 0.65 and
b
2 = 0.44
Interpreted as a causal eect, these estimates imply that each son
reduce maternal old-age longevity by 0.65 years while each daughter
increase longevity by 0.44 years
Note the size of the relative eect: If a child born is a son instead of
a daughter, then the model predicts that the mothers life will be 0.65
+ 0.44 = 1.09 years shorter, a very large eect, if true

Lindqvist (Institute) 651: Lecture 5 12 / 28


Issues I

1 Selection on the dependent variable (women above the age of 50)


In general, this is a big "no-no" since it may create a correlation
between the error term and the explanatory variables
2 Precision of the estimates
Both b1 and b
2 are imprecisely estimated
For example, the 95 % CI for b1 is (roughly) 0.65 (0.29 2), i.e.,
between 1.23 (a huge eect) and 0.07 (almost no eect)

Lindqvist (Institute) 651: Lecture 5 13 / 28


Issues II
3. Omitted variable bias
A causal interpretation of b 1 and b2 requires that we assume that
fertility is uncorrelated with other factors that aect longevity, e.g.,
health. This does not seem reasonable.
Yet the dierence between b 1 and b2 does arguably not suer from an
omitted variable bias since the sex of the child, given that a child is
born, is quite unlikely to be correlated with the mothers health status
To see this, we use what we learned in lecture 4. Dening
= 1 2 , we get

Mother _longi = 0 + ( + 2 ) Sonsi + 2 Daughtersi + ui

and so

Mother _longi = 0 + Sonsi + 2 (Sonsi + Daughtersi ) + ui

giving us

Mother _longi = 0 + Sonsi + 2 Childreni + ui

Lindqvist (Institute) 651: Lecture 5 14 / 28


Replication

Cesarini, Lindqvist and Wallace (2009) replicate Helle et al using a


data set of 930 Sami women living on the Swedish side of the border
during the same period
The point in using data from a similar population is to rule out that
dierences in results are due to dierences in the characteristics of the
studied samples
We estimate the re-formulated version of Helles model

Mother _longi = 0 + Sonsi + 2 Childreni + ui

where we argue that b reects an unbiased estimate of the causal


eect of giving birth to a son instead of a daughter
We get b = 0.23, implying a small positive eect of giving birth to a
son instead of a daughter on maternal longevity, but the dierence is
not statistically signicant.

Lindqvist (Institute) 651: Lecture 5 15 / 28


What do we learn from this?

The dierential results could be due to the Sami populations in


northern Sweden and Finland being distinctively dierent
Alternatively, dierences in result reect sample variation
That is, the underlying relationship between sex of ospring and
mothers longevity is the same in both populations, but we just
happened to collect samples where the results dier
But which sample should we believe?

Lindqvist (Institute) 651: Lecture 5 16 / 28


My 10 cents

This is an example of "publication bias" in action


It is very hard to publish papers with statistically insignicant results in
good journals
This induces researchers to search for statistical signicance - projects
that give no statistical signicance are dropped
The problem is that statistical signicance occurs sometimes by pure
chance
With small samples, standard errors are large, implying that statistical
signicance is only reached when the estimated coe cients are large
(Helles results are only marginally signicant despite the huge eects)
Searching for statistical signicance in small samples is therefore likely
to lead to the publication of spurious results which are sensationally
large
Statistical signicance arise by chance also in large samples equally
often (this is true by denition of statistical signicance), but, when it
happens, the estimated coe cients will be smaller

Lindqvist (Institute) 651: Lecture 5 17 / 28


Writing an empirical paper

What follows is a set of advice for how to write an empirical paper


In particular, I hope some of this advice could be useful for those
writing a BSc thesis in the spring
On a side note: You should turn to your supervisor, not me, in case
you run into problems with you thesis...

Lindqvist (Institute) 651: Lecture 5 18 / 28


Manage your time

Writing good papers take time...


Actual time = Estimated time*3
Persistance and planning are as important as an understanding of
econometrics
Most common mistakes students make when writing a thesis:
Deciding on a topic too late
Starting to collect data too late
Overestimating their writing skills

Sensible to start thinking about a topic for your bachelor thesis


already at this point

Lindqvist (Institute) 651: Lecture 5 19 / 28


Ask well-dened questions

The general rule is that the more narrow the question, the better
You should be able to provide some sort of answer to the question,
therefore open-ended questions are bad
Bad question: "What factors explain Y?"
Good question: "Does X aect Y?"

Lindqvist (Institute) 651: Lecture 5 20 / 28


Getting ideas

Blogs
Ekonomistas (in Swedish)
Economic Logician
Marginal Revolution
Freakonomics
...and many more
Journal of Economic Perspectives
Summarizes research in a non-technical fashion
News, literature, daily life,...
Choose a topic that interests you!

Lindqvist (Institute) 651: Lecture 5 21 / 28


Empirical work is great...

...but dont forget the theory


What is the relevant theory?
Try to motivate your empirical investigation from a theoretical
perspective
Which theories are consistent with the patterns in the data?
Which theories are inconsistent with the patterns in the data?
Referring to theory could, but does not have to, imply that you use
math or graphs
A verbal account is often enough

Lindqvist (Institute) 651: Lecture 5 22 / 28


Finding data

Sometimes a time consuming process, and something that you are


often not able to fully control
Hard for undergraduates to get hold of Swedish individual-level
register data
Much easier to get aggregated data
Lots of data on municipalities available from Statistics Sweden
Country data available from the World Bank
Dont forget that you could collect data yourselves
Survey, experiments, or directly in the eld

Lindqvist (Institute) 651: Lecture 5 23 / 28


Examples of a BSc thesis

Remuneration systems for local politicians vary in Sweden


Some municipalities pay politicians per meeting, others adjust pay
depending on the duration of the meeting
Max Rylander and Lukas Kvissberg collected data on the number and
duration of meetings in Swedish municipality councils
Question: Do politicians game the system in order to increase their
pay?
Answer: No. Swedish politicians seem a remarkably honest bunch

Lindqvist (Institute) 651: Lecture 5 24 / 28


More examples of BSc theses

Larissa Haspel and Soe Mler used a survey from Romanian


households to test theories of savings behavior
Adeline Sterner and Shu Sheng designed their own survey to test how
social stigma aects fare evasion in public transport
Hanna-Maria Nordlander and Emanuel Welander used register data to
test how immigration aects wages
Niklas Lyttkens and Alexandra Tham did a lab experiment to test
how altruism changes with age
Bjrn Beckman and Markus Ederwall used lab experiments and
surveys to test whether healthcare students are more pro-social than
business students (yes, they are)

Lindqvist (Institute) 651: Lecture 5 25 / 28


Get to know the data

Look at the data


Make histograms for the key variables
No data set is perfect
Look out for outliers and unreasonable values
For example: The military draft data contains information about one
man with a recorded height of 3.18 meters

Think about how the data was generated


What are the relevant institutional rules?
Put yourselves in the shoes of the people in the data: What tradeos
do they face?

Lindqvist (Institute) 651: Lecture 5 26 / 28


Remember what you have done

Save do-les!
Keep track of all relevant information regarding the data

Lindqvist (Institute) 651: Lecture 5 27 / 28


Presenting the results

Avoid "senseless regressions"


Every regression tells a piece of a story. You need to know what that
piece is, and it has to be relevant for the overall story of the paper
Take care to layout the tables and gures
Use informative titles
Figures and Tables should be self-contained; one should be able to
understand them without searching in the text
Use a combination of gures and tables
The reader should be able to follow what you have done
Refer to tables and gures in the text

Lindqvist (Institute) 651: Lecture 5 28 / 28

Vous aimerez peut-être aussi