Multiple Regression Concepts

Module 3: Multiple Regression
Concepts
1. Please ignore this bit & leave it alone

2. - its to get automated Figure & Table numbering
right, and
3. will not appear in the pdfs
Fiona Steele1
Centre for Multilevel Modelling
It works in this way.
Contents
We wish automatically made figure & table numbering that has module numbers in
it so we get captions like Figure 2.1, or Figure 3.2, etc.
Introduction .........................................................................................................3
MS Word works out what numbering to use by looking at the heading styles used in
the document.
What is Multiple Regression? ................................................................................... 3

Motivation......................................................................................................... 3
Conditioning ...................................................................................................... 3
Data for multiple regression analysis ......................................................................... 4
Introduction to Dataset ...........................................................................................4
Every time the style Heading 1 is used in the document (as above on Please
ignore), MS Word counts this as a new chapter, and increments the chapter
number part of the automatic numbering.
C3.1
Regression with a Single Continuous Explanatory Variable......................................6
C3.1.1
C3.1.2
C3.1.3
C3.1.4
C3.1.5
C3.1.6
So to get a caption like Figure 2.x - you need to have preceding it two headings
formatted with the style Heading 1.
This template has Heading 1 used only once, so figures and tables in this document
will be numbered as Figure 1.x or Table 1.x
These figure and table numbers will be automatically updated each time another
Heading 1 is inserted.
C3.2
Comparing Groups: Regression with a Single Categorical Explanatory Variable .......... 19
C3.2.1
C3.2.2
C3.2.3
C3.3
Statistical control.................................................................................
The multiple regression model .................................................................
Using multiple regression to model a non-linear relationship .............................
Adding further predictors .......................................................................
26
27
31
33
Interaction Effects ..................................................................................... 36
C3.4.1
C3.4.2
C3.4.3
C3.4.4
C3.4.5
C3.5
Comparing two groups ........................................................................... 19

Comparing more than two groups.............................................................. 21
Comparing a large number of groups .......................................................... 25
Regression with More than One Explanatory Variable (Multiple Regression) .............. 26
C3.3.1
C3.3.2
C3.3.3
C3.3.4
C3.4
Examining data graphically ....................................................................... 6

The linear regression model ...................................................................... 8
The fitted regression line ....................................................................... 10
Explained and unexplained variance and R-squared ........................................ 13
Hypothesis testing ................................................................................ 14
Model checking.................................................................................... 15
Model with fixed slopes across groups.........................................................

Fitting separate models for each group .......................................................
Allowing for varying slopes in a pooled analysis: interaction effects ....................
Testing for interaction effects..................................................................
Another example: allowing age effects to be different in different countries .........
36
37
39
41
41
Checking Model Assumptions in Multiple Regression............................................ 44
C3.5.1
C3.5.2
C3.5.3
Checking the normality assumption ........................................................... 44

Checking the homoskedasticity assumption .................................................. 45
Outliers ............................................................................................. 47
With additional material from Kelvyn Jones. Comments from Sacha Brostoff, Jon Rasbash and
Rebecca Pillinger on an earlier draft are gratefully acknowledged.
Centre for Multilevel Modelling, 2008
All of the sections within this module have online quizzes for you to
test your understanding. To find the quizzes:
EXAMPLE
Module 3 (Concepts): Multiple Regression

Introduction
Introduction
From within the LEMMA learning environment

Go down to the section for Module 3: Multilevel Modelling
Click "3.1 Regression with a Single Continuous Explanatory Variable"
to open Lesson 3.1
Q1
Click
to open the first question
What is Multiple Regression?

Multiple regression is a technique used to study the relationship between an
outcome variable and a set of explanatory or predictor variables.
Motivation
All of the sections within this module have practicals so you can
learn how to perform this kind of analysis in MLwiN or other
software packages. To find the practicals:
To illustrate the ideas of multiple regression, we will consider a research problem

of assessing the evidence for gender discrimination in legal firms. Statistical
modelling can provide the following:
EXAMPLE
From within the LEMMA learning environment

Go down to the section for Module 3: Multiple Regression, then
Either
Click "3.1 Regression with a Single Continuous Explanatory Variable" to open
Lesson 3.1
Click
Or
Click
Print all Module 3 MLwiN Practicals
Pre-requisites
Conditioning
We can use regression modelling in different modes: 1) as description (what is the

average salary for men and women?), 2) as part of causal inference (does being
female result in a lower salary?), and 3) for prediction (what happens if
questions).
Understanding of types of variables (continuous vs. categorical variables,

dependent and explanatory); covered in Module 1.
Correlation between variables
Confidence intervals around estimates
Hypothesis testing, p-values
Independent samples t-test for comparing the means of two groups
Online resources:
http://www.sportsci.org/resource/stats/
http://www.socialresearchmethods.net/
http://www.animatedsoftware.com/statglos/statglos.htm
http://davidmlane.com/hyperstat/index.html
A quantitative assessment of the size of the effect; e.g. the difference in salary
between women and men is 5000 per annum;
A quantitative assessment after taking account of other variables; e.g. a female
worker earns 6500 less after taking account of years of experience. This
conditioning on other variables distinguishes multiple regression modelling from
simple testing for differences analyses.
A measure of uncertainty for the size of the effect; e.g. we can be 95%
confident that the female-male difference in salary in the population from
which our sample was drawn is likely to lie between 4500 and 5500.
The key feature that distinguishes multiple regression from simple regression is
that more than one predictor variable is involved. Even if we are interested in the
effect of just one variable (gender) on another (salary) we need to take account of
other variables as they may compromise the results. We can recognise three
distinct cases where it is important to control or adjust for the effects of other
variables:
i)
Inflation of a relationship when not taking into account extraneous

variables. For example, a substantial gender effect could be reduced after
taking account of type of employment. This is because jobs that are
characterized by poor pay (e.g. in the service sector) have a predominantly
female labour force.
ii)
Suppression of a relationship. An apparent small gender gap could increase

when account is taken of years of employment; women having longer
service and poorer pay.
Introduction
Introduction
iii)
He (sic) seeks every chance he can to have fun. It is important to him to do

things that give him pleasure.
Having a good time is important to him. He likes to spoil himself.
No confounding. The original relationship remains substantially unaltered

when account is taken of other variables. Note, however, that there may
be unmeasured confounders.
Data for multiple regression analysis

Statistical analysis requires a quantifiable outcome measure (dependent variable)
to assess the effects of discrimination. Possibilities include the following,
differentiated by the nature of the measurement: a continuous measure of salary,
a binary indicator of whether an employee was promoted or not, a three-category
indicator of promotion (promoted, not promoted, not even considered), a count of
the number of times rejected for promotion, the length of time that it has taken
to gain promotion. All of these outcomes can be analysed using regression
analysis, but different techniques are required for different scales of
measurement.
The term multiple regression is usually applied when the dependent variable is
measured on a continuous scale. A dichotomous dependent variable can be
analysed using logistic regression and multinomial logistic and ordinal regression
can be applied to nominal and ordinal dependent variables respectively. There are
also methods for handling counts (Poisson regression) and time-to-event data
(event history analysis or survival analysis). These techniques will be described in
later Modules.
The explanatory variables may also have different scales of measurement. For
example, gender is a binary categorical variable; ethnicity is categorical with more
than two categories; education might be measured on an ordinal scale (e.g. <11,
11-13, 14-16 and >16 years of education); years of employment could be measured
on a continuous scale. Multiple regression can handle all of these types of
explanatory variable, and we will consider examples of both continuous and
categorical variables in this Module.
A respondents own values are inferred from their self-reported similarity to a

person with the above descriptions. Each of the two items is rated on a 6-point
scale (from very much like me to not like me at all). The mean of these
ratings is calculated for each individual. The mean of the two hedonism items is
then adjusted for individual differences in scale use2 by subtracting the mean of all
value items (a total of 21 are used to measure the 10 values). These centred
scores recognise that the 10 values function as a system rather than
independently. The centred hedonism score is interpreted as a measure of the
relative importance of hedonism to an individual in their whole value system.
The scores on the hedonism variable range from -3.76 to 2.90, where higher scores
indicate more hedonistic beliefs.
We consider three countries France, Germany and the UK with a total sample
size of 5845. That is, we use a subsample of the original data.
Hedonism is taken as the outcome variable in our analysis. We consider three
explanatory variables:
Age in years
Gender (coded 0 for male and 1 for female)
Country (coded 1 for the UK, 2 for Germany and 3 for France)
Years of education.
An extract of the data is given below.

Respondent
Introduction to Dataset
The ideas of multiple regression will be introduced using data from the 2002
European Social Surveys (ESS).
Measures of ten human values have been
constructed for 20 countries in the European Union. According to value theory,
values are defined as desirable, trans-situational goals that serve as guiding
principles in peoples lives. Further details on value theory and how it is
operationalised in the ESS can be found on the ESS education net
(http://essedunet.nsd.uib.no/cms/topics/1/).
1
2
3
4
.
.
5845
Hedonism
1.55
0.76
-0.26
-1.00
.
.
0.74
Age
Gender
Country
Education
25
30
59
47
.
.
65
0
0
0
1
.
.
0
2
2
2
3
.
.
1
10
11
9
10
.
.
9
We will study one of the ten values, hedonism, defined as the pleasure and
sensuous gratification for oneself. The measure we use is based on responses to
the question How much like you is this person?:
Some individuals will tend to select responses from one side of the scale (very much like me)
for any item, while others will select from the other side (not like me at all). If we ignore these
differences in response tendency we might incorrectly infer that the first type of individual
believes that all values are important, while the second believes that all values are unimportant.

C3.1.1 Examining data graphically
C3.1 Regression with a Single Continuous

Explanatory Variable
We will first consider age as an explanatory variable for hedonism. The age range
in our sample is 14 to 98 years with a mean of 46.7 and standard deviation of 18.1.
Relationship between X and Y
We will begin with a description of simple linear regression for studying the
relationship between a pair of continuous variables, which we denote by Y and X.
Simple regression is also commonly known as bivariate regression because only two
variables are involved.
Y is the outcome variable (also called a response or dependent variable)
X is the explanatory variable (also called a predictor or independent variable).
C3.1.1
In its simplest form, a regression analysis assumes that the relationship between X
and Y is linear, i.e. that it can be reasonably approximated by a straight line. If
the relationship is nonlinear, it may be possible to transform one of the variables
to make the relationship linear or the regression model can be modified (see
C3.3.3). The relationship between two variables can be viewed in a scatterplot. A
scatterplot can also reveal outliers.
Examining data graphically
Before carrying out a regression analysis, it is important to look at your data first.
There are various assumptions made when we fit a regression model, which we will
consider later, but there are two checks that should always be carried out before
fitting any models: i) examine the distribution of the variables and check that the
values are all valid, and ii) look at the nature of the relationship between X and Y.
Distribution of Y
We can examine the distribution of a continuous variable using a histogram. At
this stage, we are checking that the values appear reasonable. Are there any
outliers, i.e. observations outside the general pattern? Are there any values of 99 in the data that should be declared as missing values? We also look at the
shape of the distribution: is it a symmetrical bell-shaped distribution (normal), or
is it skewed? Although it is the residuals3 that are assumed to be normally
distributed in a multiple regression model, rather than the dependent variable, a
skewed Y will often produce skewed residuals. If the residuals turn out to be nonnormal, it may be possible to transform Y to obtain a normally distributed
variable. For example, a positively skewed distribution (with a long tail to the
right) will often look more symmetrical after taking logarithms.
Figure 3.1 shows the distribution of the hedonism scores. It appears approximately
normal with no obvious outliers. The mean of the hedonism score is -0.15 and the
standard deviation is 0.97.
Distribution of X
For a regression analysis the distribution of the explanatory variable is
unimportant, but it is sensible to look at descriptive statistics for any variables
that we analyse to check for unusual values.
Figure 3.1. Histogram of hedonism
Figure 3.2 shows a scatterplot of hedonism versus age, where the size of the
plotting symbol is proportional to the number of respondents represented by a
particular data point. Also shown is what is commonly called the line of best fit,
which we will come back to in a moment. The scatterplot shows a negative
relationship: as age increases then hedonism decreases. The Pearson correlation
coefficient for the linear relationship is -0.34.
The residual for each observation is the difference between the observed value of Y and the value
of Y predicted by the model. See C3.1.2 for further details.
C3.1.1 Examining data graphically
C3.1.2 The linear regression model
so that the intercept is now denoted by 0 and the slope by 1 . The subscripts on
the s indicate the variable to which each coefficient is attached. We could have
written (3.2) as y = 0 x 0 + 1 x 1 where x 0 =1 for every observation and x 1 = x .
Later we will be adding further explanatory variables ( x 2 , x 3 etc.) with
coefficients 2 , 3 , etc.
For a given individual i (i=1, 2, 3, .., n), we denote their value on Y by y i and
their value on X by x i . (Note that when we consider more than one explanatory
variable, we will introduce a second subscript to index the variable. For example,
x 2i will denote the value on variable x 2 for individual i.)
For individual i, the linear relationship between Y and X may be expressed as:
y i = 0 + 1 x i + ei (3.3)
ei is called the residual and is the difference between the ith individuals actual
y-value and that predicted by their x-value. We know that we cannot perfectly
predict an individuals value on Y from their value on X; the points in a scatterplot
of x and y will never lie perfectly on a straight line (see Figure 3.2, for example).
The residuals represent the (vertical) scatter of points about the regression line.
Figure 3.2. Plot of hedonism by age
C3.1.2
The linear regression model
In a linear regression analysis, we fit a straight line to the scatterplot of Y against

X. The equation of a straight line is traditionally written as
y = mx + c
(3.1)
where m is the gradient or slope of the line, and c is the intercept or the point at
which the line cuts the Y-axis (i.e. the value of y when x=0). The gradient is
interpreted as the change in y expected for a 1-unit change in x. In statistics, we
often refer to m and c as coefficients. A coefficient of a variable is a quantity that
multiplies it. The slope m is the coefficient of the predictor x, and the intercept c
is the coefficient of a variable which equals 1 for each observation (usually
referred to as the constant).
Because we will soon be adding more explanatory variables (Xs), it is convenient to
use a more general notation with coefficients represented by Greek betas ( ).
Thus (3.1) becomes
y = 0 + 1 x
The equation (3.3) is called the linear regression model.

0 and 1 are the
intercept and slope of the regression line in the population from which our sample
was drawn, and ei is the difference between an individuals y-value and the value
of y predicted by the population regression line. We estimate these quantities
using the sample data. Quantities such as 0 and 1 that relate to the population
are called parameters. Parameters are very often represented by Greek letters in
statistics4.
We make the following assumptions about the residuals ei :
i)
The residuals are normally distributed with zero mean and variance 2
(spoken as sigma-squared). This assumption is often written in shorthand
as ei ~ N (0, 2 ).
ii)
The variance of the residuals is constant, whatever the value of x. This

means that if we take a slice through the scatterplot of y versus x at any
particular value of x, the y values have approximately the same variation as
at any other value of x. If the variance is constant, we say the residuals are
homoskedastic. Otherwise they are said to be heteroskedastic.
(3.2)
4
Other examples of parameters are the population mean

(sigma-squared).
(mu) and the population variance
C3.1.2 The linear regression model
C3.1.3 The fitted regression line
iii)
The residuals are not correlated with one another, i.e. they are
independent. Correlations might arise if some individuals contribute more
than one observation (e.g. repeated measures) or if individuals are
clustered in some way (e.g. in schools). If it is suspected that residuals are
correlated, the regression model needs to be modified, e.g. to a multilevel
model (see Module 5).
We can use the fitted line to predict an individuals hedonism based on their age.
So, for example, for an individual of age 25 we would predict a hedonism score of
0.712 (0.018 25) = 0.262. In contrast, we would predict a score of -0.188 for
someone of age 50. The regression line is the line of best fit shown in Figure 3.2.
Most statistical packages will report the results of a regression analysis in tabular
form, e.g. as in Table 3.1.
If these assumptions are not met the estimate of 0 , and more importantly 1 ,
may be biased and imprecise.
C3.1.3
Table 3.1. Results from a simple regression of hedonism on age
The fitted regression line
In linear regression analysis, 0 and 1 are estimated from the data using a
method called least squares in which the sum of the squared residuals is
(Responses with other scales of measurement require other
minimized5.
techniques, but all of them are based on the same underlying principle of
minimizing the poorness of fit between the actual data points and the fitted
model.)
By applying the method of least squares to our sample data, we obtain an estimate
of the underlying population value of the intercept and of the slope. These
estimates are denoted by 0 and 1 (spoken as beta-0-hat and beta-1-hat).
The predicted value of y for individual i is denoted by yi and is calculated as:
yi = 0 + 1 x i
(3.4)
The equation (3.4) is the equation of the estimated or fitted regression line. The
predicted value yi is the point on the fitted line corresponding to x i .
If we regress hedonism on age we obtain 0 =0.712 and 1 =-0.018, and the fitted
regression line is written (substituting HED for y and AGE for x) as:
HEDi = 0.712 0.018 AGE i .
The slope estimate tells us that for every extra year of age, hedonism is predicted
to decrease by 0.018. Importantly, the decrease in hedonism expected for an
increase from 14 to 15 years old is the same as for an increase for 54 to 55 years
old. This is a direct consequence of assuming that the underlying functional form
of the model is linear and fitting a linear equation.
Constant
Age
Coefficient
0.712
-0.018
Interpretation of the intercept and slope estimates
0 =0.712 is the predicted value of Y when X=0. So we would expect someone of

age zero to have a hedonism score of 0.712. Because the minimum age in the
sample is 14, this is not very informative.
1 =-0.018 is the predicted change in Y for a 1 unit change in X. So we expect a
decrease of 0.018 in the hedonism score for each 1 year increase in age.
Centring
Continuous variables are often centred about the mean so that the intercept has a
more meaningful interpretation. For example, we would centre the variable AGE
by subtracting the sample mean of 46 years from each of its values. If we then
repeat the regression analysis replacing AGE by AGE-46, the intercept becomes the
predicted value of Y when AGE-46=0, i.e. when AGE=46. Rather than a prediction
for a baby of 0 years, which is well outside the age range in the sample, the
intercept now gives a prediction for a 46 year old adult
The intercept in the analysis based on centred AGE is estimated as -0.139, which is
the predicted hedonism score for a 46 year old. Centring does not affect the
estimate of the slope because only the origin of X has been shifted; its scale
(standard deviation) has not changed.
Standardisation and standardised coefficients
Sometimes X is standardised, which involves subtracting the sample mean and then
dividing the result by the standard deviation:
A description of least squares can be found at

http://mathforum.org/dynamic/java_gsp/squares.html
10
11
C3.1.3 The fitted regression line
C3.1.4 Explained and unexplained variance and R-squared
X mean of X
.
SD of X
C3.1.4
Explained and unexplained variance and R-squared
All statistical models have a common form:
Standardising a variable forces it to have a mean of zero and a standard deviation

of 1, while centring shifts only the origin and leaves the scale unaltered.
After standardisation a unit corresponds to one standard deviation, so if X is
standardised its slope is interpreted as the change in Y expected for a one
standard deviation change in X.
Sometimes standardised coefficients are calculated. In simple regression the
standardised coefficient of X is the slope that would be obtained if X and Y had
both been standardised, which is equivalent to the Pearson correlation coefficient.
The standardised coefficient of X is interpreted as the number of standard
deviation units change in Y that we would expect for each standard deviation
change in X. While standardised coefficients put each variable on the same scale,
and may therefore be useful for comparing the effect of X on Y in different
subpopulations, the natural meaning of the X and Y variables is lost. The use and
interpretation of standardised coefficients in multiple regression is discussed in
C3.3.2.
When age is standardised the estimated intercept and slope of the regression line
are 0 =-0.151 and 1 =-0.335. If we also standardise hedonism, we obtain 1 =0.343 (now a standardised coefficient) which is equal to the Pearson correlation
coefficient given earlier in C3.1.1.
Important note: We cannot claim that there is a causal relationship between X
and Y from such a simple model or, indeed, from any regression model applied to
observational data. So when interpreting the slope it is better to avoid
statements like a change in X leads to or causes an increase in Y. Taking
account of other factors would provide stronger evidence of a causal relationship
if the original relationship did not change as additional predictors are included in
the model.
Response = Systematic part + Random part

where for a simple regression (3.3) the systematic part is 0 + 1 x i and the
random part is the residual ei .
The systematic part gives the average relationship between the response and the
predictor(s), while the random part is what is left over (the unexplained part)
after taking account of the included predictor(s). Figure 3.2 displays the values on
X and Y for individuals in the sample, and a straight line that we have threaded
through the (X, Y) data points to represent the systematic relation between
hedonism and age. The line represents the fitted values, e.g. if you are 20 years
old you are predicted to have a hedonism score of about 0.3.
The term random means allowed to vary and, in relation to Figure 3.2, the
random part is the portion of hedonism that is not accounted for by the underlying
average relationship with age. Some people are more and some less hedonistic
given their age. The residual is the difference between the actual and predicted
hedonism. In some cases there will be a close fit between the actual and fitted
values, e.g. if differences in age explain most of the variability in hedonism. In
other cases there may be a lot of noise, e.g. if, for any given age, there is a wide
range of hedonism scores. It is helpful to characterise this residual variability. To
do so requires us to make some assumptions about the residuals (normality and
homoskedasticity - see C3.1.2). Under these assumptions we can summarise the
variability in a single statistic, the variance of the residuals 2 . We can think of
the residual variance as the part of the variance in Y that is unexplained by X.
The part of the variance in Y that is explained by X (the systematic part of the
model) is called the explained variance in Y. For Figure 3.2 the residual or
unexplained variance is 0.84. The total variance in hedonism scores (which is the
sum of the explained and unexplained variances) is 0.95, so by subtraction the
explained variance is 0.11.
Another key summary statistic is the R-squared (R2) value which gives the
correspondence between the actual and fitted values, on a scale between zero (no
correspondence) and 1 (complete correspondence). R-squared can also be
interpreted as the proportion of the total variance in Y that can be explained by
variability in X. For the hedonism data, R-squared = 0.11/0.95 = 0.12 so 12% of the
variance in hedonism scores can be explained by age.
In the case of simple regression, R-squared is the square of the Pearson correlation
coefficient.
12
13
C3.1.5 Hypothesis testing
C3.1.5 Hypothesis testing
C3.1.5
Hypothesis testing
We must bear in mind that the estimates of the intercept and slope are subject to
sampling variability, as is any statistic calculated from a sample. While we have
established that there is a negative relationship between hedonism and age in our
sample, we are really interested in their relationship in the population from which
our sample was drawn (the combined populations of France, Germany and the UK).
In other words, is the relationship statistically significant, or could we have got
such a result by chance?
The null hypothesis (H0) for our test is that there is no relationship between
hedonism and age in the population, in which case 1 =0.
SE(1 )
which is compared to a normal distribution (or a t distribution if the sample size is

small). In this case the p-value is tiny, less than 0.001. If there was no
relationship between hedonism and age in the population (i.e. the true slope is
zero), we would expect less than 0.1% of samples from that population to produce
a slope estimate of magnitude greater than 0.018.
In the practice sections, we will generally use the Z-ratio to test significance
rather than calculating confidence intervals.
C3.1.6
The alternative hypothesis (HA) is that there is a relationship, i.e. 1 0.

The test of a relationship between hedonism and age is based on the estimate of
the slope of the relationship and a measure of the precision of this estimate. The
standard error is a measure of imprecision, where large values indicate greater
uncertainty about the true (population) value. The standard error is inversely
related to sample size, so that the precision of the estimate of 1 increases as the
sample size increases. The standard error also depends on the amount of
variability in X and the amount of variance in Y that is unexplained by X (the
residual variance): the standard error decreases as the variance in X increases, and
the standard error increases as the residual variance increases.
In our example, the standard error of 1 is 0.001 and a 95% confidence interval
for 1 is therefore
1 1.96 SE(1 ) = 0.018 (1.96 0.001) = (0.020, 0.016) 6

Zero (the value of 1 under H0) is well outside the 95% confidence interval, so we
reject the null hypothesis and conclude that the relationship is statistically
significant at the 5% level.
We can also calculate a confidence interval for the population intercept 0 , but
the slope is of principal interest since it measures the relationship between X and
Y.
0.018
= 27.9
0.001
Model checking
A number of assumptions lie behind a regression model.

C3.1.2 but, briefly, we assume:
These were given in
i)
The residuals ei are normally distributed.
ii)
The variance of the residuals is constant, whatever the value of x, i.e. the
residuals are homoskedastic.
iii)
The residuals are not correlated with one another, i.e. they are
independent.
We can check the validity of assumptions i) and ii) by examining plots of the
estimated residuals. If it is suspected that residuals might be correlated because
the data are clustered in some way, we can test assumption iii) by comparing a
multilevel model, which accounts for clustering, with a multiple regression model
which ignores clustering (see Module 5).
To check assumptions about ei , we use the estimated residuals which are the
differences between the observed and predicted values of y:
ei = y i yi
We usually work with the standardized residuals ri which we obtain by dividing ei
by their standard deviation.
Alternatively, but equivalently, we can calculate the test statistic (often called
the Z or t-ratio)
6
-1.96 and +1.96 are the 2.5% and 97.5% points of a standard normal distribution (one with a mean
of zero and a standard deviation of one). The middle 95% of the distribution lies between these
points.
14
15
C3.1.6 Model checking
Checking the normality assumption

We can check whether residuals are normally distributed by looking at a histogram
or a normal probability plot of the standardized residuals. If the normality
assumption holds, the points in a normal plot should lie on a straight line.
Expected cumulative probability
1.0
Figure 3.3 and Figure 3.4 show a histogram and normal probability plot of residuals
from a simple regression model with age. Both plots suggest that the normal
distribution assumption is reasonable here.
Checking the homoskedasticity assumption

To check that the variance of the residuals is fairly constant across the range of X,
we can examine a plot of the standardized residuals against X and check that the
vertical scatter of the residuals is roughly the same for different values of X with
no funnelling.
0.8
0.6
0.4
0.2
500
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Observed cumulative probability
Figure 3.4. Normal probability plot of ri
Frequency
400
300
Standardized residual
200
100
0
-2.5
0.0
2.5
Standardized residual
Figure 3.3. Histogram of ri
-2
Figure 3.5 shows a plot of ri versus x i . The vertical spread of the points appears
fairly equal across different values of X, so we conclude that the assumption of
homoskedasticity is reasonable.
0
20
40
60
80
100
Age, in number of years, in 2002
Figure 3.5. Plot of ri versus xi

16
17
C3.2 Comparing Groups: Regression with a Single

Categorical Explanatory Variable
Outliers
We can also check for outliers using any of the above residual plots. An outlier is a
point with a particularly large residual. We would expect approximately 95% of
the residuals to lie between 2 and +2.
Of major interest, however, is whether an outlier has undue influence on our
results. For example, in simple regression, an outlier with very large values on X
and Y could push up a positive slope.
A straightforward way to judge the
influence of an outlier is to refit the regression line after excluding it. If the
results are very similar to those based on all observations, we would conclude that
the outlier does not have undue influence. An observations influence can also be
measured by a statistic called Cooks D (see C3.5.3).
Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
When X is continuous, we are fitting a straight line relationship. Regression can

also be applied when X is categorical, in which case we are allowing the mean of Y
to be potentially different for the different categories of X.
C3.2.1
Comparing two groups
Suppose that a categorical explanatory variable X has only two categories. We

wish to compare the mean of our response variable Y for the two groups defined by
these categories.
We will examine whether there are gender differences in hedonism. In the human
values dataset there is a variable called SEX which is coded 1 for female, and 0 for
male. Variables that have codes of 0 and 1 are often called dummy variables. If
we simply calculate the mean of our response variable HED for men and women,
we obtain the results given in Table 3.2.
Please read P3.1, which is available in online form or as part of a pdf file.
Table 3.2. Descriptive statistics for hedonism by sex
Dont forget to take the online quiz for this section!

(see page 2 for details of how to find the quiz questions)
Sample size
Women
Men
2747
3098
Mean hedonism score

-0.225
-0.069
So the (female-male) difference in means is -0.225-(-0.069) = -0.156.
Normal (t) test for comparing two independent samples

We can use a normal test (or t-test if the sample is small) to test for a difference
between women and men in the population. The null hypothesis for the test is
that the gender difference between the mean of hedonism in the population is
zero. The test statistic is -6.12 and the p-value is less than 0.0001. A 95%
confidence interval for the difference between the female and male population
means is (-0.206, -0.106), which does not contain the null value of zero. We
therefore conclude that the difference between women and mens hedonism
scores is statistically significant (at the 0.01% level).
Comparing two groups using regression

We can also compare groups using a regression model. The advantage of using
regression, rather than a normal (or t) test, is that in a regression model we can
allow for the effects of other variables as well as gender. To start with, however,
we will consider gender as the only explanatory variable and demonstrate how
men and womens hedonism scores can be compared using regression.
18
19
C3.2.1 Comparing two groups
C3.2.2 Comparing more than two groups
Suppose we fit the simple regression model
C3.2.2
y i = 0 + 1 x i + ei
where y i is the hedonism score of individual i, and x i =1 if the individual is a
woman, and 0 if the respondent is male7.
Table 3.3. Regression of hedonism on sex
Coefficient
Constant
Sex
-0.069
-0.156
Comparing more than two groups
Suppose now that a categorical explanatory variable X has three categories. We

wish to compare the mean of our outcome variable Y for the three groups defined
by these categories.
The respondents in the hedonism example come from three countries. The mean
of HED for each country is given in Table 3.4.
Table 3.4. Descriptive statistics for hedonism by country
Standard Error
0.019
0.025
Sample size
The regression output is given in Table 3.3, from which we obtain the fitted
regression equation:
HEDi = 0.069 0.156 SEX i
UK
Germany
France
Mean hedonism score
1748
2785
1312
-0.384
-0.128
0.108
Analysis of Variance (ANOVA)

The standard way to compare more than two groups is to use analysis of variance
(ANOVA)8. The null hypothesis is that there is no difference between groups (i.e.
that the group means are all equal). Table 3.5 shows the results from an ANOVA
for a comparison of hedonism for the three countries. When there is just one
categorical variable, this type of analysis is usually called a one-way ANOVA.
We can use this equation to predict HED for men and women:
For men (SEX=0), HED = 0.069 (0.156 0) = 0.069
For women (SEX=1), HED = 0.069 (0.156 1) = 0.225
Table 3.5. Analysis of variance of country differences in hedonism
Notice that these predicted values are just the mean hedonism scores for men and
women, and that the coefficient of SEX is the difference between these means
(womens mean mens mean, since SEX is coded 1 for women here).
The null hypothesis that there is no difference between the mean score for men
and women in the population can be expressed as H0: 1 = 0 . The standard error
of 1 is 0.025 and the Z-ratio is therefore 0.156 0.025 = 6.12 . The 95%
confidence interval for 1 is (-0.206, -0.106).
Note that these results are exactly the same as those for the independent samples
comparison of means test given earlier. So if SEX is the only explanatory variable,
a regression analysis gives exactly the same results as a t-test. But only in a
regression analysis can we include other explanatory variables.
Between countries
Within countries
Total
Sum of
squares
d.f.
Mean
square
F statistic
p-value
184.5
5370.7
5555.2
2
5842
5844
92.3
0.9
100.4
<0.001
Note: d.f. is degrees of freedom9
The tiny p-value suggests that we can reject the null hypothesis and conclude that
there are significant between-country differences in hedonism.
See, for example, http://www.animatedsoftware.com/statglos/sg_anova.htm

Discussions of degrees of freedom can be found at
http://www.animatedsoftware.com/statglos/sgdegree.htm
and http://davidmlane.com/hyperstat/A42408.html
9
7
Note that it would not be sensible to centre or standardize a binary variable.
20
21
Comparing groups using multiple regression
Table 3.6. Regression analysis of hedonism on country, UK taken as reference
The statistical model behind ANOVA is in fact a multiple regression model. But
rather than including country as an explanatory variable10, we create dummy
variables for two of the three countries and include these. Suppose we create
three variables which indicate whether a respondent is from a particular country,
i.e.
UK
=1 if respondent is from the UK, =0 if from Germany or France
GERM
=1 if respondent is from Germany, =0 if from UK or France
FRANCE
=1 if respondent is from France, =0 if from UK or Germany
These variables are called dummy variables11. In fact we do not need all three of
these variables because if we know a respondents value on two of them, we can
infer their value on the third. E.g. if we know that UK=0 and GERM=1, then we
know that FRANCE=0. (A respondent can only be living in one country at the time
of survey, so only one of UK, GERM and FRANCE can equal 1 for any given
individual.) By the same argument, when we have a categorical variable with only
two categories (e.g. our SEX variable in C3.2.1) we do not need to create any
additional variables. SEX is already a dummy variable, and can therefore be
included directly in the model as an explanatory variable.
To allow for differences between the UK, Germany and France, we choose
(arbitrarily) two of the country dummy variables and include those as explanatory
variables. Suppose we choose GERM and FRANCE, then the multiple regression
model is:
Coefficient
Constant
Country
Germany
France
Standard
error
Z-ratio
-0.384
0.023
0.256
0.492
0.029
0.035
8.765
14.052
p-value
<0.001
<0.001
The fitted regression equation is:

HEDi = 0.384 + 0.256 GERMi + 0.492 FRANCE i
We can use this equation to predict the hedonism score for inhabitants of each
country.
For UK residents (GERM=0, FRANCE=0):
HEDi = 0.384 + (0.256 0) + (0.492 0) = 0.384

For Germans (GERM=1, FRANCE=0):
HEDi = 0.384 + (0.256 1) + (0.492 0) = 0.128
HEDi = 0 + 1GERMi + 2 FRANCE i + ei
For French residents (GERM=0, FRANCE=1):
Table 3.6 shows the results from fitting this model.
HEDi = 0.384 + (0.256 0) + (0.492 1) = 0.108

Notice that these predictions give exactly the same results as the country means in
Table 3.4. We obtain the prediction for the UK directly; the estimate of the
intercept will always equal the mean for the omitted category, i.e. the category
for which we do not include a dummy variable in the model. The coefficients 1
and 2 are interpreted as differences between one of the other countries and the
omitted category country.
10
It would not make sense to fit a model of the form HEDi =
0 + 1 COUNTRYi + ei
because
the coding of COUNTRY is arbitrary (i.e. COUNTRY is a nominal variable). In such a model, b 1
would be interpreted as the effect on HED of a 1 unit change in COUNTRY, but a 1 unit change in
COUNTRY has no meaning!
11
This is the most common way of coding dummy variables for a categorical variable and is often
called simple coding, but other types of coding are possible depending on which comparisons are of
interest. A comprehensive discussion of alternative coding systems can be found at
http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter5/statareg5.htm
22
1 = 0.256 is the difference between the means for Germany and the UK
2 = 0.492 is the difference between the means for France and the UK
The UK is the reference category. (If we had included the UK and FRANCE dummy
variables in the model, then Germany would have been the reference.)
23
C3.2.3 Comparing a large number of groups
The remaining contrast is between Germany and France, which is estimated as

1 2 = 0.236.
C3.2.3
The null hypothesis for testing whether there is a difference between the mean
hedonism scores in the populations of Germany and the UK can be expressed as H0:
1 =0. Similarly the null for testing whether there is a difference between France
and the UK is H0: 2 =0. The simplest way to compare Germany and France would
be to refit the model, making one of these countries the reference category.
Table 3.7 shows the results when Germany is taken as the reference, i.e. when the
UK and FRANCE dummies are included in the model. The difference between the
means for France and Germany is now obtained directly (from the coefficient of
the FRANCE dummy) as 0.236.
All coefficients in Table 3.6 and Table 3.7 are significantly different from zero (all
p-values are <0.001), so we conclude that all pairwise differences between
countries are significant.
Constant
Country
UK
France
Standard
error
Z-ratio
-0.128
0.018
-0.256
0.236
0.029
0.032
-8.765
7.341
Suppose that instead of three countries, we had 20 or more countries that we

wished to compare. One approach would be to include 19 dummy variables for
countries. However, there are some potential drawbacks of this approach:
i)
ii)
19 is a large number of coefficients to estimate! Adding interactions

between countries and other explanatory variables (see C3.4) will lead to
even more parameters. Also, if the sample sizes within some countries are
small, the estimates of the coefficients of the dummy variables for those
countries may be unreliable
Suppose we wish to estimate the effects of country characteristics, e.g. the
effects on hedonism of cultural factors, such as religiosity, or economic
status. It can be shown that it is not possible to estimate the effects of
these variables as well as the coefficients of the country dummy variables.
(This is because any country-level variable can be expressed as a linear
function of the country dummy variables.)
In other applications, the sampled groups may be regarded as a random sample

from a larger population. For example, we may have data on a sample of schools.
In such cases it is the population of groups from which our sample was drawn that
is of interest. However, the origins of the dummy variable (or ANOVA) approach
lie in experimental design where there is typically a small number of groups to be
compared and all groups of interest are sampled. The ANOVA approach does not
allow us to make inferences beyond the groups in our sample.
Table 3.7. Regression analysis of hedonism on country, Germany as reference
Coefficient
Comparing a large number of groups
p-value
<0.001
<0.001
An approach that overcomes these problems is multilevel modelling (see Module

5).
The above tests are for comparing pairs of countries. In ANOVA, the null
hypothesis is that the means for all three countries are equal. In most statistical
packages, an ANOVA table is given as part of the standard regression analysis
output. The only difference between the regression ANOVA table and the one-way
ANOVA table is that the between-country sum of squares would usually be called
the regression sum of squares, and within-country would be replaced by
residual. All numerical results would be exactly the same. In regression terms,
the null hypothesis being tested is that all coefficients are zero, which in this case
is that 1 = 2 = 0 . This is just another way of saying that all three country means
are equal.
Dont forget to take the online quiz! (see page 2 for details of how
to find the quiz questions)
The advantage of multiple regression over one-way ANOVA is that regression can
allow for the effects of several explanatory variables simultaneously12.
12
A multiple regression analysis with two categorical explanatory variables is sometimes called a
two-way ANOVA, while a regression with a mixture of categorical and continuous variables is called
an analysis of covariance (ANCOVA).
24
25

C3.3.2 The multiple regression model
C3.3 Regression with More than One Explanatory

Variable (Multiple Regression)
C3.3.1
C3.3.2
In simple regression we have a single predictor or explanatory variable (X), and the
linear regression model is
Statistical control
So far we have used simple regression to assess the linear relationship between
two variables. In reality there will be a number of factors that are potential
predictors of the outcome variable.
The advantage of using a regression
framework is that we can straightforwardly account for the effects of multiple
variables simultaneously.
Examples
i) Suppose we compare two secondary schools on their age 16 exam performance,
e.g. we might compare the percentage of students who achieve a pass in five or
more subjects. Suppose we find that school 1 has a higher percentage with 5+
passes than school 2. Would we conclude that school 1s performance was better
than school 2? What other factors would we like to take into account? An obvious
candidate would be a measure of students achievement when they entered
secondary school so that school effects are value-added.
ii) Comparisons of men and womens salaries often reveal that women earn less.
Explanations that are commonly put forward for this discrepancy are that women
tend to work in jobs that have been traditionally lower paid, or that women have
taken time out of paid employment to raise children. To determine whether there
are salary differences between men and women who have been working in the
same job for the same amount of time, we would wish to account for occupation
and number of years of full-time employment as well as other factors such as
education level. Using multiple regression we can test whether these other factors
explain gender differences in salary, i.e. does any gender difference disappear
when we adjust for the effects of these other variables?
We can use multiple regression to take into account or adjust for other factors
that might predict the response variable. Sometimes the effects of these other
factors are of interest in themselves, e.g. predictors of age 16 attainment other
than the school attended. Other times the effects of other factors are not of
major interest, but it is important to adjust for their effects to obtain more
meaningful estimates of effects that we are interested in. Such factors are often
called controls.
The multiple regression model
26
y i = 0 + 1 x i + ei .
In multiple regression, we have more than one predictor. Suppose that we have
two predictors, denoted by X1 and X2, which may be continuous or categorical. We
have in fact already used a multiple regression model to analyse country
differences in hedonism (in C3.2.2). Although there was just one predictor,
country, it was represented by two dummy variables. More generally, we can
include several predictors and any of these may be represented by a set of dummy
variables.
The multiple (linear) regression model for two continuous (or dichotomous)
explanatory variables is written
y i = 0 + 1 x 1i + 2 x 2i + ei
where 0 is the value of y that would be expected when x1 =0 and x 2 =0.
The coefficients 1 and 2 are interpreted as follows:
1 is the coefficient of x1 , which is interpreted as the change in y for a 1-unit

change in x1 controlling or adjusting for the effect of x 2 . In other words, 1 is
the effect of x1 for individuals with the same value of x 2 (or holding x 2
constant).
Similarly, 2 is the coefficient of x 2 , which is interpreted as the change in y

for a 1-unit change in x 2 controlling for the effect of x1 .
Because each multiple regression coefficient represents the relationship between

an explanatory variable and the dependent variable, conditioning on the effect of
all other explanatory variables in the model, they are sometimes called partial
regression coefficients.
We can test for a linear relationship between the response variable Y and a
predictor variable Xk by testing the null hypothesis that the coefficient of Xk is zero
(H0: k =0) versus the alternative hypothesis that the coefficient is non-zero (HA:
k 0).
As in simple regression, we can test for significance by examining
confidence intervals for each parameter or, equivalently, by comparing Z-ratios to
the normal distribution and calculating a p-value.
27
As in the simple regression model, ei is a residual. The residuals now represent

factors other than X1 and X2 that predict Y, but we use ei in a general way to
represent residuals in any model.
Notice that, as expected, there is little change in the coefficient of age when
education is added, but the relationship between hedonism and education is now
negative after accounting for age. Both relationships are significantly different
from zero at the 0.1% level. The relationship between hedonism and education
should be interpreted with some caution, however. We should hesitate to
conclude that education affects or causes hedonism. It is likely that hedonism and
education are both influenced by variables that we have not accounted for in this
model.
If X1 and X2 are continuous we can examine their relationship using a scatterplot.

Note that when we have two predictors we would need a three-dimensional
scatterplot to represent the relationship between Y and X1 and X2 graphically13.
As it can be difficult to interpret three-dimensional plots, we can explore the data
by looking at plots of Y versus X1, Y versus X2, and X1 versus X2. The third plot is
important to check whether X1 versus X2 are highly correlated.
Example
We will begin with the case where both X1 and X2 are continuous. Lets consider
the effects of age (X1) and education (X2) on hedonism. We will ignore gender and
country differences for now. We have already examined the bivariate relationship
between hedonism and age and found that older respondents tend to be less
hedonistic (in C3.1). This relationship may change when we account for education
if education is related to both hedonism and age. For example, we would expect
older respondents to have fewer years of education and a higher level of education
might be associated with less hedonistic beliefs if the more career-minded choose
study over having a good time!
Figure 3.6 shows the relationship between hedonism and education (see Figure 3.2
for a plot of hedonism versus age). The relationship between the two explanatory
variables, age and education, is shown in Figure 3.7. The correlation between
hedonism and education is very weak; the Pearson coefficient is only 0.024. As
expected, there is a negative correlation between age and education (r=-0.242).
Because of the weak correlation between hedonism and education, however, we
would not expect the addition of education in a multiple regression to have much
impact on the coefficient of age.
Figure 3.6. Plot of hedonism by education
In C3.1.3 the fitted equation from a simple regression of hedonism on age was
found to be:
HEDi = 0.712 0.018 AGE i .

If we add education to the model, we obtain the following fitted multiple
regression equation:
HEDi = 0.971 0.019 AGE i 0.017 EDUCi .
For those of you who are interested, y =

dimensional space.
13
0 + 1 x1 + 2 x 2
is the equation of a plane in 3-
28
29
Further, some predictors may be easier to manipulate than others, which is

particularly important if regression results are used to inform public policy14.
When standardised coefficients are reported, they should be accompanied by the
corresponding unstandardised coefficients which represent effects in terms of the
original units of measurement for X and Y. This is particularly important for
categorical X.
C3.3.3
Using multiple regression to model a non-linear relationship
Suppose a scatterplot of Y versus X resembles Figure 3.8. The relationship is nonlinear, so it would not be appropriate to fit the straight line relationship implied
by a linear regression model. We should fit a curve through the points rather than
a line. The simplest curve is a quadratic function (or a second order polynomial):
y i = 0 + 1 x i + 2 x i2 + ei
Note that the above is an example of a multiple regression model with x 1 = x and
x 2 = x 2 . Also shown in Figure 3.8 is the fitted quadratic curve, which turns out to
have equation yi = 1.00 + 1.02 x i 0.47 x i2 .
Figure 3.7. Plot of age by education
Standardised coefficients
Standardisation and standardised coefficients were introduced in C3.1.3. To recap,
the standardised coefficient for a predictor X is the estimate of the slope that
would be obtained if X and Y were both standardised before the regression
analysis. In simple regression, the standardised coefficient of X is equal to the
Pearson correlation coefficient. In multiple regression, with two predictors X1 and
X2, the standardised coefficient of X1 is interpreted as the change in standardised Y
for a 1-unit change in standardised X1, holding X2 constant. (Recall that 1 unit of a
standardised variable corresponds to 1 standard deviation.) For example, in a
multiple regression model of hedonism on age and education, the standardised
coefficient for AGE is -0.358. Thus we can say that a 1 standard deviation change
in age predicts a 0.358 standard deviation decrease in hedonism. Note that if all
variables (Y and the Xs) had been standardised prior to the analysis, then the
unstandardised and standardised coefficients would be equal.
-1
-2
-3
-2.0
Standardised coefficients are produced by many statistical software packages and

reported in much published quantitative research, but they should be interpreted
with caution. It is often claimed that standardised coefficients can be compared
across the predictors to determine which has the strongest influence on Y.
However, predictors are usually correlated with one another and it is rarely
possible to change the value of one without changing the value of another.
30
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
2.0
Figure 3.8. Example of a non-linear relationship between Y and X

14
See http://www.tufts.edu/~gdallal/importnt.htm for further discussion of the use and

interpretation of standardised coefficients.
31
C3.3.3 Using multiple regression to model a non-linear relationship
C3.3.4 Adding further predictors
The results from fitting a quadratic curve to the relationship between hedonism
and age are given in Table 3.8. Note that this analysis is based on standardised
age and its square. This is because, for older respondents (remember the oldest is
98), age2 takes very large values; this may cause computational difficulties and the
coefficient of age2 would be very small.
Table 3.8. Regression with quadratic effects for age
Coeff.
S.E.
-0.222
-0.348
0.072
0.017
0.012
0.011
Constant
Standardised age
Standardized age-squared
Z-ratio
-28.669
6.288
C3.3.4
Adding further predictors
Suppose that we have p predictors, which we denote by X1, X2, X3, . . ., Xp. Then
the multiple regression model is
y i = 0 + 1 x 1i + 2 x 2i + 3 x 3i + ... + p x pi + ei
Variation explained: R2
p-value
In simple regression, R2 is the proportion of variance in Y that is explained by the

explanatory variable X (see C3.1.4). When there is more than one X, R2 is the
proportion of variance in Y explained by all variables in the model. An alternative
interpretation of R2 is as the square of the correlation between the predicted
values of Y (from the fitted model) and the observed values of Y.
<0.001
<0.001
The coefficient of age-squared is significantly different from zero at the 0.1%

level, so we conclude that age-squared should be retained in the model and that
the quadratic model is therefore a better fit to the data than the linear model.
The positive coefficient of the squared term, together with the negative
coefficient of the linear term, indicates that the negative relationship flattens out
at older ages. A scatterplot with the fitted curve is shown in Figure 3.9.
The R2 for the regression model with age and education effects is 0.121, so 12.1%
of the variance in hedonism scores is due to variation in age and education. The
correlation between the predicted and observed hedonism scores is 0.348
= 0 . 121 . As suggested by the low bivariate correlation between hedonism and
education, education has little explanatory power; when education is removed the
model R2 decreases only slightly to 0.118.
A problem with R2 is that it always increases even if irrelevant variables are added
to the model. Therefore in multiple regression a measure called the adjusted R2 is
usually quoted. The adjusted R2 takes into account the number of variables in the
model. It is therefore a goodness-of-fit measure that is penalised by the
complexity of the model. With such a measure, the value will only increase if the
additional predictors are accounting for some of the variability in the response. In
this simple example with only two explanatory variables, age and education, the
adjusted R2 turns out to be the same as the unadjusted value.
Multicollinearity
Before carrying out a regression analysis, we should always look at the correlation
between each pair of predictor variables. If the correlation between a pair is very
high (>0.8 say), the estimates of the coefficients of those variables may be
unstable and imprecise (large standard errors). If the two variables are really
measuring the same thing, we should consider dropping one. Otherwise, we might
replace the two variables by a new variable which is a combination of the two15.
Figure 3.9. Plot of hedonism versus standardised age with fitted quadratic curve
15
Principal components analysis or factor analysis can be used to reduce a set of correlated
variables into a smaller set of uncorrelated variables.
32
33
Principles of model selection
The coefficient of age in months (AGE*12) is -0.002, which is the coefficient of age
in years (AGE) divided by 12. This is because 1 unit on the scale of AGE*12 is equal
to 1/12 of a unit on the scale of AGE. Notice that the intercept does not change
because AGE=0 means the same whether the measurement is in months or years.
The coefficient of education is unaffected by transformations in age.
In most quantitative research there is a large set of potential explanatory

variables. There are many procedures that have been proposed to automatically
select the best model from a set of variables (e.g. backward elimination, forward
selection, stepwise selection), and many of these have been implemented in
mainstream statistical software. These procedures are sometimes useful in that
they provide a systematic means of model selection, but they should be used with
caution or you may be accused of data dredging. In practice your research
design and analysis will be guided by theory, which will come from previous
research in the same or related areas as well as your own ideas. Often you will
have several rival theories that you wish to compare and assess which have the
stronger empirical support. These theories and your particular research question
will guide the order in which you enter explanatory variables into the model.
For example, suppose you are interested in examining gender differences in salary
levels. The first model you fit might include only a gender effect. Suppose you
find that there is a significant difference between men and women. You might
then add in other explanatory variables to see which ones, if any, help to explain
the gender difference.
A further step in the analysis would be to test whether
the gender difference is the same for all men and women, e.g. gender differences
may be larger in some occupation categories than in others (an example of an
interaction effect - see C3.4). In other situations, there will be variables that you
want to include for interpretation purposes. For example, in educational research,
you might be interested in looking at predictors of academic progress rather than
academic attainment at one point in time. One way to do that is to include prior
attainment as an explanatory variable in the model.
Because regression coefficients depend on scale, standardised coefficients are

sometimes quoted too. When researchers talk about effect sizes, they are often
referring to standardised coefficients.
Dont forget to take the online quiz for this section! (see page 2 for
details of how to find the quiz questions)
Effect sizes
The size of the coefficient for predictor variable Xk will depend on the scales of Xk
and the response variable. For example, suppose we multiply each value of AGE
by 12 to give age in months rather than years and refit the multiple regression
model with age and education effects. We obtain the results shown in Table 3.9.
Table 3.9. Regression of hedonism on age and education for different age scales.
Age in years
Coeff.
Constant
Age
Education
0.971
-0.019
-0.017
Z-ratio
-28.206
-4.915
Age in months
Coeff.
Z-ratio
0.971
-0.002
-0.017
-28.206
-4.915
34
35

C3.4.1 Model with fixed slopes across groups
C3.4 Interaction Effects

In C3.2 we saw how to compare groups using dummy variables in a regression
model. For example, we compared the mean hedonism score for men and women,
and for different countries. So far, however, we have assumed that the effects of
other predictor variables, e.g. age, are the same for each group. This is
equivalent to assuming that group differences in hedonism are the same for all
values of the other predictors. This assumption may be unrealistic. Perhaps age
differences in hedonism are more pronounced among men, which would imply that
the age effect differs for men and women.
Two predictors are said to have an interaction effect on Y if the effect of one of
the predictors on Y depends on the value of the other predictor.
C3.4.1
Model with fixed slopes across groups
Suppose we fit a multiple regression model with age and gender effects:
HEDi = 0 + 1 AGE i + 2 SEX i + ei
(3.7)
Figure 3.10. Regression lines for men and women, fixed slopes
We obtain the following fitted regression equation:
Note: The age range in the sample is 14 to 98 years. The software used to draw the plot
has extrapolated beyond the observed range regression lines which is not generally
recommended.
HEDi = 0.791 0.018 AGE i 0.152 SEX i

For SEX=0 (men), the relationship between HED and AGE is represented by the
line:
HEDi = 0.791 0.018 AGE i

and for SEX=1 (women), the fitted line is:
HEDi = 0.639 0.018 AGE i
C3.4.2
Fitting separate models for each group
Is it reasonable to assume that the gender difference in hedonism is the same for
all ages? One way of allowing men and women to have different slopes for the
relationship between hedonism and age is to fit a separate regression line for each
sex. We do this by splitting the sample by gender16, and fitting a simple regression
of HED on AGE for each sex. If we do this, we obtain the results shown in Table
3.10.
So the lines for men and women have different intercepts, but the same slope, i.e.
the regression lines are parallel (see Figure 3.10). There are two equivalent ways
of interpreting Figure 3.10. We can say that the effect of age on hedonism is the
same for men and women. Alternatively we can say that the gender difference in
hedonism is the same at all ages.
16
This is often done using a select if command or menu option, or by requesting an analysis that is
stratified by gender.
36
37
C3.4.2 Fitting separate models for each group
C3.4.3 Allowing for varying slopes in a pooled analysis: interaction effects
Table 3.10. Regression of hedonism on age with separate models fitted for men and women
C3.4.3
Coeff.
S.E.
Z-ratio
0.839
-0.019
0.047
0.001
-20.854
0.597
-0.018
0.047
0.001
-18.910
Rather than fitting a separate model for each sex, we will fit a single model to the
whole pooled sample. We create a new variable which is the product of AGE and
SEX:
Men
Constant
Age (years)
Women
Constant
Age (years)
AGE_SEX=AGESEX
The new variable AGE_SEX is added as another predictor variable to model (3.7) to
give:
For men the slope of age is -0.019, compared to -0.018 for women. So the slope is
slightly steeper for men. Because women have a lower intercept than men, a
steeper slope for men implies that the gender difference is greater among younger
respondents (see Figure 3.11 later).
HEDi = 0 + 1 AGE i + 2 SEX i + 3 AGE _ SEX i + ei
Table 3.11. Example of hedonism dataset with age by sex interaction variable
Respondent
i)
The sample size for some groups may be small.
ii)
There may be more than one categorical predictor, and therefore more than
one way of grouping the data. The effects of the other predictors may vary
across each grouping, e.g. hedonism may vary by sex and by country.
Splitting the data into groups defined by sex and country will lead to a large
number of groups; in this dataset, the sample sizes in each group remain
large, but this will often not be the case.
iii)
In general there will be several predictors in the model, but it is unlikely

that the effects of all predictors will vary across groups. In that case,
fitting a separate regression for each group is inefficient. Where the
coefficient of a predictor does not vary across groups, it would be better to
estimate it using information from the whole sample; the estimate of the
coefficient would then be based on a larger sample size and would therefore
have a smaller standard error than if it were estimated separately for each
group.
When separate analyses are carried out for each group, it is not possible to
carry out hypothesis tests to compare coefficients across groups. For
example, if we fit separate regressions of hedonism on age for men and
women we cannot test whether there is a gender difference in the
relationship between hedonism and age in the population.
1
2
3
4
.
.
5845
Hedonism
1.55
0.76
-0.26
-1.00
.
.
0.74
AGE
SEX
AGE_SEX
25
30
59
47
.
.
65
0
0
0
1
.
.
0
0
0
0
47
.
.
0
The inclusion of AGE_SEX, called the interaction between AGE and SEX, allows the
effect of AGE on HED to differ for men and women (or, equivalently, the effect of
sex on HED to depend on AGE). If the effect of age differs by sex, we say that
there is an interaction effect. To see how an interaction effect works, we will
look at the regression model for each value of SEX.
For SEX=0 (men), AGE_SEX=0 so the regression model (3.8) becomes:
HEDi = 0 + 1 AGE i + ei
38
(3.9)
For SEX=1 (women), AGE_SEX=AGE and the regression model (3.8) becomes:
HEDi = 0 + 1 AGE i + 2 + 3 AGE i + ei

= ( 0 + 2 ) + (1 + 3 )AGE i + ei
(3.8)
Table 3.11 gives an extract of the analysis data file to which (3.8) could be fitted.
While splitting the sample into groups is a simple way of allowing for different
slopes for each group, there are several problems with this approach:
iv)
Allowing for varying slopes in a pooled analysis: interaction

effects
(3.10)
39
In equation (3.9) the intercept is 0 , and in (3.10) it is 0 + 2 . So 2 is the

difference between intercepts for men and women.
In equation (3.9) the slope of AGE is 1 , and in (3.10) it is 1 + 3 . So 3 is the
difference between slopes for men and women.
Table 3.12 shows the results from fitting model (3.8) to the hedonism data.
Table 3.12. Regression of hedonism on age and sex, pooled analysis with interaction
Coeff.
Constant
Age (years)
Female
Age Female
0.839
-0.019
-0.242
0.002
S.E.
0.048
0.001
0.066
0.001
Z-ratio
p-value
-20.075
-3.649
1.461
<0.001
<0.001
0.144
So the fitted regression equation is:

Figure 3.11. Regression lines for men and women, varying slopes
HEDi = 0.839 0.019 AGE i 0.242 SEX i + 0.002 AGE _ SEX i
C3.4.4
For SEX=0 (men), the fitted regression equation is
Testing for interaction effects
Is the slope in the regression of hedonism on age significantly different for men
and women?
HEDi = 0.839 0.019 AGE i
Recall that 3 , the coefficient of the interaction variable AGE_SEX in equation

(3.8), is the difference in the slope for men and women. So the null hypothesis
that the slopes are the same for men and women can be expressed as H0: 3 =0.
For SEX=1 (women), the fitted regression equation is

HEDi = 0.839 0.019 AGE i 0.242 + 0.002 AGE
= 0.597 0.017 AGE i
Notice that the intercept and slope estimates from the interaction model are
exactly the same as the estimates we got from fitting a simple regression for each
sex separately. Figure 3.11 shows the predicted regression lines for men and
women. Note that the lines are no longer parallel because we have allowed for
different slopes in our regression model. The gender difference in hedonism is
slightly larger among young respondents.
From Table 3.12, we see that the Z-ratio for this test is 1.461 and the p-value is
0.144. So we cannot reject the null hypothesis and we conclude that the slope of
age is the same for men and women. We would then return to the simpler model
(3.7) with the fixed slope.
C3.4.5
Another example: allowing age effects to be different in

different countries
We have concluded that the effect of age on hedonism is the same for men and
women. Or, equivalently, we can conclude that the gender difference in hedonism
is the same for all ages. We will now test whether the effect of age is the same in
40
41
C3.4.5 Another example: allowing age effects to be different in different countries
C3.4.5 Another example: allowing age effects to be different in different countries
each of the three countries, which is equivalent to testing whether differences

between countries depend on age.
simplest way to compare Germany and France (H0: 4 5 = 0 ) would be to refit

the model taking either Germany or France as the reference category.
In C3.2.2, we allowed for country effects by including dummy variables for

Germany and France, i.e. we included the variables GERM and FRANCE as
predictors in the regression model. To allow the effect of age on hedonism to vary
across countries, we need to create two interaction variables which we will call
AGE_GERM and AGE_FRANCE. These are defined as follows:
To test whether all three countries have the same slope (a joint test), we need to
test the null that 4 and 5 are both (simultaneously) equal to zero. We can do
this using an F-test for comparing nested models: the model in which 4 and 5
are freely estimated (the interaction model) versus the model with both 4 and 5
fixed at zero (the main effects model, i.e. without interaction terms). The pvalue for this test turns out to be 0.040, so there is evidence at the 5% level that
the interaction model is a significantly better fit to the data: at least one of the
age-by-country interaction coefficients is non-zero. We therefore conclude that
the age effect differs between countries.
AGE_GERM=AGEGERM
AGE_FRANCE=AGEFRANCE
The interaction model has the form:
HEDi = 0 + 1 AGE i + 2 GERMi + 3 FRANCE i + 4 AGE _ GERM + 5 AGE _ FRANCE + ei
The results from fitting this model are given in Table 3.13.
Table 3.13. Regression with age by country interaction effect
Constant
Age (years)
Country
Germany
France
Age Germany
Age France
Coeff.
S.E.
Z-ratio
p-value
0.604
-0.021
0.061
0.001
-17.210
<0.001
-0.007
0.386
0.005
0.001
0.078
0.090
0.002
0.002
-0.085
4.277
3.207
0.570
0.932
<0.001
0.001
0.569
Dont forget to take the online quiz for this section! (see page 2 for
details of how to find the quiz questions)
The effect of age in each country is:

-0.021 in the UK (the reference category)
-0.021+0.005 = -0.016 in Germany
-0.021+0.001 = -0.020 in France
It therefore appears that the negative effect of age on hedonism is weaker in
Germany than in the UK or France. The coefficient of the AGE_GERM term has a Zratio of 3.207 so the differential age effect for Germany is significant at the 0.1%
level.
The individual Z-ratios for each interaction term allow us to carry out two separate
tests: 1) whether the slopes for the UK and Germany are the same (H0: 4 = 0 ),
and 2) whether the slopes for the UK and France are the same (H0: 5 = 0 ). The
42
43

C3.5.1 Checking the normality assumption
C3.5 Checking Model Assumptions in Multiple

Regression
The assumptions of a multiple regression model are the same as those for a simple
regression model (see C3.1.6), i.e. i) the residuals ei are normally distributed, ii)
the variance of the residuals is the same for each value of X (or combination of
values for different X variables), and iii) the residuals are independent. We can
check assumptions i) and ii) by looking at various plots of the standardised
residuals. The same plots can be used to check for outliers and their influence on
the regression results can be assessed by looking at the distribution of the Cooks D
Statistic.
C3.5.1
Checking the normality assumption
In C3.1.6, we checked the normality assumption of simple regression using two

plots of the standardized residuals: a histogram and a normal probability plot. The
same plots are used in multiple regression. Figure 3.12 and Figure 3.13 show the
histogram and normal probability plot of residuals from a multiple regression
model of hedonism that includes age, education, gender and country effects. The
histogram shows a symmetric bell-shaped distribution and the normal plot shows a
straight line, suggesting that the normal distribution assumption is reasonable.
Figure 3.13. Normal probability plot of ri
C3.5.2
Checking the homoskedasticity assumption
For simple regression, we check that the variance of the residuals is fairly constant
across the range of X in a plot of the standardised residuals against the explanatory
variable X. In multiple regression, it is useful to start with a plot of ri against yi
because, for any individual, the predicted value of y is a linear function of their
values on all X variables in the model. This should be followed by an examination
of pairwise plots of the standardized residuals against each explanatory variable X
in turn. For each plot we are looking for indications of funnelling where the
vertical scatter of the residuals is different for different values of xi or y i , in
which case the assumption of homoskedasticity is not met.
A common reason for funnelling (or heteroskedasticity) is the existence of groups
in the data among which the relationship between Y and one or more X differs, i.e.
unmodelled interaction effects. To illustrate the idea of funnelling, suppose that
the relationship between Y and a continuous variable X1 is different for two
subgroups defined by a binary variable X2: the relationship between Y and X1 is
positive for both groups, but stronger for X2=0 than for X2=1. The predicted
regression lines from a multiple regression of Y on X1, X2 and their interaction X1*X2
are shown in Figure 3.14.
Figure 3.12. Histogram of ri
44
45
C3.5.2 Checking the homoskedasticity assumption
C3.5.2 Checking the homoskedasticity assumption
as X1 increases, so the average line will lie close to the individual group lines and
the residuals are smaller.
x2
0
1
2
0
x2=1
Standardized Residual
Standardized predicted value
-2
x2=0
-4
-1
-2
-5.0
-2.5
0.0
2.5
5.0
x1
-3
Figure 3.14. Prediction lines from a multiple regression with an interaction effect
-5.0
-2.5
0.0
2.5
5.0
x1
Now suppose we mistakenly fit a simple regression of Y on X1, so we ignore the fact
that there are two groups with different relationships between Y and X1. Figure
3.15 shows the residual plot for this misspecified model17. (The data points for the
groups defined by X2 are distinguished, but remember that X2 is not included in the
model.) The plot shows evidence of heteroskedasticity because the vertical spread
of the residuals gets smaller as X1 increases this is an example of what we mean
by funnelling. Why has this happened? Instead of fitting two regression lines
with different intercepts and slopes for each group, we have fitted a single
average line which would lie somewhere in between the lines in Figure 3.1418. At
small values of X1, where we have the largest difference in the predicted value of
Y for the two groups, the residuals about this line are large and positive for X2=1
and large and negative for X2=0. The difference between groups becomes smaller
17
The residuals are plotted against
x 1i , but the plot of ri against the predicted response, yi ,

x 1i .
Returning to the hedonism data, Figure 3.16 shows a plot of ri versus standardised
yi from the model with age, education, gender and country included as
explanatory variables. The vertical spread of the points appears fairly equal across
different values of standardised y i , so we conclude that the assumption of
homoskedasticity is reasonable.
C3.5.3
Outliers
We can also check for outliers using any of the residual plots. An outlier is a point
with a particularly large residual. We would expect approximately 95% of the
residuals to lie between 2 and +2.
Of major interest, however, is whether an outlier has undue influence on our
results. An influence statistic called Cooks D (where D is for distance) measures
i is just a linear function of

would look exactly the same because in simple regression y
18
We would expect this average line to lie closer to the line for the largest group.
Figure 3.15. Plot of ri versus X1 from fitting a misspecified regression without X2 or its
interaction with X1
46
47
C3.5.3 Outliers
C3.5.3 Outliers
how different our estimated regression coefficients would have been if a sample
observation were omitted. Cooks D is calculated for every observation. The
higher the value of D, the more likely it is that an observation exerts influence on
the estimates of the coefficients. However, D does not have a fixed range and so
we focus on those values of D which are considerably greater than, say, the 90th
percentile.
Figure 3.17. Boxplot of Cooks D

Table 3.14. Impact of omitting outliers on estimated coefficients and Z-ratios
Full sample
Figure 3.16. Plot of ri versus standardised yi from a multiple regression of hedonism

scores
For a regression of hedonism on age, education, sex and country, we find that the
90th percentile of the distribution of Cooks D is 0.000046. A boxplot of Cooks D
is given in Figure 3.17. Two observations have relatively large values of D: case
numbers 3225 and 2948. However, removing these observations from the analysis
has negligible impact on our results (see Table 3.14).
Constant
Age (years)
Education (years)
Female
Country
Germany
France
Omitting observations 3225

and 2948
Coeff.
Z-ratio
0.790
-0.019
-0.015
-0.160
-27.611
-4.281
-6.752
0.789
-0.019
-0.015
-0.160
-27.537
-4.350
-6.774
0.222
0.436
8.068
13.145
0.222
0.441
8.090
13.322
Dont forget to take the online quizzes for this module if you
havent already done so! (see page 2 for details of how to find the
quizzes)
48
49

Multiple Regression Concepts

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Multiple Regression Concepts

Transféré par

Droits d'auteur :

Formats disponibles

Module 3: Multiple Regression

1. Please ignore this bit & leave it alone

It works in this way.

What is Multiple Regression? ................................................................................... 3

Regression with a Single Continuous Explanatory Variable......................................6

Comparing Groups: Regression with a Single Categorical Explanatory Variable .......... 19

Interaction Effects ..................................................................................... 36

Comparing two groups ........................................................................... 19

Examining data graphically ....................................................................... 6

Model with fixed slopes across groups.........................................................

Checking Model Assumptions in Multiple Regression............................................ 44

Checking the normality assumption ........................................................... 44

Centre for Multilevel Modelling, 2008

Centre for Multilevel Modelling, 2008

Module 3 (Concepts): Multiple Regression

From within the LEMMA learning environment

What is Multiple Regression?

To illustrate the ideas of multiple regression, we will consider a research problem

From within the LEMMA learning environment

We can use regression modelling in different modes: 1) as description (what is the

Understanding of types of variables (continuous vs. categorical variables,

Centre for Multilevel Modelling, 2008

Inflation of a relationship when not taking into account extraneous

Suppression of a relationship. An apparent small gender gap could increase

Centre for Multilevel Modelling, 2008

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

He (sic) seeks every chance he can to have fun. It is important to him to do

Having a good time is important to him. He likes to spoil himself.

No confounding. The original relationship remains substantially unaltered

Data for multiple regression analysis

A respondents own values are inferred from their self-reported similarity to a

An extract of the data is given below.

Centre for Multilevel Modelling, 2008

Centre for Multilevel Modelling, 2008

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1 Regression with a Single Continuous

Examining data graphically

Figure 3.1. Histogram of hedonism

Centre for Multilevel Modelling, 2008

Centre for Multilevel Modelling, 2008

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1.1 Examining data graphically

C3.1.2 The linear regression model

Figure 3.2. Plot of hedonism by age

The linear regression model

In a linear regression analysis, we fit a straight line to the scatterplot of Y against

The equation (3.3) is called the linear regression model.

The variance of the residuals is constant, whatever the value of x. This

Other examples of parameters are the population mean

Centre for Multilevel Modelling, 2008

Centre for Multilevel Modelling, 2008

(mu) and the population variance

Module 3 (Concepts): Multiple Regression

Module 3 (Concepts): Multiple Regression

C3.1.2 The linear regression model

C3.1.3 The fitted regression line

Table 3.1. Results from a simple regression of hedonism on age

The fitted regression line

Interpretation of the intercept and slope estimates

0 =0.712 is the predicted value of Y when X=0. So we would expect someone of

A description of least squares can be found at