Académique Documents
Professionnel Documents
Culture Documents
Concepts
Fiona Steele1
Centre for Multilevel Modelling
Contents
We wish automatically made figure & table numbering that has module numbers in
it so we get captions like Figure 2.1, or Figure 3.2, etc.
Introduction .........................................................................................................3
MS Word works out what numbering to use by looking at the heading styles used in
the document.
Every time the style Heading 1 is used in the document (as above on Please
ignore), MS Word counts this as a new chapter, and increments the chapter
number part of the automatic numbering.
C3.1
C3.1.1
C3.1.2
C3.1.3
C3.1.4
C3.1.5
C3.1.6
So to get a caption like Figure 2.x - you need to have preceding it two headings
formatted with the style Heading 1.
This template has Heading 1 used only once, so figures and tables in this document
will be numbered as Figure 1.x or Table 1.x
These figure and table numbers will be automatically updated each time another
Heading 1 is inserted.
C3.2
C3.2.1
C3.2.2
C3.2.3
C3.3
Statistical control.................................................................................
The multiple regression model .................................................................
Using multiple regression to model a non-linear relationship .............................
Adding further predictors .......................................................................
26
27
31
33
C3.4.1
C3.4.2
C3.4.3
C3.4.4
C3.4.5
C3.5
Regression with More than One Explanatory Variable (Multiple Regression) .............. 26
C3.3.1
C3.3.2
C3.3.3
C3.3.4
C3.4
36
37
39
41
41
C3.5.1
C3.5.2
C3.5.3
With additional material from Kelvyn Jones. Comments from Sacha Brostoff, Jon Rasbash and
Rebecca Pillinger on an earlier draft are gratefully acknowledged.
All of the sections within this module have online quizzes for you to
test your understanding. To find the quizzes:
EXAMPLE
Introduction
Motivation
All of the sections within this module have practicals so you can
learn how to perform this kind of analysis in MLwiN or other
software packages. To find the practicals:
EXAMPLE
Pre-requisites
Conditioning
Online resources:
http://www.sportsci.org/resource/stats/
http://www.socialresearchmethods.net/
http://www.animatedsoftware.com/statglos/statglos.htm
http://davidmlane.com/hyperstat/index.html
A quantitative assessment of the size of the effect; e.g. the difference in salary
between women and men is 5000 per annum;
A quantitative assessment after taking account of other variables; e.g. a female
worker earns 6500 less after taking account of years of experience. This
conditioning on other variables distinguishes multiple regression modelling from
simple testing for differences analyses.
A measure of uncertainty for the size of the effect; e.g. we can be 95%
confident that the female-male difference in salary in the population from
which our sample was drawn is likely to lie between 4500 and 5500.
The key feature that distinguishes multiple regression from simple regression is
that more than one predictor variable is involved. Even if we are interested in the
effect of just one variable (gender) on another (salary) we need to take account of
other variables as they may compromise the results. We can recognise three
distinct cases where it is important to control or adjust for the effects of other
variables:
i)
ii)
Introduction
Introduction
iii)
Age in years
Gender (coded 0 for male and 1 for female)
Country (coded 1 for the UK, 2 for Germany and 3 for France)
Years of education.
Introduction to Dataset
The ideas of multiple regression will be introduced using data from the 2002
European Social Surveys (ESS).
Measures of ten human values have been
constructed for 20 countries in the European Union. According to value theory,
values are defined as desirable, trans-situational goals that serve as guiding
principles in peoples lives. Further details on value theory and how it is
operationalised in the ESS can be found on the ESS education net
(http://essedunet.nsd.uib.no/cms/topics/1/).
1
2
3
4
.
.
5845
Hedonism
1.55
0.76
-0.26
-1.00
.
.
0.74
Age
Gender
Country
Education
25
30
59
47
.
.
65
0
0
0
1
.
.
0
2
2
2
3
.
.
1
10
11
9
10
.
.
9
We will study one of the ten values, hedonism, defined as the pleasure and
sensuous gratification for oneself. The measure we use is based on responses to
the question How much like you is this person?:
Some individuals will tend to select responses from one side of the scale (very much like me)
for any item, while others will select from the other side (not like me at all). If we ignore these
differences in response tendency we might incorrectly infer that the first type of individual
believes that all values are important, while the second believes that all values are unimportant.
We will first consider age as an explanatory variable for hedonism. The age range
in our sample is 14 to 98 years with a mean of 46.7 and standard deviation of 18.1.
Relationship between X and Y
We will begin with a description of simple linear regression for studying the
relationship between a pair of continuous variables, which we denote by Y and X.
Simple regression is also commonly known as bivariate regression because only two
variables are involved.
Y is the outcome variable (also called a response or dependent variable)
X is the explanatory variable (also called a predictor or independent variable).
C3.1.1
In its simplest form, a regression analysis assumes that the relationship between X
and Y is linear, i.e. that it can be reasonably approximated by a straight line. If
the relationship is nonlinear, it may be possible to transform one of the variables
to make the relationship linear or the regression model can be modified (see
C3.3.3). The relationship between two variables can be viewed in a scatterplot. A
scatterplot can also reveal outliers.
Before carrying out a regression analysis, it is important to look at your data first.
There are various assumptions made when we fit a regression model, which we will
consider later, but there are two checks that should always be carried out before
fitting any models: i) examine the distribution of the variables and check that the
values are all valid, and ii) look at the nature of the relationship between X and Y.
Distribution of Y
We can examine the distribution of a continuous variable using a histogram. At
this stage, we are checking that the values appear reasonable. Are there any
outliers, i.e. observations outside the general pattern? Are there any values of 99 in the data that should be declared as missing values? We also look at the
shape of the distribution: is it a symmetrical bell-shaped distribution (normal), or
is it skewed? Although it is the residuals3 that are assumed to be normally
distributed in a multiple regression model, rather than the dependent variable, a
skewed Y will often produce skewed residuals. If the residuals turn out to be nonnormal, it may be possible to transform Y to obtain a normally distributed
variable. For example, a positively skewed distribution (with a long tail to the
right) will often look more symmetrical after taking logarithms.
Figure 3.1 shows the distribution of the hedonism scores. It appears approximately
normal with no obvious outliers. The mean of the hedonism score is -0.15 and the
standard deviation is 0.97.
Distribution of X
For a regression analysis the distribution of the explanatory variable is
unimportant, but it is sensible to look at descriptive statistics for any variables
that we analyse to check for unusual values.
Figure 3.2 shows a scatterplot of hedonism versus age, where the size of the
plotting symbol is proportional to the number of respondents represented by a
particular data point. Also shown is what is commonly called the line of best fit,
which we will come back to in a moment. The scatterplot shows a negative
relationship: as age increases then hedonism decreases. The Pearson correlation
coefficient for the linear relationship is -0.34.
The residual for each observation is the difference between the observed value of Y and the value
of Y predicted by the model. See C3.1.2 for further details.
so that the intercept is now denoted by 0 and the slope by 1 . The subscripts on
the s indicate the variable to which each coefficient is attached. We could have
written (3.2) as y = 0 x 0 + 1 x 1 where x 0 =1 for every observation and x 1 = x .
Later we will be adding further explanatory variables ( x 2 , x 3 etc.) with
coefficients 2 , 3 , etc.
For a given individual i (i=1, 2, 3, .., n), we denote their value on Y by y i and
their value on X by x i . (Note that when we consider more than one explanatory
variable, we will introduce a second subscript to index the variable. For example,
x 2i will denote the value on variable x 2 for individual i.)
For individual i, the linear relationship between Y and X may be expressed as:
y i = 0 + 1 x i + ei (3.3)
ei is called the residual and is the difference between the ith individuals actual
y-value and that predicted by their x-value. We know that we cannot perfectly
predict an individuals value on Y from their value on X; the points in a scatterplot
of x and y will never lie perfectly on a straight line (see Figure 3.2, for example).
The residuals represent the (vertical) scatter of points about the regression line.
C3.1.2
(3.1)
where m is the gradient or slope of the line, and c is the intercept or the point at
which the line cuts the Y-axis (i.e. the value of y when x=0). The gradient is
interpreted as the change in y expected for a 1-unit change in x. In statistics, we
often refer to m and c as coefficients. A coefficient of a variable is a quantity that
multiplies it. The slope m is the coefficient of the predictor x, and the intercept c
is the coefficient of a variable which equals 1 for each observation (usually
referred to as the constant).
Because we will soon be adding more explanatory variables (Xs), it is convenient to
use a more general notation with coefficients represented by Greek betas ( ).
Thus (3.1) becomes
y = 0 + 1 x
The residuals are normally distributed with zero mean and variance 2
(spoken as sigma-squared). This assumption is often written in shorthand
as ei ~ N (0, 2 ).
ii)
(3.2)
4
iii)
The residuals are not correlated with one another, i.e. they are
independent. Correlations might arise if some individuals contribute more
than one observation (e.g. repeated measures) or if individuals are
clustered in some way (e.g. in schools). If it is suspected that residuals are
correlated, the regression model needs to be modified, e.g. to a multilevel
model (see Module 5).
We can use the fitted line to predict an individuals hedonism based on their age.
So, for example, for an individual of age 25 we would predict a hedonism score of
0.712 (0.018 25) = 0.262. In contrast, we would predict a score of -0.188 for
someone of age 50. The regression line is the line of best fit shown in Figure 3.2.
Most statistical packages will report the results of a regression analysis in tabular
form, e.g. as in Table 3.1.
If these assumptions are not met the estimate of 0 , and more importantly 1 ,
may be biased and imprecise.
C3.1.3
In linear regression analysis, 0 and 1 are estimated from the data using a
method called least squares in which the sum of the squared residuals is
(Responses with other scales of measurement require other
minimized5.
techniques, but all of them are based on the same underlying principle of
minimizing the poorness of fit between the actual data points and the fitted
model.)
By applying the method of least squares to our sample data, we obtain an estimate
of the underlying population value of the intercept and of the slope. These
estimates are denoted by 0 and 1 (spoken as beta-0-hat and beta-1-hat).
The predicted value of y for individual i is denoted by yi and is calculated as:
yi = 0 + 1 x i
(3.4)
The equation (3.4) is the equation of the estimated or fitted regression line. The
predicted value yi is the point on the fitted line corresponding to x i .
If we regress hedonism on age we obtain 0 =0.712 and 1 =-0.018, and the fitted
regression line is written (substituting HED for y and AGE for x) as:
HEDi = 0.712 0.018 AGE i .
The slope estimate tells us that for every extra year of age, hedonism is predicted
to decrease by 0.018. Importantly, the decrease in hedonism expected for an
increase from 14 to 15 years old is the same as for an increase for 54 to 55 years
old. This is a direct consequence of assuming that the underlying functional form
of the model is linear and fitting a linear equation.
Constant
Age
Coefficient
0.712
-0.018
Continuous variables are often centred about the mean so that the intercept has a
more meaningful interpretation. For example, we would centre the variable AGE
by subtracting the sample mean of 46 years from each of its values. If we then
repeat the regression analysis replacing AGE by AGE-46, the intercept becomes the
predicted value of Y when AGE-46=0, i.e. when AGE=46. Rather than a prediction
for a baby of 0 years, which is well outside the age range in the sample, the
intercept now gives a prediction for a 46 year old adult
The intercept in the analysis based on centred AGE is estimated as -0.139, which is
the predicted hedonism score for a 46 year old. Centring does not affect the
estimate of the slope because only the origin of X has been shifted; its scale
(standard deviation) has not changed.
Standardisation and standardised coefficients
Sometimes X is standardised, which involves subtracting the sample mean and then
dividing the result by the standard deviation:
10
11
X mean of X
.
SD of X
C3.1.4
12
13
C3.1.5
Hypothesis testing
We must bear in mind that the estimates of the intercept and slope are subject to
sampling variability, as is any statistic calculated from a sample. While we have
established that there is a negative relationship between hedonism and age in our
sample, we are really interested in their relationship in the population from which
our sample was drawn (the combined populations of France, Germany and the UK).
In other words, is the relationship statistically significant, or could we have got
such a result by chance?
The null hypothesis (H0) for our test is that there is no relationship between
hedonism and age in the population, in which case 1 =0.
SE(1 )
C3.1.6
0.018
= 27.9
0.001
Model checking
i)
ii)
The variance of the residuals is constant, whatever the value of x, i.e. the
residuals are homoskedastic.
iii)
The residuals are not correlated with one another, i.e. they are
independent.
We can check the validity of assumptions i) and ii) by examining plots of the
estimated residuals. If it is suspected that residuals might be correlated because
the data are clustered in some way, we can test assumption iii) by comparing a
multilevel model, which accounts for clustering, with a multiple regression model
which ignores clustering (see Module 5).
To check assumptions about ei , we use the estimated residuals which are the
differences between the observed and predicted values of y:
ei = y i yi
We usually work with the standardized residuals ri which we obtain by dividing ei
by their standard deviation.
Alternatively, but equivalently, we can calculate the test statistic (often called
the Z or t-ratio)
6
-1.96 and +1.96 are the 2.5% and 97.5% points of a standard normal distribution (one with a mean
of zero and a standard deviation of one). The middle 95% of the distribution lies between these
points.
14
15
1.0
Figure 3.3 and Figure 3.4 show a histogram and normal probability plot of residuals
from a simple regression model with age. Both plots suggest that the normal
distribution assumption is reasonable here.
0.8
0.6
0.4
0.2
500
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Frequency
400
300
Standardized residual
200
100
0
-2.5
0.0
2.5
Standardized residual
-2
Figure 3.5 shows a plot of ri versus x i . The vertical spread of the points appears
fairly equal across different values of X, so we conclude that the assumption of
homoskedasticity is reasonable.
0
20
40
60
80
100
16
17
Outliers
We can also check for outliers using any of the above residual plots. An outlier is a
point with a particularly large residual. We would expect approximately 95% of
the residuals to lie between 2 and +2.
Of major interest, however, is whether an outlier has undue influence on our
results. For example, in simple regression, an outlier with very large values on X
and Y could push up a positive slope.
A straightforward way to judge the
influence of an outlier is to refit the regression line after excluding it. If the
results are very similar to those based on all observations, we would conclude that
the outlier does not have undue influence. An observations influence can also be
measured by a statistic called Cooks D (see C3.5.3).
Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
C3.2.1
Please read P3.1, which is available in online form or as part of a pdf file.
Sample size
Women
Men
2747
3098
18
19
C3.2.2
y i = 0 + 1 x i + ei
where y i is the hedonism score of individual i, and x i =1 if the individual is a
woman, and 0 if the respondent is male7.
Table 3.3. Regression of hedonism on sex
Coefficient
Constant
Sex
-0.069
-0.156
Standard Error
0.019
0.025
Sample size
The regression output is given in Table 3.3, from which we obtain the fitted
regression equation:
HEDi = 0.069 0.156 SEX i
UK
Germany
France
1748
2785
1312
-0.384
-0.128
0.108
We can use this equation to predict HED for men and women:
For men (SEX=0), HED = 0.069 (0.156 0) = 0.069
For women (SEX=1), HED = 0.069 (0.156 1) = 0.225
Notice that these predicted values are just the mean hedonism scores for men and
women, and that the coefficient of SEX is the difference between these means
(womens mean mens mean, since SEX is coded 1 for women here).
The null hypothesis that there is no difference between the mean score for men
and women in the population can be expressed as H0: 1 = 0 . The standard error
of 1 is 0.025 and the Z-ratio is therefore 0.156 0.025 = 6.12 . The 95%
confidence interval for 1 is (-0.206, -0.106).
Note that these results are exactly the same as those for the independent samples
comparison of means test given earlier. So if SEX is the only explanatory variable,
a regression analysis gives exactly the same results as a t-test. But only in a
regression analysis can we include other explanatory variables.
Between countries
Within countries
Total
Sum of
squares
d.f.
Mean
square
F statistic
p-value
184.5
5370.7
5555.2
2
5842
5844
92.3
0.9
100.4
<0.001
The tiny p-value suggests that we can reject the null hypothesis and conclude that
there are significant between-country differences in hedonism.
9
7
20
21
The statistical model behind ANOVA is in fact a multiple regression model. But
rather than including country as an explanatory variable10, we create dummy
variables for two of the three countries and include these. Suppose we create
three variables which indicate whether a respondent is from a particular country,
i.e.
UK
=1 if respondent is from the UK, =0 if from Germany or France
GERM
=1 if respondent is from Germany, =0 if from UK or France
FRANCE
=1 if respondent is from France, =0 if from UK or Germany
These variables are called dummy variables11. In fact we do not need all three of
these variables because if we know a respondents value on two of them, we can
infer their value on the third. E.g. if we know that UK=0 and GERM=1, then we
know that FRANCE=0. (A respondent can only be living in one country at the time
of survey, so only one of UK, GERM and FRANCE can equal 1 for any given
individual.) By the same argument, when we have a categorical variable with only
two categories (e.g. our SEX variable in C3.2.1) we do not need to create any
additional variables. SEX is already a dummy variable, and can therefore be
included directly in the model as an explanatory variable.
To allow for differences between the UK, Germany and France, we choose
(arbitrarily) two of the country dummy variables and include those as explanatory
variables. Suppose we choose GERM and FRANCE, then the multiple regression
model is:
Coefficient
Constant
Country
Germany
France
Standard
error
Z-ratio
-0.384
0.023
0.256
0.492
0.029
0.035
8.765
14.052
p-value
<0.001
<0.001
0 + 1 COUNTRYi + ei
because
the coding of COUNTRY is arbitrary (i.e. COUNTRY is a nominal variable). In such a model, b 1
would be interpreted as the effect on HED of a 1 unit change in COUNTRY, but a 1 unit change in
COUNTRY has no meaning!
11
This is the most common way of coding dummy variables for a categorical variable and is often
called simple coding, but other types of coding are possible depending on which comparisons are of
interest. A comprehensive discussion of alternative coding systems can be found at
http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter5/statareg5.htm
22
1 = 0.256 is the difference between the means for Germany and the UK
2 = 0.492 is the difference between the means for France and the UK
The UK is the reference category. (If we had included the UK and FRANCE dummy
variables in the model, then Germany would have been the reference.)
23
C3.2.3
The null hypothesis for testing whether there is a difference between the mean
hedonism scores in the populations of Germany and the UK can be expressed as H0:
1 =0. Similarly the null for testing whether there is a difference between France
and the UK is H0: 2 =0. The simplest way to compare Germany and France would
be to refit the model, making one of these countries the reference category.
Table 3.7 shows the results when Germany is taken as the reference, i.e. when the
UK and FRANCE dummies are included in the model. The difference between the
means for France and Germany is now obtained directly (from the coefficient of
the FRANCE dummy) as 0.236.
All coefficients in Table 3.6 and Table 3.7 are significantly different from zero (all
p-values are <0.001), so we conclude that all pairwise differences between
countries are significant.
Constant
Country
UK
France
Standard
error
Z-ratio
-0.128
0.018
-0.256
0.236
0.029
0.032
-8.765
7.341
ii)
Coefficient
p-value
<0.001
<0.001
The above tests are for comparing pairs of countries. In ANOVA, the null
hypothesis is that the means for all three countries are equal. In most statistical
packages, an ANOVA table is given as part of the standard regression analysis
output. The only difference between the regression ANOVA table and the one-way
ANOVA table is that the between-country sum of squares would usually be called
the regression sum of squares, and within-country would be replaced by
residual. All numerical results would be exactly the same. In regression terms,
the null hypothesis being tested is that all coefficients are zero, which in this case
is that 1 = 2 = 0 . This is just another way of saying that all three country means
are equal.
Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
Please read P3.2, which is available in online form or as part of a pdf file.
Dont forget to take the online quiz! (see page 2 for details of how
to find the quiz questions)
The advantage of multiple regression over one-way ANOVA is that regression can
allow for the effects of several explanatory variables simultaneously12.
12
A multiple regression analysis with two categorical explanatory variables is sometimes called a
two-way ANOVA, while a regression with a mixture of categorical and continuous variables is called
an analysis of covariance (ANCOVA).
24
25
C3.3.2
In simple regression we have a single predictor or explanatory variable (X), and the
linear regression model is
Statistical control
So far we have used simple regression to assess the linear relationship between
two variables. In reality there will be a number of factors that are potential
predictors of the outcome variable.
The advantage of using a regression
framework is that we can straightforwardly account for the effects of multiple
variables simultaneously.
Examples
i) Suppose we compare two secondary schools on their age 16 exam performance,
e.g. we might compare the percentage of students who achieve a pass in five or
more subjects. Suppose we find that school 1 has a higher percentage with 5+
passes than school 2. Would we conclude that school 1s performance was better
than school 2? What other factors would we like to take into account? An obvious
candidate would be a measure of students achievement when they entered
secondary school so that school effects are value-added.
ii) Comparisons of men and womens salaries often reveal that women earn less.
Explanations that are commonly put forward for this discrepancy are that women
tend to work in jobs that have been traditionally lower paid, or that women have
taken time out of paid employment to raise children. To determine whether there
are salary differences between men and women who have been working in the
same job for the same amount of time, we would wish to account for occupation
and number of years of full-time employment as well as other factors such as
education level. Using multiple regression we can test whether these other factors
explain gender differences in salary, i.e. does any gender difference disappear
when we adjust for the effects of these other variables?
We can use multiple regression to take into account or adjust for other factors
that might predict the response variable. Sometimes the effects of these other
factors are of interest in themselves, e.g. predictors of age 16 attainment other
than the school attended. Other times the effects of other factors are not of
major interest, but it is important to adjust for their effects to obtain more
meaningful estimates of effects that we are interested in. Such factors are often
called controls.
26
y i = 0 + 1 x i + ei .
In multiple regression, we have more than one predictor. Suppose that we have
two predictors, denoted by X1 and X2, which may be continuous or categorical. We
have in fact already used a multiple regression model to analyse country
differences in hedonism (in C3.2.2). Although there was just one predictor,
country, it was represented by two dummy variables. More generally, we can
include several predictors and any of these may be represented by a set of dummy
variables.
The multiple (linear) regression model for two continuous (or dichotomous)
explanatory variables is written
y i = 0 + 1 x 1i + 2 x 2i + ei
where 0 is the value of y that would be expected when x1 =0 and x 2 =0.
The coefficients 1 and 2 are interpreted as follows:
27
Notice that, as expected, there is little change in the coefficient of age when
education is added, but the relationship between hedonism and education is now
negative after accounting for age. Both relationships are significantly different
from zero at the 0.1% level. The relationship between hedonism and education
should be interpreted with some caution, however. We should hesitate to
conclude that education affects or causes hedonism. It is likely that hedonism and
education are both influenced by variables that we have not accounted for in this
model.
Example
We will begin with the case where both X1 and X2 are continuous. Lets consider
the effects of age (X1) and education (X2) on hedonism. We will ignore gender and
country differences for now. We have already examined the bivariate relationship
between hedonism and age and found that older respondents tend to be less
hedonistic (in C3.1). This relationship may change when we account for education
if education is related to both hedonism and age. For example, we would expect
older respondents to have fewer years of education and a higher level of education
might be associated with less hedonistic beliefs if the more career-minded choose
study over having a good time!
Figure 3.6 shows the relationship between hedonism and education (see Figure 3.2
for a plot of hedonism versus age). The relationship between the two explanatory
variables, age and education, is shown in Figure 3.7. The correlation between
hedonism and education is very weak; the Pearson coefficient is only 0.024. As
expected, there is a negative correlation between age and education (r=-0.242).
Because of the weak correlation between hedonism and education, however, we
would not expect the addition of education in a multiple regression to have much
impact on the coefficient of age.
In C3.1.3 the fitted equation from a simple regression of hedonism on age was
found to be:
13
0 + 1 x1 + 2 x 2
28
29
C3.3.3
Suppose a scatterplot of Y versus X resembles Figure 3.8. The relationship is nonlinear, so it would not be appropriate to fit the straight line relationship implied
by a linear regression model. We should fit a curve through the points rather than
a line. The simplest curve is a quadratic function (or a second order polynomial):
y i = 0 + 1 x i + 2 x i2 + ei
Note that the above is an example of a multiple regression model with x 1 = x and
x 2 = x 2 . Also shown in Figure 3.8 is the fitted quadratic curve, which turns out to
have equation yi = 1.00 + 1.02 x i 0.47 x i2 .
Standardised coefficients
Standardisation and standardised coefficients were introduced in C3.1.3. To recap,
the standardised coefficient for a predictor X is the estimate of the slope that
would be obtained if X and Y were both standardised before the regression
analysis. In simple regression, the standardised coefficient of X is equal to the
Pearson correlation coefficient. In multiple regression, with two predictors X1 and
X2, the standardised coefficient of X1 is interpreted as the change in standardised Y
for a 1-unit change in standardised X1, holding X2 constant. (Recall that 1 unit of a
standardised variable corresponds to 1 standard deviation.) For example, in a
multiple regression model of hedonism on age and education, the standardised
coefficient for AGE is -0.358. Thus we can say that a 1 standard deviation change
in age predicts a 0.358 standard deviation decrease in hedonism. Note that if all
variables (Y and the Xs) had been standardised prior to the analysis, then the
unstandardised and standardised coefficients would be equal.
-1
-2
-3
-2.0
30
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
2.0
31
The results from fitting a quadratic curve to the relationship between hedonism
and age are given in Table 3.8. Note that this analysis is based on standardised
age and its square. This is because, for older respondents (remember the oldest is
98), age2 takes very large values; this may cause computational difficulties and the
coefficient of age2 would be very small.
Table 3.8. Regression with quadratic effects for age
Coeff.
S.E.
-0.222
-0.348
0.072
0.017
0.012
0.011
Constant
Standardised age
Standardized age-squared
Z-ratio
-28.669
6.288
C3.3.4
Suppose that we have p predictors, which we denote by X1, X2, X3, . . ., Xp. Then
the multiple regression model is
y i = 0 + 1 x 1i + 2 x 2i + 3 x 3i + ... + p x pi + ei
Variation explained: R2
p-value
<0.001
<0.001
The R2 for the regression model with age and education effects is 0.121, so 12.1%
of the variance in hedonism scores is due to variation in age and education. The
correlation between the predicted and observed hedonism scores is 0.348
= 0 . 121 . As suggested by the low bivariate correlation between hedonism and
education, education has little explanatory power; when education is removed the
model R2 decreases only slightly to 0.118.
A problem with R2 is that it always increases even if irrelevant variables are added
to the model. Therefore in multiple regression a measure called the adjusted R2 is
usually quoted. The adjusted R2 takes into account the number of variables in the
model. It is therefore a goodness-of-fit measure that is penalised by the
complexity of the model. With such a measure, the value will only increase if the
additional predictors are accounting for some of the variability in the response. In
this simple example with only two explanatory variables, age and education, the
adjusted R2 turns out to be the same as the unadjusted value.
Multicollinearity
Before carrying out a regression analysis, we should always look at the correlation
between each pair of predictor variables. If the correlation between a pair is very
high (>0.8 say), the estimates of the coefficients of those variables may be
unstable and imprecise (large standard errors). If the two variables are really
measuring the same thing, we should consider dropping one. Otherwise, we might
replace the two variables by a new variable which is a combination of the two15.
Figure 3.9. Plot of hedonism versus standardised age with fitted quadratic curve
15
Principal components analysis or factor analysis can be used to reduce a set of correlated
variables into a smaller set of uncorrelated variables.
32
33
The coefficient of age in months (AGE*12) is -0.002, which is the coefficient of age
in years (AGE) divided by 12. This is because 1 unit on the scale of AGE*12 is equal
to 1/12 of a unit on the scale of AGE. Notice that the intercept does not change
because AGE=0 means the same whether the measurement is in months or years.
The coefficient of education is unaffected by transformations in age.
Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
Please read P3.3, which is available in online form or as part of a pdf file.
Dont forget to take the online quiz for this section! (see page 2 for
details of how to find the quiz questions)
Effect sizes
The size of the coefficient for predictor variable Xk will depend on the scales of Xk
and the response variable. For example, suppose we multiply each value of AGE
by 12 to give age in months rather than years and refit the multiple regression
model with age and education effects. We obtain the results shown in Table 3.9.
Table 3.9. Regression of hedonism on age and education for different age scales.
Age in years
Coeff.
Constant
Age
Education
0.971
-0.019
-0.017
Z-ratio
-28.206
-4.915
Age in months
Coeff.
Z-ratio
0.971
-0.002
-0.017
-28.206
-4.915
34
35
C3.4.1
Suppose we fit a multiple regression model with age and gender effects:
(3.7)
Figure 3.10. Regression lines for men and women, fixed slopes
Note: The age range in the sample is 14 to 98 years. The software used to draw the plot
has extrapolated beyond the observed range regression lines which is not generally
recommended.
C3.4.2
Is it reasonable to assume that the gender difference in hedonism is the same for
all ages? One way of allowing men and women to have different slopes for the
relationship between hedonism and age is to fit a separate regression line for each
sex. We do this by splitting the sample by gender16, and fitting a simple regression
of HED on AGE for each sex. If we do this, we obtain the results shown in Table
3.10.
So the lines for men and women have different intercepts, but the same slope, i.e.
the regression lines are parallel (see Figure 3.10). There are two equivalent ways
of interpreting Figure 3.10. We can say that the effect of age on hedonism is the
same for men and women. Alternatively we can say that the gender difference in
hedonism is the same at all ages.
16
This is often done using a select if command or menu option, or by requesting an analysis that is
stratified by gender.
36
37
Table 3.10. Regression of hedonism on age with separate models fitted for men and women
C3.4.3
Coeff.
S.E.
Z-ratio
0.839
-0.019
0.047
0.001
-20.854
0.597
-0.018
0.047
0.001
-18.910
Rather than fitting a separate model for each sex, we will fit a single model to the
whole pooled sample. We create a new variable which is the product of AGE and
SEX:
Men
Constant
Age (years)
Women
Constant
Age (years)
AGE_SEX=AGESEX
The new variable AGE_SEX is added as another predictor variable to model (3.7) to
give:
For men the slope of age is -0.019, compared to -0.018 for women. So the slope is
slightly steeper for men. Because women have a lower intercept than men, a
steeper slope for men implies that the gender difference is greater among younger
respondents (see Figure 3.11 later).
Table 3.11. Example of hedonism dataset with age by sex interaction variable
Respondent
i)
ii)
There may be more than one categorical predictor, and therefore more than
one way of grouping the data. The effects of the other predictors may vary
across each grouping, e.g. hedonism may vary by sex and by country.
Splitting the data into groups defined by sex and country will lead to a large
number of groups; in this dataset, the sample sizes in each group remain
large, but this will often not be the case.
iii)
1
2
3
4
.
.
5845
Hedonism
1.55
0.76
-0.26
-1.00
.
.
0.74
AGE
SEX
AGE_SEX
25
30
59
47
.
.
65
0
0
0
1
.
.
0
0
0
0
47
.
.
0
The inclusion of AGE_SEX, called the interaction between AGE and SEX, allows the
effect of AGE on HED to differ for men and women (or, equivalently, the effect of
sex on HED to depend on AGE). If the effect of age differs by sex, we say that
there is an interaction effect. To see how an interaction effect works, we will
look at the regression model for each value of SEX.
For SEX=0 (men), AGE_SEX=0 so the regression model (3.8) becomes:
HEDi = 0 + 1 AGE i + ei
38
(3.9)
For SEX=1 (women), AGE_SEX=AGE and the regression model (3.8) becomes:
(3.8)
Table 3.11 gives an extract of the analysis data file to which (3.8) could be fitted.
While splitting the sample into groups is a simple way of allowing for different
slopes for each group, there are several problems with this approach:
iv)
(3.10)
39
Coeff.
Constant
Age (years)
Female
Age Female
0.839
-0.019
-0.242
0.002
S.E.
0.048
0.001
0.066
0.001
Z-ratio
p-value
-20.075
-3.649
1.461
<0.001
<0.001
0.144
C3.4.4
Is the slope in the regression of hedonism on age significantly different for men
and women?
From Table 3.12, we see that the Z-ratio for this test is 1.461 and the p-value is
0.144. So we cannot reject the null hypothesis and we conclude that the slope of
age is the same for men and women. We would then return to the simpler model
(3.7) with the fixed slope.
C3.4.5
We have concluded that the effect of age on hedonism is the same for men and
women. Or, equivalently, we can conclude that the gender difference in hedonism
is the same for all ages. We will now test whether the effect of age is the same in
40
41
To test whether all three countries have the same slope (a joint test), we need to
test the null that 4 and 5 are both (simultaneously) equal to zero. We can do
this using an F-test for comparing nested models: the model in which 4 and 5
are freely estimated (the interaction model) versus the model with both 4 and 5
fixed at zero (the main effects model, i.e. without interaction terms). The pvalue for this test turns out to be 0.040, so there is evidence at the 5% level that
the interaction model is a significantly better fit to the data: at least one of the
age-by-country interaction coefficients is non-zero. We therefore conclude that
the age effect differs between countries.
AGE_GERM=AGEGERM
AGE_FRANCE=AGEFRANCE
The interaction model has the form:
HEDi = 0 + 1 AGE i + 2 GERMi + 3 FRANCE i + 4 AGE _ GERM + 5 AGE _ FRANCE + ei
The results from fitting this model are given in Table 3.13.
Please read P3.4, which is available in online form or as part of a pdf file.
Constant
Age (years)
Country
Germany
France
Age Germany
Age France
Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
Coeff.
S.E.
Z-ratio
p-value
0.604
-0.021
0.061
0.001
-17.210
<0.001
-0.007
0.386
0.005
0.001
0.078
0.090
0.002
0.002
-0.085
4.277
3.207
0.570
0.932
<0.001
0.001
0.569
Dont forget to take the online quiz for this section! (see page 2 for
details of how to find the quiz questions)
42
43
C3.5.1
C3.5.2
For simple regression, we check that the variance of the residuals is fairly constant
across the range of X in a plot of the standardised residuals against the explanatory
variable X. In multiple regression, it is useful to start with a plot of ri against yi
because, for any individual, the predicted value of y is a linear function of their
values on all X variables in the model. This should be followed by an examination
of pairwise plots of the standardized residuals against each explanatory variable X
in turn. For each plot we are looking for indications of funnelling where the
vertical scatter of the residuals is different for different values of xi or y i , in
which case the assumption of homoskedasticity is not met.
A common reason for funnelling (or heteroskedasticity) is the existence of groups
in the data among which the relationship between Y and one or more X differs, i.e.
unmodelled interaction effects. To illustrate the idea of funnelling, suppose that
the relationship between Y and a continuous variable X1 is different for two
subgroups defined by a binary variable X2: the relationship between Y and X1 is
positive for both groups, but stronger for X2=0 than for X2=1. The predicted
regression lines from a multiple regression of Y on X1, X2 and their interaction X1*X2
are shown in Figure 3.14.
44
45
as X1 increases, so the average line will lie close to the individual group lines and
the residuals are smaller.
x2
0
1
2
0
x2=1
Standardized Residual
-2
x2=0
-4
-1
-2
-5.0
-2.5
0.0
2.5
5.0
x1
-3
Figure 3.14. Prediction lines from a multiple regression with an interaction effect
-5.0
-2.5
0.0
2.5
5.0
x1
Now suppose we mistakenly fit a simple regression of Y on X1, so we ignore the fact
that there are two groups with different relationships between Y and X1. Figure
3.15 shows the residual plot for this misspecified model17. (The data points for the
groups defined by X2 are distinguished, but remember that X2 is not included in the
model.) The plot shows evidence of heteroskedasticity because the vertical spread
of the residuals gets smaller as X1 increases this is an example of what we mean
by funnelling. Why has this happened? Instead of fitting two regression lines
with different intercepts and slopes for each group, we have fitted a single
average line which would lie somewhere in between the lines in Figure 3.1418. At
small values of X1, where we have the largest difference in the predicted value of
Y for the two groups, the residuals about this line are large and positive for X2=1
and large and negative for X2=0. The difference between groups becomes smaller
17
Returning to the hedonism data, Figure 3.16 shows a plot of ri versus standardised
yi from the model with age, education, gender and country included as
explanatory variables. The vertical spread of the points appears fairly equal across
different values of standardised y i , so we conclude that the assumption of
homoskedasticity is reasonable.
C3.5.3
Outliers
We can also check for outliers using any of the residual plots. An outlier is a point
with a particularly large residual. We would expect approximately 95% of the
residuals to lie between 2 and +2.
Of major interest, however, is whether an outlier has undue influence on our
results. An influence statistic called Cooks D (where D is for distance) measures
Figure 3.15. Plot of ri versus X1 from fitting a misspecified regression without X2 or its
interaction with X1
46
47
C3.5.3 Outliers
C3.5.3 Outliers
how different our estimated regression coefficients would have been if a sample
observation were omitted. Cooks D is calculated for every observation. The
higher the value of D, the more likely it is that an observation exerts influence on
the estimates of the coefficients. However, D does not have a fixed range and so
we focus on those values of D which are considerably greater than, say, the 90th
percentile.
Full sample
For a regression of hedonism on age, education, sex and country, we find that the
90th percentile of the distribution of Cooks D is 0.000046. A boxplot of Cooks D
is given in Figure 3.17. Two observations have relatively large values of D: case
numbers 3225 and 2948. However, removing these observations from the analysis
has negligible impact on our results (see Table 3.14).
Constant
Age (years)
Education (years)
Female
Country
Germany
France
Coeff.
Z-ratio
0.790
-0.019
-0.015
-0.160
-27.611
-4.281
-6.752
0.789
-0.019
-0.015
-0.160
-27.537
-4.350
-6.774
0.222
0.436
8.068
13.145
0.222
0.441
8.090
13.322
Dont forget to do the practical for this section! (see page 2 for
details of how to find the practical)
Please read P3.5, which is available in online form or as part of a pdf file.
Dont forget to take the online quizzes for this module if you
havent already done so! (see page 2 for details of how to find the
quizzes)
Centre for Multilevel Modelling, 2008
48
49