Académique Documents
Professionnel Documents
Culture Documents
Correlation
coefficients reveal the
magnitude and
direction of
relationships.
Illustration of Direction:
Positive Correlation
Family income vs. household food
expenditures
Negative Correlation
Prices of products and services in
relation to their scarcity or
availability.
SCATTERPLOTS
They are essential for
understanding the relationship
between variables.
They provide a means for visual
inspection of data that a list of
values for two variables cannot.
Correlation Analysis
Used to measure and interpret the
strength of association (linear
relationship) between two numerical
variables
Only concerned with strength of the
relationship
No causal effect is implied
Session 12.7
Scatter Diagram
When there
appears to be a
linear relationship
between x and y:
attempt to fit a line to the
scatter diagram.
Linear Correlation
Linear Correlation
Non-Linear Correlation
No Linear Correlation
High Linear
Correlation
Positive Correlation
x
Negative Correlation
Little or No Linear
Correlation
y
x
What type of
correlation is
expected?
Height and weight
IQ and height
Linear correlation
coefficient
1 r +1
Table of Interpretation
Pearson r Qualitative Interpretation
1.00 Perfect Correlation
0.91 - 0.99 Very High Correlation
0.71 - 0.90 High Correlation
0.41 - 0.70 Marked Correlation
0.21 - 0.40 Slight/Low Correlation
0 - 0.20 Negligible Correlation
If r = 0, scatter diagram
might look like:
y
x
If r = +1, all points lie on
the least squares line
y
x
1<r<0
0<r<1
x
Find the Correlation Coefficient
x y x2 y2 xy
(Miles) (Min.)
2 6 4 36 12
5 9 25 81 45
12 23 144 529 276
7 18 49 324 126
7 15 49 225 105
15 28 225 784 420
10 19 100 361 190
x = 58 y = 118 x2 = 596 y2=2340 xy = 1174
The Correlation
Coefficient,
r = 0.9753643
r 0.98
Warning
Testing the
Correlation
Coefficient
Hypotheses to Test
Rho
Assume that both variables x and y are
normally distributed.
To test if the (x, y) values are correlated in
the population, set up the null hypothesis
that they are not correlated:
Assumptions
The data is a bivariate random variable.
Disadvantages
1. Calculations may become tedious. Additionally
ties are important and must be factored into
computation.
Steps in Calculating Spearmans Rho
1. Convert the observed values to ranks
(accounting for ties)
2. Find the difference between the ranks, square
them and sum the squared differences.
3. Set up hypothesis, carry out test and conclude
based on findings.
Spearmans Rho
Assumes values between -1 and +1
-1 0 +1
Example 1
Team Test Rank ODI Rank d d2
Australia 1 1 0 0
India 2 3 1 1
South Africa 3 2 1 1
Sri Lanka 4 7 3 9
England 5 6 1 1
Pakistan 6 4 2 4
New Zealand 7 5 2 4
West Indies 8 8 0 0
Bangladesh 9 9 0 0
Total 20
Answer:
T = d i = 20
2
= 0.8333.
Example 2
A composite rating is given by executives to
each college graduate joining a plastic
manufacturing firm. The executive ratings
represent the future potential of the college
graduate. The graduates then enter an in-plant
training programme and are given another
composite rating. The executive ratings and the
in-plant ratings are as follows:
Purpose of Regression
Analysis
Regression analysis is used primarily to
establish linear relationship between
variables and provide prediction
Predicts the value of a dependent (response)
variable based on the value of at least one
independent (explanatory) variable
Explains the relationship of the independent
variables on the dependent variable
Session 13.56
Types of Regression
Models
Positive Linear Relationship Relationship NOT Linear
Session 13.57
Simple Linear
Regression
Relationship between variables
is described by a linear function
This function relates how much
change in the dependent variable
is associated with a unit increase
(or decrease) in the independent
variable.
Session 13.58
Population Linear Regression:
Simple Linear Regression Model
Population regression line is a straight line that describes the
relationship of the average value of one variable on the other
Population Population
Random
Y intercept Slope
Error
Coefficient
Dependent
(Response)
Variable Yi = 0 + 1 X i + i
Population Independent
Regression Line YX (Explanatory)
Variable
Session 13.59
Session 13.60
Random Error Term
It represents the effect of other factors, apart
from X, which are omitted from the model
but do affect the response variable to some
extent
Session 13.61
Session 13.62
Population Linear
Regression: Simple
Linear Regression Model
Y (Observed Value of Y) = Yi = 0 + 1 X i + i
1
i = Random Error
YX = 0 + 1 X i
0 (Conditional Mean)
Observed Value of Y
X
Session 13.63
Interpretation of the
Slope and the Intercept
0 = E(Y | X = 0) is the average value of Y
when the value of X is zero.
E (Y | X )
1 = measures the change in the
X
average value of Y as a result of a one-unit
change in X.
Session 13.64
Steps in Doing a Simple Linear
Regression Analysis
1. Obtain the equation that best fits the data;
2. Evaluate the equation to determine the strength
of the relationship for estimation and prediction;
3. Determine if the assumptions on the error terms
are satisfied and if model fits the data adequately;
4. Use the equation for prediction and description.
Session 13.65
Sample Linear
Regression
Sample regression line provides an estimate of the
population regression line as well as a predicted value of Y
Sample
Sample Slope
Y Intercept Coefficient
Yi = b0 + b1 X i + ei Residual
Y = b 0 + b1 X =(Fitted
Sample Regression Line
Regression Line, Predicted Value)
Session 13.66
Estimation using Method of Least
Squares
(Y ) =
2 2
i YX i i
i =1 i =1
b0 provides an estimate of 0
b1 provides an estimate of 1
Session 13.67
(
i =1
Yi Yi ) = e i =1
2
i
Session 13.68
Sample Linear Regression
Yi = b0 + b1 X i + ei Yi = 0 + 1 X i + i
b1
Y
i 1
ei
YX = 0 + 1 X i
0 Y i = b0 + b1 X i
b0
X
Observed Value
Session 13.69
Interpretation of the
Slope and the Intercept
(Y | X = 0 ) is the estimated
b = E
0
EXAMPLE
Yi = b0 + b1 X i
= 1636.415 +1.487 X i
From Excel Printout:
C o e ffi c ie n ts
I n te rc e p t 1636.414726
X V a ri a b l e 1 1 .4 8 6 6 3 3 6 5 7
Session 13.72
EXAMPLE
12000
Annua l S a le s ($000)
10000
8000
7X i
1.48
6000
15 +
36.4
4000
= 16
2000 Yi
0
0 1000 2000 3000 4000 5000 6000
S q u a re F e e t
Session 13.73
EXAMPLE
Yi = 1636.415 +1.487 X i
The slope of 1.487 means that for each increase of one
unit in X, we predict the average of Y to increase by an
estimated 1.487 units.
Session 13.74
RESIDUAL ANALYSIS
Purposes
Examine linearity
Evaluate assumptions to see if any
is violated
Graphical Analysis of Residuals
Plot residuals vs. Xi ,Y i (and time
if necessary)
Session 13.75
X X
e e
X
X
Not Linear
Linear
Session 13.76
Residual Analysis for
Homoscedasticity
Y Y
X
X
SR SR
X X
Heteroscedasticity
Homoscedasticity
Session 13.77
Residual Analysis:Excel
Output
Observation Predicted Y Residuals
1 4202.344417 -521.3444173
2 3928.803824 -533.8038245
3 5822.775103 830.2248971
Excel Output 4 9894.664688 -351.6646882
5 3557.14541 -239.1454103
6 4918.90184 644.0981603
7 3588.364717 171.6352829
Residual Plot
Sb1 n
Xi
X i2 i =1
n i =1 n
(Y Y )
2
i i
SSE i =1
where MSE = =
n2 n2
Session 13.79
Example: Produce
Store
Data for Seven Stores:
Annual
Store Square Sales Estimated Regression Equation:
Feet ($000)
1 1,726 3,681 Yi = 1636.415 +1.487Xi
2 1,542 3,395
3 2,8166,653 The slope of this model is
4 5,5559,543 1.487.
5 1,2923,318
Is square footage of the
6 2,2085,563
store affecting its annual
7 1,3133,760
sales?
Session 13.80
Inferences about the Slope:
t-test
H0: 1 = 0 Test Statistic:
Pitfalls of Regression
Analysis
Session 13.82
Strategies for Avoiding the
Pitfalls of Regression
Session 13.83
Session 13.84
Problem Set