Session 4

Correlation
Correlation and Regression

Correlation involves calculating
an index to measure the nature
of the relationship between
variables.
With regression, an equation is
developed to predict the values
of a dependent variable.
Pearson Product Moment Coefficient r
The Pearson correlation

coefficient r varies over a range
of +1 through 0 to 1.
It symbolizes the coefficients
estimate of linear association
based on the sampling data, The
coefficient represents the
population correlation.
Pearson Product Moment Coefficient r
Correlation
coefficients reveal the
magnitude and
direction of
relationships.
Illustration of Direction:
Positive Correlation
Family income vs. household food
expenditures
Negative Correlation
Prices of products and services in
relation to their scarcity or
availability.
SCATTERPLOTS
They are essential for
understanding the relationship
between variables.
They provide a means for visual
inspection of data that a list of
values for two variables cannot.
Correlation Analysis
Used to measure and interpret the
strength of association (linear
relationship) between two numerical
variables
Only concerned with strength of the
relationship
No causal effect is implied
Session 12.7
Scatter Diagram
a plot of paired data to

determine or show a
relationship between two
variables
Paired Data
When there
appears to be a
linear relationship
between x and y:
attempt to fit a line to the
scatter diagram.
Linear Correlation
The general trend of the points

seems to follow a straight line
segment.
Linear Correlation
Non-Linear Correlation
No Linear Correlation
High Linear
Correlation
Points lie close to a straight line.
High Linear Correlation

Moderate Linear Correlation
Low Linear Correlation

Perfect Linear
Correlation
The Sample Correlation

Coefficient, r
A measurement of the strength of the

linear association between two
variables
Also called the Pearson product-

moment correlation coefficient
Positive Linear
Correlation
High values of x are paired with

high values of y and low values
of x are paired with low values
of y.
Negative Linear Correlation
High values of x are paired with

low values of y and low values of x
are paired with high values of y.
Little or No Linear Correlation
Both high and low values of x are

sometimes paired with high values
of y and sometimes with low values
of y.
Positive Correlation
x
Negative Correlation
Little or No Linear
Correlation
y
x
What type of
correlation is
expected?
Height and weight
Mileage on tires and remaining tread
IQ and height
Years of driving experience and insurance rates
Linear correlation
coefficient
1 r +1
Table of Interpretation
Pearson r Qualitative Interpretation
1.00 Perfect Correlation
0.91 - 0.99 Very High Correlation
0.71 - 0.90 High Correlation
0.41 - 0.70 Marked Correlation
0.21 - 0.40 Slight/Low Correlation
0 - 0.20 Negligible Correlation
If r = 0, scatter diagram
might look like:
y
x
If r = +1, all points lie on
the least squares line
y
If r = 1, all points lie on

the least squares line
y
x
1<r<0
0<r<1
x
Find the Correlation Coefficient
x y x2 y2 xy
(Miles) (Min.)
2 6 4 36 12
5 9 25 81 45
12 23 144 529 276
7 18 49 324 126
7 15 49 225 105
15 28 225 784 420
10 19 100 361 190
x = 58 y = 118 x2 = 596 y2=2340 xy = 1174
The Correlation
Coefficient,
r = 0.9753643
r 0.98
Warning
The correlation coefficient ( r)

measures the strength of the
relationship between two variables.
Just because two variables are related
does not imply that there is a cause-
and-effect relationship between them.
Testing the
Correlation
Coefficient
Determining whether a value of

the sample correlation
coefficient, r, is far enough from
zero to indicate correlation in
the population.
The Population
Correlation
Coefficient
= Greek letter rho
Hypotheses to Test
Rho
Assume that both variables x and y are
normally distributed.
To test if the (x, y) values are correlated in
the population, set up the null hypothesis
that they are not correlated:
H0: x and y are not correlated, so = 0.

Spearman Rank Correlation
A measure of Rank Correlation
The Spearman Correlation
Spearmans correlation is designed to measure
the relationship between variables measured on
an ordinal scale of measurement.
Similar to Pearsons Correlation, however it

uses ranks as opposed to actual values.
Assumptions
The data is a bivariate random variable.
The measurement scale is at least ordinal.

Advantages
1. Less sensitive to bias due to the effect of

outliers
- Can be used to reduce the weight of outliers (large distances
get treated as a one-rank difference)
2. Does not require assumption of normality.
3. When the intervals between data points are

problematic, it is advisable to study the
rankings rather than the actual values.
Disadvantages
1. Calculations may become tedious. Additionally
ties are important and must be factored into
computation.
Steps in Calculating Spearmans Rho
1. Convert the observed values to ranks
(accounting for ties)
2. Find the difference between the ranks, square
them and sum the squared differences.
3. Set up hypothesis, carry out test and conclude
based on findings.
Steps in Calculating Spearmans Rho

4. If the null is rejected then calculate the
Spearman correlation coefficient to measure
the strength of the relationship between the
variables.
Hypothesis: I
A. (Two-Tailed)
Ho : There is no correlation between the Xs and the Ys.
(there is mutual independence between the Xs and the Ys)
H1 : There is a correlation between the Xs and the Ys.

(there is mutual dependence between the Xs and the Ys)
Spearmans Rho
Assumes values between -1 and +1
-1 0 +1
Perfectly Negative Perfectly Positive

Correlation Correlation
Example 1
The ICC rankings for One Day International (ODI) and
Test matches for nine teams are shown below.
Team Test Rank ODI Rank
Australia 1 1
India 2 3
South Africa 3 2
Sri Lanka 4 7
England 5 6
Pakistan 6 4
New Zealand 7 5
West Indies 8 8
Bangladesh 9 9
Test whether there is correlation between the ranks
Example 1
Team Test Rank ODI Rank d d2
Australia 1 1 0 0
India 2 3 1 1
South Africa 3 2 1 1
Sri Lanka 4 7 3 9
England 5 6 1 1
Pakistan 6 4 2 4
New Zealand 7 5 2 4
West Indies 8 8 0 0
Bangladesh 9 9 0 0
Total 20
Answer:
T = d i = 20
2
= 0.8333.
Example 2
A composite rating is given by executives to
each college graduate joining a plastic
manufacturing firm. The executive ratings
represent the future potential of the college
graduate. The graduates then enter an in-plant
training programme and are given another
composite rating. The executive ratings and the
in-plant ratings are as follows:
Graduate Executive rating (X) Training rating (Y)

A 8 4
B 10 4
C 9 4
D 4 3
E 12 6
F 11 9
G 11 9
H 7 6
I 8 6
J 13 9
K 10 5
L 12 9
A) At the 5% level of significance, determine if there

is a positive correlation between the variables
B) Find the rank correlation coefficient if the null is
rejected
Regression
Analysis
Purpose of Regression
Analysis
Regression analysis is used primarily to
establish linear relationship between
variables and provide prediction
Predicts the value of a dependent (response)
variable based on the value of at least one
independent (explanatory) variable
Explains the relationship of the independent
variables on the dependent variable
Session 13.56
Types of Regression
Models
Positive Linear Relationship Relationship NOT Linear
Negative Linear Relationship No Relationship
Session 13.57
Simple Linear
Regression
Relationship between variables
is described by a linear function
This function relates how much
change in the dependent variable
is associated with a unit increase
(or decrease) in the independent
variable.
Session 13.58
Population Linear Regression:
Simple Linear Regression Model
Population regression line is a straight line that describes the
relationship of the average value of one variable on the other
Population Population
Random
Y intercept Slope
Error
Coefficient
Dependent
(Response)
Variable Yi = 0 + 1 X i + i
Population Independent
Regression Line YX (Explanatory)
Variable
Session 13.59
Random Error Term

i is the random error term for the ith
observation
where i s are independently normally
2
distributed with mean 0 and variance
for i = 1,..,n, n is the number of
observations
Session 13.60
Random Error Term
It represents the effect of other factors, apart
from X, which are omitted from the model
but do affect the response variable to some
extent
It may also account for errors of observation

or measurements in recording the response
variable
Session 13.61
4 Assumptions Made on the

Random Error Term
1. The error terms are independent from

one another;
2. The error terms are normally
distributed;
3. The error terms all have a mean of 0;
and
4. The error terms have constant
2
variance,
Session 13.62
Population Linear
Regression: Simple
Linear Regression Model
Y (Observed Value of Y) = Yi = 0 + 1 X i + i
1
i = Random Error
YX = 0 + 1 X i
0 (Conditional Mean)
Observed Value of Y
X
Session 13.63
Interpretation of the
Slope and the Intercept
0 = E(Y | X = 0) is the average value of Y
when the value of X is zero.
E (Y | X )
1 = measures the change in the
X
average value of Y as a result of a one-unit
change in X.
Session 13.64
Steps in Doing a Simple Linear
Regression Analysis
1. Obtain the equation that best fits the data;
2. Evaluate the equation to determine the strength
of the relationship for estimation and prediction;
3. Determine if the assumptions on the error terms
are satisfied and if model fits the data adequately;
4. Use the equation for prediction and description.
Session 13.65
Sample Linear
Regression
Sample regression line provides an estimate of the
population regression line as well as a predicted value of Y
Sample
Sample Slope
Y Intercept Coefficient
Yi = b0 + b1 X i + ei Residual
Y = b 0 + b1 X =(Fitted
Sample Regression Line
Regression Line, Predicted Value)
Session 13.66
Estimation using Method of Least
Squares
The estimates for the parameters 0

and 1 are obtained by minimizing the
sum of the squared errors
n n
(Y ) =
2 2
i YX i i
i =1 i =1
b0 provides an estimate of 0
b1 provides an estimate of 1
Session 13.67
Sample Linear Regression
As a result of LS estimation, the

values of b0 and b1 also minimize
the sum of the squared residuals.
n 2 n
(
i =1
Yi Yi ) = e i =1
2
i
Session 13.68
Sample Linear Regression
Yi = b0 + b1 X i + ei Yi = 0 + 1 X i + i
b1
Y
i 1
ei
YX = 0 + 1 X i
0 Y i = b0 + b1 X i
b0
X
Observed Value
Session 13.69
Interpretation of the
Slope and the Intercept
(Y | X = 0 ) is the estimated
b = E
0
average value of Y when the value of X

is zero.
E (Y | X )
b1 = is the estimated
X
change in the average value of Y as a
result of a one-unit change in X.
Session 13.70
EXAMPLE
Annual
Examine the linear Store Square Sales
relationship of the Feet ($1000)
annual sales of 1 1,726 3,681
produce stores on
2 1,542 3,395
their size in square
3 2,816 6,653
footage. Find the
equation of the 4 5,555 9,543
straight line that fits 5 1,292 3,318
the data best. 6 2,208 5,563
7 1,313 3,760
Session 13.71
EXAMPLE
Yi = b0 + b1 X i
= 1636.415 +1.487 X i
From Excel Printout:
C o e ffi c ie n ts
I n te rc e p t 1636.414726
X V a ri a b l e 1 1 .4 8 6 6 3 3 6 5 7
Session 13.72
EXAMPLE
12000
Annua l S a le s ($000)
10000
8000
7X i
1.48
6000
15 +
36.4
4000
= 16
2000 Yi
0
0 1000 2000 3000 4000 5000 6000
S q u a re F e e t
Session 13.73
EXAMPLE
Yi = 1636.415 +1.487 X i
The slope of 1.487 means that for each increase of one
unit in X, we predict the average of Y to increase by an
estimated 1.487 units.
The model estimates that for each increase of one

square foot in the size of the store, the expected
annual sales are predicted to increase by $1,487.
Session 13.74
RESIDUAL ANALYSIS
Purposes
Examine linearity
Evaluate assumptions to see if any
is violated
Graphical Analysis of Residuals
Plot residuals vs. Xi ,Y i (and time
if necessary)
Session 13.75
Residual Analysis for

Linearity
Y Y
X X
e e
X
X
Not Linear
Linear
Session 13.76
Residual Analysis for
Homoscedasticity
Y Y
X
X
SR SR
X X
Heteroscedasticity
Homoscedasticity
Session 13.77
Residual Analysis:Excel
Output
Observation Predicted Y Residuals
1 4202.344417 -521.3444173
2 3928.803824 -533.8038245
3 5822.775103 830.2248971
Excel Output 4 9894.664688 -351.6646882
5 3557.14541 -239.1454103
6 4918.90184 644.0981603
7 3588.364717 171.6352829
Residual Plot
0 1000 2000 3000 4000 5000 6000
Session 13.78 Square Feet

Inference about the Slope: t
Test
t test for a population slope
Is there a linear relationship of Y on X ?
Null and alternative hypotheses
H 0: 1 = 0 (no linear relationship)
H 1: 1 0 (linear relationship)
Test statistic
MSE
b1 where Sb1 =
t= n
2
Sb1 n
Xi
X i2 i =1
n i =1 n
(Y Y )
2
i i
SSE i =1
where MSE = =
n2 n2
Session 13.79
Example: Produce
Store
Data for Seven Stores:
Annual
Store Square Sales Estimated Regression Equation:
Feet ($000)
1 1,726 3,681 Yi = 1636.415 +1.487Xi
2 1,542 3,395
3 2,8166,653 The slope of this model is
4 5,5559,543 1.487.
5 1,2923,318
Is square footage of the
6 2,2085,563
store affecting its annual
7 1,3133,760
sales?
Session 13.80
Inferences about the Slope:
t-test
H0: 1 = 0 Test Statistic:
H1: 1 0 From Excel Printout b1 Sb1 t

= .05 Coefficients Standard Error t Stat P-value
df = 7 - 2 = 5 Intercept 1636.4147 451.4953 3.6244 0.01515
Critical Values: Footage 1.4866 0.1650 9.0099 0.00028
Decision: Reject H0
Reject Reject
Conclusion:
.025 .025 There is evidence that square footage
affects annual sales.
-2.5706 0 2.5706 t
Session 13.81
Pitfalls of Regression
Analysis
Lacking an awareness of the assumptions

underlying least-squares regression
Not knowing how to evaluate assumptions
Not knowing the alternatives to classical
regression if some assumption is violated
Using a regression model without
knowledge of the subject matter
Session 13.82
Strategies for Avoiding the
Start with a scatter plot of X on Y to

observe possible relationship
Perform residual analysis to check the
assumptions
Use a histogram, stem-and-leaf
display, box-and-whisker plot, or
normal probability plot of the
residuals to uncover possible non-
normality
Session 13.83
Strategies for Avoiding the

If there is violation of any assumption,

use alternative methods to least-
squares regression or alternative least-
squares models (e.g.: Curvilinear or
multiple regression)
If there is no evidence of assumption
violation, then test for the significance of
the regression coefficients
Session 13.84
Problem Set
c) Find the equation of the regression model that will predict

cost per day given the length of services in days.

Session 4

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Session 4

Transféré par

Droits d'auteur :

Formats disponibles

Correlation

Correlation and Regression

The Pearson correlation

Pearson Product Moment Coefficient r

a plot of paired data to

The general trend of the points

Points lie close to a straight line.

High Linear Correlation

Low Linear Correlation

The Sample Correlation

A measurement of the strength of the

Also called the Pearson product-

High values of x are paired with

Negative Linear Correlation

High values of x are paired with

Both high and low values of x are

Mileage on tires and remaining tread

Years of driving experience and insurance rates

If r = 1, all points lie on

The correlation coefficient ( r)

Determining whether a value of

= Greek letter rho

H0: x and y are not correlated, so = 0.

Similar to Pearsons Correlation, however it

The measurement scale is at least ordinal.

1. Less sensitive to bias due to the effect of

2. Does not require assumption of normality.

3. When the intervals between data points are

Steps in Calculating Spearmans Rho

H1 : There is a correlation between the Xs and the Ys.

Perfectly Negative Perfectly Positive

Test whether there is correlation between the ranks

Graduate Executive rating (X) Training rating (Y)

A) At the 5% level of significance, determine if there

Negative Linear Relationship No Relationship

Random Error Term

It may also account for errors of observation

4 Assumptions Made on the

1. The error terms are independent from

The estimates for the parameters 0

Sample Linear Regression

As a result of LS estimation, the

average value of Y when the value of X

The model estimates that for each increase of one

Residual Analysis for

0 1000 2000 3000 4000 5000 6000

Session 13.78 Square Feet

H1: 1 0 From Excel Printout b1 Sb1 t

Lacking an awareness of the assumptions

Start with a scatter plot of X on Y to

Strategies for Avoiding the

If there is violation of any assumption,

c) Find the equation of the regression model that will predict

Vous aimerez peut-être aussi