Vous êtes sur la page 1sur 6

Purpose of Regression and

Correlation Analysis
Regression Analysis is Used Primarily for
Prediction

The Simple Linear


Regression Model

A statistical model used to predict the values of a


dependent or response variable based on values of
at least one independent or explanatory variable

Correlation Analysis is Used to Measure


Strength of the Association Between
Numerical Variables

Types of Regression Models


The Scatter Diagram
Positive Linear Relationship

Relationship NOT Linear

Plot of all (Xi , Yi) pairs


Axis
100
Title
50

Negative Linear Relationship

Axis Title
0

20

40

60

Error Variable: Required


Conditions

Simple Linear Regression


Model
Relationship Between Variables Is a Linear Function
The Straight Line that Best Fit the Data
Random
Error

Y intercept

Yi 0 1 X i i
Dependent
(Response)
Variable

No Relationship

Slope

Independent
(Explanatory)
Variable

The error is a critical part of the regression


model.
Four requirements involving the distribution of
must be satisfied.
The probability distribution of is normal.
The mean of is zero: E() = 0.
The standard deviation of is s for all values of x.
The set of errors associated with different values of y

are all independent.


6

Sample Linear Regression


Model

Population
Linear Regression Model
Y

Yi 0 1X i i

Observed
Value

i = Random Error
m

YX

0 1X i

Y i b0 b1X

Yi

= Predicted Value of Y for observation i

Xi

= Value of X for observation i

b0

= Sample Y - intercept used as estimate of


the population 0
b1 = Sample Slope used as estimate of the
population 1

X
Observed Value

REGRESSION COEFFICIENTS
To calculate the estimates of the
coefficients
that minimize the differences
between the data
points and the line, use the
formulas:

b1

The regression equation that


estimates
the equation of the first
order linear model
is:

To calculate the estimates of the coefficients that


minimize the differences between the data points and
the line, use the formulas ( least squares method):
b1

cov(X, Y)

y b 0 b1x

s 2x

n X iYi ( X i )( Yi )
n( X i2 ) ( X i ) 2

et b0 Y b1 X

EXCEL offers several approaches to regression,

b 0 y b1 x

including trendlines, regression functions and the


regression analysis tool
9

You wish to examine the


relationship between the
square footage of produce
stores and its annual sales.
Sample data for 7 stores
were obtained. Find the
equation of the straight
line that fits the data best

Store

Square
Feet

Annual
Sales
($000)

1
2
3
4
5
6
7

1,726
1,542
2,816
5,555
1,292
2,208
1,313

3,681
3,395
6,653
9,543
3,318
5,563
3,760

Scatter Diagram Example


12000

Annua l Sa le s ($000)

Simple Linear Regression


Equation: Example

10000
8000
6000
4000
2000
0
0

Excel Output

1000

2000

3000

4000

S q u a re F e e t

5000

6000

Graph of the Best


Straight Line

Equation for the Best


Straight Line

Y i b0 b1 X i

Annua l Sa le s ($000)

12000

1636 . 415 1 . 487 X i


From Excel Printout:
C o e ffi c i e n ts
I n te r c e p t

1 6 3 6 .4 1 4 7 2 6

10000
8000
6000
4000
2000
0
0

1000

2000

3000

4000

5000

6000

X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
S q u a re F e e t

Interpreting the Results

Inferences about the Slope: t


Test
t Test for a Population Slope

Yi = 1636.415 +1.487Xi

Is a Linear Relationship Between X & Y ?


Null and Alternative Hypotheses
H0: 1 = 0 (No Linear Relationship)
H1: 1 0 (Linear Relationship)

The slope of 1.487 means for each increase of one


unit in X, the Y is estimated to increase 1.487units.
For each increase of 1 square foot in the size of the
store, the model predicts that the expected annual
sales are estimated to increase by $1487.

Test Statistic: t

b1 1
Where Sb
1
S b1

SYX
n

2
( Xi X )

i 1

and df = n - 2

Graph of the Best


Straight Line

Standard Error of Estimate

( Yi Yi )

i 1

12000

n2

The standard deviation of the variation of


observations around the regression line

Annua l Sa le s ($000)

Syx

SSE
n2

10000
8000
6000
4000
2000
0
0

1000

2000

3000

4000

S q u a re F e e t

5000

6000

Inferences about the


Slope: t Test Example

Example: Produce Stores


Data for 7 Stores:
Store
1
2
3
4
5
6
7

Square
Feet

Annual
Sales
($000)

1,726
1,542
2,816
5,555
1,292
2,208
1,313

3,681
3,395
6,653
9,543
3,318
5,563
3,760

Yi = 1636.415 +1.487Xi
The slope of this model
is 1.487.
Is there a linear
relationship between the
square footage of a store
and its annual sales?

Reject

X V a r i a b l e 11 . 0 6 2 4 9 0 3 7

1.91077694

Conclusion: There is a significant linear relationship

between annual sales and the size of the store.

SSE =(Yi - Yi )2

Decision:
Reject H0

Conclusion:
There is evidence of a
linear relationship.

SST = Total Sum of Squares


measures_the variation of the Yi values around their
mean Y

SSE = Error Sum of Squares


variation attributable to factors other than the
relationship between X and Y

Excel Output for Produce Stores


df

SST = (Yi - Y)2


_
SSR = (Yi - Y)2

_
Y

SS

R e g r e ssi o n

30380456.12

R e si d u a l

1871199.595

T o ta l

32251655.71

SSR
Xi

0.0002812

Measures of Variation
The Sum of Squares: Example

Measures of Variation: The


Sum of Squares
_

9.009944

X V a ria b le 1

explained variation attributable to the relationship


between X and Y

At 95% level of Confidence The confidence Interval for the


slope is (1.062, 1.911). Does not include 0.

0.0151488

SSR = Regression Sum of Squares

U p p er 95%
2797.01853

P-valu e

3.6244333

Measures of Variation:
The Sum of Squares

Excel Printout for Produce Stores


475.810926

t S tat
I n te r c e p t

.025

-2.5706 0 2.5706

Confidence Interval Estimate of the Slope


b1 tn-2 Sb1
L o w er 95%

From Excel Printout

Reject

.025

Inferences about the Slope:


Confidence Interval Example

I n te r c e p t

Test Statistic:

H0: 1 = 0

H 1: 1 0
a .05
df 7 - 2 = 7
Critical Value(s):

Regression
Model Obtained:

SSE

SST

ANOVA - Summary Table


Testing the validity of the model
We pose the question:

Source of Degrees Sum of


of
Squares
Variation
Freedom

Is there at least one independent variable linearly


related to the dependent variable?
To answer the question we test the hypothesis
H0 : 1 = 0
H1: At least one i is not equal to 0
If at least one i is not equal to zero, the model is valid.

Explained
(Factor)

k-1

SSR

Within
(Error)

n-k

SSE

Total

n-1

SST =
SSR+SSE

Mean
F Test
Square Statistic
(Variance)
MSR
=
MSR =
MSE
SSR/(k - 1)
MSE =
SSE/(n - k)

The Coefficient of
Determination

To test these hypotheses we perform an analysis


of variance procedure.
The F test
Construct the F statistic

MSR=SSR/k-1
[Variation in y] = SSR + SSE.
Large F results from a large SSR.
Then, much of the variation in y is
explained
by the regression
model.
Rejection
region
The null hypothesis should
be rejected; thus, the model is valid.

MSR
F
MSE
MSE=SSE/(n-k)

F >Fa,k,n-k

r2 =

SSR
SST

^=b +b X
Y
i
0
1 i

Measures of Variation:
Example
Excel Output for Produce Stores
R e g r e ssi o n S ta ti sti c s

Y r2 = 1, r = -1
^=b +b X
Y
i
0
1 i

M u lt ip le R

X
Y

r2 = 0, r = 0
^=b +b X
Y
i
0
1 i
X

0 .9 7 0 5 5 7 2

R S q u a re

0 .9 4 1 9 8 1 2 9

A d ju s t e d R S q u a re

0 .9 3 0 3 7 7 5 4

S t a n d a rd E rro r

6 1 1 .7 5 1 5 1 7

O b s e r va t i o n s

r2 = .94

^=b +b X
Y
i
0
1 i
X

total sum of squares

Required conditions
must be satisfied.

X
Yr2 = .8, r = +0.9

regression sum of squares

Measures the proportion of variation that is


explained by the independent variable X in
the regression model

Coefficients of Determination
(r2) and Correlation (r)
Y r2 = 1, r = +1

94% of the variation in annual sales can be


explained by the variability in the size of the
store as measured by square footage

Syx

Estimation of
Predicted Values

Estimation of
Predicted Values
Confidence Interval Estimate for mXY

Confidence Interval Estimate for


Individual Response Yi at a Particular Xi

The Mean of Y given a particular Xi


Size of interval vary according to
distance away from mean, X.

Standard error
of the estimate

Y i t n 2 Syx
t value from table
with df=n-2

1
( X X )2
n i
n ( X X )2
i

Addition of this 1 increased width of


interval from that for the mean Y

1
( X X )2
Y i t n 2 Syx 1 n i
n ( X X )2
i
i 1

i 1

Interval Estimates for


Different Values of X
Y

Example: Produce Stores

Confidence
Interval for the
mean of Y

Confidence Interval
for a individual Yi

_
X

Data for 7 Stores:


Store

Square
Feet

Annual
Sales
($000)

1
2
3
4
5
6
7

1,726
1,542
2,816
5,555
1,292
2,208
1,313

3,681
3,395
6,653
9,543
3,318
5,563
3,760

Confidence Interval Estimate for Individual Y


Find the 95% confidence interval for the average annual sales
for stores of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)


SYX = 611.75

1
( X X )2
Y i t n 2 Syx
n i
n
( X i X )2
i 1

Regression Model Obtained:

Yi = 1636.415 +1.487Xi

A Given X

Estimation of Predicted
Values: Example

X = 2350.29

Predict the annual


sales for a store with
2000 square feet.

tn-2 = t5 = 2.5706

= 4610.45 980.97
Confidence interval for mean Y

Estimation of Predicted
Values: Example
Confidence Interval Estimate for mXY
Find the 95% confidence interval for annual sales of one
particular store of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)


X = 2350.29

SYX = 611.75

tn-2 = t5 = 2.5706

1
( X X )2
Y i t n 2 Syx 1 n i
= 4610.45 1853.45
n ( X X )2
i
i 1

Confidence interval for individual


Y

Vous aimerez peut-être aussi