Vous êtes sur la page 1sur 43

Introduction to Linear Regression and Correlation Analysis

Scatter Diagrams
A scatter plot is a graph that may be used to represent the relationship between two variables. Also referred to as a scatter diagram.

Dependent and Independent Variables


A dependent variable is the variable to be predicted or explained in a regression model. This variable is assumed to be functionally related to the independent variable.

Dependent and Independent Variables


An independent variable is the variable related to the dependent variable in a regression equation. The independent variable is used in a regression model to estimate the value of the dependent variable.

Two Variable Relationships


(Figure 11-1)

(a) Linear

Two Variable Relationships


(Figure 11-1)

(b) Linear

Two Variable Relationships


(Figure 11-1)

(c) Curvilinear

Two Variable Relationships


(Figure 11-1)

(d) Curvilinear

Two Variable Relationships


(Figure 11-1)

(e) No Relationship

Correlation
The correlation coefficient is a quantitative measure of the strength of the linear relationship between two variables. The correlation ranges from + 1.0 to - 1.0. A correlation of 1.0 indicates a perfect linear relationship, whereas a correlation of 0 indicates no linear relationship.

Correlation
SAMPLE CORRELATION COEFFICIENT

( x x )( y y ) [ ( x x ) ][ ( y y ) ]
2 2

where: r = Sample correlation coefficient n = Sample size x = Value of the independent variable y = Value of the dependent variable

Correlation
SAMPLE CORRELATION COEFFICIENT or the algebraic equivalent:

[n( x 2 ) ( x) 2 ][n( y 2 ) ( y ) 2 ]

n xy x y

Correlation
(Example 11-1)

(Table 11-1)
Sales y 487 445 272 641 187 440 346 238 312 269 655 563 Years x 3 5 2 8 2 6 7 1 4 2 9 6 yx 1,461 2,225 544 5,128 374 2,640 2,422 238 1,248 538 5,895 3,378 y2 237,169 198,025 73,984 410,881 34,969 193,600 119,716 56,644 97,344 72,361 429,025 316,969 x2 9 25 4 64 4 36 49 1 16 4 81 36

4,855

55 26,091 2,240,687 4,855

Correlation
(Example 11-1)

[n( x ) ( x) ][n( y ) ( y ) ]
2 2 2 2

n xy x y

12(26,091) 55(4,855) [12(329) (55) 2 ][12(2,240,687) (4,855) 2 ]

0.8325

Correlation
(Example 11-1)

Sales Sales 1 Years with Midwest 0.832534056

Years with Midwest 1

Correlation between Years and Sales

Excel Correlation Output


(Figure 11-5)

Correlation
Spurious correlation occurs when there is a correlation between two otherwise unrelated variables.

Simple Linear Regression Analysis


Simple linear regression analysis analyzes the linear relationship that exists between a dependent variable and a single independent variable.

Simple Linear Regression Analysis


SIMPLE LINEAR REGRESSION MODEL (POPULATION MODEL)

y 0 1 x
where: y = Value of the dependent variable x = Value of the independent variable 0= Populations y-intercept 1 = Slope of the population regression line = Error term, or residual

Simple Linear Regression Analysis


The simple linear regression model has four assumptions: Individual values if the error terms, i, are statistically independent of one another. The distribution of all possible values of is normal. The distributions of possible i values have equal variances for all value of x. The means of the dependent variable, for all specified values of the independent variable, y, can be connected by a straight line called the population regression model.

Simple Linear Regression Analysis


REGRESSION COEFFICIENTS In the simple regression model, there are two coefficients: the intercept and the slope.

Simple Linear Regression Analysis


The interpretation of the regression slope coefficient is that is gives the average change in the dependent variable for a unit increase in the independent variable. The slope coefficient may be positive or negative, depending on the relationship between the two variables.

Simple Linear Regression Analysis


The least squares criterion is used for determining a regression line that minimizes the sum of squared residuals.

Simple Linear Regression Analysis


A residual is the difference between the actual value of the dependent variable and the value predicted by the regression model.

y y

Simple Linear Regression Analysis


Sales in Thousands

Y
390 400 300 312 200

150 60 x y

Residual = 312 - 390 = -78

100

X Years with Company

Simple Linear Regression Analysis


ESTIMATED REGRESSION MODEL (SAMPLE MODEL)

i b0 b1 x y
where:

= Estimated, or predicted, y value y


b0 = Unbiased estimate of the regression intercept b1 = Unbiased estimate of the regression slope x = Value of the independent variable

Simple Linear Regression Analysis


LEAST SQUARES EQUATIONS

b1
algebraic equivalent:

( x x )( y y ) (x x)
2

b1

x y xy n 2 ( x ) 2 x n

and

b0 y b1 x

Simple Linear Regression Analysis


SUM OF SQUARED ERRORS

SSE y b0 y b1 xy
2

Simple Linear Regression Analysis


(Midwest Example)

(Table 11-3)
Sales y 487 445 272 641 187 440 346 238 312 269 655 563 Years x 3 5 2 8 2 6 7 1 4 2 9 6 xy 1,461 2,225 544 5,128 374 2,640 2,422 238 1,248 538 5,895 3,378 y2 237,169 198,025 73,984 410,881 34,969 193,600 119,716 56,644 97,344 72,361 429,025 316,969 x2 9 25 4 64 4 36 49 1 16 4 81 36

4,855

55 26,091 2,240,687 4,855

Simple Linear Regression Analysis


(Table 11-3)

b1

x y xy n 2 ( x ) 2 x n

55(4,855) 26,091 12 49.9101 2 (55) 329 12

b0 y b1 x 404.5833 49.9101(4.5833) 175.8288


The least squares regression line is:

175.8288 49.9101( x) y

Simple Linear Regression Analysis


(Figure 11-11)
SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations ANOVA df Regression Residual Total 1 10 11 SS MS F Significance F 191600.622 191600.622 22.58527906 0.000777416 84834.29469 8483.429469 276434.9167

0.832534056 0.693112955 0.662424251 92.10553441 12

Intercept Years with Midwest

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% 175.8288191 54.98988674 3.197475563 0.00953244 53.30369475 298.3539434 53.30369475 298.3539434 49.91007584 10.50208428 4.752397191 0.000777416 26.50996978 73.3101819 26.50996978 73.3101819

Excel Midwest Distribution Results

Least Squares Regression Properties


The sum of the residuals from the least squares regression line is 0. The sum of the squared residuals is a minimum. The simple regression line always passes through the mean of the y variable and the mean of the x variable. The least squares coefficients are unbiased estimates of 0 and 1.

Simple Linear Regression Analysis


SUM OF RESIDUALS

) 0 ( y y
SUM OF SQUARED RESIDUALS

) ( y y

Simple Linear Regression Analysis


TOTAL SUM OF SQUARES

where: TSS = Total sum of squares n = Sample size y = Values of the dependent variable y= Average value of the dependent variable

TSS ( y y )

Simple Linear Regression Analysis


SUM OF SQUARES ERROR (RESIDUALS)

where: SSE = Sum of squares error n = Sample size y = Values of the dependent variable = Estimated value for the average of y for the y given x value

) SSE ( y y

Simple Linear Regression Analysis


SUM OF SQUARES REGRESSION

where: SSR = Sum of squares regression y= Average value of the dependent variable y = Values of the dependent variable = Estimated value for the average of y for the y given x value

y) SSR ( y

Simple Linear Regression Analysis


SUMS OF SQUARES

TSS SSE SSR

Simple Linear Regression Analysis


The coefficient of determination is the portion of the total variation in the dependent variable that is explained by its relationship with the independent variable. The coefficient of determination is also called R-squared and is denoted as R2.

Simple Linear Regression Analysis


COEFFICIENT OF DETERMINATION (R2)

SSR R TSS
2

Simple Linear Regression Analysis


(Midwest Example)

COEFFICIENT OF DETERMINATION (R2)

SSR 191,600.62 R 0.6931 TSS 276,434.90


2

69.31% of the variation in the sales data for this sample can be explained by the linear relationship between sales and years of experience.

Simple Linear Regression Analysis


COEFFICIENT OF DETERMINATION SINGLE INDEPENDENT VARIABLE CASE

R r
2

where: R2 = Coefficient of determination r = Simple correlation coefficient

Simple Regression Steps


Develop a scatter plot of y and x. You are looking for a linear relationship between the two variables. Calculate the least squares regression line for the sample data. Calculate the correlation coefficient and the simple coefficient of determination, R2. Conduct one of the significance tests.

Residual Analysis
Before using a regression model for description or prediction, you should do a check to see if the assumptions concerning the normal distribution and constant variance of the error terms have been satisfied. One way to do this is through the use of residual plots.

Key Terms
Coefficient of Determination Correlation Coefficient Dependent Variable Independent Variable Least Squares Criterion Regression Coefficients Regression Slope Coefficient Residual Scatter Plot Simple Linear Regression Analysis Spurious Correlation