Vous êtes sur la page 1sur 9

REGRESSION

1.2 REGRESSION ANALYSIS

1.2.1 Introduction to Regression Analysis

 Regression analysis is a statistical tool that enables us to obtain an equation


that represents the best prediction of dependent variable (Y) from several
independent variables (X).

 The goal in regression analysis is to develop a statistical model that can be used
to predict the values of a dependent variable based on the values of at least one
independent variable.

 Regression analysis used when Y correlated with X.

 At least one X must be continuous.


 X can be both/mixture qualitative and quantitative variables.

1.2.2 Objective or Use of Regression Analysis

Regression analysis serves 3 major objectives. It is used for:


 Description
 Control
 Prediction

Example:
 To describe the effect of income on expenditure.
 To increase the export of rubber by controlling other factors such as
price.
 To predict the price of houses based on lot size and location.

1.2.3 Examples of Application of Regression Analysis

 To determine expenditure (response variable) based on income (X 1),


gender (X2) and number of children (X3).
 To predict the Final Exam Score (Y) based on Test (X 1) and Quiz (X2)
marks.
 To predict the export of rubber (Y) based on production (X 1), stock price
(X2), domestic usage(X3) and import of rubber (X4).
 To predict the price of houses (Y) based on lot size(X 1), number of
bedrooms (X2), number of bathrooms(X3), and location(X4).

1.2.4 SIMPLE LINEAR REGRESSION


1
 SLR consist of one dependent variable(Y) and one independent variable (X)
 Simple : - because there is only one predictor variable (X).
 Linear : - the equation represents a straight line.
 Linear in the parameters: - no parameter appears as an exponent or is
multiplied or divided by another parameter (β2, β3).
 Linear in the predictor variable: - the predictor variable appears only in the first
power.

 SLR Model is a basic regression model where there is only one independent
variable and one response variable.

 Modeling
Simple Linear Regression Model : Yˆ  a  bX
o
Multiple Linear Regression Model: Yˆ  a  bX  cX
o 1 2

Polynomial Regression Model: ˆ


Y  a  bX  cX 2
o

 The SLR model can be states as follows:


Yi = β0 + β1Xi + εi i = 1, 2, ..., n
where
Yi is the value of the response variable in the ith trial,
β0 and β1 are regression parameter,
Xi is a known constant (the value of the independent variable in the i th
tiral),
εi is a random error with mean E (εi ) = 0 and variance V(εi)=σ2
εi and εj are uncorrelated (i.e., V(εi)= σ2)

 The parameters β0 and β1 and are called the regression coefficients.


β0 is the Y-intercept of the regression line.
o
β1 is the slope of the regression line. – It indicates the change in the E(Y)
o
(mean of response variable) for every unit increase in X.

1.2.5 COEFFICIENT OF DETERMINATION (r2)

How well does the regression model fit the data?

 To answer this question, we can use the correlation coefficient (r).

 One way to measure the strength of the relationship between y and x is to


calculate the Coefficient of Determination (r2).

 Coefficient of determination is the ratio of the explained variation to t he total


variation.
2
 This coefficient helps a researcher to evaluate a regression line. If the line is
“fit”, we can use the regression line for forecasting.

 The bigger the value of r2 the better the model fit the data. (However, you should
not use r2 as the only criteria in determining how well the model fit the data. For
example: Adjusted R2).

 r2 is the square of the correlation coefficient (r).

 Usually r2 is represented in percentage.

Coefficient of Determination

r2  r
2

2
 S xy 
 
 S xx S yy 
 
2
 
   x y  
  xy   n



  
 
   x  2   y 2    y  2 
 x 2  


  n 
 n 


   

0 ≤ r2 ≤ 1

What does r2 implying?


This indicates that the monthly income (X) can explain 82% of the total variation in
entertainment expenditure (Y).

Is x a good indicator in predicting Y?


Yes, because r2 is large. (Note: It consider large if r2 0.60 – 0.99)

Format: How to explain r2


This indicates that ________ (X) can explain (r2) % of the total variation in _______(Y).
Example 1
Final Marks

3
E(Y) = 25 + 0.5X
25

β0 = 25
Test Marks

β0 = 25 (Y-intercept)
β1= 0.5 (slope)

a = 25 : The Y-intercept of the regression line is at y = 25.


b = 0.5 : The slope of the regression line is 0.5

For every 1-mark increase in Test Marks, the average Final Marks increases by
0.5.
Or

For every 10 marks increase in Test Marks, the average Final Marks increases
by 5.

 Formula for the Regression Line

Yˆ  a  bX

   x  y  
  xy      
     
n   xy     x   y     
   n
b      OR 
  x 2     x  
   
2
 2

n   x 2     x 
        n 
 

a  y b x OR a
 y   b  x
n n

where a is the y-intercept and b is the slope of the line

Example 2

4
A recent magazine article listed the "Best Small and Medium Companies" for the year 2006. A random
sample of 8 companies was selected, and the sales and earnings, in million ringgit, are reported below:

Earnings
Sales
(RM
(RM million)
million)
89.2 4.9
18.6 4.4
18.2 1.3
71.7 8
58.6 6.6
46.8 4.1
17.5 2.6
11.9 1.7

i) Construct a scatter diagram. Hence, describe the relationship that might exist
between earnings and sales based on the scatter diagram.

Comment Scatter Plot

There exists a strong positive linear relationship between Sales (X) and Earnings
(Y).

ii) Compute the Pearson coefficient of correlation.

ΣX = 332.5 ΣY= 33.6 n=8


ΣX2 =19846.79 ΣY2 =179.08 ΣXY =1760.55

5
r
  XY     X   Y 
n

n X    X  n Y    Y 
2 2 2 2

81760.55   332.5 33.6  2912.4
r 
819846.79   332.5 8179.08   33.6 
2 2  48218.07 303.68
r  0.7611

iii) Compute the coefficient of determination and interpret the value obtained.

r 2  0.5793

This indicates that Sales(X) can explain 57.93% of the total variation in
Earnings (Y).

iv) Determine the regression equation by using the least squares method.

b
n   XY     X  Y   81760.55   332.5 33.6  0.0604
n  X     X  8 19846.79    332.5
2 2 2

a
 Y  b  X  33.6   0.0604  332.5   1.6897
n n 8  8 

ˆ  a  bX
Y
ˆ  1.6897  0.0604X
Y

v) Estimate the earnings of a company with RM50 million in sales.

X  50
ˆ  1.6897  0.0604 50 
Y
ˆ  4.7097
Y

RM 4.7097 million

Steps to Build a Simple Linear Regression


1. Scatter plot (relationship: is it strong or weak, negative or positive, linear or
curvilinear)
2. Correlation Matrix / r (to make sure the relationship, we calculate r). Proceed to
next step if and only if the correlation (r) is significant. Otherwise, stop.
3. Construct SLR model
4. Write the estimated SLR model : yˆ  a  bx
6
5. Interpret the coefficient (a and b)
6. Interpret R2
7. Check if regression model is appropriate
a. Q-Q plot (to check normality assumption)
b. Scatter Plot Residual vs. Fitted (to check the constant variance)

TUTORIAL

1. Effect of hours of mixing on temperature of wood pulp is shown below.

Hours of Mixing Temperature of wood pulp


2 21
4 27
6 29

7
8 64
10 86
12 92

a. State the dependent variable and independent variable.


b. Draw a scatter diagram of the above data and comment on the
relationship between the two variables.
c. Calculate the product moment correlation coefficient between the
temperature of wood pulp and hours of mixing. Comment on the result
obtained.
d. Obtain the linear regression equation and interpret the equation
obtained.
e. Estimate the temperature of wood pulp if 7 hours needed to mix.

2. The following data represent birth weights (oz) of babies and their percentage
increase between 70 and 100 days after birth.

Birth % Increase Birth Weight % Increase


Weight
72 68 125 27
112 63 126 60
111 66 122 71
107 72 126 88
119 52 127 63
92 75 86 88
126 76 142 53
80 118 132 50
81 120 87 111
84 114 123 59
115 29 133 76
118 42 106 72
128 48 103 90
128 50 118 68
123 68 114 93
116 59 94 91

a. State the dependent variable and independent variable.


b. Draw a scatter diagram of the above data and comment on the
relationship between the two variables.
c. Calculate the product moment correlation coefficient between the
temperature of wood pulp and hours of mixing. Comment on the result
obtained.
d. Obtain the linear regression equation and interpret the equation
obtained.
e. Calculate the coefficient of determination and explain its meaning.
f. Estimate the percentage increases for a baby who have 100 oz of birth
weight.
8
3. The data in table below illustrate 10 major universities ranked for research
dollars awarded in health sciences and their football team’s conference ranking
in 2001. Calculate the Spearman’s Rank Correlation Coefficient and interpret
your result.

School A B C D E F G H I J
Research
dollars 1 2 4 6 3 5 9 7 10 8

Football
rankings 4 5 3 1 9 7 6 8 2 10

4. The data in table below illustrate 10 students’ result for 2 tests. Calculate the
Spearman’s Rank Correlation Coefficient and interpret your result.

Student 1 2 3 4 5 6 7 8 9 10
English
Test A+ B- C A C A- B+ B C A

Mandari
n Test B A C B B+ C A+ A- B- A