Vous êtes sur la page 1sur 54

Regression Analysis

Scatter Plots and Correlation

A scatter plot (or scatter diagram) is used to show


the relationship between two variables
Correlation analysis is used to measure strength of
the linear association between two variables
Only concerned with strength of the relationship
No causal effect is implied
Examples

Salary and no of years of experience


Household income and expenditure;
Price and supply of commodities;
Amount of rainfall and yield of crops.
Price and demand of goods.
Weight and blood pressure
Sales and GDP
Scatter Plot Examples
Linear relationships Curvilinear relationships

y y

x x

y y

x x
Strong relationships Weak relationships

y y

x x

y y

x x
No relationship

x
Correlation Coefficient
The population correlation coefficient ()
measures the strength of the linear association
between the variables

The sample correlation coefficient (r) is an


estimate of and is used to measure the strength
of the linear relationship in the sample
observations
Calculating sample Correlation Coefficient

cov( x, y )
rxy
sx s y
1
cov( x, y ) ( xi x )( yi y )
n
1 1
sx
n
( xi x ) 2
s y
n
( y i y ) 2
Features of correlation coefficient
Unit free
Range between -1.00 and 1.00
-1<0 implies that as X (), Y ( )
0< 1 implies that as X (), Y ()
The closer to -1.00, the stronger the negative
linear relationship
The closer to 1.00, the stronger the positive linear
relationship
The closer to 0.00, the weaker the linear
relationship
=0 implies that X and Y are not linearly
associated
Significance Test for Correlation

Hypotheses
H0: = 0 (no linear correlation)
H1: 0 (linear correlation)
Significance test for Correlation

Test statistic:

r n2
tobs ~ t n 2 , under H 0
1 r 2

Critical Region:
{tobs t ;n 2 }
{tobs t ;n 2 }
{ tobs t / 2;n 2 }
What is Regression
Regression is a tool for finding existence of an
association relationship between a dependent
variable (Y) and one or more independent
variables (1 , 2 , , ) in a study.
The relationship can be linear or non-linear.
Mathematical vs Statistical
Relationship
Mathematical Relationship is exact

y 0 1 x
Statistical Relationship is not exact

y 0 1x
Linear Regression Assumptions

Distribution of error: i ~ N (0, e2 )


Error values () are statistically independent
Error values are normally distributed for any given value of x
E( i ) = 0 , var( i ) e2 ; cov (i , j ) 0
The probability distribution of the errors has constant
variance
Nomenclature in Regression
A dependent variable (response variable)
measures an outcome of a study (also called
outcome variable).
An independent variable (explanatory
variable) explains changes in a response
variable.
Regression often set values of explanatory
variable to see how it affects response
variable (predict response variable)
Caution

Regression model establishes the existence of an


association between two variables, but not causation.

Terms dependent and independent does not necessarily


imply a causal relationship between two variables.

Purpose of regression is to predict the value of


dependent variable given the value(s) of independent
variable(s).
Steps in regression analysis
Statement of the problem under consideration
Identify the explanatory variable.
Specify the nature of relationship between
dependent variable and explanatory variables
Collection of data on relevant variables
Choice of method for fitting the data
Fitting of model
Model validation and criticism
Using the chosen model(s) for the solution of
the posed problem and forecasting
Population Linear Regression

The population regression model:


Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual

y 0 1x
Variable

Linear component Random Error


(systematic) component
Population Linear Regression

y y 0 1x
Observed Value
of y for xi

i Slope = 1
Predicted Value Random Error for
of y for xi
this x value

Intercept = 0

xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line

Estimated (or Estimate of the Estimate of the


predicted) y regression regression slope
value intercept

Independent

y i b0 b1x variable
Estimation of parameters
Least square method of estimation
Confidence interval
Prediction interval
p-value
Interpretation of the Slope and the Intercept

b0 is the estimated average value of y when


the value of x is zero
b1 is the estimated change in the average value
of y as a result of a one-unit change in x
Assessing Model Accuracy

2
Residual Standard Error (interpretation?)
F Statistic
Coefficient of Determination
Relationship Among SST, SSR, SSE
SST = SSR + SSE

i
( y y ) 2
i
(
y y ) 2
i i
( y
y ) 2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Goodness of fit of regression
Coefficient of Determination
It can be noted that a fitted model can be said to be good
when residuals are small. Since SSR is based on residuals, so
a measure of quality of fitted model can be based on SSR.
R2 is a measure of relative fit based on a comparison of SSR
and SST
R2 = r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
a value of closer to 1 indicates the better
fit and value of closer to zero indicates
the poor fit.
Coefficient of Determination (example)

R2 = SSR/SST = 2.963146 / 13.45859 = 0.220168

The regression relationship is weak; 22.02% of the


variability in the demand for the item can be
explained by the linear model between the demand
and the price.
Residual Standard Error

()
=
2

Interpretation: Suppose RSE is 3.26. In other words, actual


sales in each market deviate from the true regression line
by approximately 3,260 units, on average. Another way to
think about this is that even if the model were correct and
the true values of the unknown coefficients were known
exactly, any prediction of sales on the basis of TV
advertising would still be off by about 3,260 units on
average
Advertising data
The Advertising data set consists of the sales (in thousands of
units) of a particular product in 200 different markets, along with
advertising budgets (in thousands of dollars) for the product in
each of those markets for three different media: TV, radio, and
newspaper.

Suppose that in our role as statistical consultants we are asked to


suggest, on the on the basis of this data, a marketing plan for next
year that will result in high product sales. What information would
be useful in order to provide such a recommendation?

Let us first check how sales is related with ad expenditure.


Simple Linear Regression

Is there a relationship between advertising


budget and sales?

How strong is the relationship between


advertising budget and sales?
Scatter Diagram
Correlation coefficient
Test of correlation coefficient
Interpretation of regression coefficients and
corresponding s.e.
Confidence interval of parameters
p-value of t-tests
Test for Significance

To test for a significant regression relationship, we


test for intercept parameter, b0;
slope parameter b1 and predicted y

test commonly used is:

t Test

t test requires an estimate of e2,


the variance of error in the regression model.
Testing for Significance
An Estimate of e2
The mean square error (MSE) provides the unbiased estimator
of e2, given as

s 2 = MSE = SSE/(n - 2)

where:

( SS ) 2
xy
SSE = (yi - yi )2 SS y
SS x
= SS y b SS
1 XY
Testing for slope parameter
Hypotheses
H 0 : 1 10

H1 : 1 10

Test Statistic, under H0

b1 10 s
tobs where sb1
sb1 i
(
i
x x ) 2
Testing for intercept parameter
Hypotheses
H 0 : 0 00

H1 : 0 00

Test Statistic, under H0

b0 00 1 x2
tobs where sb 0 s
sb 0 n ( xi x ) 2
Testing for Significance: t Test
Critical Region

Reject H0 if p-value <


or tobs < -t /2;n-2 or t > t/2;n-2

where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: Example

1. Determine the hypotheses. H0 : 1 0


H a : 1 0
2. Specify the level of significance. a = .05

b1
3. Select the test statistic. t
sb1

4. State the rejection rule. Reject H0 if p-value < .05


or |t| > 3.182 (with
3 degrees of freedom)
Testing for Significance: t Test

5. Compute the value of the test statistic.


b1 5
t 4.63
sb1 1.08

6. Determine whether to reject H0.


t = 4.541 provides an area of .01 in the upper
tail. Hence, the p-value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H0.
Confidence Interval for 1
The form of a confidence interval for 1 is: t /2 sb1
is the
b1 t /2 sb1 margin
of error
b1 is the
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Confidence Interval for 0
The form of a confidence interval for 0 is: t / 2 sb 0
is the
b0 t / 2 sb 0 margin
of error
b0 is the
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Multiple Regression
Example
Suppose that we are statistical consultants hired
by a client to provide advice on how to improve
sales of a particular product.

The Advertising data set consists of the sales of


that product in 200 different markets, along with
advertising budgets for the product in each of
those markets for three different media: TV,
radio, and newspaper

Response or dependent variable?


Predictors or independent variable(s)?
Common questions in regression

Which predictors are associated with the


response?

What is the relationship between the response and


each predictor?

Can the relationship between Y and each


predictor be adequately summarized using a
linear equation, or is the relationship more
complicated?
Advertising data
One may be interested in answering questions
such as:
Which media contribute to sales?
Which media generate the biggest boost in
sales? or
How much increase in sales is associated with
a given increase in TV advertising?
Multiple Regression
The model is

y = 0 + 1x1 + 2x2 + + kxk +

15-44
Scatter matrix
Correlation matrix
Test of correlation coefficients
Interpretation of regression coefficients and
corresponding s.e.
Confidence interval of parameters
p-value of t-tests
Assessing Model Accuracy

2
Adjusted 2
Residual Standard Error
F Statistic
Residual Standard Error

()
=

Adjusted 2
If more explanatory variables are added to the model,
then 2 increases. In case the variables are irrelevant,
then 2 will still increase and gives an overly optimistic
picture.
With a purpose of correction in overly optimistic
picture, adjusted 2 , denoted as or adj 2 is used
which is defined as
2 1
= 1 1 2

Note: 2 will never decrease when a variable is added,
but the same is not true for adj 2 . Even, if the model fits
poorly, then adj 2 can even be negative.
Testing For Overall Significance of
Model F Test
Is There a Relationship Between the Response
and Predictors?
0 : 1 = 2 ==1 =0
1 : 0
Test Statistic:

= 1 ~1, 0
()

15-49
2
Relation between and F
2
= = = =
1 1 1 12

When 2 =0, F=0


When 2 =1, F=; So both F and 2 vary directly. Larger
2 implies greater F value
That is why the F test under analysis of variance is
termed as the measure of overall significance of
estimated regression
It is also a test of significance of 2 . If F is highly
significant, it implies that we can reject H0, i.e. y is
linearly related to Xs.
Model Adequacy Checking
The fitting of linear regression model, estimation of
parameters testing of hypothesis properties of the
estimator are based on following major
assumptions:
The relationship between the study variable and
explanatory variables is linear, at least
approximately.
The errors are normally distributed
The error term has constant variance.
There is no Multi-collinearity (no perfect linear
relationship among explanatory variables).
Residuals Plot
The graphical analysis of residuals is a very
effective way to investigate the adequacy of
the fit of a regression model and to check the
underlying assumptions.
Typically, the residuals are plotted against the
fitted values.
Plots of residuals against the fitted
values (heteroscedasticity)

If plot is such that the residuals can be


contained in a horizontal band (and residual
fluctuates is more or less in a random fashion
inside the band) , then there are no obvious
model defects.
It plot is such that the residuals can be
contained is an outward opening funnel, then
such pattern indicates that the variance of
errors is not constant.

Presence of heteroscedasticity
Plots of residuals against the fitted
values (heteroscedasticity)

If plots is such that the residuals can be


accommodated in an inward opening funnel,
then such pattern indicates that the variance
of errors is not constant.
Presence of heteroscedasticity

If plot is such that the residuals are contained


inside a curved plot, then it indicates
nonlinearity. The assumed relationship between
y and Xs is non- linear. This could also mean that
some other explanatory variables are needed in
the model. For example, a squared error term
may be necessary. Transformations on
explanatory variables and/or study variable may
also be helpful is these cases.