Vous êtes sur la page 1sur 21

# Introduction to

## Regression and Correlation

MS (Statistics & Scientific Computing)
(Statistics & Scientific Computing)

Outline

Introduction of statistics
Regression

Simple Linear Regression
Linear Correlation

Using the Eview 6
Model/Formulas
Definition of Statistics

The study of the collection, organization, analysis, interpretation and
presentation of data. It deals with all aspects of data including the
planning of data collection in terms of the design
of surveys and experiments.
A mathematical body of science that pertains to the collection, analysis,
interpretation or explanation, and presentation of data
Regression and Correlation

Regression provides a functional relationship
(Y=f(x)) between the variables; the function
represents the average relationship.
Correlation tells us the direction and the strength
of the relationship.

Introduction to Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable
based on the value of at least one
independent variable
Explain the impact of changes in an
independent variable on the dependent
variable
The unknown parameters, denoted
as , which may represent a scalar or
a vector.
The independent variables, X.
The dependent variable, Y.

The Linear Regression Model
The Linear Regression Model:
x y
1 0

Linear component
Population
y intercept
Independent
Variable
Population
Slope
Coefficient
Dependent
Variable
Random
Error
term, or
residual
Random Error
component
The Linear Regression by graph

x y
1 0

y
x
Intercept =
0

i
Observed Value
of y for x
i
Predicted Value
of y for x
i

Random Error
for this x value

Slope =
1
Linear Regression Assumptions
1. Error values () are statistically
independent
2. Error values are normally distributed
for any given value of x
3. The probability distribution of the
errors is normal
4. The probability distribution of the
errors has constant variance
5. The underlying relationship between
the x variable and the y variable is
linear
Sample Data for House Price Model
House Price in \$1000s
(y)
Square Feet
(x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Manual method
House price in
\$1000s
Square feet
y X Xy y
2
X
2
245 1400
343000 60025
1960000
312 1600
499200 97344
2560000
279 1700
474300 77841
2890000
308 1875
577500 94864
3515625
199 1100
218900 39601
1210000
219 1550
339450 47961
2402500
405 2350
951750 16405
5522500
324 2450
793800
104976
6002500
319 1425
454575
101761
2030625
255 1700
433500
65025
2890000
y=28745 x=17150 xy=5085975 y^2=853423 x^2=30983750
Manual method
b=nx
2
(x)( y) and a=y-bx
n(x
2
)(x)
2

b=10(5085975) (17150)( 2865) and a=286.5-0.10976(1715)
10(30983750)(17150)
2

x y
1 0

y=

0.10976+98.2616(square)
b=0.10976
a=98.2616
By the software method

Dependent Variable: PRICE
Method: Least Squares
Date: 01/03/14 Time: 13:00
Sample: 1 10
Included observations: 10

Variable Coefficient Std. Error t-Statistic Prob.

C 98.24833 58.03348 1.692960 0.1289
SQUARE 0.109768 0.032969 3.329378 0.0104

R-squared 0.580817 Mean dependent var 286.5000
Adjusted R-squared 0.528419 S.D. dependent var 60.18536
S.E. of regression 41.33032 Akaike info criterion 10.45793
Sum squared resid 13665.57 Schwarz criterion 10.51844
Log likelihood -50.28963 Hannan-Quinn criter. 10.39154
F-statistic 11.08476 Durbin-Watson stat 3.221657
Prob(F-statistic) 0.010394

Interpretation of the Intercept, b
0

feet) (square 0.10977 98.24833 price house
b
0
is the estimated average value of Y when the
value of X is zero (if x = 0 is in the range of
observed x values)
Here, no houses had 0 square feet, so b
0
=
98.24833 just indicates that, for houses
within the range of sizes observed,
\$98,248.33 is the portion of the house price
not explained by square feet
Interpretation of the Slope Coefficient, b
1

feet) (square 0.10977 98.24833 price house
b
1
measures the estimated change
in the average value of Y as a
result of a one-unit change in X
Here, b
1
= .10977 tells us that the average
value of a house increases by
.10977(\$1000) = \$109.77, on average, for
each additional one square foot of size
Coefficient of Determination, R
2

Portion of the total variation in the dependent variable
that is explained by variation in the independent variable

It is also called R-squared and is denoted as R
2

SST
SSR
R
2
1 R 0
2

squares of sum total
regression by explained squares of sum
SST
SSR
R
2
Scatter Plots and Correlation
Correlation analysis is used to measure strength of the
association (linear relationship) between two variables
A scatter plot (or scatter diagram) is used to show the
relationship between two variables
Calculating the Correlation Coefficient

] ) y ( ) y ( n ][ ) x ( ) x ( n [
y x xy n
r
2 2 2 2
where: r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
House price
in \$1000s
Square feet
y X Xy y
2
X
2
245 1400
343000 60025
1960000
312 1600
499200 97344
2560000
279 1700
474300 77841
2890000
308 1875
577500 94864
3515625
199 1100
218900 39601
1210000
219 1550
339450 47961
2402500
405 2350
951750 16405
5522500
324 2450
793800
104976
6002500
319 1425
454575
101761
2030625
255 1700
433500
65025
2890000
y=2865 x=17150 xy=5085975 y^2=853423 x^2=30983750
1
] (17150) 3) ][10(85342 (2865) 50) [10(309837
745) (17150)(28 ) 10(5085975
] y) ( ) y ][n( x) ( ) x [n(
y x xy n
r
2 2
2 2 2 2

r = 0.886 relatively strong
positive
linear association between x and y
1,000
1,200
1,400
1,600
1,800
2,000
2,200
2,400
2,600
150 200 250 300 350 400 450
House price in \$1000s
s
q
u
a
r
e

f
e
e
t

o
f

h
o
u
s
e
1,000
1,200
1,400
1,600
1,800
2,000
2,200
2,400
2,600
150 200 250 300 350 400 450
House price in \$1000s
s
q
u
a
r
e

f
e
e
t

o
f

h
o
u
s
e
Summary
Simple regression is a statistical tool that attempts to
fit a straight line relationship between X (independent variable)
and Y (dependent variable).

The scatter plot gives us a visual clue about the nature
of the relationship between X and Y.

Evies, or other statistical software is used to fit the model;
a good model will be statistically valid, and will have a
reasonably high R-squared valu.

A good model is then used to make predictions; when
making predictions, be sure to confine them within the
domain of Xs used to fit the model (i.e. interpolate);
we should avoid extrapolation

Thanking Yours