Vous êtes sur la page 1sur 19

1 | P a g e

Q.1 Systemetic note on simple regression analysis, multiple regression analysis, correlation
analysis?
Simple linear regression
Regression analysis is used when two or more variables are thought to be systematically
connected by a linear relationship. In simple regression, we have only two let us designate
them x and y and we suppose that they are related by an expression of the form
y = b0 + b1 x + e. Well leave aside for a moment the nature of the variable e and focus on the
x - y relationship. y = b0 + b1 x is the equation of a straight line; b0 is the intercept (or constant)
and b1 is the x coefficient, which represents the slope of the straight line the equation describes.
There are many examples in the life sciences of such a situation height to predict weight, dose
of an algaecide to predict algae growth, skin fold measurements to predict total body fat, etc
Often times, several predictors are used to make a prediction of one variable (ex. height, weight,
age, smoking status, gender can all be used to predict blood pressure).
We focus on the special case of using one predictor variable for a response, where the
relationship is linear.
Example- In a study of a free living population of the snake Vipera bertis, researchers caught
and measured nine adult females. The goal is to predict weight (Y) from length (X).
The data and a scatter plot of the data are below. Notice this data comes in pairs. For example
(x1,y1) = (60, 136)


2 | P a g e


Simple Linear Model (Regression Equation)
The simple linear model relating Y and X is Y = bO + b1X, bO is the intercept, the point where
the line crosses the Y axis b1 is the slope, the change in Y over the change in X (rise over run)



In regression models, the independent variables are also referred to as predictor variables.
The dependent variable , is also referred to as the response. The slope, and the
intercept, , of the line are called regression
coefficients. The slope, , can be interpreted as the change in the mean value of for a
unit change in . The random error term is assumed to follow the normal distribution
with a mean of 0 and variance of , since is the sum of this random term and the mean
value is , which is constant, the variance of at any given value of is also .
Therefore, at any given value of , say , the dependent variable follows a normal
distribution with a mean of and a standard deviation of .

The true regression line corresponding is usually never known. However, the regression
line can be estimated by estimating the coefficients and for an observed data set.
The estimates, and , are calculated using least squares. The estimated regression line,
3 | P a g e

obtained using the values of and , is called the fitted line. The least square estimates,


We find the coefficients for the regression equation.



Once and are known, the fitted regression line can be written as:



Multiple Linear Regression Analysis:
Multiple regression analysis is a powerful technique used for predicting the unknown value of a
variable from the known value of two or more variables- also called the predictors.
More precisely, multiple regression analysis helps us to predict the value of Y for given values of
X
1
X
2
,,X
k
.

Multiple regressions are a statistical technique that allows us to predict someones score on one
variable on the basis of their scores on several other variables. For example, suppose we were
interested in predicting how much an individual enjoy their job. Variable such as salary, extent
of academic qualifications, age, sex, number of year in full time employment and socioeconomic
status might all contribute towards job satisfaction. If we collected data on all of these variables,
perhaps by surveying a few hundred members of the public, we would be able to see how many
and which of these variables gave rise to the most accurate prediction of job satisfaction. We
might find that job satisfaction. When using multiple regressions in psychology, many
researchers use the term independent variables to identify those variables that they think will
4 | P a g e

influence some other dependent variable. We prefer to use the term predictor variables for those
variables that may be useful in predicting the scores on another variable that we call the
criterion variable.
Theory for multiple linear regressions
In multiple linear regressions, there are p explanatory variables, and the relationship between the
dependent variable and the explanatory variables is represented by the be following equation:




Where:
is the constant term and
to are the coefficients relating the p explanatory variables to the variables of interest.
So, multiple linear regression can be thought of an extension of simple linear regression, where
there are p explanatory variables, or simple linear regression can be thought of as a special case
of multiple linear regression, where p=1. The term linear is used because in multiple linear
regressions we assume that y is directly related to a linear combination of the explanatory
variables.
Assumption of Linearity:
First of all, as is evident I n the name multiple linear regression, it is assumed that the
relationship between variables is linear. In practice this assumption can virtually never be
confirmed; fortunately, multiple regression procedures are not greatly affected by minor
deviations from this assumption. However, as a rule it is prudent to always look at vicariate
scatter plot of the variables of interest. If curvature in the relationships is evident, you may
consider either transforming the variables, or explicitly allowing for nonlinear components.
Normality Assumption:
It is assumed in multiple regressions that the residuals are distributed normally. Again, even
though most tests are quite robust with regard to violations of this assumption, it is always a
5 | P a g e

good idea, before drawing final conclusion, to review the distributions of the major variables of
interest. You can produce histograms for the residuals as well as normal probability plots, in
order to inspect the distribution of the residual values.
Usually increasing variance and skewed distribution go together. Here is a histogram of the
residuals, with a superimposed normal distribution. Notice the residual extending to the right.
Limitations:
The major conceptual limitation of all regression techniques is that you can only ascertain
relationships, but never be sure about underlying causal mechanism. For example, you would
find a strong positive relationship (correlation) between the damage that a fire does and the
number of firemen involved in fighting blaze.
Multiple regressions is a seductive technique: plug in as many predictor variables as you can
think of and usually at least a few of them will come out significant. This is because you are
capitalizing on chance when simply including as many variables as you can think of as predictors
of some other variable of interest. This problem is compounded when, in addition, the number of
observations is relatively low.
When should we use multiple regressions:
6 | P a g e

1. You can use this statistical technique when exploring linear relationships between the
predictor and criterion variables- that is, when the relationship follows a straight line.
2. The criterion variable that you are seeking to predict should be measured on a continuous
scale. There is a separate regression method called logistic regression that can be used for
dichotomous dependent variables.
3. The predictor variables that you select should be measured on a ratio, interval, or ordinal
scale. A nominal predictor variable is legitimate but only if it is dichotomous. The term
dummy variable is used to describe this type of dichotomous variable.
4. Multiple regressions require a large number of observations. The number of cases must
substantially exceed the number of predictor variables you are using in your regression.
The absolute minimum is that you have five times as many participants as predictor
variables.

Correlation Analysis:
Given two n-element sample populations, X and Y, it is possible to quantify the degree of fit to a
linear model using the correlation coefficient. The correlation coefficient, r, is a scalar quantity
in the interval [-1.0, 1.0], and is defined as the ratio of the covariance of the sample populations
to the product of their standard deviations.

Or:

the correlation between two securities is a statistical measure of the relationship between the
price movements of the two securities. This relationship, which is expressed by what is known as
7 | P a g e

the correlation coefficient, is represented by a value within the range of -1.00 to +1.00. A
correlation coefficient of +1.00 indicates that two securities move in the same direction at all
times. If security A gains in value, we would expect security B to gain as well. A correlation
coefficient of 0 indicates that the price movements are totally random. A gain by security A
provides no insight into the expected movement of security B.
The correlation coefficient is a direct measure of how well two sample populations vary jointly.
A value of r = +1 or r = 1 indicates a perfect fit to a positive or negative linear model,
The simplest way to find out qualitatively the correlation is to plot the data. In the case of our
example, as seen from Figure, a strong positive correlation between y and x is evident. R defiant
as:















8 | P a g e

R = ,
(1)

Where and denote the sample mean and the sample standard deviation respectively for
the variable x and and denote the sample mean and the sample standard deviation
respectively for the variable y.
The correlation transformer calculates any of the following correlation-related statistics on one or
more pairs of columns:

Correlation coefficient r:
The correlation coefficient r is a measure of the linear relationship between two attributes of
data. The correlation coefficient is also known as the Pearson product-moment correlation
coefficient. The value of r can range from -1 to +1 and is independent of the units of
measurement. A value of r near 0 indicates little correlation between attributes; a value near +1
or -1 indicates a high level of correlation.
Consider two variables x and y:
- If r = 1, then x and y are perfectly positively correlated. The possible values of x
and y all lie on a straight line with a positive slope in the (x,y) plane.
- If r = 0, then x and y are not correlated. They do not have an apparent linear
relationship. However, this does not mean that x and y are statistically
independent.
- If r = -1, then x and y are perfectly negatively correlated. The possible values of x
and y all lie on a straight line with a negative slope in the (x,y) plane.
Covariance:
9 | P a g e

Covariance is a measure of the linear relationship between two attributes or columns of
data. The value of the covariance can range from -infinity to +infinity. However, if the
value of the covariance is too small or too large to be represented by a number, the value
is represented by NULL.
T-value:
T-value is the observed value of the T-statistic that is used to test the hypothesis that two
attributes are correlated. The T-value can range between - and +. A T-value near 0 is
evidence for the null hypothesis that there is no correlation between the attributes. A T-
value far from 0 (either positive or negative) is evidence for the alternative hypothesis
that there is correlation between the attributes.
The definition of T-statistic is:
T = r * SQRT ((n-2) / (1 - r*r))
Where r is the correlation coefficient, n is the number of input value pairs, and SQRT is
the square root function.
If the correlation coefficient r is either -1 or +1, the T-value is represented by NULL. If
the T-value is too small or too large to be represented by a number, the value is
represented by NULL.
P-value:
P-value is the probability, when the null hypothesis is true, that the absolute value of the
T-statistic would equal or exceed the observed value (T-value). A small P-value is
evidence that the null hypothesis is false and the attributes are, in fact, correlated.







10 | P a g e

1. Solve the following problems:

11.10. Given the estimated regression equation =100+21X
a: Y change by +105
b: Y change by -147
c: = 100+21(14)=394
d:=100+21(27)=667
e:Regrassion result do not prove that Increased value of X cause increased value of
Y. Theory will help establish conclusion of causation.
11.16 The constant represents an adjustment for the estimated model and not the number sold
when the price is zero.

11.18 Compute the coefficient for the last square regression equaton


A:

=1.80

()

=10 +1.80


B:

=1.30

=210 -1.30(60)=10

=132+1.30


C:

=.975

=100-.975(20)=80.5

=80.5 +.975


D:

=.30

=50-.30(10)=47


E:

=.525

=200-.525(90)=152.75


11.20 a: n=20,

=25.4/20=1.27,

=22.6/20=1.13

=
()()()
()()()
=1.0737,

=1.13-1.0737(1.27)=-.2336
b: For a one unit increase in the rate of return of S&P 500 index, we estimate that the rate
of return of the corporations stock will increase by 1.07%
c: when the presenting rate of return of the S&P 500 index is 0,we estimate that the
corporation rate of return will be -.2336%.
11 | P a g e


11.27 a:

)
b:

=(

))

=0
c:

=(

))

)(

)+

) -2


d: -=a+b


=b(

)
e: SSE=

) (

=SST SSR
SST=SSR+SSE
11.28 a:

=
(

=
[

]
(


b:

=b
(

)(

)
(

=
[(

)(

)]


c:

)(

)
(

)(

)
(


12 | P a g e

11.36 For a simple regression problem, test the hypothesis
1 1 1
: 0 : 0
o
H vs H | | = =
Given that
2 2
xy
R r = ,
2
1
SSR SSE
R
SST SST
= = , and
2 2

2
e
SSE
s
n
o = =


a. n=35, SST = 100,000, r = .46
2 2
(.46) R = = .2116. .2116
100, 000
SSR
= SSR = 21,160.
Therefore, given that SST = SSR + SSE, 100,000 = 21,160 + SSE.
SSE = 78,840.
2 2
78,840

35 2
e
s o = =

= 2,389.091
F =
2
e
MSR SSR
MSE s
= =
21,160
2389.091
= 8.857.
2
,1, 2 2, 2 n n
F t
o o
= = 2.042
2
= 4.170
Therefore, at the .05 level, Reject H
0
b.
2 2
(.65) R = = .4225. .4225
123, 000
SSR
= SSR = 51,967.5.
Given that SST = SSR + SSE, 123,000 = 51,967.5 + SSE.
SSE = 71,032.5.
2 2
71, 032.5

61 2
e
s o = =

= 1,203.941
F =
2
e
MSR SSR
MSE s
= =
51, 967.5
1203.941
= 43.165.
2
,1, 2 2, 2 n n
F t
o o
= = 2.000
2
= 4.00
Therefore, at the .05 level, Reject H
0

c.
2 2
(.69) R = = .4761. .4761
128, 000
SSR
= SSR = 60,940.8.
Given that SST = SSR + SSE, 128,000 = 60,940.8 + SSE. SSE = 67,059.2.
2 2
67, 059.2

25 2
e
s o = =

= 2,915.617
F =
2
e
MSR SSR
MSE s
= =
60, 940.8
2915.617
= 20.902.
2
,1, 2 2, 2 n n
F t
o o
= = 2.069
2
= 4.281
Therefore, at the .05 level, Reject H
0
13 | P a g e


11.38 a.
2
8, 52/ 8 6.5, 494 n X x = = = =

,

2
54.4/ 8 6.8, 437.36, 437.7 Y y xy = = = =

,
1 2
437.7 8(6.5)(6.8)
.5391
494 8(6.5)
b

= =


b
0
= 6.8-.5391(6.5)=3.2958

b.
2 2 2 2
[437.36 8(6.8) ] (.5391) [494 8(6.5) ] 22.1019
i
e = =


2
22.1019/ 6 3.6836
e
s = = ,
2
2
3.6836
.0236
494 8(6.5)
b
s = =

, t
6,.05
= 1.943,
Therefore, the 90% confidence interval is: .5391 1.943 .0236 ,
.2406 up to .8376

11.42 Given a simple regression:
1
12 5(13) 77
n
y
+
= + =
95% Prediction Interval:
2
1
1 2, 2 2
( ) 1
1 ( )
( )
n
n n e
i
x x
y t s
n x x
o
+
+
(

+ +
(


2
1 (13 8)
77 2.042 1 (9.67)
32 500
(
+ +
(

= 77 20.533, (56.467, 97.533)
95% Confidence Interval:
2
1
1 2, 2 2
( ) 1
( )
( )
n
n n e
i
x x
y t s
n x x
o
+
+
(

+
(


2
1 (13 8)
77 2.042 (9.67)
32 500
(
+
(

= 77 5.629, (71.371, 82.629)
11.44 Given a simple regression:
1
22 8(17) 158
n
y
+
= + =
95% Prediction Interval:
2
1
1 2, 2 2
( ) 1
1 ( )
( )
n
n n e
i
x x
y t s
n x x
o
+
+
(

+ +
(


14 | P a g e

2
1 (17 11)
158 2.086 1 (3.45)
22 400
(
+ +
(

= 158 7.669, (150.331, 165.669)
95% Confidence Interval:
2
1
1 2, 2 2
( ) 1
( )
( )
n
n n e
i
x x
y t s
n x x
o
+
+
(

+
(


2
1 (17 11)
158 2.086 (3.45)
22 400
(
+
(

= 158 2.649, (155.351, 160.649)

11.46
2
80.6/ 23 3.5043
e
s = = ,
2
3.5043
.027
130
b
s = =
a.
1
: 0; : 0;
o
H H | | = < ,
1.2
7.303
.027
t

= =
Therefore, reject H
0
at the 1% level since t = -7.303 > -2.807 = -t
23,.005
b.
1
12.6 1.2(4) 7.8
n
y
+
= = ,
2
1 (4 6)
7.8 1.714 1 (1.872)
25 130
(
+ +
(


7.8 3.3203, (4.4798, 11.1203)

11.60 a. Compute the sample correlation
x y
( )
i
x x
2
( )
i
x x ( )
i
y y
2
( )
i
y y ( )( )
i i
x x y y
2 5 -1.8 3.24 -2.4 5.76 4.32
5 8 1.2 1.44 0.6 0.36 0.72
3 7 -0.8 0.64 -0.4 0.16 0.32
1 2 -2.8 7.84 -5.4 29.16 15.12
8 15 4.2 17.64 7.6 57.76 31.92
19 37 30.8 93.2 52.4

15 | P a g e

x = 19/5 = 3.8, y = 37/5 = 7.4,
2
( )
1
i
x
x x
s
n

=
30.8
4
= 2.7749,
2
( )
1
i
y
y y
s
n

=
93.2
4
= 4.827,
( )( )
1
i i
xy
x x y y
s
n

=

= 52.4/4 = 13.1
xy
x y
s
r
s s
= = 13.1/(2.7749)(4.827) = .97802

b. Compute the sample correlation
x
y
( )
i
x x
2
( )
i
x x ( )
i
y y
2
( )
i
y y ( )( )
i i
x x y y
7 5 -1.8 3.24 -2.4 5.76 4.32
10 8 1.2 1.44 0.6 0.36 0.72
8 7 -0.8 0.64 -0.4 0.16 0.32
6 2 -2.8 7.84 -5.4 29.16 15.12
13 15 4.2 17.64 7.6 57.76 31.92
44 37 0 30.8 0 93.2 52.4

x = 44/5 = 8.8, y = 37/5 = 7.4,
2
( )
1
i
x
x x
s
n

=
30.8
4
= 2.7749,
2
( )
1
i
y
y y
s
n

=
93.2
4
= 4.827,
( )( )
1
i i
xy
x x y y
s
n

=


= 52.4/4 = 13.1
xy
x y
s
r
s s
= = 13.1/(2.7749)(4.827) = .97802




16 | P a g e

b. Compute the sample correlation
x y
( )
i
x x
2
( )
i
x x ( )
i
y y
2
( )
i
y y
( )( )
i i
x x y y
12 4 -3.6 12.96 -1.8 3.24 6.48
15 6 -0.6 0.36 0.2 0.04 -0.12
16 5 0.4 0.16 -0.8 0.64 -0.32
21 8 5.4 29.16 2.2 4.84 11.88
14 6 -1.6 2.56 0.2 0.04 -0.32
78 29 45.2 8.8 17.6

x = 78/5 = 15.6, y = 29/5 = 5.8,
2
( )
1
i
x
x x
s
n

=
45.2
4
= 3.36155,
2
( )
1
i
y
y y
s
n

=
8.8
4
= 1.48324,
( )( )
1
i i
xy
x x y y
s
n

=

= 17.6/4 = 4.4
xy
x y
s
r
s s
= = 4.4/(3.36155)(1.48324) = .88247

11.84 a. For a one unit change in the inflation rate, we estimate that the actual spot rate will
change by .7916 units.
b. R
2
= 9.7%. 9.7% of the variation in the actual spot rate can be explained by the
variations in the spot rate predicted by the inflation rate.
b.

1
: 0, : 0
o
H H | | = > ,
.7916
2.8692
.2759
t = = , Reject H
0
at the .5% level since t = 2.8692
> 2.66 = t
77,.005
d.

1
: 0, : 0
o
H H | | = = ,
.7916 1
.7553
.2759
t

= = ,
Do not reject H
0
at any common level




17 | P a g e

11..86 a. For each unit increase in the diagnostic statistcs test, we estimate that the final student
score at the end of the course will increase by .2875 points.
b. 11.58% of the variation in the final student score can be explained by the variation in
the diagnostic statistics test
c. The two methods are 1) the test of the significance of the population regression slope
coefficient (|) and 2) the test of the significance of the population correlation
coefficient ()
1)
1
: 0, : 0
o
H H | | = > ,
.2875
6.2965
.04566
t = =
Therefore, reject H
0
at any common level of alpha
2)
1
: 0, : 0
o
H H = > ,
2
.1158 r R = = =.3403
2
( 2)
(1 )
r n
t
r

=
2
.3403 304
6.3098
(1 .3403 )
t = =

, Reject H
0
at any common level
Compute the correlation coefficient

x y
( )
i
x x
2
( )
i
x x
( )
i
y y
2
( )
i
y y
( )( )
i i
x x y y
2 8 -1.8 3.24 -5 25 9
5 12 1.2 1.44 -1 1 -1.2
3 14 -0.8 0.64 1 1 -0.8
1 9 -2.8 7.84 -4 16 11.2
8 22 4.2 17.64 9 81 37.8
19 65 0 30.8 0 124 56

x = 19/5 = 3.8, y = 65/5 = 13,
2
( )
1
i
x
x x
s
n

=
30.8
4
= 2.77488,
2
( )
1
i
y
y y
s
n

=
124
4
= 5.56776,
( )( )
1
i i
xy
x x y y
s
n

=

= 56/4 = 14

xy
x y
s
r
s s
= = 14/(2.77488)(5.56776) = .90615
18 | P a g e

3. Take the data from table 11.2 on page 462 and reproduce the output on page 463
presented in the table 11.7.
Income (x) Retail Sales (y) Predicted Retail sales Residual
Observed
Dev.
Predicted
Dev.
1 55641 21886 21787 99 -550 -649
2 55681 21934 21803 131 -502 -633
3 55637 21699 21786 -87 -737 -650
4 55825 21901 21858 43 -535 -578
5 55772 21812 21837 -25 -624 -599
6 55890 21714 21882 -168 -722 -554
7 56068 21932 21950 -18 -504 -486
8 56299 22086 21039 48 -350 -398
9 56825 22265 22239 26 -171 -197
10 57205 22551 22384 167 115 -52
11 57562 22736 22520 216 300 84
12 57850 22301 22630 -329 -135 194
13 57975 22518 22678 -160 82 242
14 57992 22580 22684 -104 144 248
15 58240 22618 22779 -161 182 343
16 58414 22890 22845 45 454 409
17 58561 23112 22902 211 676 465
18 59066 23315 23094 221 879 658
19 58596 22865 22915 -50 429 479
20 58631 22788 22928 -140 352 492
21 58758 22949 22977 -28 513 541
22 59037 23149 23083 -66 713 647
SUM OF Squared Value 436127 5397565 4961438


SUMMARY OUTPUT

Regression Statistics
Multiple R 0.958748803
R Square 0.919199267
Adjusted R Square 0.91515923
Standard Error 147.6697181
Observations 22

19 | P a g e

ANOVA
df SS MS F
Significance
F
Regression 1 4961434.406 4961434.406 227.5225 2.17134E-12
Residual 20 436126.9127 21806.34563
Total 21 5397561.318



` Coefficients S.E. t Stat P-value Lower 95% Upper 95% L 95.0% U 95.0%
Intercept 559.4600137 1450.69753 0.385648974 0.703828
-
2466.641999 3585.56203 -2466.64 3585.562
X
Variable
1 0.38151672 0.02529306 15.08384918 2.17E-12 0.328756319 0.43427712 0.328756 0.434277

Vous aimerez peut-être aussi