Vous êtes sur la page 1sur 21

REGRESSION MODELS

By: Ayush Sharma 09 Mickey Haldia 19 Prerna Makhijani 29 Sanoj George 39 Sushant Jaggi 49 Nitish Dorle 59

Example

Year 1935 1940 1945 1950 1955 1960 1965

Population on Farm (in millions) 32.1 30.5 24.4 23.0 19.1 15.6 12.5

Scatter Plot
Population(in millions)
35 30 25 20 15 10 5 0 1930 Poplation(in millions)

1940

1950

1960

1970

Correlation Coefficient (r)




It is a measure of strength of the linear relationship between two variables and is calculated using the following formula:

Interpretation


After calculating we find r = -0.993 There is a strong negative correlation.

Coefficient of Determination


 

Squaring the correlation coefficient (r) gives us the percent variation in the y-variable that is described by the variation in the x-variable To relate x and y, the Regression Equation is calculated using Least Squares technique. Regression Equation: Y = a +bX Slope of the regression line:

To continue with the example




We found r = -0.993. By squaring we get the Coefficient of Determination (R^2) = 0.987


35 Population on Farm (in millions) 30 25 20 15 10 1930

Regression

y = -0.671 x + 1,330.350 R = 0.987

1940

Year 1950

1960

1970

Interpretation


We conclude that 98.7% of the decrease in farm population can be explained by timeline progression. Theoretically, population is a dependent variable (y-axis) and timeline is an independent variable (x-axis).

Assumptions of the Regression Model


The following assumptions are made about the errors: a) The errors are independent b) The errors are normally distributed c) The errors have a mean of zero d) The errors have a constant variance(regardless of the value of X)


Patterns of Indicating Errors

Error

Estimating the Variance


The error variance is measured by the MSE  s2 = MSE= SSE n-k-1 where n = number of observations in the sample k = number of independent variables


Therefore the standard deviation will be s = sqrt (MSE)

Testing the Model for Significance

MSE and co-efficient of determination (r2) does not provide a good measure of accuracy when the sample size is small In this case, it is necessary to test the model for significance Linear Model is given by,
Y=
0

1X

Null Hypothesis :If 1 = 0, then there is no linear relationship between X and Y Alternate Hypothesis : If 1 0, then there is a linear relationship

Steps in Hypothesis Test for a Significant Regression Model Specify null and alternative hypothesis. 2. Select the level of significance ( ). Common values are between 0.01 and 0.05 3. Calculate the value of the test statistic using the formula: F = MSR/MSE 4. Make a decision using one of the following methods: a) Reject if Fcalculated > Ftable b) Reject if p-value <
1.

Multiple regression Analysis


More than one independent variable Y= 0+ 1X1+ 2X2++ kXk+ Where, Y=dependent variable(response variable) Xi=ith independent variable(predictor variable or explanatory variable) 0= intercept(value of Y when all Xi = 0) i= coefficient of the ith independent variable k= number of independent variables = random error To estimate the values of these coefficients, a sample is taken and the following equation is developed : = b0+b1X1+b2X2+.+bkXk where, = predicted value of Y b0= sample intercept (and is an estimate of 0) bi= sample coefficient of ith variable(and is an estimate of i)

Selling Price ($) 95000 119000 124800 135000 142800 145000 159000 165000 182000 183000 200000 211000 215000 219000

Suare Footage 1926 2069 1720 1396 1706 1847 1950 2323 2285 3752 2300 2525 3800 1740

AGE 30 40 30 15 32 38 27 30 26 35 18 17 40 12

Condition GOOD Excellent Excellent GOOD Mint Mint Mint Excellent Mint GOOD GOOD GOOD Excellent Mint

SUMMARY OUTPUT

Jenny Wilson Reality


Regression Statistics

Multiple R R Square

The coefficient of determination r2

0.819680305 0.671875802 0.612216857 24312.60729 14

Adjusted R Square Standard Error Observations

ANOVA

df Regression Residual Total 2 11 13

SS

MS

Significance F 0.002178765

The regression 13313936968 6.7E+09 11.262 coefficients


6502131603 19816068571 5.9E+08

The p-values are used to test the individual variables for significance
Upper 95% Lower 95.0% Upper 95.0%

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Intercept SF AGE

146630.89 43.819366 -2898.686

25482.08287 10.28096507 796.5649421

5.75427 4.26218 -3.639

0.0001 0.0013 0.0039

90545.20735 21.19111495 -4651.91386

202717 66.448 -1145

90545 21.191 -4651.9

202717 66.448 -1145.5

Binary or Dummy Variables


  

Indicator Variable Assigned a value of 1 if a particular condition is met, 0 otherwise The number of dummy variables must equal one less than the number of categories of a qualitative variable The Jenny Wilson realty example : X3= 1 for excellent condition = 0 otherwise X4= 1 for mint condition = 0 otherwise

Selling Price Suare Footage ($) 95000 1926 119000 2069 124800 1720 135000 1396 142800 1706 145000 1847 159000 1950 165000 2323 182000 2285 183000 3752 200000 2300 211000 2525 215000 3800 219000 1740
ANOVA

AGE 30 40 30 15 32 38 27 30 26 35 18 17 40 12

X3(Exc.) 0 1 1 0 0 0 0 1 0 0 0 0 1 0

X4(Mint) 0 0 0 0 1 1 1 0 1 0 0 0 0 1

Condition GOOD Excellent Excellent GOOD Mint Mint Mint Excellent Mint GOOD GOOD GOOD Excellent Mint

Jenny Wilson Reality


SUMMARY OUTPUT

Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.94762 0.89798 0.85264 14987.6 14

The coefficients of age is negative, indicating that the price decreases as a house gets older
df SS 4 9 13 17794427451 2021641120 19816068571 MS 4E+09 2E+08 F

Significance F 0.000174421

Regression Residual Total

19.8044

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept SF AGE X3(Exc.) X4(Mint)

121658 56.4276 -3962.82 33162.6 47369.2

17426.61432 6.947516792 596.0278736 12179.62073 10649.26942

6.9812 8.122 -6.6487 2.7228 4.4481

6.5E-05 2E-05 9.4E-05 0.0235 0.0016

82236.71393 40.71122594 -5311.12866 5610.432651 23278.92699

161080 72.144 -2614.5 60714.9 71459.6

82236.71 40.71123 -5311.129 5610.433 23278.93

161080 72.144 -2614.5 60715 71460

Model Building
 

The value of r2 can never decrease when more variables are added to the model Adjusted r2 often used to determine if an additional independent variable is beneficial

The adjusted r2 is

A variable should not be added to the model if it causes the adjusted r2 to decrease

Multiple Regression
Sales/Decision to buy = B0+ B1* Price
Sales/Decision to buy = B0+ B1* (Price)3+ B2*(Design)2+B3*(Performance) L = (Price)3 M = (Design)2 N = (Performance)

Sales/Decision to buy = B0+ B1* L+ B2* M+ B3* N

Pitfalls In Regression
A High Correlation does not mean one variable is causing a change in another (Some regressions have shown a significantly positive relation between individuals' college GPA and future salary. )

Values of the dependent variable should not be used that are above or below the ones from the sample

The number of independent variables that should be used in the model is limited by the number of observations.

Vous aimerez peut-être aussi