Vous êtes sur la page 1sur 36

LINEAR REGRESSION

BY- ROHIT ARORA 10113057 IPE-FINAL YEAR

REGRESSION
Technique used for the modeling and analysis of numerical data Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships

LINEAR REGRESSION
Simple linear regression is used for three main purposes: To describe the linear dependence of one variable on another To predict values of one variable from values of another, for which more data are available To correct for the linear dependence of one variable on another, in order to clarify other features of its variability. Linear regression determines the best-fit line through a scatterplot of data, such that the sum of squared residuals is minimized; equivalently, it minimizes the error variance. The fit is "best" in precisely that sense: the sum of squared errors is as small as possible. That is why it is also termed "Ordinary Least Squares" regression.

ASSUMPTIONS IN REGRESSION
The relationship between the response y and regressors is linear. The error term has zero mean. The error term has constant variance 2. The errors are uncontrolled. The errors are normally distributed.

CALCULATION OF PARAMETERS
Y = +x = intercept = slope =Sxy / Sxx 0 = y - x n- number of observations Sxy = xy - 1/n(x)(y) Sxx =x2 1/n(x)2

TESTING OF HYPOTHESIS ON SIGNIFICANCE OF REGRESSION


Y=+X+
yi

C B

B
A C
yi

B-EXPLAINED VARIATION C-UNEXPLAINED VARIATION


x

TESTING OF HYPOTHESIS ON SIGNIFICANCE OF REGRESSION


One should check whether estimates from the regression model represents the real world data Model of regression is Y=+X+ Total variation is made up of two parts:

SST
TOTAL SUM OF SQUARES

= SSR
REGRESSION SUM OF SQAURES

SSE
ERROR SUM OF SQAURES

SST =(Y-Y)2
Where

SSR=(Y-Y)2

SSE =(Y-Y)2

Y=Average value of the dependent variable Y =Predicted value of Y for given X Y=Observed value of the dependent variable

TESTING OF HYPOTHESIS ON SIGNIFICANCE OF REGRESSION


H: The slope , =0 (There is no linear relationship between Y and X as shown in the model) H: The slope , 0 (There is a linear relationship between Y and X as shown in the model) The ANOVA table to test null hypothesis is: Sum of squares SSR Degrees of freedom 1 Mean sum of squares MSSR = SSR F-ratio F= MSSR / MSSE

Source of variation Due to regression Due to error

SSE

n-2

MSSE = SSE /n-2

Total

SST

n-1

COEFFIECIENT OF DETERMINATION
Coefficient of determination ( R2 ) = SSR / SST The coefficient of determination is the variation in Y explained by the regression model when compared to the total variation . If this is higher , higher is the degree of relationship between Y and X. The range of R2 is from 0 to 1. The statistic R2 should be used with caution since it is always possible to make r2 large by adding enough terms to the model. The magnitude of R2 also depend on the range of variability in the regressor variable. Generally R2 will increase as the spread of the xs increase and decrease as the spread of xs decreases provided the assumed model form is correct. E( R2 )= 12 Sxx / 12 Sxx + MSSR Clearly the expected value of R2 will increase as Sxx increases. Thus a large value of R2 may result simply because x has been varied over an unrealistic large range. R2 does not measure the appropriates of the linear model , R2 Will often be large even though y and x are non linearly related. Coefficient of correlation =[SSR / SST ]1/2

COEFFICIENT OF CORRELATION
Coefficient of correlation =[SSR / SST ]1/2 The sign of R is same as that of the slope in the regression model. A zero correlation indicates that there is no relationship between variables. A correlation of -1 indicates a perfect negative relation A correlation of +1 indicates a perfect positive relation. It is very dangerous to conclude that there is no association between x and y just because r is close to zero as shown in figure below. The correlation coefficient is of value only when the relation between x and y is linear.

Scatter plot between temperature and sales of ice cream

Calculated value of coefficient of correlation is zero but data follows a nice curve

SITUATION WHERE HYPOTHESIS H:=0 IS NOT REJECTED

The failure to reject H:=0 suggests that there is no linear relationship between y and x. Figure 11.8 is an illustration of the implication of this result .This may imply either that x is of little value in explaining the variation in y (a) or that the true relationship between x and y is not linear (b).

SITUATION WHERE HYPOTHESIS H:=0 IS REJECTED

If H:=0 is rejected this implies that x is of value in explaining the variability in y . However rejecting H : =0 could mean either that the straight line model is adequate (fig .a ) or that even though there is a linear effect of x , better result could be obtained with the addition of higher order polynomial terms.

HYPOTHESIS TESTING OF SLOPE


We wish to test the hypothesis that slope equals a constant . The hypothesis are: H : = H : For this testing we assume that errors are normally and independently distributed. t=(1 - )/ (MSSE / SXX ) 1 is a linear combination of observations . It is normally distributed with mean and variance (2 / SXX ). Degrees of freedom associated with t are degrees of freedom associated with MSSE . The procedure rejects the null hypothesis if t> t/2 , n-2 Standard error of slope is se(1 ) = (MSSE / SXX )1/2

CONFIDENCE INTERVALS OF SLOPE


The sampling distribution of slope is given as: t =(1 - )/ (MSSE / SXX ) t with n-2 degrees of freedom. A 100(1-) percent confidence interval of slope is given by 1 - t/2 , n-2 se(1 ) 1 + t/2 , n-2 se(1 )

HYPOTHESIS TESTING OF INTERCEPT


We wish to test the hypothesis that intercept equals a constant 00. The hypothesis are: H : = H : For this testing we assume that errors are normally and independently distributed. t=( - )/ [MSSE /(1/n + x2 / SXX) ]1/2 is a linear combination of observations . It is normally distributed with mean . Degrees of freedom associated with t are degrees of freedom associated with MSSE . The procedure rejects the null hypothesis if t> t/2 , n-2 Standard error of intercept is se( ) = [MSSE /(1/n + x2 / SXX) ]1/2

CONFIDENCE INTERVALS OF INTERCEPT


The sampling distribution of intercept is given as: t =( - )/[MSSE /(1/n + x2 / SXX) ]1/2 t with n-2 degrees of freedom. A 100(1-) percent confidence interval of intercept is given by - t/2 , n-2 se( ) + t/2 , n-2 se( )

Confidence interval for mean and individual response of y at a specified x

y = +x , n-number of observations For mean :


y - t/2,n-2 [MSSE (1/n+(x0-x) 2/SXX )]1/2 y0 y + t/2,n-2 [MSSE (1/n+(x0-x) 2/SXX )]1/2

For individual : y - t/2,n-2 [MSSE (1+1/n+(x0-x) 2/SXX )]1/2 y0 y + t/2,n-2 [MSSE (1+1/n+(x0-x) 2/SXX )]1/2

Residual Analysis
Definition of Residuals Residual: i , i 1,, n ei yi y
The deviation between the data and the fit A measure of the variability in the response variable not explained by the regression model. The realized or observed values of the model errors.
18

Residual Analysis
Residual Plot
Graphical analysis is a very effective way to investigate the adequacy of the fit of a regression model and to check the underlying assumption. Normal Probability Plot: If the errors come from a distribution with thicker or heavier tails than the normal, LS fit may be sensitive to a small subset of the data. Heavy-tailed error distributions often generate outliers that pull LS fit too much in their direction. Normal probability plot: a simple way to

check the normal assumption. Ranked residuals: e[1] < < e[n] Plot e[i] against Pi = (i-1/2)/n Sometimes plot e[i] against -1[ (i-1/2)/n] Plot nearly a straight line for large sample n > 32 if e[i] normal Small sample (n<=16) may deviate from straight line even e[i] normal Usually 20 points are required to plot normal probability plots.

Residual Analysis

Fitting the parameters tends to destroy the evidence of nonnormality in the residuals, and we cannot always rely on the normal probability to detect departures from normality. Defect: Occurrence of one or two large residuals. Sometimes this is an indication that the corresponding observations are outlier

Residual Analysis
Plot of Residuals against the Fitted Values

a: Satisfactory b: Variance is an increase function of y c: Often occurs when y is a proportion between 0 and 1. d: Indicate nonlinearity.

Residual Analysis
Plot of Residuals in Time Sequence: The time sequence plot of residuals may indicate that the errors at one time period are correlated with those at other time periods. Autocorrelation: The correlation between model errors at different time periods. (a) positive autocorrelation (b) negative autocorrelation

LACK OF FIT TEST


In this test we test the following hypothesis H0 : The model adequately fits the data H1 : The model does not fit the data The test involves partitioning the error into two components SSE = SSPE + SSLOF where SSPE is sum of square attribute to pure error and SSLOF is the sum of square attribute to lack of fit of the model. To compute SSPE we must have repeated observations on y for at least one level of x. y11, y12., y1n repeated observation at x1 . . ya1 , ya2 , yan repeated observations at xc Note that there are a distinct levels of x. The total sum of square for pure error would be obtained by SSPE = (yiu - yi )2

LACK OF FIT TEST


The sum of square for lack of fit is simply SSLOF = SSE SSPE F-test is conducted for a-2 degrees of freedom F0 = [SSLOF / (a-2)]/ [SSPE / (n-a)] We reject the hypothesis if F0 > F, a-2, n-a

A plot for lack of fit test

CONSIDERATIONS IN USE OF REGRESSIONS


Regression models are intended as interpolation equations over the range of regressor variable used to fit the model. We must be careful if we explorate outside of this range. The deposition of x plays an important role in least squares fit. The slope is more strongly influenced by the remote values of x.

The slope depends heavily on A and B. The remaining data would give a very different estimate of the slope if A and B are deleted. Situations like this often require correction action such as further analysis or estimation of model parameter with some other technique that is less influenced by these points.

CONSIDERATIONS IN USE OF REGRESSIONS


In the following situation the slope is largely determined by the extreme point. If this point is deleted the slope estimate is probably zero. Because of the gap between the two clusters of points , we really have 2 distinct information units with which to fit the model. So we should be aware that a small cluster of points may control key model parameters.

Just because a regression analysis has indicated a strong relationship between two variables this does not imply that variables are related in causal sense. Causality implies necessary correlation . Regression analysis cannot address the issue of necessity.

CONSIDERATIONS IN USE OF REGRESSIONS


Outliers can seriously disturb the least square fit as shown in figure but this data point A may not be a bad value and may be a highly useful piece of evidence concerning the process under investigation.

In some applications the value of the regressor variable x required to predict y is unknown for example for forecast of load of electric power generation system we need to forecast the temperature.

POLYNOMIAL REGRESSION MODELS


Polynomial regression is a form of linear regression in which the relationship between the independent variable x and the dependent variable y is modelled as an nth order polynomial. In many settings a linear relationship may not hold. For example, if we are modeling the yield of a chemical synthesis in terms of the temperature at which the synthesis takes place, we may find that the yield improves by increasing amounts for each unit increase in temperature. In this case, we might propose a quadratic model of the form

In general, we can model the expected value of y as an nth order polynomial, yielding the general polynomial regression model

CONSIDERATIONS FOR FITTING A POLYNOMIAL IN ONE VARIABLE


Order of the model : It is important to keep the order of the model as low as possible. Transformations should be tried to keep the model first order. A low order model in a transformed variable is almost always preferable to a high order model in the original metric. Extrapolation : Extrapolation with polynomial models can be very hazardous . If we extrapolate beyond the range of the original data the predicted response turns downward.

CONSIDERATIONS FOR FITTING A POLYNOMIAL IN ONE VARIABLE

Hierarchy: The regression model y= + x+ 2x2 + 3x3 + is said to be hierarchal because it contains all terms of order 3 and low. Model Building Strategy: Various strategies for choosing the order of an approximating polynomial have been suggested. The two main strategies are forward selection and backward elimination.

SELECTING REGRESSION MODELS


INTRODUCTION
In complex regression situations, when there is a large number of explanatory variables which

may or may not be relevant for making predictions about the response variable, it is useful to be able to reduce the model to contain only the variables which provide important information about the response variable. First of all, we need to define the maximum model, that is, the model containing all explanatory variables which could possibly be present in the final model. Let k denote the maximum number of feasible explanatory variables then maximum model is given by Yi = + xi,1 + 2xi,2 +.+ k xk + where x , x2 . k are the explanatory variables, and are independent, normally distributed random error terms with zero mean and common variance.

SELECTION CRITERIA
When the maximum model has been defined, the next point to consider is how to determine whether one model is `better' than the rest: which criterion should we use to compare the possible models? A selection criterion is a criterion, which will order all possible models from `best' to `worst'. Many different criteria have been suggested through time; some are better than others, but there is no single criterion which is overall preferred.

SELECTING REGRESSION MODELS


The purpose of a selection criteria is to compare the maximum model with a reduced model Yi = + xi,1 + 2xi,2 +.+ m xm + which is a restriction of the maximum model. If the reduced model provides (almost) as good a fit to the data as the maximum model, then we prefer the reduced model. The Ra2 Criteria : Due to the way is R2 defined, the largest model (the one with most explanatory variables) will always have the largest R2 -whether the extra variables provide any important information about the response variable or not. A common way to avoid this problem is to use an adjusted version of R2 instead of R2 itself. The adjusted R2 statistic, for a model with k explanatory variables, is given by

Note that Ra2 does not necessarily increase when the number of explanatory variables increases. According to the Ra2 criteria, one should choose the model which has the largest Ra2 .

SELECTING REGRESSION MODELS


The F -test criterion: A different, but equally intuitively natural selection criterion is the F-test criterion. The idea is to test significance of k-m explanatory variables, say xm+1 ,.xk in the maximum model , in order to get the reduced model . That is, we need test the null hypothesis The F-test statistic for testing significance of xm+1 ,.xk is given by

If is H0 not rejected, the reduced model provides as good a fit to the data as the maximum model, so we can use the reduced model instead of the maximum model.

SELECTION PROCEDURES
All possible models procedure: The most careful selection procedure is the all possible models procedure in which all possible models are fitted to the data, and the selection criterion is used on all the models in order to find the model which is preferable to all others.

SELECTING REGRESSION MODELS


BACKWARD ELIMINATION : The backward elimination procedure is basically a sequence of tests for significance of explanatory variables. Starting out with the maximum model Yi = + xi,1 + 2xi,2 +.+ k xk + we remove (or, eliminate) the variable with the highest p-value for the test of significance of the variable, conditioned on the p-value being bigger than some pre-determined level (say, 0.10). Next, we fit the reduced model (having removed the variable from the maximum model), and remove from the reduced model the variable with the highest p-value for the test of significance of that variable (if p>0.1 ). And so on. The procedure ends when no more variables can be removed from the model at significance level 10%. Note that we use the -test criterion in this procedure. FORWARD SELECTION :The forward selection procedure is a reversed version of the backward elimination procedure. Instead of starting with the maximum model, and eliminating variables one by one, we start with an `empty' model with no explanatory variables, and add variables one by one until we cannot improve the model significantly by adding another variable.

SELECTING REGRESSION MODELS


STEPWISE REGRESSION PROCEDURE :The stepwise regression procedure modifies the forward selection procedure in the following way. Each time a new variable is added to the model, the significance of each of the variables already in the model is re-examined. That is, at each step in the forward selection procedure, we test for significance of each of the variables currently in the model, and remove the one with the highest p-value (if the p-value is above some threshold value, say 0.10). The model is then re-fitted without this variable, before going to the next step in the forward selection procedure. The stepwise regression procedure continues until no more variables can be added or removed.

THANK YOU

Vous aimerez peut-être aussi