Vous êtes sur la page 1sur 25

Regression Analysis: How to Interpret the Constant (Y Intercept)

The constant term in linear regression analysis seems to be such a simple thing. Also known as the y intercept, it is simply the value at which the fitted line crosses the y-axis. While the concept is simple, Ive seen a lot of confusion about interpreting the constant. Thats not surprising because the value of the constant term is almost always meaningless! Paradoxically, while the value is generally meaningless, it is crucial to include the constant term in most regression models! In this post, Ill show you everything you need to know about the constant in linear regression analysis. I'll use fitted line plots to illustrate the concepts because it really brings the math to life. However, a 2D fitted line plot can only display the results from simple regression, which has one predictor variable and the response. The concepts hold true for multiple linear regression, but I cant graph the higher dimensions that are required.

Zero Settings for All of the Predictor Variables Is Often Impossible

Ive often seen the constant described as the mean response value when all predictor variables are set to zero. Mathematically, thats correct. However, a zero setting for all predictors in a model is often an impossible/nonsensical combination, as it is in the following example. In my last post about the interpretation of regression p-values and coefficients, I used a fitted line plot to illustrate a weight-by-height regression analysis. Below, Ive changed the scale of the y-axis on that fitted line plot, but the regression results are the same as before.

If you follow the blue fitted line down to where it intercepts the y-axis, it is a fairly negative value. From the regression equation, we see that the intercept value is -114.3. If height is zero, the regression equation predicts that weight is -114.3 kilograms! Clearly this constant is meaningless and you shouldnt even try to give it meaning. No human can have zero height or a negative weight! Now imagine a multiple regression analysis with many predictors. It becomes even more unlikely that ALL of the predictors can realistically be set to zero. If all of the predictors cant be zero, it is impossible to interpret the value of the constant. Don't even try!

Zero Settings for All of the Predictor Variables Can Be Outside the Data Range
Even if its possible for all of the predictor variables to equal zero, that data point might be outside the range of the observed data.

You should never use a regression model to make a prediction for a point that is outside the range of your data because the relationship between the variables might change. The value of the constant is a prediction for the response value when all predictors equal zero. If you didn't collect data in this all-zero range, you can't trust the value of the constant. The height-by-weight example illustrates this concept. These data are from middle school girls and we cant estimate the relationship between the variables outside of the observed weight and height range. However, we can get a sense that the relationship changes by marking the average weight and height for a newborn baby on the graph. Thats not quite zero height, but it's as close as we can get.

I drew the red circle near the origin to approximate the newborn's average height and weight. You can clearly see that the relationship must change as you extend the data range! So the relationship we see for the observed data is locally linear, but it changes beyond that. Thats why you shouldnt predict outside the range of your data...and another reason why the regression constant can be meaningless.

The Constant Is the Garbage Collector for the Regression Model


Even if a zero setting for all predictors is a plausible scenario, and even if you collect data within that all-zero range, the constant might still be meaningless! The constant term is in part estimated by the omission of predictors from a regression analysis. In essence, it serves as a garbage bin for any bias that is not accounted for by the terms in the model. You can picture this by imagining that the regression line floats up and down (by adjusting

the constant) to a point where the mean of the residuals is zero, which is a key assumption for residual analysis. This floating is not based on what makes sense for the constant, but rather what works mathematically to produce that zero mean. The constant guarantees that the residuals dont have an overall positive or negative bias, but also makes it harder to interpret the value of the constant because it absorbs the bias.

Why Is it Crucial to Include the Constant in a Regression Model?


Immediately above, we saw a key reason why you should include the constant in your regression model. It guarantees that your residuals have a mean of zero. Additionally, if you dont include the constant, the regression line is forced to go through the origin. This means that all of the predictors and the response variable must equal zero at that point. If your fitted line doesnt naturally go through the origin, your regression coefficients and predictions will be biased if don't include the constant. Ill use the height and weight regression example to illustrate this concept. First, Ill use General Regression in Minitab statistical software to fit the model without the constant. In the output below, you can see that there is no constant, just a coefficient for height.

Next, Ill overlay the line for this equation on the previous fitted line plot so we can compare the model with and without the constant.

The blue line is the fitted line for the regression model with the constant while the green line is for the model without the constant. Clearly, the green line just doesnt fit. The slope is way off and the predicted values are biased. For the model without the constant, the weight predictions tend to be too high for shorter subjects and too low for taller subjects. In closing, the regression constant is generally not worth interpreting. Despite this, it is almost always a good idea to include the constant in your regression analysis. In the end, the real value of a regression model is the ability to understand how the response variable changes when you change the values of the predictor variables. Don't worry too much about the constant!

How to Interpret Regression Analysis Results: P-values and Coefficients


Regression analysis generates an equation to describe the statistical relationship between one or more predictor variables and the response variable. After you use Minitab Statistical Software to fit a regression model, and verify the fit by checking the residual plots, youll want to interpret the results. In this post, Ill show you how to interpret the p -values and coefficients that appear in the output for linear regression analysis.

How Do I Interpret the P-Values in Linear Regression Analysis?


The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a

predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable. Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response. In the output below, we can see that the predictor variables of South and North are significant because both of their p-values are 0.000. However, the p-value for East (0.092) is greater than the common alpha level of 0.05, which indicates that it is not statistically significant.

Typically, you use the coefficient p-values to determine which terms to keep in the regression model. In the model above, we should consider removing East.

How Do I Interpret the Regression Coefficients for Linear Relationships?


Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. This statistical control that regression provides is important because it isolates the role of one variable from all of the others in the model. The key to understanding the coefficients is to think of them as slopes, and theyre often called slope coefficients. Ill illustrate this in the fitted line plot below, where Ill use a persons height to model their weight. First, Minitabs session window output:

The fitted line plot shows the same regression results graphically.

The equation shows that the coefficient for height in meters is 106.5 kilograms. The coefficient indicates that for every additional meter in height you can expect weight to increase by an average of 106.5 kilograms. The blue fitted line graphically shows the same information. If you move left or right along the xaxis by an amount that represents a one meter change in height, the fitted line rises or falls by 106.5 kilograms. However, these heights are from middle-school aged girls and range from 1.3 m to 1.7 m. The relationship is only valid within this data range, so we would not actually shift up or down the line by a full meter in this case. If the fitted line was flat (a slope coefficient of zero), the expected value for weight would not change no matter how far up and down the line you go. So, a low p-value suggests that the slope is not zero, which in turn suggests that changes in the predictor variable are associated with changes in the response variable. I used a fitted line plot because it really brings the math to life. However, fitted line plots can only display the results from simple regression, which is one predictor variable and the response. The concepts hold true for multiple linear regression, but I would need an extra spatial dimension for each additional predictor to plot the results. That's hard to show with today's technology!

How Do I Interpret the Regression Coefficients for Curvilinear Relationships and Interaction Terms?
In the above example, height is a linear effect; the slope is constant, which indicates that the effect is also constant along the entire fitted line. However, if your model requires polynomial or interaction terms, the interpretation is a bit less intuitive.

As a refresher, polynomial terms model curvature in the data, while interaction terms indicate that the effect of one predictor depends on the value of another predictor. The next example uses a data set that requires a quadratic (squared) term to model the curvature. In the output below, we see that the p-values for both the linear and quadratic terms are significant.

The residual plots (not shown) indicate a good fit, so we can proceed with the interpretation. But, how do we interpret these coefficients? It really helps to graph it in a fitted line plot.

You can see how the relationship between the machine setting and energy consumption varies depending on where you start on the fitted line. For example, if you start at a machine setting of 12 and increase the setting by 1, youd expect energy consumption to decrease. However, if you start at 25, an increase of 1 should increase energy consumption. And if youre around 20, energy consumption shouldnt change much at all. A significant polynomial term can make the interpretation less intuitive because the effect of changing the predictor varies depending on the value of that predictor. Similarly, a significant interaction term indicates that the effect of the predictor varies depending on the value of a different predictor.

Take extra care when you interpret a regression model that contains these types of terms. You cant just look at the main effect (linear term) and understand what is happening! Unfortunately, if you are performing multiple regression analysis, you won't be able to use a fitted line plot to graphically interpret the results. This is where subject area knowledge is extra valuable! Particularly attentive readers may have noticed that I didnt tell you how to interpret the constant. Ill cover that in my next post!

Multiple Regression Analysis: Use Adjusted R-Squared and Predicted RSquared to Include the Correct Number of Variables

Multiple regression can be a beguiling, temptation-filled analysis. Its so easy to add more variables as you think of them, or just because the data are handy. Some of the predictors will be significant. Perhaps there is a relationship, or is it just by chance? You can add higher-order polynomials to bend and twist that fitted line as you like, but are you fitting real patterns or just connecting the dots? All the while, the R-squared (R2) value increases, teasing you, and egging you on to add more variables! Previously, I showed how R-squared can be misleading when you assess the goodness-of-fit for linear regression analysis. In this post, well look at why you should resist the urge to add too

many predictors to a regression model, and how the adjusted R-squared and predicted R-squared can help!

Some Problems with R-squared


In my last post, I showed how R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots. However, R-squared has additional problems that the adjusted R-squared and predicted R-squared are designed to address. Problem 1: Every time you add a predictor to a model, the R-squared increases, even if due to chance alone. It never decreases. Consequently, a model with more terms may appear to have a better fit simply because it has more terms. Problem 2: If a model has too many predictors and higher order polynomials, it begins to model the random noise in the data. This condition is known as overfitting the model and it produces misleadingly high R-squared values and a lessened ability to make predictions.

What Is the Adjusted R-squared?


The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors. Suppose you compare a five-predictor model with a higher R-squared to a one-predictor model. Does the five predictor model have a higher R-squared because its better? Or is the R-squared higher because it has more predictors? Simply compare the adjusted R-squared values to find out! The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but its usually not. It is always lower than the R-squared. In the simplified Best Subsets Regression output below, you can see where the adjusted R-squared peaks, and then declines. Meanwhile, the R-squared continues to increase.

You might want to include only three predictors in this model. In my last blog, we saw how an under-specified model (one that was too simple) can produce biased estimates. However, an overspecified model (one that's too complex) is more likely to reduce the precision of coefficient estimates and predicted values. Consequently, you dont want to include more terms in the model than necessary. (Read an example of using Minitabs Best Subsets Regression.)

What Is the Predicted R-squared?


The predicted R-squared indicates how well a regression model predicts responses for new observations. This statistic helps you determine when the model fits the original data but is less capable of providing valid predictions for new observations. (Read an example of using regression to make predictions.) Minitab calculates predicted R-squared by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation. Like adjusted R-squared, predicted R-squared can be negative and it is always lower than R-squared. Even if you dont plan to use the model for predictions, the predicted R-squared still provides crucial information. A key benefit of predicted R-squared is that it can prevent you from overfitting a model. As mentioned earlier, an overfit model contains too many predictors and it starts to model the random noise. Because it is impossible to predict random noise, the predicted R-squared must drop for an overfit model. If you see a predicted R-squared that is much lower than the regular R-squared, you almost certainly have too many terms in the model.

Examples of Overfit Models and Predicted R-squared


You can try these examples for yourself using this Minitab project file that contains two worksheets. If you want to play along and you don't already have it, please download the free 30day trial of Minitab Statistical Software! Theres an easy way for you to see an overfit model in action. If you analyze a linear regression model that has one predictor for each degree of freedom, youll always get an R -squared of 100%! In the random data worksheet, I created 10 rows of random data for a response variable and nine predictors. Because there are nine predictors and nine degrees of freedom, we get an R-squared of 100%.

It appears that the model accounts for all of the variation. However, we know that the random predictors do not have any relationship to the random response! We are just fitting the random variability. Thats an extreme case, but lets look at some real data in the President's ranking worksheet.

These data come from my post about great Presidents. I found no association between each Presidents highest approval rating and the historians ranking. In fact, I described that fitted line plot (below) as an exemplar of no relationship, a flat line with an R-squared of 0.7%!

Lets say we didnt know better and we overfit the model by including the hi ghest approval rating as a cubic polynomial.

Wow, both the R-squared and adjusted R-squared look pretty good! Also, the coefficient estimates are all significant because their p-values are less than 0.05. The residual plots (not shown) look good too. Great! Not so fast...all that we're doing is excessively bending the fitted line to artificially connect the dots rather than finding a true relationship between the variables. Our model is too complicated and the predicted R-squared gives this away. We actually have a negative predicted R-squared value. That may not seem intuitive, but if 0% is terrible, a negative percentage is even worse! The predicted R-squared doesnt have to be negative to indicate an overfit model. If you see the predicted R-squared start to fall as you add predictors, even if theyre significant, you should begin to worry about overfitting the model.

Closing Thoughts about Adjusted R-squared and Predicted R-squared


All data contain a natural amount of variability that is unexplainable. Unfortunately, R-squared doesnt respect this natural ceiling. Chasing a high R-squared value can push us to include too many predictors in an attempt to explain the unexplainable. In these cases, you can achieve a higher R-squared value, but at the cost of misleading results, reduced precision, and a lessened ability to make predictions. Both adjusted R-squared and predicted R-square provide information that helps you assess the number of predictors in your model:

Use the adjusted R-square to compare models with different numbers of predictors Use the predicted R-square to determine how well the model predicts new observations and whether the model is too complicated

Regression analysis is powerful, but you dont want to be seduced by that power and use it unwisely!

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?


After you have fit a linear model using regression analysis, ANOVA, or design of experiments (DOE), you need to determine how well the model fits the data. To help you out, Minitab statistical software presents a variety of goodness-of-fit statistics. In this post, well explore the R-squared (R2 ) statistic, some of its limitations, and uncover some surprises along the way. For instance, low R-squared values are not always bad and high R-squared values are not always good!

What Is Goodness-of-Fit for a Linear Model?

Definition: Residual = Observed value - Fitted value

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals. In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased. Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or: R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that Ill talk about both in this post and my next post.

Graphical Representation of R-squared


Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.

The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line.

A Key Limitation of R-squared


R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots. R-squared does not indicate whether a regression model is adequate. You can have a low Rsquared value for a good model, or a high R-squared value for a model that does not fit the data!

Are Low R-squared Values Inherently Bad?


No! There are two major reasons why it can be just fine to have low R-squared values. In some fields, it is entirely expected that your R-squared values will be low. For example, any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than, say, physical processes.

Furthermore, if your R-squared value is low but you have statistically significant predictors, you can still draw important conclusions about how changes in the predictor values are associated with changes in the response value. Regardless of the R-squared, the significant coefficients still represent the mean change in the response for one unit of change in the predictor while holding other predictors in the model constant. Obviously, this type of information can be extremely valuable. A low R-squared is most problematic when you want to produce predictions that are reasonably precise (have a small enough prediction interval). How high should the R-squared be for prediction? Well, that depends on your requirements for the width of a prediction interval and how much variability is present in your data. While a high R-squared is required for precise predictions, its not sufficient by itself, as we shall see.

Are High R-squared Values Inherently Good?


No! A high R-squared does not necessarily indicate that the model has a good fit. That might be a surprise, but look at the fitted line plot and residual plot below. The fitted line plot displays the relationship between semiconductor electron mobility and the natural log of the density for real experimental data.

The fitted line plot shows that these data follow a nice tight function and the R-squared is 98.5%, which sounds great. However, look closer to see how the regression line systematically over and under-predicts the data (bias) at different points along the curve. You can also see patterns in the Residuals versus Fits plot, rather than the randomness that you want to see. This indicates a bad fit, and serves as a reminder as to why you should always check the residual plots. This example comes from my post about choosing between linear and nonlinear regression. In this case, the answer is to use nonlinear regression because linear models are unable to fit the specific curve that these data follow. However, similar biases can occur when your linear model is missing important predictors, polynomial terms, and interaction terms. Statisticians call this specification bias, and it is caused by an underspecified model. For this type of bias, you can fix the residuals by adding the proper terms to the model.

Closing Thoughts on R-squared


R-squared is a handy, seemingly intuitive measure of how well your linear model fits a set of observations. However, as we saw, R-squared doesnt tell us the entire story. You should evaluate R-squared values in conjunction with residual plots, other model statistics, and subject area knowledge in order to round out the picture (pardon the pun). In my next blog, well continue with the theme that R -squared by itself is incomplete and look at two other types of R-squared: adjusted R-squared and predicted R-squared. These two measures overcome specific problems in order to provide additional information by which you can evaluate your regression models explanatory power.

Why You Need to Check Your Residual Plots for Regression Analysis: Or, To Err is Human, To Err Randomly is Statistically Divine

Anyone who has performed ordinary least squares (OLS) regression analysis knows that you need to check the residual plots in order to validate your model. Have you ever wondered why? There are mathematical reasons, of course, but Im going to focus on the conceptual reasons. The bottom line is that randomness and unpredictability are crucial components of any regression model. If you dont have those, your model is not valid. Why? To start, lets breakdown and define the 2 basic components of a valid regression model: Response = (Constant + Predictors) + Error Another way we can say this is: Response = Deterministic + Stochastic

The Deterministic Portion


This is the part that is explained by the predictor variables in the model. The expected value of the response is a function of a set of predictor variables. All of the explanatory/predictive information of the model should be in this portion.

The Stochastic Error


Stochastic is a fancy word that means random and unpredictable. Error is the difference between the expected value and the observed value. Putting this together, the differences between the expected and observed values must be unpredictable. In other words, none of the explanatory/predictive information should be in the error. The idea is that the deterministic portion of your model is so good at explaining (or predicting) the response that only the inherent randomness of any real-world phenomenon remains leftover for the error portion. If you observe explanatory or predictive power in the error, you know that your predictors are missing some of the predictive information. Residual plots help you check this!

Statistical caveat: Regression residuals are actually estimates of the true error, just like the regression coefficients are estimates of the true population coefficients.

Using Residual Plots


Using residual plots, you can assess whether the observed error (residuals) is consistent with stochastic error. This process is easy to understand with a die-rolling analogy. When you roll a die, you shouldnt be able to predict which number will show on any given toss. However, you can assess a series of tosses to determine whether the displayed numbers follow a random pattern. If the number six shows up more frequently than randomness dictates, you know something is wrong with your understanding (mental model) of how the die actually behaves. If a gambler looked at the analysis of die rolls, he could adjust his mental model, and playing style, to factor in the higher frequency of sixes. His new mental model better reflects the outcome. The same principle applies to regression models. You shouldnt be able to predict the error for any given observation. And, for a series of observations, you can determine whether the residuals are consistent with random error. Just like with the die, if the residuals suggest that your model is systematically incorrect, you have an opportunity to improve the model. So, what does random error look like for OLS regression? The residuals should not be either systematically high or low. So, the residuals should be centered on zero throughout the range of fitted values. In other words, the model is correct on average for all fitted values. Further, in the OLS context, random errors are assumed to produce residuals that are normally distributed. Therefore, the residuals should fall in a symmetrical pattern and have a constant spread throughout the range. Here's how residuals should look:

Now lets look at a problematic residual plot. Keep in mind that the residuals should not contain any predictive information.

In the graph above, you can predict non-zero values for the residuals based on the fitted value. For example, a fitted value of 8 has an expected residual that is negative. Conversely, a fitted value of 5 or 11 has an expected residual that is positive. The non-random pattern in the residuals indicates that the deterministic portion (predictor variables) of the model is not capturing some explanatory information that is leaking into the residuals. The graph could represent several ways in which the model is not explaining all that is possible. Possibilities include:

A missing variable A missing higher-order term of a variable in the model to explain the curvature A missing interction between terms already in the model

Identifying and fixing the problem so that the predictors now explain the information that they missed before should produce a good-looking set of residuals! In addition to the above, here are two more specific ways that predictive information can sneak into the residuals:

The residuals should not be correlated with another variable. If you can predict the residuals with another variable, that variable should be included in the model. In Minitabs regression, you can plot the residuals by other variables to look for this problem.

Adjacent residuals should not be correlated with each other (autocorrelation). If you can use one residual to predict the next residual, there is some predictive information present that is not captured by the predictors. Typically, this situation involves time-ordered observations. For example, if a residual is more likely to be followed by another residual that has the same sign, adjacent residuals are positively correlated. You can include a variable that captures the relevant timerelated information, or use a time series analysis. In Minitabs regression, you can perform the Durbin-Watson test to test for autocorrelation.

Are You Seeing Non-Random Patterns in Your Residuals?


I hope this gives you a different perspective and a more complete rationale for something that you are already doing, and that its clear why you need randomness in your residuals. You must explain everything that is possible with your predictors so that only random error is leftover. If you see non-random patterns in your residuals, it means that your predictors are missing something.

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?


After you have fit a linear model using regression analysis, ANOVA, or design of experiments (DOE), you need to determine how well the model fits the data. To help you out, Minitab statistical software presents a variety of goodness-of-fit statistics. In this post, well explore the R-squared (R2 ) statistic, some of its limitations, and uncover some surprises along the way. For instance, low R-squared values are not always bad and high R-squared values are not always good!

What Is Goodness-of-Fit for a Linear Model?

Definition: Residual = Observed value - Fitted value

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals. In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased.

Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or: R-squared = Explained variation / Total variation R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that Ill talk about both in this post and my next post.

Graphical Representation of R-squared


Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.

The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line.

A Key Limitation of R-squared


R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots. R-squared does not indicate whether a regression model is adequate. You can have a low Rsquared value for a good model, or a high R-squared value for a model that does not fit the data!

Are Low R-squared Values Inherently Bad?


No! There are two major reasons why it can be just fine to have low R-squared values. In some fields, it is entirely expected that your R-squared values will be low. For example, any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than, say, physical processes. Furthermore, if your R-squared value is low but you have statistically significant predictors, you can still draw important conclusions about how changes in the predictor values are associated with changes in the response value. Regardless of the R-squared, the significant coefficients still represent the mean change in the response for one unit of change in the predictor while holding other predictors in the model constant. Obviously, this type of information can be extremely valuable. A low R-squared is most problematic when you want to produce predictions that are reasonably precise (have a small enough prediction interval). How high should the R-squared be for prediction? Well, that depends on your requirements for the width of a prediction interval and how much variability is present in your data. While a high R-squared is required for precise predictions, its not sufficient by itself, as we shall see.

Are High R-squared Values Inherently Good?


No! A high R-squared does not necessarily indicate that the model has a good fit. That might be a surprise, but look at the fitted line plot and residual plot below. The fitted line plot displays the relationship between semiconductor electron mobility and the natural log of the density for real experimental data.

The fitted line plot shows that these data follow a nice tight function and the R-squared is 98.5%, which sounds great. However, look closer to see how the regression line systematically over and under-predicts the data (bias) at different points along the curve. You can also see patterns in the Residuals versus Fits plot, rather than the randomness that you want to see. This indicates a bad fit, and serves as a reminder as to why you should always check the residual plots. This example comes from my post about choosing between linear and nonlinear regression. In this case, the answer is to use nonlinear regression because linear models are unable to fit the specific curve that these data follow.

However, similar biases can occur when your linear model is missing important predictors, polynomial terms, and interaction terms. Statisticians call this specification bias, and it is caused by an underspecified model. For this type of bias, you can fix the residuals by adding the proper terms to the model.

Closing Thoughts on R-squared


R-squared is a handy, seemingly intuitive measure of how well your linear model fits a set of observations. However, as we saw, R-squared doesnt tell us the entire story. You should evaluate R-squared values in conjunction with residual plots, other model statistics, and subject area knowledge in order to round out the picture (pardon the pun). In my next blog, well continue with the theme that R -squared by itself is incomplete and look at two other types of R-squared: adjusted R-squared and predicted R-squared. These two measures overcome specific problems in order to provide additional information by which you can evaluate your regression models explanatory power.

Vous aimerez peut-être aussi