Vous êtes sur la page 1sur 17

Running head: ASSUMPTIONS IN MR

EDPS 607: Assumptions in Multiple Regression Angela Chiasson University of Calgary June 22nd, 2012

ASSUMPTIONS IN MR

Multiple regression, according to Kerlinger and Lee (2000), is probably the single most useful form of the multivariate methods that analyzes the common and separate influences of two or more independent variables on a dependent variable. The method has been used in hundreds of studies because of its flexibility, power, and general applicability to many different kinds of research problems (Kerlinger & Lee, 2000, p. 209). The current paper strives to answer the following questions: 1) What is multiple regression?; 2) What are the assumptions in multiple regression and why are they important?; and 3) What happens when there is a violation of assumptions? What is multiple regression? In order to understand the importance of assumptions in the multivariate statistical method of multiple regression (MR), one must first have a basic understanding of the uses of MR. MR can be classified as an intermediate method, somewhere between the bivariate methods of correlation and linear regression and multivariate methods, such as canonical correlation (Harlow, 2005). Most notable, the objective in multiple regression is to predict a dependent variable from a set of independent variables. Unlike ordinary bivariate regression, MR allows the use of an entire set of variables to predict another (as seen in Figure 1). In MR, there are two or more, usually continuous, independent variables and one continuous dependent variable. A researchers preference would be that the independent variables are relatively uncorrelated (low multicollinearity), while the independent variable is correlated to the dependent variable. Figure 1. Example of a Multiple Regression Model (Hindes, June 20th, 2012)

ASSUMPTIONS IN MR

There are three common approaches to MR, depending on the circumstance. Standard MR is useful when predicting or explaining a single phenomenon with a set of predictors. For example, predicting the amount of condom use with a set of attitudinal, interpersonal, and behavioural variables (Harlow, 2005). Next, hierarchichal MR allows assessment of whether a set of variables substantially adds to prediction, over and above one or more other variables already in the analysis. For instance, assessing whether attitudinal variables increase prediction, over and above behavioral predictors of condom use (Harlow, 2005). Finally, stepwise MR has the computer select variables for entry based on the IV that has the strongest partial correlation with the DV, after controlling for the effect of variables already in the equation. Stepwise MR is not often recommended because it capitalizes on chance variation much more than the other two methods. An example of stepwise MR would be if a researcher were to assess the most important predictors, either behavioural, attitudinal, or environmental, of an outcome variable designating a new form of disease (Harlow, 2005).

ASSUMPTIONS IN MR When a researcher is looking to answer any of the following questions: 1) How well does the set of predictors estimate y?; 2) What is the relative contribution of each

variable in predicting y?; 3) What is the incremental validity of each predictor over every other?; or 4) What is the best subset of predictor variables from the overall set?, a MR would be the statistical method of choice. It is important to keep in mind that regression estimates are more reliable if there is a large data set (N x p, with a sample size of atleast 100), the sample variance of the explanatory variable is high, the variance of the error term is small, and the explanatory variables are less closely related. Before moving on to the assumptions in MR, presented below is a brief overview of the steps to a MR statistical analysis in Statistical Packages of the Social Sciences (SPSS). Figure 2.1. Starting SPSS

Figure 2.1. First, open the data set, click on analyze, regression, and linear. Next, choose one continuous dependent variable, and two or more independent variables. As seen in Figure 2.2, math enjoyment was chosen as the dependent variable,

ASSUMPTIONS IN MR while hours of math homework per month and score on general stress scale were chosen as the independent variables. Figure 2.2. Choosing Dependent and Independent Variables

Figure 2.3 demonstrates the next step in SPSS. Click on statistics and then check the boxes that are important for the study. The model fit box gives a value to state how well the model fits, R squared change is useful with more variables, descriptives gives an overview of descriptive statistics, and collinearity diagnostics is useful to determine multicolinearity between variables.

ASSUMPTIONS IN MR Figure 2.3. Statistics

Figure 2.4 is important because this is where the statistician uses plots to determine if assumptions have been met, which will be discussed in the next section. One must click on plots and then move the ZPRED in the Y box and ZPRESID in the X box. Remember to click histogram and normal probability plot to check for normality of the dependent variable, and to see if there is linearity. Figure 2.4. Plots

ASSUMPTIONS IN MR

The final box presented in this section is used to state probability and what to do with missing values. For the present example, options was clicked, a probability of .05 was entered by default, and exclude cases pairwise was clicked, which means that there is only deletion from the variables its missing from. Figure 2.5. Options

ASSUMPTIONS IN MR What are the assumptions in multiple regression and why are they important? Assumptions in statistics can impact the validity and accuracy of results. If there is a violation of assumptions, then interpretation of the data can change. Thus, highlighting the importance of knowing how to check for assumptions and if they have been met. In a linear regression, it is assumed that, 1) errors are independent; 2) errors follow a normal distribution; and 3) errors have a constant variance. Thus, multiple regression makes assumptions or normality, linearity and homoscedasticity about the nature of the relationships between variables. Furthermore, there assumptions of independence, multicollinearity, and outliers will be discussed. Assumptions allow researchers to make inferences, and, the following section reviews these assumptions in detail. Multiple regression assumes linear bivariate relationships between each x and y, and also between y and y. Assumption of normality can be measured with histograms, probability-probability plots (P-P plots), and descriptive statistics tables. To check for normally distributed errors, one would be looking for a linear relationship between observed and predicted variables, thus, one would want to analyze a P-P plot of

standardized regression residual. Simply put, the closer the dots are to the line, the closer you are at validating the assumption of normally distributed errors. The following plot from SPSS demonstrates that the assumption of normally distributed errors has been met because the dots are hugging the line. Figure 3.1. P-P Plot Demonstrating Assumption of Normality

ASSUMPTIONS IN MR

The following diagrams demonstrate that as the dots deviate away from the line, assumption of normality is less viable. Figure 3.2. P-P Plots Demonstrating Violation of Normality Assumption

ASSUMPTIONS IN MR Multiple regression assumes that both the univariate and the multivariate

10

distributions of residuals (actual scores minus predicted scores) are normally distributed. As previously mentioned, the normality assumption can be checked through the use of the histogram of the standardized residuals, as can be seen in Figure 4. If there is a bars follow a normal curve, one can say that the assumption of normality has been met. Figure 4.1. Histogram Illustrating Assumption of Normality

Homoscedasticity means assumption of equal variances. To check for homoscedasticity between variables, one must run a bivariate scatterplot. If the assumptions of the linear regression model are tenable, then the standardized residuals should scatter randomly about a horizontal line defined by ri = 0, as seen in Figure 3.3. When looking at a scatterplot that is fairly scattered, the assumption of homoscedasticity

ASSUMPTIONS IN MR would be met because the variance appears to be fairly equal. If the scatterplot shows

11

clumping of dots, then there is no assumption of equal variances because variances do not appear to be equal across the variables. In other words, any systematic pattern or clustering of the residuals suggests a model violation (Stevens, 2009). Figure 5.1 demonstrates variance that is fairly equal, thus indicating that the assumption of homoscedasticity has been met, while figure 5.2 demonstrates a violation of the assumption of homoscedasticity. Figure 5.1. Scatterplot Meeting the Assumption of Homoscedasticity

Figure 3.3. This diagram illustrates the assumption of equal variances because the dots are dispersed evenly about the horizontal line defined by ri = 0.

ASSUMPTIONS IN MR Figure 5.2. Scatterplot Violation of Homoscedasticity

12

Figure 3.4. A violation of homoscedastivity is evident in this diagram because most of the dots are scattered above the horizontal line defined by ri = 0. One must also keep in mind the independence assumption. First, the independence assumption implies that the subjects are responding independently of one another. If the independence is violated mildly, then the probability of a Type I error will be several times greater than the experimental level that the researcher thinks he or she is working at (Stevens, 2009). For instance, instead of rejecting falsely 5% of the time, the researcher may be rejecting falsely up to 30% of the time. Finally, multicollinearity can be an issue if not analyzed properly. Multicollinearity occurs when there are moderate to high intercorrelations among the predictor variables (Stevens, 2009). In MR, multicollinearity can severely limit the size

ASSUMPTIONS IN MR of and makes determining the importance of a give predictor difficult. Furthermore, it can increase the variances of the regression coefficients leading to a more unstable regression equations. To check for multicollinearity, one could examine the variance

13

inflation factors (VIF) to see if it is under 10. As shown in the current example (Figure 6.1), the VIF is 1.0, which is well under 10, indicating that multicollinearity is not likely evident between the predictor variables, hours of math homework per month and score on general stress scale. Figure 6.1. Variance Inflation Factors to Check for Multicollinearity

Outliers can alter regression results by drawing the regression line towards itself. Assumptions of multivariate outliers can be measured with Mahalanobis Distance and Cooks Distance as seen in Figure 7.1 and 7.2. Cooks Distance can be used to identify influential points, as any Cooks Distance above one can be assumed to be an influential data point (Stevens, 2009). The results indicate that the highest Cooks Distance is .04, which was below one; therefore, it can be assumed that there are no scores that can be considered influential data points. Therefore, the assumptions of multivariate outliers have been met.

ASSUMPTIONS IN MR Figure 7.1. Residual Statistics

14

Figure 7.2. Mahalanobis and Cooks Distance

What happens when there is a violation of assumptions? A common mistake that many researchers make is to assume that all data is normally distributed, and to apply the relevant parametric test to show differences between groups (Harlow, 2005). This method has no validity if the underlying assumptions about distributions are not met. Realistically, it is not always possible to uphold these assumptions and obtain accurate probabilities regarding the hypotheses of all studies. In that case, one must turn to non-parametric, or distribution free tests that do not have such stringent assumptions. There are three cases in which the use of nonparametric tests would be necessary: 1) the data for the independent variable is not interval; 2) the distribution of the data for the dependent variable is highly skewed; and 3)

ASSUMPTIONS IN MR

15

there exists severely unequal variances between groups. It is important to remember that non-parametric tests exist for most situations that are commonly tested, thus, most parametric tests have non-parametric analogues. One other way of managing with data that do not appear to conform to assumptions of linearity, normality or homoscedasticity, is to consider making transformations on relevant variables. For instance, a variable such as substance abuse could be highly lopsided to the left (positively skewed). This indicates that most individuals reported low or no substance abuse, whereas few reported moderate to high substance use. In this example, it could be beneficial to transform the scores to readjust the variable into one that more closely followed the assumptions. Logarithm or square root transformations could be considered to even out the distribution the substance abuse variable (Marlow, 2005, p. 46). While non-parametric tests make fewer assumptions regarding the nature of the distributions, they are usually less powerful than their parametric counterparts. However, in cases where assumptions are violated, not only are non-parametric tests more appropriate, they can also have more power. In summary, only when assumptions have been violated, are non-parametric tests the method of choice. Summary Assumptions are important in scientific research because they guide data collection and can be tested to be probably true or false. They are powerful tools in the advancement of knowledge in that they give the researcher the confidence to make these assumptions not based on personal values and opinions. Through defined mathematical models, assumptions help gain a better understanding of whether or not results are

ASSUMPTIONS IN MR

16

significant. The difficulty lies when assumptions of normality are not met, thus leaving the researcher the task of choosing a different nonparametric test that does not make the normality assumption. Assumptions in multivariate statistics are challenging to ascertain because, when working with multiple variables, it is difficult to gauge where or not assumptions have been met. This leaves the researcher the difficult decision of whether or not the tests used are valid and reliable. Unfortunately, some outcomes can be conclusions based on invalid data, thus stressing the importance of knowing how to masterfully check for assumptions.

ASSUMPTIONS IN MR

17

References Boslaugh, S., & Watters, P.A. (2008). Statistics in a nutshell. Sebastopol, CA: OReilly Media, Inc. Harlow, L.L. (2005). The essence of multivariate thinking: Basic themes and methods. Mahwah, NJ: Lawrence Erlbaum Associates. Kerlinger, F. N. and Lee, H. B. (2000). Foundations of Behavioral Research, 4th Ed. Harcourt Stevens, J. P. (2009). Applied multivariate statistics for the social sciences (5th ed.). New York: Routledge.

Vous aimerez peut-être aussi