Académique Documents
Professionnel Documents
Culture Documents
Regression Analysis
We have previously studied the Pearsons r correlation coefficient and the r2 coefficient of determination as measures of association for evaluating the relationship between an interval level independent variable and an interval level dependent variable. These statistics are components of a broader set of statistical techniques for evaluating the relationship between two interval level variables, called regression analysis (sometimes referred to in combination as correlation and regression analysis).
Our purpose now is to use a hypothesis test to conclude that there is a relationship between two interval level variables in the population represented by our sample data. We could use a chi-square test of independence to determine whether or not a relationship exists between two variables in the population represented by our data, provided we grouped the values of both variables to create a bivariate table. However, it is preferable to test for the presence of a relationship retaining the variables as interval level data because this strategy is more effective at detecting the existence of relationship. We might find a relationship using interval level statistics that we do not find using nominal level statistics because the nominal level statistics are less precise.
We will first review previous material on regression and correlation: The scatterplot or scattergram The regression equation
Then, we will examine the statistical evidence to determine whether or not, the relationships found in our sample data are applicable to the population represented by the sample using a hypothesis test.
The purpose of regression analysis is to answer the same three questions that have been identified as requirements for understanding the relationships between variables:
Is there a relationship between the two variables? How strong is the relationship? What is the direction of the relationship?
Scatterplots - 1
The relationship between two interval variables can be graphed as a scatterplot or a scatter diagram which shows the position of all of the cases in an x-y coordinate system. The independent variable is plotted on the x-axis, or the horizontal axis. The dependent variable is plotted on the y-axis, or the vertical axis. A dot in the body of the chart represented the intersection of the data on the x-axis and the y-axis
Scatterplots - 2
The trendline or regression line is plotted on the chart in a contrasting color The overall pattern of the dots, or data points, succinctly summarizes the nature of the relationship between the two variables.
The clarity of the pattern formed by the dots can be enhanced by drawing a straight line through the cluster such that the line touches every dot or comes as close to doing so as possible. This summarizing line is called the regression line.
We will see later how this line is obtained, but for now, we will look at how it helps us understand the scatterplot.
Scatterplots - 3
The pattern of the points on the scatterplot gives us information about the relationship between the variables. The regression line, drawn in red, makes it easier for us to understand the scatterplot.
Scatterplots give us information about our three questions about the relationship between two interval variables: Is there a relationship between the two variables? How strong is the relationship? What is the direction of the relationship? In addition, the regression line on the scatterplot can be used to estimate the value of the dependent variable for any value of the independent variable.
When relationship between two variables, the regression line is parallel to the horizontal axis.
When there is a relationship between two variables, the regression line lies at an angle to the horizontal axis, sloping either upward or downward.
2 1 0 -3 -2 -1 -1 -2 -3 -4 0 1 2 3
In this scatterplot, the SES scale score points are very spread out around the regression line. The relationship is weak.
Vocabulary aptitude score The spread of the points around the regression line is narrow, indicating a stronger relationship.
We should check the scale of the vertical axis to make sure the narrow band is not the result of an excessively large scale.
80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 .00 In .000 this
80.000 70.000 60.000 50.000 40.000 30.000 20.000 10.000 In this scatterplot, the .000 regression line slopes -1.00 .00 1.00 2.00 donward to the right, indicatingconcept scale score Self a negative or inverse relationship. The values of the variables move in opposite directions.
scatterplot, the 20.000 40.000 60.000 regression line slopes upward Vocabulary aptitude to the right, indicating a score positive or direct relationship. The values of both variables increase and decrease at the same time.
80.000
-2.00
3.00
80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 .00 .000 20.000 40.000 60.000
For the value of the independent variable on the horizontal axis, we draw a line upward to the regression line, e.g. 52. We draw a perpendicular line 80.000 from the value on the x-axis to the regression line.
The estimate for the dependent variable is obtained by drawing a line parallel to the x-axis from the regression line to the vertical y-axis and reading the value where this line crosses the y-axis, e.g. 50.
75.00 70.00 65.00 60.00 55.00 50.00 45.00 40.00 35.00 30.00 25.00 .000
160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 .00 .000 20.000 40.000 60.000 80.000
In this Vocabulary aptitude score plot, I have narrowed the range of the y-axis scale to 25 to 75, spreading the points, and making the relationship appear weaker.
20.000
40.000
60.000
80.000
In this plot, I doubled the Vocabulary aptitude score range of the y-axis scale to 0 to 160, drawing the points closer together, and making the relationship appear stronger.
An underlying assumption of regression analysis is that the relationship between the variables is linear, meaning that the points in the scatterplot must form a pattern that can be approximated with a straight line.
While we could test the assumption of linearity with a test of statistical significance of the correlation coefficient, we will make a visual assess tor scatterplots. If the scatterplot indicates that the points do not follow a linear pattern, the techniques of linear correlation and regression should not be applied.
40000 35000 30000 25000 20000 15000 10000 5000 0 -5000 0 -10000
The regression equation is the algebraic formula for the regression line, which states the mathematical relationship between the independent and the dependent variable.
We can use the regression line to estimate the value of the dependent variable for any value of the independent variable. The stronger the relationship between the independent and dependent variables, the closer these estimates will come to the actual score that each case had on the dependent variable.
The regression equation has two components. The first component is a number called the yintercept that defines where the line crosses the vertical y axis. The second component is called the slope of the line, and is a number that multiplies the value of the independent variable. These two elements are combined in the general form for the regression equation: the estimated score on the dependent variable = the y-intercept + the slope the score on the independent variable
The standard form for the regression equation or formula is: Y = a + bX where Y is the estimated score for the dependent variable X is the score for the independent variable b is the slope of the regression line, or the multiplier of X a is the intercept, or the point on the vertical axis where the regression line crosses the vertical yaxis
y 2.5
2.0 1.5 1.0 0.5 0.0
The slope is the multiplier of x. It is the amount of change in y for a change of one unit in x.
0 0. 5 0. 5 1. 0 1. 0 2. 5 2. 0 3. 0 4. 5 3. 5 4.
The y-intercept is the point on the vertical yaxis where the regression line crosses the axis, i.e. 1.0.
If x changes one unit from 2.0 to 3.0, depicted by the blue arrow, y will change by 0.5 units, from 2.0 to 2.5 as depicted by the red arrow.
0 5.
y = 0.8 + 0.6 x
5 4 3
y
2 1 0 0 1 2 3 4 5
The intercept is the point on the vertical axis where the regression line crosses the axis. It is the predicted value for the dependent variable when the independent variable has a value of zero. This may or may not be useful information depending on the context of the problem.
The slope is interpreted as the amount of change in the predicted value of the dependent variable associated with a one unit change in the value of the independent variable. If the slope has a negative sign, the direction of the relationship is negative or inverse, meaning that the scores on the two variables move in opposite directions. If the slope has a positive sign, the direction of the relationship is positive or direct, meaning that the scores on the two variables move in the same direction.
If there is no relationship between two variables, the slope of the regression line is zero and the regression line is parallel to the horizontal axis. A slope of zero means that the predicted value of the dependent variable will not change, no matter what value of the independent variable is used.
If there is no relationship, using the regression equation to predict values of the dependent variable is no improvement over using the mean of the dependent variable.
The assumptions required for utilizing a regression equation are the same as the assumptions for the test of significance of a correlation coefficient. Both variables are interval level. Both variables are normally distributed. The relationship between the two variables is linear. The variance of the values of the dependent variable is uniform for all values of the independent variable (equality of variance).
Assumption of Normality
Strictly speaking, the test requires that the two variables be bivariate normal, meaning that the combined distribution of the two variables is normal. It is usually assumed that the variables are bivariate normal if each variable is normally distributed, so this assumption is tested by checking the normality of each variable. Each variable will be considered normal if its skewness and kurtosis statistics fall between 1.0 and +1.0 or if the sample size is sufficiently large to apply the Central Limit theorem.
Assumption of Linearity
Fem ale Life Expectancy
Linearity means that the pattern of the points in a scatterplot form a band, like the pattern in the chart on the right:
When the pattern of the points follows a curve, like the scatterplot on the right, the correlation coefficient will not accurately measure the relationship.
40000 35000 30000 25000 20000 15000 10000 5000 0 -5000 0 -10000
20
40
60
80
100
Test of Linearity
The test of linearity is a diagnostic statistical test of the null hypothesis that the linear model is an appropriate fit for the data points. The desired outcome for this test is to fail to reject the null hypothesis. If the probability for the test of statistic is less than or equal to the level of significance for the problem, we reject the null hypothesis, concluding that the data is not linear and the Regression Analysis is not appropriate for the relationship between the two variables. If the probability for the test of linearity statistic is greater than the level of significance for the problem, we fail to reject the null hypothesis and conclude that we satisfy the assumption of linearity.
Assumption of Homoscedasticity
Homoscedasticity (equality of variances) means that the points are evenly dispersed on either side of the regression line for the linear relationship.
In this scatterplot, the spread of the points around the regression line is narrower at the left end of the regression line than at the right end of the regression line. This funnel shape is typical of a scatterplot showing violations of the assumption of homoscedasticity. 200
Infant Mortality Rate
150 100 50 0 0 0 50 100 Infant Mortality Rate 150 200 -50 Birth Rate 10 20 30 40 50 60
In this scatterplot, the points extend about the same distance above and below the regression line for most of the length of the regression line. This scatterplot meets the assumption of homoscedasticity.
90 80 70 60 50 40 30 20 10 0
Test of Homoscedasticity
When we compared groups, we used the Levene test of population variances to test for the assumption that the group variances were equal. In order to use this test for the assumption of homoscedasity, we will convert the interval level independent variable into a dichotomous variable with low scores in one group and high scores in the other group. We can then compare the variances of the two groups derived from the independent variable.
The Levene test of equality of population variances tests whether or not the variances for the two groups are equal. It is a test of the research hypothesis that the variance (dispersion) of the group with low scores is different from the variance of the group with high scores. The null hypothesis that the variance (dispersion) of both groups are equal. If the probability of the test statistic is greater than 0.05, we do not reject the null hypothesis and conclude that the variances are equal. This is the desired outcome. If the probability of the test statistic is less than or equal to 0.05, we conclude the variances are different and the Regression Analysis is not an appropriate test for the relationship between the two variables.
The purpose of the hypothesis test of r2 is a test of the applicability of our findings to the population represented by the sample. When we studied association between two interval variables, we stated that the Pearson r correlation coefficient and its square, the coefficient of determination measure the strength of the relationship between two interval variables. When the correlation coefficient and coefficient of determination are zero (0), there is no relationship. The hypothesis test of r2 is a test of whether or not r2 is larger than zero in the population.
The research hypothesis states that r2 is larger than zero. (a relationship exists) The null hypothesis states that r2 is equal to zero. (no relationship) Recall that we interpreted the coefficient of determination r2 as the reduction in error attributable to with the relationship between the variables.
The test statistic is an ANOVA F-test which tests whether or not the reduction in error associated with using the regression equation is really greater than zero.
We are interested in the relationship between family size and number of credit cards.
Without taking into account the independent variable, our best guess for the number of credit cards for any subject is the mean, 7.0.
Errors are measured by computing the difference between the mean and each Y value, squaring the differences, and then summing them. When we compute the answer in SPSS, it will tell us that the total amount of error is 22.0.
The regression line minimizes the error (the best fitting or least squares line)
SPSS will give us the formula for the regression line in the form Y = a + bX, or for these variables:
Error using mean only (total) Error using regression line Reduction in error associated with the regression PRE measure (r2)
The F statistic is calculated as the ratio of error reduced by regressions divided the error remaining. If the ratio were 1 and these two numbers were the same, we would not have reduced any error, there would be no relationship, and the p-value would not let us reject the null hypothesis.
In this problem, the amount of error reduced by the regression is large relative to the amount remaining, so the F statistic is large, the pvalue(0.005) is smaller than the alpha level of significance, so we reject the null hypothesis.
To interpret the direction of the relationship between the variables, we look at the coefficient for the independent variable. In this example, the coefficient of 0.971 is positive, so we would interpret this relationship as: Families with more members had more credit cards.
The process of testing assumptions can easily overwhelm the task of testing the significance of the relationship. Since our emphasis here is testing the hypothesis that the relationship is generalizable to the population represented by the sample data, we will assume that our data satisfies the assumptions without explicitly testing assumptions.
The question in the homework problems requires us to look at three things: Does the hypothesis test support the existence of a relationship in the population? Is the strength of the relationship characterized correctly? Is the direction of the relationship between the variables correctly stated?
Practice Problem 1
This question asks you to use linear regression to examine the relationship between [marital] and [age]. Linear regression requires that the dependent variable and the independent variables be interval. Ordinal variables may be included as interval variables if a caution is added to any true findings. The dependent variable [marital] is nominal level which does not satisfy the requirement for a dependent variable. The independent variable [age] is interval level, satisfying the requirement for an independent variable.
Practice Problem - 2
This question asks you to use linear regression to examine the relationship between [fund] and [attend]. The level of measurement requirements for multiple regression are satisfied: [fund] is ordinal level, and [attend] is ordinal level. A caution is added because ordinal level variables are included in the analysis. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional assumptions for the variables.
Move the dependent variable to Dependent: and the independent variable to Independent(s): boxes and then click OK button.
Based on the ANOVA table for the linear regression (F(1, 604) = 70.579, p<0.001), there was an relationship between the dependent variable "degree of religious fundamentalism" and the independent variable "frequency of attendance at religious services". Since the probability of the F statistic (p<0.001) was less than or equal to the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was rejected. The research hypothesis that there was a relationship between the variables was supported.
Given the significant F-test result, the correlation coefficient (R) can be interpreted. The correlation coefficient for the relationship between the independent variable and the dependent variable was 0.323, which would be characterized as a weak relationship using the rule of thumb that a correlation between 0.0 and 0.20 is very weak; 0.20 to 0.40 is weak; 0.40 to 0.60 is moderate; 0.60 to 0.80 is strong; and greater than 0.80 is very strong. The relationship between the independent variables and the dependent variable was incorrectly characterized as a moderate relationship. The relationship should have been characterized as a weak relationship. The answer to the problem is false.
Practice Problem 3
This question asks you to use linear regression to examine the relationship between [educ] and [age]. [educ] and [age] are interval level, satisfying the level of measurement requirements for regression. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional characteristics of variables.
Move the dependent variable to Dependent: and the independent variable to Independent(s): boxes and then click OK button.
Based on the ANOVA table for the linear regression (F(1, 659) = 9.983, p=0.002), there was an relationship between the dependent variable "highest year of school completed" and the independent variable "age". Since the probability of the F statistic (p=0.002) was less than or equal to the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was rejected. The research hypothesis that there was a relationship between the variables was supported.
Given the significant F-test result, the correlation coefficient (R) can be interpreted. The correlation coefficient for the relationship between the independent variable and the dependent variable was 0.122, which can be characterized as a very weak relationship. .
The b coefficient for the independent variable "age" was -.021, indicating an inverse relationship with the dependent variable. Higher numeric values for the independent variable "age" [age] are associated with lower numeric values for the dependent variable "highest year of school completed" [educ]. The statement in the problem that "survey respondents who were older had completed more years of school" is incorrect. The direction of the relationship is stated incorrectly.
Practice Problem 4
This question asks you to use linear regression to examine the relationship between [sei] and [age]. [sei] and [age] are interval level, satisfying the level of measurement requirements for regression. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional characteristics of variables.
Move the dependent variable to Dependent: and the independent variable to Independent(s): boxes and then click OK button.
Based on the ANOVA table for the linear regression (F(1, 629) = .266, p=0.606), there was no relationship between the dependent variable "socioeconomic index" and the independent variable "age". Since the probability of the F statistic (p=0.606) was greater than the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was not rejected. The research hypothesis that there was a relationship between the variables was not supported.
No
Yes Make sure that the assumption that the distributional requirements for linear regression are satisfied is made. Otherwise, you have to check the assumption first.
Our regression problems will assume that the assumptions are met.
Is the p-value in the ANOVA table for the F ratio test <= alpha?
No
False
Yes
No
False
Yes
No
False
Are either of the variables ordinal level? Yes True with caution
True