Class27 RegressionNCorrHypoTest

SW318 Social Work Statistics Slide 1
Regression Analysis
We have previously studied the Pearsons r correlation coefficient and the r2 coefficient of determination as measures of association for evaluating the relationship between an interval level independent variable and an interval level dependent variable. These statistics are components of a broader set of statistical techniques for evaluating the relationship between two interval level variables, called regression analysis (sometimes referred to in combination as correlation and regression analysis).
Regression Analysis vs. Chi-Square Test of Independence
Our purpose now is to use a hypothesis test to conclude that there is a relationship between two interval level variables in the population represented by our sample data. We could use a chi-square test of independence to determine whether or not a relationship exists between two variables in the population represented by our data, provided we grouped the values of both variables to create a bivariate table. However, it is preferable to test for the presence of a relationship retaining the variables as interval level data because this strategy is more effective at detecting the existence of relationship. We might find a relationship using interval level statistics that we do not find using nominal level statistics because the nominal level statistics are less precise.
Elements of Regression Analysis
We will first review previous material on regression and correlation: The scatterplot or scattergram The regression equation
Then, we will examine the statistical evidence to determine whether or not, the relationships found in our sample data are applicable to the population represented by the sample using a hypothesis test.
Purpose of Regression Analysis
The purpose of regression analysis is to answer the same three questions that have been identified as requirements for understanding the relationships between variables:
Is there a relationship between the two variables? How strong is the relationship? What is the direction of the relationship?
Scatterplots - 1
The relationship between two interval variables can be graphed as a scatterplot or a scatter diagram which shows the position of all of the cases in an x-y coordinate system. The independent variable is plotted on the x-axis, or the horizontal axis. The dependent variable is plotted on the y-axis, or the vertical axis. A dot in the body of the chart represented the intersection of the data on the x-axis and the y-axis
Scatterplots - 2
The trendline or regression line is plotted on the chart in a contrasting color The overall pattern of the dots, or data points, succinctly summarizes the nature of the relationship between the two variables.
The clarity of the pattern formed by the dots can be enhanced by drawing a straight line through the cluster such that the line touches every dot or comes as close to doing so as possible. This summarizing line is called the regression line.
We will see later how this line is obtained, but for now, we will look at how it helps us understand the scatterplot.
Scatterplots - 3
The pattern of the points on the scatterplot gives us information about the relationship between the variables. The regression line, drawn in red, makes it easier for us to understand the scatterplot.
The Uses of Scatterplots
Scatterplots give us information about our three questions about the relationship between two interval variables: Is there a relationship between the two variables? How strong is the relationship? What is the direction of the relationship? In addition, the regression line on the scatterplot can be used to estimate the value of the dependent variable for any value of the independent variable.
Scatterplots: Evidence of a Relationship

The angle between the regression line and the horizontal x-axis provides evidence of a relationship. If there is no relationship, the regression line will be parallel to the axis.
Work orientation scale score Com posite aptitude score
2 1 0 -3 -2 -1 -1 -2 -3 -4 0 1 2 3 80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 .00 .000 20.000 40.000 60.000 80.000
When relationship between two variables, the regression line is parallel to the horizontal axis.
thereSES scale score is no
Vocabulary aptitude score
When there is a relationship between two variables, the regression line lies at an angle to the horizontal axis, sloping either upward or downward.
Scatterplots: Strength of a Relationship

The strength of a relationship is indicated by the narrowness of the band of points spread around the regression line: the tighter the band, the stronger the relationship.
Com posite aptitude score
80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 .00 .000 20.000 40.000 60.000 80.000
Work orientation scale score
2 1 0 -3 -2 -1 -1 -2 -3 -4 0 1 2 3
In this scatterplot, the SES scale score points are very spread out around the regression line. The relationship is weak.
Vocabulary aptitude score The spread of the points around the regression line is narrow, indicating a stronger relationship.
We should check the scale of the vertical axis to make sure the narrow band is not the result of an excessively large scale.
Scatterplots: Direction of Relationship

When the regression line slopes upward to the right, there is a positive, or direct, relationship between the variables. When the regression line slopes downward, the relationship is negative, or inverse.
80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 .00 In .000 this
Mathem atics aptitude score
80.000 70.000 60.000 50.000 40.000 30.000 20.000 10.000 In this scatterplot, the .000 regression line slopes -1.00 .00 1.00 2.00 donward to the right, indicatingconcept scale score Self a negative or inverse relationship. The values of the variables move in opposite directions.
scatterplot, the 20.000 40.000 60.000 regression line slopes upward Vocabulary aptitude to the right, indicating a score positive or direct relationship. The values of both variables increase and decrease at the same time.
80.000
-2.00
3.00
Scatterplots: Predicting Scores

For any value of the independent variable on the horizontal x-axis, the predicted value for the dependent variable will be the corresponding value on the vertical y-axis.
80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 .00 .000 20.000 40.000 60.000
For the value of the independent variable on the horizontal axis, we draw a line upward to the regression line, e.g. 52. We draw a perpendicular line 80.000 from the value on the x-axis to the regression line.
The estimate for the dependent variable is obtained by drawing a line parallel to the x-axis from the regression line to the vertical y-axis and reading the value where this line crosses the y-axis, e.g. 50.
The Effect of Scaling on the Scatterplot

The scale used for the vertical y-axis can change the appearance of the scatterplot and alter our interpretation of the strength of the relationship. The three scatterplots on this slide all use the same data.
80 70 60 50 40 30 20 10 0 0 20
In the original plot, the y-axis is scaled from 0 40 60 80 to 80.
75.00 70.00 65.00 60.00 55.00 50.00 45.00 40.00 35.00 30.00 25.00 .000
160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 .00 .000 20.000 40.000 60.000 80.000
In this Vocabulary aptitude score plot, I have narrowed the range of the y-axis scale to 25 to 75, spreading the points, and making the relationship appear weaker.
20.000
40.000
60.000
80.000
In this plot, I doubled the Vocabulary aptitude score range of the y-axis scale to 0 to 160, drawing the points closer together, and making the relationship appear stronger.
The Assumption of Linearity
An underlying assumption of regression analysis is that the relationship between the variables is linear, meaning that the points in the scatterplot must form a pattern that can be approximated with a straight line.
While we could test the assumption of linearity with a test of statistical significance of the correlation coefficient, we will make a visual assess tor scatterplots. If the scatterplot indicates that the points do not follow a linear pattern, the techniques of linear correlation and regression should not be applied.
Examples of Linear Relationships

These two scatterplots are for data on poverty of nations. The plots below show strong linear relationships. The points are evenly distributed on either side of the regression line.
90 80 70 60 50 40 30 20 10 0 0 20 40 Male Life Expectancy 60 80 200
Fem ale Life Expectancy
Infant Mortality Rate
150 100 50 0 0 -50 Fem ale Life Expectancy 20 40 60 80 100
Examples of Non-linear Relationships

These scatterplots show a nonlinear relationship. The points are not evenly distributed on either side of the regression line. We will often see a concentration of points on one side of the regression line and an absence of points on the other side.
90 80 70 60 50 40 30 20 10 0 0 10000 20000 30000 40000 Gross National Product
Male Life Expectancy Birth Rate
40000 35000 30000 25000 20000 15000 10000 5000 0 -5000 0 -10000
30 25 20 15 10 5 20 40 60 80 100 0 0 Male Life Expectancy 50 100 Death Rate 150 200
The Regression Equation
The regression equation is the algebraic formula for the regression line, which states the mathematical relationship between the independent and the dependent variable.
We can use the regression line to estimate the value of the dependent variable for any value of the independent variable. The stronger the relationship between the independent and dependent variables, the closer these estimates will come to the actual score that each case had on the dependent variable.
Components of the Regression Equation
The regression equation has two components. The first component is a number called the yintercept that defines where the line crosses the vertical y axis. The second component is called the slope of the line, and is a number that multiplies the value of the independent variable. These two elements are combined in the general form for the regression equation: the estimated score on the dependent variable = the y-intercept + the slope the score on the independent variable
The Standard Form of the Regression Equation
The standard form for the regression equation or formula is: Y = a + bX where Y is the estimated score for the dependent variable X is the score for the independent variable b is the slope of the regression line, or the multiplier of X a is the intercept, or the point on the vertical axis where the regression line crosses the vertical yaxis
Depicting the Regression Equation

The regression equation includes both the yintercept and the slope of the line. The yintercept is 1.0 and the slope is 0.5.
y = 1.0 + 0.5 x
5.0 4.5 4.0 3.5 3.0
y 2.5
2.0 1.5 1.0 0.5 0.0
The slope is the multiplier of x. It is the amount of change in y for a change of one unit in x.
0 0. 5 0. 5 1. 0 1. 0 2. 5 2. 0 3. 0 4. 5 3. 5 4.
The y-intercept is the point on the vertical yaxis where the regression line crosses the axis, i.e. 1.0.
If x changes one unit from 2.0 to 3.0, depicted by the blue arrow, y will change by 0.5 units, from 2.0 to 2.5 as depicted by the red arrow.
0 5.
Deriving the Regression Equation

In this plot, none of the points fall on the regression line. The difference between the actual value for the dependent variable and the predicted value for each point is shown by the red lines. This difference is called the residual, and represents the error between the actual and predicted values. The regression equation is computed to minimize the total amount of error in predicting values for the dependent variable. The method for deriving the equation is called the "method of least squares," meaning that the regression line minimizes the sum of the squared residuals, or errors between actual and predicted values.
y = 0.8 + 0.6 x
5 4 3
y
2 1 0 0 1 2 3 4 5
Interpreting the Regression Equation: the Intercept
The intercept is the point on the vertical axis where the regression line crosses the axis. It is the predicted value for the dependent variable when the independent variable has a value of zero. This may or may not be useful information depending on the context of the problem.
Interpreting the Regression Equation: the Slope
The slope is interpreted as the amount of change in the predicted value of the dependent variable associated with a one unit change in the value of the independent variable. If the slope has a negative sign, the direction of the relationship is negative or inverse, meaning that the scores on the two variables move in opposite directions. If the slope has a positive sign, the direction of the relationship is positive or direct, meaning that the scores on the two variables move in the same direction.
Interpreting the Regression Equation: when the Slope equals 0
If there is no relationship between two variables, the slope of the regression line is zero and the regression line is parallel to the horizontal axis. A slope of zero means that the predicted value of the dependent variable will not change, no matter what value of the independent variable is used.
If there is no relationship, using the regression equation to predict values of the dependent variable is no improvement over using the mean of the dependent variable.
Assumptions Required for Utilizing a Regression Equation
The assumptions required for utilizing a regression equation are the same as the assumptions for the test of significance of a correlation coefficient. Both variables are interval level. Both variables are normally distributed. The relationship between the two variables is linear. The variance of the values of the dependent variable is uniform for all values of the independent variable (equality of variance).
Assumption of Normality
Strictly speaking, the test requires that the two variables be bivariate normal, meaning that the combined distribution of the two variables is normal. It is usually assumed that the variables are bivariate normal if each variable is normally distributed, so this assumption is tested by checking the normality of each variable. Each variable will be considered normal if its skewness and kurtosis statistics fall between 1.0 and +1.0 or if the sample size is sufficiently large to apply the Central Limit theorem.
Assumption of Linearity
Linearity means that the pattern of the points in a scatterplot form a band, like the pattern in the chart on the right:
90 80 70 60 50 40 30 20 10 0 0 50 100 Infant Mortality Rate 150 200
When the pattern of the points follows a curve, like the scatterplot on the right, the correlation coefficient will not accurately measure the relationship.
40000 35000 30000 25000 20000 15000 10000 5000 0 -5000 0 -10000
20
40
60
80
100
Male Life Expectancy
Test of Linearity
The test of linearity is a diagnostic statistical test of the null hypothesis that the linear model is an appropriate fit for the data points. The desired outcome for this test is to fail to reject the null hypothesis. If the probability for the test of statistic is less than or equal to the level of significance for the problem, we reject the null hypothesis, concluding that the data is not linear and the Regression Analysis is not appropriate for the relationship between the two variables. If the probability for the test of linearity statistic is greater than the level of significance for the problem, we fail to reject the null hypothesis and conclude that we satisfy the assumption of linearity.
Assumption of Homoscedasticity
Homoscedasticity (equality of variances) means that the points are evenly dispersed on either side of the regression line for the linear relationship.
In this scatterplot, the spread of the points around the regression line is narrower at the left end of the regression line than at the right end of the regression line. This funnel shape is typical of a scatterplot showing violations of the assumption of homoscedasticity. 200
Infant Mortality Rate
150 100 50 0 0 0 50 100 Infant Mortality Rate 150 200 -50 Birth Rate 10 20 30 40 50 60
In this scatterplot, the points extend about the same distance above and below the regression line for most of the length of the regression line. This scatterplot meets the assumption of homoscedasticity.
90 80 70 60 50 40 30 20 10 0
Test of Homoscedasticity
When we compared groups, we used the Levene test of population variances to test for the assumption that the group variances were equal. In order to use this test for the assumption of homoscedasity, we will convert the interval level independent variable into a dichotomous variable with low scores in one group and high scores in the other group. We can then compare the variances of the two groups derived from the independent variable.
Levene Test of Homogeneity of Variances
The Levene test of equality of population variances tests whether or not the variances for the two groups are equal. It is a test of the research hypothesis that the variance (dispersion) of the group with low scores is different from the variance of the group with high scores. The null hypothesis that the variance (dispersion) of both groups are equal. If the probability of the test statistic is greater than 0.05, we do not reject the null hypothesis and conclude that the variances are equal. This is the desired outcome. If the probability of the test statistic is less than or equal to 0.05, we conclude the variances are different and the Regression Analysis is not an appropriate test for the relationship between the two variables.
The hypothesis test of r2
The purpose of the hypothesis test of r2 is a test of the applicability of our findings to the population represented by the sample. When we studied association between two interval variables, we stated that the Pearson r correlation coefficient and its square, the coefficient of determination measure the strength of the relationship between two interval variables. When the correlation coefficient and coefficient of determination are zero (0), there is no relationship. The hypothesis test of r2 is a test of whether or not r2 is larger than zero in the population.
The hypothesis test of r2
The research hypothesis states that r2 is larger than zero. (a relationship exists) The null hypothesis states that r2 is equal to zero. (no relationship) Recall that we interpreted the coefficient of determination r2 as the reduction in error attributable to with the relationship between the variables.
The test statistic is an ANOVA F-test which tests whether or not the reduction in error associated with using the regression equation is really greater than zero.
How the regression ANOVA test works?

We will use the sample data we used for correlation and regression to examine how the hypothesis test for r2 works.
We are interested in the relationship between family size and number of credit cards.
The scatter diagram or scatterplot
The dependent variable is plotted on the Y or vertical axis.
The independent variable is plotted on the x or horizontal axis.
The mean as the best guess
Without taking into account the independent variable, our best guess for the number of credit cards for any subject is the mean, 7.0.
Errors using the mean as estimate
Errors are measured by computing the difference between the mean and each Y value, squaring the differences, and then summing them. When we compute the answer in SPSS, it will tell us that the total amount of error is 22.0.
The regression line
The regression line minimizes the error (the best fitting or least squares line)
The equation for the regression line
SPSS will give us the formula for the regression line in the form Y = a + bX, or for these variables:
Number of Credit Cards = 2.871 + .971 x Family Size
PRE reduction in error

SPSS also tells us the amount of error using only the mean and using the regression line.
Error using mean only (total) Error using regression line Reduction in error associated with the regression PRE measure (r2)
22.000 5.486 16.514 22.0-5.486 = .751 22.0
The ANOVA test for the regression
The F statistic is calculated as the ratio of error reduced by regressions divided the error remaining. If the ratio were 1 and these two numbers were the same, we would not have reduced any error, there would be no relationship, and the p-value would not let us reject the null hypothesis.
In this problem, the amount of error reduced by the regression is large relative to the amount remaining, so the F statistic is large, the pvalue(0.005) is smaller than the alpha level of significance, so we reject the null hypothesis.
Interpreting Pearsons r correlation coefficient

The square root of r2 is Pearsons r, the correlation coefficient. If we want to characterize the strength of the relationship, we compare the size of r to the interpretive guidelines for measures of association.
Interpreting the direction of the relationship
To interpret the direction of the relationship between the variables, we look at the coefficient for the independent variable. In this example, the coefficient of 0.971 is positive, so we would interpret this relationship as: Families with more members had more credit cards.
Testing Assumptions in Homework Problems
The process of testing assumptions can easily overwhelm the task of testing the significance of the relationship. Since our emphasis here is testing the hypothesis that the relationship is generalizable to the population represented by the sample data, we will assume that our data satisfies the assumptions without explicitly testing assumptions.
Homework Problem Questions
The question in the homework problems requires us to look at three things: Does the hypothesis test support the existence of a relationship in the population? Is the strength of the relationship characterized correctly? Is the direction of the relationship between the variables correctly stated?
Practice Problem 1
This question asks you to use linear regression to examine the relationship between [marital] and [age]. Linear regression requires that the dependent variable and the independent variables be interval. Ordinal variables may be included as interval variables if a caution is added to any true findings. The dependent variable [marital] is nominal level which does not satisfy the requirement for a dependent variable. The independent variable [age] is interval level, satisfying the requirement for an independent variable.
Practice Problem - 2
This question asks you to use linear regression to examine the relationship between [fund] and [attend]. The level of measurement requirements for multiple regression are satisfied: [fund] is ordinal level, and [attend] is ordinal level. A caution is added because ordinal level variables are included in the analysis. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional assumptions for the variables.
Linear Regression Hypothesis Test in SPSS (1)
You can conduct a linear regression using:
Analyze > Regression > Linear
Move the dependent variable to Dependent: and the independent variable to Independent(s): boxes and then click OK button.
Based on the ANOVA table for the linear regression (F(1, 604) = 70.579, p<0.001), there was an relationship between the dependent variable "degree of religious fundamentalism" and the independent variable "frequency of attendance at religious services". Since the probability of the F statistic (p<0.001) was less than or equal to the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was rejected. The research hypothesis that there was a relationship between the variables was supported.
Given the significant F-test result, the correlation coefficient (R) can be interpreted. The correlation coefficient for the relationship between the independent variable and the dependent variable was 0.323, which would be characterized as a weak relationship using the rule of thumb that a correlation between 0.0 and 0.20 is very weak; 0.20 to 0.40 is weak; 0.40 to 0.60 is moderate; 0.60 to 0.80 is strong; and greater than 0.80 is very strong. The relationship between the independent variables and the dependent variable was incorrectly characterized as a moderate relationship. The relationship should have been characterized as a weak relationship. The answer to the problem is false.
Practice Problem 3
This question asks you to use linear regression to examine the relationship between [educ] and [age]. [educ] and [age] are interval level, satisfying the level of measurement requirements for regression. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional characteristics of variables.
Based on the ANOVA table for the linear regression (F(1, 659) = 9.983, p=0.002), there was an relationship between the dependent variable "highest year of school completed" and the independent variable "age". Since the probability of the F statistic (p=0.002) was less than or equal to the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was rejected. The research hypothesis that there was a relationship between the variables was supported.
Given the significant F-test result, the correlation coefficient (R) can be interpreted. The correlation coefficient for the relationship between the independent variable and the dependent variable was 0.122, which can be characterized as a very weak relationship. .
The b coefficient for the independent variable "age" was -.021, indicating an inverse relationship with the dependent variable. Higher numeric values for the independent variable "age" [age] are associated with lower numeric values for the dependent variable "highest year of school completed" [educ]. The statement in the problem that "survey respondents who were older had completed more years of school" is incorrect. The direction of the relationship is stated incorrectly.
Practice Problem 4
This question asks you to use linear regression to examine the relationship between [sei] and [age]. [sei] and [age] are interval level, satisfying the level of measurement requirements for regression. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional characteristics of variables.
Based on the ANOVA table for the linear regression (F(1, 629) = .266, p=0.606), there was no relationship between the dependent variable "socioeconomic index" and the independent variable "age". Since the probability of the F statistic (p=0.606) was greater than the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was not rejected. The research hypothesis that there was a relationship between the variables was not supported.
Steps in solving Linear Regression Hypothesis Test Problems - 1

The following is a guide to the decision process for answering homework problems about Linear Regression Hypothesis Test problems:
Are the dependent and independent variables ordinal or interval level?
No
Incorrect application of a statistic
Yes Make sure that the assumption that the distributional requirements for linear regression are satisfied is made. Otherwise, you have to check the assumption first.
Our regression problems will assume that the assumptions are met.
Conduct the linear regression analysis
Is the p-value in the ANOVA table for the F ratio test <= alpha?
No
False
Yes
Is the interpretation of the strength of the correlation coefficient correct?
No
False
Yes
Is the direction of the relationship correctly stated? Yes No
No
False
Are either of the variables ordinal level? Yes True with caution
True

Class27 RegressionNCorrHypoTest

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Class27 RegressionNCorrHypoTest

Transféré par

Droits d'auteur :

Formats disponibles

SW318 Social Work Statistics Slide 1

SW318 Social Work Statistics Slide 2

Regression Analysis vs. Chi-Square Test of Independence

SW318 Social Work Statistics Slide 3

Elements of Regression Analysis

SW318 Social Work Statistics Slide 4

Purpose of Regression Analysis

SW318 Social Work Statistics Slide 5

SW318 Social Work Statistics Slide 6

SW318 Social Work Statistics Slide 7

SW318 Social Work Statistics Slide 8

The Uses of Scatterplots

SW318 Social Work Statistics Slide 9

Scatterplots: Evidence of a Relationship

thereSES scale score is no

Vocabulary aptitude score

SW318 Social Work Statistics Slide 10

Scatterplots: Strength of a Relationship

Work orientation scale score

SW318 Social Work Statistics Slide 11

Scatterplots: Direction of Relationship

Mathem atics aptitude score

SW318 Social Work Statistics Slide 12

Scatterplots: Predicting Scores

Vocabulary aptitude score

SW318 Social Work Statistics Slide 13

The Effect of Scaling on the Scatterplot

Vocabulary aptitude score

In the original plot, the y-axis is scaled from 0 40 60 80 to 80.

Com posite aptitude score

Com posite aptitude score

SW318 Social Work Statistics Slide 14

The Assumption of Linearity

SW318 Social Work Statistics Slide 15

Examples of Linear Relationships

Fem ale Life Expectancy

Infant Mortality Rate

150 100 50 0 0 -50 Fem ale Life Expectancy 20 40 60 80 100

SW318 Social Work Statistics Slide 16

Examples of Non-linear Relationships

90 80 70 60 50 40 30 20 10 0 0 10000 20000 30000 40000 Gross National Product

Male Life Expectancy Birth Rate

30 25 20 15 10 5 20 40 60 80 100 0 0 Male Life Expectancy 50 100 Death Rate 150 200

Fem ale Life Expectancy

SW318 Social Work Statistics Slide 17

The Regression Equation

SW318 Social Work Statistics Slide 18

Components of the Regression Equation

SW318 Social Work Statistics Slide 19

The Standard Form of the Regression Equation

SW318 Social Work Statistics Slide 20

Depicting the Regression Equation

5.0 4.5 4.0 3.5 3.0

SW318 Social Work Statistics Slide 21

Deriving the Regression Equation

SW318 Social Work Statistics Slide 22

Interpreting the Regression Equation: the Intercept

SW318 Social Work Statistics Slide 23

Interpreting the Regression Equation: the Slope

SW318 Social Work Statistics Slide 24

Interpreting the Regression Equation: when the Slope equals 0

SW318 Social Work Statistics Slide 25