Académique Documents
Professionnel Documents
Culture Documents
a b1 x1 b2 x2 b3 x3 ... bk xk y
y
X3
X1
X2
b. Compare Means (e.g., Analysis of Variance) 2. Examine Strength and Direction of Relationships a. Bivariate (e.g., Pearson Correlationr)
Between one variable and another: Y = a + b1 x1
The DV (Y) must be metric. The IVs (Xs) must be either metric or dummy var. Central Question Addressed: Is Y a function of X1, X2, etc.? How ? Is there a relationship between Y and X1, X2 , etc., (in each case, after controlling for the effects of all other Xs)? In what way? What is the relative impact of each X on Y, holding all other Xs constant (that is, all other Xs being equal)?
If so,
By how much? And How strong is the connection/relationship y between Xs and Y? what % of differences/variations in Y values (e.g., income) among study subjects can be explained by (or attributed to) differences in X1 X values (e.g. years of education, years of experience, etc.)? X2
X3
Therefore, regression analysis, in a sense, is about ESTIMATING values of Y, using information about values of Xs: Estimation, by definition, involves? The objective? To minimize error in estimation. Or, to compute estimates that are as close to the true/actual values as possible.
yi
Actual # of Credit Cards
1
2 3 4 5
4
6 6 7 8
Estimate? y
y y 56 7 8
6
7 8
i
7
8 10
QUESTION: Can we determine how much error in estimation we are committing by using Y 7 as our estimate, for each of these households?
Y 56
* This example was adopted from Hair, Black, Babin, Anderson, & Tatham, (2006). Multivariate Data Analysis, 6th ed., Prentice Hall.
yi
Actual # of Credit Cards
4 6 6 7 8 7 8
Error in Estimation
? ? ? ? ? ? ?
10
yi 56
y y
56 7 8
yi
Actual # of Credit Cards
4 6 6 7 8 7 8
yi y
Error in Estimation
-3 -1 -1 0 +1 0 +1
10
+3
yi 56
y y
56 7 8
F8 F5 F7 F6 F4 F2, F3 F1
Y Y Estimate
4
3 2
1
0
Lets spread the dots away from each other to see things more clearly!
F8
F7
F6
F3
F2 F1
F4
Y Y Estimate
4
3 2
Estimation Error Can we determine the total estimation error for all 8 families?
1
0
yi
Actual # of Credit Cards 4 6 6 7 8 7 8 10
yi y
Error in Estimation -3 -1 -1 0 +1 0 +1 +3
What would be the total estimation error for all 8 families combined?
56
y y
56 7 8
( y y) = 0
i
Solution?
yi
Actual # of Credit Cards 4 6 6 7 8 7 8 10
yi y
Error in Estimation -3 -1 -1 0 +1 0 +1 +3
Errors Squared 9 1 1 0 1 0 1 9
2 ( yi y) 22
( yi y )
yi 56
( yi y) 0
i Family Number 1 2 3
y
Actual # of Credit Cards 4 6 6
x
Family Size 2 2 4
4
5 6
7
8 7
4
5 5
7
8
8
10
6
6
We now can attempt to estimate # of credit cards from the information on family size, rather than from its own mean. Lets first see this graphically!
F8
8
7 6 5 4 3 2 1 0
F7
y y
F1
x 2, y 4
QUESTION: Does the mean ( y ) appear to represent the closest estimate of the actual c.c. numbers for our sample families ? That is, is the green line the best line to represent the location of estimates of # of CC for these families?
7 Family Size
F8
a1 b1 x y
a3 b3 x y
Regression Line
8
7 6 5 4 3 2 1 0
F4
F2
F5
F6
F7 y a2 b2 x
Original (Baseline) Estimate
y y
F3 F1
Regression Line (Line of Best Fit)-new improved location for CC estimates (see next slide)
a 0x y y
7 Family Size
F8
a bx y
Reg. Line (Line of Best Fit)--new improved location for CC estimates
8
7 6 5 4 3 2 1 0
F5
F2 F4 F6 F3
F7
y Original (Baseline)
Estimate
Estimation ERROR ( y
) y
F1
7 Family Size
a bx y
( x x)( y y ) b 2 (x x)
a y bx
Lets use above formulas to compute the values of a and b for the regression line in our example. We will need: y , x ,
( x x )( y y ),
and
(x x)
y
Actual # of Credit Cards 4 6
x
Family Size 2 2
xx
? ?
y y ( x x )( y y )
? ? ? ?
(x x)
? ?
3
4 5 6 7 8
Y
6
7 8 7 8 10
56 7 8
4
4 5 5 6 6
?
? ? ? ? ?
?
? ? ? ? ?
?
? ? ? ? ?
( x x )( y y ) ?
?
? ? ? ? ?
34 4.25 8
( x x) ?
2
y
Actual # of Credit Cards 4
6 6 7 8 7 8 10
x
Family Size 2
2 4 4 5 5 6 6
xx
-2.25
-2.25 -.25 -.25 .75 .75 1.75 1.75
y y ( x x )( y y )
-3
-1 -1 0 1 0 1 3
(x x)
1
2 3 4 5 6 7 8
6.75
2.25 .25 0 .75 0 1.75 5.25
5.0625
5.0625 .0625 .0625 .5625 .5625 3.0625 3.0625
2
56 Y 7 x 34 4.25 8 8
( x x )( y y ) 17 ( x x )
17.5
a bx y
a =2.87 b = .97
2.87 .97 x y
? Y-Intercept ? Regression Coefficient
F8 F5 F2 F4 F6 F3 F1
2.87 .97 x y
New Improved Estimates
8
7 6 5 4 3 2 1 0
F7
y Original (Baseline)
Estimate
Can we tell how much estimation error we have committed by using the new regression line? Yes, examine differences between our households actual # of CCs and their new/regression estimates.
7 Family Size
y
y
y y
y
Actual # of Credit Cards
x Family Size
2
2 4 4 5 5 6 6
) (y y
Errors Squared
1
2 3 4 5 6 7 8
4
6 6 7 8 7 8 10
?
? ? ? ? ? ? ?
?
? ? ? ? ? ? ?
?
? ? ? ? ? ? ?
) ( y y
x Family Size
2
2 4 4 5 5 6 6
y y
) (y y
Errors Squared
1
2 3 4 5 6 7 8
4
6 6 7 8 7 8 10
4.81
4.81 6.76 6.76 7.73 7.73 8.7 8.7
-.81
1.19 -.76 .24 .27 -.73 -.7 1.3
.66
1.42 .58 .06 .07 .53 .49 1.69
)2 5.486 ( y y
16.5
QUESTION: What % of estimation error have we explained (eliminated by using the regression model?
Note: When dealing with only two variables (a single X and Y):
16.514 r R .75 .866 22
2
10 9
F8 F5 F6
2.87 .97 x y
8
7 6 5 4 3 2 1 0
F7
F2
y y
Original Baseline ERROR for F1
F4
y y
by ? Explained REGRESSION
y Original (Baseline)
Estimate
F3
? y y
Model
F1
7 Family Size
yi
Actual # of Credit Cards 4 6 6
x1
Family Size 2 2 4
x2
Family Income 14 16 14
4
5 6 7 8
7
8 7 8 10
4
5 5 6 6
17
18 21 17 25
We now can attempt to estimate # of CCs from our information on family size and family income! Our regression model will now be a linear plane, rather than a straight line!
a b1 x1 b2 x2 y
a b1 x1 b2 x2 y
11
7 6
Family Income
5
4 3 2 1 0
Actual Regression Estimate
Lets now see how much error in estimation we are committing by using this multiple regression model.
X1 = Family Size
y
Actual # of Credit Cards 4
x1
Family Size 2
x2
y y
Error (Residual) ?
) (y y
Errors Squared ?
2
3 4 5 6 7 8
6
6 7 8 7 8 10
2
4 4 5 5 6 6
16
14 17 18 21 17 25
?
? ? ? ? ? ?
?
? ? ? ? ? ?
?
? ? ? ? ? ?
) ( y y
y
Actual # of Credit Cards 4
x1
Family Size 2
x2
y y
Error (Residual) -.77
) (y y
Errors Squared .59
2
3 4 5 6 7 8
6
6 7 8 7 8 10
2
4 4 5 5 6 6
16
14 17 18 21 17 25
5.20
6.03 6.68 7.53 8.18 7.95 9.67
.80
-.03 .32 .47 -1.18 .05 .33
.64
.00 .10 .22 1.39 .00 .11
) 3.05 ( y y
?
b1 and b2 = Regression Coefficients
0.63: Among families of the same income, an increase in family size by one person would, on average, result in .63 more credit cards. 0.21: Among families of the same size, an income increase of $1,000, results in an average increase of 0.2 credit cards . bs represent effect of each X on Y when all other Xs are controlled for/held constant/taken into account i.e., after impacts of all other variables are accounted for (remember the high blood pressure-hearing problem connection?)
Y-Intercept, a
(NOTE: Only when all Xs can meaningfully take on value of zero, the intercept will have a meaningful/direct/ practical interpretation. Otherwise, it is simply an aid in increasing accuracy of estimation.
SST = 22
a c
X1=Family
Size
d b
Y= # of CC
Total Variation/Error in Y = SS Total = a + b + c + d = 22
Y
X2 = Family
Income
2.87 .97 X 1 r2 = ? y
Pearson/simple Correlation of Y with X1 (not controlling for X2)
R2 = (a+c) / (a+b+c+d)
R2 = 16.5 / 22 = 0.75
SSR =
a+c
= 16.5
size
16 .5 0.75 0.867 22
X1=Family
ryx
1
ac abcd
Y
SSR =
0.063 .398X 2 y
c+b
= 15.12
X2 = Family
Income
Pearson/simple bc Correlation of Y ryx2 abcd with X2 (not controlling for 15.11 r 0.829 yx2 X1) ? 22
a c
X1=Family
Size
d b
X2 = Family
Income
both X1 and X2
SSR = a + b +c = 18.95 SST = a + b + c + d = 22
y
Actual # of Credit Cards 4
x1
Family Size 2
x2
y y
Error (Residual) -.77
) (y y
Errors Squared .59
2
3 4 5 6 7 8
6
6 7 8 7 8 10
2
4 4 5 5 6 6
16
14 17 18 21 17 25
5.20
6.03 6.68 7.53 8.18 7.95 9.67
.80
-.03 .32 .47 -1.18 .05 .33
.64
.00 .10 .22 1.39 .00 .11
) 3.05 ( y y
Remember:
Use the gss_2 data file and conduct the appropriate analysis.
NOTE: satjob_2 is coded as: 1 = Very Dissatisfied 2 = A Little Dissatisfied 3 = Pretty Satisfied 4 = Very Satisfied hapmar_2 is coded as: 1 = Not Too Happy 2 = Pretty Happy 3 = Very Happy
2. Which independent variable(s) have significant relationships with the dep. Var.? In the Coefficients table, look up the result of the t-test for each indep. variables regression coefficient (b). Ho for t-test of a given variable hypothesizes that the coefficient b = 0. That is, there is no relationship between the corresponding independent variable and the dep. Variable. If a t-tests < 0.05, reject the null and conclude that the corresponding variable has a significant relationship with the dep. Variable. 3. Look up the sign of the regression coefficient (b) ONLY FOR those indep. variables that are found to have a significant relationship with the dependent variable (i.e., those with < 0.05), and state your conclusions accordingly.
Independent = 3 Other = 4
variables representing only two groups--such as gender, when coded as 0 and 1) can be used as independent variables in regression analysis. The reason is that a dummy variables values (0, 1) can go up or down by only 1 unit, signifying a change from one group to another. EXAMPLE: Income = 24000 + 1400 gender
Meaning?
Note: A dummy variables regression coefficient represents the average difference in the value of the dependent variable between the two groups represented by the dummy variable.
EXAMPLE 1:
Average income of females is $24,000. Males on average make $1400 more than females
Meaning?
Among people of the same gender, every additional year of education results in an average additional income of $1,000. Males make, on average, $800 more in comparison with females who have the same number of years of education.
Exercise 4: Suppose we are interested in knowing what role, if any, demographic characteristics (i.e., age, sex_Dummy, educ, sibs, agewed, incomdol), as well as job satisfaction (satjob-2), and marriage satisfaction (hapmar-2) play in determining ones overall happiness in life (happy-2). Use the gss_2 data file and conduct the appropriate analysis.
Exercise 3: Suppose we are interested in knowing what role, if any, the following demographic characteristics play in determining ones income (rincmdol):
Age, Sex_Dummy (0=male, 1=female), age first married (agewed), Years of education completed (educ), and Political party affiliation--republic (0=Democrat, 1=Republican) .
Use the gss_2 data file and conduct the appropriate analysis.
Assignment 5
Data file Salary.sav contains information about 474 employees hired by a Midwestern bank between 1969 and 1971 (NOTE: Due to SPSS site license restrictions, this hyperlink will not work if you are off campus). Of the 474 employees, 258 were men, 216 women, 370 white, and 104 non-white. The bank was subsequently involved in EEOC litigation; the bank was accused of gender and race discrimination in its hiring and compensation practices. The two issues that were of particular interest in the litigation were alleged gender and racial inequalities not only in the banks beginning salaries (variable salbeg), but also in its later salaries (variable salnow). 1. Print, examine, and interpret correlation coefficients between beginning salary (salbeg) and age in years (age), education in years (edlevel), employment category or job classification level--rated from 1=lowest to 8=highest (jobcat), and work experience in months (work). 2. Conduct the appropriate analysis to see: (a) What role each of the variables age, education (edlevel), employment category (jobcat), and work experience (work) played, holding all other variables constant, in determining the banks beginning salaries? For example, what was the differential pay for one additional year of education among new hires who otherwise had the same age, employment category, and work experience? (b) Which of the above demographic characteristics had the strongest influence on beginning pay? How can you tell? (c) What percent of the differences in employees beginning salaries can be explained by/attributed to difference in all of the above characteristics?
Assignment 5
3. Now conduct the appropriate analysis to indicate, holding all other variables constant, what roles gender (sex, male=0, female=1) played in determining beginning salaries at the bank. That is, what was the differential beginning pay between male and female employees who otherwise had the same age, education, employment category, and work experience? Does this evidence support the charges of gender discrimination in the banks practices regarding initial compensation? 4. During litigation, it was charged that the banks unfair compensation practices had continued beyond its initial salary decisions. That is, the prosecution claimed that with time, not only the beginning salary disparities between men and women did not shrink, but further widened. Conduct the appropriate analysis to indicate (a) everything else being equal, what roles gender played in determining employees later salaries at the bank (salnow). That is, what was the average differential pay between male and female employees who otherwise had the same age, education, employment category, work experience, and job seniority (variable time represents seniority in terms of number of months employed at the bank)? (b) Compare the later pay disparities you have just identified with the beginning pay disparities you had found in question 3 above to explain if the evidence supports the prosecutions charges of continued gender discrimination beyond initial salary decisions, resulting in widening disparities in later pay. NOTE: For each question, provide thorough explanations on corresponding pages and parts of your printout.
QUESTIONS OR COMMENTS?