Vous êtes sur la page 1sur 7

George Han

03/06/13
Regression and Multivariate Data Analysis STAT-UB 17
Homework 1
Professor Simonoff
Consistency of Womens Marital Statuses among Different Age Groups in U.S. Territories
Life in each territory within the United States is different. From the pleasant suburbs of
Silicon Valleys Bay Area to quick-paced metropolitan New York City, it can be seen that
through differences in living conditions, operational policies, and more, each territory boasts
uniqueness in the way their people live. Likewise, each territory also yields different statistics
concerning the marital statuses of its residents. While it is difficult to explicitly name the factors
that influence marital status, it might be expected by some that generally and vaguely, in each
individual territory, the way people live should be at least somewhat consistent, and therefore so
should marital statuses among different age groups. It is interesting to know if this consistency is
actually so because new and exciting knowledge concerning how people of different age groups
think about the issue of marriage could be discovered. In this report, among women, this
speculative consistency will be statistically tested.
In the following, a regression analysis on data obtained from the United States Census
Bureaus Report C2KBR-30, Marital Status: 2000, Table 3: Marital Status for Women 15 Years
and Over by Age for the United States, Regions, States, and Puerto Rico: 2000 will be run. The
data can be found here: http://www.census.gov/population/www/cen2000/briefs/phct27/index.html. Data from year 2000 of 52 regions within United States territory (the 50 states,
the District of Columbia, and Puerto Rico) will be used. The predictor variable will be the
percentage of women aged 15-24 years who hold the marital status married in the year 2000.
The target variable will be the percentage of women aged 25-34 years who hold the marital
status married in the year 2000. If the expectation that if womens marital statuses in each
individual territory are consistent among different age groups, then the target variable should be
positively correlated with the predictor variable, because the expectation is that for any year, as
marriage becomes more desirable, it should become so within all age groups, and if marriage
becomes less desirable, then it should likewise become so within all age groups.

60
50
40
30

Percentage of Women Married Age 25-34

70

Women's Marriage Statuses By State Compared By Age

10

15

20

25

Percentage of Women Married Age 15-24

Looking at this scatterplot, it appears that the positive correlation we expected is indeed
present, but there is one outlier near the x-axis that goes against the apparent relationship and lies
far from the other data points. This outlier is the District of Columbia. According to Newsweek
(http://www.thedailybeast.com/newsweek/blogs/the-gaggle/2009/10/20/why-so-few-d-cresidents-are-married.html), D.C.s marriage rate is artificially low because of many different
demographic factors, some of which are likely due to D.C.s being the U.S. capital city. First, in
the year 2000, same-sex marriage in D.C. was illegal, but D.C. was regardless home to a
relatively high percentage of LGBT people (compared to other territories). Second, D.C.
residents [tend to] marry at an older age (Newsweek). Third, the city is split into two main
classes, so to speak the wealthy white minority and the poorer Hispanic and AfricanAmerican masses which do not interact well, especially not for purposes of marriage. Fourth, it
has been observed that marriage rates for African-Americans are steadily decreasing over time,
an effect made more prominent by D.C.s large African-American population. Because of all this,
and because D.C. is more of a city than the other territories in this data set are, all data pertaining
to D.C. should be omitted from the rest of the report for the sake of regression. This trims down
the size of the useful data to 51 points, the 50 states and Puerto Rico. We do not omit any data for
Puerto Rico because the territory is more state-like than D.C. is and does not appear to create any
problematic outliers.

65
60
55
50

Percentage of Women Married Age 25-34

70

Women's Marriage Statuses By State Compared By Age


(Without District of Columbia)

10

15

20

25

Percentage of Women Married Age 15-24

Performing a least-squares regression on these data, we get:

In order to test whether the consistency of womens marital status among different age
groups in various U.S. states is statistically significant, it is necessary to analyze the above
statistical output. The regression is of moderate strength with R2 = 0.4156, meaning that 41.56%

of the variability in the target variable can be accounted for by the predictor. The regression slope
for the predictor, 0.7216, means that on average, an increase in one percentage point of the
percentage of women in the U.S. aged 15-24 years who hold the marital status married in the
year 2000 is associated with an estimated expected increase of 0.7216 percentage points in the
percentage of women in the U.S. aged 24-35 years who hold the marital status married in the
year 2000. The t-statistic for the regression slope, 5.903, is associated with a P value of 3.31e-7,
meaning that the probability of this regression slope occurring due to pure chance is that
probability out of 1. A hypothesis test of the regression slope:
Ho: the regression slope is less than or equal to 0 (U.S. womens marital statuses in individual
states are not necessarily consistent among different age groups)
Ha: the regression slope is greater than 0 (U.S. womens marital statuses in individual states are
indeed consistent among different age groups)
Since the P value is reasonably low, it means that at the significance level of 0.001 (and
lower), we reject Ho and accept Ha and state that the regression slope is indeed statistically
significant at the previously mentioned significance level of 0.001 (and lower).
The F-statistic for the regression, 34.84, is associated with a P-value of 3.306e-7, meaning
that the probability of the results of this regression occurring due to pure chance is that
probability out of 1. This reasonably low P value means that at the significance level of 0.001
(and lower), the regression is also statistically significant. The intercept, 48.3417, is meaningless
to interpret because it is extremely unlikely for the 15-24 age group anywhere in U.S. territory to
have impossible to have a negative number of women who are married. In conclusion, from the
results of this regression analysis, it can be said that womens marital statuses among different
age groups in various U.S. states is indeed consistent in that the consistency is of a statistically
significant nature.

0
-2
-4
-6

Residuals

Residuals vs. Fitted Vaues

50

55

60

65

Percentage of Women Married Age 25-34

70

This residual plot shows a slight positive correlation between the residuals and the fitted
values. This means that the data is somewhat clumped, which could be a matter of different
states being in different regions of the U.S., for example the Northeast, Midwest, South, and
West. Other than this, this residual plot does not seem to show any abnormalities, meaning that
otherwise the data are homoscedastic, approximately normally distributed, approximately linear,
and approximately have error expected value 0.

0
-2

-1

Standardized Residuals

Normal Probability Plot of the Residuals

-2

-1

Normal Scores

This residual plot shows that the data is approximately normally distributed because the
points lie close to the displayed line.

0
-1
-2

Residual

Residuals Versus the Order of the Data

10

20

30

Observation Order

40

50

This residual plot shows that there is pretty much no autocorrelation because the points
look pretty randomly distributed around the horizon residual = 0.

4
0

Frequency

Histogram of the Residuals

-2

-1

Residuals

This histogram shows that the residuals are very roughly normally distributed.
From the four residual plots shown above, it can be seen that all four of the assumptions
of linear regression homoscedasticity, normally distributed errors, linear data, and error
expected value 0 all seem to roughly hold. This rough hold is good enough for the purpose of
this regression analysis because the issue at hand is whether two variables are positively
correlated. Before this report concludes, it may be beneficial to briefly touch upon the specifics
of the conclusion that is about to be reached. The data used in this regression analysis are
specifically from the year 2000 and concern only the marriage statistics in the United States of
women of age 15-34 who responded to the United States Census 2000. While it is possible and
even likely for this data to resemble other similar data sets including those of other years, other
nations, other genders (i.e. men), other ages, or the like, that is extrapolation and there can be no
guarantee that said extrapolation will be accurate to any degree. With this in mind, according to
this regression analysis, it can finally be stated that: womens marital statuses among different
age groups in U.S. territories indeed appear to be consistent because as statistically shown,
changes in the percentage of women in the U.S. aged 15-24 years who hold the marital
status married in the year 2000 are indeed associated with changes in the same direction

of the percentage of women in the U.S. aged 25-34 years who hold the marital status
married in the same year. This conclusion implicates that people, in general, of different
age groups may indeed view marriage differently.
Sources:
http://www.census.gov/population/www/cen2000/briefs/phc-t27/index.html
http://www.thedailybeast.com/newsweek/blogs/the-gaggle/2009/10/20/why-so-few-d-cresidents-are-married.html

Vous aimerez peut-être aussi