Vous êtes sur la page 1sur 10

2014 New York Health Data Report:

Multiple Linear Regression Analysis


Prepared by: Ann Steward

Spring 2014
In partial fulfillment of the course requirements for MGMT 305-03

School of Business
SUNY Geneseo

I.

INTRODUCTION

The purpose of this project, as directed by Dr. Farooq Sheikh, is to display our comprehension of
creating multiple-linear regression models and providing the reasoning behind the appropriate elimination
of their variables. Our objective is to both refine and establish an undoubted belief in our regression model.
In this study, we analyze data to construct five regression models that pertain to a summary
estimate of how strongly independent characteristics effect the Potential Years Lost in the average New
Yorkers life. Independent factors covered in our models include:
% Obese
% Smokers

% Physically-inactive
% Excessive drinking

% Had low birth weight


% Uninsured1

By using multiple-linear regression, it is our intent to calculate which of these characteristics are
most significant in their direct impact upon potential years of life lost. The following objectives identify our
interest in each multiple-regression model and our initial hypotheses, described in respective order:
Model 1
The purpose of Model 1 is to show a regression analysis with all six autonomous x-variables. Our
interest in this model is to identify which of those variables is most likely to affect the dependent variable
and to remove the variables that are insignificant. The higher the value of T-stat, the greater influence each
variable will have on the dependent variable. We will remove variables in a step-wise manner that contain
excessively high P-values Our initial hypothesis is that the % Smokers category will have the greatest
influence on our YLPL (Years of Life Potentially Lost) rate 2.
Model 2
The purpose of Model 2 is to show a regression analysis with five of the x-variables, excluding the
one that had the highest P-value (% Obese). Our interest in this model is to try and find a better fit for the
model. Our initial hypothesis is that repeating the regression whilst excluding % Obese, will help us to
find a better fit for our model.
Model 3
The purpose of Model 3 is to show a regression analysis with four of the independent x-variables.
We will exclude a second variable (% Excessive Drinking) in this model, based on its P-value, in an effort to
create a more accurate final model. We believe that removing % Excessive Drinking and repeating the
regression test will yield a more accurate model.

Model 4
The purpose of Model 4 is to eliminate another single x-variable in an effort to reach a best-fit
copy of this model. Our interest is to ultimately identify a more accurate final model. We hypothesized that
all P-values will yield significant variables (P <.05).
II.

Methodology

Data Source Background


Our data was referenced and studied from the University of Wisconsin Population Health Institutes
2014 New York County Health Data report. Efforts by the institute are sponsored by the Robert Wood
Johnson Foundation program. The foundation is the largest U.S. charity focused on public health 3.
Testing Software
According to Princetons online archive, regression testing is: Any type of software testing that
seeks to uncover software errors after changes to the programhave been made, by retesting the
program.4 We chose to create and calculate the relationship between our regressors and beta values
using the Microsoft Excel add-in program of Data Analysis as well as the Macintosh statistics program
StatPlus.
Model Refinement Procedure
As referenced by author James R. Evans, regression analysis is defined as: A tool for building
statistical models that characterize relationships among a dependent variable and one or more
independent variables, all of which are numerical. 5
In order to determine if our regression variables have a plausible power of value, we critiqued the
individual characteristics that had the potential to influence our dependent variable. Isolating each singular
characteristic reinforced our understanding that they are independently, identically distributed 6 (IID).
We worked under the strict assumption that any findings regarding one characteristic as IID would
not, in any way, affect another. This was also conducted under the assumption that each x-variable in itself
was to be independent of the others.

III.

RESULTS

Statistical Analysis Outcome: Final Regression Model

Final Model:
=76.12X1+347.07X2+75.73X3-69.9
X1= % Smokers
X2= % Had Low Birth Weight
X3= % Physically Inactive

Our final Model reveals that the categories of % Smokers, % Had Low Birth Weight and
%Physically Inactive were the most influential factors on Years of Life Potentially Lost. The categories of
% Uninsured, % Excessive Drinking, % Obese, and Physically-inactive produced the highest Pvalues in respective order. This means that those four x-variables were the most insignificant predictors of
the regression model.
The adjusted R-Square for this model was 0.54, which suggests that approximately 54% of
variation in y can be attributed to x is explained by the regression analysis we conducted. The t-stats are
high, which tells us that each individual x-variable has significance (assuming the 95% confidence interval
is being used).

IV.

DISCUSSION

Comprehensive Analysis
Correlation Metrics Table

Prior to creating our multiple-regression models, we chose to examine the individual x-variables through the
data analysis program of Microsoft Excel. The table was produced through Excels correlation data option. Upon
observing the table, the independent variables of % Obese and % Smokers rank as the strongest correlations.

Generated Intuitions
After observing that % Obese and % Smokers had the strongest, positive correlations, we firstly
agreed that those rankings made sense. Obesity and habitual smoking are detrimental to any individuals
health, causing diseases and conditions that have the potential to shorten a persons lifespan.
Closest runner-ups for highest correlation were between % Physically Inactive vs. % Smokers and
% Physically Inactive vs. % Obese. These correlations are also logically sensible as it is likely that an
inactive individual will also be obese or partake in smoking tobacco, although we personally thought it
would make sense for the correlations to be much greater.
Counter-intuitive Findings
The removal of the variable % Excessive Drinking was done due to a significantly high P-value of .197.
Although in theory doing so will improve the model, the overall adjusted R-square value decreased from .58 to .57,
yielding a 1% decrease in the variance of the dependent variable that can be attributed to the independent variables.
Furthermore, removing % Obese due to a significantly high P-value of .02 caused adjusted R-squared to drop
from .58 to .54. Our final model includes the variable % Physically-Inactive despite a somewhat high P-value of .006
due to the nature that the value of adjusted R-squared dropped to below 50% (See Appendix: Figure 6).
These findings could be the result of one underlying issue between our independent variables. Although we
assumed that each variable is strictly independent of each other, we were able to recognize slightly correlated data
between the variables of % Obese and % Smokers. The nature of this correlation could result in multicollinearity
or in other words, more than one independent variable may be conveying the same information, making it hard to
discern the difference between the weights of their influence on the dependent variable.

Limits of Analysis
Although overall content with our results, we realize that this analysis is greatly limited in the scope of its
findings. First of all, the data collected constitutes only the numbers taken from counties within New York State whilst
ignoring environmental factors that could effect health within different regions of the state i.e. exposure to pollution,

hazardous living conditions, exposure to natural disasters. Our understanding of the value we recorded for adjusted
R-square tells us that approximately only 54% of the Years of Potential Life Lost can be attributed to the independent
variables, telling us there must be other independent variables not included in the data set that hold a much greater
influence on our dependent variable.
Furthermore, upon further research it was discovered that high P-values alone are not substantial enough
reasons to remove a potentially significant variable from a regression analysis and can in theory result in Omitted
Variable Bias. There are a number of additional methods to determining significant variables that are beyond the
scope of this course, leaving us with no other option but to determine our final regression model by removing
independent variables under the basis of their P-values.

V. REFERENCES

About RWJF http://www.rwjf.org/en/about-rwjf.html


Evans, James R. Statistics, Data Analysis, and Decision Modeling. Pearson Education, Inc. 2013
Income, Poverty, and Health Insurance Coverage in the United States: 2012.
http://www.census.gov/prod/2013pubs/p60-245.pdf
Regression Testing. Princeton University.
https://www.princeton.edu/~achaney/tmve/wiki100k/docs/Regression_testing.html

IV. APPENDIX
The following information, data, charts and calculations were utilized in a collaborative effort to determine
the value relevance, if any, of each beta; that is, the relationship between the dependent factor of income

(y) and its i.i.d. variables. All resources were referenced for the sole purpose of the assigned project by Dr.
Farooq Sheikh and cited in the previous References section.
Fig. 1: Data Table of YLPL (Years of Life Potentially Lost) for New York Counties

Source:http://www.countyhealthrankings.org/sites/default/files/state/downloads/2014%20County%20Health%20Rankings%20New%20York%20Data%20-%20v1.xls

Figure 2: Model 1
(With all independent x-variables)

Figure 3: Model 2
(With all independent x-variables excluding % Uninsured)

Figure 4: Model 3
(With all independent x-variables excluding % Uninsured, % Excessive Drinking)

Figure 5: Model 4
(With all independent x-variables excluding % Uninsured, % Excessive Drinking, % Obese)

Figure 6: Model 5
(With all independent x-variables excluding % Uninsured, % Excessive Drinking, % Obese, %
Physically Inactive)

Vous aimerez peut-être aussi