Vous êtes sur la page 1sur 7

Mingyo Lee

Linear Regression Report


I.

II.

Here in the US, the internet consumes my life at home through schoolwork, social media, and
news. I am on my computer or my phone for the majority of my day, and if the internet was
ever destroyed, my life would be in pieces. In countries like India however, many people do
not have internet access. There is a lack of broadband and internet provider service, and what
internet and internet-capable devices exist are expensive and for the wealthy. People like
Mark Zuckerberg have therefore taken it upon themselves to popularize and cheapen the
internet in poorer countries in order to establish a truly worldwide connection. I thought that
the GDP per Capita and Internet Usage per 100 People of different countries were linearly
related because I thought that the higher a countrys economic output per person, the more
the countrys citizens would consume, spend, and have access to luxuries like the Internet.
On a scatterplot, I plotted points of 49 different countries by GDP per Capita (as the
explanatory variable) and Internet Usage per 100 People (as the response variable) based on
data taken in 2014 by the World Bank. I found that the two variables had a moderate,
positive, linear correlation.

The least squares regression line equation that models my data is y hat = 33.30492 + (0.00105)x.
In this equation, y hat is the predicted number of Internet Users per 100 People, while x is
the GDP per Capita in US dollars. A slope of 0.00105 means that as the GDP per Capita
increases by 1 dollar, the Internet Users per 100 people is predicted to increase by 0.00105
people. The y intercept is not relevant because when x = 0, the GDP per Capita =
GDP/Population = 0, which means that the GDP = 0. This would mean that the economy
produces no food or shelter at all, and therefore the country cannot survive and all countries have
a positive GDP. Nnb The r2 value is 0.62125, which means that 0.62125 of the variance of
Internet Users per 100 People can be associated with the variance of GDP per Capita. The r value
is 0.7881941, which indicates a moderate, positive, linear relationship between the two variables.

The residual plot of the complete data set is shown below. The plot suggests that there is an
x-value that is outlier because the points are clustered close to the y-axis.

To confirm the presence of an outlier, I made a boxplot for each variable. If a point is
positioned outside the range from the 25th Percentile (1.5 x IQR) to the 75th Percentile
(1.5 x IQR), then it is an outlier. As illustrated by the boxplots below, there was an outlier in
the x direction, but not in the y direction. That point, Qatar, is an influential point because on
the scatterplot it is positioned away from the least squares regression line.

If I remove the outlier, then the r2 value of the data increases to 0.68104, the r value increases
to 0.8252515. The intercept decreases to 30.25639 Users/100 People, while the slope
increases to 0.00128 Internet Users/GDP per Capita.

Prediction: Arbitrarily selected explanatory data point: Hungary, with a GDP per Capita of
$13902.7 and 76.1 Internet Users per 100 People.
y hat = 33.30492 + (0.00105)x.
x = GDP per Capita = $13902.7

y hat = 33.30492 + (0.00105 * 13902.7) = 47.90275


My predicted Internet Users per 100 people for a country with a GDP per Capita of
$13902.7 is 47.90275 Users.
My prediction is invalid since the predicted y value of 47.9025 Users/100 people is not equal
to the actual y value of 76.1 Users/100 people. The residual is 28.19725, which means that
the function is underestimating the actual response data.
I think that a logistic, rather than a linear, regression is the best fit. Below are the exponential
and logistic regressions for the data. For the exponential regression, r2 = 0.42055. For the
logistic regression, r2 = 0.78219. The logistic regression shows a much higher r2 value than
both the exponential (0.42055) and linear (0.62125) regressions, thereby illustrating a higher
proportion of variance in Internet Users/100 people that can be explained by GDP per Capita.

Business analysts would use this type of analysis to their jobs. They would be identifying
trends in sales and performance as well as analyzing risks and consumer behavior in order to
maximize profits for a business. These analysts are the reason business are able to make
educated decisions about when and where to expand, which products to stop manufacturing,
and what products to introduce.

III.

IV.

GDP per Capita and Internet Users per 100 People have a moderate, positive, linear
relationship, but can be better represented using logistic regression. This means that when
GDP per Capita rises, the number of Internet Users per 100 people is predicted to rise. The
two variables are best represented through a logistic regression, rather than a linear
regression. From the logistic regression, we can see that 0.78219 of the portion of the
variance in Internet Users per 100 people can be explained by the GDP per Capita, which is
very high for real-life data. We cannot, however, conclude that the changes in GDP per
Capita causes changes in Internet Users per 100 people because there is the possibility of a
lurking variable.
Works Cited:
"World Development Indicators." World Databank. The World Bank, 2014. Web. 14 Nov.
2015.<http://databank.worldbank.org/data/reports.aspx?Code=SL.UEM.TOTL.ZS&id=af3ce
82b&report_name=Popular_indicators&populartype=series&ispopular=y#advancedDownloa
dOptions>.

V.

R Code:
> `Lee,M_Project2DataVersion2` <- read.csv("~/Lee,M_Project2DataVersion2.csv")
> lee<-`Lee,M_Project2DataVersion2`
> plot(lee$GDP.per.Capita, lee$Internet.Users, main = "GDP per Capita vs Internet Users",
xlab = 'GDP per Capita (Current US$)', ylab = 'Internet Users per 100 People')
> abline(lm(lee$Internet.Users~lee$GDP.per.Capita))
> boxplot(lee$GDP.per.Capita, horizontal = TRUE, xlab = 'GDP per Capita (Current US$)',
main = 'GDP per Capita Boxplot')
> boxplot(lee$Internet.Users, horizontal = TRUE, xlab = 'Internet Users per 100 people',
main = 'Internet Users Boxplot')
> linFit(lee$GDP.per.Capita, lee$Internet.Users)
Intercept = 33.30492
Slope = 0.00105
R-squared = 0.62125
> plot(lee$GDP.per.Capita, lee$Internet.Users - ((.00105*lee$GDP.per.Capita)+33.30492),
main = "Residual Plot of GDP per Capita vs Internet Users", xlab = 'GDP per Capita (Current
US$)', ylab = 'Internet Users per 100 People')
> abline(a=0, b=0)
> graph<- lm(lee$Internet.Users ~ lee$GDP.per.Capita)
> resid(graph)

-27.5983442 18.1699904 -6.1025075 -24.8546974 2.2902637 0.8982033 23.8079308


8

10

11

12

13

14

8.0045302 -20.3315690 -1.1063954 -24.5042024 3.0425488 -32.7454368 6.9661224


15

16

17

18

19

20

21

5.5304562 2.7799897 -15.9230839 -0.9726725 28.1662073 -16.9839702 -28.6677631


22

23

24

25

26

27

28

19.2101612 4.9892421 8.6774526 21.5636142 25.6195862 30.8118299 0.3305303


29

30

31

32

33

34

35

-10.6500111 -18.6382238 5.6104141 16.5768926 -0.9779413 8.0608414 -44.4171408


36

37

38

39

40

41

42

10.2762265 23.7939519 4.9720930 8.8788214 -4.2126139 -28.2730906 6.8515709


43

44

45

46

47

48

49

10.3098131 -3.3878458 6.2127638 12.8355806 17.2742612 -17.8164462 -14.3479337


> plot(lee$GDP.per.Capita, resid(graph))
> cor(lee$GDP.per.Capita, lee$Internet.Users)
[1] 0.7881964
> expFit(lee$GDP.per.Capita, lee$Internet.Users)
a = 27.28568
b = 1.00002
R-squared = 0.42055
> logisticFit(lee$GDP.per.Capita, lee$Internet.Users)
Logistic Fit
C = 86.30719
a = 3.16746
b = 1.00014
R-squared = 0.78219
> lee2 <- lee[lee$Country != 'Qatar',]

> plot(lee2$GDP.per.Capita, lee2$Internet.Users, main = "GDP per Capita vs Internet Users


Without Outliers", xlab = 'GDP per Capita (Current US$)', ylab = 'Internet Users per 100
People')
> abline(lm(lee2$Internet.Users~lee2$GDP.per.Capita))
> linFit(lee2$GDP.per.Capita, lee2$Internet.Users)
Intercept = 30.25639
Slope = 0.00128
R-squared = 0.68104
> sqrt(0.68104)
[1] 0.8252515
> 33.30492 + (0.00105 * 13902.7)
[1] 47.90275
> 76.1 - 47.90275
[1] 28.19725

Vous aimerez peut-être aussi