Simple Linear Regression in R

Gianni Gorgoglione
0 20 40 60 80 100
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
x
y
Figure 1 Eps
Figure 2 Original data x & y
Simple Linear Regression in R
PART 1

1. Start R
2. ## Generate data
x<-c(1:100)
a<-15
b<-5
eps<-rnorm(length(x),0,10)
y<-a+b*x+eps
3.
plot(eps)
plot(x,y)

0 20 40 60 80 100
-
3
0
-
2
0
-
1
0
0
1
0
2
0
Index
e
p
s
Gianni Gorgoglione

0 20 40 60 80 100
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
x
y
Figure 3 Original data and Linear Regression
4.
## Linear Model lm() with x as predictor and y as response
4.a
lm(y~x)
lm1<-lm(y~x)
4.b
## Original data
plot(x,y)
## Regression line
abline(lm1)

4.c
## Extract residuals from the dataset lm1
res<-(lm1$residuals)
## Standard deviation
sd(res)
standard deviation = 10.29069
mean = 0
plot(res)
Gianni Gorgoglione

figure 4 Histogram of residuals of linear regression

Yes, The mean b/w residuals should be a number closer to zero and
the standand deviation should follow exactly what we assigned before
in our rnorm distrubution.

4.d
a <<<---- (Intercept) 13.12010
b <<<---- x 5.03321
In our original data we assigned a = 15 and b= 5
So, these coefficients of the linear regression are expected according to the values assigned in the
original data. The prediction should match with original data.

5.
Linear regressions assumptions must be satisfied before applying a linear regression and these are:
1) Linearity of the relationship between x and y.
2) Errors mean should be zero and with constant variance.
3) Residuals must be independent of one another.
4) Errors must have a normal distribution centered on the linear regression. Difficult to check.

Histogram of res
res
F
r
e
q
u
e
n
c
y
-30 -20 -10 0 10 20
0
5
1
0
1
5
Gianni Gorgoglione

6.
6.a
plot(x,y)
plot(res,y)

Yes, values in predicted and observed data shape a linear function and then residuals versus predicted
values
show a mean of zero where other values are within -10 and 10 that represent the standard deviation.

6.b
y2 <- a + 0.1*b*x^2 + eps
lm(y2~x)->lm2
res2<-(lm2$residuals)
plot(x,y2)
plot(res2,y2)

plot(x,y2)
abline(lm2)

By plotting the variables and the linear regression it is not possible to observe a good of fitness between
the two functions.
0 20 40 60 80 100
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
x
y
2
Gianni Gorgoglione

By plotting the histogram of the residuals it is not possible to see an even distribution. In other words, it
seems that a constant variance does not exist. This confirms that we are not in the case of Linear
Regression. The second assumption criterion tells that the linear model is not appropriate for the non-
linear data.

7.
7.a
plot(res,y)
Histogram of res2
res2
F
r
e
q
u
e
n
c
y
-500 0 500 1000
0
1
0
2
0
3
0
4
0
5
0
Gianni Gorgoglione

Residuals are getting smaller closer to 0.
7.b
y3 <- a + b * x + rnorm( length ( x ) , 0 , 10 + 2 * x )
lm(y3~x)->lm3
res3<-(lm3$residuals)
plot(x,y3)
plot(res3,y3)

The residuals in this case show a non-constant variance. Still, there is not a constant magnitude of
dispersion for all values of x. In other words, there is a heteroscedasticity.

-30 -20 -10 0 10 20
0
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
res
y
-400 -200 0 200 400 600
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
res3
y
3
Histogram of res3
res3
F
r
e
q
u
e
n
c
y
-400 -200 0 200 400 600
0
1
0
2
0
3
0
4
0
Gianni Gorgoglione

6500000
7000000
7500000
1300000 1700000
prec_DJF prec_JJA
1300000 1700000
temp_DJF temp_JJA
1300000 1700000
presence
-20
0
20
40
60
80
100
120
140
PART 2

1.

install.packages(sp)
library(sp)
install.packages("raster")
library(raster)
install.packages("rasterVis")
library(rasterVis)

2. bird.clim.sp<-SpatialPixelsDataFrame(bird.clim[,8:7],bird.clim[,2:6])
bird.clim.br<-brick(bird.clim.sp)
levelplot(bird.clim.br)

3.
train<-sort(sample(1:nrow(bird.clim),size=nrow(bird.clim)/100*75))
## Sort the first row of bird.clim selecting the 75% of the bird.clim data frame

test<-which(! 1:dim(bird.clim)[1]%in%train)
##Those rows that are not in the data frame train will be taken and assigned to variable test
Gianni Gorgoglione

train.data<-bird.clim[train,c(5,7:8)]
## This assigns the data bird.clim where it takes 75% of them represented by dataframe. It selects
only column 5, 7, 8
test.data<-bird.clim[test,c(5,7:8)]
##It creates a test.data variable by taking values from sub-selected dataframe test in the
column 5,7,8.

4.

temp_jja.trend1<-lm(train.data$temp_JJA ~ train.data$RT90.O + train.data$RT90.N)

temp_jja.trend2<-lm(train.data$temp_JJA~train.data$RT90.O +
train.data$RT90.N+I(train.data$RT90.O*train.data$RT90.N)+(train.data$RT90.O)^2 +
(train.data$RT90.N)^2)
12 14 16 18
-
4
-
2
0
2
Fitted values
R
e
s
i
d
u
a
l
s
lm(train.data$temp_JJA ~ train.data$RT90.O + train.data$RT90.N)
Residuals vs Fitted
518
480
414
Gianni Gorgoglione

train.data$RT90.O->tx
train.data$RT90.N->ty
train.data$temp_JJA->tdjja
temp_jja.trend2<-lm(tdjja ~ tx+ty+I(tx*ty)+(tx)^2 + (ty)^2)
temp_jja.trend3<-lm(tdjja ~tx+ty+I(tx*ty)+(tx)^2 + (ty)^2+(tx) ^3+(tx) ^2*ty+
I(tx*ty^2)+ty^3)
AIC(temp_jja.trend3, k=2)

Trend1 AIC = 1401.354
Trend2 AIC = 1379.886
Trend3 AIC= 1143.866

As trend with a higher grade polynomial usually gives a better representation of the interpolated
data. Thus, this is confirmed by cross validation that evaluates the goodness of fit metrics with AIC
method, the smaller AIC result is the better the fit of the model. In other words, the fitted train
data with polynomial of 3th order fits better.

12 14 16 18
-
4
-
3
-
2
-
1
0
1
2
Fitted values
R
e
s
i
d
u
a
l
s
lm(train.data$temp_JJA ~ train.data$RT90.O + train.data$RT90.N + I(train.da ...
Residuals vs Fitted
518
480
414
Gianni Gorgoglione

5.
The function predict takes the fit model and its values and calculates based on those values the
new predicted data.

temp_jja.trend1.pred<-
data.frame(test.data,predicted=predict(temp_jja.trend1,type='response',newdata=test.data[,2:3])
)
temp.trend2 <- temp_jja.trend2.pred[1:179,1:4]

sum((temp.trend1$temp_JJA-temp.trend1$predicted)^2)/179->MSE1
MSE1 6.958501
MSE2 6.666005
MSE3 6.731361
Table 1 MSE for crossvalidation of 1th, 2th and 3th order trend surfaces
MSE RESULTS
MSE1 6.958501
MSE2 6.666005
MSE3 6.731361

According to the Mean Square Error cross-validation method the 2th order trend surface fits best
to the model where mse is the lowest.

Gianni Gorgoglione

PART 3

1.
http://www.oikos.ekol.lu.se/appendixdown/snouterdata.txt
2.
snout<-read.table("K:/NGEN11_SPATIAL ANALYSIS/REGRESSION/snouterdata2.txt
",sep="",dec=".",header=T)
head(snout)
3.
##extract snouter1.1 and rain, djungle, x and y
snout1.1<-snout[1:5]
##Linear model with snouter1.1 as response and djungle+rain as predictors
lm(snout1.1$snouter1.1~snout1.1$rain+snout1.1$djungle)->lm.snout
str(lm.snout)

lm.snout

Call:
lm(formula = snout1.1$snouter1.1 ~ snout1.1$rain + snout1.1$djungle)
Coefficients:
(Intercept) snout1.1$rain snout1.1$djungle
81.48315 -0.02184 0.04546

The linear model can identify the predictors. We have got 81.48, that is near to 80 for the intercept
of the function. Then, we have got -0.021 that is a value close to -0.015 in our function.
Gianni Gorgoglione

4.
install.packages("pgirmess")
snout.res<-(lm.snout$residuals)
coords<-cbind(snout1.1$X,snout1.1$Y)
correlog(coords,snout.res,method='Moran')->snout.corr
plot(snout.corr)

The assumption of independence of residuals is violated when the residuals distribution has a
pattern in distribution. From Moran Index it is possible to observe values for the major number of
distances close or below zero. This tells us about the absence of a pattern in the residuals
distribution. For higher lags distances values of Moran Index tend to -1 where it is rare to find a
pattern.

-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
Moran I statistic = f(distance classes)
distance classes
M
o
r
a
n

I

s
t
a
t
i
s
t
i
c
2 4 6 8 12 16 20 24 28 32 36 40 44
Gianni Gorgoglione

5.
5.A
install.packages("spdep")
library(spdep)
dnearneigh(coords, 1, 1.5, row.names = NULL, longlat = NULL)
##Created 3 spatial weights interval
dnearneigh(coords, 0, 1, row.names = NULL, longlat = NULL)->w0
dnearneigh(coords, 1, 1.5, row.names = NULL, longlat = NULL)->w1
dnearneigh(coords, 1.5, 2, row.names = NULL, longlat = NULL)->w2
dnearneigh(coords, 2, 10, row.names = NULL, longlat = NULL)->w3
nb2listw(w0)
nb2listw(w1)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 1108
Number of nonzero links: 4186
Percentage nonzero weights: 0.3409728
Average number of links: 3.777978

Weights style: W
Weights constants summary:
n nn S0 S1 S2
W 1108 1227664 1108 599.7917 4461.944
nb2listw(w2)

Weights style: W
Gianni Gorgoglione

n nn S0 S1 S2
W 1108 1227664 1108 618.75 4472.778

nb2listw(w3)

Weights style: W
n nn S0 S1 S2
W 1108 1227664 1108 10.45354 4468.041

nb2listw(w0)->sar0
nb2listw(w1)->sar1
nb2listw(w2)->sar2
nb2listw(w3)->sar3
errorsarlm(snout1.1$snouter1.1~snout1.1$rain+snout1.1$djungle,listw=sar0)
Call:
errorsarlm(formula = snout1.1$snouter1.1 ~ snout1.1$rain + snout1.1$djungle,
listw = sar0)
Type: error

Coefficients:
lambda (Intercept) snout1.1$rain snout1.1$djungle
0.87922189 79.91106965 -0.01964262 0.02931540
Log likelihood: -3514.855

Call:
listw = sar1)
Type: error

Coefficients:
Gianni Gorgoglione

0.81297579 81.12541938 -0.02054332 -0.01565557



Call:
listw = sar2)
Type: error

Coefficients:
0.73214746 80.38066616 -0.02022954 0.02246367



Call:
listw = sar3)
Type: error

Coefficients:
0.98325221 101.73633693 -0.01773770 0.08609926


5.B

##Create SAR LINEAR MODEL for each distance w1,w2,w3
errorsarlm(snout1.1$snouter1.1~snout1.1$rain+snout1.1$djungle,listw=sar0)->SARLM0

AIC(SARLM0)-----7039.71
Gianni Gorgoglione

-
0
.
2
0
-
0
.
1
5
-
0
.
1
0
-
0
.
0
5
0
.
0
0
0
.
0
5
distance classes
M
o
r
a
n

I

s
t
a
t
i
s
t
i
c
2 4 6 8 12 16 20 24 28 32 36 40 44
AIC(SARLM1)-----7414.543
AIC(SARLM2)-----7717.656
AIC(SARLM3) -----8078.533

The lowest AIC value is for SARLM0 where the distance between neighbours is 0 to 1 is the best model.

5.C
correlog(coords,SARLM0$residuals,method=c('Moran'))->SARLM0CORR

plot(SARLM0CORR)
plot(SARLM1CORR)
plot(SARLM2CORR)
plot(SARLM3CORR)

Figure 1SARLM1 correlogram d=0-1.0 Figure 2 SARLM1 correlogram d=1.0-1.5

-
0
.
1
5
-
0
.
1
0
-
0
.
0
5
0
.
0
0
distance classes
M
o
r
a
n

I

s
t
a
t
i
s
t
i
c
2 6 10 14 18 22 26 30 34 38 42 46
Gianni Gorgoglione

-
0
.
1
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
distance classes
M
o
r
a
n

I

s
t
a
t
i
s
t
i
c
2 4 6 8 12 16 20 24 28 32 36 40 44

Figure 2 SARLM3 correlogram d=1.5-2.0 Figure 4 SARLM3 correlogram d=2.0-10.0

Both the results from AIC and SAR correlograms show that the best model is the first called SARLM0. In
the SARLM0 (Figure 1) correlogram the values of Moran I are pretty much close to 0. This is the situation
that depicts a no spatial pattern. In the SARLM4 (figure 4) there is a tendency for residuals to 1 at
distances between 0 and 6. Thus, there is the tendency to create pattern. At the end this could not be
acceptable for the assumption of linear Regression.
-
0
.
2
0
-
0
.
1
5
-
0
.
1
0
-
0
.
0
5
0
.
0
0
0
.
0
5
0
.
1
0
distance classes
M
o
r
a
n

I

s
t
a
t
i
s
t
i
c
2 4 6 8 12 16 20 24 28 32 36 40 44

Simple Linear Regression in R

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Simple Linear Regression in R

Transféré par

Droits d'auteur :

Formats disponibles

Gianni Gorgoglione

Vous aimerez peut-être aussi