Académique Documents
Professionnel Documents
Culture Documents
Student Information
Jessica A. Eddy S179533 Dashari Colon-Maldonado S1785494
Experiment Description
Fun description We are a company named JaDa Bio Inc. that intends to complete world
domination, but we want to know at which of the 4 seasons, spring, fall, winter, and summer,
the highest number of species of microorganism can be found and in which of our two choice
countries. The locations to start our takeover of the world (and thus where we need to know
the amount of microbial species) are the United States of America and China, where literature
reviews suggest the most difference in microbes are in these countries. There will be 5
locations within each country determined randomly by a number generator with the given
paramaters of the latitude and longitude lines of each country. We send our trusted, unpaid
henchman Napoleon to measure 5 samples in each country once per season with using an
idiot-proof, step-by-step guide (for cost effectiveness, reduction of technician differences,
bias reduction, efficiency, and standardization).
Second, we must create numerical versions of each variable to represent the value
expectations for the number of microbial species. This way, we end up with two vectors.
Then, we visually check our empirical models.
plot(Country_N~Country)
plot(Season_N~Season)
Statistical Model
Now we add the variance to our working models and check each visually.
set.seed(362)
Residuals_S<-rnorm(40, 0, 256)
Species_S<-Season_N+Residuals_S
plot(Species_S~Season)
points(Species_S~Season)
set.seed(362)
Residuals_C<-rnorm(40, 0, 256)
Species_C<-Country_N+Residuals_C
plot(Species_C~Country)
points(Species_C~Country)
The plots visually look indistinguishable from a possible real and legitimate experiment, so
our models seem to be working properly. Next we will check if we can get our means back
for each variable.
t.test(Species_C~Country)
##
## Welch Two Sample t-test
##
## data: Species_C by Country
## t = 66.969, df = 37.881, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4423.568 4699.374
## sample estimates:
## mean in group USA mean in group China
## 6872.182 2310.711
Upon inspection, the t-test output gives me very close estimates to my population means.
one_anova<-lm(Species_S~Season-1)
summary(one_anova)
##
## Call:
## lm(formula = Species_S ~ Season - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -325.37 -163.47 3.07 139.94 521.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## SeasonSpring 4196.1 67.2 62.44 <2e-16 ***
## SeasonSummer 2180.7 67.2 32.45 <2e-16 ***
## SeasonFall 1810.0 67.2 26.93 <2e-16 ***
## SeasonWinter 1060.2 67.2 15.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 212.5 on 36 degrees of freedom
## Multiple R-squared: 0.994, Adjusted R-squared: 0.9933
## F-statistic: 1482 on 4 and 36 DF, p-value: < 2.2e-16
Upon inspection, we see that the estimated values of the means match our original means.
Model
Here we define our explanatory variables.
set.seed(362)
Country_Int<-rep(c(1,2), each=20)
Seasons_Int<-rep(c(1,2,3,4), each=5, times=2)
Now we make a dataframe to easily check if the countries and the seasons are distributed
properly.
Then, we plot our model in order to determine the season in which the highest number of
species can be found, and in which country this occurs.
library(lattice)
Species<-Country_N+Country_Int*Season_N
xyplot(Species~Season|Country, type=c("p", "r")) #shows the season in which
there is a highest number of species per country
Now, we need to add the residuals to our model in order to make our statistical model. We
will also look at different plots that best show our data.
set.seed(362)
Residuals_1<-rnorm(length(Season), 0, 300)
Species_1<-Country_N+Country_Int*Season_N+Residuals_1
xyplot(Species_1~Season|Country, type=c("p", "r"), xlab="Seasons",
ylab="Species", main="Microbial Species in USA and China Across the
Seasons") # nicely shows the relationship between the means of each season
per country
The last plot, the interaction plot, was run to show that up to this moment, there is no clear
indication of interaction between our variables.
Now we can finally make our full ANOVA model. Then we will proceed with data inspection
and model selection. Afterwards, we will run diagnostics on our chosen model.
anova<-aov(Species_1~Season+Country+Season:Country)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Season 3 123123873 41041291 641.80 < 2e-16 ***
## Country 1 52081052 52081052 814.44 < 2e-16 ***
## Season:Country 3 13265645 4421882 69.15 4.47e-14 ***
## Residuals 32 2046309 63947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(anova)
Before we run the diagnostics, we will make a model without the interaction term, even
though the summary of the anova model indicates significance of the interaction between
Country and Season.
anova1<-aov(Species_1~Season+Country)
summary(anova1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Season 3 123123873 41041291 93.81 < 2e-16 ***
## Country 1 52081052 52081052 119.05 8.25e-13 ***
## Residuals 35 15311955 437484
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(anova1)
The plots show a worse distribution of the residuals in comparison to the prior model.
Nevertheless, we will compare the models next to justify keeping the interaction term. The
advantage of this second model is that the p-value for Country is no longer at the lower
limitations of R calculations, which indicates that Season is more important towards the
effect on number of microbial species.
Model comparison using RSS and AIC values and model selection:
anova(anova, anova1)
## Analysis of Variance Table
##
## Model 1: Species_1 ~ Season + Country + Season:Country
## Model 2: Species_1 ~ Season + Country
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 32 2046309
## 2 35 15311955 -3 -13265645 69.149 4.47e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(anova, anova1)
## df AIC
## anova 9 565.2218
## anova1 6 639.7257
The output of the anova comparison shows that the RSS score for our first model is smaller
than for the second, and the generated p-value indicates that this is a very significant
difference. We also notice that the second model gives negative degrees of freedom, which is
unacceptable for a viable working model. The output of the AIC comparison also indicates
that the first model is the best model for our data. It also retains more degrees of freedom,
which is always nice.
In order to check the assumptions, we have to generate a full linear model with the interaction
term:
anova.model<-lm(Species_1~Season+Country+Season:Country)
Homogeneity of variance:
library(car)
## Warning: package 'car' was built under R version 3.2.5
residualPlots(anova.model)
## Warning in residualPlots.default(model, ...): No possible lack-of-fit
tests
ncvTest(anova.model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 3.679056 Df = 1 p = 0.05510005
The output of the residual plots show that variance is quite homogeneous along the line and
the p-value generated in the ncvTest is higher than 0.05, which means that there is no
difference in the residuals from a homogeneous variance. These two indicate that the
assumption of homogeneity of variance has been met.
qqPlot(anova.model$residuals)
shapiro.test(anova.model$residuals)
##
## Shapiro-Wilk normality test
##
## data: anova.model$residuals
## W = 0.95667, p-value = 0.1287
The qqPlot shows that the residuals follow a normal distribution and there is no visual
indication of extreme points or divergence from normality. The shapiro test gives a p-value of
0.13, which means that there is no difference in the residuals from a normal distribution.
These two indicate that the assumption of normal distribution of residuals has been met.
outlierTest(anova.model)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 22 3.182989 0.0033078 0.13231
The influence index plots show that only one of the ppoints gets close to but doesn†™t
touch 0.5 in the Cook†™s distance, so that makes it slightly suspicious even if not clearly
an outlier. The Studentized residuals line shows an overall even distribution of the residuals
along the line. The only point to also get close to but not touch the 0.0 Bonferoni p-value is
the same one as above, so it again seems like it could be an outlier visually though not
numerically. Then we check the hat values, which show that all the points have similar
influence on the final results, but that the suspicious point we†™ve been tracking has a
comparatively lower hat value. Therefore, even if it were clearly an outlier (which it
isn†™t), it would not really affect the final results if we keep it in. The outlierTest gives a p-
value over 0.05, which means there is no numerical indication for outliers.
Now, for the final inspection of the model:
summary.aov(anova.model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Season 3 123123873 41041291 641.80 < 2e-16 ***
## Country 1 52081052 52081052 814.44 < 2e-16 ***
## Season:Country 3 13265645 4421882 69.15 4.47e-14 ***
## Residuals 32 2046309 63947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output shows how both variables and the interaction terms are significant for explaining
the results of our data.
Next, we do a Post-Hoc test to make sure 0 is not in our confidence intervals. This means that
the
(Post.Hoc<-TukeyHSD(aov(Species_1~Season+Country+Season:Country)))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Species_1 ~ Season + Country + Season:Country)
##
## $Season
## diff lwr upr p adj
## Summer-Spring -3031.9720 -3338.3747 -2725.5693 0.0e+00
## Fall-Spring -3628.4751 -3934.8778 -3322.0723 0.0e+00
## Winter-Spring -4734.3452 -5040.7479 -4427.9424 0.0e+00
## Fall-Summer -596.5031 -902.9058 -290.1003 5.1e-05
## Winter-Summer -1702.3732 -2008.7759 -1395.9704 0.0e+00
## Winter-Fall -1105.8701 -1412.2729 -799.4674 0.0e+00
##
## $Country
## diff lwr upr p adj
## China-USA -2282.127 -2445.015 -2119.24 0
##
## $`Season:Country`
## diff lwr upr p adj
## Summer:USA-Spring:USA -2021.6009 -2539.6747 -1503.5271 0.0000000
## Fall:USA-Spring:USA -2306.8172 -2824.8910 -1788.7434 0.0000000
## Winter:USA-Spring:USA -3252.9931 -3771.0669 -2734.9193 0.0000000
## Spring:China-Spring:USA -375.4368 -893.5106 142.6370 0.3006021
## Summer:China-Spring:USA -4417.7799 -4935.8537 -3899.7061 0.0000000
## Fall:China-Spring:USA -5325.5698 -5843.6436 -4807.4960 0.0000000
## Winter:China-Spring:USA -6591.1341 -7109.2079 -6073.0603 0.0000000
## Fall:USA-Summer:USA -285.2163 -803.2900 232.8575 0.6351262
## Winter:USA-Summer:USA -1231.3922 -1749.4659 -713.3184 0.0000002
## Spring:China-Summer:USA 1646.1641 1128.0903 2164.2379 0.0000000
## Summer:China-Summer:USA -2396.1790 -2914.2527 -1878.1052 0.0000000
## Fall:China-Summer:USA -3303.9688 -3822.0426 -2785.8951 0.0000000
## Winter:China-Summer:USA -4569.5332 -5087.6070 -4051.4594 0.0000000
## Winter:USA-Fall:USA -946.1759 -1464.2497 -428.1021 0.0000350
## Spring:China-Fall:USA 1931.3804 1413.3066 2449.4542 0.0000000
## Summer:China-Fall:USA -2110.9627 -2629.0365 -1592.8889 0.0000000
## Fall:China-Fall:USA -3018.7526 -3536.8264 -2500.6788 0.0000000
## Winter:China-Fall:USA -4284.3169 -4802.3907 -3766.2431 0.0000000
## Spring:China-Winter:USA 2877.5563 2359.4825 3395.6301 0.0000000
## Summer:China-Winter:USA -1164.7868 -1682.8606 -646.7130 0.0000007
## Fall:China-Winter:USA -2072.5767 -2590.6505 -1554.5029 0.0000000
## Winter:China-Winter:USA -3338.1410 -3856.2148 -2820.0672 0.0000000
## Summer:China-Spring:China -4042.3431 -4560.4169 -3524.2693 0.0000000
## Fall:China-Spring:China -4950.1330 -5468.2068 -4432.0592 0.0000000
## Winter:China-Spring:China -6215.6973 -6733.7711 -5697.6235 0.0000000
## Fall:China-Summer:China -907.7899 -1425.8637 -389.7161 0.0000694
## Winter:China-Summer:China -2173.3542 -2691.4280 -1655.2804 0.0000000
## Winter:China-Fall:China -1265.5643 -1783.6381 -747.4905 0.0000001
plot(Post.Hoc)
The output shows that in most cases, the difference in means is significant and their
confidence intervals don†™t contain 0. Therefore, this model really is the best for
explaining the data.
LETTERS[1:5]
## [1] "A" "B" "C" "D" "E"
Locations<-factor(rep(c(LETTERS[1:10]), times=4)) #Location for random
effect factor labeled by letters
Locations
## [1] A B C D E F G H I J A B C D E F G H I J A B C D E F G H I J A B C D
E
## [36] F G H I J
## Levels: A B C D E F G H I J
set.seed(362)
Random_EF<-rep(rnorm(10, 0, 256), each=4) #Random effect factor
Random_EF
## [1] 144.42256 144.42256 144.42256 144.42256 -38.69105 -38.69105
## [7] -38.69105 -38.69105 -171.77653 -171.77653 -171.77653 -171.77653
## [13] -216.21228 -216.21228 -216.21228 -216.21228 206.39975 206.39975
## [19] 206.39975 206.39975 -25.44847 -25.44847 -25.44847 -25.44847
## [25] -146.71013 -146.71013 -146.71013 -146.71013 -262.36423 -262.36423
## [31] -262.36423 -262.36423 182.05818 182.05818 182.05818 182.05818
## [37] 264.92312 264.92312 264.92312 264.92312
First we include a an interaction between the random effect on the intercept. Like the
residuals, a random effect has a mean of 0 and some standard deviation. Every location has
its own deviations. We plot the data to visually check that the random effect factor has been
incorporated correctly.
Intercept<-(Country_N+Random_EF)
Slope<-(Country_Int)
Residuals_2<-rnorm(length(Season_N), 0, 256)
Species_2<-Intercept+Slope*Season_N+Residuals_2
xyplot(Species_2~Season_N|Country, type=c("p", "r"), groups=Locations,
main="Microbial Species in Different Seasons")
The plot output was as expected, where the slopes seem to be quite unaffected while the
intercept of each location is different.
library(lme4)
## Warning: package 'lme4' was built under R version 3.2.5
## Loading required package: Matrix
MModel1<-lmer(Species_2~Country*Season_N+(1+Season_N|Locations))
## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge with max|grad| = 1.11185 (tol =
## 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?;Model is nearly unidentifiable: large eigenvalue
ratio
## - Rescale variables?
summary(MModel1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Species_2 ~ Country * Season_N + (1 + Season_N | Locations)
##
## REML criterion at convergence: 562.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9904 -0.7119 -0.1293 0.6226 2.0829
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Locations (Intercept) 1.225e+05 349.9772
## Season_N 1.258e-02 0.1122 -1.00
## Residual 9.716e+04 311.7090
## Number of obs: 40, groups: Locations, 10
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 6.584e+03 1.865e+02 35.30
## CountryChina -4.068e+03 2.112e+02 -19.26
## Season_N 1.101e+00 6.793e-02 16.21
## CountryChina:Season_N 8.202e-01 8.193e-02 10.01
##
## Correlation of Fixed Effects:
## (Intr) CntryC Sesn_N
## CountryChin -0.566
## Season_N -0.912 0.533
## CntryCh:S_N 0.501 -0.884 -0.603
## fit warnings:
## Some predictor variables are on very different scales: consider
rescaling
## convergence code: 0
## Model failed to converge with max|grad| = 1.11185 (tol = 0.002,
component 1)
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
In the output examination, we see that the random effect factor does affect the intercept. We
also check our estimates in the fixed effects and can see that they resemble the original means
for USA and China.
We can run the diagnostics on this model. First, we check homogeneity of variance:
plot(residuals(MModel1)~fitted(MModel1))
abline(h=0)
library(car)
qqPlot(residuals(MModel1))
library(lattice)
dotplot(ranef(MModel1, condVar = TRUE))
## $Locations
We can see that the Locations only had an effect on the Intercept, as we wanted.
We can now choose our best model between the ANOVA and the mixed model by running an
AIC test.
AIC(anova.model, MModel1)
## df AIC
## anova.model 9 565.2218
## MModel1 8 578.5030
The output shows that our best model for our data is the ANOVA model because it has a
lower AIC score and more degrees of freedom. This could be due to the fact that the ANOVA
doesn†™t include the random effect factor found in our mixed model.
Conclusions
Our best model is the ANOVA including the interaction term. It looks like we need to take
over the world from the United States during Spring because Spring has the overall highest
number of species and the United States has the highest number of species between China
and the USA.