Vous êtes sur la page 1sur 102

Statistics with R —

a practical guide for beginners

Uppsala University

Fall 2014

1
Table of Contents
INTRODUCTION ........................................................................................................... 4

WHO SHOULD USE THIS SCRIPT? ............................................................................................................... 4

HOW TO USE THIS SCRIPT ........................................................................................................................... 4

1 GETTING STARTED WITH STATISTICS ............................................................. 6

1.1 BASIC STATISTICAL CONCEPTS ..................................................................................................... 6

1.2 DESCRIPTIVE STATISTICS ............................................................................................................14

1.3 EXERCISES: GETTING STARTED WITH STATISTICS ..................................................................18

2 GETTING STARTED WITH R ............................................................................... 21

2.1 WHAT IS R? ...................................................................................................................................21

2.2 DOWNLOADING AND INSTALLING R........................................................................................21

2.3 HOW TO WORK WITH R ................................................................................................................21

2.4 HANDLING DATA.......................................................................................................................30

2.5 DEALING WITH MISSING VALUES ............................................................................................39

2. 6 UNDERSTANDING HELP() FUNCTIONS ................................................................................42

2.7 EXERCISES ..................................................................................................................................43

2.8 WEB RESOURCES AND BOOKS ON R .........................................................................................46

3 BASIC STATISTICS WITH R ..................................................................................48

3.1 TYPES OF DATA ............................................................................................................................48

3.2 EXPLORING DATA WITH TABLES ...............................................................................................48

3.3 EXPLORING DATA GRAPHICALLY .............................................................................................49

3.4 DESCRIPTIVE STATISTICS ............................................................................................................51

3.5 COMPARING TWO GROUPS OF MEASUREMENTS .......................................................................53

3.6 USING T-TESTS WITH R ................................................................................................................57

3.7 NON-PARAMETRIC ALTERNATIVES ..........................................................................................60

3.8 CORRELATION ANALYSIS ...........................................................................................................60

3.9 CROSS-TABULATION AND THE Χ2 TEST.....................................................................................61

2
3.10 SUMMARY ......................................................................................................................................62

3.11 EXERCISES ....................................................................................................................................63

4 LINEAR MODELS...................................................................................................67

4.1 OVERVIEW - IN THIS SECTION YOU WILL .................................................................................67

4.2 CLASSES OF LINEAR MODELS ....................................................................................................67

4.3 WORKFLOW FOR LINEAR MODELS .............................................................................................68

4.4 DEFINING THE MODEL ...............................................................................................................69

4.5 ANALYZING AND INTERPRETING THE MODEL .......................................................................71

4.6 WORKED EXAMPLES ...................................................................................................................74

4.7 SUMMARY ......................................................................................................................................79

4.8 EXERCISES ...................................................................................................................................80

5 BASIC GRAPHS WITH R ........................................................................................83

5.1 BAR-PLOTS .....................................................................................................................................83

5.2 GROUPED SCATTER PLOT WITH REGRESSION LINES ...............................................................86

5.3 EXERCISES ....................................................................................................................................88

6 LOGISTIC REGRESSION .......................................................................................90

6.1 GOALS............................................................................................................................................90

6.2 HOW TO DO IT ...............................................................................................................................90

6.3 SUMMARY ......................................................................................................................................93

6.4 EXERCISES ....................................................................................................................................94

7 R PROGRAMMING STRUCTURES .......................................................................96

7.1 FLOW CONTROL ............................................................................................................................96

7.2 WRITE YOUR OWN FUNCTION.....................................................................................................98

7.3 SUMMARY ................................................................................................................................... 100

8 APPENDIX: CODE REFERENCES ..................................................................... 101

3
Introduction

Welcome to our statistics with R script!

Who should use this script?

This introduction directed at advanced BSc and at beginning MSc students on Biology at Uppsala
University and may of course also be helpful for anyone interested. We have kept this script
concise and applied in order to allow people to conduct own statistical analyses in R after a short
time. Note that this “quick- start” guide does not replace a full-fledged course. However, we
hope that successfully using R for statistical analyses with the help of this script will generate
interest in learning more about statistics and R!

You may find this script helpful if you are:

 an incoming Master student at Uppsala University with no or little previous education in


either Statistics or R
 a student who wishes to freshen up knowledge in statistics and/or R for courses, project
work or research (Master or Doctoral thesis)
 a prospective student who wants to check out the level and content of statistics with R at
Uppsala University
 anyone interested in a quick-start guide to beginner level statistics with R

How to use this script

We wrote this script for flexible use, such that you can direct your attention to the parts that you
want to focus on, given your background and current interest.

You will have most use of this enhanced .pdf file if you read it electronically using a pdf reader
that provides a content sidebar and allows hyperlinks as well as attachments. Recent versions of
the free program Adobe Reader for Macintosh and for PC have these functions (do not use
Preview). The script contains internal links and links to webpages. You can navigate between
sections and subsection using the bookmarks pane (Figure 0-0-1). Some solutions or data files are
provides as attachments to the .pdf document that are accessible through the attachment pane
(press paperclip symbol). Note that Adobe Reader also allows you to add notes, highlight text
and set bookmarks on your own.

4
Figure 0-0-1 Screenshot of this script in Acrobat Reader. Use bookmarks to navigate
between sections and subsections. The attachment pane (paperclip symbol) will display
a list of files included (datasets, and exercise solutions). You can also insert your own
marks, as the yellow “1996” as well as comments using Adobe Readers tools.

Please browse the overview en summery sections that are present in all chapters in order to find
out where just you need to start reading. All sections have exercises with solutions such that you
can practice and test your knowledge.

We hope that you will find our script useful and fun!

August 2014,

Sophie Karrenberg, Andres Cortes, Elodie Chapurlat, Xiaodong Liu, Matthew Tye

5
1 Getting started with statistics

1.1 Basic statistical concepts

Goals:
 learn about why and how statistics are used in Biology
 be introduced to basic statistical concepts such as distributions and probabilities
 become familiar with the normal distribution
 get an idea about how statistical tests work

Why do we need (so much) statistics in biology?


Organisms that biologists study are influenced by a multitude of factors including the genetic
makeup of the organisms, their developmental stage and the environmental conditions at various
scales ranging from microscopic to world-wide. This brings about the need to detect the most
important factors and to isolate certain factors experimentally for further analysis. In biology, it is
also usually impossible to work on all the units of the group or species that you are interested in.
Instead, biologists often have to work on a subset of units taken at random and make inferences
from this subset. The whole group of units is called a “population”, while the subset is a
“sample” (Figure 1-1 Population and sample).

The problem is that all these units are often different, even though they belong to the same
population. By chance, your random sample may not be very representative of the population.
Thus, even two samples taken from two similar populations may differ greatly, just by chance. It
is also possible that two samples taken from two very different populations may be very similar,
misleading you to conclude that the two populations are similar. Also, the natural variation
among units within your samples may obscure the effect of an experimental treatment. Thus,
working with samples means that we have to deal with all these uncertainties in some way. If
there is really a difference between the samples, you need to know what differences you can
expect by chance, and how to deal with the variation within samples. Statistics help you to make
these decisions. In other words, statistical tests are methods to use samples to make inferences
about the populations.

6
Figure 1-1 Population and sample

Biological questions such as which genes affect certain traits or how climate change affects the
biosphere can only be solved using statistical analyses on massive datasets. But even
comparatively small questions, for example to what extent men are taller than women are in need
of statistical treatment. Thus, as soon as you formulate a study question, you should start
thinking about statistics.

Statistical analyses have a central place in biological studies and in many other sciences (Figure 1-
2 The role of statistical analyses in the biological sciences):

Figure 1-2 The role of statistical analyses in the biological sciences

Hypothesis testing
Many statistical tests evaluate a strict null hypothesis H0 against an alternative hypothesis HA.
For example:

7
H0: mean dispersal distance does NOT differ between male and female butterflies

HA: mean dispersal distance differs between male and female butterflies

Testing these hypotheses involves test statistics, distributions and probabilities.

For this reason, the next parts of this lesson reviews first concepts of distribution and probability,
after which we come back to statistical testing.

Distributions
A distribution refers to how often different values occur in a set of data. In the graph below you
see a common representation of a distribution: a histogram (Figure 1-3 Histogram of normally
distributed data). In histograms, the horizontal x-axis represents the values occurring in the data,
separated into groups (columns), and the y-axis shows how many values fall into each group. The
vertical y-axis in histograms can also be given as the percentage of the data the values represent.

30

20
Frequency

10

2 3 4 5 6 7 8

Values

Figure 1-3 Histogram of normally distributed data

Probability and probability density functions


Probability refers to how likely an event is. For example, when you throw a coin it is equally likely
that it lands on heads or tails. The probability of a coin landing on heads is 0.5.

Nonetheless, for a single throw of the coin you cannot predict where it will land! If you, however,
throw the coin very many time times you expect it to land on heads about half of the times,
corresponding to the probability of 0.5.

The coin example concerned an outcome with two categories, heads and tails. For continuous
(measurement) values, probability density functions can be derived (Figure 1-4 Probability
8
density for a standard normal distribution). Note the similarity in shape to the histogram above
(Figure 1-3 Histogram of normally distributed data). For each value on the x-axis the value of the
probability density function displayed on the y-axis is the expected probability of that value
occurring. The value -1 is thus expected to occur with a probability 0.242 or in 24 of 100 cases.
Probability density functions of test statistics are used for the evaluation of statistical tests.

0.4

0.3
Density

0.2

0.1

0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Standard deviations

Figure 1-4 Probability density for a standard normal distribution

The normal distribution


The normal distribution, also referred to as the bell curve, was described by Gauss and others in
the early 19th century providing the basis for many statistical analyses in use today. The
parameters of the normal distribution are the mean corresponding to the center of the
distribution and the standard deviation referring to the spread of the distribution. Below you
see normal distributions with different means and standard deviations (Figure 1-5):
y

Figure 1-5 Probability density for normal distributions with various means and standard
deviations (sd)

9
Mean and standard deviation of the normal distribution are intricately linked to how common
values are. In fact, the probability of obtaining values in a certain range corresponds to the area
under the curve in this range. The entire area under the curve sums to 1.

Values within one standard deviation to either side of the mean represent 34.4% of the data
(pink), 13.6% of the values occur between one and two standard deviations from the mean on
either side (yellow), 2.1% of the values occur between two and 3 standard deviations from the
mean (green) and 0.1% of the data occur beyond three standard deviations from the mean on
either side (white, Figure 1-6).

Figure 1-6 Standard normal distribution with percentage of values occurring in a certain
range indicated

How statistical tests work


Let us now come back to statistical hypothesis testing and our example:

H0: mean dispersal distance does NOT differ between male and female butterflies

HA: mean dispersal distance differs between male and female butterflies

Note that by simply saying that dispersal distance differs, we imply that female dispersal distance
could be either higher or lower than male dispersal distance. Because one group can differ from
the other in either direction this is called a two-sided hypothesis and is followed by a two-tailed
test. We could also phrase the alternative hypothesis on either of the following ways:

10
HA: female butterfly dispersal distance is longer than male butterflies.

HA: female butterfly dispersal distance is shorter than male butterflies.

Here we hypothesize that the female butterflies differ from the male butterflies in a specific
direction. This is referred to a one-sided hypothesis and is followed by a one-tailed test. .

To evaluate either one or two-sided hypotheses, statistical tests calculate a test statistic from the
data to find out how likely the obtained result would be under the null hypothesis. To do so, a
probability distribution of the test statistic is theoretically derived assuming the null hypothesis.
The probability of the test statistic from the data given that the null hypothesis is true is then
found using this theoretical distribution. This probability is termed the P-value. Common
statistical tests usually have an outcome of "significant", meaning that the alternative hypothesis
is accepted, or "not significant", meaning that the alternative hypothesis is discarded and the null
hypothesis accepted.

How is this decision made?

If the test statistic calculated from the data happens to be a value that is very rare under the null
hypothesis, usually occurring at a probability of less than 5% (P-value < 0.05), the null hypothesis
is discarded and the alternative hypothesis accepted. If the test statistic happens to have a
commonly occurring value of the test statistic under the null hypothesis, the alternative
hypothesis is discarded instead. This is illustrated for one- and two tailed tests in Figure 1-7.
Importantly, all statistical tests make assumptions on the data and are only valid if these are met.
You will come across these assumptions in the section Basic analyses with R.

Figure 1-7 Illustration of significance (P< 0.05) ranges in one- and two-tailed tests.

11
Test outcomes, error types, significance and power
A common statistical tests can have four potential outcomes, two are correct and two are false
(Table 1-1).

Table 1-1 Possible outcomes of statistical tests with the significance level of 0.05.
Test result Reality
H0 true HA true

P-value > = 0.05:


Type II error
H0 accepted, Correct!
(false negative)
HA discarded

P-value < 0.05:


Type I error
H0 discarded, Correct!
(false positive)
HA accepted

Note that the P-values correspond to Type I errors (false positives), i.e. accepting the alternative
hypothesis when it is not true. The significance level is commonly set to 0.05 in biological studies
and P < 0.01 or P < 0.001 are regarded as highly significant.

Importantly, the choice of significance level has direct implications for the two error types.

Let´s look at this further. The two probability density curves on the graph below represent
theoretical normal distributions of measurements for our example of dispersal distance: female
(black) and male (red) butterflies. When the significance level is set to 0.05 (P-values < 0.05 taken
as significant, upper graph, Figure 1-8) the black areas under the black curve for females
represent the type I error, i.e. erroneously accepting the alternative hypothesis. The same cut-off
applies to the red curve for males: here the area in red represents the type II error, erroneously
accepting the null hypothesis when the alternative hypothesis is true.

If the significance level and thus the type I error is decreased to 0.01 (lower graph) the type II
error is inevitably increased! This means that even if you can be surer that the alternative
hypothesis is true when you do accept it, you also have to live with higher chances of missing
cases where the alternative hypothesis is true.

12
Significance level = 0.05

0.5
0.4
Density
0.3
0.2
0.1
0.0
-4 -2 0 2 4 6 8
Values

Significance level = 0.01

0.5
0.4
Density

0.3
0.2
0.1
0.0
-4 -2 0 2 4 6 8

Values

Figure 1-8 Trade-off between type I and type II errors, for the example of female (black)
vs. male (red) dispersal distance in butterflies.

Summary

 Distributions can be displayed as histograms and show how often different values (ore
classes of values) occur.
 Probabilities express how likely events or outcomes are. Probability density functions show
how likely it is to obtain values under a certain distribution.
 The normal distribution is fundamentally liked to many common statistical tests. Normal
distributions are described by their mean and standard deviation.
 Statistical tests use samples to makes inferences on large populations and generally evaluate a
null hypothesis (usually no difference) against an alternative hypothesis (a difference). They
do so by comparing a test statistic calculated from the data against a theoretical distribution
of this statistic under the null hypothesis.
 The significance level used in statistical testing is related to both type I errors (false positives)
and type II errors (false negatives).

13
1.2 Descriptive statistics

Goals
In this section you will learn how to describe your data in terms of

 range, quartiles and median


 mean and standard deviation
 standard error of the mean

Range, median and quartiles


Once you obtain data you often wish to gain an overview before you start conducting analyses.

One of the most basic measures of data series is their range. The range refers to the interval
between the smallest value, the minimum, and the largest value, the maximum.

Indeed, looking at the range is highly recommended as it allows you to conduct a first check of
the data: are the values actually in the expected (or reasonable) range?

In addition to the range it also a good idea to calculate the median, the value that is exactly in the
middle of the data: 50% of the values are larger than the median and the other 50% are smaller
than the median. Whether the median is more or less in the middle of the range will show you
whether you data is distributed symmetrically. For example data with a range of 1 to 10 and a
median of 2 is NOT symmetrically distributed. The data in the illustration below is symmetrically
distributed (Figure ).

Figure 1-9 Median and quartiles on a histogram of a normal distribution

14
Similar to the median you can also calculate the 25% and 75% quartiles, values that are larger
than 25% or 75% of the (ordered) data. The example data above with a range of 1 to 10 and a
median of 2 has a 25% quartile of 2 and a 75% quartile of 3, this would indicate that there
probably are some outliers causing the range to extend to 10.

Mean and standard deviation


Very common descriptive statistics are mean and the standard deviation. They make most sense
for symmetrically distributed data. The mean is calculated as the sum of all values Xi divided by
the number of values n (Figure 1-10).

The standard deviation s (also sd) is a measure of the spread of the data and is calculated:

Figure 1-10 Histogram of a normal distribution showing mean, median, quartiles and
standard deviation

15
Standard error and confidence interval of the mean
The standard error of the mean (SE or se) gives a measure of the precision of the estimate of the
mean.

The standard error can be used to calculate a confidence interval (CI) for the mean. The 95%
confidence interval around the sample mean that should contain the mean of the population with
95% probability. It is calculated as:

Note that standard error and confidence interval of the mean become smaller the larger the
sample is. This reflects the greater trust you can have for a mean calculated from a large sample
as opposed to a mean calculated from a small sample.

How to deal with data that is not normally distributed


Biological data is often not normally distributed, especially size measurements. It is for example
not rare that there are many small measurement values and fewer and fewer larger values such
that the data has a distribution as in the histogram below.

In this case the mean and standard deviation are not such god measures of center and spread. On
the graph below, the mean is rather far from the bulk of the data. A range of one standard
deviation around the mean does not contain the same number of measurements of each side.
Median and quartiles are better descriptive statistics for such data: the median indeed is in the
center of the data and the quartiles nicely reflect the asymmetry in the distribution, i.e. the
distance between 25% quartile and median is smaller than the distance between 75% quartile and
the median (Figure 1-11). Alternatively, you can use data transformations and calculate mean and
standard deviation from transformed data (see Basic analyses with R).

16
Figure 1-11 Descriptive statistics on a right-skewed distribution

Summary

 Descriptive statistics are important to check data and are used to summarize data.
 Range, quartiles and median are basic descriptive statistics for data with any distribution.
 Mean and standard deviation are more useful for symmetrically distributed data.

17
1.3 Exercises: Getting started with statistics

1-A
Please have a look at the two distributions below, A and B. They correspond to commonly
observed distributions of biological data. Read the statements below and select whether they are
true for distribution A, distribution B, both distributions or neither of the distributions.

 Most values are between 3 and 7

 Values are between zero and 9


 Corresponds to about 1000 values in total

 The distribution is symmetric


 The distribution is asymmetric

 Corresponds to about 350 values in total


 Less than 200 values are smaller than 4

 Many small values occur


 Many large values and few small values.

 Few extreme values occur.


 Most values are smaller than 4.

1-B
You and friend wonder if it is "normal" that some bottles of your favourite beer contain more beer
than others although the volume is stated as 0.33 L. You find out from the manufacturer that the
volume of beer in a bottle has a mean of 0.33 L and a standard deviation of 0.03 l. If you now
measure the beer volume in the next 100 bottles that you drink with your friend, how many of
those 100 bottles are expected to contain more than 0.39 L given that the information of the
manufacturer is correct?

1-C

18
Your data is distributed as shown below. Where do you expect the median to be?

Select one:
To the left of the mean.
To the right of the mean.
At the same place as the mean

1-D
Check all answers that are correct.
A P-value of 0.051 for a t-test …
(a) …means that 0.949 % of the data are greater than the mean.
(b) …indicates a 5.1% probability of Type I error
(c) …proves that there is no difference between the groups.
(d) …shows that the difference would be significant if more data were used.
(e) …is regarded significant in biostatistical analyses.
(f) …is regarded non-significant in biostatistical analyses.

19
Solutions to Exercises Getting started with statistics

1-A
Most values are between 3 and 7 – True for A
Values are between zero and 9 – True for A and B
Corresponds to about 1000 values in total – True for A and B
Symmetric – True for A
Asymmetric – True for B
Corresponds to about 350 values in total – True for neither
Less than 200 values are smaller than 4 – True for A
Many small values occur – True for B
Many large values and few small values – True for neither
Few extreme values occur – True for A
Most values are smaller than 4 – True for B
1-B
Correct answer:
0.39 l corresponds to the mean plus two standard deviations (0.33 + 2* 0.03) and values larger than 0,39 are thus expected to
occur in 2.3% of the cases. 2.3% of 100 is 2.3 or 2.
The correct answer is: 2.3
1-C
The correct answer is: To the right of the mean
1-D
(b), (f)

20
2 Getting started with R

2.1 What is R?

R is a versatile and powerful statistical programming language developed by the statistics


professors Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand.
Different from other statistical programs, R is free and its source code is available. R was released
in 1996 and is maintained by the R development core team (http://www.R-project.org/) with a
very large number of international contributions and continues to develop at a fast pace. R is
among the most widely used statistical programs at universities today.

For more information please check out this list of Webpages and books on R at the end of this
section.

2.2 Downloading and installing R

R is available for Macintosh, PC and Linux operating systems and easy to install. To download
and install R, please go to the The Comprehensive R Archive Network page (CRAN) and follow
the instructions there.

2.3 How to work with R

Goal
In this section, you will learn:

1. What R scripts, R commands and the R console are


2. How to work with RStudio
3. Basic data types in R and how to assign them
4. Steps to follow when working with R (work flow)

21
Elements of R: Script, Console & Co.
Using R involves mostly writing commands (or “code”) rather than clicking on menus. The
commands are usually assembled in a script that can be saved and reused. The R console
receives the commands either from the script or by direct typing, shows the progress of analyses
and displays the output. Graphical output will open in a separate window. This means that
working with R can involve quite a lot of windows and files, the script, the console, graphical out
and then, of course your data files and other user-specified output. You can assign a working
directory where all of these files and outputs are saved by default. Nonetheless, it can be difficult
to keep track of all these windows on the screen when working with R! (Figure 2-1)

Figure 2-1 An overview of workflow of R

An excellent way of ordering and manipulating your R windows and files is to use the free and
powerful interface for RStudio (see below). We highly recommend that you use this program.

Using RStudio to work with R


RStudio incorporates the console, script, graphical output and various other elements in an
accessible and easy-to-manipulate form. RStudio is free and available for both Windows and
Macintosh operating systems and can be downloaded from
http://www.rstudio.com/products/rstudio/. Note that the R studio menu differs slightly
between PC and Mac versions.

22
The RStudio screen is divided into four resizable parts (Figure 2-2). The upper left part contains
a script editor where commands are written and saved. The various tabs in the upper left part
can contain multiple scripts and also data files. Commands are sent to the console in the lower
left part using the key combination cmd + Return (Macintosh) or Control+R (PC). On the right
side, RStudio displays a workspace tab listing all objects in the current analysis and a history tab
providing a recollection of executed commands. The lower right partition hosts a figure tab
where graphical output is stored, a package tab where packages can be viewed and installed, a
file tab to manipulate files and a help tab where R help information can be searched and
displayed.

Workspace/ History
Script editor tab
Write your commands here

Files/ Plots/ Packages/


Console Help tab

Execute your script and view output

Figure 2-2 A screenshot of RStudio

In RStudio, you can bundle your analyses into projects using the project drop-down menu on the
top (through “file” for PC and through “project” for Mac) or the pop-up menu in the top right
corner of RStudio (both versions). Projects will contain all elements of analyses allowing you to
continue a session exactly where you ended the previous time.

For help with RStudio, you can go to https://support.rstudio.com.

You can set a new working directory following the menu options Session > Set working directory
(Figure 2-3).

23
Figure 2-3 Setting the working directory in RStudio
To create a new script, you can follow “File > New File > R script”, or use the shortcut Ctrl +
Shift + N. Save your scripts regularly. A file that has been modified but not saved again will
show with a red title and a * at the end.

You can navigate between different plots produced during a session using the blue arrows at the
top left corner of the Plots tab. You can save your graphs by clicking on “Export”.

For help with RStudio, you can go https://support.rstudio.com.

Typing commands and Writing Code


R commands always have the same structure (Figure 2-4). A command name is followed by
parentheses without space. Command names are often closely related to what the command
does, for example the command mean() will calculate the mean. The parentheses contain the
arguments, separated by comma that the command will use. For example, such arguments tell R
what data to use. Arguments are further used to select options for analyses. What arguments are
needed differs between commands and is explained in the help on commands.

Figure 2-4 Structure of R command

24
In R, you can create scripts that are saved as an .R file, can be re-used and serve as a
documentation of your analysis. Scripts are written and manipulated in the script editor window,
and should be saved to the working directory with a .R file extension. It is also possible to enter
code directly into the console. Typing in the console can be faster when trying out code.
However, any analysis that you need to keep a record of should be created as a script and saved.

Important notes for writing and executing code:

 You send commands from the script to the console such that they are executed by
highlighting one or multiple lines and pressing the keyboard shortcuts cmd + Return
(Macintosh) or Control+R (PC).
 You can enter comments preceded by #. Everything written after # will be ignored and
not executed in the console (Figure 2-5).
 In RStudio, you can use 4 of these symbols #### after a title to organize the scripts and
mark specific points that you can find easily at the bottom left of the script window
indicated by an orange # (Figure 2-5).

Figure 2-5 Screen shot of RStudio script tab

 Observe the ">" sign in the console. This is the command prompt, the place where you
can type in your commands and execute them by pressing return. The command prompt
indicates that R is ready to receive commands and has finished executing previous
commands. R will display a "+" if a command is longer than a single line or incomplete.
On the console, you can cycle through previous commands using the "arrow up" and
“arrow down” keys on your keyboard.

Object assignment and the workspace


R stores data, analyses and outputs as objects in the current workspace that can be saved.
Observe that workspace is a collection of R objects with properties assigned through R
commands whereas the working directory basically is a folder on your computer that contains
various files of any type.

You assign information to variable using the operator- an "arrow" composed of a smaller than
sign and a minus sign (<-) pointing to the name of the variable. There are certain rules for
naming your variables.

25
 The first character must be an English letter or underscore (_).
 You can use uppercase or lowercase letters.
 Blanks and special characters, except for the underscore and dot are not allowed to
appear in variable name.

For instance, the following command

greeting <- "hej"

will assign "hej" to an object named greeting. All text has to be in quotes (" "), otherwise R will
look for an object with this name and create an error message, for example,

greeting <- hej

will result in the error message

Error: object 'hej' not found

To call an object and see what it contains you enter its name. Type the object name on the
command line as below.

greeting

will result in the output

[1] "hej"

The [1]indicates that this is the first (and only) element of this object.

You can check what variables have been created in R using the command

ls()

This will result in a list of the current objects in the workspace. In RStudio you can view and
manipulate current objects in the workspace tab. You can also remove (delete) objects using
rm(), for example

rm(greeting)

Variables are overwritten without notice whenever something else is assigned to the same name.
If you want to delete all current objects, use

rm(list=ls())

26
Data types
R deals with numbers, characters (entered in quotes, " ", as the “hej” in the example above) and
logical statements (TRUE or FALSE).

The following types of data are commonly used by beginners:

Vector: one-dimensional, contains a sequence of one type of data, i.e. numbers OR categories
(letters, group names) OR logical statements. Vectors can be created using c(element1, element2,
element3, .... , which concatenates (connect them one after each other) the different elements
into a vector. Note that the elements can themselves be vectors. For example,

c("population1","population2","population3","population4")

will generate a vector as follows:

[1] "population1" "population2" "population3" "population4"

Number sequences can be created using the operator ‘:’. For instance,

x <- 1:7

creates the vector x that contains a sequence of number from 1 to 7:


[1] 1 2 3 4 5 6 7

Besides, there are a number of other functions for creating vectors, for user-defined seq() for
sequences and rep() for repeated elements. You can find out about these functions using the R
help.

Factor: similar to vectors but also contains information on levels. Entries of a factor that are
equal belong to the same factor level, or in other words, to the same category. Factors can be
created from vectors using factor(). For example, you can create a factor named sex using
the code below:

sex <- c(rep("male",25), rep("female", 35))


sex <- factor(sex)

Data frame: collection of vectors and factors of the same length that can contain different data
types. This is the format commonly used for data analysis where each row corresponds to an
observation and each column corresponds to a variable (vector or factor). The section Getting
data into R explains how to create data frames from your data and the sections the sections
Accessing and changing individual entries, Accessing and changing entire rows or columns,
Adding and deleting columns explain how to handle and manipulate the contents of data frames.

27
List: collection of elements of any format and type, can be created using list( ). Outputs of
statistical analyses are often lists. Other types of objects include matrices and arrays.

Work flow in R
1. Define/create a folder on your computer that is to be used as a working directory.
2. Open R Studio and create a new file in the script editor (see R scripts in RStudio)
The hash sign (#) is used in scripts to identify text that is NOT a command, i.e., titles and
comments, and prevent R from trying to execute such text.
3. Add a title (preceded by a hash sign (#)), for example,

# My first R session, my name, today´s date

4. Set the working directory to your prepared folder (see 1). The working directory can be
changed at any time. 
If you want to make sure that the working directory is correct, use
the command getwd() to obtain the path to the current working directory.
5. Save the script file now and regularly later on using the menu or symbol on the script
window or the shortcut Ctrl + S. The script file has the extension .R and will be saved in
the working directory by default. This script can be re-used and shared.
6. Load data from the working directory into an object (see Loading data files).
7. Conduct analyses, produce output and graphs creating further objects and save the script
and outputs /graphs.
8. Quit R using the command q() that you can type into the console or by closing the R
Studio window. This will result in a question whether you want to save the workspace.
You can safely answer no to that; save the script and the data instead. Saving the
workspace is only recommended for analyses that take a long time to complete.

Basic calculations in R
R is sometimes referred as an "overgrown" calculator.

You can perform calculations directly in R, for example the line

3 + 4

will result in
[1] 7

Other most common arithmetic operators are: - (minus), / (division), * (multiplication), ^


(exponentiation).Calculations can also be done on vectors. Basic calculations (+, -, /, *, ^, etc.)
are conducted on each element and vectors of the same length will be combined element-wise in
calculations. This also applies to columns or rows of data frames.

numbers.1 <- c(1,2,3)

will assign the numbers 1, 2, and 3 to a vector called numbers.1.

28
numbers.1 * 2

will multiply each element of the vector by 2.

Creating a second vector with 3 elements, numbers.2 and adding these two vectors:

numbers.2 <- c(2,2,2)


numbers.1 + numbers.2

You can use this to establish functional relationships of interest, plot and examine them. Try out
this code! Note that c(1:100) creates a vector with the numbers from 1 to 100.

plot(c(1:100)^2)

Summary

 In R, you use a script window to enter the commands. Commands are transferred to the R
console for execution. Scripts can be saved and re-used.
 Data, output and scripts are saved in a designated working directory.
 R stores data and analysis outputs as objects that often are vectors, factors, data frames or
lists. They contain data as numbers, characters or logical statements. Factor levels or other
text must be in quotes "".
 Objects are given names with the assignment arrow “<-“
 The workflow in R involves setting the working directory at the beginning and saving the
script file repeatedly.
 In R, you can conduct basic mathematical calculations directly and element-wise, for
example on vectors and on columns of data frames.

29
2.4 Handling data

Goal
In this section you will learn how to

 enter and load data into R


 check data structure
 explore data graphically
 change and subset data

In R, you can input data either by directly typing within the script or by loading data files.

Entering data in the script


You can enter data using the functions c() and data.frame().

Below is an example from plant experiments. You measured the width and the length of six
leaves in the plant species Silene dioica in cm. The first three plants were flowering and the last
three were not flowering.

You can enter the data directly as arguments to the function data.frame().

Silene.leaves <- data.frame(plant.number =


c(1,2,3,4,5,6), leaf.width = c(4.0, 4.7, 2.8, 4.1,3.5,
3.7), leaf.length = c(5.3, 4.9, 5.7, 5.0, 5.5, 4.3 ),
flowering.state = c("flowering", "flowering", "flowering",
"vegetative", "vegetative", "vegetative"))

Each column is set by an argument: column.name = c(data). Note that the data entries
within c() must be separated by commas and the arguments (columns in this case) within
data.frame() must also be separated by commas. The function c() produces a vector
from the data (compare data types). You can choose column names freely. Note that data for the
flowering state is entered in quotes because flowering state is a category.

Alternatively, the data can be assigned to vectors first and then combined into a data frame.
Vectors can also be used for analysis by themselves.

width.vector <- c(4.0, 4.7, 2.8, 4.1,3.5, 3.7)


length.vector <- c(5.3, 4.9, 5.7, 5.0, 5.5, 4.3 )
flowering.vector <- c("flowering", "flowering", "flowering",
"vegetative", "vegetative", "vegetative")
Silene.leaves <- data.frame(leaf.width = width.vector, leaf.length
= length.vector, flowering.state = flowering.vector)

Preparing data files


If you want to load data from a file instead, you first need to prepare a suitable file from another
program, for example Excel. Follow these points:
30
1. Arrange the variables (measurements) in columns and observations in rows (see
example).
2. Make sure that column headings and table entries contain only letters from English
alphabet or numbers; in particular they should have no spaces or no slashes. Headings
should start with a letter and must not contain commas. If you want to make your
headings clearer, you can use points or underscores, for example, height_august. This is
advisable even though some newer versions of R (sometimes) tolerate spaces and more
letters.
3. Save your table as .csv.
4. Open your file in a text editor. Observe two things: Firstly,, what is the decimal separator,
i.e., is it 1,5 or 1.5 (comma or point)? Secondly, how are the entries separated? This can
for example be a space( ), tabulator(\t), comma (,) or semicolon (;).

You need this information to ensure correct loading of your data in R, explained on the next
page.

Loading data files


There are many ways to load data into R. In RStudio you can also load data through the
workspace tab using the import data pull-down menu. This allows you to view the files and
directly specify the separator between entries, the decimal separator and whether or not you have
headings. Note that using this menu option in R studio will also produce code in the console that
you can copy into your script for later use if desired. You can also view the data in the tab in the
top left partition.

If you want to use a direct command to read data we recommend read.table() because it is
universally applicable. Within the read.table() command you can specify to browse your
computer for the file to load using the argument file = file.choose(). Or you can
enter the path of your file: file = "document_name.txt" for example. You indicate
whether your data contains a header (a row of titles) with header = TRUE (has a header) or
header = FALSE (no header). The table separator is set using sep, for example, sep =
";" (semicolon), sep = "," (comma), or sep = "\t" (tabulator). The decimal
separator is specified by argument dec, for example dec = "," (comma) or dec = "."
(point).

The input file needs to be assigned to an object using the arrow <-. In most cases, this will
automatically be a data frame object.

For a .csv file with header, semicolon-separated entries and decimal commas (as usually used for
Swedish settings for .csv files saved from Excel) the command looks like this:

my.data <- read.table(file = file.choose(), header=TRUE , sep=";",


dec = "," )

For a .csv file with header, comma -separated entries and decimal points (as common in North
America) the command looks like this:

31
my.data <- read.table(file = file.choose(), header=TRUE, dec=".",
sep="," )

Note: when you execute the commands nothing happens! See next page for how to access the
data.

Common problems when loading data


Additional entries. Loading a file with additional entries (sometimes invisible ones such as
spaces) in cells outside the data will yield an error message similar to this one:

Error in scan(file, what, nmax, sep, dec, quote,


skip, nlines, na.strings, :
line 148 did not have 5 elements

Remedy: copy your data (and only your data!) to a new sheet in Excel, save it as .csv and reload.

Non-English characters and signs. Non- English characters(ä, ö, é, etc. ), signs (for example
* ! & / | \ + - > < $? =) or spaces in the column names will produce an error
message similar to this:

Error in make.names(col.names, unique = TRUE) :


invalid multibyte string 1

Remedy: change the names in an Excel file or directly in a .csv file, save it as csv and reload.

Note: Non-English characters elsewhere in the data might also lead to an error message and thus
should be avoided.

Checking data structure


To control what sort of object you data is stored in and whether it has the correct structure, you
can use the structure command str( ), for example after loading the example file on Silene
leaves, we use

str(Silene.leaves)

This yields the following output:

'data.frame': 6 obs. of 4 variables:


$ plant.number : num 1 2 3 4 5 6
$ leaf.width : num 4 4.7 2.8 4.1 3.5 3.7
$ leaf.length : num 5.3 4.9 5.7 5 5.5 4.3
$ flowering.state: Factor w/ 2 levels "flowering","vegetative":
1 1 1 2 2 2

This indicates that the object Silene.leaves is a data frame with six observations, i.e. six
rows in our input file, and four variables, i.e. four columns in our input file. This matches well
with the six plants and four columns in our input file.

32
The column names and the type of the columns are also given together with the first few values
(in this case all six values). The columns plant.number, leaf.width and
leaf.length are numeric (continuous numbers) and you will be able to do calculations with
these numbers. The flowering.state column is a factor with the two levels
"flowering" and "non-flowering", you will be able to use this factor as a grouping
variable. All of this appears as expected and correct.

If you want to change the type of the column, for example changing plant.number from
numeric to factor (because the number is a "name" in this case), use the following command

Silene.leaves$plant.number <-
as.factor(Silene.leaves$plant.number)

You can control whether this has been successful by calling the structure command again.

To look at the data you can also just type the name of the object. This is advisable only for small
data frames. For example

Silene.leaves

will yield

plant.number leaf.width leaf.length flowering.state


1 1 4.0 5.3 flowering
2 2 4.7 4.9 flowering
3 3 2.8 5.7 flowering
4 4 4.1 5.0 vegetative
5 5 3.5 5.5 vegetative
6 6 3.7 4.3 vegetative

Accessing and changing data


Sometimes you need to check or change individual elements of your data. In R, the elements of
vector are always internally numbered and we use this numbering to access and change the data.

For example, the vector

width.vector <- c(4.0, 4.7, 2.8, 4.1,3.5, 3.7)

has six elements. We can access its third element using the vector name followed by square
brackets and the element number. This line will bring up the third element:

width.vector[3]

[1] 2.8

Should we realize that this element needs to be changed from 2.8 to 3.0 we can do that using the
assignment arrow:

33
width.vector[3] <- 3.0

calling width.vector again shows that this has happened

width.vector

[1] 4.0 4.7 3.0 4.1 3.5 3.7

We can access the elements of data frames in the same way, except that data frames have two
dimensions, rows and columns, such that two numbers separated by comma are needed with the
square brackets. The first number always refers to rows, the second to columns. To access the
element in the third row and the second column in our Silene.leaves data frame we use

Silene.leaves[3, 2]

[1] 2.8

Incidentally this is the same measurement as in the vector example above. To change it to 3.0 in
the data frame we use the same kind of assignment operation:

Silene.leaves[3, 2] <- 3.0

We can check whether this has happened by

Silene.leaves[3,2]

[1] 3.0

Entire rows and columns of data frames can be accessed by leaving column (or row) number
empty in the square brackets. Note that the comma must always be entered because data frames
have two dimensions. Accessing rows and columns is needed to conduct analyses and to make
changes or calculation. For example,

Silene.leaves[ ,2]

[1] 4 4.7 2.8 4.1 3.5 3.7

brings up entries in the entire second column and

Silene.leaves[3, ]

plant.number leaf.width leaf.length flowering.state


3 3 2.8 5.7 "flowering"

brings up the entire third row, for example to check that plant´s measurements. The first 3 is the
row number.

34
Column names can be used in place of the numbers. R has a special notation for columns
involving the dollar sign as you may have noticed in the output of the structure command. The
following line will also bring up the third column.

Silene.leaves$width

Alternatively, the column names can be entered in quotes directly within the square brackets
(note the comma!).

Silene.leaves[ ,"width"]

Should you now realize that the width measurements all need to be increased by 0.2 you can do
that using

Silene.leaves$width <- Silene.leaves$width + 0.2

OR

Silene.leaves[,"width"] <- Silene.leaves[,"width"] + 0.2

OR

Silene.leaves[,2] <- Silene.leaves[,2] + 0.2

Which of these options is most convenient depends on your column names, the size of your data
file and your preferences.

Adding and deleting columns


Additional columns can be assigned at any times. For example, you may wish to create column of
the ratio of leaf width and leaf length in our Silene.leaves data frame.

Silene.leaves$width.length.ratio <- Silene.leaves$leaf.width /


Silene.leaves$leaf.length

Calling the structure command shows that the column has been added and is numeric.

str(Silene.leaves)

'data.frame': 6 obs. of 5 variables:


$ plant.number : num 1 2 3 4 5 6
$ leaf.width : num 4 4.7 2.8 4.1 3.5 3.7
$ leaf.length : num 5.3 4.9 5.7 5 5.5 4.3
$ flowering.state : Factor w/ 2 levels
"flowering","vegetative": 1 1 1 2 2 2
$ width.length.ratio: num 0.755 0.959 0.491 0.82 0.636 ...

Deleting one or several columns can be done using the minus sign within the square
brackets. This only works with column numbers not with column names. This line removes the
newly added width-length ratio column:
35
Silene.leaves <- Silene.leaves[ ,-5]
str(Silene.leaves)

'data.frame': 6 obs. of 4 variables:


$ plant.number : num 1 2 3 4 5 6
$ leaf.width : num 4 4.7 2.8 4.1 3.5 3.7
$ leaf.length : num 5.3 4.9 5.7 5 5.5 4.3
$ flowering.state : Factor w/ 2 levels
"flowering","vegetative": 1 1 1 2 2 2

Removing rows, for example, when you realize that measurements of an entire row are faulty,
works in the same way. This line removes the first row (observe the placement of the comma!).

Silene.leaves <- Silene.leaves[ -1, ]


str(Silene.leaves)

'data.frame': 5 obs. of 4 variables:


$ plant.number : num 2 3 4 5 6
$ leaf.width : num 4.7 2.8 4.1 3.5 3.7
$ leaf.length : num 4.9 5.7 5 5.5 4.3
$ flowering.state: Factor w/ 2 levels "flowering","vegetative":
1 1 2 2 2

If you need to remove more than one column use the c() command within the square brackets:

Silene.leaves[-c(1:3), ]

will remove rows one to three.

Silene.leaves[, -c(1,4)]

will remove columns one and four.

Subsetting data
There are many situations where only a specific subset of the data needs to be accessed. In R this
is done with entering logical statements that are into the square brackets for row and column
selection. If you want to select for example only the flowering plants in the Silene.leaves
data frame you use a Silene.leaves$flowering.state == "flowering"
statement for row selection.

Silene.flowering <- Silene.leaves[Silene.leaves$flowering.state ==


"flowering" , ]

will produce a new data frame named Silene.flowering containing only the flowering plants.

str(Silene.flowering)

36
'data.frame': 3 obs. of 4 variables:
$ plant.number : num 1 2 3
$ leaf.width : num 4 4.7 2.8
$ leaf.length : num 5.3 4.9 5.7
$ flowering.state: Factor w/ 2 levels "flowering","vegetative":
1 1 1

Let´s have a closer look at the logical statement:

Silene.leaves$flowering.state == "flowering":

In words this statement means something like "check for each element of
Silene.leaves$flowering whether it reads "flowering" or not". If you execute only the logical
statement, you create a vector with six elements, the first three are TRUE (corresponding to the
flowering plants) and last three are FALSE (corresponding to vegetative plants).

Silene.leaves$flowering.state == "flowering"

[1] TRUE TRUE TRUE FALSE FALSE FALSE

When you use such statements for row selection, all rows corresponding to TRUE will be
selected, in this case the first three rows.

Note that R does not assume that you will use only columns of the same data frame for the
logical statements, in fact, you can also use columns from other data frames or vectors. For this
reason you need to write Silene.leaves$flowering.state == "flowering"
and not only fowering.state == "flowering".

You can use the following logical operators:

== identical
!= not identical
> greater than
< smaller than

and combine conditions using

| logical OR, one of the conditions fulfilled


& logical AND, all conditions fulfilled

Here are some more examples:

Selecting plants with leaf width over 4.0

Silene.leaves[Silene.leaves$leaf.width > 4.0, ]

Selecting plants with either width or length under 3.5

37
Silene.leaves[Silene.leaves$leaf.width < 3.5 |
Silene.leaves$leaf.length < 3.5, ]

Selecting plants with both width and length over 4.0

Silene.leaves[Silene.leaves$leaf.width >4.0 &


Silene.leaves$leaf.length >4.0, ]

Further, subset() is a useful function to perform these kind of selections and subset a data
set. The first argument specifies the data frame to subset. The second argument is a logical
expression as explained above used to select specific rows in the data frame and the third
argument indicates the columns to be selected by their names (if several columns are selected, the
names have to be in a vector). If you only want to omit one column, use – in front of the column
name: for example,

New.Silene.leaves <-
subset(Silene.leaves,flowering.state==”flowering”,select=-
flowering.state)

will create a new data frame containing only the rows concerning the flowering plants, and all
columns except the flowering.state column (which is not needed any longer since we know all the
plants in the data set are flowering).

Since you specify the data frame to subset in the first argument of the subset() function, you
can directly refer to the different variables (columns) by their names, without using the $ sign.

Summary

 Data can be entered in the script using the data.frame() command.


 Loading data from files involves preparing .csv file with observations as rows and
measurements and grouping factors as columns. These files should only contain numbers
and letters from the English alphabet. In grouping variables and headers points and
underscores can also be used.
 Data can be loaded through the menu in RStudio.
 Alternatively, data is read and then assigned to an object using the command
read.table().
 Data structure can be checked using the str(data.name) command.
 Exploring data graphically can involve pair-wise plots of all variables with plot(),
histograms with hist() and boxplots with boxplot().
 Individual data entries, row and columns can be accessed and changed using their row and
column subscripts.
 Data can be subset using logical statements involving ==, !=, |, and &.

38
2.5 Dealing with missing Values

Goal
In this section you will learn how to

1. Interpret different types of missing values indicators in R


2. Handle missing values in common functions
3. Identify, count and set missing values

Types of missing values


 NA (not available) codes missing data in R. When preparing a data it is good practice to
enter NA into ”empty cells” in your Excel table.
 NA also appears as a result when a command cannot be executed, for example because
the data contains NA and the command is not prepared to handle NA.
 NaN (not a number) appears when a calculation does not yield a mathematically defined
answer. R often gives a warning when NaN are generated as in

> log(-1)

[1] NaN
Warning message:
In log(-1) : NaNs produced

Handling missing values in common commands


R is very cautious. Most of the basic commands return NA as soon as an NA is present in the
data. However, they usually have an optional argument to tell R to ignore NA but this differs
between commands. For example,

mean(c(1,2,3, NA))

[1] NA

yields NA. Setting the optional argument na.rm (for NA remove) to TRUE tells R to consider
only non.NA in the calculation, thus,

mean(c(1,2,3,NA), na.rm=TRUE)

[1] 2

yields the mean of the three non-NA values.

This also works for range(), sd(), var(), sum(), median(), max(), min()and
many other commands.

An exception is the command length( ). It gives the number of cases regardless of the
presence of NA. Thus,

39
length(c(NA, NA, NA))

[1] 3

The commands cor() for correlation and cov() for covariance ignore NA with the argument
use="complete.obs":

cor(n.1, n.2, use=”complete.obs”)

Here, n.1 and n.2 are two vectors of the same length.

Other commands such as lm() for calculating linear models ignore NA in the default setting.

Consult the help files to find out how NA is dealt with for specific commands (see lesson on
interpreting help files).

Finding and counting missing values


To find out whether your data contains NA use

is.na(data.name) or more specifically


is.na(data.name$column.name)

This command can be applied to any data structure or part thereof. is.na() returns logical
statements for each element of the data with TRUE for both NA and NaN and FALSE for other
entries. To find only NaN use is.nan(). For example,

is.na(c(1,NA,3,NA,5)) returns

[1] FALSE TRUE FALSE TRUE FALSE

Vectors of logical statements can be summed, because TRUE is automatically converted to 1 and
FALSE to 0. This way the number of missing values can be obtained. For example,

sum(is.na(c(1,NA,3,NA,5))) will yield the answer

[1] 2

indicating the number of NA in the data.

To access rows that have no NA in any of the columns use

complete.cases(data.name)

Note that the function summary(data.name) will also provide the number of NA in each
column.

To find out where the NA are in the data use the command which(), for example

40
which(is.na(c(1,NA,3,NA,5))==TRUE)

returns

[1] 2 4

because elements 2 and 4 are NA.

Setting missing values


To set certain data points as NA, for example when you realize that there is a problem with them,
access the elements of the data frame using row and column numbers and assign NA to those, for
example

numbers.1[1] <- NA

will set the first element of the vector to NA.

data.name[2,3] <- NA

will set row 2, column 3 to NA.

Note that these changes are made to the data frame object stored in R’s current workspace NOT
to your original data file.

Summary

 Missing data type sin R are NA (not available, to be used in data tables) and NaN (not a
number).
 Many commands have optional arguments to deal with missing values, for example
na.rm=TRUE will tell R to ignore missing values in mean(), range(), sum() and
other basic functions.
 The command is.na(data.name) is used to identify NA and NaN.
 sum(is.na(data.name)) will return the number of missing values in the data and
which(is.na(data.name)) will return subscript numbers of the elements that are
NA or NaN.
 Data entries can be set to NA with the assignment arrow as in numbers[1] <- NA.

41
2.6 Understanding help() functions

R provides a large number of standardized help functionalities and web resources.

You can find information on

1. the use of commands (that you know the name of)


2. search for terms or words, perhaps related to an analysis you want to do, or
3. use web-based search functions that allows you to find commands but also packages,
tutorials and forum entries.

Typing a question mark followed by a command, for example

?t.test

will open a help file, try it out!

At the top of the page, the package that the command originates from is given in braces. Here
t.test{stats}, shows that the command t.test() originates form the package stats.
Further sections contain a description, the usage, the arguments and the value or object returned
by the command. The help file for t.test() indicates that its arguments include x,
y, alternative, mu. It further explains that the t.test() command returns a list
object including the the value of the t-statistic, the estimated mean or difference in means, the
degrees of freedom and the P-value.

The help pages end with references, similar commands ("see also" section) and importantly,
examples. Example code can be directly copy-pasted into the console and only internal data is
used. Running example code is a very good way to examine how to work with a command.

When looking for a command or term, you use two question marks followed by the term you are
looking for. Information on correlation analyses, for example, is found by typing

??correlation

This command will open a table on command related to correlation in some way. The table lists
the commands and the packages they originate from, as well as a short description. Clicking on
these entries will take you to the help files for these commands.

A number of number webpages are dedicated to informing about the use of R. For beginning
users as search in the R help archive can be very helpful. It collects questions on R and answers
there are often given by well-known authors of R books and packages.

The Namazu R search page is accessible through directly or from the R console using the
command RSiteSearch(”search.word”, make sure to enter the search word in quotes
(" "). This page often leads to newer and more advanced topics. Try it out!

42
2.7 Exercises

2-A Vector creation


Write R code to generate the following vectors, explore the functions seq() and rep() using
the help on commands:
1.0 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4.0 4.3 4.6 4.9
1 2 3 4 1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4 1 2 3 4 85
14 12 10 8 6 4 2 0
5 5 12 12 13 13 20 20

2-B
What is the correct code to load the data in a .csv file that looks like this?
Fertilizer;plant.biomass;plant.height;seed.weight
high;0,180476248;21,31200596;0,029829762

Select one:
(a) read.table(file=file.choose(), header=T, sep=",", dec=",")

(b) read.table(file=file.choose(), header=T, sep=",", dec=";")

(c) read.table(file=file.choose(), header=F, sep=";", dec=",")

(d) read.table(file=file.choose(), header=T, sep=";", dec=”,”)

(e) read.table(file=file.choose(), header=T, sep=",", dec=".")

2-C Loading and exploring data structure

Load the iris data that R provides internally by typing data(iris)


A. What sort of data type is iris?

B. How many rows (observations) and columns (variables) does the iris dataset have?

C. Which variable of the data frame iris is a factor and how many levels does it have?
Select one:
(a) The variable Species is a factor and it has 5 levels.

(b) The variable Species is a factor and it has 3 levels.

(c) The variable 'data.frame' is a factor and it has 150 levels.

(d) The variable 'data.frame' is a factor and it has 5 levels.

43
2-D Loading and graphical exploration of data
Please download this file and load it into R: sunflower fertilizer file
A. Is there any indication in the graphs that plant height or seed weight differ between plants
subjected to the two fertilizer treatments?
Select one:
(a) Plant height appears to be considerably larger in plants treated with from high nutrient
fertilizer, whereas seed mass appears to be similar in plants from both treatments.
(b) Plant height appears to be considerably lower in plants treated with high nutrient fertilizer,
whereas seed mass appears to be similar in plants from both treatments.
(c) Plants from then two treatments do not appear to differ in height or seed weight.
(d) Plant height and seed weight appear to be lower in plants from the low nutrient fertilizer
treatment.
B. Create a new data frame containing only the rows of the “low” treatment.

2-E Subsetting data


Explore dataset- mtcars in R. You can get the structure and column names of data by typing the
command str(mtcars) and names(mtcars) respectively. Write your code to subset the
dataset- mtcars according to the following requirements (NOTE: each requirement is
independent.)
A. Select cars whose cyl (a column in the dataset) value is no smaller than 5.
B. Show all the fields (columns) of the first 10 cars.

2-F Data manipulation


Which of the following lines of script multiplies column 3 of the data frame my.data with 1.5?
(a) my.data[,3] *1.5 <- my.data[,3]

(b) my.data[3,] *1.5 <- my.data[3,]

(c) my.data[,3] <- my.data[,3] *1.5

(d) my.data[3] <- my.data[3] *1.5

44
Solutions:
2-A
A. v1=seq(1,4.9,0.3): you have to create a regular sequence, thus indicating the use of the function seq(), from 1 (first
argument) to 4.9 (second argument) using an increment of 0.3 (third argument)
B. v2=rep(1:4,3): this time you need to repeat the sequence of the 1 to 4 integers three times, so you should use the rep()
function, giving the vector to be repeated (first argument) and the number of repeats (second argument). As the vector to be
repeated is simply the sequence of integers from 1 to 4, you can use the : symbol to save time.
C. v3=c(v2,85): the vector you have to create is basically the same as the vector in question b), with just 85 added at the end.
So you should use the function c() that concatenates several vectors into one, in the order specified.
D. v4=seq(14,0,-2): again a sequence, but this time in decreasing order. You can generate that by using seq() with a negative
increment (-2 here).
E. v5=rep(c(5,12,13,20),each=2): this time we do not want to repeat a whole vector as in b), but we want to repeat each element
of a vector twice. This is done by using the argument each in the function rep(). The vector (5,12,13,20) is not an obvious
sequence so we just use c() to provide the vector to the rep() function.
2-B
d
2-C
A. data.frame
B. 150 obersvations 5 columns
C. Species. Three factors: setosa versicolor virginica.
2-D
A. a
Read the table into R using this command: t <- read.table("Downloads/sunflower.csv",sep=",",header=T)
Make the boxplot by typing: boxplot(t$plant.height ~ t$Fertilizer) boxplot(t$see.weight ~ t$Fertilizer)
B. t.low <- t[t$Fertilizer=="low",]
2-E
A. mtcars[mtcars$cyl >= 4, ]
B. mtcars[1:10, ]
2-F
c

45
2.8 Web resources and books on R

Web resources
CRAN page

 Main R page for downloading and information

R-Studio

 Effective user interface for R, free download.

Quick R

 Useful web-page for the beginners and a little bit more advanced user, related to the book
R in action mentioned below

R reference card - print it and fold it!

 The perfect pocket card. Great for refreshing your memory or point you in the right
direction

Introduction to R - Webinar in Youtube

 Some of the presentations from the famous Paul H. Geissler's (now retired) web-based R
introductory course

Stackoverflow (forum)

 Every time that you type an R related question in google, this is one of the best hits to
follow

R-help info page

 Questions and answers on R use. Note that that the domain stat.ethz.ch often has good
information.

R tutor

 Very informative with both beginner and selected advanced topics

Teaching with data simulations

 Inspiration for teachers

Books
Click on the title to see the book in the Swedish libraries

The R book (Crawley)

46
 Large overview with many biological examples, suitable for beginning users with some
statistical knowledge.

Getting started with R (Beckerman & Petchey)

 Very good and concise introduction to R for people with experience in statistical analyses
from other programs.

25 recipes for getting started with R (Teetor)

 A quite brief (44p) list of basic tips on how to use R for data exploration. Only
recommended for beginners. Most of those tips are found in Quick R

Introductory statistics with R (Dalgaard)

 A popular and well-written compendium of basic statistics in R. Suitable for beginners.

R in action (Kabacoff)

 From the author of Quick R; this book follows a case-study approach with many
practical data sets. Previous experience with R is desirable.

Data analysis and graphics using R (Maindonald & Braun)

 A long compendium of case-studies, some of them a bit outdated and quite technical.
Ideal for more advanced students.

R graphics (Murrell)

 A popular guide on how to make perfect graphs. Intended for those who want to
improve their artwork in R.

The art of R programming (Matloff)

 Teaches you how to use R for programming efficiently. Previous experience with R and
programming concepts is required.

Modern applied statistics with S (Venables & Ripley)

 Rather technical explanations on how to use the S language (the one R is based on) for
statistics. For more advanced learners.

Check more books and free PDFs here!

47
3 Basic Statistics with R

3.1 Types of data

Most data types can be broadly classified in two categories:

Continuous data, also known as numeric data, is any form of data in which data points can be
any numbers within a given range. Common examples of this include measurements such as
height, weight, etc. and many mathematical solutions (e.g. integration, slope, etc.).

Categorical data, also known as factor data, is any form of data in which data is grouped into
multiple categories. Examples of this include species type, hair color, etc. Binary data is a subset
of categorical data in which the data can only be one of two groups (e.g. dead or alive, heads or
tails, etc.).

Being able to distinguish between these types of data is extremely important, because as we will
see later, the type of data being used is an important factor in deciding the appropriate way to
analyze the data statistically.

3.2 Exploring data with tables

One of the simplest diagnostics used in R is the table() command. This command allows for
the creation of contingency tables that report the counts of cases (rows) in different categories of
another variable or several variable combinations. These tables are extremely useful for
determining if your data is in the correct.

To demonstrate this we will use the example dataset warpbreaks, which records the amount
of times different wools at different levels of tension break. The researchers set up this
experiment so that each combination of wool and tension had an equal sample size. In order to
test this we could simply call the data and count each measurement. However, this would
become extremely tedious in larger datasets. We could instead use the table() command to
answer this question:

data(warpbreaks)

48
table(warpbreaks$wool,warpbreaks$tension)

L M H
A 9 9 9
B 9 9 9

We can thus quickly confirm that each combination of treatments does indeed have 9
measurements associated with it. Further exploration of using statistics to analyze contingency
tables will be discussed in an oncoming section.

3.3 Exploring data graphically


In most situations you will benefit from first looking at your data graphically. This is in order to

 Find out whether your data is reasonable


 Detect any large outliers (for example data entry mistakes)
 Assess what the approximate distribution of the data is
 See the main patterns in the data.

The plot() command


A quick graphical check of the data is provided by the simple
command plot(data.name)that will open a new window displaying plots of all pair-wise
variable combinations. (Figure 3-1).

plot(iris)

Figure 3-1 Pairwise plot for Iris dataset

49
Histograms
The command hist() produces a histogram displaying data values on the x-axis against their
frequencies on the y-axis allowing you to judge the distribution of the data. The command
hist() is applied to individual variables (columns) of the data, that are given by the name of
the data frame followed by a dollar sign and the name of the variable (column). The output is
shown in Figure 3-2.

hist(iris$Sepal.Length)

Figure 3-2 Histogram of sepal length of Iris

Boxplots
Boxplots display continuous data separated into the levels (groups) of a factor (grouping
variable). In the default settings, the command boxplot() shows medians as thick black lines
and quartiles as a box around the median. The t-bars ("whiskers") are the range of the data that
is within 1.5 times the inter-quartile distance from the median. Data points outside that range are
regarded as outliers and are displayed as circles. The main argument in the boxplot()
command is a formula statement relating the continuous variable on the left side to the grouping
variable on the right side with a tilde symbol (~), continuous.variable.name ~
factor.variable.name ). Boxplots (Figure 3-3) can be used to get an idea on whether
there are large differences between groups, whether the data is distributed symmetrically within
groups and whether there are outliers.

boxplot(Sepal.Length~Species, data=iris)

50
Figure 3-3 Boxplot of sepal length of Iris
You can learn how to produce nicer looking graphs in R in the section Basic graphs in R.

3.4 Descriptive statistics

Another quick way to explore data is to use the command


summary(name.of.data.frame). This command gives you a number of descriptive
statistics for each continuous variable (range, quartiles, mean, median) (see getting started with
statistics section). For factor variable the command tabulates the number of observations for
each factor level. In our example

summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width


Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

51
You can calculate these and further descriptive statistics directly using the following commands
(Table 3), that are all applied to vectors (or data frame columns) of continuous variables:

Table 3. Commands to calculate descriptive statistics


Statistic Command

Mean
mean(variable.name)

Median
median(variable.name)

Range
range(variable.name)

Standard deviation
sd(variable.name)

No. observations
length(variable.name)

Variance
var(variable.name)

Calculating descriptive statistics for groups of data


Often you will wish to apply these commands over groups of the data that are defined in a
grouping factor. You can do this using tapply().

The main arguments of tapply()are X, the variable that you want to summarize, INDEX, one
or more grouping variable(s) and FUN the command you want to apply. The INDEX variable is
given as a list object, that is typically is converted into a list within the tapply() command
using INDEX=list(variable.name).

tapply(X=variable.name, INDEX=list(variable.name),
FUN=command.name)

You can state all kinds of functions in the function argument.

tapply(X=iris$Sepal.Length, INDEX=list(iris$Species), FUN=mean)

setosa versicolor virginica


5.006 5.936 6.588

52
3.5 Comparing two groups of measurements

Identifying the type of test

One-sample test

One-sample tests are used when a single sample with a specific


hypothesized value for the mean is to be considered. Examples of this
include fixed value comparisons such as whether average human height is
1.77m.

Two independent sample test

Here, measurements on two samples from two different populations are


compared. Examples include comparisons of males and females,
comparisons of plant/animals subjected to two different treatments, and
comparisons of two different species/localities.

Paired-sample test

Paired sample tests are used when two different measurements were taken
on the SAME experimental units. Examples are before and after studies on
the effect of medical treatments

Work-flow for group comparisons (one or two groups)


Assume that we ask the question whether males run faster than females. We suggest the
following workflow (Figure 3-4) for group comparisons: First the data is plotted. The assumption
of t-tests is that the data is normally distributed and this needs to be assessed (see below) before
proceeding further. If the data is normally distributed you can proceed to the t-test, otherwise
data transformations are needed (i.e. transform the speed at which males and females run). If that
does not work, then a non-parametric tests should be conducted.

53
Figure 3-4 Workflow for one- and two-group comparisons

Assessing normality using quantile-quantile-plots


The main assumption for t-tests is that the data is normally distributed. Depending on the type of
group comparison you need to assess normality for the following vector(s):

 one sample test: data of the sample


 two independent sample test: data vector of each group separately
 paired-sample test: vector of differences between the two treatments for each
experimental unit

You can view the distribution of your data using the command hist()as described above.
The command qqnorm(your.vector) produces a quantile-quantile plot (qq-plot) that is
regarded as the best graphical assessment of whether or not data conforms to the normal
distribution. A quantile is a value of the data that is just larger than a certain percentage of the
data. The median, for example is the 50% quantile and the quartiles are the 25% and 75%
quantiles (see here).

The qq-plot displays two different types of quantiles. On the y-axis sample quantiles, i.e. each
data point, are indicated. You can check this out by comparing the histogram and the qq-plots
below; the histogram and the y-axis of the qq-plot have the same range.

The x axis of the qq-plot represents the standardized theoretical quantiles for a normal
distribution corresponding to each data point. The qqnorm(your.vector) command first
calculates the quantile of each value in the data , i.e. what percentage of the data is smaller or
equal to that value. It then looks up the corresponding quantile (i.e value) of the standard normal
distribution with a mean of 0 and standard deviation of 1. Thus, points with values around zero
for the theoretical quantile should be close to mean of the data on the y-axis.

qq-plots are evaluated with the aid of the command qqline(your.vector), producing a
line from the first to the third quartile of the data. You expect the points to be close to this line if
the data is normally distributed.

54
Below you see examples of histograms and qq-plots (Figure 3-5) for 250 data points that are
normally distributed (green), left skewed (red) and right skewed (red). Data that look like the left
skewed or right skewed example below should transformed before analysis. If this does not work
a non-parametric test should be used.

Figure 3-5 Histogram and QQ plot for normally-distributed, right skewed, left skewed
data

For smaller datasets, quite some deviations from the line are expected even in normally
distributed data, especially at the extremes. Below you see three examples of histograms and
corresponding qq-plots (Figure 3-6) for five and ten values sampled from a normal distribution.
If your data looks like this you can use the parametric tests!

55
Figure 3-6 Histogram and QQ plot for small datasets

56
Transformations
To obtain normally distributed data for further analysis the following transformations (Table 3-2
Useful transformations) are recommended:

Table 3-2 Useful transformations


Data Transformations R code
Right-skewed (common!) loge(Y)
log10(Y) log(your.data)
1/Y
log10(your.data)
√(Y)
1/your.data
sqrt(your.data)

Left-skewed data (rare) Ya


your.data^a

Percent data arcsin(√(Y))


asin(sqrt(your.data))

Remember that most of these commands, with the exception of the exponential transformation
for left-skewed data, are not defined for values smaller than zero. You may need to add a
constant value to all values in order to perform the transformation.

You can transform your data, assign it to a new vector or data frame column and plot it again.
You may need to try out several different transformations. If you are still not satisfied with the
distribution, please use a non-parametric test.

It is also possible to apply the transformations directly with other commands, for example
hist(log(your.data)).

Once you have established the distribution of your data is normal you are ready to conduct the
appropriate t-test otherwise proceed to non-parametric alternatives.

3.6 Using t-tests with R

T-tests are calculated using the command t.test(). The arguments to this command identify
the type of test to be conducted.

One-sample t-test
We use our example in which we want to test whether average human height is 1.77m. This is
our data:

57
height <- c(1.43,1.75,1.85,1.74,1.65,1.83,1.91,1.52,1.92,1.83)

Be aware that the alternative hypothesis is that average human height is significantly different
from 1.77m. We declare the model in the following way, to obtain the following output:

t.test(height, mu = 1.77)

One Sample t-test


data: height
t = -0.5205, df = 9, p-value = 0.6153
alternative hypothesis: true mean is not equal to 1.77
95 percent confidence interval:
1.625646 1.860354
sample estimates:
mean of x
1.743

Therefore, we cannot reject the hypothesis that average human height is significantly different
from 1.77m.

Two independent sample t-test


We use our example of dispersal distance in male and female butterflies. This is your data:

distance <- c(3,5,5,4,5,3,1,2,2,3)


sex <-
c("male","male","male","male","male","female","female","fe
male","female","female")

Before running the test it is important to consider your alternative hypothesis, specifically
whether you want to run a one-tailed or two-tailed test. If no alternative hypothesis is specified,
the command will assume a two-tailed test.

The two-sample t-test has a second assumption in addition to the normality of the data: equal
variance in the two samples. If the variances are assumed to be equal, this must be specified using
the argument var.equal = TRUE, otherwise Welch´s t-test that does not assume equal
variances is automatically used when needed. Here we assume equal variances and perform a
two-tailed test.

t.test(distance ~ sex, var.equal = TRUE)

Two Sample t-test


data: distance by sex
t = -4.0166, df = 8, p-value = 0.003859
alternative hypothesis: true difference in means is not equal
to 0
95 percent confidence interval:
-3.4630505 -0.9369495
sample estimates:

58
mean in group female mean in group male
2.2 4.4

Thus, male butterflies dispersal is significantly different from female butterflies. We can also
specify a one-sided alternative hypothesis by adding the argument alternative= “less”
or alternative = “greater” depending on which tail is to be tested:

t.test(distance ~ sex, var.equal = TRUE, alternative="greater")

Two Sample t-test

data: distance by sex


t = -4.0166, df = 8, p-value = 0.9981
alternative hypothesis: true difference in means is greater
than 0
95 percent confidence interval:
-3.218516 Inf
sample estimates:
mean in group female mean in group male
2.2 4.4

The results of these tests state that female dispersal distance is not significantly greater than male
dispersal distance.

Paired sample t-test


As an example of a paired test, you investigate whether the sleep of students is affected by an
exam. You ask 6 students how long they sleep the night before an exam and the night after an
exam. These are the answers you get:

sleep.before <- c(4,2,7,4,3,2)


sleep.after <- c(5,1,3,6,2,1)

Here you simply run add the argument paired=TRUE to the command from the two-sampled
test above.

t.test(sleep.before, sleep.after, paired=TRUE)

Paired t-test
data: sleep.before and sleep.after
t = 0.7906, df = 5, p-value = 0.465
alternative hypothesis: true difference in means is not equal
to 0
95 percent confidence interval:
-1.501038 2.834372
sample estimates:
mean of the differences
0.6666667

Well, this test is NOT significant, and thus the data does not support an effect of exams on
students sleeping time. But maybe you forgot about the party after the exam?
59
3.7 Non-parametric alternatives

In certain cases it is impossible to meet the assumption of normality required for standard t-tests,
even after transformations. In such cases a non-parametric alternative such as the Wilcoxon
family of tests may be appropriate. These include one-sample, two-sample (also names Mann-
Whitney-U test) and paired alternatives, all available through the command wilcox.test().
The syntax of wilcox.test() is similar to that of t.test(), see ?wilcox.test.

3.8 Correlation analysis


In some cases, you may wish to assess the relationship between two variables. One way to do this
is by using correlation analysis. The goal of correlation analysis is to determine how related two
variables are. This differs from regression analysis, which seeks to determine a line of best fit
from the relationship and assumes that predictor variables is directly affecting (causing) the
outcomes of the response variable.

Pearson Correlation
One of the most common correlation analyses for parametric data is the Pearson product-
moment correlation coefficient, commonly called Pearson’s r. This test seeks to determine the
level of relatedness between two variables using a score that runs from -1 (perfect negative
correlation) to 1(perfect positive correlation). A value of zero indicates no correlation. Since
Pearson’s r is parametric, it is advisable to test the assumption of normality before running this
test.

In r, Pearson’s r can be calculated using the cor.test() command. Here we once again use
the sample data set iris to assess the correlation between sepal length and petal length:

cor.test (iris$Sepal.Length,iris$Petal.Length)

Pearson's product-moment correlation

data: iris$Sepal.Length and iris$Petal.Length


t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538

This data shows a highly-significant (P-value < 2.2e-16) and strongly positive (0.87) correlation
between these two variables. Note that in this case, the P-value is used to reject the null
hypothesis that the true correlation is equal to zero.

60
Spearman Correlation
A non-parametric alternative to Pearson’s r is Spearman's rank correlation coefficient, or
Spearman’s rho. Like Pearson’s r, Spearman’s rho determines the level of correlation of two
variables ranging from -1 to 1. The difference between the two measures is that Spearman uses
the rank-order of the data rather than the raw values.

Spearman’s rho can also be calculated using the cor.test() command. Below we repeat the
previous correlation analysis this time using Spearman’s rho.

cor.test (iris$Sepal.Length,iris$Petal.Length, method="spearman")

Spearman's rank correlation rho

data: iris$Sepal.Length and iris$Petal.Length


S = 66429.35, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8818981

Note that this test produced similar, but not identical, results compared with Pearson’s R

3.9 Cross-tabulation and the χ2 test

Basic contingency tables would have two categorical variables. In many cases we may wish to test
whether the two grouping variables are independent. One of the most common ways to analyze
contingency tables is with the χ2-test (Chi-square test). The χ2 tests work by first calculating
the difference between expected and observed values:

The result of this calculation, the so-called the X2score, is then compared to the χ2 distribution to
calculate a p-value to determine if the observed values differ significantly from the expected
values.

In order to demonstrate the usage of χ2 tests we will use an example of eye color counts in two
different groups of flies. The dataset can be found in the attachments as 3.9_flies_eyes_color.csv.
We begin by loading the data and creating a contingency table.

flyeyes <- read.csv("3.9_flies_eyes_color", header = T)


tab <- table(Eyecolor, Group, data=flyeyes)

A B
Red 34 41
White 16 9

61
Note that the ratio between red and white eyes differs between group A and group B. We will use
chi-squared test to determine whether the data is more compatible with the null hypothesis that
the variables of eye color and group are independent of each other or with the alternative
hypothesis that eye color and group are not independent.

chisq.test(tab)

Pearson's Chi-squared test with Yates' continuity correction

data: tab
X-squared = 1.92, df = 1, p-value = 0.1659

In this case, based on a chi-squared value of 1.92 and 1 degree of freedom, we calculated a P-
value of 0.1659. Thus, data suggests that these variables are independent.

Another useful command is the prop.test() command.

prop.test(tab)

data: tab
X-squared = 1.92, df = 1, p-value = 0.1659
alternative hypothesis: two.sided
95 percent confidence interval:
-0.43264180 0.05930846
sample estimates:
prop 1 prop 2
0.4533333 0.6400000

The first two lines return the same values as we observed with the chi-squared test. However two
additional groups of numbers are present in this output. The first is the 95% confidence interval.
These two numbers represent the lower and upper estimates of the difference between the two
groups. Note that because the interval includes zero that we are not confident that there is a
difference between the two groups at all. The sample estimates show the estimated proportion of
Red individuals in group A and White individuals in group A respectively.

3.10 Summary

 It is important to distinguish between continuous and categorical data when determining


which statistical test is most appropriate.
 Exploring data through the use of tables and graphs is critical to understanding the data
 Many statistical tests rely on an assumption of a normal distribution
 It is sometimes possible to transform non-normal data into normal data
 Non-parametric tests should be used on data that is not-normal even after transformation.
 T-tests compare the means of one or two data groups
 Correlation analysis determines whether two variables are associated
 Chi-squared tests are used to determine whether two categorical variables are independent.

62
3.11 Exercises

3-A
Below you find a number of descriptions of experiments.
Please assign the appropriate test: one-sample test, two independent groups or paired samples
A. You read that it costs on average 600 SEK you go to the hairdresser in Uppsala and you want
to find out whether that is actually true. You walk through the city and obtain prices from 10
hairdressers.
B. You investigate whether the flower color of your grandmothers’ orchids becomes more
intensive after applying fertilizer. You score color intensity in 10 orchids before fertilizing and one
week after fertilizing
C. You want to investigate whether arrival times at lectures differs between male and female
students. You come 15 minutes early to large lectures and record arrival time and sex of the
students. You obtain data from 60 women and 57 men.
D. You study two species of plants, red and white campion. You want to know which species has
larger flowers and measure flower size in 50 individuals of each species.
E. You want to study whether the hand people write with is stronger than their other had. You ask
25 people to participate in your experiments and measure how well they can squeeze balloons
with either hand.

3-B
A nonparametric test is applied when:
a. There are not parameters
b. The variables are not independent
c. The groups do not have the same sample size
d. The assumptions of parametric tests are not met

3-C
Below you find data on the snow-melt times in two different habitats, snow-beds and ridges.
Use a t-test to find out whether there is a significant difference in snowmelt times between these
two habitats. Assume that the variances are equal. What is the P-value?
Generate the data using this code:
snowmelt <- c(110, 120,109,101,105,99,106,108,95,98)
habitat <- c(rep("snowbed",times=5), rep("ridge", times=5))

3-D

63
You are investigating two different nature conservation areas (area 1 and area 2). You would like
to know if interactions between poplar trees and leaf eating insects differ between these two
reserves. For this purpose, you measured the leaf area that has been eaten (in %) on 10-20 year
old poplars (in %). 52 trees were sampled in each reserve. The data is available in the
attachments as 3-D.
Does the consumed leaf surface differ significantly between the two reserves?
Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the
results critically.

3-E
You want to understand interactions between insects and oaks. Do older trees support more
insects, and if so, how much more? For this purpose, you set up insect traps in 20 oaks trees of
different ages and measure the total dry weight of insects collected in one month (July). The data
is available in the attachments as 3-E.
Hint, conduct the appropriate statistical test, produce a graph and interpret the results.

3-F
You want to study willow shrub responses to herbivory – do willows produce tannins (that are
known to act as defense compounds) in response to herbivory?
For this purpose, you selected 10 willow shrub pairs with similar ages and sizes and growing
closed to each other. In May and June you spray one shrub in each pair with insecticide each
week and the other one with water. At the end of June you measure tannin concentration in 50
leaves per tree. The data is available in the attachments as 3-F.
Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the
results.

3-G
Explore the Swiss dataframe. And indicate whether the following statements are true or false:
A. Fertility positively correlates with religiosity and agriculture, and correlates negatively with
education and examination
B. Education and examination are not strongly positive correlated
C. Swiss cities with better education are usually more catholic
D. More extensive agriculture causes more fertility
E. The more catholic you are the more fertile you will be

3-H

64
Using Spearman correlation and Pearson correlation, the correlation between fertility and
mortality is:
(a) 0.44 and 0.42 (b) 0.42 and 0.44 (c) 0.43 and 0.41 (d) 0.41 and 0.43

3-I
I am counting people who arrive into the bank in a sunny summer afternoon and I record the
color of their clothes. I got that 35 guys wore a white t-shirt, 22 a blue one, and 7 a black one. On
the other hand, 14 ladies wore white dress, 12 a light-blue and 8 dark one. Answer the following
questions:
A. Is the color being randomly chosen?
B. Assuming that the female/male ratio in the city where the bank is situated is 50:50, do men
prefer go more often to the bank than women?
C. Is color selection gender-biased?

3-J
You receive the following data regarding behavior and colour of crabs:
Blue Red
Aggressive 36 24
Passive 32 28
Are colour and behavior independent?

65
Solutions
3-A
A. one-sample test, B. paired samples, C. two-independent groups, D. two-independent groups, E. paired samples
3-B
d
3-C
0.09112: t.test(snowmelt~habitat). Data indicates that there is not difference between microhabitats
3-D
#Load data check object and data;poplar <- read.table(file.choose(),sep=";", header=T, dec=",");str(poplar);#convert area to a
factor;poplar$area <- as.factor(poplar$area);#plot data;boxplot(consumed_surface~area,
data=poplar,xlab="Area",ylab="Consumed surface");#check
normality;par(mfrow=c(2,2));hist(poplar$consumed_surface[poplar$area=="1"], main=”Area
1”);hist(poplar$consumed_surface[poplar$area=="2"],main=”Area1”);qqnorm(poplar$consumed_surface[poplar$area=="1"]);qqli
ne(poplar$consumed_surface[poplar$area=="1"]);qqnorm(poplar$consumed_surface[poplar$area=="2"]);qqline(poplar$consum
ed_surface[poplar$area=="2"]);# The data have only minor deviations from normality, so a two sample t-test is appropriate;
t.test(consumed_surface~area, data=poplar);#Output and graphs;means <- tapply(poplar$consumed_surface,
poplar$area,mean);se <- tapply(poplar$consumed_surface, poplar$area, function(x)sd(x)/sqrt(length(x)));par(mfrow =
c(1,1));mp <- barplot(means, ylim = c(0, 20), las=1, xlab="Area", ylab="Consumed surface");arrows(mp, means - se, mp,
means+se, mp, angle = 90, length = 0.2, code=3, col="black", lty=1, lwd=1). Interpretation: The consumed surface is around
17,5% in both areas and does not differ significantly between areas.
3-E
#Load data, check object and data;grazing <- read.table(file.choose(), sep=";", header=T, dec=",");str(grazing);par(mfrow =
c(1,1));plot(grazing$insect ~ grazing$age, xlab="Age of the tree (year)", ylab="Occurrence of insects");#Analysis: Linear
Regression;model <- lm(grazing$insect ~ grazing$age);Checking assumptions;par(mfrow=c(2,2)) ;plot(model);#Output and
graphs;summary(model);anova(model);#This graphic is optional;p_conf1 <- predict(model, interval = "confidence");p_pred1 <-
predict(model, interval = "prediction"); par(mfrow=c(1,1));plot(grazing$insect ~ grazing$age, xlab="Age of the tree (year)",
ylab="Occurrence of insects ");abline(model);matlines(grazing$age, p_conf1[,c("lwr","upr")], col=2, lty=1, type="b",
pch="+");matlines(grazing$age, p_pred1[,c("lwr","upr")], col=2,lty=2, type="b", pch=1).Interpretation: Insects occur more
frequently on older trees, on average, an increase in one year of age is related to an increase in dry insect biomass supported
of 16mg.
3-F
#Load data, check object and data;tannin <- read.table(file.choose(), sep=";", dec=",", header=T);str(tannin);#Testing
assumptions;tannin$diff <- tannin$water -
tannin$insecticide;qqnorm(tannin$diff);qqline(tannin$diff);#Analysis;t.test(tannin$water, tannin$insecticide, var.equal=T);mean<-
mean(tannin$diff);mean;se<-sd(tannin$diff)/ sqrt(length(tannin$diff));se. Interpretation: Insecticide application significantly
reduced the tannin content as compared to a water treatment suggesting that the presence of insects induced tannin production
3-G A. True, B. False, C. False, D. False, E. False
3-H a
3-I
A. No, p-value = 0.0001381, chisq.test(c(35+14,22+12,7+8))
B. Yes, p-value = 0.002442, chisq.test(c(35 + 22 + 7, 14 + 12 + 8))
C. No, p-value = 0.2105,summer <- as.table(rbind(c(35, 22, 7), c(14, 12, 8))); dimnames(summer) <- list(gender=c("M","F"),
color = c("White","Blue", "Black"));chisq.test(summer)
3-J Yes, P=.5805

66
4 Linear models

4.1 Overview - In this section you will

 learn how to design and interpret linear models


 assess whether models meet assumption using analysis of residuals
 learn how to define and interpret models including interaction terms

4.2 Classes of linear models


Linear are model are a large family of statistical analyses that relate a continuous response to one
ore several explanatory variables. Explanatory variables can be grouping factors or continuous or
a combination of both.

One-way analysis of variance (ANOVA)

Tests whether means of more than two groups are the same, for example
whether fruit production differs among five populations of a plant species.
If there are only two groups, a t-test is the way to proceed. ANOVA relates
variance within groups to variance between groups. The analysis does not,
however, tell you which groups are significantly different from each other.
For this purpose a Tukey test can be applied.

Two-way ANOVA
This analysis can assesses the influence of two grouping factors on groups
means, for example, whether irrigation and fertilization have an effect on
plant growth. Importantly, two-way ANOVA can also analyze whether the
two factors interact, in the example, whether the effect of irrigation depends
on the fertilizer level (or the other way around). This is called a statistical
interaction. The same methods can also be applied to studies with more
than two grouping factors (multi-way ANOVA).

67
Linear regression

Linear regression analyzes to what extent changes in a continuous


explanatory variable result in changes in the response variable, for example
whether larger females cause longer male courtship behavior. If a causal
relationship cannot be assumed a correlation analysis should be used. This
type of analysis can also be conducted with more than one continuous
explanatory variable (multiple regression).

Analysis of covariance (ANCOVA)

ANCOVA allows more complicated analyses that involve effects of


grouping factors, explanatory factors and their interactions. An example is
an analysis of whether the response to different doses of a medication
differs between male and female patients. Such more complicated linear
models can also include more than two explanatory factors.

4.3 Workflow for linear models

First, start by exploring the data through a basic plot (check plot section). Based on this, define
the model and analyze the residuals. If the residuals are normally distributed, obtains and
interpret the result. Otherwise, try transforming the data and re-checking the residuals or use a
different model (Figure 4-1).

Analysis
Plot data Define OK Obtain and interpret
of residuals
model results

not OK
Different
Transformation
model

Figure 4-1 Flowchart of linear model

68
4.4 Defining the model

First you must define the linear model that you want to use, using the lm() function. Within
this function a so-called formula statement defines the relationship of the variables to each other.
The response variable is always on the left side of the tilde-symbol (~) and the explanatory
variable(s) are on the right side as in lm(, …). For instance, if one uses the AirQuality R
internal dataset and we want to make a model to predict Ozone content in the atmosphere using
wind speed, the model definition would be as follows:

My.model <- lm(Ozone ~ Wind, data = airquality)

Observe that we are assigning the model to an object, for example My.model <- lm (…,
…). This is a good practice since later you can just use the object for testing assumptions and for
extracting results.

Whether or not variable are categorical or continuous is defined in the data itself and can be
checked by str()(check data section), R will calculate the appropriate model by itself. Thus, the
following formula statement will yield a one-way ANOVA:

airquality$Month <- as.factor(airquality$Month)


#turns Month into a factor
lm(Ozone ~ Month, data = airquality)

While the following formula statement will yield a regression analysis:

lm(Ozone ~ Temp, data = airquality)

Formula statements are further used to combine explanatory variables and to define interactions.
If variables should be considered only by themselves (additive effects), for example in a multiple
regression without interaction you connect the variables by a plus sign as in:

lm(Ozone ~ Temp + Wind, data = airquality)

On the other hand, if you want to consider interactions in addition to the additive effects use an
asterisk (*) between the explanatory variables, as in:

lm(Ozone ~ Temp * Wind, data = airquality)

Checking assumptions with diagnostic plots


All linear models including regression, one-way ANOVA and ANCOVA have the following
assumptions:

1. The experimental units are independent and sampled at random. The


independence assumption depends heavily on the experimental design.
2. The residuals have constant variance across values of explanatory variables.

69
3. The residuals, i.e. the differences between the observed values of a response
variable and the values fitted by the model, are normally distributed with a mean
of zero.

Analysis of residuals is thus a key step when conduction linear model analyses.

We can for now concentrate on two diagnostic plots for analysis of residuals, the Tukey
Anscombe plot, that displays residuals vs. fitted values. Fitted values are those predicted by the
model.

The other diagnostic plot is the qq-plot that was explained in the section basic statistical analyses.

You obtain both of these graphs from your model object using the plot(,…) command.

In the following example, one is interested in analyzing whether wind speed is a good predictor
of ozone levels using an AirQuality internal dataset from R, such as in the previous section
(defining the model). The output is presented in Figure 4-2.

My.model <- lm(Ozone ~ Wind, data = airquality)


par(mfrow=c(1,2)
plot(My.model, which = c(1,2))

Figure 4-2 Residuals and Q-Q plot for a situation when assumptions are not fulfilled.

In the first graph, the Tukey Anscombe plot, we expect as random scatter of points around zero.
If there is any pattern in the graphs such as a funnel shape, the model fit is not good, in our
example, the residuals of higher fitted values are much larger than those at low values violating
assumption 2 above. This is very common especially in measurement data and can be related to a
larger variation at higher values. A log-transformation of the response variable often improves
model fit, such as in this case (see Figure 4-3). The corresponding qq-plot testing assumption 3
above, also improves after log-transformation (see assumption here) and thus, this analysis
should definitely be based on log-transformed values.

70
My.model2 <- lm(log(Ozone) ~ Wind, data = airquality)
plot(My.model2, which = c(1,2))

Figure 4-3 Residuals and Q-Q plot for a situation when assumptions are fulfilled.

4.5 Analyzing and interpreting the model


An ANOVA table shows how much variation in the response is explained by the explanatory
factors. To get the ANOVA table, use the command anova(My.model), where My.model is
the object that stores the defined model (i.e. the model in the previous section that used the
AirQuality internal dataset). Bellow you find the command and its corresponding output.

airquality$Month <- as.factor(airquality$Month)


#turns Month into a factor
My.Model <- lm(Ozone ~ Month, data = airquality)
anova(My.Model)

Analysis of Variance Table

Response: Ozone
Df Sum Sq Mean Sq F value Pr(>F)
Month 4 29438 7359.5 8.5356 4.827e-06 ***
Residuals 111 95705 862.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In the output, df stands for degrees of freedom, which is the number of values in the final
calculation of a statistic that are free to vary, the F value which is the test statistic calculated as
the ratio between the explained and the unexplained variance, and the corresponding p-value for
this F statistic, which is the probability of not rejecting the alternative hypothesis (significant
effect of the explanatory variable on the response variable) given than is false.

71
The command summary(model) will show the parameters estimated by the model, for
example the slope of the regression for regression analyses or the difference between group
means for ANOVA.

My.model <- lm(Ozone ~ Wind, data = airquality)


summary(My.model)

lm(formula = log(Ozone) ~ Wind, data = airquality)

Residuals:
Min 1Q Median 3Q Max
-3.4219 -0.4662 0.0663 0.5021 1.4035

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.75331 0.21879 21.726 < 2e-16 ***
Wind -0.13726 0.02153 -6.376 4.39e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.723 on 110 degrees of freedom


(37 observations deleted due to missingness)
Multiple R-squared: 0.2699, Adjusted R-squared: 0.2632
F-statistic: 40.65 on 1 and 110 DF, p-value: 4.389e-09

The first part of the output indicates the model that was run and the distribution of residuals by
means of quantiles (see basic stats section). In the first row of the table under the title
“Coefficients” one can find the estimate of the intercept of the regression line, a t-value which is
the test statistics that tests whether the estimate for the intercept is different from zero, and the
p-value for this test.

In the second row of the same table one can find the slope, which is effect of the explanatory
variable on the response variable (a.k.a. magnitude of the effect), and the same test statistic and
p-value as before, but this time for the test that the slope is significantly different from zero. If
the slope is different from zero, then there is an appreciable effect. The effect can either be
positive or negative (gray and black lines, respectively, in the case III plot in figure 1.6.2). Notice
that for these two tests there is a significance coding that is shown at the bottom of the table.

Finally, R2, which is the percentage of explained variance by the model, is presented at the
bottom of the output. This indicated how much of the variation in the data is explained by the
model.

Interpreting interactions
Sometimes the effect of an explanatory variable on a response variable depends on another
explanatory variable, this is termed an interaction. Below you will find some examples on how
interactions would look like and how to interpret them. In the following graphs, different colors
symbolize different grouping variables. The X-axis is the explanatory variable, categorical in the

72
case of the bar-plots and continuous in the case of the dot-plots with tendency line. The Y-axis is
the continuous response variable.

In the following graph (Figure 4-4), you can assume that the continuous response variable is seed
productivity, the white and dark-gray colors correspond to watering treatments (irrigation and
drought) and A and B are different populations. A two-way ANOVA is the right test to be
applied in a scenario such as this, a graphical representation of which is shown in figure 1.6.1.
The first two figures show cases where the interaction term of a two-way ANOVA is not
significant, while in the last two cases the interaction term of the ANOVA is significant. In the
first case, the response variable (i.e. seed productivity) differs between populations but not
between treatments. In the second case it also differs between treatments. Observe that in the
fourth case the overall mean between the populations is the same. In this context the interaction
cases (III & IV) can be interpreted as follows: the effect of the treatment (water availability) on
the response variable (i.e. seed productivity) depends on the population.

Case I Case II Case III Case IV

A B A B A B A B

Figure 4-4 Bar plots for two treatments (white and grey) and two populations (A and B)
Now, assume that instead of having the treatment as a categorical variable, there is a spectrum of
different values. For instance, instead of having drought and watered treatments, we measured
the amount of water naturally available in the soil (x-axis in Figure 4-5) in the two different
populations (black and empty dots). ANCOVA is the test to be applied in this particular case. To
the left of Figure 4-5 you can see a case where the interaction term of an ANCOVA is not
significant (case I), but in the next two cases the interaction term of the ANCOVA is significant.
In other words, if the interaction term is significant then how the response variable (i.e. seed
productivity) varies as a function of the continuous explanatory variable (i.e. water availability)
will depend on the population. As was discussed in section interpreting the model, this effect can
be accessed by looking at the slope. Observe that in the first two cases the response variable also
varies regarding the population, which does not happen in the third case (mentally try projecting
the empty and filled dots into the y-axis).

73
Case I Case II Case III

Figure 4-5 Dot plots for two populations (filled and empty dots)

4.6 Worked Examples

The following examples follow the workflow structure (see here): make exploratory plot, define
the model, check assumptions, and analyze and interpret the summary and test statistics.

One-way ANOVA
To test whether fruit production differs between populations of Lythrum, fruits were counted on
10 individuals on each of 3 populations.

fruits <- data.frame(fruits = c(24, 19, 21, 20, 23, 19, 17, 20,
23, 20, 11, 15, 11, 9, 10, 14, 12, 12, 15, 13, 13, 11,
19, 12, 15, 15, 13, 18, 17, 13), pop = c(rep(c(1), 10),
rep(c(2), 10), rep(c(3), 10)))
fruits$pop <- as.factor(fruits$pop)
plot(fruits ~ pop, data = fruits)
20
fruits
15
10

1 2 3
pop

Figure 4-6 Boxplot showing distribution of fruits per population

model<-lm(fruits~pop,data=fruits)

74
par(mfrow=c(1,2));plot(model, which=c(1,2))

Residuals vs Fitted Normal Q-Q

Standardized residuals
23 23

2
4
Residuals

1
2
0

0
-1
22 7
-4

227

12 16 20 -2 -1 0 1 2
Fitted values Theoretical Quantiles

Figure 4-7 Distribution and QQ plot for residuals of one-way ANOVA

anova(model)

Analysis of Variance Table

Response: fruits
Analysis of Variance Table

Response: fruits
Df Sum Sq Mean Sq F value Pr(>F)
pop 2 420.0 210.000 19.104 6.767e-06 ***
Residuals 27 296.8 10.993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The analysis shows that fruit production differs among populations. You can also make a bar
plot out of this data. For that, please refer to the section on how to make plots.

Two-way ANOVA
In a study on pea cultivation methods, pea production was assessed in two treatments of
irrigation (normal irrigation and drought) and in three treatments of radiation (low, medium and
high). 10 plants in each of the six combinations were considered.

plants<-
data.frame(seeds=c(39,39,39,40,40,39,41,42,40,40,39,38,41,
41,40,41,40,40,41,40,38,40,40,39,42,40,39,41,39,40,39,40,4
1,40,41,39,40,41,40,39,42,40,39,39,42,40,39,39,39,39,41,38
,40,39,41,42,40,40,40,41),irrigation=c(rep(c(1),30),rep(c(
2),30)),radiation=c(rep(c(1,2,3),20)))
plants$irrigation<-as.factor(plants$irrigation)
plants$radiation<-as.factor(plants$radiation)
par(mfrow=c(1,2));plot(seeds~irrigation*radiation,data=plants)

75
42

42
41

41
seeds

seeds
40

40
39

39
38

38
1 2 1 2 3
irrigation radiation

Figure 4-8 Boxplots for seed numbers (response variable) categorized by irrigation and
radiation (explanatory variable)

model<-lm(seeds~irrigation*radiation,data=plants)
par(mfrow=c(1,2));plot(model, which=c(1,2))
Standardized residuals

Residuals vs Fitted Normal Q-Q


845 8
2

41 41 45
2
Residuals
1

1
0

-1
-2

39.2 39.6 40.0 40.4 -2 -1 0 1 2


Fitted values Theoretical Quantiles

Figure 4-9 Distribution and QQ for residuals of two-way ANOVA

anova(model)

Analysis of Variance Table

Response: seeds
Df Sum Sq Mean Sq F value Pr(>F)
irrigation 1 0.067 0.0667 0.0747 0.785671
radiation 2 2.233 1.1167 1.2510 0.294370
irrigation:radiation 2 11.433 5.7167 6.4046 0.003192 **
Residuals 54 48.200 0.8926
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

76
This analysis shows that the effect of irrigation depends on that of radiation as the interaction
term in significant. To find out more, this analysis should be followed by analyses within either
irrigation or radiation treatments (split the dataset).

Linear regression
In an experiment testing whether the duration of male courtship behavior depends on female size
16 pairs of earwigs were observed.

sex<-data.frame(pair = 1:16, fem_size = c(58.84, 60.37, 57, 59.86,


61.42, 60.34, 60.1, 59.63, 58.06, 61, 58.61, 60.94, 60.83,
57.7, 60, 59.09), male_court_hrs = c(8.37, 9.88, 10.12,
8.39, 9.93, 9.69, 8.68, 11.74, 11.07, 8.69, 10.53, 10.38,
10.12, 11.14, 8.6, 11.26))
plot(male_court_hrs~ fem_size,data=sex)
11.5
11.0
10.5
male_court_hrs
10.0
9.5
9.0
8.5

57 58 59 60 61
fem_size

Figure 4-10 Plotting duration of male courtship against female size

model<-lm(male_court_hrs~ fem_size,data=sex)
par(mfrow=c(1,2));plot(model, which=c(1,2))

77
Residuals vs Fitted Normal Q-Q

2
2

Standardized residuals
8 8

1
Residuals
0

0
-1

-1
4 4
1
-2

9.4 9.8 10.2 10.6 -2 -1 0 1 2


Fitted values Theoretical Quantiles

Figure 4-11 Distribution and QQ plot for residuals


Scale-Location ofvs
Residuals linear regression
Leverage
8 1

2
1
Standardized residuals

Standardized residuals
8
1.2

4
summary(model) 0.5

1
0.8

Call:
0.4

3
lm(formula = male_court_hrs ~ fem_size, data = sex)
-1

0.5
Cook's
1 distance
0.0

1
-2

Residuals:
Min 9.8 1Q
9.4 10.2 Median
10.6 3Q0.00 Max
0.10 0.20 0.30
-1.7679 -0.8837 0.2574 0.6771 1.8334
Fitted values Leverage

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.3698 12.8304 2.133 0.0511 .
fem_size -0.2929 0.2152 -1.361 0.1950
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.071 on 14 degrees of freedom


Multiple R-squared: 0.1168, Adjusted R-squared: 0.05376
F-statistic: 1.852 on 1 and 14 DF, p-value: 0.195

This analysis suggests that larger females do not cause longer male courtship behavior.

78
Summary

 Linear models specify a linear relationship between a response variable and one or more
explanatory variables.
 When working with linear models the workflow involves exploring the data through a plot,
define the model, analyze the residuals and, depending on the output of this step, obtain and
interpret the results or transform variables and/or try another model.
 The assumption to apply linear models are that the experimental units are independent and
sampled at random, the residuals have constant variance across values of explanatory
variables and the residuals are normally distributed with a mean of zero.
 Interactions are cases in which the effect of an explanatory variable on a response variable
depends on another explanatory variable.
 Use the command model<-lm()to define the model, the command plot(model) to
revise the assumptions and the commands summary(model) and anova(model) to
retrieve a model estimates and an ANOVA table of the analysis.
 When defining a model with two or more explanatory variables, use * to include both direct
and interaction effects, + to include only direct (additive) effects.

79
4.8 Exercises

Exercise 4-A
R has different internal datasets that do not require being uploaded. What sort of linear model is
conducted by the code below? Choose between one-way ANOVA, two-way ANOVA, linear
regression, multiple regression, and ANCOVA. In order do to decide, use the command
str(…)to understand the structure of each internal dataset (i.e. str(ChickWeight)). You can
also plot the data and run the code.
A. lm(weight~Time,ChickWeight)

B. lm(weight~Time*Chick,ChickWeight)

C. lm(GNP.deflator~Unemployed + Population,longley)

D. lm(weight~Diet+Chick,ChickWeight)

E. lm(weight~Diet,ChickWeight)

Exercise 4-B
7 economic indicators have been collected during 16 years (1947 to 1962) and were recently
made available in the data frame called "longley". GNP.deflator is the GNP implicit price deflator,
a measure of inflation. GNP is the gross national product, Unemployed/Employed is the number
of unemployed/employed people, Armed.Forces is the number of people in the armed forces and
Population is the 'noninstitutionalized' population over 14 years of age. Answer the following
questions:
A. Does unemployment implies a reduction in the gross national product? Does employment
increase the gross national product?
(a) YES / YES (b) NO /YES (c) NO / NO (d) YES / NO
B. What percentage of the variation in the gross national product is explained by the employment
rate? (Hint: refer to the concept of percentage of explained variance covered here)
C. How much (euros) does the gross national product increase for every person that is newly
employed? (Hint: refer to the concept of magnitude of effect and slope covered here)

Exercise 4-C
Are the following models adequate in terms of residuals’ distribution? (It uses “longley” R internal
dataset)
A. lm(GNP.deflator~Unemployed+Armed.Forces+Population,longley)

B. lm(GNP~Employed,longley)

80
Exercise 4-D
Forests are sometimes fertilized with nitrogen compound to increase their growth. However, this
could lead to a change in herbivory. 42 3-year old birch trees were used in a greenhouse
experiment. They were divided in 6 groups with seven trees each. Trees were subjected to two
fertilization treatments (yes and no) and three herbivory treatments (none, low, high). Resulting in
six combinations of treatments. One tree died so one treatment combination is missing a
replicate. The data is available in the attachments as 4-D. How do trees react to fertilization and
herbivory? Are these effects independent? Does fertilization increase herbivory risk?
Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the
results.

Exercise 4-E
In this exercise you are going to use linear models in order to perform a selection analysis in the
orchid Gymnadenia conopsea. A selection analysis is actually “simply” a multiple regression
analysis in which the response variable is the fitness and the explanatory variables are the
different phenotypic quantitative traits. The idea is that if there is a relationship between a
phenotypic trait and fitness, that means that some values of the trait are favored, i.e. the trait is
under selection. The strength of selection is represented by the slope of the regression line
between fitness and the phenotypic trait (See next figure).

The data and full exercise are available from this in the attachments as 4Ea and 4Eb.
Hint: You will have to practice your skills on handling a data set (checking for outliers, subsetting,
computing and adding new variables), get and interpret descriptive statistics (means,
correlations), graphics (exploration, bar plots with error bars) and linear models (multiple
regression), make variable transformation and standardization of variables, and extract values
from a statistical output for plotting.

81
Solutions
4-A
A. Linear regression, B. ANCOVA, C. multiple regression, D. Two-way ANOVA, E. ANOVA.
4-B
A.b:summary(lm(GNP~Unemployed,longley));summary(lm(GNP~Unemployed,longley))$coefficients["Employed",];summary(lm(
GNP~Employed,longley));summary(lm(GNP~Employed,longley))$coefficients["Employed",], B. 97:
summary(lm(GNP~Employed,longley));summary(lm(GNP~Employed,longley))$r.squared, C. 28EUR:
summary(lm(GNP~Employed,longley))
4-C
A. T:plot(lm(GNP.deflator~Unemployed+Armed.Forces+Population,longley)), B. T:plot(lm(GNP~Employed,longley))
4-D
#Load data, check object and data;fertilization <- read.table(file.choose(), sep=";", header=T, dec=",");str(fertilization);#You
need to make sure that the response is numeric.;fertilization$growth <- as.numeric( fertilization$growth);interaction.plot(x.factor=
fertilization$fertilization, trace.factor=fertilization$herbivory, response= fertilization$growth, cex.axis=1);#Analyses: two-way
ANOVA;lm.fert.lab <- lm(growth~fertilization*herbivory, data=fertilization);Testing the assumptions: residual
analysis;par(mfrow=c(2,2));plot(lm.fert.lab) ;#The assumptions of the analysis are met, residuals are normally distributed and
model fit is satisfactory.;#Result output and graph;anova(lm.fert.lab);#Means, se and barplot;means <-
tapply(fertilization$growth, list(fertilization$fertilization,fertilization$herbivory),mean);se <-
tapply(fertilization$growth,list(fertilization$fertilization,fertilization$herbivory), function(x) sd(x)/sqrt(length(x)));mp <-
barplot(means, beside=T, ylim=c(0,55), las=1, xlab="Herbivory", ylab="Growth (cm)", col=c(0,8));legend(5,55, legend=c("No",
"Yes"), fill=c(0,8), bty="n", title="Fertilization", horiz=T);arrows(mp, means-se, mp, means+se, angle=90, length=0.05, code=3,
col="black", lty=1, lwd=1). Interpretation: The interaction between grazing and fertilization has a significant effect on growth
indicating that the effect of fertilizer depends on whether or not there is grazing. Vice versa the effect of grazing depends on how
much fertilizer there is. It is difficult to interpret the main effects (i.e., fertilizer and grazing) when the interaction is significant. If
this is desired the data needs to be split into the different levels of one of the factors and reanalyzed with one-way ANOVAs.
From the graph of means it is clear that fertilization generally increases growth. The higher the grazing intensity the more
pronounced is the growth response to fertilization.
4-E
Get the solution with detailed explanations from the attachments as 4Ec

82
5 Basic graphs with R

5.1 Bar-plots

Goal
In this section, you will learn how to script a grouped barplot of means with standard errors
indicated a T-bars (Figure 5-1) and how to adjust the layout of this type of graph.

Figure 5-1 Example of a grouped barplot of means with


standard errors (as T-bars)

How to do it
We are going to use the internal dataset ToothGrowth (available with R installation), which
contains measurements of tooth length in guinea pigs that received three levels of vitamin C and
two supplement types (Figure 5-1). To explore this dataset you can use ?ToothGrowth,
str(ToothGrowth) and summary(ToothGrowth).

We want to produce a barplot of the mean tooth length for all six combinations of the two
factors (supplement type: 2 levels, dose: 3 levels). We first need to calculate the mean tooth

83
length each of the combinations. For this, we use the command tapply(). tapply() can
return a table with mean tooth lengths for all six combinations and this table will be the input for
the barplot. Importantly, tapply() will create a matrix with two rows and three columns
corresponding to the factor levels in the dataset as you can see below. This structure is needed to
produce a groups barplot.

mean.tg <- tapply(ToothGrowth$len, list(ToothGrowth$supp,


ToothGrowth$dose), FUN = mean)
mean.tg

0.5 1 2
OJ 13.23 22.70 26.06
VC 7.98 16.77 26.14

You are now ready to use the command barplot(). The first argument is the data to use, in
this case our mean.tg matrix. Using the argument beside = T indicates that the bars
should be plotted beside each other instead of on top of each other.

barplot(mean.tg, beside = T)

We can customize the layout of the barplot using further arguments, otherwise default options
will be used. The labels of the axes can be specified with the arguments xlab and ylab and the
labels below each group of bars are controlled with the argument names. The font size of these
labels can be changed with cex.lab and cex.names. These arguments are set to 1 by
default and changes are relative to this default. For example, cex.axis = 2 will double the
font size. The limit of the y-axis is specified with ylim. Here we use the maxiumum and the
minimum in the datset. The orientation of the axis labels can be altered with the argument las
that has four options (0,1,2,3). Here, las = 1 produces horizontal axis labels. The colors of
the bars are determined by col, in our example by a vector with a length of two for the two
groups, specifying 1 (black) and 8 (grey). The color can be specified either with numbers (1 to 8)
or with the color name. You can to get an overview of all available color names type
colors(). You can further explore colors at http://research.stowers-
institute.org/efg/R/Color/Chart/.

barplot(mean.tg, beside=T, xlab = "Dose (mg)", ylab = "Tooth


length (cm)", names = c("0.5", "1.0", "2.0"), cex.lab =
1.3, cex.names = 1.2,col=c(0, 8), ylim = c(0,
max(ToothGrowth$len)), las = 1)

The next step is to add error bars to the barplot. There is no standard command to add error
bars. Instead, we have to draw them ourselves with the command arrows(). First, we need
the standard error of the mean for all six groups. We do this in the same way as the calculating
the means: we use tapply() but ask for calculation of the standard error. Besides the length of
the error bars, we also need the horizontal locations of the bars, such that they end up in the
middle of the bars. These midpoints, in the same matrix format as the means above, can be
extracted from a basic barplot().We assign the barplot(…) command to an object, here

84
named midpoints, use plot = F to suppress the plotting as we want to use the improved
barplot we produced above.

sem.tg <- tapply(ToothGrowth$len, list(ToothGrowth$supp,


ToothGrowth$dose), FUN = function(x) sd(x) /
sqrt(length(x)))
midpoints <- barplot(mean.tg, beside = T, plot = F)

Now we are ready to draw the error bars using the command arrows(). Within this command
we first state the position of the error bars by two sets of x and y coordinates corresponding to
the start and end of the error bars. The coordinates are given in our matrix format and
correspond to the six bars on the graph. The starting coordinate set, midpoints, mean.tg
- sem.tg identifies the six midpoints as x coordinates and the means minus the standard
errors as the six y coordinates. Likewise, midpoints, mean.tg + sem.tg is used for the
end of the error bars. We further use the argument code = 3 and angle = 90 such that we
get bars with T´s on both ends and not arrows. The arguments length and lwd set the size
of the T´s and line width of the entire error bars.

arrows(midpoints, mean.tg-sem.tg, midpoints, mean.tg+sem.tg, angle


= 90, length = 0.1, code=3, lwd = 1.5,)

We can further use the command legend()to add a legend with our groups to the graph. We
can specify the place of the legend in the graph either with coordinates (here, 0.75, 30) or
with the options such as “topright” or “topleft” (see help for legend()). The fill
argument produces boxes with the specified colors to place next to the legend text. bty
determines if there will be a box drawn around the legend or not (default: bty = "o", with box),
here, bty = ”n” removes the box. The font size in the legend is determined by cex, as
explained above.

legend(0.75, 30, legend = c("Orange juice", "Ascorbic acid"), fill


= c(0,8), bty = "n", cex = 1.1)

There are many more details of the plot that can be controlled and changed. For an overview of
the graphical parameters that can be changed by arguments use ?par.

Summary

 A barplot can be made with the command barplot(), a higher-level plotting command
that creates a new graph. Mean values to be plotted should be calculated first with the
command tapply().
 Error bars, calculated with tapply(), and a legend can be added with the lower-level
plotting commands arrows() and legend()that add extra features to an existing graph.
 A large number of graphical parameters can be used to customize plots arguments.

85
5.2 Grouped scatter plot with regression lines

Goal
In this section, you will learn how to script a grouped scatterplot with regression lines (Figure 5-2)
and to adjust the layout of this type of graph.

Figure 5-2 Example of a grouped scatterplot with regression lines

How to do it
To produce a scatterplot, we will use the plot() command. plot() is a higher-level plotting
command that it will create a new graph.

We are going to use part of the internal dataset Iris (available with the R installation) as an
example (Figure 5-2). Iris contains flower measurements of three different Iris species. You can
explore the dataset by ?iris, summary(iris) and str(iris). To reduce the dataset
o two species and to plot all the datapoints use:

iris.short <- iris[1:100, ]


plot(iris.short$Sepal.Length, iris.short$Sepal.Width)

We can now assign two different plotting symbols for the species by creating a new column in
the data frame iris.short, named iris.short$pch, that contains the number of the
plotting symbol to be used. There are 26 different plotting symbols, ranging from 0 to 25. Here
we use symbol 1 for Iris setosa and symbol 16 for Iris versicolor. You can use the same procedure to
assign different colors to the two species (see above). We can then set the axis labels, range and
orientation as well as font size using xlab, ylab, xlim, ylim, las, cex.axis and
cex.lab as explained above.

iris.short$pch[iris.short$Species == "setosa"] <- 1

86
iris.short$pch[iris.short$Species == "versicolor"] <- 16
plot(iris.short$Sepal.Length, iris.short$Sepal.Width, type = "n",
xlab = "Sepal length (mm)", ylab = "Sepal width (mm)",
xlim = c(4, 7.5), las = 1, cex.axis = 1.2, cex.lab =
1.3,pch = iris.short$pch)

The next step is to add a regression line for each species assuming that sepal length causes
changes in sepal width (which may or may not be reasonable). For this, we have to model the
regression lines first. Subsequently, we plot lines corresponding to these models with the lower-
level plotting command lines(). The line is specified by the x and y coordinates, which are
both a vector: the x-vector contains are sepal lengths the y vector contains the sepal widths
predicted by the model. We increase the line widths using with lwd = 1.5.

model.seto <- lm(Sepal.Width ~ Sepal.Length, data =


iris.short[iris.short$Species == "setosa",])
lines(iris.short$Sepal.Length[iris$Species == "setosa"],
predict(model.seto), lwd = 1.5)
model.versi <- lm(Sepal.Width ~ Sepal.Length, data=
iris.short[iris.short$Species == "versicolor",])
lines(iris.short$Sepal.Length[iris$Species == "versicolor"],
predict(model.versi), lwd = 1.5)

We should also add a legend to the figure. This is similar as above, and we can produce species
names in italics using the command expression(italics()) for each legend entry.

legend("topright", legend = c( expression( italic("Iris setosa")),


expression( italic( "Iris versicolor"))), pch = c(1, 16),
cex = 1.1, bty = "n")

There are many more details of the plot that can be customized. An overview of the graphical
parameters that can be changed can be viewed using ?par.

Summary

 Scatter plots can be created with the higher-level plotting command plot()
 A new vector in the dataframe can be used to specify plotting symbol and color
 The lower level plotting command lines() can be used to add regression from a linear
model

87
5.3 Exercises

Exercise 5-A
Please use the dataset below to produce a scatterplot where each point has a different
color and symbol.
x <- c(2,3,4,5,7,8,9,10)
y <- c(10,14,14,17,18,22,23,26)

Exercise 5-B
Starting from graph produced by the code below, change the symbols to filled red
triangles that are twice as large.
plot(iris$Sepal.Length, iris$Sepal.Width,
xlab="Sepal length (mm)", ylab="Sepal width (mm)")

Exercise 5-C

Please use the internal datafile


CO2 and create the barplot below. Hint: use
str(CO2) and ?CO2 to find out more about the file; the argument horiz = T to
legend will place legend entries horizontally.

88
Solutions
5-A
x<-c(2,3,4,5,7,8,9,10)
y<-c(10,14,14,17,18,22,23,26)
plot(x,y, las=1, cex.lab=1.5, cex.axis=1.5,cex=1.5, pch=c(1:8), col=c(1:8))

5-B
plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal length (mm)", ylab="Sepal width (mm)", pch=17, col="red",
cex=2)

5-C
means <- tapply(CO2$uptake, INDEX = list(CO2$Type, CO2$Treatment), FUN = mean)
se <- tapply(CO2$uptake, INDEX = list(CO2$Type, CO2$Treatment), FUN = function(x) sd(x)/sqrt(length(x)))
uptake <- barplot(means, beside = T, col = c(0,2), las = 1, ylim = c(0,50))
arrows(uptake, means-se, uptake, means+se, angle = 90, type = 3, length = 0.1, code = 3)
legend(x = "top", fill = c(0, 2), legend = c("Quebec", "Mississippi"), bty = "n", horiz

89
6 Logistic Regression

6.1 Goals

In this section you will learn

 how and when logistic regression analyses should be used


 how to create and interpret logistic regression models in R
 how to create simple plots of logistic regression models

6.2 How to do it

Background
Logistic regression models are used to in situations where we want to know how a binary
response variable is affected by one or more continuous variables. Common biological
examples of this include assessing probability of survival, probability of reproducing, or
probability of an individual possessing a certain allele. On a natural scale, logistic
regression is non-linear and cannot be analyzed using linear models. However this
problem is circumvented by using the logit transformation to linearize the model.

Logit = log (p/1-p)

Creating and analyzing model


In R, logistic regression models are created using the generalized linear model function
glm( . This takes the general form of:

Model<-glm(probability_data~ continuous_predictor,
family = ”binomial”)

The argument family=”binomial” tells the function to create a binomial logistic


regression model. As with the lm()function, we can use summary() to obtain
summary data of the model.

90
To demonstrate this, we will use survival data collected by Quintana-Ascencio et al. on
Hypericum cumulicola, a plant endemic to the southeastern United States. This dataset,
which can be obtained in the attachments as Hypericum, contains both a probabilistic
response variable (survival) as well as continuous predictor variables (log-transformed
number of fruits produced and height in the previous year). First we create the
generalized linear model and use the summary() function to obtain a summary:

Lmodel <- glm(survival ~ height, family = binomial,


data = Hypericum)
summary(Lmodel)

Call:
glm(formula = survival ~ height, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4470 -1.2380 0.6166 0.8544 1.2199
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.9931 0.4201 9.505 < 2e-16 ***
height -2.1885 0.2912 -7.515 5.71e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1
‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1018.68 on 878 degrees of freedom
Residual deviance: 947.12 on 877 degrees of freedom
AIC: 951.12
Number of Fisher Scoring iterations: 4

As can be seen from this summary, height has a significant negative effect on survival.
One thing that should be noted is that the intercept in this case is far larger than 1. This
is because summary presents values after logit transformation, so values are no longer
bound between 0 and 1.

There are several ways to determine the goodness of fit and significance on the overall
model rather than the individual parameters. One way is to use the G2 statistic (similar to
a chi-squared) to compare Null and Residual deviance. This technique compares the
unexplained variation in a null model (that is, one that has no predictive value) to the
unexplained variation in the model being tested (i.e. the residual deviance). A greater
difference between null and residual deviance indicates lower deviance in the model and
a better model fit. This difference is then tested using the chi-squared distribution to
determine a p-value.

G_sq <- LModel$null.deviance - LModel$deviance


pchisq(G_sq, 1, lower.tail=F)

91
This technique produces a p-value of ~2.7 X 10-17 so the overall model in this case is
highly significant.

Plotting Logistic Regression Models


Logistic regressions can be plotted either with survival presented on the linearized logit
scale or on the natural logistic curve. The plotted logit can be useful for diagnostic
purposes to determine the quality of fit of the model. However, the logistic curve is often
more intuitive to present as a result.

In order to plot on the logit scale, you must first define a sequence using the
seq()function.

Sequence<-seq(0,4,.1)

The first and second numbers define the minimum and maximum values of the sequence
and the third value specifies the intervals. Next we must create a generated predicted
values of the model using the predict() function

PLMlogit<-predict(LModel, list(height=Sequence))
plot(Sequence,PLMLogit, type="l", xlab="Log(height)",
ylab="Logit Survival")

The resulting figure should look like this (Figure 6-1):

Figure 6-1 Relationship between height and survival on the logit scale

In order to plot a logistic curve we again use the predict() and plot() functions,
but we add an additional argument to predict: type= “response”

92
PLMcurve<-
predict(LModel,list(height=Sequence),type="response"
)
plot(Sequence,PLMcurve,type="l", xlab="Log(height)",
ylab="Survival")

This argument tells the predict function to output the response variable (survival) on its
original scale rather than on the transformed scale. The resulting figure should look like
this (Figure 6-2):

Figure 6-2 Relationship between log(height) and survival on the natural scale.

6.3 Summary

 Logistic Regressions are used when you have a probability as a response variable and
a continuous predictor variable
 Logistic curves are analyzed as generalized linear models glm() though the use of the
logit transformation.
 Logistic regressions can be plotted either as a logistic curve or as a linear function.

93
6.4 Exercises

6-A
Repeat the example above using fruits as the predictor variable rather than height.
A. Compare the overall significances of both models. Which predictor variable is a better
fit for the data? Why?
B. Compare the graphs of the two predictor variables. How are they similar? How are
they different?

6-B
(Yes/No) Which of the following questions could be answered using logistic regression?
A. Is the probability of getting a head in a coin flip is affected by wind speed?
B. Is there a correlation between rate of coffee consumption and hours worked?
C. Is the probability of successfully building a nest related to body size in doves?
D. Are yellow or red crabs are more likely to occupy holes at the beach.
E. Does advertising spending affect the proportion of people who receive flu shots?
F. How does increasing the concentration of a drug affect mortality rates in mice?
G. Do different species of bears produce differing number of offspring?
H. Is the ratio of body length to width similar throughout the family Mustelidae?

6-C
You created a logistic model which has a null deviance of 432 and a residual deviance
of 425. Is the overall model significant?

94
Solutions:
6-A
a) Fruits is a better fit because the difference between null and residual deviance is larger.
b) Both negative; logit plots are both linear/ Height has a stronger negative effect; shape of the logistic plots are
different.
6-B
Y;N;Y;N;Y;Y;N;N
6-C
Yes, P=.008

95
7 R programming structures

7.1 Flow control

Goal
As a full and miscellaneous programming language, R comes with various looping and
conditional constructs. In this section, we will preliminarily discuss iteration and
conditionals. R provides three basic C- style paradigms to write explicit loop: for(),
while() and repeat(). Conditional evaluations can be employed using function
if() and ifelse().

How to do it
The syntax of the looping functions is listed below.

for (VAR in SEQ) {EXPR}


while (COND) {EXPR}
repeat {EXPR}

 VAR is the abbreviation of variable.


 SEQ is the abbreviation of sequence, which is equivalent to a vector (including list) in R.
 COND is the abbreviation of conditional, which can evaluate to TRUE or FALSE.
 EXPR is the abbreviation of expression in a formal sense.

The first one, for(), iterates through each component VAR of the sequence SEQ-
for example, in the first iteration VAR = SEQ[1], in the second iteration VAR =
SEQ[2], and so on. The following code uses the for() structure to print out square
of each component in a vector.

96
for ( i in 1:5 ) {
print( paste('square of', i, '=', i^2) )
}

[1] "square of 1 = 1"


[1] "square of 2 = 4"
[1] "square of 3 = 9"
[1] "square of 4 = 16"
[1] "square of 5 = 25"

The other two loop structure with while() and repeat() rely on the change state
of expression, or the use of break to leave the loop. The function break halts the
execution of the innermost loop and passes control to the first statement outside.
Similarly, next exists the processing of the current iteration of the loop and causes the
execution of the next iteration. When using repeat() or while(), special attention
should be paid to averting infinite loop, that is, a loop which iterates without end. Below
is an example showing two different ways of accomplishing the same job.

i_w <- 1
while ( i_w <=10 ) {
i_w <- i_w + 5
}
i_w

[1] 11

i_r <-1
repeat {
i_r <- i_r + 1
if (i_r > 10) {break}
}
i_r

[1] 11

Note that excessive use of loops will make your R code rather crappy. Although loop in
R is very straightforward and convenient, sometimes you should avoid it due to its high
cost of computational time and performance, especially when working on long vectors.
A better alternative choice is to use vectorized function, for example which(),
where(), all(), etc. In case of matrix computation, you can use rowSums(),
colSums(), and so on.

Now it is time to move to conditionals. The syntax of the if()statement looks like this:

97
if ( COND ) {EXPR1} else {EXPR2}

The conditional COND is first evaluated, and if it is TRUE, then expression EXPR1 is
executed; if COND evaluates to FALSE; then EXPR2 is executed. Particularly, when
COND evaluates to numeric value of zero, R treats it as FALSE; and COND evaluates
to any non-zero number, it is treated as TRUE. We can also easily extend/shrink if()
structures by adding/removing one or several else clause as it is optional. But note that
in case of extension, the order of conditional clauses are vital because once a condition
statement is satisfied, R will ignore the rest part of the whole if-else structure and jump
out of it. Here is a simple example:

x <- 3
if ( ! is.numeric(x) ) {
stop( paste(x, 'is not numeric') )
} else if ( x%%2 == 1) {
print ( paste(x, 'is an odd') )
} else if ( x == round(x) ) {
print ( paste(x, 'is an integer') )
} else {
print ( paste(x, 'is a number') )
}
[1] "3 is an odd"

You can assign other value to x, for example, x <- 1.3, x<-'abc', etc., then copy
and execute the if-else structure to check the result.

7.2 Write your own function

Goal
In this section, you will learn how to develop your own R function.

How to do it
R provides a convenient way to define custom function and make good use of it. All
functions read and parse input, which are referred to as arguments, and then return
output. R function is actually first-class object defined in R. It can be created by using the
command function(), which is followed by a comma separated list of formal
arguments enclosed by a pair of parenthesis, and then the expression that form the body

98
of the function. If the expression only includes one statement, it can be directly entered
and when there are multiple expressions, they have to be enclosed in braces {}. The
value returned by a R function, can be either yielded by R built-in function return or
simply the value of the last expression.

Here is an example of a function, which returned x to the power of n (x^n):

expon <- function(x,n) {


if ( x%%1 != 0 ) {
stop('x must be an integer!')
} else if ( n==0 ) {
return(1)
} else {
prod <- x
while( n>1 ) {
prod <- x*prod
n <- n-1
}
return(prod)
} # end of else
} # end of the function

Now let it calculate 3 to the power of 4.

expon(3,4)

[1] 81

The formal and body arguments to function expon() can later be accessed via the R
functions formals() and body(), as following:

formals(expon)

$x

$n

body(expon)

{
if (x%%1 != 0) {
stop("x must be an integer!")
} (…)

Another point worth mentioning is that you can print out the built-in function in R when
you are not sure what it does. By looking at the code, you may have a better idea. For

99
instance, if you are curious about detail of the function cat(), you can glance over its
code by typing it without braces.

cat

function (..., file = "", sep = " ", fill = FALSE,


labels = NULL, append = FALSE)
{
if (is.character(file))
if (file == "")
file <- stdout() (…)

It is very handy to view a lengthy function via the command page().

page(cat)

However, as most fundamental functions in R are written directly in C, they are not
viewable in this manner. There is a lot more to writing custom function in R than what is
shown here. Nonetheless you won’t need it unless you move to the advance level.

7.3 Summary

 R programs are made up of expressions. Basic control-flow constructs are needed to


control compound expressions. R provides for(), while() and repeat()
to write loops. If-else statements are also available when choosing between two
or several expressions. But sometimes loops should be avoided due to low
efficiency.
 One prominent advantage of R over other statistical languages is its extensibility.
You can easily add new functionality to the system by defining new functions.

100
8 Appendix: Code References
The following appendix provides a list and general descriptions of commands used in
each section of this document. Clicking on the names links to the CRAN reference
pages.

Getting Started Description Example


with R
rm remove an object rm(x)
c creates a vector a<-c(1,2,3,4)
seq creates a sequence seq(1, 10, by=2)
rep replicate vector rep (1:10, 2)
elements
factor define a function as a factor (x)
categorical variable
list construct a list z<-list(x, y)
q quits r session q()
data.frame creates a data frame data.frame(x)
read.table reads an external file read.table(example.txt,
and converts it to a header=T)
data frame
str displays the structure str(x)
of an object
? place in front of ?lm
functions to get a
description of the
function
subset creates a subset of a subset(data, x>10)
vector or matrix
library loads packages that are library(ggplot2)
already downloaded
attach attaches data frame data<-read.table(example.txt,
header=T);
attach(data)
Basic Statistics with
R
table creates a contingency table(x,y)
table
summary produces summaries of ab<-lm(x~y)
model functions summary(ab)
tapply applies a function tapply(x,y,mean)
across an array of
categories
qqnorm creates a qqnorm plot qqnorm(x)
to visualizes deviations
from normality
t.test Runs t tests (either one t.test(y, mu=value);
sample, paired or two- t.test(y1, y2, paired=TRUE)
sample)on data t.test(y ~ group)
wilcox.test runs non-parametric wilcox.test(y, mu=value)
Wilcoxon tests on data wilcox.test(y1, y2,
paired=TRUE)

101
wilcox.test(y~group)
cor.test tests for correlations cor.test(x,y)
among variables
chisq.test runs a chi-squared test chisq.test(x,y)
prop.test tests for equal prop.test (x,y)
proportions
Linear Models
lm constructs a linear lm(x~y)
model
anova calculates analysis of x<-lm(x~y)
variance on a model anova(x)
Basic Graphs with R
barplot creates and edits barplot(x)
barplots (see link for
list of plotting
arguments)
arrows adds arrows to a plot arrows(midpoints, mean.tg-sem.tg,
midpoints, mean.tg+sem.tg, angle = 90,
length = 0.1, code=3, lwd = 1.5,)

legend adds a legend to a plot legend(0.75, 30, legend = c("Orange juice",


"Ascorbic acid"), fill = c(0,8), bty = "n", cex
= 1.1)
plot creates an x,y plot(x,y)
scatterplot
(see link for list of
plotting arguments)
lines adds lines to a plot lines(x,y)
expression calls a subsetable list expression(x)
par sets and edits graphical par(mfrow=c(2,2))
parameters (see link
for a list of arguments)
Logistic Regression
glm creates a generalized glm(y~x, family=gaussian)
linear model
predict generates predictions a<-glm(y~x, family=gaussian)
from the model predict(a)

102

Vous aimerez peut-être aussi