Vous êtes sur la page 1sur 9

Downloaded from http://sp.lyellcollection.

org/ at Pennsylvania State University on April 8, 2016

Compositional data analysis with 'R' and


the package 'compositions'
K. G. V A N D E R B O O G A A R T 1 & R. T O L O S A N A - D E L G A D O 2
llnstitut fiir Mathematik und Informatik, Ernst-Moritz-Arndt-Universitiit Greifswald,
Greifswald D-17487, Germany (e-maih boogaart@ uni-greifswald.de)
2Departament Informhtica i Matembtica Aplicada, Universitat de Girona,
Girona E-17071, Spain

Abstract: This paper is a hands-on introduction and shows how to perform basic tasks in the
analysis of compositional data following Aitchison's philosophy, within the statistical package
'R' and using a contributed package (called 'compositions'), which is devoted specially to com-
positional data analysis. The studied tasks are: descriptive statistics and plots (ternary diagrams,
boxplots), principal component analysis (using biplots), cluster analysis with Aitchison distance,
analysis of variance (ANOVA) of a dependent composition, some transformations and operations
between compositions in the simplex.

This paper will show how the basic tasks of compo- manuals or of typing to a command line any
sitional data analysis (Aitchison et al. 2002) can be command found out there. However, it should be
performed with the package 'compositions' in the remembered that 'R' and its packages are a living
free statistical environment 'R' (R Development Core project permanently adapted to the development
Team 2003). The paper aims to be useful for a wide of the field. More intstructions can be found at
spectrum of 'R' users: for this reason, it is suggested 'http://www.stat.boogaart.de/compositions/'.
that the experienced skip these first steps, whereas After starting 'R' (either by clicking on the
those who never heard about 'R' should begin with appropriate icon, selecting the entry 'R' in the
Appendix A before continuing with the text. It is start menu or by typing the command 'R' to a
strongly recommended that the reader be in front of console or command window, after installing the
the computer, typing the examples outlined here: software) a command window appears where com-
thus, text output of these instructions is kept to a mini- mands can be given to 'R'. The following appears:
mum, and almost all figures are not included, although
they are described briefly (with a few exceptions).
R: C o p y r i g h t 2004, T h e R F o u n d a t i o n for
Statistical Computing Version 2.0.1
First steps (2004-11-15), ISBN3-900051-07-0

'R' is a powerful computer environment for multi- R is f r e e s o f t w a r e and comes with


purpose statistics and data analysis. It is available for ABSOLUTELY NO WARRANTY.
all computer platforms and can be downloaded from You are welcome to r e d i s t r i b u t e it
under certain conditions.
'http://www.cran.R-project.org'. 'Compositions' is
a contributed package for 'R', devoted specially to Type 'license()' or ' l i c e n c e ( ) ' for
the analysis of compositional data; it can be down- distribution details.
loaded from 'http://www.stat.boogaart.de/compo R is a c o l l a b o r a t i v e project with
sitions'. 'R' and 'compositions' are both distributed many contributors.
and developed under the GNU public license, hence
Type 'contributors()' for more
they are available free of charge. Further instructions information and 'citation()' on h o w
on downloading, installation and getting started with to cite R or R packages in
the software can be found in Appendix A. publications.
'R' is classically based on a command line inter-
Type 'd e m o ( ) ' for some demos,
face, but various graphical user interfaces are avail-
'help()' for o n - l i n e help, or 'help.
able from 'http://www.cran.R-project.org'. When s t a r t () ' for a HTML browser
compared with other compositional software, the i n t e r f a c e to h e l p .
'R' package provides a maximum of flexibility. T y p e 'q()' to q u i t R.
However, being based on a computer language, it
demands from its users not to be afraid of reading

From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) CompositionalData Analysis


in the Geosciences:From Theory to Practice. Geological Society, London, Special Publications, 264, 119-127.
0305-8719/06/$15.00 9 The Geological Society of London 2006.
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016

120 K . G . VAN DER B O O G A A R T & R. T O L O S A N A - D E L G A D O

The version number should be checked, since at When working in a terminal, the help can be
least version 2.0.0 is required for running compo- closed by typing 'q' for Quit. In a windows-based
sitions. The ' > ' mark shows that 'R' is willing to environment the help window can simply be closed.
accept commands. This character should not be
typed with the commands. To see how 'R' works,
type '3 "7', and hit the ENTER-Key to make 'R' > i s ( ) # S h o w n a m e s of all v a r i a b l e s /
execute this command: datasets
[i] " s a . d i r i c h l e t .... s a . d i r i c h l e t .
dil .... s a . d i r i c h l e t . m i x "
> 3*7 [4] " s a . d i r i c h l e t 5 .... s a . d i r i c h l e t 5 .
[i] 21 dil .... s a . d i r i c h l e t 5 . m i x "
> ... (lines o m i t t e d )

'R' executes the command by multiplying 3 and 7 The other commands show a typical usage of 'R':
and then prints the result 21. At this moment Use .'? to get help information, or '1 s ( )' to show
ignore the '[ 1 ] ' . 'R' can in this way be used as a all variables/datasets defined previously. Just type
(extremely powerful) calculator. To prepare 'R' the name of a dataset to show its content, which
for compositional data analysis the library compo- in this case is a set of simulated amounts of three
sitions must be loaded with the library command: different chemical elements in ppm:

> library(compositions)
Attaching package 'compositions': > sa.lognormals # Show one of the
The following o b j e c t ( s ) are masked datasets
Cu Zn Pb
from package:stats:
[i,] 8.8043262 35.1671810 45.895025
cor cov dist var
[2,] 0.8115227 2.6547329 47.804310
The following object(s) are masked [3,] 1.2836130 12.4472047 40.553628
from package:base: ... (lines o m i t t e d )
%*%
[60,] 3 . 9 8 5 4 9 9 8 6 . 1 3 0 1 9 0 9 4 0 . 5 7 9 4 1 7
>

To edit or just to inspect the dataset in a spread-


Either such output, or no output at all, informs the sheet-like environment the command ' f i x ( s a .
user about a properly loaded package. When an lognormals)' may be used. Appendix A
error appear such as this: contains instructions on how to load datasets.

> library(compositions) Basic compositional data analysis


Error in library(compositions):
There is no package called The zero step when using the package is to mark
'compositions' your data explicitly as a set of elements from a
simplex under Aitchison geometry (Aitchison
et al. 2002). This is done by converting the
this means that the package is not properly down-
loaded or installed. Instructions for downloading dataset to an _Aitchis~ compositional set through
and installation of the package can be found in the function 'acomp', and storing it into a new
Appendix A. variable by using the assignment sign '<-'.
After loading the package, some example data
from the package should be loaded with the
'data' command: > data(SimulatedAmounts) # j u s t in
case you s t a r t h e r e
> c d a t a <- a c o m p ( s a . l o g n o r m a l s )
> data(SimulatedAmounts) # Load > cdata
e x a m p l e d a t a (no o u t p u t ) Cu Zn Pb
> ? SimulatedAmounts # Show help [i,] 0.097971136 0.391326782 0.51070208
about example data [2,] 0.015828238 0.051778890 0.93239287
[3,] 0.023646054 0.229295970 0.74705798
... (lines o m i t t e d )
[60,] 0.078617049 0.120922730 0.80046022
Note that a hash mark' #' denotes the beginning of a
attr(,"class")
comment: after it, the rest of the line is ignored by [i] " a c o m p "
'R'. Therefore, it is not necessary to type them.
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016

CODA ANALYSIS WITH R AND COMPOSITIONS 121

The dataset is now closed. Note that the closure A barplot can also be used to display the whole
constant is automatically considered as one. In this dataset:
case, the resulting object is stored in ' c d a t a ' and
marked as an Aitchison composition (having the I
attribute class ' a c omp') such that it is automatically I > barplot(cdata) # Display the w h o l e I
data set
treated in an adequate way in further commands. For L
example, the ' p l o t ' command will automatically
draw a ternary diagram: The variation of compositions can be summarized
in several ways (Aitchison et al. 2002;
Pawlowsky-Glahn & Egozcue 2001):
> plot(cdata) # Ternary diagram
> ? p l o t . a c o m p # H e l p on the a c o m p -
speci~c plot function
> variation(cdata) # Variation matrix
Cu Zn Pb
One quits the help by closing the help window or Cu 0 . 0 0 0 0 0 0 0 0 . 4 0 4 6 9 9 4 2 . 9 3 8 1 8 2
by typing q for 'quit'. 1 There is always an Zn 0 . 4 0 4 6 9 9 4 0 . 0 0 0 0 0 0 0 2 . 9 1 0 5 3 9
example of the command at the end of each help Pb 2 . 9 3 8 1 8 1 6 2 . 9 1 0 5 3 8 9 0 . 0 0 0 0 0 0
> v a r ( c d a t a ) # V a r i a n c e m a t r i x of the
page. Try this out to see what happens.
clr-transform
Another graphical display of compositional data
Cu Zn Pb
related closely to the Aitchison geometry of the Cu 0.4194692 0.2125124 -0.6319817
simplex and displaying the Aitchison distance in a Zn 0.2125124 0.4102550 -0.6227674
visual way is the boxplot of log-ratios: Pb - 0 . 6 3 1 9 8 1 7 - 0 . 6 2 2 7 6 7 4 1.2547491
> mvar(cdata) # metric variance
[i] 2 . 0 8 4 4 7 3
> boxplot(cdata) # B o x - p l o t s of
>msd(cdata) # metric standard
p a i r w i s e r a t i o s in log s c a l e
deviation = sqrt(mvar/(D-l))
> ? b o x p l o t . a c o m p # H e l p on
[i] 1.0209
compositional box-plots
> boxplot(cdata,log=FALSE) # use
> summary(cdata) # multiple
information about
normal scale
pairwise ratios
. . .

As a result a square table of boxplots appears, dis-


playing the ratios of the row and the column parts
Indented lines are continuations of the previous
in log scale. Various types of descriptive statistics
line.
can be computed by intuitive commands:
A graphical way to display the variability of a
composition is the biplot, based on principal
> mean(cdata)
Cu Zn Pb -5 0 5
i ! I
0.08918175 0.23949922 0.67131903 0
attr(,"class")
[i] " a c o m p " 0 0
0
0

The result is the mean in Aitchison geometry (i.e. o oOO ~ o oc~ Cu


closed geometric mean), which is again a compo- e~ oO
sition. This single composition can be displayed ~ o oo /'^
Pb
by a pie-chart or a barplot: 8 %o o ~ ~ o
o ~176 ~176176176
" ~ Zn
o o o
> p i e (mean (cdata))
cP
> b a r p l o t (mean(cdata))
I
o

1For some commands (barplot, boxplot, cdt, cor, cov,


idt, mean, names, perturbe, plot, power, princomp, i I
qqnorrn, rnorm, runif, scale, segments, split, -0.4 -0.2 o:o 0:2
summary, var, +, - , . , /, % 9% you need to add Comp.1
'.acomp' to see the Aitchison compositional specific
help. Fig. 1. Biplot of a three-part composition (Cu, Zn, Pb).
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016

122 K.G. VAN DER BOOGAART & R. TOLOSANA-DELGADO

component analysis, which uses the clr transforms The optional parameter 'parts=', allows you to
(Aitchison 2002): select the parts to be used in the subcomposition.
Optional parameters are a typical way of 'R' pro-
viding additional functionality to the default beha-
> pca <- princomp(cdata) # p e r f o r m
PCA and store the result in pca viour of a command. The possible optional
> pca # display results as text parameters and their effects are documented in
Call: the help to each command that can be invoked by
p r i n c o m p . a c o m p ( x = cdata) '? nameoffunction'. The 'c ( ) ' function is
just here to Concatenate the variable names.
Standard deviations: m

In principle, now everyone of the aforementioned


Comp.l Comp.2
1.3604382 0.4460269 commands can be applied to the new dataset. Try:

3 variables and 60 observations.


Mean (compositional): I > plot(cdata)
Cu Zn Pb
0.08918175 0.23949922 0.67131903
Since a ternary diagram can display only three parts
attr(,"class")
[i] " a c o m p "
at the same time, a table of multiple ternary dia-
+Loadings (compositional): grams, containing subcompositions or marginal
Cu Zn Pb compositions (Fig. 2) must be displayed. As a
Comp.l 0.5533583 0.5570883 1.8895534 default, two parts are determined by the row and
Comp.2 0.4207858 1.7307697 0.8484445 the column occupied by each plot, and the geo-
attr(,"class") metric mean of the remaining components is
[i] " a c o m p " taken as the third component. Alternatively one
-Loadings (compositional): can specify a component, by using the optional
Cu Zn Pb
parameter ' m a r g i n ' :
Comp.l 1.312246 1.3034604 0.3842932
Comp.2 1.725060 0.4193976 0.8555428
attr(,"class") I plot(cdata,margin="Cd")
[i] " a c o m p "
> screeplot(pca) # display
importance of components
> biplot(pca) # display direction of
components Performing a cluster analysis with
Aitchison distance
A hierarchical cluster analysis can be performed
The last component always has no importance. The with the following instructions. First the clustering
first component, giving the highest variation, corre- must be computed and the result stored in a
sponds to the Pb against Zn and Cu balance here (as variable:
can be seen in Fig. 1). It explains 90% of the variab-
lity, as can be obtained by considering the variance
> Clusters <- hclust(dist(cdata),
of first component divided by the metric variance of
m e t h o d = " c o m p l e t e " ) # compute
'cdata'.
clustering

Working with compositions of four


Since ' c d a t a ' is marked as an Aitchison compo-
or more parts sition, ' d i s t ' automatically computes the Aitchison
distance. The linkage method of clustering (here 'com-
To analyse a different dataset or a subcomposition
plete') can be replaced by any other method (e.g.
you might assign something different to 'cdata' or
'single') as described in the help '? h c l u s t ' . Now
any other variable representing your compositional
the results should be displayed (Fig. 3):
dataset, e.g. by

plot(Clusters) # shows the d e n d r o g r a m I


> data(SimulatedAmounts) # Load the
example datasets
> sa.groups5 # One of these When the user has decided on the number of
> cdata <- acomp(sa.groups5,parts=c
groups to interpret, maybe four in this case, a new
("Cd", "Pb", " C o " , " C u " ) )
> cdata
variable containing the groups assigned to each
case can be generated and the group membership
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016

CODA ANALYSIS WITH R AND COMPOSITIONS 123

0.0 0.2 0.4 0.6 0.8 t .0 0.0 0.2 0.4 0.6 0.8 1.0
1 I I I i i t i I I =
,

Cd

o. -

oo

,o

Pb
c5

eq
c5

o
c5

~3
Co

Pb

Cu

Cu C

I I I I I

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0,8 1.0

Fig. 2. Matrix of ternary diagrams of a four-part composition (Cd, Pb, Co, Cu).

can be displayed in ternary diagrams, boxplots and Compositional computation


biplots.
Various mathematical transformations and oper-
ations are defined for the Aitchison simplex
> g r o u p <- c u t r e e ( C l u s t e r s , 4 )
(Aitchison et al. 2002). These are interesting
> group
[ 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2
mainly for developers of new statistical methods.
2 1 3 2 3 2 4 3 3 2 2 4 3 4 2 2 3 2 The perturbation and the power transform are con-
22 sidered as addition and scalar multiplication in a
[ 3 9 ] 3 1 3 3 2 4 4 3 4 4 4 3 4 4 3 4 4 vector space structure of the simplex,
4 3 3 4 4
> plot(cdata,col=group)
> plot(cdata,pch=group)
> plot(cdata, pch=as.character
(group))
> plot(cdata,col=group,center=T)
#display centered data
> plot(Clusters,labels=group)
>a
> a <- a c o m p ( c ( l , 2 , 2 ) )
composition

[i] 0.2 0.4 0.4


attr(,"class")
# a single

> biplot(princomp(cdata), xlabs= [i] " a c o m p "


group) > b <- a c o m p ( c ( 8 , 1 , 1 ) ) # a second
> boxplot(cdata,factor(group)) composition
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016

124 K.G. VAN DER BOOGAART & R. TOLOSANA-DELGADO

Cluster Dendrogram

dist(cdata)
hclust (*,"complete")

Fig. 3. Dendrogram of groups in a four-part composition (Cd, Pb, Co, Cu), defined by the Aitchison distance.

>b > var(yy) # v a r ^ - 0 . 5 * cdata


[i] 0.8 0.i 0.i [,i] [,2] [,3] [,4]
attr(,"class") [i, ] 0.75 -0.25 -0.25 -0.25
[i] " a c o m p " [2,] -0.25 0.75 -0.25 -0.25
> a+b # a d d i n g is p e r t u r b a t i o n [3,] -0.25 -0.25 0.75 -0.25
[i] 0 . 6 6 6 6 6 6 7 0 . 1 6 6 6 6 6 7 0 . 1 6 6 6 6 6 7 [4,] -0.25 -0.25 -0.25 0.75
attr(,"class")
[i] " a c o m p "
> 2*a # m u l t i p l i c a t i o n is p o w e r
transform The standard transforms can be computed by
[i] 0 . i i i i i i i 0 . 4 4 4 4 4 4 4 0 . 4 4 4 4 4 4 4
attr(, " c l a s s " )
[i] " a c o m p " > CenteredLogRatio <- c l r ( c d a t a )
> (a+a)/2-a # i n v e r s e o p e r a t i o n s > IsometricLogRatio <- i l r ( c d a t a )
[i] 0 . 3 3 3 3 3 3 3 0 . 3 3 3 3 3 3 3 0 . 3 3 3 3 3 3 3 > AdditiveLogRatio <- a l r ( c d a t a )
> OriginalData <- c l r . i n v ( C e n t e r e d L o g R a t i o )
attr(,"class') > OriginalData <- i l r . i n v ( I s o m e t r i c L o g R a t i o )
[i] " a c o m p " > OriginalData <- a l r . i n v ( A d d i t i v e L o g R a t i o )
> CenteredLogRatio
> xx<-(cdata-mean(cdata))/msd(cdata) Cd Pb Co Cu
# p a r a l l e l on the w h o l e data set [i,] - 1 . 6 2 8 1 5 9 0 6 2.82253578 -0.46785026 -0.72652645
> XX [2,] - 0 . 2 8 0 1 3 7 1 5 2.27945644 1.20048072 -3.19980002
... l i n e s o m i t t e d
...
[60,] 1 . 7 2 4 6 8 2 6 4 -2.56743127 -0.16183132 1.00457995
> mean(xx) # x x is c e n t e r e d attr(,'class')
[i] " r m u l t "
Cd Pb Co Cu
0.25 0.25 0.25 0.25
attr(,"class")
[i] " a c o m p "
Working with grouped data
> msd(xx) # and n o r m a l i z e d
[i] 1 Sometimes a group membership is available. In the
> a %*% a # scalar product
case of the ' s a . g r o u p s 5 ' dataset, it represents a
[1] 0.320302
> norm(a) # norm
sector of an imaginary river where the sample was
[1] 0.5659523 originally collected. This information is stored in
> cdata %*% a c o m p ( c ( 1 , 2 , 3 , 4 ) ) the variable's a . g r o u p s 5 . a r e a ' :
# scalar products
[110.43526714 - 1 . 5 3 7 0 0 3 0 4 2 . 4 0 6 8 2 7 6 9
1.61307672 2.64960822 2.38350333 plot(acomp(sa.groups5),
... (lines omitted) col=sa.groups5.area)
> y y <- c h o l ( s o l v e ( c l r v a r 2 i l r (vat boxplot(acomp(sa.groupsS),
(cdata)))) %*% c d a t a # m a t r i x sa.groups5.area)
operation
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016

CODA ANALYSIS WITH R AND COMPOSITIONS 125

When the user wants to analyse only one of the Here one sees a highly significant influence of the
groups, a subset of the data is selected based on a group given by a p-value stated as ' < 2 . 2 e - 1 6 ' . If
criterion: this example was run, a series of plots would result:
the first one would show the residuals with substan-
> sa.groups5.area tial spread. The second plot shows the location of the
[i] U p p e r U p p e r U p p e r U p p e r U p p e r predicted group means in ternary diagrams. Unfor-
Upper Upper Upper Upper Upper tunately, the variable names are lost during the ilr
[ii] U p p e r U p p e r U p p e r U p p e r U p p e r transform, such that the plots are drawn without
Upper Upper Upper Upper Upper labels. The third plot shows qqnorm-plots of the
[21] M i d d l e M i d d l e M i d d l e M i d d l e pairwise log-ratios, in order to check the normality
Middle Middle Middle Middle assumption used in the manova. The last two calcu-
Middle Middle
lations give the total of the model of about 39% and
[31] M i d d l e M i d d l e M i d d l e M i d d l e
the individual's for the four parts of the composition.
Middle Middle Middle Middle
Middle Middle
In a similar way a discrimination analysis can be
[41] L o w e r L o w e r L o w e r L o w e r L o w e r performed based on the ilr transform and standard
Lower Lower Lower Lower Lower functionality of 'R':
[51] L o w e r L o w e r L o w e r L o w e r L o w e r
Lower Lower Lower Lower Lower > library(MASS) # Loading
Levels: Lower Middle Upper appropriate library
> u p p e r <- s p l i t ( c d a t a , s a . g r o u p s S . > # Generating example data
area) [["Upper"]] > subsample <- s a m p l e ( l : 6 0 , 4 5 )
> plot(upper) > TrainingData
> mean(upper) <- a c o m p ( c d a t a [ s u b s a m p l e , ] )
> TrainingGroups
<- s a . g r o u p s S . a r e a [ s u b s a m p l e ]
A parallel analysis of all groups is possible through > ControlData
the ' 1 a p p l y ' or ' s a p p ly'-function of 'R': <- a c o m p ( c d a t a [ - s u b s a m p l e , ] )
> ControlGroups
> sapply(split(cdata,sa.groups5.area), <- s a . g r o u p s 5 . a r e a [ - s u b s a m p l e ]
mean) > ControlGroups
Lower Middle Upper [i] U p p e r U p p e r U p p e r M i d d l e M i d d l e
Cd 0.064366655 0.006801803 0.001636658 Middle Middle Middle Middle
Pb 0 . 0 5 9 1 7 8 2 4 2 0 . 5 7 2 8 8 9 9 1 9 0 . 9 5 7 1 9 1 9 0 9 Middle
Co 0 . 0 0 9 5 0 9 9 9 5 0 . 0 0 2 6 5 8 0 6 4 0 . 0 0 4 3 2 3 9 1 2 [ii] L o w e r L o w e r L o w e r L o w e r L o w e r
Cu 0 . 8 6 6 9 4 5 1 0 8 0 . 4 1 7 6 5 0 2 1 4 0 . 0 3 6 8 4 7 5 2 1 Levels: Lower Middle Upper
> # Performing the discriminat
analysis
However the grouping information could be used > d s c r <- i d a ( T r a i n i n g G r o u p s - . , i l r
to check whether the groups are really different in a (TrainingData)) # Discrimination
Multivariate Analysis of Variance (manova), which Analysis
can be done by 'R' standard routines based on the ilr > dscr
transform: ... ( o u t p u t o m i t t e d )
> predict(dscr,newdata:ilr
(ControlData)) # Classify
ControlData
> m <- manova(ilr(cdata)-sa.groups5, area) $class
> summary(m) [i] U p p e r U p p e r U p p e r M i d d l e M i d d l e
Df Pillai approx F num Df den Lower Middle Middle Middle
Df Pr(>F)
sa.groups5.area 2 1.0872 22.2312 6 Middle
112 < 2.2 e-16 *** [ii] L o w e r L o w e r L o w e r L o w e r L o w e r
Residuals 57 Levels: Lower Middle Upper
___
$posterior
Signif. codes: 0 ~***" 0.001 "**" 0.01 ~*" Lower Middle Upper
0.05~. " 0.i " "i
1 3.626286e-16 1.851031e-07 9.999998e-01
> plot(ilr.inv(residuals(m)),col=sa.groups5.
area) 2 7.991869e-12 8.473827e-05 9.999153e-01
> plot(ilr.inv(predict(m)),col=sa.groups5. ... ( l i n e s o m i t t e d )
area) > table(ControlGroups, predict
> qqnorm(ilr.inv(residuals(m))) (dscr, n e w d a t a = i l r (ControlData))
> mvar(predict(m))/(mvar(residuals(m)
$class)
+predict (m))) # ~R"^2
[i] 0.3980416 ControlGroups Lower Middle Upper
> diag(ilrvar2clr(var(predict(m)))/ilrvar2 Lower 5 0 0
clr(var(residuals(m)+predict(m)))) Middle 1 6 0
[i] 0.4001846 0.5670027 0.1392320 0.2654141 Upper 0 0 3
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016

126 K.G. VAN DER BOOGAART & R. TOLOSANA-DELGADO

The calculated classification o f the 15 control Importing data to 'R'


samples based on 45 training samples w a s thus
correct, with one exception. M o r e detailed infor- The most simple way to provide data to 'R' is to store them
mation about discriminant analysis and ' l d a ' func- into a simple text file. The first row should contain the
tion can be found in the ' R ' help. variable names seperated by a semicolon. The following
lines contain the data, again separated by a semicolon.

Conclusions
For the beginner, this approach i m m e d i a t e l y pro-
Cd;Zn;Pb;Cd;Co
vides all basic compositional plots, summaries and
1.2;2.6;4.9;0.2;5
transformation in the form o f simple standard com-
23.4;11;0.2;0.002;6.2
mands given in this publication. M o r e helpful
. . .

reading can be found in Using the R package


'compositions', available at h t t p : / / w w w . s t a t .
boogaart.de/compositions. Users can perform
advanced analysis using the p a c k a g e in c o m b i - The data can then be loaded b y the ' R ' - c o m m a n d s :
nation with the statistical sub-routines o f ' R ' as
exemplified in the later chapters and experts can
even extend the functionality through the program- > m y d a t a <- r e a d . c s v ( " C : /
ming interface o f 'R'. The authors are open to sug- mydirectory/m!n%le.txt",
gestions to include m o r e functionality and sep=";", dec=".")
c o n v e n i e n c e to the package. > fux(mydata) # you m u s t close
the w i n d o w a f t e r w a r d s
Appendix A: Help with
technical details
One should always check with the fix command that the
Downloading and installing 'R' data are properly loaded before using them. Directories
must always be separated by a forward slash '/' in path-
On 'http://www.cran.R-project.org' one can find detailed names. All spreadsheet programs can export to this
instruction on downloading and installing 'R', as well as format, when instructed to store as '. c s v ' . The separator
the downloadable packages themselves. Users must down- and the decimal symbol can vary with the local configur-
load the setup program of the base part of a precompiled ation of the computer. Note that 'R' only uses a dot as the
binary distribution of 'R' for their platform. floating comma symbol in any output, although the import
For example, for windows users it is sufficient procedure accepts the optional parameter ' d e c = ' , " to
to download the 'rw? .9 .9 ? . e x e ' file from 'http:// deal with the colon (see '? r e a d . c s v ' or '?
www'cran'R-pr~176176 and to double r e a d . table'). To use Tabulator as a separating charac-
click the downloaded file to start the installation process. ter use ' s e p = ' " in the ' r e a d . c s v ' command.

Downloading and installing Newbee problems and solutions


'compositions'
9 'R' is not exactly made for beginners: it's not your
The package is available from 'http://www.stat. fault. Try to find someone to help you, Next year
boogaart.de/compositions/' or as a contributed package you will be the expert.
from 'http://www.cran.R-project.org'. In a windows 9 When 'R' does not find your file, give the whole path
system ('R' v.2.0.1 or later) the package can now be and separate the directories with ' / ' , not with a back-
installed through the 'Packages' menu option 'Install slash. Don't forget the extension (e.g. '. t x t ' ) .
package(s) from local zip file...'. On Unix/Linux 9 W e a r glasses w h e n c o p y i n g and typing
systems it is done by the command: commands.
9 When you get neither a plot nor an error, the plot
window is probably iconified.
9 When 'R' answers with '+' instead of '>', you have
" R " CMD I N S T A L L
made a typing error and 'R' thinks that the command
DownloadedPackage.tar.gz
is not yet finished. Type '; ', the ENTER-key, and
try again.
9 When you are bored by retyping commands again and
with 'DownloadedPackage.tgz' replaced by the again, try the up and down arrow keys or copy and
actual filename of the downloaded package. paste from your favourite editor or a script window.
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016

CODA ANALYSIS WITH R AND COMPOSITIONS 127

9 It doesn't work in the second session: Have you loaded AITCHISON, J., BARCELO-VIDAL, C., EGOZCUE, J. J. &
all necessary libraries and prepared all variables? PAWLOWSKY-GLAHN, V. 2002. A concise guide to
9 'R' comes with plenty of help. Type the the algebraic geometric structure of the simplex,
' h e l p . s t a r t ( ) ' command and start with 'Intro- the sample space for compositional data analysis.
duction to R'. In: BAYER, U., BURGER, H. & SKALA, W. (eds)
9 Type 'q ( )' for quit and the ENTER-key to leave 'R'. Proceedings of the 8th Annual Conference of the
International Association for Mathematical
Save your workspace, when asked.
Geology, Berlin, Germany, 387-392.
PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. 2001. Geo-
metric approach to statistical analysis on the
References
simplex. Stochastic Environmental Research and
AITCHISON, J. 2002. Simplicial inference. In: VIANA, Risk Assessment, 15 (5), 384-398.
M. A. G. & RICHARDS, D. S. P. (eds) Algebraic R Development Core Team 2003. R: A language and
Methods in Statistics and Probability. Contempor- environment for statistical computing. R Foun-
ary Mathematics Series, 287, American Mathe- dation for Statistical Computing, Vienna, Austria
matical Society, Providence, Rhode Island, 1-22. (http://www.R-project.org).

Vous aimerez peut-être aussi