Vous êtes sur la page 1sur 77

The Newsletter of the R Project

News Volume 7/2, October 2007

Editorial
by Torsten Hothorn page 74. As I’m writing this, Uwe Ligges and his
team at Dortmund university are working hard to
Welcome to the October 2007 issue of R News, organize next year’s useR! 2008; Uwe’s invitation to
the second issue for this year! Last week, R 2.6.0 Dortmund Germany appears on the last page of this
was released with many new useful features: issue.
dev2bitmap allows semi-transparent colours to be Three of the nine papers contained in this issue
used, plotmath offers access to all characters in were presented at useR! 2006 in Vienna: Peter Dal-
the Adobe Symbol encoding, and three higher-order gaard reports on new functions for multivariate anal-
functions Reduce, Filer, and Map are now available. ysis, Heather Turner and David Firth describe their
In addition, the startup time has been substantially new package gnm for fitting generalized nonlinear
reduced when starting without the methods pack- models, and Olivia Lau and colleagues use ecologi-
age. All the details about changes in R 2.6.0 are given cal inference models to infer individual-level behav-
on pages 54 ff. ior from aggregate data.
The Comprehensive R Archive Network contin- Further topics are the analysis of functional mag-
ues to reflect a very active R user and developer com- netic resonance imaging and environmental data,
munity. Since April, 187 new packages have been matching for observational studies, and machine
uploaded, ranging from ADaCGH, providing facili- learning methods for survival analysis as well as for
ties to deal with array CGH data, to yest for model categorical and continuous data. In addition, an ap-
selection and variance estimation in Gaussian inde- plication to create web interfaces for R scripts is de-
pendence models. Kurt Hornik presents this impres- scribed, and Fritz Leisch completes this issue with a
sive list starting on page 61. review of Michael J. Crawley’s The R Book.
The first North American R user conference took
place in Ames Iowa last August and many colleagues Torsten Hothorn
and friends from all over the world enjoyed a well- Ludwig–Maximilians–Universität München, Germany
organized and interesting conference. Duncan Mur- Torsten.Hothorn@R-project.org
doch and Martin Maechler report on useR! 2007 on
Contents of this issue: eiPack: R × C Ecological Inference and
Higher-Dimension Data Management . . . . 43
Editorial . . . . . . . . . . . . . . . . . . . . . . 1 The ade4 Package — II: Two-table and K-table
New Functions for Multivariate Analysis . . . 2 Methods . . . . . . . . . . . . . . . . . . . . . 47
gnm: A Package for Generalized Nonlinear Review of “The R Book” . . . . . . . . . . . . . 53
Models . . . . . . . . . . . . . . . . . . . . . . 8 Changes in R 2.6.0 . . . . . . . . . . . . . . . . . 54
fmri: A Package for Analyzing fmri Data . . . 13 Changes on CRAN . . . . . . . . . . . . . . . . 61
Optmatch: Flexible, Optimal Matching for R Foundation News . . . . . . . . . . . . . . . . 72
Observational Studies . . . . . . . . . . . . . 18 News from the Bioconductor Project . . . . . . 73
Random Survival Forests for R . . . . . . . . . 25 Past Events: useR! 2007 . . . . . . . . . . . . . . 74
Rwui: A Web Application to Create User Forthcoming Events: useR! 2008 . . . . . . . . . 75
Friendly Web Interfaces for R Scripts . . . . . 32 Forthcoming Events: R Courses in Munich . . 76
The np Package: Kernel Methods for Categor-
ical and Continuous Data . . . . . . . . . . . 35
Vol. 7/2, October 2007 2

New Functions for Multivariate Analysis


Peter Dalgaard that the two MS matrices both have mean Σ;
notice, however, that MSeff will be rank defi-
R (and S-PLUS) used to have limited support for cient whenever the degrees of freedom for the
multivariate tests. We had the manova function, effect is less than p), but for a test statistic we
which extended the features of aov to multivariate need to reduce R to a scalar measure. Four such
responses, but like aov, this effectively assumed a measures are cited in the literature, namely
balanced design, and was not capable of dealing with Wilks’s Λ, the Pillai trace, the Hotelling-Lawley
the within-subject transformations that are com- trace, and Roy’s greatest root. These are all
monly used in repeated measurement modelling. based on combinations of the eigenvalues of R.
Although the methods encoded in procedures Details can be found in, e.g., Hand and Taylor
available in SAS and SPSS can seem somewhat old- (1987).
fashioned, they do have some added value relative Wilks’s Λ is equivalent to the likelihood ra-
to analysis by mixed model methodology, and they tio test, but R and S-PLUS have traditionally
have a strong tradition in several applied areas. It favoured the Pillai trace based on the (rather
was thus worthwhile to extend R’s capabilities to vague) recommendations cited in Hand and
handle contrast tests, as well as Greenhouse-Geisser Taylor (1987). Each test can be converted to
and Huynh-Feldt epsilons. The extensions also pro- an approximately F distributed statistic. If
vide flexible ways of dealing with linear models with the tests are for a single-degree-of-freedom hy-
a multivariate response. pothesis, the matrix R has only one non-zero
eigenvalue and all four tests are equivalent.
Theoretical setting 2. Testing whether Σ is proportional to a given
matrix, say Σ0 (which is usually the unit matrix
The general setup is given by I): This is known as Mauchly’s test of spheric-
ity. It is based on a comparison of the de-
Y ∼ N (ΞB, I ⊗ Σ) terminant and the trace of U = Σ− 1
0 S where
Here, Y is N × p matrix and Σ is a p × p covari- S is the SSD (deviation sum-of-squares-and-
ance matrix. The rows yi of Y are independent with products) matrix (MSres instead of S is equiv-
the same covariance Σ. alent). Specifically
Ξ is a N × k design matrix (the reader will have to W = det(U )/tr(U / p) p
apologise that I am not using the traditional X, but
that symbol is reserved for other purposes later on) is close to 1 if U is close to a diagonal matrix of
and B is a k × p matrix of regression coefficients. dimension p with a constant value along the di-
Thus, we have the same linear model for all p agonal. The test statistic − f log W is an asymp-
response coordinates, with separate parameters con- totic χ2 on p( p + 1)/2 − 1 degrees of freedom
tained in the columns of B. (where f is the degrees of freedom for the co-
variance matrix. An improved approximation
is found in Anderson (1958).
Standard test procedures
3. Testing linear hypotheses as in point 1 above,
From classical univariate and multivariate theory, a but assuming that sphericity holds. In this
number of standard tests are available. We shall fo- case, we are assuming that the covariance is
cus on three of them: known up to a constant, and thus are effec-
tively in a univariate setting of (weighted) least
1. Testing hypotheses of simplified mean value squares theory. The relevant F statistic will
structure: This reduction is required to be the have ( p f 1 , p f 2 ) degrees of freedom if the coor-
same for all coordinates, effectively replacing dinatewise tests have ( f 1 , f 2 ) degrees of free-
the design matrix Ξ with one spanning a sub- dom.
space. Such tests take the form of generalized
F tests, replacing the variance ratio by
Within-subject transformations
−1
R = MSres MSeff
It is often necessary to consider a transformation of
in which the MS terms are matrices which gen- responses: If, for instance, coordinates are repeated
eralize the mean square terms from analysis of measures, we might wish to test for “no change
variance. Under the hypothesis, R should be over time” or “same changes over time in different
distributed around the unit matrix (in the sense groups” (profile analysis). This leads to analysis of

R News ISSN 1609-3631


Vol. 7/2, October 2007 3

within-subjects contrasts, rather than of the observa- squares (OLS) fit and for some covariance structures,
tions themselves. a weighted fit may be more appropriate. However,
If we assume a covariance structure with random OLS is not actually biased, and the efficiency that
effects both between and within subject, the contrasts we might gain is offset by the need to estimate a
will cancel out the between-subject variation. They rather large number of parameters to find the opti-
will effectively behave as if the between-subject term mal weights.
wasn’t there and satisfy a sphericity condition.
Hence we consider the transformed response YT 0
(which has rows Tyi ). In many cases T is chosen to The epsilons
annihilate another matrix X, i.e. it satisfies TX = 0;
by rotation invariance, different such choices of T Assume that the conditions for a balanced analysis of
will lead to the same inference as long as they have variance are roughly valid. We may decide to com-
maximal rank. In such cases it may well be more con- pute, say, column means for each subject and test
venient to specify X rather than T. whether the contrasts are zero on average. There
For profile analysis, X could be a p-vector of ones. could be a loss of efficiency in the use of unweighted
In that case, the hypothesis of compound symmetry means, but the critical issue for an F test is whether
implies sphericity of TΣT 0 w.r.t. TT 0 (but sphericity the sphericity condition holds for the contrasts be-
does not imply compound symmetry), and the hy- tween column means.
pothesis EYT 0 = 0 is equivalent to a flat (constant If the sphericity condition is not valid, then the
level) profile. F test is in principle wrong. Instead, we could use
More elaborate within-subject designs exist. As one of the multivariate tests, but they will often have
an example, consider a a complete two-way layout, low power due to the estimation of a large number
in which case we might want to look at the sets of (a) of parameters in the empirical covariance matrix.
Contrasts between row means, (b) Contrasts between For this reason, a methodology has evolved in
column means, and (c) Interaction contrasts, which “near-sphericity” data are analyzed by F tests,
These contrasts connect to the theory of balanced but applying the so-called epsilon corrections to the
mixed-effects analysis of variance. A model with “all degrees of freedom. The theory originates in Box
interactions with subject are considered random”, i.e. (1954a) in which it is shown that F is approximately
random effects of subjects, as well as random inter- distributed as F (ε f 1 , ε f 2 ), where
action between subject and rows and between sub-
jects and columns, implies sphericity of each of the ∑ λi2 / p
ε=
above contrasts. The proportionality constants will ( ∑ λi / p )2
be different linear combinations of the random effect
and the λi are the eigenvalues of the true covari-
variances.
ance matrix (after contrast transformation, and with
A common pattern for such sets of contrasts is as
respect to a similarly transformed identity matrix).
follows: Define two subspaces of R p , defined by ma-
One may notice that 1/ p ≤ ε ≤ 1; the upper limit
trices X and M, such that span( X ) ⊂ span( M). The
corresponds to sphericity (all eigenvalues equal) and
transformation is calculated as T = PM − PX where
the lower limit corresponds to the case where there
P denotes orthogonal projection. Actually, T defined
is one dominating eigenvalue, so that data are effec-
like this would be singular and therefore T is thinned
tively one-dimensional. Box (1954b) details the result
by deletion of linearly dependent rows (it can fairly
as a function of the elements of the covariance matrix
easily be seen that it does not matter exactly how this
in the two-way layout.
is done). Putting M = I (the identity matrix) will
The Box correction requires knowledge of the
make T satisfy TX = 0 which is consistent with the
true covariance matrix. Greenhouse and Geisser
situation described earlier.
(1959) suggest the use of εGG , which is simply the
The choice of X and M is most easily done by
empirical verson of ε, inserting the empirical covari-
viewing them as design matrices for linear models in
ance matrix for the true one. However, it is easily
p-dimensional space. Consider again a two-way in-
seen that this estimator is biased, at least when the
trasubject design. To make T a transformation which
true epsilon is close to 1, since εGG < 1 almost surely
calculates contrasts between column means, choose
when p > 1.
M to describe a model with different means in each
The Huynh-Feldt correction (Huynh and Feldt,
column and X to describe a single mean for all data.
1976) is
Preferably, but equivalently for a balanced design, let
( f + 1) pεGG − 2
M describe an additive model in rows and columns εHF =
p( f − pεGG )
and let X be a model in which there is only a row
effect. where f is the number of degrees of freedom for the
There is a hidden assumption involved in the empirical covariance matrix. (The original paper has
choice of the orthogonal projection onto span( M). N instead of f + 1 in the numerator for the split-plot
This calculates (for each subject) an ordinary least design. A correction was published 15 years later

R News ISSN 1609-3631


Vol. 7/2, October 2007 4

(Lecoutre, 1991), but another 15 years later, this er- • The transformation matrix can be given di-
ror is still present in SAS and SPSS.) rectly using the argument T.
Notice that εHF can be larger than one, in which
case you should use the uncorrected F test. Also, • It is possible to specify the arguments X and M
εHF is obtained by bias-correcting the numerator and as the matrices described previously. The de-
denominator of εGG , which is not guaranteed to be faults are set so that the default transformation
helpful, and simulation studies in (Huynh and Feldt, is the identity.
1976) indicate that it actually makes the approxima-
tions worse when εGG < 0.7 or so. • It is also possible to specify X and/or M us-
ing model formulas. In that case, they usually
need to refer to a data frame which describes
Implementation in R the intra-subject design. This can be given in
the idata argument. The default for idata is
Basics a data frame consisting of the single variable
index=1:p which may be used to specify, e.g.,
Quite a lot of the calculations could be copied from
a polynomial response for equispaced data.
the existing manova and summary.manova code for
balanced designs. Also, objects of class mlm, inherit-
ing from lm, were already defined as the output from
lm(Y~....) when Y is a matrix. Example
It was necessary to add the following new func-
tions These data from the book by Maxwell and Delaney
(1990) are also used by Baron and Li (2006). They
• SSD creates an object of (S3) class SSD which is show reaction times where ten subjects respond to
the sums of squares and products matrix aug- stimuli in the absence and presence of ambient noise,
mented by degrees of freedom and information and using stimuli tilted at three different angles.
about the call. This is a generic function with a
method for mlm objects.
> reacttime <- matrix(c(
+ 420, 420, 480, 480, 600, 780,
• estVar calculates the estimated variance-
+ 420, 480, 480, 360, 480, 600,
covariance matrix. It has methods for SSD and
+ 480, 480, 540, 660, 780, 780,
mlm objects. In the former case, it just normal- + 420, 540, 540, 480, 780, 900,
izes the SSD by the degrees of freedom, and + 540, 660, 540, 480, 660, 720,
in the latter case it calculates SSD(object) and + 360, 420, 360, 360, 480, 540,
then applies the SSD method. + 480, 480, 600, 540, 720, 840,
+ 480, 600, 660, 540, 720, 900,
• anova.mlm is used to compare two multivariate + 540, 600, 540, 480, 720, 780,
linear models or partition a single model in the + 480, 420, 540, 540, 660, 780),
usual cumulative way. The various multivari- + ncol = 6, byrow = TRUE,
ate tests can be specified using test="Pillai", + dimnames=list(subj=1:10,
etc., just as in summary.manova. In addition, it + cond=c("deg0NA", "deg4NA", "deg8NA",
is possible to specify test="Spherical" which + "deg0NP", "deg4NP", "deg8NP")))
gives the F test under assumption of sphericity.
The same data are used by example(estVar) and
• mauchly.test tests for sphericity of a covari- example(anova.mlm), so you can load the reacttime
ance matrix with respect to another. matrix just by running the examples. The following
is mainly an expanded explanation of those exam-
• sphericity calculates the εGG and εHF based
ples.
on the SSD matrix and its degrees of freedom.
This function is private to the stats namespace First let us calculate the estimated covariance ma-
(at least currently, as of R version 2.6.0) and trix:
used to generate the headings and adjusted p-
values in anova.mlm. > mlmfit <- lm(reacttime~1)
> estVar(mlmfit)
cond
Representing transformations cond deg0NA deg4NA deg8NA deg0NP deg4NP deg8NP
deg0NA 3240 3400 2960 2640 3600 2840
As previously described, sphericity tests and tests of deg4NA 3400 7400 3600 800 4000 3400
linear hypotheses may make sense only for a trans- deg8NA 2960 3600 6240 4560 6400 7760
formed response. Hence we need to be able to spec- deg0NP 2640 800 4560 7840 8000 7040
ify transformations in anova.mlm and mauchly.test. deg4NP 3600 4000 6400 8000 12000 11200
The code has several ways to deal with this: deg8NP 2840 3400 7760 7040 11200 13640

R News ISSN 1609-3631


Vol. 7/2, October 2007 5

In this case there is no between-subjects structure, > mauchly.test(mlmfit, X=~1)


except for a common mean, so the result is equivalent
to var(reacttime). Mauchly's test of sphericity
Next we consider tests for whether the response Contrasts orthogonal to
depends on the design at all. We generate a contrast ~1
transformation which is orthogonal to an intercept-
data: SSD matrix from lm(formula = reacttime ~ 1)
only within-subject model; this is equivalent to any
full-rank set of within-subject contrasts. A test based W = 0.0311, p-value = 0.04765
on multivariate normal theory is performed as fol-
lows. Accordingly, the hypothesis of sphericity is re-
jected at the 0.05 significance level.
> mlmfit0 <- update(mlmfit, ~0) However, the analysis of contrasts completely
> anova(mlmfit, mlmfit0, X=~1)
disregards the two-way intrasubject design. To gen-
Analysis of Variance Table
erate tests for overall effects of deg and noise, as well
Model 1: reacttime ~ 1 as interaction between the two, we do as follows:
Model 2: reacttime ~ 1 - 1 First we need to set up the idata data frame to de-
scribe the design.
Contrasts orthogonal to
~1 > idata <- expand.grid(
+ deg=c("0", "4", "8"),
Res.Df Df Gen.var. Pillai approx F + noise=c("A", "P"))
1 9 1249.57 > idata
2 10 1 2013.16 0.95 17.38 deg noise
num Df den Df Pr(>F) 1 0 A
1 2 4 A
2 5 5 0.003534 ** 3 8 A
4 0 P
This gives the default Pillai test, but actually, you 5 4 P
6 8 P
get the same result with the other multivariate tests
since the degrees of freedom (per coordinate) only Then we can specify the tests using model formu-
changes by one. las to specify the X and M matrices. To test for an ef-
To perform the same test, but assuming spheric- fect of deg we let M specify an additive model in deg
ity of the covariance matrix, just add an argument to and noise and let X be the model with noise alone.
the anova call:
> anova(mlmfit, mlmfit0, M = ~ deg + noise,
> anova(mlmfit, mlmfit0, X=~1, test="Spherical") + X = ~ noise,
Analysis of Variance Table + idata = idata, test="Spherical")
Analysis of Variance Table
Model 1: reacttime ~ 1
Model 2: reacttime ~ 1 - 1 Model 1: reacttime ~ 1
Model 2: reacttime ~ 1 - 1
Contrasts orthogonal to
~1 Contrasts orthogonal to
~noise
Greenhouse-Geisser epsilon: 0.4855
Huynh-Feldt epsilon: 0.6778
Contrasts spanned by
Res.Df Df Gen.var. F num Df ~deg + noise
1 9 1249.6
2 10 1 2013.2 38.028 5 Greenhouse-Geisser epsilon: 0.9616
den Df Pr(>F) G-G Pr H-F Pr Huynh-Feldt epsilon: 1.2176
1
2 45 4.471e-15 2.532e-08 7.393e-11 Res.Df Df Gen.var. F num Df
1 9 1007.0
It can be noticed that the epsilon corrections in 2 10 1 2703.2 40.719 2
den Df Pr(>F) G-G Pr H-F Pr
this case are rather strong, but that the corrected
1
p-values nevertheless are considerably more signif- 2 18 2.087e-07 3.402e-07 2.087e-07
icant with this approach, reflecting the low power
of the multivariate test. This is commonly the case It might at this point be helpful to explain the
when the number of replications is low. roles of X and M a bit further. To do this, we need
To test the hypothesis of sphericity, we employ to access an internal function in the stats names-
Mauchly’s criterion: pace, namely proj.matrix. This function calculates

R News ISSN 1609-3631


Vol. 7/2, October 2007 6

the projection matrix for a given design matrix, i.e. This is the representation of x̄i· − x̄·· where i and
PX = X ( X 0 X )−1 X 0 . In the above context, we have j index levels of deg and noise, respectively. The
matrix has (row) rank 2; for the actual computations,
M <- model.matrix(~deg+noise, data=idata)
only the first two rows are used.
P1 <- stats:::proj.matrix(M)
X <- model.matrix(~noise, data=idata)
The important aspect of this matrix is that it as-
P2 <- stats:::proj.matrix(X) signs equal weights to observations at different lev-
els of noise and that it constructs a maximal set of
The two design matrices are (eliminating at- within-deg differences. Any other such choice of
tributes) transformation leads to equivalent results, since it
will be a full-rank transformation of T. For instance:
> M
(Intercept) deg4 deg8 noiseP
> T1 <- rbind(c(1,-1,0,1,-1,0),c(1,0,-1,1,0,-1))
1 1 0 0 0
> anova(mlmfit, mlmfit0, T=T1,
2 1 1 0 0
+ idata = idata, test = "Spherical")
3 1 0 1 0
Analysis of Variance Table
4 1 0 0 1
5 1 1 0 1
Model 1: reacttime ~ 1
6 1 0 1 1
Model 2: reacttime ~ 1 - 1
> X
(Intercept) noiseP
Contrast matrix
1 1 0
1 -1 0 1 -1 0
2 1 0
1 0 -1 1 0 -1
3 1 0
Greenhouse-Geisser epsilon: 0.9616
4 1 1
Huynh-Feldt epsilon: 1.2176
5 1 1
6 1 1
Res.Df Df Gen.var. F num Df
For printing the projection matrices, it is useful to 1 9 12084
2 10 1 32438 40.719 2
include the fractions function from MASS.
1 den Df Pr(>F) G-G Pr H-F Pr
> library("MASS") 2 18 2.087e-07 3.402e-07 2.087e-07
> fractions(P1)
1 2 3 4 5 6 This differs from the previous analysis only in the
1 2/3 1/6 1/6 1/3 -1/6 -1/6 Gen.var. column, which is not invariant to scaling
2 1/6 2/3 1/6 -1/6 1/3 -1/6 and transformation of data.
3 1/6 1/6 2/3 -1/6 -1/6 1/3 Returning to the interpretation of the results, no-
4 1/3 -1/6 -1/6 2/3 1/6 1/6 tice that the epsilons are much closer to one than be-
5 -1/6 1/3 -1/6 1/6 2/3 1/6 fore. This is consistent with a covariance structure
6 -1/6 -1/6 1/3 1/6 1/6 2/3
induced by a mixed model containing a random in-
> fractions(P2)
1 2 3 4 5 6
teraction between subject and deg. If we perform the
1 1/3 1/3 1/3 0 0 0 similar procedure for noise we get epsilons of ex-
2 1/3 1/3 1/3 0 0 0 actly 1, because we are dealing with a single-df con-
3 1/3 1/3 1/3 0 0 0 trast.
4 0 0 0 1/3 1/3 1/3
5 0 0 0 1/3 1/3 1/3 > anova(mlmfit, mlmfit0, M = ~ deg + noise,
6 0 0 0 1/3 1/3 1/3 + X = ~ deg,
+ idata = idata, test="Spherical")
Here, P2 is readily recognized as the operator that Analysis of Variance Table
replaces each of the first three values by their aver-
age, and similarly the last three. The other one, P1, Model 1: reacttime ~ 1
is a little harder, but if you look long enough, you Model 2: reacttime ~ 1 - 1
will recognize the formula x̂i j = x̄i· + x̄· j − x̄·· from
Contrasts orthogonal to
two-way ANOVA.
~deg
The transformation T is the difference between P1
and P2
> fractions(P1-P2) Contrasts spanned by
1 2 3 4 5 6 ~deg + noise
1 1/3 -1/6 -1/6 1/3 -1/6 -1/6
2 -1/6 1/3 -1/6 -1/6 1/3 -1/6 Greenhouse-Geisser epsilon: 1
3 -1/6 -1/6 1/3 -1/6 -1/6 1/3 Huynh-Feldt epsilon: 1
4 1/3 -1/6 -1/6 1/3 -1/6 -1/6
5 -1/6 1/3 -1/6 -1/6 1/3 -1/6 Res.Df Df Gen.var. F num Df
6 -1/6 -1/6 1/3 -1/6 -1/6 1/3 1 9 1410

R News ISSN 1609-3631


Vol. 7/2, October 2007 7

2 10 1 6030 33.766 1 More detailed tests of interactions between f and


den Df Pr(>F) G-G Pr H-F Pr deg, f and noise, and the three-way interaction can
1 be constructed using M and X settings as described
2 9 0.0002560 0.0002560 0.0002560 previously.
However, we really shouldn’t be doing these tests
on individual effects of deg and noise if there is in-
teraction between the two, and as the following out- Bibliography
put shows, there is:
T. W. Anderson. An Introduction to Multivariate Sta-
> anova(mlmfit, mlmfit0, X = ~ deg + noise, tistical Analysis. Wiley, New York, 1958.
+ idata = idata, test = "Spherical")
Analysis of Variance Table J. Baron and Y. Li. Notes on the use of
R for psychology experiments and question-
Model 1: reacttime ~ 1 naires, 2006. URL http://www.psych.upenn.edu/
Model 2: reacttime ~ 1 - 1 ~baron/rpsych/rpsych.html.
Contrasts orthogonal to G. E. P. Box. Some theorems on quadratic forms ap-
~deg + noise plied in the study of analysis of variance problems,
i. effect of inequality of variance in the one-way
Greenhouse-Geisser epsilon: 0.904
classification. Annals of Mathematical Statistics, 25
Huynh-Feldt epsilon: 1.118
(2):290–302, 1954a.
Res.Df Df Gen.var. F num Df
G. E. P. Box. Some theorems on quadratic forms ap-
1 9 316.58
plied in the study of analysis of variance problems,
2 10 1 996.34 45.31 2
den Df Pr(>F) G-G Pr H-F Pr ii.effect of inequality of variance and of correlation
1 between errors in the two-way classification. An-
2 18 9.424e-08 3.454e-07 9.424e-08 nals of Mathematical Statistics, 25(3):484–498, 1954b.

Finally, to test between-within interactions, S. W. Greenhouse and S. Geisser. On methods in the


rephrase them as effects of between-factors on analysis of profile data. Psychometrika, 24:95–112,
within-contrasts. To illustrate this, we introduce 1959.
a fake grouping f of the reacttime data into two
groups. The interaction between f and the (implied) D. J. Hand and C. C. Taylor. Multivariate Analysis of
column factor is obtained by testing whether the con- Variance and Repeated Measures. Chapman and Hall,
trasts orthogonal to ~1 depend on f. 1987.

> f <- factor(rep(1:2, 5)) H. Huynh and L. S. Feldt. Estimation of the box cor-
> mlmfit2 <- update(mlmfit, ~f) rection for degrees of freedom from sample data in
> anova(mlmfit2, X = ~1, test = "Spherical") randomized block and split-plot designs. Journal of
Analysis of Variance Table Educational Statistics, 1(1):69–82, 1976.

B. Lecoutre. A correction for the ˜ approximate test


Contrasts orthogonal to in repeated measures designs with two or more in-
~1 dependent groups. Journal of Educational Statistics,
16(4):371–372, 1991.
Greenhouse-Geisser epsilon: 0.4691
Huynh-Feldt epsilon: 0.6758 S. E. Maxwell and H. D. Delaney. Designing Experi-
ments and Analyzing Data: A model comparison per-
Df F num Df den Df
spective. Brooks/Cole, Pacific Grove, CA, 1990.
(Intercept) 1 34.9615 5 40
f 1 0.2743 5 40
Residuals 8
Pr(>F) G-G Pr H-F Pr Peter Dalgaard
(Intercept) 1.382e-13 2.207e-07 8.254e-10 Deptartment of Biostatistics
f 0.92452 0.79608 0.86456 University of Copenhagen
Residuals Peter.Dalgaard@R-project.org

R News ISSN 1609-3631


Vol. 7/2, October 2007 8

gnm: A Package for Generalized


Nonlinear Models
by Heather Turner and David Firth Key Functions
In a generalized nonlinear model, the expectation of The primary function defined in package gnm is the
a response variable Y is related to a predictor η — model-fitting function of the same name. This func-
a possibly nonlinear function of parameters — via a tion is patterned after glm (the function included in
link function g: the standard stats package for fitting generalized lin-
ear models), taking similar arguments and returning
an object of class c("gnm", "glm", "lm").
g[ E(Y )] = η(β) A model formula is specified to gnm as the first ar-
gument. The conventional symbolic form is used to
and the variance of Y is equal to, or proportional specify linear terms, whilst nonlinear terms are spec-
to, a known function v[ E(Y )]. This class of models ified using functions of class "nonlin". The nonlin
may be thought of as extending the class of gener- functions currently exported in gnm are summarized
alized linear models by allowing nonlinear terms in in Figure 1. These functions enable the specification
the predictor, or extending the class of nonlinear least of basic mathematical functions of predictors (Exp,
squares models by allowing the variance of the re- Inv and Mult) and some more specialized nonlinear
sponse to depend on the mean. terms (MultHomog, Dref). Often arguments of nonlin
The gnm package provides facilities for the spec- functions may themselves be parametric functions
ification, estimation and inspection of generalized described in symbolic form. For example, an expo-
nonlinear models. Models are specified via symbolic nential decay model with additive errors
formulae, with nonlinear terms specified by func- y = α + exp(β + γx) + e (1)
tions of class "nonlin". The fitting procedure uses an
iterative weighted least squares algorithm, adapted is specified using the Exp function as
to work with over-parameterized models. Identifi- gnm(y ~ Exp(1 + x))
ability constraints are not essential for model speci- These “sub-formulae” (to which intercepts are not
fication and estimation, and so the difficulty of au- added by default) can include other nonlin func-
tomatically defining such constraints for the entire tions, allowing more complex nonlinear terms to be
family of generalized nonlinear models is avoided. built up. Users may also define custom nonlin func-
Constraints may be pre-specified, if desired; other- tions, as we shall illustrate later.
wise, functions are provided in the package for con-
ducting inference on identifiable parameter combi- Dref to specify a diagonal reference term
nations after a model in an over-parameterized rep- Exp to specify the exponential of a predictor
resentation has been fitted. Inv to specify the reciprocal of a predictor
Specific models which the package may be Mult to specify a product of predictors
used to fit include models with multiplicative in- MultHomog to specify a multiplicative interaction
teractions, such as row-column association mod- with homogeneous effects
els (Goodman, 1979), UNIDIFF (uniform difference)
models for social mobility (Xie, 1992; Erikson and Figure 1: "nonlin" functions in the gnm package.
Goldthorpe, 1992), GAMMI (generalized additive
main effects and multiplicative interaction) mod- The remaining arguments to gnm are mainly con-
els (e.g. van Eeuwijk, 1995), and Lee-Carter mod- trol parameters and are of secondary importance.
els for trends in age-specific mortality (Lee and However, the eliminate argument — which imple-
Carter, 1992); diagonal-reference models for depen- ments a feature seen previously in the GLIM 4 sta-
dence on a square or hyper-square classification (So- tistical modelling system — can be extremely useful
bel, 1981, 1985); Rasch-type logit or probit mod- when a model contains the additive effect of a (typi-
els for legislative voting (e.g. de Leeuw, 2006); and cally ‘nuisance’) factor with a large number of levels;
stereotype multinomial regression models for or- in such circumstances the use of eliminate can sub-
dinal response (Anderson, 1984). A comprehen- stantially improve computational efficiency, as well
sive manual is distributed with the package (see as allowing more convenient model summaries with
vignette("gnmOverview", package = "gnm")) and the ‘nuisance’ factor omitted.
this manual may also be downloaded from http: Most of the methods and accessor functions im-
//go.warwick.ac.uk/gnm. Here we give an intro- plemented for "glm" or "lm" objects are also imple-
duction to the key functions and provide some illus- mented for "gnm" objects, such as print, summary,
trative applications. plot, coef, and so on. Since "gnm" models may be

R News ISSN 1609-3631


Vol. 7/2, October 2007 9

over-parameterized, care is needed when interpret- An abbreviated version of the output given by
ing the output of some of these functions. In partic- summary(RC) is shown in Figure 2. Default con-
ular, for parameters that are not identified, an arbi- straints (as specified by options("contrasts")) are
trary parameterization will be returned and the stan- applied to linear terms: in this case the first level of
dard error will be displayed as NA. the origin and destination factors have been set to
For estimable parameters, methods for profile zero. However the “main effects” are not estimable,
and confint enable inference based on the profile since they are aliased with the unconstrained multi-
likelihood. These methods are designed to handle plicative interaction. If the gnm call were re-evaluated
the asymmetric behaviour of the log-likelihood func- from a different random seed, the parameterization
tion that is a well-known feature of many nonlinear for the main effects and multiplicative interaction
models. would differ from that shown in Figure 2. On the
Estimates and standard errors of simple "sum-to- other hand the diagonal effects, which are essentially
zero" contrasts can be obtained using getContrasts, contrasts with the off-diagonal elements, are identi-
if such contrasts are estimable. For more general lin- fied here. This is immediately apparent from Figure
ear combinations of parameters, the lower-level se 2, since standard errors are available for these param-
function may be used instead. eters.
One could consider extending the model by
adding a further component of the multiplicative in-
Example teraction, giving an RC(2) model:

We illustrate here the use of some of the key log µrc = αr + βc + θrc + γr δc + θrφc . (4)
functions in gnm through the analysis of a con-
tingency table from Goodman (1979). The data Most of the "nonlin" functions, including Mult, have
are a cross-classification of a sample of British an inst argument to allow the specification of multi-
males according to each subject’s occupational sta- ple instances of a term, as here:
tus and his father’s occupational status. The contin-
gency table is distributed with gnm as the data set Freq ~ origin + destination +
occupationalStatus. Diag(origin, destination) +
Goodman (1979) analysed these data using row- Mult(origin, destination, inst = 1) +
column association models, in which the interaction Mult(origin, destination, inst = 2)
between the row and column factors is represented Multiple instances of a term with an inst argument
by components of a multiplicative interaction. The may be specified in shorthand using the instances
log-multiplicative form of the RC(1) model — the function provided in the package:
row-column association model with one component
of the multiplicative interaction — is given by Freq ~ origin + destination +
Diag(origin, destination) +
log µrc = αr + βc + γr δc , (2) instances(Mult(origin, destination), 2)
where µrc is the cell mean for row r and column c. The formula is then expanded by gnm before the
Before modelling the occupational status data, Good- model is fitted.
man (1979) first deleted cells on the main diagonal. In the case of the occupational status data how-
Equivalently we can separate out the diagonal effects ever, the RC(1) model is a good fit (a residual de-
as follows: viance of 29.149 on 28 degrees of freedom). So rather
than extending the model, we shall see if it can be
log µrc = αr + βc + θrc + γr δc (3)
simplified by using homogeneous scores in the mul-
where  tiplicative interaction:
φr if r = c
θrc = log µrc = αr + βc + θrc + γr γc (5)
0 otherwise.
In addition to the "nonlin" functions provided by This model can be obtained by updating RC using
gnm, the package includes a number of functions for update:
specifying structured linear interactions. The diago-
nal effects in model 3 can be specified using the gnm > RChomog <- update(RC, . ~ .
function Diag. Thus we can fit model 3 as follows: + - Mult(origin, destination)
+ + MultHomog(origin, destination))
> set.seed(1)
> RC <- gnm(Freq ~ origin + destination + We can compare the two models using anova, which
+ Diag(origin, destination) + shows that there is little gained by allowing hetero-
+ Mult(origin, destination), geneous row and column scores in the association of
+ family = poisson, the fathers’ occupational status and the sons’ occu-
+ data = occupationalStatus) pational status:

R News ISSN 1609-3631


Vol. 7/2, October 2007 10

Call:
gnm(formula = Freq ~ origin + destination + Diag(origin, destination) +
Mult(origin, destination), family = poisson, data = occupationalStatus)

Deviance Residuals: ...

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.11881 NA NA NA
origin2 0.49005 NA NA NA
...
origin8 1.60214 NA NA NA
destination2 0.95334 NA NA NA
...
destination8 1.42813 NA NA NA
Diag(origin, destination)1 1.47923 0.45401 3.258 0.00112
...
Diag(origin, destination)8 0.40731 0.21930 1.857 0.06327
Mult(., destination).origin1 -1.74022 NA NA NA
...
Mult(., destination).origin8 1.65900 NA NA NA
Mult(origin, .).destination1 -1.32971 NA NA NA
...
Mult(origin, .).destination8 0.66730 NA NA NA

...
Residual deviance: 29.149 on 28 degrees of freedom
...

Figure 2: Abbreviated summary of model RC.

> anova(RChomog, RC) geneous multiplicative factor to zero. The quasi stan-
Analysis of Deviance Table dard errors and quasi variances are independent of
parameterization; see Firth (2003) and Firth and de
Model 1: Freq ~ origin + destination + Menezes (2004) for more detail. In this example the
Diag(origin, destination) + reported relative-error diagnostics indicate that the
MultHomog(origin, destination) quasi-variance approximation is rather inaccurate —
Model 2: Freq ~ origin + destination + in contrast to the high accuracy found in many other
Diag(origin, destination) + situations by Firth and de Menezes (2004).
Mult(origin, destination) Users may run through the example covered in
Resid. Df Resid. Dev Df Deviance this section by calling demo(gnm).
1 34 32.561
2 28 29.149 6 3.412
Custom "nonlin" functions
All of the parameters in model 5 can be made
identifiable by constraining one level of the homo- The small number of "nonlin" functions currently
geneous multiplicative factor to zero. This can be distributed with gnm allow a large class of gener-
achieved by using the constrain argument to gnm, alized nonlinear models to be specified, as exempli-
but this would require re-fitting the model. As we are fied through the package documentation. Neverthe-
only really interested in the parameters of the multi- less, for some models a custom "nonlin" function
plicative interaction, we investigate simple contrasts may be necessary or desirable. As an illustration, we
of these parameters using getContrasts. The gnm shall consider a logistic predictor of the form given
function pickCoef enables us to select the parame- by SSlogis (a selfStart model for use with the stats
ters of interest using a regular expression to match function nls),
against the parameter names: Asym
. (6)
> contr <- getContrasts(RChomog, 1 + exp(( xmid − x)/scal )
+ pickCoef(RChomog, "MultHomog")) This predictor could be specified to gnm as
A summary of contr is shown in Figure 3. By de- ~ -1 + Mult(1, Inv(Const(1) +
fault, getContrasts sets the first level of the homo- Exp(Mult(1 + offset(-x), Inv(1)))))

R News ISSN 1609-3631


Vol. 7/2, October 2007 11

Model call: gnm(formula = Freq ~ origin + destination + Diag(origin, destination)


+ MultHomog(origin, destination), family = poisson, data = occupationalStatus)
estimate SE quasiSE quasiVar
MultHomog(origin, destination)1 0.000 0.000 0.1573 0.02473
MultHomog(origin, destination)2 0.218 0.235 0.1190 0.01416
MultHomog(origin, destination)3 0.816 0.167 0.0611 0.00374
MultHomog(origin, destination)4 1.400 0.160 0.0518 0.00269
MultHomog(origin, destination)5 1.418 0.172 0.0798 0.00637
MultHomog(origin, destination)6 1.929 0.157 0.0360 0.00129
MultHomog(origin, destination)7 2.345 0.173 0.0796 0.00634
MultHomog(origin, destination)8 2.589 0.189 0.1095 0.01200
Worst relative errors in SEs of simple contrasts (%): -19 4.4
Worst relative errors over *all* contrasts (%): -17.2 51.8

Figure 3: Simple contrasts of homogeneous row and column scores in model RChomog.

where Mult, Inv and Exp are "nonlin" functions, term = function(predLabels, varLabels){
offset is the usual stats function and Const is a gnm paste(predLabels[1], "/(1 + exp((",
function specifying a constant offset. However, this predLabels[2], "-", varLabels[1], ")/",
specification is rather convoluted and it would be predLabels[3], "))")
preferable to define a single "nonlin" function that }
could specify a logistic term.
Our custom "nonlin" function only needs a sin- Our Logistic function need not specify any further
gle argument, to specify the variable x, so the skele- arguments of nonlinTerms. However, we do have
ton of our definition might be as follows: an idea of useful starting values, so we can also spec-
ify the start argument. This should be a function
Logistic <- function(x){ which takes a named vector of the parameters in the
} term and returns a vector of starting values for these
class(Logistic) <- "nonlin" parameters. The following function will set the ini-
tial scale parameter to one:
Now we need to fill in the body of the func-
tion. The purpose of a "nonlin" function is to con- start = function(theta){
vert its arguments into a list of arguments for the in- theta[3] <- 1
ternal function nonlinTerms. This function consid- theta
ers a nonlinear term as a mathematical function of }
predictors, the parameters of which need to be esti- Putting these components together, we have:
mated, and possibly also variables, which have a co-
efficient of 1. In this terminology, Asym, xmid and Logistic <- function(x){
scal in Equation 6 are parameters of intercept-only list(predictors =
predictors, whilst x is a variable. Our Logistic func- list(Asym = 1, xmid = 1, scal = 1),
tion must specify the corresponding predictors and variables = list(substitute(x)),
variables arguments of nonlinTerms. We can de- term =
fine the predictors argument as a named list of function(predLabels, varLabels){
symbolic expressions paste(predLabels[1],
"/(1 + exp((", predLabels[2],
predictors = list(Asym = 1, xmid = 1, "-", varLabels[1], ")/",
scal = 1) predLabels[3], "))")
},
and pass the user-specified variable x as the start = function(theta){
variables argument: theta[3] <- 1
theta
variables = list(substitute(x))
})
We must then define the term argument of }
nonlinTerms, which is a function that creates a de- class(Logistic) <- "nonlin"
parsed mathematical expression of the nonlinear Having defined this function, we can reproduce
term from labels for the predictors and variables. the example from ?SSlogis as shown below
These labels should be passed through arguments
predLabels and varLabels respectively, so we can > Chick.1 <-
specify the term argument as + ChickWeight[ChickWeight$Chick == 1, ]

R News ISSN 1609-3631


Vol. 7/2, October 2007 12

> fm1gnm <- gnm(weight ~ Logistic(Time) - 1, Acknowledgments


+ data = Chick.1, trace = FALSE)
> summary(fm1gnm) This work was supported by the Economic and So-
cial Research Council (UK) through Professorial Fel-
Call: lowship RES-051-27-0055.
gnm(formula = weight ~ Logistic(Time) - 1,
data = Chick.1)
Bibliography
Deviance Residuals:
Min 1Q Median 3Q Max J. A. Anderson. Regression and ordered categorical
-4.154 -2.359 0.825 2.146 3.745 variables. J. R. Statist. Soc. B, 46(1):1–30, 1984.

Coefficients: J. de Leeuw. Principal component analysis of bi-


Estimate Std. Error t value Pr(>|t|) nary data by iterated singular value decomposi-
Asym 937.0205 465.8569 2.011 0.07516 tion. Comp. Stat. Data Anal., 50(1):21–39, 2006.
xmid 35.2228 8.3119 4.238 0.00218
scal 11.4052 0.9052 12.599 5.08e-07 R. Erikson and J. H. Goldthorpe. The Constant Flux.
Oxford: Clarendon Press, 1992.
(Dispersion parameter for gaussian family
D. Firth. Overcoming the reference category problem
taken to be 8.51804)
in the presentation of statistical models. Sociologi-
cal Methodology, 33:1–18, 2003.
Residual deviance: 76.662 on 9 degrees of
freedom D. Firth and R. X. de Menezes. Quasi-variances.
AIC: 64.309 Biometrika, 91:65–80, 2004.

Number of iterations: 14 L. A. Goodman. Simple models for the analysis of


In general, whenever a nonlinear term can be rep- association in cross-classifications having ordered
resented as an expression that is differentiable us- categories. J. Amer. Statist. Assoc., 74:537–552, 1979.
ing deriv, it should be possible to define a "nonlin"
R. D. Lee and L. Carter. Modelling and forecasting
function to specify the term. Additional arguments
the time series of US mortality. Journal of the Amer-
of nonlinTerms allow for the specification of factors
ican Statistical Association, 87:659–671, 1992.
with homologous effects and control over the way in
which parameters are automatically labelled. M. E. Sobel. Diagonal mobility models: A substan-
tively motivated class of designs for the analysis of
mobility effects. Amer. Soc. Rev., 46:893–906, 1981.
Summary
M. E. Sobel. Social mobility and fertility revisited:
The functions distributed in the gnm package enable Some new models for the analysis of the mobil-
a wide range of generalized nonlinear models to be ity effects hypothesis. Amer. Soc. Rev., 50:699–712,
estimated. Nonlinear terms that are not (easily) rep- 1985.
resented by the "nonlin" functions provided may be
implemented as custom "nonlin" functions, under F. A. van Eeuwijk. Multiplicative interaction in gen-
fairly weak conditions. eralized linear models. Biometrics, 51:1017–1032,
There are several features of gnm that we have 1995.
not covered in this paper. Some facilitate model in-
spection, such as the ofInterest argument to gnm Y. Xie. The log-multiplicative layer effect model for
which allows the user to customize model sum- comparing mobility tables. American Sociological
maries and simplify parameter selection. Other fea- Review, 57:380–395, 1992.
tures are designed for particular models, such as
the residSVD function for generating good starting
values for a multiplicative interaction. Still oth- Heather Turner
ers relate to particular types of data, such as the University of Warwick, UK
expandCategorical function for expressing categor- Heather.Turner@warwick.ac.uk
ical data as counts. These features complement the
facilities we have already described, to produce a David Firth
substantial and flexible package for working with University of Warwick, UK
generalized nonlinear models. David.Firth@warwick.ac.uk

R News ISSN 1609-3631


Vol. 7/2, October 2007 13

fmri: A Package for Analyzing fmri Data


by J. Polzehl and K. Tabelow Reading the data
This article describes the usage of the R pack- The fmri package can read ANALYZE (Mayo Foun-
age fmri to analyze single time series BOLD fMRI dation, 2001), AFNI- HEAD/BRIK (Cox, 1996),
(blood-oxygen-level dependent functional magnetic NIFTI and DICOM files. Use
resonance imaging) data using structure adaptive
smoothing procedures (Propagation- Separation ap- data <- read.AFNI(<filename>)
data <- read.ANALYZE(<filename>)
proach) as described in (Tabelow et al., 2006). See
(J. Polzehl and K. Tabelow, 2006) for an extended to create the object data from the data in file
documentation. ‘filename’. Drop the extension in ‘filename’. While
AFNI data is generally given as a four dimensional
datacube in one file, ANALYZE format data often
Analysing fMRI data with the fmri comes in a series of numbered files. A sequence of
package images in ANALYZE format can be imported by

The approach implemented in the fmri package is data <- read.ANALYZE(prefix = "", numbered = TRUE,
based on a linear model for the hemodynamic re- postfix = "", picstart = 1,
numbpic = 1)
sponse and structural adaptive spatial smoothing of
the resulting Statistical Parametric Map (SPM). The Setting the argument numbered to TRUE allows to
package requires R (version ≥ 2.2). 3D visualization read a list of image files. The image files are as-
needs the R package tkrplot. sumed to be ordered by a number, contained in
The statistical modeling implemented with this the filename. picstart is the number of the first
software is described in (Tabelow et al., 2006). file, and numbpic the number of files to be read.
NOTE! This software comes with abso- The increment between consecutive numbers is as-
lutely no warranty! It is not intended for sumed to be 1. prefix specifies a string preceed-
clinical use, but for evaluation purposes ing the number, whereas postfix identifies a string
only. Absence of bugs can not be guaran- following the number in the filename. Note, that
teed! read.ANALYZE() requires the file names in the form:
‘<prefix>number<postfix>.hdr/img’.
We first start with a basic script for using the fmri
package in a typical fmri analysis. In the follow-
ing sections we describe the consecutive steps of the
analysis.
# read the data
data <- read.AFNI("afnifile")
# or read sequence of ANALYZE files
# analyze031file.hdr ... analyze137file.hdr
# data <- read.ANALYZE("analyze",
# numbered = TRUE, "file", 31, 107)

# create expected BOLD signal and design matrix


hrf <- fmri.stimulus(107, c(18, 48, 78), 15, 2)
x <- fmri.design(hrf)

# generate parametric map from linear model


spm <- fmri.lm(data, x)

# adaptive smoothing with maximum bandwith hmax


Figure 1: View of a typical fMRI dataset. Data di-
spmsmooth <- fmri.smooth(spm, hmax = 4)
mension is 64x64x26 with a total of 107 scans. A high
# calculate p-values for smoothed parametric map noise level is typical. The time series is described by
pvalue <- fmri.pvalue(spmsmooth) a linear model containing the experimental stimulus
(red) and quadratic drift. Data: courtesy of H. Voss,
# write slicewise results into a file or ... Weill Medical College of Cornell University.
plot(pvalue, maxpvalue = 0.01, device = "jpeg",
file = "result.jpeg") Both functions return lists of class ”fmridata”
# ... use interactive 3D visualization with components containing the datacube (’ttt’). See
plot(pvalue, maxpvalue = 0.01, type = "3d") figure 1 for an illustration. The data cube is stored

R News ISSN 1609-3631


Vol. 7/2, October 2007 14

in (’raw’) format to save memory. The data can where X denotes the design matrix. The design ma-
be extracted from the list using extract.data(). trix is created by
Additionally the list contains basic information like
the size of the data cube (’dim’) and the voxel size x <- fmri.design(hrf) .
(’delta’), as well as the complete header information
(’header’), which is itself a list with components de- This will include polynomial drift terms up to
pending to the data format read. A head mask is de- quadratic order. To deviate from this default, the
fined by simply using a 75% quantile of the data grey order of the polynomial drift can be specified by a
levels as cut- off. This is only be used to provide im- second argument. Use cbind() to combine several
proved spatial correlation estimates for the head in stimulus vectors into a matrix of stimuli before call-
fmri.lm(). ing fmri.design().
The first q columns of X contain values of the ex-
pected BOLD response for the different stimuli eval-
Expected BOLD response uated at scan acquisition times. The other p − q
columns are chosen to be orthogonal to the expected
In voxel affected by the experiment the observed sig- BOLD responses and to account for a slowly vary-
nal is assumed to follow the expected Blood Oxy- ing drift and possible other external effects. The er-
genation Level Dependent (BOLD) signal. This sig- ror vector εi has zero expectation and is assumed to
nal depends on both the experimental stimulus, de- be correlated in time. In order to access the variabil-
scribed by a task indicator function and a hemody- ity of the estimates of βi correctly we have to take
namic response function h(t). We define h(t) as the the correlation structure of the error vector εi into ac-
difference of two gamma functions count. We assume an AR(1) model to be sufficient for
  a1 commonly used MRI scanners. The autocorrelation
t − d1
 
t coefficients ρi are estimated from the residual vector
h(t) = exp −
d1 b1 ri = (ri1 , . . . , riT ) of the fitted model (1) as
  a2
t − d2
 
t
−c exp − T T
d2 b2
ρ̄i = ∑ rit ri(t−1) / ∑ rit2 .
t=2 t=1
with a1 = 6, a2 = 12, b1 = 0.9, b2 = 0.9, and
di = ai bi (i = 1, 2), c = 0.35 where t is the time in
This estimate of the correlation coefficient is biased
seconds, see (Glover, 1999). The expected BOLD re-
due to fitting the linear model (1). We therefore ap-
sponse is given as a discrete convolution of this func-
ply the bias correction given by (Worsley et al., 2002)
tion with the task indicator function. Use
leading to an estimate ρ̃i .
hrf <- fmri.stimulus(107, c(18, 48, 78), 15, 2) We then use prewhitening to transform model (1)
into a linear model with approximately uncorrelated
to create the expected BOLD response for a stimulus errors. The prewhitened linear model is obtained by
with 107 scans, onset times at the 18th, 48th, and 78th multiplying the terms in (1) with some matrix Ai de-
scan with a duration of the stimulus of 15 scans, and pending on ρ̃i . The prewhitening procedure thus re-
a time TR = 2s between two scans. In case of mul- sults in a new linear model
tiple stimuli the results of fmri.stimulus() for the
different stimuli should be arranged as columns of a Ỹi = X̃i βi + ε̃i (2)
matrix hrf using cbind().
The hemodynamic response function may have with Ỹi = Ai Yi , X̃i = Ai X, and ε̃i = Aiεi . In the new
an unknown latency (Worsley and Taylor, 2005). This model the errors ε̃i = (ε̃it ) are approximately uncor-
can be modeled creating an additional explanatory related in time t, such that var ε̃i = σi2 IT . Finally least
variable as the first numerical derivative of any ex-
squares estimates β̃i are obtained from model (2) as
perimental stimulus:
dhrf <- (c(0,diff(hrf)) + c(diff(hrf),0))/2 β̃i = ( X̃iT X̃i )−1 X̃iT Ỹi .
See the next section for how to include this into the The error variance σi2 is estimated from the residu-
linear model.
als r̃i of the linear model (2) as σ̃i2 = ∑1T r̃it2 /( T − p)
leading to estimated covariance matrices
Construction of the SPM
var β̃i = σ̃i2 ( X̃iT X̃i )−1 .
We adopt the common view of a linear model for the
time series Yi = (Yit ) in each voxel i after reconstruc- Estimates β̃i and their estimated covariance ma-
tion of the raw data and motion correction. trices are, in the simplest case, obtained by

Yi = Xβi + εi , (1) spm <- fmri.lm(data, x)

R News ISSN 1609-3631


Vol. 7/2, October 2007 15

where data is the data object read by read.AFNI() or regions of adjacent voxels. Averaging over such re-
read.ANALYZE(), and x is the design matrix created gions allows us to reduce the variance of the param-
with fmri.design(). eter estimates without compromising their mean. In-
See figure 1 for an example of a typical fMRI troducing a spatial correlation structure also reduces
dataset together with the result of the fit of the lin- the number of independent decisions made for sig-
ear model to a time series. nal detection and therefore eases the multiple test
To consider more than one stimulus and to esti- problem. Adaptive smoothing as implemented in
mate an effect this package also allows to improve the specificity of
γ̃ = c T β̃ (3) signal detection, see figure 2 for an illustration.
The parameter map is smoothed with
defined by a vector of contrasts c set the argument
contrast of the function fmri.lm() correspondingly spmsmooth <- fmri.smooth(spm, hmax = hmax)

hrf1 <- fmri.stimulus(214, c(18, 78, 138), 15, 2) where spm is the result of the function fmri.lm().
hrf2 <- fmri.stimulus(214, c(48, 108, 168), 15, 2) hmax is the maximum bandwidth for the smoothing
x <- fmri.design(cbind(hrf1, hrf2)) algorithm. For lkern="Gaussian" the bandwidth is
given in units of FWHM, for any other localization
# stimulus 1 only kernel the unit is voxel. hmax should be chosen as
spm1 <- fmri.lm(data, x, contrast = c(1,0))
the expected maximum size of the activation areas.
As adaptive smoothing automatically adapts to dif-
# stimulus 2 only
spm2 <- fmri.lm(data, x, contrast = c(0,1)) ferent sizes and shapes of the activation areas, over-
smoothing is not to be expected.
# contrast between both In (Tabelow et al., 2006) the use of a spa-
spm3 <- fmri.lm(data, x, contrast = c(1,-1)) tial adaptive smoothing procedure derived from
the Propagation- Separation approach (Polzehl and
If the argument vvector is set, the component Spokoiny, 2006) has been proposed in this context.
”cbeta” of the object returned by fmri.lm() con- The approach focuses, for each voxel i, on simultane-
tains a vector with the parameters corresponding to ously identifying a region where the unknown pa-
the non- zero elements in vvector in each voxel. rameter γ is approximately constant and to obtain
This may be used to include unknown latency of an optimal estimate γ̂i employing this structural in-
the hemodynamic response function in the analysis. formation. This is achieved by an iterative proce-
First define the expected BOLD response for a given dure. Local smoothing is restricted to local vicini-
stimulus and its derivative and then combine them ties of each voxel, that are characterized by a weight-
into the design matrix ing scheme. Smoothing and characterization of local
vicinities are alternated. Weights for a pair of voxels
hrf <- fmri.stimulus(107, c(18, 48, 78), 15, 2)
i and j are constructed as a product of kernel weights
dhrf <- (c(0,diff(hrf)) + c(diff(hrf),0))/2
x <- fmri.design(cbind(hrf, dhrf)) Kloc (δ (i, j)/h), depending on the distance δ (i, j) be-
spm <- fmri.lm(data, x, vvector = c(1,1)) . tween the two voxels and a bandwidth h, and a factor
reflecting the difference of the estimates γ̂i and γ̂ j ob-
The specification of vvector in the last statement re- tained within the last iteration. The bandwidth h is
sults in a vector of length 2 containing the two pa- increased with iterations up to a maximal bandwidth
rameter estimates for the expected BOLD response hmax .
and its derivative in each voxel. Furthermore the ra- The name Propagation- Separation is a synonym
tio of the variance estimates for these parameters is for the two main properties of this algorithm. In
calculated as a prerequisite for the smoothing proce- case of a completely homogeneous array Γ̃ , that is
dure. See fmri.smooth() for details about smooth- IE γ̃ ≡ Const., the algorithm delivers essentially the
ing this parametric map. same result as a nonadaptive kernel smoother em-
The function returns an object with class at- ploying the bandwidth hmax . In this case the pro-
tributes ”fmridata” and ”fmrispm”. This is again a cedure selects the best of a sequence of almost non-
list with components containing the estimated pa- adaptive estimates, that is, it propagates to the one
rameter contrast (’cbeta’), and its estimated variance with maximum bandwidth. Separation means that
(’var’), as well as estimated spatial correlations in all as soon as within one iteration step significant dif-
directions. ferences of γ̂i and γ̂ j are observed the correspond-
ing weight is decreased to zero and the information
from voxel j is no longer used to estimate γi . Voxels i
Structure Adaptive Smoothing (PS) and j belong to different regions of homogeneity and
are therefore separated. As a consequence smooth-
Smoothing of SPM’s is applied in this context to im- ing is restricted to regions with approximately con-
prove the sensitivity of signal detection. This is ex- stant values of γ, bias at the border of such regions is
pected to work since neural activations extends over avoided and the spatial structure of activated regions

R News ISSN 1609-3631


Vol. 7/2, October 2007 16

Figure 2: A numerical phantom for studying the performance of PS vs. Gaussian smoothing. (a) Signal loca-
tions within one slices. Eight different signal- to- noise ratios, increasing clockwise, are coded by gray values.
The spatial extent of activations varies in the radial direction. The data cube contains 15 slices with activation
alternated with 9 empty slices. (b) Smoothing with a conventional Gaussian filter. (c) Smoothing with PS. In
both (b) and (c), the bandwidth is 3 voxels which corresponds to FWHM = 10 mm for typical voxel size. In
(b) and (c) the proportion, over slices containing activations, of voxels that are detected in a given position is
rendered by gray values. Black corresponds to the absence of a detection.

is preserved. detection. Such p-values can be defined (Worsley et


For a formal description of this algorithm, a dis- al., 1996) as
cussion of its properties and theoretical results we re-
3
fer to (Polzehl and Spokoiny, 2006) and (Tabelow et
al., 2006). Numerical complexity, as well as smooth-
pi = ∑ Rd (V (rx , r y , rz ))ρd (θˆi ) (4)
d=0
ness within homogeneous regions is controlled by
the maximum bandwidth hmax . where Rd (V ) is the resel count of the search vol-
If the argument object contains a parameter vec- ume V and ρd (θˆi ) is the EC density depending only
tor for each voxel (for example to include latency, see on the parameter θˆi . r x , r y , r z denotes the effective
section 4) these will be smoothed according to their FWHM bandwidths that measure the smoothness (in
estimated variance ratio, see (Tabelow et al., 2006) for resel space see (Worsley et al., 1996)) of the random
details on the smoothing procedure. field generated by a Gaussian filter that employs the
Figure 2 provides a comparison of signal detec- bandwidth from the last iteration of the PS proce-
tion employing a Gaussian filter and structural adap- dure (Tabelow et al., 2006). Rd (V ) and ρd are given
tive smoothing. in (Worsley et al., 1996). A signal will be detected in
all voxels where the observed p-value is less or equal
to a specified threshold.
Finally we provide a statistical analysis includ-
Signal detection ing unknown latency of the hemodynamic response
function. If spmsmooth contains a vector (see
Smoothing leads to variance reduction and thus sig-
fmri.lm() and fmri.smooth()), a χ2 statistic is cal-
nal enhancement. It leaves us with three dimen-
culated from the first two parameters and used for
sional arrays Γ̂ , Ŝ containing the estimated effects
p-value calculation. If delta is given, a cone statis-
γ̂i = c T β̂i and their estimated standard deviations
tics is used (Worsley and Taylor, 2005).
ŝi = (c T var β̂i c)1/2 , obtained from time series of
The parameter mode allows for different kinds
smoothed residuals. The voxelwise quotient θ̂i =
of p-value calculation. ”basic” corresponds to a
γ̂i /ŝi of both arrays forms a statistical parametric
global definition based on the amount of smoothness
map (SPM) Θ̂. The SPM as well as a map of p-values
achieved by an equivalent Gaussian filter. The prop-
are generated by
agation condition ensures, that under the hypothesis
pvalue <- fmri.pvalue(spmsmooth) Θ̂ = 0 the adaptive smoothing perform like a non
adaptive filter with the same kernel function. ”local”
Under the hypothesis, that is, in absence of activa- corresponds to a more conservative setting, where
tion this SPM behaves approximately like a Gaussian the p-values are derived from the estimated local re-
Random Field, see (Tabelow et al., 2006). We there- sel counts that has been achieved by the adaptive
fore use the theory of Gaussian Random Fields to as- smoothing. ”global” takes a global median of these
sign appropriate p-values as a prerequisite for signal resel counts for calculation.

R News ISSN 1609-3631


Vol. 7/2, October 2007 17

Figure 3 provides the results of signal detection write.ANALYZE(data, list(pixdim=c(4,4,4,5)),


for an experimental fMRI data set. file="analyzetest")

Some basic header information can be provided,


Viewing results in case of ANALYZE files as a list with several ele-
ments (see documentation for syntax). Any datacube
Results can be displayed by a generic plot function created during the analysis can be written.

plot(object, anatomic,
device="jpeg", file="result.jpeg") Bibliography
object is an object of class ”fmridata” (and
Biomedical Imaging Resource. Analyze Program.
”fmrispm” or ”fmripvalue”) as returned by
Mayo Foundation, 2001.
fmri.pvalue(), fmri.smooth(), fmri.lm(),
read.ANALYZE() or read.AFNI(). anatomic is an R. W. Cox. Afni: Software for analysis and visual-
anatomic underlay of the same dimension as the ization of functional magnetic resonance neuroim-
functional data. The argument type="3d" provides ages. Computers and Biomed. Res., 29:162- 173, 1996.
an interactive display to produce 3 dimensional illus-
trations (requires R- package tkrplot) on the screen. G. H. Glover. Deconvolution of impulse response in
The default for type is ”slice”, which can create out- event- related BOLD fMRI. NeuroImage, 9:416- 429,
put to the screen and image files. 1999.
We also provide generic functions summary() and
J. Polzehl and V. Spokoiny. Propagation- separation
print() for objects with class attribute ”fmridata”.
approach for local likelihood estimation. Probab.
Theory and Relat. Fields, 135:335- 362, 2006.
R Development Core Team. R: A Language and Envi-
ronment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria, 2007. ISBN
3- 900051- 07- 0.
K. Tabelow, J. Polzehl, H. U. Voss, and V. Spokoiny.
Analyzing fMRI experiments with structural
adaptive smoothing procedures. NeuroImage,
33:55- 62, 2006.
J. Polzehl and K. Tabelow. Analysing fMRI experi-
ments with the fmri package in R. Version 1.0 - A
users guide. WIAS Technical Report No. 10, 2006.
K. J. Worsley. Spatial smoothing of autocorrelations
to control the degrees of freedom in fMRI analysis.
NeuroImage, 26:635- 641, 2005.
K. J. Worsley, C. Liao, J. A. D. Aston, V. Petre, G. H.
Duncan, F. Morales, and A. C. Evans. A general
Figure 3: Signal detection in 4 consecutive slices us- statistical analysis for fMRI data. NeuroImage, 15:1-
ing adaptive smoothing procedure. The shape and 15, 2002.
extent of the activation areas are conserved much K. J. Worsley, S. Marrett, P. Neelin, K. J. Friston, and
better than with traditional Gaussian filtering. Data: A. C. Evans. A unified statistical approach for de-
courtesy of H. Voss, Weill Medical College of Cornell terming significant signals in images of cerebral
University. activation. Human Brain Mapping, 4:58- 73, 1996.
K. J. Worsley and J. E. Taylor. Detecting fMRI activa-
Writing the results to files tion allowing for unknown latency of the hemody-
namic response. Neuroimage, 29:649- 654, 2006.
Finally we provide functions to write data in stan-
dard medical image formats such as HEAD/BRIK,
ANALYZE, and NIFTI. Jörg Polzehl & Karsten Tabelow
write.AFNI("afnitest", data, c("signal"), Weierstrass Institute for Applied Analysis
note="random data", origin=c(0,0,0), and Stochastics, Berlin, Germany
delta=c(4,4,5), idcode="unique ID") polzehl@wias-berlin.de
tabelow@wias-berlin.de

R News ISSN 1609-3631


Vol. 7/2, October 2007 18

Optmatch: Flexible, Optimal Matching for


Observational Studies
Ben B. Hansen Optimal matching of two groups
Observational studies compare subjects who re- To illustrate the meaning of optimal matching, con-
ceived a specified treatment to others who did not, sider Cox and Snell’s (1981, p.81) study of costs of
without controlling assignment to treatment and nuclear power. Of 26 light water reactor plants con-
comparison groups. When the groups differ at base- structed in the U.S. between 1967 and 1972, seven
line in ways that are relevant to the outcome, the had been built on the site of existing plants. The
study has to adjust for the differences. An old and problem is to estimate the cost benefit (or penalty)
particularly direct method of making these adjust- of building on an existing site as opposed to a new
ments is to match treated subjects to controls who one. A matched analysis seeks to adjust for back-
are similar in terms of their pretreatment charac- ground characteristics determinative of cost, such as
teristics, then conduct an outcome analysis condi- the date of construction and the capacity of the plant,
tioning upon the matched sets. Adjustments of this by linking similar refurbished and new plants: plants
type enjoy properties of robustness (Rubin, 1979) and of about the same capacity and constructed at about
transparency not shared with purely model-based the same time, for example. To highlight the anal-
adjustments, such as covariance adjustment without ogy with intervention studies, I refer to existing-site
matching or stratification; and with the introduction plants as “treatments” and new-site plants as “con-
of propensity scores to matching (Rosenbaum and trols.”
Rubin, 1985), the approach was shown to be more
Consider the problem of arranging the plants
broadly applicable than was previously thought. Ar-
in disjoint triples, each containing one treatment
guably, the reach of techniques based on matching
and two controls, placing each treatment and 14
now exceeds that of purely model-based adjustment
of the 19 controls into some matched triple or an-
(Hansen, 2004).
other. A straightforward way to create such a
To achieve these benefits, matched adjustment re- match is to move down the list of treatments, pair-
quires the analyst to articulate a distinction between ing each to the two most similar controls that have
desirable and undesirable potential matches, and not yet been matched; this is nearest-available match-
then to match treated and control subjects in such a ing. Figure 1 shows the 26 plants, their capaci-
way as to favor the more desirable pairings. Propen- ties and dates of construction, and a 1 : 2 match-
sity scoring fits under the first of these tasks, as do ing constructed in this way. First A was matched
the construction of Mahalanobis matching metrics to I and J, then B to L and N, and so forth. This
(Rosenbaum and Rubin, 1985), prognostic scoring example is discussed by Rosenbaum (2002, ch.10).
(Hansen, 2006b), and the distance metric optimiza-
Existing site New site
tion of Diamond and Sekhon (2006). The second task, date capacity
date capacity H 3.6 290
matching itself, is less statistical in nature, but doing
A 2.3 660 I 2.3 660
it well can substantially improve the power and ro- J 3.0 660
B 3.0 660
bustness of matched inference (Hansen and Klopfer, K 2.9 110
C 3.4 420 L 3.2 420
2006; Hansen, 2004). The main purpose of optmatch D 3.4 130 M 3.4 60
N 3.3 390
is to relieve the analyst of responsibility for this im- E 3.9 650 O 3.6 160
portant, if potentially tedious, undertaking, freeing F 5.9 430 P 3.8 390
attention for other aspects of the analysis. Given G 5.1 420 Q 3.4 130
R 3.9 650
discrepancies between each treatment and control S 3.9 450
subject that might potentially be matched, optmatch T 3.4 380
“date” is date of construc- U 4.5 440
places them into non-overlapping matched sets, in V 4.2 690
the process solving the discrete optimization prob- tion, in years after 1965; W 3.8 510
“capacity” is net capac- X 4.7 390
lems needed to make sums of matched discrepancies Y 5.4 140
as small as possible; after this, the analysis can pro- ity of the power plant, in Z 6.1 730
ceed using permutation inference (Rosenbaum, 2002; MWe above 400.
Hothorn et al., 2006; Bowers and Hansen, 2006), con-
ditional inference (Breslow and Day, 1980; Cox and
Snell, 1989; Hansen, 2004; Lumley and Therneau, Figure 1: 1:2 matching by a nearest-available algo-
2006), approximately conditional inference (Pierce rithm.
and Peters, 1992; Brazzale, 2005; Brazzale et al., 2006),
or multilevel models (Smith, 1997; Raudenbush and How might this process be improved? To com-
Bryk, 2002; Gelman and Hill, 2006). plete step i, the nearest-available algorithm requires

R News ISSN 1609-3631


Vol. 7/2, October 2007 19

a ranking of potential matches for treatment unit i, insofar as the given distance and structural stipula-
an ordering of available controls accounting for their tions best represent the analyst’s goals; in this sense
differences with plant i in generating capacity and in “optimal” in “optimal matching” is analogous to
year of construction. Typically controls j are ordered the “maximum” of “maximum likelihood,” which is
in terms of a numeric discrepancy, d[i, j], from i; Fig- never better than the chosen likelihood model it max-
ure 1’s match follows Rosenbaum (2002, ch.10) in us- imizes.
ing the sum of rank differences on the two covariates For example, in the problem just discussed the
(after restricting to a subset of the plants, pt!=1): structural stipulation that the matched sets all be 1:2
triples may be poorly tailored to the goal of reduc-
> data("nuclear", package="boot") ing baseline differences between the groups. It is
> attach(nuclear[nuclear$pt!=1,]) appropriate if these differences stem entirely from a
> drk <- rank(date) small minority of controls being too unlike any treat-
> d <- outer(drk[pr==1], drk[pr!=1], "-") ment subjects to bear comparison to them, since it
> d <- abs(d) does exclude 5/19 of potential controls from the fi-
> crk <- rank(cap) nal match; but differences on a larger number of con-
> d <- d + trols, even small differences, require the techniques
abs(outer(crk[pr==1], crk[pr!=1], "-")) to be described under Generalizations of pair match-
ing, below, which may also give similar or better bias
(where pr==1 indicates the treatment group). The reduction without excluding as many control obser-
d that results from these operations is shown (after vations. (See also Discussion, below.)
rounding) in Figure 3. Having calculated this d, one
can pose the task of matching as a discrete optimiza-
tion problem: find the match M = {(i, j)} minimiz- Growing your own discrepancy matrix
ing ∑ M d(i, j) among all sets of pairs (i, j) in which Figures 1 and 2 both illustrate multivariate distance
each treatment i appears twice and each control j ap- matching, aligning units so as to minimize a sum
pears at most once. of rank discrepancies. Optmatch is entirely flexi-
Optimal matching refers to algorithms guaran- ble about the form of the distance on which matches
teed to find matches attaining this minimum, or are to be made. To propensity-score match nuclear
falling within a specified tolerance of it, given a plants, for example, one would prepare a propensity
nt × nc discrepancy matrix M. Optimal matching’s distance using
performance advantage over heuristic, non-optimal
algorithms can be striking. For example, in the prob- > pscr <- glm(pr ~ . -(pr+cost),
lem of Figure 1 optimal matching reduces nearest- family = binomial,
available’s sum of discrepancies by 23%. This opti- data = nuclear)$linear.predictors
mal solution, found by optmatch’s pairmatch func- > PR <- nuclear$pr==1
tion, is shown in Figure 2. > pdist <- outer(pscr[PR], pscr[PR], "-")
Existing site New site
date capacity
date capacity H 3.6 290
> pscr.v <- (var(pscr[PR])*(sum(PR)-1)+
A 2.3 660 I 2.3 660
var(pscr[!PR])*(sum(!PR)-1))/
B 3.0 660 J 3.0 660 (length(PR)-2)
K 2.9 110
C 3.4 420 L 3.2 420
> pdist <- abs(pdist)/sqrt(pscr.v)
D 3.4 130 M 3.4 60
N 3.3 390 or, more simply and reliably,
E 3.9 650 O 3.6 160
F 5.9 430 P 3.8 390 > pmodel <- glm(pr ~ . -(pr+cost),
G 5.1 420 Q 3.4 130
R 3.9 650 family = binomial, data = nuclear)
S 3.9 450 > pdist <- pscore.dist(pmodel)
T 3.4 380
By evaluating potential U 4.5 440
V 4.2 690 Then pdist is passed to pairmatch or fullmatch as
matches all together rather W 3.8 510 its first argument. Other discrepancies on which one
than sequentially, optimal X 4.7 390
Y 5.4 140 might match include Mahalanobis distances (which
matching (blue lines) reduces Z 6.1 730 can be produced using mahal.dist) and combina-
the sum of distances by 23%.
tions of Mahalanobis and propensity-based distances
(Rosenbaum and Rubin, 1985; Gu and Rosenbaum,
Figure 2: Optimal vs. greedy 1:2 matching. 1993; Rubin and Thomas, 2000). Many special re-
quirements, such as that matches be made only
An optimal match is optimal relative to given within given subclasses, or that specific matches be
structural requirements, here that all treatments and avoided, are also introduced through the discrep-
a corresponding number of controls be arranged in ancy matrix.
1 : 2 matched sets, and a given distance, here d. It First consider the case the matches are to be made
is best-possible for the purposes of the analyst only within subclasses only. In the nuclear dataset, plants

R News ISSN 1609-3631


Vol. 7/2, October 2007 20

with and without partial turnkey (pt) guarantees indicated in red in Figure 3. To enforce the caliper,
should be compared separately, since the meaning of one could generate a matrix of discrepancies dy on
the outcome variable, cost, changes with pt. Figure year of construction, then replace the distance ma-
2 shows only the pt!=1 plants, and its match is gener- trix of Figure 3, d, with d/(dy<=3); this new matrix
ated with the command pairmatch(d,controls=2), has an Inf at each entry in Figure 3 currently shown
where d is the matrix in Figure 3. To match also par- in red, and otherwise is the same as in Figure 3.
tial turnkey plants to each other, one gathers into Operations of these types, division and logical
a list, dl, both d and a distance matrix dpt com- comparison, are compatible with subclassification
paring new- and existing-site partial turnkey plants, prior to matching, despite the fact that the operations
then feeds dl to pairmatch as its first argument. For seem to require matrices while subclassification de-
propensity or Mahalanobis matching, pscore.dist mands lists of matrices. Assuming one has defined a
or mahal.dist would do this if given the formula function datediffs as
pr~pt as their optional arguments ‘structure.fmla’.
> datediffs <- function(trt,data){
More generally, the helper function makedist
sclr <- data[names(trt), 'date']
(which is called by pscore.dist and mahal.dist)
names(sclr) <- names(trt)
eases the application of a (user-defined) discrepancy
abs(outer(sclr[trt], sclr[!trt], '-'))
matrix-producing function to each of a sequence of
}
strata in order to generate a list of distance matri-
ces. For separate summed-rank distances by pt sub- then the command
class, one would write a function that extracts por-
tions of relevant variables from a given data frame, > dly <- makedist(pr ~ pt, nuclear,
looking to a treatment-group variable to decide what datediffs)
portions to take, as in tabulates absolute differences on date of construc-
> capdatediffs <- function(trt, dat) { tion, separately for pr==1 and pr!=1 strata. With
crk <- rank(dat[names(trt),"cap"]) optmatch, the expression dly<=3 returns a list
names(crk) <- names(trt) of indicators of whether potential matches were
dmt <- outer(crk[trt], crk[!trt],"-") built within three years of one another. Further-
dmt <- abs(dmt) more, dl/(dly<=3) is a list imposing the three-
year caliper upon distances coded in dl . To
drk <- rank(dat[names(trt),"date"]) pair match on propensity scores, but with a 3-
dmt <- dmt + year date-of-construction caliper, one would use
abs(outer(drk[trt], drk[!trt],"-")) pairmatch(dl/(dly<=3)).
dmt
}
Generalizations of pair matching
Then one would use makedist to apply the function
separately within levels of pr: Matching with a varying number of con-
trols
> dl <- makedist(pr ~ pt, nuclear,
capdatediffs) In Figures 1 and 2, both non-optimal and optimal
matches insist on precisely two controls per treat-
The result of this is a list of two distance matrices,
ment. If one’s aim is to match as closely as possible,
both submatrices of d created above, one comparing
this is a limitation. To optimally match 14 of the 19
pt!=1 treatments and controls and a smaller one for
controls, as Figure 2 does, but without requiring that
pt==1 plants.
they always be matched two-to-one to treatments,
In larger problems, matching can be substantially
one would use the command fullmatch, with op-
faster if preceded by a division of the sample into
tions ‘min.controls=1’ and ‘omit.fraction=5/19’.
subclasses; see Under the hood, below. The use
The flexibility this adds improves matching even
of pscore.dist, mahal.dist, and makedist carry
more than the switch from greedy to optimal match-
another advantage, that the lists of distances they
ing did; while optimal pair matching reduced greedy
generate carry metadata to prevent fullmatch or
pair matching’s net discrepancy from 82 to 63, op-
pairmatch from getting confused about the order of
timal matching with a varying number of controls
observations in the data frame from which the dis-
brings it to 44, just over half its original value.
tances were generated.
If mvnc is the match created in this way, then the
Another common aim is to forbid unwanted
structure of mvnc is returned by
matches. With optmatch, this is done by placing
NA’s, NaN’s or Inf’s at the relevant places in a distance > stratumStructure(mvnc)
matrix. Consider matching nuclear plants within stratum treatment:control ratios
calipers of three years on date of construction. Pair- 1:1 1:2 1:3 1:5
ings of plants that would violate this requirement are 4 1 1 1

R News ISSN 1609-3631


Vol. 7/2, October 2007 21

Exist- New sites


ing H I J K L M N O P Q R S T U V W X Y Z
A 28 0 3 22 14 30 17 28 26 28 20 22 23 26 21 18 34 40 28
B 24 3 0 22 10 27 14 26 24 24 16 19 20 23 18 16 31 37 25
C 10 18 14 18 4 12 6 11 9 10 14 12 6 14 22 10 16 22 28
D 7 28 24 8 14 2 10 6 12 0 24 22 4 24 32 20 18 16 38
E 17 20 16 32 18 26 20 18 12 24 0 2 20 6 8 4 14 20 14
F 20 31 28 35 20 29 22 20 14 26 12 9 22 5 15 12 9 11 12
G 14 32 29 30 18 24 17 16 10 22 12 10 17 6 16 14 4 8 17

Figure 3: Rank discrepancies of new- and existing-site nuclear plants without partial turnkey guarantees. New-
and existing-site plants which differ by more than 3 years in date of construction are indicated in red.

This means it consists of four matched pairs and a Exist- New sites
matched triple, quadruple, and sextuple, all contain- ing d e f
ing precisely one treatment. Ming and Rosenbaum a 6 6 0
(2000) discuss matching with a varying number of b 0 3 6
controls, implementing it in their example with a
c 6 6 0
somewhat different algorithm.
Figure 4: Rank discrepancies of new- and existing-
site nuclear plants with partial turnkey guarantees.
Boxes indicate the optimal full matching of these
Full matching
plants.
A propensity score is the conditional probability,
p(x), of falling in the treatment group given covari- Gu and Rosenbaum (1993) compare full and other
ates x (or an increasing transformation of it, such forms of matching in an extensive simulation study,
as its logit). Because subjects with large propensity while Hansen (2004) and Hansen and Klopfer (2006)
scores more frequently fall in the treatment group, present applications. This literature emphasizes the
and subjects with low propensity scores are more fre- importance of using structural restrictions, upper
quently controls, propensity score matching is fun- limits on K in K : 1 matched sets and on L in 1 :
damentally at odds with matching treatments and L matched sets, when full matching, in order to
controls in fixed ratios, such as 1 : 1 or 1 : k. These control the variance of matched estimation. With
ratios must be allowed to adapt, so that 1:1 matches fullmatch, an upper limit K : 1 on treatment:control
can be made where p(x)/(1 − p(x)) ≈ 1 while 1 : k ratios is conveyed using ‘min.controls=1/K’, while
matches are made where p(x)/(1 − p(x)) ≈ 1/k; a lower limit of 1 : L on the treatment : control ra-
otherwise either some subjects will have to go un- tio would be given with ‘max.controls=L’. Hansen
matched or some subjects are bound to be poorly (2004) and Hansen and Klopfer (2006) give strate-
matched on their propensity scores. Matching with gies to optimize these tuning parameters. In the
multiple controls addresses part of this problem, but context of a specific application, Hansen (2004) finds
only full matching (Rosenbaum, 1991) addresses it in (min.controls, max.controls) = (1/2, 2) · (1 − p̂)/ p̂
its entirety, by also permitting l:1 matches, for when to work best, where p̂ represent the proportion
p(x)/(1 − p(x)) ≈ l ≥ 2. treated in the stratum being matched. In an unre-
lated application, Stuart and Green (2006) find these
In general, full matching is useful when there
values to work well; they may be a good starting
are some regions of covariate space in which con-
point for general use.
trols outnumber treatments but others in which treat-
ments outnumber controls. This pattern emerges A somewhat related technique is matching “with
most clearly when matching on a propensity score, replacement,” in which overlap between matched
but they influence the quality of matches even with- sets is permitted in the interests of achieving closer
out propensity scores. The rank discrepancies of matches. Because of the overlap, methods appropri-
new- and existing-site pt==1 plants, shown in Fig- ate to stratified data are not generally appropriate for
ure 4, show it; earlier dates of construction, and samples matched with replacement. The technique
smaller capacities, are more common among controls forces one to resort to specialized techniques, such
(d and e) than treatments (b only), and this is re- as those of Abadie and Imbens (2006). On the other
flected in Figure 4’s sums of discrepancies on rank. hand, with-replacement matching would appear to
As a consequence, full matching achieves a net rank offer the possibility of closer matches, since its pair-
discrepancy (3) that is half of the minimum possible ing of one treatment unit in no way limits its pairing
(6) with techniques that don’t permit both 1:2 and 2:1 of the next treatment unit.
matched sets. However, it is a surprising, and evidently

R News ISSN 1609-3631


Vol. 7/2, October 2007 22

little-known, fact that with-replacement matching to select a subset of control subjects most like treat-
achieves no closer matches than full matching, a ments, with the remainder of subjects excluded from
without-replacement matching technique. As dis- analysis; in matched adjustment, it is used to force
cussed by Rosenbaum (1991) and (more explicitly) treatment control comparisons to be based on indi-
by Hansen and Klopfer (2006), given any criterion vidualized comparisons made within matched sets,
for a potential pairing of subjects to be acceptable, which will have been so engineered that matched
full matching matches all subjects with at least one counterparts are more like one another than are treat-
suitable match in their comparison group, match- ments and controls on the whole. Matched sam-
ing them only to acceptable counterparts. So one pling is typically followed by matched adjustment,
might insist, in particular, that each treatment unit be but matched adjustment can be useful even when not
matched only to one of its nearest neighbors; by shar- preceded by matched sampling.
ing controls among treated units where necessary,
omitting controls who are not the nearest neighbor of
some treatment, and matching to multiple controls Because it excludes some five control subjects, the
where that can be done, full matching finds a way match depicted in Figures 1 and 2 might be used in
to honor this requirement. Since the matched sets a context of matched sampling, although it differs in
produced by full matching never overlap, it has the important respects from typical matched samples. In
advantage over with-replacement matching of com- archetypal cases, matched sampling is used when for
bining with any method of estimation appropriate to cost reasons the number of controls to be followed
finely stratified data. up for outcome data has to be reduced anyway, not
when outcome data is already available for the entire
sample already; and in archetypal cases, the reservoir
Under the hood of potential controls is many times larger than the
size of the desired control group. See, e.g., Althauser
Hansen and Klopfer (2006) describe the network- and Rubin (1970) or Rosenbaum and Rubin (1985).
flows algorithm on which optmatch relies in some The matches in Figures 1 and 2 are typical of matched
detail, establishing its optimality for full matching sampling in matching a fixed number of controls to
and matching with a fixed or varying number of con- each treatment subject. When bias can be addressed
trols. They also give upper bounds for the time com- by being very selective in the choice of controls, flex-
plexity of the algorithm: roughly, O(n3 log(nC )), ibility in the structure of matched sets becomes less
where n is the size of the sample and C is the quo- important.
tient of largest discrepancy in the distance matrix
and the matching tolerance. This is comparable to
the time complexity of squaring a n × n matrix. More When there is no additional data to be collected,
precisely, the algorithm requires O(nnt nc log(nC )) there may be little use for matched sampling per se,
floating-point operations, where nt and nc are the while matched adjustment may still be attractive. In
sizes of the treatment and control groups. these cases, it is important to recognize that match-
These bounds have two practical consequences ing, even optimal matching, does not in itself reduce
for optmatch. First, computational costs grow systematic differences between treatment and con-
steeply with the size of the discrepancy matrix. Just trol groups unless it is specifically given the flexi-
as squaring two (n/2) × (n/2) submatrices of an bility to do so. Suppose, for instance, that adjust-
n × n matrix is about four times faster than squaring ment for the variable t2 is needed in the compari-
the full n × n matrix, matching is made much faster son of new- and existing site-plants. This variable,
by subdividing large matching problems into smaller which represents the time between the issue of an
ones. For this reason makedist is written so as to fa- operating permit and a construction permit, differs
cilitate subclassification prior to matching, the effect markedly in its distribution among “treatments” and
of which is to split larger matching problems into a “controls,” as seen in Figure 5: treatments have sys-
sequence of smaller ones. Second, the C-factor con- tematically larger values of it, although the two dis-
tributes secondarily to computational cost; its contri- tributions well overlap. When two groups compare
bution is reduced by increasing the value of the ‘tol’ in this way, no fixed-ratio matching of them can re-
argument to fullmatch or pairmatch. duce their overall discrepancy. Some of the observa-
tions will have to be set aside — or, better yet, one
could match the two groups in varying ratios, us-
Discussion ing matching with multiple controls or full match-
ing. These techniques have surprising power to rec-
When and how matching reduces system- oncile differences between treatments and controls
atic differences between groups while setting aside few or even no subjects because
they lack suitable counterparts; the reader is referred
Matching can address bias in observational studies to Hansen (2004, § 2) for general discussion and a
in either of two ways. In matched sampling, it is used case study.

R News ISSN 1609-3631


Vol. 7/2, October 2007 23

treat it as a factor.) If one in fact has produced a pair


match, then one can recover the paired differences
using the split command:

> pm <- pairmatch(dl)


> attach(nuclear)
> unlist(split(cost[PR],pm[PR])) -
80

unlist(split(cost[!PR],pm[!PR]))

— the result of which is the vector of differences

0.1 0.2 0.3 ... 1.2 1.3


-9.77 -10.09 184.75 ... -17.77 -4.52
Days to Permit Issue

70

For matched comparisons after full matching or


matching with a varying number of controls, one
uses such commands as
60

> fm <- fullmatch(dl)


> tapply(cost[PR], fm[PR], mean) -
tapply(cost[!PR], fm[!PR], mean)

to return differences of treatment and control means


by matched set. The sizes of the matched sets, in
50

terms of treatment units, controls, or both, can be tab-


ulated by

> tapply(PR, fm, sum)


> tapply(!PR, fm, sum)
Controls Treatments > tapply(fm,fm,length)

respectively. Unmatched units are automatically


dropped, and split and tapply return matched-set
specific results in a common ordering (that of the lev-
Figure 5: New and existing sites’ differences on the els of the match object, e.g. pm or fm.)
variable t2. To reduce these differences, one has ei-
ther to drop observations or to use flexible matching
techniques. Summary
In practice, one should look critically at an opti- Optmatch offers a comprehensive implementation of
mal match before moving ahead with it toward out- matching of two groups, such as treatments and con-
come analysis, refining its generating distance and trols or cases and controls, including optimal pair
structural requirements as appropriate, just as a care- matching, optimal matching with k controls, optimal
ful analyst deploys various diagnostics in the pro- matching with a varying number of controls, and full
cess of developing and refining a likelihood-based matching, with and without structural restrictions.
analysis. Diagnostics for matching are discussed in
various methodological papers, many of them recent
(Rosenbaum and Rubin, 1985; Rubin, 2001; Lee, 2006; Bibliography
Hansen, 2006a; Sekhon, 2007).
A. Abadie and G. W. Imbens. Large sample proper-
ties of matching estimators for average treatment
optmatch output effects. Econometrica, 74(1):235–267, 2006.
Matched pairs are often analyzed by methods partic- R. Althauser and D. Rubin. The computerized con-
ular to that structure, for example the paired t-test. struction of a matched sample. American Journal of
However, matching with multiple controls and full Sociology, 76(2):325–346, 1970.
matching require methods that treat the matched sets
as strata. With these uses in mind, matching func- J. Bowers and B. B. Hansen. Attributing effects to
tions in optmatch give factor objects as their output, a cluster randomized get-out-the-vote campaign.
creating unique identifiers for each matched set and Technical Report 448, Statistics Department,
tagging them as such in the factor. (Strictly speaking, University of Michigan, October 2006. URL
the value of a call to fullmatch or pairmatch is of http://www-personal.umich.edu/%7Ejwbowers/
the class c("optmatch", "factor"), but it is safe to PAPERS/bowershansen2006-10TechReport.pdf.

R News ISSN 1609-3631


Vol. 7/2, October 2007 24

A. R. Brazzale. hoa: An R package bundle for higher T. Lumley and T. Therneau. survival: Survival anal-
order likelihood inference. Rnews, 5/1 May 2005: ysis, including penalised likelihood, 2006. R package
20–27, 2005. URL ftp://cran.r-project.org/ version 2.26.
doc/Rnews/Rnews_2005-1.pdf. ISSN 609-3631.
K. Ming and P. R. Rosenbaum. Substantial gains in
A. R. Brazzale, A. C. Davison, and N. Reid. Applied bias reduction from matching with a variable num-
Asymptotics. Cambridge University Press, 2006. ber of controls. Biometrics, 56:118–124, 2000.
N. E. Breslow and N. E. Day. Statistical Methods D. Pierce and D. Peters. Practical use of higher order
in Cancer Research (Vol. 1) — The Analysis of Case- asymptotics for multiparameter exponential fami-
control Studies. Number 32 in IARC Scientific lies. Journal of the Royal Statistical Society. Series B
Publications. International Agency for Research on (Methodological), 54(3):701–737, 1992.
Cancer, 1980.
S. W. Raudenbush and A. S. Bryk. Hierarchical Lin-
D. Cox and E. Snell. Applied Statistics. Chapman and ear Models: Applications and Data Analysis Methods.
Hall, 1981. Sage Publications Inc, 2002.
D. R. Cox and E. J. Snell. Analysis of Binary Data. P. R. Rosenbaum. A characterization of optimal de-
Chapman & Hall Ltd, 1989. signs for observational studies. Journal of the Royal
A. Diamond and J. S. Sekhon. Genetic matching for Statistical Society, 53:597– 610, 1991.
estimating causal effects: A general multivariate P. R. Rosenbaum. Observational Studies. Springer-
matching method for achieving balance in obser- Verlag, second edition, 2002.
vational studies. Technical report, Travers Depart-
ment of Political Science, University of California, P. R. Rosenbaum and D. B. Rubin. Construct-
Berkeley, 2006. version 1.2. ing a control group using multivariate matched
sampling methods that incorporate the propensity
A. Gelman and J. Hill. Data Analysis Using Regres-
score. The American Statistician, 39:33–38, 1985.
sion and Multilevel/Hierarchical Models. Cambridge
University Press, 2006. D. B. Rubin. Using propensity scores to help design
observational studies: application to the tobacco
X. Gu and P. R. Rosenbaum. Comparison of mul-
litigation. Health Services and Outcomes Research
tivariate matching methods: Structures, distances,
Methodology, 2(3):169–188, 2001.
and algorithms. Journal of Computational and Graph-
ical Statistics, 2(4):405–420, 1993. D. B. Rubin. Using multivariate matched sampling
and regression adjustment to control bias in obser-
B. B. Hansen. Appraising covariate balance after as-
vational studies. Journal of the American Statistical
signment to treatment by groups. Technical Re-
Association, 74:318–328, 1979.
port 436, University of Michigan, Statistics Depart-
ment, 2006a. D. B. Rubin and N. Thomas. Combining propen-
B. B. Hansen. Full matching in an observational sity score matching with additional adjustments
study of coaching for the SAT. Journal of the Ameri- for prognostic covariates. Journal of the American
can Statistical Association, 99(467):609–618, Septem- Statistical Association, 95(450):573–585, 2000.
ber 2004. J. S. Sekhon. Alternative balance metrics for bias re-
B. B. Hansen. Bias reduction in observational studies duction in matching methods for causal inference.
via prognosis scores. Technical Report 441, Uni- Survey Research Center, University of California,
versity of Michigan, Statistics Department, April Berkeley, 2007. URL http://sekhon.berkeley.
2006b. URL http://www.stat.lsa.umich.edu/ edu/papers/SekhonBalanceMetrics.pdf.
%7Ebbh/rspaper2006-06.pdf. H. Smith. Matching with multiple controls to esti-
B. B. Hansen and S. O. Klopfer. Optimal full match- mate treatment effects in observational studies. So-
ing and related designs via network flows. Journal ciological Methodology, 27:325–353, 1997.
of Computational and Graphical Statistics, 15(3):609– E. A. Stuart and K. M. Green. Using full matching to
627, 2006. URL http://www.stat.lsa.umich. estimate causal effects in non-experimental stud-
edu/%7Ebbh/hansenKlopfer2006.pdf. ies: Examining the relationship between adoles-
T. Hothorn, K. Hornik, M. van de Wiel, and cent marijuana use and adult outcomes. Technical
A. Zeileis. A lego system for conditional inference. report, Johns Hopkins University, 2006.
The American Statistician, 60(3):257–263, 2006.
W.-S. Lee. Propensity score matching and vari- Ben B. Hansen
ations on the balancing test, 2006. URL Department of Statistics
http://papers.ssrn.com/sol3/papers.cfm? University of Michigan
abstract_id=936782#. bbh@umich.edu

R News ISSN 1609-3631


Vol. 7/2, October 2007 25

Random Survival Forests for R


Hemant Ishwaran and Udaya B. Kogalur tion of what constitutes a good node split for grow-
ing a tree, what prediction means, and how to mea-
sure prediction performance, pose unique problems
Introduction in survival analysis.
Moreover, while a survival tree can in some
In this article we introduce Random Survival Forests, instances be reformulated as a classification tree,
an ensemble tree method for the analysis of right thereby making it possible to use CART software
censored survival data. As is well known, con- for a Random Forests analysis, we believe such ap-
structing ensembles from base learners, such as trees, proaches are merely stop-gap measures that will be
can significantly improve learning performance. Re- difficult for the average user to implement. For ex-
cently, Breiman showed that ensemble learning can ample, Ishwaran et al. (2004) show under a propor-
be further improved by injecting randomization into tional hazards assumption that one can grow sur-
the base learning process, a method called Random vival trees using the splitting rule of LeBlanc and
Forests (Breiman, 2001). Random Survival Forests is Crowley (1992) using the rpart() algorithm (Ther-
closely modeled after Breiman’s approach. In Ran- neau and Atkinson, 1997), hence making it possi-
dom Forests, randomization is introduced in two ble to implement a relative risk forests analysis in R.
forms. First, a randomly drawn bootstrap sample of However, this requires extensive coding on the users
the data is used for growing the tree. Second, the tree part, is limited to proportional hazard settings, and
learner is grown by splitting nodes on randomly se- the splitting rule used is only approximate.
lected predictors. While at first glance Random For-
est might seem an unusual procedure, considerable
empirical evidence has shown it to be highly effec- The algorithm
tive. Extensive experimentation, for example, has
shown it compares favorably to state of the art en- It is clear that a comprehensive method with accom-
sembles methods such as bagging (Breiman, 1996) panying software is needed. To fill this need we in-
and boosting (Schapire et al., 1998). troduce randomSurvivalForest, an R software pack-
Random Survival Forests being closely patterned age for implementing Random Survival Forests. The
after Random Forests naturally inherits many of its algorithm used by randomSurvivalForest is broadly
good properties. Two features especially worth em- described as follows:
phasizing are: (1) It is user-friendly in that only three, 1. Draw ntree bootstrap samples from the origi-
fairly robust, parameters need to be set (the number nal data.
of randomly selected predictors, the number of trees
grown in the forest, and the splitting rule to be used). 2. Grow a tree for each bootstrapped data set. At
(2) It is highly data adaptive and virtually model as- each node of the tree randomly select mtry pre-
sumption free. This last property is especially help- dictors (covariates) for splitting on. Split on a
ful in survival analysis. Standard analyses often predictor using a survival splitting criterion. A
rely on restrictive assumptions such as proportional node is split on that predictor which maximizes
hazards. Also, with such methods there is always survival differences across daughter nodes.
the concern whether associations between predictors 3. Grow the tree to full size under the constraint
and hazards have been modeled appropriately, and that a terminal node should have no less than
whether or not non-linear effects or higher order in- nodesize unique deaths.
teractions for predictors should be included. In con- 4. Calculate an ensemble cumulative hazard esti-
trast, such problems are handled seamlessly and au- mate by combining information from the ntree
tomatically within a Random Forests approach. trees. One estimate for each individual in the
While R currently has a Random Forests pack- data is calculated.
age for classification and regression problems (the 5. Compute an out-of-bag (OOB) error rate for the
randomForest() package ported by Andy Liaw and ensemble derived using the first b trees, where
Matthew Wiener), there is currently no version avail- b = 1, . . . , ntree.
able for analyzing survival data1 . The need for a
Random Forests procedure separate from one that
handles classification and regression problems is Splitting rules
well motivated as survival data possesses unique
features not handled within a CART (Classification Node splits are a crucial ingredient to the algo-
and Regression Tree) paradigm. In particular, the no- rithm. The randomSurvivalForest package pro-
1 We are careful to distinguish Random Forests procedures following Breiman’s methodology from other approaches. Readers, for

example, should be aware of the R party() package, which implements a random forests style analysis using conditional tree base learn-
ers (Hothorn et al., 2006).

R News ISSN 1609-3631


Vol. 7/2, October 2007 26

vides four different survival splitting rules for the split is. In particular, the best split at node h is deter-
user. These are: (i) a log-rank splitting rule, mined by finding the predictor x∗ and split value c∗
the default splitting rule, invoked by the option such that | L( x∗ , c∗ )| ≥ | L( x, c)| for all x and c.
splitrule="logrank"; (ii) a conservation of events
splitting rule, splitrule="conserve"; (iii) a logrank Conservation of events splitting
score rule, splitrule="logrankscore"; (iv) and a
fast approximation to the logrank splitting rule, The log-rank test for splitting survival trees is a
splitrule="logrankapprox". well established concept (Segal, 1988), having been
shown to be robust in both proportional and non-
proportional hazard settings (LeBlanc and Crowley,
Notation
1993). However, one criticism often heard is that it
Assume we are at node h of a tree during its growth tends to favor continuous predictors and often suf-
and that we seek to split h into two daughter nodes. fers from an end-cut preference (favoring uneven
We introduce some notation to help discuss how the splits). However, in our experience with Random
the various splitting rules work to determine the best Survival Forests we have not found this to be a seri-
split. Assume that within h are n individuals. Denote ous deficiency. Nevertheless, to address this poten-
their survival times and 0-1 censoring information by tial problem we introduce another important class
( T1 , δ1 ), . . . , ( Tn , δn ). An individual l will be said to of test statistics for splitting that are related to con-
be right censored at time Tl if δl = 0, otherwise the servation of events; a concept introduced in Naftel
individual is said to have died at Tl if δl = 1. In the et al. (1985) (our simulations have indicated these
case of death, Tl will be refered to as an event time, tests may be much less susceptible to the aforemen-
and the death as an event. An individual l who is tioned problems).
right censored at Tl simply means the individual is Under fairly general conditions, conservation of
known to have been alive at Tl , but the exact time of events asserts that the sum of the estimated cumula-
death is unknown. tive hazard function over the observed time points
A proposed split at node h on a given predictor x (deaths and censored values) must equal the total
is always of the form x ≤ c and x > c. Such a split number of deaths. This applies to a wide collection
forms two daughter nodes (a left and right daugh- of estimates including the the Nelson-Aalen estima-
ter) and two new sets of survival data. A good split tor. The Nelson-Aalen cumulative hazard estimator
maximizes survival differences across the two sets of for daughter j is
data. Let t1 < t2 < · · · < t N be the distinct death di, j
times in the parent node h, and let di, j and Yi, j equal Ĥ j (t) = ∑
the number of deaths and individuals at risk at time ti, j ≤t Yi, j
ti in the daughter nodes j = 1, 2. Note that Yi, j is the where ti, j are the ordered death times for daughter j
number of individuals in daughter j who are alive at (note: we define 0/0 = 0).
time ti , or who have an event (death) at time ti . More Let ( Tl, j , δl, j ), for l = 1, . . . , n j , denote all survival
precisely, times and censoring indicator pairs for daughter j.
Conservation of events asserts that
Yi,1 = #{ Tl ≥ ti , xl ≤ c}, Yi,2 = #{ Tl ≥ ti , xl > c},
nj nj

where xl is the value of x for individual l = 1, . . . , n. ∑ Ĥ j ( Tl, j ) = ∑ δl, j . (1)


Finally, define Yi = Yi,1 + Yi,2 and di = di,1 + di,2 . Let l =1 l =1

n j be the total number of observations in daughter j. In other words, the total number of deaths is con-
Thus, n = n1 + n2 . Note that n1 = #{l : xl ≤ c} and served in each daughter.
n 2 = # { l : x l > c }. The conservation of events splitting rule is moti-
vated by (1). First, order the time points within each
daughter node such that
Log-rank splitting
T(1), j ≤ T(2), j ≤ · · · ≤ T(n j ), j .
The log-rank test for a split at the value c for predic-
tor x is Let δ(l ), j be the censoring indicator function for the
N   ordered value T(l ), j . Define
di
∑ di,1 − Yi,1 Yi k k
L( x, c) = s i =1 Mk, j = ∑ Ĥ j (T(l), j ) − ∑ δ(l), j , k = 1, . . . , n j .
N Y  .
Yi − di
 
i,1 Yi,1 l =1 l =1
∑ Yi 1 − Yi Yi − 1
di
One can think of Mk, j as “residuals” that measure ac-
i =1
curacy of conservation of events. The proposed test
The value | L( x, c)| is the measure of node separation. statistic takes the sum of the absolute values of Mk, j
The larger the value for | L( x, c)|, the greater the dif- for k = 1, . . . , n j for each daughter j, and weights
ference between the two groups, and the better the these values by the number of individuals at risk

R News ISSN 1609-3631


Vol. 7/2, October 2007 27

within each group. Observe that Mn j , j = 0, but noth- approximation, first rewrite the numerator of L( x, c)
ing can be said about Mk, j for k < n j . Thus, by in a form that uses the Nelson-Aalen estimator for
considering Mk, j for each k, the proposed test mea- the parent node. The Nelson-Aalen estimator is
sures how evenly distributed conservation of events
di
is over all deaths. The measure of conservation of Ĥ (t) = ∑ .
events for the split on x at the value c is Y
t ≤t i
i

2 n j −1 As shown in LeBlanc and Crowley (1993) one can


1
Conserve( x, c) = ∑ Y1, j ∑ |Mk, j |. write
Y1,1 + Y1,2 j=1 k=1
N   n
di
This value is small if the two groups are well sepa- ∑ di,1 − Yi,1 Yi = D1 − ∑ I{xl ≤ c} Ĥ (Tl ),
i =1 l =1
rated. Because we want to maximize survival differ-
ences due to a split, we use the transformed value where D j = ∑iN=1 di, j for j = 1, 2. Because the
1/(1 + Conserve( x, c)) as our measure of node sep- Nelson-Aalen estimator is computed on the parent
aration. node, and not daughter nodes, this yields an efficient
The preceding expression for Conserve( x, c) can way to compute the numerator of L( x, c).
be quite expensive to compute as it involves sum- Now to simplify the denominator, we approxi-
ming over all survival times within the daughter mate the variance of the numerator of L( x, c) as in
nodes. However, we can greatly reduce the amount Section 7.7 of Cox and Oakes (1988) (this approxi-
of work by compressing the sums to involve only mation was suggested to us by Michael LeBlanc in
event times. With some work, one can show that personal communication). Setting D = ∑iN=1 di , we
Conserve( x, c) is equivalent to: get the following approximation to the log-rank test
2 N −1
(
k d
) L( x, c):
1 l, j
Y1, j ∑ Nk, j Yk+1, j ∑ ,
Y1,1 + Y1,2 j∑
!
Y n
=1 k=1 l =1 l, j
D 1/2 D1 − ∑ I{xl ≤ c} Ĥ (Tl )
l =1
where Ni, j = Yi, j − Yi+1, j equals the number of ob- v( )( ).
servations in daughter j with observed time falling n n
u
u
within the interval [ti , ti+1 ) for i = 1, . . . , N where
t
∑ I{xl ≤ c} Ĥ (Tl ) D− ∑ I{xl ≤ c} Ĥ (Tl )
t N +1 = ∞. l =1 l =1

Log-rank score splitting Ensemble estimation


Another useful splitting rule available within the The randomSurvivalForest package produces an
randomSurvivalForest package is the log-rank ensemble estimate for the cumulative hazard func-
score test of Hothorn and Lausen (2003). To describe tion. This is our predictor and key deliverable. Error
this rule, assume the predictor x has been ordered so rate performance is calculated based on this value.
that x1 ≤ x2 ≤ · · · ≤ xn . Now, compute the “ranks” The ensemble is derived as follows. First, for each
for each survival time Tl , tree grown from a bootstrap data set we estimate the
Γl cumulative hazard function for the tree. This is ac-
δk complished by grouping hazard estimates by termi-
al = δl − ∑
k=1
n − Γk + 1 nal nodes. Consider a specific node h. Let {tl,h } be
the distinct death times in h and let dl,h and Yl,h equal
where Γk = #{t : Tt ≤ Tk }. The log-rank score test is the number of deaths and individuals at risk at time
defined as tl,h . The cumulative hazard estimate for node h is de-
∑ x ≤c al − n1 a fined as
S( x, c) = r l  dl,h
n  Ĥh (t) = ∑ .
n1 1 − 1 s2a t ≤t Yl,h
n l,h

Each tree provides a sequence of such estimates,


where a and s2a are the sample mean and sample vari-
Ĥh (t). If there are M terminal nodes in the tree,
ance of { al : l = 1, . . . , n}. Log-rank score splitting
then there are M such estimates. To compute Ĥ (t|xi )
defines the measure of node separation by | S( x, c)|.
for an individual i with predictor xi , simply drop xi
Maximizing this value over x and c yields the best
down the tree. The terminal node for i yields the de-
split.
sired estimator. More precisely,

Approximate logrank splitting Ĥ (t|xi ) = Ĥh (t), if xi ∈ h. (2)

An approximate log-rank test can be used in place of Note this value is computed for all individuals i in the
L( x, c) to greatly reduce computations. To derive the data.

R News ISSN 1609-3631


Vol. 7/2, October 2007 28

The estimate (2) is based on one tree. To produce 4. Define the concordance index C as
our ensemble we average (2) over all ntree trees. Let
Concordance
Ĥb (t|x) denote the cumulative hazard estimate (2) C= .
for tree b = 1, . . . , ntree. Define Ii,b = 1 if i is an Permissible
OOB point for b, otherwise set Ii,b = 0. The OOB
ensemble cumulative hazard estimator for i is 5. The error rate is Error = 1 − C. Note that
0 ≤ Error ≤ 1 and that Error = 0.5 cor-
∑ntree
b=1 Ii,b Ĥb (t |xi )
responds to a procedure doing no better than
Ĥe∗ (t|xi ) = . random guessing, whereas Error = 0 indicates
∑ntree
b=1 Ii,b
perfect accuracy.
Observe that the estimator is obtained by averag-
ing over only those bootstrap samples in which i is
excluded (i.e., those datasets in which i is an OOB Usage in R
value). The OOB estimator is in contrast to the
ensemble cumulative hazard estimator that uses all The user interface to randomSurvivalforest is sim-
samples: ilar in many aspects to randomForest and as the
reader may have already noticed, many of the argu-
1 ntree ment names are also the same. This was done de-
Ĥe (t|xi ) = Ĥb (t|xi ).
ntree b∑
=1
liberately in order to promote compatibility between
the two packages. The primary R function call to the
randomSurvivalforest package is rsf(). The on-
Concordance error rate line documentation describes rsf() in great detail
and there is no reason to repeat this information here.
Given the OOB estimator Ĥe∗ (t|x), it is a simple mat- Different R wrapper functions are provided with the
ter to compute the error rate. We measure error us- randomSurvivalforest package to aid in interpret-
ing Harrell’s concordance index (Harrell et al., 1982). ing the object produced by rsf(). The examples
Unlike other measures of survival performance, Har- given below illustrate how some of these wrappers
rell’s C-index does not depend on choosing a fixed work, and also indicate how rsf() might be used in
time for evaluation of the model and specifically practice.
takes into account censoring of individuals (May
et al., 2004). The method has quickly become quite Lung-vet data
popular in the literature as a means for assessing
prediction performance in survival analysis settings. For our first example, we use the well known
See Kattan et al. (1998) and references therein. veteran’s administration lung cancer data from
To compute the concordance index we must de- Kalbfleisch and Prentice (Kalbfleisch and Prentice,
fine what constitutes a worse predicted outcome. We 1980). This is an example data set available within
take the following approach. Let t∗1 , . . . , t∗N denote all the package. In total there are 6 predictors in the
unique event times in the data. Individual i is said to data. We first focus on analysis that includes only
have a worse outcome than j if Karnofsky score as a predictor:

N N > library("randomSurvivalForest")
∑ Ĥe∗ (t∗k |xi ) > ∑ Ĥe∗ (t∗k |x j ). > data(veteran,package="randomSurvivalForest")
k=1 k=1 > ntree <- 1000
> v.out <- rsf(Survrsf(time,status) ~ karno,
The concordance error rate is computed as follows:
veteran, ntree=ntree, forest=T)
1. Form all possible pairs of observations over all > print(v.out)
the data.
Call:
2. Omit those pairs where the shorter event time rsf.default(formula = Survrsf(time, status)
is censored. Also, omit pairs i and j if Ti = Tj ~ karno, data = veteran, ntree = ntree)
unless δi = 1 and δ j = 0 or δi = 0 and δ j = 1.
The last restriction only allows ties if one of Sample size: 137
the observations is a death and the other a cen- Number of deaths: 128
sored observation. Let Permissible denote the Number of trees: 1000
total number of permissible pairs. Minimum terminal node size: 3
3. Count 1 for each permissible pair in which the Average no. of terminal nodes: 8.437
shorter event time had the worse predicted out- No. of variables tried at each split: 1
come. Count 0.5 if the predicted outcomes are Total no. of variables: 1
tied. Let Concordance denote the total sum Splitting rule: logrank
over all permissible pairs. Estimate of error rate: 36.28%

R News ISSN 1609-3631


Vol. 7/2, October 2007 29

The error rate is significantly smaller than 0.5, the The analysis shows that logrankscore has the best
benchmark value associated with a procedure no bet- predictive performance (logrank is a close second).
ter than flipping a coin. This is very strong evidence Standard deviations in all cases are reasonably small.
that Karnofsky score is predictive. It is interesting to observe that the mean error rates
We can investigate the effect of Karnofsky score are not substantially smaller than our previous anal-
more closely by considering how the ensemble esti- ysis which used only Karnofsky score, thus indicat-
mated mortality varies as a function of the predictor: ing the predictor is highly influential. Our next ex-
ample illustrates further techniques for studying the
> plot.variable(v.out, partial=T)
informativeness of a predictor.
Figure 1, produced by the above command, is a par-
tial plot of Karnofsky score. The vertical axis repre- Primary biliary cirrhosis (PBC) of the liver
sents expected number of deaths.
Next we consider the PBC data set found in appendix
● ● ● ●
D.1 of Fleming and Harrington (Fleming and Har-
rington, 1991). This is also an example data set avail-
120



able in the package. Similar to the previous analysis
we analyzed the data by running a forest analysis for


each of the four splitting rules, repeating the analy-
sis 100 times independently (as before ntree was set
80


to 1000). The R code is similar as before and sup-
pressed:
60

logrank conserve logrankscore logrankapx


40

mean 0.1703 0.1677 0.1719 0.1602


std 0.0014 0.0014 0.0015 0.0020
20

● ●

20 40 60 80 100 As can be seen, the error rates are between 16-17%


karno with logrankapprox having the lowest value.
Figure 1: Partial plot of Karnofsky score. Vertical axis
is mortality ∑kN=1 Nk Ĥe (t∗k | x) for a given Karnofsky
0.30

age ●

value x and represents expected number of deaths. bili ●

prothrombin ●

copper ●

Now we run an analysis with all six predictors edema


albumin ●

0.25

under each of the four splitting rules. For each split- chol ●
Error Rate

ascites ●
ting rule we run 100 replications and record the mean spiders ●

and standard deviation of the concordance error rate stage


hepatom

(as before ntree equals 1000): sex ●


0.20

platelet ●

sgot ●
> splitrule <- c("logrank", "conserve", trig ●

"logrankscore", "logrankapprox") treatment


alk ●

> nrep <- 100


> err.rate <- matrix(0, 4, nrep) 0 200 600 1000 0.000 0.010
> names(err.rate) <- splitrule Number of Trees Importance
> v.f <-
as.formula("Survrsf(time,status) ~ .") Figure 2: Error rate for PBC data as a function of trees
> for (j in 1:4) { (left-side) and out-of-bag importance values for pre-
> for (k in 1:nrep) { dictors (right-side).
> err.rate[j,k] <- rsf(v.f,
veteran,ntree=ntree, We now consider the informativeness of each pre-
splitrule=splitrule[j])$err.rate[ntree] dictor under the logrankapprox splitting rule:
> }
> } > data("pbc",package="randomSurvivalForest")
> err.rate <- rbind( > pbc.f <- as.formula("Survrsf(days,status)~.")
mean=apply(err.rate, 1, mean), > pbc.out <- rsf(pbc.f, pbc, ntree=ntree,
std=apply(err.rate, 1, sd)) splitrule = "logrankapprox", forest=T)
> colnames(err.rate) <- splitrule > plot(pbc.out)
> print(round(err.rate,4)) Figure 2 depicts the importance values for all 17 pre-
dictors. From the plot we see that “age” and “bili”
logrank conserve logrankscore logrankapx are clearly predictive and have substantially larger
mean 0.2982 0.3239 0.2951 0.3170 importance values than all other predictors. The par-
std 0.0027 0.0034 0.0027 0.0046 tial plots for the top six predictors are displayed in

R News ISSN 1609-3631


Vol. 7/2, October 2007 30

Figure 3. The figure was produced using the com- Large scale problems
mand:
In terms of computationally challenging problems,
> plot.variable(pbc.out,3,partial=T,n.pred=6) we have applied randomSurvivalForest success-
fully to several large survival datasets. For exam-
We now consider the incremental effect of each ple, we have considered data collected at the Cleve-
predictor using a nested analysis. We sort predictors land Clinic involving over 20,000 records and well
by their importance values and consider the nested over 60 predictors. We have also analyzed a data set
sequence of models starting with the top variable, containing 1,000 records and with almost 250 predic-
followed by the model with the top 2 variables, then tors. Our success with these applications is consis-
the model with the top three variables, and so on: tent with that seen for Random Forests: namely, that
> imp <- pbc.out$importance the methodology has been shown to scale up very
> pnames <- pbc.out$predictorNames nicely, even in very large predictor spaces and with
> pnames.order <- pnames[rev(order(imp))] large sample sizes. In terms of computational speed,
> n.pred <- length(pnames) we have found that logrankapprox is almost always
> pbc.err <- rep(0, n.pred) fastest. After that, conserve is second fastest. For
> for (k in 1:n.pred){ very large datasets, discretizing continuous predic-
> rsf.f <- "Survrsf(days,status)~" tors and/or the observed survival times can greatly
> rsf.f <- as.formula(paste(rsf.f, speed up computational times. Discretization does
> paste(pnames.order[1:k],collapse="+"))) not have to be overly granular for substantial gains
> pbc.err[k] <- rsf(rsf.f, pbc, ntree=ntree, to be seen.
> splitrule="logrankapprox")$err.rate[ntree]
> }
> pbc.imp.out <- as.data.frame( Acknowledgements
> cbind(round(rev(sort(imp)),4),
> round(pbc.err,4), The authors are extremely grateful to Eugene H.
> round(-diff(c(0.5,pbc.err)),4)), Blackstone and Michael S. Lauer for their tremen-
> row.names=pnames.order) dous support, feedback, and generous time given to
>colnames(pbc.imp.out) <- us throughout this project. Without their help this
> c("Imp","Err","Drop Err") project would surely never have been completed. We
> print(pbc.imp.out) also thank the referee of the paper and Paul Murrell
for their constructive and helpful comments. This re-
Imp Err Drop Err search was supported by the National Institutes of
age 0.0130 0.3961 0.1039 Health RO1 grant HL-072771.
bili 0.0081 0.1996 0.1965
prothrombin 0.0052 0.1918 0.0078
copper 0.0042 0.1685 0.0233
Bibliography
edema 0.0030 0.1647 0.0038
L. Breiman. Random forests. Machine Learning, 45:
albumin 0.0026 0.1569 0.0078
5–32, 2001.
chol 0.0022 0.1606 -0.0037
ascites 0.0014 0.1570 0.0036 L. Breiman. Bagging predictors. Machine Learning,
spiders 0.0013 0.1601 -0.0030 26:123–140, 1996.
stage 0.0013 0.1557 0.0043
hepatom 0.0008 0.1570 -0.0013 D.R. Cox and D. Oakes. Analysis of Survival Data.
sex 0.0006 0.1549 0.0021 Chapman and Hall, London, 1998.
platelet 0.0000 0.1565 -0.0016
T. Fleming and D. Harrington. Counting Processes and
sgot -0.0012 0.1538 0.0027
Survival Analysis. Wiley, New York, 1991.
trig -0.0017 0.1545 -0.0007
treatment -0.0019 0.1596 -0.0052 F. Harrell, R. Califf, D. Pryor, K. Lee, and R. Rosati.
alk -0.0040 0.1565 0.0032 Evaluating the yield of medical tests. J. Amer. Med.
The first column is the importance value of a predic- Assoc., 247:2543–2546, 1982.
tor in the full model. The kth value in the second col- T. Hothorn, K. Hornik, and A. Zeileis. Unbiased
umn is the error rate for the kth nested model, while recursive partitioning: A conditional inference
the kth value in the third column is the difference be- framework. J. Comp. Graph. Statist., 15:651–674,
tween the error rate for the kth and (k − 1)th nested 2006.
model, where the error rate for the null model, k = 0,
is 0.5. One can see not much is gained by using more T. Hothorn, and B. Lausen. On the exact distribu-
than 6-7 predictors and that the top 3-4 predictors ac- tion of maximally selected rank statistics. Comput.
count for much of the predictive power. Statist. Data Analysis, 43:121–137, 2003.

R News ISSN 1609-3631


Vol. 7/2, October 2007 31

110

120

110

● ●

● ● ●
● ●
●●
●●● ●●●● ● ●●
● ●




●●
● ●
100

95 100
● ●
● ●

90 100

● ●


●●
●●
● ●● ● ● ●●● ● ●
● ●● ●● ●●● ●
● ●●
● ● ● ●● ●

●●●
90

90
10000 20000 0 5 10 15 20 25 10 12 14 16
age bili prothrombin

115
120

100 105




110

● ● ●

105


● ●
● ●●

●●●● ●● ●


100



95

● ●
●●

90 95

● ●
●● ● ●●
●●● ●● ●
90


●●●●●●
90

0 200 400 600 0.0 0.5 1.0 2.0 2.5 3.0 3.5 4.0
copper edema albumin

Figure 3: Partial plots for top six predictors from PBC data. Values on the vertical axis represent expected
number of deaths for a given predictor, after adjusting for all other predictors. Dashed red lines for continuous
predictors are ± 2 standard error bars.

H. Ishwaran, E. Blackstone, M. Lauer, and C. Pothier. D. Naftel, E. Blackstone, and M. Turner. Conserva-
Relative risk forests for exercise heart rate recovery tion of events, 1985. Unpublished notes.
as a predictor of mortality. J. Amer. Stat. Assoc., 99:
591–600, 2004. R. Schapire, Y. Freund, P. Bartlett, and W. Lee. Boost-
ing the margin: a new explanation for the effective-
J. Kalbfleisch and R. Prentice. The Statistical Analysis ness of voting methods. Ann. Statist., 26(5):1651–
of Failure Time Data. Wiley, New York, 1980. 1686, 1998.
M. Kattan, K. Hess, and J. Beck. Experiments to
determine whether recursive partitioning (CART) M.R. Segal. Regression trees for censored data. Bio-
or an artificial neural network overcomes theoreti- metrics, 44:35–47, 1988.
cal limitations of Cox proportional hazards regres-
sion. Computers and Biomedical Research, 31:363– T. Therneau and E. Atkinson. An introduction to re-
373, 1998. cursive partitioning using the rpart routine. Tech-
nical Report 61, 1997. Section of Biostatistics, Mayo
M. LeBlanc and J. Crowley. Relative risk trees for cen- Clinic, Rochester.
sored survival data. Biometrics, 48:411–425, 1992.
M. LeBlanc and J. Crowley. Survival trees by good-
ness of split. J. Amer. Stat. Assoc., 88:457–467, 1993. Hemant Ishwaran
Department of Quantitative Health Sciences
M. May, P. Royston, M. Egger, A.C. Justice and Cleveland Clinic, Cleveland, U.S.A.
J.A.C. Sterne. Development and validation of a hemant.ishwaran@gmail.com
prognostic model for survival time data: applica- Udaya B. Kogalur
tion to prognosis of HIV positive patients treated Kogalur Shear Corporation
with antiretroviral therapy. Statist. Medicine., 23: Clemmons, U.S.A.
2375–2398, 2004. ubk@kogalur-shear.com

R News ISSN 1609-3631


Vol. 7/2, October 2007 32

Rwui: A Web Application to Create User


Friendly Web Interfaces for R Scripts
by Richard Newton and Lorenz Wernisch a fully functional web interface for an R script.
A web interface for an R script means that the
script can be used by anyone, even if they have no
Summary knowledge of R. Instead of using the R script directly,
The web application Rwui is used to create web inter- values for variables and data files for processing are
faces for running R scripts. All the code is generated first submitted by the user on a web form. The appli-
automatically so that a fully functional web interface cation then runs the R script on a server, out of sight
for an R script can be downloaded and up and run- of the user, and returns the results of the analysis to
ning in a matter of minutes. the user’s web page. Because the web application
Rwui is aimed at R script writers who have runs on a server it can be accessed remotely and the
scripts that they want people unversed in R to use. user does not need to have R installed on their ma-
The script writer uses Rwui to create a web appli- chine. And updates to the script need only be made
cation that will run their R script. Rwui allows the to the copy on the server.
script writer to do this without them having to do Although of general applicability, Rwui has been
any web application programming, because Rwui designed with bioinformatics applications in mind.
generates all the code for them. To this end the web applications created by Rwui
The script writer designs the web application to can include features often required in bioinformat-
run their R script by entering information on a se- ics data analysis, such as a page for uploading repli-
quence of web pages. The script writer then down- cate data files and their group identifiers. Rwui is
loads the application they have created and installs typically aimed at bioinformaticians who have de-
it on their own server. This is a simple matter of veloped an R script for analysing experimental data
copying one file from the download onto the server. and who want to make their method immediately
The script writer now has a web application on their accessible to collaborators who are unfamiliar with
server that runs their R script, that other people can R. Rwui enables bioinformaticians to do this quickly
use over the web. and simply, without having to concern themselves
Although of general applicability, Rwui was de- with any aspects of web application programming.
signed primarily with bioinformatics applications in The completed web applications run on Tomcat
mind; aimed at bioinformaticians who are develop- servers (http://tomcat.apache.org/). Tomcat is free
ing a statistical analysis of experimental data for col- and widely used server software, very easy to in-
laborators and who want to automate their analysis stall on both Unix and Windows machines, and with
in a user friendly way. Rwui may also be of use for an impeccable security record. There have been no
creating teaching applications. known instances of the safety of data on Tomcat
Rwui may be found at http://rwui.cryst.bbk.ac.uk servers being compromised despite its widespread
use. But if data cannot be sent over the internet
then either the web applications created by Rwui
Introduction can be installed on a local Tomcat server for inter-
nal use, situated behind a firewall, or Tomcat and the
R is widely used in the field of bioinformatics. web applications created by Rwui can be installed on
The Bioconductor project (Gentleman et al., 2004) individual stand-alone machines, in which case the
contains R packages specifically designed for this web applications are accessed in a browser on the
field. However many potential users of bioinformat- machine via the ‘localhost’ URL. Instructions on in-
ics programs written in R, who come from a non- stalling and running Tomcat can be accessed from a
bioinformatics background, are unfamiliar with the link on the ‘Help’ page of Rwui.
language. One solution to this problem is for devel- A number of approaches to providing web in-
opers of R scripts to provide user-friendly web inter- terfaces to R scripts exist. A list with links
faces for their scripts. can be found on the R website (http://cran.r-
Rwui (R Web User Interface) is a web application project.org/doc/FAQ/R-FAQ.html) in the section ‘R
that the developer of an R script can use to create Web Interfaces’. These fall into two main categories.
a web application for running their script. All the Firstly there are projects which allow the user to sub-
code for the web application is generated automat- mit R code to a remote server. With these methods,
ically. This is the key feature of Rwui and means the end user must handle R code. In contrast, users
that it only takes a few minutes for someone who of applications created by Rwui have no contact with
is entirely unfamiliar with web application program- R code.
ming to design, download and install on their server Secondly there are projects which facilitate pro-

R News ISSN 1609-3631


Vol. 7/2, October 2007 33

Figure 1: Screenshot showing the web page of Rwui where input items are added.

gramming web applications that run R code. With 1 shows the web page of Rwui on which the input
these methods, whoever is creating the application items that will appear in the application are added.
must write web application code. In contrast, when Section headings can also be added if required.
using Rwui no web application code needs to be Rwui displays a facsimile of the web page that has
written; a web interface for an R script is created been created as items are added to the page. This
simply by selecting options from a sequence of web can be seen in the lower part of the screenshot shown
pages. in Figure 1. Input items are given a number so that
items can be deleted and new items inserted between
existing items. After uploading the R script, Rwui
Using the Application generates the web application, which can be down-
loaded as a zip or tgz file.
The information that Rwui requires in order to create Rwui creates Java based applications
a web application for running an R script is entered that use the Apache Struts framework
on a sequence of forms. After entering a title and in- (http://struts.apache.org/). Struts is open source
troductory text for the application, the input items and a popular framework for constructing well-
that will appear on the application’s web page are organised, stable and extensible web applications.
selected. Input items may be Numeric or Text entry The completed applications will run on a Tom-
boxes, Checkboxes, Drop-down lists, Radio Buttons, cat server. All that needs to be done to use the
File Upload boxes and a Multiple/Replicate File Up- downloaded web application is to place the appli-
load page. Each of the input variables of the R script, cation’s ‘web application archive’ file, which is con-
that is, those variables in the script that require a tained in the download, into a subdirectory of the
value supplied by the user, must have a correspond- Tomcat server. In addition, the file permissions of an
ing input item on the application’s web page. Figure included shell script must be changed to executable.

R News ISSN 1609-3631


Vol. 7/2, October 2007 34

Figure 2: Screenshot showing an example of a results web page of an application created using Rwui.

An application description file, written in XML, is R script Requirements


included in the download. This is useful if the appli-
cation requires modification at a later date. The de-
tails of the application can be re-entered into Rwui by
uploading the application description file. The appli- An R script needs little or no modification in order
cation can then be edited and rebuilt within Rwui. to be run from a web application created by Rwui.
There are three areas where the script may require
Further information on using Rwui can be found
attention.
from links on the application’s web pages. The
‘Quick Tour’ provides a ten minute introductory ex-
The R script receives input from the user via R
ample of using Rwui. The ‘Technical Report’ gives
variables, which we term the input variables of the
a technical overview of the application. The ‘Help’
script. The values of input variables are entered by
link accesses the manual for Rwui which contains de-
the user on the web page of the application. The in-
tailed information for users.
put variables must be named according to the rules
of both R and Java variable naming. These rules are
given on the ‘Help’ page of Rwui. But those variables
System Requirements in the R script that are not input variables do not
need to conform to the rules of Java variable naming.
In order to use the web applications created by Rwui
a machine is required with Tomcat version 5.0 or Secondly, in order to make the results of the anal-
later, Java version 1.5 and an R version compatible ysis available to the user, the R script must save
with the R script(s). Although a server running a the results to files. Thirdly, and optionally, the R
Unix operating system is preferable, the applications script may also write progress information into a file
will work without modification on a Tomcat server which can be continously displayed for the user in a
running Windows XP. Javascript pop-up window.

R News ISSN 1609-3631


Vol. 7/2, October 2007 35

Using applications created by Rwui with their associated results files. Any uploaded data
files are also linked to on the Results page, giving the
A demonstration application created by Rwui can be user the opportunity to check that the correct data
accessed from the ‘Help’ page of Rwui. has been submitted and has been uploaded correctly.
Applications created by Rwui can include a lo- The applications include a SessionListener which
gin page. Access can be controlled either by a single detects when a user’s session is about to expire. The
password, or by username/password pairs. SessionListener then removes all the working direc-
If an application created by Rwui includes a Mul- tories created during the session from the server, to
tiple/Replicate File upload page, then the applica- prevent it from becoming clogged with data. By de-
tion consists of two web pages on which the user en- fault a session expires 30 minutes after it was last ac-
ters information. On the first web page the user up- cessed by the user, but the delay can be changed in
loads multiple files one at a time. Once completed, the server configuration.
a button takes the user to a second web page where All the code of a complete demonstration appli-
singleton data files and values for all other variables cation created by Rwui can be downloaded from a
are entered. The ‘Analyse’ button on this page sub- link on the ‘Help’ page of Rwui.
mits the values of variables, uploads any data files
and runs the R script. If a Multiple/Replicate File
upload page is not included, the application consists Acknowledgment
of this second web page only.
Before running the R script the application first RN was funded as part of a Wellcome Trust Func-
checks the validity of the values that the user has tional Genomics programme grant.
entered and returns an error message to the page if
any are invalid. During the analysis, progress infor-
mation can be displayed for the user. To enable this Bibliography
the R script must append information to a text file at
stages during its execution. This text file is displayed Gentleman,R.C., Carey,V.J., Bates,D.M., Bolstad,B.,
for the user in a JavaScript pop-up window which Dettling,M., Dudoit,S., Ellis,B., Gautier,L., Ge,Y.,
refreshes at fixed intervals. Gentry,J., Hornik,K., Hothorn,T., Huber,W., Ia-
On completion of the analysis, a link to a Results cus,S., Irizarry,R., Leisch,F., Li,C., Maechler,M.,
page appears at the bottom of the web page. The user Rossini,A.J., Sawitzki,G., Smith,C., Smyth,G., Tier-
can change data files and/or the values of any of the ney,L., Yang,J.Y.H., Zhang,J. (2004) Bioconductor:
variables and re-analyse, and the new results will ap- Open software development for computational bi-
pear as a second link at the bottom of the page, and ology and bioinformatics, Genome Biology, 5, R80.
so on. Clicking on a link brings up the Results page
for the corresponding analysis. Figure 2 shows an
example of a results page. Richard Newton
The user can download individual results files by School of Crystallography
clicking on the name of the appropriate file on a Re- Birkbeck College, University of London, UK
sults page. Alternatively, each Results page also con- r.newton@mail.cryst.bbk.ac.uk
tains a link which will download all the results files Lorenz Wernisch
from the page and the html of the page itself. In this MRC Biostatistics Unit, Cambridge, UK
way the user can view offline saved Results pages lorenz.wernisch@mrc-bsu.cam.ac.uk

The np Package: Kernel Methods for


Categorical and Continuous Data
by Tristen Hayfield and Jeffrey S. Racine mial regression, to name but two applications of ker-
nel methods. Nonparametric kernel approaches are
Readers of R News are certainly aware of a num- appealing to applied researchers because they often
ber of nonparametric kernel1 methods that exist in reveal features in the data that might be missed by
R base (e.g., density) and in certain R packages (e.g., classical parametric methods. However, traditional
locpoly in the KernSmooth package). Such func- nonparametric kernel methods presume that the un-
tionality allows R users to nonparametrically model derlying data is continuous, which is frequently not
a density or to conduct nonparametric local polyno- the case.
1A ‘kernel’ is simply a weighting function.

R News ISSN 1609-3631


Vol. 7/2, October 2007 36

Practitioners often encounter a mix of categori- results. For this reason, we advise the reader to in-
cal and continuous data types, but may still wish terrogate their bandwidth objects with the summary
to proceed in a nonparametric direction. The tradi- command which produces a table of the bandwidths
tional nonparametric approach is called a ‘frequency’ for the continuous variables scaled by an appropriate
approach, whereby data is broken up into subsets constant (σ x nα where α depends on the kernel order
(‘cells’) corresponding to the values assumed by the and number of continuous variables, e.g. α = −1/5
categorical variables, and only then do you apply for one continuous variable and a second order ker-
say density or locpoly to the continuous data re- nel), which some readers may find helpful. Also,
maining in each cell. Nonparametric frequency ap- the admissible range for the bandwidths for the cat-
proaches are widely acknowledged to be unsatisfac- egorical variables is provided when summary is used,
tory. Recent theoretical developments offer practi- which some readers may also find helpful.
tioners a variety of kernel-based methods for cat- We have tried to make np flexible enough to be
egorical (i.e., unordered and ordered factors) data of use to a wide range of users. All options can
only or for a mix of continuous and categorical data; be tweaked by users (kernel function, kernel order,
see Li and Racine (2007) and the references therein bandwidth type, estimator type and so forth). One
for an in-depth treatment of these methods, and also function, npksum, allows you to create your own es-
see the articles listed in the bibliography below. timators, tests, etc. The function npksum is simply a
The np package implements recently developed call to highly optimized C code, so you get the bene-
kernel methods that seamlessly handle the mix of fits of compiled code along with the power and flex-
continuous, unordered, and ordered factor data ibility of the R language. We hope that incorporating
types often found in applied settings. The package the npksum function renders the package suitable for
also allows users to create their own routines us- teaching and research alike.
ing high-level function calls rather than writing their There are two ways in which you can interact
own C or Fortran code.2 The design philosophy un- with functions in np: i) using data frames, or ii) using
derlying np is simply to provide an intuitive, flex- a formula interface, where appropriate.
ible, and extensible environment for applied kernel To some, it may be natural to use the data frame
estimation. interface. The R data.frame function preserves a
Currently, a range of methods can be found in the variable’s type once it has been cast (unlike cbind,
np package including unconditional (Li and Racine, which we avoid for this reason). If you find this most
2003; Ouyang et al., 2006) and conditional (Hall et al., natural for your project, you first create a data frame
2004; Racine et al., 2004) density estimation and casting data according to their type (i.e., one of con-
bandwidth selection, conditional mean and gradi- tinuous (default), factor, ordered), as in
ent estimation (local constant (Racine and Li, 2004; data.object <- data.frame(x1=factor(x1),
Hall et al., forthcoming) and local polynomial (Li x2, x3=ordered(x3))
and Racine, 2004)), conditional quantile and gradi-
where x1 is, say, a binary factor, x2 continuous, and
ent estimation (Li and Racine, forthcoming), model
x3 an ordered factor. Then you could pass this data
specification tests (regression (Hsiao et al., forthcom-
frame to the appropriate np function, say
ing), quantile, significance (Racine et al., forthcom-
ing)), semiparametric regression (partially linear, in- npudensbw(dat=data.object)
dex models, average derivative estimation, vary- To others, however, it may be natural to use the
ing/smooth coefficient models), etc. formula interface that is used for the regression ex-
Before proceeding, we caution the reader that ample outlined below. For nonparametric regression
data-driven bandwidth selection methods can be nu- functions such as npregbw, you would proceed just
merically intensive, which is the reason underlying as you would using lm, e.g.,
the development of an MPI-aware3 version of the
np package that uses some of the functionality of npregbw(y ~ factor(x1) + x2)
the Rmpi package, which we have tentatively called except that you would of course not need to specify,
the npRmpi package. The functionality of np and e.g., polynomials in variables or interaction terms.
npRmpi will be identical; however, using npRmpi Every function in np supports both interfaces, where
you could take advantage of a cluster computing en- appropriate.
vironment or a multi-core/multi-cpu desktop ma- Before proceeding, a few words on the formula
chine thereby alleviating the computational burden interface are in order. We use the standard for-
associated with the nonparametric analysis of large mula interface as it provided capabilities for han-
datasets. We ought also to point out that data- dling missing observations and so forth. This in-
driven (i.e., automatic) bandwidth selection proce- terface, however, is simply a convenient device for
dures are not guaranteed always to produce good telling a routine which variable is, say, the outcome
2 The high-level functions found in the package in turn call compiled C code, allowing users to focus on the application rather than the

implementation details.
3 MPI is a library specification for message-passing, and has emerged as a dominant paradigm for portable parallel programming.

R News ISSN 1609-3631


Vol. 7/2, October 2007 37

and which are, say, the covariates. That is, just be- > data("wage1")
cause one writes x1 + x2 in no way means or is >
meant to imply that the model will be linear and > model.ols <- lm(lwage ~ factor(female) +
additive (why use fully nonparametric methods to + factor(married) + educ + exper + tenure,
estimate such models in the first place?). It sim- + data=wage1)
ply means that there are, say, two covariates in the >
model, the first being x1 and the second x2, we are > summary(model.ols)
passing them to a routine with the formula interface, ....
and nothing more is presumed or implied.
This model is, however, restrictive in a number of
Regardless of whether you use the data frame or
ways. First, the analyst must specify the functional
formula interface, workflow for nonparametric esti-
form (in this case linear) for the continuous vari-
mation in np typically proceeds as follows:
ables (educ, exper, and tenure). Second, the analyst
must specify how the qualitative variables (female
1. compute appropriate bandwidths;
and married) enter the model (in this case they affect
the model’s intercepts only). Third, the analyst must
2. estimate an object and extract fitted or pre-
specify the nature of any interactions among all vari-
dicted values, standard errors, etc.;
ables, quantitative and qualitative (in this case, there
3. optionally, plot the object. are none). Should any of these assumptions be in-
correct, then the estimated model will be biased and
In order to streamline the creation of a set of com- inconsistent, potentially leading to faulty inference.
plicated graphics objects, plot (which calls npplot) One might next test the null hypothesis that this
is dynamic; i.e., you can specify, say, bootstrapped parametric linear model is correctly specified using
error bounds and the appropriate routines will be the consistent model specification test found in Hsiao
called in real time. et al. (forthcoming) that admits both categorical and
continuous data (this example will likely take a few
minutes on a desktop computer as it uses bootstrap-
ping and cross-validated bandwidth selection):
Efficient nonparametric regression
> data("wage1")
in the presence of qualitative data > attach(wage1)
We begin with a motivating example that demon- >
strates the potential benefits arising from the use > model.ols <- lm(lwage ~ factor(female) +
of kernel smoothing methods that smooth both the + factor(married) + educ + exper + tenure,
qualitative and quantitative variables in a particu- + x=TRUE, y=TRUE)
lar manner; see the subsequent section for more de- >
tailed information regarding the method itself. For > X <- data.frame(factor(female),
what follows, we consider an application taken from + factor(married), educ, exper, tenure)
Wooldridge (2003, pg. 226) that involves multiple re- >
gression analysis with qualitative information. > output <- npcmstest(model=model.ols,
+ xdat=X, ydat=lwage, tol=.1, ftol=.1)
We consider modeling an hourly wage equa-
> summary(output)
tion for which the dependent variable is log(wage)
(lwage) while the explanatory variables include three
Consistent Model Specification Test
continuous variables, namely educ (years of educa-
....
tion), exper (the number of years of potential ex-
IID Bootstrap (399 replications)
perience), and tenure (the number of years with
Test Statistic 'Jn': 5.594745e+00
their current employer) along with two qualitative
P Value: 0.000000e+00
variables, female (“Female”/“Male”) and married
....
(“Married”/“Notmarried”). For this example there
are n = 526 observations. This naïve linear model is rejected by the data
The classical parametric approach towards es- (the P-value for the null of correct specification is
timating such relationships requires that one first < 0.001), hence one might proceed instead to model
specify the functional form of the underlying rela- this relationship using kernel methods.
tionship. We start by first modelling this relationship As noted, the traditional nonparametric approach
using a simple parametric linear model. By way of towards modeling relationships in the presence of
example, Wooldridge (2003, pg. 222) presents the fol- qualitative variables requires that you first split your
lowing model:4 data into subsets containing only the continuous
4 We would like to thank Jeffrey Wooldridge for allowing us to use his data. Also, we would like to point out that Wooldridge starts

out with this naïve linear model, but quickly moves on to a more realistic model involving nonlinearities in the continuous variables and
so forth.

R News ISSN 1609-3631


Vol. 7/2, October 2007 38

variables of interest (lwage, exper, and tenure). For you need to pass to npreg as it encapsulates the
instance, we would have four such subsets, a) n = kernel types, regression method, and so forth. You
132 observations for married females, b) n = 120 ob- could, of course, use npreg with a formula and per-
servations for single females, c) n = 86 observations haps manually specify the bandwidths using a band-
for single males, and d) n = 188 observations for width vector if you so chose. We have tried to make
married males. One would then construct smooth each function as flexible as possible to meet the needs
nonparametric regression models for each of these of a variety of users.
subsets and proceed with the analysis in this fashion. The goodness of fit of the nonparametric model
However, this may lead to a loss in efficiency due to (R2 = 51.5%) is better than that for the paramet-
a reduction in the sample size leading to overly vari- ric model (R2 = 40.4%). In order to investigate
able estimates of the underlying relationship. whether this apparent improvement reflects overfit-
Instead, however, we could construct smooth ting or simply that the nonparametric model is in fact
nonparametric regression models by i) using a more faithful to the underlying data generating pro-
smoothing function that is appropriate for the qual- cess, we shuffled the data and created two indepen-
itative variables such as that proposed by Aitchi- dent samples, one of size n1 = 400 and one of size
son and Aitken (1976), and ii) modifying the non- n2 = 126. We fit the models on the n1 training ob-
parametric regression model as was done by Li and servations then evaluate the models on the n2 inde-
Racine (2004). One can then conduct sound nonpara- pendent hold-out observations using the predicted
1 n2
metric estimation based on the n = 526 observa- square error criterion, namely n− 2
2 ∑i =1 ( yi − ŷi ) ,
tions rather than resorting to sample splitting. The where the yi s are the lwage values for the hold-out
rationale for this lies in the fundamental concept that observations and the ŷi s are the predicted values. Fi-
doing so may introduce potential bias, but it will nally, we compare the parametric model, the non-
always reduce variability, thus leading to potential parametric model that smooths both the qualitative
finite-sample efficiency gains. Our experience has and quantitative variables, and the traditional fre-
been that the potential benefits arising from this ap- quency nonparametric model that breaks the data
proach more than offset the potential costs in finite- into subsets and smooths the quantitative data only.
sample settings.
Next, we consider using the local-linear nonpara-
> data("wage1")
metric method described in Li and Racine (2004).
> set.seed(123)
For the reader’s convenience we supply precom-
>
puted cross-validated bandwidths which are auto-
> # Shuffle the data and create two datasets...
matically loaded when one reads the wage1 dataset.
>
The commented out code that generates the cross-
> ii <- sample(seq(nrow(wage1)),replace=FALSE)
validated bandwidths is also provided should the
>
reader wish to fully replicate the example (recall be-
> wage1.train <- wage1[ii[1:400],]
ing cautioned about the computational burden as-
> wage1.eval <- wage1[ii[401:nrow(wage1)],]
sociated with multivariate data-driven bandwidth
>
methods).
> # Compute the parametric model for the
> data("wage1") > # training data...
> >
> #bw.all <- npregbw(lwage ~ factor(female) + > model.ols <- lm(lwage ~ factor(female) +
> # factor(married) + educ + exper + tenure, + factor(married) + educ + exper + tenure,
> # regtype="ll", bwmethod="cv.aic", + data=wage1.train)
> # data=wage1) >
> > # Obtain the out-of-sample predictions...
> model.np <- npreg(bws=bw.all) >
> > fit.ols <- predict(model.ols,
> summary(model.np) + data=wage1.train, newdata=wage1.eval)
>
Regression Data: 526 training points, > # Compute the predicted square error...
in 5 variable(s) >
.... > pse.ols <- mean((wage1.eval$lwage -
Kernel Regression Estimator: Local-Linear + fit.ols)^2)
Bandwidth Type:Fixed >
Residual standard error: 1.371530e-01 > #bw.subset <- npregbw(
R-squared: 5.148139e-01 > # lwage~factor(female) + factor(married) +
.... > # educ + exper + tenure, regtype="ll",
> # bwmethod="cv.aic", data=wage1.train)
Note that the bandwidth object is the only thing >

R News ISSN 1609-3631


Vol. 7/2, October 2007 39

> model.np <- npreg(bws=bw.subset) tive variables increases, the difference between the
> estimator that smooths both continuous and discrete
> # Obtain the out-of-sample predictions... variables in a particular manner and the traditional
> estimator that relies upon sample splitting will be-
> fit.np <- predict(model.np, come even more pronounced.
+ data=wage1.train, newdata=wage1.eval) We now briefly outline the essence of kernel
> smoothing of categorical data for the interested
> # Compute the predicted square error... reader.
>
> pse.np <- mean((wage1.eval$lwage -
+ fit.np)^2) Categorical data kernel methods
>
> # Do the same for the frequency estimator... To introduce the R user to the basic concept of ker-
> # We do this by setting lambda=0 for the nel smoothing of categorical random variables, con-
> # qualitative variables... sider a kernel estimator of a probability function,
> i.e. p( xd ) = P( X d = xd ), where X d is an unordered
> bw.freq <- bw.subset (i.e., discrete) categorical variable assuming values
> bw.freq$bw[1] <- 0 in, say, {0, 1, 2, . . . , c − 1} where c is the number of
> bw.freq$bw[2] <- 0 possible outcomes of the discrete random variable
> X d . Aitchison and Aitken (1976) proposed a kernel
> model.np.freq <- npreg(bws=bw.freq) estimator of p( xd ) given by
> n
1
> # Obtain the out-of-sample predictions... p̂( xd ) = ∑ L(Xid = xd ),
n i =1
>
> fit.np.freq <- predict(model.np.freq, where L(·) is a kernel function defined by, say,
+ data=wage1.train, newdata=wage1.eval)
if Xid = xd

d d 1−λ
> L( Xi = x ) =
λ /(c − 1) otherwise,
> # Compute the predicted square error...
> and where λ is a ‘smoothing’ parameter that can be
> pse.np.freq <- mean((wage1.eval$lwage - selected via a variety of data-driven methods; see
+ fit.np.freq)^2) also Ouyang et al. (2006) for theoretical underpin-
> nings of least-squares cross-validation in this con-
> # Compare performance for the text.
> # hold-out data... The smoothing parameter λ is restricted to lie in
> [0, (c − 1)/c] for the kernel given above. Note that
> pse.ols when λ = 0, then L( Xid = xd ) becomes an indi-
[1] 0.1469691 cator function, and hence p̂( xd ) would be the stan-
> pse.np dard frequency probability estimator (i.e., the sam-
[1] 0.1344028 ple proportion of Xid = xd ). If, on the other hand,
> pse.np.freq λ = (c − 1)/c, its upper bound, then L( Xid = xd )
[1] 0.1450111
becomes a constant function and p̂( xd ) = 1/c for all
The predicted square error on the hold-out data xd , which is the discrete uniform distribution. Data-
was 0.147 for the parametric linear model, 0.145 for driven methods for selecting λ are based on minimiz-
the traditional nonparametric estimator that splits ing expected square error loss, among other criteria.
the data into subsets, and 0.134 for the nonparamet- A number of issues surrounding this estimator
ric estimator that uses the full sample but smooths are noteworthy: i) this approach extends trivially to a
both the qualitative and quantitative data. We there- general multivariate setting, and in fact is best suited
fore conclude that the nonparametric model that to such settings; ii) to deal with ordered variables
smooths both the continuous and qualitative vari- you simply use an ‘ordered kernel’; iii) there is no
ables appears to provide a better description of the “curse of dimensionality” associated with smooth-
underlying data generating process than either the ing √ discrete probability functions – the estimators
nonparametric model that uses sample splitting or are n consistent, just like their parametric counter-
the naïve linear model. parts; and iv) you can “mix-and-match” data types
Note that for this example we have only four using “generalized product kernels” as we describe
cells. If one used all qualitative variables included in the next section.
in the dataset (16 in total), one would have 65, 536 Why would anyone want to smooth a probabil-
cells, many of which would be empty, and most hav- ity function in the first place? For finite-sample effi-
ing far too few observations to provide meaningful ciency reasons of course. That is, smoothing a prob-
nonparametric estimates. As the number of qualita- ability function may introduce some finite-sample

R News ISSN 1609-3631


Vol. 7/2, October 2007 40

bias, but, it always reduces variability. In settings > # conditional density estimator and compute
where you are modeling a mix of categorical and > # the conditional mode...
continuous data or where one has sparse multivari- >
ate categorical data, the efficiency gains can be sub- > bw <- npcdensbw(factor(low) ~
stantial indeed. + factor(smoke) + factor(race) +
+ factor(ht) + factor(ui)+ ordered(ftv) +
+ age + lwt, tol=.1, ftol=.1)
Mixed data kernel methods > model.np <- npconmode(bws=bw)
>
Estimating a joint density function defined over > # Finally, we compare confusion matrices
mixed data (i.e., both categorical and continuous) > # from the logit and nonparametric models...
follows naturally using generalized product kernels. >
For example, for one discrete variable xd and one > # Parametric...
continuous variable xc , our kernel estimator of the >
joint PDF would be > cm <- table(low,
+ ifelse(fitted(model.logit) > 0.5, 1, 0))
n Xic − xc
 
1 > cm
fˆ( xd , xc ) = ∑ L( Xid = xd )W ,
nh x i =1
h xc
low 0 1
where L( Xid = xd ) is a categorical data kernel func- 0 119 11
tion, while W [( Xic − xc )/h xc ] is a continuous data ker- 1 34 25
nel function (e.g., Epanechnikov, Gaussian, or uni- >
form) and h xc is the bandwidth for the continuous > # Nonparametric...
variable; see Li and Racine (2003) for further details. >
Once we can consistently estimate a joint density > summary(model.np)
function defined over mixed data, we can then pro-
ceed to estimate a range of statistical objects of inter- Conditional Mode data: 189 training points,
est to practitioners. Some mainstays of applied data in 8 variable(s)
analysis include estimation of regression functions ....
and their derivatives, conditional density functions Bandwidth Type: Fixed
and their quantiles, conditional mode functions (i.e.,
count data models, probability models), and so forth, Confusion Matrix
each of which is currently implemented. Predicted
Actual 0 1
0 127 3
Nonparametric estimation of binary out- 1 27 32
come and count data models
....
For what follows, we adopt the conditional proba-
bility estimator proposed in Hall et al. (2004) to es-
For this example the nonparametric model is bet-
timate a nonparametric model of a binary outcome
ter able to predict low birthweight infants than its
when there exist a number of categorical covariates.
parametric counterpart, correctly predicting 159/189
For this example, we use the birthwt data taken
birthweights compared with 144/189 for the para-
from the MASS package, and compute a parametric
metric model.
Logit model and a nonparametric conditional mode
model. We then compare their confusion matrices
and assess their classification ability. The outcome Visualizing and summarizing nonpara-
is an indicator of low infant birthweight (0/1). metric results
> library("MASS") Summarizing nonparametric results and comparing
> attach(birthwt) them with parametric ones is perhaps one of the
> main challenges faced by the practitioner. We have
> # First, we model this with a simple attempted to provide a range of summary measures
> # parametric Logit model... and plotting methods to facilitate these tasks.
> The plot function (which calls npplot) can be
> model.logit <- glm(low ~ factor(smoke) + used on most nonparametric objects. In order to fur-
+ factor(race) + factor(ht) + factor(ui)+ ther facilitate comparison of parametric with non-
+ ordered(ftv) + age + lwt, family=binomial) parametric results, we also compute a number of
> summary measures of goodness of fit, normalized
> # Now we model this with the nonparametric bandwidths (‘scale factors’) and so forth that are ac-

R News ISSN 1609-3631


Vol. 7/2, October 2007 41

cessed by the summary command, while objects gen- Many of the accessor functions such as predict,
erated by np also contain various alternative mea- fitted, and residuals work with most non-
sures of goodness of fit and the like. We again con- parametric objects, where appropriate. As well,
sider the approach detailed in Li and Racine (2004) gradients is a generic function which extracts gra-
which conducts local linear kernel regression with dients from objects. The R function coef can be
a mix of discrete and continuous data types on a used for extracting the parametric coefficients from
popular macroeconomic dataset. The original study semiparametric models such as those created with
assessed the impact of Organisation for Economic npplreg.
Co-operation and Development (OECD) member-
ship on a country’s growth rate when controlling for
Creating mixed data kernel objects via
a range of factors such as population growth rates,
stock of capital (physical and human) and so forth. npksum
For this example we shall use the formula interface Rather than being limited to only those kernel meth-
rather than the data frame interface. For this exam- ods that exist in np, you could instead create your
ple, we have already computed the cross-validated own mixed data kernel objects. npksum is a func-
bandwidths which are loaded when one reads the tion that computes kernel sums on evaluation data,
oecdpanel dataset. given a set of training data, data to be weighted (op-
> data("oecdpanel") tional), and a bandwidth specification (any band-
> attach(oecdpanel) width object). npksum uses highly-optimized C code
> that strives to minimize its memory footprint, while
> #bw <- npregbw(growth ~ factor(oecd) + there is low overhead involved when using repeated
> # ordered(year) + initgdp + popgro+ inv + calls to this function. The convolution kernel op-
> # humancap, bwmethod="cv.aic", tion would allow you to create, say, the least squares
> # regtype="ll") cross-validation function for kernel density estima-
> tion. You can choose powers to which the kernels
> summary(bw) constituting the sum are raised (default=1), whether
or not to divide by bandwidths, whether or not
Regression Data (616 observations, to generate leave-one-out sums, and so on. Three
6 variable(s)): classes of kernel methods for the continuous data
types are available: fixed, adaptive nearest-neighbor,
Regression Type: Local-Linear and generalized nearest-neighbor. Adaptive nearest-
neighbor bandwidths change with each sample real-
.... ization in the set, xi , when estimating the kernel sum
at the point x. Generalized nearest-neighbor band-
> widths change with the point at which the sum is
> model <- npreg(bw) computed, x. Fixed bandwidths are constant over
> the support of x. A variety of kernels may be spec-
> summary(model) ified by users. Kernels implemented for continu-
ous data types include the second, fourth, sixth, and
Regression Data: 616 training points, eighth order Gaussian and Epanechnikov kernels,
in 6 variable(s) and the uniform kernel. We have tried to anticipate
.... the needs of a variety of users when designing this
> function. A number of semiparametric estimators
> plot(bw, and nonparametric tests in the np package in fact
+ plot.errors.method="bootstrap", make use of npksum.
+ plot.errors.boot.num=25) In the following example, we conduct local-
> constant (i.e., Nadaraya-Watson) kernel regression.
We shall use cross-validated bandwidths drawn
We display partial regression plots in Figure 1.5
from npregbw for this example. Note that we extract
We also plot bootstrapped variability bounds, where
the kernel sum from npksum via the ‘$ksum’ return
the bootstrapping is done via the boot package
value in both the numerator and denominator of the
thereby facilitating a variety of bootstrap methods.
resulting object. For this example, we use the data
If you prefer gradients rather than levels, you can
frame interface rather than the formula interface, and
simply use the argument gradients=TRUE in plot
output from this example is plotted in Figure 2.
to indicate that you wish to plot partial derivatives
rather than the conditional mean function itself.6
5 A “partial regression plot” is simply a 2D plot of the outcome y versus one covariate x when all other covariates are held constant at
j
their respective medians (this can be changed by users).
6 plot(bw, gradients=TRUE, plot.errors.method="bootstrap", plot.errors.boot.num=25, common.scale=FALSE) will work

for this example.

R News ISSN 1609-3631


Vol. 7/2, October 2007 42

0.08

0.08
0.04

0.04
growth

growth
0.00

0.00
−0.04

−0.04
0 1 1965 1970 1975 1980 1985 1990 1995

factor(oecd) ordered(year)
0.08

0.08
0.04

0.04
growth

growth
0.00

0.00
−0.04

−0.04
6 7 8 9 −3.4 −3.2 −3.0 −2.8 −2.6 −2.4

initgdp popgro
0.08

0.08
0.04

0.04
growth

growth
0.00

0.00
−0.04

−0.04

−4.5 −4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 −2 −1 0 1 2

inv humancap

Figure 1: Partial regression plots with bootstrapped error bounds based on a cross-validated local linear kernel
estimator.

> data("cps71")
> attach(cps71)
....
15

> fit.lc <- npksum(txdat=age,tydat=logwage,


+ bws=1.892169)$ksum/
14

+ npksum(txdat=age,bws=1.892169)$ksum
log(wage)

13

>
12

20 30 40 50 60

Age

Note that the arguments ‘txdat’ and ‘tydat’ refer to


‘training’ data for the regressors X and for the depen-
dent variable y, respectively. One can also specify Figure 2: A local constant kernel estimator created
‘evaluation’ data if you wish to evaluate the function with npksum using the Gaussian kernel (default) and
on a set of points that differ from the training data. a bandwidth of 1.89.
In such cases, you would also specify ‘exdat’ (and
‘eydat’ if so desired).
Acknowledgments
We direct the interested reader to see, by way of
illustration, the example available via ?npksum that We would like to thank but not implicate Achim
conducts leave-one-out cross-validation for a local Zeileis and John Fox, whose many and varied sug-
constant regression estimator via calls to the R func- gestions led to a much improved version of the pack-
tion nlm, and compares this to the npregbw function. age and of this review.

R News ISSN 1609-3631


Vol. 7/2, October 2007 43

Bibliography Q. Li and J. S. Racine. Nonparametric estimation


of conditional CDF and quantile functions with
J. Aitchison and C. G. G. Aitken. Multivariate binary mixed categorical and continuous data. Journal of
discrimination by the kernel method. Biometrika, Business and Economic Statistics, forthcoming.
63(3):413–420, 1976.
D. Ouyang, Q. Li, and J. S. Racine. Cross-validation
P. Hall, J. S. Racine, and Q. Li. Cross-validation and the estimation of probability distributions
and the estimation of conditional probability den- with categorical data. Journal of Nonparametric
sities. Journal of the American Statistical Association, Statistics, 18(1):69–100, 2006.
99(468):1015–1026, 2004.
J. S. Racine and Q. Li. Nonparametric estimation
P. Hall, Q. Li, and J. S. Racine. Nonparametric esti- of regression functions with both categorical and
mation of regression functions in the presence of continuous data. Journal of Econometrics, 119(1):99–
irrelevant regressors. The Review of Economics and 130, 2004.
Statistics, forthcoming. J. S. Racine, Q. Li, and X. Zhu. Kernel estimation
of multivariate conditional distributions. Annals of
C. Hsiao, Q. Li, and J. S. Racine. A consistent model
Economics and Finance, 5(2):211–235, 2004.
specification test with mixed categorical and con-
tinuous data. Journal of Econometrics, 140(2):802– J. S. Racine, J. D. Hart, and Q. Li. Testing the sig-
826, 2007. nificance of categorical predictor variables in non-
parametric regression models. Econometric Re-
Q. Li and J. Racine. Nonparametric Econometrics: The- views, 25:523–544, 2007.
ory and Practice. Princeton University Press, 2007.
J. M. Wooldridge. Introductory Econometrics. Thomp-
Q. Li and J. S. Racine. Nonparametric estimation son South-Western, 2003.
of distributions with categorical and continuous
data. Journal of Multivariate Analysis, 86:266–292,
August 2003. Tristen Hayfield and Jeffrey S. Racine
McMaster University
Q. Li and J. S. Racine. Cross-validated local linear Hamilton, Ontario, Canada
nonparametric regression. Statistica Sinica, 14(2): hayfietj@mcmaster.ca
485–512, April 2004. racinej@mcmaster.ca

eiPack: R × C Ecological Inference and


Higher-Dimension Data Management
by Olivia Lau, Ryan T. Moore, and Michael Kellermann In ecological inference, challenges arise because
information is lost when aggregating across indi-
viduals, a problem that cannot be solved by col-
Introduction lecting more aggregate-level data. Thus, EI mod-
Ecological inference (EI) models allow researchers els are unusually sensitive to modeling assumptions.
to infer individual-level behavior from aggregate Testing these assumptions is difficult without access
data when individual-level data is unavailable. Ta- to individual-level data, and recent years have wit-
ble 1 shows a typical unit of ecological analysis: a nessed a lively discussion of the relative merits of
contingency table with observed row and column various models (Wakefield, 2004).
marginals and unobserved interior cells. Nevertheless, there are many applied problems in
which ecological inferences are necessary, either be-
col1 col2 ... colC cause individual-level data is unavailable or because
row1 N11i N12i ... N1Ci N1·i the aggregate-level data is considered more authori-
row2 N21i N22i ... N2Ci N2·i tative. The latter is true in the voting rights context
... ... ... ... ... in the United States, where federal courts often base
rowR NR1i NR2i ... NRCi NR ·i decisions on evidence derived from one or more EI
N·1i N·2i ... N·Ci Ni models (Cho and Yoon, 2001). While packages such
as MCMCpack (Martin and Quinn, 2006) and eco
Table 1: A typical R × C unit in ecological inference; (Imai and Lu, 2005), provide tools for 2 × 2 inference,
red quantities are typically unobserved. this is insufficient in many applications. In eiPack,

R News ISSN 1609-3631


Vol. 7/2, October 2007 44

we implement three existing methods for the general


case in which the ecological units are R × C tables.

max(0, N ji − ∑k6=k0 Nki )


Methods and Data in eiPack max(0, N ji − ∑k6=k0 Nki ) + min( N ji , ∑k∈k00 Nki )

The methods currently implemented in eiPack are


the method of bounds (Duncan and Davis, 1953),
ecological regression (Goodman, 1953), and the and
Multinomial-Dirichlet model (Rosen et al., 2001).
The functions that implement these models share
several attributes. The ecological tables are defined
using a common formula of the form cbind(col1, min( N ji , Nk0 i )
..., colC) ∼ cbind(row1, ...,rowR). The row
min( N ji , Nk0 i ) + max(0, N ji − Nk0 i − ∑k∈k̃ Nki )
and column marginals can be expressed as either
proportions or counts. Auxiliary functions renormal-
ize the results for some subset of columns taken from
the original ecological table, and appropriate print, The intervals generated by the method of bounds
summary, and plot functions conveniently summa- can be analyzed in a variety of ways. Grofman (2000)
rize the model output. suggests calculating the intersection of the unit-level
In the following section, we demonstrate the fea- bounds. If this intersection (calculated by eiPack) is
tures of eiPack using the (included) senc dataset, non-empty, it represents the range of values that are
which contains individual-level party affiliation data consistent with the observed marginals in each of the
for Black, White, and Native American voters in ecological units.
8 counties in southeastern North Carolina. These
counties include 212 precincts, which form the eco- Researchers and practitioners may also choose to
logical units in this dataset. Because the data are ob- restrict their attention to units in which one group
served at the individual level, the interior cell counts dominates, since the bounds will typically be more
are known, allowing us to benchmark the estimates informative in those units. eiPack allows users to set
generated by each method. row thresholds to conduct this extreme case analy-
sis (known as homogeneous precinct analysis in the
voting context). For example, suppose the user is in-
Method of Bounds terested in the proportion of two-party White regis-
The method of bounds (Duncan and Davis, 1953) trants registered as Democrats in precincts that are
uses the observed row and column marginals to cal- at least 90% White. eiPack calculates the desired
culate upper and lower bounds for functions of the bounds:
interior cells of each ecological unit. The method of
bounds is not a statistical procedure in the traditional
sense; the bounds implied by the row and column
marginals are deterministic and there is no proba- > out <- bounds(cbind(dem, rep, non) ~ cbind(black,
+ white, natam), data = senc, rows = "white",
bilistic model for the data-generating process.
+ column = "dem", excluded = "non",
As implemented in eiPack, the method of bounds
+ threshold = 0.9, total = NULL)
allows the user to calculate for a specified column
k0 ∈ k = {1, . . . , C } the deterministic bounds on the
proportion of individuals in each row who belong in
that column. For each unit being considered, let j be These calculated bounds can then be represented
the row of interest, k index columns, k0 be the column graphically. Segments cover the range of possible
of interest, k00 be the set of other columns considered, values (the true value for each precinct is the red dot,
and k̃ be the set of columns excluded. For example, not included in the standard bounds plot). In this ex-
if we want the bounds on the proportion of Native ample, the intersection of the precinct-level bounds
American two-party registrants who are Democrats, is empty.
j is Native American, k0 is Democrat, k00 is Repub-
lican, and k̃ is No Party. The unit-level quantity of
interest is
> plot(out, row = "white", column = "dem")
N jk0 i # add true values to plot
N jk0 i + ∑k∈k00 N jki > idx <- as.numeric(rownames(out$bounds$white.dem))
> truth <- senc$whdem[idx]/(senc$white[idx]
The lower and upper bounds on this quantity given + - senc$non[idx])
by the observed marginals are, respectively, > plot((1:length(idx)) / (length(idx) + 1), truth)

R News ISSN 1609-3631


Vol. 7/2, October 2007 45

1.0 feasible range) as the number of draws m → ∞. In


51
cases where the cell estimates are near the bound-
65

58 63

aries, choosing truncate = TRUE imposes a uniform
●52 61 75 prior over the unit hypercube such that all cell frac-
0.8


54● ●
●●
67
tions are restricted to the range [0, 1].
68
Proportion Democratic

37 118



139 Output from ecological regression can be summa-
39
●144
rized numerically just as in lm, or graphically using
0.6

● 212

● 111 ●
18 71 104


29
30

31
● 85 92 97


113
117

122
123 128
130 137
● density plots. We also include functions to calculate
28 ●34
90 ● 127● 147
25 ●

● ●86
91
94 ●
99●
110
●115


120

131
129● ●

145

estimates and standard errors of shares of a subset
35 ● 89● ● 96 ● ● ●
● ●95 ●
of columns in order to address questions such as,
0.4


● ● ● 98 ● ●
● 207

● 200

"What is the Democratic share of 2-party registration
88

● for each group?" For the Bayesian model, densities
0.2

represent functions of the posterior draws of the βrc ;


for the frequentist model, densities reflect functions
of regression point estimates and standard errors cal-
0.0

culated using the δ-method.

> out.reg <- ei.reg(cbind(dem, rep, non)


Precincts at least 90% White
+ ~ cbind(black, white, natam), data = senc)
> lreg <- lambda.reg(out.reg,
columns = c("dem", "rep"))
> density.plot(lreg)
Figure 1: A plot of deterministic bounds.

Ecological Regression 30
Density

In ecological regression (Goodman, 1953), observed


row and column marginals are expressed as propor-
0 10

tions and each column is regressed separately on the


row proportions, thus performing C regressions. Re-
gression coefficients then estimate the population in- −0.2 0.2 0.6 1.0
ternal cell proportions. For a given unit i, define Proportion Democratic

• Xri , the proportion of individuals in row r,


30
Density

• Tci , the proportion of individuals in column c,


and
0 10

• βrci , the proportion of row r individuals in col-


umn c
−0.2 0.2 0.6 1.0
The following identities hold: Proportion Republican

R C
Tci = ∑ βrci Xri and ∑ βrci = 1 Figure 2: Density plots of ecological regression out-
r=1 c=1
put.
Defining the population cell fractions βrc such that
∑Cc=1 βrc = 1 for every r, ecological regression as- Multinomial-Dirichlet (MD) model
sumes that βrc = βrci for all i, and estimates the
regression equations Tci = βrc Xri + ci . Under In the Multinomial-Dirichlet model proposed by
the standard linear regression assumptions, includ- Rosen et al. (2001), the data is expressed as counts
ing E[ci ] = 0 and Var[ci ] = σc2 for all i, these and a hierarchical Bayesian model is fit using a
regressions recover the population parameters βrc . Metropolis-within-Gibbs algorithm implemented in
eiPack implements frequentist and Bayesian regres- C. Level 1 models the observed column marginals
sion models (via ei.reg and ei.reg.bayes, respec- as multinomial (and independent across units); the
tively). choice of the multinomial corresponds to sampling
In the Bayesian implementation, we offer two op- with replacement from the population. Level 2 mod-
tions for the prior on βrc . As a default, truncate els the unobserved row cell fractions as Dirichlet
= FALSE uses an uninformative flat prior that pro- (and independent across rows and units); Level 3
vides point estimates approaching the frequentist es- models the Dirichlet parameters as i.i.d. Gamma.
timates (even when those estimates are outside the More formally, without a covariate, the model is

R News ISSN 1609-3631


Vol. 7/2, October 2007 46

The output of this function can be returned as mcmc


R objects or arrays; in the former case, the standard

⊥ diagnostic tools in coda (Plummer et al., 2006) can
( N·1i , . . . , N·Ci ) ∼ Multinomial( Ni , ∑ βr1i Xri ,
r=1 be applied directly. The MD implementation in-
R cludes lambda and density.plot functions, usage
..., ∑ βrCi Xri ) for which is analogous to ecological regression:
r=1

⊥ > lmd <- lambda.MD(out.nocov,
(βr1i , . . . , βrCi ) ∼ Dirichlet(αr1 , . . . , αrC ) + columns = c("dem", "rep"))
i.i.d. > density.plot(lmd)
αrc ∼ Gamma(λ1 , λ2 )
With a unit-level covariate Zi in the second level, If the precinct-level parameters are returned or
the model becomes saved, cover.plot plots the central credible inter-
vals for each precinct. The segments represent the
R

⊥ 95% central credible intervals and their medians for
( N·1i , . . . , N·Ci ) ∼ Multinomial( Ni , ∑ βr1i Xri ,
r=1
each unit (the true value for each precinct is the red
R dot, not included in the standard cover.plot).
..., ∑ βrCi Xri ) > cover.plot(out.nocov, row = "white",
r=1
+ column = "dem")


(βr1i , . . . , βrCi ) ∼ Dirichlet(dr e(γrc +δrc Zi ) , . . . , # add true values to plot
> points(senc$white/senc$total,
dr e(γr(C−1) +δr(C−1) Zi ) , dr ) + senc$whdem/senc$white)
i.i.d.
dr ∼ Gamma(λ1 , λ2 )
1.0

In the model with a covariate, users have two op-


tions for the priors on γrc and δrc . They may as- ● ● ●
● ● ● ● ●
0.8

sume an improper uniform prior, as was suggested ● ● ●


● ● ●


Proportion of White Democrats

● ● ●
● ● ● ●
●● ● ●● ●
●● ●
● ● ●●
●●
by Rosen et al. (2001), or they may specify normal ● ●

● ●
●●





● ●
●●

●●● ●
● ●
●●



● ●
● ● ● ●




●●
●●
●●●




● ● ● ● ●
priors for each γrc and δrc as follows: ● ●
● ● ● ● ●● ●

● ● ●
● ● ●
● ●● ●
● ●

● ●
0.6

● ●
● ● ● ●● ● ●
●● ●
● ● ● ● ● ●● ● ●●
●● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ● ● ● ● ●●
● ●

N(µγrc , σγ2rc )
● ● ●
γrc ∼ ●


● ●●



● ●
● ●






●● ● ●








● ● ● ●


● ●




● ● ● ● ● ● ●
●●
● ● ● ● ● ●● ●
●●● ● ●

N(µδrc , σδ2rc )
● ●●● ● ●● ● ●
● ● ●
● ●●● ● ●


∼ ● ● ● ● ● ●
δrc ● ● ● ● ●● ● ●
● ●
0.4

●● ●● ● ●
● ● ● ●
● ● ● ●
● ● ● ●●
●●
● ● ●● ● ● ●

● ● ● ● ● ● ●
● ●● ● ●
● ● ● ●
● ●●● ●● ●●
● ●
●●


● ● ● ●● ● ● ●●●
● ● ●●
●●
● ● ●● ● ●● ● ●
● ●● ●
As Wakefield (2004) notes, the weak identification ● ●


● ●
●●
●●

● ●
that characterizes hierarchical models in the EI con- ●
0.2

text is likely to make the results sensitive to the


choice of prior. Users should experiment with differ-
0.0

ent assumptions about the prior distribution of the


upper-level parameters in order to gauge the robust- 0.0 0.2 0.4 0.6 0.8 1.0
ness of their inferences.
The parameterization of the prior on each Proportion White in precinct
(βr1i , . . . , βrCi ) implies that the following log-odds
ratio of expected fractions is linear with respect to
the covariate Zi : Figure 3: Coverage plot for MD model output.
 
E(βrci )
log
E(βrCi )
= γrc + δrc Zi Data Management
Conducting an analysis using the MD model re- In the MD model, reasonable-sized problems produce
quires two steps. First, tuneMD calibrates the tuning unreasonable amounts of data. For example, a model
parameters used for Metropolis-Hastings sampling: for voting in Ohio includes 11000 precincts, 3 racial
groups, and 4 parties. Implementing 1000 iterations
> tune.nocov <- tuneMD(cbind(dem, rep, non)
+ ~ cbind(black, white, natam), data = senc, yields about 130 million parameter draws. These
+ ntunes = 10, totaldraws = 100000) draws occupy about 1GB of RAM, and this is almost
certainly not enough iterations. We provide a few
Second, ei.MD.bayes fits the model by calling C code options to users in order to make this model tractable
to generate MCMC draws: for large EI problems.
> out.nocov <- ei.MD.bayes(cbind(dem, rep, non) The unit-level parameters present the most sig-
+ ~ cbind(black, white, natam), nificant data management problem. Rather than
+ covariate = NULL, data = senc, storing unit-level parameters in the workspace,
+ tune.list = tune.nocov) users can save each chain as a .tar.gz file on

R News ISSN 1609-3631


Vol. 7/2, October 2007 47

disk using the option ei.MD.bayes(..., ret.beta L. Goodman. Ecological regressions and the behav-
= "s"), or discard the unit-level draws entirely us- ior of individuals. American Sociological Review, 18:
ing ei.MD.bayes(..., ret.beta = "d"). To recon- 663–664, 1953.
struct the chains, users can select the row marginals,
column marginals, and units of interest, without re- B. Grofman. A primer on racial bloc voting analy-
constructing the entire matrix of unit-level draws: sis. In N. Persily, editor, The Real Y2K Problem: Cen-
> read.betas(rows = c("black", "white"),
sus 2000 Data and Redistricting Technology. Brennan
+ columns = "dem", units = 1:150, Center for Justice, New York, 2000.
+ dir = getwd())
K. Imai and Y. Lu. eco: R Package for Fitting
If users are interested in some function of the unit- Bayesian Models of Ecological cvs c Inference in 2x2
level parameters, the implementation of the MD Tables, 2005. URL http://imai.princeton.edu/
model allows them to define a function in R that research/eco.html.
will be called from within the C sampling algorithm,
in which case the unit-level parameters need not be A. D. Martin and K. M. Quinn. Applied Bayesian
saved for post-processing. inference in R using MCMCpack. R News, 6:2–7,
2006.
Acknowledgments M. Plummer, N. Best, K. Cowles, and K. Vines.
CODA: Convergence diagnostics and output anal-
eiPack was developed with the support of the In- ysis for MCMC. R News, 6:7–11, 2006.
stitute for Quantitative Social Science at Harvard
University. Thanks to John Fox, Gary King, Kevin O. Rosen, W. Jiang, G. King, and M. A. Tanner.
Quinn, D. James Greiner, and an anonymous ref- Bayesian and frequentist inference for ecological
eree for suggestions and Matt Cox and Bob Kinney inference: The R × C case. Statistica Neerlandica,
for technical advice. For further information, see 55(2):134–156, 2001.
http://www.olivialau.org/software.
J. Wakefield. Ecological inference for 2 × 2 tables
(with discussion). Journal of the Royal Statistical So-
Bibliography ciety, 167:385–445, 2004.
W. T. Cho and A. H. Yoon. Strange bedfellows: Pol-
itics, courts and statistics: Statistical expert testi-
Olivia Lau, Ryan T. Moore, Michael Kellermann
mony in voting rights cases. Cornell Journal of Law
Institute for Quantitative Social Science
and Public Policy, 10:237–264, 2001.
Harvard University, Cambridge, MA
O. D. Duncan and B. Davis. An alternative to ecolog- olivia.lau@post.harvard.edu
ical correlation. American Sociological Review, 18: ryantmoore@post.harvard.edu
665–666, 1953. kellerm@fas.harvard.edu

The ade4 Package — II: Two-table and


K-table Methods
by Stéphane Dray, Anne B. Dufour and Daniel Chessel Holmes, 2006; Dray and Dufour, 2007) and the im-
plementation of the functions follows the description
of this unifying mathematical tool (class dudi). The
Introduction main functions of the package for one-table analysis
methods have been presented in Chessel et al. (2004).
The ade4 package proposes a great variety of ex- This new paper presents a short summary of two-
planatory methods to analyse multivariate datasets. table and K-table methods available in the package.
As suggested by the acronym ade4 (Data Analysis
functions to analyse Ecological and Environmental
data in the framework of Euclidean Exploratory Ecological illustration
methods), the package is devoted to ecologists but
it could be useful in many other fields (e.g., Goecke, In order to illustrate the methods, we used the
2005). Methods available in the package are partic- dataset jv73 (Verneaux, 1973) which is available in
ular cases of the duality diagram (Escoufier, 1987; the package. This dataset concerns 12 rivers. For

R News ISSN 1609-3631


Vol. 7/2, October 2007 48

d = 0.2 d = 0.05

each river, a number of sites have been sampled. The Pen


Smm

Qmm

number of sites per river is not constant. jv73$poi


Das

is a data.frame and contains presence/absence data Vme

84
for 19 fish species (columns) in 92 sites (rows).


85 ● ●


62 ●
43 ●




63 58 ●

42 44 66 52 5354
● ●


87 ● ● ● ●
61
jv73$fac.riv is a factor indicating the river corre- 64 ●

78 ●
86 55
5657 59

76 4780 83 60
● ●

48 89 82 ● ●

4177 8881 49 ● ●

23 36 45 37 7946 4026 65
5051 92 ●




● ●

2928
● 2439 25 33


91 ● ●


● ●



30 74 7535
● ●

sponding to each site. jv73$morpho contains the 70 32


Alt 34

Loadings 1
● ●

67 31 73
15 16
● ● ● ●


38 71 72 27 ●

d = 0.2

68 90 ●

14

Chb 69 ●
13

measurements of six environmental variables (alti- 7 ●

1 10 12
18 19 21 228 11

520 9
● ●

tude (m), distance between the site and the source



17 ●
2 ●

6 ●




4 ●

Hot
3

(km), slope (per thousand), wetted cross section (m2 ),


Lam Bla ● ●

Tox ●

Lot Spi
Bar
Omb

average flow (m3 /s) and average speed (m/s)) for the
Gou
TanPer Van
Bro
Gar Loc
Che
Vai

same sites . Several ecological questions are related


Tru
Loadings 2 Common projection
to these data: Eigenvalues
d = 0.05 d = 0.05

0.10
84 84
85
62 8681
80
43
40 82
83
44 66 85 52

0.08
87
76 6378 91 43
63
62 58
42
41 77 74 75
79 64
92 5854
55
53
56
57 42 49 59 61
52 5961 244748 64 53 5455 60
47 378850 44
1. Are the groups of fish species living together
89
4539 70 46 73
336526
88 60
87
23
36 25
7830 26 5189 66
2928 36
23 38
37 713272 34482735 29
76 46 65 5657
90 4950
51 28 79 313986
777
45
41

0.06
67
30 16 83
68 3124 25 15 14 67 33 35
32 82 1615
69 13 80
1 70 40
68 34
81
71 13 14
92
91
(i.e. species communities)? 181921 2065722
69 108 7273
38 12
747527 90

0.04
1011
12 9 21
22
11
1 2 4 8 17 18 19
20
17 9 2 5
3 36 4

0.02
0.00
Array 1 Array 2

2. Is there a relation between the composition of


fish communities and the environmental varia-
tions? Figure 1: Plot of a Procrustes analysis: loadings for en-
vironmental variables and species, eigenvalues screeplot,
scores of sites for the two data sets, and projection of the
3. Does the composition of fish communities vary two sets of sites after rotation (arrows link environment
(or not) among rivers? site score to the species site score) (Dray et al., 2003a).

4. Do the species-environment relationships vary Two randomization procedures are available to test
(or not) among rivers? the association between two tables: PROTEST (Jack-
son, 1995) and RV (Heo and Gabriel, 1998).

Multivariate analyses help to answer these different plot(procuste.randtest(pca1$tab,


questions: one-table methods for the first question, pca2$tab), main = "PROTEST")
two-table methods for the second one and K-table
methods for the last two. plot(RV.rtest(pca1$tab, pca2$tab),
main = "RV")

PROTEST RV
30

Matching two tables


300

25
250

20
200
Frequency

Frequency

15

The main purpose of ecological data analysis is


150

10
100

the matching of two data tables: a sites-by-


5
50

environmental variables table and a sites-by-species


0

table, to study the relationships between the com- 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

sim sim

position of species communities and their environ-


ment. The ade4 package contains the main variants
of these methods (procrustean rotation, co-inertia
Figure 2: Plots of PROTEST and RV tests: histograms
analysis and principal component analyses with re-
of simulated values and observed value (vertical line).
spect to instrumental variables).
The first approach is procrustean rotation
Co-inertia analysis (Dolédec and Chessel, 1994;
(Gower, 1971), introduced in ecology by Digby and
Dray et al., 2003b) is a general approach that can
Kempton (1987, p. 116).
be applied to any pair of duality diagrams hav-
ing the same row weights. This method is sym-
data(jv73) metric and seeks for a common structure between
pca1 <- dudi.pca(jv73$morpho, scannf = FALSE) two datasets. It extends psychometricians inter-
pca2 <- dudi.pca(jv73$poi, scale = FALSE, battery analysis (Tucker, 1958), canonical analysis
scannf = FALSE) on qualitative variables (Cazes, 1980), and ecologi-
plot(procuste(pca1$tab, pca2$tab, cal profiles analysis (Montaña and Greig-Smith, 1990;
nf = 2)) Mercier et al., 1992). Co-inertia analysis of the pair

R News ISSN 1609-3631


Vol. 7/2, October 2007 49

of triplets (X1 , Q1 , D) and (X2 , Q2 , D) leads to the plot(cca(jv73$poi,jv73$morpho,scannf=FALSE)).


triplet (Xt2 DX1 , Q1 , Q2 ). Note that the two triplets While the cca function of ade4 is a particular case
must have the same row weights. For a comprehen- of pcaiv, the cca function of the package vegan is
sive definition of the statistical triplet of matrices X, a more traditional implementation of the method
Q, D, the reader could consult Chessel et al. (2004). which could be preferred by ecologists.

coa1 <- dudi.coa(jv73$poi, scannf = FALSE)


Function Objective
pca3 <- dudi.pca(jv73$morpho,
s.arrow cloud of points with vectors
row.w = coa1$lw, scannf = F)
s.chull cloud of points with groups
plot(coinertia(coa1, pca3, scannf = FALSE))
by convex hulls
d=1
s.class cloud of points with groups
by stars or ellipses

Ax1

84
s.corcircle correlation circle
87
23 85

62
36
7642
43




s.distri cloud of points with fre-
63
29 41
quency distribution by stars

● ●

78 86 ●

45
77 46 47 ●

Ax2


48 66 ● 58 ● ● ●

24 44 49 525354 ●

26
and ellipses
● ●

64 ●

5051 55
5659
57 ●

30 88 89 ●

3725

● ●

80 65 60 ●

X axes 70
39 40 81 82 83 61 ●

27 ●

31 7973
cloud of points with two
● ●

33 ●

Ax2
67
68
69
28
38 74 90
32
7172
75
91 ●

3592 13 14
15 16



● ●



s.hist
x1 12 ●●

1

34
11
10 ●







● ● ●

marginal histograms
7 ●

Ax1
17
2 19
18

3
6
9
22 ●
● ●


● s.image grid of gray-scale rectangles
4 5 21
with contour lines

20 ●

8

s.kde2d cloud of points with kernel
Y axes
d = 0.2 d=1
density estimation
0.8

Eigenvalues Lam

x1
Pen Smm
Qmm
Chb Hot
s.label cloud of points with labels
Lot

s.logo cloud of points with pictures


0.6

Bla
Vme Tox Bar
Spi
Omb

s.match matching two clouds of


0.4

Das
Tru
Loc
Vai
Gou
Van points with vectors
0.2

Che

Per
BroTan
s.traject cloud of points with trajecto-
Y CanonicalAltweights
ries
0.0

Gar
X Canonical weights

s.value cloud of points with numeri-


cal variable
Figure 3: Plot of a co-inertia analysis: projection of
Table 1: Objectives of some graphical functions.
the principal axes of the two tables (species and environ-
ment) on co-inertia axes, eigenvalues screeplot, canonical
plot(pcaiv(coa1, jv73$morpho, scannf = FALSE))
weights of species and environmental variables, and joint
display of the sites. d = 0.2 d = 0.5

Das
Alt

For each coupling method, a generic plot func- VmePen


(Intercept)

tion presents the various elements required to in-



8
20
Smm ●
21
● 4 5 ●

9
terpret the results. However, the quality of graphs

17 1819 ●


22
91
3 ● ●
75
● ●

Qmm ●
32 74
● ●

31 71726 90

2 383973 92
79 40
● ●

could vary according to the data set. It is con- 34


Loadings 30 ● ● ●

81

80 ●


10



35 ●

d = 0.5 29 37 86 88 33 89


●11

● ● ●
82

70 657


83
1228 1415
sequently impossible to manage relevant graphical

● 1 69 ● ●
● ●


● ●


64
27 13 16
2644
● ● ●

Alt 68
●67 ●


25 ● ●

85 66 ●

● ● 77 24 ●
● ● ●● ●

63 87
outputs for all cases. That is why these generic plot 84 78 6061

42 3646 43
● ● ●

57
53 545556 59

41 49 5051 52 ●● ● ●

76
23 47 48

58
Pen ●


4562 ●

use graphical functions of ade4 which can be directly (Intercept)


Vme Das

called by the user. A brief description of some of


these functions is given in Table 1. Qmm
Smm

Correlation Scores and predictions


Another two-table matching strategy is princi- d=1
0.30

Gar Eigenvalues
Bro
pal component analyses with respect to instrumental Axis2
Per
Tan
0.25

Gou
Che

variables (pcaiv, Rao, 1964). This approach con-


0.20

Vai
Loc Van

Lam
sists in explaining a triplet (X2 , Q2 , D) by a table
0.15

Tru Lot

Axis1
0.10

of independent variables X1 and leads to triplet Chb

Spi
0.05

(PX1 X2 , Q2 , D) where PX1 = X1 (Xt1 DX1 )− Xt1 D.


OmbBla Tox
Hot
0.00

Bar
Inertia axes Variables
This family of methods are constrained ordinations,
among which redundancy analysis (van den Wollen-
berg, 1977) and canonical correspondence analysis Figure 4: Plot of a CCA seen as a particular case of
(Ter Braak, 1986) are the most frequently used in PCAIV: environmental variables loadings and correla-
ecology. Note that canonical correspondence anal- tions with CCA axes, projection of principal axes on CCA
ysis can also be performed using the cca wrapper axes, species scores, eigenvalues screeplot, and joint dis-
function which takes two tables as arguments. The play of the rows of the two tables (position of the sites by
example given below is then exactly equivalent to averaging (points) and by regression (arrow tips)).

R News ISSN 1609-3631


Vol. 7/2, October 2007 50

Orthogonal analysis (pcaivortho) removes the Each statistical triplet corresponds to a separate
effect of independent variables and corresponds to analysis (e.g., principal component analysis, corre-
the triplet (P⊥X1 X2 , Q2 , D) where P⊥X1 = I − PX1 . spondence analysis ...). The common dimension
Between-class (between) and within-class (within) of the K statistical triplets are the rows of tables
analyses (see Chessel et al., 2004, for details) are par- which can represent individuals (samples, statistical
ticular cases of PCAIV and orthogonal PCAIV when units) or variables. Utilities for building and ma-
there is only one categorical variable (i.e. factor) in nipulating ktab objects are available. K-table can
X1 . Within-class analyses take into account a parti- be constructed from a list of tables (ktab.list.df),
tion of individuals into groups and focus on struc- a list of dudi objects (ktab.list.dudi), a within-
tures which are common to all groups. It can be seen class analysis (ktab.within) or by splitting a ta-
as a first step to K-table methods. ble (ktab.data.frame). Generic functions to trans-
pose (t.ktab), combine (c.ktab) or extract elements
wit1 <- within(coa1, fac = jv73$fac.riv, ([.ktab) are also available. The sepan function can
scannf = FALSE) be used to compute automatically the K separate
plot(wit1) analyses.

Loc Vai
d=1 d = 0.5 kt1 <- ktab.within(wit1)
Lot
Bla
Omb ●
●●●
sep1 <- sepan(kt1)
Lam
Hot
Bar
Tox
Gou
SpiVan Chb
Tru



kplot.sepan.coa(sep1, permute.row.col = TRUE)
Che Bro


8
d=1 d = 0.5 d = 0.5
Per ● ●
● Per 25 Loc

Furieuse
● Gar
● ● 1 Vai
Tan ●●● Gar 20 Chb Spi
Van
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Tru Gou
Che
Gar
● ● ●● 2 Bro
● ●
● ● ●● 610
11
3
● ●
● Cuisance● ●

Che 5 Tru Omb
Allaine

● Audeux
● Loc94 Vai 12 Gou
Gar ● Drugeon
● ● Tan 7 Omb
Lam
Lot 18
Canonical weights ●
●● Doubs
● Chb
Hot 13 Vai
● ●
● Bro 17 19
d = 0.5 ● ● ● Gou14 Lam
Hot
Bar
Tan
Bla
Lot
Tru Loc Omb
Chb
● Loue Lison ●
VanBla
Loc Vai ●
Clauge
Spi
Per Spi
● Tox
● ● Doulonnes Che Van 40
Lot
● Dessoubre 16 Bar
15
Bla Cusancin

● ●● 21 26
Omb ●
Tox Lam
● ● Doubs Drugeon Dessoubre
Lam ●

Hot ●● ●

32
d=2
22
d = 0.5 d = 0.5
Bar ●
Tox Tru ● ●

Gou ●● Lam
Spi Van Chb ●
Loc
CheBro ●
35
Vai Tru
33 31
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Omb
Gou
28 Tru
Vai Chb
30
Che 37 43
Per 29
Lam Chb Spi
Lam
Van
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Omb
Gou
Che
Gar 38
Spi Van
34
Tan
Chb
42
44 Loc
Omb VaiGou
Spi
● Van
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Che
Gar
Loc
Gar
Gar 52 Tru
Variables Scores and classes 55
d = 0.5 Allaine Audeux 71
Cusancin
Eigenvalues d = 0.5 d = 0.5
72
d = 0.5
0.00 0.05 0.10 0.15 0.20 0.25 0.30

● ● 51 Bla
73


70

54
Axis2 ●
66
Lam Loc
53
● 63
62
● ● ●

Chb Tru


●●●

● ● 56 Vai Spi
Lam
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Omb
Gar 41 69
68
67
● ●● ●
57 39
● ●
●● ●

● ●
● ● Vai



Tox Van Bla

●●





Loc Che
Van
Axis1 ●





● ●
Tan
Gar Che Chb Tru Lot Che
Spi
Hot
Bar
Bro
Tan
Per
Tox
Gou
Gar 74

Gou Loc



● ●
●●● ●
● ● ●
Chb Lam 58 Spi VaiOmb
● ● Tru
Omb 59 Gou
● ●
●●

● Van Hot
● ●


● Bro Lot
65



● 80 79 Bar

Per 75
Loue Lison Furieuse
60 81
● ●
d = 0.5 d = 0.5 d=1
● Vai 85
61Loc
64 90
Inertia axes Common centring Lam91
Chb Lot Gou
78
76 Spi
Lam
Bar
Lot
Gar 82 Che
Spi
77
Tru LocVan Hot
CheVan Tru 89 Vai
Bar
Omb
Tru Bro
Omb Bla
Tox
Gou
Bla Lam Chb
Spi
Van
Bar
Tan
Per
Bla
Hot
Bro
Tox
Lot
Omb
Gou
Gar PerGar
Tox 92
Chb Loc
Che
Vai
83 86 88
Bro
Per
Hot

Figure 5: Plot of a within-class analysis: species loadings, Tan


Cuisance Doulonnes
Tan

Clauge
species scores, eigenvalues screeplot, projection of princi-
pal axes on within-class axes, sites scores (common cen-
tring), projections of sites and groups (i.e. rivers in this Figure 6: Kplot of 12 separate correspondence analyses
example) on within-class axes. (same species, different sites).

The K-table class When the ktab object is built, various statistical
methods can be used to analyse it. The foucart
Class ktab corresponds to collections of more than function can be used to analyse K tables of posi-
two duality diagrams, for which the internal struc- tive number having the same rows and the same
tures are to be compared. Three formats of these columns and that can be analysed by a CA (Foucart,
collections can be considered: 1984; Pavoine et al., 2007). Partial triadic analysis
(Tucker, 1966) is a first step toward three modes prin-
cipal component analysis (Kroonenberg, 1989) and
• (X1 , Q1 , D), (X2 , Q2 , D),. . . , (XK , QK , D) can be computed with the pta function. It must be
used on K triplets having the same row and column
• (X1 , Q, D1 ), (X2 , Q, D2 ),. . . , (XK , Q, DK ) stored weights. The pta function can be used to perform
in the form of (Xt1 , D1 , Q), (Xt2 , D2 , Q),. . . , the STATICO method (Simier et al., 1999; Thioulouse
(XtK , DK , Q) et al., 2004). This makes it possible to analyse a pair
of ktab objects which have been combined by the
• (X1 , Q, D), (X2 , Q, D),. . . , (XK , Q, D) which ktab.match2ktabs function.
can also be stored in the form of (Xt1 , D, Q), Multiple factor analysis (mfa, Escofier and Pagès,
(Xt2 , D, Q),. . . , (XtK , D, Q) 1994), multiple co-inertia analysis (mcoa, Chessel and

R News ISSN 1609-3631


Vol. 7/2, October 2007 51

d=1 d=1 d=1

Hanafi, 1996) and the STATIS method (statis, Lavit ●

et al., 1994) can be used to compare K triplets having ● ● ●●


● ●
●● ●




the same row weights. The STATIS method can also ●




● ●


be used to compare K triplets having the same col- ●


Doubs

Drugeon Dessoubre
umn weights, which is a first step toward Common d=1


d=1 d=1

PCA (Flury, 1988). ●





● ● ●



● ●
● ●

sta1 <- statis(kt1, scannf = F)


Allaine Audeux Cusancin
plot(sta1) d=1 d=1 d=1


● ●
●●

●●
● ● ●

Interstructure Compromise d=1 ●●


●●

● ● ● ●

Gar Loue Lison Furieuse


Chb
d=1 d=1 d=1
Doulonnes Tan
Per
Dessoubre Van
Gou
Spi
Audeux
Lison Bro
Tox
Hot
Che
Bar ●

Loc Bla ●
● ●
Lot ● ●


Cusancin VaiOmb ●● ●
Cuisance ●
● ●●
Furieuse

Loue
ClaugeAllaine Lam

Tru Cuisance Doulonnes Clauge


Drugeon
Doubs

Component projection
Typological value
Figure 8: Kplot of the projection of the sites of each table
x1

Cuisance Audeux.1
on the principal axes of the compromise of STATIS analy-
0.3 0.4 0.5 0.6 0.7 0.8


Dessoubre

Doulonnes

Lison

Doulonnes.1
sis.
Dessoubre.1

Cusancin
Audeux

Cos 2

Lison.1

Furieuse

Loue

Cusancin.1
Clauge.1 Cuisance.1

Drugeon


Doubs
● Allaine
Clauge ●

Doubs.1
Loue.1
Allaine.1

Furieuse.1
Bibliography
Drugeon.1
0.15 0.20 0.25 0.30 0.35 0.40

Tables weights
P. Cazes. L’analyse de certains tableaux rectangu-
laires décomposés en blocs : généralisation des
propriétés rencontrées dans l’étude des correspon-
dances multiples. I. Définitions et applications à
Figure 7: Plot of STATIS analysis: interstructure, typo-
l’analyse canonique des variables qualitatives. Les
logical value of each table, compromise and projection of
Cahiers de l’Analyse des Données, 5:145–161, 1980.
principal axes of separate analyses onto STATIS axes.
D. Chessel and M. Hanafi. Analyse de la co-inertie de
K nuages de points. Revue de Statistique Appliquée,
The kplot generic function is associated to the 44(2):35–60, 1996.
foucart, mcoa, mca, pta, sepan, sepan.coa and
statis methods, giving adapted collections of D. Chessel, A.-B. Dufour, and J. Thioulouse. The
graphics. ade4 package-I- One-table methods. R News, 4:5–
10, 2004.
P. G. N. Digby and R. A. . Kempton. Multivariate
kplot(sta1, traj = TRUE, arrow = FALSE, Analysis of Ecological Communities. Chapman and
unique = TRUE, clab = 0) Hall, Population and Community Biology Series,
London, 1987.
S. Dolédec and D. Chessel. Co-inertia analy-
Conclusion sis: an alternative method for studying species-
environment relationships. Freshwater Biology, 31:
The ade4 package provides many methods to anal- 277–294, 1994.
yse multivariate ecological data sets. This diversity
of tools is a methodological answer to the great va- S. Dray, D. Chessel, and J. Thioulouse. Procrustean
riety of questions and data structures associated to co-inertia analysis for the linking of multivariate
biological questions. Specific methods dedicated to datasets. Ecoscience, 10:110–119, 2003a.
the analysis of biodiversity, spatial, genetic or phy-
S. Dray, D. Chessel, and J. Thioulouse. Co-inertia
logenetic data are also available in the package. The
analysis and the linking of ecological tables. Ecol-
adehabitat brother-package contains tools to analyse
ogy, 84(11):3078–3089, 2003b.
habitat selection by animals while the ade4TkGUI
package provides a graphical interface to ade4. More S. Dray and A. Dufour. The ade4 package: imple-
resources can be found on the ade4 website (http: menting the duality diagram for ecologists. Journal
//pbil.univ-lyon1.fr/ADE-4/). of Statistical Software, 22(4):1–20, 2007.

R News ISSN 1609-3631


Vol. 7/2, October 2007 52

B. Escofier and J. Pagès. Multiple factor analysis (AF- C. Montaña and P. Greig-Smith. Correspondence
MULT package). Computational Statistics and Data analysis of species by environmental variable ma-
Analysis, 18:121–140, 1994. trices. Journal of Vegetation Science, 1:453–460, 1990.
Y. Escoufier. The duality diagram : a means of better S. Pavoine, J. Blondel, M. Baguette, and D. Chessel.
practical applications. In P. Legendre and L. Leg- A new technique for ordering asymmetrical three-
endre, editors, Development in numerical ecology, dimensional data sets in ecology. Ecology, 88:512–
pages 139–156. NATO advanced Institute , Serie G 523, 2007.
.Springer Verlag, Berlin, 1987.
C. Rao. The use and interpretation of principal com-
B. Flury. Common Principal Components and Related ponent analysis in applied research. Sankhya A, 26:
Multivariate. models. Wiley and Sons, New-York, 329–359, 1964.
1988.
M. Simier, L. Blanc, F. Pellegrin, and D. Nandris.
T. Foucart. Analyse factorielle de tableaux multiples.
Approche simultanée de K couples de tableaux:
Masson, Paris, 1984.
application à l’étude des relations pathologie
R. Goecke. 3D lip tracking and co-inertia analy- végétale-environment. Revue de Statistique Ap-
sis for improved robustness of audio-video au- pliquée, 47:31–46, 1999.
tomatic speech recognition. In Proceedings of the
Auditory-Visual Speech Processing Workshop AVSP C. Ter Braak. Canonical correspondence analysis : a
2005, pages 109–114, 2005. new eigenvector technique for multivariate direct
gradient analysis. Ecology, 67:1167–1179, 1986.
J. Gower. Statistical methods of comparing different
multivariate analyses of the same data. In F. Hod- J. Thioulouse, M. Simier, and D. Chessel. Simulta-
son, D. Kendall, and P. Tautu, editors, Mathemat- neous analysis of a sequence of pairs of ecological
ics in the archaeological and historical sciences, pages tables with the STATICO method. Ecology, 85:272–
138–149. University Press, Edinburgh, 1971. 283, 2004.

M. Heo and K. Gabriel. A permutation test of associ- L. . Tucker. An inter-battery method of factor analy-
ation between configurations by means of the RV sis. Psychometrika, 23:111–136, 1958.
coefficient. Communications in Statistics - Simulation
and Computation, 27:843–856, 1998. L. Tucker. Some mathemetical notes on three-mode
factor analysis. Psychometrika, 31:279–311, 1966.
S. Holmes. Multivariate analysis: The French way. In
N. D. and S. T., editors, Festschrift for David Freed- A. van den Wollenberg. Redundancy analysis, an al-
man. IMS, Beachwood, OH, 2006. ternative for canonical analysis. Psychometrika, 42
(2):207–219, 1977.
D. Jackson. PROTEST: a PROcustean randomization
TEST of community environment concordance. J. Verneaux. Cours d’eau de Franche-Comté (Massif
Ecosciences, 2:297–303, 1995. du Jura). Recherches écologiques sur le réseau hydro-
graphique du Doubs. Essai de biotypologie. Thèse de
P. Kroonenberg. The analysis of multiple tables in doctorat, Université de Besançon, Besançon, 1973.
factorial ecology. iii three-mode principal compo-
nent analysis:"analyse triadique complète". Acta
OEcologica, OEcologia Generalis, 10:245–256, 1989. Stéphane Dray, Anne-Béatrice Dufour, Daniel Chessel
C. Lavit, Y. Escoufier, R. Sabatier, and P. Traissac. The Laboratoire de Biométrie et
ACT (STATIS method). Computational Statistics and Biologie Evolutive (UMR 5558) ; CNRS
Data Analysis, 18:97–119, 1994. Université de Lyon ; université Lyon 1
43, Boulevard du 11 Novembre 1918
P. Mercier, D. Chessel, and S. Dolédec. Complete cor- 69622 Villeurbanne Cedex, France
respondence analysis of an ecological profile data dray@biomserv.univ-lyon1.fr
table: a central ordination method. Acta OEcolog- dufour@biomserv.univ-lyon1.fr
ica, 13:25–44, 1992. chessel@biomserv.univ-lyon1.fr

R News ISSN 1609-3631


Vol. 7/2, October 2007 53

Review of “The R Book”


Michael J. Crawley, Wiley, 2007 Both seem to have the same font size and hence are
on equal level. However, as is to be expected given
by Friedrich Leisch the two topics, the section on the match function is
2 paragraphs long, how to write functions takes the
The back cover of this physically impressive 1000- next 20 pages, with many intermezzos and headings
page volume advertises it as “. . . the first comprehem- in two different sizes. The author also jumps around
sive reference manual for the R language. . . ” which a lot, many concepts are discussed or introduced as a
“. . . introduces all the statistical models covered by R. . . ”. side note for a different theme, and it is often unclear
Considering (a) that the R Core Team considers its where examples end. E.g., how formal and actual
own language manual (R Development Core Team, arguments are matched in a funcion call is the first
2007b) a draft, and only the source code the ulti- paragraph in the section on “Saving data to disc”. All
mate reference in more cases than we like, and (b) the of this will be confusing for beginners and makes it
multitude of models implemented by R packages on hard to use the book as a reference manual.
CRAN or Bioconductor, I thought I would be in for In the chapter on mathematics a dozen pages is
an interesting read. used on introducing the OLS estimate (typo in sev-
The book has 27 chapters. Chapters 1–8 give an eral equations: β̂ = X 0 X − 1X 0 y), including a step-
introduction to R, starting where to obtain and how wise implementation of (the correct version of) this
to install the software, describing the language, data formula. Although the next page in the book starts
input and data manipulation, graphics, tables, math- with solving linear equations via solve(), it is not
ematical calculations, and classical statistical tests. even mentioned that it is numerically not the best
Chapters 9–20 on statistical modelling form the main idea to compute regression coefficients using the for-
part of the book with a detailed coverage of the lin- mula above.
ear regression model and its extensions like GLMs, The quality of the book increases considerably in
GAMs, mixed effects and non-linear least squares. the chapters on statistical modelling. A minor draw-
Chapters 20–27 show “other” topics like trees, time back is that it sometimes gives the impression that
series, multivariate and spatial statistics, survival linear models are the only ones available, even dis-
analysis, using R for simulation models and low- criminant analysis is not considered a model, be-
level graphics commands. cause response variables cannot be multi-level cate-
The preface states that the book is “aimed at be- gorical according to the cookbook recipe on page 324.
ginners and intermediate users” and can be used “as a However, there is a nice general introduction to sta-
text . . . as well as a reference manual”. I find that the tistical modelling and model selection, and linear
book in its present form is not optimal for either pur- modelling is covered in depth with many examples.
pose. The first section on “getting started” has a few Once the author leaves the territory of linear
minor problems, like using many functions without models (and their extensions), quality decreases
quoted character arguments. In some cases this is again. The chapter on trees uses package tree, al-
a matter of style (like library(foo) or help(foo)), though even the author of tree recommends using
but some instances simply do not work (find(foo), package rpart (Venables and Ripley, 2002, p. 266).
apropos(foo)). Packages are often called libraries The chapter on multivariate statistics basically rec-
(which is a different thing), input lines can be longer ommends not doing multivariate statistics at all, be-
than 128 characters (the current limit is 8192), and cause one is too likely to shoot oneself into the foot.
recommending MS Word as a source code editor is In summary, the book fails to meet the high ex-
at least debatable even for Windows users. I per- pectations that the title and cover texts raise. In
sonally find the R code throughout the book hard this review I could list only a selection of problems
to read: it is typeset in a proportional font, uses no I found, and of course there are good things too, like
spaces around the assignment operator <-, no line the detailed explanation on how to enter data into a
indentation for nested code blocks, and path names spreadsheet to form a proper data frame. The book is
sometimes contain erroneous spaces, especially after definitely not a reference manual for the R system or
backslashes. R language, but a book on applied linear modelling
The chapter on “essentials of the R language” with (many pages of) lower-quality additional ma-
gives an introduction to the language and many non- terial to give the impression of universal coverage.
statistical functions like string processing and reg- There are better introductionary books for beginners,
ular expressions. What I found very confusing is and Venables and Ripley (2002) is still “the R book”
the lack of clear structure. The book uses only one when it comes to a reference text for applied statis-
level of numbering (chapters), and this chapter is 100 tics.
pages long. E.g., on page 47 there are two headings: A (symptomatic) end note: The author gives de-
“The match function” and “Writing functions in R”. tailed credit to the R core team and wider R com-

R News ISSN 1609-3631


Vol. 7/2, October 2007 54

munity in the acknowledgements (thanks!). On R Development Core Team. R Language Definition.


page one he recommends the citation() function R Foundation for Statistical Computing, Vienna,
to users to give credit to developers (yes!), however Austria, 2007b. URL http://www.R-project.org.
he seems not to have used the function too often, be- ISBN 3-900051-13-5.
cause R Development Core Team (2007a,b) and many
others are missing from the references, which cover
only 4 of 1000 pages. W. N. Venables and B. D. Ripley. Modern Applied
Statistics with S. Fourth Edition. Springer, 2002. URL
http://www.stats.ox.ac.uk/pub/MASS4/. ISBN
Bibliography 0-387-95457-0.

R Development Core Team. R: A language and en-


vironment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria, 2007a. Friedrich Leisch
URL http://www.R-project.org. ISBN 3-900051- Ludwig-Maximilians-Universität München, Germany
07-0. Friedrich.Leisch@R-project.org

Changes in R 2.6.0
by the R Core Team • aggregate.data.frame() no longer changes
the group variables into factors, and leaves
alone the levels of those which are factors. (In-
User-visible changes ter alia grants the wish of PR#9666.)
• integrate(), nlm(), nlminb(), optim(), • The default max.names in all.names() and
optimize() and uniroot() now have ... all.vars() is now -1 which means unlimited.
much earlier in their argument list. This re- This fixes PR#9873.
duces the chances of unintentional partial
matching but means that the later arguments • as.vector() and the default methods of
must be named in full. as.character(), as.complex(), as.double(),
as.expression(), as.integer(), as.logical()
• The default type for nchar() is now "chars".
and as.raw() no longer duplicate in most
This is almost always what was intended, and
cases where the object is unchanged. (Beware:
differs from the previous default only for non-
some code has been written that invalidly as-
ASCII strings in a MBCS locale. There is a
sumes that they do duplicate, often when using
new argument allowNA, and the default be-
.C/.Fortran(DUP = FALSE).)
haviour is now to throw an error on an invalid
multibyte string if type = "chars" or type = • as.complex(), as.double(), as.integer(),
"width". as.logical() and as.raw() are now prim-
itive and internally generic for efficiency.
• Connections will be closed if there is no R ob-
They no longer dispatch on S3 methods for
ject referring to them. A warning is issued if
as.vector() (which was never documented).
this is done, either at garbage collection or if all
as.real() and as.numeric() remain as alter-
the connection slots are in use.
native names for as.double().
expm1(), log(), log1p(), log2(), log10(),
New features gamma(), lgamma(), digamma() and
trigamma() are now primitive. (Note that
• abs(), sign(), sqrt(), floor(), ceiling(), logb() is not.)
exp() and the gamma, trig and hyperbolic trig
The Math2 and Summary groups (round, signif,
functions now only accept one argument even
all, any, max, min, sum, prod, range) are now
when dispatching to a Math group method
primitive.
(which may accept more than one argument for
other group members). See under Section “methods Package” below
for some consequences for S4 methods.
• abbreviate() gains a method argument with
a new option "both.sides" which can make • apropos() now sorts by name and not by posi-
shorter abbreviations. tion on the search path.

R News ISSN 1609-3631


Vol. 7/2, October 2007 55

• attr() gains an exact = TRUE argument to • is.finite() and is.infinite() are now S3
disable partial matching. and S4 generic.
• bxp() now allows xlim to be specified. • jpeg(), png(), bmp() (Windows), dev2bitmap()
(PR#9754) and bitmap() have a new argument units to
• C(f, SAS) now works in the same way as C(f, specify the units of width and height.
treatment), etc.
• levels() is now generic (levels<-() has been
• chol() is now generic. for a long time).

• dev2bitmap() has a new option to go via PDF • Loading serialized raw objects with load() is
and so allow semi-transparent colours to be now considerably faster.
used.
• New primitive nzchar() as a faster alternative
• dev.interactive() regards devices with the to nchar(x) > 0 (and avoids having to convert
displaylist enabled as interactive, and packages to wide chars in a MBCS locale and hence con-
can register the names of their devices as inter- sider validity).
active via deviceIsInteractive().
• The way old.packages() and hence
• download.packages() and
update.packages() handle packages with dif-
available.packages() (and functions which
ferent versions in multiple package repositories
use them) now support in repos or contriburl
has been changed. The first package encoun-
either ‘file:’ plus a general path (includ-
tered was selected, now the one with highest
ing drives on a UNC path on Windows) or a
version number.
‘file:///’ URL in the same way as url().
• dQuote() and sQuote() are more flexible, • optim(method = "L-BFGS-B") now accepts
with rendering controlled by the new option zero-length parameters, like the other methods.
useFancyQuotes. This includes the ability Also, method = "SANN" no longer attempts to
to have TEX-style rendering and directional optimize in this case.
quotes (the so-called “smart quotes”) on Win-
• New options showWarnCalls and
dows. The default is to use directional quotes
showErrorCalls to give a concise traceback on
in UTF-8 locales (as before) and in the Rgui con-
warnings and errors. showErrorCalls = TRUE
sole on Windows (new).
is the default for non-interactive sessions. Op-
• duplicated() and unique() and their meth- tion showNCalls controls how abbreviated the
ods in base gain an additional argument call sequence is.
fromLast.
• New options warnPartialMatchDollar,
• fifo() no longer has a default description ar- warnPartialMatchArgs and
gument. warnPartialMatchAttr to help detect the un-
fifo("") is now implemented, and works in intended use of partial matching in $, argu-
the same way as file(""). ment matching and attr() respectively.

• file.edit() and file.show() now tilde- • A device named as a character string in


expand file paths on all interfaces (they used options(device =) is now looked for in the
to on some and not others). grDevices name space if it is not visible from
the global environment.
• The find() argument is now named numeric
and not numeric.: the latter was needed • pmatch(x, y, duplicates.ok = TRUE) now
to avoid warnings about name clashes many uses hashing and so is much faster for large x
years ago, but partial matching was used. and y when most matches are exact.
• stats:::.getXlevels() confines attention to
• qr() is now generic.
factors since some users expected R to treat
unclass(a_factor) as a numeric vector. • It is now a warning to have an non-integer ob-
• grep(), strsplit() and friends now warn if ject for .Random.seed: this indicates a user had
incompatible sets of options are used, instead been playing with it, and it has always been
of silently using the documented priority. documented that users should only save and
restore it.
• gsub()/sub() with perl = TRUE now pre-
serves attributes from the argument x on the • New higher-order functions Reduce(),
result. Filter() and Map().

R News ISSN 1609-3631


Vol. 7/2, October 2007 56

• regexpr() and gregexpr() gain an • sweep() gains an argument check.margin =


ignore.case argument for consistency with TRUE which warns about mismatched dimen-
grep(). (This does change the positional sions.
matching of arguments, but no instances of
• The mathematical annotation facility
positional matching beyond the second were
(plotmath()) now recognises a symbol() func-
found.)
tion which forces the font to be a symbol
• relist() utility, an S3 generic with several font. This allows access to all characters in
methods, providing an inverse for unlist(); the Adobe Symbol encoding within plotmath
thanks to a code proposal from Andrew expressions.
Clausen.
• For OSes that cannot unset environment vari-
• require() now returns invisibly. ables, Sys.unsetenv() sets the value to "",
with a warning.
• The interface to reshape() has been revised,
allowing some simplified forms that did not • New function Sys.which(), an interface to
work before, and somewhat improved error which on Unix-alikes and an emulation on Win-
handling. A new argument sep has been intro- dows.
duced to replace simple usages of split (the
• On Unix-alikes, system(, intern = TRUE) re-
old features are retained).
ports on very long lines that may be truncated,
• rmultinom() uses a high-precision accumula- giving the line number of the content being
tor where available, and so is more likely to read.
give the same result on different platforms (al-
• termplot() has a default for ask that uses
though it is still possible to get different results,
dev.interactive().
and the result may differ from previous ver-
sions of R). It allows ylim to be set, or computed to cover
all the plots to be made (the new default) or
• row() and col() now work on matrix-like ob- computed for each plot (the previous default).
jects such as data frames, not just matrices.
• uniroot(f, *) is slightly faster for non-
• Rprof() allows smaller values of interval on trivial f() because it computes f(lower) and
machines that support it: for example modern f(upper) only once, and it has new optional
Linux systems support interval = 0.001. arguments f.lower and f.upper by which the
caller can pass these.
• sample() now requires its first argument x to
be numeric (in the sense of is.numeric()) as • unlink() is now internal, using common
well as of length 1 and ≥ 1 before it is regarded POSIX code on all platforms.
as shorthand for 1:x.
• unsplit() now works with lists of dataframes.
• sessionInfo() now provides details about
• The vcov() methods for classes "gls" and
package name spaces that are loaded but not
"nlme" have migrated to package nlme.
attached. The output of sessionInfo() has
been improved to make it easier to read when • vignette() has a new argument all to choose
it is inadvertently wrapped after being pasted between showing vignettes in attached pack-
into an email message. ages or in all installed packages.
• setRepositories() has a new argument ind • New function within(), which is like with(),
to allow selections to be made programmati- except that it returns modified versions back of
cally. lists and data frames.
• showMethods() has a “smart” default for • X11(), postscript() (and hence bitmap()),
inherited such that showMethods(genfun, xfig(), jpeg(), png() and the Windows de-
incl = TRUE) becomes a useful short cut. vices win.print(), win.metafile() and bmp()
now warn (once at first use) if semi-transparent
• sprintf() no longer has a output string length
colours are used (rather than silently treating
limit.
them as fully transparent).
• storage.mode<-() is now primitive, and hence
• New function xspline() to provide
makes fewer copies of an object (none if the
base graphics support of X-splines
mode is unchanged). It is a little less general
(cf. grid.xspline()).
than mode<-(), which remains available. (See
also the entry under Deprecated & defunct be- • New function xyTable() does the 2D gridding
low.) “computations” used by sunflowerplot().

R News ISSN 1609-3631


Vol. 7/2, October 2007 57

• Rd conversion to HTML and CHM now makes • The [[ subsetting operator now has an argu-
use of classes, which are set in the stylesheets. ment exact that allows programmers to dis-
Editing ‘R.css’ will change the styles used for able partial matching (which will in due course
\env, \option, \pkg etc. (CHM styles are set at become the default). The default value is exact
compilation time.) = NA which causes a warning to be issued
when partial matching occurs. When exact =
• The documented arguments of %*% have been TRUE, no partial matching will be performed.
changed to be x and y, to match S and the im- When exact = FALSE, partial matching can oc-
plicit S4 generic. cur and no warning will be issued. This patch
was written by Seth Falcon.
• If members of the Ops group (the arithmetic,
logical and comparison operators) and %*% are • Many of the C-level warning/error messages
called as functions, e.g., ‘>‘(x, y), positional (e.g., from subscripting) have been re-worked
matching is always used. (It used to be the case to give more detailed information on either the
that positional matching was used for the de- location or the cause of the problem.
fault methods, but names would be matched
for S3 and S4 methods and in the case of ! • The S3 and S4 Math groups have been
the argument name differed between S3 and S4 harmonized. Functions log1p(), expm1(),
methods.) log10() and log2() are members of the S3
group, and sign(), log1p(), expm1(), log2(),
• Imports environments of name spaces are cummax(), cummin(), digamma(), trigamma()
named (as "imports:foo"), and so are known and trunk() are members of the S4 group.
e.g. to environmentName(). gammaCody() is no longer in the S3 group. They
are now all primitive.
• Package stats4 uses lazy-loading not
SaveImage (which is now deprecated). • The initialization of the random-number
stream makes use of the sub-second part of
• Installing help for a package now parses the
the current time where available.
‘.Rd’ file only once, rather than once for each
type. Initialization of the 1997 Knuth TAOCP gener-
ator is now done in R code, avoiding some C
• PCRE has been updated to version 7.2. code whose licence status has been questioned.

• bzip2 has been updated to version 1.0.4. • The reporting of syntax errors has been made
more user-friendly.
• gettext has been updated to version 0.16.1.

• There is now a global CHARSXP cache,


R_StringHash. CHARSXPs are no longer du- methods Package
plicated and must not be modified in place.
• Packages using methods have to have been in-
Developers should strive to only use mkChar
stalled in R 2.4.0 or later (when various internal
(and mkString) for creating new CHARSXPs and
representations were changed).
avoid use of allocString. A new macro,
CallocCharBuf, can be used to obtain a tem- • Internally generic primitives no longer dis-
porary char buffer for manipulating character patch S4 methods on S3 objects.
data. This patch was written by Seth Falcon.
• load() and restoring a workspace attempt to
• The internal equivalents of as.complex(), detect and warn on the loading of pre-2.4.0 S4
as.double(), as.integer() and as.logical() objects.
used to handle length - 1 arguments now ac-
cept character strings (rather than report that • Making functions primitive changes the se-
this is “unimplemented”). mantics of S4 dispatch: these no longer dis-
patch on classes based on types but do dispatch
• Lazy-loading a package is now substantially whenever the function in the base name space
more efficient (in memory saved and load is called.
time).
This applies to as.complex(), as.integer(),
• Various performance improvements lead to a as.logical(), as.numeric(), as.raw(),
45% reduction in the startup time without expm1(), log(), log1p(), log2(), log10(),
methods (and one-sixth with – methods now gamma(), lgamma(), digamma() and
takes 75% of the startup time of a default ses- trigamma(), as well as the Math2 and Summary
sion). groups.

R News ISSN 1609-3631


Vol. 7/2, October 2007 58

Because all members of the group generics are • It is planned that [[exact = TRUE]] will be-
now primitive, they are all S4 generic and set- come the default in R 2.7.0.
ting an S4 group generic does at last apply to
all members and not just those already made
S4 generic. Utilities
as.double() and as.real() are identical to • checkS3methods() (invoked by R CMD check)
as.numeric(), and now remain so even if now checks the arguments of methods for
S4 methods are set on any of them. Since primitive members of the S3 group generics.
as.numeric is the traditional name used in S4,
currently methods must be exported from a • R CMD check now does a recursive copy on the
‘NAMESPACE’ for as.numeric only. ‘tests’ directory.

• The S4 generic for ! has been changed to have • R CMD check now warns on non-ASCII ‘.Rd’
signature (x) (was (e1)) to match the docu- files without an \encoding field, rather than
mentation and the S3 generic. setMethod() just on ones that are definitely not from an
will fix up methods defined for (e1), with a ISO-8859 encoding. This agrees with the long-
warning. standing stipulation in “Writing R Extensions”,
and catches some packages with UTF-8 man
• The "structure" S4 class now has methods pages.
that implement the concept of structures as
described in the Blue Book—that element-by- • R CMD check now warns on DESCRIPTION
element functions and operators leave struc- files with a non-portable Encoding field, or
ture intact unless they change the length. The with non-ASCII data and no Encoding field.
informal behavior of R for vectors with at-
• R CMD check now loads all the Suggests and
tributes was inconsistent.
Enhances dependencies to reduce warnings
• The implicitGeneric() function and relatives about non-visible objects, and also emulates
have been added to specify how a function in standard functions (such as shell()) on alter-
a package should look when methods are de- native R platforms.
fined for it. This will be used to ensure that
• R CMD check now (by default) attempts to latex
generic versions of functions in R core are con-
the vignettes rather than just weave and tangle
sistent. See ?implicitGeneric.
them: this will give a NOTE if there are latex
• Error messages generated by some of the func- errors.
tions in the methods package provide the name • R CMD check computations no longer ignore
of the generic to provide more contextual infor- Rd \usage entries for functions for extracting
mation. or replacing parts of an object, so S3 methods
should use the appropriate \method{} markup.
• It is now possible to use
setGeneric(useAsDefault = FALSE) to de- • R CMD check now checks for CR (as well
fine a new generic with the name of a prim- as CRLF) line endings in C/C++/Fortran
itive function (but having no connection with source files, and for non-LF line endings in
the primitive). ‘Makefile[.in]’ and ‘Makevars[.in]’ in the package
‘src’ directory. R CMD build will correct non-LF
line endings in source files and in the make files
Deprecated & defunct mentioned.

• $ on an atomic vector now gives a warning that • Rdconv now warns about unmatched braces
it is “invalid”. It remains deprecated, but may rather than silently omitting sections con-
be removed in R ≥ 2.7.0. taining them. (Suggestion by Bill Dunlap,
PR#9649)
• storage.mode(x) <- "real" and
Rdconv now renders (rather than ig-
storage.mode(x) <- "single" are defunct:
nores) \var{} inside \code{} markup in
use instead storage.mode(x) <- "double"
LATEXconversion.
and mode(x) <- "single".
R CMD Rdconv gains a ‘--encoding’ argument
• In package installation, ‘SaveImage: yes’ is to set the default encoding for conversions.
deprecated in favour of ‘LazyLoad: yes’.
• The list of CRAN mirrors now has a new (man-
• seemsS4Object (methods package) is depre- ually maintained) column "OK" which flags
cated in favour of isS4(). mirrors that seem to be OK, only those are used

R News ISSN 1609-3631


Vol. 7/2, October 2007 59

by chooseCRANmirror(). The now exported • The type of the last two arguments of
function getCRANmirrors() can be used to get getMatrixDimnames (non-API but mentioned
all known mirrors or only the ones that are OK. in ‘R-exts.texi’ and in ‘Rinternals.h’) has been
changed to const char ** (from char **).
• R CMD SHLIB gains arguments ‘--clean’ and
‘--preclean’ to clean up intermediate files af- • R_FINITE now always resolves to the function
ter and before building. call R_finite in packages (rather than some-
times substituting isfinite). This avoids some
• R CMD config now knows about FC and issues where R headers are called from C++
FCFLAGS (used for F9x compilation). code using features tested on the C compiler.

• R CMD Rdconv now does a better job of render- • The advice to include R headers from C++ in-
ing quotes in titles in HTML, and \sQuote and side extern "C" has been changed. It is nowa-
\dQuote into text on Windows. days better not to wrap the headers, as they
include other headers which on some OSes
should not be wrapped.
C-level facilities
• ‘Rinternals.h’ no longer includes a substantial
• New utility function alloc3DArray similar to set of C headers. All but ‘ctype.h’ and ‘errno.h’
allocMatrix. are included by ‘R.h’ which is supposed to be
used before ‘Rinternals.h’.
• The entry point R_seemsS4Object in
‘Rinternals.h’ has not been needed since R 2.4.0 • Including C system headers can be avoided
and has been removed. Use IS_S4_OBJECT in- by defining NO_C_HEADERS before including R
stead. headers. This is intended to be used from C++
code, and you will need to include C++ equiv-
• Applications embedding R can use alents such as <cmath> before the R headers.
R_getEmbeddingDllInfo() to obtain DllInfo
for registering symbols present in the applica-
tion itself. Installation
• The instructions for making and using stan- • The test-Lapack test is now part of make
dalone libRmath have been moved to the R In- check.
stallation and Administration manual.
• The stat system call is now required, along
• CHAR() now returns (const char *) since with opendir (which had long been used but
CHARSXPs should no longer be modified in not tested for). (make check would have failed
place. This change allows compilers to warn in earlier versions without these calls.)
or error about improper modification. Thanks
• evince is now considered as a possible PDF
to Herve Pages for the suggestion.
viewer.
• acopy_string is a (provisional) new helper • make install-strip now also strips the DLLs
function that copies character data and returns in the standard packages.
a pointer to memory allocated using R_alloc.
This can be used to create a copy of a string • Perl 5.8.0 (released in July 2002) or later is now
stored in a CHARSXP before passing the data on required. (R 2.4.0 and later have in fact re-
to a function that modifies its arguments. quired 5.6.1 or later.)

• asLogical, asInteger, asReal and asComplex • The C function finite is no longer used:
now accept STRSXP and CHARSXP arguments, we expect a C99 compiler which will have
and asChar accepts CHARSXP. isfinite. (If that is missing, we test separately
for NaN, Inf and -Inf.)
• New R_GE_str2col() exported via
‘R ext/GraphicsEngine.h’ for external device de- • A script/executable texi2dvi is now required
velopers. on Unix-alikes: it is part of the texinfo distribu-
tion.
• doKeybd and doMouseevent are now exported
in ‘GraphicsDevice.h’. • Files ‘texinfo.tex’ and ‘txi-en.tex’ are no longer
supplied in doc/manual (as the latest versions
• R_alloc now has first argument of type size_t have an incompatible licence). You will need
to support 64-bit platforms (e.g., Win64) with a to ensure that your texinfo and/or TeX instal-
32-bit long type. lations supply them.

R News ISSN 1609-3631


Vol. 7/2, October 2007 60

• wcstod is now required for MBCS support. Bug fixes


• There are some experimental provisions for • charmatch() and pmatch() used to accept non-
building on Cygwin. integer values for nomatch even though the re-
turn value was documented to be integer. Now
nomatch is coerced to integer (rather than the
result being coerced to the type of nomatch).
Package Installation
• match.call() no longer “works” outside a
• The encoding declared in the ‘DESCRIPTION’ function unless definition is supplied. (Un-
file is now used as the default encoding for der some circumstances it used to “work”,
‘.Rd’ files. matching itself.)
• The formula methods of boxplot, cdplot,
• A standard for specifying package license in-
pairs and spineplot now attach stats so that
formation in the ‘DESCRIPTION’ License field
model.frame() is visible where they evaluate
was introduced, see “Writing R Extensions”.
it.
In addition, files ‘LICENSE’ or ‘LICENCE’ in a
package top-level source directory are now in- • Date-time objects are no longer regarded as nu-
stalled (so putting copies into the ‘inst’ subdi- meric by is.numeric().
rectory is no longer necessary).
• methods("Math") did not work if methods was
not attached.
• install.packages() on a Unix-alike now up-
dates ‘doc/html/packages.html’ only if packages • readChar() read an extra empty item (or more
are installed to ‘.Library’ (by that exact name). than one) beyond the end of the source; in some
conditions it would terminate early when read-
• R CMD INSTALL with option ‘--clean’ now ing an item of length 0.
runs R CMD SHLIB with option ‘--clean’ to do
the clean up (unless there is a ‘src/Makefile’), • Added a promise evaluation stack so inter-
and this will remove $(OBJECTS) (which might rupted promise evaluations can be restarted.
have been redefined in ‘Makevars’). • R.version[1:10] now nicely prints.
R CMD INSTALL with ‘--preclean’ cleans up • In the methods package, prototypes are now
the sources after a previous installation (as if inherited for the .Data “slot”; i.e., for classes
that had used ‘--clean’) before attempting to that contain one of the basic data types.
install.
• data_frame[[i, j]] now works if i is charac-
R CMD INSTALL will now run R CMD SHLIB in
ter.
the ‘src’ directory if ‘src/Makevars’ is present,
even if there are no source files with known ex- • write.dcf() no longer writes NA fields
tensions. (PR#9796), and works correctly on empty de-
scriptions.
• If there is a file ‘src/Makefile’, ‘src/Makevars’
• pbeta(x, log.p = TRUE) now has improved
is now ignored (it could be included
accuracy in many cases, and so have func-
by ‘src/Makefile’ if desired), and it is
tions depending on it such as pt(), pf() and
preceded by ‘etc/Makeconf’ rather than
pbinom().
‘R HOME/share/make/shlib.mk’. Thus the
makefiles read are ‘R HOME/etc/Makeconf’, • mle() had problems with the L-BFGS-B in
‘src/Makefile’ in the package and then any per- the no-parameter case and consequentially also
sonal ‘Makevars’ files. when profiling 1-parameter models (fix thanks
to Ben Bolker).
• R CMD SHLIB used to support the use of OBJS
in ‘Makevars’, but this was changed to OBJECTS • Two bugs fixed in methods that in involve the
in 2001. The undocumented alternative of OBJS ... argument in the generic function: previ-
has finally been removed. ously failed to catch methods that just dropped
the ...; and use of callGeneric() with no
• R CMD check no longer issues a warning about arguments failed in some circumstances when
no data sets being present if a lazyload db ... was a formal argument.
is found (as determined by the presence of • sequence() now behaves more reasonably, al-
‘Rdata.rdb’, ‘Rdata.rds’, and ‘Rdata.rdx’ in the though not back-compatibly for zero or nega-
‘data’ subdirectory). tive input.

R News ISSN 1609-3631


Vol. 7/2, October 2007 61

• nls() now allows more peculiar but reason- • plot(fn, xlim = c(a, b)) would not set
able ways of being called, e.g., with data from and to properly when plotting a func-
= list(uneven_lengths) or a model without tion. The argument lists to curve() and
variables. plot.function() have been modified slightly
as part of the fix.
• match.arg() was not behaving as documented
when several.ok = TRUE (PR#9859), gave • julian() was documented to work with
spurious warnings when arg had the wrong POSIXt origins, but did not work with POSIXlt
length and was incorrectly documented (exact ones. (PR#9908)
matches are returned even when there is more
than one partial match). • Dataset HairEyeColor has been corrected to
agree with Friendly (2000): the change involves
• The data.frame method for split<-() was
the breakdown of the Brown hair / Brown eye
broken.
cell by Sex, and only totals over Sex are given
• The test for -D__NO_MATH_INLINES was badly in the original source.
broken and returned true on all non-glibc plat-
forms and false on all glibc ones (whether they • Trailing spaces are now consistently stripped
were broken or not). from \alias{} entries in ‘.Rd’ files, and this is
now documented. (PR#9915)
• LF was missing after the last prompt when
‘--quiet’ was used without ‘--slave’. Use • .find.packages(), packageDescription()
‘--slave’ when no final LF is desired. and sessionInfo() assumed that attached en-
vironments named "package:foo" were pack-
• Fixed bug in initialisation code in grid pack-
age environments, although misguided users
age for determining the boundaries of shapes.
could use such a name in attach().
Problem reported by Hadley Wickham; symp-
tom was error message: ‘Polygon edge not • spline() and splinefun() with method =
found’. "periodic" could return incorrect results
• str() is no longer slow for large POSIXct ob- when length(x) was 2 or 3.
jects. Its output is also slightly more compact
• getS3method() could fail if the method name
for such objects; implementation via new op-
contained a regexp metacharacter such as "+".
tional argument give.head.
• strsplit(*, fixed = TRUE), potentially • help(a_character_vector) now uses the name
iconv() and internal string formatting is and not the value of the vector unless it has
now faster for large strings, thanks to report length exactly one, so e.g. help(letters) now
PR#9902 by John Brzustowski. gives help on letters. (Related to PR#9927)

• de.restore() gave a spurious warning for ma- • Ranges in chartr() now work better in CJK lo-
trices (Ben Bolker) cales, thanks to Ei-ji Nakama.

Changes on CRAN
by Kurt Hornik from Li Hsu and Douglas Grove, imagemap
code from Barry Rowlingson.
New contributed packages AIS Tools to look at the data (“Ad Inidicia Spec-
tata”). By Micah Altman.
ADaCGH Analysis and plotting of array CGH data.
Allows usage of Circular Binary Segmentation, AcceptanceSampling Creation and evaluation of
wavelet-based smoothing, ACE method (CGH Acceptance Sampling Plans. Plans can be sin-
Explorer), HMM, BioHMM, GLAD, CGHseg, gle, double or multiple sampling plans. By An-
and Price’s modification of Smith & Water- dreas Kiermeier.
man’s algorithm. Most computations are par-
allelized. Figures are imagemaps with links Amelia Amelia II: A Program for Missing Data.
to IDClight (http://idclight.bioinfo.cnio. Amelia II “multiply imputes” missing data in
es). By Ramon Diaz-Uriarte and Oscar M. a single cross-section (such as a survey), from
Rueda. Wavelet-based aCGH smoothing code a time series (like variables collected for each

R News ISSN 1609-3631


Vol. 7/2, October 2007 62

year in a country), or from a time-series-cross- Quartz device from Terminal or ESS). By Simon
sectional data set (such as collected by years Urbanek.
for each of several countries). Amelia II im-
plements a bootstrapping-based algorithm that DAAGbio Data sets and functions useful for the
gives essentially the same answers as the stan- display of microarray and for demonstrations
dard IP or EMis approaches, is usually con- with microarray data. By John Maindonald.
siderably faster than existing approaches and
Defaults Set, get, and import global function de-
can handle many more variables. The program
faults. By Jeffrey A. Ryan.
also generalizes existing approaches by allow-
ing for trends in time series across observations Devore7 Data sets and sample analyses from Jay L.
within a cross-sectional unit, as well as priors Devore (2008), “Probability and Statistics for
that allow experts to incorporate beliefs they Engineering and the Sciences (7th ed)”, Thom-
have about the values of missing cells in their son. Original by Jay L. Devore, modifications
data. The program works from the R command by Douglas Bates, modifications for the 7th edi-
line or via a graphical user interface that does tion by John Verzani.
not require users to know R. By James Honaker,
Gary King, and Matthew Blackwell. G1DBN Perform Dynamic Bayesian Network infer-
ence using 1st order conditional dependencies.
BiodiversityR A GUI (via Rcmdr) and some util- By Sophie Lebre.
ity functions (often based on the vegan) for
statistical analysis of biodiversity and ecolog- GSA Gene Set Analysis. By Brad Efron and R. Tib-
ical communities, including species accumu- shirani.
lation curves, diversity indices, Renyi pro- GSM Gamma Shape Mixture. Implements a
files, GLMs for analysis of species abundance Bayesian approach for estimation of a mixture
and presence-absence, distance matrices, Man- of gamma distributions in which the mixing
tel tests, and cluster, constrained and un- occurs over the shape parameter. This fam-
constrained ordination analysis. By Roeland ily provides a flexible and novel approach for
Kindt. modeling heavy-tailed distributions, is compu-
tationally efficient, and only requires to specify
CORREP Multivariate correlation estimator and
a prior distribution for a single parameter. By
statistical inference procedures. By Dongxiao
Sergio Venturini.
Zhu and Youjuan Li.
GeneF Implements several generalized F-statistics,
CPGchron Create radiocarbon-dated depth including ones based on the flexible iso-
chronologies, following the work of Parnell tonic/monotonic regression or order restricted
and Haslett (2007, submitted to JRSSC). By An- hypothesis testing. By Yinglei Lai.
drew Parnell.
HFWutils Functions for Excel connections, garbage
Cairo Provides a Cairo graphics device that can collection, string matching, and passing by ref-
be use to create high-quality vector (PDF, erence. By Felix Wittmann.
PostScript and SVG) and bitmap output (PNG,
JPEG, TIFF), and high-quality rendering in dis- ICEinfer Incremental Cost-Effectiveness (ICE) Sta-
plays (X11 and Win32). Since it uses the same tistical Inference. Given two unbiased sam-
back-end for all output, copying across formats ples of patient level data on cost and effec-
is WYSIWYG. Files are created without the de- tiveness for a pair of treatments, make head-
pendence on X11 or other external programs. to-head treatment comparisons by (i) generat-
This device supports alpha channel (semi- ing the bivariate bootstrap resampling distri-
transparent drawing) and resulting images can bution of ICE uncertainty for a specified value
contain transparent and semi-transparent re- of the shadow price of health, λ, (ii) form
gions. It is ideal for use in server environments the wedge-shaped ICE confidence region with
(file output) and as a replacement for other de- specified confidence fraction within [0.50, 0.99]
vices that don’t have Cairo’s capabilities such that is equivariant with respect to changes in
as alpha support or anti-aliasing. Backends λ, (iii) color the bootstrap outcomes within the
are modular such that any subset of backends above confidence wedge with economic pref-
is supported. By Simon Urbanek and Jeffrey erences from an ICE map with specified values
Horner. of λ, beta and gamma parameters, (iv) display
VAGR and ALICE acceptability curves, and
CarbonEL Carbon Event Loop: hooks a Carbon (v) display indifference (iso-preference) curves
event loop handler into R. This is useful for en- from an ICE map with specified values of λ, β
abling UI from a console R (such as using the and γ or η parameters. By Bob Obenchain.

R News ISSN 1609-3631


Vol. 7/2, October 2007 63

ICS ICS/ICA computation based on two scatter ma- summarizing posterior distributions defined
trices. Implements Oja et al.’s method of two by the user, functions for regression models, hi-
different scatter matrices to obtain an invari- erarchical models, Bayesian tests, and illustra-
ant coordinate system or independent compo- tions of Gibbs sampling. By Jim Albert.
nents, depending on the underlying assump-
tions. By Klaus Nordhausen, Hannu Oja, and LogConcDEAD Computes the maximum likelihood
Dave Tyler. estimator from an i.i.d. sample of data from a
log-concave density in any number of dimen-
ICSNP Tools for multivariate nonparametrics, as sions. Plots are available for 1- and 2-d data. By
location tests based on marginal ranks, spa- Madeleine Cule, Robert Gramacy, and Richard
tial median and spatial signs computation, Samworth.
Hotelling’s T-test, estimates of shape. By Klaus
Nordhausen, Seija Sirkia, Hannu Oja, and Dave MLDS Maximum Likelihood Difference Scaling.
Tyler. Difference scaling is a method for scaling per-
ceived supra-threshold differences. The pack-
LLAhclust Hierarchical clustering of variables or age contains functions that allow the user to
objects based on the likelihood linkage analy- design and run a difference scaling experiment,
sis method. The likelihood linkage analysis is to fit the resulting data by maximum likelihood
a general agglomerative hierarchical clustering and test the internal validity of the estimated
method developed in France by Lerman in a scale. By Kenneth Knoblauch and Laurence T.
long series of research articles and books. Ini- Maloney, based in part on C code written by
tially proposed in the framework of variable Laurence T. Maloney and J. N. Yang.
clustering, it has been progressively extended
to allow the clustering of very general object MLEcens MLE for bivariate (interval) censored
descriptions. The approach mainly consists in data. Contains functions to compute the non-
replacing the value of the estimated similar- parametric maximum likelihood estimator for
ity coefficient by the probability of finding a the bivariate distribution of ( X, Y ), when real-
lower value under the hypothesis of “absence izations of ( X, Y ) cannot be observed directly.
of link”. Package LLAhclust contains routines More precisely, the MLE is computed based on
for computing various types of probabilistic a set of rectangles (“observation rectangles”)
similarity coefficients between variables or ob- that are known to contain the unobservable re-
ject descriptions. Once the similarity values be- alizations of ( X, Y ). The methods can also be
tween variables/objects are computed, a hier- used for univariate censored data, and for cen-
archical clustering can be performed using sev- sored data with competing risks. The package
eral probabilistic and non-probabilistic aggre- contains the functionality of bicreduc, which
gation criteria, and indices measuring the qual- will no longer be maintained. By Marloes
ity of the partitions compatible with the result- Maathuis.
ing hierarchy can be computed. By Ivan Ko- MiscPsycho Miscellaneous Psychometrics: statisti-
jadinovic, Israël-César Lerman, and Philippe cal analyses useful for applied psychometri-
Peter. cians. By Harold C. Doran.
LLN Learning with Latent Networks. A new frame- ORMDR Odds ratio based multifactor-dimensionality
work in which graph-structured data are used reduction method for detecting gene-gene in-
to train a classifier in a latent space, and then teractions. By Eun-Kyung Lee, Sung Gon Yi,
classify new nodes. During the learning phase, Yoojin Jung, and Taesung Park.
a latent representation of the network is first
learned and a supervised classifier is then built PET Simulation and reconstruction of PET images.
in the learned latent space. In order to clas- Implements different analytic/direct and itera-
sify new nodes, the positions of these nodes tive reconstruction methods of Peter Toft, and
in the learned latent space are estimated using offers the possibility to simulate PET data. By
the existing links between the new nodes and Joern Schulz, Peter Toft, Jesper James Jensen,
the learning set nodes. It is then possible to ap- and Peter Philipsen.
ply the supervised classifier to assign each new
node to one of the classes. By Charles Bouvey- PSAgraphics Propensity Score Analysis (PSA)
ron & Hugh Chipman. Graphics. Includes functions which test bal-
ance within strata of categorical and quantita-
LearnBayes Functions helpful in learning the basic tive covariates, give a representation of the esti-
tenets of Bayesian statistical inference. Con- mated effect size by stratum, provide a graphic
tains functions for summarizing basic one and and loess based effect size estimate, and vari-
two parameter posterior distributions and pre- ous balance functions that provide measures of
dictive distributions, MCMC algorithms for the balance achieved via a PSA in a categorical

R News ISSN 1609-3631


Vol. 7/2, October 2007 64

covariate. By James E. Helmreich and Robert Supports tooltips with 1 to 3 lines and line
M. Pruzek. styles. By Tony Plate, based on RSvgDevice by
T Jake Luciani.
PerformanceAnalytics Econometric tools for per-
formance and risk analysis. Aims to aid prac- Rcapture Loglinear Models in Capture-Recapture
titioners and researchers in utilizing the lat- Experiments. Estimation of abundance and
est research in analysis of non-normal return other demographic parameters for closed pop-
streams. In general, the package is most tested ulations, open populations and the robust de-
on return (rather than price) data on a monthly sign in capture-recapture experiments using
scale, but most functions will work with daily loglinear models. By Sophie Baillargeon and
or irregular return data as well. By Peter Carl Louis-Paul Rivest.
and Brian G. Peterson.
RcmdrPlugin.TeachingDemos Provides an Rcmdr
QRMlib Code to examine Quantitative Risk Man- “plug-in” based on the TeachingDemos pack-
agement concepts, accompanying the book age, and is primarily for illustrative purposes.
“Quantitative Risk Management: Concepts, By John Fox.
Techniques and Tools” by Alexander J. McNeil,
Rüdiger Frey and Paul Embrechts. S-Plus orig- Reliability Functions for estimating parameters in
inal by Alexander McNeil; R port by Scott Ul- software reliability models. Only infinite fail-
man. ure models are implemented so far. By Andreas
Wittmann.
R.cache Fast and light-weight caching of objects.
Methods for memoization, that is, caching ar- RiboSort Rapid classification of (TRFLP & ARISA)
bitrary R objects in persistent memory. Objects microbial community profiles, eliminating the
can be loaded and saved stratified on a set of laborious task of manually classifying commu-
hashing objects. By Henrik Bengtsson. nity fingerprints in microbial studies. By au-
R.huge Methods for accessing huge amounts of tomatically assigning detected fragments and
data. Provides a class representing a matrix their respective relative abundances to ap-
where the actual data is stored in a binary for- propriate ribotypes, RiboSort saves time and
mat on the local file system. This way the size greatly simplifies the preparation of DNA fin-
limit of the data is set by the file system and not gerprint data sets for statistical analysis. By
the memory. Currently in an alpha/early-beta Úna Scallan & Ann-Kathrin Liliensiek.
version. By Henrik Bengtsson.
Rmetrics Rmetrics —- Financial Engineering and
RBGL Interface to boost C++ graph library. Demo Computational Finance. Environment for
of interface with full copy of all hpp defining teaching “Financial Engineering and Compu-
boost. By Vince Carey, Li Long, and R. Gentle- tational Finance”. By Diethelm Wuertz and
man. many others.

RDieHarder An interface to the dieharder test suite RobLox Optimally robust influence curves in case
of random number generators and tests that of normal location with unknown scale. By
was developed by Robert G. Brown, extending Matthias Kohl.
earlier work by George Marsaglia and others.
By Dirk Eddelbuettel. RobRex Optimally robust influence curves in case
of linear regression with unknown scale and
RLRsim Exact (Restricted) Likelihood Ratio tests for standard normal distributed errors where the
mixed and additive models. Provides rapid regressor is random. By Matthias Kohl.
simulation-based tests for the presence of vari-
ance components/nonparametric terms with a Rsac Seismic analysis tools in R. Mostly functions
convenient interface for a variety of models. By to reproduce some of the Seismic Analysis
Fabian Scheipl. Code (SAC, http://www.llnl.gov/sac/) com-
mands in R. This includes reading standard
ROptEst Optimally robust estimation using S4
binary ‘.SAC’ files, plotting arrays of seismic
classes and methods. By Matthias Kohl.
recordings, filtering, integration, differentia-
ROptRegTS Optimally robust estimation for tion, instrument deconvolution, and rotation of
regression-type models using S4 classes and horizontal components. By Eric M. Thompson.
methods. By Matthias Kohl.
Runuran Interface to the UNU.RAN library for Uni-
RSVGTipsDevice An R SVG graphics device with versal Non-Uniform RANdom variate gener-
dynamic tips and hyperlinks using the w3.org ators. By Josef Leydold and Wolfgang Hör-
XML standard for Scalable Vector Graphics. mann.

R News ISSN 1609-3631


Vol. 7/2, October 2007 65

Ryacas An interface to the yacas computer al- based binary classification. By Anne-Laure
gebra system. By Rob Goedman, Gabor Boulesteix.
Grothendieck, Søren Højsgaard, Ayal Pinkus.
YaleToolkit Tools for the graphical exploration of
SRPM Shared Reproducibility Package Manage- complex multivariate data developed at Yale
ment. A package development and manage- University. By John W. Emerson and Walton
ment system for distributed reproducible re- Green.
search. By Roger D. Peng.
adegenet Genetic data handling for multivariate
SimHap A comprehensive modeling framework analysis using ade4. By Thibaut Jombart.
for epidemiological outcomes and a multiple-
imputation approach to haplotypic analysis of ads Spatial point patterns analysis. Perform first-
population-based data. Can perform single and second-order multi-scale analyses derived
SNP and multi-locus (haplotype) association from Ripley’s K-function, for univariate, multi-
analyses for continuous Normal, binary, lon- variate and marked mapped data in rectangu-
gitudinal and right-censored outcomes mea- lar, circular or irregular shaped sampling win-
sured in population-based samples. Uses es- dows, with test of statistical significance based
timation maximization techniques for inferring on Monte Carlo simulations. By R. Pelissier
haplotypic phase in individuals, and incorpo- and F. Goreaud.
rates a multiple-imputation approach to deal
with the uncertainty of imputed haplotypes in argosfilter Functions to filter animal satellite track-
association testing. By Pamela A. McCaskie. ing data obtained from Argos. Especially indi-
cated for telemetry studies of marine animals,
SpatialNP Multivariate nonparametric methods
where Argos locations are predominantly of
based on spatial signs and ranks. Contains
low quality. By Carla Freitas.
test and estimates of location, tests of indepen-
dence, tests of sphericity, several estimates of arrayImpute Missing imputation for microarray
shape and regression all based on spatial signs, data. By Eun-kyung Lee, Dankyu Yoon, and
symmetrized signs, ranks and signed ranks. By Taesung Park.
Seija Sirkia, Jaakko Nevalainen, Klaus Nord-
hausen, and Hannu Oja. ars Adaptive Rejection Sampling. By Paulino Perez
Rodriguez; original C++ code from Arnost Ko-
TIMP Problem solving environment for fitting su-
marek.
perposition models. Measurements often rep-
resent a superposition of the contributions of
arulesSequences Add-on for arules to handle and
distinct sub-systems resolved with respect to
mine frequent sequences. Provides interfaces
many experimental variables (time, tempera-
to the C++ implementation of cSPADE by Mo-
ture, wavelength, pH, polarization, etc). TIMP
hammed J. Zaki. By Christian Buchta and
allows parametric models for such superposi-
Michael Hahsler.
tions to be fit and validated. The package has
been extensively applied to modeling data aris- asuR Functions and data sets for a lecture in Ad-
ing in spectroscopy experiments. By Katharine vanced Statistics using R. Especially the func-
M. Mullen and Ivo H. M. van Stokkum. tions mancontr() and inspect() may be of
TwslmSpikeWeight Normalization of cDNA mi- general interest. With the former, it is pos-
croarray data with the two-way semilinear sible to specify your own contrasts and give
model(TW-SLM). It incorporates information them useful names. Function inspect() shows
from control spots and data quality in the TW- a wide range of inspection plots to validate
SLM to improve normalization of cDNA mi- model assumptions. By Thomas Fabbro.
croarray data. Huber’s and Tukey’s bisquare
bayescount Bayesian analysis of count distributions
weight functions are available for robust esti-
with JAGS. A set of functions to apply a
mation of the TW-SLM. By Deli Wang and Jian
zero-inflated gamma Poisson (equivalent to a
Huang.
zero-inflated negative binomial), zero-inflated
Umacs Universal Markov chain sampler. By Jouni Poisson, gamma Poisson (negative binomial)
Kerman. or Poisson model to a set of count data us-
ing JAGS (Just Another Gibbs Sampler). Re-
WilcoxCV Functions to perform fast variable se- turns information on the possible values for
lection based on the Wilcoxon rank sum test mean count, overdispersion and zero inflation
in the cross-validation or Monte-Carlo cross- present in count data such as faecal egg count
validation settings, for use in microarray- data. By Matthew Denwood.

R News ISSN 1609-3631


Vol. 7/2, October 2007 66

benchden Full implementation of the 28 distribu- catmap Case-control And Tdt Meta-Analysis Pack-
tions introduced as benchmarks for nonpara- age. Conducts fixed-effects (inverse vari-
metric density estimation by Berlinet and De- ance) and random-effects (DerSimonian and
vroye (1994). Includes densities, cdfs, quantile Laird, 1986) meta-analyses of case-control or
functions and generators for samples. By Tho- family-based (TDT) genetic data; in addition,
ralf Mildenberger, Henrike Weinert, and Sebas- performs meta-analyses combining these two
tian Tiemeyer. types of study designs. The fixed-effects model
was first described by Kazeem and Farrell
biOps Basic image operations and image process- (2005); the random-effects model is described
ing. Includes arithmetic, logic, look up table in Nicodemus (submitted for publication). By
and geometric operations. Some image pro- Kristin K. Nicodemus.
cessing functions, for edge detection (several
algorithms including Roberts, Sobel, Kirsch, celsius Retrieve Affymetrix microarray measure-
Marr-Hildreth, Canny) and operations by con- ments and metadata from Celsius web services,
volution masks (with predefined as well as see http://genome.ucla.edu/projects/
user defined masks) are provided. Supported celsius. By Allen Day, Marc Carlson.
file formats are jpeg and tiff. By Matias Bordese
and Walter Alini. cghFLasso Spatial smoothing and hot spot detection
using the fused lasso regression. By Robert Tib-
binMto Asymptotic simultaneous confidence inter- shirani and Pei Wang.
vals for comparison of many treatments with
one control, for the difference of binomial pro- choplump Choplump tests: permutation tests for
portions, allows for Dunnett-like-adjustment, comparing two groups with some positive but
Bonferroni or unadjusted intervals. Simulation many zero responses. By M. P. Fay.
of power of the above interval methods, ap-
clValid Statistical and biological validation of clus-
proximate calculation of any-pair-power, and
tering results. By Guy Brock, Vasyl Pihur, Sus-
sample size iteration based on approximate
mita Datta, and Somnath Datta.
any-pair power. Exact conditional maximum
test for many-to-one comparisons to a control. classGraph Construct directed graphs of S4 class hi-
By Frank Schaarschmidt. erarchies and visualize them. Typically, these
graphs are DAGs (directed acyclic graphs) in
bio.infer Compute biological inferences. Imports
general, though often trees. By Martin Maech-
benthic count data, reformats this data, and
ler partly based on code from Robert Gentle-
computes environmental inferences from this
man.
data. By Lester L. Yuan.
clinfun Clinical Trial Design and Data Analysis
blockTools Block, randomly assign, and diagnose
Functions. Utilities to make your clinical col-
potential problems between units in random-
laborations easier if not fun. By E. S. Venkatra-
ized experiments. Blocks units into experimen-
man.
tal blocks, with one unit per treatment condi-
tion, by creating a measure of multivariate dis- clusterfly Explore clustering interactively using R
tance between all possible pairs of units. Max- and GGobi. Contains both general code for
imum, minimum, or an allowable range of dif- visualizing clustering results and specific vi-
ferences between units on one variable can be sualizations for model-based, hierarchical and
set. Randomly assign units to treatment con- SOM clustering. By Hadley Wickham.
ditions. Diagnose potential interference prob-
lems between units assigned to different treat- clv Cluster Validation Techniques. Contains most of
ment conditions. Write outputs to ‘.tex’ and the popular internal and external cluster vali-
‘.csv’ files. By Ryan T. Moore. dation methods ready to use for the most of the
outputs produced by functions coming from
bootStepAIC Model selection by bootstrapping the package cluster. By Lukasz Nieweglowski.
stepAIC() procedure. By Dimitris Rizopoulos.
codetools Code analysis tools for R. By Luke Tier-
brew A templating framework for mixing text and R ney.
code for report generation. Template syntax is
similar to PHP, Ruby’s erb module, Java Server colorRamps Builds single and double gradient color
Pages, and Python’s psp module. By Jeffrey maps. By Tim Keitt.
Horner.
contrast Contrast methods, in the style of the De-
ca Computation and visualization of simple, mul- sign package, for fit objects produced by the
tiple and joint correspondence analysis. By lm, glm, gls, and geese functions. By Steve We-
Michael Greenacre and Oleg Nenadic. ston, Jed Wing and Max Kuhn.

R News ISSN 1609-3631


Vol. 7/2, October 2007 67

coxphf Cox regression with Firth’s penalized likeli- wherever possible. The target audience
hood. R by Meinhard Ploner, Fortran by Georg includes biostatisticians and epidemiologists
Heinze. who would like to apply standard epidemi-
ological methods in a simple manner. By
crosshybDetector Identification of probes poten- Michael A Rotondi.
tially affected by cross-hybridizations in mi-
croarray experiments. Includes functions for experiment Various statistical methods for design-
diagnostic plots. By Paolo Uva. ing and analyzing randomized experiments.
One main functionality is the implementa-
ddesolve Solver for Delay Differential Equations by tion of randomized-block and matched-pair
interfacing numerical routines written by Si- designs based on possibly multivariate pre-
mon N. Wood, with contributions by Benjamin treatment covariates. Also provides the tools
J. Cairns. These numerical routines first ap- to analyze various randomized experiments in-
peared in Simon Wood’s solv95 program. By cluding cluster randomized experiments, ran-
Alex Couture-Beil, Jon T. Schnute, and Rowan domized experiments with noncompliance,
Haigh. and randomized experiments with missing
desirability Desirability Function Optimization and data. By Kosuke Imai.
Ranking. S3 classes for multivariate optimiza- fCopulae Rmetrics — Dependence Structures with
tion using the desirability function by Der- Copulas. Environment for teaching “Financial
ringer and Suich (1980). By Max Kuhn. Engineering and Computational Finance”. By
Diethelm Wuertz and many others.
dplR Dendrochronology Program Library in R.
Contains functions for performing some stan- forensic Statistical Methods in Forensic Genetics.
dard tree-ring analyses. By Andy Bunn. Statistical evaluation of DNA mixtures, DNA
profile match probability. By Miriam Marusi-
dtt Discrete Trigonometric Transforms. Functions
akova.
for 1D and 2D Discrete Cosine Transform
(DCT), Discrete Sine Transform (DST) and Dis- fractal Insightful Fractal Time Series Modeling and
crete Hartley Transform (DHT). By Lukasz Analysis. Software to book in development en-
Komsta. titled “Fractal Time Series Analysis in S-PLUS
and R” by William Constantine and Donald B.
earth Multivariate Adaptive Regression Spline
Percival, Springer. By William Constantine and
Models. Build regression models using
Donald Percival.
the techniques in Friedman’s papers “Fast
MARS” and “Multivariate Adaptive Regres- fso Fuzzy Set Ordination: a multivariate analy-
sion Splines”. (The term “MARS” is copy- sis used in ecology to relate the composition
righted and thus not used as the name of the of samples to possible explanatory variables.
package.). By Stephen Milborrow, derived While differing in theory and method, in prac-
from code in package mda by Trevor Hastie tice, the use is similar to “constrained ordina-
and Rob Tibshirani. tion”. Contains plotting and summary func-
tions as well as the analyses. By David W.
eigenmodel Semiparametric factor and regression
Roberts.
models for symmetric relational data. Esti-
mates the parameters of a model for symmet- gWidgetstcltk Toolkit implementation of the gWid-
ric relational data (e.g., the above-diagonal part gets API for the tcltk package. By John Verzani.
of a square matrix) using a model-based eigen-
value decomposition and regression. Missing gamlss.mx A GAMLSS add on package for fitting
data is accommodated, and a posterior mean mixture distributions. By Mikis Stasinopoulos
for missing data is calculated under the as- and Bob Rigby.
sumption that the data are missing at random. gcl Computes a fuzzy rules classifier (Vinterbo, Kim
The marginal distribution of the relational data & Ohno-Machado, 2005). By Staal A. Vinterbo.
can be arbitrary, and is fit with an ordered pro-
bit specification. By Peter Hoff. geiger Analysis of evolutionary diversification. Fea-
tures running macroevolutionary simulation
epibasix Elementary tools for the analysis of com- and estimating parameters related to diversifi-
mon epidemiological problems, ranging from cation from comparative phylogenetic data. By
sample size estimation, through 2 × 2 con- Luke Harmon, Jason Weir, Chad Brock, Rich
tingency table analysis and basic measures Glor, and Wendell Challenger.
of agreement (kappa, sensitivity/specificity).
Appropriate print and summary statements ggplot An implementation of the grammar of
are also written to facilitate interpretation graphics in R. Combines the advantages of

R News ISSN 1609-3631


Vol. 7/2, October 2007 68

both base and lattice graphics: conditioning lar, and graphical forms. By Tarmo K. Remmel,
and shared axes are handled automatically, Sandor Kabos, and Ferenc (Ferko) Csillag.
and one can still build up a plot step by step
from multiple data sources. Also implements heatmap.plus Heatmaps with more sensible behav-
a more sophisticated multidimensional condi- ior. Allows the heatmap matrix to have non-
tioning system and a consistent interface to identical x and y dimensions, and multiple
map data to aesthetic attributes. By Hadley tracks of annotation for RowSideColors and
Wickham. ColSideColors. By Allen Day.

ghyp Provides all about univariate and multivari- hydrosanity Graphical user interface for exploring
ate generalized hyperbolic distributions and hydrological time series. Designed to work
its special cases (Hyperbolic, Normal Inverse with catchment surface hydrology data (rain-
Gaussian, Variance Gamma and skewed Stu- fall and streamflow); but could also be used
dent t distribution). Especially fitting proce- with other time series data. Hydrological time
dures, computation of the density, quantile, series typically have many missing values, and
probability, random variates, expected shortfall varying accuracy of measurement (indicated
and some portfolio optimization and plotting by data quality codes). Furthermore, the spa-
routines. Also contains the generalized inverse tial coverage of data varies over time. Much
Gaussian distribution. By Wolfgang Breymann of the functionality of this package attempts
and David Luethi. to understand these complex data sets, detect
errors and characterize sources of uncertainty.
glmmAK Generalized Linear Mixed Models. By Emphasis is on graphical methods. The GUI is
Arnost Komarek. based on rattle. By Felix Andrews.

granova Graphical Analysis of Variance. Provides identity Calculate identity coefficients, based on
distinctive graphics for display of ANOVA re- Mark Abney’s C code. By Na Li.
sults. Functions were written to display data
ifultools Insightful Research Tools. By William Con-
for any number of groups, regardless of their
stantine.
sizes (however, very large data sets or numbers
of groups are likely to be problematic) using inetwork Network Analysis and Plotting. Imple-
a specialized approach to construct data-based ments a network partitioning algorithm to
contrast vectors with respect to which ANOVA identify communities (or modules) in a net-
data are displayed. By Robert M. Pruzek and work. A network plotting function then uti-
James E. Helmreich. lizes the identified community structure to po-
sition the vertices for plotting. Also con-
graph Implements some simple graph handling ca-
tains functions to calculate the assortativity
pabilities. By R. Gentleman, Elizabeth Whalen,
and transitivity of a vertex. By Sun-Chong
W. Huber, and S. Falcon.
Wang.
grasp Generalized Regression Analysis and Spatial inline Functionality to dynamically define R func-
Prediction for R. GRASP is a general method tions and S4 methods with in-lined C, C++ or
for making spatial predictions of response vari- Fortran code supporting .C and .Call calling
ables (RV) using point surveys of the RV and conventions. By Oleg Sklyar, Duncan Mur-
spatial coverages of predictor variables (PV). doch, and Mike Smith.
Originally, GRASP was developed to analyse,
model and predict vegetation distribution over irtoys A simple common interface to the estima-
New Zealand. It has been used in all sorts of tion of item parameters in IRT models for bi-
applications since then. By Anthony Lehmann, nary responses with three different programs
Fabien Fivaz, John Leathwick and Jake Over- (ICL, BILOG-MG, and ltm), and a variety of
ton, with contributions from many specialists functions useful with IRT models. By Ivailo
from around the world. Partchev.

hdeco Hierarchical DECOmposition of Entropy for kin.cohort Analysis of kin-cohort studies. Provides
Categorical Map Comparisons. A flexible and estimates of age-specific cumulative risk of a
hierarchical framework for comparing categor- disease for carriers and noncarriers of a muta-
ical map composition and configuration (spa- tion. The cohorts are retrospectively built from
tial pattern) along spatial, thematic, or external relatives of probands for whom the genotype
grouping variables. Comparisons are based on is known. Currently the method of moments
measures of mutual information between the- and marginal maximum likelihood are imple-
matic classes (colors) and location (spatial par- mented. Confidence intervals are calculated
titioning). Results are returned in textual, tabu- from bootstrap samples. Most of the code is

R News ISSN 1609-3631


Vol. 7/2, October 2007 69

a translation from previous MATLAB code by a subset of I(all) and is therefore computa-
Chatterjee. By Victor Moreno, Nilanjan Chat- tionally much more efficient, but asymptoti-
terjee, and Bhramar Mukherjee. cally equivalent. Omitting the additive correc-
tion term Gamma in either method offers an-
kzs Kolmogorov-Zurbenko Spline. A collection of other two approaches which are more power-
functions utilizing splines to smooth a noisy ful on small scales and less powerful on large
data set in order to estimate its underlying sig- scales, however, not asymptotically minimax
nal. By Derek Cyr and Igor Zurbenko. optimal anymore. Finally, the block proce-
lancet.iraqmortality Surveys of Iraq mortality pub- dure is a compromise between adding Gamma
lished in The Lancet. The Lancet has published or not, having intermediate power properties.
two surveys on Iraq mortality before and after The latter is again asymptotically equivalent to
the US-led invasion. The package facilitates ac- the first and was introduced in Rufibach and
cess to the data and a guided tour of some of Walther (2007). By Kaspar Rufibach and Guen-
their more interesting aspects. By David Kane, ther Walther.
with contributions from Arjun Navi Narayan
monomvn Estimation of multivariate normal data
and Jeff Enos.
of arbitrary dimension where the pattern of
ljr Fits and tests logistic joinpoint models. By Michal missing data is monotone. Through the use of
Czajkowski, Ryan Gill, and Greg Rempala. parsimonious/shrinkage regressions (plsr, pcr,
lasso, ridge, etc.), where standard regressions
luca Likelihood Under Covariate Assumptions fail, the package can handle an (almost) arbi-
(LUCA). Likelihood inference in case-control trary amount of missing data. The current ver-
studies of a rare disease under independence or sion supports maximum likelihood inference.
simple dependence of genetic and non-genetic Future versions will provide a means of sam-
covariates. By Ji-Hyung Shin, Brad McNeney, pling from a Bayesian posterior. By Robert B.
and Jinko Graham. Gramacy.
meifly Interactive model exploration using GGobi.
mota Mean Optimal Transformation Approach for
By Hadley Wickham.
detecting nonlinear functional relations. Origi-
mixPHM Mixtures of proportional hazard models. nally designed for the identifiability analysis of
Fits multiple variable mixtures of various para- nonlinear dynamical models. However, the un-
metric proportional hazard models using the derlying concept is very general and allows to
EM algorithm. Proportionality restrictions can detect groups of functionally related variables
be imposed on the latent groups and/or on the whenever there are multiple estimates for each
variables. Several survival distributions can be variable. By Stefan Hengl.
specified. Missing values are allowed. Inde-
pendence is assumed over the single variables. nlstools Tools for assessing the quality of fit of a
By Patrick Mair and Marcus Hudec. Gaussian nonlinear model. By Florent Baty and
Marie-Laure Delignette-Muller, with contribu-
mlmmm Computational strategies for multivariate tions from Sandrine Charles, Jean-Pierre Flan-
linear mixed-effects models with missing val- drois.
ues (Schafer and Yucel, 2002). By Recai Yucel.
oc OC Roll Call Analysis Software. Estimates Op-
modeest Provides estimators of the mode of uni- timal Classification scores from roll call votes
variate unimodal data or univariate unimodal supplied though a rollcall object from pack-
distributions. Also allows to compute the age pscl. By Keith Poole, Jeffrey Lewis, James
Chernoff distribution. By Paul Poncet. Lo and Royce Carroll.
modehunt Multiscale Analysis for Density Func- oce Analysis of Oceanographic data. Supports CTD
tions. Given independent and identically dis- measurements, sea-level time series, coastline
tributed observations X (1), . . . , X (n) from a files, etc. Also includes functions for cal-
density f , this package provides five meth- culating seawater properties such as density,
ods to perform a multiscale analysis about f and derived properties such as buoyancy fre-
as well as the necessary critical values. The quency. By Dan Kelley.
first method, introduced in Duembgen and
Walther (2006), provides simultaneous confi- pairwiseCI Calculation of parametric and nonpara-
dence statements for the existence and loca- metric confidence intervals for the difference
tion of local increases (or decreases) of f , based or ratio of location parameters and for the dif-
on all intervals I(all) spanned by any two ob- ference, ratio and odds-ratio of binomial pro-
servations X ( j), X (k). The second method ap- portion for comparison of independent sam-
proximates the latter approach by using only ples. CIs are not adjusted for multiplicity. A by

R News ISSN 1609-3631


Vol. 7/2, October 2007 70

statement allows calculation of CI separately proftools Profile output processing tools. By Luke
for the levels of one further factor. By Frank Tierney.
Schaarschmidt.
proj4 A simple interface to lat/long projection and
paran An implementation of Horn’s technique datum transformation of the PROJ.4 carto-
for evaluating the components retained in graphic projections library. Allows transforma-
a principle components analysis (PCA). tion of geographic coordinates from one pro-
Horn’s method contrasts eigenvalues pro- jection and/or datum to another. By Simon Ur-
duced through a PCA on a number of random banek.
data sets of uncorrelated variables with the
same number of variables and observations as proxy Distance and similarity measures. An exten-
the experimental or observational data set to sible framework for the efficient calculation of
produce eigenvalues for components that are auto- and cross-proximities, along with imple-
adjusted for the sample error-induced inflation. mentations of the most popular ones. By David
Components with adjusted eigenvalues greater Meyer and Christian Buchta.
than one are retained. The package may also psych Routines for personality, psychometrics and
be used to conduct parallel analysis follow- experimental psychology. Functions are pri-
ing Glorfeld’s (1995) suggestions to reduce the marily for scale construction and reliability
likelihood of over-retention. By Alexis Dinno. analysis, although others are basic descriptive
stats. By William Revelle.
pcse Estimation of panel-corrected standard errors.
Data may contain balanced or unbalanced pan- psyphy Functions that could be useful in analyz-
els. By Delia Bailey and Jonathan N. Katz. ing data from pyschophysical experiments, in-
cluding functions for calculating d0 from sev-
penalized L1 (lasso) and L2 (ridge) penalized esti- eral different experimental designs, links for
mation in Generalized Linear Models and in mafc to be used with the binomial family in
the Cox Proportional Hazards model. By Jelle glm (and possibly other contexts) and self-Start
Goeman. functions for estimating gamma values for CRT
plRasch Fit log linear by linear association models. screen calibrations. By Kenneth Knoblauch.
By Zhushan Li & Feng Hong. qualV Qualitative methods for the validation of
models. By K.G. van den Boogaart, Stefanie
plink Separate Calibration Linking Methods. Uses
Jachner and Thomas Petzoldt.
unidimensional item response theory meth-
ods to compute linking constants and conduct quantmod Specify, build, trade, and analyze quanti-
chain linking of tests for multiple groups un- tative financial trading strategies. By Jeffrey A.
der a nonequivalent groups common item de- Ryan.
sign. Allows for mean/mean, mean/sigma,
Haebara, and Stocking-Lord calibrations of rateratio.test Exact rate ratio test. By Michael Fay.
single-format or mixed-format dichotomous
(1PL, 2PL, and 3PL) and polytomous (graded regtest Functions for unary and binary regression
response partial credit/generalized partial tests. By Jens Oehlschlägel.
credit, nominal, and multiple-choice model) relations Data structures for k-ary relations with ar-
common items. By Jonathan Weeks. bitrary domains, predicate functions, and fit-
ters for consensus relations. By Kurt Hornik
plotAndPlayGTK A GUI for interactive plots using
and David Meyer.
GTK+. When wrapped around plot calls, a
window with the plot and a tool bar to interact rgcvpack Interface to the GCVPACK Fortran pack-
with it pop up. By Felix Andrews, with contri- age for thin plate spline fitting and prediction.
butions from Graham Williams. By Xianhong Xie.
pomp Inference methods for partially-observed rgr Geological Survey of Canada (GSC) functions
Markov processes. By Aaron A. King, Ed Ion- for exploratory data analysis with applied geo-
ides, and Carles Breto. chemical data, with special application to the
estimation of background ranges to support
poplab Population Lab: a tool for constructing a vir- both environmental studies and mineral explo-
tual electronic population of related individu- ration. By Robert G. Garrett.
als evolving over calendar time, by using vi-
tal statistics, such as mortality and fertility, and rindex Indexing for R. Index structures allow
disease incidence rates. By Monica Leu, Kamila quickly accessing elements from large collec-
Czene, and Marie Reilly. tions. With btree optimized for disk databases

R News ISSN 1609-3631


Vol. 7/2, October 2007 71

and ttree for RAM databases, uses hybrid static the reliability of the one-sided filter. By Marc
indexing which is quite optimal for R. By Jens Wildi & Marcel Dettling.
Oehlschlägel.
simba Functions for similarity calculation of binary
rjson JSON for R. Converts R object into JSON ob- data (for instance presence/absence species
jects and vice-versa. By Alex Couture-Beil. data). Also contains wrapper functions for
reshaping species lists into matrices and vice
rsbml R support for SBML. Links R to libsbml for versa and some other functions for further pro-
SBML parsing and output, provides an S4 cessing of similarity data (Mantel-like permu-
SBML DOM, converts SBML to R graph objects, tation procedures) as well as some other useful
and more. By Michael Lawrence. stuff. By Gerald Jurasinski.
sapa Insightful Spectral Analysis for Physical Appli- simco A package to import Structure files and de-
cations. Software for the book “Spectral Analy- duce similarity coefficients from them. By
sis for Physical Applications” by Donald B. Per- Owen Jones.
cival and Andrew T. Walden, Cambridge Uni-
versity Press, 1993. By William Constantine snpXpert Tools to analysis SNP data. By Eun-kyung
and Donald Percival. Lee, Dankyu Yoon, and Taesung Park.
spam SPArse Matrix: functions for sparse matrix
scagnostics Calculates Tukey’s scagnostics which
algebra (used by fields). Differences with
describe various measures of interest for pairs
SparseM and Matrix are: (1) support for only
of variables, based on their appearance on a
one sparse matrix format, (2) based on trans-
scatterplot. They are useful tool for weeding
parent and simple structure(s) and (3) S3 and
out interesting or unusual scatterplots from a
S4 compatible. By Reinhard Furrer.
scatterplot matrix, without having to look at
ever individual plot. By Heike Hofmann„ Lee spatgraphs Graphs, graph visualization and graph
Wilkinson, Hadley Wickham, Duncan Temple based summaries to be used with spatial point
Lang, and Anushka Anand. pattern analysis. By Tuomas Rajala.

schoolmath Functions and data sets for math used splus2R Insightful package providing missing S-
in school. A main focus is set to prime- PLUS functionality in R. Currently there are
calculation. By Joerg Schlarmann and Josef many functions in S-PLUS that are missing in
Wienand. R. To facilitate the conversion of S-PLUS mod-
ules and libraries to R packages, this package
sdcMicro Statistical Disclosure Control methods for helps to provide missing S-PLUS functionality
the generation of public- and scientific-use in R. By William Constantine.
files. Data from statistical agencies and other
institutions are mostly confidential. The pack- sqldf Manipulate R data frames using SQL. By G.
age can be used for the generation of safe (mi- Grothendieck.
cro)data, i.e., for the generation of public- and sspline Smoothing splines on the sphere. By Xian-
scientific-use files. By Matthias Templ. hong Xie.
seriation Infrastructure for seriation with an im- stochasticGEM Fitting Stochastic General Epidemic
plementation of several seriation/sequencing Models: Bayesian inference for partially ob-
techniques to reorder matrices, dissimilarity served stochastic epidemics. The general epi-
matrices, and dendrograms. Also contains demic model is used for estimating the param-
some visualizations techniques based on seri- eters governing the infectious and incubation
ation. By Michael Hahsler, Christian Buchta period length, and the parameters governing
and Kurt Hornik. susceptibility. In real-life epidemics the infec-
tion process is unobserved, and the data con-
signalextraction Real-Time Signal Extraction (Direct
sists of the times individuals are detected, usu-
Filter Approach). The Direct Filter Approach
ally via appearance of symptoms. The pack-
(DFA) provides efficient estimates of signals
age fits several variants of the general epi-
at the current boundary of time series in real-
demic model, namely the stochastic SIR with
time. For that purpose, one-sided ARMA-
Markovian and non-Markovian infectious pe-
filters are computed by minimizing customized
riods. The estimation is based on Markov chain
error criteria. The DFA can be used for esti-
Monte Carlo algorithm. By Eugene Zwane.
mating either the level or turning-points of a
series, knowing that both criteria are incongru- surveyNG An experimental revision of the survey
ent. In the context of real-time turning-point package for complex survey samples (featuring
detection, various risk-profiles can be opera- a database interface and sparse matrices). By
tionalized, which account for the speed and/or Thomas Lumley.

R News ISSN 1609-3631


Vol. 7/2, October 2007 72

termstrc Term Structure and Credit Spread Estima- waved The WaveD Transform in R. Makes avail-
tion. Offers several widely-used term structure able code necessary to reproduce figures and
estimation procedures, i.e., the parametric Nel- tables in recent papers on the WaveD method
son and Siegel approach, Svensson approach for wavelet deconvolution of noisy signals. By
and cubic splines. By Robert Ferstl and Josef Marc Raimondo and Michael Stewart.
Hayden.
wikibooks Functions and datasets of the german
tframe Time Frame coding kernel. Functions for WikiBook “GNU R” which introduces R to new
writing code that is independent of the way users. By Joerg Schlarmann.
time is represented. By Paul Gilbert.
wmtsa Insightful Wavelet Methods for Time Series
timsac TIMe Series Analysis and Control package. Analysis. Software to book “Wavelet Methods
By The Institute of Statistical Mathematics. for Time Series Analysis” by Donald B. Percival
and Andrew T. Walden, Cambridge University
trackObjs Track Objects. Stores objects in files on Press, 2000. By William Constantine and Don-
disk so that files are automatically rewritten ald Percival.
when objects are changed, and so that objects
wordnet An interface to WordNet using the Jaw-
are accessible but do not occupy memory until
bone Java API to WordNet. By Ingo Feinerer.
they are accessed. Also tracks times when ob-
jects are created and modified, and cache some yest Model selection and variance estimation in
basic characteristics of objects to allow for fast Gaussian independence models. By Petr Sime-
summaries of objects. By Tony Plate. cek.
tradeCosts Post-trade analysis of transaction costs.
By Aaron Schwartz and Luyi Zhao. Other changes
tripEstimation Metropolis sampler and supporting • CRAN’s Devel area is gone.
functions for estimating animal movement
from archival tags and satellite fixes. By • Package write.snns was moved up from Devel.
Michael Sumner and Simon Wotherspoon.
• Package anm was resurrected from the
twslm A two-way semilinear model for normaliza- Archive.
tion and analysis of cDNA microarray data.
Huber’s and Tukey’s bisquare weight func- • Package Rcmdr.HH was renamed to Rcmdr-
tions are available for robust estimation of the Plugin.HH.
two-way semilinear models. By Deli Wang and • Package msgcop was renamed to sbgcop.
Jian Huang.
• Package grasper was moved to the Archive.
vbmp Variational Bayesian multinomial probit re-
gression with Gaussian process priors. By • Package tapiR was removed from CRAN.
Nicola Lama and Mark Girolami.
Kurt Hornik
vrtest Variance ratio tests for weak-form market ef- Wirtschaftsuniversität Wien, Austria
ficiency. By Jae H. Kim. Kurt.Hornik@R-project.org

R Foundation News
by Kurt Hornik New benefactors
Shell Statistics and Chemometrics, Chester, UK
Donations and new members
New supporting institutions
Donations
AT&T Labs, New Jersey, USA
AT&T Research (USA)
BC Cancer Agency, Vancouver, Canada
Jaimison Fargo (USA)
Stavros Panidis (Greece) Black Mesa Capital, Santa Fe, USA
Saxo Bank (Denmark) Department of Statistics, Unviersity of Warwick,
Julian Stander (United Kingdom) Coventry, UK

R News ISSN 1609-3631


Vol. 7/2, October 2007 73

New supporting members Jens Oehlschlägel (Germany)


Marc Pelath (UK)
Michael Bojanowski (Netherlands) Tony Plate (USA)
Caoimhin ua Buachalla (Ireland) Julian Stander (UK)
Jake Bowers (UK)
Sandrah P. Eckel (USA) Kurt Hornik
Charles Fleming (USA) Wirtschaftsuniversität Wien, Austria
Hui Huang (USA) Kurt.Hornik@R-project.org

News from the Bioconductor Project


by Hervé Pagès and Martin Morgan sources such as the Gene Ontology consortium.
Eighty-six SQLite-based annotation packages are
Bioconductor 2.1 was released on October 8, 2007 currently available. The name of these pack-
and is designed to be compatible with R 2.6.0, re- ages end with ".db" (e.g., hgu95av2.db). For
leased five days before Bioconductor. This release an example of different metadata packages related
contains 23 new software packages, 97 new anno- to specific chips, view the annotations available
tation (or metadata) packages, and many improve- for the hgu95av2 chip: http://bioconductor.org/
ments and bug fixes to existing packages. packages/2.1/hgu95av2.html
New genome wide metadata packages provide a
more complete set of maps, similar to those provided
New packages in the chip-based annotation packages. Genome
wide annotations have an "org." prefix in their
The 23 new software packages provide excit- name, and are available as SQLite-based pack-
ing analytic and interactive tools. Packages ad- ages only. Five organisms are currently supported:
dress Affymetrix array quality assessment (e.g., ar- human (org.Hs.eg.db), mouse (org.Mm.eg.db),
rayQualityMetrics, AffyExpress), error assessment rat (org.Rn.eg.db), fly (org.Dm.eg.db) and yeast
(e.g., OutlierD, MCRestimate), particular applica- (org.Sc.sgd.db). The *LLMappings packages will
tion domains (e.g., comparative genomic hybridiza- be deprecated in Bioconductor 2.2.
tion, CGHcall, SMAP; SNP arrays, oligoClasses, Environment-based (e.g., hgu95av2) and SQLite-
RLMM, VanillaICE; SAGE, sagenhaft; gene set en- based (e.g., hgu95av2.db) packages contain the same
richment, GSEABase; protein interaction, Rintact) data. For the end user, moving from hgu95av2 to
and sophisticated modeling tools (e.g., timecourse, hgu95av2.db is transparent because the objects (or
vbmp, maigesPack). exploRase uses the GTK toolkit maps) in hgu95av2.db can be manipulated as if they
to provide an integrated user interface for systems were environments (i.e., functions ls, mget, get,
biology applications. etc. . . still work on them). Using SQLite allows con-
Several packages benefit from important infras- siderable flexibility in querying maps and in per-
tructure developments in R. Recent changes allow forming complex joins between tables, in addition
consolidation of C code common to several pack- to placing the burden of memory management and
ages into a single location (preprocessCore), greatly optimized query construction in sqlite. As with the
simplifying code maintenance and improving reli- implementation of operations like ls and mget, the
ability. Many new packages use the S4 class sys- intention is to recognize common use cases and to
tem. A number of these extend classes provided in code these so that R users do not need to know the
Biobase, facilitating more seamless interoperability. underlying SQL query.
The ability to access web resources, including con-
venient XML parsing, allow Bioconductor packages
such as GSEABase to access important curated re- Looking ahead
sources.
For the next release (BioC 2.2, April 2008) all our an-
notations will be SQLite-based and we will deprecate
SQLite-based annotations the environment-based versions.
We anticipate increasing emphasis on sequence-
This release provides SQLite-based annotations in based technologies like Solexa (http://www.
addition to the traditional environment-based ones. illumina.com) and 454 (http://www.454.com). The
Annotations contain maps between information from volume of data these technologies generate is very
microarray manufactures, standard naming con- large (a three day Solexa run produces almost a ter-
ventions (e.g., Entrez gene identifiers) and re- abyte of raw data, with 10’s of gigabytes appropriate

R News ISSN 1609-3631


Vol. 7/2, October 2007 74

for numerical analysis). This volume of data poses Hervé Pagès


significant challenges, but graphical, statistical, and Computational Biology Program
interactive abilities and web-based integration make Fred Hutchinson Cancer Research Center
R a strong candidate for making sense of, and mak- hpages@fhcrc.org
ing fundamental new contributions to, understand-
ing and critically assessing this data.
The best Bioconductor packages are contributed Martin Morgan
by our users, based on their practical needs and so- Computational Biology Program
phisticated experience. We look forward to receiving Fred Hutchinson Cancer Research Center
your contribution over the next several months. mtmorgan@fhcrc.org

Past Events: useR! 2007


Duncan Murdoch and Martin Maechler There were two programming competitions. The
first requested submissions in advance, to produce
The first North American useR! meeting took place a package useful for large data sets. This was won
over three hot days at Iowa State University in Ames by the team of Daniel Adler, Oleg Nenadić, Walter
this past August. Zucchini and Christian Gläser from Göttingen. They
The program started with John Chambers talk- wrote the ff package to use paged memory-mapping
ing about his philosophy of programming: our mis- of binary files to handle very large datasets. Daniel,
sion is to enable effective and rapid exploration of Oleg and Christian attended the meeting and pre-
data, and the prime directive is to provide trust- sented their package.
worthy software and “tell no lies”. This was fol- The second competition was a series of short R
lowed by a varied and interesting program of pre- programming tasks to be completed within a time
sentations and posters, many of which are now limit at the conference. The tasks included rela-
online at http://useR2007.org. A panel on the belling, working with ragged longitudinal data, and
use of R in clinical trials may be noteworthy be- writing functions on functions. There was a tie for
cause of the related publishing by the R Founda- first place between Elaine McVey and Olivia Lau.
tion of a document (http://www.r-project.org/ Three kinds of T-shirts with variations on the “R”
certification.html) on “Regulatory Compliance theme were available: the official conference T-shirt
and Validation” issues. In particular “21 CFR part (also worn by the local staff) sold out within a day,
11” compliance is very important in the pharmaceu- so the two publishers’ free T-shirts gained even more
tical industry. attraction.
The meeting marked the tenth anniversary of the Local arrangements for the meeting were han-
formation of the R Core group, and an enormous dled by Di Cook, Heike Hofmann, Michael
blue R birthday cake was baked for the occasion (Fig- Lawrence, Hadley Wickham, Denise Riker, Beth Ha-
ure 1). Six of the current R Core members were genman and Karen Koppenhaver at Iowa State; the
present for the cutting, and Thomas Lumley gave a Program Committee also included Doug Bates, Dave
short after dinner speech at the banquet. Henderson, Olivia Lau, and Luke Tierney. Thanks
are due to all of them, and to the team of Iowa
State students who made numerous trips to the Des
Moines airport carrying meeting participants back
and forth, and who kept the equipment and facili-
ties running smoothly–useR! 2007 was an excellent
meeting.

Duncan Murdoch, University of Western Ontario


Duncan.Murdoch@R-project.org
Martin Maechler, ETH Zürich
Martin.Maechler@R-project.org

Figure 1: R Core turns ten.

R News ISSN 1609-3631


Vol. 7/2, October 2007 75

Forthcoming Events: useR! 2008


The international R user conference ‘useR! 2008’ will • Bioinformatics
take place at the Universität Dortmund, Dortmund, • Chemometrics and Computational Physics
Germany, August 12-14, 2008. • Data Mining
This world-meeting of the R user community will • Econometrics & Finance
focus on • Environmetrics & Ecological Modeling
• R as the ‘lingua franca’ of data analysis and sta- • High Performance Computing
tistical computing, • Machine Learning
• providing a platform for R users to discuss and • Marketing & Business Analytics
exchange ideas how R can be used to do statis- • Psychometrics
tical computations, data analysis, visualization • Robust Statistics
and exciting applications in various fields, • Sensometrics
• giving an overview of the new features of the • Spatial Statistics
rapidly evolving R project. • Statistics in the Social and Political Sciences
The program comprises invited lectures, user- • Teaching
contributed sessions and pre-conference tutorials. • Visualization & Graphics
• and many more.

Invited Lectures
Call for pre-conference Tutorials
R has become the standard computing engine in
more and more disciplines, both in academia and the Before the start of the official program, half-day tuto-
business world. How R is used in different areas will rials will be offered on Monday, August 11.
be presented in invited lectures addressing hot top- We invite R users to submit proposals for three
ics. Speakers will include Peter Bühlmann, John Fox, hour tutorials on special topics on R. The proposals
Andrew Gelman, Kurt Hornik, Gary King, Duncan Mur- should give a brief description of the tutorial, includ-
doch, Jean Thioulouse, Graham J. Williams, and Keith ing goals, detailed outline, justification why the tu-
Worsley. torial is important, background knowledge required
and potential attendees. The proposals should be
sent before 2007-10-31 to useR-2008@R-project.org.
User-contributed Sessions
The sessions will be a platform to bring together Call for Papers
R users, contributers, package maintainers and de-
velopers in the S spirit that ‘users are developers’. We invite all R users to submit abstracts on top-
People from different fields will show us how they ics presenting innovations or exciting applications of
solve problems with R in fascinating applications. R. A web page offering more information on ‘useR!
The scientific program is organized by members of 2008’ is available at:
the program committee, including Micah Altman, http://www.R-project.org/useR-2008/
Roger Bivand, Peter Dalgaard, Jan de Leeuw, Ramón
Díaz-Uriarte, Spencer Graves, Leonhard Held, Torsten Abstract submission and registration will start in
Hothorn, François Husson, Christian Kleiber, Friedrich December 2007.
Leisch, Andy Liaw, Martin Mächler, Kate Mullen, Ei-ji We hope to meet you in Dortmund!
Nakama, Thomas Petzold, Martin Theus, and Heather
Turner. The program will cover topics of current in- The organizing committee:
terest such as Uwe Ligges, Achim Zeileis, Claus Weihs, Gerd Kopp,
• Applied Statistics & Biostatistics Friedrich Leisch, and Torsten Hothorn
• Bayesian Statistics useR-2008@R-project.org

R News ISSN 1609-3631


Vol. 7/2, October 2007 76

Forthcoming Events: R Courses in Munich


by Friedrich Leisch Friedrich Leisch and Carolin Strobl), econometrics
(spring 2008, by Achim Zeileis), and generalized re-
The department of statistics at Ludwig– gression (spring 2008, by Thomas Kneib).
Maximilians–Universität München, Germany, is of-
The regular courses are taught in German,
fering a range of R courses to practitioners from
more information is available from http://www.
industry, universities and all others interested in
statistik.lmu.de/R/.
using R for data analysis, starting with R basics
(November 8–9, by Torsten Hothorn and Friedrich
Leisch) followed by R programming (December 12– Friedrich Leisch
13, by Torsten Hothorn and Friedrich Leisch), ma- Ludwig-Maximilians-Universität München, Germany
chine learning (January 23–24, by Torsten Hothorn, Friedrich.Leisch@R-project.org

R News ISSN 1609-3631


Vol. 7/2, October 2007 77

Editor-in-Chief: R News is a publication of the R Foundation for Sta-


Torsten Hothorn tistical Computing. Communications regarding this
Institut für Statistik publication should be addressed to the editors. All
Ludwig–Maximilians–Universität München articles are copyrighted by the respective authors.
Ludwigstraße 33, D-80539 München Please send submissions to regular columns to the
Germany respective column editor and all other submissions
to the editor-in-chief or another member of the edi-
Editorial Board: torial board. More detailed submission instructions
John Fox and Vincent Carey. can be found on the R homepage.

Editor Programmer’s Niche:


Bill Venables R Project Homepage:
Editor Help Desk: http://www.R-project.org/
Uwe Ligges
Email of editors and editorial board: This newsletter is available online at
firstname.lastname @R-project.org http://CRAN.R-project.org/doc/Rnews/

R News ISSN 1609-3631

Vous aimerez peut-être aussi