Académique Documents
Professionnel Documents
Culture Documents
Editorial
by Torsten Hothorn page 74. As I’m writing this, Uwe Ligges and his
team at Dortmund university are working hard to
Welcome to the October 2007 issue of R News, organize next year’s useR! 2008; Uwe’s invitation to
the second issue for this year! Last week, R 2.6.0 Dortmund Germany appears on the last page of this
was released with many new useful features: issue.
dev2bitmap allows semi-transparent colours to be Three of the nine papers contained in this issue
used, plotmath offers access to all characters in were presented at useR! 2006 in Vienna: Peter Dal-
the Adobe Symbol encoding, and three higher-order gaard reports on new functions for multivariate anal-
functions Reduce, Filer, and Map are now available. ysis, Heather Turner and David Firth describe their
In addition, the startup time has been substantially new package gnm for fitting generalized nonlinear
reduced when starting without the methods pack- models, and Olivia Lau and colleagues use ecologi-
age. All the details about changes in R 2.6.0 are given cal inference models to infer individual-level behav-
on pages 54 ff. ior from aggregate data.
The Comprehensive R Archive Network contin- Further topics are the analysis of functional mag-
ues to reflect a very active R user and developer com- netic resonance imaging and environmental data,
munity. Since April, 187 new packages have been matching for observational studies, and machine
uploaded, ranging from ADaCGH, providing facili- learning methods for survival analysis as well as for
ties to deal with array CGH data, to yest for model categorical and continuous data. In addition, an ap-
selection and variance estimation in Gaussian inde- plication to create web interfaces for R scripts is de-
pendence models. Kurt Hornik presents this impres- scribed, and Fritz Leisch completes this issue with a
sive list starting on page 61. review of Michael J. Crawley’s The R Book.
The first North American R user conference took
place in Ames Iowa last August and many colleagues Torsten Hothorn
and friends from all over the world enjoyed a well- Ludwig–Maximilians–Universität München, Germany
organized and interesting conference. Duncan Mur- Torsten.Hothorn@R-project.org
doch and Martin Maechler report on useR! 2007 on
Contents of this issue: eiPack: R × C Ecological Inference and
Higher-Dimension Data Management . . . . 43
Editorial . . . . . . . . . . . . . . . . . . . . . . 1 The ade4 Package — II: Two-table and K-table
New Functions for Multivariate Analysis . . . 2 Methods . . . . . . . . . . . . . . . . . . . . . 47
gnm: A Package for Generalized Nonlinear Review of “The R Book” . . . . . . . . . . . . . 53
Models . . . . . . . . . . . . . . . . . . . . . . 8 Changes in R 2.6.0 . . . . . . . . . . . . . . . . . 54
fmri: A Package for Analyzing fmri Data . . . 13 Changes on CRAN . . . . . . . . . . . . . . . . 61
Optmatch: Flexible, Optimal Matching for R Foundation News . . . . . . . . . . . . . . . . 72
Observational Studies . . . . . . . . . . . . . 18 News from the Bioconductor Project . . . . . . 73
Random Survival Forests for R . . . . . . . . . 25 Past Events: useR! 2007 . . . . . . . . . . . . . . 74
Rwui: A Web Application to Create User Forthcoming Events: useR! 2008 . . . . . . . . . 75
Friendly Web Interfaces for R Scripts . . . . . 32 Forthcoming Events: R Courses in Munich . . 76
The np Package: Kernel Methods for Categor-
ical and Continuous Data . . . . . . . . . . . 35
Vol. 7/2, October 2007 2
within-subjects contrasts, rather than of the observa- squares (OLS) fit and for some covariance structures,
tions themselves. a weighted fit may be more appropriate. However,
If we assume a covariance structure with random OLS is not actually biased, and the efficiency that
effects both between and within subject, the contrasts we might gain is offset by the need to estimate a
will cancel out the between-subject variation. They rather large number of parameters to find the opti-
will effectively behave as if the between-subject term mal weights.
wasn’t there and satisfy a sphericity condition.
Hence we consider the transformed response YT 0
(which has rows Tyi ). In many cases T is chosen to The epsilons
annihilate another matrix X, i.e. it satisfies TX = 0;
by rotation invariance, different such choices of T Assume that the conditions for a balanced analysis of
will lead to the same inference as long as they have variance are roughly valid. We may decide to com-
maximal rank. In such cases it may well be more con- pute, say, column means for each subject and test
venient to specify X rather than T. whether the contrasts are zero on average. There
For profile analysis, X could be a p-vector of ones. could be a loss of efficiency in the use of unweighted
In that case, the hypothesis of compound symmetry means, but the critical issue for an F test is whether
implies sphericity of TΣT 0 w.r.t. TT 0 (but sphericity the sphericity condition holds for the contrasts be-
does not imply compound symmetry), and the hy- tween column means.
pothesis EYT 0 = 0 is equivalent to a flat (constant If the sphericity condition is not valid, then the
level) profile. F test is in principle wrong. Instead, we could use
More elaborate within-subject designs exist. As one of the multivariate tests, but they will often have
an example, consider a a complete two-way layout, low power due to the estimation of a large number
in which case we might want to look at the sets of (a) of parameters in the empirical covariance matrix.
Contrasts between row means, (b) Contrasts between For this reason, a methodology has evolved in
column means, and (c) Interaction contrasts, which “near-sphericity” data are analyzed by F tests,
These contrasts connect to the theory of balanced but applying the so-called epsilon corrections to the
mixed-effects analysis of variance. A model with “all degrees of freedom. The theory originates in Box
interactions with subject are considered random”, i.e. (1954a) in which it is shown that F is approximately
random effects of subjects, as well as random inter- distributed as F (ε f 1 , ε f 2 ), where
action between subject and rows and between sub-
jects and columns, implies sphericity of each of the ∑ λi2 / p
ε=
above contrasts. The proportionality constants will ( ∑ λi / p )2
be different linear combinations of the random effect
and the λi are the eigenvalues of the true covari-
variances.
ance matrix (after contrast transformation, and with
A common pattern for such sets of contrasts is as
respect to a similarly transformed identity matrix).
follows: Define two subspaces of R p , defined by ma-
One may notice that 1/ p ≤ ε ≤ 1; the upper limit
trices X and M, such that span( X ) ⊂ span( M). The
corresponds to sphericity (all eigenvalues equal) and
transformation is calculated as T = PM − PX where
the lower limit corresponds to the case where there
P denotes orthogonal projection. Actually, T defined
is one dominating eigenvalue, so that data are effec-
like this would be singular and therefore T is thinned
tively one-dimensional. Box (1954b) details the result
by deletion of linearly dependent rows (it can fairly
as a function of the elements of the covariance matrix
easily be seen that it does not matter exactly how this
in the two-way layout.
is done). Putting M = I (the identity matrix) will
The Box correction requires knowledge of the
make T satisfy TX = 0 which is consistent with the
true covariance matrix. Greenhouse and Geisser
situation described earlier.
(1959) suggest the use of εGG , which is simply the
The choice of X and M is most easily done by
empirical verson of ε, inserting the empirical covari-
viewing them as design matrices for linear models in
ance matrix for the true one. However, it is easily
p-dimensional space. Consider again a two-way in-
seen that this estimator is biased, at least when the
trasubject design. To make T a transformation which
true epsilon is close to 1, since εGG < 1 almost surely
calculates contrasts between column means, choose
when p > 1.
M to describe a model with different means in each
The Huynh-Feldt correction (Huynh and Feldt,
column and X to describe a single mean for all data.
1976) is
Preferably, but equivalently for a balanced design, let
( f + 1) pεGG − 2
M describe an additive model in rows and columns εHF =
p( f − pεGG )
and let X be a model in which there is only a row
effect. where f is the number of degrees of freedom for the
There is a hidden assumption involved in the empirical covariance matrix. (The original paper has
choice of the orthogonal projection onto span( M). N instead of f + 1 in the numerator for the split-plot
This calculates (for each subject) an ordinary least design. A correction was published 15 years later
(Lecoutre, 1991), but another 15 years later, this er- • The transformation matrix can be given di-
ror is still present in SAS and SPSS.) rectly using the argument T.
Notice that εHF can be larger than one, in which
case you should use the uncorrected F test. Also, • It is possible to specify the arguments X and M
εHF is obtained by bias-correcting the numerator and as the matrices described previously. The de-
denominator of εGG , which is not guaranteed to be faults are set so that the default transformation
helpful, and simulation studies in (Huynh and Feldt, is the identity.
1976) indicate that it actually makes the approxima-
tions worse when εGG < 0.7 or so. • It is also possible to specify X and/or M us-
ing model formulas. In that case, they usually
need to refer to a data frame which describes
Implementation in R the intra-subject design. This can be given in
the idata argument. The default for idata is
Basics a data frame consisting of the single variable
index=1:p which may be used to specify, e.g.,
Quite a lot of the calculations could be copied from
a polynomial response for equispaced data.
the existing manova and summary.manova code for
balanced designs. Also, objects of class mlm, inherit-
ing from lm, were already defined as the output from
lm(Y~....) when Y is a matrix. Example
It was necessary to add the following new func-
tions These data from the book by Maxwell and Delaney
(1990) are also used by Baron and Li (2006). They
• SSD creates an object of (S3) class SSD which is show reaction times where ten subjects respond to
the sums of squares and products matrix aug- stimuli in the absence and presence of ambient noise,
mented by degrees of freedom and information and using stimuli tilted at three different angles.
about the call. This is a generic function with a
method for mlm objects.
> reacttime <- matrix(c(
+ 420, 420, 480, 480, 600, 780,
• estVar calculates the estimated variance-
+ 420, 480, 480, 360, 480, 600,
covariance matrix. It has methods for SSD and
+ 480, 480, 540, 660, 780, 780,
mlm objects. In the former case, it just normal- + 420, 540, 540, 480, 780, 900,
izes the SSD by the degrees of freedom, and + 540, 660, 540, 480, 660, 720,
in the latter case it calculates SSD(object) and + 360, 420, 360, 360, 480, 540,
then applies the SSD method. + 480, 480, 600, 540, 720, 840,
+ 480, 600, 660, 540, 720, 900,
• anova.mlm is used to compare two multivariate + 540, 600, 540, 480, 720, 780,
linear models or partition a single model in the + 480, 420, 540, 540, 660, 780),
usual cumulative way. The various multivari- + ncol = 6, byrow = TRUE,
ate tests can be specified using test="Pillai", + dimnames=list(subj=1:10,
etc., just as in summary.manova. In addition, it + cond=c("deg0NA", "deg4NA", "deg8NA",
is possible to specify test="Spherical" which + "deg0NP", "deg4NP", "deg8NP")))
gives the F test under assumption of sphericity.
The same data are used by example(estVar) and
• mauchly.test tests for sphericity of a covari- example(anova.mlm), so you can load the reacttime
ance matrix with respect to another. matrix just by running the examples. The following
is mainly an expanded explanation of those exam-
• sphericity calculates the εGG and εHF based
ples.
on the SSD matrix and its degrees of freedom.
This function is private to the stats namespace First let us calculate the estimated covariance ma-
(at least currently, as of R version 2.6.0) and trix:
used to generate the headings and adjusted p-
values in anova.mlm. > mlmfit <- lm(reacttime~1)
> estVar(mlmfit)
cond
Representing transformations cond deg0NA deg4NA deg8NA deg0NP deg4NP deg8NP
deg0NA 3240 3400 2960 2640 3600 2840
As previously described, sphericity tests and tests of deg4NA 3400 7400 3600 800 4000 3400
linear hypotheses may make sense only for a trans- deg8NA 2960 3600 6240 4560 6400 7760
formed response. Hence we need to be able to spec- deg0NP 2640 800 4560 7840 8000 7040
ify transformations in anova.mlm and mauchly.test. deg4NP 3600 4000 6400 8000 12000 11200
The code has several ways to deal with this: deg8NP 2840 3400 7760 7040 11200 13640
the projection matrix for a given design matrix, i.e. This is the representation of x̄i· − x̄·· where i and
PX = X ( X 0 X )−1 X 0 . In the above context, we have j index levels of deg and noise, respectively. The
matrix has (row) rank 2; for the actual computations,
M <- model.matrix(~deg+noise, data=idata)
only the first two rows are used.
P1 <- stats:::proj.matrix(M)
X <- model.matrix(~noise, data=idata)
The important aspect of this matrix is that it as-
P2 <- stats:::proj.matrix(X) signs equal weights to observations at different lev-
els of noise and that it constructs a maximal set of
The two design matrices are (eliminating at- within-deg differences. Any other such choice of
tributes) transformation leads to equivalent results, since it
will be a full-rank transformation of T. For instance:
> M
(Intercept) deg4 deg8 noiseP
> T1 <- rbind(c(1,-1,0,1,-1,0),c(1,0,-1,1,0,-1))
1 1 0 0 0
> anova(mlmfit, mlmfit0, T=T1,
2 1 1 0 0
+ idata = idata, test = "Spherical")
3 1 0 1 0
Analysis of Variance Table
4 1 0 0 1
5 1 1 0 1
Model 1: reacttime ~ 1
6 1 0 1 1
Model 2: reacttime ~ 1 - 1
> X
(Intercept) noiseP
Contrast matrix
1 1 0
1 -1 0 1 -1 0
2 1 0
1 0 -1 1 0 -1
3 1 0
Greenhouse-Geisser epsilon: 0.9616
4 1 1
Huynh-Feldt epsilon: 1.2176
5 1 1
6 1 1
Res.Df Df Gen.var. F num Df
For printing the projection matrices, it is useful to 1 9 12084
2 10 1 32438 40.719 2
include the fractions function from MASS.
1 den Df Pr(>F) G-G Pr H-F Pr
> library("MASS") 2 18 2.087e-07 3.402e-07 2.087e-07
> fractions(P1)
1 2 3 4 5 6 This differs from the previous analysis only in the
1 2/3 1/6 1/6 1/3 -1/6 -1/6 Gen.var. column, which is not invariant to scaling
2 1/6 2/3 1/6 -1/6 1/3 -1/6 and transformation of data.
3 1/6 1/6 2/3 -1/6 -1/6 1/3 Returning to the interpretation of the results, no-
4 1/3 -1/6 -1/6 2/3 1/6 1/6 tice that the epsilons are much closer to one than be-
5 -1/6 1/3 -1/6 1/6 2/3 1/6 fore. This is consistent with a covariance structure
6 -1/6 -1/6 1/3 1/6 1/6 2/3
induced by a mixed model containing a random in-
> fractions(P2)
1 2 3 4 5 6
teraction between subject and deg. If we perform the
1 1/3 1/3 1/3 0 0 0 similar procedure for noise we get epsilons of ex-
2 1/3 1/3 1/3 0 0 0 actly 1, because we are dealing with a single-df con-
3 1/3 1/3 1/3 0 0 0 trast.
4 0 0 0 1/3 1/3 1/3
5 0 0 0 1/3 1/3 1/3 > anova(mlmfit, mlmfit0, M = ~ deg + noise,
6 0 0 0 1/3 1/3 1/3 + X = ~ deg,
+ idata = idata, test="Spherical")
Here, P2 is readily recognized as the operator that Analysis of Variance Table
replaces each of the first three values by their aver-
age, and similarly the last three. The other one, P1, Model 1: reacttime ~ 1
is a little harder, but if you look long enough, you Model 2: reacttime ~ 1 - 1
will recognize the formula x̂i j = x̄i· + x̄· j − x̄·· from
Contrasts orthogonal to
two-way ANOVA.
~deg
The transformation T is the difference between P1
and P2
> fractions(P1-P2) Contrasts spanned by
1 2 3 4 5 6 ~deg + noise
1 1/3 -1/6 -1/6 1/3 -1/6 -1/6
2 -1/6 1/3 -1/6 -1/6 1/3 -1/6 Greenhouse-Geisser epsilon: 1
3 -1/6 -1/6 1/3 -1/6 -1/6 1/3 Huynh-Feldt epsilon: 1
4 1/3 -1/6 -1/6 1/3 -1/6 -1/6
5 -1/6 1/3 -1/6 -1/6 1/3 -1/6 Res.Df Df Gen.var. F num Df
6 -1/6 -1/6 1/3 -1/6 -1/6 1/3 1 9 1410
> f <- factor(rep(1:2, 5)) H. Huynh and L. S. Feldt. Estimation of the box cor-
> mlmfit2 <- update(mlmfit, ~f) rection for degrees of freedom from sample data in
> anova(mlmfit2, X = ~1, test = "Spherical") randomized block and split-plot designs. Journal of
Analysis of Variance Table Educational Statistics, 1(1):69–82, 1976.
over-parameterized, care is needed when interpret- An abbreviated version of the output given by
ing the output of some of these functions. In partic- summary(RC) is shown in Figure 2. Default con-
ular, for parameters that are not identified, an arbi- straints (as specified by options("contrasts")) are
trary parameterization will be returned and the stan- applied to linear terms: in this case the first level of
dard error will be displayed as NA. the origin and destination factors have been set to
For estimable parameters, methods for profile zero. However the “main effects” are not estimable,
and confint enable inference based on the profile since they are aliased with the unconstrained multi-
likelihood. These methods are designed to handle plicative interaction. If the gnm call were re-evaluated
the asymmetric behaviour of the log-likelihood func- from a different random seed, the parameterization
tion that is a well-known feature of many nonlinear for the main effects and multiplicative interaction
models. would differ from that shown in Figure 2. On the
Estimates and standard errors of simple "sum-to- other hand the diagonal effects, which are essentially
zero" contrasts can be obtained using getContrasts, contrasts with the off-diagonal elements, are identi-
if such contrasts are estimable. For more general lin- fied here. This is immediately apparent from Figure
ear combinations of parameters, the lower-level se 2, since standard errors are available for these param-
function may be used instead. eters.
One could consider extending the model by
adding a further component of the multiplicative in-
Example teraction, giving an RC(2) model:
We illustrate here the use of some of the key log µrc = αr + βc + θrc + γr δc + θrφc . (4)
functions in gnm through the analysis of a con-
tingency table from Goodman (1979). The data Most of the "nonlin" functions, including Mult, have
are a cross-classification of a sample of British an inst argument to allow the specification of multi-
males according to each subject’s occupational sta- ple instances of a term, as here:
tus and his father’s occupational status. The contin-
gency table is distributed with gnm as the data set Freq ~ origin + destination +
occupationalStatus. Diag(origin, destination) +
Goodman (1979) analysed these data using row- Mult(origin, destination, inst = 1) +
column association models, in which the interaction Mult(origin, destination, inst = 2)
between the row and column factors is represented Multiple instances of a term with an inst argument
by components of a multiplicative interaction. The may be specified in shorthand using the instances
log-multiplicative form of the RC(1) model — the function provided in the package:
row-column association model with one component
of the multiplicative interaction — is given by Freq ~ origin + destination +
Diag(origin, destination) +
log µrc = αr + βc + γr δc , (2) instances(Mult(origin, destination), 2)
where µrc is the cell mean for row r and column c. The formula is then expanded by gnm before the
Before modelling the occupational status data, Good- model is fitted.
man (1979) first deleted cells on the main diagonal. In the case of the occupational status data how-
Equivalently we can separate out the diagonal effects ever, the RC(1) model is a good fit (a residual de-
as follows: viance of 29.149 on 28 degrees of freedom). So rather
than extending the model, we shall see if it can be
log µrc = αr + βc + θrc + γr δc (3)
simplified by using homogeneous scores in the mul-
where tiplicative interaction:
φr if r = c
θrc = log µrc = αr + βc + θrc + γr γc (5)
0 otherwise.
In addition to the "nonlin" functions provided by This model can be obtained by updating RC using
gnm, the package includes a number of functions for update:
specifying structured linear interactions. The diago-
nal effects in model 3 can be specified using the gnm > RChomog <- update(RC, . ~ .
function Diag. Thus we can fit model 3 as follows: + - Mult(origin, destination)
+ + MultHomog(origin, destination))
> set.seed(1)
> RC <- gnm(Freq ~ origin + destination + We can compare the two models using anova, which
+ Diag(origin, destination) + shows that there is little gained by allowing hetero-
+ Mult(origin, destination), geneous row and column scores in the association of
+ family = poisson, the fathers’ occupational status and the sons’ occu-
+ data = occupationalStatus) pational status:
Call:
gnm(formula = Freq ~ origin + destination + Diag(origin, destination) +
Mult(origin, destination), family = poisson, data = occupationalStatus)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.11881 NA NA NA
origin2 0.49005 NA NA NA
...
origin8 1.60214 NA NA NA
destination2 0.95334 NA NA NA
...
destination8 1.42813 NA NA NA
Diag(origin, destination)1 1.47923 0.45401 3.258 0.00112
...
Diag(origin, destination)8 0.40731 0.21930 1.857 0.06327
Mult(., destination).origin1 -1.74022 NA NA NA
...
Mult(., destination).origin8 1.65900 NA NA NA
Mult(origin, .).destination1 -1.32971 NA NA NA
...
Mult(origin, .).destination8 0.66730 NA NA NA
...
Residual deviance: 29.149 on 28 degrees of freedom
...
> anova(RChomog, RC) geneous multiplicative factor to zero. The quasi stan-
Analysis of Deviance Table dard errors and quasi variances are independent of
parameterization; see Firth (2003) and Firth and de
Model 1: Freq ~ origin + destination + Menezes (2004) for more detail. In this example the
Diag(origin, destination) + reported relative-error diagnostics indicate that the
MultHomog(origin, destination) quasi-variance approximation is rather inaccurate —
Model 2: Freq ~ origin + destination + in contrast to the high accuracy found in many other
Diag(origin, destination) + situations by Firth and de Menezes (2004).
Mult(origin, destination) Users may run through the example covered in
Resid. Df Resid. Dev Df Deviance this section by calling demo(gnm).
1 34 32.561
2 28 29.149 6 3.412
Custom "nonlin" functions
All of the parameters in model 5 can be made
identifiable by constraining one level of the homo- The small number of "nonlin" functions currently
geneous multiplicative factor to zero. This can be distributed with gnm allow a large class of gener-
achieved by using the constrain argument to gnm, alized nonlinear models to be specified, as exempli-
but this would require re-fitting the model. As we are fied through the package documentation. Neverthe-
only really interested in the parameters of the multi- less, for some models a custom "nonlin" function
plicative interaction, we investigate simple contrasts may be necessary or desirable. As an illustration, we
of these parameters using getContrasts. The gnm shall consider a logistic predictor of the form given
function pickCoef enables us to select the parame- by SSlogis (a selfStart model for use with the stats
ters of interest using a regular expression to match function nls),
against the parameter names: Asym
. (6)
> contr <- getContrasts(RChomog, 1 + exp(( xmid − x)/scal )
+ pickCoef(RChomog, "MultHomog")) This predictor could be specified to gnm as
A summary of contr is shown in Figure 3. By de- ~ -1 + Mult(1, Inv(Const(1) +
fault, getContrasts sets the first level of the homo- Exp(Mult(1 + offset(-x), Inv(1)))))
Figure 3: Simple contrasts of homogeneous row and column scores in model RChomog.
where Mult, Inv and Exp are "nonlin" functions, term = function(predLabels, varLabels){
offset is the usual stats function and Const is a gnm paste(predLabels[1], "/(1 + exp((",
function specifying a constant offset. However, this predLabels[2], "-", varLabels[1], ")/",
specification is rather convoluted and it would be predLabels[3], "))")
preferable to define a single "nonlin" function that }
could specify a logistic term.
Our custom "nonlin" function only needs a sin- Our Logistic function need not specify any further
gle argument, to specify the variable x, so the skele- arguments of nonlinTerms. However, we do have
ton of our definition might be as follows: an idea of useful starting values, so we can also spec-
ify the start argument. This should be a function
Logistic <- function(x){ which takes a named vector of the parameters in the
} term and returns a vector of starting values for these
class(Logistic) <- "nonlin" parameters. The following function will set the ini-
tial scale parameter to one:
Now we need to fill in the body of the func-
tion. The purpose of a "nonlin" function is to con- start = function(theta){
vert its arguments into a list of arguments for the in- theta[3] <- 1
ternal function nonlinTerms. This function consid- theta
ers a nonlinear term as a mathematical function of }
predictors, the parameters of which need to be esti- Putting these components together, we have:
mated, and possibly also variables, which have a co-
efficient of 1. In this terminology, Asym, xmid and Logistic <- function(x){
scal in Equation 6 are parameters of intercept-only list(predictors =
predictors, whilst x is a variable. Our Logistic func- list(Asym = 1, xmid = 1, scal = 1),
tion must specify the corresponding predictors and variables = list(substitute(x)),
variables arguments of nonlinTerms. We can de- term =
fine the predictors argument as a named list of function(predLabels, varLabels){
symbolic expressions paste(predLabels[1],
"/(1 + exp((", predLabels[2],
predictors = list(Asym = 1, xmid = 1, "-", varLabels[1], ")/",
scal = 1) predLabels[3], "))")
},
and pass the user-specified variable x as the start = function(theta){
variables argument: theta[3] <- 1
theta
variables = list(substitute(x))
})
We must then define the term argument of }
nonlinTerms, which is a function that creates a de- class(Logistic) <- "nonlin"
parsed mathematical expression of the nonlinear Having defined this function, we can reproduce
term from labels for the predictors and variables. the example from ?SSlogis as shown below
These labels should be passed through arguments
predLabels and varLabels respectively, so we can > Chick.1 <-
specify the term argument as + ChickWeight[ChickWeight$Chick == 1, ]
The approach implemented in the fmri package is data <- read.ANALYZE(prefix = "", numbered = TRUE,
based on a linear model for the hemodynamic re- postfix = "", picstart = 1,
numbpic = 1)
sponse and structural adaptive spatial smoothing of
the resulting Statistical Parametric Map (SPM). The Setting the argument numbered to TRUE allows to
package requires R (version ≥ 2.2). 3D visualization read a list of image files. The image files are as-
needs the R package tkrplot. sumed to be ordered by a number, contained in
The statistical modeling implemented with this the filename. picstart is the number of the first
software is described in (Tabelow et al., 2006). file, and numbpic the number of files to be read.
NOTE! This software comes with abso- The increment between consecutive numbers is as-
lutely no warranty! It is not intended for sumed to be 1. prefix specifies a string preceed-
clinical use, but for evaluation purposes ing the number, whereas postfix identifies a string
only. Absence of bugs can not be guaran- following the number in the filename. Note, that
teed! read.ANALYZE() requires the file names in the form:
‘<prefix>number<postfix>.hdr/img’.
We first start with a basic script for using the fmri
package in a typical fmri analysis. In the follow-
ing sections we describe the consecutive steps of the
analysis.
# read the data
data <- read.AFNI("afnifile")
# or read sequence of ANALYZE files
# analyze031file.hdr ... analyze137file.hdr
# data <- read.ANALYZE("analyze",
# numbered = TRUE, "file", 31, 107)
in (’raw’) format to save memory. The data can where X denotes the design matrix. The design ma-
be extracted from the list using extract.data(). trix is created by
Additionally the list contains basic information like
the size of the data cube (’dim’) and the voxel size x <- fmri.design(hrf) .
(’delta’), as well as the complete header information
(’header’), which is itself a list with components de- This will include polynomial drift terms up to
pending to the data format read. A head mask is de- quadratic order. To deviate from this default, the
fined by simply using a 75% quantile of the data grey order of the polynomial drift can be specified by a
levels as cut- off. This is only be used to provide im- second argument. Use cbind() to combine several
proved spatial correlation estimates for the head in stimulus vectors into a matrix of stimuli before call-
fmri.lm(). ing fmri.design().
The first q columns of X contain values of the ex-
pected BOLD response for the different stimuli eval-
Expected BOLD response uated at scan acquisition times. The other p − q
columns are chosen to be orthogonal to the expected
In voxel affected by the experiment the observed sig- BOLD responses and to account for a slowly vary-
nal is assumed to follow the expected Blood Oxy- ing drift and possible other external effects. The er-
genation Level Dependent (BOLD) signal. This sig- ror vector εi has zero expectation and is assumed to
nal depends on both the experimental stimulus, de- be correlated in time. In order to access the variabil-
scribed by a task indicator function and a hemody- ity of the estimates of βi correctly we have to take
namic response function h(t). We define h(t) as the the correlation structure of the error vector εi into ac-
difference of two gamma functions count. We assume an AR(1) model to be sufficient for
a1 commonly used MRI scanners. The autocorrelation
t − d1
t coefficients ρi are estimated from the residual vector
h(t) = exp −
d1 b1 ri = (ri1 , . . . , riT ) of the fitted model (1) as
a2
t − d2
t
−c exp − T T
d2 b2
ρ̄i = ∑ rit ri(t−1) / ∑ rit2 .
t=2 t=1
with a1 = 6, a2 = 12, b1 = 0.9, b2 = 0.9, and
di = ai bi (i = 1, 2), c = 0.35 where t is the time in
This estimate of the correlation coefficient is biased
seconds, see (Glover, 1999). The expected BOLD re-
due to fitting the linear model (1). We therefore ap-
sponse is given as a discrete convolution of this func-
ply the bias correction given by (Worsley et al., 2002)
tion with the task indicator function. Use
leading to an estimate ρ̃i .
hrf <- fmri.stimulus(107, c(18, 48, 78), 15, 2) We then use prewhitening to transform model (1)
into a linear model with approximately uncorrelated
to create the expected BOLD response for a stimulus errors. The prewhitened linear model is obtained by
with 107 scans, onset times at the 18th, 48th, and 78th multiplying the terms in (1) with some matrix Ai de-
scan with a duration of the stimulus of 15 scans, and pending on ρ̃i . The prewhitening procedure thus re-
a time TR = 2s between two scans. In case of mul- sults in a new linear model
tiple stimuli the results of fmri.stimulus() for the
different stimuli should be arranged as columns of a Ỹi = X̃i βi + ε̃i (2)
matrix hrf using cbind().
The hemodynamic response function may have with Ỹi = Ai Yi , X̃i = Ai X, and ε̃i = Aiεi . In the new
an unknown latency (Worsley and Taylor, 2005). This model the errors ε̃i = (ε̃it ) are approximately uncor-
can be modeled creating an additional explanatory related in time t, such that var ε̃i = σi2 IT . Finally least
variable as the first numerical derivative of any ex-
squares estimates β̃i are obtained from model (2) as
perimental stimulus:
dhrf <- (c(0,diff(hrf)) + c(diff(hrf),0))/2 β̃i = ( X̃iT X̃i )−1 X̃iT Ỹi .
See the next section for how to include this into the The error variance σi2 is estimated from the residu-
linear model.
als r̃i of the linear model (2) as σ̃i2 = ∑1T r̃it2 /( T − p)
leading to estimated covariance matrices
Construction of the SPM
var β̃i = σ̃i2 ( X̃iT X̃i )−1 .
We adopt the common view of a linear model for the
time series Yi = (Yit ) in each voxel i after reconstruc- Estimates β̃i and their estimated covariance ma-
tion of the raw data and motion correction. trices are, in the simplest case, obtained by
where data is the data object read by read.AFNI() or regions of adjacent voxels. Averaging over such re-
read.ANALYZE(), and x is the design matrix created gions allows us to reduce the variance of the param-
with fmri.design(). eter estimates without compromising their mean. In-
See figure 1 for an example of a typical fMRI troducing a spatial correlation structure also reduces
dataset together with the result of the fit of the lin- the number of independent decisions made for sig-
ear model to a time series. nal detection and therefore eases the multiple test
To consider more than one stimulus and to esti- problem. Adaptive smoothing as implemented in
mate an effect this package also allows to improve the specificity of
γ̃ = c T β̃ (3) signal detection, see figure 2 for an illustration.
The parameter map is smoothed with
defined by a vector of contrasts c set the argument
contrast of the function fmri.lm() correspondingly spmsmooth <- fmri.smooth(spm, hmax = hmax)
hrf1 <- fmri.stimulus(214, c(18, 78, 138), 15, 2) where spm is the result of the function fmri.lm().
hrf2 <- fmri.stimulus(214, c(48, 108, 168), 15, 2) hmax is the maximum bandwidth for the smoothing
x <- fmri.design(cbind(hrf1, hrf2)) algorithm. For lkern="Gaussian" the bandwidth is
given in units of FWHM, for any other localization
# stimulus 1 only kernel the unit is voxel. hmax should be chosen as
spm1 <- fmri.lm(data, x, contrast = c(1,0))
the expected maximum size of the activation areas.
As adaptive smoothing automatically adapts to dif-
# stimulus 2 only
spm2 <- fmri.lm(data, x, contrast = c(0,1)) ferent sizes and shapes of the activation areas, over-
smoothing is not to be expected.
# contrast between both In (Tabelow et al., 2006) the use of a spa-
spm3 <- fmri.lm(data, x, contrast = c(1,-1)) tial adaptive smoothing procedure derived from
the Propagation- Separation approach (Polzehl and
If the argument vvector is set, the component Spokoiny, 2006) has been proposed in this context.
”cbeta” of the object returned by fmri.lm() con- The approach focuses, for each voxel i, on simultane-
tains a vector with the parameters corresponding to ously identifying a region where the unknown pa-
the non- zero elements in vvector in each voxel. rameter γ is approximately constant and to obtain
This may be used to include unknown latency of an optimal estimate γ̂i employing this structural in-
the hemodynamic response function in the analysis. formation. This is achieved by an iterative proce-
First define the expected BOLD response for a given dure. Local smoothing is restricted to local vicini-
stimulus and its derivative and then combine them ties of each voxel, that are characterized by a weight-
into the design matrix ing scheme. Smoothing and characterization of local
vicinities are alternated. Weights for a pair of voxels
hrf <- fmri.stimulus(107, c(18, 48, 78), 15, 2)
i and j are constructed as a product of kernel weights
dhrf <- (c(0,diff(hrf)) + c(diff(hrf),0))/2
x <- fmri.design(cbind(hrf, dhrf)) Kloc (δ (i, j)/h), depending on the distance δ (i, j) be-
spm <- fmri.lm(data, x, vvector = c(1,1)) . tween the two voxels and a bandwidth h, and a factor
reflecting the difference of the estimates γ̂i and γ̂ j ob-
The specification of vvector in the last statement re- tained within the last iteration. The bandwidth h is
sults in a vector of length 2 containing the two pa- increased with iterations up to a maximal bandwidth
rameter estimates for the expected BOLD response hmax .
and its derivative in each voxel. Furthermore the ra- The name Propagation- Separation is a synonym
tio of the variance estimates for these parameters is for the two main properties of this algorithm. In
calculated as a prerequisite for the smoothing proce- case of a completely homogeneous array Γ̃ , that is
dure. See fmri.smooth() for details about smooth- IE γ̃ ≡ Const., the algorithm delivers essentially the
ing this parametric map. same result as a nonadaptive kernel smoother em-
The function returns an object with class at- ploying the bandwidth hmax . In this case the pro-
tributes ”fmridata” and ”fmrispm”. This is again a cedure selects the best of a sequence of almost non-
list with components containing the estimated pa- adaptive estimates, that is, it propagates to the one
rameter contrast (’cbeta’), and its estimated variance with maximum bandwidth. Separation means that
(’var’), as well as estimated spatial correlations in all as soon as within one iteration step significant dif-
directions. ferences of γ̂i and γ̂ j are observed the correspond-
ing weight is decreased to zero and the information
from voxel j is no longer used to estimate γi . Voxels i
Structure Adaptive Smoothing (PS) and j belong to different regions of homogeneity and
are therefore separated. As a consequence smooth-
Smoothing of SPM’s is applied in this context to im- ing is restricted to regions with approximately con-
prove the sensitivity of signal detection. This is ex- stant values of γ, bias at the border of such regions is
pected to work since neural activations extends over avoided and the spatial structure of activated regions
Figure 2: A numerical phantom for studying the performance of PS vs. Gaussian smoothing. (a) Signal loca-
tions within one slices. Eight different signal- to- noise ratios, increasing clockwise, are coded by gray values.
The spatial extent of activations varies in the radial direction. The data cube contains 15 slices with activation
alternated with 9 empty slices. (b) Smoothing with a conventional Gaussian filter. (c) Smoothing with PS. In
both (b) and (c), the bandwidth is 3 voxels which corresponds to FWHM = 10 mm for typical voxel size. In
(b) and (c) the proportion, over slices containing activations, of voxels that are detected in a given position is
rendered by gray values. Black corresponds to the absence of a detection.
plot(object, anatomic,
device="jpeg", file="result.jpeg") Bibliography
object is an object of class ”fmridata” (and
Biomedical Imaging Resource. Analyze Program.
”fmrispm” or ”fmripvalue”) as returned by
Mayo Foundation, 2001.
fmri.pvalue(), fmri.smooth(), fmri.lm(),
read.ANALYZE() or read.AFNI(). anatomic is an R. W. Cox. Afni: Software for analysis and visual-
anatomic underlay of the same dimension as the ization of functional magnetic resonance neuroim-
functional data. The argument type="3d" provides ages. Computers and Biomed. Res., 29:162- 173, 1996.
an interactive display to produce 3 dimensional illus-
trations (requires R- package tkrplot) on the screen. G. H. Glover. Deconvolution of impulse response in
The default for type is ”slice”, which can create out- event- related BOLD fMRI. NeuroImage, 9:416- 429,
put to the screen and image files. 1999.
We also provide generic functions summary() and
J. Polzehl and V. Spokoiny. Propagation- separation
print() for objects with class attribute ”fmridata”.
approach for local likelihood estimation. Probab.
Theory and Relat. Fields, 135:335- 362, 2006.
R Development Core Team. R: A Language and Envi-
ronment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria, 2007. ISBN
3- 900051- 07- 0.
K. Tabelow, J. Polzehl, H. U. Voss, and V. Spokoiny.
Analyzing fMRI experiments with structural
adaptive smoothing procedures. NeuroImage,
33:55- 62, 2006.
J. Polzehl and K. Tabelow. Analysing fMRI experi-
ments with the fmri package in R. Version 1.0 - A
users guide. WIAS Technical Report No. 10, 2006.
K. J. Worsley. Spatial smoothing of autocorrelations
to control the degrees of freedom in fMRI analysis.
NeuroImage, 26:635- 641, 2005.
K. J. Worsley, C. Liao, J. A. D. Aston, V. Petre, G. H.
Duncan, F. Morales, and A. C. Evans. A general
Figure 3: Signal detection in 4 consecutive slices us- statistical analysis for fMRI data. NeuroImage, 15:1-
ing adaptive smoothing procedure. The shape and 15, 2002.
extent of the activation areas are conserved much K. J. Worsley, S. Marrett, P. Neelin, K. J. Friston, and
better than with traditional Gaussian filtering. Data: A. C. Evans. A unified statistical approach for de-
courtesy of H. Voss, Weill Medical College of Cornell terming significant signals in images of cerebral
University. activation. Human Brain Mapping, 4:58- 73, 1996.
K. J. Worsley and J. E. Taylor. Detecting fMRI activa-
Writing the results to files tion allowing for unknown latency of the hemody-
namic response. Neuroimage, 29:649- 654, 2006.
Finally we provide functions to write data in stan-
dard medical image formats such as HEAD/BRIK,
ANALYZE, and NIFTI. Jörg Polzehl & Karsten Tabelow
write.AFNI("afnitest", data, c("signal"), Weierstrass Institute for Applied Analysis
note="random data", origin=c(0,0,0), and Stochastics, Berlin, Germany
delta=c(4,4,5), idcode="unique ID") polzehl@wias-berlin.de
tabelow@wias-berlin.de
a ranking of potential matches for treatment unit i, insofar as the given distance and structural stipula-
an ordering of available controls accounting for their tions best represent the analyst’s goals; in this sense
differences with plant i in generating capacity and in “optimal” in “optimal matching” is analogous to
year of construction. Typically controls j are ordered the “maximum” of “maximum likelihood,” which is
in terms of a numeric discrepancy, d[i, j], from i; Fig- never better than the chosen likelihood model it max-
ure 1’s match follows Rosenbaum (2002, ch.10) in us- imizes.
ing the sum of rank differences on the two covariates For example, in the problem just discussed the
(after restricting to a subset of the plants, pt!=1): structural stipulation that the matched sets all be 1:2
triples may be poorly tailored to the goal of reduc-
> data("nuclear", package="boot") ing baseline differences between the groups. It is
> attach(nuclear[nuclear$pt!=1,]) appropriate if these differences stem entirely from a
> drk <- rank(date) small minority of controls being too unlike any treat-
> d <- outer(drk[pr==1], drk[pr!=1], "-") ment subjects to bear comparison to them, since it
> d <- abs(d) does exclude 5/19 of potential controls from the fi-
> crk <- rank(cap) nal match; but differences on a larger number of con-
> d <- d + trols, even small differences, require the techniques
abs(outer(crk[pr==1], crk[pr!=1], "-")) to be described under Generalizations of pair match-
ing, below, which may also give similar or better bias
(where pr==1 indicates the treatment group). The reduction without excluding as many control obser-
d that results from these operations is shown (after vations. (See also Discussion, below.)
rounding) in Figure 3. Having calculated this d, one
can pose the task of matching as a discrete optimiza-
tion problem: find the match M = {(i, j)} minimiz- Growing your own discrepancy matrix
ing ∑ M d(i, j) among all sets of pairs (i, j) in which Figures 1 and 2 both illustrate multivariate distance
each treatment i appears twice and each control j ap- matching, aligning units so as to minimize a sum
pears at most once. of rank discrepancies. Optmatch is entirely flexi-
Optimal matching refers to algorithms guaran- ble about the form of the distance on which matches
teed to find matches attaining this minimum, or are to be made. To propensity-score match nuclear
falling within a specified tolerance of it, given a plants, for example, one would prepare a propensity
nt × nc discrepancy matrix M. Optimal matching’s distance using
performance advantage over heuristic, non-optimal
algorithms can be striking. For example, in the prob- > pscr <- glm(pr ~ . -(pr+cost),
lem of Figure 1 optimal matching reduces nearest- family = binomial,
available’s sum of discrepancies by 23%. This opti- data = nuclear)$linear.predictors
mal solution, found by optmatch’s pairmatch func- > PR <- nuclear$pr==1
tion, is shown in Figure 2. > pdist <- outer(pscr[PR], pscr[PR], "-")
Existing site New site
date capacity
date capacity H 3.6 290
> pscr.v <- (var(pscr[PR])*(sum(PR)-1)+
A 2.3 660 I 2.3 660
var(pscr[!PR])*(sum(!PR)-1))/
B 3.0 660 J 3.0 660 (length(PR)-2)
K 2.9 110
C 3.4 420 L 3.2 420
> pdist <- abs(pdist)/sqrt(pscr.v)
D 3.4 130 M 3.4 60
N 3.3 390 or, more simply and reliably,
E 3.9 650 O 3.6 160
F 5.9 430 P 3.8 390 > pmodel <- glm(pr ~ . -(pr+cost),
G 5.1 420 Q 3.4 130
R 3.9 650 family = binomial, data = nuclear)
S 3.9 450 > pdist <- pscore.dist(pmodel)
T 3.4 380
By evaluating potential U 4.5 440
V 4.2 690 Then pdist is passed to pairmatch or fullmatch as
matches all together rather W 3.8 510 its first argument. Other discrepancies on which one
than sequentially, optimal X 4.7 390
Y 5.4 140 might match include Mahalanobis distances (which
matching (blue lines) reduces Z 6.1 730 can be produced using mahal.dist) and combina-
the sum of distances by 23%.
tions of Mahalanobis and propensity-based distances
(Rosenbaum and Rubin, 1985; Gu and Rosenbaum,
Figure 2: Optimal vs. greedy 1:2 matching. 1993; Rubin and Thomas, 2000). Many special re-
quirements, such as that matches be made only
An optimal match is optimal relative to given within given subclasses, or that specific matches be
structural requirements, here that all treatments and avoided, are also introduced through the discrep-
a corresponding number of controls be arranged in ancy matrix.
1 : 2 matched sets, and a given distance, here d. It First consider the case the matches are to be made
is best-possible for the purposes of the analyst only within subclasses only. In the nuclear dataset, plants
with and without partial turnkey (pt) guarantees indicated in red in Figure 3. To enforce the caliper,
should be compared separately, since the meaning of one could generate a matrix of discrepancies dy on
the outcome variable, cost, changes with pt. Figure year of construction, then replace the distance ma-
2 shows only the pt!=1 plants, and its match is gener- trix of Figure 3, d, with d/(dy<=3); this new matrix
ated with the command pairmatch(d,controls=2), has an Inf at each entry in Figure 3 currently shown
where d is the matrix in Figure 3. To match also par- in red, and otherwise is the same as in Figure 3.
tial turnkey plants to each other, one gathers into Operations of these types, division and logical
a list, dl, both d and a distance matrix dpt com- comparison, are compatible with subclassification
paring new- and existing-site partial turnkey plants, prior to matching, despite the fact that the operations
then feeds dl to pairmatch as its first argument. For seem to require matrices while subclassification de-
propensity or Mahalanobis matching, pscore.dist mands lists of matrices. Assuming one has defined a
or mahal.dist would do this if given the formula function datediffs as
pr~pt as their optional arguments ‘structure.fmla’.
> datediffs <- function(trt,data){
More generally, the helper function makedist
sclr <- data[names(trt), 'date']
(which is called by pscore.dist and mahal.dist)
names(sclr) <- names(trt)
eases the application of a (user-defined) discrepancy
abs(outer(sclr[trt], sclr[!trt], '-'))
matrix-producing function to each of a sequence of
}
strata in order to generate a list of distance matri-
ces. For separate summed-rank distances by pt sub- then the command
class, one would write a function that extracts por-
tions of relevant variables from a given data frame, > dly <- makedist(pr ~ pt, nuclear,
looking to a treatment-group variable to decide what datediffs)
portions to take, as in tabulates absolute differences on date of construc-
> capdatediffs <- function(trt, dat) { tion, separately for pr==1 and pr!=1 strata. With
crk <- rank(dat[names(trt),"cap"]) optmatch, the expression dly<=3 returns a list
names(crk) <- names(trt) of indicators of whether potential matches were
dmt <- outer(crk[trt], crk[!trt],"-") built within three years of one another. Further-
dmt <- abs(dmt) more, dl/(dly<=3) is a list imposing the three-
year caliper upon distances coded in dl . To
drk <- rank(dat[names(trt),"date"]) pair match on propensity scores, but with a 3-
dmt <- dmt + year date-of-construction caliper, one would use
abs(outer(drk[trt], drk[!trt],"-")) pairmatch(dl/(dly<=3)).
dmt
}
Generalizations of pair matching
Then one would use makedist to apply the function
separately within levels of pr: Matching with a varying number of con-
trols
> dl <- makedist(pr ~ pt, nuclear,
capdatediffs) In Figures 1 and 2, both non-optimal and optimal
matches insist on precisely two controls per treat-
The result of this is a list of two distance matrices,
ment. If one’s aim is to match as closely as possible,
both submatrices of d created above, one comparing
this is a limitation. To optimally match 14 of the 19
pt!=1 treatments and controls and a smaller one for
controls, as Figure 2 does, but without requiring that
pt==1 plants.
they always be matched two-to-one to treatments,
In larger problems, matching can be substantially
one would use the command fullmatch, with op-
faster if preceded by a division of the sample into
tions ‘min.controls=1’ and ‘omit.fraction=5/19’.
subclasses; see Under the hood, below. The use
The flexibility this adds improves matching even
of pscore.dist, mahal.dist, and makedist carry
more than the switch from greedy to optimal match-
another advantage, that the lists of distances they
ing did; while optimal pair matching reduced greedy
generate carry metadata to prevent fullmatch or
pair matching’s net discrepancy from 82 to 63, op-
pairmatch from getting confused about the order of
timal matching with a varying number of controls
observations in the data frame from which the dis-
brings it to 44, just over half its original value.
tances were generated.
If mvnc is the match created in this way, then the
Another common aim is to forbid unwanted
structure of mvnc is returned by
matches. With optmatch, this is done by placing
NA’s, NaN’s or Inf’s at the relevant places in a distance > stratumStructure(mvnc)
matrix. Consider matching nuclear plants within stratum treatment:control ratios
calipers of three years on date of construction. Pair- 1:1 1:2 1:3 1:5
ings of plants that would violate this requirement are 4 1 1 1
Figure 3: Rank discrepancies of new- and existing-site nuclear plants without partial turnkey guarantees. New-
and existing-site plants which differ by more than 3 years in date of construction are indicated in red.
This means it consists of four matched pairs and a Exist- New sites
matched triple, quadruple, and sextuple, all contain- ing d e f
ing precisely one treatment. Ming and Rosenbaum a 6 6 0
(2000) discuss matching with a varying number of b 0 3 6
controls, implementing it in their example with a
c 6 6 0
somewhat different algorithm.
Figure 4: Rank discrepancies of new- and existing-
site nuclear plants with partial turnkey guarantees.
Boxes indicate the optimal full matching of these
Full matching
plants.
A propensity score is the conditional probability,
p(x), of falling in the treatment group given covari- Gu and Rosenbaum (1993) compare full and other
ates x (or an increasing transformation of it, such forms of matching in an extensive simulation study,
as its logit). Because subjects with large propensity while Hansen (2004) and Hansen and Klopfer (2006)
scores more frequently fall in the treatment group, present applications. This literature emphasizes the
and subjects with low propensity scores are more fre- importance of using structural restrictions, upper
quently controls, propensity score matching is fun- limits on K in K : 1 matched sets and on L in 1 :
damentally at odds with matching treatments and L matched sets, when full matching, in order to
controls in fixed ratios, such as 1 : 1 or 1 : k. These control the variance of matched estimation. With
ratios must be allowed to adapt, so that 1:1 matches fullmatch, an upper limit K : 1 on treatment:control
can be made where p(x)/(1 − p(x)) ≈ 1 while 1 : k ratios is conveyed using ‘min.controls=1/K’, while
matches are made where p(x)/(1 − p(x)) ≈ 1/k; a lower limit of 1 : L on the treatment : control ra-
otherwise either some subjects will have to go un- tio would be given with ‘max.controls=L’. Hansen
matched or some subjects are bound to be poorly (2004) and Hansen and Klopfer (2006) give strate-
matched on their propensity scores. Matching with gies to optimize these tuning parameters. In the
multiple controls addresses part of this problem, but context of a specific application, Hansen (2004) finds
only full matching (Rosenbaum, 1991) addresses it in (min.controls, max.controls) = (1/2, 2) · (1 − p̂)/ p̂
its entirety, by also permitting l:1 matches, for when to work best, where p̂ represent the proportion
p(x)/(1 − p(x)) ≈ l ≥ 2. treated in the stratum being matched. In an unre-
lated application, Stuart and Green (2006) find these
In general, full matching is useful when there
values to work well; they may be a good starting
are some regions of covariate space in which con-
point for general use.
trols outnumber treatments but others in which treat-
ments outnumber controls. This pattern emerges A somewhat related technique is matching “with
most clearly when matching on a propensity score, replacement,” in which overlap between matched
but they influence the quality of matches even with- sets is permitted in the interests of achieving closer
out propensity scores. The rank discrepancies of matches. Because of the overlap, methods appropri-
new- and existing-site pt==1 plants, shown in Fig- ate to stratified data are not generally appropriate for
ure 4, show it; earlier dates of construction, and samples matched with replacement. The technique
smaller capacities, are more common among controls forces one to resort to specialized techniques, such
(d and e) than treatments (b only), and this is re- as those of Abadie and Imbens (2006). On the other
flected in Figure 4’s sums of discrepancies on rank. hand, with-replacement matching would appear to
As a consequence, full matching achieves a net rank offer the possibility of closer matches, since its pair-
discrepancy (3) that is half of the minimum possible ing of one treatment unit in no way limits its pairing
(6) with techniques that don’t permit both 1:2 and 2:1 of the next treatment unit.
matched sets. However, it is a surprising, and evidently
little-known, fact that with-replacement matching to select a subset of control subjects most like treat-
achieves no closer matches than full matching, a ments, with the remainder of subjects excluded from
without-replacement matching technique. As dis- analysis; in matched adjustment, it is used to force
cussed by Rosenbaum (1991) and (more explicitly) treatment control comparisons to be based on indi-
by Hansen and Klopfer (2006), given any criterion vidualized comparisons made within matched sets,
for a potential pairing of subjects to be acceptable, which will have been so engineered that matched
full matching matches all subjects with at least one counterparts are more like one another than are treat-
suitable match in their comparison group, match- ments and controls on the whole. Matched sam-
ing them only to acceptable counterparts. So one pling is typically followed by matched adjustment,
might insist, in particular, that each treatment unit be but matched adjustment can be useful even when not
matched only to one of its nearest neighbors; by shar- preceded by matched sampling.
ing controls among treated units where necessary,
omitting controls who are not the nearest neighbor of
some treatment, and matching to multiple controls Because it excludes some five control subjects, the
where that can be done, full matching finds a way match depicted in Figures 1 and 2 might be used in
to honor this requirement. Since the matched sets a context of matched sampling, although it differs in
produced by full matching never overlap, it has the important respects from typical matched samples. In
advantage over with-replacement matching of com- archetypal cases, matched sampling is used when for
bining with any method of estimation appropriate to cost reasons the number of controls to be followed
finely stratified data. up for outcome data has to be reduced anyway, not
when outcome data is already available for the entire
sample already; and in archetypal cases, the reservoir
Under the hood of potential controls is many times larger than the
size of the desired control group. See, e.g., Althauser
Hansen and Klopfer (2006) describe the network- and Rubin (1970) or Rosenbaum and Rubin (1985).
flows algorithm on which optmatch relies in some The matches in Figures 1 and 2 are typical of matched
detail, establishing its optimality for full matching sampling in matching a fixed number of controls to
and matching with a fixed or varying number of con- each treatment subject. When bias can be addressed
trols. They also give upper bounds for the time com- by being very selective in the choice of controls, flex-
plexity of the algorithm: roughly, O(n3 log(nC )), ibility in the structure of matched sets becomes less
where n is the size of the sample and C is the quo- important.
tient of largest discrepancy in the distance matrix
and the matching tolerance. This is comparable to
the time complexity of squaring a n × n matrix. More When there is no additional data to be collected,
precisely, the algorithm requires O(nnt nc log(nC )) there may be little use for matched sampling per se,
floating-point operations, where nt and nc are the while matched adjustment may still be attractive. In
sizes of the treatment and control groups. these cases, it is important to recognize that match-
These bounds have two practical consequences ing, even optimal matching, does not in itself reduce
for optmatch. First, computational costs grow systematic differences between treatment and con-
steeply with the size of the discrepancy matrix. Just trol groups unless it is specifically given the flexi-
as squaring two (n/2) × (n/2) submatrices of an bility to do so. Suppose, for instance, that adjust-
n × n matrix is about four times faster than squaring ment for the variable t2 is needed in the compari-
the full n × n matrix, matching is made much faster son of new- and existing site-plants. This variable,
by subdividing large matching problems into smaller which represents the time between the issue of an
ones. For this reason makedist is written so as to fa- operating permit and a construction permit, differs
cilitate subclassification prior to matching, the effect markedly in its distribution among “treatments” and
of which is to split larger matching problems into a “controls,” as seen in Figure 5: treatments have sys-
sequence of smaller ones. Second, the C-factor con- tematically larger values of it, although the two dis-
tributes secondarily to computational cost; its contri- tributions well overlap. When two groups compare
bution is reduced by increasing the value of the ‘tol’ in this way, no fixed-ratio matching of them can re-
argument to fullmatch or pairmatch. duce their overall discrepancy. Some of the observa-
tions will have to be set aside — or, better yet, one
could match the two groups in varying ratios, us-
Discussion ing matching with multiple controls or full match-
ing. These techniques have surprising power to rec-
When and how matching reduces system- oncile differences between treatments and controls
atic differences between groups while setting aside few or even no subjects because
they lack suitable counterparts; the reader is referred
Matching can address bias in observational studies to Hansen (2004, § 2) for general discussion and a
in either of two ways. In matched sampling, it is used case study.
unlist(split(cost[!PR],pm[!PR]))
70
A. R. Brazzale. hoa: An R package bundle for higher T. Lumley and T. Therneau. survival: Survival anal-
order likelihood inference. Rnews, 5/1 May 2005: ysis, including penalised likelihood, 2006. R package
20–27, 2005. URL ftp://cran.r-project.org/ version 2.26.
doc/Rnews/Rnews_2005-1.pdf. ISSN 609-3631.
K. Ming and P. R. Rosenbaum. Substantial gains in
A. R. Brazzale, A. C. Davison, and N. Reid. Applied bias reduction from matching with a variable num-
Asymptotics. Cambridge University Press, 2006. ber of controls. Biometrics, 56:118–124, 2000.
N. E. Breslow and N. E. Day. Statistical Methods D. Pierce and D. Peters. Practical use of higher order
in Cancer Research (Vol. 1) — The Analysis of Case- asymptotics for multiparameter exponential fami-
control Studies. Number 32 in IARC Scientific lies. Journal of the Royal Statistical Society. Series B
Publications. International Agency for Research on (Methodological), 54(3):701–737, 1992.
Cancer, 1980.
S. W. Raudenbush and A. S. Bryk. Hierarchical Lin-
D. Cox and E. Snell. Applied Statistics. Chapman and ear Models: Applications and Data Analysis Methods.
Hall, 1981. Sage Publications Inc, 2002.
D. R. Cox and E. J. Snell. Analysis of Binary Data. P. R. Rosenbaum. A characterization of optimal de-
Chapman & Hall Ltd, 1989. signs for observational studies. Journal of the Royal
A. Diamond and J. S. Sekhon. Genetic matching for Statistical Society, 53:597– 610, 1991.
estimating causal effects: A general multivariate P. R. Rosenbaum. Observational Studies. Springer-
matching method for achieving balance in obser- Verlag, second edition, 2002.
vational studies. Technical report, Travers Depart-
ment of Political Science, University of California, P. R. Rosenbaum and D. B. Rubin. Construct-
Berkeley, 2006. version 1.2. ing a control group using multivariate matched
sampling methods that incorporate the propensity
A. Gelman and J. Hill. Data Analysis Using Regres-
score. The American Statistician, 39:33–38, 1985.
sion and Multilevel/Hierarchical Models. Cambridge
University Press, 2006. D. B. Rubin. Using propensity scores to help design
observational studies: application to the tobacco
X. Gu and P. R. Rosenbaum. Comparison of mul-
litigation. Health Services and Outcomes Research
tivariate matching methods: Structures, distances,
Methodology, 2(3):169–188, 2001.
and algorithms. Journal of Computational and Graph-
ical Statistics, 2(4):405–420, 1993. D. B. Rubin. Using multivariate matched sampling
and regression adjustment to control bias in obser-
B. B. Hansen. Appraising covariate balance after as-
vational studies. Journal of the American Statistical
signment to treatment by groups. Technical Re-
Association, 74:318–328, 1979.
port 436, University of Michigan, Statistics Depart-
ment, 2006a. D. B. Rubin and N. Thomas. Combining propen-
B. B. Hansen. Full matching in an observational sity score matching with additional adjustments
study of coaching for the SAT. Journal of the Ameri- for prognostic covariates. Journal of the American
can Statistical Association, 99(467):609–618, Septem- Statistical Association, 95(450):573–585, 2000.
ber 2004. J. S. Sekhon. Alternative balance metrics for bias re-
B. B. Hansen. Bias reduction in observational studies duction in matching methods for causal inference.
via prognosis scores. Technical Report 441, Uni- Survey Research Center, University of California,
versity of Michigan, Statistics Department, April Berkeley, 2007. URL http://sekhon.berkeley.
2006b. URL http://www.stat.lsa.umich.edu/ edu/papers/SekhonBalanceMetrics.pdf.
%7Ebbh/rspaper2006-06.pdf. H. Smith. Matching with multiple controls to esti-
B. B. Hansen and S. O. Klopfer. Optimal full match- mate treatment effects in observational studies. So-
ing and related designs via network flows. Journal ciological Methodology, 27:325–353, 1997.
of Computational and Graphical Statistics, 15(3):609– E. A. Stuart and K. M. Green. Using full matching to
627, 2006. URL http://www.stat.lsa.umich. estimate causal effects in non-experimental stud-
edu/%7Ebbh/hansenKlopfer2006.pdf. ies: Examining the relationship between adoles-
T. Hothorn, K. Hornik, M. van de Wiel, and cent marijuana use and adult outcomes. Technical
A. Zeileis. A lego system for conditional inference. report, Johns Hopkins University, 2006.
The American Statistician, 60(3):257–263, 2006.
W.-S. Lee. Propensity score matching and vari- Ben B. Hansen
ations on the balancing test, 2006. URL Department of Statistics
http://papers.ssrn.com/sol3/papers.cfm? University of Michigan
abstract_id=936782#. bbh@umich.edu
example, should be aware of the R party() package, which implements a random forests style analysis using conditional tree base learn-
ers (Hothorn et al., 2006).
vides four different survival splitting rules for the split is. In particular, the best split at node h is deter-
user. These are: (i) a log-rank splitting rule, mined by finding the predictor x∗ and split value c∗
the default splitting rule, invoked by the option such that | L( x∗ , c∗ )| ≥ | L( x, c)| for all x and c.
splitrule="logrank"; (ii) a conservation of events
splitting rule, splitrule="conserve"; (iii) a logrank Conservation of events splitting
score rule, splitrule="logrankscore"; (iv) and a
fast approximation to the logrank splitting rule, The log-rank test for splitting survival trees is a
splitrule="logrankapprox". well established concept (Segal, 1988), having been
shown to be robust in both proportional and non-
proportional hazard settings (LeBlanc and Crowley,
Notation
1993). However, one criticism often heard is that it
Assume we are at node h of a tree during its growth tends to favor continuous predictors and often suf-
and that we seek to split h into two daughter nodes. fers from an end-cut preference (favoring uneven
We introduce some notation to help discuss how the splits). However, in our experience with Random
the various splitting rules work to determine the best Survival Forests we have not found this to be a seri-
split. Assume that within h are n individuals. Denote ous deficiency. Nevertheless, to address this poten-
their survival times and 0-1 censoring information by tial problem we introduce another important class
( T1 , δ1 ), . . . , ( Tn , δn ). An individual l will be said to of test statistics for splitting that are related to con-
be right censored at time Tl if δl = 0, otherwise the servation of events; a concept introduced in Naftel
individual is said to have died at Tl if δl = 1. In the et al. (1985) (our simulations have indicated these
case of death, Tl will be refered to as an event time, tests may be much less susceptible to the aforemen-
and the death as an event. An individual l who is tioned problems).
right censored at Tl simply means the individual is Under fairly general conditions, conservation of
known to have been alive at Tl , but the exact time of events asserts that the sum of the estimated cumula-
death is unknown. tive hazard function over the observed time points
A proposed split at node h on a given predictor x (deaths and censored values) must equal the total
is always of the form x ≤ c and x > c. Such a split number of deaths. This applies to a wide collection
forms two daughter nodes (a left and right daugh- of estimates including the the Nelson-Aalen estima-
ter) and two new sets of survival data. A good split tor. The Nelson-Aalen cumulative hazard estimator
maximizes survival differences across the two sets of for daughter j is
data. Let t1 < t2 < · · · < t N be the distinct death di, j
times in the parent node h, and let di, j and Yi, j equal Ĥ j (t) = ∑
the number of deaths and individuals at risk at time ti, j ≤t Yi, j
ti in the daughter nodes j = 1, 2. Note that Yi, j is the where ti, j are the ordered death times for daughter j
number of individuals in daughter j who are alive at (note: we define 0/0 = 0).
time ti , or who have an event (death) at time ti . More Let ( Tl, j , δl, j ), for l = 1, . . . , n j , denote all survival
precisely, times and censoring indicator pairs for daughter j.
Conservation of events asserts that
Yi,1 = #{ Tl ≥ ti , xl ≤ c}, Yi,2 = #{ Tl ≥ ti , xl > c},
nj nj
n j be the total number of observations in daughter j. In other words, the total number of deaths is con-
Thus, n = n1 + n2 . Note that n1 = #{l : xl ≤ c} and served in each daughter.
n 2 = # { l : x l > c }. The conservation of events splitting rule is moti-
vated by (1). First, order the time points within each
daughter node such that
Log-rank splitting
T(1), j ≤ T(2), j ≤ · · · ≤ T(n j ), j .
The log-rank test for a split at the value c for predic-
tor x is Let δ(l ), j be the censoring indicator function for the
N ordered value T(l ), j . Define
di
∑ di,1 − Yi,1 Yi k k
L( x, c) = s i =1 Mk, j = ∑ Ĥ j (T(l), j ) − ∑ δ(l), j , k = 1, . . . , n j .
N Y .
Yi − di
i,1 Yi,1 l =1 l =1
∑ Yi 1 − Yi Yi − 1
di
One can think of Mk, j as “residuals” that measure ac-
i =1
curacy of conservation of events. The proposed test
The value | L( x, c)| is the measure of node separation. statistic takes the sum of the absolute values of Mk, j
The larger the value for | L( x, c)|, the greater the dif- for k = 1, . . . , n j for each daughter j, and weights
ference between the two groups, and the better the these values by the number of individuals at risk
within each group. Observe that Mn j , j = 0, but noth- approximation, first rewrite the numerator of L( x, c)
ing can be said about Mk, j for k < n j . Thus, by in a form that uses the Nelson-Aalen estimator for
considering Mk, j for each k, the proposed test mea- the parent node. The Nelson-Aalen estimator is
sures how evenly distributed conservation of events
di
is over all deaths. The measure of conservation of Ĥ (t) = ∑ .
events for the split on x at the value c is Y
t ≤t i
i
An approximate log-rank test can be used in place of Note this value is computed for all individuals i in the
L( x, c) to greatly reduce computations. To derive the data.
The estimate (2) is based on one tree. To produce 4. Define the concordance index C as
our ensemble we average (2) over all ntree trees. Let
Concordance
Ĥb (t|x) denote the cumulative hazard estimate (2) C= .
for tree b = 1, . . . , ntree. Define Ii,b = 1 if i is an Permissible
OOB point for b, otherwise set Ii,b = 0. The OOB
ensemble cumulative hazard estimator for i is 5. The error rate is Error = 1 − C. Note that
0 ≤ Error ≤ 1 and that Error = 0.5 cor-
∑ntree
b=1 Ii,b Ĥb (t |xi )
responds to a procedure doing no better than
Ĥe∗ (t|xi ) = . random guessing, whereas Error = 0 indicates
∑ntree
b=1 Ii,b
perfect accuracy.
Observe that the estimator is obtained by averag-
ing over only those bootstrap samples in which i is
excluded (i.e., those datasets in which i is an OOB Usage in R
value). The OOB estimator is in contrast to the
ensemble cumulative hazard estimator that uses all The user interface to randomSurvivalforest is sim-
samples: ilar in many aspects to randomForest and as the
reader may have already noticed, many of the argu-
1 ntree ment names are also the same. This was done de-
Ĥe (t|xi ) = Ĥb (t|xi ).
ntree b∑
=1
liberately in order to promote compatibility between
the two packages. The primary R function call to the
randomSurvivalforest package is rsf(). The on-
Concordance error rate line documentation describes rsf() in great detail
and there is no reason to repeat this information here.
Given the OOB estimator Ĥe∗ (t|x), it is a simple mat- Different R wrapper functions are provided with the
ter to compute the error rate. We measure error us- randomSurvivalforest package to aid in interpret-
ing Harrell’s concordance index (Harrell et al., 1982). ing the object produced by rsf(). The examples
Unlike other measures of survival performance, Har- given below illustrate how some of these wrappers
rell’s C-index does not depend on choosing a fixed work, and also indicate how rsf() might be used in
time for evaluation of the model and specifically practice.
takes into account censoring of individuals (May
et al., 2004). The method has quickly become quite Lung-vet data
popular in the literature as a means for assessing
prediction performance in survival analysis settings. For our first example, we use the well known
See Kattan et al. (1998) and references therein. veteran’s administration lung cancer data from
To compute the concordance index we must de- Kalbfleisch and Prentice (Kalbfleisch and Prentice,
fine what constitutes a worse predicted outcome. We 1980). This is an example data set available within
take the following approach. Let t∗1 , . . . , t∗N denote all the package. In total there are 6 predictors in the
unique event times in the data. Individual i is said to data. We first focus on analysis that includes only
have a worse outcome than j if Karnofsky score as a predictor:
N N > library("randomSurvivalForest")
∑ Ĥe∗ (t∗k |xi ) > ∑ Ĥe∗ (t∗k |x j ). > data(veteran,package="randomSurvivalForest")
k=1 k=1 > ntree <- 1000
> v.out <- rsf(Survrsf(time,status) ~ karno,
The concordance error rate is computed as follows:
veteran, ntree=ntree, forest=T)
1. Form all possible pairs of observations over all > print(v.out)
the data.
Call:
2. Omit those pairs where the shorter event time rsf.default(formula = Survrsf(time, status)
is censored. Also, omit pairs i and j if Ti = Tj ~ karno, data = veteran, ntree = ntree)
unless δi = 1 and δ j = 0 or δi = 0 and δ j = 1.
The last restriction only allows ties if one of Sample size: 137
the observations is a death and the other a cen- Number of deaths: 128
sored observation. Let Permissible denote the Number of trees: 1000
total number of permissible pairs. Minimum terminal node size: 3
3. Count 1 for each permissible pair in which the Average no. of terminal nodes: 8.437
shorter event time had the worse predicted out- No. of variables tried at each split: 1
come. Count 0.5 if the predicted outcomes are Total no. of variables: 1
tied. Let Concordance denote the total sum Splitting rule: logrank
over all permissible pairs. Estimate of error rate: 36.28%
The error rate is significantly smaller than 0.5, the The analysis shows that logrankscore has the best
benchmark value associated with a procedure no bet- predictive performance (logrank is a close second).
ter than flipping a coin. This is very strong evidence Standard deviations in all cases are reasonably small.
that Karnofsky score is predictive. It is interesting to observe that the mean error rates
We can investigate the effect of Karnofsky score are not substantially smaller than our previous anal-
more closely by considering how the ensemble esti- ysis which used only Karnofsky score, thus indicat-
mated mortality varies as a function of the predictor: ing the predictor is highly influential. Our next ex-
ample illustrates further techniques for studying the
> plot.variable(v.out, partial=T)
informativeness of a predictor.
Figure 1, produced by the above command, is a par-
tial plot of Karnofsky score. The vertical axis repre- Primary biliary cirrhosis (PBC) of the liver
sents expected number of deaths.
Next we consider the PBC data set found in appendix
● ● ● ●
D.1 of Fleming and Harrington (Fleming and Har-
rington, 1991). This is also an example data set avail-
120
●
●
able in the package. Similar to the previous analysis
we analyzed the data by running a forest analysis for
●
●
each of the four splitting rules, repeating the analy-
sis 100 times independently (as before ntree was set
80
●
to 1000). The R code is similar as before and sup-
pressed:
60
● ●
age ●
prothrombin ●
copper ●
under each of the four splitting rules. For each split- chol ●
Error Rate
ascites ●
ting rule we run 100 replications and record the mean spiders ●
platelet ●
sgot ●
> splitrule <- c("logrank", "conserve", trig ●
Figure 3. The figure was produced using the com- Large scale problems
mand:
In terms of computationally challenging problems,
> plot.variable(pbc.out,3,partial=T,n.pred=6) we have applied randomSurvivalForest success-
fully to several large survival datasets. For exam-
We now consider the incremental effect of each ple, we have considered data collected at the Cleve-
predictor using a nested analysis. We sort predictors land Clinic involving over 20,000 records and well
by their importance values and consider the nested over 60 predictors. We have also analyzed a data set
sequence of models starting with the top variable, containing 1,000 records and with almost 250 predic-
followed by the model with the top 2 variables, then tors. Our success with these applications is consis-
the model with the top three variables, and so on: tent with that seen for Random Forests: namely, that
> imp <- pbc.out$importance the methodology has been shown to scale up very
> pnames <- pbc.out$predictorNames nicely, even in very large predictor spaces and with
> pnames.order <- pnames[rev(order(imp))] large sample sizes. In terms of computational speed,
> n.pred <- length(pnames) we have found that logrankapprox is almost always
> pbc.err <- rep(0, n.pred) fastest. After that, conserve is second fastest. For
> for (k in 1:n.pred){ very large datasets, discretizing continuous predic-
> rsf.f <- "Survrsf(days,status)~" tors and/or the observed survival times can greatly
> rsf.f <- as.formula(paste(rsf.f, speed up computational times. Discretization does
> paste(pnames.order[1:k],collapse="+"))) not have to be overly granular for substantial gains
> pbc.err[k] <- rsf(rsf.f, pbc, ntree=ntree, to be seen.
> splitrule="logrankapprox")$err.rate[ntree]
> }
> pbc.imp.out <- as.data.frame( Acknowledgements
> cbind(round(rev(sort(imp)),4),
> round(pbc.err,4), The authors are extremely grateful to Eugene H.
> round(-diff(c(0.5,pbc.err)),4)), Blackstone and Michael S. Lauer for their tremen-
> row.names=pnames.order) dous support, feedback, and generous time given to
>colnames(pbc.imp.out) <- us throughout this project. Without their help this
> c("Imp","Err","Drop Err") project would surely never have been completed. We
> print(pbc.imp.out) also thank the referee of the paper and Paul Murrell
for their constructive and helpful comments. This re-
Imp Err Drop Err search was supported by the National Institutes of
age 0.0130 0.3961 0.1039 Health RO1 grant HL-072771.
bili 0.0081 0.1996 0.1965
prothrombin 0.0052 0.1918 0.0078
copper 0.0042 0.1685 0.0233
Bibliography
edema 0.0030 0.1647 0.0038
L. Breiman. Random forests. Machine Learning, 45:
albumin 0.0026 0.1569 0.0078
5–32, 2001.
chol 0.0022 0.1606 -0.0037
ascites 0.0014 0.1570 0.0036 L. Breiman. Bagging predictors. Machine Learning,
spiders 0.0013 0.1601 -0.0030 26:123–140, 1996.
stage 0.0013 0.1557 0.0043
hepatom 0.0008 0.1570 -0.0013 D.R. Cox and D. Oakes. Analysis of Survival Data.
sex 0.0006 0.1549 0.0021 Chapman and Hall, London, 1998.
platelet 0.0000 0.1565 -0.0016
T. Fleming and D. Harrington. Counting Processes and
sgot -0.0012 0.1538 0.0027
Survival Analysis. Wiley, New York, 1991.
trig -0.0017 0.1545 -0.0007
treatment -0.0019 0.1596 -0.0052 F. Harrell, R. Califf, D. Pryor, K. Lee, and R. Rosati.
alk -0.0040 0.1565 0.0032 Evaluating the yield of medical tests. J. Amer. Med.
The first column is the importance value of a predic- Assoc., 247:2543–2546, 1982.
tor in the full model. The kth value in the second col- T. Hothorn, K. Hornik, and A. Zeileis. Unbiased
umn is the error rate for the kth nested model, while recursive partitioning: A conditional inference
the kth value in the third column is the difference be- framework. J. Comp. Graph. Statist., 15:651–674,
tween the error rate for the kth and (k − 1)th nested 2006.
model, where the error rate for the null model, k = 0,
is 0.5. One can see not much is gained by using more T. Hothorn, and B. Lausen. On the exact distribu-
than 6-7 predictors and that the top 3-4 predictors ac- tion of maximally selected rank statistics. Comput.
count for much of the predictive power. Statist. Data Analysis, 43:121–137, 2003.
110
●
120
●
110
● ●
● ● ●
● ●
●●
●●● ●●●● ● ●●
● ●
●
●
●
●
●●
● ●
100
95 100
● ●
● ●
90 100
●
● ●
●
●
●●
●●
● ●● ● ● ●●● ● ●
● ●● ●● ●●● ●
● ●●
● ● ● ●● ●
●●●
90
90
10000 20000 0 5 10 15 20 25 10 12 14 16
age bili prothrombin
115
120
100 105
●
●
●
110
● ● ●
105
●
●
● ●
● ●●
●
●●●● ●● ●
●
●
100
●
●
95
● ●
●●
90 95
●
● ●
●● ● ●●
●●● ●● ●
90
●
●●●●●●
90
0 200 400 600 0.0 0.5 1.0 2.0 2.5 3.0 3.5 4.0
copper edema albumin
Figure 3: Partial plots for top six predictors from PBC data. Values on the vertical axis represent expected
number of deaths for a given predictor, after adjusting for all other predictors. Dashed red lines for continuous
predictors are ± 2 standard error bars.
H. Ishwaran, E. Blackstone, M. Lauer, and C. Pothier. D. Naftel, E. Blackstone, and M. Turner. Conserva-
Relative risk forests for exercise heart rate recovery tion of events, 1985. Unpublished notes.
as a predictor of mortality. J. Amer. Stat. Assoc., 99:
591–600, 2004. R. Schapire, Y. Freund, P. Bartlett, and W. Lee. Boost-
ing the margin: a new explanation for the effective-
J. Kalbfleisch and R. Prentice. The Statistical Analysis ness of voting methods. Ann. Statist., 26(5):1651–
of Failure Time Data. Wiley, New York, 1980. 1686, 1998.
M. Kattan, K. Hess, and J. Beck. Experiments to
determine whether recursive partitioning (CART) M.R. Segal. Regression trees for censored data. Bio-
or an artificial neural network overcomes theoreti- metrics, 44:35–47, 1988.
cal limitations of Cox proportional hazards regres-
sion. Computers and Biomedical Research, 31:363– T. Therneau and E. Atkinson. An introduction to re-
373, 1998. cursive partitioning using the rpart routine. Tech-
nical Report 61, 1997. Section of Biostatistics, Mayo
M. LeBlanc and J. Crowley. Relative risk trees for cen- Clinic, Rochester.
sored survival data. Biometrics, 48:411–425, 1992.
M. LeBlanc and J. Crowley. Survival trees by good-
ness of split. J. Amer. Stat. Assoc., 88:457–467, 1993. Hemant Ishwaran
Department of Quantitative Health Sciences
M. May, P. Royston, M. Egger, A.C. Justice and Cleveland Clinic, Cleveland, U.S.A.
J.A.C. Sterne. Development and validation of a hemant.ishwaran@gmail.com
prognostic model for survival time data: applica- Udaya B. Kogalur
tion to prognosis of HIV positive patients treated Kogalur Shear Corporation
with antiretroviral therapy. Statist. Medicine., 23: Clemmons, U.S.A.
2375–2398, 2004. ubk@kogalur-shear.com
Figure 1: Screenshot showing the web page of Rwui where input items are added.
gramming web applications that run R code. With 1 shows the web page of Rwui on which the input
these methods, whoever is creating the application items that will appear in the application are added.
must write web application code. In contrast, when Section headings can also be added if required.
using Rwui no web application code needs to be Rwui displays a facsimile of the web page that has
written; a web interface for an R script is created been created as items are added to the page. This
simply by selecting options from a sequence of web can be seen in the lower part of the screenshot shown
pages. in Figure 1. Input items are given a number so that
items can be deleted and new items inserted between
existing items. After uploading the R script, Rwui
Using the Application generates the web application, which can be down-
loaded as a zip or tgz file.
The information that Rwui requires in order to create Rwui creates Java based applications
a web application for running an R script is entered that use the Apache Struts framework
on a sequence of forms. After entering a title and in- (http://struts.apache.org/). Struts is open source
troductory text for the application, the input items and a popular framework for constructing well-
that will appear on the application’s web page are organised, stable and extensible web applications.
selected. Input items may be Numeric or Text entry The completed applications will run on a Tom-
boxes, Checkboxes, Drop-down lists, Radio Buttons, cat server. All that needs to be done to use the
File Upload boxes and a Multiple/Replicate File Up- downloaded web application is to place the appli-
load page. Each of the input variables of the R script, cation’s ‘web application archive’ file, which is con-
that is, those variables in the script that require a tained in the download, into a subdirectory of the
value supplied by the user, must have a correspond- Tomcat server. In addition, the file permissions of an
ing input item on the application’s web page. Figure included shell script must be changed to executable.
Figure 2: Screenshot showing an example of a results web page of an application created using Rwui.
Using applications created by Rwui with their associated results files. Any uploaded data
files are also linked to on the Results page, giving the
A demonstration application created by Rwui can be user the opportunity to check that the correct data
accessed from the ‘Help’ page of Rwui. has been submitted and has been uploaded correctly.
Applications created by Rwui can include a lo- The applications include a SessionListener which
gin page. Access can be controlled either by a single detects when a user’s session is about to expire. The
password, or by username/password pairs. SessionListener then removes all the working direc-
If an application created by Rwui includes a Mul- tories created during the session from the server, to
tiple/Replicate File upload page, then the applica- prevent it from becoming clogged with data. By de-
tion consists of two web pages on which the user en- fault a session expires 30 minutes after it was last ac-
ters information. On the first web page the user up- cessed by the user, but the delay can be changed in
loads multiple files one at a time. Once completed, the server configuration.
a button takes the user to a second web page where All the code of a complete demonstration appli-
singleton data files and values for all other variables cation created by Rwui can be downloaded from a
are entered. The ‘Analyse’ button on this page sub- link on the ‘Help’ page of Rwui.
mits the values of variables, uploads any data files
and runs the R script. If a Multiple/Replicate File
upload page is not included, the application consists Acknowledgment
of this second web page only.
Before running the R script the application first RN was funded as part of a Wellcome Trust Func-
checks the validity of the values that the user has tional Genomics programme grant.
entered and returns an error message to the page if
any are invalid. During the analysis, progress infor-
mation can be displayed for the user. To enable this Bibliography
the R script must append information to a text file at
stages during its execution. This text file is displayed Gentleman,R.C., Carey,V.J., Bates,D.M., Bolstad,B.,
for the user in a JavaScript pop-up window which Dettling,M., Dudoit,S., Ellis,B., Gautier,L., Ge,Y.,
refreshes at fixed intervals. Gentry,J., Hornik,K., Hothorn,T., Huber,W., Ia-
On completion of the analysis, a link to a Results cus,S., Irizarry,R., Leisch,F., Li,C., Maechler,M.,
page appears at the bottom of the web page. The user Rossini,A.J., Sawitzki,G., Smith,C., Smyth,G., Tier-
can change data files and/or the values of any of the ney,L., Yang,J.Y.H., Zhang,J. (2004) Bioconductor:
variables and re-analyse, and the new results will ap- Open software development for computational bi-
pear as a second link at the bottom of the page, and ology and bioinformatics, Genome Biology, 5, R80.
so on. Clicking on a link brings up the Results page
for the corresponding analysis. Figure 2 shows an
example of a results page. Richard Newton
The user can download individual results files by School of Crystallography
clicking on the name of the appropriate file on a Re- Birkbeck College, University of London, UK
sults page. Alternatively, each Results page also con- r.newton@mail.cryst.bbk.ac.uk
tains a link which will download all the results files Lorenz Wernisch
from the page and the html of the page itself. In this MRC Biostatistics Unit, Cambridge, UK
way the user can view offline saved Results pages lorenz.wernisch@mrc-bsu.cam.ac.uk
Practitioners often encounter a mix of categori- results. For this reason, we advise the reader to in-
cal and continuous data types, but may still wish terrogate their bandwidth objects with the summary
to proceed in a nonparametric direction. The tradi- command which produces a table of the bandwidths
tional nonparametric approach is called a ‘frequency’ for the continuous variables scaled by an appropriate
approach, whereby data is broken up into subsets constant (σ x nα where α depends on the kernel order
(‘cells’) corresponding to the values assumed by the and number of continuous variables, e.g. α = −1/5
categorical variables, and only then do you apply for one continuous variable and a second order ker-
say density or locpoly to the continuous data re- nel), which some readers may find helpful. Also,
maining in each cell. Nonparametric frequency ap- the admissible range for the bandwidths for the cat-
proaches are widely acknowledged to be unsatisfac- egorical variables is provided when summary is used,
tory. Recent theoretical developments offer practi- which some readers may also find helpful.
tioners a variety of kernel-based methods for cat- We have tried to make np flexible enough to be
egorical (i.e., unordered and ordered factors) data of use to a wide range of users. All options can
only or for a mix of continuous and categorical data; be tweaked by users (kernel function, kernel order,
see Li and Racine (2007) and the references therein bandwidth type, estimator type and so forth). One
for an in-depth treatment of these methods, and also function, npksum, allows you to create your own es-
see the articles listed in the bibliography below. timators, tests, etc. The function npksum is simply a
The np package implements recently developed call to highly optimized C code, so you get the bene-
kernel methods that seamlessly handle the mix of fits of compiled code along with the power and flex-
continuous, unordered, and ordered factor data ibility of the R language. We hope that incorporating
types often found in applied settings. The package the npksum function renders the package suitable for
also allows users to create their own routines us- teaching and research alike.
ing high-level function calls rather than writing their There are two ways in which you can interact
own C or Fortran code.2 The design philosophy un- with functions in np: i) using data frames, or ii) using
derlying np is simply to provide an intuitive, flex- a formula interface, where appropriate.
ible, and extensible environment for applied kernel To some, it may be natural to use the data frame
estimation. interface. The R data.frame function preserves a
Currently, a range of methods can be found in the variable’s type once it has been cast (unlike cbind,
np package including unconditional (Li and Racine, which we avoid for this reason). If you find this most
2003; Ouyang et al., 2006) and conditional (Hall et al., natural for your project, you first create a data frame
2004; Racine et al., 2004) density estimation and casting data according to their type (i.e., one of con-
bandwidth selection, conditional mean and gradi- tinuous (default), factor, ordered), as in
ent estimation (local constant (Racine and Li, 2004; data.object <- data.frame(x1=factor(x1),
Hall et al., forthcoming) and local polynomial (Li x2, x3=ordered(x3))
and Racine, 2004)), conditional quantile and gradi-
where x1 is, say, a binary factor, x2 continuous, and
ent estimation (Li and Racine, forthcoming), model
x3 an ordered factor. Then you could pass this data
specification tests (regression (Hsiao et al., forthcom-
frame to the appropriate np function, say
ing), quantile, significance (Racine et al., forthcom-
ing)), semiparametric regression (partially linear, in- npudensbw(dat=data.object)
dex models, average derivative estimation, vary- To others, however, it may be natural to use the
ing/smooth coefficient models), etc. formula interface that is used for the regression ex-
Before proceeding, we caution the reader that ample outlined below. For nonparametric regression
data-driven bandwidth selection methods can be nu- functions such as npregbw, you would proceed just
merically intensive, which is the reason underlying as you would using lm, e.g.,
the development of an MPI-aware3 version of the
np package that uses some of the functionality of npregbw(y ~ factor(x1) + x2)
the Rmpi package, which we have tentatively called except that you would of course not need to specify,
the npRmpi package. The functionality of np and e.g., polynomials in variables or interaction terms.
npRmpi will be identical; however, using npRmpi Every function in np supports both interfaces, where
you could take advantage of a cluster computing en- appropriate.
vironment or a multi-core/multi-cpu desktop ma- Before proceeding, a few words on the formula
chine thereby alleviating the computational burden interface are in order. We use the standard for-
associated with the nonparametric analysis of large mula interface as it provided capabilities for han-
datasets. We ought also to point out that data- dling missing observations and so forth. This in-
driven (i.e., automatic) bandwidth selection proce- terface, however, is simply a convenient device for
dures are not guaranteed always to produce good telling a routine which variable is, say, the outcome
2 The high-level functions found in the package in turn call compiled C code, allowing users to focus on the application rather than the
implementation details.
3 MPI is a library specification for message-passing, and has emerged as a dominant paradigm for portable parallel programming.
and which are, say, the covariates. That is, just be- > data("wage1")
cause one writes x1 + x2 in no way means or is >
meant to imply that the model will be linear and > model.ols <- lm(lwage ~ factor(female) +
additive (why use fully nonparametric methods to + factor(married) + educ + exper + tenure,
estimate such models in the first place?). It sim- + data=wage1)
ply means that there are, say, two covariates in the >
model, the first being x1 and the second x2, we are > summary(model.ols)
passing them to a routine with the formula interface, ....
and nothing more is presumed or implied.
This model is, however, restrictive in a number of
Regardless of whether you use the data frame or
ways. First, the analyst must specify the functional
formula interface, workflow for nonparametric esti-
form (in this case linear) for the continuous vari-
mation in np typically proceeds as follows:
ables (educ, exper, and tenure). Second, the analyst
must specify how the qualitative variables (female
1. compute appropriate bandwidths;
and married) enter the model (in this case they affect
the model’s intercepts only). Third, the analyst must
2. estimate an object and extract fitted or pre-
specify the nature of any interactions among all vari-
dicted values, standard errors, etc.;
ables, quantitative and qualitative (in this case, there
3. optionally, plot the object. are none). Should any of these assumptions be in-
correct, then the estimated model will be biased and
In order to streamline the creation of a set of com- inconsistent, potentially leading to faulty inference.
plicated graphics objects, plot (which calls npplot) One might next test the null hypothesis that this
is dynamic; i.e., you can specify, say, bootstrapped parametric linear model is correctly specified using
error bounds and the appropriate routines will be the consistent model specification test found in Hsiao
called in real time. et al. (forthcoming) that admits both categorical and
continuous data (this example will likely take a few
minutes on a desktop computer as it uses bootstrap-
ping and cross-validated bandwidth selection):
Efficient nonparametric regression
> data("wage1")
in the presence of qualitative data > attach(wage1)
We begin with a motivating example that demon- >
strates the potential benefits arising from the use > model.ols <- lm(lwage ~ factor(female) +
of kernel smoothing methods that smooth both the + factor(married) + educ + exper + tenure,
qualitative and quantitative variables in a particu- + x=TRUE, y=TRUE)
lar manner; see the subsequent section for more de- >
tailed information regarding the method itself. For > X <- data.frame(factor(female),
what follows, we consider an application taken from + factor(married), educ, exper, tenure)
Wooldridge (2003, pg. 226) that involves multiple re- >
gression analysis with qualitative information. > output <- npcmstest(model=model.ols,
+ xdat=X, ydat=lwage, tol=.1, ftol=.1)
We consider modeling an hourly wage equa-
> summary(output)
tion for which the dependent variable is log(wage)
(lwage) while the explanatory variables include three
Consistent Model Specification Test
continuous variables, namely educ (years of educa-
....
tion), exper (the number of years of potential ex-
IID Bootstrap (399 replications)
perience), and tenure (the number of years with
Test Statistic 'Jn': 5.594745e+00
their current employer) along with two qualitative
P Value: 0.000000e+00
variables, female (“Female”/“Male”) and married
....
(“Married”/“Notmarried”). For this example there
are n = 526 observations. This naïve linear model is rejected by the data
The classical parametric approach towards es- (the P-value for the null of correct specification is
timating such relationships requires that one first < 0.001), hence one might proceed instead to model
specify the functional form of the underlying rela- this relationship using kernel methods.
tionship. We start by first modelling this relationship As noted, the traditional nonparametric approach
using a simple parametric linear model. By way of towards modeling relationships in the presence of
example, Wooldridge (2003, pg. 222) presents the fol- qualitative variables requires that you first split your
lowing model:4 data into subsets containing only the continuous
4 We would like to thank Jeffrey Wooldridge for allowing us to use his data. Also, we would like to point out that Wooldridge starts
out with this naïve linear model, but quickly moves on to a more realistic model involving nonlinearities in the continuous variables and
so forth.
variables of interest (lwage, exper, and tenure). For you need to pass to npreg as it encapsulates the
instance, we would have four such subsets, a) n = kernel types, regression method, and so forth. You
132 observations for married females, b) n = 120 ob- could, of course, use npreg with a formula and per-
servations for single females, c) n = 86 observations haps manually specify the bandwidths using a band-
for single males, and d) n = 188 observations for width vector if you so chose. We have tried to make
married males. One would then construct smooth each function as flexible as possible to meet the needs
nonparametric regression models for each of these of a variety of users.
subsets and proceed with the analysis in this fashion. The goodness of fit of the nonparametric model
However, this may lead to a loss in efficiency due to (R2 = 51.5%) is better than that for the paramet-
a reduction in the sample size leading to overly vari- ric model (R2 = 40.4%). In order to investigate
able estimates of the underlying relationship. whether this apparent improvement reflects overfit-
Instead, however, we could construct smooth ting or simply that the nonparametric model is in fact
nonparametric regression models by i) using a more faithful to the underlying data generating pro-
smoothing function that is appropriate for the qual- cess, we shuffled the data and created two indepen-
itative variables such as that proposed by Aitchi- dent samples, one of size n1 = 400 and one of size
son and Aitken (1976), and ii) modifying the non- n2 = 126. We fit the models on the n1 training ob-
parametric regression model as was done by Li and servations then evaluate the models on the n2 inde-
Racine (2004). One can then conduct sound nonpara- pendent hold-out observations using the predicted
1 n2
metric estimation based on the n = 526 observa- square error criterion, namely n− 2
2 ∑i =1 ( yi − ŷi ) ,
tions rather than resorting to sample splitting. The where the yi s are the lwage values for the hold-out
rationale for this lies in the fundamental concept that observations and the ŷi s are the predicted values. Fi-
doing so may introduce potential bias, but it will nally, we compare the parametric model, the non-
always reduce variability, thus leading to potential parametric model that smooths both the qualitative
finite-sample efficiency gains. Our experience has and quantitative variables, and the traditional fre-
been that the potential benefits arising from this ap- quency nonparametric model that breaks the data
proach more than offset the potential costs in finite- into subsets and smooths the quantitative data only.
sample settings.
Next, we consider using the local-linear nonpara-
> data("wage1")
metric method described in Li and Racine (2004).
> set.seed(123)
For the reader’s convenience we supply precom-
>
puted cross-validated bandwidths which are auto-
> # Shuffle the data and create two datasets...
matically loaded when one reads the wage1 dataset.
>
The commented out code that generates the cross-
> ii <- sample(seq(nrow(wage1)),replace=FALSE)
validated bandwidths is also provided should the
>
reader wish to fully replicate the example (recall be-
> wage1.train <- wage1[ii[1:400],]
ing cautioned about the computational burden as-
> wage1.eval <- wage1[ii[401:nrow(wage1)],]
sociated with multivariate data-driven bandwidth
>
methods).
> # Compute the parametric model for the
> data("wage1") > # training data...
> >
> #bw.all <- npregbw(lwage ~ factor(female) + > model.ols <- lm(lwage ~ factor(female) +
> # factor(married) + educ + exper + tenure, + factor(married) + educ + exper + tenure,
> # regtype="ll", bwmethod="cv.aic", + data=wage1.train)
> # data=wage1) >
> > # Obtain the out-of-sample predictions...
> model.np <- npreg(bws=bw.all) >
> > fit.ols <- predict(model.ols,
> summary(model.np) + data=wage1.train, newdata=wage1.eval)
>
Regression Data: 526 training points, > # Compute the predicted square error...
in 5 variable(s) >
.... > pse.ols <- mean((wage1.eval$lwage -
Kernel Regression Estimator: Local-Linear + fit.ols)^2)
Bandwidth Type:Fixed >
Residual standard error: 1.371530e-01 > #bw.subset <- npregbw(
R-squared: 5.148139e-01 > # lwage~factor(female) + factor(married) +
.... > # educ + exper + tenure, regtype="ll",
> # bwmethod="cv.aic", data=wage1.train)
Note that the bandwidth object is the only thing >
> model.np <- npreg(bws=bw.subset) tive variables increases, the difference between the
> estimator that smooths both continuous and discrete
> # Obtain the out-of-sample predictions... variables in a particular manner and the traditional
> estimator that relies upon sample splitting will be-
> fit.np <- predict(model.np, come even more pronounced.
+ data=wage1.train, newdata=wage1.eval) We now briefly outline the essence of kernel
> smoothing of categorical data for the interested
> # Compute the predicted square error... reader.
>
> pse.np <- mean((wage1.eval$lwage -
+ fit.np)^2) Categorical data kernel methods
>
> # Do the same for the frequency estimator... To introduce the R user to the basic concept of ker-
> # We do this by setting lambda=0 for the nel smoothing of categorical random variables, con-
> # qualitative variables... sider a kernel estimator of a probability function,
> i.e. p( xd ) = P( X d = xd ), where X d is an unordered
> bw.freq <- bw.subset (i.e., discrete) categorical variable assuming values
> bw.freq$bw[1] <- 0 in, say, {0, 1, 2, . . . , c − 1} where c is the number of
> bw.freq$bw[2] <- 0 possible outcomes of the discrete random variable
> X d . Aitchison and Aitken (1976) proposed a kernel
> model.np.freq <- npreg(bws=bw.freq) estimator of p( xd ) given by
> n
1
> # Obtain the out-of-sample predictions... p̂( xd ) = ∑ L(Xid = xd ),
n i =1
>
> fit.np.freq <- predict(model.np.freq, where L(·) is a kernel function defined by, say,
+ data=wage1.train, newdata=wage1.eval)
if Xid = xd
d d 1−λ
> L( Xi = x ) =
λ /(c − 1) otherwise,
> # Compute the predicted square error...
> and where λ is a ‘smoothing’ parameter that can be
> pse.np.freq <- mean((wage1.eval$lwage - selected via a variety of data-driven methods; see
+ fit.np.freq)^2) also Ouyang et al. (2006) for theoretical underpin-
> nings of least-squares cross-validation in this con-
> # Compare performance for the text.
> # hold-out data... The smoothing parameter λ is restricted to lie in
> [0, (c − 1)/c] for the kernel given above. Note that
> pse.ols when λ = 0, then L( Xid = xd ) becomes an indi-
[1] 0.1469691 cator function, and hence p̂( xd ) would be the stan-
> pse.np dard frequency probability estimator (i.e., the sam-
[1] 0.1344028 ple proportion of Xid = xd ). If, on the other hand,
> pse.np.freq λ = (c − 1)/c, its upper bound, then L( Xid = xd )
[1] 0.1450111
becomes a constant function and p̂( xd ) = 1/c for all
The predicted square error on the hold-out data xd , which is the discrete uniform distribution. Data-
was 0.147 for the parametric linear model, 0.145 for driven methods for selecting λ are based on minimiz-
the traditional nonparametric estimator that splits ing expected square error loss, among other criteria.
the data into subsets, and 0.134 for the nonparamet- A number of issues surrounding this estimator
ric estimator that uses the full sample but smooths are noteworthy: i) this approach extends trivially to a
both the qualitative and quantitative data. We there- general multivariate setting, and in fact is best suited
fore conclude that the nonparametric model that to such settings; ii) to deal with ordered variables
smooths both the continuous and qualitative vari- you simply use an ‘ordered kernel’; iii) there is no
ables appears to provide a better description of the “curse of dimensionality” associated with smooth-
underlying data generating process than either the ing √ discrete probability functions – the estimators
nonparametric model that uses sample splitting or are n consistent, just like their parametric counter-
the naïve linear model. parts; and iv) you can “mix-and-match” data types
Note that for this example we have only four using “generalized product kernels” as we describe
cells. If one used all qualitative variables included in the next section.
in the dataset (16 in total), one would have 65, 536 Why would anyone want to smooth a probabil-
cells, many of which would be empty, and most hav- ity function in the first place? For finite-sample effi-
ing far too few observations to provide meaningful ciency reasons of course. That is, smoothing a prob-
nonparametric estimates. As the number of qualita- ability function may introduce some finite-sample
bias, but, it always reduces variability. In settings > # conditional density estimator and compute
where you are modeling a mix of categorical and > # the conditional mode...
continuous data or where one has sparse multivari- >
ate categorical data, the efficiency gains can be sub- > bw <- npcdensbw(factor(low) ~
stantial indeed. + factor(smoke) + factor(race) +
+ factor(ht) + factor(ui)+ ordered(ftv) +
+ age + lwt, tol=.1, ftol=.1)
Mixed data kernel methods > model.np <- npconmode(bws=bw)
>
Estimating a joint density function defined over > # Finally, we compare confusion matrices
mixed data (i.e., both categorical and continuous) > # from the logit and nonparametric models...
follows naturally using generalized product kernels. >
For example, for one discrete variable xd and one > # Parametric...
continuous variable xc , our kernel estimator of the >
joint PDF would be > cm <- table(low,
+ ifelse(fitted(model.logit) > 0.5, 1, 0))
n Xic − xc
1 > cm
fˆ( xd , xc ) = ∑ L( Xid = xd )W ,
nh x i =1
h xc
low 0 1
where L( Xid = xd ) is a categorical data kernel func- 0 119 11
tion, while W [( Xic − xc )/h xc ] is a continuous data ker- 1 34 25
nel function (e.g., Epanechnikov, Gaussian, or uni- >
form) and h xc is the bandwidth for the continuous > # Nonparametric...
variable; see Li and Racine (2003) for further details. >
Once we can consistently estimate a joint density > summary(model.np)
function defined over mixed data, we can then pro-
ceed to estimate a range of statistical objects of inter- Conditional Mode data: 189 training points,
est to practitioners. Some mainstays of applied data in 8 variable(s)
analysis include estimation of regression functions ....
and their derivatives, conditional density functions Bandwidth Type: Fixed
and their quantiles, conditional mode functions (i.e.,
count data models, probability models), and so forth, Confusion Matrix
each of which is currently implemented. Predicted
Actual 0 1
0 127 3
Nonparametric estimation of binary out- 1 27 32
come and count data models
....
For what follows, we adopt the conditional proba-
bility estimator proposed in Hall et al. (2004) to es-
For this example the nonparametric model is bet-
timate a nonparametric model of a binary outcome
ter able to predict low birthweight infants than its
when there exist a number of categorical covariates.
parametric counterpart, correctly predicting 159/189
For this example, we use the birthwt data taken
birthweights compared with 144/189 for the para-
from the MASS package, and compute a parametric
metric model.
Logit model and a nonparametric conditional mode
model. We then compare their confusion matrices
and assess their classification ability. The outcome Visualizing and summarizing nonpara-
is an indicator of low infant birthweight (0/1). metric results
> library("MASS") Summarizing nonparametric results and comparing
> attach(birthwt) them with parametric ones is perhaps one of the
> main challenges faced by the practitioner. We have
> # First, we model this with a simple attempted to provide a range of summary measures
> # parametric Logit model... and plotting methods to facilitate these tasks.
> The plot function (which calls npplot) can be
> model.logit <- glm(low ~ factor(smoke) + used on most nonparametric objects. In order to fur-
+ factor(race) + factor(ht) + factor(ui)+ ther facilitate comparison of parametric with non-
+ ordered(ftv) + age + lwt, family=binomial) parametric results, we also compute a number of
> summary measures of goodness of fit, normalized
> # Now we model this with the nonparametric bandwidths (‘scale factors’) and so forth that are ac-
cessed by the summary command, while objects gen- Many of the accessor functions such as predict,
erated by np also contain various alternative mea- fitted, and residuals work with most non-
sures of goodness of fit and the like. We again con- parametric objects, where appropriate. As well,
sider the approach detailed in Li and Racine (2004) gradients is a generic function which extracts gra-
which conducts local linear kernel regression with dients from objects. The R function coef can be
a mix of discrete and continuous data types on a used for extracting the parametric coefficients from
popular macroeconomic dataset. The original study semiparametric models such as those created with
assessed the impact of Organisation for Economic npplreg.
Co-operation and Development (OECD) member-
ship on a country’s growth rate when controlling for
Creating mixed data kernel objects via
a range of factors such as population growth rates,
stock of capital (physical and human) and so forth. npksum
For this example we shall use the formula interface Rather than being limited to only those kernel meth-
rather than the data frame interface. For this exam- ods that exist in np, you could instead create your
ple, we have already computed the cross-validated own mixed data kernel objects. npksum is a func-
bandwidths which are loaded when one reads the tion that computes kernel sums on evaluation data,
oecdpanel dataset. given a set of training data, data to be weighted (op-
> data("oecdpanel") tional), and a bandwidth specification (any band-
> attach(oecdpanel) width object). npksum uses highly-optimized C code
> that strives to minimize its memory footprint, while
> #bw <- npregbw(growth ~ factor(oecd) + there is low overhead involved when using repeated
> # ordered(year) + initgdp + popgro+ inv + calls to this function. The convolution kernel op-
> # humancap, bwmethod="cv.aic", tion would allow you to create, say, the least squares
> # regtype="ll") cross-validation function for kernel density estima-
> tion. You can choose powers to which the kernels
> summary(bw) constituting the sum are raised (default=1), whether
or not to divide by bandwidths, whether or not
Regression Data (616 observations, to generate leave-one-out sums, and so on. Three
6 variable(s)): classes of kernel methods for the continuous data
types are available: fixed, adaptive nearest-neighbor,
Regression Type: Local-Linear and generalized nearest-neighbor. Adaptive nearest-
neighbor bandwidths change with each sample real-
.... ization in the set, xi , when estimating the kernel sum
at the point x. Generalized nearest-neighbor band-
> widths change with the point at which the sum is
> model <- npreg(bw) computed, x. Fixed bandwidths are constant over
> the support of x. A variety of kernels may be spec-
> summary(model) ified by users. Kernels implemented for continu-
ous data types include the second, fourth, sixth, and
Regression Data: 616 training points, eighth order Gaussian and Epanechnikov kernels,
in 6 variable(s) and the uniform kernel. We have tried to anticipate
.... the needs of a variety of users when designing this
> function. A number of semiparametric estimators
> plot(bw, and nonparametric tests in the np package in fact
+ plot.errors.method="bootstrap", make use of npksum.
+ plot.errors.boot.num=25) In the following example, we conduct local-
> constant (i.e., Nadaraya-Watson) kernel regression.
We shall use cross-validated bandwidths drawn
We display partial regression plots in Figure 1.5
from npregbw for this example. Note that we extract
We also plot bootstrapped variability bounds, where
the kernel sum from npksum via the ‘$ksum’ return
the bootstrapping is done via the boot package
value in both the numerator and denominator of the
thereby facilitating a variety of bootstrap methods.
resulting object. For this example, we use the data
If you prefer gradients rather than levels, you can
frame interface rather than the formula interface, and
simply use the argument gradients=TRUE in plot
output from this example is plotted in Figure 2.
to indicate that you wish to plot partial derivatives
rather than the conditional mean function itself.6
5 A “partial regression plot” is simply a 2D plot of the outcome y versus one covariate x when all other covariates are held constant at
j
their respective medians (this can be changed by users).
6 plot(bw, gradients=TRUE, plot.errors.method="bootstrap", plot.errors.boot.num=25, common.scale=FALSE) will work
0.08
0.08
0.04
0.04
growth
growth
0.00
0.00
−0.04
−0.04
0 1 1965 1970 1975 1980 1985 1990 1995
factor(oecd) ordered(year)
0.08
0.08
0.04
0.04
growth
growth
0.00
0.00
−0.04
−0.04
6 7 8 9 −3.4 −3.2 −3.0 −2.8 −2.6 −2.4
initgdp popgro
0.08
0.08
0.04
0.04
growth
growth
0.00
0.00
−0.04
−0.04
inv humancap
Figure 1: Partial regression plots with bootstrapped error bounds based on a cross-validated local linear kernel
estimator.
> data("cps71")
> attach(cps71)
....
15
+ npksum(txdat=age,bws=1.892169)$ksum
log(wage)
13
>
12
20 30 40 50 60
Age
58 63
●
aries, choosing truncate = TRUE imposes a uniform
●52 61 75 prior over the unit hypercube such that all cell frac-
0.8
●
54● ●
●●
67
tions are restricted to the range [0, 1].
68
Proportion Democratic
37 118
●
●
139 Output from ecological regression can be summa-
39
●144
rized numerically just as in lm, or graphically using
0.6
● 212
●
● 111 ●
18 71 104
●
29
30
●
31
● 85 92 97
●
113
117
●
122
123 128
130 137
● density plots. We also include functions to calculate
28 ●34
90 ● 127● 147
25 ●
●
● ●86
91
94 ●
99●
110
●115
●
120
●
131
129● ●
●
145
●
estimates and standard errors of shares of a subset
35 ● 89● ● 96 ● ● ●
● ●95 ●
of columns in order to address questions such as,
0.4
●
● ● ● 98 ● ●
● 207
●
● 200
●
"What is the Democratic share of 2-party registration
88
●
● for each group?" For the Bayesian model, densities
0.2
Ecological Regression 30
Density
R C
Tci = ∑ βrci Xri and ∑ βrci = 1 Figure 2: Density plots of ecological regression out-
r=1 c=1
put.
Defining the population cell fractions βrc such that
∑Cc=1 βrc = 1 for every r, ecological regression as- Multinomial-Dirichlet (MD) model
sumes that βrc = βrci for all i, and estimates the
regression equations Tci = βrc Xri + ci . Under In the Multinomial-Dirichlet model proposed by
the standard linear regression assumptions, includ- Rosen et al. (2001), the data is expressed as counts
ing E[ci ] = 0 and Var[ci ] = σc2 for all i, these and a hierarchical Bayesian model is fit using a
regressions recover the population parameters βrc . Metropolis-within-Gibbs algorithm implemented in
eiPack implements frequentist and Bayesian regres- C. Level 1 models the observed column marginals
sion models (via ei.reg and ei.reg.bayes, respec- as multinomial (and independent across units); the
tively). choice of the multinomial corresponds to sampling
In the Bayesian implementation, we offer two op- with replacement from the population. Level 2 mod-
tions for the prior on βrc . As a default, truncate els the unobserved row cell fractions as Dirichlet
= FALSE uses an uninformative flat prior that pro- (and independent across rows and units); Level 3
vides point estimates approaching the frequentist es- models the Dirichlet parameters as i.i.d. Gamma.
timates (even when those estimates are outside the More formally, without a covariate, the model is
● ● ●
● ● ● ●
●● ● ●● ●
●● ●
● ● ●●
●●
by Rosen et al. (2001), or they may specify normal ● ●
●
● ●
●●
●
●
●
●
●
● ●
●●
●
●●● ●
● ●
●●
●
●
●
● ●
● ● ● ●
●
●
●
●
●●
●●
●●●
●
●
●
●
● ● ● ● ●
priors for each γrc and δrc as follows: ● ●
● ● ● ● ●● ●
●
● ● ●
● ● ●
● ●● ●
● ●
●
● ●
0.6
● ●
● ● ● ●● ● ●
●● ●
● ● ● ● ● ●● ● ●●
●● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ● ● ● ● ●●
● ●
N(µγrc , σγ2rc )
● ● ●
γrc ∼ ●
●
● ●●
●
●
●
● ●
● ●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
● ●
●
●
●
●
● ● ● ● ● ● ●
●●
● ● ● ● ● ●● ●
●●● ● ●
N(µδrc , σδ2rc )
● ●●● ● ●● ● ●
● ● ●
● ●●● ● ●
●
●
∼ ● ● ● ● ● ●
δrc ● ● ● ● ●● ● ●
● ●
0.4
●● ●● ● ●
● ● ● ●
● ● ● ●
● ● ● ●●
●●
● ● ●● ● ● ●
●
● ● ● ● ● ● ●
● ●● ● ●
● ● ● ●
● ●●● ●● ●●
● ●
●●
●
●
● ● ● ●● ● ● ●●●
● ● ●●
●●
● ● ●● ● ●● ● ●
● ●● ●
As Wakefield (2004) notes, the weak identification ● ●
●
●
●
● ●
●●
●●
● ●
that characterizes hierarchical models in the EI con- ●
0.2
disk using the option ei.MD.bayes(..., ret.beta L. Goodman. Ecological regressions and the behav-
= "s"), or discard the unit-level draws entirely us- ior of individuals. American Sociological Review, 18:
ing ei.MD.bayes(..., ret.beta = "d"). To recon- 663–664, 1953.
struct the chains, users can select the row marginals,
column marginals, and units of interest, without re- B. Grofman. A primer on racial bloc voting analy-
constructing the entire matrix of unit-level draws: sis. In N. Persily, editor, The Real Y2K Problem: Cen-
> read.betas(rows = c("black", "white"),
sus 2000 Data and Redistricting Technology. Brennan
+ columns = "dem", units = 1:150, Center for Justice, New York, 2000.
+ dir = getwd())
K. Imai and Y. Lu. eco: R Package for Fitting
If users are interested in some function of the unit- Bayesian Models of Ecological cvs c Inference in 2x2
level parameters, the implementation of the MD Tables, 2005. URL http://imai.princeton.edu/
model allows them to define a function in R that research/eco.html.
will be called from within the C sampling algorithm,
in which case the unit-level parameters need not be A. D. Martin and K. M. Quinn. Applied Bayesian
saved for post-processing. inference in R using MCMCpack. R News, 6:2–7,
2006.
Acknowledgments M. Plummer, N. Best, K. Cowles, and K. Vines.
CODA: Convergence diagnostics and output anal-
eiPack was developed with the support of the In- ysis for MCMC. R News, 6:7–11, 2006.
stitute for Quantitative Social Science at Harvard
University. Thanks to John Fox, Gary King, Kevin O. Rosen, W. Jiang, G. King, and M. A. Tanner.
Quinn, D. James Greiner, and an anonymous ref- Bayesian and frequentist inference for ecological
eree for suggestions and Matt Cox and Bob Kinney inference: The R × C case. Statistica Neerlandica,
for technical advice. For further information, see 55(2):134–156, 2001.
http://www.olivialau.org/software.
J. Wakefield. Ecological inference for 2 × 2 tables
(with discussion). Journal of the Royal Statistical So-
Bibliography ciety, 167:385–445, 2004.
W. T. Cho and A. H. Yoon. Strange bedfellows: Pol-
itics, courts and statistics: Statistical expert testi-
Olivia Lau, Ryan T. Moore, Michael Kellermann
mony in voting rights cases. Cornell Journal of Law
Institute for Quantitative Social Science
and Public Policy, 10:237–264, 2001.
Harvard University, Cambridge, MA
O. D. Duncan and B. Davis. An alternative to ecolog- olivia.lau@post.harvard.edu
ical correlation. American Sociological Review, 18: ryantmoore@post.harvard.edu
665–666, 1953. kellerm@fas.harvard.edu
d = 0.2 d = 0.05
Qmm
84
for 19 fish species (columns) in 92 sites (rows).
●
●
85 ● ●
●
62 ●
43 ●
●
●
●
●
●
63 58 ●
●
42 44 66 52 5354
● ●
●
●
87 ● ● ● ●
61
jv73$fac.riv is a factor indicating the river corre- 64 ●
78 ●
86 55
5657 59
●
●
76 4780 83 60
● ●
48 89 82 ● ●
4177 8881 49 ● ●
23 36 45 37 7946 4026 65
5051 92 ●
●
●
●
●
● ●
2928
● 2439 25 33
●
●
91 ● ●
●
● ●
●
●
●
30 74 7535
● ●
Alt 34
●
Loadings 1
● ●
67 31 73
15 16
● ● ● ●
●
38 71 72 27 ●
d = 0.2
●
●
68 90 ●
●
14
●
●
Chb 69 ●
13
1 10 12
18 19 21 228 11
●
520 9
● ●
17 ●
2 ●
●
6 ●
●
●
●
4 ●
●
Hot
3
Tox ●
Lot Spi
Bar
Omb
average flow (m3 /s) and average speed (m/s)) for the
Gou
TanPer Van
Bro
Gar Loc
Che
Vai
0.10
84 84
85
62 8681
80
43
40 82
83
44 66 85 52
0.08
87
76 6378 91 43
63
62 58
42
41 77 74 75
79 64
92 5854
55
53
56
57 42 49 59 61
52 5961 244748 64 53 5455 60
47 378850 44
1. Are the groups of fish species living together
89
4539 70 46 73
336526
88 60
87
23
36 25
7830 26 5189 66
2928 36
23 38
37 713272 34482735 29
76 46 65 5657
90 4950
51 28 79 313986
777
45
41
0.06
67
30 16 83
68 3124 25 15 14 67 33 35
32 82 1615
69 13 80
1 70 40
68 34
81
71 13 14
92
91
(i.e. species communities)? 181921 2065722
69 108 7273
38 12
747527 90
0.04
1011
12 9 21
22
11
1 2 4 8 17 18 19
20
17 9 2 5
3 36 4
0.02
0.00
Array 1 Array 2
4. Do the species-environment relationships vary Two randomization procedures are available to test
(or not) among rivers? the association between two tables: PROTEST (Jack-
son, 1995) and RV (Heo and Gabriel, 1998).
PROTEST RV
30
25
250
20
200
Frequency
Frequency
15
10
100
table, to study the relationships between the com- 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
sim sim
Ax1
●
●
84
s.corcircle correlation circle
87
23 85
●
62
36
7642
43
●
●
●
●
●
s.distri cloud of points with fre-
63
29 41
quency distribution by stars
●
● ●
78 86 ●
●
●
45
77 46 47 ●
●
Ax2
●
●
48 66 ● 58 ● ● ●
24 44 49 525354 ●
26
and ellipses
● ●
64 ●
5051 55
5659
57 ●
●
●
30 88 89 ●
3725
●
● ●
80 65 60 ●
X axes 70
39 40 81 82 83 61 ●
27 ●
●
31 7973
cloud of points with two
● ●
33 ●
Ax2
67
68
69
28
38 74 90
32
7172
75
91 ●
3592 13 14
15 16
●
●
●
● ●
●
●
●
s.hist
x1 12 ●●
●
●
1
●
34
11
10 ●
●
●
●
●
●
●
●
●
●
●
● ● ●
marginal histograms
7 ●
Ax1
17
2 19
18
3
6
9
22 ●
● ●
●
●
● s.image grid of gray-scale rectangles
4 5 21
with contour lines
●
20 ●
8
●
s.kde2d cloud of points with kernel
Y axes
d = 0.2 d=1
density estimation
0.8
Eigenvalues Lam
x1
Pen Smm
Qmm
Chb Hot
s.label cloud of points with labels
Lot
Bla
Vme Tox Bar
Spi
Omb
Das
Tru
Loc
Vai
Gou
Van points with vectors
0.2
Che
Per
BroTan
s.traject cloud of points with trajecto-
Y CanonicalAltweights
ries
0.0
Gar
X Canonical weights
Das
Alt
8
20
Smm ●
21
● 4 5 ●
9
terpret the results. However, the quality of graphs
●
17 1819 ●
●
●
22
91
3 ● ●
75
● ●
Qmm ●
32 74
● ●
31 71726 90
●
2 383973 92
79 40
● ●
Loadings 30 ● ● ●
81
●
80 ●
●
10
●
●
●
35 ●
d = 0.5 29 37 86 88 33 89
●
●
●11
●
● ● ●
82
●
70 657
●
●
83
1228 1415
sequently impossible to manage relevant graphical
●
● 1 69 ● ●
● ●
●
●
● ●
●
●
64
27 13 16
2644
● ● ●
Alt 68
●67 ●
●
●
25 ● ●
●
●
85 66 ●
● ● 77 24 ●
● ● ●● ●
63 87
outputs for all cases. That is why these generic plot 84 78 6061
●
42 3646 43
● ● ●
57
53 545556 59
●
41 49 5051 52 ●● ● ●
76
23 47 48
●
58
Pen ●
●
●
4562 ●
Gar Eigenvalues
Bro
pal component analyses with respect to instrumental Axis2
Per
Tan
0.25
Gou
Che
Vai
Loc Van
Lam
sists in explaining a triplet (X2 , Q2 , D) by a table
0.15
Tru Lot
Axis1
0.10
Spi
0.05
Bar
Inertia axes Variables
This family of methods are constrained ordinations,
among which redundancy analysis (van den Wollen-
berg, 1977) and canonical correspondence analysis Figure 4: Plot of a CCA seen as a particular case of
(Ter Braak, 1986) are the most frequently used in PCAIV: environmental variables loadings and correla-
ecology. Note that canonical correspondence anal- tions with CCA axes, projection of principal axes on CCA
ysis can also be performed using the cca wrapper axes, species scores, eigenvalues screeplot, and joint dis-
function which takes two tables as arguments. The play of the rows of the two tables (position of the sites by
example given below is then exactly equivalent to averaging (points) and by regression (arrow tips)).
Orthogonal analysis (pcaivortho) removes the Each statistical triplet corresponds to a separate
effect of independent variables and corresponds to analysis (e.g., principal component analysis, corre-
the triplet (P⊥X1 X2 , Q2 , D) where P⊥X1 = I − PX1 . spondence analysis ...). The common dimension
Between-class (between) and within-class (within) of the K statistical triplets are the rows of tables
analyses (see Chessel et al., 2004, for details) are par- which can represent individuals (samples, statistical
ticular cases of PCAIV and orthogonal PCAIV when units) or variables. Utilities for building and ma-
there is only one categorical variable (i.e. factor) in nipulating ktab objects are available. K-table can
X1 . Within-class analyses take into account a parti- be constructed from a list of tables (ktab.list.df),
tion of individuals into groups and focus on struc- a list of dudi objects (ktab.list.dudi), a within-
tures which are common to all groups. It can be seen class analysis (ktab.within) or by splitting a ta-
as a first step to K-table methods. ble (ktab.data.frame). Generic functions to trans-
pose (t.ktab), combine (c.ktab) or extract elements
wit1 <- within(coa1, fac = jv73$fac.riv, ([.ktab) are also available. The sepan function can
scannf = FALSE) be used to compute automatically the K separate
plot(wit1) analyses.
Loc Vai
d=1 d = 0.5 kt1 <- ktab.within(wit1)
Lot
Bla
Omb ●
●●●
sep1 <- sepan(kt1)
Lam
Hot
Bar
Tox
Gou
SpiVan Chb
Tru
●
●
●
●
kplot.sepan.coa(sep1, permute.row.col = TRUE)
Che Bro
●
●
8
d=1 d = 0.5 d = 0.5
Per ● ●
● Per 25 Loc
●
Furieuse
● Gar
● ● 1 Vai
Tan ●●● Gar 20 Chb Spi
Van
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Tru Gou
Che
Gar
● ● ●● 2 Bro
● ●
● ● ●● 610
11
3
● ●
● Cuisance● ●
●
Che 5 Tru Omb
Allaine
●
● Audeux
● Loc94 Vai 12 Gou
Gar ● Drugeon
● ● Tan 7 Omb
Lam
Lot 18
Canonical weights ●
●● Doubs
● Chb
Hot 13 Vai
● ●
● Bro 17 19
d = 0.5 ● ● ● Gou14 Lam
Hot
Bar
Tan
Bla
Lot
Tru Loc Omb
Chb
● Loue Lison ●
VanBla
Loc Vai ●
Clauge
Spi
Per Spi
● Tox
● ● Doulonnes Che Van 40
Lot
● Dessoubre 16 Bar
15
Bla Cusancin
●
● ●● 21 26
Omb ●
Tox Lam
● ● Doubs Drugeon Dessoubre
Lam ●
●
Hot ●● ●
●
32
d=2
22
d = 0.5 d = 0.5
Bar ●
Tox Tru ● ●
●
Gou ●● Lam
Spi Van Chb ●
Loc
CheBro ●
35
Vai Tru
33 31
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Omb
Gou
28 Tru
Vai Chb
30
Che 37 43
Per 29
Lam Chb Spi
Lam
Van
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Omb
Gou
Che
Gar 38
Spi Van
34
Tan
Chb
42
44 Loc
Omb VaiGou
Spi
● Van
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Che
Gar
Loc
Gar
Gar 52 Tru
Variables Scores and classes 55
d = 0.5 Allaine Audeux 71
Cusancin
Eigenvalues d = 0.5 d = 0.5
72
d = 0.5
0.00 0.05 0.10 0.15 0.20 0.25 0.30
● ● 51 Bla
73
●
●
70
●
54
Axis2 ●
66
Lam Loc
53
● 63
62
● ● ●
●
Chb Tru
●
●
●●●
●
● ● 56 Vai Spi
Lam
Hot
Bar
Bro
Tan
Per
Tox
Bla
Lot
Omb
Gar 41 69
68
67
● ●● ●
57 39
● ●
●● ●
●
● ●
● ● Vai
●
●
●
Tox Van Bla
●
●●
●
●
●
●
●
●
Loc Che
Van
Axis1 ●
●
●
●
●
●
● ●
Tan
Gar Che Chb Tru Lot Che
Spi
Hot
Bar
Bro
Tan
Per
Tox
Gou
Gar 74
●
Gou Loc
●
●
●
● ●
●●● ●
● ● ●
Chb Lam 58 Spi VaiOmb
● ● Tru
Omb 59 Gou
● ●
●●
●
● Van Hot
● ●
●
●
● Bro Lot
65
●
●
●
● 80 79 Bar
●
Per 75
Loue Lison Furieuse
60 81
● ●
d = 0.5 d = 0.5 d=1
● Vai 85
61Loc
64 90
Inertia axes Common centring Lam91
Chb Lot Gou
78
76 Spi
Lam
Bar
Lot
Gar 82 Che
Spi
77
Tru LocVan Hot
CheVan Tru 89 Vai
Bar
Omb
Tru Bro
Omb Bla
Tox
Gou
Bla Lam Chb
Spi
Van
Bar
Tan
Per
Bla
Hot
Bro
Tox
Lot
Omb
Gou
Gar PerGar
Tox 92
Chb Loc
Che
Vai
83 86 88
Bro
Per
Hot
Clauge
species scores, eigenvalues screeplot, projection of princi-
pal axes on within-class axes, sites scores (common cen-
tring), projections of sites and groups (i.e. rivers in this Figure 6: Kplot of 12 separate correspondence analyses
example) on within-class axes. (same species, different sites).
The K-table class When the ktab object is built, various statistical
methods can be used to analyse it. The foucart
Class ktab corresponds to collections of more than function can be used to analyse K tables of posi-
two duality diagrams, for which the internal struc- tive number having the same rows and the same
tures are to be compared. Three formats of these columns and that can be analysed by a CA (Foucart,
collections can be considered: 1984; Pavoine et al., 2007). Partial triadic analysis
(Tucker, 1966) is a first step toward three modes prin-
cipal component analysis (Kroonenberg, 1989) and
• (X1 , Q1 , D), (X2 , Q2 , D),. . . , (XK , QK , D) can be computed with the pta function. It must be
used on K triplets having the same row and column
• (X1 , Q, D1 ), (X2 , Q, D2 ),. . . , (XK , Q, DK ) stored weights. The pta function can be used to perform
in the form of (Xt1 , D1 , Q), (Xt2 , D2 , Q),. . . , the STATICO method (Simier et al., 1999; Thioulouse
(XtK , DK , Q) et al., 2004). This makes it possible to analyse a pair
of ktab objects which have been combined by the
• (X1 , Q, D), (X2 , Q, D),. . . , (XK , Q, D) which ktab.match2ktabs function.
can also be stored in the form of (Xt1 , D, Q), Multiple factor analysis (mfa, Escofier and Pagès,
(Xt2 , D, Q),. . . , (XtK , D, Q) 1994), multiple co-inertia analysis (mcoa, Chessel and
●
●
Drugeon Dessoubre
umn weights, which is a first step toward Common d=1
●
d=1 d=1
●
●
●
● ● ●
●
●
●
● ●
● ●
●
●
● ●
●●
●
●●
● ● ●
●
Interstructure Compromise d=1 ●●
●
●
●●
●
● ● ● ●
●
●
Loc Bla ●
● ●
Lot ● ●
●
●
Cusancin VaiOmb ●● ●
Cuisance ●
● ●●
Furieuse
●
Loue
ClaugeAllaine Lam
Component projection
Typological value
Figure 8: Kplot of the projection of the sites of each table
x1
●
Cuisance Audeux.1
on the principal axes of the compromise of STATIS analy-
0.3 0.4 0.5 0.6 0.7 0.8
●
Dessoubre
●
Doulonnes
●
Lison
Doulonnes.1
sis.
Dessoubre.1
●
Cusancin
Audeux
●
Cos 2
Lison.1
●
Furieuse
●
Loue
Cusancin.1
Clauge.1 Cuisance.1
Drugeon
●
●
Doubs
● Allaine
Clauge ●
Doubs.1
Loue.1
Allaine.1
Furieuse.1
Bibliography
Drugeon.1
0.15 0.20 0.25 0.30 0.35 0.40
Tables weights
P. Cazes. L’analyse de certains tableaux rectangu-
laires décomposés en blocs : généralisation des
propriétés rencontrées dans l’étude des correspon-
dances multiples. I. Définitions et applications à
Figure 7: Plot of STATIS analysis: interstructure, typo-
l’analyse canonique des variables qualitatives. Les
logical value of each table, compromise and projection of
Cahiers de l’Analyse des Données, 5:145–161, 1980.
principal axes of separate analyses onto STATIS axes.
D. Chessel and M. Hanafi. Analyse de la co-inertie de
K nuages de points. Revue de Statistique Appliquée,
The kplot generic function is associated to the 44(2):35–60, 1996.
foucart, mcoa, mca, pta, sepan, sepan.coa and
statis methods, giving adapted collections of D. Chessel, A.-B. Dufour, and J. Thioulouse. The
graphics. ade4 package-I- One-table methods. R News, 4:5–
10, 2004.
P. G. N. Digby and R. A. . Kempton. Multivariate
kplot(sta1, traj = TRUE, arrow = FALSE, Analysis of Ecological Communities. Chapman and
unique = TRUE, clab = 0) Hall, Population and Community Biology Series,
London, 1987.
S. Dolédec and D. Chessel. Co-inertia analy-
Conclusion sis: an alternative method for studying species-
environment relationships. Freshwater Biology, 31:
The ade4 package provides many methods to anal- 277–294, 1994.
yse multivariate ecological data sets. This diversity
of tools is a methodological answer to the great va- S. Dray, D. Chessel, and J. Thioulouse. Procrustean
riety of questions and data structures associated to co-inertia analysis for the linking of multivariate
biological questions. Specific methods dedicated to datasets. Ecoscience, 10:110–119, 2003a.
the analysis of biodiversity, spatial, genetic or phy-
S. Dray, D. Chessel, and J. Thioulouse. Co-inertia
logenetic data are also available in the package. The
analysis and the linking of ecological tables. Ecol-
adehabitat brother-package contains tools to analyse
ogy, 84(11):3078–3089, 2003b.
habitat selection by animals while the ade4TkGUI
package provides a graphical interface to ade4. More S. Dray and A. Dufour. The ade4 package: imple-
resources can be found on the ade4 website (http: menting the duality diagram for ecologists. Journal
//pbil.univ-lyon1.fr/ADE-4/). of Statistical Software, 22(4):1–20, 2007.
B. Escofier and J. Pagès. Multiple factor analysis (AF- C. Montaña and P. Greig-Smith. Correspondence
MULT package). Computational Statistics and Data analysis of species by environmental variable ma-
Analysis, 18:121–140, 1994. trices. Journal of Vegetation Science, 1:453–460, 1990.
Y. Escoufier. The duality diagram : a means of better S. Pavoine, J. Blondel, M. Baguette, and D. Chessel.
practical applications. In P. Legendre and L. Leg- A new technique for ordering asymmetrical three-
endre, editors, Development in numerical ecology, dimensional data sets in ecology. Ecology, 88:512–
pages 139–156. NATO advanced Institute , Serie G 523, 2007.
.Springer Verlag, Berlin, 1987.
C. Rao. The use and interpretation of principal com-
B. Flury. Common Principal Components and Related ponent analysis in applied research. Sankhya A, 26:
Multivariate. models. Wiley and Sons, New-York, 329–359, 1964.
1988.
M. Simier, L. Blanc, F. Pellegrin, and D. Nandris.
T. Foucart. Analyse factorielle de tableaux multiples.
Approche simultanée de K couples de tableaux:
Masson, Paris, 1984.
application à l’étude des relations pathologie
R. Goecke. 3D lip tracking and co-inertia analy- végétale-environment. Revue de Statistique Ap-
sis for improved robustness of audio-video au- pliquée, 47:31–46, 1999.
tomatic speech recognition. In Proceedings of the
Auditory-Visual Speech Processing Workshop AVSP C. Ter Braak. Canonical correspondence analysis : a
2005, pages 109–114, 2005. new eigenvector technique for multivariate direct
gradient analysis. Ecology, 67:1167–1179, 1986.
J. Gower. Statistical methods of comparing different
multivariate analyses of the same data. In F. Hod- J. Thioulouse, M. Simier, and D. Chessel. Simulta-
son, D. Kendall, and P. Tautu, editors, Mathemat- neous analysis of a sequence of pairs of ecological
ics in the archaeological and historical sciences, pages tables with the STATICO method. Ecology, 85:272–
138–149. University Press, Edinburgh, 1971. 283, 2004.
M. Heo and K. Gabriel. A permutation test of associ- L. . Tucker. An inter-battery method of factor analy-
ation between configurations by means of the RV sis. Psychometrika, 23:111–136, 1958.
coefficient. Communications in Statistics - Simulation
and Computation, 27:843–856, 1998. L. Tucker. Some mathemetical notes on three-mode
factor analysis. Psychometrika, 31:279–311, 1966.
S. Holmes. Multivariate analysis: The French way. In
N. D. and S. T., editors, Festschrift for David Freed- A. van den Wollenberg. Redundancy analysis, an al-
man. IMS, Beachwood, OH, 2006. ternative for canonical analysis. Psychometrika, 42
(2):207–219, 1977.
D. Jackson. PROTEST: a PROcustean randomization
TEST of community environment concordance. J. Verneaux. Cours d’eau de Franche-Comté (Massif
Ecosciences, 2:297–303, 1995. du Jura). Recherches écologiques sur le réseau hydro-
graphique du Doubs. Essai de biotypologie. Thèse de
P. Kroonenberg. The analysis of multiple tables in doctorat, Université de Besançon, Besançon, 1973.
factorial ecology. iii three-mode principal compo-
nent analysis:"analyse triadique complète". Acta
OEcologica, OEcologia Generalis, 10:245–256, 1989. Stéphane Dray, Anne-Béatrice Dufour, Daniel Chessel
C. Lavit, Y. Escoufier, R. Sabatier, and P. Traissac. The Laboratoire de Biométrie et
ACT (STATIS method). Computational Statistics and Biologie Evolutive (UMR 5558) ; CNRS
Data Analysis, 18:97–119, 1994. Université de Lyon ; université Lyon 1
43, Boulevard du 11 Novembre 1918
P. Mercier, D. Chessel, and S. Dolédec. Complete cor- 69622 Villeurbanne Cedex, France
respondence analysis of an ecological profile data dray@biomserv.univ-lyon1.fr
table: a central ordination method. Acta OEcolog- dufour@biomserv.univ-lyon1.fr
ica, 13:25–44, 1992. chessel@biomserv.univ-lyon1.fr
Changes in R 2.6.0
by the R Core Team • aggregate.data.frame() no longer changes
the group variables into factors, and leaves
alone the levels of those which are factors. (In-
User-visible changes ter alia grants the wish of PR#9666.)
• integrate(), nlm(), nlminb(), optim(), • The default max.names in all.names() and
optimize() and uniroot() now have ... all.vars() is now -1 which means unlimited.
much earlier in their argument list. This re- This fixes PR#9873.
duces the chances of unintentional partial
matching but means that the later arguments • as.vector() and the default methods of
must be named in full. as.character(), as.complex(), as.double(),
as.expression(), as.integer(), as.logical()
• The default type for nchar() is now "chars".
and as.raw() no longer duplicate in most
This is almost always what was intended, and
cases where the object is unchanged. (Beware:
differs from the previous default only for non-
some code has been written that invalidly as-
ASCII strings in a MBCS locale. There is a
sumes that they do duplicate, often when using
new argument allowNA, and the default be-
.C/.Fortran(DUP = FALSE).)
haviour is now to throw an error on an invalid
multibyte string if type = "chars" or type = • as.complex(), as.double(), as.integer(),
"width". as.logical() and as.raw() are now prim-
itive and internally generic for efficiency.
• Connections will be closed if there is no R ob-
They no longer dispatch on S3 methods for
ject referring to them. A warning is issued if
as.vector() (which was never documented).
this is done, either at garbage collection or if all
as.real() and as.numeric() remain as alter-
the connection slots are in use.
native names for as.double().
expm1(), log(), log1p(), log2(), log10(),
New features gamma(), lgamma(), digamma() and
trigamma() are now primitive. (Note that
• abs(), sign(), sqrt(), floor(), ceiling(), logb() is not.)
exp() and the gamma, trig and hyperbolic trig
The Math2 and Summary groups (round, signif,
functions now only accept one argument even
all, any, max, min, sum, prod, range) are now
when dispatching to a Math group method
primitive.
(which may accept more than one argument for
other group members). See under Section “methods Package” below
for some consequences for S4 methods.
• abbreviate() gains a method argument with
a new option "both.sides" which can make • apropos() now sorts by name and not by posi-
shorter abbreviations. tion on the search path.
• attr() gains an exact = TRUE argument to • is.finite() and is.infinite() are now S3
disable partial matching. and S4 generic.
• bxp() now allows xlim to be specified. • jpeg(), png(), bmp() (Windows), dev2bitmap()
(PR#9754) and bitmap() have a new argument units to
• C(f, SAS) now works in the same way as C(f, specify the units of width and height.
treatment), etc.
• levels() is now generic (levels<-() has been
• chol() is now generic. for a long time).
• dev2bitmap() has a new option to go via PDF • Loading serialized raw objects with load() is
and so allow semi-transparent colours to be now considerably faster.
used.
• New primitive nzchar() as a faster alternative
• dev.interactive() regards devices with the to nchar(x) > 0 (and avoids having to convert
displaylist enabled as interactive, and packages to wide chars in a MBCS locale and hence con-
can register the names of their devices as inter- sider validity).
active via deviceIsInteractive().
• The way old.packages() and hence
• download.packages() and
update.packages() handle packages with dif-
available.packages() (and functions which
ferent versions in multiple package repositories
use them) now support in repos or contriburl
has been changed. The first package encoun-
either ‘file:’ plus a general path (includ-
tered was selected, now the one with highest
ing drives on a UNC path on Windows) or a
version number.
‘file:///’ URL in the same way as url().
• dQuote() and sQuote() are more flexible, • optim(method = "L-BFGS-B") now accepts
with rendering controlled by the new option zero-length parameters, like the other methods.
useFancyQuotes. This includes the ability Also, method = "SANN" no longer attempts to
to have TEX-style rendering and directional optimize in this case.
quotes (the so-called “smart quotes”) on Win-
• New options showWarnCalls and
dows. The default is to use directional quotes
showErrorCalls to give a concise traceback on
in UTF-8 locales (as before) and in the Rgui con-
warnings and errors. showErrorCalls = TRUE
sole on Windows (new).
is the default for non-interactive sessions. Op-
• duplicated() and unique() and their meth- tion showNCalls controls how abbreviated the
ods in base gain an additional argument call sequence is.
fromLast.
• New options warnPartialMatchDollar,
• fifo() no longer has a default description ar- warnPartialMatchArgs and
gument. warnPartialMatchAttr to help detect the un-
fifo("") is now implemented, and works in intended use of partial matching in $, argu-
the same way as file(""). ment matching and attr() respectively.
• Rd conversion to HTML and CHM now makes • The [[ subsetting operator now has an argu-
use of classes, which are set in the stylesheets. ment exact that allows programmers to dis-
Editing ‘R.css’ will change the styles used for able partial matching (which will in due course
\env, \option, \pkg etc. (CHM styles are set at become the default). The default value is exact
compilation time.) = NA which causes a warning to be issued
when partial matching occurs. When exact =
• The documented arguments of %*% have been TRUE, no partial matching will be performed.
changed to be x and y, to match S and the im- When exact = FALSE, partial matching can oc-
plicit S4 generic. cur and no warning will be issued. This patch
was written by Seth Falcon.
• If members of the Ops group (the arithmetic,
logical and comparison operators) and %*% are • Many of the C-level warning/error messages
called as functions, e.g., ‘>‘(x, y), positional (e.g., from subscripting) have been re-worked
matching is always used. (It used to be the case to give more detailed information on either the
that positional matching was used for the de- location or the cause of the problem.
fault methods, but names would be matched
for S3 and S4 methods and in the case of ! • The S3 and S4 Math groups have been
the argument name differed between S3 and S4 harmonized. Functions log1p(), expm1(),
methods.) log10() and log2() are members of the S3
group, and sign(), log1p(), expm1(), log2(),
• Imports environments of name spaces are cummax(), cummin(), digamma(), trigamma()
named (as "imports:foo"), and so are known and trunk() are members of the S4 group.
e.g. to environmentName(). gammaCody() is no longer in the S3 group. They
are now all primitive.
• Package stats4 uses lazy-loading not
SaveImage (which is now deprecated). • The initialization of the random-number
stream makes use of the sub-second part of
• Installing help for a package now parses the
the current time where available.
‘.Rd’ file only once, rather than once for each
type. Initialization of the 1997 Knuth TAOCP gener-
ator is now done in R code, avoiding some C
• PCRE has been updated to version 7.2. code whose licence status has been questioned.
• bzip2 has been updated to version 1.0.4. • The reporting of syntax errors has been made
more user-friendly.
• gettext has been updated to version 0.16.1.
Because all members of the group generics are • It is planned that [[exact = TRUE]] will be-
now primitive, they are all S4 generic and set- come the default in R 2.7.0.
ting an S4 group generic does at last apply to
all members and not just those already made
S4 generic. Utilities
as.double() and as.real() are identical to • checkS3methods() (invoked by R CMD check)
as.numeric(), and now remain so even if now checks the arguments of methods for
S4 methods are set on any of them. Since primitive members of the S3 group generics.
as.numeric is the traditional name used in S4,
currently methods must be exported from a • R CMD check now does a recursive copy on the
‘NAMESPACE’ for as.numeric only. ‘tests’ directory.
• The S4 generic for ! has been changed to have • R CMD check now warns on non-ASCII ‘.Rd’
signature (x) (was (e1)) to match the docu- files without an \encoding field, rather than
mentation and the S3 generic. setMethod() just on ones that are definitely not from an
will fix up methods defined for (e1), with a ISO-8859 encoding. This agrees with the long-
warning. standing stipulation in “Writing R Extensions”,
and catches some packages with UTF-8 man
• The "structure" S4 class now has methods pages.
that implement the concept of structures as
described in the Blue Book—that element-by- • R CMD check now warns on DESCRIPTION
element functions and operators leave struc- files with a non-portable Encoding field, or
ture intact unless they change the length. The with non-ASCII data and no Encoding field.
informal behavior of R for vectors with at-
• R CMD check now loads all the Suggests and
tributes was inconsistent.
Enhances dependencies to reduce warnings
• The implicitGeneric() function and relatives about non-visible objects, and also emulates
have been added to specify how a function in standard functions (such as shell()) on alter-
a package should look when methods are de- native R platforms.
fined for it. This will be used to ensure that
• R CMD check now (by default) attempts to latex
generic versions of functions in R core are con-
the vignettes rather than just weave and tangle
sistent. See ?implicitGeneric.
them: this will give a NOTE if there are latex
• Error messages generated by some of the func- errors.
tions in the methods package provide the name • R CMD check computations no longer ignore
of the generic to provide more contextual infor- Rd \usage entries for functions for extracting
mation. or replacing parts of an object, so S3 methods
should use the appropriate \method{} markup.
• It is now possible to use
setGeneric(useAsDefault = FALSE) to de- • R CMD check now checks for CR (as well
fine a new generic with the name of a prim- as CRLF) line endings in C/C++/Fortran
itive function (but having no connection with source files, and for non-LF line endings in
the primitive). ‘Makefile[.in]’ and ‘Makevars[.in]’ in the package
‘src’ directory. R CMD build will correct non-LF
line endings in source files and in the make files
Deprecated & defunct mentioned.
• $ on an atomic vector now gives a warning that • Rdconv now warns about unmatched braces
it is “invalid”. It remains deprecated, but may rather than silently omitting sections con-
be removed in R ≥ 2.7.0. taining them. (Suggestion by Bill Dunlap,
PR#9649)
• storage.mode(x) <- "real" and
Rdconv now renders (rather than ig-
storage.mode(x) <- "single" are defunct:
nores) \var{} inside \code{} markup in
use instead storage.mode(x) <- "double"
LATEXconversion.
and mode(x) <- "single".
R CMD Rdconv gains a ‘--encoding’ argument
• In package installation, ‘SaveImage: yes’ is to set the default encoding for conversions.
deprecated in favour of ‘LazyLoad: yes’.
• The list of CRAN mirrors now has a new (man-
• seemsS4Object (methods package) is depre- ually maintained) column "OK" which flags
cated in favour of isS4(). mirrors that seem to be OK, only those are used
by chooseCRANmirror(). The now exported • The type of the last two arguments of
function getCRANmirrors() can be used to get getMatrixDimnames (non-API but mentioned
all known mirrors or only the ones that are OK. in ‘R-exts.texi’ and in ‘Rinternals.h’) has been
changed to const char ** (from char **).
• R CMD SHLIB gains arguments ‘--clean’ and
‘--preclean’ to clean up intermediate files af- • R_FINITE now always resolves to the function
ter and before building. call R_finite in packages (rather than some-
times substituting isfinite). This avoids some
• R CMD config now knows about FC and issues where R headers are called from C++
FCFLAGS (used for F9x compilation). code using features tested on the C compiler.
• R CMD Rdconv now does a better job of render- • The advice to include R headers from C++ in-
ing quotes in titles in HTML, and \sQuote and side extern "C" has been changed. It is nowa-
\dQuote into text on Windows. days better not to wrap the headers, as they
include other headers which on some OSes
should not be wrapped.
C-level facilities
• ‘Rinternals.h’ no longer includes a substantial
• New utility function alloc3DArray similar to set of C headers. All but ‘ctype.h’ and ‘errno.h’
allocMatrix. are included by ‘R.h’ which is supposed to be
used before ‘Rinternals.h’.
• The entry point R_seemsS4Object in
‘Rinternals.h’ has not been needed since R 2.4.0 • Including C system headers can be avoided
and has been removed. Use IS_S4_OBJECT in- by defining NO_C_HEADERS before including R
stead. headers. This is intended to be used from C++
code, and you will need to include C++ equiv-
• Applications embedding R can use alents such as <cmath> before the R headers.
R_getEmbeddingDllInfo() to obtain DllInfo
for registering symbols present in the applica-
tion itself. Installation
• The instructions for making and using stan- • The test-Lapack test is now part of make
dalone libRmath have been moved to the R In- check.
stallation and Administration manual.
• The stat system call is now required, along
• CHAR() now returns (const char *) since with opendir (which had long been used but
CHARSXPs should no longer be modified in not tested for). (make check would have failed
place. This change allows compilers to warn in earlier versions without these calls.)
or error about improper modification. Thanks
• evince is now considered as a possible PDF
to Herve Pages for the suggestion.
viewer.
• acopy_string is a (provisional) new helper • make install-strip now also strips the DLLs
function that copies character data and returns in the standard packages.
a pointer to memory allocated using R_alloc.
This can be used to create a copy of a string • Perl 5.8.0 (released in July 2002) or later is now
stored in a CHARSXP before passing the data on required. (R 2.4.0 and later have in fact re-
to a function that modifies its arguments. quired 5.6.1 or later.)
• asLogical, asInteger, asReal and asComplex • The C function finite is no longer used:
now accept STRSXP and CHARSXP arguments, we expect a C99 compiler which will have
and asChar accepts CHARSXP. isfinite. (If that is missing, we test separately
for NaN, Inf and -Inf.)
• New R_GE_str2col() exported via
‘R ext/GraphicsEngine.h’ for external device de- • A script/executable texi2dvi is now required
velopers. on Unix-alikes: it is part of the texinfo distribu-
tion.
• doKeybd and doMouseevent are now exported
in ‘GraphicsDevice.h’. • Files ‘texinfo.tex’ and ‘txi-en.tex’ are no longer
supplied in doc/manual (as the latest versions
• R_alloc now has first argument of type size_t have an incompatible licence). You will need
to support 64-bit platforms (e.g., Win64) with a to ensure that your texinfo and/or TeX instal-
32-bit long type. lations supply them.
• nls() now allows more peculiar but reason- • plot(fn, xlim = c(a, b)) would not set
able ways of being called, e.g., with data from and to properly when plotting a func-
= list(uneven_lengths) or a model without tion. The argument lists to curve() and
variables. plot.function() have been modified slightly
as part of the fix.
• match.arg() was not behaving as documented
when several.ok = TRUE (PR#9859), gave • julian() was documented to work with
spurious warnings when arg had the wrong POSIXt origins, but did not work with POSIXlt
length and was incorrectly documented (exact ones. (PR#9908)
matches are returned even when there is more
than one partial match). • Dataset HairEyeColor has been corrected to
agree with Friendly (2000): the change involves
• The data.frame method for split<-() was
the breakdown of the Brown hair / Brown eye
broken.
cell by Sex, and only totals over Sex are given
• The test for -D__NO_MATH_INLINES was badly in the original source.
broken and returned true on all non-glibc plat-
forms and false on all glibc ones (whether they • Trailing spaces are now consistently stripped
were broken or not). from \alias{} entries in ‘.Rd’ files, and this is
now documented. (PR#9915)
• LF was missing after the last prompt when
‘--quiet’ was used without ‘--slave’. Use • .find.packages(), packageDescription()
‘--slave’ when no final LF is desired. and sessionInfo() assumed that attached en-
vironments named "package:foo" were pack-
• Fixed bug in initialisation code in grid pack-
age environments, although misguided users
age for determining the boundaries of shapes.
could use such a name in attach().
Problem reported by Hadley Wickham; symp-
tom was error message: ‘Polygon edge not • spline() and splinefun() with method =
found’. "periodic" could return incorrect results
• str() is no longer slow for large POSIXct ob- when length(x) was 2 or 3.
jects. Its output is also slightly more compact
• getS3method() could fail if the method name
for such objects; implementation via new op-
contained a regexp metacharacter such as "+".
tional argument give.head.
• strsplit(*, fixed = TRUE), potentially • help(a_character_vector) now uses the name
iconv() and internal string formatting is and not the value of the vector unless it has
now faster for large strings, thanks to report length exactly one, so e.g. help(letters) now
PR#9902 by John Brzustowski. gives help on letters. (Related to PR#9927)
• de.restore() gave a spurious warning for ma- • Ranges in chartr() now work better in CJK lo-
trices (Ben Bolker) cales, thanks to Ei-ji Nakama.
Changes on CRAN
by Kurt Hornik from Li Hsu and Douglas Grove, imagemap
code from Barry Rowlingson.
New contributed packages AIS Tools to look at the data (“Ad Inidicia Spec-
tata”). By Micah Altman.
ADaCGH Analysis and plotting of array CGH data.
Allows usage of Circular Binary Segmentation, AcceptanceSampling Creation and evaluation of
wavelet-based smoothing, ACE method (CGH Acceptance Sampling Plans. Plans can be sin-
Explorer), HMM, BioHMM, GLAD, CGHseg, gle, double or multiple sampling plans. By An-
and Price’s modification of Smith & Water- dreas Kiermeier.
man’s algorithm. Most computations are par-
allelized. Figures are imagemaps with links Amelia Amelia II: A Program for Missing Data.
to IDClight (http://idclight.bioinfo.cnio. Amelia II “multiply imputes” missing data in
es). By Ramon Diaz-Uriarte and Oscar M. a single cross-section (such as a survey), from
Rueda. Wavelet-based aCGH smoothing code a time series (like variables collected for each
year in a country), or from a time-series-cross- Quartz device from Terminal or ESS). By Simon
sectional data set (such as collected by years Urbanek.
for each of several countries). Amelia II im-
plements a bootstrapping-based algorithm that DAAGbio Data sets and functions useful for the
gives essentially the same answers as the stan- display of microarray and for demonstrations
dard IP or EMis approaches, is usually con- with microarray data. By John Maindonald.
siderably faster than existing approaches and
Defaults Set, get, and import global function de-
can handle many more variables. The program
faults. By Jeffrey A. Ryan.
also generalizes existing approaches by allow-
ing for trends in time series across observations Devore7 Data sets and sample analyses from Jay L.
within a cross-sectional unit, as well as priors Devore (2008), “Probability and Statistics for
that allow experts to incorporate beliefs they Engineering and the Sciences (7th ed)”, Thom-
have about the values of missing cells in their son. Original by Jay L. Devore, modifications
data. The program works from the R command by Douglas Bates, modifications for the 7th edi-
line or via a graphical user interface that does tion by John Verzani.
not require users to know R. By James Honaker,
Gary King, and Matthew Blackwell. G1DBN Perform Dynamic Bayesian Network infer-
ence using 1st order conditional dependencies.
BiodiversityR A GUI (via Rcmdr) and some util- By Sophie Lebre.
ity functions (often based on the vegan) for
statistical analysis of biodiversity and ecolog- GSA Gene Set Analysis. By Brad Efron and R. Tib-
ical communities, including species accumu- shirani.
lation curves, diversity indices, Renyi pro- GSM Gamma Shape Mixture. Implements a
files, GLMs for analysis of species abundance Bayesian approach for estimation of a mixture
and presence-absence, distance matrices, Man- of gamma distributions in which the mixing
tel tests, and cluster, constrained and un- occurs over the shape parameter. This fam-
constrained ordination analysis. By Roeland ily provides a flexible and novel approach for
Kindt. modeling heavy-tailed distributions, is compu-
tationally efficient, and only requires to specify
CORREP Multivariate correlation estimator and
a prior distribution for a single parameter. By
statistical inference procedures. By Dongxiao
Sergio Venturini.
Zhu and Youjuan Li.
GeneF Implements several generalized F-statistics,
CPGchron Create radiocarbon-dated depth including ones based on the flexible iso-
chronologies, following the work of Parnell tonic/monotonic regression or order restricted
and Haslett (2007, submitted to JRSSC). By An- hypothesis testing. By Yinglei Lai.
drew Parnell.
HFWutils Functions for Excel connections, garbage
Cairo Provides a Cairo graphics device that can collection, string matching, and passing by ref-
be use to create high-quality vector (PDF, erence. By Felix Wittmann.
PostScript and SVG) and bitmap output (PNG,
JPEG, TIFF), and high-quality rendering in dis- ICEinfer Incremental Cost-Effectiveness (ICE) Sta-
plays (X11 and Win32). Since it uses the same tistical Inference. Given two unbiased sam-
back-end for all output, copying across formats ples of patient level data on cost and effec-
is WYSIWYG. Files are created without the de- tiveness for a pair of treatments, make head-
pendence on X11 or other external programs. to-head treatment comparisons by (i) generat-
This device supports alpha channel (semi- ing the bivariate bootstrap resampling distri-
transparent drawing) and resulting images can bution of ICE uncertainty for a specified value
contain transparent and semi-transparent re- of the shadow price of health, λ, (ii) form
gions. It is ideal for use in server environments the wedge-shaped ICE confidence region with
(file output) and as a replacement for other de- specified confidence fraction within [0.50, 0.99]
vices that don’t have Cairo’s capabilities such that is equivariant with respect to changes in
as alpha support or anti-aliasing. Backends λ, (iii) color the bootstrap outcomes within the
are modular such that any subset of backends above confidence wedge with economic pref-
is supported. By Simon Urbanek and Jeffrey erences from an ICE map with specified values
Horner. of λ, beta and gamma parameters, (iv) display
VAGR and ALICE acceptability curves, and
CarbonEL Carbon Event Loop: hooks a Carbon (v) display indifference (iso-preference) curves
event loop handler into R. This is useful for en- from an ICE map with specified values of λ, β
abling UI from a console R (such as using the and γ or η parameters. By Bob Obenchain.
ICS ICS/ICA computation based on two scatter ma- summarizing posterior distributions defined
trices. Implements Oja et al.’s method of two by the user, functions for regression models, hi-
different scatter matrices to obtain an invari- erarchical models, Bayesian tests, and illustra-
ant coordinate system or independent compo- tions of Gibbs sampling. By Jim Albert.
nents, depending on the underlying assump-
tions. By Klaus Nordhausen, Hannu Oja, and LogConcDEAD Computes the maximum likelihood
Dave Tyler. estimator from an i.i.d. sample of data from a
log-concave density in any number of dimen-
ICSNP Tools for multivariate nonparametrics, as sions. Plots are available for 1- and 2-d data. By
location tests based on marginal ranks, spa- Madeleine Cule, Robert Gramacy, and Richard
tial median and spatial signs computation, Samworth.
Hotelling’s T-test, estimates of shape. By Klaus
Nordhausen, Seija Sirkia, Hannu Oja, and Dave MLDS Maximum Likelihood Difference Scaling.
Tyler. Difference scaling is a method for scaling per-
ceived supra-threshold differences. The pack-
LLAhclust Hierarchical clustering of variables or age contains functions that allow the user to
objects based on the likelihood linkage analy- design and run a difference scaling experiment,
sis method. The likelihood linkage analysis is to fit the resulting data by maximum likelihood
a general agglomerative hierarchical clustering and test the internal validity of the estimated
method developed in France by Lerman in a scale. By Kenneth Knoblauch and Laurence T.
long series of research articles and books. Ini- Maloney, based in part on C code written by
tially proposed in the framework of variable Laurence T. Maloney and J. N. Yang.
clustering, it has been progressively extended
to allow the clustering of very general object MLEcens MLE for bivariate (interval) censored
descriptions. The approach mainly consists in data. Contains functions to compute the non-
replacing the value of the estimated similar- parametric maximum likelihood estimator for
ity coefficient by the probability of finding a the bivariate distribution of ( X, Y ), when real-
lower value under the hypothesis of “absence izations of ( X, Y ) cannot be observed directly.
of link”. Package LLAhclust contains routines More precisely, the MLE is computed based on
for computing various types of probabilistic a set of rectangles (“observation rectangles”)
similarity coefficients between variables or ob- that are known to contain the unobservable re-
ject descriptions. Once the similarity values be- alizations of ( X, Y ). The methods can also be
tween variables/objects are computed, a hier- used for univariate censored data, and for cen-
archical clustering can be performed using sev- sored data with competing risks. The package
eral probabilistic and non-probabilistic aggre- contains the functionality of bicreduc, which
gation criteria, and indices measuring the qual- will no longer be maintained. By Marloes
ity of the partitions compatible with the result- Maathuis.
ing hierarchy can be computed. By Ivan Ko- MiscPsycho Miscellaneous Psychometrics: statisti-
jadinovic, Israël-César Lerman, and Philippe cal analyses useful for applied psychometri-
Peter. cians. By Harold C. Doran.
LLN Learning with Latent Networks. A new frame- ORMDR Odds ratio based multifactor-dimensionality
work in which graph-structured data are used reduction method for detecting gene-gene in-
to train a classifier in a latent space, and then teractions. By Eun-Kyung Lee, Sung Gon Yi,
classify new nodes. During the learning phase, Yoojin Jung, and Taesung Park.
a latent representation of the network is first
learned and a supervised classifier is then built PET Simulation and reconstruction of PET images.
in the learned latent space. In order to clas- Implements different analytic/direct and itera-
sify new nodes, the positions of these nodes tive reconstruction methods of Peter Toft, and
in the learned latent space are estimated using offers the possibility to simulate PET data. By
the existing links between the new nodes and Joern Schulz, Peter Toft, Jesper James Jensen,
the learning set nodes. It is then possible to ap- and Peter Philipsen.
ply the supervised classifier to assign each new
node to one of the classes. By Charles Bouvey- PSAgraphics Propensity Score Analysis (PSA)
ron & Hugh Chipman. Graphics. Includes functions which test bal-
ance within strata of categorical and quantita-
LearnBayes Functions helpful in learning the basic tive covariates, give a representation of the esti-
tenets of Bayesian statistical inference. Con- mated effect size by stratum, provide a graphic
tains functions for summarizing basic one and and loess based effect size estimate, and vari-
two parameter posterior distributions and pre- ous balance functions that provide measures of
dictive distributions, MCMC algorithms for the balance achieved via a PSA in a categorical
covariate. By James E. Helmreich and Robert Supports tooltips with 1 to 3 lines and line
M. Pruzek. styles. By Tony Plate, based on RSvgDevice by
T Jake Luciani.
PerformanceAnalytics Econometric tools for per-
formance and risk analysis. Aims to aid prac- Rcapture Loglinear Models in Capture-Recapture
titioners and researchers in utilizing the lat- Experiments. Estimation of abundance and
est research in analysis of non-normal return other demographic parameters for closed pop-
streams. In general, the package is most tested ulations, open populations and the robust de-
on return (rather than price) data on a monthly sign in capture-recapture experiments using
scale, but most functions will work with daily loglinear models. By Sophie Baillargeon and
or irregular return data as well. By Peter Carl Louis-Paul Rivest.
and Brian G. Peterson.
RcmdrPlugin.TeachingDemos Provides an Rcmdr
QRMlib Code to examine Quantitative Risk Man- “plug-in” based on the TeachingDemos pack-
agement concepts, accompanying the book age, and is primarily for illustrative purposes.
“Quantitative Risk Management: Concepts, By John Fox.
Techniques and Tools” by Alexander J. McNeil,
Rüdiger Frey and Paul Embrechts. S-Plus orig- Reliability Functions for estimating parameters in
inal by Alexander McNeil; R port by Scott Ul- software reliability models. Only infinite fail-
man. ure models are implemented so far. By Andreas
Wittmann.
R.cache Fast and light-weight caching of objects.
Methods for memoization, that is, caching ar- RiboSort Rapid classification of (TRFLP & ARISA)
bitrary R objects in persistent memory. Objects microbial community profiles, eliminating the
can be loaded and saved stratified on a set of laborious task of manually classifying commu-
hashing objects. By Henrik Bengtsson. nity fingerprints in microbial studies. By au-
R.huge Methods for accessing huge amounts of tomatically assigning detected fragments and
data. Provides a class representing a matrix their respective relative abundances to ap-
where the actual data is stored in a binary for- propriate ribotypes, RiboSort saves time and
mat on the local file system. This way the size greatly simplifies the preparation of DNA fin-
limit of the data is set by the file system and not gerprint data sets for statistical analysis. By
the memory. Currently in an alpha/early-beta Úna Scallan & Ann-Kathrin Liliensiek.
version. By Henrik Bengtsson.
Rmetrics Rmetrics —- Financial Engineering and
RBGL Interface to boost C++ graph library. Demo Computational Finance. Environment for
of interface with full copy of all hpp defining teaching “Financial Engineering and Compu-
boost. By Vince Carey, Li Long, and R. Gentle- tational Finance”. By Diethelm Wuertz and
man. many others.
RDieHarder An interface to the dieharder test suite RobLox Optimally robust influence curves in case
of random number generators and tests that of normal location with unknown scale. By
was developed by Robert G. Brown, extending Matthias Kohl.
earlier work by George Marsaglia and others.
By Dirk Eddelbuettel. RobRex Optimally robust influence curves in case
of linear regression with unknown scale and
RLRsim Exact (Restricted) Likelihood Ratio tests for standard normal distributed errors where the
mixed and additive models. Provides rapid regressor is random. By Matthias Kohl.
simulation-based tests for the presence of vari-
ance components/nonparametric terms with a Rsac Seismic analysis tools in R. Mostly functions
convenient interface for a variety of models. By to reproduce some of the Seismic Analysis
Fabian Scheipl. Code (SAC, http://www.llnl.gov/sac/) com-
mands in R. This includes reading standard
ROptEst Optimally robust estimation using S4
binary ‘.SAC’ files, plotting arrays of seismic
classes and methods. By Matthias Kohl.
recordings, filtering, integration, differentia-
ROptRegTS Optimally robust estimation for tion, instrument deconvolution, and rotation of
regression-type models using S4 classes and horizontal components. By Eric M. Thompson.
methods. By Matthias Kohl.
Runuran Interface to the UNU.RAN library for Uni-
RSVGTipsDevice An R SVG graphics device with versal Non-Uniform RANdom variate gener-
dynamic tips and hyperlinks using the w3.org ators. By Josef Leydold and Wolfgang Hör-
XML standard for Scalable Vector Graphics. mann.
Ryacas An interface to the yacas computer al- based binary classification. By Anne-Laure
gebra system. By Rob Goedman, Gabor Boulesteix.
Grothendieck, Søren Højsgaard, Ayal Pinkus.
YaleToolkit Tools for the graphical exploration of
SRPM Shared Reproducibility Package Manage- complex multivariate data developed at Yale
ment. A package development and manage- University. By John W. Emerson and Walton
ment system for distributed reproducible re- Green.
search. By Roger D. Peng.
adegenet Genetic data handling for multivariate
SimHap A comprehensive modeling framework analysis using ade4. By Thibaut Jombart.
for epidemiological outcomes and a multiple-
imputation approach to haplotypic analysis of ads Spatial point patterns analysis. Perform first-
population-based data. Can perform single and second-order multi-scale analyses derived
SNP and multi-locus (haplotype) association from Ripley’s K-function, for univariate, multi-
analyses for continuous Normal, binary, lon- variate and marked mapped data in rectangu-
gitudinal and right-censored outcomes mea- lar, circular or irregular shaped sampling win-
sured in population-based samples. Uses es- dows, with test of statistical significance based
timation maximization techniques for inferring on Monte Carlo simulations. By R. Pelissier
haplotypic phase in individuals, and incorpo- and F. Goreaud.
rates a multiple-imputation approach to deal
with the uncertainty of imputed haplotypes in argosfilter Functions to filter animal satellite track-
association testing. By Pamela A. McCaskie. ing data obtained from Argos. Especially indi-
cated for telemetry studies of marine animals,
SpatialNP Multivariate nonparametric methods
where Argos locations are predominantly of
based on spatial signs and ranks. Contains
low quality. By Carla Freitas.
test and estimates of location, tests of indepen-
dence, tests of sphericity, several estimates of arrayImpute Missing imputation for microarray
shape and regression all based on spatial signs, data. By Eun-kyung Lee, Dankyu Yoon, and
symmetrized signs, ranks and signed ranks. By Taesung Park.
Seija Sirkia, Jaakko Nevalainen, Klaus Nord-
hausen, and Hannu Oja. ars Adaptive Rejection Sampling. By Paulino Perez
Rodriguez; original C++ code from Arnost Ko-
TIMP Problem solving environment for fitting su-
marek.
perposition models. Measurements often rep-
resent a superposition of the contributions of
arulesSequences Add-on for arules to handle and
distinct sub-systems resolved with respect to
mine frequent sequences. Provides interfaces
many experimental variables (time, tempera-
to the C++ implementation of cSPADE by Mo-
ture, wavelength, pH, polarization, etc). TIMP
hammed J. Zaki. By Christian Buchta and
allows parametric models for such superposi-
Michael Hahsler.
tions to be fit and validated. The package has
been extensively applied to modeling data aris- asuR Functions and data sets for a lecture in Ad-
ing in spectroscopy experiments. By Katharine vanced Statistics using R. Especially the func-
M. Mullen and Ivo H. M. van Stokkum. tions mancontr() and inspect() may be of
TwslmSpikeWeight Normalization of cDNA mi- general interest. With the former, it is pos-
croarray data with the two-way semilinear sible to specify your own contrasts and give
model(TW-SLM). It incorporates information them useful names. Function inspect() shows
from control spots and data quality in the TW- a wide range of inspection plots to validate
SLM to improve normalization of cDNA mi- model assumptions. By Thomas Fabbro.
croarray data. Huber’s and Tukey’s bisquare
bayescount Bayesian analysis of count distributions
weight functions are available for robust esti-
with JAGS. A set of functions to apply a
mation of the TW-SLM. By Deli Wang and Jian
zero-inflated gamma Poisson (equivalent to a
Huang.
zero-inflated negative binomial), zero-inflated
Umacs Universal Markov chain sampler. By Jouni Poisson, gamma Poisson (negative binomial)
Kerman. or Poisson model to a set of count data us-
ing JAGS (Just Another Gibbs Sampler). Re-
WilcoxCV Functions to perform fast variable se- turns information on the possible values for
lection based on the Wilcoxon rank sum test mean count, overdispersion and zero inflation
in the cross-validation or Monte-Carlo cross- present in count data such as faecal egg count
validation settings, for use in microarray- data. By Matthew Denwood.
benchden Full implementation of the 28 distribu- catmap Case-control And Tdt Meta-Analysis Pack-
tions introduced as benchmarks for nonpara- age. Conducts fixed-effects (inverse vari-
metric density estimation by Berlinet and De- ance) and random-effects (DerSimonian and
vroye (1994). Includes densities, cdfs, quantile Laird, 1986) meta-analyses of case-control or
functions and generators for samples. By Tho- family-based (TDT) genetic data; in addition,
ralf Mildenberger, Henrike Weinert, and Sebas- performs meta-analyses combining these two
tian Tiemeyer. types of study designs. The fixed-effects model
was first described by Kazeem and Farrell
biOps Basic image operations and image process- (2005); the random-effects model is described
ing. Includes arithmetic, logic, look up table in Nicodemus (submitted for publication). By
and geometric operations. Some image pro- Kristin K. Nicodemus.
cessing functions, for edge detection (several
algorithms including Roberts, Sobel, Kirsch, celsius Retrieve Affymetrix microarray measure-
Marr-Hildreth, Canny) and operations by con- ments and metadata from Celsius web services,
volution masks (with predefined as well as see http://genome.ucla.edu/projects/
user defined masks) are provided. Supported celsius. By Allen Day, Marc Carlson.
file formats are jpeg and tiff. By Matias Bordese
and Walter Alini. cghFLasso Spatial smoothing and hot spot detection
using the fused lasso regression. By Robert Tib-
binMto Asymptotic simultaneous confidence inter- shirani and Pei Wang.
vals for comparison of many treatments with
one control, for the difference of binomial pro- choplump Choplump tests: permutation tests for
portions, allows for Dunnett-like-adjustment, comparing two groups with some positive but
Bonferroni or unadjusted intervals. Simulation many zero responses. By M. P. Fay.
of power of the above interval methods, ap-
clValid Statistical and biological validation of clus-
proximate calculation of any-pair-power, and
tering results. By Guy Brock, Vasyl Pihur, Sus-
sample size iteration based on approximate
mita Datta, and Somnath Datta.
any-pair power. Exact conditional maximum
test for many-to-one comparisons to a control. classGraph Construct directed graphs of S4 class hi-
By Frank Schaarschmidt. erarchies and visualize them. Typically, these
graphs are DAGs (directed acyclic graphs) in
bio.infer Compute biological inferences. Imports
general, though often trees. By Martin Maech-
benthic count data, reformats this data, and
ler partly based on code from Robert Gentle-
computes environmental inferences from this
man.
data. By Lester L. Yuan.
clinfun Clinical Trial Design and Data Analysis
blockTools Block, randomly assign, and diagnose
Functions. Utilities to make your clinical col-
potential problems between units in random-
laborations easier if not fun. By E. S. Venkatra-
ized experiments. Blocks units into experimen-
man.
tal blocks, with one unit per treatment condi-
tion, by creating a measure of multivariate dis- clusterfly Explore clustering interactively using R
tance between all possible pairs of units. Max- and GGobi. Contains both general code for
imum, minimum, or an allowable range of dif- visualizing clustering results and specific vi-
ferences between units on one variable can be sualizations for model-based, hierarchical and
set. Randomly assign units to treatment con- SOM clustering. By Hadley Wickham.
ditions. Diagnose potential interference prob-
lems between units assigned to different treat- clv Cluster Validation Techniques. Contains most of
ment conditions. Write outputs to ‘.tex’ and the popular internal and external cluster vali-
‘.csv’ files. By Ryan T. Moore. dation methods ready to use for the most of the
outputs produced by functions coming from
bootStepAIC Model selection by bootstrapping the package cluster. By Lukasz Nieweglowski.
stepAIC() procedure. By Dimitris Rizopoulos.
codetools Code analysis tools for R. By Luke Tier-
brew A templating framework for mixing text and R ney.
code for report generation. Template syntax is
similar to PHP, Ruby’s erb module, Java Server colorRamps Builds single and double gradient color
Pages, and Python’s psp module. By Jeffrey maps. By Tim Keitt.
Horner.
contrast Contrast methods, in the style of the De-
ca Computation and visualization of simple, mul- sign package, for fit objects produced by the
tiple and joint correspondence analysis. By lm, glm, gls, and geese functions. By Steve We-
Michael Greenacre and Oleg Nenadic. ston, Jed Wing and Max Kuhn.
coxphf Cox regression with Firth’s penalized likeli- wherever possible. The target audience
hood. R by Meinhard Ploner, Fortran by Georg includes biostatisticians and epidemiologists
Heinze. who would like to apply standard epidemi-
ological methods in a simple manner. By
crosshybDetector Identification of probes poten- Michael A Rotondi.
tially affected by cross-hybridizations in mi-
croarray experiments. Includes functions for experiment Various statistical methods for design-
diagnostic plots. By Paolo Uva. ing and analyzing randomized experiments.
One main functionality is the implementa-
ddesolve Solver for Delay Differential Equations by tion of randomized-block and matched-pair
interfacing numerical routines written by Si- designs based on possibly multivariate pre-
mon N. Wood, with contributions by Benjamin treatment covariates. Also provides the tools
J. Cairns. These numerical routines first ap- to analyze various randomized experiments in-
peared in Simon Wood’s solv95 program. By cluding cluster randomized experiments, ran-
Alex Couture-Beil, Jon T. Schnute, and Rowan domized experiments with noncompliance,
Haigh. and randomized experiments with missing
desirability Desirability Function Optimization and data. By Kosuke Imai.
Ranking. S3 classes for multivariate optimiza- fCopulae Rmetrics — Dependence Structures with
tion using the desirability function by Der- Copulas. Environment for teaching “Financial
ringer and Suich (1980). By Max Kuhn. Engineering and Computational Finance”. By
Diethelm Wuertz and many others.
dplR Dendrochronology Program Library in R.
Contains functions for performing some stan- forensic Statistical Methods in Forensic Genetics.
dard tree-ring analyses. By Andy Bunn. Statistical evaluation of DNA mixtures, DNA
profile match probability. By Miriam Marusi-
dtt Discrete Trigonometric Transforms. Functions
akova.
for 1D and 2D Discrete Cosine Transform
(DCT), Discrete Sine Transform (DST) and Dis- fractal Insightful Fractal Time Series Modeling and
crete Hartley Transform (DHT). By Lukasz Analysis. Software to book in development en-
Komsta. titled “Fractal Time Series Analysis in S-PLUS
and R” by William Constantine and Donald B.
earth Multivariate Adaptive Regression Spline
Percival, Springer. By William Constantine and
Models. Build regression models using
Donald Percival.
the techniques in Friedman’s papers “Fast
MARS” and “Multivariate Adaptive Regres- fso Fuzzy Set Ordination: a multivariate analy-
sion Splines”. (The term “MARS” is copy- sis used in ecology to relate the composition
righted and thus not used as the name of the of samples to possible explanatory variables.
package.). By Stephen Milborrow, derived While differing in theory and method, in prac-
from code in package mda by Trevor Hastie tice, the use is similar to “constrained ordina-
and Rob Tibshirani. tion”. Contains plotting and summary func-
tions as well as the analyses. By David W.
eigenmodel Semiparametric factor and regression
Roberts.
models for symmetric relational data. Esti-
mates the parameters of a model for symmet- gWidgetstcltk Toolkit implementation of the gWid-
ric relational data (e.g., the above-diagonal part gets API for the tcltk package. By John Verzani.
of a square matrix) using a model-based eigen-
value decomposition and regression. Missing gamlss.mx A GAMLSS add on package for fitting
data is accommodated, and a posterior mean mixture distributions. By Mikis Stasinopoulos
for missing data is calculated under the as- and Bob Rigby.
sumption that the data are missing at random. gcl Computes a fuzzy rules classifier (Vinterbo, Kim
The marginal distribution of the relational data & Ohno-Machado, 2005). By Staal A. Vinterbo.
can be arbitrary, and is fit with an ordered pro-
bit specification. By Peter Hoff. geiger Analysis of evolutionary diversification. Fea-
tures running macroevolutionary simulation
epibasix Elementary tools for the analysis of com- and estimating parameters related to diversifi-
mon epidemiological problems, ranging from cation from comparative phylogenetic data. By
sample size estimation, through 2 × 2 con- Luke Harmon, Jason Weir, Chad Brock, Rich
tingency table analysis and basic measures Glor, and Wendell Challenger.
of agreement (kappa, sensitivity/specificity).
Appropriate print and summary statements ggplot An implementation of the grammar of
are also written to facilitate interpretation graphics in R. Combines the advantages of
both base and lattice graphics: conditioning lar, and graphical forms. By Tarmo K. Remmel,
and shared axes are handled automatically, Sandor Kabos, and Ferenc (Ferko) Csillag.
and one can still build up a plot step by step
from multiple data sources. Also implements heatmap.plus Heatmaps with more sensible behav-
a more sophisticated multidimensional condi- ior. Allows the heatmap matrix to have non-
tioning system and a consistent interface to identical x and y dimensions, and multiple
map data to aesthetic attributes. By Hadley tracks of annotation for RowSideColors and
Wickham. ColSideColors. By Allen Day.
ghyp Provides all about univariate and multivari- hydrosanity Graphical user interface for exploring
ate generalized hyperbolic distributions and hydrological time series. Designed to work
its special cases (Hyperbolic, Normal Inverse with catchment surface hydrology data (rain-
Gaussian, Variance Gamma and skewed Stu- fall and streamflow); but could also be used
dent t distribution). Especially fitting proce- with other time series data. Hydrological time
dures, computation of the density, quantile, series typically have many missing values, and
probability, random variates, expected shortfall varying accuracy of measurement (indicated
and some portfolio optimization and plotting by data quality codes). Furthermore, the spa-
routines. Also contains the generalized inverse tial coverage of data varies over time. Much
Gaussian distribution. By Wolfgang Breymann of the functionality of this package attempts
and David Luethi. to understand these complex data sets, detect
errors and characterize sources of uncertainty.
glmmAK Generalized Linear Mixed Models. By Emphasis is on graphical methods. The GUI is
Arnost Komarek. based on rattle. By Felix Andrews.
granova Graphical Analysis of Variance. Provides identity Calculate identity coefficients, based on
distinctive graphics for display of ANOVA re- Mark Abney’s C code. By Na Li.
sults. Functions were written to display data
ifultools Insightful Research Tools. By William Con-
for any number of groups, regardless of their
stantine.
sizes (however, very large data sets or numbers
of groups are likely to be problematic) using inetwork Network Analysis and Plotting. Imple-
a specialized approach to construct data-based ments a network partitioning algorithm to
contrast vectors with respect to which ANOVA identify communities (or modules) in a net-
data are displayed. By Robert M. Pruzek and work. A network plotting function then uti-
James E. Helmreich. lizes the identified community structure to po-
sition the vertices for plotting. Also con-
graph Implements some simple graph handling ca-
tains functions to calculate the assortativity
pabilities. By R. Gentleman, Elizabeth Whalen,
and transitivity of a vertex. By Sun-Chong
W. Huber, and S. Falcon.
Wang.
grasp Generalized Regression Analysis and Spatial inline Functionality to dynamically define R func-
Prediction for R. GRASP is a general method tions and S4 methods with in-lined C, C++ or
for making spatial predictions of response vari- Fortran code supporting .C and .Call calling
ables (RV) using point surveys of the RV and conventions. By Oleg Sklyar, Duncan Mur-
spatial coverages of predictor variables (PV). doch, and Mike Smith.
Originally, GRASP was developed to analyse,
model and predict vegetation distribution over irtoys A simple common interface to the estima-
New Zealand. It has been used in all sorts of tion of item parameters in IRT models for bi-
applications since then. By Anthony Lehmann, nary responses with three different programs
Fabien Fivaz, John Leathwick and Jake Over- (ICL, BILOG-MG, and ltm), and a variety of
ton, with contributions from many specialists functions useful with IRT models. By Ivailo
from around the world. Partchev.
hdeco Hierarchical DECOmposition of Entropy for kin.cohort Analysis of kin-cohort studies. Provides
Categorical Map Comparisons. A flexible and estimates of age-specific cumulative risk of a
hierarchical framework for comparing categor- disease for carriers and noncarriers of a muta-
ical map composition and configuration (spa- tion. The cohorts are retrospectively built from
tial pattern) along spatial, thematic, or external relatives of probands for whom the genotype
grouping variables. Comparisons are based on is known. Currently the method of moments
measures of mutual information between the- and marginal maximum likelihood are imple-
matic classes (colors) and location (spatial par- mented. Confidence intervals are calculated
titioning). Results are returned in textual, tabu- from bootstrap samples. Most of the code is
a translation from previous MATLAB code by a subset of I(all) and is therefore computa-
Chatterjee. By Victor Moreno, Nilanjan Chat- tionally much more efficient, but asymptoti-
terjee, and Bhramar Mukherjee. cally equivalent. Omitting the additive correc-
tion term Gamma in either method offers an-
kzs Kolmogorov-Zurbenko Spline. A collection of other two approaches which are more power-
functions utilizing splines to smooth a noisy ful on small scales and less powerful on large
data set in order to estimate its underlying sig- scales, however, not asymptotically minimax
nal. By Derek Cyr and Igor Zurbenko. optimal anymore. Finally, the block proce-
lancet.iraqmortality Surveys of Iraq mortality pub- dure is a compromise between adding Gamma
lished in The Lancet. The Lancet has published or not, having intermediate power properties.
two surveys on Iraq mortality before and after The latter is again asymptotically equivalent to
the US-led invasion. The package facilitates ac- the first and was introduced in Rufibach and
cess to the data and a guided tour of some of Walther (2007). By Kaspar Rufibach and Guen-
their more interesting aspects. By David Kane, ther Walther.
with contributions from Arjun Navi Narayan
monomvn Estimation of multivariate normal data
and Jeff Enos.
of arbitrary dimension where the pattern of
ljr Fits and tests logistic joinpoint models. By Michal missing data is monotone. Through the use of
Czajkowski, Ryan Gill, and Greg Rempala. parsimonious/shrinkage regressions (plsr, pcr,
lasso, ridge, etc.), where standard regressions
luca Likelihood Under Covariate Assumptions fail, the package can handle an (almost) arbi-
(LUCA). Likelihood inference in case-control trary amount of missing data. The current ver-
studies of a rare disease under independence or sion supports maximum likelihood inference.
simple dependence of genetic and non-genetic Future versions will provide a means of sam-
covariates. By Ji-Hyung Shin, Brad McNeney, pling from a Bayesian posterior. By Robert B.
and Jinko Graham. Gramacy.
meifly Interactive model exploration using GGobi.
mota Mean Optimal Transformation Approach for
By Hadley Wickham.
detecting nonlinear functional relations. Origi-
mixPHM Mixtures of proportional hazard models. nally designed for the identifiability analysis of
Fits multiple variable mixtures of various para- nonlinear dynamical models. However, the un-
metric proportional hazard models using the derlying concept is very general and allows to
EM algorithm. Proportionality restrictions can detect groups of functionally related variables
be imposed on the latent groups and/or on the whenever there are multiple estimates for each
variables. Several survival distributions can be variable. By Stefan Hengl.
specified. Missing values are allowed. Inde-
pendence is assumed over the single variables. nlstools Tools for assessing the quality of fit of a
By Patrick Mair and Marcus Hudec. Gaussian nonlinear model. By Florent Baty and
Marie-Laure Delignette-Muller, with contribu-
mlmmm Computational strategies for multivariate tions from Sandrine Charles, Jean-Pierre Flan-
linear mixed-effects models with missing val- drois.
ues (Schafer and Yucel, 2002). By Recai Yucel.
oc OC Roll Call Analysis Software. Estimates Op-
modeest Provides estimators of the mode of uni- timal Classification scores from roll call votes
variate unimodal data or univariate unimodal supplied though a rollcall object from pack-
distributions. Also allows to compute the age pscl. By Keith Poole, Jeffrey Lewis, James
Chernoff distribution. By Paul Poncet. Lo and Royce Carroll.
modehunt Multiscale Analysis for Density Func- oce Analysis of Oceanographic data. Supports CTD
tions. Given independent and identically dis- measurements, sea-level time series, coastline
tributed observations X (1), . . . , X (n) from a files, etc. Also includes functions for cal-
density f , this package provides five meth- culating seawater properties such as density,
ods to perform a multiscale analysis about f and derived properties such as buoyancy fre-
as well as the necessary critical values. The quency. By Dan Kelley.
first method, introduced in Duembgen and
Walther (2006), provides simultaneous confi- pairwiseCI Calculation of parametric and nonpara-
dence statements for the existence and loca- metric confidence intervals for the difference
tion of local increases (or decreases) of f , based or ratio of location parameters and for the dif-
on all intervals I(all) spanned by any two ob- ference, ratio and odds-ratio of binomial pro-
servations X ( j), X (k). The second method ap- portion for comparison of independent sam-
proximates the latter approach by using only ples. CIs are not adjusted for multiplicity. A by
statement allows calculation of CI separately proftools Profile output processing tools. By Luke
for the levels of one further factor. By Frank Tierney.
Schaarschmidt.
proj4 A simple interface to lat/long projection and
paran An implementation of Horn’s technique datum transformation of the PROJ.4 carto-
for evaluating the components retained in graphic projections library. Allows transforma-
a principle components analysis (PCA). tion of geographic coordinates from one pro-
Horn’s method contrasts eigenvalues pro- jection and/or datum to another. By Simon Ur-
duced through a PCA on a number of random banek.
data sets of uncorrelated variables with the
same number of variables and observations as proxy Distance and similarity measures. An exten-
the experimental or observational data set to sible framework for the efficient calculation of
produce eigenvalues for components that are auto- and cross-proximities, along with imple-
adjusted for the sample error-induced inflation. mentations of the most popular ones. By David
Components with adjusted eigenvalues greater Meyer and Christian Buchta.
than one are retained. The package may also psych Routines for personality, psychometrics and
be used to conduct parallel analysis follow- experimental psychology. Functions are pri-
ing Glorfeld’s (1995) suggestions to reduce the marily for scale construction and reliability
likelihood of over-retention. By Alexis Dinno. analysis, although others are basic descriptive
stats. By William Revelle.
pcse Estimation of panel-corrected standard errors.
Data may contain balanced or unbalanced pan- psyphy Functions that could be useful in analyz-
els. By Delia Bailey and Jonathan N. Katz. ing data from pyschophysical experiments, in-
cluding functions for calculating d0 from sev-
penalized L1 (lasso) and L2 (ridge) penalized esti- eral different experimental designs, links for
mation in Generalized Linear Models and in mafc to be used with the binomial family in
the Cox Proportional Hazards model. By Jelle glm (and possibly other contexts) and self-Start
Goeman. functions for estimating gamma values for CRT
plRasch Fit log linear by linear association models. screen calibrations. By Kenneth Knoblauch.
By Zhushan Li & Feng Hong. qualV Qualitative methods for the validation of
models. By K.G. van den Boogaart, Stefanie
plink Separate Calibration Linking Methods. Uses
Jachner and Thomas Petzoldt.
unidimensional item response theory meth-
ods to compute linking constants and conduct quantmod Specify, build, trade, and analyze quanti-
chain linking of tests for multiple groups un- tative financial trading strategies. By Jeffrey A.
der a nonequivalent groups common item de- Ryan.
sign. Allows for mean/mean, mean/sigma,
Haebara, and Stocking-Lord calibrations of rateratio.test Exact rate ratio test. By Michael Fay.
single-format or mixed-format dichotomous
(1PL, 2PL, and 3PL) and polytomous (graded regtest Functions for unary and binary regression
response partial credit/generalized partial tests. By Jens Oehlschlägel.
credit, nominal, and multiple-choice model) relations Data structures for k-ary relations with ar-
common items. By Jonathan Weeks. bitrary domains, predicate functions, and fit-
ters for consensus relations. By Kurt Hornik
plotAndPlayGTK A GUI for interactive plots using
and David Meyer.
GTK+. When wrapped around plot calls, a
window with the plot and a tool bar to interact rgcvpack Interface to the GCVPACK Fortran pack-
with it pop up. By Felix Andrews, with contri- age for thin plate spline fitting and prediction.
butions from Graham Williams. By Xianhong Xie.
pomp Inference methods for partially-observed rgr Geological Survey of Canada (GSC) functions
Markov processes. By Aaron A. King, Ed Ion- for exploratory data analysis with applied geo-
ides, and Carles Breto. chemical data, with special application to the
estimation of background ranges to support
poplab Population Lab: a tool for constructing a vir- both environmental studies and mineral explo-
tual electronic population of related individu- ration. By Robert G. Garrett.
als evolving over calendar time, by using vi-
tal statistics, such as mortality and fertility, and rindex Indexing for R. Index structures allow
disease incidence rates. By Monica Leu, Kamila quickly accessing elements from large collec-
Czene, and Marie Reilly. tions. With btree optimized for disk databases
and ttree for RAM databases, uses hybrid static the reliability of the one-sided filter. By Marc
indexing which is quite optimal for R. By Jens Wildi & Marcel Dettling.
Oehlschlägel.
simba Functions for similarity calculation of binary
rjson JSON for R. Converts R object into JSON ob- data (for instance presence/absence species
jects and vice-versa. By Alex Couture-Beil. data). Also contains wrapper functions for
reshaping species lists into matrices and vice
rsbml R support for SBML. Links R to libsbml for versa and some other functions for further pro-
SBML parsing and output, provides an S4 cessing of similarity data (Mantel-like permu-
SBML DOM, converts SBML to R graph objects, tation procedures) as well as some other useful
and more. By Michael Lawrence. stuff. By Gerald Jurasinski.
sapa Insightful Spectral Analysis for Physical Appli- simco A package to import Structure files and de-
cations. Software for the book “Spectral Analy- duce similarity coefficients from them. By
sis for Physical Applications” by Donald B. Per- Owen Jones.
cival and Andrew T. Walden, Cambridge Uni-
versity Press, 1993. By William Constantine snpXpert Tools to analysis SNP data. By Eun-kyung
and Donald Percival. Lee, Dankyu Yoon, and Taesung Park.
spam SPArse Matrix: functions for sparse matrix
scagnostics Calculates Tukey’s scagnostics which
algebra (used by fields). Differences with
describe various measures of interest for pairs
SparseM and Matrix are: (1) support for only
of variables, based on their appearance on a
one sparse matrix format, (2) based on trans-
scatterplot. They are useful tool for weeding
parent and simple structure(s) and (3) S3 and
out interesting or unusual scatterplots from a
S4 compatible. By Reinhard Furrer.
scatterplot matrix, without having to look at
ever individual plot. By Heike Hofmann„ Lee spatgraphs Graphs, graph visualization and graph
Wilkinson, Hadley Wickham, Duncan Temple based summaries to be used with spatial point
Lang, and Anushka Anand. pattern analysis. By Tuomas Rajala.
schoolmath Functions and data sets for math used splus2R Insightful package providing missing S-
in school. A main focus is set to prime- PLUS functionality in R. Currently there are
calculation. By Joerg Schlarmann and Josef many functions in S-PLUS that are missing in
Wienand. R. To facilitate the conversion of S-PLUS mod-
ules and libraries to R packages, this package
sdcMicro Statistical Disclosure Control methods for helps to provide missing S-PLUS functionality
the generation of public- and scientific-use in R. By William Constantine.
files. Data from statistical agencies and other
institutions are mostly confidential. The pack- sqldf Manipulate R data frames using SQL. By G.
age can be used for the generation of safe (mi- Grothendieck.
cro)data, i.e., for the generation of public- and sspline Smoothing splines on the sphere. By Xian-
scientific-use files. By Matthias Templ. hong Xie.
seriation Infrastructure for seriation with an im- stochasticGEM Fitting Stochastic General Epidemic
plementation of several seriation/sequencing Models: Bayesian inference for partially ob-
techniques to reorder matrices, dissimilarity served stochastic epidemics. The general epi-
matrices, and dendrograms. Also contains demic model is used for estimating the param-
some visualizations techniques based on seri- eters governing the infectious and incubation
ation. By Michael Hahsler, Christian Buchta period length, and the parameters governing
and Kurt Hornik. susceptibility. In real-life epidemics the infec-
tion process is unobserved, and the data con-
signalextraction Real-Time Signal Extraction (Direct
sists of the times individuals are detected, usu-
Filter Approach). The Direct Filter Approach
ally via appearance of symptoms. The pack-
(DFA) provides efficient estimates of signals
age fits several variants of the general epi-
at the current boundary of time series in real-
demic model, namely the stochastic SIR with
time. For that purpose, one-sided ARMA-
Markovian and non-Markovian infectious pe-
filters are computed by minimizing customized
riods. The estimation is based on Markov chain
error criteria. The DFA can be used for esti-
Monte Carlo algorithm. By Eugene Zwane.
mating either the level or turning-points of a
series, knowing that both criteria are incongru- surveyNG An experimental revision of the survey
ent. In the context of real-time turning-point package for complex survey samples (featuring
detection, various risk-profiles can be opera- a database interface and sparse matrices). By
tionalized, which account for the speed and/or Thomas Lumley.
termstrc Term Structure and Credit Spread Estima- waved The WaveD Transform in R. Makes avail-
tion. Offers several widely-used term structure able code necessary to reproduce figures and
estimation procedures, i.e., the parametric Nel- tables in recent papers on the WaveD method
son and Siegel approach, Svensson approach for wavelet deconvolution of noisy signals. By
and cubic splines. By Robert Ferstl and Josef Marc Raimondo and Michael Stewart.
Hayden.
wikibooks Functions and datasets of the german
tframe Time Frame coding kernel. Functions for WikiBook “GNU R” which introduces R to new
writing code that is independent of the way users. By Joerg Schlarmann.
time is represented. By Paul Gilbert.
wmtsa Insightful Wavelet Methods for Time Series
timsac TIMe Series Analysis and Control package. Analysis. Software to book “Wavelet Methods
By The Institute of Statistical Mathematics. for Time Series Analysis” by Donald B. Percival
and Andrew T. Walden, Cambridge University
trackObjs Track Objects. Stores objects in files on Press, 2000. By William Constantine and Don-
disk so that files are automatically rewritten ald Percival.
when objects are changed, and so that objects
wordnet An interface to WordNet using the Jaw-
are accessible but do not occupy memory until
bone Java API to WordNet. By Ingo Feinerer.
they are accessed. Also tracks times when ob-
jects are created and modified, and cache some yest Model selection and variance estimation in
basic characteristics of objects to allow for fast Gaussian independence models. By Petr Sime-
summaries of objects. By Tony Plate. cek.
tradeCosts Post-trade analysis of transaction costs.
By Aaron Schwartz and Luyi Zhao. Other changes
tripEstimation Metropolis sampler and supporting • CRAN’s Devel area is gone.
functions for estimating animal movement
from archival tags and satellite fixes. By • Package write.snns was moved up from Devel.
Michael Sumner and Simon Wotherspoon.
• Package anm was resurrected from the
twslm A two-way semilinear model for normaliza- Archive.
tion and analysis of cDNA microarray data.
Huber’s and Tukey’s bisquare weight func- • Package Rcmdr.HH was renamed to Rcmdr-
tions are available for robust estimation of the Plugin.HH.
two-way semilinear models. By Deli Wang and • Package msgcop was renamed to sbgcop.
Jian Huang.
• Package grasper was moved to the Archive.
vbmp Variational Bayesian multinomial probit re-
gression with Gaussian process priors. By • Package tapiR was removed from CRAN.
Nicola Lama and Mark Girolami.
Kurt Hornik
vrtest Variance ratio tests for weak-form market ef- Wirtschaftsuniversität Wien, Austria
ficiency. By Jae H. Kim. Kurt.Hornik@R-project.org
R Foundation News
by Kurt Hornik New benefactors
Shell Statistics and Chemometrics, Chester, UK
Donations and new members
New supporting institutions
Donations
AT&T Labs, New Jersey, USA
AT&T Research (USA)
BC Cancer Agency, Vancouver, Canada
Jaimison Fargo (USA)
Stavros Panidis (Greece) Black Mesa Capital, Santa Fe, USA
Saxo Bank (Denmark) Department of Statistics, Unviersity of Warwick,
Julian Stander (United Kingdom) Coventry, UK
Invited Lectures
Call for pre-conference Tutorials
R has become the standard computing engine in
more and more disciplines, both in academia and the Before the start of the official program, half-day tuto-
business world. How R is used in different areas will rials will be offered on Monday, August 11.
be presented in invited lectures addressing hot top- We invite R users to submit proposals for three
ics. Speakers will include Peter Bühlmann, John Fox, hour tutorials on special topics on R. The proposals
Andrew Gelman, Kurt Hornik, Gary King, Duncan Mur- should give a brief description of the tutorial, includ-
doch, Jean Thioulouse, Graham J. Williams, and Keith ing goals, detailed outline, justification why the tu-
Worsley. torial is important, background knowledge required
and potential attendees. The proposals should be
sent before 2007-10-31 to useR-2008@R-project.org.
User-contributed Sessions
The sessions will be a platform to bring together Call for Papers
R users, contributers, package maintainers and de-
velopers in the S spirit that ‘users are developers’. We invite all R users to submit abstracts on top-
People from different fields will show us how they ics presenting innovations or exciting applications of
solve problems with R in fascinating applications. R. A web page offering more information on ‘useR!
The scientific program is organized by members of 2008’ is available at:
the program committee, including Micah Altman, http://www.R-project.org/useR-2008/
Roger Bivand, Peter Dalgaard, Jan de Leeuw, Ramón
Díaz-Uriarte, Spencer Graves, Leonhard Held, Torsten Abstract submission and registration will start in
Hothorn, François Husson, Christian Kleiber, Friedrich December 2007.
Leisch, Andy Liaw, Martin Mächler, Kate Mullen, Ei-ji We hope to meet you in Dortmund!
Nakama, Thomas Petzold, Martin Theus, and Heather
Turner. The program will cover topics of current in- The organizing committee:
terest such as Uwe Ligges, Achim Zeileis, Claus Weihs, Gerd Kopp,
• Applied Statistics & Biostatistics Friedrich Leisch, and Torsten Hothorn
• Bayesian Statistics useR-2008@R-project.org