Vous êtes sur la page 1sur 51

Introductory Guide

to
S-Plus

Final Version

B.D. Ripley
Professor of Applied Statistics,
University of Oxford
e-mail: 
 
 

24 August 1994
Preface

This guide was originally written for graduate students in Statistics at the University of Ox-
ford. The first versions were based closely on notes by Dr. Bill Venables of the Department
of Statistics at the University of Adelaide, but have been updated to reflect later versions of S,
the extensions of S-Plus and local facilities. Several sections, in particular 4, 6 and 11, remain
close to Dr. Venables’ original material. This guide will no longer be updated, following the
publication of Venables & Ripley (1994). [See p. 1. Where that takes a significantly better
approach than earlier editions of these notes, the material formerly here has been dropped.]
The guide is to S-Plus, but much of it will be relevant to users of the underlying S. Extensions
which are only in S-Plus include dynamic graphics (  6.3, "!$#
%& and %'
() ) and the classical
statistics functions (  9). The terminology of this guide is intended to be precise, only referring
to S-Plus rather than S for features unique to S-Plus.
These notes were written for a particular environment, S-Plus 3.2 on Sun SparcStations running
the Open Windows windowing system. You will find a number of differences depending on
your local environment. It will help to have the library !(*'+-,/. available — it should be in the
same source as these notes. It can be also be obtained by anonymous ftp from
0
1 3! 24/5768%*9 1 9% 6:4/;76 1=< 6>#-2 (163.1.20.1)
in file '=#-?=@-?A!B('+-,.C6%D&C6FE . It is available from %*9 1 9+"(* (see Section A.2) as
%3,/)GH!('B+=,I.HJ"!4 0 @
Alternatively, +("! 1 !3.KML=NB@=@=O from Venables & Ripley (1994) can be used.
This guide may be freely copied and redistributed for any educational purpose (including com-
mercial courses) provided its authorship (B.D. Ripley and W.N. Venables) is clearly stated.
Where appropriate, a small charge to cover the costs of production and distribution, only, may
be made.

B.D. Ripley,
University of Oxford,
24th August, 1994.

i
Contents ii

Contents

1 Introduction 1
1.1 Starting and Finishing PPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 1
1.2 Getting Help PQPRPSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 2
1.3 Hardcopy Output PSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 3

2 Datasets 3

3 A First Session 5

4 Simple Data Manipulation 6


4.1 Vectors PQPQPPQPRPSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 6
4.2 Vector Arithmetic PRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 6
4.3 Generating Regular Sequences of Numbers. PRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 7
4.4 Logical Vectors. Missing Values PPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 8
4.5 Character Vectors PRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 8
4.6 Index Vectors. Selecting and Modifying Subsets of a Data Set PQPRPSPRPQPSPRP 9
4.7 Arrays PQPQPPQPRPSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 10
4.8 Lists PPQPQPPQPRPSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 11
4.9 Data Frames PQPRPSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 12

5 Reading data into S 14


5.1 Writing out data PSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 15

6 Graphics 16
6.1 Graphical Parameters PPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 16
6.2 Some Basic Plotting Functions PQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 17
6.3 Interaction with Plots PPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 17
6.4 Brush and Spin PRPSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 18
6.5 Equally-scaled plots PQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 18
Contents iii

7 Statistical Summaries 20
7.1 Arithmetical Summaries PQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 20
7.2 Histograms and Stem-and-Leaf Plots PQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 20
7.3 Boxplots PQPPQPRPSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 21

8 Distributions 22
8.1 Q-Q Plots PPQPRPSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 23

9 Classical Statistics 24

10 Handling Categorical Data 27


10.1 The Function 9 1 '-'+I.KTPDPDPUO and Ragged Arrays PPQPQPPQPPQPPQPRPSPRPQPSPRP 28

11 Loops and Conditional Execution 29

12 Writing Your Own Functions 30

13 Statistical Models 32
13.1 Model Formulas PSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 32
13.2 One-way Layouts PRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 33
13.3 Designed Experiments PPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 35
13.4 Generalized Linear Models PQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 39
13.5 Updating and Selecting Models PPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 42

14 Multivariate Analysis 43

Appendix

A Libraries 45
A.1 Library !B('B+=,/. PSPRPQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 46
A.2 Sources of Libraries PQPPQPPQPQPPQPPQPRPSPRPQPPQPQPPQPPQPPQPRPSPRPQPSPRP 46
VIntroduction 1

1 Introduction

S is a statistical language developed at AT&T’s Bell Laboratories. S-Plus is a binary distri-


bution of S, with added functions, produced by the StatSci Division of MathSoft in Seattle.
The S system was radically re-designed in the 1988 release and known as ‘New S’. In August
1991 a new release of what is once again called S consisted of a moderate revision of ‘New
S’ together with far-ranging extensions. S-Plus 3.0 was introduced in late 1991, based on that
release of S, with numerous additional features. S-Plus 3.1 was released at the very end of
1992, and S-Plus 3.2 in very early 1994.
The main references are:
R.A. Becker, J.M. Chambers and A.R. Wilks (1988) The NEW S language. Wadsworth &
Brooks/Cole.
J.M. Chambers and T.J. Hastie (1992) Statistical Models in S. Wadsworth & Brooks/Cole.
It is not the intention of this guide to replace the books. Rather these notes are intended as
a brief introduction to the capabilities of the S programming language and to how to perform
some common statistical procedures within S. Users of S-Plus will need to consult both books,
probably frequently. Both books contain some reference documentation, but the on-line ver-
sions (see  1.2) are later and definitive.
There also manuals for S-Plus itself, whose organization differs from release to release.
Other books include
W.N. Venables and B.D. Ripley (1994) Modern Applied Statistics with S-Plus. New York:
Springer ISBN 0-387-94350-1
which goes far beyond the coverage of this guide, including many topics (such as robust statis-
tics, non-linear regressions, modern regression, survival analysis, tree-based models, time se-
ries and spatial statistics) not covered here, as well as in greater depth on what is covered.

1.1 Starting and Finishing

To start S-Plus, type the command


0W1-< &
(),$XH@I'+I#W%
After a short while (and, the first time, an initialization message) you get the S-Plus prompt Y :
Z
This is waiting for input from you.
Technically S is a function language with a very simple syntax. Like most Unix based packages
it is case sensitive, so N and 1 are different variables. Elementary commands consist of either
expressions or assignments. If an expression is given as a command, it is evaluated, printed,
and the value is discarded[ . An assignment also evaluates an expression and passes the value
\
] which can be changed, but the default is assumed here
In fact it is kept in the (hidden) variable ^`_a*bdc
^`eaDf8g$h and so can be retrieved from the ‘bin’.
1.2 Getting Help 2

to a variable but the result is not printed automatically. An expression can be as simple as iHj
k or a complex function call. Assignments are indicated by the assignment operator l=m or .
(As the first needs two keystrokes, lazy typists use the second. However, the first is easier to
read.) For example,
Z $i j k
npo*qsr
Z 0 , 1 )Kt&
%9 1 !$9O
npo*quo k=v
Z 0 l-m 0 6d, w=1 wI)x=x KU&W%9 1 !39Oyz5{l=m|5 1 !Kt&W%*9 1 !$9WO
Z 0 ?"%I}=!$9~Kt5
O
npo*q k o v o
6 x-i
npo*q
The states that the answer is starting at the first element of a vector.
Commands are separated either by a semi-colon, y , or by a newline. If a command is not com-
plete at the end of a line, S will give a different prompt, namely
j
on second and subsequent lines and continue to read input until the command is syntactically
complete.
S can be extended by writing new functions, which then can be used exactly as built-in func-
tions (and can even replace them). How to write your own functions is covered in section 12.

1.2 Getting Help

S has an inbuilt help facility similar to the man facility of Unix. To get more information on
any specific named function or dataset, for example 0 , 1 ) , the command is
Z &B,=+I'K 0 , 1 W) O
For a feature specified by special characters, and in a few other cases (one is €$%*W(-%=%"€ ), the
argument must be enclosed in double quotes, making it a ‘character string’:
Z &B,=+I'K€ =n n $€ O

o*‚
Help uses a window which overlays your main window. The pager accepts a number of options,
including %*' 1-< , for the next page and } to quit. (Other useful options are to go to the top
<
and 4/)-9"!4-+=mI to go back a page.) If you prefer, a separate help window (which can be left
up) can be obtained by the argument W()G4I"ƒ$„ . Another way to get help is by
Z†… 0 , 1 )
Short help is given by the function 1 !3‡W% .
S-Plus also has a window-based help facility, started by
Z &B,=+I'C68%*9 1 !$9KM‡-#(/ƒ
€*4/',)B+=4$4I2W€$O
Click with the left mouse button on items to select categories and items. The help window can
be left up, or removed by
1.3 Hardcopy Output 3

Z &B,=+I'C6U4IJ=J~KTO
It is not advisable to quit S-Plus windows from the frame menu.

1.3 Hardcopy Output

Graphics are printed by holding down the right button on the ‡! 1 '=& menu in a 4/',I)+-4=4/2~KpO
window (see  6) and releasing over the print item. This will print on the nearest laser printer
(or that selected by your ˆ$‰
Š‹=„Œ$‰ environment variable).
To record a session cut-and-paste to a 9B,/;=9B,AGB(9 window, then remove your mistakes (if any)
and save as a Unix file.

2 Datasets

Datasets are stored in a directory ?6FŽ 1 9 1 . They are permanent, so all the objects you create
are retained until explicitly deleted. (As the directory name 6FŽ 1 9 1 begins with 6 it will nor-
mally be hidden in file listings from Unix by +"% .) If there is a 6Ž 1 9 1 directory in the current
directory when S is invoked, that directory is used rather than ?6Ž 1 9 1 . This provides one
way to organize your S, using separate directories for each project.
In S, to get a list of names of the objects currently defined use the command
Z 4I", < 9
%‘KDO
Your own functions are also stored in 6FŽ 1 9 1 . To find out whether an object is a function or
dataset, and what is in it, just type its name at the prompt, e.g.
Z *% 9 1-< 2C6;
Z 'B+=4I9
This prints out the function, dataset, PDPDP . In the later versions of S it may print a short summary
of the object. To get the full details, use
Z '!(*)=9C6:G,IJ 1 #+I9 K object O
When S looks for an object, it searches in turn through a sequence of directories known as the
search list. Usually the first entry in the search list is the 6FŽ 1 9 1 sub-directory of the current
working directory. The names of the directories currently on the search list can be found by
the function
Z %3, 1 ! < &~KTO
The names of the objects held in any directory on the search list can be displayed by giving the
+"% function an argument. For example 4/", < 9W%‘K’i"O lists the contents of the second directory
in the search list. Normally the second, third and fourth directories are built-in functions, and
the fifth, sixth and seventh contain standard datasets
Extra search directories can be added to this list with the 1 9=9 1=< &~KTPDPDPtO function and removed
with the G,/9 1=< &~KTPDPDPtO function, details of which can be found in the manuals or the &,-+/' fa-
“Datasets 4

cility. Note that attached directories are searched after the 6Ž 1 9 1 directory in the order last
attached to first attached.
To remove objects permanently the function ! 0 is available:
Z ! 0 KU;”M.•”t–~”p()”UI#=)-2”M9, 0 'WO
The function !, 0 4I5,—KTPDPDPtO can be used to remove objects with non-standard names.

Warning

Objects in your 6Ž 1 9 1 directory will take precedence over system objects of the same name.
This is a frequent cause of rather obscure errors, and can cause apparently correct behaviour but
erroneous results. Avoid using names such as < ”˜%”™9”š‡B+ 0 ”š! 1 )-‡, ”›9"!,-, for your own
objects. If you get peculiar errors, clean up your 6FŽ 1 9 1 directory and try again!

S keeps a record of commands in the 6N=#"GB(9 file in the 6FŽ 1 9 1 directory. This is a hidden file
and can grow rather large. Use (from the Unix command line)
@/'B+/#
%œ„=‰=-‹"žŸ/N=-ŽŠ„H
occasionally to clean out the audit file entirely (or omit the  to keep the last 0.5Mb).
  A First Session 5

3 A First Session

The sample session given below is intended to show by example some of the capabilities of the
system. Work through the session given by the commands on the left of the page. Some clues
as to what is going on are given at the right hand side of the page.
¡B¢/£’¤¥¦§¨œ©ª3«¬"­
®°¯ ª"§’¦=« ¯I¯±‘²’³ Start the session.
® «-¥8´$µ-¢TµI¶ ² µ¥8ª=«$§D¶ ³ Open the graphics window.
® ¤§«ª ²M· µ=§I§I­ ³ Add a library of functions and datasets.
®¸· µ=§I§I­ use q to quit
® ¢ ·/· ¢I£p¤ ²M· µ-§/§I­ ³ Print out a data frame of the trees data
® ¤¥­ ·‘²t¹ ¥¢8¡ ³ so that we can use names diam etc
® ¤¥­ ·‘²t¹ ¥¢8¡~ºQ¦£«3¢/­I­»"¼’½‘ºª$µ ¯ ´¢p´¥T«-¥ · ¶$»¾ ³ Histogram as counts.
® ¤§«ª ² ¤¥­ ·³ as probability density
® ­ · §8¡ ²t¹ ¥¢8¡ ³
® ª=« ¯*·‘²t¹ ¥¢8¡~ºR¿ ¯ «¬¡B§ ³ Stem-and-leaf plot.
®¸· µ=§I§I­ÁÀF«p¡ÃÂ/Ĝ«p¡ ² ¿ ¯ «D¬¡§|Å ¹ ¥¢8¡ ³ Scatter plot.
® ­’¬¡I¡B¢TµI¶ ²:· µ=§I§/­‘ÀF«p¡ ³ linear regression
® ¢’¦ ¯ ¿-¢ ²:· µ-§I§/­‘ÀÆ«’¡ ³ summary of fit
® ¢’´3«-¥¦"§ ²M· µ-§/§I­ÁÀF«’¡ ³ analysis of variance table
® ¥ ¹ §’¦ · ¥’Ç/¶ ²U¹ ¥*¢d¡~ºR¿ ¯ «D¬¡§
º¤"§$¥’Ȥ ·"³ plot line on scatter plot
Move mouse to plot and click with left button
to see what height is. Click middle button to
® ª¢Tµ ² ¡=ÇIµ ¯*É »$£ ² ¼ºMÊ ³I³ quit.
® ª=« ¯*·‘²M· µ=§I§I­ÁÀF«p¡ ³ set up 1 row, 2 cols for plots
plots of fitted values and Ë residuals Ë vs fitted value.
® ª¢Tµ ² ¡=ÇIµ ¯*É »$£ ² ¼ºp¼ ³I³
®°ÌIÌ ¦ ¯ µT¡ ² µ-§I­A¥ ¹ ¬¢«$­ ²M· µ-§/§I­ÁÀF«8¡ ³/³ one plot again.
®°ÌIÌ ¦ ¯ µT¡ ² ­ · ¬ ¹ µ=§I­ ²M· µ=§I§I­ÁÀF«p¡ ³³ normal probability plot of residuals
®°ÌIÌ «-¥¦"§ ² ­ · ¬ ¹ µ=§I­ ²M· µ=§I§I­ÁÀF«p¡ ³³ and of Studentized residuals
® ª¢A¥’µ-­ ²:· µ-§I§/­ ³ line through quartiles
® ´$µ¬­’¤ ² £p´¥¦ ¹
²t¹ ¥*¢d¡~º¤"§$¥’Ȥ · ºz¿ ¯ «¬¡§ ³/³ all pair-wise scatter plots
Ì ·
rotate points in 3D, select and de-select points.
Click on ´ to end
®¸· µ=§I§I­ÁÀF«p¡ʜÂIÄ°«’¡ ² ¿ ¯ «¬¡B§|Å ¹ ¥*¢d¡ÎÍϤ§$¥pÈ*¤ ·³
® · µ=§I§I­ÁÀF«p¡МÂIÄ°«’¡ ² « ¯ È ² ¿ ¯ «D¬¡§ ³ Ř« ¯ È ²U¹ ¥¢8¡ ³ Í|« ¯ È ² ¤§$¥pÈ*¤ ·³I³
¸ multiple regression. Try functions as before
® ¹ § · ¢I£p¤ ²TÑ· µ=§I§/­ Ñ*³
°
® ¤§«ª ² µ ¯ ¢ ¹-³ to avoid any confusion
® ¢ ·/· ¢I£p¤ ² µ ¯ ¢ ¹-³
® ª=« ¯*·‘²t¹ µ"¥’¿-§Tµ-­Wº ¹ §I¢ · ¤­ ³
® ª=« ¯*·‘²t¹ µ"¥’¿-§Tµ-­Wº ¹ §I¢ · ¤­WºÒ« ¯ È$» Ñ8Ó ¶ Ñ*³
® ­ · ¢ · §°ÂIÄϵ ¯*É À`¦¢d¡B§I­ ² µ ¯ ¢ ¹=³
® ¥ ¹ §’¦ · ¥’Ç/¶ ²U¹ µ¥p¿-§Dµ=­
º ¹ §I¢ · ¤­Wº›­ · ¢ · § ³
® ª=« ¯*·‘² Ç*¬"§«Ôº ¹ §/¢ · ¤­
ºÕ« ¯ ÈA» Ñ8Ó ¶ ѳ Find the ‘odd’ states.
® ¥ ¹ §’¦ · ¥’Ç/¶ ² Ǭ§«Ôº ¹ §I¢ · ¤"­
º›­ · ¢ · § ³
® µ ¯ ¢ ¹ ÀÖ¡B¢ · Â/Ä×£p´¥8¦ ¹
²t¹ µ"¥’¿-§Tµ-­WºÇ¬§«‘º ¹ §/¢ · ¤­ ³ Set up a matrix
® ª¢A¥’µ-­ ² µ ¯ ¢ ¹ ÀØ¡¢ ·³
® ´$µ¬­’¤ ² µ ¯ ¢ ¹ ÀØ¡¢ · ºRµ ¯É «$¢’´=»$­ · ¢ · §WºÙ­pª¥8¦=»IÚ ³ Look at pattern of all three
Ì ·
Use mouse to highlight points and check their
identity. Then click on ´
®°Ì
²p³ Finish session
ÛSimple Data Manipulation 6

4 Simple Data Manipulation

The basic data objects in S are vectors, arrays, lists and data frames.

4.1 Vectors

S operates on named data structures. The simplest such structure is the vector, which is a sin-
gle entity consisting of an ordered collection of numbers. To set up a vector named ; , say,
consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7, use the S command
Z ;{l-m < K o 6x” r 6UÜ ” k 6 o ”ÝÜ6Fx•”Ýi o 6 v O
This is an assignment statement using the function < KDPTPDP’O taking an arbitrary number of vector
arguments and whose value is the vector of its arguments.
A number occurring by itself in an expression is taken as a vector of length one.
Assignments can also be made in the other direction, using the obvious change in the assign-
ment operator. So the same assignment could be made using
Z < K o 6Fx•” r 6UÜ ” k 6 o ”ÏÜ6Fx”Ýi o 6 v OÞm Z ;
If an expression is used as a complete command, the value is printed and lost. So now if we
were to use the command
Z o ?/;
the reciprocals of the five values would be printed (and, of course, the value of ; would be
unchanged).

4.2 Vector Arithmetic

Vectors can be used in arithmetic expressions, in which case the operations are performed
element-by-element. Vectors occurring in the same expression need not all be of the same
length. If they are not, the value of the expression is a vector with the same length as the longest
vector which occurs in the expression. Shorter vectors in the expression are recycled as often
as need be (perhaps fractionally) until they match the length of the longest vector. In particular
a constant is simply repeated. So with the above assignments the command
Z 5{l-mHi=ß/;sjH.sj o
generates a new vector 5 of length 11 constructed by adding together, element-by-element, i-ß/;
o
repeated 2.2 times, . repeated just once, and repeated 11 times.
The elementary arithmetic operators are the usual j , m , ß , ? and à for raising to a power. In
o
addition all of the common arithmetic functions are available. +-4/‡ , +-4/‡  , ,I;=' , %=() , < 4"% ,
9 1 ) , %/}-!$9 , and so on, all have their usual meaning. 0
1 ; and 0 () select the largest and small-
est elements of an vector respectively. ! 1 )=‡, is a function whose value is a vector of length
two, namely < K 0 (*)KU;WO” 0W1 ;~KM;WO=O . The element-by-element maximum and minimum of two or
more vectors are given by ' 0W1 ; and ' 0 () . +-,/)-‡=9=&~Kt;
O is the number of elements in ; , %*# 0 Kt;WO
gives the total of the elements in ; and '"!4AGKt;WO their product.
á4.3 Generating Regular Sequences of Numbers. 7

Two statistical functions are 0 , 1 )~Kt;


O , which evaluates to %*# 0 Kt;WO3?=+-,)=‡=9$&~Kt;O and 5 1 !Kt;
O ,
o
which gives the value %# 0 K=KU;m 0 , 1 ) Kt;
O$O"àIi=O3?‘Kp+-,)=‡=9$&~Kt;O3m O , the sample variance. If the
argument to 5 1 !KTPDPDP8O is an âÕãCä matrix the value is a äSã7ä sample covariance matrix obtained
from regarding the rows as independent ä -variate sample vectors.
%$4$!$9~Kt;WO
returns a vector of the same size as ; with the elements arranged in increasing order.
Other, more flexible, sorting facilities are available (see 4A!=G,A!KTPDPDPtO which produces a permu-
tation to do the sorting, and %$4$!$976U+"(-%9 ).

4.3 Generating Regular Sequences of Numbers.

S has aoånumber of facilitieso for generating commonly used sequences of numbers. For ex-
k  is the vector < K ”8i ”pPDPDP/”i=wæ” k O . The colon operator has highest priority within
ample oåpoAr < k o
an expression, so, for exampleo—å i=o ß o—å vector
is the o Kpiæ”:x”Üæ”pPDPDPI”8iAç~” "O . Put )èl=m  and
compare the sequences )Bm and KU)m O .
The construction
k  ’å o may be used to generate a backwards sequence.
The function %$,A}KTPDPDP8O is a more general facility for generating sequences. It has five argu-
ments, only some of which may be specified in any one call. The first two arguments, if given,
o
specify the beginning and end of the sequence, and if these are the only two arguments given
å’o
the result is the same as the colon operator. That is, %$,$} K’iæ” "O is the same vector as i .
Parameters to %3,A}KTPDPDP8O , and to many other S functions, can also be given in named form, in
which case the order in which they appear
0 o k is irrelevant.0 The o first k two parameters may k be named o
J"!4 ƒ value and 94$ƒovalue; thus %3,A} K ” "O , %3,A} KUJ"!4 ƒ ”M94$ƒ "O and %$,A}Kt9B4Aƒ æ”:J=!4 0 ƒ O
åk
are all the same as  . The next two parameters to %$,$} KTPDPDP8O may be named =.ƒ value and
+=,I)=‡-9=&"ƒ value, which specify a step
o size and a length for the sequence respectively. If neither
of these is given, the default =.ƒ is assumed.
For example
Z %3,A}Kpm r ” r š ” -."ƒ6di"OÞm Z % k
k r r
generates in % the vector < K’m 6Uæ”m/x76Fç”8mx76:Ü攒PpPDP/”:xC6dÜæ”Mx76Mç~” 6:O . Similarly
Z %*xél-mê%$,$} K’+=,/)-‡=9-&=ƒ ro ”ÙJ"!4 0 ƒm r ”z=.ƒ6di"O
generates the same vector in %*x .
The fifth parameter mayo be named 1 +-4/)-‡"ƒ vector, which if used must be the only parameter,
and creates a sequence ”Ïi攘PDPTP-”˜+=,/)-‡=9-&K vector O , or the empty sequence if the vector is
empty (as it can be).
A related function is !,I'KTPDPDP8O which can be used for replicating a structure in various com-
plicated ways. The simplest form is
Z % r l-më!,I'KU;”Ù9W( 0 ,"%Iƒ r O
r
which will put five copies of ; end-to-end in % .
á4.4 Logical Vectors. Missing Values 8

4.4 Logical Vectors. Missing Values

As well as numerical vectors, S allows manipulation of logical quantities. The elements of a


logical vector have just two possible values, represented formally as ì (for ‘false’) and „ (for
‘true’). („-‰="Œ and ì=N-í@AŒ are also valid representations.)
Logical vectors are generated by conditions. For example
o
Z 9B, 0 s' l=mî; Z k
sets 9, 0 ' as a vector of the same length as ; with values ì corresponding to elements of ; where
the condition is not met and „ where it is.
Z Z
The logical operators are l , l$ƒ , , ƒ , ƒ=ƒ for exact equality and ïƃ for inequality. In addition if
o< oÔð < o ozñ <
and < i are logical expressions, then < is their intersection (and), < i is their union
o< o<
(or) and ï is the negation of .

o
Logical vectors may be used in ordinary arithmetic, in which case they are coerced into numeric
vectors, ì becoming  and „ becoming . However there are situations where logical vectors
and their coerced numeric counterparts are not equivalent.
In some cases the components of a vector may not be completely known. When an element
or value is “not available” or a “missing value” in the statistical sense, a place within a vector
may be reserved for it by assigning it the special value ‹-N . In general any operation on an ‹=N
becomes an ‹-N . The motivation for this rule is simply that if the specification of an operation
is incomplete, the result cannot be known and hence is not available.
The function (-%æ6F) 1 KU;WO gives a logical vector of the same size as ; with value „ if and only if
the corresponding element in ; is ‹-N .
Z (*)"Gòl=mê(-%æ6) 1 K8–O

4.5 Character Vectors

Character quantities and character strings are used frequently in S, for example as plot labels.
They are denoted by a sequence of characters delimited by the double quote character. E.g.
€D;Bm/5 1 +/#B,"%€ , €D‹,Ié(9B,A! 1 9W(34/)î!,"%#B+/9
%"€ . Single quotes can also be used, in matching pairs.
Character strings may be collected into a vector by the < KDPTPDP8O function; examples of their use
will emerge frequently.
The ' 1 %9,—KTPDPDPtO function takes an arbitrary number of character string arguments and concate-
nates them into a single character string. Any numbers given among the arguments are coerced
into character strings in the same way they would be if they were printed. The arguments are
by default separated in the result by a single blank character, but this can be changed by the
named parameter, %$,I'"ƒ string, which changes it to string, possibly empty.
For example
Z + 1
%Þl=mî' 1 %*9,K < K€DóÁ€
”D€DôW€3O‘” o—åpo  ”Ï%$,I'"ƒW€-€$O
o k o
makes + 1
% the character vector K€Dó €Ô”¸€Dôi€‘”Ï€Dó €‘”ÏPDPDP”°€ów€Ô”¸€Dô €3O . Note in par-
ticular that recycling of short vectors takes place here too; thus < K€D󑀑”Ï€Dô‘€$O is repeated 5
á4.6 Index Vectors. Selecting and Modifying Subsets of a Data Set 9

times to match the sequence.


The elements of a vector can be named (as well as numbered) by assigning a character vector
to its ) 10 ,"% attribute, e.g.
Z < "4 %*9W%Þl=m < K’i=Ü攛x r ”ÝÜ v ” k-k ” ro O
Z ) 10 ,"%‘K < 4%9W%$Oõl=m < K€ 1 ) 1 ) 1 €‘”Ý€ 1 -' '+-,€‘”Ï€I4$! 1 )=‡,€‘”ö€JW(*‡Á€‘”Ï€D2
(
("€$O
Z < 4"%*9W%
1 ) 1 ) 1Þ1 '-'+-,H4A! 1 )=‡B,°J
(‡†2
(
(
r ro
i-Ü x Ü v k=k

4.6 Index Vectors. Selecting and Modifying Subsets of a Data Set


ndrAq
Elements of a vector may be extracted by specifying the element in square brackets, e.g. ; .
More generally, subsets of a vector (or any expression that evaluates to a vector) may be se-
lected by appending to the name of the vector an index vector in square brackets. Such index
vectors can be any of four distinct types:

1. A logical vector. In this case the index vector must be of the same length as the vector from
which elements are to be selected. Values corresponding to „ in the index vector are
selected and those corresponding to ì omitted. For example
Z .èl=mî; n ïU(=%æ6) 1 Kt;WO q
creates (or re-creates) an object . which will contain the non-missing values of ; , in the
same order. Note that if ; has missing values, . will be shorter than ; . Also
Z KU;"j o O n KWït(-%æ6>) 1 tK ;WO$O ð ; Z  q m Z –
o
creates an object – and places in it the values of the vector ;"j for which the correspond-
ing value in ; was both non-missing and positive.
2. A vector of positive o integral quantities. In this case the values in the index vector must
lie in the the set ÷ ”Ïi攘PDPTP”ø+=,/)-‡=9-&Kt;
O=ù . The corresponding elements of the vector
n q
are selected and concatenated, in that order, in the result. The index vector can be of any
length and the result is of the same length as the index vector. For example ; Ü is the
sixth component of ; and
Z ; npoå’o  q
o
selects the first 10 elements of ; (assuming =+ ,I)=‡=9-&KU;WO7ú  ). Also
Z < K€D;‘€‘”D€D.Á€3O n !,'K < K o ”Ui ”Uiæ” o O—”M9( 0 ,=%/ƒIx
O q
(an admittedly unlikely thing to do) produces a character vector of length 16 consisting
of €;Á€Ô”¸€D.Á€Ô”¸€D.‘€‘”¸€D;‘€ repeated four times.
3. A vector of negative integral quantities. In this case the index vector specifies the values
to be excluded rather than included. Thus
Z .èl=mî; n m—K oådr O q
gives . all but the first five elements of ; .
á4.7 Arrays 10

4. A vector of character strings. This possibility only applies where an object has a ) 1*0 ,"%
attribute to identify its components. In this case a subvector of the names vector may be
used in the same way as the positive integral labels in 2.
Z +I#=) < &sl=mÞJ"!$#
(9 n < K€ 1 '-'+=,€
”D€I4/! 1 )$‡,"€3O q
This option is particularly useful in connection with data frames (see  4.9).

An indexed expression can also appear on the receiving end of an assignment, in which case
n q
the assignment operation is performed only on those elements of the vector. The expression
must be of the form 5B, < 9B4A! index vector as having an arbitrary expression in place of the
vector name would not make sense.
The vector assigned must match the length of the index vector, and in the case of a logical index
vector it must again be the same length as the vector it is indexing.
For example
Z ; n (-%æ6F) 1 KU;WO q l=m†
replaces any missing values in ; by zeros and
Z . n .Bl= q l-m†mI. n .Bl= q
has the same effect as
Z .{l-m 1 W%ÔKt.
O

4.7 Arrays

An array can be considered as a multiply subscripted collection of data entries of the same type,
for example numeric, logical or character string.
An array is defined by having a dimension vector, a vector of positive integers. If its length is
2 then the array is 2 –dimensional. The values in the dimension vector give the upper limits for
each of the 2 subscripts. The lower limits are always 1. Suppose, for example, – is a vector of
1500 elements. The assignment
Z GB( 0 Kd–OÞl-m < K k ” r ” o ==O
allows – to be treated as a ûÙãöüÙã¸ýTþIþ array.
Other functions such as 0
1 9!(;~KTPDPDPtO and 1 !-! 1 .KTPDPDP8O are available for simpler and more nat-
ural looking assignments in special cases, e.g.
r o
Z –èl-m 1 =! ! 1 .~Kd–” < K k ” ” =O$O
Z –èl-m 0W1 9 !(*;Kd–~” k ” r O=O
The values in the data vector give the values in the array in the same order as they would occur in
k  ÿ entries
Fortran, that is, with the first subscript moving fastest and the last subscript slowest. For exam-
ple if the dimension vector for an array, say 1 , is < K ”Mx”i"O then there are û•ã ÿ7ã
’
n o o *
o q 1 i ” ” ”šPDPTP” 1 n i ”Mx•”8i q ”
n o 
o q
in 1 and the data vector holds them in the order 1 ” ” ”
1 n k ”:x”8i q . To make life easier, 0
1 9!(; has a -."!4/"ƒ3„ parameter for data presented by row
rather than by column.
á4.8 Lists 11

Individual elements of an array may be referenced by giving the name of the array followed
by the subscripts in square brackets, separated by commas. More generally, subsections of an
array may be specified by giving a sequence of index vectors in place of subscripts; however if
n q  array with dimension vector < Ktx”i"O and data vector
any index position is given an empty index vector, then the full range of that subscript is taken.
Thus 1 iæ”=” is a ÿ›ã
1 n i ” o ” oq , 1 n i ”8iæ” oq , 1 n i ” k ” oq , 1 n i ”Mx” oq , 1 n i ” o ”i q , 1 n i ”8iæ”i q , 1 n i ” k ”i q , 1 n i ”Mx”i q ,
n q
in that order. 1 ”-” stands for the entire array, which is the same as omitting the subscripts
entirely and using 1 alone.
Arrays may be used in arithmetic expressions and the result is an array formed by element-by-
element operations on the data vector. The dimension vectors of operands generally need to be
the same, and this becomes the dimension vector of the result. So if N ,  and ž are all similar
arrays, then
Z Ž{l-mHi=ß/NBßHjΞêj o
makes Ž a similar array with data vector the result of the evident element-by-element opera-
tions. The matrix multiplication operator is XßAX .
There are extensive matrix manipulation facilities, including transposes and eigenvalue,
Cholesky, QR and singular-value decompositions. See help on 9 , ,(‡,I) , < &4-+ , }-! and %*5"G .
Any dimension of an array can be given a set of names using G( 0 ) *1 0 ", % , but is usually easier
to use the facilities of data frames.
Matrices can be built up from given vectors and matrices by the functions <
()"GKTPDPDPtO and
!$
()G KTPDPDPUO . Informally, <
()G KTPDPDPUO forms matrices by binding together vectors or matrices
horizontally, or column-wise, and !3W()G KDPTPDPUO vertically, or row-wise.

4.8 Lists

An S list is an object consisting of an ordered collection of objects known as its components.


There is no particular need for the components to be of the same mode or type, and, for example,
a list could consist of a numeric vector, a logical value, a matrix, a character array, a function,
and so on.
Components are always numbered and may always be referred to as such. If 9"!,=,% is a list,
n=npo*q=q +=,I)=‡-9=n=&n KU9"q-!q ,$,"%=O gives the number of (top level) components it has, specified
then the function
as 9"!,-,"% , 9"!,=,% i and so on.
Components of lists may also be named, and in this case the component may be referred to
either by giving the component name as a character string in place of the number in double
square brackets, or, more conveniently, by giving an expression of the form
Z name  component name
for the same thing. This is a very useful convention as it makes it easier to get the right com-
ponent if you forget the number, and is strongly advised. You can find out the names of the
components by
Z ) 1 0 ", %‘K names O
á4.9 Data Frames 12

and this generates much less output that printing the object, which will achieve the same pur-
pose.
The names of components may be abbreviated down to the minimum number of letters needed
to identify them uniquely. Most of the datasets are in fact lists (or can be treated as lists), so we
could refer to the component GB( 10 of the 9"!,-,"% data as 9"!,=,"%G . Similarly, many S functions
return lists of results.
n=npo*q=q n’oq n=n q=q
It is important to distinguish 9"!,-,"% n fromq 9!,=,% . “ PDPTP ” is the operator used to
select a single element of a list, whereas “ PDPTP ” is a general subscripting operator for vectors.
Fortunately, numbered components are needed very rarely.
New lists may be formed from existing objects by the function +"(-%9~KTPDPDPtO . An assignment of
the form
Z 9!,-,"%Þl=mÎ+"(-%9K8G( 10  9=!,$,6:G ”Ò&B,"(*‡=&=9  9"!,$,6&”z54-+/# 0 ,  9!,-,6>5WO
sets up a list 9"!,=, of 3 components using the existing objects 9"!= , ,6:G”Ù9"!,=,6&
and 9!,-,6F5
for the components and giving them names as specified by the argument names (which can be
chosen freely). If these names are omitted, the components are numbered only.
Lists can be 1 9=9 1-< & -ed as well as directories, and this allows their components to be accessed
as if they were stand-alone entities. Thus in the 9"!,=,% example we could have
Z 1 =9 9 =1 < ~& Kt9!,=,=%-O
Z 0 , 1 )KtB& ,"(*‡=&=9O
It is wise to G,/9 1=< &K€9=!,$,%"€IO after use to avoid any nasty surprises.

4.9 Data Frames

Data frames were introduced in the August 1991 release of S, and can be thought of as closely-
coupled lists of data vectors of the same length. Unlike matrices, the data vectors can be of
different types, including character data. Both the rows and columns can be labelled. Consider
the data frame !4 1 G from +"(*"! 1 !$.~Kd!('B+$,/.
O :
Z !4 1 G
G, 1 9-&W%œG=!(*5,$oI!r % '4I'"G,/)†!$#! 1 +°9B, 0 ' J=#B,=+
=o o
N+ 1 10W1 w=Ü$çk o-ç o ÜIxC6d Ü=Ür 6U Ük=i w6U
1
NB+ %2 1 x 6x 6Uw  Ü6Ui
6-6=6-6=6=6-6=6-o 6=6=6$6-6$6=6-6 k o o
L
4 iAçr w ikx Ü k 6d v -6d x kço 6U
LW4I)=9 i w ç xC6dÜ i6U i=w 6U
which has both row and column labels. The columns can be treated as components of a list:
® µ ¯ ¢ ¹ µ*¬Aµ-¢«

¼  ÀF½ — À ÐIÐÀF½ IÐÀF½H¼/¼—Àƽ /ЗÀƽ —À8¼ ÐÀ ½ÀF½  —ÀF½ IÐÀF½ $ ½ÀF½

¼pÐ †¼p½IÊÀF½ — Àƽ†¼’½I½ÀF½H¼pÊæÀF½ —Àƽ A½—Àƽ ¼ —ÀF½ ÊÀF½ ¼ —Àƽ  —ÀF½ë¼I¼’½ÀF½  ÀF½

Ê  †¼p½I½ÀF½ I—Ê Àƽ
and the structure can be treated as a two-dimensional array:
á4.9 Data Frames 13

Z ! 4 1 G n i ”Mx q
NB+ 1 r %2
1
Z ! 4 1 G 6Uw n €TLW4€‘”ö€9, 0 Á' € q
L
4
Z xB! 4 1 G n €TLW4€‘” q
G, 1o 9=&
%°G-!(*5,Ak!B%¸'B4/'G,/)†
k !3#"! o 1 +|9, 0 H
' J-#o ,=+
LW4 iAçw i x Ü - xB ç
Note how the row label is carried along.
Data frames can be 1 9-9 1-< & -ed just as lists can, and this allows their columns to be accessed as
if they were named vectors.
A data frame can be created from vectors and matrices by the G 1 9 1 6FJ! 1 0 , function. For ex-
ample:
Z 9!,-,/J"! 1 0 î
, l-mîG 1 9 1 6J"! *1 0 —, KtG( D1 0 $ƒ 9=!,=,æ6MG ”Ù&,(‡-&=9"ƒ39"!-,-,6>&”Õ54=+I# 0 ,Aƒ$9!,-,æ6F5O
If the columns are not named, they pick up the names of the vectors, so
Z 9!,-,/J"! 1 0 î
, l-mîG 1 9 1 6J"! *1 0 —, KM9"!,$,6MG ”:9"!-,=,æ6&”F9!,$,6>5
O
gives

o 9"!,=,6:kG|9!,-,v 6F&H9!,-o ,6Fk5


ç6 r o 6 k
ki ç6UÜ Ük o 6
o ç6Mçr vÜ i o 6Ui
xr o 6 v o o Ü6Fx
6 ç ç•6Mç
6-6=6-6=6=6
Character vectors given to G 1 9 1 6FJ! *1 0 , are automatically treated as factors (see  10), unless
specified within a Š‘KTO function.
Reading data into S 14

5 Reading data into S

Data objects will usually be read as values from external files. This is done most conveniently
with the % <=1 )KDPTPDP8O function. To read a vector from the keyboard we can use
Z < 4/#-)=9W%Þl-m < pK iæ” k ” k M” x” k ”Ui ” o ” k ”Uç” o-o ”UÜæ”UÜ ” v ” o i” o$o ” $o o ”
o=o v o o v r o$o
j ” i ”:x U” i=i”tç 8” w$ç ”Mx k ”i$æ” wæ”Ui-$wæ”Ü/ç”Mx k ”8Ü v ”Uw=wæF” xBÜæ” k-k O
or
< I4 #=)-9W%×l=mê% =< 1 ) KTO
k k x k i o k ç o-o s
io=o s Ü Ü v o i o=o†o-o
v o i o x v i=iHç r $w çHx k =i  -o o Î
w i==wÎÜAçÞx k Ü v w=wÞxÜ k-k
Input is terminated by a blank input line (from the terminal only, despite the documentation) or
by EOF (ctrl-D in Unix). To read in a character vector we specify the vector type by the second
argument:
Z GB($,I9sl=mê% <=1 )K"”D€=€3O
ŽêŒÎNsžHìéžÃŽ†ìÎNsŒ
ì†NéžÎŒH Ž Ξ†ŒHNsìΎ
ŒHì ÎêžHNs N HŽ†ìéŒÎž
To read from a file specify its name as the first argument, for example
Z < 4/#-)=9W%Þl-mê% =< 1 ~) K€ < &"G6:G 1 9‘€IO
Now suppose that multiple data vectors of equal length are to be read in in parallel. For example
suppose that there are three vectors, the first of mode character and the remaining two of mode
numeric, and the file is ()-'=#=976MG 1 9 . Use % <=1 )~KTPDPDPtO to read in the three vectors as a list, as
follows
Z (*)él-mê% -< 1 ~) K€$(*)='-#$9C6MG 1 9Á€
”+=(=%9 KD(/G$ƒ
€=€
”z;ƒ攙."ƒ"O=O
The second argument is a dummy list structure that establishes the mode of the three vectors to
be read. The result, held in (*) , is a list whose (named) components are the three vectors read
in.
Matrices are usually read by row, as follows
Z ó{l-m W0 1 9 !(*;KT% <=1 ) K€I+=(*‡=&$976FG 1 9W€3O”z) < 4-+Aƒ r ”Ò-."!4/"ƒ3„WO
The argument %2
('ƒ to % <=1 ) can be used to skip header rows of files.

Data frames can be read from a file by the !, 1 G69 1 B+=, function. The data file should be a
table in one of a number of formats:

1. A file such as !4/9W(*J,$!6MG 1 9 (page 39) which has a first row naming the columns, fol-
lowed by the table of numeric data can be read by
Z !4/9W(*J,$!Hl-më!, 1 G•6F9 1 +-,‘K€!-4I9W(DJB,A!6MG 1 9Á€
”Ù&, 1 G,$!=ƒ3„WO
5.1 Writing out data 15

2. A file laid out like the listing of a data frame. This has a first header line, and rows which
contain the row label followed by the data for the columns, such as
¹ §I¢ · ¤­ ¹ µ"¥’¿=§Dµ-­öª ¯ ª ¹ §p¦ µ¬$µ-¢« · d§ ¡3ª Ǭ§«
 «3¢’´"¢8¡¢   ¼    IÊ ¼/¼
 «3¢I­ ± ¢ A Ð ¼/¼ —½ À! —À" ÐI½ ÀFÊ
À/ÀIÀIÀ/ÀIÀ/ÀIÀIÀ/ÀIÀ/ÀIÀ
Note that the header has one less entry than subsequent rows. This format is read by
Z !4 1 GÎl-më!, 1 G•6F9 1 +-,KT€*!-4 1 G6:G 1 9‘€$O
o o
3. A table without any header. The row and column labels are then ”˜PDPTP”$# and % ”
PDPTP&%"â . However, if there exists a character column without duplicates, the first such is
taken as the row labels and removed as a column.

Sometimes it is necessary to read in character strings which contain spaces. This can be done
by separating the fields in the file by, for example, tabs or commas:
Z #
%/!4 1 GÎl-mê% <=1 )~K€*!4 1 G•6MG 1 9W€‘”š%3,/'"ƒ
€'I9Á€‘”™+"(-%9~KT%9 1 9,$ƒW€$€‘”ÙG, 1 9=&
%/ƒ ”
j†G-!(*5,A!B%/ƒæ”z'B4/'"G,/)ƒ攛!$#! 1 +Aƒ ”Ґ 1 )=9B, 0 'ƒ ”zJ=#B,=+Aƒ"O-O
where '/9 is the usual Unix abbreviation for a tab character. This device also applies to !, 1 G•6F9 1 +-, .

5.1 Writing out data

There are amny ways to write out data from S, for example the '"!B()-9 , <-1 9 and J4$! 0W1 9 com-
mands. To write directly to a file, there are <=1 9 , "!(*9, and, from S-Plus 3.2, !(9B,69 1 B+$,
which is usually the simplest method. This can write a dataframe, matrix or vector, with syntax
Z !(*9,69 1 +=,‘K8G 1 9 1 ”ÒJ
($+=,$ƒW€-€‘”š%3,/'"ƒ
€‘”D€$O
and further arguments can be found in the help page. By default it writes out comma-separated
items on rows, but the separator can be changed to space or tab ( €'I9Á€ in Unix).
The function !(9B, writes a vector, with syntax
Z !(*9,K8G 1 9 1 ”ÕJW($+-,Aƒ
€G 1 9 1 €Ô”z) < 4=+/# 0 )
%/ƒ r O
for numeric data, and in one column for character data. To write out a matrix 0 , use
Z !(*9,KU9K 0 O ”ÕJW($+-,Aƒ
€G 1 9 1 €Ô”z) < 4=+/# 0 )
%/ƒ$) < 4=+‘K 0 O$O
The function JB4A! 0
1 9 converts data to a line of characters, and can be used with "!(*9, or <-1 9
to construct custom reports.
Graphics 16

6 Graphics

The graphical facilities are central to S. The steps involved are as follows:

1. The type of terminal, or device, is declared to S at the beginning of the session:


Z 4I',/)B+=4-4/2KDO

2. A command is issued to construct a plot from data. For example


Z 'B+=4/9~Kt;•”M.WO
specifies a simple point plot where ; and . are vectors giving the ( - and ) -coordinates
of the points respectively. (The command includes a default automatic choice of axes,
scales, titles and plotting characters, all of which can be overridden with additional graph-
ical parameters that could be included as named arguments in the command.)

6.1 Graphical Parameters

Functions producing graphical output usually have optional additional named arguments that
can be specified to override some default parameter settings and hence modify the character-
istics of a plot. A short list of the main ones is as follows:
1 ;B,"%Iƒ$í If Ú
* ©+ all axes are suppressed. Default ¾ ,-.+ , axes are automatically constructed.
9=.-',$ƒW€ < € Type of plot desired. Values for £ are:
ª for points only, (the default for function ª3« ¯· ),
« for lines only,
´ for both points and lines, (the lines miss the points),
­
º© for step functions ( ­ specifies to change now, © to change just before the
¯ for overlaid points and lines,
next point),

¤ for high density vertical line plotting, and


¦ for no plotting (but axes are still found and set).
;+ 1 ƒW€$%*9"!B(D)=‡Á€ Ó
Give labels for the – and/or ¶ –axes (default: the names, including suffices, of
. + 1 ƒW€$%*9"!B(D)=‡Á€ Ó
the and ¶ coordinate vectors).
%#-"ƒ
€$%9!(*)$‡Á€ ­’¬/´ Ó
specifies a title to appear under the –axis label and ¡B¢$¥8¦ a title for the top
0W1 ()ƒW€$%*9"!B(D)=‡Á€ of the plot in larger letters. (default: both empty).

;+( 0 ƒ < K’+=4u”:&W(-O Ó


Approximate minimum and maximum values for – and/or ¶ –axes settings. These
.+( 0 ƒ < K’+=4 ”z&W(-O values are automatically rounded to make them “pretty” for axis labelling.

Other graphical parameters control the background characteristics of all subsequent plots and
are usually specified by a call to the function ' 1 !KTPDPDP8O . There are a great number of these
parameters and the command
Z &B,=+I'Kt' 1 !BO
gives a complete list of them and their meanings. Some of the more commonly adjusted ones
are as follows:
/6.2 Some Basic Plotting Functions 17

+/9-."ƒ3) Line type is ¦ . If lines are being plotted, a variety of line types is available; ¦ë»
¼ means a solid line, ¦ë»œÊÔºБº $000
indicates a variety of broken line forms.
' < &"ƒ
€ < € Specify the character to be used for plotting points (default: 1 for graphics ter-
2
minals, for PostScript).
0 J!4I"ƒ < K 0 F” )WO multiple frames on the one plot. Instead of plotting just one graph per screen,
0 J < 4-+Aƒ < K 0 ”F)WO each screen (or page) will contain an array of ¡ ¦ graphs forming an
1 35476
If =
¡ I
Ç µ 
¯ É is used the screen is filled row-by-row and if -
¡ =
Ç £ ¯ «
grid.
is used it is filled
column-by-column. Useful if many graphs are to be inspected simultaneously
and high resolution is not necessary.
'=9-."ƒ
€ < € Specify the type of plotting region currently in effect. Possible values for £ are
­ to generate a square plotting region;
¡ (the default) to generate a maximal size plotting region.

6.2 Some Basic Plotting Functions

The elementary plotting functions are as follows:

'+-4/9~Kt;”:.”’PpPDPtO Scatter plot of points with ; – and . –coordinates given by the two
main parameters. The pair ;”:. may be replaced by a single list with
components labeled ; and . , called a ‘plot list’.
Graphical parameters are particularly useful.
'4()-9W%‘KU;”:.”pPDPpP8O Add points to an existing plot (possibly using a different plotting char-
acter. Follows on from a '+-4/9~KTPDPDP8O command.
+"(*),%‘Kt;•”M.•”8PTPDPUO Add lines to an existing plot. Similar to points.
Z 'B+=4/9~Kt;•”M.WO‘y +"()B,"%ÔKT%'B+"(*),Kt;”:.WO$O
Note

will join the points of a plot by a cubic spline interpolation function.


(See &B,=+I'KT%*'+(),"O for further information.)
9,I;=9~Kt;”:.” Add text to a plot at points given by ;•”M. . Normally 1
n q + B,=+"% is an in-
+ 1 B,=+"%—”pPDPpP8O tegern q or character
n q vector in which 1
oå case + ,=+% ( is plotted at point
Kt; ( ”M. ( O . The default is +=,/)-‡=9-&Kt;O .
Note: This function is often used in the sequence
Z 'B+=4/9~Kt;•”M.”F9-.$',$ƒ€D)Á€IO—yz9,I;=9~Kt;”:.WO
The graphics parameter 9-.='B,AƒW€)Á€ suppresses the plotting of points
but set up the axes, and the 9B,/;-9KTPDPDPUO function supplies special char-
acters (in this case just the integers by default) for the points.
1 B+"(*),K 1 ”:”pPDPpP8O Draw a line in intercept and slope form, ( 1 , ), across an existing plot.
1 B+"(*),KU&"ƒ < ”pPDPpP8O "& ƒ < may be used to specify . –coordinates for the heights of horizon-
1 B+"(*),KU5"ƒ < ”pPDPpP8O tal lines to go across a plot, and 5"ƒ < similarly for the ; –coordinates
1 B+"(*),K lmobject ”pPpPDPUO for vertical lines.

6.3 Interaction with Plots

S-Plus allows users to interact with plots, by identifying points and by adding information at
places selected by mouse clicks.
/6.4 Brush and Spin 18

(/G,/)-9W(J-.KU;”M.”U+ 1 ,-+=%=O Ó
On a current plot of ºF¶ , clicking the LEFT mouse button places
the appropriate string from «3¢’´"§« near the point which has been
clicked on. Click the MIDDLE mouse button to finish. If «3¢’´"§«
is omitted uses index numbers, and always returns the indices of
selected points.
+=4 =< 1 9 4A!KTO Returns a list of vector coordinates of points clicked by the LEFT
mouse button. Click the MIDDLE mouse button to finish.
+=4 <=1 94A!K"”D€’'Á€$O ditto, but plots the points as in ª=«
¯*· .
=+ ,I‡,I)"G K’+=4 <$1 94/!KpO”6$6=68O Add a legend box at a mouse-selected point (one LEFT click). See
help page for the box contents and other options.

+=4 =< 1 9 4A!KTO is often used with 9B,/;=9 to add annotation to plots, e.g.
Z 9B,/;-9Kp+-4 -< 1 9 4/!KpO”D€ < 4/)$9!4$+%=€$OyÕ9,/;-9K’+=4 <-1 94$! KpO”p€ <=1 %3,"%=€$O

6.4 Brush and Spin

These are S-Plus enhancements to allow dynamic manipulation of graphs. Spin allows three
columns chosen from a matrix of data vectors to be rotated in space.
Z &B,=+I'K€3%9 1 9 ,"€3O
Z %*'W(*)KT%*9 1 9, 6>; v$v O
Use the left mouse button to select three of the variables, then use the cross-shaped pad to rotate
the point cloud. Finally click on }3#W(9 .
Z !$#
%&KD%9 1 9,æ6; v=v z” &W(-%9"ƒ3„WO
includes %*'W() and a ' 1 I( !% plot. Additionally one can ‘brush’ by selecting points with the left
mouse button, and de-selecting them with the middle button. One can mark points in different
ways, with the four symbols, and even label points if + 1 ,-+ is selected.
Z !$#
%&K8!$
()"G—KD(!(-% n ”=” o*q ”8(I!(=% n ”-”8i q ”p(!($% n ”$” k q O=O
Now select the first 50 points with one symbol and the last fifty with another. The intermediate
nature of the middle 50 then stands out.

6.5 Equally-scaled plots

It is sometime necessary to make geometrically-square plots, for example so that distances


can be assessed accurately. This is somewhat tricky, but done by the functions ,A}B% < 'B+=4I9 in
+"(*"! 1 !$.~Kd!B(D'+=,.
O , which adjusts the axis scales to be equal within the current window shape.
/6.5 Equally-scaled plots 19

Figure 1: Screen dump of an 4/'B,/)+-4=4I2KTO window displaying !$#W%*& on the (/!B(=% data, with
different highlights for the three groups.
ÛStatistical Summaries 20

7 Statistical Summaries

7.1 Arithmetical Summaries

Standard summaries such as 0 , 1 ) , 0 ,AGB( 1 ) and 5 1 ! are available. The 5 1 ! function will take a
data matrix and give the variance-covariance matrix, and < 4A! computes the correlation matrix,
either from two vectors or a data matrix.
There are also standard functions 0
1 ; , 0 (*) , ! 1 )=‡, and }$# 1 )-9W($+-, . The functions 0 , 1 ) and < 4A!
will compute trimmed summaries. More sophisticated robust summaries are available, such as
+=4 <=1 9W($4I)C6 0 and % <=1 +-,69 1 # as well as via the !4/-#W%*9 library.

7.2 Histograms and Stem-and-Leaf Plots

The standard histogram function is &W(=%*9KU;”ÝPTPDPDO which plots a conventional histogram. More
control is available via the extra parameters. The parameter '!4/ 1
($+"(*9$."ƒI„ gives a plot of unit
area rather than cell counts, and ) < + 1 %=% sets the number of bins.
Densities can be estimated via the function G,I)W%=(*9=. :
&W(-%9~Kt&W%*9 1 !I9”z) < + 1 %=%Iƒi= ”z'"!4/ 1 W(3+"(*9=.=ƒ$„”Ù.+( 0 ƒ < K’æ”U6U$i"O-O
+"(*),%‘KdG,/)
%$(9=. KU&%9 1 !$9WO$O
+"(*),%‘KdG,/)
%$(9=. KU&%9 1 !$9”ՁW(/G39=&ƒi=O”›+I9=.†ƒ k O
See figure 2.
0.020
0.015
0.010
0.005
0.0

50 100 150
8 200
hstart

Figure 2: A histogram of &W%*9 1 !$9 with two density estimates overlaid.


7.3 Boxplots 21

A stem-and-leaf plot is an enhanced histogram:


Z %*9, 0 tK &
%9 1 !$9O
o 1 )†ƒ o -k k M6 ç r
‹ê ƒ A
 ç

L
9 # 1 !39W($+-,"%¸ƒ o  r 6diæ” oIr ç6MçA
, B
G (

o
Ž, < ( 0W1 +œ'4()-9é(=% '+ 1=< | , 9B4î9-&,î!(*‡=&-9s4/J†9-&, < 4=+=4I)
r åÝr
å
Üv åÝi=r=ir kv w
å w-r w
ç å¸io k-kk r v=Ü vv
ow å i r w
o=o å =Ix r Ü
o å /x k Ü$ç v-v$v=v
o ki å =io-o xBkÜ=Ü=Ü-Ü r v ç-w-w
o å¸o i k=x=k=x-k x vÜ w=w
oIx r å i-i=o ki r x=x w=w=w
o å =oAr x ç
o vÜ å  w
o å Ü=vÜ
o ç å v=i v
w å o k=k-k r v
i=o å k x=x Ü-Ü
i å ç
i=ki å¸ÜAo ç
i x
Apart from givingr-r a visual picture of the data, this gives more detail. The actual data, in sorted
k
order, is roughly ”ÝÜ-iæ”ÏÜ-iæ”ÝÜ ”ÏÜ-wæ”ÏPDPDP and this can be read off the plot. Sometimes the
pattern of numbers (all odd?) gives clues. Quantiles can be computed (roughly) from the plot.

7.3 Boxplots

A boxplot is a way to look at the overall shape of a set of data. The central box shows the data
between the quartiles, with the median represented by a line. ‘Whiskers’ go out to the extremes
of the data, and very extreme points are shown by themselves. It is also possible to plot boxplot
for groups side-by-side:
Z + (! 1 !3.K8!('+-,.WO
Z B4/;-'+=4I9KD%'+=(*9 Kt)B49=9, 0 ” < . < +-,Kt)B4/9-9, 0 O$O—”z) *1 0 , %/ƒ 0 4/)-9=&6 1 $WO
divides a time-series into months, and plots the boxplots for each month on one plot. See fig-
ure 3. Other styles of boxplot are available—see the help page.
“Distributions 22

60
50
40
30
: ; < = < : : = > ? @ A
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Figure 3: Boxplots for months of )B4/9-9, 0 data.

8 Distributions

S has functions built it to (approximate) the density, cumulative distribution function and quan-
tile function (the inverse of the CDF) for many standard distributions. There are also function
to simulate samples from these distributions. The first letter of the name indicates the function,
e.g. G$)4$! 0 ”Ò'-)4A! 0 ”›}$)B4A! 0 ”š!$)B4A! 0 respectively.
Distributions available are:

Distribution S name parameters


o
beta B,/9 1 %& 1 'B, ”Ý%& 1 'B,=i
binomial
()B4 0 =% (I–, ”›'"!4I
Cauchy <-1 # < &=. +=4 <=1 9W($4I)”š% <=1 +=,
chisquare < &W(-%/} G$J
exponential ,I;=' ! 1 9o ,
F J G$J ”UG$Ji
gamma ‡ 10-0W1 %& 1 'B,
geometric ‡B,=4 0 '"!4/
hypergeometric &-.='B,A! 0 ”™)”™2
log-normal +I)4$! 0 0 , 1 )B+=4/‡•”š%/G+=4I‡
logistic +-4/‡
(=% +=4 < ”Ï% <=1 +-,
negative binomial )-W(*)4 0 %=(I–, ”›'"!4I
normal )B4A! 0 0 , 1 )•”Ï%/G
normal range )! 1 )=‡, %=(I–, ”Ï%/G
Poisson 'B4"(-% + 1*0 G 1
stable %*9 1 ()G,I;”Ý%2,I=)B,"%=%
T 9 G$J
uniform #-)W(*J 0 (*)” 0
1 ;
Weibull B,"(*=#+-+ %& 1 'B,æ”Ý% <=1 +=,
Wilcoxon 
($+ < 4/; 0 ”™)
B8.1 Q-Q Plots 23

The function % 1 0 B' +=, re-samples from a data vector, with or without replacement.

8.1 Q-Q Plots

One of the best ways to compare the distribution of a sample ; with a distribution is to use a
Q-Q plot, of which the normal probability plot is the best-known example. Q-Q plots can also
be used to compare two samples. For a sample ; the quantile function is the inverse of the
empirical CDF, that is

quantile CÖäED
GFIHKJ C"LNM proportion ä of the data OLPD
The function }-}$'B+=4/9~Kt;•”Ò.•”¸PDPDPpO plots the quantile functions of two samples ; and . against
each other, and so compares two samples. The function }-}$)4$! 0 Kt;WO replaces one of the samples
by a sample at the quantiles of a standard normal distribution. This idea can be applied quite
generally. For example, to test a sample against a QSR distribution, we use
' +=4I9Kö}39Kt'-'4()=9%ÔKt;O—”Uw"O”™%$4A!39KU;WOÞO
where '='4()-9W% computes the appropriate set of probabilities for the plot.
The function }=}+()B, helps assess how straight a }-}$)B4A! 0 plot is by plotting a straight line
through the upper and lower quartiles. (See the example in  3.)
Classical Statistics 24

9 Classical Statistics

S-Plus 3.1 has a section on classical statistics. The same functions are used to perform tests
and to calculate confidence intervals.
The table shows the amount of wear in a shoe experiment with 10 boys, an experiment reported
in Box, Hunter & Hunter (1977), Statistics for Experimenters. There were two materials ( T and
U ) that were randomly assigned to the left or right shoe.

4/. T U
o ok o
U6 i Ktí
O x 6d
C Kt‰WO
ki o ç6Ui Ktí
O o-ço 6:ç Kt‰WO
o 6Ukw Kt‰
O o 6di KtíWO
xr o xC6 v Ktí
O o-xCo 6di Kt‰WO
6 Kt‰
O 6:ç KtíWO
vÜ Ü6UrÜ Ktí
O Ü6x Kt‰WO
o w6 Ktí
O o-wo 6:çk Kt‰WO
ç 6Mç Ktí
O 6 Kt‰WO
wo ço 
k 6M6 çk Kt‰
O wo k6 k KtíWO
 Ktí
O 6dÜ Kt‰WO
We can use these data to illustrate one-sample and paired and unpaired two-sample tests. The
rather voluminous output has been edited:
® ’­ ¤ ¯ §I­°ÂIĜ­/£I¢’¦ ² º«-¥*­ ·‘²  »I½‘ºVA»I½ ³I³
E¼ Wö¼pЗÀFÊë¼  Àƽ
ÐXWYÀFZ Ê "À 
XWö¼/¼BÀFÊ뼒½—À 
XWö[¼  ÀFÐë ¼  ÀÆÊ
XWö¼p½—"À ë¼I¼BÀ 
¼I\¼ W] —À ^ —À 
¼’_Ð W]—À ^—"À 
¼ _Wš¼’½"À †¼I¼ÀFÐ
¼ _W]—ÀÆ^ Ð —"À 
¼ _Wš¼’ÐÀFІ¼’Ð"À
Ê\¼ W
® ¢ ·/· ¢I£p¤ ² ­’¤ ¯ §I­ ³
®¸· À · §I­ ·‘²  º ¡3¬=»"¼’½ ³
` ¦"§IÄ/­I¢8¡$ª=«$§ · ÄD¾-§/­ ·
¹ ¢ · a¢ W 
· »œ½—À¼pÊÔº ¹ Çî»^Ôºª"ÄD¿-¢«¬"§¸»|½—ÀAÐ/Ð
¢« · §Tµ*¦¢ · ¥p¿-§š¤$¶*ª ¯*· ¤§I­A¥*­aW · µ¬§š¡§I¢’¦Ã¥*­Ï¦ ¯· § Ì ¬¢« ·A¯ ¼’½
˜ª"§Dµ-£/§’¦ · £ ¯ ¦$Ç"¥ ¹ §’¦£/§|¥¦ · §TµI¿-¢«XW
—À  b$Ê Þ¼pʗÀFÐ IÐ IÐ
­I¢d¡3ª3«3§œ§/­ · ¥t¡B¢ · §I­aW
¡§I¢p¦ ¯ Ç Ó
Classical Statistics 25

¼p½—À IÐ
®¸· À · §I­ ·‘²  ³ £ ¯ ¦$Ç Àt¥¦ ·

[¼ À" AÊH¼pʗÀÆÐIÐ/Ð
¢ ·/· µ ² º Ñ £ ¯ ¦AÇ ÀÆ«3§D¿=§« Ñ*³ W

[¼ |½—À
®¸É ¥T«3£ ¯Ó À · §I­ ·‘²  º ¡3¬=»"¼’½ ³
+ Ó ¢I£ ·Zc ¥T«3£ ¯Ó$¯ ¦H­$¥pÈ*¦§ ¹ ÄTµ-¢’¦ ˜
± · I§ ­ ·
¹ ¢ · ¢aW 
­$¥pÈ*¦"§ ¹ ÄTµ-¢p¦ ± ­ · ¢ · ¥­ · ¥£edкz¦H»ë¼’½Ôºzª"ÄD¿-¢«¬"§¸»œ½—À 
¢« · §Tµ*¦¢ · ¥p¿-§š¤$¶*ª ¯*· ¤§I­A¥*­aW · µ¬§š¡$¬s¥­Ý¦ ¯· § Ì ¬"¢« ·A¯ ¼’½
®¸· À · §I­ ·‘²  ºfV ³
© · ¢’¦ ¹ ¢Dµ ¹ ¾ AÉ ¯ Ä ©3¢8¡$ª=«$§ · ÄD¾-§/­ ·
¹ ¢ · a¢ W  ¢’¦ ¹ V
· »×ĽÀFÐ ‘º ¹ Ç¼‘ºªÄT¿-¢«¬§ø»×½À""¼ 
¢« · §Tµ*¦¢ · ¥p¿-§š¤$¶*ª ¯*· ¤§I­A¥*­aW · µ¬§ ¹ ¥’ÇIÇ=§Dµ=§’¦£/§|¥¦ø¡B§I¢p¦­|¥­Ï¦ ¯*· § Ì ¬"¢« ·$¯ ½
˜ª"§Dµ-£/§’¦ · £ ¯ ¦$Ç"¥ ¹ §’¦£/§|¥¦ · §TµI¿-¢«XW
ÄʗÀ IbÊ  ¼B"À /Ê IÊ 
­I¢d¡3ª3«3§œ§/­ · ¥t¡B¢ · §I­aW
¡§I¢p¦ ¯ Ç Ó ¡§I¢’¦ ¯ ǜ¶
¼p½—À IÐ ¼I¼ÀF½ 
· À · §/­ ·‘²  º]Vºz¿-¢Tµ ÀM§ Ì ¬"¢«/»IÚ ³
®
c §«3£p^
¤ g ¯/¹ ¥’Ç"¥*§ ¹ ¾ AÉ ¯ Ä©$¢8¡$ª=«3§ · ÄT¾-§/­ ·
¹ ¢ · a¢ W  ¢’¦ ¹ V
· »×ĽÀFÐ ‘º ¹ Ç¼À"ÔºQªÄT¿-¢«¬§ø»×½À""¼ 
¢« · §Tµ*¦¢ · ¥p¿-§š¤$¶*ª ¯*· ¤§I­A¥*­aW · µ¬§ ¹ ¥’ÇIÇ=§Dµ=§’¦£/§|¥¦ø¡B§I¢p¦­|¥­Ï¦ ¯*· § Ì ¬"¢« ·$¯ ½
˜ª"§Dµ-£/§’¦ · £ ¯ ¦$Ç"¥ ¹ §’¦£/§|¥¦ · §TµI¿-¢«XW
ÄʗÀ ./½  ¼B"À /Ê /½ .
­I¢d¡3ª3«3§œ§/­ · ¥t¡B¢ · §I­aW
¡§I¢p¦ ¯ Ç Ó ¡§I¢’¦ ¯ ǜ¶
¼p½—À IÐ ¼I¼ÀF½ 
®¸· À · §I­ ·‘²  ºfV—ºzª¢A¥’µ-§ ¹ »¾ ³
h ¢A¥’µ=§ ¹˜· ÄD¾-§/­ ·
¹ ¢ · ¢aW  ¢’¦ ¹ V
· »×ÄÐÀFÐb.‘ º ¹ Çî»^ÔºRªÄD¿=¢«D¬§¸»|½—ÀƽI½
¢« · §Tµ*¦¢ · ¥p¿-§š¤$¶*ª ¯*· ¤§I­A¥*­aW · µ¬§š¡§I¢’¦ ¯ Ç ¹ ¥pÇIÇ-§Tµ-§p¦£I§/­œ¥­Ï¦ ¯*· § Ì ¬"¢« ·$¯ ½
˜ª"§Dµ-£/§’¦ · £ ¯ ¦$Ç"¥ ¹ §’¦£/§|¥¦ · §TµI¿-¢«XW
Ľ—À  / Ðœ
 Ľ—Àd¼’Ð/ÐI½ ¼
À/ÀIÀ
Classical Statistics 26

®¸É ¥T«3£ ¯ Ó À · §I­ ·‘²  ºfV—ºª"¢$¥’µ=§ ¹ »*¾ ³


c ¥T«3£ ¯*Ó$¯ ¦H­$¥pÈ*¦"§ ¹ ÄTµ-¢p¦ ±¸· §I­ ·
¹ ¢ · ¢aW  ¢’¦ ¹ V
­$¥pÈ*¦"§ ¹ ÄTµ-¢p¦ ± ¦ ¯ µT¡B¢«×­ · ¢ · ¥­ · ¥*£ É ¥ · ¤ë£ ¯ µ/µ-§/£ · ¥ ¯ ¦ji|»×ÄʗÀ!.Ôºª"ÄD¿=¢«¬"§¸»|½—Àƽ¼ AÐ
The sample size is rather small, and one might wonder about the validity of the Q -distribution.
An alternative for a randomized experiment such as this is to base inference on the permutation
distribution of G . Figure 4 shows that the agreement is very good. (As the computation of this
figure uses some subtle ideas in S, it is omitted: see Venables & Ripley (1994, Chapter 5).)

m Permutation dsn

1.0
t_9 cdf
0.4

0.8
0.3

0.6
0.2

0.4
0.1

0.2
0.0

0.0

8 k 8 k
-4 -2
l 0
diff
2 4 -4 -2
l 0
diff
2 4

Figure 4: Histogram and empirical CDF of the permutation distribution of the Q -test in the shoes
example. The density and CDF of Q&R are shown overlaid.

The list of classical tests is:


()B4 0 69,%D9 < &W(-%/}69,%D9 < 4A!•6F9,%9 J
(=%*&,A!•6F9B,"%D9
J!(3,AG 0
1 )76>9,"%D9 2!$#
%2 1 +69,"%9 0
1 )-9,=+I& 1 ,/)6F9,%9 0
< )B, 0W1 !69,=%9
'!4I'C6F9B,"%*9 976F9B,"%9 5 1 !•6F9,%9 
($+ < 4/;76F9B,"%D9
Many of these have alternative methods – for < 4$!6F9B,"%*9 there are methods €D'B, 1 !B%$4I)Á€ ,
€D2B,/)G 1 +-+€ and €3%', 1 ! 0W1 )W€ .
nHandling Categorical Data 27

10 Handling Categorical Data

Consider a (fictitious) survey of shoppers in Britain. Amongst the variables collected for each
person surveyed are sex, age, TV areao , social classp , transport used for this trip to the shops,
and total spend at supermarkets. The possible values of these variables are

sex: M, F
age: –24, 25–44, 45–59, 60+
TV area: 1, PDPTP , 12
social: A, B, C1, C2
transport: car, bus, cycle, foot
spend: positive continuous

This provides examples of each of S’s types of categorical data structure. There are two main
structures, categories and factors. The latter were introduced in the August 1991 release, and
have almost entirely superseded the use of categories. A factor is regarded as a vector over the
set of levels which have no implied order. Thus sex, TV area and transport are all factors. How-
ever, TV area is coded by number rather than by the names of the companies. These variables
can be declared as
%$,I;él-m|J 1=< 94$! KT%3,/;76FG 1 9 1 O
„q%76 1 !, 1 l=mÞJ 1-< 94$! KU„q%C6FG 19 1 O
9"! 1 )
%'4$!$9Hl-mîJ 1=< 9B4A!Kt9"! 1 )%*'4A!$96:G 1 9 1 O
Internally in S levels are numbered in alphabetical order, and when factors are used as treat-
ments in designed experiments, the order of levels may matter. For example, if we want to
contrast females with males (rather than vice versa) we need to specify the levels of the factor
explicitly:
Z %3,/;èl=mîJ =1 < 9 4A!KT%3,;C6MG 1 9 1 ”›+-,/5B,=+"%Iƒ < K€’LÁ€Ô”T€’ì‘€$O$O
Social class is an ordered factor in that the classes are perceived as ordered, with “A” (profes-
sionals) regarded as highest. We can declare an order by
$% 4 < ( 1 +Þl=m†4$!=G,A!,$G KUJ 1=< 9B4A!—KD%I4 < ( 1 + 6FG 1 9 1 O$O
+=,I5,-+"%‘KD%$4 < ( 1 +=Oõl=m†+-,/5B,=+"%ÔKT%34 < ( 1 +"O n x å8oq
1 ‡B,†l-mH4A!=G,A!,AG KUJ 1-< 94A!—K 1 ‡,6FG 1 9 1
r O” r r
+=,I5,=+%/ƒ < K€*m-i/xW€Ô”ö€Ii m/x-xÁ€Ô” €x m w€‘”ö€AÜ=Aj
€$O-O
The first line orders the levels by the default (alphabetical) order. The second shows how the
set of levels may be changed, in this case by reversing the existing ordering. Age is an ordered
category for which it is necessary to specify the levels explicitly. Had 1 ‡,6MG 1 9 1 been specified
as a continuous variable, it could have been categorized using < #=9 (whose help page gives other
ways to produce the categories):
1 ‡B,6 < G 1 9 1 l-m < #=9K 1 r r
‡B,6MG 1 9 1 ” < K’æ”Ýi ”šx Ý ” Ü= ”Ïw-w"O=O
1 ‡B,†l-mH4A!=G,A!,AG KUJ 1-< 94A!—K 1 ‡,6 < G 1 9r 1 O‘”
+=,I5,=+%/ƒ < K€*m-i/xW€Ô”ö€Ii m/x-xÁ€Ô” €x r m r w €‘”ö€AÜ=Aj
€$O-O
r Britain is covered by 12 commercial TV companies, so this provides a simple geographical variable.
s Derived from occupation.
10.1 The Function
· ¢pªIª3«*¶ ² 000 ³ and Ragged Arrays 28

Some of the functions for statistical models treat ordered factors in appropriate special ways.

10.1 The Function 9EtuPu\vwyx{z|z|z~} and Ragged Arrays

To continue the previous example, suppose we have want to summarize spend by some of the
factors To calculate the sample mean income for each age-group we can now use the special
function 9EtuquEvwyx{z|z|z~} :
€ u\‚„ƒ6…†Pt‚ €ˆ‡q‰ 9EtuPu\vwyx € uŠ‚„ƒŒ‹NtEŽ‹…qt‚}
giving a means vector with the components labeled by the levels
€ u\‚„ƒ6…†Pt‚ €
‰P‘’ ‘q“q‰’q’ \’ “q‰P“q” q• –—
‘˜ 6 ‘P– ™Š“ 6 “™ q™ ™ 6 ’\‘ bš ˜ 6 •q“
Suppose further we needed to calculate the standard errors of the mean spends. To do this we
need to write an S function to calculate the standard error for any given vector. We discuss
functions more fully in › 12, but since there is an inbuilt function 5Etœx{z|z|z[} to calculate the
sample variance, such a function is a very simple one-liner, specified by the assignment:
€ 9„ƒžœqœ ‡P‰jŸP  ‚\¡I9†¢34‚£xM;} € ¤ $œ 9yxt5\t.œŒxU;†}¥qv.‚q$9P¦£xM;†}.}
After this assignment, the standard errors are calculated by
€ u\‚„ƒ6 € 9ŠƒŠœ.œ q‡ ‰ E9 tuPu\vwyx € u\‚qƒy‹ItEŽ‹ € 9„ƒŠ.œqœE}
and the values calculated are then
€ uE‚Šƒ6 € 9„ƒžœqœ
‰P‘’ ‘q“q‰’q’ ’\“q‰P“q” q• –—
™ 6 ˜Š– ‘ 6 ™P™ ’ 6 “P“ ‘ 6 ˜ž–
The function 9\tuPu\vw£x{z|zz[} can be used to handle more complicated indexing of a vector by
multiple factors. For example, we might wish to split the spend by both age and sex:
 9EtuPu\vwyx € u\‚qƒy‹Iv„¢ € 9yxtEŽ‹ € /;}_‹…†Pt‚}
The combination of a vector and a labelling factor is an example of what is called a ragged
array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same
the indexing may be done implicitly and much more efficiently by using arrays. The function
tuPu\vw is the analogue of 9EtuPu\vw for arrays.
The pattern of our survey can be seen by the 9\t§Evq function, which takes a listing of factors
and returns the contingency table as an array, e.g.
 9Et§Evq_x € I;¨‹It\Ž‹I©PªC6«tœŠPtŽ‹ € 4P¡„¢.tPvŽ‹Ù9„œžt‚ € u4.œ$9†}
¬Loops and Conditional Execution 29

11 Loops and Conditional Execution

Commands may be grouped together in braces, ­q/;Pu„œP®.¯eI;qu„œP°.¯z|z|zq¯±I;qu„œ²£³ . The value of


the group is the result of the last expression in the group evaluated. Since such a group is also
an expression it may, for example, be itself included in parentheses and used as part of an even
larger expression, and so on. This facility is most often used with the control statements of this
section.
The control statements are very close in spirit to those of the C programming language, and
only a few are mentioned here. There is a conditional construction of the form
 ¢ Ÿ x expr®|} expr°Zqv €  expr´
where expr® must evaluate to a logical value and the result of the entire expression is then evi-
dent.
There is also a
Ÿ\µ œ –loop construction which has the form
 ŸEµ œ¶x name ¢‚
j expr®|} expr°
where name is a dummy, /;Pu„œq® is a vector expression (often a sequence like
š_·«‘q–
), and I;qu„œP°
is often a grouped expression with its sub-expressions written in terms of the dummy name.
/;Pu„œP° is repeatedly evaluated as name ranges through the values in the vector result of /;Pu„œP® .
As an example, suppose ¢b‚„ƒ is a vector of class indicators and we wish to produce separate
plots of w versus ; within classes. Use the ¦\qvu facility to understand the following:
 Ew ¡ ‡P‰€ uEv„¢*9£x~w¸‹¢‚qƒ\}_¯ ;\¡ ‡P‰€ u\v„¢*9£xU;¨‹$¢b‚„ƒE}
jEŸ µ œ x{¢¢‚ šX· vq‚q-9q¦£xSwE¡„}.}­bu\v µ 9yxt;Š¡£¹q¹º¢».»y‹SwŠ¡£¹q¹º¢».»E}_¯
— ¶ t§\v„¢‚\Xxv €Ÿ ¢9yxM;\¡¹.¹ ¢».»y‹"w\¡¹.¹ ¢».»E}q}³
€
(Note the function u\vŠ¢ 9yx{z|z|z~} which produces a list of vectors got by splitting a larger vector
according to the classes specified by a factor.)
Other looping facilities include the
 œžuEqt/9 expr
statement and the
j¼ ¦†¢vq½x condition } expr
statement. The §„œžqt¾
statement can be used to terminate any loop abnormally, and ‚\I;=9 can
be used to discontinue one particular cycle.
Loops in S are often memory-hungry, and care may be needed not to use up all of your com-
puter’s memory. Expert advice is necessary on work-arounds.
Writing Your Own Functions 30

12 Writing Your Own Functions

As we have seen informally in › 10.1, the S language allows the user to create his or her own
functions. These are true S functions that are stored in a special internal form and may be used
in further expressions and so on. In the process the language gains enormously in power, conve-
nience and elegance. Most of the functions supplied as part of the S system, such as …qt‚£x{z|z|zº}
and 5Etœx{z|z|z[} and so on, are themselves written in S and thus do not differ materially from user
written functions. (However, increasingly such functions are being re-written as internal func-
tions to gain efficiency.) Listing these functions (by printing their name without parentheses)
is a very fruitful way to gain hints for writing your own functions.
A function is defined by an assignment of the form
 ‚Etb… ‡q‰jŸP  ‚E¡/9†¢ µ ‚yx arg®‹ arg °Š‹Zz{z|z} expression
The expression is an S expression, (usually a grouped expression), that uses the arguments,
arg¿ , to calculate a value. The value of the expression is the value returned for the function. A
call to the function then takes the form ‚Etb…_x expr®.‹ expr°Š‹z|z|z{} and may occur anywhere a
function call is legitimate.
For example, the À.Á function in vŠ¢b§ŠœŠtœw£x[œ µ §   € †9 } is defined as:
À.Á ‡P‰^Ÿq  ‚E¡/9¢ µ ‚yx~w}
­
œ ‡P‘ ‰Ã¤‰   t‚-9†š ¢vq_xºw¨‹I¡Xx6 ‘q“ ‹H6 ˜ž“ }P}
œ¨¹ » œ¨¹ »
³
This first computes the quartiles, then returns the last value computed, their difference.
Note that any ordinary assignments done within the function are temporary and lost after exit
from the function. Thus œ is not left behind, and does not affect any other object œ .

‡q‡q‰ ’ can be used. See the ¦Eqvu documentation for details, and see also the
If global and permanent assignments are intended within a function, then the ‘superassign-
€ wP‚\¡¦„œ µ ‚†¢ÄP_x{} function.
ment’ operator, ‘

As a second example of a useful function, consider a function to evaluate the ‘Huber proposal
2’ robust estimator(s) of location and/or scale:
Å ¬ ƊÇ{È-­ZÉʱˬ̄Í|ΊÏ{ÐÑÌÓÒ&ÔXÕfÖj×GØEÙÚÓÕfÛ3¬ŽÕ›­ÕÜÏ[Ì\ÏÎ{Û3¬j×eÛ\ÇbÝPÏbÞÌÓÒ&Ԋß\ÕÎÐI«^×GØ\Ù"àÇÊbáqß
â
ÔGÉÊ ÔŒãäºÏ ­åÙǣÞ\ÒSԊßç
ÌÉÊ°«ÇÌ.èÎ Å ÒSԊß
ÏËaÒۆ*Ï ­I­Ï ÌèåÒ$Û ¬\ßß â
3Û ¬qà^ÉÊjÏ ÌžÏÎ|$Û ¬
̆ØéÉÊé̄ÊqØ
ê Ç «$­Ç â
3Û ¬qà^ÉÊÜ3Û ¬
3Û ¬†ØeÉÊÜ3Û ¬
̆ØéÉÊéÌ
ê
ÏËaÒۆ*Ï ­I­Ï ÌèåÒ ­ßß â
Writing Your Own Functions 31

­bà5ÉÊ$Û\Þb݆Ò&Ԋß
ê Ç«$­Ç â
­bà5ÉÊ×­
­qØéÉÊ×­
ê
Î Å ÉÊZë^ø ì ªÌqÐbÈ|یÒS֊߱ÊÃØ
ƊÇ{ÎPÞ^ÉÊíÎ ÅÃî ֊ï|ë^ìÒ{ØéÊ Î Å ßˆÊéëÃì ÖGì Ý|ÌqÐÈ{یÒS֊ß
ÈPpÇ ªŠÇÞ|Î â
ÔÔðÉø Ê ªÑېÏ[Ì_Ò ªÑÛEÞ{ñåÒ$Û ¬qàZÊ Öòì°­bàÓÕYԄßEÕóÛ$¬qà î ÖGì|­bàPß
ÏËåÒ!ېÏ ­I­.Ï[Ì.èaÒ3Û ¬žßß $Û ¬†ØéÉÊ×­p¬bێÒ&ÔԄßôÌ
ÏËåÒ!ېÏ ­I­.Ï[Ì.èaÒ ­.ßß â
­I­^É| Ê ­’¬ÑیÒÒ&ÔÔ^ÊÜ3Û ¬EØßï{ëPßbô̆Ø
­qرÉ|
Ê ­bõbÈÎaÒ ­I­ôƄÇ|ÎPÞß
ê
ÏËåÒÒ ÞÆ ­\Ò!3Û ¬àˆÊÜ3Û ¬EØß±É ÎIÐ «5| ì ­bàqßeööðÞÆ ­\8Ò ­bà5|
Ê ­Øß±É Î.Ð/«ð윭bàqß
Æ.ÈPÇÞ|Ö
3Û ¬qà^ÉÊÜ3Û ¬EØ
­bà5É× Ê ­Ø
ê
«PÏ ­|ÎaÒ3Û ¬5×é$Û ¬qàå› Õ ­÷ë × ­ÑàPß
ê
  €
This allows either of the location … and scale to be specified. Optional arguments are the
 
parameter ¾ , the initial value for … and a convergence tolerance. The first line removes all
missing values. The …a¢
€q€ ¢b‚qyx{} function checks if a parameter is supplied. Two constants are
then calculated as functions of ¾ . The rest of the function is a loop. In general loops are ineffi-
cient in S and should be avoided if at all possible, but here we have no choice as the calculation
is iterative. Finally the function returns two components, the location and scale.
It is sometimes useful to be able to time commands:

¡u   9¢|€ …† q‡q‡ ‰ð Ÿq  \‚ ¡/9¢ µ ‚£xt;\} b€   ¨… x   ‚†¢*;C6>9¢|…\_xM;}Ž¹ ‰.™ »Š}
q vPtu ƒ ‰ðŸq  \‚ ¡/9¢ µ ‚£tx ;\}   ‚¢ ;76F9†¢Ñ…†åxU;†}_¹ ™ »
which return the total cpu time and the elapsed time taken by a command or sequence of com-
mands enclosed in ­6=6-6º³ . Note: as these are functions, assignments inside them are in the
µ
frame of the function rather than permanent. Alternatively, use uŠœ ¡6F9¢|…†Xx{} before and after
a group of commands.
øStatistical Models 32

13 Statistical Models

These facilities form the heart of the 1991 version of S. They are based on object-oriented ex-
tensions, so that generic functions such as u„œE¢b‚=9 know what to do with the results of various
models. The two most basic notions are a data frame ( › 4.9) and a model formula.

13.1 Model Formulas

A model formula couples a y-vector with a model expressed in a terminology very similar to
that of GLIM and GENSTAT. The form is
 v „µ €P€Gù ¦EtœPƒ.‚\ €q€ — 9E‚ €
for the linear regression of v
µ„€P€
on ¦EtœPƒ.‚\
€q€ €
and 9\‚ . Factors are replaced by a set of in-
dicator variables for the regression, and can interact via the ú operator (not z as this is a valid
character in a variable name). Thus we can have all the following constructs:
 9¢|… ù u µ ¢ €.µ ‚ — 9ŠœŠPt/9.…‚-9 — u µ ¢ €.µ ‚ · 9ŠœŠ.tI9.…\‚$9 equivalent to
 9¢|… ù u µ ¢ €.µ ‚üûî9ŠœŠPt/9.…‚-9
€ 9„œž‚q-9q¦ ù wEtœ.‚ € ¥§ µ §P§†¢|‚ € nested layout
 Et„¢‚ ù Šœ µ  u — ¢‚†¢*9†¢.tPv parallel lines
 ¡ µ ‚E¡ ù‰\š^— œŠqt.ƒ\¢‚q line thorough the origin
 ¡ µ ‚E¡ ù u µ vw£x«œžqt.ƒŠ¢b‚q£‹ ‘ } quadratic polynomial
 ¡ µ ‚E¡ ù ‚ € x[œŠqt.ƒ\¢‚.¨‹ ’ ‹ ¢b‚-9\.œŠ¡qu=9Šý.©\} natural spline
 ¡ µ ‚E¡ ù€ x«œžqtƒE¢b‚P\} smooth function, for èqÞ[Û
The syntax of a linear-model fit is

vb…¨x model formula ‹ data frame }

where the names in the model formula refer to columns of the data frame, which can be omitted
if it has already been attached. For example
 vŠ¢b§ŠœŠtœw£x[œ\¢buŠvPbw†}
 It 9=9Etq¡¦yx«œ   §q§Š.œŠ}
 9Pw„œž € 6«vb… ‡q‰ vb…¨xv µŠ€q€ðù ¦\t.œqƒ — \9 ‚ € }
€  …P…†tœw£Ux 9qw„œP € 6ºv…†}
 t‚ µ 5\t_Ux 9qwŠœŠ € 6«v|…a}
 ¡ µ  ŸqŸ ¢¡„¢‚=9 € xM9qwŠœP € 6&v…a}
 uEv µ 9£x Ÿ ¢ 9-9\ƒXUx 9.w„œž € 6ºv|…å}_‹Yœž € ¢ƒŒxUq9 wŠœŠ € 6&v…a}.}
This show how to extract information from a fit by the use of ancillary functions. There are no
€ € €   €
standard ancillary functions for standardized and Studentized residuals, but I have added them
as 9„ƒPœŠ x{} and 9 ƒqœŠ x|} in v„¢§„œžtœ.wyx«œE¢buŠvqbw} .
13.2 One-way Layouts 33

13.2 One-way Layouts

The analysis of one-way layout is best illustrated by an example. The table gives data on ob-
served concentrations (ng/ml) of a chemical in groups of 10 patients after oral administration
of almitrine bismesylate:
ƒPœ    ƒ „µ € ½
 x«þÿ†}
€  §Šq¡I9 ‘q“ “P– š –P– ‘q–P–
š .™ ’ P” ‘ ‘q“P• ‘q‘P”
‘ ’\• š“q– ‘˜š ‘™ž‘
™ “q–  š ‘˜ž– ‘
’ \’ ” š“q“ š‘P– š”P“
“ ‘žš ž“ ™q™P™ ™Š“’
• “q‘ ”P“ š” ‘
˜ ™Š– ”P“ š–P” ‘
 ‘q” ž‘ š|’E– šb˜ž–
” ‘˜ šPš– š|’Š˜ “q‘P‘
š – “žš ”P” š”P• ‘q”P•

€ 9„ƒž/5 ‡q‰ðŸq  ‚\¡/9¢ µ ‚£xt;\} €¤ œ$9yxt5\t.œŒxM;}q}


 ¡¦\…a¢.¡Ptqv ‡q‰€ ¡qt‚£x¡¦Eb…†¢¡.tqv6"ƒžt/9} Function to compute st. dev.

 ƒ µ„€  ‡q‰ œžuyx¡_x ‘q“ ‹ “P– ‹ š–P– ‹ ‘P–q– }X‹ š– }


 Šœ µ  u ‡q‰ðŸ tP¡/9 µ œŒx[ƒ µ„€ Š} Label the observations by dose

 § µ ;Pu\v µ 9£x € u\vq*¢ 9Œx¡¦Šb…a¢¡Ptqv_‹Yƒ µ„€ „}q} Make a factor from the doses

 9EtuPu\vwyx¡¦\b…†¢¡.tqvŒ‹Yƒ µŠ€ Ž‹…qt‚†}


 9EtuPu\vwyx¡¦\b…†¢¡.tqvŒ‹Yƒ µŠ€ Ž‹ € 9„ƒž/5†}
 ¡¦\… €ð‡q‰ ƒŠIt 9\t 6 Ÿ œžtb…†åxº„œ µ  u¨‹I¡¦\b…å¢.¡Ptqv„}
 ¡¦\… € 6«t µ 5 ‡P‰ t µ 5yx¡¦\b…å¢.¡Pt.v ù „œ µ  u¸‹ ¡¦\b… € } set up for AOV

€  …P…†tœw£x¡¦\|… € 6ºt µ 5†}


 ¡ µ  ŸqŸ ¢¡„¢=‚ 9 € x[¡¦E|… € 6&t µ 5\} print out table

 ¡¦\… € 6«t µ 5 ‡P‰ t µ 5yxv µ £x¡¦E|…a¢.¡.tPv„} ù „œ µ  u¸‹$¡¦Eb… € } and and the parameters

€  …P…†tœw£x¡¦\|… € 6ºt µ 5†} on log scale

€  …P…†tœw£xt µ 5Œxv µ yx[¡¦\|…å¢.¡.tPvq} ù v µ yx«ƒ µ„€ Š} — Šœ µb  u£‹N¡¦Eb… € }q}
test for linearity of response
which gives
­’¬ÑÛÛEÞ{ÈÔaÒ Í Å Ç[Û­åÙ&ÞÑÐ
„ß
Ë Ѭ ÛÐË õ PÇÞÌ õ PÞ«¬„Ç ÈaÒqß
èÈЬ/ª  Úáà ŽÙ"ÚGØØá ŽÙZëXÙŠØà ÃØEÙàáëŠØÇÊÑà
Ç/.­ Ï{ÝŠ¬ Þ«3!­ á Ø"##áë_Ù žØà ŽÙ"Ú
ÍbÐ.Ç|Ë˄ÏÍÏÇÌÎ-ž ­ Ò Í Å Ç[Ûå­ Ù&ÞÑÐ
„ß
Ò"$ÌÎPÇ|ÈqÍÇp.ª ÎŠß èÈЬ/†ª Ø èÈЬ/qª ë èÈЬ/%ª 
ØÚ_Ù#Ú  ëXÙ#Ú&ŒÙ[ØØáá#&ëXÙ"áà
­’¬ÑÛÛEÞ{ÈÔaÒ Í Å Ç[Û­åÙ&ÞÑÐ
„ß
Ë Ѭ ÛÐË õ PÇÞ
Ì '
õ !PÞ «D¬ŠÇ ÈåÒqß
13.2 One-way Layouts 34

èÈЬ/ª  ëëXÙàŠØ(#_Ù"á.áàà#ŒÙ#ëëáGØEÙÚÚŠØëÇÊqØÚ
/Ç ­.Ï{ݬŠÞ«3­!á XÙá#ŠØ$à_Ù[ØàëëÚ
­’¬ÑÛÛEÞ{ÈÔaÒ ÞbÐ
åÒt«ÐèaÒ Í Å Ç[ۆÏÍÞ«.ß*)¸«ÐèaÒºÝÐ3­Çß î èÈ.ÐD¬IªŽÕÍ Å Ç[Û­.ßß
Ë ¬ÑÛÐ
Ë õ PÇÞÌ õ PÞ«¬„Ç ÈaÒPß
«ÐbèåÒ~Ý3Ð ­Ç.ß Ø ë„ØEÙ ##Zë„ØEÙ##Zë„Ø_Ù#áë àXÙàààààààà
èÈÐ ¬/ª ë ØEÙà á  à  àXÙÚ ëàë ÚXÙ[Ø ZàXÙàŠØà #Ú
/Ç ­.Ï{Ý ¬ŠÞ «3­!á XÙá #ŠØ àXÙ«Øàë 

The parameterization of linear models for designed experiments is a little tricky. The usual
parameterization is to impose a ‘sum to zero’ constraint on the parameters for a factor. GLIM
sets the parameter for the first level to zero, so that parameters for the the other levels are differ-
ences between that level and the first. By default S uses the Helmert parameterization, which
compares the second and subsequent levels to the average of lower levels. The usual parame-
terization can be gotten as default by setting
µ u=9¢ µ ‚ € x¡ µ ‚$9ŠœPt € 9 € ýŠ¡åx¡ µ ‚$9„œ6 €  …+Ӌ,¡ µ ‚=9Šœ6 u µ vbw-.}.}
and the GLIM parameterization by
µ u=9¢ µ ‚ € x¡ µ ‚$9ŠœPt € 9 € ýŠ¡åx¡ µ ‚$9„œ6>9ŠœŠ.tI9.…\‚$9-å‹,¡ µ ‚=9„œ•6"u µ vbw+}.}
Of course, the parameterization only affects the coefficients, not the fitted values, residuals,
z|z|z . µThe
  contrasts for a particular term in a fit can be changed by the .x{} function, e.g.
€  …å} or using ¡ µ ‚=9ŠœŠt € 9 € .
.Œxº„œ u¸‹
There is a ‘clever’ way to test for linearity using a re-parameterization of the factor „œ u as€ an µ 
š
ordered factor, for which the default parameterization is polynomial in / ‹ z|z|zq‹,0xvP/5\Pv }%1 .
(This relies on v £x[ƒ
µ µ„€ „} having levels in an arithmetic progression. One could always use
u µ vwyxv µ £x[ƒ µ„€ q}X‹ ™ } in place of v.ƒ µŠ€  .)
 .v ƒ µŠ€  ‡q‰üµ œPƒŠœžƒx Ÿ t.¡I9 µ œx[v µ Œx[ƒ µq€ „}.}q}
€   P… …†tœwC6«vb…¨x[t µ 5£xv µ £x[¡¦\|…墡qtqvq} ù v.ƒ µŠ€ Ž‹N¡¦Eb… € }q}
(As far as I can see the use of …q…†t.œ.7
€b  w 6ºvb… is necessary to get results for the individual coeffi-
cients.) This shows that the response can be regarded as quadratic in log(dose):
’­ ¬ÑÛÛEÞ{ÈԎÙF«ÛŽÒ ÞÑÐ
åÒt«ÐbèåÒ Í Å Ç«ÛÏ|ÍÞ«Pß*)Ï«ÝÐ3­ÇÕ Í Å Ç[Û­.ßß
2 Þ«I«43ÞbÐ
åÒ&ËÐÈ{Û3¬=«.Þ ×œ«ÐèåÒ[Í Å Ç[ېÏbÍÞ«Pß!)˜«ÝÐ$­ÇÕ ÝÞ|ÎqÞ ×ðÍ Å Ç[ÛB­ß
Ç/.­ Ï{ÝŠ¬ Þ«3+­ 3
„ Ï Ì Ø65 qÇbÝPÏbÞÌ 75 PÞ{ñ
ÊÑàXÙÚ#àá5Êbà_Ù"ë„Ø#^ÊbàXÙàà„Øàë±à_Ù"ëàá÷àXÙ"á„Ø
2 Ð.Ç|ËˊÏÍÏÇÌ.Î-­+3
PÞ«¬ŠÇ8 Î.Ý_Ù:9ÈÈ.ÐbÈ Î
PÞ«¬„Ç*ÈaÒ 4; Î ; ß
Ò<$Ì.ÎqÇ|ÈPÍ’Ç ªÎŠß ŽÙ##Úà à_Ù"àÚàá ŒÙ=à àXÙ"àààà
«Ý3Ð ­Çå=Ù > Ø\Ù=#à à_Ù[ØàŠØë Ø"ŒÙ"áëà àXÙ"àààà
«Ý3Ð ­Çå?Ù 5 Êbà_ÙëÚ  à_Ù[ØàŠØë Ê@XÙ"ë„Øá àXÙ"ààë#
«Ý3Ð ­ÇåÙ 2 à_Ù"àëë  à_Ù[ØàŠØë àXÙ"ëëÚÚ àXÙëë
/Ç ­.Ï{Ý ¬ŠÞ «×­|ÎqÞÌÝÞ|ÈÝ5Ç|ÈÈ.ÐbAÈ 3]à_Ù ŠØ ZÐ|Ìá^Ý.Ç|èÈqÇÇ/­ ÐËZËÈPÇÇbÝÐÛ
*¬3«΄Ï ª=«.Ç @Ê Dõ ¬ŠÞ|ÈqÇb4 Ý 3à_Ù á„Øá
13.3 Designed Experiments 35

Ê/­|ÎqÞ|ΊÏ­|΄ÏÍ-3B# ŽÙ#ëZÐÑÌCðÞÌÝá^ÝÇ{èÈqÇÇI­éÐËZËÈPÇÇÑÝÐÛyÕ
Î Å ÇϪŠÊDP
Þ«D¬ŠÇ^Ï­jØEÙÚÚP ÇÊØÚ
2 ÐbÈÈqÇ«Þ{ΊÏ{ÐÑ̈ÐË 2 Ð.Ç|Ë˄ÏÍÏÇÌÎ-­+3
<Ò $ÌÎPÇ{ÈPÍÇpª.΄ßÝ«ÝÐ3­ÇåÙE>|«ÝÐ3­ÇaÙ?5
«Ý3Ð ­Çå=Ù > à
«Ý3Ð ­Çå?Ù 5 à à
«Ý3Ð ­ÇåÙ 2 à à à

13.3 Designed Experiments

The central concept for designed experiments is a factor. Consider the famous Box-Cox poi-
sons data (survival times (in hours) of animals with 3 poisons and 4 antidotes, from Box & Cox
Ÿ €
(1964), J. Roy. Statist. Soc. B26, 211–252 and Box, Hunter & Hunter (1977), Statistics for Ex-
perimenters). The function tq¡6&ƒŠ ¢bP‚ generates the rows, columns and so on – consult its
help page for full details.
­|΄ϺÛ\IÇ ­ZÉ× Ê ­ÍÞÌ_6Ò FUªqÐq*Ï ­bÐ|Ì£ÙÝÞ|GÎ Fß
Ë Ì„Þ[Û\ÇI­ZÉʜ«qÏ*­{ÎåÒ&ÎÈPÇÞ|Î.×>9HH9 _ã[ØI3EçåÕóÈPÇpª=«׊ØI3EXÕªÐPÏ­bÐÑÌ×͞Ò6F@$7FžÕ<F@data $$7FžÕ<F@$$$%Fbßß
in hours

ªqÐqÏ*­ÑÐÑÌ­ÉÊ Ý.Þ|ÎqÞåÙËÈPÞ«ÛEÇ\ÒSËPÞÍåÙ"Ý.ÇI­ÏèÌÓÒ Í„Ò?_Õ_ÕJPß\Õ!Ë̊ޫÛEÇ/­.ߊÕU­{ΊϺÛ\ÇI­ß


ªŠÞ{ÈåÒ!ÛPËÈÐ K×Í\ÒLÓÕ ëPßß
ª=«ÐΎÙ"Ý/Ç ­.ÏèÌ_FÒ ªqÐq*Ï ­bÐ|Ì ­ß
ª=«ÐΎÙ"Ý/Ç ­.ÏèÌ_FÒ ªqÐq*Ï ­bÐ|Ì ­†Õó*Ë ¬Ì×Û\ÇbÝPÏbÞ̞ß
plot main effects

Þ|ÎÎPÞÍ Å FÒ ªqÐq*Ï ­bÐ|Ì ­ß and using medians

ª=«ÐΎÙËPÞÍ|ÎÐÈå8Ò ­|΄ϺÛE/Ç ­!)eÎÈPÇÞ|Î î ªÐP*Ï ­ÑÐÑ̎ÕSÝÞ{ÎPÞÑ× ªqÐq*Ï ­ÑÐÑÌ ­ß


Ï ÌÎPÇ{ÈPÞÍ{ΊÏ{ÐÑÌ£`Ù ª=«ÐÎåÒSÎÈqÇÞ|Î_Õ ªqÐq*Ï ­bÐ|̌›Õ ­{ΊÏ~ÛEIÇ ­ß box plots

Ï ÌÎPÇ{ÈPÞÍ{ΊÏ{ÐÑÌ£`Ù ª=«ÐÎåÒSÎÈqÇÞ|Î_Õ ªqÐq*Ï ­bÐ|̌›Õ ­{ΊÏ~ÛEIÇ ­†Õ]*Ë ¬Ìq×ÛEÇbÝqÏÞÌ\ß


ªqÐq*Ï ­ÑÐÑÌ ­aÙ&ÞÑ Ð
^ÉÊ^ÞbÐ
åÒ ­{ΊÏ~ÛEIÇ ­M)eÎÈqÇÞ|Îjø ì ªÐPÏ ­bÐÑ̞ß
ˊÏ-Î ­ZÉÊ ËŠÏÎÎPÇÑÝÒ ÞÑ Ð
aÒ ­|΄ϺÛ\IÇ ­!)eÎÈqÇÞ|Î î ªÐPÏ ­bÐÑ̞ßß full fit

­’¬ÑÛÛ\Þ|ÈÔaÒ ªÐP*Ï ­ÑÐÑ"Ì ­åÙ&ÞÑ Ð


„ß
additive fit for 1dofna

ªŠÞ{ÈåÒ!ÛPËÈ Ð K×Í\Ò~ëÓÕ ëPßß


Å Ï ­|ÎaÒ&ÈP/Ç ­.Ï{ݐÒ ªÐPÏ ­bÐÑ"Ì ­åÙSÞb Ð
„ßß
õõ|ÌqÐbÈ|یÒSÈP/Ç ­.Ï|݆Ò ªÐP*Ï ­ÑÐÑ"Ì ­åÙ&ÞÑ Ð
„ßß
ª=«ÐÎaÒ&ˊÏÎÎqÇbݐFÒ ªqÐq*Ï ­bÐ|Ì ­aÙ&ÞbÐ
Šß\Õ"È.IÇ ­Ï|ݐFÒ ªqÐ.*Ï ­ÑÐÑ"Ì ­åÙ&ÞÑ6Ð
Šßß
­’¬ÑÛÛ\Þ|ÈÔaÒ ÞÑ Ð
å8Ò ­|΄ϺÛE/Ç ­!)eÎÈPÇÞ|Î î ªÐP*Ï ­ÑÐÑÌ î ˄Ï-Î ­ï|ë î ÎÈPÇÞ{AÎ 3`ªqÐPÏ ­bÐ|Ì\ßß
which gives
­’¬ÑÛÛEÞ{ÈÔaҏªqÐqÏ*­ÑÐÑÌ­aÙ&ÞÑÐ
Šß
Ë Db¬ ÛGÐbËN õ qÇÞO
Ì õ qÞ «¬„Ç ÈaÒ qß
Î ÈqÇÞ{Î  ë_Ù[Øëàá!à_Ù#àáGØ_ÙàÚÚZàXÙààààà
ªqÐqÏ*­ÑÐÑÌ ë Ø à_ÙàŠØë ڊØ\Ù"áÚàáëZë_Ù"ëëŠØ#^àXÙàààààà
ÎÈqÇÞ{ÎA3 ªÐPÏ­bÐÑÌ á ë Ú_Ù"à„Ø ŽÙ[Øáá Ø\Ù# ZàXÙ«ØØëëÚàá
/Ç ­.Ï{Ý ¬ŠÞ «3­ á à_Ù"à#ëÚ ë_Ù"ëë .ë
13.3 Designed Experiments 36

QB QB
I
I

6
6

median of stimes
SU2
mean of stimes
II
T43
D
D
TS3

5
5

RC 1
RC U214 II

4
4

PA PA

3
Vtreat Vtreat
3

Wrepl X
III
poison
Wrepl X
III
poison
Factors Factors
12

12
10

10
stimes

stimes
8

8
6

6
4

4
PA QB RC
2

2
Vtreat D I
Xpoison
II III

poison poison
4 5 6 7 8
8

II I
median of stimes
I II
mean of stimes

III III
4 6

PA QB RC PA QB RC
2

Vtreat Vtreat
D D


20


resid(poisons.aov)
15

••
2

•• •
••••••••••
10

•••••••••••
0

••••••••••
••••
• ••
5

-2




0

-4 -2 0 2 4 -2 -1 0 1 2
resid(poisons.aov) Quantiles of Standard Normal

• 95%
4


resid(poisons.aov)

-100
Log Likelihood


2


• • •
••• • • • • •
••• •• • • • • •
0

-120

••• •

• •
• •
•• • •
-2


-140



2 4 6 8 -2 -1 0 1 2
Y fitted(poisons.aov) Lambda

Figure 5: Plots for Poison data


13.3 Designed Experiments 37

­’¬ÑÛÛEÞ{ÈÔaÒ ÞbÐ
åÒ8­|ΊÏ~ÛEÇ/­!)éÎÈPÇÞ|Î î ªqÐqÏ*­ÑÐÑÌ î ˊÏÎ-­ï|ë î ÎÈPÇÞ|ÎA3`ªqÐqÏ*­bÐ|Ì\ßß
Ë Db¬ ÛGÐbËN õ qÇÞÌO õ qÞ«¬„Ç ÈaÒqß
ÎÈqÇÞ{Î  ë_Ù[Øëàá!à_Ù#àáGØ_ÙàÚÚZàXÙààààà
ªqÐqÏ*­ÑÐÑÌ ë Øà_ÙàŠØë ڊØ\Ù"áÚàáëZë_Ù"ëëŠØ#^àXÙàààààà
$\ÒSˊÏÎ-­.ï{ëPß Ø ØÚ_Ù # ë ÃØÚ_Ù#ë ë á_Ù„ØØëZàXÙàŠØëڊØÚ
ÎÈqÇÞ{ÎA3 ª ÐPÏ­bÐÑÌ Ú XÙ á \Ø  Ø\Ùëë# à_ÙááZàXÙڊØë#ëÚ
Ç/.­ Ï{ÝŠ¬ Þ«3­ á à_Ù"à #ëÚ ë_Ù"ë ë .ë 

indicating the need for transformation. The Àåx6=6-6[} function protects the argument from ex-
— µ €.µ ‘ — µ €µ — · µ €.µ
pansion; xt9ŠœŠPt/9 u ¢ ‚\}Z is equivalent to 9„œžqt/9 u ¢ ‚ 9ŠœPqt/9 u ¢ ‚ and generally
x Ÿ tq¡I9 µ œ € }Z‚ gives up to n-th order interactions.
µ ŸP  ‚\¡I9†¢ µ ‚ :
There is no direct Box-Cox function, but we can do the operations by hand. They are quite
slow (25 secs on a SparcStation IPC), due to the overhead of calling the t 5
$ñ «ðÉʜ­ÇÑõÒ ÊÑëÓÕ ØžÕÆÔ.×àXÙ[Øbß
«Ðbè$«qÏÖðÉÊ5Þ/­åÙE
PÇÍ{Î.ÐbÈåÒ&ñA«Pß
ÌÉʸ«.ÇÌèÎ Å Ò­{ΊϺÛ\ÇI­ß
Ì=«|Ì.è{ÛüÉ° Ê «ÐbèåFÒ ª.È.Ðݐ8Ò ­|ΊÏ~ÛE/Ç ­.ßß
Ë.ÐbÈåÒÏZÏ Ì IØ 3Æ«ÇÌ.èÎ Å ÒS$ñ «Pßß â
ÏËaÒ ÞÆ ­\ÒS$ñ «_ãºÏç„ß à_Ù"àŠØbß
â
­I­ZÉ× Ê ­p¬bێÒÒ ÞÑ Ð
aÒ ­|΄ϺÛ\ÇI­.ïñ$«_ãºÏç)eÎÈPÇÞ{Î î ªqÐqÏ*­bÐ|Ì\ß[|ÈPÇ/­.Ï{ÝPßï{ëPß
«ÐAè «PÏ֌ãºÏçðÉÊÜ̊ì «ÐèaÒ Þ"Æ ­\ÒS$ñ «Xã~Ïç„ßßeÊ$̊ôÑëì«ÐèaÒ­I­ß î ÒSñ$«Xã~ÏçqÊqØßbìÌ3«ÑÌ.è{Û
ê
Ç «$­Ç
â
­I­ZÉ× ß )eÎÈqÇÞ|Î î ªÐPÏ ­bÐÑ̞ ß [{ÈPIÇ ­Ï|Ýqßï|ëqß
Ê ­p¬bێÒÒ ÞÑ Ð
aUÒ «Ðbèå8Ò ­|ΊÏ~ÛE/Ç ­.*
«ÐAè «PÏ֌ãºÏçðÉÊ÷Êé̄ôbë.ì «Ðbèå8Ò ­I­.ßeÊe=Ì «|Ì.è|Û
ê
ê
ª=«ÐÎaÒ&ñ$«åÕÒ«Ðbè$«qÏÖXÕ ñ$«ÞÆj×\F<>qÞ[ÛÆÝÞ7FžÕfÔ$«ÞÆÃ×OF<>Ðè>ŠÏÖPÇ«qÏ Å ÐÐÝ]FžÕfÎÔªŠÇ ×\F’«]Fß
«Þ«ÛÆÝÞ Å Þ|Îðɸ Ê «Ðb$è «qÏ֌Æã «Ðb$è «PÏÖZ××eÛEÞ|ñaUÒ «Ð$è «qÏքßç
«PÏ~ېÏÎðÉÊ°«Þ«ÛÆÝÞ Å Þ|Î^Ê÷àXÙ"Úðì õ.Í Å Ï*­bõ†Òºà_ÙÚåÕÜØß
Þ3Æ «PÏ[̊Ç\tÒ «PÏ~ېÏÎ_Õ]àqß
­ÍÞ «5ÉÊÒ ªŠÞ{ÈåDÒ Fd¬­{^È Fbߐ=ã ç5Ï Ê ª„Þ|ÈåDÒ Fd¬"­|^È Fbߐã çŠßb’ô ª„Þ|ÈåDÒ FdªžÏ GÌ F߆ã"ëç
ÎPÇ{ñÎaÒ Í\ÒS$ñ «_ã[Ø ç„ßEÕ «PÏ~ېÏÎ î à_Ù[Øé| ì ­ÍÞ «å, Õ F*Ú _Fß
A more efficient way (4 secs) is to use the function ` ;. ; in the library œ\¢u\vqw :
µ µ
 vŠ¢b§ŠœŠtœw£x[œ\¢buŠvPbw†}
 ` µ ;. µ ;yx € 9†¢|…\ €^ù 9ŠœŠP/t 9 — u µ ¢ €.µ ‚}

Now consider a Latin square. Six litters of six piglets were ranked in order of birthweight,
providing a acb,a table, and each piglet given one of 6 dietary supplements in a Latin square.
The weight gain (in kg) over 12 weeks is given in the table.
13.3 Designed Experiments 38

ÝPÏbÇ|ÎðÉÊ×­ÍÞÌ_ÒÕ<FFbß
 9 d 2e 
2  e d'9
d 2 9 'e
e2 9d
9 e 2 d
d e 9 2
K ÎèPÞ.Ï[ÌGÉÊ°­ÍÞÌ_Òß
ÚXÙ# #_Ù"ë„ØéÚ_ÙëZáXÙÚ#_Ù"ë ŽÙ
ÚXÙ ëZÚ_Ù#N_# ÙZ  áXÙÚZ  á_Ù#^  á_Ù"àÚ
XÙ«ØZ  Ú_Ù5# Ú_Ù=ž
 Ø(X # ÙE_ # Ù=j  á_Ù=
XÙÚZ  Ú_Ù"ë„ØéÚ_Ù"á„ØfŒ Ù_ # Ù"á^  Ú_Ù=
áXÙàÚ_ Ù[Øá5á_Ù"ëZ # ÚXÙ ^  á_ÙŠ# ØeÚ_Ù##
ŒÙ ÚZá_Ù Ú5Ú_Ù"ÚáX # ÙÚà_ # Ù"à j
 á_Ù"ëë
ÝPÏbÇ|ÎðÉʱËqÞÍ{Î.ÐÈaÒºÝqÏÇ|΄ß
«Þ{ÎŠÏ ÌÃÉÊ Ý.Þ|ÎPÞaÙËÈPÞ[Û\Ç\ÒSËPÞÍaÙ"Ý.ÇI­.ÏèÌÓÒ Í\Ò~áÓÕ áPßEÕ «PÏ­|ÎåÒ"Æ.ÈqÞÌ.Ö׊Ø^3"áÓՏ«PÏÎÎPÇ{È.ׄØI3"áqßß\Õ
î ÝqÏÇ{ÎXg
Õ KÎèPÞÏ Ìžß
ª=«ÐΌÙÝ/Ç ­.ÏèbÌ_tÒ «Þ|Î„Ï Ìžß
ÏbÇ|ÎðÉÊ 2 ÒºÝqÏÇ|Î_Õ]ÎÈqÇÞ{Î|ÛEÇÌ.΄ß
«Þ{ÎŠÏ ÌÙ&ÞÑ Ð
^ÉÊ^ÞbÐ
å?Ò KÎèqÞ.Ï C Ì )$Æ.ÈqÞÌ.Ö î «qÏÎÎPÇ|È îh ÏÇ{ÎXÕÒ«.Þ|ΊÏ[Ì\ß
­’¬ÑÛÛEÞ{ÈÔaUÒ «Þ{ΊÏ[Ì£Ù&ÞÑ Ð
„ß
­’¬ÑÛÛEÞ{ÈԎFÙ «ێUÒ «.Þ|ΊÏ[Ì£ÙSÞb Ð
„ß

The last command gives t-values for the contrasts (diet ? i diet A).
­’¬ÑÛÛEÞ{ÈÔaÒU«Þ{ΊÏ[Ì£Ù&ÞÑÐ
„ß
Ë Ѭ ÛÐË õ PÇÞ
Ì õ PÞ «¬„Ç ÈaÒPß
Æ.ÈqÞÌÖ Ú #XÙë àÚGØ\Ù"Ú àZë_Ù#á# àXÙà .Ú.Úá
«PÏÎÎqÇ|È Ú #XÙ#ëà\ØZØ\Ù"Ú.àZë_Ù#„Ø#àë àXÙà áë##
ÏbÇ|Î Ú ØØEÙáŠØ#ڊØ$ë_ÙëÚà&ŽÙ"àڄØ àXÙàŠØà„ØÚà
/Ç ­.Ï{Ý ¬ŠÞ«3­ ëà ØØEÙáÚZà_Ù"Úáë
­’¬ÑÛÛEÞ{ÈԎÙF«ێÒU«.Þ|ΊÏ[Ì£ÙSÞbÐ
„ß
2 Þ «I«43ÞbÐ
åÒ&ËÐÈ{3Û ¬=«.Þ & × KÎèPÞÏ Ì\)$ÆÈPÞÌÖ î «PÏÎÎPÇ{È î ÏbÇ|ÎXÕ ÝÞ|ÎqÞ ×|«Þ{ÎŠÏ Ìžß
/Ç ­.Ï{Ý ¬ŠÞ «3­+3
ŠÏ[Ì @Ø 5!qÇbÝPÏbÞÌ 5 qÞ|ñ
ÊÑëXÙàڊØeÊbàXÙë àáZà_Ù[Øë„ØØ$à_Ù #ŠØÚ÷àXÙ àá„Ø
2 Ð.Ç|ËˊÏÍÏÇÌ.-Î ­+3
PÞ «¬Š8 Ç Î.Ý_:Ù 9ÈÈ.ÐbÈ  Î
PÞ «¬„* Ç ÈaÒ 4; Î ; ß
<Ò $Ì.ÎqÇ|ÈPÍ’Ç ªÎŠß Ú_Ù"áàÚà à_Ù à # Ø XÙ"ë„Øëë àXÙ"àààà
ÙÙÙÙÙÙÙÙÙÙÙÙÙÙÙ
ÏbÇ|Î e à_=Ù áŠØ # à_=Ù Úë ØEÙ"àáà # àXÙ àŠØÚ
ÏbÇ|Î 2 à_=Ù à  à_=Ù Úë àXÙ ëá # àXÙ áڄØ
ÏbÇ|Î à_Ù ÚÚà à_=Ù Úë àXÙ „ØÚá àX=Ù  ë 
ÏbÇ|Î 9 à_Ù #àà à_=Ù Úë ëXÙ"ëë # àXÙ"à #Ú
ÏbÇ|Î  Ø\Ù #Ú  à_=Ù Úë ŒÙ"à  àXÙ"àààá
ÙÙÙÙÙÙÙÙÙÙÙÙÙÙ
13.4 Generalized Linear Models 39

13.4 Generalized Linear Models


µ
The functions vb… and t 5 have extensions Evb… which fits generalized linear models, and Etb…
which further extends this to allow semi-parametric smooth functions in the explanatory vari-
ables. We can, for example, fit the poisons data by a gamma GLM:
t/9-µ 9\tP€¡µ ¦£€ xºu µ ¢ €.µ ‡P‚ ‰ € }
u ¢ ‚ 6"Ev|… \vb…¸x € 9†¢|… €^ù 9ŠœŠqtI9 — u µ ¢ € µ ¸‚ ‹ Ÿ tb…a¢vwŠý%jŠt….…†tq}
€b  …q…tœ.wyx~u µ ¢ €.µ ‚ € 6"Ev|…a} note the ‘G’

t‚ µ 5Et_x~u µ ¢ €µ ‚ € 6 Švb…å} analysis of deviance table


Once again there is a whole range of ancillary functions such as ƒž/ 5†¢t‚E¡q , u„œžƒE¢.¡/9 and
œŠ € ¢ƒ   tPv € . The latter will produce a four types of residuals, but uses deviance residuals by
default.
Ÿ
The t…a¢vw argument is also used to specify other aspects of the fit such as the link function.
Ÿ µ µ
For example, one can have t…a¢.vw„ý§†¢b‚ …a¢t.v_x[v„¢‚.¾„ýuqœ §\¢*9\}q} . With the binomial the re-
¤.  €
sponse can either be a factor (taken as first level vs the rest) or a matrix with two columns giving
the number of successes and failures. There is a t ¢ family allowing user-defined models,
µ
and a œ §
†
  € 9 family generator allowing robust fitting. The scope for ingenuity is unlimited!

Binary Data

The following example is taken from D. Collett (1991) Modelling Binary Data, page 217.
µ Ÿ
Numbers of rotifers falling out of suspension for two species (Polyartha major and Keratella
cochlearis) are given for different fluid densities in the table, as file œ 9†¢ œ•6Sƒžt/9 :

klmn6o"p6qsrDt-uvqwr6t-uvp@xpwyzuvqwyzu{pxp
|uE}|"~ |6| 6€ |"*|"‚|
|uE}Dƒ6} „ €6‚ |…,ƒ…@€
|uE}Dƒ| |"} „6‚ 6}†ƒ6…
|uE}D6} |"~ €6 |"}†ƒ6€D
|uE}D6} ~ 6‚ |…!|"ƒD~
|uE}D6} ƒ| „6 6*|"‚|
|uE}D| |" ƒ6~ ƒ6‚*|"‚D„
|uE}…} … …6… 6ƒ†ƒ6€D‚
|uE}…} |"} | ƒ6ƒ*|6|‡„
|uE}…| 6‚ 6‚ ƒ6*|"‚Dƒ
|uE}…€ ƒ6} ƒ6„ „ …@ƒ
|uE}…~ … 6~ ƒ6ƒ …@€
|uE}D6} ƒ6} ƒ6ƒ ~ …@~
|uE}D6} ~ |… …!|"‚D}
|uE}D‚6} |… |"„ „| „…
|uE}D‚| |"} ƒ6ƒ ƒ6 …@
|uE}D‚6 ‚… ‚6‚ ~…!|"}|
|uE}D„6} ‚6€ €6‚ ‚6 ‚D€
|uE}D„6}w…€6€w…~6ƒ*|"„6€*|"~D}
|uE}D„6} €6€ €6~*|"…!|"…

An annotated session follows. Several points need further explanation.


13.4 Generalized Linear Models 40

The parametrizations need careful consideration. By default S uses a linear-model parameter-


‰žš
ization, contrasting each level with the average of the previous levels. This is less useful for
µ µ €
GLMs. The first way out below is to remove the overall mean (the term) which forces sepa-
rate means for each species. We can also change to the GLIM parameterization by the u=9¢ ‚
line.
Ÿ µ
There is a catch here. By default tq¡/9 œx{} numbers the factor levels in alphabetical order, so
we have to force the order we want (see › 10).
1.0



• ••
0.80.6
pm.prop


0.4


0.2



• • •
• •
• •



0.0

1.02 1.03 1.04


ˆ density 1.05 1.06 1.07

Figure 6: Plots for Rotifer data. The square symbols and dashed line indicate species Polyartha
major.
È.ÐbΊÏËPÇ|È=­ZÉÊíÈPÇÞÑÝXÙ!ÎPÞÆ3«ǞÒ6F ÈÐ΄ÏËPÇ{ȌÙÝÞ|ÎGFžÕ Å ÇÞbÝ.Ç|È× HŠß
È.ÐbΊÏËPÇ|=È ­
Þ|ÎÎPÞÍ Å ÒSÈ.ÐbΊÏËqÇ|=È ­.ß list the data frame

ÖPÍaŠÙ ‰È.@Ð ‰ÃÉÊeÖqÍåÙÔqô|ÖqÍåÙÎÐÎ


‰bÛ¨ŠÙ ‰È.@Ð ‰ÃÉ( Ê ‰ÑÛ¸ÙÔqô ‰ÑÛ¸ÙÎÐÎ compute the proportions

‰=«ÐÎaÒºÝÇÌ ­ÏÎÔ_‹ Õ ‰ÑÛ¸vÙ ‰.È.6Ð ‰ŒÕ]Î Ô ‰„Çb× FºÌ FŠÕAÔ «PϺÛ×ðÍ\Ò~àÓÕ Øßß and plot them
‰qÐqÏ Ì-Î ­\Ò~ÝÇÌ ­.ÏÎÔ_‹ Õ ‰„Í Å ×àPß
Õ ‰ÑÛ¸ŠÙ ‰È.6Ð ‰Œ:
‰qÐqÏ Ì-Î ­\Ò~ÝÇÌ ­.ÏÎÔ_ÕÖqÍåŠÙ ‰È.6Ð ‰\ß

è$«Û¸Ùv‰bÛüÉʱèA«ÛŽÒ ÍÆžÏ ÌݐҌ‰ÑÛ¸Ù!ÔXՍ‰bÛ¨ÙÎÐÎPʉbÛ¨ÙԊßM)eÝÇÌ­.ÏÎfitÔ_Õseparate Æ\Ï[ÌqÐۆÏmodels


Þ«ÒU«Ðè„forÏΊeach
ßß species
è$«Û¸Ù!ÖPÍZÉʱAè «ÛŽÒ ÍÆžÏ ÌݐÒ&ÖqÍåÙ!ÔXÕóÖPÍaÙÎÐÎPÊ{ÖPÍaÙԊM ß )eÝÇÌ ­.ÏÎÔ_ÕÆ\Ï[ÌqÐۆÏÞ «UÒ «Ðè„ÏΊßß
è$«Û¸vÙ ‰bÛ Ž $è «Û¨ÙÖqÍ bare summaries

­‰„ÇÍÏIÇ ­÷ÉÊeËqÞÍ|ÎÐÈaÒ Í\ÒSÈPÇ ‰_6Ò F‰b+Û FžÕSëàPß\ÕóÈPÇ ‰Ó6Ò F[ÖP%Í FŠÕSëàPßß\Õ


Now combine the two species

«DÇ
PÇ «3­b×^Í\6Ò F‰b+Û Fž,
Õ F[ÖP%Í Fbßß
È.ÐbΊÏËPÇ|Èë5ÉÊéÝÞ|ÎqÞåÙ!ËÈPÞ«ÛEǞҺÝÇÌ ­í×ðÍ\ÒºÝ.Ç"Ì ­.ÏÎÔXÕ ÝÇÌ ­.ÏÎԄßEÕ
ÔP/Ç ­÷×Ã͞ŒÒ ‰ÑÛ¸ÙÔ_Õ]ÖPÍaÙԄßEÕ]ÎÐÎ^×ðÍ\Ò ‰bÛ¨ÙÎ.ÐbÎXÕfÖPÍaÙÎ.ÐbΊß\› Õ ­‰„ÇÍ.ÏbIÇ ­ß
Þ|ÎÎPÞÍ Å ÒSÈ.ÐbΊÏËqÇ|ÈëPß
è$«Û¸Ù!È.ÐÎjÉÊeAè «یÒ[ÍÆžÏ Ìq݆Ò&ÔqIÇ ­ÕfÎ.ÐbÎPÊ{ÔPIÇ ­ ß ) ÝÇÌ ­Z| ì ­‰„ÇÍÏIÇ ­†ÕÆ\Ï[ÌqÐېÏÞ «tÒ «Ðè„Ï΄ßß
13.4 Generalized Linear Models 41

è$«Û¸Ù!È.ÐÎ
è$«Û¸Ù!È.ÐÎjÉÊeèA«یÒ[ÍÆžÏ Ìq݆Ò&ÔqÇI­ÕfÎ.ÐbÎPÊ{ÔPÇI­ß)ZÊqØ î ÝÇÌ­Zì|Note ­‰ŠÇÍÏÇ/­ÕóÆ\Ï ÌÐۆÏÞ«†ÒU«ÐèŠÏΊßß
the parameterization used

è$«Û¸Ù!È.ÐÎ
@Ð ‰ΊÏ{ÐÑÌ ­žÒ ÍÑÐÑÌ.ÎÈP/Þ ­|-Î ­Ñ×͞6Ò FÑÍÑÐÑÌÎÈ_ÙÎÈPÇÞ{Î|ۊÇÌ^ Î FŠ
Õ FÑÍbÐ|Ì.ÎȌي‰separate
ÐI«bÔ^Fßß means for each species
è$«Û¸Ù!È.ÐÎjÉÊeAè «یÒ[ÍÆžÏ Ìq݆Ò&ÔqIÇ ­ÕfÎ.ÐbÎPÊ{ÔPIÇ ­ ß ) ÝÇÌ ­ì/­‰„ÇÍ.ÏbÇI­†Õ ÆžÏ ÌqÐېÏbÞ«Òt«ÐbèŠÏ΄ßß
è$«Û¸Ù!È.ÐÎ
­’¬ÑÛÛ\Þ|ÈÔaÒ&Aè «Û¸Ù!È.ÐbΊß
ÞÌ Ð
qÞ\Ò&Aè «Û¨ÙÈ.ÐbÎŠß over-dispersion, but a common slope
è$«Û¸Ù!È.ÐÎjÉÊeAè «یÒ[ÍÆžÏ Ìq݆Ò&ÔqIÇ ­ÕfÎ.ÐbÎPÊ{ÔPIÇ ­ ß ) ÝÇÌ ­ î ­‰„Çlooks ÍÏIÇ ­†OK
ÕÆ\Ï[ÌqÐېÏÞ «tÒ «Ðè„Ï΄ßß
«PÏ[̊/Ç ­\ÒºÝ.Ç"Ì ­.ÏÎÔXÕfˊÏÎÎPÇÑݐÒS$è «Û¨ÙÈÐΊ߆:ã ­‰ŠÇÍÏ/Ç ­b×]× FºÖP%Í F[çŠß
«PÏ[̊/Ç ­\ÒºÝ.Ç"Ì ­.ÏÎÔXÕfˊÏÎÎPÇÑݐÒS$è «Û¨ÙÈÐΊ߆:ã ­‰ŠÇÍÏ/Ç ­b×]× F?‰b-Û F[çXR Õ «bÎÔ.× Pß
ñ.Ý.ÇÌGÉ| Ê ­Çbõ†Ò|Ø\Ù"àëåÕÜØEÙà #åÕYàXÙààŠØbß these lines are rather crude, so try harder!

Ô.Ý.ÇÌGÉ* Ê ‰.ÈPÇÑÝPÏbÍ|ÎåÒS$è «Û¸ÙÈÐÎ_Õ]Ý.Þ|ÎPÞaÙËÈPÞ[Û\Ç\Ò~ÝÇ"Ì ­b×bÈPÇ ‰ÓÒ&ñÝÇ[ÌŒÕ ëPßEÕ


­‰„ÇÍÏIÇ ­Ñ×ËqÞÍ|ÎÐÈaÒ Í\ÒSÈPÇ ‰_6Ò F‰b+Û FžÕfڊØbßEÕ]ÈqÇ ‰Ó6Ò F Öq%Í FŠÕYڊØbßßEÕ
«DÇ
PÇ «3­b×^Í\6Ò F‰b+Û Fž,Õ F[ÖP%Í FbßßßEÕ]Î Ô ‰ŠÇÑ× F[ÈPIÇ ­‰qÐ|Ì ­7Ç Fß
«PÏ[̊/Ç ­\Ò&ñÝÇ̌ÕfÔ.ÝÇÌ£ã«IØ 3"Ú„Ø ç_Ò Õ «ÎÔ.× qß
«PÏ[̊/Ç ­\Ò&ñÝÇ̌ÕfÔ.ÝÇÌ£ãÚ‘ë 3«ØàëçXÕ «ÎÔ.× Pß

Poisson Data

We consider the log-linear analysis of a contingency table. As this has two ‘history’ factors
and two levels of the the response, it could also be treated as binomial data. The response is
the occurrence of coronary heart disease. The table is of the form:

blood pressure
serum
chd cholesterol 1 2 3 4

yes 1 2 3 3 4
2 3 2 1 3
3 8 11 6 6
4 7 12 11 11

no 1 117 121 47 22
2 85 98 43 20
3 119 209 68 43
4 67 99 46 33

with log-linear analysis:


‚‘à …™™ò
‡P‰€ ¡qt‚£x|}
’™‘ š5™\ šPšð••Ã˜ š‘šqššPš
šqš˜ š‘žš±’„˜ ‘q‘OŠ“”G’Š™‘q–šPš”ü‘q–q”ü•𒊙•˜ ”q”ð’\•Ã™P™
Ÿ E‚ tb… €5‡q‰ vŠ¢ € 9£x~uŠœŠ €.€ ý šå· ’ ‹€ €  œ   …žý’ _š ’ · ’ ‘ ‹I¡¦„Ÿ ƒqýž¡_x|€ w+‹6|  ‚}q}
¾q¾ q‡ ‰ Šƒ t/9Et6 Ÿ œŠt…\_x Ÿ tq¡6"ƒž ¢.‚£x¡åx ‹ ‹ }_‹ ‚\t|… }_‹ ‚ …å}
13.5 Updating and Selecting Models 42

¾q¾7µ 6"Evb… ‡q‰ Evb…¸x~‚   … € ù€ œ   …†ûu„œŠ €P€ û.¡¦„ƒŒ‹ Ÿ tb…a¢vwŠý.u µ ¢ €.€µ ‚£‹IƒŠt/9Etý¾q¾†}
t‚ 5Et_x~¾Pšˆ¾C‡q6 ‰ð
Švb …7‹Õ9\ 9„ý’..¦¢]} ù ‰„€   · €q€_·
¾q7¾ 6"Evb…Ÿ µ¼ uŠš ƒŠ/t‘ 9E_xº¾q¾C6Eµ vb…¨‹î6 š6 œ … u„œP ¡b¦Šƒ\}
u\t.œŒx&… œ ýP¡_x ‹ }q}X¯]u\v 9£xº¾q¾76"\v|… }
µ
The t‚ 5Et command gives an analysis of deviance for Evb… objects:
ÞÌÐ
PޞÒ&Ö֌ÙèA«Û£ÕÎqÇI­|Î×F 2ÑÅ ÏFß
d̄Þ«b-Ô ­.Ï ­ ÐË 6Ç
„ÏÞ̊Í“ Ç HPÞ3Æ «Ç
.Ðq*Ï ­/­bÐÑÌéۊÐÝ.Ç «

/Ç ­‰ÐÑÌ ­-Ç 3 /Ì ¬bÛ


HPÇ{È| Û ­ZÞbÝÝÇb| Ý ­ÇbDõ ¬ŠÇÌ.ΊÏbÞ «/«ÔÒSˊÏÈ=­|ÎZÎМ«Þ/­|΄ß
Ë DÇ
ŠÏÞ̊ÍÇ /Ç ­.Ï|Ý_Ù Ë IÇ ­.Ï{ÝXÙ Ç6
ÈåÒ 2|Å Ïß
”• >> ŠØ ØáŽÙ"ëë#
­Ç|È ¬bÛ  ##XÙ #à ë  ØÚáá_Ù Úá±à_Ù"ààààààà
‰ÈP/Ç ­I­  „Ø XÙë # ëÚ Øë _Ù"Ú #à±à_Ù"ààààààà
Í Å Ý ØZØØá XÙáà  ë  #XÙ áàZà_Ù"ààààààà
­Ç{*È ¬Ñ–Û 3Š‰ÈP/Ç ­I­  ë ŒEÙ  ØÚ Ú ŒÙ"ڄØØ$à_Ù"àà á .Ú 
Å
­Ç{*È ¬b—Û 3&Í Ý  àXEÙ .Úë Øë ë ŒÙ"àÚ Zà_Ù"àààààŠØØ
‰.ÈqIÇ ­I­+3&Í Å Ý  Ø XÙë   ŒÙ ##ÚZà_Ù"àààë 
­Ç{*È ¬Ñ– Û 3Š‰ÈP/Ç ­I­-3SÍ Å Ý  ŒÙ ##Ú à àXÙ"àààZà_Ù Ú  ëà

13.5 Updating and Selecting Models

There are number of facilities to update models. The


  uŠƒŠtI9\ function takes a result of a pre-
vious fit and changes the model in some way.
tƒPƒ š µ š
and ƒPœ u show the (approximate) effects of adding and dropping single terms, and
€ ¼
€ 9Eu
runs a fairly general stepwise fitting procedure. (Note that S-Plus 3.x has a separate 9Eu ¢€
function for multiple regression.)
˜
Multivariate Analysis 43

14 Multivariate Analysis

u\t„¢œ € ,
§„œ  †€ ¦ and € u¢b‚ . There are also functions for classical multivariate analysis.
S-Plus is particularly rich is functions for exploratory multivariate analysis, such as

Clustering
€
The workhorses here are ƒE¢ 9 which computes distance matrices (also used in ¡…žƒ
€ ¡qtPvq ) and
¦\¡Pv  € 9 which computes a cluster tree by single-, average- or complete linkage.
\ƒ ¢ € 9 € Distance matrix calculations
¦\ ¡Pv 9 Hierarchical clustering
¡ 9„œž †q€ Create groups from a cluster tree
u\vP¡qv  †9 € Plot a cluster tree
vqt§\µ ¡Pv 9 Label a cluster tree plot
¡q€b v œPƒŠœ Re-order leaves of a cluster tree
=§ 9Š œŠ€ q Extract part of a cluster tree
…†¡Pv €q9 € ”model-based” clustering
…†¡Pvqt µ auxiliary functions
…žœžqv ¡
Graphical Methods

This is a varied collection of functions for displaying multivariate data.

€
Ÿ¡b…\tPƒ¡q ¡q€ tqvP Classical multi-dimensional scaling

… € € 9„œžq€ 
Chernoff’s faces
Minimal spanning tree
9Etœ µ Star plots
§†¢u\v 9 Biplot (v 3.2)
Two analyses of socio-economic data on Swiss cantons:
«PÏ[Æ.ÈqÞ|ÈÔaÒ&ȄÏ<‰=«.Ç|Ԅß
ÝðÉÊ ÝqÏ*{­ ÎåÒD­ KŠÏ­Iå­ Ù!ñŠß
ñGÉÊZͫۊÝ$­ ÍÞ« ǞҺÝPß
qÍ Ø±ÉÊeñŒãZÕØ[瑎IÍÑëðÉÊeñŽãˆÕ ëç
Çbõ$­Í‰=«ÐbÎåÒ[Íq؞ÕYÍbëåÕYÎÔ ‰„Çb×]F«ÌFbß(™ZËÈ.ÐÛë«PÏ ÆÈPÞ{ÈÔåÒSȊÏ"‰=«Ç{Ԋß
ÎPÇ{ñÎaÒ Íq؊ÕIÍbëå՛­ÇÑõÒ[ÍqØßß
Å ÉÊ Å Í «D¬­|ÎaÒºÝqß
‰=«.Í «D¬­|ÎaÒ Å ß
’Í ¬ÎÈqÇÇ\Ò Å š
Õ qß
‰=«.Í «D¬­|ÎaÒ Í «ÐÈÝÇ{ÈåÒ Å Õ]’Í ¬ÎÈPÇÇ\Ò Å š ß ™ˆÈqÇÊÑÐÈ.Ý.Ç|ÈZÎÈPÇÇ^Ï ÌÎ.бΠŠÈPÇÇ èÈ.ÐD¬‰"­
Õ qßß“
˜
Multivariate Analysis 44

Matrix Methods

The classical methods based on variance-covariance matrices.

…†t¦\tPvqµ t‚ µ §¢ € Mahalanobis distances


¡qt‚\€ ¡ œ Canonical correlation analysis
ƒ\¢ ¡.œµ Discriminant analysis
u\¡.œŠ¡ …„µ u Principal components analysis
u„Ÿ œE¢b‚E¡ …Šu Principal components analysis (v 3.2)
tP¡/9Et‚\tPv Principal components analysis (v 3.2)
An example of discriminant analysis with Fisher’s iris data:
¢€ œE¢ € 6F5\t.€5œ ‡q‰‡P‰ œ§†¢b‚ŠšXƒŒ·S™ x|¢œ\¢ € ¹.“.‹.– ‹ š »£™ ‹ ¢œ\¢ € ¹.‹P‹ ‘ »y‹ ¢œE¢ € ¹.‹q‹ ™ »E}
uEq¡Š€ ¢. €^‡P‰ œŠu£€ x ‹~œŠbuy€ x ‹ }q™ }
¢œE¢ € 6Sƒ\¢ ‡q‰ ƒE¢ € ¡.œŒx|¢œ\¢ 6F5Št.œŒ‹ € } €œ € Ÿ
¢œE †¢ € 6Sƒ$5 ¢œE¢ € 65\t.œ›ž€û›¢œE¢ € 6Sƒ\¢ 5\tœ 0 ¢‚„ƒGƒE¢ € ¡.œ\¢Ñ…a¢b‚Et‚$9Þ5\t.œ\¢.t§\vP €
§„œ € ¦£x¡‡q§†‰ ¢‚qƒŒx{¢bœE€ ¢ 6&ƒI5¨š ‹ u\q¡Š¢. € }q} ‡q‰
¢œE¢ € 6F; ‡P¢‰ œ\¢ 6S3ƒ 57¹‹ € » ¯ “q¢– œE¢ 6"w ¢œE¢ “P€ – 6&ƒ$5 ¹.‹ ‘ »
¢œEµ¢ 6ºvqt§ € ¡Xx«œŠu£€ x ‹ }_‹ œžu£x¡^å‹ Ÿ }_€ ‹$œŠu£€ x@5+Ӌ “q– }q}
u\v 9yx{¢œE¢ € 6>;¨‹ µ ¢œ\¢ 6"w¸€ Ò ‹ 9Pwqu\.ý Ñ+‚ åÙ ‹ ;\vPt§Šý  ¢œ 9jƒ\¢ ¡œE¢|…å¢b‚\t‚$9|5\t.œ\¢t§\vPGӋ
w\vPt§„’ý € q¡ ‚Šƒ^ƒ\€ ¢ ¡.œ\¢|…å¢b‚E€ tb=‚ 9|5Etœ\¢t§Evq– G ˜ }
9\I;=9yx{¢œE¢ 6>;¨‹ ¢œ\¢ 6"w¸‹Ü¢œ\¢ 6ºvPt§¨‹N¡qI;„ý 6 }

 c
-4

¡v ¡ v
 c   Ÿs
 c  c  c  c    c  c 
-5

¡v  c cc c
 c
   c
c

ŸsŸs ŸŸss
second discriminant variable

¡v ¡v ¡v c ¡v  c  cc   c  
¡v ¡v  c c c      c ŸsŸ ŸsŸ Ÿs Ÿs
¡     ŸsŸss ŸsŸ Ÿ
-6

ss
¡ ¡ v c   c
 c c c c  c c
ccc
¡v ¡v v¡ ¡v¡ v¡ ¡v
c
Ÿss ŸssŸ Ÿs Ÿ s
¡v
v vv
¡v ¡ ¡v  c  c c  c    c  c  c ss Ÿ
Ÿs Ÿ
¡v ¡v ¡v ¡v¡v
c  Ÿs Ÿs sŸ Ÿ sŸsŸ ss Ÿs
Ÿ
-7

v
s sŸ Ÿ s
sŸ s
c
¡vv ¡v ¡v  c s Ÿ
 c Ÿ Ÿs s
¡ ¡v
¡  c Ÿs s Ÿs Ÿ s Ÿs
¡v¡ v ¡v v Ÿs
-8

v¡ ¡ v ¡ v
Ÿ
Ÿ s Ÿs s
¡v
vv
¡v ¡v
¡v ¡v v ¡
-9

¡v Ÿs
-10 -5 0
ž 5
first discriminant variable

Figure 7: Discriminant analysis


¬Libraries 45

A Libraries

Libraries are a mechanism to add ‘packages’ of extra objects (functions and datasets) to S. To
find out which libraries are available type
 vŠ¢b§ŠœŠtœw£x|}
which on one of my systems gave:
H Å ÇeËÐI«/«Ð K„Ï ÌèÞ­ÇÍ|ΊÏ{ÐÑÌ"­ZÞ|ÈqÇZÞ6
qÞ.ÏD«.ÞÆ3«Ç^Ï[Ì^ΠŠǜ«qÏ ÆÈPÞ|ÈÔA3
9 2 H%$¢ ” e $69 9 2 $6H$¢ ”

Í Å È.Ð|Ì D¬̊Í{ΊÏ{ÐÑÌ­ÜÎ.Ð Å ÞÌqÝ/«Ç Ý.Þ|ÎqÇI­ZÞÌqÝZΊÏ~ÛEÇI­aÙ


Ç|ñqÞ[Û‰=«Ç/­ D¬̊Í{ΊÏ{ÐÑÌ­ ÞÌÝ^ÐÑÆ%£.ÇÍ{Î-­eËÈ.ÐÛH Å Ç ” Ç6K >PÞÌ.謊Þ|èqÇåÙ
Ç|ñÎPÇ{È̊Þ« ¤qÞÌqÝ/«ÇZÇ{ñÎqÇ|È̄Þ«Òt«Þ{ÈèPÇßeÐÑÆ7£ÇÍ|Î-­aÙ
ϺÛ\Þ|èqÇ Ï*­‰3«Þ{ÔGϺÛ\Þ|èPÇ/­åÙ
ÛEÞ ‰­ Ï*­‰3«Þ{Ô^ÐËeÛ\Þ‰"­&K„ÏÎ Å ‰.ÈУÇÍ|΄Ï|ÐÑÌ"­åÙ
‰.ÈÐèÝÈPDÞ K ÝÈPÞDKðÇ|ñqÞ[Û‰=«ÇíËÈ.ÐÛNÈÐèÈPÞ[ÛÛEÇ{ȑ¥U­,PÞÌI¬„Þ«XÙ
‰.ÈÐèqÇ|ñPÞ«Û 9ñPÞ[Û‰=«.ÇI­eËÈ.ÐÛÈ.ÐbèÈqÞ[ÛÛ\Ç|È4¥U­,PÞÌ/¬ŠÞ«XÙ
­Ç«ÛEÞÌ.ΊÏbÍI­ D¬̊Í{ΊÏ{ÐÑÌ­ÜËÈÐÛÍ Å Þ‰ÎPÇ|ÈòØØ$ÐbË'H Å Ç ” Ç6K &>qÞÌ.謊Þ{èPÇåÙ

Ȋ"Ï ‰=«.Ç|Ô e Ù Ù Ï<‰3«Ç{ԑ¥U­íÎPÇÞÍ Å Ï Ì.è÷Ë*¬̄Í|΄Ï|ÐÑÌ"­

ÐbÈeۊÐbÈPÇjÏ ÌË.ÐbÈ|ÛEÞ{ΊÏ{ÐÑ̈ÐÑÌÇÞÍ Å «qÏ Æ.ÈqÞ|ÈÔÞ­ÇÍ|ΊÏ{ÐÑÌH­ÇÇeÎ Å Ç 9d 9Z˄ÏD«ÇˆÏ Ì


ÇÞÍ Å «qÏ ÆÈPÞ|ÈÞ Ô ­ÇÍ|΄Ï|ÐÑÌ5ÝPÏÈqÇÍ{Î.ÐÈÔ^ÐÈÃÏ ÌC ÊD> • ±È*¬̦3
«PÏ ÆÈPÞ{ÈÔåÒ Å Ç «6‰j×ðIÉ ­ÇÍ|΄Ï|Ð|̧̄Þ[Û\Ç ß
>ŠÏ[Æ.ÈqÞ|Èî Ô ­ÇÍ{ΊÏ{ÐÑÌ ­íËÈ.ÐN Û PÇ̊Þ=Æ «/Ç ­eö <Ï ‰=«.Ç|Ô Ò|Ø Šß
¨ .ÐÝÇ{ÈÌ d‰‰=«qÏÇÑ Ý ÎqÞ|ΊÏ ­|΄ÏIÍ ­,KŠÏÎ Å DÊ $«D¬­’¥
d  Û\Þ.Ï × Ì «PÏ ÆÈPÞ{ÈÔ
Ì̄Ç|Π̄’Ç ¬.ÈqÞ «é̄Ç|Î K.ÐÈ-Ö ­
­‰„Þ|΄ÏÞ « ­‰ŠÞ|΄ÏÞ «×­|ÎqÞ|ΊÏ ­|΄ÏIÍ ­
To find out more about a section, use
 vŠ¢b§ŠœŠtœw£xº¦\qvbuŠý name }
e.g.
«PÏ[Æ.ÈPÞ{ÈÔaÒ Å Ç«@‰×È.Ð|ÆI¬"­|Ίß
Ë*¬̊Í{ΊÏ|Ð|Ìe ­ ËÐÈZÈÐÑÆ/¬­|Îî­|ÎPÞ{ΊÏ­|ΊÏbÍI­
$5 ÒSÔŠß Ï ÌÎPÇ|ÈqÊbõD¬ŠÞ|ÈΊÏT«ÇÜÈPÞÌèPÇ
Å ¬ƊÇ{ÈåÒ&Ô_Õ]Ö^×ØEÙÚPß ¤*¬ƊÇ|Èõ«ÐÍÞ|΄Ï|ÐÑÌhKŠÏÎ Å d ­ÍÞ«.Ç
Å ¬ƊÇ{-È ­\ÒSÔXÕ]ֈ×üØ\Ù"ÚåÕóÛ3¬ŽÕ™­ß ¤*¬ƊÇ|È!‰.È.Ð6‰qÐ$­Þ«÷ë ãEKŠÏÎ Å Û$¬ðÖbÌqÐǨÕ­eÖÌÐ KÌç
Å ÈqÇ|èaÒ&ñXÕ ÔXÕ]Öj×ØEÙÚPß ¤*¬ƊÇ|È÷È.ÐÑÆ/¬­{ÎZÈPÇ{èÈPÇ/­I­Ï|ÐÑÌ

ÝÞ{ÎP/Þ ­Ç|=Î ­
Í Å Ç[Û Íb6Ð ‰‰ŠÇ{ÈGÏ h Ì K Å /Ð «Ç«ÛEÇÞ «±$Ë «Ð ¬È
ÞÆƊÇ{Ô Ì\ÏbÍ|ÖPÇ «ðÏ ë Ì ­|ÔPÇÌ\ÏÎPÇeÈÐÍ|Ö
©
A.1 Library <‰=« È„Ï .Ç|Ô 46

ېÏT«Ö «ÇÞbÝðÏ[ÌZۆÏD«bÖ&‰qÐK.ÝÇ{È
‰ Å ÐÑ̄ÇI­ e Ç« èŠÏbÞ̪¥E‰ Å ÐÑ̊Ç÷ÍÞ«/«3­jØÚàÊqØ#
To use the library, invoke it by
 vŠ¢b§ŠœŠtœw£x name }
which attaches it as a data directory at the end of the search list. Thus libraries cannot over-ride
standard functions nor your own functions. To make a library over-ride the system functions,
use
 vŠ¢b§ŠœŠtœw£x name ‹ Ÿ ¢œ € 9„ý.©}
which attaches it at position 2 (after the 6 «\tI9\t directory).

A.1 Library œ\¢u\vqw

This is a collection of useful functions and datasets for teaching at Oxford.


e Ðñ 2 ÐbñåÒ&ñ_Õ"Ԅß
è$«ÒSËÈÐÛyÕÎ.ÐåÕU­.ϬPdžÕ"ñŠß Box-Cox plot for transformations.
replacement for GLIM _­ > .
Çbõ3­Í‰3«ÐÎ
­|Î.ÝbÈP/Ç ­\ÒºÐ|%Æ £.ÇÍ|΄ß
equally-scaled plot function.

­|*Î ¬ÝÈqIÇ ­\Ò~ÐÑ7Æ £ÇÍ{Ίß


calculate standardized residuals from a fit.
calculate Studentized residuals from a fit.

Datasets in the library are:


ÞÍÍÑÝÇÞ|Î Å ­
ÍÇ[Û\ÇÌÎ US accidental deaths 1973-8

͉I¬"­ dataset on heat evolved in setting cements

ÝÇÞ{Î Å ­ dataset on performance of cpus

ۊÝÇÞ|Î Å ­Õ Ë.ÝÇÞ|Î Å ­ time series on UK lung deaths 1974-9 from Diggle

èPÇ Å ÞÌ as above, for males and females

Ë.ÐÈbƊ/Ç ­ remission times on leukaemia patients (censored)


Å DÏ «/«3­ Forbes’ dataset on boiling points, from Atkinson

«’Ç ¬Ö
dataset on times of Scottish hill races

«Å
(uncensored) survival times on leukaemia patients

ÛEÞ[ÛÛEÞ «3­ time series on luteinizing hormone from Diggle

ÛEÍ|ÔqÍ «.Ç body weight(kg) and brain weight (g) of mammals, from Weisberg

ۊÐÎÐÈqÇ|ÎÎqÇ motorcycle impact data – Silverman JRSS B 1985

ÌqÐÎÎPÇ«Û accelerated life testing on motorettes

È.ÐÞÑÝ time-series of temperatures in Nottingham, 1920-1939

È.ÐÍ{Ö dataset on road deaths in the US

*È ¬ÆƊÇ{È
dataset on relating permeability to physical measurements

­ Å "Ï ‰­
dataset on rubber wear

ÎÈPÇIÇ ­ ship damage incidents, from McCullagh & Nelder


Black Cherry trees heights, diameters and volumes

A.2 Sources of Libraries

Many S users have generously collected together their functions and datasets together into li-
braries and made them publically available. An archive of libraries is maintained at Carnegie-
©
A.2 Sources of Libraries 47

Mellon as a service to the statistical profession by Mike Meyer. To obtain details of its contents
by e-mail send a message to
€ 9\tI9\v„¢§]®žv„¢b§6 € 9\tI96º¡b…   6ºƒ  
with body
€ ‚Šƒ¢b‚„ƒž/;
€ ‚Šƒ¢b‚„ƒž/; Ÿ œ µ …°¯
€     €
Ftp to v„¢§C6 9\It 9C6«¡|… 6&.ƒ with user 9\tI9\vŠ¢b§ is also available.

Vous aimerez peut-être aussi