Vous êtes sur la page 1sur 139

Data Analysis and Data Management Using R

(Version 0.98)

T OM B ACKER J OHNSEN

Faculty of Psychology, University of Bergen

March 28, 2008

ii

c 2008 by Tom Backer Johnsen


Copyright
http://www.galton.uib.no/johnsen
ISBN 82-91713-40-5
Universitetet i Bergen
Det psykologiske fakultet
Christies gt. 12
5015 Bergen, Norway
Tel: +47 55 58 31 90
Fax: +47 55 58 98 79
URL: http://www.uib.no/psyfa/isp

Contents
Preface
xi
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Caution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1

Introduction
1.1 What is R? . . . . .
1.2 Inexperienced users
1.3 Experienced users .
1.4 Finally . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

1
1
3
3
3

Starting and Stopping R


2.1 Opening a session . . . . . .
2.2 For the impatient . . . . . .
2.3 Closing a session . . . . . .
2.4 Documentation . . . . . . .
2.5 Getting help . . . . . . . . .
2.6 Fine-tuning the installation

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

5
5
5
7
7
8
9

Basic stuff
3.1 Simple expressions . . . . . . . . . . . . . .
3.2 Vectors . . . . . . . . . . . . . . . . . . . . .
3.3 Simple univariate plots . . . . . . . . . . . .
3.4 Functions . . . . . . . . . . . . . . . . . . . .
3.5 Objects . . . . . . . . . . . . . . . . . . . . .
3.5.1 Naming objects . . . . . . . . . . . .
3.5.2 Object contents . . . . . . . . . . . .
3.6 Other information on the session . . . . . .
3.7 The Workspace . . . . . . . . . . . . . . .
3.7.1 Listing all objects in the workspace .
3.7.2 Deleting objects in the workspace .
3.8 Directories and the workspace . . . . . . . .
3.9 Basic rules for commands . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

11
11
12
13
13
13
14
15
15
15
16
16
16
17

Data sets / Frames


4.1 Sources of data sets . . . . . . .
4.2 Data frames . . . . . . . . . . .
4.3 Entering data . . . . . . . . . .
4.3.1 Variable types in frames

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

19
19
20
21
22

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
iii

.
.
.
.

.
.
.
.

iv

CONTENTS
4.4

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

22
22
23
23
24
24
24
25
26
26
27
27
28

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

29
29
31
31
31
32
33
34
34
35
35
37
38
38
39
40
40
43
44
45
45
47
49
52
56
57

Resampling, Permutations and Bootstrapping


6.1 Resampling . . . . . . . . . . . . . . . . . .
6.1.1 Permutation (Randomization) tests
6.1.2 Bootstrapping . . . . . . . . . . . . .
6.1.3 Using the boot package . . . . . .
6.1.4 Random numbers . . . . . . . . . . .
6.1.5 Final Comments . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

59
60
60
64
65
66
67

4.5
4.6

4.7
4.8
4.9
5

Reading data frames . . . . . . . . . . . . .


4.4.1 Reading frames from text files . . .
4.4.2 Selecting the data file in a window .
4.4.3 Reading data from the clipboard . .
4.4.4 Reading data from the net . . . . . .
Inspecting and editing data . . . . . . . . .
Missing data . . . . . . . . . . . . . . . . . .
4.6.1 Identification of missing values . . .
4.6.2 What to do with missing data . . . .
4.6.3 Advanced handling of missing data
Attaching data sets . . . . . . . . . . . . . .
Detaching data sets . . . . . . . . . . . . . .
Use of the with () function . . . . . . . . . .

Data analysis
5.1 Classification of techniques . . . . . . . .
5.2 Univariate statistics . . . . . . . . . . . . .
5.2.1 Simple counts . . . . . . . . . . . .
5.2.2 Continuous measurements . . . .
5.2.3 Computing SS . . . . . . . . . . . .
5.2.4 Plots . . . . . . . . . . . . . . . . .
5.3 Bivariate techniques . . . . . . . . . . . .
5.3.1 Correlations . . . . . . . . . . . . .
5.3.2 Scatterplots . . . . . . . . . . . . .
5.3.3 t-test, independent means . . . . .
5.3.4 t-test, dependent means . . . . . .
5.3.5 Two-way frequency tables . . . . .
5.4 Multivariate techniques . . . . . . . . . .
5.4.1 Factor analysis . . . . . . . . . . .
5.4.2 Rotation . . . . . . . . . . . . . . .
5.4.3 Principal component analysis . . .
5.4.4 Principal factor analysis . . . . . .
5.4.5 Final comments on factor analysis
5.4.6 Reliability and Item analysis . . .
5.4.7 Reliability . . . . . . . . . . . . . .
5.4.8 Item Analysis . . . . . . . . . . . .
5.4.9 Factorial Analysis of Variance . . .
5.4.10 Multiple regression . . . . . . . . .
5.5 On Differences and Similarities . . . . . .
5.5.1 For the Courageous . . . . . . . . .

Data Management

69

CONTENTS
7.1

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

69
70
70
71
71
72
72
72
73
73
74
76
76
76
77
77
78
79
81
82

Scripts, functions and R


8.1 Writing functions . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Editing functions . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Sample function 1: Hello world . . . . . . . . . . . . .
8.1.3 General comments on functions . . . . . . . . . . . . . .
8.1.4 Sample function 2: Compute an SS value . . . . . . . . .
8.1.5 Sample function 3: Improved version of the SS function
8.1.6 Things to remember . . . . . . . . . . . . . . . . . . . . .
8.1.7 Sample function 4: Administrative tasks . . . . . . . . .
8.2 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Sample script 1: Saving data frames . . . . . . . . . . . .
8.2.2 Sample script 2: Simple computations . . . . . . . . . . .
8.2.3 Sample script 3: Formatted output . . . . . . . . . . . . .
8.2.4 Sample script 4: ANOVA with simulated data . . . . . .
8.2.5 Sample script 5: A more general version . . . . . . . . . .
8.2.6 Nested scripts . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

85
85
85
86
88
88
89
90
90
90
91
93
95
96
96
97

.
.
.
.
.
.
.
.
.
.

99
101
102
102
102
102
104
104
104
105
105

7.2

7.3
7.4

7.5
7.6

7.7
8

Handling data sets . . . . . . . . . . . . . . . .


7.1.1 Editing data frames . . . . . . . . . . . .
7.1.2 List the data set . . . . . . . . . . . . . .
7.1.3 Other useful commands . . . . . . . . .
7.1.4 Selecting subsets of columns in a frame
7.1.5 Row subsets . . . . . . . . . . . . . . . .
7.1.6 Repeated measures . . . . . . . . . . . .
Handling commands . . . . . . . . . . . . . . .
7.2.1 Saving commands . . . . . . . . . . . .
7.2.2 Using saved commands . . . . . . . . .
Transformations . . . . . . . . . . . . . . . . . .
Input and Output . . . . . . . . . . . . . . . . .
7.4.1 File Names . . . . . . . . . . . . . . . . .
7.4.2 Input . . . . . . . . . . . . . . . . . . . .
More on workspaces . . . . . . . . . . . . . . .
Transfer of output to MS Word . . . . . . . . .
7.6.1 Table 1 . . . . . . . . . . . . . . . . . . .
7.6.2 Table 2 . . . . . . . . . . . . . . . . . . .
7.6.3 Conclusion . . . . . . . . . . . . . . . .
Final comments: The Power of Plain Text . .

A Data Transfer
A.1 Why is this important? . .
A.2 The audit trail . . . . . .
A.3 Data Sources . . . . . . . .
A.4 Data Transfer . . . . . . .
A.4.1 Manual data entry
A.4.2 Scanning forms . .
A.4.3 Filters . . . . . . .
A.4.4 Checks . . . . . . .
A.4.5 The final data set .
A.5 Final comments . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

vi

CONTENTS

B Installation and Fine-tuning


B.1 Installation . . . . . . . . . . .
B.2 Editors . . . . . . . . . . . . .
B.2.1 Tinn-R . . . . . . . . .
B.2.2 WinEdt and RWinEdt
B.2.3 Notepad Plus . . . . .
B.2.4 vim . . . . . . . . . . .
B.3 GUI Interfaces to R . . . . . .
B.3.1 R Commander . . . .
B.3.2 SciView-R . . . . . . .
B.4 Installing packages . . . . . .
B.4.1 Normal installation . .
B.4.2 Updating packages . .
B.4.3 Failed installation . . .
C Other tools
C.0.4
C.0.5
C.0.6
C.0.7
C.0.8
C.0.9

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

107
107
107
108
108
108
109
109
109
109
109
110
110
110

Spreadsheets . . . . . . . . . . . . . . . . . .
Managing text . . . . . . . . . . . . . . . . . .
Bibliographies . . . . . . . . . . . . . . . . . .
Presentations . . . . . . . . . . . . . . . . . .
Portable Applications . . . . . . . . . . . . .
Combining Statistical Output and Authoring

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

113
113
114
116
116
117
117

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

Reference card
119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

List of Tables
4.1
4.2
4.3

Data stored in a text file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


Clipboard data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Missing observations in frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1
5.2
5.3

Rearranged sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Data for Item Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
The elements used for Cronbachs alpha . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1

Permutations of four values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.1
7.2
7.3

File output.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Means, Standard Deviations and Intercorrelations . . . . . . . . . . . . . . . . . . . . 78
Regression Analysis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8.1
8.2
8.3

File Input.data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Simple data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Contents of Tiny.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

vii

viii

LIST OF TABLES

List of Figures
2.1
2.2
2.3

R Opening Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Closing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sample Help Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5
7
9

3.1

1000 normally distributed random values . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1
5.2
5.3

Histogram with density estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


Scatterplot with regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Scree plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.1

Editing the attitude data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.1 The data entry loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

ix

LIST OF FIGURES

Preface
For quite some time I and several colleagues of mine have been looking for alternatives to the
standard statistical packages like SPSS and Statistica used by most of our colleagues in the social
sciences. This search has been triggered by several trends we have seen in the past few years.
Suppliers of commercial statistical software has become more and more paranoid about the
installation of the software in respect to licenses, installation periods etc.. Consequently, I
cannot be sure that I as a researcher at all times have access to one of the essential tools of
our trade, a set of statistical routines. For instance, one piece of mathematical software I had
installed on my portable required that I was logged on the Internet in order to use it. It was
immediately uninstalled. As I am writing this, my installation of both Statistica and SPSS
will not work due to a missing time code, my licenses for the programs is only valid for
one year at a time.
The price of the licenses have gone up, which, together with increased pressure on the funding of universities, increases the risk of running into potentially interesting collaborators
who simply cannot afford to use these tools. We have had that problem for a long time with
researchers from the third world, now the problem is spreading into the west as well. In
other words, if a data file in SPSS format (.sav) is sent to a colleague, you cannot be sure that
it can be read by the recipient without problems. The solution is simple, drop SPSS and use
a more general format instead, one based on plain text.
In addition, and as a matter of principle, all parts of the research process should be open to
peer review. This includes both documentation of details about the data analysis itself and,
at least in principle, the software used for the data analysis as well. In contrast, commercial
and proprietary programs like the ones mentioned above do not encourage documentation
of all steps in the data analysis. Of course, it is possible to use scripts in these programs,
but this is not enforced in any way, and it seems that few users actually do so. In addition,
the source for these programs is definitely not open for scrutiny. This is problematic both
for the people selling the products, the researchers who insist on using them, as well as the
readers of what the same researchers produce. To reiterate: All parts of the research process
should be open to peer review.
When using open source type of software, these problems are at least reduced. There are
several alternatives for doing statistical operations in that world, but one of them turns out to
be far superior in very many respects compared to the standard ones. This is a system simply
called R. It is also based on a different type of user interface, a command or script oriented
approach, which for experienced users is far superior to what is popularly preferred.
The standard (base) version of R as installed by default is very powerful in itself. In addition,
there are a very large number of additions (so-called packages) which may be added to the
system for all kinds of more specialized techniques. In other words, you may tailor your system
xi

xii

LIST OF FIGURES

to fit your own needs. If you have programming skills you may even add your own components
in various ways.
Now, if you have a piece of software which is open, available at no cost, customizable with
an immensely rich set of available methods and procedures which seems to be optimal for both
novices and experienced users, what do you do? The conclusion is simple: You make the change.
In any case, R is obviously a statistical tool worth exploring. As a latecomer, I have just started
that process for myself, and writing this document is part of that exploration.

Overview
For one thing, the focus in this document is on elementary data analysis rather than statistics
as such. In addition, the focus is more on how things may be done rather than why one might
want to do a particular analysis. In other words, there are little methodology and little statistics.
I will assume that basic information about statistics and methodology is known to the reader
and available from other sources. In addition, the scope is not limited to R alone, in the three
appendices I touch on themes that are for the most part less oriented on R and more toward
project management as such. This includes tools for entering large amounts of data etc., as well
as other useful tools, needed for the production of high quality typesetting of documents etc..
The main problem has been to decide on what to include in the text and the sequence to treat
the various themes. After all, a text structured like a book is very much a linear affair, at least
in respect to what follows what. The only way the author can do to break the strict sequence of
the text is to include aids like a table of contents, an index, and a lot of cross references to other
relevant parts of the text as direct references or footnotes. In any case, you have to make a decision
on what you regard as a sensible sequence. For me, that has given me a lot of headaches. I finally
decided on the following:
Of the first two chapters, the Introduction and Starting and stopping R, the first includes
some background information and the second the basics on starting and stopping the system, as well as pointing out where to get find documentation and help from the installation.
The second chapter also includes something for the really impatient user.
The chapter on Basic stuff contains fundamental information on expressions, functions
and objects as well as the essentials of a workspace. It ends with a summary of the basic
rules for commands.
Datasets / frames: One can start using and experimenting with the system using nothing
but the built-in data sets. But, the real fun starts with the analysis of real data, either data
generated according to ones own specifications or your data in an empirical sense. Here
the basics of data management are described.
With an initial coverage of these themes, the foundation for a central theme of the document,
data analysis, has been laid:
Data analysis: Here four types of methods are described, (a) simple univariate statistics,
(b) some simple bivariate operations, (c) variants of factor analysis, (d) reliability and item
analysis, (e) factorial analysis of variance and finally, (f) multiple regression. This subject is
discussed in chapter 5.
Following the chapter on data analysis, the next chapter is oriented towards modern and
more general alternatives to the classical tests of significance, a class of techniques that fall
into the general category called resampling techniques. These are computer intensive

LIST OF FIGURES

xiii

methods in the sense that they are for all practical purposes impossible to perform without computers. It should be added that these techniques are not present in the so-called
standard packages, but are easily accessible in R.
However, every time you start working on a new a project, it implies much more than the ability to start a data analysis of a particular type. That part is often quite simple in itself, often almost
trivial compared with the necessary preparations for the data analysis. The really important parts
of a project does not involve the statistical tool directly, but includes activities like data collection
(constructing forms, scales etc.), transfer of the data into a format you can use, e.g. data entry,
strategies for getting the results you want into documents in the correct format, etc..
In addition, after you have obtained the results you need, you have to transform the results
from the software into a more readable format, i.e. to generate proper tables and figures for a
paper or report. These parts need to be documented as well. So, some of these subjects needs to
be covered as well, at least briefly, towards the end. So, in addition to the focus leading towards
data analysis in the chapters mentioned above, some other subjects needs to be covered:
The first of these additional themes is a discussion in appendix A on the proper way to enter
data sets with a minimum of errors, if possible using dedicated software. Here the idea of
maintaining an audit trail for the project is introduced. This is a subject that is becoming
increasingly important for several reasons. You have to be able to transfer the recorded data
to a format suitable for reading into R and at the same time be able to record what has been
done, e.g., to document operations. So, more information on data transfer is needed.
This will require information about the data handling beyond what is defined in part 4 on
page 19, e.g. the ability to read data stored in other formats, and after possible transformations, to store them in a suitable format (as well as a recording of what was done).
To be able to do the needed transformations (where it is impossible to supply more than a
few suggestions. Again, the main thing is to be able to document what you actually have
done, and that is where the next part is important.
In general, it is not a trivial matter to transfer output from a statistical program to your
favorite word processor. For one thing, SPSS and Statistica generate too much output while
R by default generates too little. In both cases the elements are in the wrong sequence for
the APA standard. The part called Transfer of output to MS Word covers the transfer of
output from R to a properly formatted table in a document formatted according to the APA
Publication manual (Association, 2001) and (Nicol & Pexman, 1999). That part may be of
interest to users of other statistical systems as well.
One very important aspect of the R language is that it is command oriented. Normally, the
commands are entered one by one at the keyboard. But it is also very useful to have sets of
R commands (scripts) stored and run from simple text files, or alternatively, as functions
stored in the workspace. This is a great advantage in respect to documentation of your
project as well as reusing sets of commands. This subject is briefly touched upon in chapter
8.
Finally, the last chapter in the third part is called Fine-tuning your installation which covers the basics of the installation process itself, as well as alternatives in respect to GUI interfaces and editors for the R system. That section also includes some very brief comments on
other software that may be useful.

xiv

LIST OF FIGURES

Font conventions
This document is set in a font called Palatino Roman, and that holds for most of the text. However, when commands for the R language and function names are printed, a monospaced Courier
type of font is used.

Caution
This document is at best a very short introduction to a tool of vast capabilities, covering only
a few themes. For an impression about how superficial this introduction really is, consider the
most comprehensive textbook on R I know of. This is (Crawley, 2007), which runs to 942 pages, a
massive book packed with information and examples. And that one is still far from complete, one
quite large subject I know a little about is one called Social Network Analysis or SNA. There are
a large number of useful functions i R for analysis of that type of data (e.g. found in the package
called sna) which are not covered in that book at all. There are probably many many more such
subjects.
So, as an introductory text this is naturally quite superficial and only covers what I consider the
most important aspects of the language. It can, as a matter of course, become much, much, better.
The next version will (hopefully) be better than this one. In addition I welcome any suggestions
in respect to errors and/or improvements.

Acknowledgment

Professor Hans-Rudiger
Pfister from the University of Luneburg,
Germany has been very generous with many suggestions about the use of R, as well as some of the material presented here,
especially the very nice example on factorial ANOVA in chapter 5 which is also used in the sample
scripts in the discussion of that subject in chapter 8.

Chapter 1

Introduction
The purpose of this chapter is simple, tell what the R language is, cover some
aspects of what is called open source types of software, together with some
comments on the needs of different types of users.

1.1

What is R?

R 1 is a computer language oriented towards statistical computing similar to the S language, originally developed at Bell Laboratories in the US. This language has been implemented in a first-class
piece of software which is nice to use both for learning statistics and for problems of more experienced researchers. One reason is simple; in contrast with more conventional packages like SPSS
and Statistica, you have to know something about what you are doing when using R. And that is
not a bad thing, at least for researchers.
To quote from the introduction to one of the manuals for the system (included when the system
is downloaded and installed):
... R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. Among other things it has:
an effective data handling and storage facility,
a suite of operators for calculations on single values, arrays, and matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display of results either directly on the
screen or stored on file for later use.
And:
a well developed, simple and effective programming language (called S) which
includes conditionals, loops, user defined recursive functions and input and output facilities. (Indeed, most of the system supplied functions are themselves written in the S language.)
1

The language for controlling S is said to be essentially the same as R, it is claimed that most procedures available for
S may be used in R without modification. Also, one may see references to the S language even when R is discussed.
In other words, one may regard S and R to be two different implementations of the same language.

CHAPTER 1. INTRODUCTION
The term environment is intended to characterize it as a fully planned and coherent
system, rather than an collection of very specific and often somewhat inflexible tools,
as is frequently the case with other software for data analysis.
R is very much a vehicle for newly developing methods of interactive data analysis.
It has developed rapidly, and has been extended by a large collection of packages.
However, most programs written in R are essentially ephemeral, written for a single
piece of data analysis.

My feeling is that this quote represents a series of understatements. R is in many ways much
better suited for research purposes than the commonly used commercial products. For one thing,
it has to a very large extent been created by scientists for scientists. In particular, the graphics
functions in R are really very advanced and flexible.
Compared with the popular GUI (Graphical User Interface) systems, some of the characteristics of R are:
It is command oriented, controlled by entering commands at a console or window(a special screen). Persons who are used to a conventional GUI system may regard this as an
old-fashioned, slow and cumbersome interface. That is true, at least to begin with and
for novices. However, once you know the basic commands, it is much faster and easier
in use than any GUI interface you can think of. However, the real gain is in flexibility. You
have many more degrees of freedom than in conventional statistical packages controlled by
menus and the mouse (MM type programs). The wonderful part of GUI interfaces is what
has been called WYSIWYG (What You See Is What You Get), the drawback is simply the
reverse: WYSIAYG (What You See Is ALL You Get).
The commands may be stored in a file for later editing and reused. This is very handy for
documentation of anything beyond a trivial set of operations in respect to data analysis in a
research project as well as testing out variants of the same analysis.
It is object oriented, so you can have anything from a minimal amount of output (the default)
to a lot more if you so wish.
It is a very good example of an open source program. This means that it is downloadable
for free from the net, and the same holds for a very large number (hundreds) of packages
for more specialized functions which can be added to your version of the system. If you
want to (but you do not have to of course), you are even able to read the source to see what
any part of the system really does. In other words, this type of software is subjected to peer
review like any other part of a research process. In contrast, the code for conventional
packages is definitely not open for inspection and may have (and probably has) errors no
researcher will ever know about.
When first installed, it has a base; a basic set of functions and data sets are included,
enough for most users. This base can be expanded by downloading packages of your
own selection (among hundreds of alternatives) covering YOUR particular needs. The available packages cover everything from general functions to highly specialized themes. In other
words, you may tailor the system to your own requirements.
And according to (Verzani, 2004):
R is excellent software to use while first learning statistics. It provides a coherent,
flexible system for data analysis that can be extended as needed.

1.2. INEXPERIENCED USERS

It is perfectly possible to start the students with the bare basics without having them to cope
with a confusing menu structure that includes much more than even normally experienced users
would ever need. You only have to observe the problems a group of fresh users have with their
first encounters with a so-called user friendly point-and-click program to have serious doubts
about the user friendliness of the same programs. It can even be argued that these programs
encourage the learning of habits that are very much less than optimal for researchers.

1.2

Inexperienced users

Users not acquainted with conventional programs should have no problems other than learning
about statistics and the same time as getting acquainted with R. They have no preconceptions to
get rid of, and the main problem is to have them hooked on to the optimal starting point and to
have a reasonable progression from that point.

1.3

Experienced users

Users experienced with conventional statistical systems like SPSS and Statistica may have some
handicaps when starting with R for a number of reasons. For one thing, the menu structure in
these programs includes much more than most users will ever need. The initial problem with
these systems is therefore to learn what can (or should) be ignored, rather than what is actually
needed and useful for the problem at hand. Furthermore, conventional statistical systems are
rather inflexible in respect to how the data for the analysis is organized, where the main unit for the
analysis is a data set. In other words, with other systems you tend to work with one (often very
large) data set or file which contains everything you conceivably might need. In contrast, with
R, you are encouraged to work with subsets of your complete data set, which contains what you
need for one particular analysis or a class of related techniques. In particular, my understanding
is that the concept of a data set in conventional systems is quite rigid.
In any case, the closest equivalent to the data set concept in R is a data frame which is
almost the same as the conventional data set, but somewhat looser. The differences in terminology may also need getting used to, and in a few cases things are very different. 2 One could (or
rather should) regard R (or S) as a programming language dedicated to statistical and mathematical
operations, since all the elements of more conventional programming languages are there as well
(loops, conditional execution, functions etc.), although I have to a large extent avoided that part
of the system in this exposition.

1.4

Finally

To quote (Crawley, 2005):


Learning R is not easy, but you will not regret investing the effort to master the basics.

For instance: The basic rule in conventional packages is that all observations from the same individual or unit is
placed in the same row of the data matrix. With R (and S), the basic unit seems to be the variable, or perhaps even the
observation as such. This shows up in the use of repeated observations, where all observations of the same variable is
assumed to be in the same vector or column, with additional factors identifying the repeat for each observation and
the individual case to which the observation belongs. The effect is that imported data of this type from these systems
may have to be rearranged before analysis, a simple import is not sufficient. Se section 7.1.6 on 72 for more details.

CHAPTER 1. INTRODUCTION

Chapter 2

Starting and Stopping R


This chapter covers little more than the really elementary information on how
to start and stop the system. In addition, some help is presented for the really
impatient readers who can generate some instructive output with only a few
keystrokes.
An additional subject is to present the basics of getting help from the system,
plus information on where more documentation for the system may be located.
If the program is not installed already, the first thing to do if you want to use the program is of
course to get hold of it. See the Installation and fine-tuning, part B.1 on page 107.

2.1

Opening a session

After an installation of R on Windows, you will


probably find an icon on the desktop, a blue
R. Click that once to start the program and the
screen will look like the one in Figure 2.1. An
alternative is to click on a copy of a file called
.Rdata (or a shortcut to a file with that name).
See part 3.7: The workspace on page 15 below for details. When the program is started,
the current workspace is loaded and a prompt
appears (normally a >). This is an invitation to
type something (a command). If you do so and
end the line with pressing the Enter button,
the program responds with some output and a
new prompt appears. In other words, R is an
interactive language.

2.2

Figure 2.1: R Opening Window

For the impatient

Whenever I come across a new program, I always want to have a try at generating some results
more or less immediately. I always assume that others do the same thing.
With statistical software, the first question is: How do I get some data into the system? With
R, that is simple. You already have some data at your fingertips once the program is started. So,
5

CHAPTER 2. STARTING AND STOPPING R

there are a number of data sets that may be used without any operations; they are part of the
installation.
The next question is: How do I get some results? Enter the following command when your
computer displays the window shown in Figure 2.1: R Opening window:
> mean (attitude)

As a result, you would see:


rating complaints privileges
64.63333
66.60000
53.13333

learning
56.36667

raises
64.63333

critical
74.76667

advance
42.93333

What does this mean? First, there is a data set (a frame in R terminology) called attitude
which is part of the installation of R and are available at all times (there are many others). Details
about this data set is obtained by entering the name with a ? in front of the name of the data set,
i.e. ?attitude (there are also a large number of other data sets, which are listed if you enter the
command data ()). You have access to a list of the variable names in the attitude data set by a
names () command, i.e.:
> names (attitude)

The function mean () returns an object which contains the means for the columns in the
data set named in the call on the function. It is possible to assign a name to the object (or rather
the other way around), as in the next command:
> xbar <- mean (attitude)

The combination <- is used for the assignment, and should be read as becomes, in this case,
the line should read: the name xbar becomes a pointer to the container for the means of the
variables in the data set attitude. You will sometimes see an equal (=) used instead of the <combination, but that is not recommended. 1
> xbar
rating complaints privileges
64.63333
66.60000
53.13333

learning
56.36667

raises
64.63333

critical
74.76667

advance
42.93333

Here we first assign the results of the function mean (): get the means for each of the columns
in the attitude data set to the name xbar, where the contents of the object is printed when the
name of the object is entered on its own as the next command. Once the object is assigned a name,
the contents can be referred to by that name for other purposes. For instance, we might (for an
unknown reason) want to have a vector containing the difference of the variable means from the
mean of the variable means. Here it is:
> xxbar <- xbar - mean(xbar)
> xxbar
rating complaints privileges
4.195238
6.161905 -7.304762

learning
-4.071429

raises
4.195238

critical
advance
14.328571 -17.504762

Finally, enter the command:


1

The use of = for assignment is a potential source of problems when programming with R, for the simple reason
that it is also used as a relational operator in logical tests. For that reason alone, it should be avoided.

2.3. CLOSING A SESSION

> barplot(xbar)

And you get a nice-looking bar plot of the variable means in a separate window. Not perfect,
some of the longer variable names underneath the bars have been omitted for some reason, and
you may for instance want to add some colors, but it is a good start towards something that might
be useful. Also, note that you can reuse commands, use the up and down arrows to locate a
command you would like to use again, possibly edit it, and press the enter key. This is handy for
correcting errors in commands without having to retype the whole line or for trying out variants
of commands.

2.3

Closing a session

When you want to end the session, either close the window or type the command q () at the
prompt and then press the Enter key. In both cases the window shown in Figure 2.2: Closing R
appears.
If the Yes button is clicked, the workspace containing all
the objects that has been generated in these (and possibly those
from previous sessions) are saved in the active directory. This
information is stored in a file called .Rdata. When the system is
started for the first time, this file is placed in the same directory
as the program itself by default. This is not recommended in any
Figure 2.2: Closing R
case. With some installations (large networks etc.) you may not
even be permitted to do so. The solution is to use the option Change directory found in the
File menu before closing down and then pick a working directory within your private area.
If you then look in that location with Windows Explorer you will find the .Rdata file there. Doubleclick the file name to start the program.
The general theme of workspaces is discussed in more detail in part 3.7 on page 15.

2.4

Documentation

In general, the documentation for the system is very nice. When installed, several documents in
PDF format are included and are available from the Help menu item in the main form seen when
the program is started. These include several manuals in PDF format, where two are potentially
useful for novices:
An Introduction to R: This is not very introductory in my eyes, but may be useful nevertheless,
especially after you have had some experience with the language.
R Data Import/Export: This covers most of the options in respect to importing data into the
system, some very advanced.
Documentation of the more technical kind and not for beginners include:
R language definition: For more experienced users and for people who have some background
in programming, this is a very interesting document. After using R for a while, it is recommended to at least look at this one now and then.
Parts of the same documentation is also available in quite readable HTML format from the
same menu, nice if you do not have a PDF reader installed.

CHAPTER 2. STARTING AND STOPPING R

There are also a number of very active mailing lists. For details first click Help and then R
project home page when you are logged on to the net. On that page locate the Mailing lists
entry in the list on the left side. For beginners, follow the instructions on the R-help part of the
page. It is always instructive to read the questions and and the answers in this list, the top people
in a number of fields are very active there, and the tolerance for newbies is high. But, read the
Posting guide before posting a question to the list!
Apart from the documentation included with the installation and the mailing lists, there are
a large number of downloadable documents that covers many different fields. A list over Contributed documentation is maintained on the main site for the R project:
http://cran.r-project.org/
Click on the Contributed item in the list on the left. These documents are in general very
good (some are in reality early versions of printed books in PDF format). One of the more useful
ones from my point of view is Notes on the use of R for psychology experiments and questionnaires by Jonathan Baron and Yuelin Li, but there are several others that also may be of interest
for psychologists as well. Another nice downloadable document is called R for beginners and
is found at:
http://cran.r-project.org/doc/contrib/Paradis-rdebuts en.pdf
The author (Emmanuel Paradise) goes into some aspects of R in a quite detailed manner, but
the exposition is clear. The chapter on graphics is very nice, and so is the chapter on programming
with R. Well worth looking at.
In addition to the manuals included with the installation and downloadable material from
the Internet there are a number of published books as well, which may be useful. Of the latter
type, (Dalgaard, 2002) is a very readable introduction and strongly recommended, the same holds
for (Verzani, 2004). Read the Dalgaard one first, it is very instructive. Books like (Everitt, 2005),
(Crawley, 2005), and (MainDonald & Braun, 2003) are more advanced and intended for more experienced users. The real and more recent bible (more than 950 pages, packed with information)
is (Crawley, 2007), an extremely nice book, quite expensive, but really very nice and very useful. A
reference with a more narrow focus oriented towards resampling as covered in part 6.1 is (Good,
2005). Also very good.
However, books like the ones mentioned above may omit entire fields of research. One example of the latter is what is called Social Network Analysis 2 which really is a rapidly growing
field of interest for (among other) researchers within organizational and social psychology. In
other words, given your particular interests you may have to look closely at the downloadable
packages on CTAN and their documentation and at the same time be aware that variations in the
terminology used in the documentation may not be obvious. Posing polite questions at the user
list for R may be a very good start.
Another source of information that is potentially useful is a Wiki for R:
http://wiki.r-project.org

2.5

Getting help

When using the system, help on any function may be obtained either by writing the name of the
function preceded by a "?", e.g.:
2

I recently attended a conference for the INSNA organization (called Sunbelt conferences). A very large part of
the contributions I saw there used R for the graphics (one third?) and very many had used LaTex for the generation of
the slides.

2.6. FINE-TUNING THE INSTALLATION

> ?lm

Or alternatively:
> help (lm)

The lm () function covers what is also called multiple regression (in general : Linear models)
as well as analysis of variance. The commands above results in the opening of a new window
(Figure 2.3) with the documentation for that function, covering all the options and an example:
Note: there is much more information in that particular window than is shown here, you need
to scroll! If you want to look at examples on the usage of a particular function, use the example
(command). This is especially nice for the graphical functions, where there are many options. Try
example (dotchart), example (plot), example (contour).
In addition to the example () function, there are a number of demo ()s.
See demo (graphics), demo (persp) and
demo (image).
As mentioned above, help on the preinstalled data sets (frames) are handled in
the same manner, e.g.
> ?attitude

To get information on installed packages you use the library() command,


e.g.:
> library (help=utils)

Figure 2.3: Sample Help Window


Where "utils" is one of the many
packages installed by default .
There is also a function called RSiteSearch () in the R language itself that can be used to
search for terms on the net. So, if you enter the command RSiteSearch ("repeated measures")
(and are connected to the net) you get a very large number of links which includes the term entered
as the argument. There are also a number of options (see the help for the function) that can limit
the search somewhat.

2.6

Fine-tuning the installation

You may very well want to change the installation in several ways, e.g. to install a better editor
than Notepad, or perhaps install one of several possible GUI interfaces for R, maybe add packages
for access to more specialized techniques, etc.. See part B.1 on page 107 for details about these
subjects.

10

CHAPTER 2. STARTING AND STOPPING R

Chapter 3

Basic stuff
This chapter covers the bare basics of R, just to give you an initial idea of what
the system can do. The subjects include:
Using R as an overgrown calculator and simple expressions.
Basic functions.
A very basic discussion of objects, a very important aspect of R.

3.1

Simple expressions

Let us start with some really simple-minded stuff without using any data sets. Type the following
at the > prompt:
> 2 + 2

And the answer or response appears on the next line:


[1] 4

The conclusion so far is that the system can add. It can do more. In general, this type of
operation works for almost any type of formula, where the basic operators are +, -, /, *, and ^
(plus, minus, division, multiplication, and raising to a power). All the normal rules for expressions
hold, e.g. multiplication and division are done before addition and subtraction. Parentheses may
be used to control the sequence of the operations.
Other operators are available, like relational and logical and relational operators, plus matrix
operations. Functions like log (), exp (), sin (), cos (), tan (), sqrt () etc. have the normal
meaning, e.g.:
> log10 (2)
[1] 0.30103

The latter is what is called a function, which in this case returns the logarithm base 10 of 2,
again a single value. However, some times R prints values that look more impressive than really
they are, e.g.:
> sin (pi)
[1] 1.224606e-16

11

12

CHAPTER 3. BASIC STUFF

This result is written in a so-called scientific notation. With a negative exponent (-16), this
result means: Move the decimal point 16 places to the left, in other words, this corresponds to
the value of 0.0000000000000001224606. For all practical purposes this value is equal to zero (but
not really, nevertheless close). This kind of result is not uncommon when using computers, these
types of results may be hidden in some way or another, but they are there anyhow, behind the
scenes so to speak. R is explicit in this respect.

3.2

Vectors

In contrast with single values, sometimes one wants to operate on or with a group of values on the
same time, e.g. all the observations on one variable in a data matrix. Imagine data for a sample of
5 persons, where the values for one of the variables consist of the numbers of 1 to 5. Anticipating
the discussion on objects, we can generate such a vector with the c () function and assign a
name to it:
> x <- c (1, 2, 3, 4, 5)

Or the equivalent:
> x <- c (1:5)

Then the contents of that vector could be printed out by simply writing the name of the
vector followed by an Enter:
> x
[1] 1 2 3 4 5

This is a vector, a collection of values of the same type. This is not a very advanced one, but
we can do arithmetical operations on such vectors just as easy as for simple (single) values. We
can for instance print this vector with the constant of 10 added to all the values in the vector:
> x + 10
[1] 11 12 13 14 15

Or print the squares of all the values:


> x ^ 2
[1] 1 4

9 16 25

Logical expressions return one of two possible values either TRUE or FALSE. An example might
be:
> x < 3
[1] TRUE

TRUE FALSE FALSE FALSE

Regard the x 3 part as a statement x is less than 5. Each of the five values in the vector x are
then examined and compared with the constant 3. Two of the smallest values are smaller than 3,
so the first two values printed are TRUE while the remainder are FALSE.

3.3. SIMPLE UNIVARIATE PLOTS

3.3

13

Simple univariate plots

However, a very different kind of result is obtained by typing the command (also a function):
> plot (rnorm (1000))

And then press Enter. In this case a new window opens with something like Figure 3.1.
The command generating this plot is really a two-step affair, first a vector (an ordered collection of values) containing 1000 random numbers is generated, where the values have a normal
distribution with a mean of zero and standard deviation of one (by the call on the rnorm () function). Then each of the 1000 values in this vector are plotted (the plot () command) with the
value on the vertical axis and the number of the value within the sequence on the horizontal axis.
Since the values tend towards a mean of zero, the points are concentrated along a horizontal zone
in the middle of the plot with fewer points at the upper and lower extremes. If you want to use
this or any other figure in a document or presentation, right-click the window for the figure and a
menu appears with some alternatives for a copy operation.

3.4

Functions

The use of log10 (), rnorm () and plot () in the examples above represents the use of functions
in the R language. In some cases (as with log10 ()) the result is returned as a single value, in other
cases the result may be a vector of a specified length (as with the call on rnorm () or the call on
c ()). An example of the results from an even more complex function in this respect is lm ()
which returns all the results from a multiple regression stored in an object (a linear model, see
part 5.4 below on this subject). Some functions in R are magical (?, ?) in the sense that the result
obtained from them depends on what is thrown at them.
The plot () function is typical in this respect. If
the argument is a single vector, the result is a plot like
the one in Figure 3.1, if there are two vectors as arguments (or a data frame with two columns) you will get
a standard scatterplot similar to 5.2 on page 35, and if
the argument refers to a data frame with more than
two variables (columns), you will get a combined plot
with subplots consisting of scatter plots for all pairs of
columns.
The function summary () is similar, where the ob- Figure 3.1: 1000 normally distributed rantained output is dependent on the nature of the object dom values
used as the argument to the function.
There are a number of other plotting procedures worth looking into, one is hist () which is
used for the generation of histograms, another is boxplot () which generates normal box plots.
All of them have different options for tailoring the plots to what you may want.

3.5

Objects

Most of the operations above are not really useful, for the simple reason that the results for the
most part are not kept anywhere and therefore cannot be used at a later stage in the session. In
order to be able to do so, the results have to be stored in some manner. The trick is to assign the
results of an operation to a name:
> a <- 5 + 1

14

CHAPTER 3. BASIC STUFF


or:

> b <- rnorm (15)

or still another one :


> x <- c (1, 2, 3, 4, 5) + 10

The name on the left of the <- (assignment) is the name of an object which contains the
information and which may be referred to at a later stage.
The contents of an object after assignment may be very simple or very complex. An object
may be regarded as a container for information, which may be one of several different kinds
and different complexity. Think of an object as a box in memory where the content includes
information about what the contents are in addition to the data itself. In the examples above, the
content of a is a single number, while the b is different, and contains a vector or array of 15
random numbers, while the object x contains the integers from 11 to 15.
The results from an analysis are usually stored in objects as well. For instance, the object
returned by the lm () function used for the computation of a linear model, e.g. a multiple regression contains a large number of objects, all oriented towards that particular model. (weights,
intercept, predicted scores, etc.). Some other alternatives in respect to objects are frames or matrices. If you assign an object to a name that already refers to an object, the contents of the original
or first one is lost.
Normally, you can inspect at least part of the contents of the object by entering the name alone.
Therefore at least some of the values in the object named b generated above can be inspected by
writing the name of the object. Hence:
b
[1] -1.3953210 -1.3465854
[7] -0.8005043 -0.7293967
[13] 0.3936527 0.2116120

0.1281261 -0.5984524 -0.1481564 -0.9600024


0.8819378 1.4281142 0.4284926 -0.1139720
0.8039520

Using the function print () has the same effect as using the name of the object alone, e.g.
print (b), but with the added advantage that you have more control, e.g. the number of digits in
the output of some types of objects using that function:
> print (b, digits=3)
[1] -1.395 -1.347 0.128 -0.598 -0.148 -0.960 -0.801 -0.729
[11] 0.428 -0.114 0.394 0.212 0.804

0.882

1.428

Other useful functions in respect to inspecting the contents of objects are the functions summary
() and str ().

3.5.1

Naming objects

Details on the rules for naming of objects is found in the help page for make.names (). To summarize, a name starts with either a letter or a dot (.), otherwise letters and numbers can be used
together with a dot (.) or underline ( ) as you wish. Some examples are:
X <- 2
n <- 25
a.really.long.number <- 123456789
a_very_small_number <-0.000000001

3.6. OTHER INFORMATION ON THE SESSION

15

For obvious reasons, operators like +, -, / and * are not permitted in an object name, nor are
blanks or quotes permitted (use a period or an (underline) instead). In general, the name of an
object contains a pointer or adress to where the object is stored in the memory of the machine,
which means that it is a simple matter to change the object that a particular name is assigned to.
Just enter a new command with an assignment to that name. On the other hand, this also means
that it is is easy to loose something by assigning a new object to a same name that is already a
pointer to something.
Certain one-letter names should be avoided, as c, q or t since there are useful functions
with those names (c (): combine values into a vector or list, q (): quit the session, and t ():
transpose a matrix respectively). Some names are reserved as well, like return, break, if, TRUE,
and FALSE, they are all part of the R programming language. You should also avoid using object
names that are names of functions you use.
There are other conventions in the use of names. Very often n is used for the length of a vector
or the number of cases in a sample, x and y are normally used as symbols for data vectors, the latter
as symbols for dependent variables, and i and j are often used for indices, i.e. for numbering or
referring to things.
Also note that the case of letters in a name are important (as in all Unix/Linux based systems):
An object called ab is not the same as one called Ab, AB, or aB.

3.5.2

Object contents

Since the actual contents of a object may vary from a simple value to a large number of different
types of information , e.g. the results from a multiple regression (part 5.4.10) or a factor analysis
(part 5.4.4) it may be useful to be able to inspect what elements are included in the object. This is
achieved with the str (). Use that function with the name of the object you are interested in, and
you get a list of the contents of the object.

3.6

Other information on the session

There are other functions that might be useful, giving information on the session and what is
available. In addition to the ls () function mentioned above, another one is sessionInfo ().
The command data () lists all the data sets or frames included in the session. The command
library () provides a list of all the installed packages and therefore are available for use. See
also the part on Installing packages in part B.4 on page 109 below.

3.7

The Workspace

In addition to the object concept, the workspace concept is very important when using R.
When you start the system by clicking on the .Rdata file in a directory (or a shortcut to a file with
that name), you start the system and at the same time load all the objects, i.e. the generated objects
stored in that workspace, i.e. the contents of the .Rdata file. So, if you ended the previous session
by clicking on the Yes button in the final window (see Figure 2.2: Closing R on page 7), all the
objects in that workspace was saved, and you can continue the next session where you left off.
In other words, just as an object can be regarded as a container for information on something
(including where the information comes from), the workspace may be in itself regarded as a container for objects. Every time you start R, you start the system in the state that you saved that
particular workspace in.
This is one of the major differences between R and more conventional systems for statistical
computations. With systems like SPSS you normally start a session with one data set alone and

16

CHAPTER 3. BASIC STUFF

the system has no memory of what has been done before, nor having the tools for maintaining
such a memory of operations. In contrast, with R, a workspace may contain any number of data
sets (frames), results stored in objects, functions, etc. accumulated over several sessions. This is
very nice, but potentially problematic as well. If a workspace contains a lot of objects with obscure
names you may run into problems. So, assigning meaningful names to objects is important. See
part 3.5.1 on page 14 above.
For that reason it is also good practice to keep things apart that should be kept apart, i.e. to
have workspaces in separate catalogs or directories for each project you are working on.
More information on this subject is found in part 7.5 on page 77 below.

3.7.1

Listing all objects in the workspace

In order to get a list of all the names used for objects in a workspace, the command ls () is used.
Alternatively, the command objects () can be used. In the GUI interface, there is option for this
operation in the Misc menu entry as well.

3.7.2

Deleting objects in the workspace

Specific objects in the workspace can be removed with the rm () command, e.g.:
> rm (height, weight)

All the objects in the workspace can also be removed with the same command combined with
a list:
> rm (list=ls ())

Strictly speaking, you should use rm (list = ls (all.names = TRUE)) instead, since the
simple version of the ls () command does not list variables or objects with names starting with
a period. Normally, this should not be necessary, for the simple reason that names starting with a
period should not be used by normal users in any case. Alternatively (if you are using a Windows
machine), there is a Remove all objects entry in the Misc menu on the GUI interface.
However, this is a dangerous operation, for the simple reason that it is likely that you will
want to keep some objects. So list them before removing anything.

3.8

Directories and the workspace

By default, R reads from and writes files to the active directory, and this is the catalog or directory where the program is started. More precise, this directory is the one where the .Rdata file
used for starting the program resides. Since it is perfectly OK (and often very useful) to operate
with more than one workspace (i.e. different .Rdata) files it may be convenient to have separate
directories for each project, each with a .Rdata file. This is how to do it:
Look in the File menu and select the Change dir., and when you have located the directory
you want to use with that option, save your workspace there with the Save Workspace option.
One other useful operation is to simplify the use of that workspace, locate the .Rdata file using the
File Explorer, right-click on the file name and create a shortcut to the file. Drag the shortcut to
the desktop, and give it a suitable name.

3.9. BASIC RULES FOR COMMANDS

3.9

17

Basic rules for commands

The basic rules for commands are:


Blank lines are ignored.
The parts of a line following a # are also ignored, this is handy for adding comments or
explanatory text (annotations) in scripts, functions and data files.
Strings are enclosed in either double quotes, e.g. Extroversion score or single quotes,
e.g.Extroversion score. The two types work the same way, but you have to be consistent.
You cannot start a string with a double and close it with a single quote.
Blanks are also ignored within commands, with the exception of blanks within strings. It
is recommended to use blanks to increase readability. However, there is one very specific
situation where blanks outside strings matters. If you write x <- 7, the object x will contain
the value of 7. On the other hand, if you write x < - 7, (with one or more blanks between
the less than and the minus) that is a different thing, you ask for a test for the possibility
that the value stored in x is less than 7.
All object names are case sensitive, so Sqrt is NOT the same as sqrt (the latter is the
correct one if want to refer to the built-in function for obtaining a square root in the language
R).
All the previously typed commands are kept by the system in a stack. You can walk up and
down this stack by using the up and down arrow keys, edit any command, and send it
off for execution with the enter key again. In other words, any previous command may
be edited and reused.
Normally, there is only one command per line. However, if you want to have more than one
command on a line, separate them with a semicolon ;.

18

CHAPTER 3. BASIC STUFF

Chapter 4

Data sets / Frames


The subject of this chapter is a special case of the object concept introduced the
last part of the previous chapter, a type of object, the data frame.
Handling data is a large subject in itself, and therefore only the basic input operations plus a discussion of missing data are covered in this chapter. More details
on this subject is found in the chapter on Project Management.
Apart from the data sets that are included with R when it was installed (e.g. the attitude data
set used above), you will probably want to get hold of some data of your own at a very early stage.
Therefore, learning the basics of how to handle data sets or frames as they are called in the R
context has to be one of the first and really very important things to learn.

4.1

Sources of data sets

For any type of analysis, some data are needed. In this context, there are three possible sources.
One may:
Read data sets from file (see part 4.4 on page 22), where there are a number of different
alternatives, both in respect to where the data sets are read from, as well as how the data
are formatted. When read, the data set(s) may be saved in your workspace, if so, they are
immediately available the next time you start R.
The reading of data sets from files is part of the normal use of R, usually performed with
some variant of the read.table () function. You may obtain a list of the objects (which
includes any data sets or frames) in your workspace with the objects () function. Alternatively, you may use the data () function to obtain a list of the data frames installed by
default.
Alternatively, use any of the data sets included when the base version of R is installed, or
added when other packages are added to the installation. These data sets are available at all
times, like the attitude data set used above. Unlike the attitude data set, some data sets
are part of packages. Then you need to use the library () command to get hold of them,
but they are nevertheless available with a minimal effort. Few of these data sets come from
fields related to psychology, but they may nevertheless be useful, at least for demonstrations.
In addition, one may generate data with specific properties in a number of different ways.
This is a very powerful aspect of R, and very useful in many contexts, not the least for
19

20

CHAPTER 4. DATA SETS / FRAMES


exploring what the different techniques actually do. This approach is used in the discussion
of ANOVA in part 5.4.9 on page 49 as well as a script version of this procedure in part 8.2.3
on page 95.

Or, any combination of these three. 1 All three approaches are used in the chapter on data
analysis. The important point to keep in mind is that once data have been read into memory, they
can be saved in the workspace (and you may have more than one workspace, see part 3.7 on page
15 and 7.5 on page 77. In that case the data sets are available at all times once that particular
workspace have been opened.
These data sets are all in the public domain, and are commonly used for demonstrations. One
of the data sets in the list is called attitude as mentioned above and used in in a number of
contexts here. Information about this data set is obtained in the same manner as other types of
help:
> ?attitude

This opens a window with information on the data set. The same holds for any of the other
data sets included in the installation.

4.2

Data frames

A frame (i.e. a data set in the R sense) is an object with rows and columns, one row for each person
or case (except the first one), and one column for each variable. The first (top) row will usually
contain the names for the variables, and (less often) the first (leftmost) column contains names for
the cases.
The concept of a data frame in R is the closest approximation to the data set as used in
conventional statistical programs. It differs from an R type vector or a matrix in the sense
that the values in columns within the frame may be of different types, but the data within one
column are always of the same type.
In principle, one should distinguish between the representation of the frame (e.g. normally as
a text file) and the frame itself as stored in the workspace. An example of a text type of representation of a frame is seen in the box containing figure 4.3.
There are a number of different ways to import a frame (data set). The most basic ones are
to read the data set from a text file or from the clipboard. On the other hand there are also the
possibility of reading (importing) data from other types of files as well, including SPSS type .SAV
files. For more details beyond what is mentioned here, see the R Data Import/Export manual,
available from the Help menu on the R interface. The most important of the functions are included
in the foreign package (library).
Normally, a row contains all the data from one subject and a column all observations on the
same variable, although this convention is somewhat less strict in R (one exception is a repeated
measures type of analysis). The columns must have unique names, where the name is either
created when the frame is read from file with one of the functions of the read.xxx () type, or
added afterwards with the names () function combined with the c () function.
1

Actually, there is at least one other category, very general statistical techniques based on using real data, but with
variations on where each observation belongs. One class within this category is called randomization or permutation tests (Edgington, 1995), where some of the values in an analysis are systematically shuffled, another is bootstrapping (Efron & Tibshirani, 1993), where each observation are kept in place but where the same results are computed
for a large number of subsamples. A third variant is jackknifing (Efron, 1979) and (Efron & Tibshirani, 1993) which
explores the effect of systematic exclusion of a subset of the cases in an analysis. A common label for these techniques
is resampling or computer intensive. Relevant R packages are boot and bootstrap. See part 6.1 on page 60

4.3. ENTERING DATA

4.3

21

Entering data

My preferred tool for manual data entry when the data sets are really small is a spreadsheet (Excel2 , Gnumeric, or Calc from OpenOffice). Write the variable names as the first row in the sheet,
and corresponding values for each case in the rows below.
However, if the project involves anything more than
a (very) small number of variables and cases, it is
strongly recommended to use a dedicated tool for data
Table 4.1: Data stored in a text file
entry. A recommended strategy is to maintain a complete data set in one file, and to split the data set into
#This a test data set
several subsets of variables for each type of analysis as
hori h1 h2 v1 v2 path gender
148 29 29 8 5 7.0 male
needed. See the part A on Data Transfer on page 99.
115 19 27 3 5 12.5 male
The basic rules are:
107 15
134 21
.......
105 24
96 21
129 22
85 18

27
29

4
4

6
6

1.5
2.0

female
male

20
22
28
20

1
3
2
4

3 1.0
4 25.0
4 8.0
4 18.0

female
female
male
male

Avoid having blanks (spaces) in the variable names


or in the variable values. If you need more that one
word, separate the words with periods or underscores, e.g. Neo.Score or Tom Johnsen.
All the values of the same variable must go in the
same column.

And of course:
All the values from the same person must (normally) be in the same row. There is one
important exception, when you have a repeated measures type of design. This is where R
differs from programs like SPSS and Statistica. The structure of the data file used with this
type of design is covered in section 7.1.6 on page 72.
The nice thing with spreadsheets is that you can have several related data sets (perhaps from
the same project) in different sheets in the same file, and that it is easy to transfer the data in each
sheet to R, either using the clipboard (which also works from other sources as well, see below), or
via a plain text file.
In the latter case you may have to export the spreadsheet as a text file. With the relevant
sheet active, select File -> Save as. In the next window select the .CSV option in the Save as
type dropdown list and change the name of the file if that is needed. Alternatively, use the tab
delimited option when saving the file.
The generated file will have the contents of the spreadsheet stored as text, where all the values
on one row are separated with semicolons ;.
However, and dependent on the setup of your system (imposed by the so-called locale), 3
the values in the file may be separated with commas. Also, some locales use commas as decimal
2
Using Excel may be problematic if the data set has more than 256 variables (the upper limit for the number of
columns in that program, at least for the version I am using at the moment). In that case the solution is either to
use another spreadsheet program (there are several free ones available, a very popular one is called Gnumeric:
http://www.gnome.org/projects/gnumeric/, or the Calc program in OpenOffice, and yes, both are able to read and
write Excel files).
3
Some locales (including the Norwegian one which most of my students are subjected to) use commas as decimal
separators in text files. This is in my opinion against nature for researchers and should be changed. Permanently.
Open the Control panel for Windows, find the icon for Regional and Language options. With the Regional option
click on the Customize button. My advice is to set the decimal separator to a period and the list separator to a
semicolon.

22

CHAPTER 4. DATA SETS / FRAMES

separators which may cause problems. Therefore it may be necessary to either change some options in your setup of Windows or to edit your file with NotePad or a similar editing program
oriented towards plain text. This does NOT include the files generated by programs like MS
Word. For most versions of MS Word the .doc file formats contain a lot of junk that does not contain anything relevant to the text content of the file (and probably irrelevant in respect to anything
else as well). Also note that files with the extension of .CSV will by default be opened by the
spreadsheet program, which brings you (partially) back to where you started. It may therefore be
an advantage to change the file extension into something that does not invoke anything else than
an editor.

4.3.1

Variable types in frames

One of the differences between R vectors and matrices on the one hand and data frames on the
other, is that the columns in a frame do not have to be of the same type. The basic types are:
Numeric, a value represented as a number. Anything represented by numbers, nominal,
ordinal, or higher measurements.
A string, e.g. a name.
A category or factor type variable, normally used as an independent variable in multiple regression, in ANOVA designs, or as category names in frequency tables. They are
assumed to have a limited set of possible values, either represented as numbers or text, e.g.
codes like 0 and 1 or words like Male, and Female for gender. In general, this type
may be simply categorical or ordered (ordinal).
However, all the observations within a column are assumed to be of the same type.
In respect to the factor type, R is normally quite tolerant in respect to what kind of values are
being accepted as factors in an analysis. So a variable or data column in a frame contains a series of
1s, and 2s (or 20s and 30s for that matter) works just as well as a variable or column containing
strings like YES or NO. However, in some contexts you have to be explicit about telling the
function that the vector is a factor, especially when the values are not a simple dichotomy. This is
especially important in respect to multiple regression and ANOVA types of analysis.

4.4

Reading data frames

Given that there are a number of different ways of storing potential frames in memory or files,
R permits a large number of options in respect to importing frames, including reading data from
foreign formats like SPSS or Excel. Only two are covered here, reading text files in two variants,
plus reading data from the clipboard.

4.4.1

Reading frames from text files

There are several methods of creating frames for data analysis. One of the most common methods
are variations on the read.table () command for reading data from simple text files, for instance;
> mydata <- read.table ("small.data", header=TRUE)

4.4. READING DATA FRAMES

23

Where it is assumed that the data is stored


in a simple type of file called small.data,
looking something like the one in Table 4.3.
With this command it is assumed that the elements (variable names in the first row (line)
and data values in the succesive rows or lines)
are separated with blanks (spaces) or tabs.
If they are separated by something else, e.g.
semicolons (which could be the case if the data
set was saved from a spreadsheet program as
a type CSV file), you have to use:

Note: If you have a very large number of


variables in your data set, some editors may
generate problems for the read.table ()
operation due to line wrapping. It is therefore advisable to carefully check the data matrix after it has been read by R with the edit
() or the fix () commands. In particular,
check that the number of rows in the data set
is correct, and have a look at the values for
the last (rightmost) variables. That is in any
case wise to do whenever a data set is trans> mydata <- read.table ("small.data", header=TRUE,
ferredsep=";")
from one format to another. In particular, check that the missing observations
The first line in table 4.3 is a comment. The
have come across correctly.
next line (above the data matrix itself) contains the headers, i.e. the variable or column names. It is strongly recommended to include this information. However, if the variable names are omitted in the (text) data file, drop the header=TRUE part of the command above,
then the variables are assigned the names of V1, V2 etc. by default. All values on one line in a
file of this type are regarded as belonging to the same case or individual, where the values are
normally separated by blanks or tabs (other separators are possible, see the example above). Each
line or row in the data matrix is terminated by a carriage return (generated by pressing the Enter
key).
The most important option used for reading data frames is the header=TRUE option, and by
default, missing observations in the data set are identified with the string NA. See part 4.6 below
for details about missing information.
There are a number of other options that may be useful for this function, One is to specify the
number of text lines to skip in the beginning of the file in case there are text there, etc.. Have a
look at the help for read.table () for details.
Finally, it should be mentioned that the first argument may very well be an URL where a data
set may be found on or downloaded from the Internet. This works perfectly well if you are logged
on to the net in some manner.

4.4.2

Selecting the data file in a window

Another handy variant of the read.table () command is:


> mydata <- read.table (file.choose(), header=TRUE, sep=";")

Which opens a window where you select the file to be read.

4.4.3

Reading data from the clipboard

The absolutely simplest way to get data into the system is to read the data set from the clipboard.
For instance, imagine that someone sent you an email with a small data set. The text of the email
contains the lines found in table 4.2:
If you want to use this data set in R, mark the part of the text with the data in the document,
and copy it to the clipboard (e.g. using Ctrl-C). Then activate R, and enter a command like:
> z <- read.table ("clipboard", header=TRUE)

24

CHAPTER 4. DATA SETS / FRAMES

The data set is then in the workspace with the name


Table 4.2: Clipboard data
"z". Very simple, very efficient, and works from almost any
source, a file in plain text, a page on the web, a spreadsheet,
# Data on clipboard
any statistical program I can think of. If you want to save the
A B C
1 1 29
data, you have of course to at least save the workspace when
1 2 16
you exit the system. Alternatively and more permanent, save
1 3 55
the data set to file with a write.table () command as de2 1 198
scribed in part 8.2.1 on page 91 below.
2 2 107
Missing observations are signalled by the text NA as the
2 3 181
value. As with any import of data to R it is important to check that these values are coded correctly.

4.4.4

Reading data from the net

If the data you are interested in are available from some location on the net as a text file, you can
replace the name of the file in the examples above with the URL for the file, e.g.:
> z <- read.table ("http://www.galton.uib.no/psyk353/candlo.dat", header=TRUE)

This particular dataset is one I use for teaching R. In any case, this alternative may be handy
for sharing data between researchers provided you have access to a server.

4.5

Inspecting and editing data

Quite often you will want to ensure that you actually have the correct data, that the number of
cases or variables is correct, check that the values are plausible etc.. One simple (and first) test is
to print the variable names, which you get when you use the names () function. A another trick
is to print the complete object, which may be large. Still another way to inspect the contents of an
object is to use the edit () command. However, remember that this function returns the entire
dataset, suitable for assignment of the (possibly) modified dataset to an object name. For direct
modification of a dataset, use the fix () function.
Alternatively one may use the head () or the tail () command which prints the upper left
or lower left part of the data frame named in the argument.

4.6

Missing data

It is a fact of life that some observations may be missing in a data set. Laboratory animals may
die, people may leave out questions in a survey, recording instruments may fail, etc.. There may
be logical reasons for observations to be missing as well, e.g. some questions may be meaningless
for some subgroups in the sample. In any case the effect is that there may be holes in a data set
and that problem must be handled in some way or another.
Whatever you do with missing data, it will have both theoretical and technical implications.
With conventional packages, handling of missing observations is largely automatic. That is convenient, but also problematic, for the simple reason that the problem is swept under the carpet
in some sense. Typically, when using R, one has to be more explicit.
In any case, when we have missing data, 4 there are in principle two very different strategies
4
Actually, there are more than one type of missing data which may occur when doing analysis on a data set, the one
discussed here (NA) is one which is part of the data set, another is NaN (Not a Number), which is the result of meaningless
or impossible value in respect to computations. Examples are the result of a division of zero by zero, or the square root
of a negative number. In addition, division of a non-zero value by zero in R generates still another value, Inf (infinity).

4.6. MISSING DATA

25

that may be used. We can either:


Fill in or plug all or a subset of the holes in the data set with more or less plausible
values. This is called imputing, and there are many alternative strategies for doing this. R is
very strong in this respect.
Statistically speaking, this strategy is problematic if the values to be imputed are not missing completely at random (often abbreviated to MCAR).
Alternatively we may:
Filter or ignore parts of the data set when doing the analysis, or operate with subsets of the
rows in data set.
There are two main variants of the latter strategy, listwise (called complete in an R context)
and pairwise deletion of cases:
With listwise deletion one essentially says: If, for any case (row) in the data set, any of
the variables involved in a particular analysis is missing, the entire case is dropped from the
analysis
The pairwise deletion is somewhat less restrictive; it essentially says that as long as there
are values available for computation, they are used as far as possible.
Once you start thinking about them, both these alternatives are problematic in one way or
another. For instance, when using pairwise deletion when computing a correlation matrix, each
correlation in the matrix may be based on different subsamples of the cases. This may easily have
an influence on the possible conclusions to be drawn from the results. In the extreme case one
may even run into mathematical problems when applying some types of analysis to this type of
correlation or covariance matrix5 . On the other hand, a strict regime of listwise deletion may have
the effect that the computations are superficially OK, but that they are based on a very odd subset
of the cases in the data set. You may then have problems comparing two different models for the
simple reason that they are based on different subsets of the cases in the data set.
In any case there are two different problems at the technical level:
You have to tell the system which observations in a data set are to be regarded as missing,
and
When doing a particular analysis, you normally have to tell the system what to do with the
cases or rows containing missing values.

4.6.1

Identification of missing values

Some types of analysis (e.g. multiple regression) use a mathematical procedure called inversion as part of the
computations. With a data matrix containing missing data one may in the extreme case obtain a covariance or correlation matrix which is not positive semidefinite or ill-conditioned, which again means that a solution cannot be
found. Three possible reasons may be that (a) when using a pairwise strategy, the samples used for each correlation
or covariance are VERY different, (b) that the final n after a listwise deletion is lower than the number of predictors
or independent variables, or that (c) there are linear dependencies between the dependent variables. The effect in any
case is that you will not get the requested results.

26

CHAPTER 4. DATA SETS / FRAMES

When reading a data matrix in text format, one


needs to be able to able to indicate that a particular
observation is missing in one sense or another, i.e.
not available for statistical computations.
In a text file to be read by R, the missing values
are signalled by the text NA. So, if there were missing
data in a text file, it might look like the data set in
table 4.3. In other words, where an observation is
missing in the data set, the string NA is inserted. NA
is a symbol, meaning Not Available.

4.6.2

Table 4.3: Missing observations in frame

What to do with missing data

#This a
hori h1
148 29
115 19
107 15
134 21
.......
105 24
96 21
NA 22
85 18

test data set


h2 v1 v2 path
29 8 5 7.0
NA 3 5 12.5
27 4 6 1.5
29 4 6 2.0

gender
male
male
female
male

20
22
28
20

female
female
male
male

1
3
2
4

3 1.0
4 25.0
4 8.0
4 18.0

If one has a data set with missing data as the one in


table 4.3, available functions in R behave somewhat
differently. In general one has to have a look at the help for the function before using them. For
instance:
> mydata <- read.table ("mydata.data", header=TRUE)
> mean (mydata)

The function mean () will by default return the value NA for all the columns in the frame where
one or more observations are missing, which simply means that the value cannot be computed.
The mean is reported for the remainder, i.e. the columns where there are no missing observations.
To get useful results for all the variables, you have to explicitly tell the function to handle the
missing values in a sensible manner. This is done by adding an option to the call on the function:
> mean (mydata, na.rm=TRUE), where the na.rm part says remove the NAs. The same holds
for other functions like sum (), sd () etc.. On the other hand, some functions like plot () handle
the problem differently, at least in some variants where missing values are simply ignored. When
computing single correlations or a correlation matrix with the cor () function, the simplest call
on the function would be:
> cor (mydata)

Which prints a correlation matrix for all the columns in mydata. However, if there are any
missing data in the data set at all, you will not get any results, only an error message. You have to
add an option telling the function how to handle these values:
> cor (mydata, use="pairwise")

If you want a listwise type of results from this procedure, replace the pairwise option
with complete.

4.6.3

Advanced handling of missing data

The default procedure for handling missing data in standard commercial statistical packages
like SPSS and Statistica is to ignore single values or whole cases in the computations when missing
observations are encountered almost without warning. This is in many cases very problematic and
may seriously distort the results.
Attempting to avoid this problem by replacement of the missing values with variable means
will very often seriously distort the variable distributions (variance etc.), confidence intervals, and
relationships among variables (e.g. the correlations and covariances).

4.7. ATTACHING DATA SETS

27

Usually this solution does more harm than good and is not to be recommended. See (Schafer
& Graham, 2002) for a discussion of this subject. When using R, you have access to a large number
of functions oriented towards handling missing data in various ways, i.e. when writing functions
(e.g. na.fail (), na.omit (), etc.).
In addition, there are a number of options to use when one wants to replace missing observations with more or less plausible ones (technically called imputing) found in several places,
among them the package Hmisc, and for creating useful plots (see the functions naclus () and
naplot () in Hmisc. These are state of the art methods, much more advanced than the ones normally found in the commercial packages. On the other hand, you really have to understand what
you are doing when using them, so if you really need something like those functions, proceed
with caution.

4.7

Attaching data sets

To use the attitude data set in a convenient way, the attach () command may be used:
> attach (attitude)

Note that this command is not really necessary, but convenient. When a data set or frame is
attached, the names (headers) of the variables or columns in the data set are included in the
search path for R. This means that the data in the named columns in the attached data set are
available as single objects (vectors) as well. Since two of the variables in the data set are called
learning and raises one may for instance refer to these columns or variables directly after an
attach () command, for instance to generate a scatterplot:
> plot (learning, raises)

If the attach command above was not used, one would have to use a somewhat more cumbersome notation to refer to a single variable within the data set or frame such as:
> plot (attitude$learning, attitude$raises)

Alternatively, one could still another notation with exactly the same effect:
> plot (attitude[,4], attitude[,5])

Where the reference in this case is to column numbers 4 and 5 of the frame attitude, which
are the variables learning and raises). To verify this, use the command names (attitude)
which yields a list of the variable names in the frame. 6 Note also that the attach () option is
also useful with other data or frames, not only the built-in data sets.

4.8

Detaching data sets

There is a possible trap when you have attached more than one data set or frame at the same time.
If there are some kind of overlap between data sets (frames) in respect to variable names, referring
to them may be ambiguous. You will get a message in that case, referring to the columns that are
masked. In any case it is wise to detach the data set as soon as you are finished with it, e.g.:
6

R has an extremely rich and flexible set of techniques for defining subsets of variables (columns) and cases (rows)
in data frames. See part 7 for some examples.

28

CHAPTER 4. DATA SETS / FRAMES

> detach (attitude)

In any case, if the attach () function is used at all, remember to use detach () sooner rather
than later.

4.9

Use of the with () function

Experienced R users tend to avoid the use of the attach () / detach () pair of commands, for
the simple reason that it is too easy to run into problems when variables in an attached data set
are masked by objects with the same names already in the workspace. In order to be sure that one
actually are using the variable one intends to use, the alternative may be the with () function:
> with (attitude, plot (learning, raises))

This is a kind of very local attach, which eliminates all kinds of masking and several other
potential problems as well.
Here the first argument is the number of the frame or data set to which the variables belong,
and the second is the command or function one wants use. Regard this variant as very local
variant of the attach () command without the drawbacks. Recommended.

Chapter 5

Data analysis
This chapter starts with a discussion of various alternatives in respect to the
classification of techniques for data analysis. and continues to cover:
Basic univariate and bivariate techniques
Factor analysis
Item analysis
Two-way factorial analysis of variance
To put it mildly, what is included in this part is a very small subset of what is
possible using R. There are myriads of other techniques.
As a start, it may be an advantage to classify the techniques discussed in this part in order to
give an overview of the field. There are several possible approaches.

5.1

Classification of techniques

One possible classification of the techniques used for data analysis is to classify the techniques as
to whether they are oriented towards the relationships within one set of variables, or between between two sets of variables. This approach is to some extent based on the mathematical operations
used in the various types of analysis.
Within one set of variables. In this category, the simplest possible situation is when the
variable set consists of one single variable. In that case only simple distribution statistics
(e.g. central values, estimates of variation and corresponding plots) are possible. If the set
of variables is expanded to two variables, the possible techniques is expanded somewhat,
including correlations and related techniques.
If there are have more than two variables, variants of factor analysis may be the right thing
to use. A somewhat related set of techniques is represented by variants of clustering techniques.
Between two sets of variables: The central technique here is multiple regression (often abbreviated to MR, where one operates with one dependent variable or response variable, and
one or more independent variables.
29

30

CHAPTER 5. DATA ANALYSIS


Formally speaking, t-tests and analysis of variance (ANOVA) are a special case of MR, they
belong to the same category as well. The same holds for discriminant analysis.

A related, but alternative categorization 1 is less based on the mathematical operations and
more upon which techniques researchers in different fields are actually using.
The field researchers view: Cross (frequency) tables, linear regression, logistic regression.
The experimentalists view: Various types of ANOVA, discriminant analysis in controlled
experiments.
The psychometricians view: For a large part oriented towards test construction, where reliability and factor (component) analysis are important tools. Structural equation modeling
may be used as well.
The structuralists view: Includes variants of factor analysis, correspondence analysis, structural equation modeling (SEM).
These views may be categorized in several different ways, but the most important is that the
first two are oriented towards the examination of the relations between two different sets of variables with different status as mentioned above. Typical for these two views is that one often employs data analysis techniques that are variants of multiple regression (linear models), i.e. examines the relation between what is called dependent and independent variables, also called response
and criterion variables.
Of the first two perspectives, the second typically employs variants of analysis of variance or
ANOVA. The important point to keep in mind is that all ANOVA techniques are really variants of
multiple regression (Cohen, 1968). However, even if the actual techniques used in these two views
are formally equivalent, the context of their use is clearly different, where the first is normally used
in a general class of studies where a survey is a typical example, while the second is normally
characterized by a controlled experiment.
In contrast, the last two views in general have their focus on the relations within one set of
variables, where the variables normally belong to the same domain, and essentially are of the
same type. For instance, typical techniques employed in these views are variants of a general
class of data analysis called factor analysis (of which there are many variants). In practice, the
differentiation between the last two views is somewhat blurred compared with the first two.
From a data analysis point of view (as distinct from a design point of view) the distinction
between these views is to a large extent based on history and has become less clear given developments since the advent of the use of computers in research.
A third possible categorization is to place techniques in one of three groups:
Univariate techniques, essentially computing distributions like means, standard deviations,
quartiles for one variable at a time, as well as plots of single variables like histograms and
piecharts.
Bivariate techniques. These are oriented towards the expression of the relation between two
variables at a time, where simple correlations including simple regression is one example,
the computation of a t-test for the difference between two groups is another, as well a number of different types of plots describing the relations between the two variables. A normal
scatterplot is only one possible alternative in respect to graphics.
1

The views approach has been suggested by Hans-Rudiger


Pfister from the University of Luneburg.

5.2. UNIVARIATE STATISTICS

31

Multivariate techniques. In principle, this covers anything involving more than two variables at a time, which covers almost anything else. Component analysis, Factor analysis,
Multiple regression, any ANOVA design with more than two groups or repeated measures,
as well as SEM as normally used.
The last classification is the one that will be used for the remainder of the chapter.

5.2

Univariate statistics

This section is concerned with the description of one set of numbers in a data set at a time, either
stored as one of the columns in a frame, or as a set of values stored in a vector. In data analysis
using R, data can be considered to be one of three different types, categorical, discrete numeric,
and categorical numeric. The methods for summarizing and viewing each type vary. The simplest
is to summarize the values as a set of numerical values:

5.2.1

Simple counts

Imagine a small survey, where the answers to one of the questions was recorded as a simple yes
or no. For eight of the respondents the data could be:
> res <- c ("Yes", "Yes", "No", "No", "No", "Yes", "No", "No")

A simple frequency count of these data is obtained with the table () function:
> table (res)
res
No Yes
5
3

It would of course be faster to do a simple count like this with few values without a computer,
but the example is good illustration nevertheless. The function table () works just as well with
numbers. If we replace the YESs and NOs in the observation above with 1 and 0 respectively,
we get the same counts, but with different labels:
> res <- c (1, 1, 0, 0, 0, 1, 0, 0)
> table(res)
res 0 1
5 3

5.2.2

Continuous measurements

What the function mean () returns depends on the type of the object used as the first argument.
If it is a single vector or matrix, the returned value is the mean of all the values. If the argument
is a frame, a vector containing the mean for each of the data columns is returned. It can be used
on any type of numeric variable, and may in certain cases yield something that is meaningful for
categorical data as with the last version of res in the previous example:
> mean (res)
[1] 0.375

Since the last version of the res object contained only zeros and ones, the mean in this case
is the proportion of the count of the 1s to the total number of observations, i.e. 3 divided by 8.
Since there are 7 variables in the attitude data set or frame, mean (attitude) returns a vector
with the seven means:

32

CHAPTER 5. DATA ANALYSIS

> mean (attitude)


rating complaints privileges
64.63333
66.60000
53.13333

learning
56.36667

raises
64.63333

critical
74.76667

advance
42.93333

while the sd () command yields the standard deviations:


> sd (attitude)
rating complaints privileges
12.172562 13.314757 12.235430

learning
11.737013

raises
10.397226

critical
9.894908

advance
10.288706

Technically, this is an estimate of the population standard deviation from the sample and is
based on the sum-of-squares divided by n minus 1. While the variance, i.e. the squared value of
the sds are obtained by:
> sd (attitude) ^ 2
rating complaints privileges
148.1713
177.2828
149.7057

learning
137.7575

raises
108.1023

critical
97.9092

advance
105.8575

Note: If the object contains a single vector or column, you get the same thing with the var ()
function. However, if you have a frame with more than one column, the var () command yields a
covariance matrix, something completely different (apart from the diagonal values). More useful
details are obtained with the summary () command:
> summary (attitude)
rating
complaints
Min.
:40.00
Min.
:37.0
1st Qu.:58.75
1st Qu.:58.5
Median :65.50
Median :65.0
Mean
:64.63
Mean
:66.6
3rd Qu.:71.75
3rd Qu.:77.0
Max.
:85.00
Max.
:90.0
critical
Min.
:49.00
1st Qu.:69.25
Median :77.50
Mean
:74.77
3rd Qu.:80.00
Max.
:92.00

privileges
Min.
:30.00
1st Qu.:45.00
Median :51.50
Mean
:53.13
3rd Qu.:62.50
Max.
:83.00

learning
Min.
:34.00
1st Qu.:47.00
Median :56.50
Mean
:56.37
3rd Qu.:66.75
Max.
:75.00

raises
Min.
:43.00
1st Qu.:58.25
Median :63.50
Mean
:64.63
3rd Qu.:71.00
Max.
:88.00

advance
Min.
:25.00
1st Qu.:35.00
Median :41.00
Mean
:42.93
3rd Qu.:47.75
Max.
:72.00

Other useful functions in this category includes max () which returns the maximum value in
the vector or frame names in the argument, the corresponding function min (), sum (), median
() and a number of other functions.

5.2.3

Computing SS

In some cases it is convenient to use what is called an SS value, i.e. the sum of the squared
deviations from the mean or sum-of-squares. A command to obtain that value is:
> q <- c (1, 2, 3, 4, 5, NA)
> SSq <- sum (scale (q, scale=FALSE) ^ 2, na.rm=TRUE))
> SSq
[1] 10

5.2. UNIVARIATE STATISTICS

33

The first line just places the 6 values (one of them missing) into q, a vector. This is a simple
replacement for reading a set of data from file. The values are deliberately simple, the mean is 3,
the deviations from the means are then (-2, -1, 0, 1, 2), while the corresponding squared values are
(4, 1, 0, 1, 4). Hence, the sum of the values is 10.
The function scale () is very useful. As it is used here it removes the mean, but does nothing
with the variation, since the option scale=FALSE is defined. After that, all the values are squared.
The sum () adds the squared values without any protests about missing values (since na.rm=TRUE
option was included). Finally the sum is stored in SSq and printed. Note that the scale ()
function handles missing observations without having to do be told to do so, while you have to
be explicit about that when using sum (). Since there are more than one way of doing most things
in R, here is an alternative:
> q <- c (1, 2, 3, 4, 5, NA)
> SSq <- sum ( (q - mean(q, na.rm=TRUE)) ^ 2, na.rm=TRUE)
> SSq
[1] 10

The only difference is that the second one avoids the use of scale (). A third alternative is:
> SSq <- (var (q, na.rm=TRUE)) * (sum (table (q)) - 1)
> SSq
[1] 10

Which is based on the function var () (variance). The sum (table (q)) part is a simple
way to get the number of valid (non-missing)
values in the vector. When computing the SS
value using the var () function, remember that
this function returns the SS divided by N 1.
That is why the last version multiplies the value
computed by var () by (sum (table (q)) 1), i.e. the number of valid observations minus
one.

5.2.4

Plots

R is extremely flexible in respect to plots. An


example is Figure 5.1 on page 33. This plot is
generated with the plot () function, combined
with the lines () function using the variable
Figure 5.1: Histogram with density estimate
waiting in the data set called faithful (the
waiting times between eruptions of Old faithful, a famous geyser in a national park in the US,
available in R in the same way as the attitude frame). This data set is famous for showing a
natural bimodality. The function hist () generates the histogram itself, while the lines ()
function adds the density curve to the plot.
The commands used to generate this plot are:
>
>
>
>

attach (faithful)
hist (waiting, breaks="scott", prob=TRUE, main="", ylab="")
lines (density(waiting))
detach (faithful)

34

CHAPTER 5. DATA ANALYSIS

The breaks argument in the call on the hist () function specifies the number of bins or
(classes) for the histogram, either by number or by the algorithm to use when computing the
number of bins. This set of commands is also an example of how several commands are combined
to generate plots. In this case the plot itself is generated with the hist () command, and the
density curve is added with the lines () command.
In addition it is relatively simple to construct your own plot with exactly the properties you
want from a set of basic commands.

5.3

Bivariate techniques

This category of analysis techniques include among many others correlations, scatter-plots, and
t-tests for independent samples as well as two-way frequency tables.

5.3.1

Correlations

The function cor () by default computes the correlation(s) of the product-moment type between two or more variables.
If the call on the function includes only two variables you get a single correlation of the
product-moment type, e.g.:
attach (attitude)
> cor(rating, complaints)
[1] 0.8254176

If there are more than two variables in the call on cor (), you do not get a single correlation,
but a correlation matrix. There are also other variants of correlations, look at the method option
for the cor () function. If you want a statistical test of a single correlation you can use the cor.test
() function, as with this pair of variables from the attitude data set:
> cor.test (rating, privileges)
Pearsons product-moment correlation
data: rating and privileges t = 2.4924, df = 28, p-value = 0.01888
alternative hypothesis: true correlation is not equal to 0 95
percent confidence interval:
0.07778967 0.68172921
sample estimates:
cor
0.4261169

Note that the confidence interval is quite large for the simple reason that the sample is rather
small. In any case, have a look at the help page for the function.
If there are more (or a complete data frame) you get an object containing a complete correlation
matrix. So, if you apply the command:
> print (cor(attitude), digits=3)

You get the correlation matrix for the relations between the variables (columns) in the attitude
frame printed with a reasonable number of digits:

5.3. BIVARIATE TECHNIQUES

rating
complaints
privileges
learning
raises
critical
advance

35

rating complaints privileges learning raises critical advance


1.000
0.825
0.426
0.624 0.590
0.156
0.155
0.825
1.000
0.558
0.597 0.669
0.188
0.225
0.426
0.558
1.000
0.493 0.445
0.147
0.343
0.624
0.597
0.493
1.000 0.640
0.116
0.532
0.590
0.669
0.445
0.640 1.000
0.377
0.574
0.156
0.188
0.147
0.116 0.377
1.000
0.283
0.155
0.225
0.343
0.532 0.574
0.283
1.000

The call on cor () also permits the computation of rank correlations by specifying the method
argument. Type ?cor for an explanation of the options.

5.3.2

Scatterplots

Scatter-plots are simple to obtain as well, use the


Figure 5.2: Scatterplot with regression line
plot () function with the two variables (frame
columns) you want to plot.
As with the cor () function, if there are only
two variables in the call on plot () function
you get a simple scatter-plot if both variables are
continuous or numerical, and if there are more
than two variables you get a combined plot with
one scatter-plot for each variable pair. Lines may
be added to the plot, e.g. the regression line(s) by
the use of the abline () function.
This is illustrated by the plot in figure 5.2
which contains four elements (a) the scatterplot
itself between two variables from the attitude
data frame (rating and learning), (b) one horizontal line for the mean of the y variable, (c) one
vertical line for the mean of the x variable, and
finally (d) the regression line for the prediction of y from x. This is achieved with the following
commands:
>
>
>
>
>
>

attach (attitude)
plot (rating, learning)
abline (v=mean(rating))
abline (h=mean(learning))
abline (lm(learning~rating))
detach (attitude)

#
#
#
#

Scatterplot
Place the mean for the x variable
Corresponding for y
Then the regression line

As you see, plots like this are built up from basic elements, one element after the other.

5.3.3

t-test, independent means

Many users of statistical software are not aware of the fact that simple and multiple correlations
on one hand and the t-test and F -test on the other are closely related to each other. So, apart from
showing how to compute at t-test for independent means, another purpose of this section is to
demonstrate some aspects of this relationship. When using this type of test, one typically have
two independent groups of subjects which are different in some way, e.g. in respect to attributes
like sex, or how they have been treated in an experiment.

36

CHAPTER 5. DATA ANALYSIS

The data used to demonstrate this type of analysis is a famous one, originally published by
Student (a pseudonym for William S. Gosset, (Student, 1908) ), which is why the test is also called
Students t-test. This one of the built-in data sets in R, and is called sleep.
A look at the documentation for this data set (use the ?sleep) and the contents of the frame
itself, shows that it contains two equally large groups of subjects with a total n of 20. So this data
set fits the independent groupstype of design. For this demonstration, Ill use the t.test ()
function:
> t.test(extra ~ group, data = sleep, var.equal=T)

Which yields the following output:


Two Sample t-test
data: extra by group
t = -1.8608, df = 18, p-value = 0.07919
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3638740 0.2038740
sample estimates:
mean in group 1 mean in group 2
0.75
2.33

From these results we see that there is not a significant difference between the two groups.
Note that this function has a data= argument, which eliminates the need for an attach command.
In this data set there are two variables, one dependent called extra, and one independent
called group. What happens if we correlate the two variables and then test that correlation?
The following input line does the trick (after attaching the sleep data frame):
> cor.test(extra,group))

Where the cor.test () does the test of significance. This gives the following output:
Pearsons product-moment correlation
data: extra and group
t = 1.8608, df = 18, p-value = 0.07919
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.04969034 0.71678001
sample estimates:
cor
0.4016626

If we compare this output with the output from the t-test above, we see that the values for
the t are the same as in the previous test (apart from the sign), so are the degrees of freedom and
the p value. The confidence intervals are different, only because the first is expressed in respect to
correlations and the second in respect to t-values. You could generate an instructive scatterplot by
using plot.default (group, extra) (using plot () by default gives a box and whisker type
plot for the two groups when one of the variables is a factor, while plot.default () forces the
plot to be a scatterplot) combined with abline(lm(extra ~
group)) to generate the regression line.
You will then see that the regression line goes through the respective means for the two groups.
Why are the results so similar? The answer is simply that the same model is the basis for both.
It is simple to recompute one from the other. The formulas are:

5.3. BIVARIATE TECHNIQUES

37

r=p

t
t2 + df

(5.1)

And in the other direction:

r df

t=
(5.2)
1 r2
Warning: This only holds when there are only TWO groups, when the between degrees of
freedom is equal to one. If that is the case, there are other ways of obtaining the same results as
from t.test () and cor.test () as well. One is to compute a multiple regression using the lm
() function which is essentially the same as the computing and testing the correlation between the
dependent and the independent variable. If we attach the sleep frame and enter the command
summary (lm (extra group)) we get the multiple R2 equal to 0.1613, which is the same as
the square of the correlation above (0.4016626). The F value for the multiple correlation is 3.463,
which is equal to t2 . The other alternative is to do an analysis of variance, using the aov ()
function with the command summary(aov(extra group)), which gives the same value for F
as obtained from the lm () function. The reason for the similarity between the results from the
latter two functions is simply that the basis for the results from the aov () function is based on
calls on the lm () function.
Again, the identity between the results from four different functions is partially dependent on
the fact that there are only two groups of subjects in this data set, but there are nevertheless basic
identities between analysis of variance techniques on the one hand and multiple regression on the
other. See for instance (Cohen, 1968) for more details.

5.3.4

t-test, dependent means

Actually, the assumption that the sleep data set represents


an independent groups design is not correct in a historical
Table 5.1: Rearranged sleep
first second
sense. If you look at the original article (make a search for the
1
0.7
1.9
title of the article on the net), it is seen the original data set
2
-1.6
0.8
consists of data from one group with 10 subjects, with data
3
-0.2
1.1
on changes in sleep using two different drugs. Hence, the
4
-1.2
0.1
correct design to use in this case is a repeated measures or
5
-0.1
-0.1
dependent means one. This means that the data has to be
6
3.4
4.4
rearranged so that the two values from the same persons ap7
3.7
5.5
pear in the same row, and are paired. This is what Gossett
8
0.8
1.6
9
0.0
4.6
(Student, 1908) originally used. This revised data set (called
10
2.0
3.4
repsleep) and is shown in table 5.1. In this case the first
value of extra is paired with the value in row 11 of the
sleep frame, the second with the observation in row 12, and so on.
Given this rearrangement, the command to do this analysis is:
> t.test(first, second, var.equal=T, data=repsleep, paired=T)

Which gives the following results:


Paired t-test
data: first and second
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true difference in means is not equal to 0

38

CHAPTER 5. DATA ANALYSIS

95 percent confidence interval:


-2.4598858 -0.7001142
sample estimates:
mean of the differences
-1.58

The difference between the two drugs in their effect on the dependent variable is significant.
Note that this rearrangement of the data works for a very simple dependent or paired
type of t-test, but not as a general case. One cannot have a proper repeated measures kind of
design with this type of arrangement. In that case one needs something close to the arrangement
in the original sleep frame, but where the group is replaced by a person variable. In this variable
row 1 and 11 contains a 1, row 2 and 12 a 2, etc., since these pairs of variables originates from
the same person. In other words, the reason that the description of this technique is placed in the
bivariate section is not that there are two columns in the data set. There are two columns, but
both columns refer to the same variable, but at different times or conditions. The other variable is
hidden in the fact that there are pairs of variables to which each observation belongs.
Also note that the simple relation between these results and a correlation type measure as
shown in respect to the t-test for independent groups is not obvious, it is there, but in a more
complicated manner which involves multiple regression.

5.3.5

Two-way frequency tables

The simple count in the section on univariate techniques above covered a simple count of a limited
set of values within one variable (normally a single column in a frame) was obtained. Very often
one will want to compute a contingency table, also called a cross table, giving the frequencies for
each combination of values for two categorical variables. As an example, consider the data set in
table 5.2, and construct a frequency or contingency table for two of the variables or columns in
this data set. Using the table () command gives the following results:
> table(X1,
X5
X1 1 2 3 5
1 0 0 0 1
2 0 2 1 1
3 0 1 1 0
4 1 2 0 0

X5)
6
0
1
0
0

This table has the labels (values) for the first variable (X1)in the first column in the table and
the labels (values) for the second variable (X5) in the first row. The rest of the table gives the
frequencies for the combinations of values. Applying the summary () command to the table gives
the following results:
> summary(table(X1, X5))
Number of cases in table: 11 Number of factors: 2 Test for
independence of all factors:
Chisq = 11.11, df = 12, p-value = 0.5195
Chi-squared approximation may be incorrect

5.4

Multivariate techniques

The multivariate techniques covered in this part includes the computation of component and factor analysis, of reliability estimates multiple regression (MR), and factorial analysis of variance
(ANOVA).

5.4. MULTIVARIATE TECHNIQUES

5.4.1

39

Factor analysis

The term factor analysis really covers several different techniques, where the common term is Exploratory Factor Analysis (EFA). Among them the more important vaiants are two types called
Principal Component analysis (CA or PCA) and (Principal) Factor analysis (FA or PFA) respectively. The former is one of the oldest techniques within multivariate data analysis, starting with
(Pearson, 1901). What they have in common is that they are both exploratory and what is called
unsupervised 2 techniques.
The question addressed with this type of procedure is simple: Given a set of variables and
the correlations between the variables (normally expressed as a correlation matrix), is it possible
to represent these relations as a lower number of (normally) uncorrelated factors 3 or latent
variables, regarding the remainder of the variance as noise?
Factors are often thought to cause the scores for the variable the underlying construct (the
factor) is what (at least partially) produces the scores for the variable.
The basic steps in this type of analysis are:
1. Compute a correlation matrix for the variables you want to include in the analysis.
2. If you are doing an FA, place some initial estimates of the communalities (an expression
of how much of the variance for the variable it has in common with one or more of the
other variables in the analysis) in the diagonal of the correlation matrix. Iterate until a stable
solution has been reached.
3. The next step is to extract the factors. This is done by finding a factor (component, dimension, latent variable, the use of the terms vary) that explains as much as possible of the
total variance represented in the correlation matrix. That is the first component or factor one). This variance is then extracted from the correlation matrix.
Then the process is repeated, find a new factor that explains as much of possible of the
remaining variance in the (now reduced by the preceding extractions) correlation matrix.
Repeat this process until you have as many factors as you want or there are no variance left
to be extracted.
4. The solution reached by the above process is characterized by (normally) very unequal explanations of the variance in the correlation matrix by the factors. Therefore the normal
thing to do, if you have extracted more than one factor, to rotate the structure (the matrix of
loadings obtained by the previous step) to something that may be easier to interpret. This
rotation is either orthogonal (which means that the resulting factors are uncorrelated) or
oblique (yielding correlated factors).
The main difference between the two types mentioned above is that the first (CA) used the
correlation matrix as it is, with ones on the diagonal. The other (FA) replaces the values on the
diagonal with what is called communalities by starting with some estimate and then iterating
until some convergence criterion is satisfied.
2

This is in contrast to a more general class of supervised data analysis techniques like SEM (Structural Equation
Modeling), where the researcher specifies a model based on substantial theory to start the analysis. The program
then program attempts to fit the data to the model. This is a very general approach where very many of the standard
data analysis techniques are special cases. The output includes various measures describing how successful the fitting
has been. The method within SEM most related to PA and FA is called Confirmatory Factor Analysis (CFA).
3

In respect to terminology, strictly speaking PCA produces components, while PFA yields factors. In actual
usage, the distinction is blurred.

40

CHAPTER 5. DATA ANALYSIS

The first of these is essential mathematical in nature, the other is more based on statistical
theory. The classical references in this field are (Harman, 1967) and two articles by Henry F. Kaiser,
(Kaiser, 1960) and (Kaiser & Rice, 1974). Especially the title of the first of the latter two has a very
nice flavor to it, at least for someone who has used computers since the middle 60s.
In both cases the options are really quite limited. Apart from the the operations on the correlation matrix before the analysis is started, you may influence:
1. The data set to be subjected to the analysis, i.e. the combination of variables and cases used
to compute the correlation matrix.
2. The number of factors to extract. There are some suggestions about criteria to use, but no
fixed rules.
3. What type of rotation to use.
And this is not really very much, anything else is automatic. The modern approach is to use
what is called SEM (an acronym for Structural Equation Modeling). This is a supervised and
theory based type of technique.

5.4.2

Rotation

The loadings produced in this type of analysis are really coordinates for points representing the
variables in a space with as many dimensions as are considered useful. When the matrix of
loadings is rotated, the relative positions of each of the variables is unchanged, it is the axes that
are changed in respect to direction. In the unrotated version of the output, the first dimension or
factor is located where the most variation is found, the second orthogonal to the first and where
most of the remainder of the variation is found, the third orthogonal to both the first two, etc.. The
purpose of the rotation procedure is to spread the variation as evenly as possible on a (normally)
small subset of the factors.
This is clearly seen when comparing a plot of the first two dimensions before and after a rotation process. In other words, the important thing to keep in mind is that the rotation procedure
used in this class of techniques does not change the relative positions between the variables, only
direction of axes. Therefore the coordinates used to represent the variables change.

5.4.3

Principal component analysis

This is the oldest of the techniques in this class, and also the most common one. Technically
speaking, and in contrast to principal factor analysis (see 5.4.4 on page 43 below) this is more a set
of mathematical operations, which have little statistical content. It is the default option for several
of the other statistical systems in this respect, including SPSS. However, it is not necessarily the
best.
The main tool for using principal components analysis in R is the princomp () function. The
following command uses the attitude data set:
Results <- princomp (attitude, cor=TRUE)

A simple printing of the Results object after using this function shows that the procedure
extracts all the possible factors, usually (depending on the rank 4 of the correlation matrix) the
4

The rank is normally the same as the order (number of variables) of the correlation matrix. However, if there
are linear dependencies between some of the variables, the rank will be less than the order. This has mathematical
consequences, see for instance footnote 5 on page 25 in the part on missing data.

5.4. MULTIVARIATE TECHNIQUES

41

same number as the number of variables in the correlation matrix. The simple summary of the
results get the object printed on the screen by writing the name of the object containing the results:
Call: princomp(x = attitude, cor = TRUE)
Standard deviations:
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
1.9277904 1.0681395 0.9204301 0.7828599 0.5689225 0.4674726
7

variables and

Comp.7
0.3747503

30 observations.

The standard deviations in the


output are the same as the square
Figure 5.3: Scree plot
roots of the eigenvalues (computed
during the extraction of the factors,
see below). A normal criterion or
stopping rule for the number of
factors to use an analysis is that
the smallest eigenvalue should be
larger than 1.0. This is the so-called
Kaisers criterion. Since the square
root of something larger than 1.0 is
also larger than 1.0, these results indicate that the first two factors should be used. Another commonly rule used for determining the number of factors to use is the so-called scree test, inspect
a line plot of the Standard deviations (or more commonly, the eigenvalues given below), and
see if there is a clear break in the plot. The latter values are the ones used for generating the plot
in figure 5.3. The commands for generating this plot are:
> zz <- eigen(cor(attitude))
> plot(c(1:7), zz$values, type="b", col="blue", ylab="Eigenvalue", xlab="Factor")

What is done here is to plot the integers from 1 to 7 (the number of variables and the number
of components extracted) against the eigenvalues from the values in the object zz, i.e. the eigenvalues. The type of plot to generate is "b" (for both lines and points). As seen from the figure,
there is a clear change in direction of the curve after the second eigenvalue, which suggests that
two components should be used, which supports the conclusion above, based on the sizes of the
eigenvalues.
The use of the summary () function on the Results object yields more information:
> summary (Results)
Importance of components:
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Standard deviation
1.9277904 1.0681395 0.9204301 0.7828599 0.56892250
Proportion of Variance 0.5309108 0.1629888 0.1210274 0.0875528 0.04623897
Cumulative Proportion 0.5309108 0.6938997 0.8149270 0.9024798 0.94871881
Comp.6
Comp.7
Standard deviation
0.46747256 0.37475026
Proportion of Variance 0.03121866 0.02006254
Cumulative Proportion 0.97993746 1.00000000

The first line represents the same values as obtained by printing the Results object alone,
which again was based on the eigenvalues below. The Proportion of variance in the second
line is based on these values, and is equal to the eigenvalue divided by the number of variables,

42

CHAPTER 5. DATA ANALYSIS

which again gives the proportion of the variance in the correlation matrix explained by the
factor or dimension. In this case the first component or factor explains 53.1% of the variance and
the second 16.3%, the proportion multiplied by 100. As seen in the third row, these two factors
together accounts for 69.4% of the variance.
The eigenvalues used as the basis for the Standard deviations are not printed here in either
of these two outputs, but are easily obtained by the use of the eigen () command, e.g. eigen
(cor(attitude))$values, which for these data gives:
[1] 3.7163758 1.1409219 0.8471915 0.6128697 0.3236728 0.2185306 0.1404378

It is easy to see the relationship between these values and the standard deviations above. The
sum of all the eigenvalues is equal to the sum of the diagonal elements of the matrix used as the
basis for the analysis, which when a correlation matrix is used, is equal (when the rank of the
correlation matrix is the same as its order) to the the number of variables (since the values on the
diagonal of a correlation matrix are all ones). In order to have the loadings printed, you have to
use either the function loadings(Results) or the command Results$loadings:
Loadings:
rating
complaints
privileges
learning
raises
critical
advance

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7


-0.413 0.397 -0.263 0.234 0.143 0.412 0.598
-0.441 0.334 -0.226
-0.278 0.228 -0.717
-0.355
0.188 -0.891
0.174
-0.429
0.325 0.239 0.698 -0.353 -0.200
-0.447 -0.179
0.243 -0.556 -0.585 0.234
-0.185 -0.603 -0.701 -0.149 0.293
-0.303 -0.570 0.496 0.115 -0.142 0.552

SS loadings
Proportion Var
Cumulative Var

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7


1.000 1.000 1.000 1.000 1.000 1.000 1.000
0.143 0.143 0.143 0.143 0.143 0.143 0.143
0.143 0.286 0.429 0.571 0.714 0.857 1.000

For one thing, these are the unrotated factor loadings, and in addition, the smallest values have
been omitted from the output (by default, values less than 0.1, see the help file ?loadings for
details).
Also, these loadings are normalized, the squared column values add up to 1.0. This makes
comparisons between columns in respect to loadings simpler. This is different from the comparable results from a program like SPSS, where the squared column values add up to the corresponding eigenvalue. To get the same type of results as in SPSS, multiply the values in each column
in the R output by the square root of the corresponding eigenvalue. To get the same results from
SPSS as in R, you have to divide the values in the loadings by the corresponding value. This is
simple to do in R, but much less so in SPSS. In that case the simplest solution is probably to import
the results from SPSS into a spreadsheet.
This difference indicates that, as often is the case in data analysis, there are no absolute rules
in the choice of operations for this type of analysis, the preferences may be even be dictated by
something close to ideology.
This holds for the unrotated factor loadings, but not for the rotated ones.
To obtain one variant of an (orthogonal) rotation of the loadings from PCA, the varimax ()
function may be used, in this case on the first two columns (since Kaisers criterion indicated
that the first two components should be used) of the loadings:
> varimax(Results$loadings[,1:2])

5.4. MULTIVARIATE TECHNIQUES

43

Loadings:
[,1]
[,2]
[1,] -0.546 0.173
[2,] -0.543 0.104
[3,] -0.361
[4,] -0.404 -0.149
[5,] -0.322 -0.359
[6,] 0.100 -0.622
[7,]
-0.645
[,1] [,2]
SS loadings
1.000 1.000
Proportion Var 0.143 0.143
Cumulative Var 0.143 0.286
$rotmat
[,1]
[,2]
[1,] 0.8968073 0.4424213
[2,] -0.4424213 0.8968073

The objective of the rotation is to simplify the interpretation of the results by spreading out the
variance across the factors.
As with the output of the results from the unrotated solution, the smallest values are omitted
in order to simplify the interpretation of the results, and the values are normalized by columns as
was the case for the unrotated solution.

5.4.4

Principal factor analysis

One of the problems with PCA is that the variables involved in the analysis always have equal
weights, and they have a corresponding equal influence on the results. The basic question is then,
is this reasonable? The influence of a given variable on the results should (perhaps) be a function
of how much that variable has in common with the other variables in the analysis.
Other variants of this type of analysis address this question by altering the diagonal values of
the correlation matrix before the extraction of the factors.
There are several variants of doing so. One of the more common ones is represented by the
function in R do the computations by (maximum likelihood), the factanal () function. As in
the preceding example we use the attitude data set:
result <- factanal (attitude, factors=2, rotation="varimax")

We could have omitted the specification of the rotation method, as the default is to use varimax. Note also that in this case you have to specify the number of factors to extract. This means
that you may have to do a bit of exploring to find the number of factors you prefer.
The result object gives the following output:
Call: factanal(x = attitude, factors = 2, rotation = "varimax")
Uniquenesses:
rating complaints privileges
0.210
0.132
0.641
Loadings:
Factor1 Factor2

learning
0.396

raises
0.318

critical
0.897

advance
0.037

44

CHAPTER 5. DATA ANALYSIS

rating
complaints
privileges
learning
raises
critical
advance

0.882
0.914
0.505
0.587
0.613
0.152

SS loadings
Proportion Var
Cumulative Var

0.111
0.180
0.323
0.509
0.554
0.283
0.980

Factor1 Factor2
2.614
1.756
0.373
0.251
0.373
0.624

Test of the hypothesis that 2 factors are sufficient. The chi square
statistic is 5.47 on 8 degrees of freedom. The p-value is 0.706

The socalled Uniquenesses part of the output is related to the communalities as generated
by programs like SPSS or Statistica, but represents the opposite idea. Communalities describe the
amount of variance that the variables have in common, and in this output the vector describe the
extent to which they are unique. Hence, subtracting the values for the uniquenesses from 1.0 yields
the communalities as generated by many other programs. In this case, the value for the critical
variable is very high (and therefore correspondingly low in respect to communality), indicating
that this variable has very little in common with the other variables. Perhaps this one should be
left out of the analysis. This point is related to the one of the objectives with the item analysis
discussed below in part 5.4.8 below.

5.4.5

Final comments on factor analysis

When comparing the output from different programs, the following quote from the documentation for the factanal () function in R is quite telling:
There are so many variations on factor analysis that it is hard to compare output from
different programs. Further, the optimization in maximum likelihood factor analysis
is hard, and many other examples we compared had less good fits than produced by
this function. In particular, solutions which are Heywood cases (with one or more
uniquenesses essentially zero or communalities close to one) are much often common
than most texts and some other programs would lead one to believe.
Note that this comment only refers to PFA and not PCA. Given the number of factors to extract
and the method of rotation, the results from PCA programs should be very similar. Any differences in this respect should be minor, and only caused by differences in computational accuracy,
mainly influenced by the compiler used to generate the program used for the computation, as well
as the algorithms used for computing the correlation matrix and the extraction of factors.
As to PFA, it is not at all obvious which one of the many variants with which options one
should use in a given situation. In addition, different programs often give somewhat different
results on seemingly identical data and specifications. When reporting on the results it is therefore
important to include details on the program as well as the options used to generate the results.
The results from different programs may vary considerably, and it is not even obvious to specialists which one is the best. Personally, I would be inclined to trust the results from the R
functions more than the output from many other programs.

5.4. MULTIVARIATE TECHNIQUES

5.4.6

45

Reliability and Item analysis

This subject is an important part of test theory, where the methods discussed in this context belongs to the general field of classical test theory. However, the subject is not limited to what
may be called tests, but is relevant whenever a number of items or variables of the same kind is
added to generate an index, scale, score etc..
The problem here is two-fold: (a) first find an overall measure which describes how well the items fit together, and then
Table 5.2: Data for Item Anal(b) have a look at the attributes for each item in respect to
ysis
their contribution to the score, i.e. the sum of all items. Since
R does not have these computations built in as for most of
X1 X2 X3 X4 X5
the other things used here (e.g. standard deviation, variance,
1 3 6 5 5 3
multiple regression, etc.), this theme is a nice opportunity to
2 2 2 3 2 2
3 2 2 5 5 5
demonstrate the power of the language, where a few lines of
4 2 4 5 4 3
code adds a powerful method to your repertoire.
5 2 5 3 5 6
The end product of the discussion is two functions, one
6 4 3 2 2 2
called Alpha which computes Cronbachs alpha, and the
7 4 3 4 2 2
other is called ItemAnalysis directed towards the item anal8 1 7 4 6 5
ysis part as discussed in the second part of this section. 5 Of
9 2 3 3 3 2
course, you do not have to type in the commands listed be10 3 4 5 5 2
low, the final version of the functions can be downloaded.
11 4 1 2 1 1
For the analysis, we need a data set similar to the one in
Table 5.2 on page 45. This is a very much constructed data set.
Imagine 11 persons responding to a questionnaire with five questions, and where the response to
each question is recorded on a 7-point scale ranging from Strongly disagree to Strongly agree.
The contents of the data set is based on a coding of the possible response categories with the
numbers from 1 to 7, with Strongly agree corresponding to the highest value.

5.4.7

Reliability

The most common measure of reliability within a set of items is called Cronbachs alpha. The
formula is relatively simple:
k s2t s2i
=
k1
s2t
P

(5.3)

The terms used in the formula are explained in table 5.3 together with the corresponding R
functions.
The necessary computations are not complex, but tedious if you have to do them by hand. A
spreadsheet handles the problem fairly well, but R is a much better alternative.
As you see from the second column in the table above, each of the three elements in the formula
for Cronbachs alpha have a corresponding command in the R language. The major new thing here
is the function apply () as used in the second and third row: When translated into English, the
first one could be read as: Using the matrix or data frame x, apply the function sum() for each row
(the 1 as the second argument) to obtain the sum. So, the results of this operation is a vector with
one value for each row (person), the sum of all five ratings done by that person. The outermost
call on var () computes the variance of the row sums. The same function apply () is used in the
5

In respect to the writing of functions (a theme that has not been covered yet), I am convinced that it is possible
to understand the operations by careful reading. However, it might be a good idea to have a look at the beginning of
chapter 85, Scripts, functions and R.

46

CHAPTER 5. DATA ANALYSIS


Table 5.3: The elements used for Cronbachs alpha

Formula

R code

What it means

cols <- dim(x)[2]

s2t

tvar <- var (apply (x, 1, sum))

Get hold of the number of items in the


test, i.e. the number of columns.
The variance in the scores for each person, i.e. first compute the row sum for
each person and then find the variance
for that sum.
The sum of the variances for each item,
i.e. compute the variance for each column, and then find the sum of the values.

P 2
s
i

sums2i <- sum (apply (x, 2, var))

third command, this time using the function var () to find the variance for each column (since
the value of the second argument is 2), which are then added with the outermost sum ().
So, with the three rows in the table we have all the values needed to compute the value of
alpha:
> alpha <- (cols / (cols - 1)) * ((tvar - sums2i) / tvar)

If these four lines are entered, the value of alpha is computed as equal to 0.6162407, which is
the correct answer for this data set (but with an absurd number of significant digits).
As entered above, four lines could easily be regarded as an only used once set of commands.
That is not very smart. If something is good and correct, why not consider saving the commands
for later (more general) use? The four commands are:
>
>
>
>

cols <- dim(x)[2]


tvar <- var (apply (x, 1, sum))
sums2i <- sum (apply (x, 2, var))
alpha <- (cols / (cols - 1)) * ((tvar - sums2i) / tvar)

The really smart thing would be to save and later use these commands as a function (see part 8
on page 85) which returned Cronbachs alpha. Such a function could be called Alpha and stored
in an object. So, start the definition of the function by entering:
> Alpha <- edit ()

Where the contents of the editing window could then look like this 6 :
function (x) {
x <- na.omit (x)
cols <- dim(x)[2]
tvar <- var (apply (x, 1, sum))
sums2i <- sum (apply (x, 2, var))
alpha <- (cols / (cols - 1)) * ((tvar - sums2i) / tvar)
return (alpha)
}
6

Have a look at the section on editing functions in part 8.1.1, page 85 before doing serious work on functions

5.4. MULTIVARIATE TECHNIQUES

47

Apart from the wrappings (the first and the last line), the new thing here is the call on na.omit
(), a function. The line x <- na.omit (x) has the effect that all rows in the matrix with one
or more missing values (NAs) are removed when the object is returned. In other words, the
remainder of the operations are based on a listwise handling of missing data. When the function
is saved, use it by entering the command:
> Alpha (mydata)

With that name of the frame you want the results from replacing the mydata part of the
command, and the computed value of Cronbachs alpha is printed on the screen. If you save the
workspace, this function may be used again and again, on any suitable data set.

5.4.8

Item Analysis

In contrast to the computation of Cronbachs alpha, which is oriented toward the reliability of the
scale as a whole, item analysis is oriented toward the inspection of the contribution of each item
to the properties of the sum of the items, i.e. the final score. The obvious question is then: what
attributes of the items are of interest? The output from standard software for item analysis usually
contains all or some of:
1. The item mean.
2. The variance of the item (s2i ). If the variance is low, the item does not contribute to the
discrimination between persons. If that is the case, the item should be excluded.
3. The correlation between the item (item i) and the sum of all the items (t), i.e the score (rit ).
If that correlation is low (or even worse, negative), including the item does not contribute to
the discrimination between the cases, and therefore the item should either be eliminated or
reversed.
4. The correlation between the item (item i) and the sum of all the items except the one in
question (t i): (ri(ti) ). This is an even stronger indicator than the previous one.
5. The reliability index for the item, defined as the product of the first and either the second or
third indicator above. If either the variance of the item or the correlation of the item with the
sum of the other items (or both) are low, this index will be low. If so, one should consider to
exclude the item. If the correlation with the total is negative, one should consider reversing
that item (see below).
6. The value of Cronbachs alpha if the item in question is excluded from the test. One could
consider omitting the item if the value of increases when that item is excluded.
7. The difference between the alpha for the whole set of items and the alpha when the item is
excluded. If there is a real increase in the alpha when the item is excluded, that is not a good
sign. All items should at least contribute to an increase in the reliability.
The first step is get hold of the data, and to compute the mean and the variance for the columns.
The final step is to compute the scores (i.e. the row sums) for each respondent in the vector Total.
We will need this later on. It is a good idea to add the lines as they are tested to the body of a
function defined with the edit command in the same manner as was done for the Alpha (), but
give the function a meaningful name like ItemAnalysis (in one word).

48

CHAPTER 5. DATA ANALYSIS

x <- na.omit (x)


Mean <- apply (x, 2, mean)
s2i <- apply (x, 2,var)
Total <- apply (x, 1, sum)

# Remove all cases with missing data


# Column means
#Column variances
# All person scores (row sums)

The next step is to allocate vectors for the information we want to compute for each item. We
start by getting hold of the number of columns, and use this value to tell R how many elements
there should be in the vectors.
cols <- dim(x)[2]
# Get hold of the number of columns or items
Index <- vector (mode="numeric", length=cols)
Rit<- vector (mode="numeric", length=cols)
AlphaReduced <- vector (mode="numeric", length=cols)

Now we need to do computations in a loop, by that is meant a mechanism for repeating the
same operations, one for each of the five items. The loop is defined by the for () command,
where the name of the index (in this case i is used to identify the item number) and the range
to be used. The construction 1:cols says let the object (a simple integer) i have the values
starting with 1 and ending with the value of cols. The commands to be included in the loop are
enclosed in curly brackets, { and }.
for (i in 1:cols) {
AlphaReduced[i] <- Alpha (x[,-i])
Rit[i] <- cor(x[,i], Total)
Index[i] <- Rit[i] * s2i[i]
}

Note that the function uses the function called Alpha () defined above. Within the loop there
are some rather nifty uses of the index i. One of them is in the third line, where the correlation
between one column in the data frame and the values in the Total vector is computed. x[,i]
refers to the ith column of x. An even more elegant one is in the second line, where the value of
Cronbachs alpha is computed for the data frame when excluding the ith column. The exclusion
part is signalled by the minus in front of the i. Finally, we want to generate some pretty output
7 of the results we have computed.
cat ("\nReliability indices:\n\n")
cat ("\nCronbachs alpha = ",
format(Alpha (x), digits=4), "\n\n")
Res <- data.frame (rbind (Mean, s2i, Rit, Index, AlphaReduced))
print (Res, digits=4)

The first line prints a header, and the second the value of alpha () for the complete data set.
The operations in the third line generates a data frame with one row for each of the computed
indices, and one column for each variable. The output in this case would be:
Cronbachs alpha =
7

0.6162

The functions used here, cat (), print (), format () are oriented towards the generation of formatted output
and have a double purpose in this context. For one thing, anything inside functions is normally silent, normal output
does not show up anywhere, Secondly, they are used to prettify the output, among other things to keep the number
of digits after the decimal point down to a reasonable level. There are a number of other functions in R intended for
formatting output in various ways. For the use of the \n strings to control the output, see part 8.2.3 on page 95 for
more information.

5.4. MULTIVARIATE TECHNIQUES

Reliability indices:
X1
Mean
2.6364
s2i
1.0545
Rit
-0.5071
Index
-0.5347
AlphaReduced 0.8511

X2
3.6364
3.2545
0.8563
2.7868
0.3547

49

X3
3.7273
1.4182
0.6957
0.9867
0.5037

X4
3.6364
2.8545
0.9521
2.7179
0.2004

X5
3.0000
2.6000
0.7659
1.9914
0.4568

Which includes more or less the information we would like to have. From these results it is
clear that the first item (X1) is the weakest one, it has a negative correlation with the total, the
variance is relatively low, and the product of the correlation and the variance for the vector is
negative. Most important, if the item is excluded, the value of Cronbachs alpha rises from the
original 0.6162 to 0.8511. Clearly that item should be excluded from the test or index. However,
an alternative that could be worth considering is to reverse the first column of the frame. Since
the data originated from a scale with the possible values of 1 to 7, the values of that column could
replaced, subtracting the original values from 8 (the upper or the highest value for the variable
plus one). Then all 7s would become 1s, all 6s 2, etc.. Hence:
> x$X1 <- 8 - x$X1

Or:
> x[,1] <- 8 - x[,1]

If we do that, and run the ItemAnalysis () function again on the revised version of the frame,
we get the following results:
Cronbachs alpha =

0.8623

Reliability indices:
X1
X2
Mean
5.3636 3.6364
s2i
1.0545 3.2545
Rit
0.7386 0.8258
Index
0.7789 2.6876
AlphaReduced 0.8511 0.8398

X3
3.7273
1.4182
0.6586
0.9339
0.8693

X4
3.6364
2.8545
0.9705
2.7703
0.7556

X5
3.0000
2.6000
0.8263
2.1483
0.8283

These results are radically different from the previous output, it is very obvious that when the
first item is reversed, all the items pull in the same direction. However, one cannot do a reversal
like this blindly, there must be some kind of rationale or support for doing so, for instance by
looking at the text or task of the item and comparing it with the other items. In any case, and as
the final step, save the lines above in a text file called ItemAnalysis.R, add: function (a) at the
top (first line) and return (Res) at the bottom. Then you have the function stored away, and it
can be read into your workspace at any time and reused on any appropriate set of data or frame.
As a final comment, there is a package (library) called pstych which contains a number of
useful functions for reserchers withing the field of pachology.

5.4.9

Factorial Analysis of Variance

Another basic example is analysis of variance, in this case demonstrated with artificial data. With
R, it is easy to generate (or simulate) artificial data to demonstrate how a statistical procedure
works (and with data you made up yourself you always know what the true situation is). It is
often very instructive to do so, and leads to an understanding of what the analysis really is about
(also, have a look at the scripts in part 8 on page 85 for doing this type of analysis with a script).

50

CHAPTER 5. DATA ANALYSIS

Generation of data
Let us consider a simple hypothetical experiment with two factors f1 and f2, each factor with two
levels, and a dependent variable x; also assume we have 80 participants and a balanced factorial
design. First, we create x, the dependent variable:
> x <- rnorm (80, 10, 2)

This creates a random variable x with 80 observations, normally distributed, with a mean of
10 and a standard deviation of 2. With:
> hist (x)

you can easily check the distribution of this variable. Now we create two factors to be used as
independent variables:
> f1 <- factor (rep (c (1, 2), each=40))
> f2 <- factor (rep (c (1, 2, 1, 2), each=20))

The function rep (c(1,2), each=40) creates a variable with values 1 and 2, each repeated 40
times, so we get a variable with 80 observations, the first 40 observations are 1s and the second
40 are 2s. The function factor () tells R that this is a factor (or a categorical variable). Factor f2
is similar, we tell R to create a variable with values 1 and 2, but this time the values alternate with
20 replications each.
Strictly speaking, the use of the factor () is not necessary, but it may add meaning to what
is done. A factor type object (or column in a data frame) is one that contains a limited set of
values, i.e. what may be called a categorical variable). In this case the f1 and f2 factors only
contain the numbers 1 and 2, which are used to identify the group the corresponding values
in x belongs to, generated by the rep () (repeat) function. You could generate a vector with
names replacing the numbers for the groups instead, as in the alternative:
> f1 <- factor (rep (c ("A1", "A2"), each=40))
> f2 <- factor (rep (c ("B1", "B2", "B1", "B2"), each=20))

However, if the factor vector contains names, the factor () part is necessary. Whatever you
do, you can easily check how the observations distribute across factor levels:
> table (f1, f2)

All groups should be equally large (n of 20). Now we modify the x variable a little bit by
adding 1.5 to all observations starting with number 41 and ending with the last one:
> x[41:80] <- x[41:80] + 1.5

To the uninitiated, this command may be a bit cryptic, but it is still readable. The 41:80 part
is a shorthand for a sequence of the integers starting with 41 and ending with 80. Therefore
x[41:80] refers to a subset of the 80 values in the vector x (which was generated with the rnorm
() command above), starting with number 41 and ending with number 80. So this term refers to
a sequence of 40 values in a very compact manner. The part to the right of the <- says take these
40 values and add 1.5 to each of them. The left part tells R where the result of the additions are to
be stored, and that is exactly where they came from. So, the value of 1.5 is added to a subset (the
last half) of the values in the vector x. In any case, this simulates a treatment effect of 1.5 for factor
f1, i.e., all observations under level 2 of factor f1 get an increment of 1.5 units. We also apply the
same trick to factor f2, we add an increment of 2 to all observations under level 1 of factor f2:

5.4. MULTIVARIATE TECHNIQUES

51

> x[1:20] <- x[1:20] + 2


> x[41:60] <- x[41:60] + 2

Here the notation is the same as the previous one, except that the constant (2.0 rather than 1.5)
is different, and we also are referring to different subsets of values in the vector x. If you repeat
the use of hist () on the vector x, you will see what has happened to the values in this vector.
> hist (x)

Now we have three vectors called x, f1, and f2 respectively. The first one is to be used as
the dependent variable, and the other two are independent variables or factors identifying the
conditions for each of the xs. In other words, we have data for a two-way analysis of variance,
with two levels on each factor. It is balanced since we have the same number of subjects or
observations in all four groups.
The analysis
Now we are finished with the generation of data, and are ready for an ANOVA. This is done with
the lm () command:
> result <- lm (x ~ f1*f2)

It is very similar to multiple regression (in fact, analysis of variance is a special case of multiple
regression, where the classic reference within the field of psychology is (Cohen, 1968)). Therefore,
it is not an accident that the same function lm () is used for the computation of both the factorial
ANOVA above and the multiple regression below). Note how the factors f1 and f2 are written.
The argument f1*f2, does not mean multiplication, but something like: test for all effects, including all interactions. The result is stored in the object result and a typical ANOVA type
table with the results is printed with the anova command: 8
> anova (result)
Analysis of Variance Table
Response: x Df Sum Sq Mean Sq F value Pr(>F)
f1 1 48.20 48.20 9.4606 0.002917 **
f2 1 39.19 39.19 7.6926 0.006973 **
f1:f2 1 0.01 0.01 0.0026 0.959611
Residuals 76 387.17 5.09
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

As you probably had expected from how x was constructed in the first place, the analysis
shows two significant main effects but no interaction term, for the simple reason that it was not
requested (see part 5.4.10 on page 54 below). A quick graphical plot of the situation is produced
by:
> interaction.plot (f1, f2, x)

and the means are printed with


8

The output from the lm () command is strictly utilitarian and not suited for any kind of report. Have a look at part
7.6 below for the generation of more elegant output.

52

CHAPTER 5. DATA ANALYSIS

> tapply (x, list(f1,f2), mean)

The tapply () function looks somewhat complicated, but is very useful. The first argument
is the dependent variable, the second argument is a list of factors with respect to which we want
the vector x to be split up in groups with. The third argument mean tells R what to do with
each combination of f1 and f2, namely to compute the mean of x. Instead of the mean the name
of any sensible function might be inserted, such as the standard deviation sd, the sum, etc.. Final
note: Instead of using the lm () command, one can use the aov () command, which of course
produces identical results:
> result <- aov(x ~ f1*f2)
> anova (result)

It sounds like this is redundant, but we will need the aov () command for more complicated
analyzes of variance, especially for repeated measurement designs.

5.4.10

Multiple regression

As an example of another relatively advanced type of analysis, consider a standard multiple regression, in this case not using generated data, but again using the attitude data set included
with the installation of R:
> attach (attitude)
> Results <- lm (rating ~ complaints + privileges + learning)

Referring to the variables or columns in the attitude data set is simple as long as the attach
() command is used somewhere above the lm () command. Here the variable rating is used as
the dependent variable, while complaints, privileges and learning are used as the predictors or independent variables. The (tilde) 9 symbol is used to separate the dependent variable
(the first variable reference) from the independent ones (predictors). As with other objects, you
get a listing of the basic results by writing the name of the object alone.
> Results
Call: Lm (formula = rating ~ complaints + privileges + learning)
Coefficients:
(Intercept)
complaints
privileges
learning
11.2583
0.6824
-0.1033
0.2380

Note that the output from a simple list of this object is very brief, and only includes the intercept and the unstandardized coefficients (values for the bs). 10 More detail is obtained by the
summary () command:
9

On most keyboards (at least mine, with a Norwegian layout) the tilde () is rather special, when the key is pressed
(together with the Alt key); the character does not appear on the screen before the next key (e.g. a space or something
else) is pressed as well.
10
One slightly puzzling thing with the use of the lm () function in R is that you do not get the standardized regression
coefficients (the beta values), not in the output, nor as an attribute of the object returned by the function. If you
need them, one trick is to standardize the input to the function, for instance by: Results <- lm (scale(rating)
scale(complaints) + scale(privileges) + scale(learning)). There are several other ways to do the same. See
the scripts in part 8 and the section below for more details.

5.4. MULTIVARIATE TECHNIQUES

53

> summary (Results)


Call:
Lm (formula = rating ~ complaints + privileges + learning)
Residuals:
Min
1Q
Median
3Q
Max
-11.2012 -5.7478
0.5599
5.8226 11.3241
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.2583
7.3183
1.538
0.1360
complaints
0.6824
0.1288
5.296 1.54e-05 ***
privileges
-0.1033
0.1293 -0.799
0.4318
learning
0.2380
0.1394
1.707
0.0997 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 6.863 on 26 degrees of freedom Multiple
R-Squared: 0.715,
Adjusted R-squared: 0.6821
F-statistic: 21.74 on 3 and 26 DF, p-value: 2.936e-07

Some useful plots can be obtained, simply by applying the plot () function to the Results
object, e.g.:
> plot (Results)

Alternatively, the results from the analysis can be printed in an ANOVA format:
> anova (Results)
Analysis of Variance Table Response: rating
Df Sum Sq Mean Sq F value
Pr(>F)
complaints 1 2927.58 2927.58 62.1559 2.324e-08 ***
privileges 1
7.52
7.52 0.1596
0.69276
learning
1 137.25 137.25 2.9139
0.09974 .
Residuals 26 1224.62
47.10
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Even more detailed results are also available for the object containing the results, e.g. a single
vector with the coefficients for the equation:
> coefficients (Result)
(Intercept) complaints
11.2583051
0.6824165

privileges
-0.1032843

learning
0.2379762

This is the same information as printed when asking for the results in Result alone, but with
more accuracy. Another command is :
> predict (Result)
1
2
3
52.24409 62.51618 68.42449
9
10
11
75.72440 59.42280 55.75493
17
18
19
79.07387 63.33804 67.84103
25
26
27
55.19372 71.98012 74.05930

4
60.78764
12
56.63001
20
56.66585
28
56.32047

5
74.40930
13
57.67593
21
43.23778
29
78.82684

6
7
8
54.20124 65.96894 70.36402
14
15
16
70.03521 75.36131 84.64586
22
23
24
62.26946 62.82582 45.97240
30
77.22897

Which is a vector with the predicted values for the dependent variable, which may be handy
for other computations. The function fitted () yields the same results as predict ().

54

CHAPTER 5. DATA ANALYSIS

Factor variables
The independent variables used in the examples above are all type numeric, i.e. it is assumed that
the numbers in the corresponding columns in the data set contains values representing points on
some dimension or scale. That does not have to be the case with multiple regression, but if
not, some precautions are necessary.
If the values for an independent variable represents categories and there are no underlying
continuum, for instance when the values indicates experimental conditions and there are more
than two categories, it is necessary to define that variable as a factor. This is achieved with the
factor () function. For instance, if one variable contains level of education with three categories
(have a look at the first variable called education in the infert data set) you should tell R that
this variable should be interpreted as a factor.
> education <- factor (education)

This is not really necessary in respect to the variable education in the infert data set for the
simple reason that R will interpret that variable as a factor, for the simple reason that the entries
for that variable are unique text labels. However, if this variable had been coded with numerals
for the three categories (e.g. 1, 2 and 3), then a command like the one above would be necessary.
Handling interactions
So far the regression models are relatively simple. However, there are several other symbols that
may be used in the specification of a model (the combination of a dependent variable and one or
more independent or explanatory variables). Note that these symbols are the same as in arithmetical expressions, but their meaning is different. As noted above a + indicates the inclusion of an
independent variable in the model, not addition. The use of a colon (:) specifies that you want to
include the interaction between two or more independent variables, and an asterisk (*) that you
want to include both the explanatory variables and the interactions between them. A minus (-)
indicates that you want to exclude a term from the model.
So, suppose that you have a dependent variable called Y and three independent variables A, B,
and C. Then a few of the possible combinations are:
A + B + C
A:B
A * B * C
A + B * C

is the simple model, no interactions (and not addition)


inclusion of the (two-way) interaction between A and B
is the same as A + B + C + A:B + A:C + B:C + A:B:C where the last term
is a three-way interaction (not multiplication)
defines a model consisting of all three variables plus the interaction between
the last two

It is also possible to indicate nesting (/) of variables as well as conditioning (|).


Standardized regression weights
One of the slightly surprising things about the output from lm () is that standardized regression
weights are not included 11 . From some of the comments I have see in the R discussion list, there
11

The probable reason is the it is the same function that is used for both typical multiple regression problems as well
as analysis of variance. In the latter case the independent variables will normally be of a factor type, and then the s
will be at least meaningless and perhaps confusing as well.

5.4. MULTIVARIATE TECHNIQUES

55

are other reasons as well, which may of the more ideological kind. But if you really need the
weights, they are easy to compute. The formula for computing the betas from the bs is:
j = bj sj /sy

(5.4)

To make things simpler when computing the betas, two assumptions are made here: (a) the
dependent variable is in the first column of the data set, and (b) that the sequence of the variables
in the data set is exactly the same as in the model. Then the commands would be:
> b <- coef (model1)
> s <- sd (mydata)
> beta <- b * s / s[1]

If the variables in the model are a subset of the variables in the data set and/or the sequence
of the variables is not the same as in the model, some indexing of mydata will be needed when
generating the vector s (which needs to contain the standard deviations of the variables in the
same sequence as the variables in the model or the same sequence as the values in the vector coef
() from the model). If not, some indexing is needed for the generation of the vector s:
> s <- sd (mydata[,c(5, 2, 1)])

Where the fifth variable in the data set in this example is used as the dependent variable, while
the first two are the independent variables. Alternatively, you may refer to the variables by their
names in quotes, e.g.:
> s <- sd (mydata[,c("v05", "v02", "v01")])

Which may be simpler.


Differences between models in hierarchical regression
If you do a hierarchical regression, you will have a sequence of models where the dependent
variable is the same for all models, and the independent variables in the smaller models are true
subsets of the independent variables in the larger models. For instance, assume that you have used
the lm () function to compute two objects model1 and model2. Then the difference between
the two is computed with the anova () function:
> anova (model1, model2)

However, note that by default lm () does its computations in a listwise fashion. This means
that if there are any missing observations in the added variables for the larger model which has
the effect that the larger model is computed on a smaller subset of the cases than the smaller, the
call on anova () like the one above will produce an error message. This is formally correct. With
R, the safest thing in that situation is to generate a data frame which only contains the variables
used in the largest model and then remove all cases with any missing data from that data set. This
is achieved with something like this:
> newdata <- na.omit (mydata[, c(<list of variables>)])

The na.omit () removes all rows with missing data. If you then run the hierarchical models
using the data set newdata you are safe. Alternatively, some kind of careful imputation may be
used.

56

CHAPTER 5. DATA ANALYSIS

5.5

On Differences and Similarities

In all the sections above, the different types of correlations are used to get an expression of the
degree of similarities between two sets of variables, while the t and F values are (often) used to
test for differences between means.
As demonstrated in the section of the t-test for independent means above (page 35), there is
a close relationship between the product-moment correlation on the one hand and the t-test for
independent means on the other. This relationship is so close that there is a simple formula to
convert from one to the other. There is more to that relationship. In fact one may talk about two
classes of conventional statistics, one called (the greek letter for t, called tau representing
the word test) and the other called (the corresponding greek letter for r called rho or
relation) statistics. The last type is used to express the relationship between variables and the
other to test that relationship. There are are clear relationships across these two classes, something
that might be useful.
For instance, in older textbooks you will often find three apparently different types of correlation coefficients with formulas that looked very different. This would include formulas for
:
A phi () coefficient used for the computation of correlation between two dichotomous variables or something that could be expressed as a 2 2 table.
A point-biserial (rpb ) type of correlation used for the correlation between one dichotomous
variable on the one hand and a continuous 12 variable on the other.
The formula for the standard product-moment (rpm ) correlation (in two variants, one computational and one definition formula).
The important point is simply: The first two formulas for correlations are special cases of the
third, constructed for the pre-computer and pre-calculator days in the 1920s. At the time, you
needed formulas that were simple to use with the equipment you had. In contrast, a modern computer will not bother about such niceties, and compute a correlation coefficient between whatever
pair of variables is thrown at it. So, if both the variables are dichotomous, then the output from
the cor () function is the same as a coefficient. If one is dichotomous and the other is continuous then you get a point-biserial correlation. If both are continuous, then you get the standard
product moment correlation, all from the same cor () function which really computes a rpm
correlation by default.
That shows that there are basic similarities between several of the most common types of
correlations. There are basic similarities between the statistics used to test the same correlations,
as well as simple formulas to transform one to the other. So, the first formula is the one to use to
compute a value from a 2 (chi-square):
=

2 /N

(5.5)

Note that this relationship is the one to use when the 2 has been computed on a 2 2 table.
In all other cases (larger tables) the value is not a coefficient, but something else (but similar). In
the opposite direction:
2 = N 2
12

(5.6)

In the this context, continuous in the worst case simply means: A variable with more than two response categories. Not very continuous, but that is how it is.

5.5. ON DIFFERENCES AND SIMILARITIES

57

The second pair involves the relation between a point-biserial correlation rp b or any productmoment correlation (rp m) and a type t value:
t
+ df

(5.7)

r df

t=
1 r2

(5.8)

r=p

t2

And its complement:

So, if you in R use the function cor.test (raises, rating) (after attaching the attitude
frame) you get both the t and the r values, plus the confidence intervals for the correlation. The
degrees of freedom (df ) are equal to the number of cases or variable pairs minus 2.
Finally, the corresponding relationship between multiple correlation (R2 ) and F values:
Fk
F k + (N k 1)

(5.9)

R2 /k
(1 R2 )/(N k 1)

(5.10)

R2 =

F =

The last pair of formulas are written in the conventional notation for multiple regression,
where k is the number of independent variables, and N is the number of cases. The same two
formulas can be expressed in the notation used in the analysis of variance world:
F dfb
F dfb + dfw

(5.11)

R2 /dfb
(1 R2 )/dfw

(5.12)

R2 =
and:
F =

As a general observation, if you test a correlation with the cor.test () function or use the
summary () function on the output from a multiple regression (the lm () function) you get both
types of values, in the other world you get only the 2 , t, and F values.
As matter of fact, there is not only a clear relationship between each pair of values, but one
may regard the other formulas as special cases of the R2 and F values 13 . If for instance you
used a multiple regression function to get results on the relationship between two variables (one
dependent and one independent variable), then the R2 is nothing else than the squared r or
value. The same holds for the corresponding (tau) value, F = t2 when there is one dependent
variable (dependent means) or two groups (independent means).

5.5.1

For the Courageous

The relation between the different techniques becomes very obvious when one looks at the computation of the multiple correlation as expressed in terms of matrix algebra. For instance, the
formula for computing the regression coefficient b in the bivariate case may be written as: b =
SSxy /SSx if the two variables contain deviations from the mean. If there are more than one independent variable, the formula for computation of the unstandardized regression coefficients
13

The same relationship holds if you increase complexity by having more than one dependent or response variables.
This covers canonical analysis and discriminant analysis)

58

CHAPTER 5. DATA ANALYSIS

b = (X T X)1 X T y), again assuming that the columns of X contain deviations from the mean,
where (X)T and (X)1 refers to the transpose and the inverse of the matrix respectively. When
one knows that the (X T X) part is the same as the SS and that by the use of 1 one means the inverse of the SS (one cannot do division in matrix algebra, instead one multiplies by the inverse).
1
The expression then becomes: b = SSX
SSX y. The similarity is obvious.
If you want to test this yourself, consider using the scale () function on the data set with
the center argument set to TRUE and the scale argument to FALSE. That gives a matrix with
deviations from the column means. The function t () gives the transpose of a matrix, while the
ginv () function in the MASS library computes the (general) inverse of a matrix. To do matrix
and vector multiplication, use the operator %*%. Try it and compare the results you get with the
output from the lm () function with the same variables.
When I try to reproduce the results on page 52, i.e. when using rating as the response variable
and complaints, priviledges and learning from the attitude data set as independent or criterion
variables, I get the following results:
[,1]
[1,] 0.6824165
[2,] -0.1032843
[3,] 0.2379762

Can you do the same?

Chapter 6

Resampling, Permutations and


Bootstrapping
This chapter covers what are called computer intensive techniques that are
alternatives to classical test of significance. This includes covers three subclasses
called (a) randomization tests, (b) bootstrapping, and (c) jackknifing, where the
first tw are covered here.

In the foreword I said that this book is less about statistics than data analysis. That is still true,
but statistics cannot be avoided altogether. What is covered in this section is some techniques
that can be regarded at the very least as supplements to the classical tests of significance, and in
some cases as replacements for these tests. Here the capabilities of R as a dedicated programming
language are very useful.
Everybody knows that at the foundation for the classical tests of significance there are assumptions about the distributions of the variables in the populations, for instance that the population
distributions are normal or Gaussian, or at least known. What everybody also know is that this
assumption is not reasonable in very many cases, there may be all kinds of good reasons to expect
other types of distributions. In other words, the normal distribution is and has been a useful
fiction, no more and no less.
Furthermore, the need for maintaining this fiction was based on a very obvious fact, the lack
of computational aids when researchers started using statistical tools. You simply had to employ
mathematical and computational shortcuts, and this resulted in the classical parametric methods. However, given modern computers, there are opportunities to supplement the classical tests
with more flexible techniques based on less rigorous assumptions, and these techniques are often call non-parametric. In addition, this class of techniques gain practical importance when
you want to say something about a statistic where the sampling distribution is unknown, e.g. the
difference between two medians or Cronbachs .
So, this section is about techniques that are
often called computer intensive methods or
This section contains several functions and
resampling techniques. The computer inscripts. If you are unsure about how these
tensive label is fair. These are methods where
work, it might be a good idea to have a look
computers are programmed to perform the
at chapter called Scripts, functions and R
same operations thousands of times on differbelow first.
ent subsamples of the data set.
59

60

CHAPTER 6. RESAMPLING, PERMUTATIONS AND BOOTSTRAPPING

6.1

Resampling

There are at least three classes of techniques within this category of procedures:
Permutation (also called randomization) tests
Bootstrapping
Jackknifing
In simple terms, the basic assumption for all three seems to be the following:
The observed sample is representative of whatever population from which the sample is drawn.
This apparently simplistic assumption removes much of the sampling problems from the computational domain well into the design part of the study. Classical references include (Edgington,
1995) and (Manly, 1991).

6.1.1

Permutation (Randomization) tests

Complete Permutations
This technique is illustrated with a test of a correlation, and to simplify matters to begin with, let
us start with an example of a really trivial size. Consider the following data:
case
1
2
3
4

y
5
8
6
9

x
1
2
3
4

The (product-moment) correlation between the x and the y variables in this data set is equal to
0.7071. If we want to do a significance test on this correlation, we could do the conventional test
(where that sample in this case is way too small), but where the assumption is that the population
from which these two variables has a random distribution.
The alternative is to ask a somewhat different and naive different question:
Of all the possible correlations one can compute on these data, how many are larger or equal to correlation
for the the observed data?
After all, a null hypothesis would say (or at least imply) that (a) there is no systematic relation
between the two variables, therefore (b) it does not matter which value in x is paired with which
value in y. Let us therefore look at all the possible combinations or permutations of the numbers
and compute the correlation for each of them. If H0 is false, the observed correlation should be
unusual compared with what is possible to generate from the observed data.
When there are 4 pairs of numbers, there are only 24 different sequences. We already have one,
and table 6.1 contains the remaining 23 permutations:
In order to observe what has happened with each permutation, the x variable in this case
consists of the numbers from 1 to 4. Of course, in a more realistic setting this will not be the case,
I have only done so in this very fictitious data set to make it easy to see that I have gone through
every possible sequence of the values in the x variable.
If we do so and then compute the correlation for each of the sequences, we see that of these
possible sequences there are a total of 5 of the 24 sequences that gives a correlation equal to or

6.1. RESAMPLING
No.
2
3
4
5
6
7
8
9
10
11
12
13

Permutation
1 2 4 3
1 3 2 4
1 3 4 2
1 4 2 3
1 4 3 2
2 1 3 4
2 1 4 3
2 3 1 4
2 3 4 1
2 4 1 3
2 4 3 1
3 1 2 4

61
r
0.2828
0.9900
0.1414
0.8485
0.4243
0.2828
-0.1414
0.8485
-0.4243
0.7071
-0.1414
0.1414

*
*

*
*

No.
14
15
16
17
18
19
20
21
22
23
24

Permutation
3 1 4 2
3 2 1 4
3 2 4 1
3 4 1 2
3 4 2 1
4 1 2 3
4 1 3 2
4 2 1 3
4 2 3 1
4 3 1 2
4 3 2 1

r
-0.7071
0.4243
-0.8485
0.1414
-0.2828
-0.4243
-0.8485
-0.1414
-0.9900
-0.2828
-0.7071

Table 6.1: Permutations of four values

larger than the observed value. Therefore the probability of getting the observed correlation under
null conditions is 5/24 = 0.16667. This probability is one-tailed, as we have only considered the
values having the same sign as the observed values. Clearly the correlation is not significant by
normal criteria.
The main advantage with this approach is that we have not made any assumptions about
distributions of the variables at all, these are the numbers, and we have looked at all the possible
combinations, as simple as that. So, this is an exact result, not an estimate. If it is reasonable to
apply a particular statistic (in this case a correlation) to a particular set of values at all, this is a
neat way of getting an probability for the observed value.
Doing the permutations and the computations of the correlations in this case is a dull, but not
impossible job, at least with data sets as small as this 1 .
However, the number of possible permutations rises very fast with an increase in the number
of cases or variable pairs. The number of permutations is equal to the faculty of n: n!, which
means that with an n of 5 we have 120 permutations, with 6 there are 720 combinations, with 7
there are 5040 possible sequences. When n equals 10, there are more than three million (3628800
to be precise). So, for anything more than very trivial data sets we have to use a computer to even
think about this approach. Even then we need a somewhat different strategy that could be used
in more realistic settings for the very simple reason that we cannot (nor do we need to) run the all
the possible permutations.
The solution is simple, of the possible permutations, extract the results for a random subset of
the possible permutations of a reasonable size, e.g. 1000 random permutations. If we want to give
a name to this approach we could call it random permutations, while the one described above
would be called complete permutations.
Random permutation tests using R
The trick to use in R is the sample () function:
sample (x, size, replace, prob)

The first argument to the sample function (), x is a vector containing the data to be resampled or the indices of the data to be resampled. The size option specifies the sample size with
1

If you look at the computational formula you will see that you do not have to do all the computations for the
correlation coefficient. The only thing that varies between the permutations are the value of SSxy .

62

CHAPTER 6. RESAMPLING, PERMUTATIONS AND BOOTSTRAPPING

the default being the size of the population being resampled, i.e. the length of the vector x. The
replace option determines the type of sample to be returned, if FALSE the sampling is without
replacement, and this is the default.
In other words, for a permutation test the value should be FALSE and TRUE for the bootstrapping approach described below. The prob option takes a vector of length equal to the length of the
first argument containing the probability of selection for each element of x. If omitted, the default
is to have an equal probability for all observations in the vector x, which would be the normal
thing.
This is of course a very useful function to use in programming this task. Here is a primitive
attempt at using R to define a function to get a test of significance of this type:
function (xx, y, iterations) {
x <- xx
r.obs <- cor (x, y)
count <- 0
for (i in 1:iterations) {
x <- sample (x)
if (cor(x, y) >= r.obs)
count <- count + 1
}
p <- count / iterations
cat (count, "correlations of ", iterations, "iterations, p = ", p)
return (p)
}

The function itself is quite simpleminded 2 and far from optimal, but illustrates the basic idea.
The first set of operations is to (a) copy the data to a new vector (to avoid returning a shuffled
vector when the whole thing finishes), (b) compute the observed correlation between the two
vectors, and (c) set a counter equal to zero (not really necessary, but I like things to be explicit).
The next stage is to repeat a small set of operations as many times as specified in the third
argument on the call of the function: (a) shuffle the values in the x vector with the sample ()
function 3 , (b) compute the correlation between the two vectors with the cor () function, and (c)
test if the new correlation is larger or equal to the observed correlation, if so, increase the value of
the counter by one.
The final stage is also simple, divide the counter by the number of iterations, and that is the
value for the significance test. If I then use this function with two of the variables in the attitude
data set with the call on the function above perm1 (raises, critical, 1000), I get a value for
p of 0.018. Since the function only counts the larger values of a positive correlation, this is a onetailed test.
One thing to observe is that this value is not equal to the one obtained by the cor.test ()
function using the same two variables:
> cor.test (raises, critical, alternative="greater")
Pearsons product-moment correlation
2
For one thing the function above will only give correct result for a one-tailed test of a positive correlation. To get
a two-tailed test you will have to operate on the absolute values of the computed correlation. If you want a one-tailed
test of a negative correlation you have to include a line below the computation of the p value like: if (r.obs < 0.0) p
<- 1.0 - p.
3

Since the sampling is without replacement, we can reshuffle the previous sequence, all the values are present for
all sequences. In other words, we do not have to maintain a copy of the original sequence.

6.1. RESAMPLING

63

data: raises and critical


t = 2.153, df = 28, p-value = 0.02004
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
0.07969994 1.00000000
sample estimates:
cor
0.376883

The difference is not large, but it is there. The cor.test () function performs a parametric
test and is based on a comparison of the results with a theoretical distribution. The test above is
non-parametric in the sense that the result is compared with a distribution based on the original
set of numbers. What is perhaps more disturbing is that if you repeat the call on perm1 () with the
same two variables, you will probably get a different result. How is that possible? The answer is
simple, the results from this type of procedure are dependent on pseudo-random random numbers.
See part 6.1.4 on page 66 below for an explanation.
What about a confidence interval? If you want that, the function has to be expanded a bit,
where the most important change is that all the correlations from the shuffled samples has to be
saved in a vector in order to be able to look at the distribution of the values when the iterations in
the for loop has finished.
In that case, what needs to be done to the function above is:
1. Add a line before the for () loop: samplecor <- numeric (iterations). This defines a
vector where all the correlations can be stored.
2. Replace the if (cor(x, y) >= r.obs) line inside the loop with three lines (1) r <- cor
(x, y), (2) samplecor[i] <- r, and (3) if (r >= r.obs). This has the effect that (a) as a
new value for r is computed, (b) the new value for r is added to the vector for each iteration
of the loop, and (c) we have the same counter for correlations larger that the observed one
as before.
3. After the loop is finished, we have all the generated values for the correlation in the vector
called samplecor. The final operation is to find the upper and lower limits for the interval
we want. If we want a 95% confidence interval, we would place the following line before the
return () command: print (quantile (samplecor, c(0.025, 0.975), type="b")).
The outcome
This type of approach can be used to obtain a significance test for any summary statistic for pairs
of variables you might be interested in, not only for a standard correlation. 4 The outcome of
the procedure is the distribution of the statistic you are interested in, where it is possible to find
the probability of getting your results, given your data. Since the observed value is compared with
a distribution generated from your own data, the test is distribution free, not dependent on any
assumptions on population distributions like normality etc..
Another important point is that you are not dependent on a known distribution for the statistic
you are interested in. The procedure works with whatever statistic you are able to define in the R
language, and always in a distribution free fashion.
4
However, we can not use this strategy for grouped data where one of the two variables represent group memberships, for the simple reason that we then should only consider permutations which include shuffling of cases between
groups. That gives an much smaller number of permutations, equal to the faculty of the total n, nt ! divided by the
product of the faculties for all the group sizes, e.g. n1 !n2 !n3 !.

64

CHAPTER 6. RESAMPLING, PERMUTATIONS AND BOOTSTRAPPING

The drawback of permutation tests is that they become more complex as you want to apply
resampling to situations beyond simple correlations. That is probably the reason that using permutation tests is not as common as the next type of resampling tests, bootstrapping. So, that
theme is the subject of the next section.
In any case, the basic advantage of this type of test is that it is possible for a researcher to
construct their own significance tests. To quote (Edgington, 1995), page 16:
The idea of a researcher possessing both the expertise and the time to develop a new
statistical test to meet his special research requirements may sound farfetched, as indeed it would be if a special significance table had to be constructed for the new test.
The original derivation of the sampling distribution of a new test statistic is timeconsuming and demands a high level of mathematical ability. On the other hand, if
the researcher determines significance by data permutation, not much time or mathematical ability is necessary, because the significance table is generated by data permutation.
That is a very interesting statement. In addition he quotes (Bradley, 1968):
Eminent statisticians have stated that the randomization test is the truly correct one
and the corresponding parametric test is valid only to the extent that it results in the
same statistical decision.
One of these eminent statisticians are sir Ronald Fisher, another is Oscar Kempthorne. None
of them unknown.

6.1.2

Bootstrapping

The common element between this and the permutation test is that the original data are used to
arrive at a result, rather than a comparison with a theoretical distribution.
In many situations it is obviously somewhat dubious to assume that we have a normal distribution in the population. As one example, consider getting an IQ score from a small sample of
students in a particular class as a university. It is (perhaps) reasonable to assume that the actual
distribution would be skewed to the right, as it is more likely that students with high IQs are
members of the class than students with relatively low IQs. Admittance to a university is normally based on grades, and grades are positively correlated with measures of IQ. In other words,
the correct sampling distribution for a particular student population may not be a normal distribution at all. In addition one could easily imagine that there would be large differences in this
respect between different fields, at most universities it much more difficult to to be admitted into
say, medicine than fields like literature. This would affect the sampling population for variables
like IQ. In other words, using standard parametric statistics may not be the correct thing, at least
with that type of data.
For that reason alone it may be better to look into alternatives. What was done in permutation
tests was based on systematic or random shuffles of the data for one or more of the variables
involved in the computations, changing the data for each case.
In that respect, bootstrapping is different. Here the data for each case are kept intact, but the
computations are done on a large number of random subsets of the cases in the data set. Typically,
with bootstrapping each sample is of the same size as the original, and therefore we are talking
about sampling with replacement. In other words, what we do is to treat our sample as the population and use random sampling from that population to obtain a sampling distribution for the
statistic we are interested in.

6.1. RESAMPLING

65

The same sample () function as used above is used in this procedure, this time with an additional argument. As an illustration consider the following command for R and the resulting
output:
> sample (15, replace=TRUE)
[1] 14 10 14 11 8 3 12 12 10 11

9 15 15 11

The call on sample () generates a random sample with replacement from the sequence of
numbers from 1 to 15 (Note: If you try the command in R, the probability that you get the same
sequence of numbers as in the output above is very low.). All the 15 values have the same probability of being extracted. In this case some values are not present in the generated subsample, and
some are repeated. However, the size of the subsample is the same as the original.
Confidence Interval for the Mean
The purpose of a confidence interval is to show the likely interval within which the observed mean
would fall if the sampling were to be repeated. But the big question is simply: what do we mean
by confidence. We can generate confidence intervals by defining different widths of the intervals
we need. The higher the confidence, the wider the interval.
The parametric approach to obtaining the confidence interval is based on (a) the mean of the
sample, (b) the variance of the sample, (c) the size of the sample, and finally (d) the distribution of
the statistic in question in the population, where we normally cannot be sure.
In contrast, the bootstrap approach is to generate a large number of subsamples using the
sample () function above and compute the mean for each. The commands needed for a very basic
analysis of this type are simple:
> a <-numeric(1000)
> for (i in 1:1000){
+
a[i] <- mean (sample(attitude$advance,replace=TRUE))}

The first statement defines a vector with room for 1000 values. The next step is to define a loop
which is repeated 1000 times. For each repeat the mean is computed for a random subsample of
the same size (N=30) as the original and placed in the vector a. So, when the loop is completed,
we have 1000 means in this vector. The next step is to have a look at the distribution of these
values:
> mean(a)
[1] 42.85657
> quantile (a, c(0.025, 0.975))
2.5%
97.5%
39.43167 46.50167

Note that if you repeat the commands above, you will not get exactly the same results as above,
but something that is very close. This needs to be explained, and that explanation is found in
section 6.1.4 below.

6.1.3

Using the boot package

There is a package or library in R called boot which contains a function for bootstrapping with
the same name. This is a very advanced package, but may be used in a simple manner. Since you
might want to experiment with this function, it is a good idea to operate with a script in a small
text file. Therefore imagine that we have a small file called bootstrap.r in the same directory as
your workspace containing the following lines:

66

CHAPTER 6. RESAMPLING, PERMUTATIONS AND BOOTSTRAPPING

attach (attitude)
# set.seed (1234)
start <- Sys.time()
d.fun <- function (d, i)
mean(d[i])
x <- boot(advance, statistic=d.fun, R=1000)
print(x)
cat("Bootstrap mean = ", mean(x$t), "n")
print(quantile(x$t,c(0.025, 0.975)))
difftime(Sys.time(), start, unit="secs")
detach (attitude)

Then this script is used with the source () command, e.g. source (file="bootstrap.r").
The heart of the script is the command x <- boot(advance, statistic=d.fun, R=100) where
all the results from the bootstrap are stored in an object named x. The data to be used is found
in the vector advance from the attitude data set (the same one as in the previous example).
For each of the 1000 samples (defined by the R= argument), the function d.fun () is used as
named in the statistic argument. In this case it is assumed that the function takes two arguments where the first is the data and the second a vector with the indices for the row cases for the
subsample. Now that function may do whatever you want, but the value returned by the function
is treated as the value to be bootstrapped. In this case it is very simple, the function returns the
mean of the generated subsample. The output obtained from the commands above is:
2.5%
97.5%
39.46583 46.70167
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = advance, statistic = d.fun, R = 1000)
Bootstrap Statistics :
original
bias
std. error
t1* 42.93333 -0.04966667
1.860136

Note also that if you set the seed in these commands to the same value as used in the code
snippet on page 65, you get exactly the same values from the quantile function as above.
In other words, you get a confidence interval for whatever the function d.fun () returns,
which may be something as simple as in this case, a mean, or something much more complex, like
the value of a multiple correlation.
When the function boot is finished, all the values returned by your function is stored in the t
vector, an attribute of the bootstrap object x. These values may then be summarized any way you
want. In this case the bootstrap mean is printed, together with the 95% confidence interval.

6.1.4

Random numbers

The reason for the variation in the results with resampling techniques is that the operations in the
function sample () are based on what is called a quasi-random or a pseudo-random number
generator. By quasi-random is meant that each time the random number generator is called
or activated, the result of the previous call on the function is used to generate the next random
number. In other words, a particular sequence of random numbers is reproducable, provided that

6.1. RESAMPLING

67

the start value for the sequence is set. The start value is called a seed, which will normally be
different every time you call a function that depends on random numbers, e.g. rnorm () based on
the internal clock of the computer.
However, if you need to reproduce a sequence of random numbers, you can set the start value
with the set.seed () function, e.g. set.seed (12345). It is easy to try that out. First set the value
for the seed, and then call rnorm () twice with a small number as the argument, e.g. rnorm (5).
The two sets of random numbers will be different. If you then call set.seed () again with the
same value as the previous call and then rnorm (5) again, the last sequence of numbers will be
the same as obtained from the first call on rnorm (5).
The idea of a reproducible random sequence of values may seem to be a contradiction in terms,
nevertheless the generated values themselves behave as if they were completely random. Controlling the seed may in many situations be very useful, e.g. if you want to try out variants of one of
these tests and at the same time be sure that variations in the results are caused by your variations
and not by the generated numbers.

6.1.5

Final Comments

In respect to the example above for obtaining the confidence interval for the mean, it is simple
to obtain a confidence interval based on classical normal distribution. If we do so, we would get
somewhat different results from the bootstrap procedure. However, I would prefer the bootstrap
confidence interval over the classical one for the simple reason that there are fewer assumptions involved. In addition, the data from which the confidence interval if the distribution is very skewed
or differs from a symmetric distribution in other ways, normal theory results can be directly
misleading.
In addition, the general approach used in the bootstrap procedure for obtaining confidence
intervals is very general, and can be applied to any statistic you can think of, even for those where
there are no known sampling distribution. An example of the latter type is Cronbachs alpha,
where bootstrapping is the only way to get a confidence interval. So, if
Bootstrapping is often used in more complex techniques, e.g. most programs for Structural
Equation Modeling (SEM) have resampling as an option. 5
There are several packages in R oriented toward bootstrapping, one is the library boot which
is part of the standard installation. This is an extensive package (the documentation for the package in PDF format is 118 pages long) and contains all or most of the procedures described in
(Davison & Hinkley, 1997). For more references in this field, see the R help for the boot package
(first library (boot), then ?boot).

Using bootstrapping in SEM can in some cases be very informative in other ways than for obtaining confidence
intervals for the parameter estimates. In the data analysis for one fairly recent study I had three alternative SEM type
models, all theoretically meaningful, where it turned out that the most attractive alternative also was the one that was
numerically speaking the most unstable in a mathematical sense. To obtain 500 bootstrap samples, more than 800
subsamples was generated. In other words, this was a highly unstable model and was rejected on that ground alone,
in spite of having the best fit of the three models when the complete sample was used.

68

CHAPTER 6. RESAMPLING, PERMUTATIONS AND BOOTSTRAPPING

Chapter 7

Data Management
Every time you initiate a project, it implies much more than the ability to read
a data set and start a data analysis of a particular type. Those operations are
usually quite simple, often almost trivial. The really important part does not
involve the statistical tools directly, nor the collection and management of the
initial data as discussed in the previous chapter, but a number of smaller tasks
oriented towards managing things. This and the next chapters are somewhat
less oriented towards R as such, and more how to support your project in various
ways. This chapter will cover:
Reusing commands.
Basic transformations
More on input and output of data.
Some more comments on Workspaces
How to transfer results from R into a word processor like Microsoft Word.
Some comments on the Power of Plain Text
In other words, chapter 4 covered little more than the basic reading operations
in respect to data frames and the basics of handling missing data (NAs). The
objective in this part is to add some flexibility to the handling of data.

7.1

Handling data sets

Given that a data set is read into memory (is part of the workspace), it is of course possible to
manipulate with the data sets in a number of different ways.
69

70

CHAPTER 7. DATA MANAGEMENT

Figure 7.1: Editing the attitude data set

7.1.1

Editing data frames

Once available in the workspace, the contents of a table or data set may be inspected or manually
edited by using the edit () command, e.g.: 1
> revised.mydata <- edit (mydata)

This interface is a bit rough, but works well enough for small data sets. For large data sets it
might be better to edit a data set in a spreadsheet and import the data afterwards, or even better,
use something like the software mentioned in the section on data transfer, combined with notes of
what you have done. Note that the result of the above edit operation is assigned to an object called
revised.mydata. When the edit window is closed, the result (after the editing) is stored in this
object. NB! If the assignment part is omitted from the command, none of the editing operations
are saved. The result of using the edit command will be similar to the window shown in Figure 5.
An alternative when entering a completely new data set is to first generate an empty data
frame, and then to fix it:
> dd <- data.frame()
> fix (dd)

When using the fix() command any changes are stored without any assignment.

7.1.2

List the data set

Of course, the contents of the object may be simply listed by writing the name of the object, although that is not really as useful with larger data sets or frames. With a small data set / frame,
this is how it might appear
> mydata
#This a test data set
1

Actually, edit () is another of the magical functions, as it can be used to edit almost any kind of object, including
the text defining functions. It is only when the argument is a frame that the opened window will look like Figure 7.1.

7.1. HANDLING DATA SETS


hori h1
148 29
115 19
107 15
134 21
.......
105 24
96 21
129 22
85 18

h2 v1 v2 path
29 8 5 7.0
27 3 5 12.5
27 4 6 1.5
29 4 6 2.0

gender
male
male
female
male

20
22
28
20

female
female
male
male

1
3
2
4

3 1.0
4 25.0
4 8.0
4 18.0

71

If the data set is large, it may be impractical to list all of the rows. If that is the case, the head
() and the tail () functions may be useful.

7.1.3

Other useful commands

If you only want a list of the names of the columns or variables in a frame or data set, use the
names ()command instead, e.g.:
> names (attitude)
[1] "rating"
"complaints" "privileges" "learning"
[6] "critical"
"advance"

7.1.4

"raises"

Selecting subsets of columns in a frame

Quite often one needs to operate on a subset of the values in a data set. There are a number of
very flexible features of R in this respect, where square brackets are used to define subscripts. So,
x[2,6] refers to the value in the second row and the 6th column of the frame or object x, while
x[,6] refers to all the values in the 6th column and x[3,] refers to all the values in the third
row. This use of square brackets contrasts with the use of round brackets or parenthesis in the R
language, which are used for functions. As an example, suppose one wants to create an object z
consisting of columns 2, 3, and 4 from the attitude data set. This can be done in a number of
different ways. One is to refer to the columns by number:
> z <- attitude[,c(2, 3, 4)]

An alternative is to operate with numbers defined as a sequence, e.g.:


> z <- attitude[,c(2:4)]

Or, even simpler:


> z <- attitude[,2:4]

Of the three variants, one has to use the c () function whenever the list has to contain a comma
as a separator. Using names for the variables works as well, e.g.:
> attach (attitude)
> z <- attitude[,c("complaints", "privileges", "learning")]
> detach (attitude)

In some contexts it may be handy to define the object, not in terms of what is to be included,
but in terms of what should be excluded. Negative indices does the trick:
> z <- attitude[,c(-1,-5:-7)]

This excludes columns 1, 5, 6 and 7. What is left is the same as in the examples above.

72

CHAPTER 7. DATA MANAGEMENT

7.1.5

Row subsets

The same tricks can be applied to the rows of a frame. Suppose one wants an object containing
the first 10 rows of the frame attitude. Enter the following command:
> z <- attitude[1:10,]

The object z is now a frame which includes the first 10 rows of the object attitude.

7.1.6

Repeated measures

The most obvious difference between R and other systems like SPSS and Statistica in respect to
the organization of data sets is in respect to repeated measures types of design. Suppose you have
collected data on four subjects at three different points of time:
caseno TimeA TimeB TimeC
1
3.1
3.4
4.2
2
5.2
5.6
6.0
3
2.8
3.3
3.4
4
4.1
4.1
4.4

This would correspond to the structure if the rule one row per case was followed strictly.
This is, however, not the correct format to use for a data frame in R. In that case the correct structure of the frame would be:
caseno time measure
1 TimeA
3.1
1 TimeB
3.4
1 TimeC
4.2
2 TimeA
5.2
2 TimeB
5.6
2 TimeC
6.0
3 TimeA
2.8
3 TimeB
3.3
3 TimeC
3.4
4 TimeA
4.1
4 TimeB
4.1
4 TimeC
4.4

The variable time would be a factor type variable, while the variable called measure would be
a normal numerical one. The analysis of this set of data could be to use the lm () function, e.g.:
lm (measure
time) 2 .
In other words, in this respect the concept of a data frame in R differs from the traditional
GUI programs, and is perhaps more oriented towards the single observations or measures in the
design rather than a strict cases or units type of setup.

7.2

Handling commands

Here there are two basic problems:


2

If you have only two repeated measures, the t.test () function combined with the paired option can handle
frames where the two measures are in the same row.

7.2. HANDLING COMMANDS

73

How to save the commands as they are entered, and:


How to save the generated output.
These basic problems are covered below.

7.2.1

Saving commands

When you exit the system, either by just closing the window or by giving the command:
> q ()

You are asked the question Save Workspace image (see the section on the Workspace above). If
you respond with a yes, two things happen:
All the objects generated in the session are saved in the .Rdata file in the active directory and
will be restored the next time you open this file.
All the commands written during the session are stored in a file called .RHistory.
The latter file may be copied to another file or simply renamed. The basic effect is the same;
you have a list of all the commands issued during the session. This includes everything, false
starts, simple errors, plain stupidity, absolutely all. So, from there on you have two alternatives.
You can either reuse the set of commands without modification (not very probable), or edit it with
an editor first and keep the important / correct operations in a log.

7.2.2

Using saved commands

Once a set of commands have been saved to a file it can be edited and used again. This type
of files is called a script. For instance, if you have a set of commands stored in a file called
commands.r, you can run these commands by the use of the source () command:
> source ("commands.r")

In this version, this command is very silent, nothing is printed on the screen, but if the file
includes the sink () command, the output is stored written to the file named in the sink ()
command, in this case the file called small.txt. So, if you have a file called commands.r which
contains the following lines:
sink ("output.text")
small <- read.table ("small.text")
small
sink()

and you issue the command:


> source ("commands.r", echo=TRUE)

All the commands in that file will be executed without anything appearing on the screen 3 .
3

Older computer users will recognize the combination of sink () with the source () commands as a kind of
batch operation, executing a set of predefined commands. This mechanism is a handy tool if a set of operations
is complex and needs to be checked carefully, or if the operations take a long time to complete. Besides, both the
command file and the output file represent documentation for the operations. My impression is that most researchers
in psychology ignore the need for documenting ALL parts of the research process, which includes the data analysis part
as well. It is fairly obvious that there is a trend towards having to be able to include this information when publishing
research. This mechanism is a possible solution to that problem.

74

CHAPTER 7. DATA MANAGEMENT

However, since the file contains the sink () command with a file name as the argument, all
output is redirected to that file. The storage of the results in this file is ended by the second call
(without a file name). The contents of the file output.text would in this case be what is seen in
table 7.1.
If the echo part of the source () command is
omitted, only the output itself is included in the file,
not the commands. That is less than useful for most
Table 7.1: File output.txt
purposes.
In any case, the source () command can be
> small <- read.table ("small.text")
very useful, for instance for loading all data and do> small
V1 V2 V3 V4 V5
V6
ing basic transformations at the start of a session.
1
hori
h1
h2
v1
v2
path
Keep the acronym DRY in mind: Dont Repeat
2
148 29 29 8 5 7.0
Yourself.
3
115 19 27 3 5 12.5
More information about this subject is found in
4
107 15 27 4 6 1.5
the discussion of scripts and functions in part 8.
...

7.3

Transformations

18 105 24
19
96 21
20 129 22
21
85 18
> sink ()

20
22
28
20

1
3
2
4

3 1.0
4 25.0
4 8.0
4 18.0

One of the things to keep in mind is that R in one


sense is a calculator oriented towards matrices as
well as statistical calculations, and as such it is very nice for simple or complex transformations of
data sets using simple commands. However, as with all statistical systems it is a VERY good idea
to have a plan and to keep track of what you have done to the data set or frame before starting the
analysis. For that reason it is a good idea to keep certain rules in mind:
Always have a copy of the original data, do the operations on a copy. If you do so, you can
always backtrack.
If you are not 100% sure about what a command really does, make up a small data set if
possible, and try it out.
For anything but a trivial set of operations, use a script combined with a source () command as described in part 12, then you have a record of what you have done at the same
time.
Always check that the operations work as intended. One trick is to start with a small subset (e.g. the first 10 cases or rows in a frame, e.g. x <- attitude [1:10,]) and carefully
print the intermediate results during the process. When everything is OK, delete the print
commands in the script and use it on the full data set.
Include comments in the script (any text preceded by a #) explaining what is the intention
with operations. This is part of the documentation for the research project. That is what
good programmers do, and the same should hold for researchers.
Save the script in a file with a meaningful name.
There are things to be aware of. One is that if you have used attach () on a data set, you
have access to the named columns in the attitude object as single objects. But note that these are
copies of the columns from the, not the columns in the frame itself. This may be handy, but in
order to avoid surprises, you might want to ensure that any operations or transformations refer
to the columns in the frame itself if that is what you want. To show this, try something like the
commands below:

7.3. TRANSFORMATIONS
>
>
>
>
>
>

attach (attitude)
raises
raises <- raises + 100
raises
edit (attitude)
detach (attitude)

75
#
#
#
#
#
#

Attach a data set


Print "raises", a variable in the data set
Add 100 to all the elements in raises
See what has happened to "raises"
Inspect the data set or frame
Cleanup.

You will see that the corresponding column called raises in the dataset (frame) itself has
NOT been changed, although the values in the vector raises has. If you want the frame itself
to be updated you have to replace the line:
> raises <- raises + 100

with something like:


> attitude[,5] <- attitude[,5] + 100

Or:
> attitude$raises <- attitude$raises + 100

to have the desired effect of changing the values in the frame itself (raises is the fifth column in
the frame). On the other hand, you might want to ensure that transformations are done on copies
of the columns, precisely to avoid changing the frame itself. Either way, you should know what
you are doing.
Column operations
Since the R system is oriented towards matrix operations, doing operations on all the values in
data columns at the same time is done with very compact commands. For instance, obtaining a
vector containing an elementwise sum of two of the columns in the attitude frame is simple:
> scores <- learning + raises

Assuming that you have used the attach () command with the attitude data set previously. Print out the two vectors as well as the scores vector to ascertain that the results are correct.
Similarly, assume that you have a vector containing reaction times or latencies called times.
These types of data are often quite skewed, and you may want to pull in the tail of the distribution, reducing the influence of the larger values. A common trick is to transform the values by
using log (logarithm) values instead. When using a command like:
> logtimes <- log10 (times)

You now have a new vector containing log values of the original. As still another example, you
may want to have a column tacked on to a copy of the frame attitude named newattitude,
containing rank values of the raises column. The following commands could then be used:
> rankraises <- rank (raises)
> newattitude <- cbind (attitude, rankraises)

The first line ranks the values in the raises column (including handling of ties), and the
second column binds the column generated by the first line on to the attitude frame and
stores the result in a new frame called newattitude. Using the command names (newattitude)

76

CHAPTER 7. DATA MANAGEMENT

shows that there is a now a new column in the new frame or data set, and a print of the vector
rankraises shows the new values (nicely corrected for ties). Another useful command is scale
() which scales (adjusts the standard deviation / variance to 1.0 4 ) and/or centers the values in the
argument (subtracts the mean, by default both). So, scale (x) returns a vector with the z-scores
of the values in the vector x. Another even more general function is sweep () which is worth
having a look at for more specialized tasks.
These operations demonstrate only a few of the possibilities. There are a huge number of
useful functions and possible operations as well as basic concepts used in programming, such as
loops (for, while and repeat), and the construction of functions. However, you have to be sure
that the operations really work the way you think they do. For that reason alone it may be a good
strategy to start with a small data set (a subset of the data) and check the results carefully.

7.4

Input and Output

Normally, all input and output to the system are based on text files which are simple to handle,
either with the editor included with R, or with simple text editors (e.g. Notepad or Wordpad which
are available on any Windows machine). There are other alternatives. The main drawback with
programs like NotePad is that you normally have only one file open at a time. Another drawback
is that you do not have aids like syntax highlighting, very useful when entering scripts.
See part B.2 on page 107 for alternatives in respect to editors.

7.4.1

File Names

There seems be three basically different types of files used by R, each with a different purpose. It
is a good habit to differentiate between them by using different file extensions.
Files containing scripts, i.e. small files containing scripts or programs, i.e. reusable sets
of commands. I prefer having an extension of .r to these files, e.g. Anova.r.
Files containing data or frames for R. I use the extension .data for this type of file, e.g.
Twoway.data.
Files containing output from the system. My preference in this case is to use an extension of
.text, e.g. Anova.text.
Your choice in respect to extensions does not really matter, but it is a good idea to be consistent
in your usage. As you work, you will probably accumulate a fair number of files, and it is usually
convenient to be able to infer something about the contents of the file from the file name. It is
also a good idea to instruct the text editor you are using to open files with these extensions
automatically when you double-click the file name in File Explorer.

7.4.2

Input

Input to the system is normally either as (a) data frames stored as text, either in files or in the
clipboard, or (b) scripts, i.e. collections of commands a stored as text files.
4

Note that the scale () function adjusts these values as based on the value for the columns SS divided by N 1 and
not N .

7.5. MORE ON WORKSPACES

77

Importing data frames


This theme is covered in part 4.4 on page 22 above, and needs no expansion.

7.5

More on workspaces

The option Change directory in the File menu is used to change the location where the
workspace (the file .Rdata) is stored when the session is ended. Any directory where a file with
that name is found is a potential work directory for R, and it is perfectly OK to have more than
one of these directories.
By having more than one work directory, you can keep things together that belongs together, or perhaps more important, keep things apart that do NOT belong together. Basic
elements for a particular project can be stored in the same place, including anything from the
documentation of the project to data sets. This is in general a good strategy for most researchers.
There are at least two ways to do this.
1. One alternative is to use the R shortcut that was placed on the desktop as part of the installation. First, if necessary, create the directory you want to use as a work directory. Then
right-click on the R icon, and select the Properties item (the last one). Set the Start in
field to the correct work directory, and click the OK button to save the change. When you
exit R and respond with a Yes to the save workspace image question, you will have a
fresh a fresh copy of the .Rdata file in that directory.
Remember that you can have more than one copy of the R icon, each pointing to different
work directories. This method has the advantage that you can control startup parameters
necessary for using additions like R Commander as well. For that system, and before you
exit the Properties page above, add --sdi after the end of the text (the name of the .exe
file) in the Target field with a preceding space.
2. Another method is simply to copy or save .Rdata file a working directory for each of your
projects. If you then (on a Windows system) double-click the .Rdata file in that directory (or
a use shortcut to that file), the R system is started with that particular workspace and the
default directory being the same as the one where the .Rdata file you are using is located.
At the same time as the workspace is saved, all the commands used during the session are
stored in the file called .Rhistory, located in the same directory as the .Rdata file.
Also remember that the file menu item in the GUI interface contains several options for
saving and loading all the control files with the names you want.

7.6

Transfer of output to MS Word

The default output from R is very simple text, with absolutely no frills. For instance, all formatting
of columns is managed with spaces or blanks, no tabs, nothing extra. This means that transferring
output, like the summary of a multiple regression, to Word or any other word processor by simple
a copy and paste operation would work, but would be much less than optimal. To make a decent
table with results in a paper, we need a table in the word processing sense, an arrangement of
things in rows and columns. In other words, with output as plain as in R, a lot of fiddling would
be necessary after a direct copy and paste into MS Word (it is definitely NOT recommended to
reenter the results, as this is an invitation to introduce errors). We need a better solution, and it
turns out that the basic trick is to transfer the information to the document via a spreadsheet. The
steps involved are quite simple:

78

CHAPTER 7. DATA MANAGEMENT


Table 7.2: Means, Standard Deviations and Intercorrelations

Rating
Predictors
1. Complaints
2. Privileges
3. Learning

M
65.0

SD
12.0

1
0.83

2
0.43

3
0.62

67.0
53.0
56.0

13.0
12.0
12.0

0.56

0.60
0.49

1. Write the output to the clipboard in HTML format (that is the same format as used for writing web pages)
2. When you are finished with that, paste the contents of the clipboard into a spreadsheet (e.g.
Excel), this automatically reformats HTML to something that the word processor (e.g. MS
Word) can handle.
3. Edit the output in the spreadsheet to something closer to what you want. Then copy and
paste what you need from the spreadsheet to the document.
Actually, the last two steps are the same as the recommended procedure to use when transferring output from Statistica or SPSS to a document, e.g. the manuscript for a paper. These programs
normally generate much more information and formatting than you would normally want to have
in the document, and in the wrong places as well. A spreadsheet is a nice tool to use to reorganize
the information towards something which you actually need for the generation of a manuscript
in APA format. So, the problem is in reality quite general. In this respect, R is at one end of an
extreme and SPSS and Statistica are at the other. What we really need is somewhere in between or
even in some cases a combination of output from different methods.
The main point is simple: No statistical program I know of can produce the tables exactly the
way I want them to be, the output has in any case to be edited. The most efficient tool to do so (as
long as you are using MS Word) is some type of spreadsheet, e.g. MS Excel, the Calc program in
OpenOffice, or Gnumerics.

7.6.1

Table 1

With SPSS and Statistica, the first step is really simple, mark the relevant part of the output from
the programs and paste the information into different areas of the same worksheet (Excel, Calc,
Gnumeric etc.). Then drag the different elements you need for the table into place. When all the
elements are in place and in the correct format, paste the relevant parts of the spreadsheet into the
document. After the paste operation, you then have a table which will only need a bit of polishing
like adding a caption.
When using R, the steps are a bit more complex, but not really more difficult. The trick is
to write the output from R to the clipboard in a format that Excel recognizes as something with
columns and rows when the clipboard is pasted into a worksheet. From that point on the procedure is essentially the same as when using the other programs. In this demonstration I will use
the attitude data set which is included in R during installation. To simplify matters and at the
same time to be absolutely sure that the variables are in the correct and identical sequence, the
first step is to make a data frame consisting of only the variables we want to use. In addition, we
take care to place the variable we want to use as the dependent variable in the first column:

7.6. TRANSFER OF OUTPUT TO MS WORD

79

> attach (attitude)


> DataMatrix < omit.na(data.frame (rating, complaints, privileges, learning))
> detach (attitude)

This step is also a smart move in case there are missing observations in the data frame and
you want to apply a hierarchical regression (which is not the case here). Then you should include
only the variables used in the largest model in the frame and apply the na.omit () function to
the results from the data.frame () function before the assignment to DataMatrix to ensure that
all models are computed on the same subset of the cases. So, we use columns from this frame to
make a new frame:
> C <- data.frame (mean(DataMatrix), sd(DataMatrix), cor(DataMatrix))
> library (R2HTML)
> HTML (C, file("clipboard", "w"), digits=2)

The first of these commands results in a new data frame called C with the means of the
variables in the first column, the standard deviations in the second, followed by the columns of
the correlation matrix. This is the contents of the body of the table above. The effect of the second
command makes the library R2HTML available. The third command uses one of the functions in
that library to print the frame to the clipboard. So, if we now open a worksheet in Excel, we can
click on where we want the table to be, and press Ctrl-V.
This is the place for some minor editing:
1. Delete the rating column from the correlation matrix part.
2. Insert a row in the table below the rating row and write the word Predictors in the first
cell of that row.
3. Select the complete table and paste it into your document where you want it.
In Word, do the following:
1. Select the entire table and center it on the page (Table -> Table properties and click on the
Center option).
2. Remove all borders and then add the borders where you want them (one horizontal one at
the top of the table, one below the last row, and one below the headers.
3. Delete the contents of the lower triangular part of the correlation matrix and replace the 1s
with -- .
4. Do some minor operations like capitalizing the first letter of the variable names, adding the
numbering of the predictors etc..
You should then have a table like table 7.25 above.

7.6.2

Table 2

The next table to generate is the one containing the basic results from a multiple regression, and
we want it to look like table 7.3.
5

This table is based on sample table 15.1 in (Nicol & Pexman, 1999), which expands on the recommendations found
in the APA Publication Manual (American Psychological Association, 2001).

80

CHAPTER 7. DATA MANAGEMENT


Table 7.3: Regression Analysis Summary
Variable
Complaints
Privileges
Learning

b
0.68
-0.10
0.24

SEB
0.13
0.13
0.14

0.75
-0.10
0.23

***

This requires the same basic steps as the one above, where the first step is to compute the
regression model 6 :
> Regression <- lm (DataMatrix)
> HTML (summary (Regression), file ("clipboard", "w"))

By default, when the argument to the lm () function is a frame, the first column is used as the
dependent variable, and the remainder as the independent variables. Since we took care of that
problem when the DataMatrix object was generated, this is what we want.
Now, paste the results generated by the second command to the top of an empty sheet. You
then have more than you need, but let that be for now. As you see from the table we need the
standardized regression weights as well, and the lm () function does not supply these. These
weights have to be computed. The formula is simple, see formula 5.4 on page 55.
So, in this case we enter:
> Beta <- coef(Regression) * sd(DataMatrix) / sd(DataMatrix)[1]

Important: This command will only give a correct result if the sequence and the number of
the variables are the same in both the data frame and the regression model. Different sequence,
and you get silly results. Different numbers, and you get an error message. You have to be very
careful. The last term in the expression is a reference to the standard deviation of the first variable,
the dependent one, which we took care to place as the first variable when the object DataMatrix
was generated. We do not need the first value in this result (a vector), that is really the intercept
(constant), but the remainder are the beta weights we want. A better solution is to use the following function (submitted to one of the R discussion lists by Thomas D. Fletcher as a response to my
post of a question about standardized regression weights):
compute.beta <- function (MOD) {
b <- summary (MOD)$coef[-1,1]
tmp <- model.matrix (MOD)[,-1]
sx <- apply (tmp, 2, sd)
sy <- sqrt (sum(anova(MOD)[,2]) / sum (anova(MOD)[,1]))
Beta <- b * sx / sy
return (Beta)
}

Where MOD is the object generated by the lm () command, and the object returned by the
function is a vector with the beta values, e.g.:
6

The format of the lm () command below may be a bit confusing. However if the function lm () is called with a
frame as the argument, the first column in the frame is used as the dependent variable and all the remaining variables
as the independent variables. So have a look at the generation of the DataMatrix object above.

7.6. TRANSFER OF OUTPUT TO MS WORD

81

> Beta <- compute.beta (Regression)

If you use this function you do not have to be as careful about the number and sequence of
the variables in the data set (frame). In any case, what we then need for the body of the table is a
column with the betas. If it is printed directly we get it as a row, but if converted to a data frame,
it is printed as a column. So we write this one to the clip board with the following command:
> HTML(data.frame(Beta), file("clipboard", "w"), digits=3)

And we get the contents of the Beta vector converted to a column. Paste the clipboard to the
spreadsheet in the same manner as above. Now we go back to the spreadsheet for the next steps:
1. Delete the rows above the table with the regression coefficients.
2. Copy and paste the values for the betas (including the header) into the coefficient table,
overwriting the t-values (we do not need them for the table).
3. Delete the column with the p-values, but keep the column to the right of that one, that is
where the asterisks for the significance of the regression weights should appear if any of
them are significant.
4. Adjust the number of digits after the decimal point to what you want (in this case 2).
5. Delete the first row below the headers.
6. Remove all borders and then insert new ones where they should be.
7. Select the whole table and paste it into the document.
In Word:
1. Do the final adjustments, e.g. capitalization, centering the contents of the cells etc..
2. Write the comments on the results and insert the values for the multiple R2, the value for F
and the degrees of freedom into the text.

7.6.3

Conclusion

No statistical package generates tables formatted to the APA standards. Besides, there are a large
number of permissible variants for each type of table. Therefore, the recommended procedure is
to:
1. Paste the results from the statistical program you are using into a spreadsheet as the first
step,
2. Rearrange and edit things as required, and finally:
3. Copy and paste the table into the document.

82

CHAPTER 7. DATA MANAGEMENT

7.7

Final comments: The Power of Plain Text

The phrase in the title of this section is from (Hunt & Thomas, 2000). What are we talking about?
Plain text is made up of printable characters in a form that can be read and directly
understood by people.
The argument there is simple:
Human readable forms of data, and self-describing data, will outlive all other forms of
data and applications that created them.
Consequently, storing your notes and data as plain text in this sense reduces the potential
problems that you or others may have in accessing that information, today and in the future. It is
difficult to imagine an operation system which cannot handle this type of information.
As researchers, one should keep in mind that software is normally quite short-lived. When a
particular program vanishes from the market, the files that the program generates may gradually
become more and more and more unreadable by the day, unless the files are really written as some
variant of a plain text file. As one who has been using computers since the mid-sixties and since
then has gone through several changes of computers, operating systems, types of storage media,
and software updates, I know perfectly well that any change of this type may easily have the side
effect of making parts of the stored information for all practical purposes unreadable.
I am often shocked at the implicit and extremely naive beliefs my colleagues have about how
permanent and safe the information they store on their computers are. A book printed hundred of
years ago is still readable today, the same holds for printed music or mathematics, while computer
files written in a dedicated format by some programs only a few years ago may be more or less
unreadable today. The one exception is files that essentially are written as text, which always can
be recovered in some way, even without access to the original program.
This holds for document files written by some word processors, e.g. WordPerfect and Notabene (they are stored as plain text files with embedded tags) as well as modern XML file types.
The same holds for LATEX(a very advanced open source text formatting or typesetting program,
popular in the sciences), and Lyx, a frontend machine for LATEX. You may loose the formatting of
the text, but you will not loose the most important part, the text itself.
This definitely does not hold for the file formats used by programs like Microsoft Word or
Excel, nor for the dedicated file formats used by SPSS or Statistica. All of the latter types are
popular with researchers in psychology and the social sciences.
What is worse, the file formats these programs employ are not even very robust. If one of
these file types becomes partially corrupted in some way, e.g. by a hard disk error, the normal
consequence is to loose all the contents. With a text based file format, you can normally rescue at
least part of the information.
Researchers should worry about this state of affairs. Data and text are valuable, in some cases
very much so, and often irreplaceable. If the data files are stored as plain text (at the very least as
a backup), it is very probable that they can be read and used by computers in say, 20 years from
now. A printout of the contents of the file on paper may even be useful as well, as you can scan in
printed sheets and at the very least use them for proofreading 7 .
7
An example: I still have the data I used in my thesis (Johnsen, 1968) (submitted in 1967) and an later article (Johnsen,
1970) on about 2000 punched cards (anybody seen that kind of data storage lately?). Forty years later I have some
potentially interesting ideas about reanalyzing the data from that study (changes towards balance in sociometric group
structures over time). In other words, the data are unreadable, for the simple reason that I do not have access to a reader

7.7. FINAL COMMENTS: THE POWER OF PLAIN TEXT

83

In contrast, if a data set is stored as for instance, an SPSS file alone, the probability that it will
be unreadable in a quite near future is unacceptably high. In other words, these file formats are
worse than useless in an archival sense.
One of the main advantages of R is that it encourages the use of text files both for the scripts
and for the data the scripts are working with. That is in itself, a very good reason for using R for
scientific research.

for punched cards, in other words, the original storage medium is useless now. That problem is probably possible to
solve, at least in principle (a technical museum perhaps?). But, what I do have is printouts of the data set, which can be
scanned. And I have even saved the original forms filled in by the responders, which may be used for a manual data
entry. Paper is still useful, and will be so in the future as well.
I do not believe that an SPSS file will be readable 40 years from now, but it is very likely that a text files like R data
and script files are usable without much trouble, even if the programs themselves have evolved. In other words, using
proprietary file formats like .SAV files for long-term storage i.e. archiving is not a very good idea.

84

CHAPTER 7. DATA MANAGEMENT

Chapter 8

Scripts, functions and R


One of the really powerful aspects of R is that it is simple to reuse code or sets
of commands, especially when the code is stored in files. If you start exploring
some of these possibilities you will find that there are many situations where
scripts and functions are very useful. Ill start with functions using a very simpleminded example and from there go on to scripts. The common element is
that both are often stored in files.

8.1

Writing functions

One of the big advantages of R is that at its core it is a programming language, and permits you to
write your own functions which may range from the very simple to the complex using the same
language as you are used to for other operations.

8.1.1

Editing functions

With anything but the simplest functions you will need an editor. There is one that is part of the
R system. This is accessed by the command:
> myfunction <- edit (myfunction)

where myfunction is the name of the function you are editing, or:
> fix (myfunction)

The difference between the two is that the first needs a to assign the text for the function to,
while the second is used to fix an existing function. Of course, this presupposes that you already
have a function in your workspace called myfunction which needs modification. If you do not
have one, you can make a start by entering the command:
> myfunction <- function () {}

Where any arguments goes between the parentheses, and the commands or source code in
the body of the function goes between the braces, the { and }. . The next step would be
to use one of the commands above for adding substance to the function. An alternative is to use a
command like:
> myfunction <- edit()

85

86

CHAPTER 8. SCRIPTS, FUNCTIONS AND R

This starts a new function from scratch. In any case, there is one snag with the editor in R
as used above. If there are syntactic errors in the source, nothing is returned. You only get a
warning giving the line number of the first error found. Therefore it is important to save the text
to a suitable file before returning, especially to begin with. Also, remember to save the workspace
when you exit R.

8.1.2

Sample function 1: Hello world

Let us start at a very simple level. The classical first problem in the programming world when
starting on a new language is to write a Hello world program as the very first step, so lets do
just that. As the first step, write a command that prints these two words on the console. So, with
R active, enter the command:
print ("Hello World")

The system responds with:


Hello world

So far, so good. But now suppose that we want to repeat this operation more than once,
without having to write the whole command every time. So, write a command like the following:
> Hello <- function () { cat (Hello World\n) }

Nothing seems to happen. By the way, the slashn part of the string says when this string
has been printed, change to a new line, so everything printed after this command starts on the
following line. If you then enter the following command:
> Hello()

The system responds with the same output as above, i.e. Hello world. What is important,
every time you enter this command the same result is obtained. In addition, if you close R and
answer Yes to the Save Workspace question, the next time you open R and enter the Hello
() command, the same thing happens. So, what have we done so far?
We have written a function, a special type of object, which has a name, in this case one called
Hello ()
When then Workspace is saved, the function is remembered between sessions, and can be
reused.
We are in the situation that any number of commands placed between the curly brackets are
repeated every time the name of the function is entered in the same manner as a command.
A reasonable reaction to this is: So what? We need more than this:
We need to have arguments, something for the function to act on or to treat in some manner.
We need to be able to return some results from the function in some manner, e.g. the result
of some operations.
Let us start with the first one. We start with the fix () command, which permits us to edit the
contents of objects:

8.1. WRITING FUNCTIONS

87

> fix (Hello)

The effect is that an editor window opens with the current version of the function where we
may enter additions or changes to the function. We add the word name between the parentheses
after function (this is an argument to the function), replace print with cat (), and add the
argument to the call on cat followed by a comma. It should look like this:
function (name) { cat (name, : Hello World\n) }

Close the editor window. You are then asked if you want to save the changes to the function in
the editor window. If you answer Yes, the modified function is returned to R. Now, assume that
we call the revised function with something like Hello ("Alice"), we get the following result on
the console:
Alice : Hello World

What we have done is to include in the definition of the function a so-called argument called
name placed between the parentheses. So when the function is used, that particular name is
used in the output, and this holds for any name, Alice, Tom, Dick and Harry you may care to
try. In other words, we have defined something that the function can act on. This will also work
with anything that the cat () function can take, strings like names in quotes, numbers, constants
like pi, vectors like mean (attitude), but at least not objects like the output from a function like
lm (). We may also add other commands to the function. Any commands included between
the two curly brackets are included in action of the function, one quite meaningless result would
be something like a count of the number of characters in the name argument. So, we enter the
command:
> fix (Hello)

And in the editor window we change the contents of the function to something like this:
function (name) {
cat (name, : Hello World\n)
length <- nchar(name)
print (length)
return (length)
}

The output from this version of the function is two lines, one with the previous text, the other
with the number of letters in the argument, the value of names in the call on the function. The
next question was: is it possible to return some result from the function, so when we use the
function as follows:
x <- Hello (Harry)

we want to have something potentially useful returned by the function and stored in the object
called x. This is achieved by the last command, the return (length) command (in this case the
value of 5 which is the number of letters in the name used in the call on the function) which
may be used in other contexts. This is the essential points of functions:
Any number of commands may be subsumed under one name
One (or more) values may be returned from the function.

88

CHAPTER 8. SCRIPTS, FUNCTIONS AND R


They may be saved in the workspace and reused any time.

We have also shown that we may edit the body of the function with the fix () command
(we could also have used the edit () which behaves a little different, more oriented towards
storing the commands in the function in a file, see part 8.1.1 above). However, you should be very
aware of at least two things:
So far, this function ONLY exists in the workspace. With anything more serious than this
one it should be stored on file as well.
There are much more to functions than is covered here, especially in respect to the use of
arguments. You should consult the documentation for the system.

8.1.3

General comments on functions

The roots of R are found in the Unix world, an operating system for computers where the concept
of a filter is central.
A filter is a subprogram that takes some input, does something with it according to some
rules, and returns the processed information. The function mean () is typical in this respect. It
takes the object named in the call on the function (normally a vector, matrix or frame), computes
the (column) means, and returns the mean(s) as a vector or a single value. This general type of
function are the building blocks of the language, where the basic rule is that they should normally
do one thing, but do that one thing very efficiently. Very often a seemingly simple function will be
written in R, using calls on other, even more basic functions. It is possible to inspect the commands
in a function by entering the name of the function alone (without the parentheses). Then you can
see what other elements are used by the function and how. Try that with a function like sd ().
However, functions are useful for a lot of other things. Use your imagination and try!

8.1.4

Sample function 2: Compute an SS value

As a simple example of a function, consider the problem of computing an SS (Sum of Squares or


Sum of Squared deviations from the mean). We would like to have a function for this computation to avoid having to write a complex expression every time we need that value for a vector.
The definition of an SS is:
SSx =

n
X

(xi x)2

(8.1)

i=1

To obtain the SS, we start by subtracting the mean from all the values. The second step is to
square the values, and finally to find the sum of all the values. First, we define a function:
> SS <- edit ()

The effect is that a window opens for editing, and the function defined in that process is assigned the name of SS when it is closed. As with other objects, it is smart to be careful in respect
to what names you use, your name will mask anything else in workspace with the same name, and
that holds for functions as well. So, it is wise to be careful. In the editor window type something
like:
# Compute sum of squared deviations from the mean for a vector
function (x) {
x <- x - mean(x, na.rm=TRUE)
# Deviation from the mean for all xs

8.1. WRITING FUNCTIONS


x <- x ^ 2
r <- sum (x, na.rm=TRUE)
return (r)
}

89
# Square the deviations
# Find the sum
# Return result

Alternatively and more compact:


# Compute sum of squared deviations from the
mean for a vector function (x) {
r <- sum ( (x - mean(x, na.rm=TRUE)) ^ 2, na.rm=TRUE)
return (r)
}

This piece of software starts with the word function. Then follows what is to be the argument(s). Finally we have the body of the function, the part between the braces, the { and
the }, where the three steps in the computation of the value for SS are performed. Formally, the
braces are not needed when the body of the function only consists of one command, but it does
not hurt to include them. When the editor window is closed, the function is ready for use. We can
test it by entering:
> x <- c (1, 2, 3, 4,
> SS (x)

NA, 5)

And the correct value of 10 for this set of values is printed after the last line. If you need to edit
the function, use the command fix (function name), e.g.:
> fix (SS)

8.1.5

Sample function 3: Improved version of the SS function

Note that this function only gives a correct result for vectors, not for frames nor matrices. With the
wrong kind of data, the result obtained by the function is simply incorrect, without any warnings
or anything. In that sense it is not very good, and it needs improvement. The function should
therefore be changed to something like this:
# Compute sum of squared deviations from the mean for a vector
function (x) {
if (is.vector(x)) { # Check if the argument is a vector object
r <- sum ( (x - mean(x, na.rm=TRUE)) ^ 2, na.rm=TRUE) # Yes, compute
return (r)
}
else {
cat ("Illegal input") # No, generate an error message and return a missing value
return (NA)
}
}

Due to the first test in the line after the start of the function (the one starting with if), you
get both an error message and a returned value of NA if the function is called with something else
than a vector. Also, remember that once the function is written it is stored in the workspace for the
session. If the workspace it is gone, the function is gone. For that reason alone, it is a good idea
to store the function in a file with an appropriate name as well (e.g. SS.r). Once a function is
defined, you may use it again and again with a simple call (the name plus an argument, something
to be treated).

90

CHAPTER 8. SCRIPTS, FUNCTIONS AND R

8.1.6

Things to remember

There are a few things to remember when writing functions:


All assignments made inside the function (in the body) are invisible and not available
after the call has been completed. Assignments to variables within the function are what
programmers call local. This means that you cannot inspect the results of operations inside
the function in outside of the function. But that also means that you need not be afraid of
changing critical information in your workspace by careless naming of objects in the body
of the function.
Normal output does not work inside the function quite as expected. To do so, you have to
use one or more output functions to get the effect you want, e.g. print (), cat (), possibly
combined with format (). See the output of results from the item analysis in part 5.4.8:
Item Analysis for an example.

8.1.7

Sample function 4: Administrative tasks

The demonstration above defines a function that is potentially useful in a large number of different
settings. That is not necessary, the functions may be as specific as you want. You might want to
use a function to load a library, change working directory, simplify the use of a complex command,
or to generate a particular type of plot for a specific paper. Rather than entering variations of the
same commands again and again, a function can be defined and gradually refined towards what
you want by successively editing and testing.
Whatever you do, do not forget to include annotation with the function unless the operations
are really very trivial. For instance, suppose you have defined a function called Do, which looks
like this:
# Startup for project "MyProject"
Do <- function (x) {
library (RWinEdt)
x <- read.table ("reliability.data", header=TRUE)
return (x)
}

This is a very simple one which could be used for starting up the data analysis for one project.
First it loads a library (in this case the one associated with the WinEdt editor) and then reads a
fresh copy of a data set (frame). So, when you start a new session with the command:
> dataset <- Do()

You have the session set up the way you want it without typing all the commands in the function every time you return to R with a fresh copy of your data set (using libraries, after possible
transformations, adding new data, extracting a subset of the columns, in the data frame etc.).

8.2

Scripts

With exception of the functions discussed above, most of the sample commands in the previous
chapters represent what programmers would call one-liners, i.e. commands entered one by one
in the window controlling the program. This is perfectly OK for simple operations, but much less
than optimal for anything more complex.

8.2. SCRIPTS

91

For instance: Suppose you are doing the data analysis for a paper, and have anything from a
small set to a complex set of operations, involving for instance (a) reading a data set, (b) doing
a set of transformations, and (c) from the transformed data set generating some results from a
particular kind of analysis, including some graphics. Since it is real research you want to both
(a) be sure that the operations at all steps are correct, and (b) be able to document what you have
done. Neither are simple to solve when using an MM (Mouse and Menu) program.
The latter point is important both for your own purposes (Question: Will you be able to repeat the operations later and get the same results?), and in respect to communication about your
research to other persons. If your supervisor or a journal editor asks for details about what you
have done to obtain the results, you want to be able to do so. Then, the best thing is to be able
to refer to some kind of record of what has been done, where the optimal thing is to operate with
what is sometimes called a script, in other contexts a program. This is where systems like R
really shine. Some of the arguments in favor of scripts written as text files are:
Scripts are part of the project documentation and useful for any project beyond the really
trivial (will you remember details about what operations you did for a particular paper after
say, a year?)
Helps you to adhere to the DRY principle (Dont Repeat Yourself), (Hunt & Thomas, 2000).
Write a script or function once and use it several times. Repeating yourself is an invitation
for errors.
Scripts are useful for small tests, like the one in Script 1: IO test; Save and read operations above, used mainly to convince yourself (or somebody else) about something. If you
discover something else at a later stage, you can go back to that script, refine it, and test it
again.
They are also useful for simulations, like the one in the first script below as well as the next
one (the variation). With scripts like these, you are able repeat operations with different
parameters in order to learn more about what the analysis does in different circumstances.
Divide and rule: In many cases it is a good idea to split a complex set of operations into
smaller parts which can be tested separately. When all the steps are regarded to be OK, the
different parts may be combined to one script, and, with a bit of planning, the operations in
each part are reproduced exactly in the correct order.
Some types of operations may take a long time on large data sets, even on fast computers.
Test the procedure on a small data set, and when you are convinced that you have what you
want, start the script on the real data and do something else while the computer is sweating.
You want to try out variants of things, as for instance transformations, and keep the rest
unchanged in order to study the effect of each variant.
Collecting the data may be a slow process, and you may want to try out the analysis before
you have the full data set.
This is only a few of the arguments. As has been said before, one of the very nice things about
R is that it encourages the use of scripts, i.e. collections of predefined commands stored in a file.

8.2.1

Sample script 1: Saving data frames

Sometimes it is useful to be able to check how things work, i.e. a simple exploration. In this
example I am interested in exploring the write.table () command, how the generated files look

92

CHAPTER 8. SCRIPTS, FUNCTIONS AND R

for later use or backup. 1


To test the operations it is convenient with a small script that (a) reads a small file as the one in
Table 8.1 (Simple data set= (b) prints the contents, (c) saves the data set to a file, (d) reads it again,
and (e) prints the final version. The two printouts should be the same. The script to be used is
called IO test.R and looks like this:
# Test input and saving data
x <- read.table ("Input.data", header=TRUE)
x
write.table (x, file="Output.data")
a to a second file
y <- read.table ("output.data", header=TRUE)
y # Print it

# Read the table from first file


# Print it
# Write the contents to
# Read the second one

Then we run this script with the command:


> source("IO Test.R", echo=TRUE)

Which yields the following output on the console (due to


the use of the echo=TRUE option):
> x <- read.table ("Input.data", header = TRUE)

Table 8.1: File Input.data


AAA BBB CCC DDD
10 11 12 YES
20 NA 22 YES
30 31 32 NO
40 41 42 NO

> x
AAA BBB CCC DDD
1 10 11 12 YES
2 20 NA 22 YES
3 30 31 32 NO
4 40 41 42 NO
> write.table(x, file = "Output.data")
> y <- read.table ("output.data", header = TRUE)
> y
AAA BBB CCC DDD
1 10 11 12 YES
2 20 NA 22 YES
3 30 31 32 NO
4 40 41 42 NO

It evidently works correctly. The contents of the file output.data file now looks like this:
"AAA" "BBB" "CCC" "DDD"
"1" 10 11 12 "YES"
"2" 20 NA 22 "YES"
"3" 30 31 32 "NO"
"4" 40 41 42 "NO"

Which is different from the original in respect to formatting, but clearly equivalent. What has been added is quotes
around all strings (basically those considered to be nonnumber data values), and one extra column (the row names).
1

In order to see how things really works, it is often a good idea to take the trouble to construct a small test like this,
using very simple data or an example from a textbook.

8.2. SCRIPTS

8.2.2

93

Sample script 2: Simple computations

For the next example, start with a text file with the commands defining the operations you need,
e.g. something like the script below.
# Contents of file "Tiny.r"
sink ("Tiny.txt")
x <- read.table ("Tiny.Data", header=TRUE)
mean (x)
sd (x)
sink ()

The file Tiny.r contains commands for (a) reading a


small data set (see the box containing Table 8.2), (b) print
the means for the variables in the same data set, and (c) do
the same for the standard deviations. That is all, three commands plus one additional command at the start, and another
at the end. These two sink () commands are very important.
The first says write all output from now on to the file called
Tiny.txt, and the last line simply says stop doing so. To use
this small file, you start R and enter the command:
> source ("small.r", echo=TRUE)

Table 8.2: Simple data set


1
2
3
4
5
6
7
8

X1 X2 X3 X4 X5
3 6 5 5 3
2 2 3 2 2
2 2 5 5 5
2 4 5 4 3
2 5 3 5 6
4 3 2 2 2
4 3 4 2 2
1 7 4 6 5

You will see very little output on the screen (apart from a
reminder about the first sink () command), but if you look
into the working directory of your machine you will find a file called Tiny.txt which contains
the results or output from the operations, shown in the box.
What is so sensational about that?
The reason is so obvious that it is easy to overlook. This small and really trivial demonstrates
a very important principle: With one single command (the call on source ()) :
1. The set of commands stored in the file Tiny.R is executed when it was referred to by the
source () command, and:
2. Since this set of commands included the use of sink () commands, the results from that set
of commands are all stored in one place, the file Tiny.txt.
3. Since the commands are stored a file, they are in other words reusable, the operations can
be repeated with very little trouble on this or other data sets.
4. By simple editing, it is possible to add to the commands and produce additional results.
The argument you probably would hear from an experienced SPSS or Statistica user is simple
So what? It is cumbersome to set this thing up in the first place. By using the mouse and clicking
in the menus, I could have the same results long before you have anything at all.
The argument may perhaps be true. The very first time. But if you want to repeat the same
operations for some reason or another (which happens VERY often in real life), the user of the
point-and-click program will normally have to repeat all of the operations from the start with
a high probability of introducing errors, 2 while the R user at the most has to write one single
command. Then the script user will win, every time.
2

Of course, you have the possibility to use scripts in both SPSS and Statistica (and other packages as well), but this
is not very common, and besides, I find the languages in these systems to be much less readable than the command
used in R. Besides, they are proprietary languages.

94

CHAPTER 8. SCRIPTS, FUNCTIONS AND R

Besides, the contents of the script file Tiny.R may be expanded in any number of ways, every
time a change is applied to the set of commands, it is reflected in the generated output.
For instance, suppose you want to have a matrix of variable intercorrelations in addition to the
means and standard deviations in the output. Open the script file (Tiny.R) with an editor (e.g. The
Open Script item in the File menu of R, Notepad or RWinEdt, see below) and add a single line
with the text cor (x) before the last sink () command and save the file. The next time you run it
(remember that you can use the up arrow to reenter the source () command), and the correlation
matrix is added to the output file Tiny.txt.
Watch a user of a point and
click program like SPSS user and
Table 8.3: Contents of Tiny.txt
see what operations he or she has to
do before he has a correlation ma> x <- read.table("Tiny.Data", header = TRUE)
trix on the screen, and you see that
all the arguments about the speed of
> mean(x)
this type of software are less convincX1
X2
X3
X4
X5
ing. If you want more complexity,
2.500 4.000 3.875 3.875 3.500
add more commands to the script.
> sd(x)
Repeat the same operations on a difX1
X2
X3
X4
X5
ferent data set with the same vari1.069045 1.851640 1.125992 1.642081 1.603567
ables? Change the name of the data
set to be read in the line after the first
> sink()
sink ().
Situation 1: Suppose you have
an assignment in a course on data analysis which implies a number different steps, and part of
the assignment is to be able to produce a report with all the results. As you work through the
steps and see the effect of the operations, copy each operation to the correct place in the script file.
When you have worked your way through the assignment, the final step is to run the script with
the source () command and you have your report. If it needs polishing, edit the script file, and
run it again.
Situation 2: You have a complex set of operations to be done on a large data set. Write the
operations as a script file, test it on a subset of the data, and when everything is the way it should
be, run it on the full data set. The script and the output it produces is part of the documentation
for the project which can be returned to with minimal effort as well. Or, reused with modifications
in another context.
Remember:
Normally, operations in scripts are within local workspace, so objects generated inside
the script are available in the local workspace afterwards.
If you drop the sink () commands in the script and keep the echo=TRUE option in the
source () command, the output is sent to the console, i.e. the same window as used for
entering commands.
If you drop the echo option in the source command as well, all operations are very silent,
no output appears, neither on the console, nor on file (but all generated objects are in the
workspace nevertheless).
If you answer yes to the Save workspace question when you exit from R, the commands
from the session are stored in the file .Rhistory in the work directory. This file may be
the basis for a script file after editing.

8.2. SCRIPTS

95

You should ALWAYS include comments in the script containing explanations of what is
done.

8.2.3

Sample script 3: Formatted output

In many ways the default output from the R system is very primitive. For one thing the number
of decimals you normally get from R is often excessive, and one might like to have prettier output.
So, start a new file with an editor and call it Test.R. Enter something like the following lines:
sink ("Test.txt")
attach (attitude)
cat ("Correlations\n\n")
print (cor(attitude), digits=3)
detach (attitude)
set sink ()

#
#
#
#
#
#

File to send output to


Attach the "attitude" data
Print a string plus two "newlines"
Print correlations with 3 digits after decimal point
Free the data
Stop printing to the file.

The third point under the Remember heading above is relevant here, when you use the
source () command with the file name alone to activate a script, the actions are very silent, no
output is generated. So, you have to explicitly order R to print something to the output file, i.e.
the file named in the first sink command. In this case, the output contains two elements:
1. A string: Correlations. The two \ ns following the string (see below) inside the string
have the effect that the next output element will start on a new line. Having two of them
yields one blank line between the end of the text string and the first line of the correlation
matrix.
2. A correlation matrix, printed with three significant digits.
The output (found in the file Test.txt) looks like this:
Correlations

rating
complaints
privileges
learning
raises
critical
advance

rating complaints privileges learning raises critical advance


1.000
0.825
0.426
0.624 0.590
0.156
0.155
0.825
1.000
0.558
0.597 0.669
0.188
0.225
0.426
0.558
1.000
0.493 0.445
0.147
0.343
0.624
0.597
0.493
1.000 0.640
0.116
0.532
0.590
0.669
0.445
0.640 1.000
0.377
0.574
0.156
0.188
0.147
0.116 0.377
1.000
0.283
0.155
0.225
0.343
0.532 0.574
0.283
1.000

The \n part of the string sent to the cat () function is what is called an escape sequence
used to control the output. The most important ones are:
\n New line
\f Form feed (new page)
\r Carriage return
\t Tab
These letters are entered after a backslash. On my system, the \f sequence does not work
properly when Notepad is used, but it works fine when printing the file with RWinEdt, provided
R is in sdi mode. See page 77.

96

8.2.4

CHAPTER 8. SCRIPTS, FUNCTIONS AND R

Sample script 4: ANOVA with simulated data

So, imagine that you want to repeat the ANOVA from part 5, Factorial Analysis of Variance
with different constants just to see what happens. You then start with a text file with the necessary
commands. This text file may be typed in an editor, or you may do the analysis in R and get the
information afterwards from the .History file. In any case, you then have a text file for instance
called ANOVA.r
sink ("ANOVA.txt")
output x <- rnorm (80, 10, 2)
f1 <- factor (rep(c(1,2), each=40))
f2 <- factor (rep(c(1,2,1,2), each=20))
table(f1,f2)
x[41:80] <- x[41:80] + 1.5
x[1:20] <- x[1:20] + 2
x[41:60] <- x[41:60] + 2
result <- lm (x ~ f1*f2)
anova (result)
interaction.plot (f1,f2,x)
tapply (x, list(f1,f2), mean)
sink ()

# Define file for the


# Generate a dependent variable, 80 values
# Generate the factors
# Check on the group sizes
# Add effects to the dependent variable

# Do and ANOVA with lm ()


# Print the results in ANOVA format

# Turn off output to the file

These commands as stored in a file is called a script, I have added comments in order to
explain what the commands are supposed to do. Now, start R and enter the command:
> source ("anova.r", echo=TRUE)

After the operations are finished all the output (including the commands, due to the use of the
echo argument) will be found in the file named in the initial sink () command. For one thing,
you now have a record of what has been done. In addition, you can repeat the analysis at any
time without having to write all the commands again. You may for instance want to experiment
with different constants added to the dependent variable, do some other transformations, or add
other analyzes or simply correct errors. Just edit the commands in the anova.r file and run the
script again with the source () command. In other words, a good example of the DRY principle
(Dont Repeat Yourself).

8.2.5

Sample script 5: A more general version

This simulation of a two-way ANOVA above can be made more general, which illustrates some
of the possibilities with scripts. Here a variable gsize has been added to the script, and this
variable is used to replace the constants controlling the generation of the data vector x and the
two factors with simple expressions, e.g. replacing 41 with (gsize*2+1):
sink ("ANOVA.txt")
gsize <- 20
x <- rnorm (gsize*4), 10, 2)
f1 <- factor(rep(c("A1","A2"), each=gsize*2))
f2 <- factor(rep(c(B1","B2","B1","B2"), each=gsize))
table (f1,f2)
x[(gsize*2+1):(gsize*4)] <- x[(gsize*2+1):(gsize*4)] + 1.5
x[1:(gsize*2)] <- x[1:(gsize*2)] + 2
x[(gsize*2+1):(gsize*3)] <- x[(gsize*2+1):(gsize*3)] + 2
result <- lm (x ~ f1*f2)
anova (result)

8.2. SCRIPTS

97

interaction.plot (f1, f2, x)


tapply (x, list(f1,f2), mean)
sink ()

Now you have the possibility of experimenting with different group sizes (change the value
of gsize on the second line) as well as different effect sizes. This could have been even more
general by rewriting the script into a function, but that would be beyond the intended scope of
this document.

8.2.6

Nested scripts

Of course, one script can contain source () commands that refer to other scripts. That way you
can put together a complex and/or series of operations that are tested in steps. The only thing
you have to be careful about is that you do not want to have a source command referring to a
command file above where it is used. So, if you have a script called a.r which contains a source
() command referring to b.r, the file b.r should not contain a source command referring to
a.r. That would in most cases not be very smart.

98

CHAPTER 8. SCRIPTS, FUNCTIONS AND R

Appendix A

Data Transfer
Most research projects include a stage where the data to be used in the analysis
is transformed from one format to another, more usable format. This transformation may be quite simple, as when a data file is imported or exported, or
very much more complex, as when the data is extracted from some other source
and used to form a data set permitting the use of data analysis.
The basic assumption discussed in this chapter is simple: In any transformation of
data from one format to another there is a risk of introducing new errors.
In addition, this part of the research process merits as much attention as the
other parts. For some reason that does not seem to be the case. It is strange,
but few introductory texts e.g. (Ballinger, 2007) even mention anything close to
the topic. The design of the study is of course important, and that is usually a
very important theme in texts on methodology. However, themes like reporting
on results, handling references, correct use of styles, etc. are at best secondary if
not trivial. Normally, nothing is found on what is essentially quality control, on
being sure that the actual foundation for the paper, the data itself, is essentially
as correct as possible and therefore something that can be trusted and worthy to
be reported on.
This part of the process needs to be documented. In addition, there is an increased need to do so.
A normal activity at an early stage of a new project in a study is to transfer the data to be used
from one format to another, where the final format is something that is readable for your preferred
software. 1
My impression is that this stage of a project often gets much less attention than it deserves
compared with other stages of a project. Many researchers seem to, strangely enough, to be less
careful with this phase, in spite of the simple fact that this part of the research process is the
foundation for the rest. Now, why is it important? Some of the reasons (which will be elaborated
on below) are:
One will of course want to operate with an as correct as possible data set.
1

One of the very few documents I have found on the net which discusses this theme is:
http://www.folkesundhed.au.dk/uddannelse/software/takecare.pdf

99

100

APPENDIX A. DATA TRANSFER


It is sometimes necessary to be able to backtrack when inconsistencies or errors are discovered in the data set. It is smart to take that possibility into account in the planning phase.

It is becoming more and more common for journals to require that raw data, instructions,
and all tools used in the establishment of the data set must be archived when an article is
punished (for all APA journals one has to keep records for five years)

As a start, it is convenient (and in most


cases not difficult) to distinguish between two
different representations of data, (a) the source,
and (b) the final data set used for the data analysis. The current theme is therefore to look
at some of the principles that should govern
transfer of information from the source to the
data matrix used as the basis for any reports
on the results.
Essentially, this is (or should be) a loop
That was not very smart, and in my opinion
with the elements shown in figure A.1. The
borders on the unethical for a researcher as
transfer itself may be very simple, like the imwell. When you consider the expenses and
port of a data set from a spreadsheet into R, or
time already invested in the project, both by
it may be very complex and time-consuming,
him and the respondents, it is a relatively milike a manual transfer of responses from quesnor matter to be reasonably sure that the data
tionnaires to a machine-readable format, or the
used in the analysis matches the source as
coding of say, films according to some criteria.
well as possible.
In any case, with any project involving
more than a trivial amount of data, an important aspect of the research process is to ensure that the data used for the data analysis is as correct
as possible, that the data set used for the analysis is an as faithful representation as possible of
whatever was in the original source. It is probably safe to assume the any transfer from one format to another will include a risk of introducing errors in addition to those already included in the
source. The problem is then to avoid the introduction of new errors as far as possible. One tool
is to use a setup that permits simple checks, another is to use the right type of software for data
entry.
After all, if the quality of the data is low, the results cannot be good. Remember the acronym GIGO
(Garbage In, Garbage Out). In addition, unless the transfer is really trivial (like the transfer of a spreadsheet to a
text file), we also have to document details of that process
so we (or somebody else) can, if necessary, replicate the
process and arrive something close to the original.
In other words, this part of the process deserves as
much attention as the data analysis itself for the simple reason that it is the foundation for the data analysis
which is the basis for the results presented in a paper or
Figure A.1: The data entry loop
report. Before going into details, it might be a good idea
to examine the basic question:
A colleague once conducted a survey where
the returned forms from the respondents
were scanned in. When asked about the error
rate (some expression of the differences between the original source and the final data
set), his response was that he had not bothered about that. He did not regard the possibility of errors as important.

A.1. WHY IS THIS IMPORTANT?

A.1

101

Why is this important?

The primary objective has already been mentioned, one wants the final data set used as the basis for the research to be an as faithful representation of the source as possible. That is, obvious,
although often ignored. However, there are other situations where a careful transfer and documentation may be useful. One should be at least slightly paranoid, and imagine scenarios like the
following:
A jealous colleague accuses you of fraud, and you are forced to document what your sources
are and what you have done so far. Are you able to do so?
You have submitted an article to a journal, and after a long wait (while working with other
things) one of the reviewers spot some inconsistencies in your analysis. The editor asks you
to go through the analysis and document what happened between the collection of the data
and the results you presented in the paper. Are you sure that you are able to do so, and arrive
at the same results as the first time?
After a few years you want to go back to an old project. In doing so you have to reconstruct
things. Accurate records of what was done in the first place may save months of work (for
an example, see footnote 7 on page 82). Do you have the necessary records?
Your main computer is a portable (something that is becoming quite common and which in
itself implies a number of risks) which is stolen while attending a conference or is damaged
in an accident. You have to reconstruct your work of the past year (or more) to be able to
continue your career. If you can do so without loosing more than a few hours or at most a
few days of work, you are being reasonably careful. More than that, you are a fool. If we are
talking about loosing work of weeks or more you should consider a change of career, you
are being too naive. 2
Apart from the last one which requires a backup 3 of your files, all of them require a reasonably
complete log of what has been done, combined with an audit trail as described below. One
should never have to rely on memory in respect to what has been done to the data, but have records
of all the steps in the process of arriving at the results. The consistent use of scripts combined with
a log for everything but really trivial operations is the most important tool.
My impression is that maintaining proper procedures for documentation of a project is becoming more and more important for journals as well as for those bodies and institutions that provide
funding for research. Recent, widely publicized cases of scientific fraud, will only reinforce that
trend.
2

I must admit that I sometimes wonder at how many of my colleagues have lost important data sets or documents.
I am sure that has happened to most of us, hopefully with less critical stuff. If they have, most people know how stupid
they have been and tell no one.
3

Having and maintaining a backup is very important, and has saved me at least once when the hard disk on my
portable failed without warning. Hard disks will fail sooner or later, and when that happens you may loose years of
work unless you are prepared for the worst. It is therefore sensible to be careful. One solution is to have an external
hard disk where an incremental backup (only the files that have changed since the last backup) is kept. To simplify
this process, it is important to keep program directories on your machine strictly apart from data directories where
your work is stored (preferably organized by subject, not file type). The former may deleted without warning when
the system is updated, which is routine in many institutions. So the default setup for Windows is NOT optimal for
researchers. In addition, really important stuff (like data sets copied to CDs) should be deposited in a different place
from where you work.

102

APPENDIX A. DATA TRANSFER

A.2

The audit trail

In accounting, every amount of money passing in or out of the accounts must have a corresponding record, normally on paper, explaining the nature of the transaction. The same term is often
used in the handling of data, each record in the data set should contain a unique id making it
possible to go back to the source for checks.
A data set will very often represent a combination of information on the same respondents
from different sources, different tests or forms, files or data bases, different points in time, etc..
Having a plan for an audit trail which includes unique ids for each case in the data set will be
an important tool in quality control.
The main point is of course, as with all other parts of research, accountability. One has to be
able to produce documentation on all the steps that has been taken in a project, not only the data
collection and analysis, but the data preparation as well.

A.3

Data Sources

Some of the more common sources of data in psychology and the social sciences are:
Forms on paper, like questionnaires or tests.
Material like filmed behavior or recordings that has to be coded in some way or another.
Sometimes the source may already be in a machine-readable format which has to be converted to a format that can be used by the preferred tool for data analysis.
As mentioned in the last point above, quite a number of data collection procedures do not
involve paper forms (e.g. web based data collection procedures, using recording instruments of
various kinds), but nevertheless there should always be an audit trail, making it possible to go
back to the original source to check on any single observation in the data set.
One important tool in maintaining such an trail is to have a unique label on each form (record)
and to enter that label into the data set together with the other information from the form. This
label should of course be neutral in the sense that it is not sensitive in any sense. If that type of
information is necessary for the project, a separate list should be maintained linking the neutral
ids to the sensitive information and kept apart from the data.
Note that there are normally quite strict regulations on how to store and treat sensitive information in research projects.

A.4

Data Transfer

My very first teacher in a course on programming once said: Computers are completely neutral,
in a slightly malicious way. She was right. It pays to be slightly paranoid, and assume that errors
will creep into whatever you are doing with computers, no matter how experienced you are. The
basic problem is then simple, find tools to reduce the number of errors as much as possible.

A.4.1

Manual data entry

If the original source of the data is on paper, e.g. some type of forms, one normally has to enter
the data by hand. In that case, it is NOT recommended to use a spreadsheet (e.g. Excel, Calc, or
Gnumeric) for data entry unless the project is really trivial . With that kind of tool it is too easy
to enter errors. It is much better to use a dedicated program for this entry. This type of program
should at least be able to:

A.4. DATA TRANSFER

103

Set up a convenient form on the screen for data entry, one record at a time.
Encourage the establishment of a codebook for the project.
Possibly add unique ids for each case in the data set (which should be added to the form
as well) as the data are entered. If you do, you can go back and check the file against the
originals if necessary.
Hinder entry of illegal values by the definition of permissible ranges of values and automatic insertion of the correct missing data codes for each variable where no data are entered.
Permit the definition of conditions in case subsets of the questions are given to subsets of
the respondents, e.g. if only the females in a sample are given some questions, the entry
procedure should skip these questions by default for the males.
Automatical dating of each record during data entry.
Permit checks on the data entry process, e.g. the option to enter the data twice followed by
a comparison of the two files to see where the discrepancies are.
Enable simple logical tests to spot cases in the data set like testing for pregnant males etc..
These types of errors are normally easy to spot with simple tools like frequency tables. However, one should keep in mind that this possibility normally covers only a few of the variables in a data set.
Permit export of the data to various archival formats as well as to common file formats (e.g.
text files, Excel, StatA, and SPSS among others). This includes encryption procedures which
may be useful, especially when transferring data sets between researchers.
One dedicated data entry program that has these features (and more) is downloadable for free
from the net and is strongly recommended for any serious project (and we are all serious, right?).
See:
http//:www.epidata.dk
To quote from their website:
EpiData Entry is used for simple or programmed data entry and data documentation.
Entry handles simple forms or related systems. Optimized documentation and error
detection features e.g. double entry verification, list of ID numbers in several files,
codebook overview of data, date added to backup and encryption procedures.
It is a minor effort to set up the program for the first time on a new project, but it is well
worth the investment in terms of error reduction. As mentioned above, this type of software also
enables you to enter the data twice (preferably by two different persons) and to compare the two
data sets for discrepancies. Whenever I suggest using double data entry to colleagues, I normally
hear protests. It is too expensive, takes too much time, not worth the trouble, a few errors does
not matter, etc.. On the other hand, considering the expenses often involved in data collection
(planning, printing forms, postage, sending out reminders etc.), I do not think that it is excessive
to want to be sure that the data one actually uses for analysis (and later publication) are as correct
as possible.
Besides, my experience is that miracles are not very common, at least not when using computers. So, to believe that manually entered data is without errors is really to believe in miracles. The
objective must be to reduce the errors as much as possible by using the best tools.

104

A.4.2

APPENDIX A. DATA TRANSFER

Scanning forms

A quite common alternative solution is to use a scanner to transfer data from paper to data files.
I must admit that I am quite sceptical in respect to that alternative, my experience is that the
scanning process introduces an unacceptable high number of errors, especially if handwritten
responses are to be interpreted. However, it is easy to test that assumption, and you really should
do so if you plan to use scanners. Enter a small subset of the data set twice, both manually.
Check the two against each other and make corrections until you are sure that you have a correct
representation of the subset of the records in the source.
As the next step, scan in the records from the same subset and compare the scanned file with
the file from the manual data entry. My guess is that you will be disappointed.

A.4.3

Filters

In computerese, a filter (a Unix/Linux term)is a program for transferring data from one format
to another. So, when a data set is transformed, e.g. from Excel into an SPSS format, you are using
a filter, the original is changed in some way or another with a possible loss of information as well
as the introduction of new errors. Therefore, whenever that kind of operation is applied to your
data, one should do some simple checks to convince yourself that everything is correct. Since
the most common problems occur with missing observations, it might be a good idea to at least
compare simple things like counts of valid observations and variable means between the before
and after data sets.

A.4.4

Checks

The critical element in figure A.1 is the back loop involving the Check label. Of course, the
data must checked in some way or another before advancing to the next step. Not to do so is to
believe in miracles, which we all know do not happen very often. There will be errors, and there
are several possible strategies to use when looking for them.
Possibly the best method is to do the transfer twice and then use some kind of software to
compare the two and then investigate all differences. The tool to use could be the EpiData program
or a similar tool as mentioned above or a spreadsheet combined with appropriate macros. Of
course, the basic assumption is that one does not commit the same errors twice.
My guess is that it makes little sense to do this comparison if both copies are scanned, for the
simple reason that the software used for the scanning will have a tendency to do just that, have a
tendency towards the same errors from the same raw material. So, comparing the scanned data
with a manually entered data set is the only sensible thing.
A definitely less good solution is to manually compare the data in the file for all or a subsample of the source elements. The likelihood of overlooking errors is too large.
Finally, one way of spotting some kinds of errors is to apply logical tests, e.g. look for pregnant
males, or very unlikely things, like 25 year old grandmothers. That should be done in any case,
regardless of the method of transferring, since respondents may make errors as well. In other
words, there may be errors in the source itself 4 . In any case, this type of error checks cannot be
used for all types of data. For instance, it is impossible to spot logical errors of this type with
data like items in personality or attitude tests. In that case (which covers most projects), the only
reasonable thing to do is to enter the data twice and compare the two.
4

One type of systematic error is quite amusing as well as being a nuiscance. When doing surveys within organizations it is quite common to ask the respondent about their rank or place in the hierarchy. In that case you always end
up with more leaders than there really are there. There is a clear tendency for self-promotion.

A.5. FINAL COMMENTS

105

Another form of checking may be used when transferring from one machine readable form to
another as when dedicated instrumentation combined with specialized software is used to generate the data used as the source. This type of software should produce some type of record that can
be included in a log.
When using any type of filter it is very important to check the result against the original as far
as possible. This may be very simple, look at the number of cases, compare variable means etc.,
and see if codes for missing data work the same way in both versions.

A.4.5

The final data set

In a project, the outcome of the process above is the basis for the next stage, the data analysis, and
this is the data set that should be archived together with all necessary information.
Also note that some types of information should NOT be included in the final data set in any
case. This includes sensitive data, like id numbers of various types. This information may be
necessary to use in order to be able to join information from different sources and/or when data
are collected from the same individuals over several occasions. The solution is to have neutral
ids in the data file and to have a separate list containing the same ids together with the sensitive information. Of course, this list should be kept strictly apart from the data used in the data
analysis.

A.5

Final comments

Much of the things we do as researchers can be reconstructed in one way or another. We can buy
a new computer, software can bought or upgraded, manuscripts can be reconstructed, and may
even improve as a result of rewriting. The loss is mainly one of time and money, serious, but not
critical. On the other hand, very often data sets only be reconstructed with great difficulty. For
that reason alone it is better to be prepared for the worst.
Therefore, some of the things to be considered are:
If your original data are on paper forms, e.g. questionnaires, do not store the original forms
in your office, but in a different place altogether, preferably in a different building. Use
copies for the data entry procedure. The same holds for recordings and other types of electronic media. However, if the data contains sensitive information, take that into account
when selecting a storage place.
Once you are sure that the data is entered as completely and correct as possible, store a copy
of the data set in a different place as well in a program independent format (plain text
format is the best, see part 7.7 on page 82) if at all possible, together with notes on how it
was collected, including codebooks, any scripts used for modification of the data etc.. This is
the archive for the project, which also should be updated for any important operations later
on.
Each record in the data set should ALWAYS include some information enabling you to go
back to the original for checks.
NEVER operate with only one single copy of data sets. Have backups, preferably incremental. Then you are able to backtrack if something goes wrong in one way or another.
Remember that high capacity USB type hard disks are cheap, and they usually come with
the necessary and efficient software to establish backups. It may be a good idea to have more
than one of those in different places, but remember to have copies of the original installation
disks for the backup software as well.

106

APPENDIX A. DATA TRANSFER

The first two items are about archiving, saving your work for a more distant future. What is
called backup is more a short-term process, perhaps even on a daily basis, and normally incremental. Both are important.
On archiving data sets, see for instance the Publication manual of the APA (Association, 2001),
section 3.53 (page 137):
Authors are responsible for the statistical method selected and for all supporting data.
Access to computer analyzes of data does not relieve the author of responsibility for
selecting the appropriate data analytic techniques. To permit interested readers to verify the statistical analysis, an author should retain the raw data after publication of
the research. Authors of manuscripts accepted for publication in APA journals are required to have available their raw data throughout the editorial review process and for
at least 5 years after the date of publication.
And on page 354 from the same source:
Other information related to the research (e.g. instructions, treatment manuals, software, details of procedures) should be kept for the same period. This information is
necessary if others are to attempt replication. Authors are expected to comply promptly
and in a spirit of cooperation with such requests.
In addition, see part 7.7 above.

Appendix B

Installation and Fine-tuning


This part is an attempt to cover some of the more technical aspects of an installation, things like editors to use together with R, possible GUI interfaces, etc..
The final and very brief section cover some of the other tools needed to be able
to produce documents with a professional look worthy of a scientist.
To start using R you really only need to look at the first part on the installation below. The remainder contains refinements, plus a mention of other software tools in the open source category.

B.1

Installation

The home page for the R project is: http://www.r-project.org/. The Windows version of R
is downloaded via that URL. Look for a file called R-2.3.1-win32.exe (or similar, the version
number (the 2.3.1 part may be different in the file name), and download it to a suitable location.
When the download is finished, run the program to install R on your machine.
If your preferred operating system is Linux, you have to download the correct RPM package
from the same source and install it as well as resolving dependencies in respect to other packages.
The latter point will be very much reduced if you are using a Debian based variant of Linux (where
the Ubuntu distribution extremely popular, and not without reason). However, my experience
with Linux is limited; therefore I will ignore the finer details of that possibility (at least for the
time being).
A new version of the R package is published every 6 months or so. There is also an option in
the R interface (Packages -> Update Packages) for updating the packages you have installed if any
of then have been revised recently. This is a very automatic process, where the only requirement
(of course) is that you are logged on to the net.

B.2

Editors

Since R is very much text based, a good text editor is a very useful tool. The one included with
R is quite good, but not good enough.
In this context it is very important to be aware of the fact that MS Word is NOT a text editor,
for the simple reason that far more than the text is stored in the file, the file is padded with a
lot of incomprehensible junk. Notepad is OK for very simple jobs, but somewhat cumbersome
in this context, for the simple reason that you only work with one file at the time. For instance,
when using scripts, you need to have at least two text files easily available (the script itself and the
107

108

APPENDIX B. INSTALLATION AND FINE-TUNING

resulting output) and it is often handy to have the data file open as well. Alternative editors are
very much better for working with R, see the R site on the web for recommended ones. In other
words, the basic requirement is the ability to have more than one file open at the same time, to run
a script from the editor, and if possible to have syntax highlighting as well, nice when working
with scripts.

B.2.1

Tinn-R

Tinn-R is installed when you download and install SciViews-R, but you can also install it as a
separate tool. This is as far I can see, a very nice editor, which is also free. See:
http://www.sciviews.org/Tinn-R/
Note that newer versions of the editor are said to be not completely compatible with R, so the
safe thing is to download it from the SciViews location, not from SourceForge. From my point of
view, this is a very nice program, partially because it is written in Object Pascal (Delphi), the same
language that I am using for writing computer programs.

B.2.2

WinEdt and RWinEdt

Another very good editor is WinEdt, a shareware program (free for trial for 31 days, and after
that available for a very modest price) which can be very nicely integrated with R. The WinEdt
program installer can be downloaded from: http://www.winedt.com/. When downloaded, run
the installation file in order to install the program. As a final step, you need to install a companion package to the editor in your installation of R. So start R and write: Install.packages
(RWinEdt) When this is completed, whenever you start a new session, you can enter the
command library (RWinEdit), and you are in business.
If that does not work, have look at the site for the editor: http://www.winedt.com. Note that
the full version is not completely free, but the sum to pay for it is modest (currently USD 30 for
students and USD 40 for Educational), and you may try it for 31 days for free. I regard this
editor one as very good and useful for other things as well.
This editor is also geared towards LaTex as discussed below in part C.0.5 on page 114. That is
a typesetting system that is far superior to what can be produced on standard word processors.
In any case, working with a good editor has many advantages, where perhaps the most important one is to have all the files you have used recently at your finger tips, visible with its own
tab in the editor window. When the package is installed, you start the editor by writing:
> library (RWinEdt)

Then the editor opens in a separate window.

B.2.3

Notepad Plus

This is not the same program as is included in a standard Windows installation. It is a nice editor
where several different documents may be viewed by separate tabs. This is freeware and downloadable from the net at no cost. The main disadvantage is that this is a general editor without
any link to R (like the next one) for running scripts etc..
1
In order to have the RWinEdt editor work properly, R has to be working in sdi mode. This can be achieved in
several ways. The one I normally use is to open the desktop shortcut to R by right-clicking on it and selecting the
Properties option. In the Target field, add the string " --sdi" after the name and location of the program, i.e. ONE
space followed by two minuses and sdi (without the quotes of course). In addition, set the Start in field to the
name of the work directory.

B.3. GUI INTERFACES TO R

B.2.4

109

vim

Another recommended editor is vim: http://www.vim.org/. This is a full-fledged text editor,


very popular in the Linux world, and really intended as interface for a lot of different things, like
open source compilers and typesetting systems like TeX or LATEX. In this context it is perhaps a bit
of an overkill and in addition quite scary for unmotivated users.

B.3

GUI Interfaces to R

In general it is not recommended to use a graphical user interface (a GUI) with R (apart from
the one generated by an installation of R on a Windows system), even if they do exist. You will
be much better off by mastering the more basic commands and advance from there. Among
other things, you loose considerable amounts of flexibility without the command approach for
the simple reason that no set of menus or other types of graphical controls can cover the richness
of the R language..
In other words, it is a much better strategy to invest in an editor that is capable of working
together with R in an efficient manner (see above). For complete novices it is normally recommended to avoid GUI interfaces, at least at the very start. At least wait with installing that kind of
tool until you have a feel of the system. See the part on editors above.
But, if you insist, here are some alternatives:

B.3.1

R Commander

If you really insist on having a graphical user interface to R, there are several free ones, and at least
one of them can be installed as a package and seems to be quite popular for the simple reason
that it is also oriented towards the use of scripts (Google for R Commander for details). It is
simple to install: While on the net, start R and enter the command install.packages ("Rcmdr",
dependencies=TRUE). Select a location for the download, and wait, the process takes time. When
the install is complete, start the package by entering the command library (Rcmdr).
The people behind the package are quite prominent in R world, so it should be quite good.
Play with it and see if you like it.

B.3.2

SciView-R

A more advanced alternative is called SciViews-R , and is the best I have seen so far, even if it is not
quite complete, it is still only in a beta version. It is downloadable from http://www.sciviews.org.
Install R first, and then download the correct .exe file and run it.

B.4

Installing packages

The functions and data sets included when the system is installed covers the needs for most users.
However, one of the great advantages of R is that it is possible to tailor your installation of R to
your own purposes. The concept of packages is important in this context. A package is a collection of more specialized functions with corresponding documentation, usually with additional
data sets. There are a very large number of available packages; some are official in the sense that
they are available from one of the R sites. Of these, some are recommended, and are normally
made available when the R system is installed. To obtain a list of all the installed packages in your
installation, use the following command:
> installed.packages ()

110

APPENDIX B. INSTALLATION AND FINE-TUNING

As to the available packages, look in the Help item in the R GUI interface and click on the
CRAN Home Page item. Find the Packages entry on the left side an click on that. There you
will find a list of all the available packages. If you find one that might be useful (have a look
at the documentation that usually is found as a .PDF file when clicking the entry). Install it and
try it. See the Normal installation part below. You can also use the Install package(s) in the
Package menu.
To put it mildly, the contents of the packages vary a lot. Some are oriented towards particular
types of analysis, sometimes very specialized; others contain examples and data sets used in a
particular textbook. For instance, the IswR (installed by default) package contains all the data
sets used in (Dalgaard, 2002). Another variant of the same type is used for the (Crawley, 2005)
book, the data sets and the code for all the examples, including the code for all the figures in the
book, are found in a site named in the introduction.

B.4.1

Normal installation

Use the Packages option on the Windows form to obtain the ones you want. For instance, the
one called rgl is useful for three-dimensional graphics, while the one called lattice is oriented
towards very nice graphics beyond what the plot () function and its relatives manages. The
boot and the bootstrap packages gives access to alternative packages for significance tests and
confidence intervals. Rcmdr is a graphical user interface for R which is quite popular (see above).
Alternatively, you can use the install command for this purpose, e.g.:
>install.packages ("rgl", dependencies=TRUE)

This will install the rgl package (specialized graphical functions) together with any other
packages that the named package is dependent upon. Another one is:
>install.packages ("Rcmdr", dependencies=TRUE)

Which installs the R Commander package, a GUI interface for R.

B.4.2

Updating packages

It is a good idea to periodically update your installation by regularly running the command
>update.packages (dependencies=TRUE)

If you are logged on the net, this operation compares the packages you have installed on your
machine with one of the repositories for R in respect to versions. If any of the packages have been
updated, they are downloaded and reinstalled automatically. To ensure that you are using the
latest version of things you should run this command occasionally.

B.4.3

Failed installation

If this method of installing packages fails, it is usually because you do not have the privileges to install things in the default locations on the machine you are using. This is becoming quite common
at research institutions where normal users do not have complete control over their computers.
Strange, but true, and contrary to the ideology behind R where it is possible to tailor your installation to your needs by the installation of packages.
Consequently, you have to instruct R to use a directory for extra libraries where you have that
permission. At the University of Bergen, the your private area is found on o:. For instance, if
you create a directory on o: called Rlib, you may use an expanded version of the install.packages
command:

B.4. INSTALLING PACKAGES

111

> install.packages ("ellipse", lib="o:/Rlib", dependencies=TRUE)

With the same option used to control the call on library (), e.g.
> library ("ellipse", lib="o:/Rlib")

Currently this version of the install.packages () function works, at least in respect to the
basic contents of the package. You will get some warnings about inability to update the help files.
I will try to get that problem solved. Having to include the lib="o:/Rlib" part every time you
want to use one of the libraries you have installed is a bit cumbersome, but a workaround exists.
What you need is to generate a file with the name .Renviron, and that file should contain a
command telling R where to search for libraries. This file should either be located in your home
directory or in the work directory (where R is started). To generate this file, do the following:
1. Start Notepad or similar text editor and enter one line with the text R LIBS=o:/Rlib, where
the you define the name of the directory you are using for installing packages.
2. Select File and then Save as, and in that window locate the correct directory for the file, set
the file type to all files, and the name to .Renviron. 2
3. Close the current session for R (if there are any active ones), and start the system again.
4. Test whether the simplified library command works. If not, either the command in the file is
incorrect (you will get no warning if that is the case), or the file is in the wrong location. The
safest thing is to place the .Renviron file in the same directory as the .Rdata file and to start
R with a shortcut to that file.
5. You may have to make a shortcut to the .Rdata file and start the program with that shortcut.
The alternative to using the install.packages () command is to download the package file
directly as a .zip type file (rather than installing it with the install.packages () command) and
store it somewhere where you are permitted to do so. A package is simply a .ZIP file under
Windows. Once downloaded, the package can be installed from the menu bar under Windows.
However, the last method may be considerably more complicated than the other method, for
the simple reason that the packages have to be installed one at a time. If there are dependencies
on other packages, you may have to install any missing parts in separate operations.

Under Windows, it is not a trivial task to generate a file with a name like .Renviron, formally this is a filename with
no filename part, only an extension. You are not allowed to change a file name to something like this with the Explorer.
The alternative is either to use the old DOS commands, or to force NotePad to save a file with that name.

112

APPENDIX B. INSTALLATION AND FINE-TUNING

Appendix C

Other tools
This part is about other tools used by researchers, in principle not very different from any other
craftsman. An old-fashioned carpenter needs one or more hammers, a set of chisels, planes, etc. in
order to do a good job. Researchers also needs a set of tools in their toolbox, some very specialized,
and some which are useful regardless of the field.
The main topic of this document is about one of the tools that researchers will normally need
in the toolbox, a tool for statistical computations. But, other tools are also needed, so let us make
a list of what we really need:
1. A tool for statistical computations. We have already covered that subject using R in the
preceding chapters.
2. A spreadsheet for the handling and storage of data. In addition, spreadsheet programs are
useful for the generation of simple graphics.
3. A tool for entering text, an editor for plain text. Several such editors are discussed above.
4. A typesetting tool used for the generation of nice-looking documents.
5. A tool for collecting and using references.
6. A presentation tool.
The discussion in this part will be mainly oriented toward open source programs for the
latter three points, where the ambition is to point at tools which in addition to R, are used for the
production of scientific documents. In many cases, this type of software is written for scientists
by scientists, and in the most cases downloadable for free from the Internet.

C.0.4

Spreadsheets

Spreadsheets as such represent a fantastic invention. Indeed, it has been said that one of the
reasons why the idea of personal computers to really take off in the late seventies and early eighties
was the availability of early versions of spreadsheet programs (e.g. Visicalc).
Today, the Excel program in the MicroSoft Office package is by far the most common in the
Windows world, but there are several others. Besides, being the most common does not mean
that it is the best. A very good alternative to Microsoft Office is the OpenOffice package, that one
is completely free and is available for all the common operating systems. The commercial (but
still quite cheap) version of OpenOffice is called SuperOffice. The spreadsheet program within
the latter two systems is called Calc.
One alternative spreadsheet program outside any office type packages is called Gnumeric .
113

114

APPENDIX C. OTHER TOOLS

Advantages
The advantage with spreadsheets is that you can collect information that belong together in one
file. You may for instance have several different data sets belonging to the same project stored in
the same workbook, but in different sheets. In other words, spreadsheet workbooks are a very
nice repository for data in a project.
In addition, If you are using MS Word to write your article, there are functions in R that simplify the transfer of output from that system to the word processor of your choice, as long as you
use a spreadsheet at an intermediate stage to rearrange things according to what you want. This
is what you would want to do regardless of the statistical system of your preference.
Furthermore, spreadsheets are also useful for the generation of primitive graphics. Normally
they are somewhat standardized and not very elegant. If you need some unusual kinds of plots
you have to look elsewhere. In that respect R is a good alternative.
Disadvantages
The main disadvantage with spreadsheets is the file format. This is not optimal for archiving data,
you want your data to be readable after, say, twenty or more years. That will not be possible with
this kind of software, so you will normally want to store two versions of your data, one for archival
purposes and data analysis as plain text (e.g..csv files), and the other for storage and handling in
a spreadsheet.

C.0.5

Managing text

The most common understanding of the term word processing really covers the combination of
three very different kinds of activities: (a) the authoring process itself, the creation of a (hopefully)
readable text on some theme, (b) entering the text in some manner, and (c) formatting of the text
for printing, also called typesetting.
Ignoring the first of these (which for authors is of course the most important one), the more
common programs for this purpose combine the latter two into what is called WYSIWYG programs (What You See Is What You Get). The problem is that these programs do not really do
the combination of the two tasks very well. Yes, what you see on the screen and then send off
to the printer is what you get, but what you can see and send off to the printer is really quite
primitive, and compared with what you can get from a professional typesetting program, sometimes downright ugly. In addition these programs have become so user friendly, that important
functionalities are lost or hidden quite effectively.
When we start thinking about this process, we see that we really have at least two very different tasks.
On the one hand we have the problem of writing the text and structuring (dividing the text
into paragraphs, sections, chapters, etc.). This is the real authoring.
The other part of the process is to take care of the appearance of the final document. This is a
typesetting problem, the setting of the text should be elegant as well as consistent.
Therefore we have two very different concerns, content and structure vs. style.
The main point is that these two parts of generating a document are all mixed up in normal
WSYIWYG programs. In addition, in order to manage the appearance of the document of the GUI
type one needs to use a functionality usually called styles, something few users can handle.
Scientific writing is really quite strict in respect to the structure of the document, and we want
the different elements within the respective type of document which belongs to the same category

115
to look the same with a minimum of effort. In other words, as researchers we tend to write a very
basic and limited set of document types, either a letter, an article, book, or report, each
governed by a very strict set of formatting rules, where one example of such a set of rules is found
in the APA style manual (Association, 2001). When using the standard word processing tools
to write something like an article, you really have to work hard with styles etc. to come close to
what you want.
What we really want is a tool that can aid us in respect to what can be included in the document
and how it should appear on the page.
LaTex
If you really want your document in its final version to have a professional look in respect to
typesetting, you cannot use programs MS Word, the Writer program in OpenOffice, or WordPerfect, they are simply not good enough. You have to use real typesetting software. There may
be many such programs on the market, but there is only one that is really good. In other words,
what you do when using this type of software is to enter the text with one type of program, and
typeset the text with another.
The solution to these problems is something called LATEX(see the Tex User Group, TUG on the
net, http://www.tug.org/, (Diller, 1992), or (Kopka & Daly, 2004)). As R, this is open source,
and has long been the main tool for writing in the sciences, especially when you have a document
with a lot of mathematics in it. But it is also suitable for other things. One example is this document, which is produced with LATEX. The real advantage is that you, unless you really know what
you are doing AND work hard at it, you simply cannot get a document that has an inconsistent
formatting. When using other programs you really have to know the program to achieve that
effect, most users never manage to do so. Another alternative is to generate presentations where
the slides are usually much nicer than you can do with PowerPoint. When generating a document
using these tools you normally work in two stages:
You use a (possibly dedicated) editor to enter the text, with the commands defining the
formatting of the text inserted in the text. The main thing is that you then concentrate on
what you want to say, rather than how it will look in the final document.
You compile the text file using the LATEXsoftware. The output is a (hopefully) nice-looking
document in PDF format. If not, you go back to the text file and change the formatting
commands.
I have to admit that starting with Latex is not very easy. The initial learning curve is quite
steep, but in my opinion well worth the effort. However, there are a number of different tools you
can use to simplify operations.
Lyx
This is probably the simplest starting point for novices using LATEX. When the full package is
installed you get all the necessary tools at the same time. The interface you get when starting the
program is simpler than using a simple editor like WinEdt, but not as advanced as one of the more
common word processors, but still perfectly useable.
This program is really a preprocessor for LATEX, and therefore the end product when the file
is printed is correspondingly nice, very much better than what you are able to produce with programs like OpenOffice or MS Word. But, note that Lyx has its own file format, close to, but not
quite LATEX(but still plain text), so if you want to try LATEXyou have to export the document from
Lyx.

116

APPENDIX C. OTHER TOOLS

LaTeX editors
For advanced use of the LATEXtool you need a editing tool that is suitable for the task. One editor
tailored to LATEXis called Led , another one is called TeXnicCenter as well as one called Texmaker. All of these are open source and can be downloaded for free. One of the advantages
of at least TeXnicCenter for Norwegians is that it is possible to install a Norwegian spellchecker,
which is not available for WinEdt mentioned below.
An alternative tool is WinEdt (see above). Strictly speaking, this is not an open source program.
However, it is shareware, quite cheap, especially for students, and can be downloaded for free. A
possible drawback with WinEdt is that this is a very general editor, and therefore contains more
than ever you will need for LATEX, which may be a bit confusing. On the other hand, there is a nice
package (library) for R called RWinEdt based on WinEdt which gives the basic functionality for
handling R functions and data.

C.0.6

Bibliographies

A very important function needed in scientific authoring is to be able to handle references. This
involves two different functions, (a) to accumulate a set of potentially relevant references in a
database, and (b) to cite the references in a text as well as generating a bibliography toward the
end of the document. The latter operation should be automatic, ensuring that the bibliography
includes nothing more and nothing less than the works actually cited in the text. The BibTex format combined with JabRef is the LATEX(open source) worlds answer to programs like EndNote
and Procite in the Windows world. The LATEXpackage apacite formats references in documents
according to APA standards.
In any case, it is directly foolish not to use that kind of tool when writing any kind of research
report. Handling references in any other way is an invitation to introduce errors in the manuscript.

BibTex
BibTex is not just a program for adding references to your document, but also a file (text) format
oriented toward the storage of references. It is extremely popular in large parts of the sciences.
The nice thing about this format is that LATEXas well as Lyx are geared toward this format and
given a correctly formatted file can generate a correctly formatted list of references in the format
of your choice, including the APA format.

JabRef
Jabref is a nice GUI type of program (open source and free) for maintaining a bibliography (with an
interface somewhat similar to EndNote ) in BibTex format (see above). This program is platform
independent, since it is written in Java. For Windows systems, there are several others, where
some require that data bases like MySQL are installed on your machine. In my eyes that is a bit of
an overkill.

C.0.7

Presentations

One important task for the student, teacher, and researcher is to be able to generate a presentation,
a sequence of slides on a particular subject. The most popular one is Microsoft Powerpoint. I
must admit that I do not really like that program, I have always regarded it as somewhat inflexible

117
1.

The main competitor is the program called Impress from the OpenOffice suite (downloadable for free from the Internet). This program reads and writes files in the Powerpoint format. In
most aspects it is essentially very similar to Powerpoint, however slightly better in respect to the
formatting of text on the slides, and much better in respect to the generation of graphics.
In case you are worried about being able to present whatever you have where you are going
there are two arguments in favor of Impress. For one thing, that program generates PDF files right
out of the box. The probability of a conference site having a PDF reader is actually higher than
having Powerpoint or corresponding program installed.

C.0.8

Portable Applications

A very interesting alternative to some of the programs mentioned above are called Portable applications, see:
http://portableapps.com
These are versions of standard software that does not have to be installed on the machine
they are being run on. In other words, you can carry a copy of the programs with you on a
minidisk, an iPod or similar thing, or a memory stick and run the programs on any Windows
machine without any installation procedure at all. The list of programs in this category is growing,
and includes systems like the complete OpenOffice suite. As long as the machine is running
a reasonable modern version of Windows and has a USB connector, you are in business. Plug
in your device, locate the file and start your presentation. This also eliminates another potential
problem, as when the Windows/Powerpoint machines at a conference site does not have the fonts
you prefer. If that is the case, and your presentation is in Powerpoint, you will have a real problem,
at least parts of the presentation might be in gibberish, which is not a very nice situation. That is
never a problem with a PDF file, nor with a portable version of Impress.
If you are making a presentation of any kind for a professional audience, most would leave
home with a portable alone. Given what I have mentioned above, it is easy to argue for dropping
the portable machine as well, and leave home with a portable application on at least two memory
sticks, one in your pocket and one in your suitcase.
Another alternative for presentations should be mentioned, and that is the same Latex as mentioned above combined with a package called Beamer. The output is in PDF format, which is
more portable than anything else. The Portable applications site above includes a PDF viewer
called Sumatra, not perfect, but useable.
Finally, since the programs in this category runs off a USB memory device without any required installation, they can be used on any machine, including machines where you have no
administrative rights.

C.0.9

Combining Statistical Output and Authoring

R is a programming language, you can write a script which contains instructions about what
computations are needed to produce the results you need. LATEXis also a programming language,
1

One of the problems with PowerPoint is that the file format is to a large extent incompatible with that of the word
processor in the same package. Working on a paper together with a presentation for the same project therefore tends
to be two separate and parallel processes, which means that you are repeating yourself, which in itself is an invitation
to errors. Not so with LATEX. In that case you are essentially writing text which can with a few keystrokes change from
an article to a presentation. It is even possible to have your document to include instructions for the generation of the
statistical results you are reporting on, using a combination of LATEXand R. See the comments on Sweave below (page
117).

118

APPENDIX C. OTHER TOOLS

used for typesetting a document. What about combining the two? That is possible. In several
places, including chapter A on page 99, the need for documentation of all parts of a project has
been discussed. Now consider the following scenario:
What if the manuscript for your article or presentation contained the necessary
information not only the text and the instructions on how to format the text, but commands to generate and include the results (figures, tables, graphics) in the report itself
as it is printed?
Now, that is an interesting question. That could for instance mean;
If somebody asks: What did you really do to generate the numbers in the report? The answer
is simple: The file used to generate the article tells the story with all details included, it is
part of your documentation for the project. Apart from the data, the command file for the
article contains everything needed to generate the article, including all results. Little other
information is required. That is, if you are using R.
You discover an error in the data set. There are two alternatives: You can either run all
the analysis again and manually cut paste all figures and results into the document for the
new version. The probability of generating new (or omit correcting old) errors is large. The
alternative is simple, correct the data itself and run the article again. The printing process
invokes the necessary software to generate the results presented in the article. If the scripts
are correct, everything is correct, using the new data. Again, if you are using R.
You want try out the effect of potentially different transformations on the final results. You
have the version of the article for each variant. It is your choice.
This is called literate programming, a term coined by Donald Knuth (Knuth, 1992), the original author of the Tex system (which is the basis for LATEX). His notes on the subject is found on
the net (http://www-cs-faculty.stanford.edu/knuth/lp.html).
For someone who has experienced polishing papers and presentations at the last minute before
conferences or other deadlines, sometimes late into the night, occasionally with a component of
panic, as well as having a small nagging voice in the back of the mind asking the question: Is this
really correct?, the very idea sounds very much like music.
The basic idea is simple, and involves three steps:
1. In the file defining the text for the article, embed the necessary code to generate the results
(figures, tables) as so-called chunks. 2 Save the file with the extension of .rnw.
2. Start R and use the function Sweave () (the initial capital S is important) in R to process the
file. This generates a regular type .tex file which includes the output from the R commands.
3. Typeset the article in the normal manner. Viola! You have a nicely typesetted article where
the source (the .rnw file) is a complete documentation of what was done as well as the text
itself.
For more details, make a search on the net using the term sweave.
Remember, the entire project, including the analysis should be reproducible. In real science,
this is hard. Redoing all data collection, the field work, or whatever is asking a lot. However, one
should try to document the steps as far as possible, and this is one tool that might help.
2

By this is meant a block of text starting with <<>>= (where there may be arguments between the delimiters)
and ending with a @, both on a separate line. Place the R commands inside this block.

Reference card
This is a small subset of the commands and Managing variables and objects
operations available in R and S. Parentheses are attach (x1) : Put the objects (usually variables or
columns) in the frame x1 in search path. Note
for functions, square brackets indicate position
that references to the variables are copies of the
of items in a vector or matrix. Names like x1
variables.
are user-supplied variables. Strings are enclosed
detach
(x1) : Reverse an attach (), remove variin single or double quotes.

ables (columns) in the frame x1 from the search


path.
ls
() : List all objects in workspace.
q () : Quit the session
rep
(x1, n1) : Repeats the vector x1 n1 times.
<- : Assign
rm
(<objectname>)
: Remove the object with <obframe$objectname : One object within an object.
jectname>from
workspace.
m1[,2] : Column 2 of matrix or frame m1.
dim (m1) : Dimensions of matrix m1.
f1:f2 : Sequence of numbers from f1 to f2.
m1[,2:5] or m1[,c(2,3,4,5)] : Columns 2 to 5 of names (<framename>) : Names of the variables in
a frame.
matrix m1. Values before the first comma refers
merge
(frame1, frame2) : Merge data frames.
to rows, in this case all rows, since none are speccbind
(a1, a2, a3) or rbind (a1, a2, a3) : Bind
ified.
the
named
vectors to columns or rows into a maNA : Missing data
trix.
is.na : True if value is missing
library (<name>) : Load a package (e.g. RWinEdt data.frame (v1, v2) : Make a data frame from vectors v1 and v2.
or mva).
c (1, 2, 3, 4) : Creates a vector, in this case conHelp
taining the numbers from 1 to 4. A variant with
the same effect is c (1:4)
help (<command>) : Get detailed help with comt () : Transpose a matrix, switch rows and columns.
mand.
help.start () : Start browser help.
apropos (<topic>) : Other commands relevant to Control flow
the named topic.
for : Repeat what follows enclosed in curly brackets
example (<command>) : Examples of the command.
{ and }.
while (condition) { . . . }: Repeat the operation(s)
Input and output
within the brackets until the condition is met.
source (<filename>) : Run the commands in the if (condition) ...else ... : Conditional execunamed file. Useful option: echo.
tion of the command(s).
read.table (<filename>) : Read data from file
<filename>. Useful options: file.choose (), Arithmetic
sep= and header=TRUE.
edit (<framename>), edit a data set or frame. Al- %*% : Matrix multiplication
+, -, /, * : Standard operations in expressions.
ternatively: data.entry (<framename>).
sink (<filename>) : Send all output to the file ^ : Power.
log (), cos (), sqrt () etc. : Standard functions.
<filename>until sink ().
write.table (dataframe, "<filename>"), writes
Statistics
a table or frame to <filename>.
print (), cat (), format (), list () etc.: addi- max (), mean (), median (), sum (), var (). Usetional output functions.
ful option : na.rm=TRUE.
na.omit(<framename>) : Remove all rows (cases) summary () : Expand on the standard output from
from the frame where one or more of the values
a statistical function, where the resulting output
are NA (missing).
depends on the type of the argument.

Miscellanous

119

120

APPENDIX C. OTHER TOOLS

rank (), and sort () : rank and sort.


scale () : Center and / or standardize a vector.
sweep () : Remove effects from a vector.
ave (x1, y1) : Averages of x1 grouped by factor y1.
table () : Make a table of frequencies.
tabulate () : Tabulate a vector.

Basic data analysis


aov (), anova (), lm (), glm () : Various linear
and nonlinear models.
t.test () : t-test
chisq.test (x1) : Chi-square test on matrix x1.
fisher.test () : Fisher exact probablity test.
cor () : Various correlations. Useful options are:
use="pairwise" or "complete", and correlation
type: method="pearson" (default), "spearman" or
"kendall".
cor.test () : Test correlations.

Some statistics in mva package


prcomp
kmeans
factan
cancor

() :
() :
() :
() :

Principal components analysis.


kmeans cluster analysis.
Factor analysis.
Canonical correlation.

Graphics
Plot commands: plot (), barplot (), boxplot (),
stem (), hist (), abline ().
matplot () : Matrix plot.
pairs (<matrix>) : Scatterplots
coplot () : Conditional plot
stripplot () : Strip plot.
qqplot () : Quantile-quantile plot.
qqnorm (), qqline () : Fit normal distribution.

References

121

References
Association, A. P. (2001). Publication manual for the American Psychological Association (5th ed.).
Washington DC: Author.
Ballinger, B. (2007). The curious researcher; A guide to writing research papers. New York, USA:
Pearson Longman.
Bradley, J. V. (1968). Distribution free statistical tests. Englewood Cliffs, New Jersey: Prentice-Hall.
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin,
70(6), 426-443.
Crawley, M. J. (2005). Statistics: An introduction using R. Chichester: John Wiley and Sons.
Crawley, M. J. (2007). The R book. Wiley.
Dalgaard, P. (2002). Introductory statistics with R. London: SpringerVerlag.
Davison, A., & Hinkley, D. (1997). Bootstrap Methods and Their Application. Cambridge: Cambridge
University Press.
Diller, A. (1992). Latex line by line (2nd ed.). New York, NY, USA: John Wiley & Sons.
Edgington, E. S. (1995). Randomization tests (3rd ed.). New York: Marcel Dekker.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1-26.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall.
Everitt, B. (2005). An R and S-plus companion to multivariate analysis. London: Springer-verlag.
Good, P. I. (2005). Introduction to statistics through resampling methods and R/S Plus. Wiley.
Harman, H. H. (1967). Modern factor analysis (3rd ed.). Chicago: University of Chicago Press.
Hunt, A., & Thomas, D. (2000). The pragmatic programmer. Reading, Massachusetts: AddisonWesley.
Johnsen, T. B. (1968). Strukturell balanse i sosiale systemer; En applikasjon av graph-teori pa utviklingen
av sosiale strukturer.
Johnsen, T. B. (1970). Balance tendencies in sociometric group structures. Scand. J. Psychol., 11,
80-88.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and
Psychological Measurement, 20, 141-151.
Kaiser, H. F., & Rice, J. (1974). Little Jiffy mark IV. Educational and Psychological Measurement, 34,
11-117.
Knuth, D. E. (1992). Literate programming. Stanford, CA, USA: Center for the Study of Language
and Information (CSLI Lecture Notes, no. 27.).
Kopka, H., & Daly, P. W. (2004). Guide to Latex: Tools and Techniques for Computer Typesetting (4th
ed.). Boston, USA: Addison-Wesley.
MainDonald, J., & Braun, J. (2003). Data analysis and graphics using R - an example-based approach.
Cambridge: Cambridge University Press.
Manly, B. F. (1991). Randomization and Monte Carlo methods in biology. London: Chapman & Hall.
Nicol, A. A. M., & Pexman, P. M. (1999). Presenting your findings: A practical guide for creating tables.
Washington DC: American Psychological Association.
Pearson, K. (1901). On lines and planes of closest fit points of in space. Philosophical Magazine, 2,
559-572.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological
Methods, 7(2), 147-177.
Student. (1908). The probable error of the mean. Biometrika, 6, 1-25.
Verzani, J. (2004). Using R for Introductory Statistics. Baton Roca, USA: Chapman and Hall.

122

APPENDIX C. OTHER TOOLS

Index
.RHistory, 73
.Rdata, 7, 15, 16, 77, 111
.Renviron, 111
.SAV files, 20
<-, 6, 14, 119
%*%, 58, 119
abline (), 35, 120
anova (), 51, 55, 120
aov (), 37, 52, 120
Archiving, 106
attach (), 27, 28
ave (x1, y1), 120
Backup, 106
barplot (), 7, 120
Beamer, 117
Biblography
Bibliographies, 116
BibTex, 116
JabRef, 116
boot (), 65
Bootstrapping, 59
Boxplot, 13
boxplot (), 13, 120
break, 15
c (), 12, 13, 15, 71
CA, 39
Calc, 78, 102, 113
cat (), 48
cbind (), 75
CFA, 39
chisq.test (x1), 120
Clipboard, 23
Clustering, 29
Codebook, 103
coef (), 55
coefficients (), 53
complete, 26
Component analysis, 39
coplot (), 120

cor (), 26, 34, 35, 56, 120


cor.test (), 34, 36, 37, 57, 63, 120
Correlations, 34
complete, 120
Kendall, 120
pairwise, 120
Pearson, 120
cos (), 11, 119
Cronbach, 67
Cronbachs alpha, 45
data (), 6, 15, 19
Data entry, 21
data.frame (), 79
Delphi, 108
demo (), 9
detach (), 28
echo=, 73
edit (), 23, 24, 70
Editors, 107
EFA, 39
eigen (), 42
EndNote, 116
EpiData, 103, 104
example (), 9
Excel, 77, 78, 102
exp (), 11
FA, 39
factanal (), 43
Factor, 54
factor, 72
factor (), 50, 54
Factor analysis, 39
FALSE, 15
filter, 104, 105
filters, 88
fisher.test (), 120
fitted (), 53
fix (), 23, 24
fix(), 70
123

124
for, 76, 119
foreign, 20
format (), 48
Function
arguments, 85
body, 85
function Sweave (), 118
Functions, 85, 86
GIGO, 100
ginv (), 58
glm (), 120
Gnumeric, 102, 113
GnuMerics, 78
GUI, 2
head (), 24, 71
header=TRUE, 23
hist (), 13, 34, 50, 51, 120
Histogram, 13
Hmisc, 27
HTML, 79
if, 15, 119
Imputing, 25
Inf, 24
infert, 54
install.packages (), 111
Installation, 107
interaction.plot , 51
Interactions, 54
is.na, 119
Item analysis, 45

INDEX
make.names (), 14
matplot (), 120
max (), 32, 119
mean (), 6, 26, 119
median (), 32, 119
min (), 32
Missing data, 24
MS Word, 77, 107
Multiple regression, 52
NA, 24, 119
na.fail (), 27
na.omit (), 27, 47, 55, 79
na.rm, 26
naclus (), 27
names (), 6, 24, 71
NaN, 24
naplot (), 27
Notabene, 82
NotePad, 107, 111
Notepad Plus, 108
objects (), 16
omit.na (), 79
OpenOffice, 113, 115, 117
Operators, 15

Package, 109
Hmisc, 27
Iswr, 110
R2HTML, 79
Rcmdr, 109
rgl, 110
RWinEdt, 108
LaTex, 82, 115
utils, 9
Led, 116
Packages,
107
library (), 119
pairs
(<matrix>),
120
library (), 9, 15
pairwise, 26
lines (), 34
PCA, 39
Linux, 107
lm (), 9, 13, 14, 37, 51, 52, 54, 55, 57, 58, 72, 87, Permutations, 59
PFA, 39
120
pi, 12
loadings(Results), 42
plot (), 13, 26, 27, 33, 120
log (), 11, 119
plot.default
(), 36
log10, 75
plot.default (group, extra), 36
log10 (), 11, 13
Portable Applications, 117
Logical expressions, 12
predict (), 53
ls (), 15, 16
Presentations, 116
Lyx, 82, 115
princomp (), 40
Mailing lists, 8
print (), 14, 48

INDEX
q (), 15, 73, 119
qqline (), 120
qqnorm (), 120
qqplot (), 120
quantile (), 63
R, xiii, 107
R2HTML (), 79
Random numbers, 66
rank (), 75, 120
read.table
read.table
file.choose, 23
read.table
header=, 22
read.table
sep=, 23
read.table (), 19, 23
Reliability, 45
repeat, 76
Repeated measures, 21
Resampling, 20
boot package, 20
bootstrap package, 20
Bootstrapping, 20
Jacknifing, 20
Permutation test, 20
resampling, 59
return, 15
rgl, 110
rm (), 16
rnorm (), 13, 50, 67
RSiteSearch (), 9
RWinEdt, 108
sample (, 65
sample (), 61
sample (), 65
scale (), 58, 120
Scatter-plots, 35
Scatterplot, 13
SciViews, 108
Scripts, 85
sd (), 26, 32
sec:ReadingFromInternet, 24
SEM, 39, 40, 67
sessionInfo (), 15
set.seed (), 67
sin (), 11, 12
sink (), 73, 74

125
sort (), 120
source (), 73, 119
SourceForge, 108
Spreadsheets, 113
SPSS, xi, 3, 15, 20, 21, 26, 40, 42, 72, 78, 82, 83, 93
sqrt (), 11, 119
SS, 32
Statistica, xi, 3, 21, 26, 72, 78, 82, 93
stem (), 120
str (), 14, 15
stripplot (), 120
Student, 36
Student , 37
Styles, 114
sum (), 26, 32, 119
Sum-of-squares, 32
summary (), 13, 14, 32, 57, 119
SuperOffice, 113
Sweave, 117
sweep (), 76, 120
t (), 15, 58, 119
t.test (), 36, 37, 72, 120
table (), 50, 120
Tables, 77
tabulate (), 120
tail (), 24, 71
tan (), 11
tapply (), 52
Texmaker, 116
TeXnicCenter, 116
Tinn-R, 108
Tools, 113
TRUE, 15
Typesetting, 115
var (), 32, 119
vim, 109
while, 76, 119
WinEdt, 108, 116
with (), 28
WordPerfect, 82, 115
Workspace, 15, 77
write.table (), 24

Vous aimerez peut-être aussi