Académique Documents
Professionnel Documents
Culture Documents
10/2002
The R language
R is GNU S. R is a re-implementation of S. The two languages are superficially similar languages, but do not have identical underpinnings. (One big difference is in scoping rules). Furthermore, there are some packages (libraries) for certain statistical analyses written only for R (or S), and due to the differences between the two languages, the package runs error-free only in R (or S).
1
There is a program called Scompile that will compile S functions. I believe this may be similar to the compiling done in Mathematica.
Thompson
10/2002
R also comes with an optional GUI and one can enter data from a simple spreadsheet environment. But, R does not have the editable graphics facilities that S-PLUS does. Indeed, there is no R-PLUS. However, there is an XGOBI interface accessible from CRAN (see below) for both Unix and Windows. There is a tremendous amount of documentation on R (the language, data import information, extensions, etc) from the website (see below) and downloadable with the distribution (which is on version 1.6, as of this writing). As with all GNU software, there are no guarantees on full functionality, and the distribution gets updated much more frequently than does the commercial version in S-PLUS, although usercontributed libraries for both applications are updated and added frequently. If there is time, I will try to include R in the S-PLUS lesson. It is occasionally discussed in V&R, and contrasted with S. If you can use S-PLUS via the command line, you should have no problems using R. However, if you are going to be developing software in R or porting functions from S to R, you need to know the differences between R and S (if you are lucky, they will show up for you). This write-up is meant to show you around S-PLUS and point out where to find things. There is not the space to go into detail about everything (and many other books and manuals have already done that2), but at least youll know where things are, and you can explore them on your own later. There is a lot to SPLUS. I can say with all honesty that I have never been able to NOT do something (statistical or programming-wise) that Ive wanted to do using S or S-PLUS. Of course, sometimes figuring out how to do it can take time. But, I usually learn something in the process.
Getting the MASS library that goes with Venables and Ripley (2002) (V&R)
See the Appendices of V&R for information on how to download and install the MASS library (http://www.stats.ox.ac.uk/pub/MASS4/Winlibs). You can use these instructions to download any S-PLUS library. Many libraries are stored at the web site http://lib.stat.cmu.edu/S.
A list of books on S-PLUS can be found in the Users Guide Manual that comes with S-PLUS.
Thompson
10/2002
However, note that when you start S-PLUS for the first time, it will ask you what directory you want to use for your working directory. As long as you dont check the box that says dont ask me this again, it will continue to ask you for the working directory.
BATCH mode is convenient for calling S-PLUS and running S-PLUS commands from other applications (like R). See my example on the STAT 5537 course website.
Each instance you start up should be invoked from this icon. Beware, though, that objects you change or add in one instance can affect what the other instance sees.
3
Thompson
10/2002
Interacting with the operating system (DOS or Windows for S-PLUS 2000)
One can send DOS commands while in S-PLUS using the dos function, if you are using Windows preXP. One can invoke any Windows application using the system function, with the appropriate executable statement.
system(Notepad, multi=T) dos(dir, output.to.S = T)
Thompson
10/2002
Also, see examples on the STAT 5537 course website. I used system to invoke an S-PLUS BATCH file from R. You can also do the reverse: i.e., run an R BATCH file from S-PLUS.
Search List
When S-PLUS searches for objects or files, it searches databases or lists in a particular order. The list is returned via the search()command. search()lists the order of directories or lists that SPLUS searches for objects. The database in position 1 is also where SPLUS will write any new objects you create. The default first position is the working directory, by definition. Positions can be changed or replaced via the attach and detach commands. Thus, changing the working directory entails manipulating the search list. Here is an example of search() result
[1] [3] [5] [7] [9] "C:\\Program Files\\Insightful\\splus61\\users\\default" "default" "Laura" "splus" "stat" "data" "trellis" "nlme3" "menu" "sgui"
To find where on the search list an object appears, use the function find. For example, to find the object (a function) rm0,
> find("rm0") [1] "C:\\Program Files\\Insightful\\splus61\\users\\default"
Any S-PLUS object can be attached to the search list as long as it consists of named components (as does a list or data frame. See the help file for more information. To list the paths of the databases on the search list, use searchPaths.
> matrix(searchPaths(),nc=1) [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [,1] "C:\\Program Files\\Insightful\\splus61\\users\\default" "C:\\Program Files\\Insightful\\splus6\\users\\default" "C:\\Program Files\\Insightful\\splus6\\users\\Laura" "c:\\progra~1\\insightful\\splus61/library/splus" "c:\\progra~1\\insightful\\splus61/library/stat" "c:\\progra~1\\insightful\\splus61/library/data" "c:\\progra~1\\insightful\\splus61/library/trellis" "c:\\progra~1\\insightful\\splus61\\library\\nlme3"
Thompson
10/2002
Installing libraries
Libraries contain functions, objects, and sometimes data sets (which are stored as objects) to do specific types of tasks. Many libraries have functions that perform statistical analyses. For example, the tree library contains functions associated with fitting regression and classification trees. SPLUS comes with some libraries already. Other libraries (contributed by users) have to be downloaded from the statlib website. See Appendix C.2 of V&R. To load a library that comes with SPLUS, do one of the following: type, at the command line: library(libname) # libname is the name of the library In S-PLUS, from the File menu, select Load Library, and follow instructions What loading a library does is place its objects onto the search list. To have a library load every time you open a chapter (project) in S-PLUS, create a text file called S.chapters in the working directory for the project. For each library you want to load in that project startup, type the path to the library. For example, to load MASS each time I start up in a particular chapter, I the file S.chapters would look like c:/program files/insightful/splus61/library/mass A similar thing can be done in the Rprofile file (c:\Program Files\R\rw1061\etc\Rprofile) in R. Alternatively, issuing a library or require command within a .First() function within the working directory will load the library at start-up in that directory.
Getting Help
A help file exists for most S-PLUS commands and objects that come built-in with the system. One can access help using one of the following:
help(topic) ?topic
within a script window: highlight the object and press F1 Help Menu, including pdf manuals Many user-contributed libraries also have help files for their functions. To invoke help for a library, type help(function.name, library=libname), at the command line, where function.name is the name of the function on which you want help. See Appendix A of V&R. To get a description of a particular library, use library with the help argument as the name of the library. For example, to get description of the MASS library, do
Thompson
10/2002
> library(help=MASS)
History Log
A history of the commands you have given S-PLUS via the command line or the GUI are stored in text form in the history log. To see the contents of this log, do one of the following: Press the history log button (this opens a script file containing the history log) Access History log under the Window menu Press the history interactive button In R, the command history is saved onto a file called .Rhistory when the image in saved onto a .RData file using the command save.image. (See below)
To change the prompt to STAT_5537>: options(prompt= STAT_5537>) To save old options: options.old<-options() # assign old options to options.old
options(options.old) # change back to old options
Options menu Graph Options (later) General Settings Command Line Create a .First function: .First<-function(){ library(MASS, first=T) } You may also have a .Last function for cleanup on exit. The getenv()function lists current settings for environment variables. S_CWD is the current working directory.
new.database(c:\\project1\\dirname)
Thompson
10/2002
attach(dirname) # move dirname to position 2 in search list (the default) attach(dirname, pos=1) # move dirname to position 1 in search list detach(dirname)
See also mkdir and rmdir and links in the help files. Try
junk<-new.database(paste(getenv("S_CWD"),"\\STAT_5537",sep="")) attach(junk, pos=1)
The result of a call to attach is an object of class attached. Assigning the call to a name gives a way to refer to that attached database using the name. For example,
mydb<- attach(my.fuel.frame) identical(mydb, database.attached("fuel.frame")) [1] T # are these identical databases?
mydb2<- attach(directory.name) # attach a directory with .data folder assign(x, 0, w=mydb2) # assign x with value 0 to database mydb2
Not all attached databases have to appear in the search list (And if name argument is not specified, then they appear as in the list). The purpose argument of attach allows other ways of attaching. For example, purpose=data implies that the new database is to be used only for explicit requests, and that it will not appear in the search list (Chambers, 1998). A database attached for purpose=data will not be used in the standard search for objects, and so you will never accidently get an object from that database instead of the intended database.
Temporary Databases
To make a workspace that you can throw away at the end of a session type at command line
attach(pos=1, new.database(work.dir<-tempfile()))
There is also the option of saving objects to the session database or session frame. This is frame number 0. To save objects to the session database, use
assign(name, value, fr=0) # assigns to frame 0 (disappears when you quit S+)
So that the above assign statement and the following are now equivalent
name%<-%value
In R, all objects are deleted when you quit R unless they have been saved to an image via save.image(). as well, in Rgui. This will save the database into a file
Thompson
10/2002
Removing Objects
To remove objects from your working directory, type rm(object) at command line. To remove objects from other databases, type remove(name of object, where=database.position) at the command line, where database.position is the position of database in search list. A convenient way to remove objects from the session database uses the following function
rm0<-function(x){ remove(deparse(substitute(x)), where=0) }
Now, rm0("e") removes the object e from the session database. Important Note: With the assignment function, <-, you cannot directly modify any objects in databases other than those in position 1. So, you cant accidently overwrite built-in S functions! If you try to overwrite an object in one of those databases (via the assignment function), SPLUS will make a copy of the modified object in your working directory. This modified object will mask the original until you remove it.
If you use a file in a different directory, specify backslashes with double \s. For example
> read.table("c:\\my documents\\fuel.txt", header=T, row.names="Car", sep= )
You can also use forward slashes: "c:/my documents/fuel.txt" The argument na.strings changes the label for a missing value from NA to any number of other values.
with records spanning many lines, and has more flexibility. The help file describes its features. With the what argument, you can specify the structure of the data to be read in, and each variable type or class. In addition, you do not have to read all of the variables or all of
9
Thompson
10/2002
the observations. Here, I read in the first three lines of a ficticious data set where the observations are labeled X1, X2,.., X15. After these character id variables, there are 20 numeric measurements spanning three lines per observation. I will read in only the first three observations, and only the middle 7 variables per observation.
X1 X1 X1 X2 X2 X2 X3 X3 X3 23.39138 28.19607 18.73721 25.05268 26.20181 22.12924 17.30856 21.09150 26.94256 31.03014 18.46590 25.03833 21.22678 26.16561 24.26280 24.10714 30.82490 28.56601 37.01177 29.90596 22.83223 25.25613 30.77551 21.94572 25.21277 24.76578 25.85390 26.36072 25.15313 17.65843 22.19962 21.34217 22.83311 10.54811 27.28496 26.00526 23.30528 32.04329 28.81536 23.02498 25.13774 30.93580 25.55041 29.46369 31.82780 24.43292 27.00844 NA 25.13215 22.48904 19.99993 31.21184 21.13336 26.66346 30.70835 22.18513 32.20992 21.76138 23.46410 27.47658
test<-scan("a:/TextData/scanExample.txt", what=list(NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL, "",0,0,0,0,0,0,0, NULL,NULL,NULL,NULL,NULL,NULL,NULL), multi=T, n=23*3) test<-test[unlist(lapply(test, function(x) !is.null(x)))] as.data.frame(test[-1], row.names = test[[1]]) X.1 X.2 X.3 X.4 X.5 X.6 X.7 X1 28.19607 18.46590 29.90596 25.15313 32.04329 27.00844 22.18513 X2 26.20181 26.16561 30.77551 21.34217 25.13774 22.48904 21.76138 X3 21.09150 30.82490 24.76578 27.28496 29.46369 21.13336 27.47658
See also the S Commands file on the STAT 5537 course website. Look up scan for a fairly complex example.
scan can also be used for
reading interactively from standard input (i.e., the console). For example,
V&R discuss this on page 47 of their text. I did something similar to what they describe on pages 47-48 to extract relevant information out of a complicated subject ID string (see Part II of this manual).
Thompson
10/2002
Data, and specify the format. Consult the options available for each format. Data can be read in in sequential blocks using the readNextDataRows function on a data handle object. A data handle object is created using the openData command on an external file. After reading all the data you want, close the handle using closeData function. A simple example is as follows. Suppose we have an EXCEL file called mydata.xls. We can open it using openData, and specify the number of rows to read with each call to readNextDataRows by giving the argument rowsToRead. We can get information on the variables names from the call to getDataInfo.
handle.mydata <- openData("mydata.xls", type=EXCEL, rowsToRead=100) getDataInfo(handle.mydata) # Read the first 100 observations. mydata.100 <- readNextDataRows(handle.mydata) # Close the external data file. closeData(handle.mydata)
or
menuSelectData(data.source = "Existing Data", existing.name = "my.data.set ")
In order for a data set to be included in the list of existing data, it must be either a data frame or data sheet. To find out the class of an object, use the class function.
class(my.data.set) [1] "data.frame"
R also has a data editor window. Call it up with menu item Edit->Data Editor, then select the name of a data frame. Or, use fix(x) where x is a data frame.
Check what happens when you put dimnames.write=F. Consult the functions write and write.matrix for other forms of writing to external files.
11
Thompson
10/2002
There are several other options for save, such as whether to save in ascii format or compressed format. See online help via ?save.
12
Thompson
10/2002
Set preferences for an Explorer using Format->Object Explorer or the right-click shortcut menu (right click in the white space in the right pane). Consult the Users Guide for descriptions of the options. When you are done customizing, you can save your preferences as the new default Object Explorer. All new Object Explorers have these preferences.
13
Thompson
10/2002
14
Thompson
10/2002
junk3 <- menuStackColumns(target = junk3, target.col.spec = list(Measurement), source = junk2, source.col.spec = list(Group1, Group2, Group3), group.col.p = T, group.col.name = Group)
15
Thompson
10/2002
16
Thompson
10/2002
Note that a command equivalent exists for categorizing a continuous variable (cut function). Subsetting a data set and merging two datasets can also be accomplished via the GUI.
17
Thompson
10/2002
You can see what each plot resembles from the Insert Menu->Graph within any graphsheet. If you insert a graph into the current graphsheet, any existing graphs on that sheet move to make room for the new graph when it is finished being created. The arrangement of the graphs on a sheet can also be changed by the user, either through the Format->Arrange Graphs Menu or just by moving with the mouse.
Trellis graphs are series of graphs conditioned on a set of categorical values. For example,
18
Thompson
10/2002
5.5 4.5 3.5 2.5 1500 2000 2500 3000 3500 4000 Weight
Exporting GraphSheets
The File Menu has a selection called Export Graphs that allows you to export a graph in a variety of formats. In addition, there is a clipboard button that by default sends your graph to the clipboard, but can be customized to send it to another application. Finally, there is a function called pdf.graph which will send your graphical output to a pdf file. I believe also that the Design or Hmisc libraries have a function that transforms your graph into Latex format. I can give more information at a later date.
19
Thompson
10/2002
a factor giving the general type of car. The levels are: Small , Sporty , Compact , Medium , Large , Van . an order statistic giving the relative weights of the cars; 1 is the lightest and 111 is the heaviest. a numeric vector giving the engine displacement in liters.
numeric vector of list price with standard equipment, in dollars. Country a factor giving the country in which the car was manufactured. The levels are: Brazil , England , France , Germany , Japan , Japan/USA , Korea , Mexico , Sweden , USA . Reliability an ordered factor; ` Much worse < worse < average < better < Much better; contains NA s. Fuel - ? First, we merge the two data sets by their row names (the names of the cars). The command to do this is given here.
fuel5537.fr<- merge(fuel.frame,cu.summary, by=c("row.names",intersect(names(fuel.frame),names(cu.summary))))
We can get summaries of some of the variables by Country and Type of car from the Statistics Menu -> Data Summaries:
20
Thompson
10/2002
We will get summaries of Weight, Disp., Fuel, Price, and Reliability by Country and Type of car. The descriptive statistics to compute are chosen in the Statistics tab.
21
Thompson
10/2002
Pressing OK gives an awful lot of information (and some Country/Type combinations contain no data). The output will be given in a Report window, which you can copy or save. The choice of output routing can be changed via the Options menu. To reduce the amount of information, lets get summaries only by Type of car. Also, to be able to fit everything on the page, well only get numeric summaries, not factors. We will do so using the command by. First, which columns are factors? The following expression will tell you.
> sapply(names(fuel5537.fr),function(x) is.factor(eval(parse(text=x)))) [1] F T F F F F T T
You might think that the command apply(fuel5537.fr, 2, is.factor) would work but it turns out not to work. So, we can do
> check.factors<-sapply(names(fuel5537.fr),function(x) is.factor(eval(parse(text=x)))) > by(fuel5537.fr[,!check.factors], fuel5537.fr$Type, summary)
22
Thompson
10/2002
fuel5537.fr$Type:Compact Mileage Weight Disp. Fuel Price Min.:21.00 Min.:2575 Min.:116.0 Min.:3.703704 Min.: 9483 1st Qu.:23.00 1st Qu.:2663 1st Qu.:127.0 1st Qu.:3.923077 1st Qu.:10755 Median:24.00 Median:2780 Median:135.0 Median:4.166667 Median:11588 Mean:24.13 Mean:2821 Mean:140.4 Mean:4.167655 Mean:12853 3rd Qu.:25.50 3rd Qu.:2928 3rd Qu.:148.5 3rd Qu.:4.347826 3rd Qu.:14195 Max.:27.00 Max.:3110 Max.:181.0 Max.:4.761905 Max.:18900 -------------------------------------------------------------------------------------fuel5537.fr$Type:Large Mileage Weight Disp. Fuel Price Min.:18.00 Min.:3325 Min.:231.0 Min.:4.347826 Min.:14525 1st Qu.:19.00 1st Qu.:3588 1st Qu.:266.5 1st Qu.:4.673913 1st Qu.:15335 Median:20.00 Median:3850 Median:302.0 Median:5.000000 Median:16145 Mean:20.33 Mean:3677 Mean:279.3 Mean:4.967794 Mean:15976 3rd Qu.:21.50 3rd Qu.:3853 3rd Qu.:303.5 3rd Qu.:5.277778 3rd Qu.:16701 Max.:23.00 Max.:3855 Max.:305.0 Max.:5.555556 Max.:17257 -------------------------------------------------------------------------------------fuel5537.fr$Type:Medium Mileage Weight Disp. Fuel Price Min.:20.00 Min.:2765 Min.:143.0 Min.:4.347826 Min.: 9999 1st Qu.:21.00 1st Qu.:2975 1st Qu.:153.0 1st Qu.:4.545455 1st Qu.:13150 Median:22.00 Median:3200 Median:180.0 Median:4.545455 Median:14980 Mean:21.77 Mean:3196 Mean:175.8 Mean:4.601413 Mean:16201 3rd Qu.:22.00 3rd Qu.:3450 3rd Qu.:182.0 3rd Qu.:4.761905 3rd Qu.:17899 Max.:23.00 Max.:3610 Max.:232.0 Max.:5.000000 Max.:24760 -------------------------------------------------------------------------------------fuel5537.fr$Type:Small Mileage Weight Disp. Fuel Price Min.:25 Min.:1845 Min.: 73.00 Min.:2.702703 Min.:5866 1st Qu.:28 1st Qu.:2260 1st Qu.: 91.00 1st Qu.:3.030303 1st Qu.:6599 Median:32 Median:2295 Median: 97.00 Median:3.125000 Median:7399 Mean:31 Mean:2258 Mean: 97.31 Mean:3.273380 Mean:7682 3rd Qu.:33 3rd Qu.:2350 3rd Qu.:109.00 3rd Qu.:3.571429 3rd Qu.:8748 Max.:37 Max.:2560 Max.:114.00 Max.:4.000000 Max.:9995 -------------------------------------------------------------------------------------fuel5537.fr$Type:Sporty Mileage Weight Disp. Fuel Price Min.:19 Min.:2170 Min.: 97.0 Min.:3.030303 Min.: 9410 1st Qu.:24 1st Qu.:2695 1st Qu.:109.0 1st Qu.:3.571429 1st Qu.:10855 Median:27 Median:2775 Median:133.0 Median:3.703704 Median:11545 Mean:26 Mean:2799 Mean:164.1 Mean:3.957606 Mean:11717 3rd Qu.:28 3rd Qu.:2885 3rd Qu.:153.0 3rd Qu.:4.166667 3rd Qu.:13071 Max.:33 Max.:3320 Max.:305.0 Max.:5.263158 Max.:13945 -------------------------------------------------------------------------------------fuel5537.fr$Type:Van Mileage Weight Disp. Fuel Price Min.:18.00 Min.:3185 Min.:143.0 Min.:5.000000 Min.:12267 1st Qu.:18.00 1st Qu.:3305 1st Qu.:146.0 1st Qu.:5.131579 1st Qu.:13972 Median:19.00 Median:3665 Median:151.0 Median:5.263158 Median:14799 Mean:18.86 Mean:3517 Mean:164.4 Mean:5.313283 Mean:14325 3rd Qu.:19.50 3rd Qu.:3713 3rd Qu.:181.5 3rd Qu.:5.555556 3rd Qu.:14937 Max.:20.00 Max.:3735 Max.:202.0 Max.:5.555556 Max.:15395
23
Thompson
10/2002
Now, we get graphical summaries. Well use the 2D palette and select the variables we want to use from the data in the data editor.
Mileage
Weight
0 0 0 0 0 0
1000 6000 11000 16000 21000 26000
Fuel
Price
Edit-> Select all Axes (or Plots, or Graphs, or Lines etc) Format->Selected Objects Make Modifications. These modifications apply to all selected objects. Another tip: the right-click menu for any selected object has an option to save those preferences as default. For example, I rarely like all four borders on a graph. So, once I remove the top and right side border I save the x-axis and y-axis as the default. For a trellis display of histograms of Price by Type and Country, first set the conditioning mode on, then select the number to condition on (2, here). Well use Log(Price) though.
24
Thompson
10/2002
What is the problem? To few observations within these groups. So, we only use Type as a grouping variable.
8.7
Type: Sporty
9.1
9.6
10.0
10.5
Type: Van
12 10 8 6 4 2 0
Type: Medium
Type: Small
Type: Compact
Type: Large
12 10 8 6 4 2 0
25
Thompson
10/2002
Side-by-Side Boxplots of LogPrice by Reliability score: Turn off conditioning mode. Select Reliability first, then LogPrice. Press Boxplot button on 2D palette.
10.20
9.86
LogPrice
9.52
9.18
8.84
26
Thompson
10/2002
Weight
Disp.
5.5 Fuel 4.5 3.5 2.5 150020002500300035004000 15 20 25 30 35 2.5 3.5 4.5 5.5
27
Thompson
10/2002
Next time we will see how to add a loess curve to each plot.
10.0 9.8 9.6 9.4 9.4 9.6 9.810.0
LogPrice
8.8
Price
15000 10000
Fuel
4.0 3.5
3.0
Disp.
150 200
Weight
2500 2000 2500
2000 35 30 30 35
Mileage
25
25 20
20
I should add that for all these graphs, I was able to format them within Microsoft Word 2000 by double clicking on each graph. That is because S-PLUS supports OLE automation for object linking and embedding.
A second point is that if any of these graphs were saved (as .sgr files in my S-PLUS working directory), they would be dynamically linked to the data sets used to create them. So, when the data changes, so does the graph. This can be good or bad. If you dont want this, then embed the data into the graph. To do this, open your graphsheet (or put it in focus), go to the Graph Menu, and select Embed Data. A graphsheet can also be linked with Excel data. See information on using S-PLUS within Excel from the Users Manuals.
28
Thompson
10/2002
3D Plots: S-PLUS has the ability to do lots of types of 3D plots, including contour plots. Here is a 3D scatter plot.
29
Thompson
10/2002
30
Thompson
10/2002
If you want to see the result of an assignment with only one command, then enclose the whole assignment statement within parentheses. For example,
(a<-4) [1] 3
Naming Conventions
Object names must start with a letter or period and may contain letters, numbers, and periods. S is case sensitive. Spaces are not allowed between characters. Some names are already in use by the language. Examples are s, c, C, T, and F. The last two are examples of reserved names, which cannot be assigned (an error will result). For the first three, you can assign an expression to these names, but your object will mask the original objects with those names. That doesnt mean the original objects are gone because your assignment (via <-) will only make changes to your working directory. To unmask it, rename your object, then remove the first object from the working directory by typing rm(c), for example, at the command line. The command masked() will return any masked names. Related to masking is the function find, which finds where an object exists in your search list.
Mode
The mode of an object describes the type of object that it is. Examples of atomic modes are logical, numeric, character, complex and NULL. These describe data objects. Some modes that describe language objects are: list, function, graphics, expression, call, <-, etc. Most objects also have a class. Beginning with S-PLUS 6.0 for Windows, all objects will have a class. The class of an
31
Thompson
10/2002
object defines how methods (special types of functions) are applied to it. Classes and methods are discussed later.
Length
The meaning of the length of an object depends on its mode. The length of an array is the number of elements it has. So, the length of a two or three dimensional array is the product of its dimension sizes. The length of a list is the number of components it has. The length of a data frame is the number of columns it has (although the function ncol will return that information as well). Zero-length objects exist. The vector numeric(0) is an empty vector, like an empty container. Also, the NULL object has length zero, and, according to V&R, is like no container. The length of a function is one plus the number of arguments. The additional 1 is from the function body.
Attributes
Objects may have attributes, other objects that are attached to the main object, but are of a subordinate status. For example, a list object (a collection of objects with different modes) has a names attribute, which can be retrieved using the function call, name(my.list.object). The attributes of an object can be listed (in the S sense of the word) using the function call attributes(my.object). Attributes can be changed by the user. For example, to replace the row.names attribute of a data frame (row names are the labels for each of the rows) with the row names of another data frame, type
attr(data.frame.1, row.names)<-row.names(data.frame.2)
OR
row.names(data.frame.1)<- row.names(data.frame.2)
The second instance is an example of a replacement function. See p. 16 of V&R. It is safer to use the replacement function if it exists because it does consistency checking. The function structure can be called to create an object along with its attributes. For example, to create a simple 2x2 matrix with attribute dimnames, type
structure(matrix(c(1,2,3,4),nr=2,byrow=T), dimnames=list(c("row1","row2"),c("col1", "col2"))) row1 row2 col1 col2 1 2 3 4
Types of Objects
Functions
We have seen examples of functions since page 1. The structure of an S function is
name.of.function<function(formal arg1 = actual arg1, formal arg2 = actual arg2, etc) {
32
Thompson
10/2002
function body
Braces {} are only required when the function body contains multiple statements. The function is called by typing its name and inserting actual arguments for (at least) the required formal arguments. When functions are called, their arguments may be given names or determined by their order. S lets you omit argument names or supply just enough of the name to match the formal argument (but, I rarely use this feature). For example, to compute the 10% trimmed mean of the whole numbers from 1 to 5, we call the function mean, which takes a numeric object (like a numeric vector).
mean(c(1,2,3,4,5), mean(1:5, tr=.1) tr=.1) # the c function denotes combine or concatenate # c(1,2,3,4,5) is a numeric vector # alternatively
The formal trim argument is given the actual argument, .1. Note that we did not have to spell out trim completely, and we also did not explicitly state the formal argument, x for which 1:5 is the value. mean has another argument called na.rm, which determines how missing values (NA) are handled. This argument is not required to be given a value, and, in this case, has a default value, which is F. The function body may or may not contain a last expression, which would be the returned value from the call to the function. Functions may have side effects as well, which are not returned with the call of the function, but happen as a result of calling the function. A function is allowed to do nothing. Indeed, the shortest.s function,
shortest.s<-function(){}
is valid. To create a function, just decide on a name for it, and fill in the details of what it should do. For example, to create our own function to compute a trimmed mean
mean.vector<-function(x, trim=0, na.rm=F) { if(na.rm) x<-x[!is.na(x)] # remove NAs if na.rm=T elements.to.trim<-floor(trim*length(x)) # number of elements in x to trim from both ends if(elements.to.trim > 0) x[-c(1:elements.to.trim,(length(x)-elements.to.trim):length(x))] # trim the ends sum(x)/length(x) } mean.vector(c(1,2,3,4,5,NA),trim=.2,T) [1] 3
Try again:
> now <- proc.time() > mean.vector(c(1, 2, 3, 4, 5, NA), trim = 0.2, T) [1] 3 > proc.time() - now [1] 0.05 0.00 0.05 0.00 0.00
33
Thompson
10/2002
> now <- proc.time() > mean(c(1, 2, 3, 4, 5, NA), trim = 0.2, T) [1] 3 > proc.time() - now [1] 0.06 0.00 0.06 0.00 0.00
More will be said about debugging functions later. S-PLUS comes with over 2,000 built-in functions, not including the functions from contributed libraries. Descriptions of the functions grouped by purpose or type can be found in the S-PLUS manuals or from the help utility. (See extra handout on statistical and mathematical functions)
A list of logical and comparison operators can be obtained via the help file. Search on logical. The elements of a vector can be named and accessed by name:
names(scores)<-paste("Name",1:length(scores),sep="") names(scores) [1] "Name1" "Name2" "Name3" "Name4" "Name5" "Name6" "Name7" "Name8" "Name9" "Name10"
A matrix is a vector of elements with a dim attribute that gives the dimensions of the matrix.
dim(scores)<-c(2,5) scores [,1] [,2] [,3] [,4] [,5] [1,] 5 5 6 4 3 [2,] 5 3 2 5 4 class(scores) [1] "matrix" # will fill the matrix by row
34
Thompson
10/2002
To fill a matrix by column, use the function matrix, which takes a vector and creates a matrix by filling by row (or by column). Multi-way arrays are extensions of this concept, with a dim attribute of length the number of dimensions. Dimensions are given names using the dimnames attribute.
In the second instance, dim(res) is c(1, m). In the third instance, dim(res)is NULL. Also, Replace the i,j th element of a n m matrix named my.matrix: my.matrix[i, j]<-2 Replace the ith row of my.matrix:
my.matrix[i, ]<-rep(2, m)
Vectors and multi-way arrays are indexed in the logically analogous way. For example, my.vec[c(3,5)] gives the third and fifth elements of my.vec. The functions ncol and nrow return the number of columns and rows for a matrix (they also work for a data frame). rbind and cbind will bind row-wise or columnwise, several matrices, vectors, or data frames. Indexing in general is treated in more detail later.
Lists
Lists are actually special cases of vectors. The elements of the vector are the components of the list. A list is a hierarchical data structure composed of different data objects, called components. Lists are used to collect different types of items together in one structure. The elements of a list do not have to be the same length. The components of a list are always numbered (and sometimes named, as well). As such, they can always be extracted by their number or position, and by their name, if available. Because lists can contain lists as components, which further can contain lists as components, etc, lists are sometimes called recursive objects.
Creating lists
A list can be created using the list function. For example, suppose we random select some people coming out of an ice cream store and ask them what they bought and how they would rate it (1 to 5). Here is a list of the data.
> x<-list(ice.cream.choice= c( "vanilla", "chocolate", "strawberry", "chocolate"), topping.choice = c("chocolate","strawberry","chocolate","marshmallow"), rating=c(5,4,4,5)) > x $ice.cream.choice: [1] "vanilla" "chocolate"
"strawberry" "chocolate"
35
Thompson
10/2002
"marshmallow"
Lists are concatenated using the c function. Lets add the number we actually approached as well. So, six people chose not to answer the survey.
x<-c(x, number.surveyed=10) > x $ice.cream.choice: [1] "vanilla" "chocolate" $topping.choice: [1] "chocolate" $rating: [1] 5 4 4 5 $number.surveyed: [1] 10
"strawberry"
If we had set the recursive argument to TRUE, prior to concatenation with number.surveyed, the list x would have been unlisted. Thus, the result would be
x<-c(x, number.surveyed=10,recursive=T) >x ice.cream.choice1 ice.cream.choice2 ice.cream.choice3 ice.cream.choice4 "vanilla" "chocolate" "strawberry" "chocolate" topping.choice1 topping.choice2 topping.choice3 topping.choice4 "chocolate" "strawberry" "chocolate" "marshmallow" rating1 rating2 rating3 rating4 number.surveyed "5" "4" "4" "5" "10"
Note the coercion to character strings for the ratings and number surveyed. Thus, it is probably wise only to recursively concatenate when all elements are numeric. The function unlist exists by itself as well, and can be quite useful (See discussion of lapply and sapply). If we had not given names to the list components, the components would be labeled by position:
> names(x)<-NULL > x [[1]]: [1] "vanilla" [[2]]: [1] "chocolate" [[3]]: [1] 5 4 4 5 [[4]]: # note the use of the replacement function, names
"chocolate"
"strawberry"
36
Thompson
10/2002
[1] 10
"strawberry"
"chocolate"
"marshmallow"
However, to set a component to a value, we use [[ ]]. For example, to set the fourth component to 9
x[[4]]<-9 > x $ice.cream.choice: [1] "vanilla" "chocolate" $topping.choice: [1] "chocolate" $rating: [1] 5 4 4 5 $number.surveyed: [1] 9
"strawberry"
To set a component to NULL, use list.name[component.name]<-list(NULL) or list.name[[component.name]]<-list(NULL). So, to set number.surveyed to NULL use
x[number.surveyed]<-list(NULL) list(NULL) is required because x[component.name]
37
Thompson
10/2002
Or equivalently,
x[[number.surveyed]]<-NULL
The last few commands allude to easier ways of selecting components from a named list. We can use the name information instead of the position information. So, to select the topping choices from x, type
> x$topping.choice # selects the component [1] "chocolate" "strawberry" "chocolate" "marshmallow" > x[["topping.choice"]] # does the same thing [1] "chocolate" "strawberry" "chocolate" "marshmallow" > x["topping.choice"] # but, this command returns a list itself $topping.choice: [1] "chocolate" "strawberry" "chocolate" "marshmallow"
As lists are vectors of list components, the vector selection ([ ]) returns the vector component. This also illustrates another way to set or change components of a list.
x$topping.preferred<-c("orange pineapple", "strawberry banana", "chocolate mint", "cherry jubilee") x$topping.preferred[2]<-"chocolate banana" # change second topping preferred
"cherry jubilee"
There is a caveat to selecting components using their names. Partial matching will be done on the quoted component name.
> x["topping"] $topping.choice: [1] "chocolate"
"strawberry"
"chocolate"
"marshmallow"
Naturally, problems arise if we have two components with the words topping. We will get the first partial match. To get around this, one can select components using the numerical indexes only, finding them via exact matching:
> x[[match("topping.choice", names(x))]] [1] "chocolate" "strawberry" "chocolate" "marshmallow"
But,
> x[[match("topping", names(x))]] NULL
Initializing a list with n components (useful for simulations) can be done using the command:
my.list<-vector(list, n)
38
Thompson
10/2002
Factors
Factors are similar to character vectors. However, factors possess an attribute called levels, which indicates which character strings are allowed in the factor. Internally, a factor is stored as a set of integer codes (retrievable using the codes function or the levelsIndex function). The names of the levels or categories are stored in the levels attribute. (In a data window, entries which are character strings are automatically converted to factors. You can change this option to have them remain as character using the Options menu ->General Settings->Data. When creating a data frame using data.frame, enclose character columns in I() to prevent them from being converted automatically to factors). Here we factor topping.preferred.
> x$topping.preferred<-factor(x$topping.preferred) > x $ice.cream.choice: [1] "vanilla" "chocolate"
"strawberry" "chocolate"
. . .
$topping.preferred: [1] "orange pineapple" "chocolate banana" "chocolate mint" $topping.preferred: Levels: [1] "cherry jubilee" "chocolate banana" "chocolate mint" > levelsIndex(x$topping.preferred) [1] 4 2 3 1 > attributes(x$topping.preferred) $levels: [1] "cherry jubilee" "chocolate banana" "chocolate mint" $class: [1] "factor" "cherry jubilee" "orange pineapple"
"orange pineapple"
Note that the numerical codes follow alphabetical order. To specify a different order, use the levels argument, which takes a character vector. The levels argument can include character strings not occurring in the factor.
> x$topping.preferred<-factor(x$topping.preferred, levels=c(levels(x$topping.preferred), peanut butter)) > x ... $topping.preferred: [1] "orange pineapple" "chocolate banana" "chocolate mint" $topping.preferred: Levels: [1] "orange pineapple" "chocolate banana" "chocolate mint" "peanut butter" "cherry jubilee" "cherry jubilee"
Unclass will now give the correct codes. Codes will re-code using alphabetical order. > unclass(x$topping.preferred)
39
Thompson
10/2002
[1] 1 2 3 4 attr(, "levels"): [1] "orange pineapple" "chocolate banana" "chocolate mint" [5] "peanut butter" mode(unclass(x$topping.preferred)) [1] "numeric"
"cherry jubilee"
Factors are treated specially in many statistical modeling functions (e.g., lm, aov, glm, gam, loess, nls) and in some plotting functions. For example, using the built-in S-PLUS data frame fuel.frame, a boxplot of Mileage by Type (a factor) can be given automatically using plot:
> attach(fuel.frame) > plot(Type, Mileage)
Mileage 20 25
30
35
Compact
Large
Medium Type
Small
Sporty
Van
Ordered Factors
An ordered factor is a factor with an ordering of the levels. To create an ordered factor use the ordered function with first argument a vector. The default ordering of levels is alphabetical for characters and numerical for numbers. To change the default ordering, use the levels argument.
> ordered(x$topping.choice,levels=c("marshmallow","strawberry", "chocolate")) [1] chocolate strawberry chocolate marshmallow marshmallow < strawberry < chocolate
One can create an ordered factor from a numeric variable by using the cut function. For example, to category the Weight variable from the fuel.frame data set into quartiles, use
cut(fuel.frame$Weight, breaks=quantile(fuel.frame$Weight, seq(0,1,.25)), include.lowest=T) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 4 2 4 2 1 2 2 2 1 2 2 2 3 2 3 2 2 3 3 2 2 3 3 3 4 2 3 4 3 3 4 2 4 3 2 3 4 4 4 4 3 4 4 4 4 3 4 attr(, "levels"): [1] "1845.00+ thru 2571.25" "2571.25+ thru 2885.00" "2885.00+ thru 3231.25" "3231.25+ thru 3855.00"
40
Thompson
10/2002
Subsetting Factors
A factor can be subsetted using a logical selection vector. For example, to select the types of cars with mileage over 30 mph in the fuel.frame data frame, type
> fuel.frame$Type[fuel.frame$Mileage>30] [1] Small Small Small Small Small Small Small Small Levels (first 5 out of 6): [1] "Compact" "Large" "Medium" "Small" "Sporty" Sporty
This retains the full set of levels, even though not all of them appear in the subset. If you only want levels that appear in the subset, type
> fuel.frame$Type[fuel.frame$Mileage>30,drop=T] [1] Small Small Small Small Small Small Small Small Sporty
The levels function does the conversion to character. So, in the second example, we only convert length(levels(my.factor)) characters to numeric instead of length(my.factor), where the former is usually shorter than the latter (though not in this example).
Thompson
10/2002
default becomes a data sheet. A data frame or data sheet is a table of data in rows and columns. Typically, the rows correspond to observations and the columns correspond to variables. Data frames are special cases of lists. They are lists with vector components. So, all the extraction techniques one can do with lists also apply to data frames. However, all of their component columns must be the same length, but not necessarily the same mode. Not all the columns of a data sheet must be the same length. A data sheet is like a spreadsheet. The attributes of a data frame include row.names (the names of the rows), names (the column or variable names), and class (data.frame). Its mode is list.
> names(fuel.frame) [1] "Weight" "Disp." "Mileage" "Fuel" "Type" "Honda Civic 4" "Pontiac LeMans 4" "Toyota Tercel 4" "Ford Mustang V8"
> row.names(fuel.frame) [1] "Eagle Summit 4" "Ford Escort 4" "Ford Festiva 4" [5] "Mazda Protege 4" "Mercury Tracer 4" "Nissan Sentra 4" [9] "Subaru Loyale 4" "Subaru Justy 3" "Toyota Corolla 4" [13] "Volkswagen Jetta 4" "Chevrolet Camaro V8" "Dodge Daytona" ...
If two or more data frames are collected together, the row names corresponding to the first data frame in the list are the row names for the new data frame. Character and logical vectors are automatically converted to factors unless they are enclosed in an identity function, I(). See V&R (2000, p. 16) for details. The functions cbind and rbind bind column-wise or row-wise, data frames that have the same number of columns or rows. This is useful when updating a data frame with new data:
rbind(old.data.frame, new.data.frame)
However, note that cbinding a mixture of numeric and character variables results in a character matrix. To see this,
> cbind(1:26, letters[1:26]) character matrix: 26 rows, 2 columns. [,1] [,2] [1,] "1" "a" [2,] "2" "b" [3,] "3" "c"
Thus, cbind is only useful for data frames with all numeric (or all character) columns.
42
Thompson
10/2002
One can convert a list or a matrix to a data frame by using the function data.frame. For example, we convert the ice cream list, x, to a data frame.
> x.fr<-data.frame(x) > x.fr ice.cream.choice topping.choice rating topping.preferred 1 vanilla chocolate 5 orange pineapple 2 chocolate strawberry 4 chocolate banana 3 strawberry chocolate 4 chocolate mint 4 chocolate marshmallow 5 cherry jubilee
Also,
> x.fr$topping.choice > x.fr[, "topping.choice"] > x.fr[[,"topping.choice"]] [1] chocolate strawberry chocolate marshmallow
If we first attach the data frame, we dont need the $ to subset. Thats because attaching a list brings the list to the second position of the search list. So, unless there are other objects in our working directory
43
Thompson
10/2002
with the same names as the names in the attached list, using a name in the data frame brings up the associated column. For example,
> attach(fuel.frame) > Type
[1] Small Small Small Small Small Small Small Small Small Small Small Small Small Sporty Sporty Sporty Sporty [18] Sporty Sporty Sporty Sporty Sporty Compact Compact Compact Compact Compact Compact Compact Compact Compact Compact Compact Compact
Instead of seq, we could have used rep(c(T,rep(F,9)), 10). However, seq is faster. We can sample from the vector of integers representing the row names to get a pseudorandom sample.
> fuel.frame[sample(1:nrow(fuel.frame), 10), ] # sample 10 cases
44
Thompson
10/2002
> duplicated(my.fuel.frame)
[1] F F F [68] T T T F F T T F F T T F F T T
F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T
frames.
0.6
0.8
1.0
You can seq along another vector by using the along argument.
> seq(along=seq(-1, 1, .2)) [1] 1 2 3 4 5 6 7 8 9 10 11
A colon can be used for incrementing or decrementing sequences with by=1: e.g., 1:5, 3:5, -10:(-1), 1:(-10). However, note that : will take precedence over arithmetical expressions so that 1:51 results in c(0, 1, 2, 3, 4) which is different from 1:(5-1) which gives c(1, 2, 3, 4) The rep function repeats its first argument a number of times, the second argument. Here are some simple examples:
> rep("hi", times=5) [1] "hi" "hi" "hi" "hi" "hi" > rep(1:3, 4) [1] 1 2 3 1 2 3 1 2 3 1 2 3
45
Thompson
10/2002
> rep(letters[1:3], 3:1) [1] "a" "a" "a" "b" "b" "c" > rep(letters[1:3], times=rep(4, 3)) [1] "a" "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "c"
With S-PLUS 6.1 and R, we dont need the additional rep expression for the times argument. We can use the each argument.
> rep(letters[1:3], each=4) [1] "a" "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "c" > rep(rep(1:4, each = 2), 2) [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4
R has a function gl (thats an el) that does the above. gl(n, k) generates a sequence of length n*k consisting of the numbers 1,2,,n each repeated k times.
> gl(n=4,k=2) [1] 1 1 2 2 3 3 4 4 Levels: 1 2 3 4
6 12
46
Thompson
10/2002
1 1 1 1 1 1 1
4 2 3 4 2 3 4
which can be convenient for specifying the intercept in the X matrix for a regression problem.
There are functions any, all, and all.equal which return scalar T or F values:
> all.equal(z, y) # default tolerance is .Machine$double.eps [1] "Mean relative difference: 0.3333333" > all(z > y) [1] T > any(y > z) [1] F
The function identical tests for exact equality, i.e., without a tolerance. If a logical vector is used in an arithmetical expression, then T is converted to 1 and F is converted to 0. Thus, sum(c(T,T,F)) is the number 2.
Thompson
10/2002
takes the midpoint of each ordered pair in the regular grid of z and y values. The function crossprod(x, y) gives the (matrix) cross-product xT y , where x and/or y may be a matrix. With a single argument, crossprod(x) gives xT x . The function diag will create a diagonal matrix or extract the diagonal of a square matrix, depending on the mode of the first argument. Examples:
The last command above is equivalent to the function scale. scale will standardize the values in each column (subtract its respective column mean and divide by column standard deviation). Also note the related functions, rowSums, colSums, rowVars, colVars. Section 4.3 of V&R contains more information on matrix functions like QR decomposition, eigen decomposition, etc.
48
Thompson
10/2002
Many of these functions have methods for other classes, like data frames or lists. See if you can find out which. Note that the function match(x, table, nomatch=NA) returns a vector of the positions in table of the elements of x, but only for the first matches. For example,
> match(1,c(1,2,3,1,4)) [1] 1
So, now you know that 1 appears in the first and fourth positions. To know all the positions in table of each element in x, where x is a vector, here, for the sake of argument, you can use sapply (I think there is something easier, but I forgot). For example, to find all the positions of 1s, all the positions of 2s, etc, in the vector c(1,2,3,1,4)
> sapply(1:4,function(x,y) x==y,y=c(1,2,3,1,4)) [,1] [,2] [,3] [,4] [1,] T F F F [2,] F T F F [3,] F F T F [4,] T F F F [5,] F F F T
What would happen if I used lapply? Could I have used apply? Note the following matches of cars between the two data sets cu.summary and fuel.frame:
> match(row.names(cu.summary),row.names(fuel.frame)) [1] NA NA NA 1 2 3 NA NA 4 NA 5 6 NA 7 8 9 NA NA 16 17 NA NA 18 19 NA NA NA 20 NA NA 21 NA 22 NA [47] NA NA 23 NA NA 24 25 NA 26 NA 27 28 29 NA 30 31 39 NA NA NA NA 40 41 NA 42 NA 43 44 45 NA NA NA NA 46 [93] 47 48 49 NA NA NA 50 NA NA 51 NA NA 52 53 NA NA 10 11 12 NA NA 13 NA 14 NA 15 NA NA 32 33 34 NA NA 35 36 37 38 NA NA NA 54 NA 55 56 57 58 59 60 NA
Sorting
The function sort sorts a vector in ascending order or alphabetical order. rev will then reverse the order. The function order returns the indices of a vector that will sort the original vector in ascending order. To order a vector by one variable within another, use further arguments to order. For example, to order cars by Price within Type, use
order(Type, Price)
49
Thompson
10/2002
The function sort.list will quickly sort a list (data frame) by a single column. For example,
my.data.frame[sort.list(row.names(my.data.frame)),]
To
Any operation on an NA becomes an NA. So, the expression below gives NAs.
c(1, NA, 1)== NA [1] NA NA NA
In S-PLUS, character vectors cannot have missing values. The expression NA will be interpreted as NA. However, in R NA in a character vector is a missing value. See the documentation on functions, is.finite, is.infinite, is.nan, and is.number. The symbol NaN means Not a Number (undefined). The symbol Inf stands for infinity.
50
Thompson
10/2002
The function nchar takes in a vector of character strings and returns a vector with the number of characters in each string. The function paste takes an arbitrary number of arguments (coercing them to character strings if necessary) and joins them together element by element. By default, the joined elements are separated by a blank. The separator can be changed using the sep argument. For example,
> paste("Round",1:10) [1] "Round 1" "Round 2" "Round 3" "Round 8" "Round 9" "Round 10" "Round 4" "Round 5" "Round 6" "Round 7"
The function substring takes as first argument a vector (which is coerced to character vector). The next two arguments indicate the first and last positions of each string in the character vector. This defines a segment of each string to be extracted. For example, to extract the Round number from the th paste("Round",1:10) vector created above, we start in the 7 position, then continue until the end of each string (thus, we dont need to give a value for the last argument).
> substring(paste("Round",1:10), first=7) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
charmatch(input, target, nomatch=0) returns a vector matched by input. Ambiguous matches return a 0.
51
Thompson
10/2002
From the help file, charmatch is useful for processing the arguments to functions. It is very similar to the pmatch function. However, pmatch does not allow a distinction between no match and an ambiguous match. The pmatch function doesn't allow a match to the empty string, while charmatch does. See the help file for examples.
regexpr(pattern, text) matches one regular
expression, pattern, to a character vector, text. Basic use of regular expressions is covered in the help file. An example: Suppose we want to find all functions in the working directory that contain the letters pow as a whole word. regexpr will match the characters pow to the directory listing. What is returned is the position of the p in each directory file listed. If pow does not occur as a whole word, then a 1 is returned.
> tmp<-objects() > tmp[ regexpr("\\<pow\\>", tmp)>0 ] [1] "pow.matrix" "pow.structure" # >0 eliminates the 1 nomatches. # So we get the matches only
Regular expressions can be used in the pattern argument to objects too. R has the functions sub and gsub which resemble the sed commands
s/pattern/replacement/ s/pattern/replacement/g # replace first instance of pattern with replacement # g stands for global replacement (replace all instances)
in Unix. For example, to replace our Round vector with lower case rs,
> gsub("R", "r", paste("Round",1:10)) [1] "round 1" "round 2" "round 3" "round 4" "round 8" "round 9" "round 10" "round 5" "round 6" "round 7"
Here are some good examples of the use of character vector operations and searching and matching. I have a data set with 548 records, of which 258 are unique individuals. Several individuals contributed more than one record to this data set. Each record has a unique ID associated with it. The first few digits of the ID correspond to the individual. The last two digits correspond to the record contributed by the individual (I call these events). So, a record with ID=11209 corresponds to the 09th event from an individual identified with number 112. I want to create an event variable that indicates which event corresponds to the record for that individual. Here is how I did it. ID contains the 548 ID tags.
# First I find the largest number of events: num.events<-max(as.numeric(substring(ID,nchar(ID)-1))) # Now, I create a vector of strings with event labels event<-if(num.events > 9) c(paste("*0",1:9,sep=""), paste("*", 10:num.events, sep="")) else paste("*0",1:num.events,sep="") # Now, create the event variable for(i in 1:num.events) event[grep(event[i],ID)]<-i # Get the first records in the database of each individual my.data[match(unique(individual),individual), ]
52
Thompson
10/2002
The function format coerces input to character strings, then formats it using specified number of digits or significant digits. It outputs the result in quotes, which can be removed by encasing the format output in cat or print, with argument quote=F. For example,
> format(pi^(-3:3), digits=5)
[1] " 0.032252" " 0.10132 " " 0.31831 " " 1 " " 3.1416 " " 9.8696 " "31.006 "
3.1416
9.8696
31.006 31.006
9.8696
If you want to reduce the number of significant digits of a number (say, called, number) to that of the default (found in options()$digits), then use format(number).
is not valid. One can, however, use unlist to prevent using the operator on a list:
x[unlist(x)>3] # Note that sapply can be used as well
53
Thompson
10/2002
Vectors of positive and negative integers can be used to index all four data structures. Decimal representations are truncated toward zero. A zero subscript is allowed, but will return an empty structure (empty list, empty numeric vector, etc). However, using a zero subscript to select the zeroth column of a data frame returns an error. If a subscript extends outside the range of the length or dimensions of the object, then for a vector, the extracted component is NA, for a list it is NULL, and for arrays and data frames it is an error. Out of range subscripts on the left hand side of assignments return a lengthened vector or list with intervening components set to NA and NULL, respectively. The function is.element can be used as an index vector. For example, to create missing values (NAs) in a vector out of the values 99 or 999, use
x[is.element(x, c(99, 999))] <- NA
Note that the concatenation cannot contain elements of mixed types. Thus, c(99, 999, .) is not allowed. The command x[x==.]<-NA would have to be used. For an object with a names component, extraction by name can be done. The following are examples using data frames.
# R data(iris) iris[,c("Sepal.Length","Sepal.Width")] 1 2 3 4 5 Sepal.Length Sepal.Width 5.1 3.5 4.9 3.0 4.7 3.2 4.6 3.1 5.0 3.6
# S-PLUS fuel.frame["Ford Escort",] Weight Disp. Mileage Fuel Type Ford Escort 4 2345 114 33 3.030303 Small
The empty selection, [], returns the entire object (for all data structures in the heading). Replacements to subsets of vectors, arrays, data frames, and lists can also be done, via a replacement function, such as x[unlist(x)>3]<-0. Negative subscripts work on the left hand side as well. For example, to assign a 9 to all but the 4th element of a vector x, x[-4]<-9.
Indexing an array
Arrays are indexed in column-major order, meaning that the first index moves fastest. For a 3dimensional array, this means that filling an array will start with the first matrix and fill down columns. One can index a 3-dimensional array using 3 subscripts or a single subscript. If a single subscript is used, then the selected element will be the element in the position of the index, when we start counting positions in column-major order. Compare the following ways to set two values to NA in an array. They yield the same answer. (Note the method for setting NA. We use a replacement function).
54
Thompson
10/2002
x<-array(1:50, dim=c(2,5,5)) is.na(x)<-c(1,3) x<-array(1:50, dim=c(2,5,5)) is.na(x)<-x[1,1:2,1] > x , , 1 [,1] [,2] [,3] [,4] [,5] [1,] NA NA 5 7 9 [2,] 2 4 6 8 10 , , 2 [,1] [,2] [,3] [,4] [,5] [1,] 11 13 15 17 19 [2,] 12 14 16 18 20 , , 3 [,1] [,2] [,3] [,4] [,5] [1,] 21 23 25 27 29 [2,] 22 24 26 28 30 , , 4 [,1] [,2] [,3] [,4] [,5] [1,] 31 33 35 37 39 [2,] 32 34 36 38 40 , , 5 [,1] [,2] [,3] [,4] [,5] [1,] 41 43 45 47 49 [2,] 42 44 46 48 50
values of x to NA
# set the 1st and 2nd columns in the 1st row of the 1st # matrix of x to NA
We can select submatrices from a 3-dimensional array. For example, to select the first 3 columns of each matrix in the array, x, above:
> x[, 1:3, ]
, , 1 [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 , , 2 [,1] [,2] [,3] [1,] 11 13 15 [2,] 12 14 16 , , 3 [,1] [,2] [,3] [1,] 21 23 25 [2,] 22 24 26 , , 4 [,1] [,2] [,3] [1,] 31 33 35 [2,] 32 34 36 , , 5 [,1] [,2] [,3]
55
Thompson
10/2002
[1,] [2,]
41 42
43 44
45 46
Normally, in selecting from an array dimensions are dropped unless you tell S not to drop them. This is done using the drop=F argument as an additional index. Compare the following:
> x[1, , ] # select first row from each matrix in the array [,1] [,2] [,3] [,4] [,5] [1,] 1 11 21 31 41 [2,] 3 13 23 33 43 [3,] 5 15 25 35 45 [4,] 7 17 27 37 47 [5,] 9 19 29 39 49 > x[1, , ,drop = F] # dont drop length-one dimensions
, , 1 [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 , , 2 [,1] [,2] [,3] [,4] [,5] [1,] 11 13 15 17 19 , , 3 [,1] [,2] [,3] [,4] [,5] [1,] 21 23 25 27 29 , , 4 [,1] [,2] [,3] [,4] [,5] [1,] 31 33 35 37 39 , , 5 [,1] [,2] [,3] [,4] [,5] [1,] 41 43 45 47 49
Note the function drop, too. We can index an array with a matrix. Here we extract the antidiagonal of a square matrix.
> x <- diag(4) > x [,1] [,2] [,3] [,4] [1,] 1 0 0 0 [2,] 0 1 0 0 [3,] 0 0 1 0 [4,] 0 0 0 1 > n <- nrow(x) > x[matrix(c(1:n, n:1), nr = n, nc = 2)] [1] 0 0 0 0
To subtract the off-diagonal, see the matrix.coords function in S commands on the website (http://math.cl.uh.edu/~thompsonla/5537).
56
Thompson
10/2002
Vectorized calculations
Vectorized functions are functions that return a vector if the argument is a vector. These calculations operate on the entire vector instead of the individual elements in turn. Many mathematical functions and transformations are vectorized. Actually, vectorized functions will usually also return a structure like the argument if the argument is a structure. A structure in S is a class of object that adds to an ordinary vector some notion of the values being organized in space and time (Chambers, 1998). A structure can be even more general, taking a basic object and turning it into something else by adding attributes. For example, a matrix is a vector with a dim attribute an array is a vector with a dim attribute a factor is a character vector with a levels attribute a data frame is a list with a names and row.names attribute So, for a vectorized function, f, if x is a vector, f(x) is again a vector, where the function is applied to each element in x. Furthermore, f(x) is a matrix if x is a matrix. Examples:
> log(1:10) [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 2.0794415 2.1972246 2.3025851 > log(diag(4)) [,1] [,2] [1,] 0 -Inf [2,] -Inf 0 [3,] -Inf -Inf [4,] -Inf -Inf
> diag(3) + diag(3) [,1] [,2] [,3] [1,] 2 0 0 [2,] 0 2 0 [3,] 0 0 2 > 1:10 + 1 [1] 2 3
9 10 11
8 10 12 14 16 18 20 12 14 16 18 20 22 24 26 28 30
What happened in the third example is that S coerces the 1 to a vector: rep(1, length(1:10)). This coercion will happen whenever the shorter vector is a multiple of the longer. So, in the fourth example, 1:10 is coerced to rep(1:10, 2). If the shorter vector is not a multiple of the longer, then an error results. Functions that are vectorized include: mathematical transformations and operations (+, -, *, /, log, exp, sqrt, etc), logical operations (==, >, <, !, etc), and functions related to probability distributions and random number generation.
57
Thompson
10/2002
One way to get around if a function is not vectorized is to use sapply with the vector as the first argument and the function as the second.
Here the results from 100,000 simulations, and their times in seconds.
> pi1.f(100000) [1] 3.15316 > pi2.f(100000) [1] 3.15028 > dos.time(pi1.f(100000)) [1] 153.07 > dos.time(pi2.f(100000)) [1] 0.54
The first function takes 2.5 minutes, whereas the second takes half a second. The next section describes functions that operate on whole objects such as lists, arrays, data frames. These can sometimes be used in place of explicit loops. However, in R, apply still uses an internal loop.
58
Thompson
10/2002
, , array2 a b c d A 11 13 15 17 B 12 14 16 18 , , array3 a b c d A 111 113 115 117 B 112 114 116 118 > apply(newarray, 3, rowSums) array1 array2 array3 A 16 56 456 B 20 60 460 > apply(newarray, 3, colSums) array1 array2 array3 a 3 23 223 b 7 27 227 c 11 31 231 d 15 35 235
But, what happens when we want a matrix result returned when we apply a function to each matrix of an array?
(newarray.ginv<-apply(newarray, 3, ginverse)) array1 array2 array3 [1,] -1.000000e+000 -2.50 -17.50 [2,] -5.000000e-001 -1.00 -6.00 [3,] -6.730727e-016 0.50 5.50 [4,] 5.000000e-001 2.00 17.00 [5,] 8.500000e-001 2.35 17.35 [6,] 4.500000e-001 0.95 5.95 [7,] 5.000000e-002 -0.45 -5.45 [8,] -3.500000e-001 -1.85 -16.85
59
Thompson
10/2002
We could put the results into a matrix using matrix, but it takes a little dirty work. Another way is to transform the array to a list, then lapply the same function to the list. Finally, we can transform back to an array.
> newlist <- apply(newarray, 3, as.data.frame) > (newlist.ginv<-lapply(newlist, ginverse)) $array1: [,1] [,2] [1,] -1.000000e+000 0.85 [2,] -5.000000e-001 0.45 [3,] -6.730727e-016 0.05 [4,] 5.000000e-001 -0.35 attr($array1, "rank"): [1] 2 $array2: [,1] [,2] [1,] -2.5 2.35 [2,] -1.0 0.95 [3,] 0.5 -0.45 [4,] 2.0 -1.85 attr($array2, "rank"): [1] 2 $array3: [,1] [,2] [1,] -17.5 17.35 [2,] -6.0 5.95 [3,] 5.5 -5.45 [4,] 17.0 -16.85 attr($array3, "rank"): [1] 2 # back to an array array(unlist(newlist.ginv),dim=c(4,2,3)) , , 1 [,1] [,2] [1,] -1.000000e+000 0.85 [2,] -5.000000e-001 0.45 [3,] -6.730727e-016 0.05 [4,] 5.000000e-001 -0.35 , , 2 [,1] [,2] [1,] -2.5 2.35 [2,] -1.0 0.95 [3,] 0.5 -0.45 [4,] 2.0 -1.85 , , 3 [,1] [,2] [1,] -17.5 17.35 [2,] -6.0 5.95 [3,] 5.5 -5.45 [4,] 17.0 -16.85
However, the easiest way is to use the dim and dimnames arguments of array on newarray.ginv:
array(newarray.ginv, dim=dim(newarray)[c(2,1,3)], dimnames= dimnames(newarray)[c(2,1,3)])
60
Thompson
10/2002
One can also sum the matrix components of an array. We show three functions for doing so. Try out the first function on newarray.
sum.array<-function(array){ res<-aperm(array,c(3,2,1)) apply(res,3,colSums) } > sum.array(newarray) A B a 123 126 b 129 132 c 135 138 d 141 144
A quicker way uses matrix multiplication instead of the apply function. The apply function is not as efficient as matrix multiplication when the latter can be done.
sum.array2<-function(array){ res<-aperm(array,c(3,2,1)) d<-dim(res) matrix(rep(1,d[1])%*%matrix(res, nr=d[1]), nr=d[2], dimnames=dimnames(res)[-1])
> sum.array2(newarray) A B a 123 126 b 129 132 c 135 138 d 141 144
A third way uses the function colSums, which has the efficiency of sum.array2 and the simplicity of sum.array.
sum.array3<-function(array){ res<-aperm(array,c(3,2,1)) colSums(res)
> sum.array3(newarray) A B a 123 126 b 129 132 c 135 138 d 141 144
Comparing them we get (the resources function appears in Venables and Ripley (2000) and is available from their Sprog scripts online.
> resources(sum.array(newarray)) CPU Elapsed % CPU Child Cache Working 0.11 0.11 100 0 0 4540 > resources(sum.array2(newarray)) CPU Elapsed % CPU Child Cache Working 0.05 0.05 100 0 0 1713
61
Thompson
10/2002
> resources(sum.array3(newarray)) CPU Elapsed % CPU Child Cache Working 0.06 0.06 100 0 0 1001
See the examples in V&R. Note how they apply a function to the diagonals of a matrix.
We can find the maximum values for each column using the apply function: apply(state.x77, 2, But, this only gives us the numbers, not the states which produced them. Also, using the function
62
Thompson
10/2002
which on the result of apply will return NULL unless there is a single state that produces the maximum values for each variable. sapply can help.
We can sapply over the names of the variables in this matrix. The function we apply to each variable name is
function(x) which(apply(state.x77,2,max)[x]== state.x77[,x])
The variable names are given in dimnames(state.x77)[[2]]. So we use sapply with this as our first argument:
sapply(dimnames(state.x77)[[2]], function(x,ref.table) which(apply(ref.table,2,max)[x]==ref.table[,x]), ref.table=state.x77) [1] 5 2 18 11 1 44 28 2
I chose to pass in the data matrix state.x77 instead of accessing it directly. I did this to show how arguments can be passed to the function used in sapply. Now, we can see which states have the maximum values on each variable.
ind<sapply(dimnames(state.x77)[[2]], function(x,ref.table) which(apply(ref.table,2,max)[x]==ref.table[,x]), ref.table=state.x77) structure(dimnames(state.x77)[[1]][ind], names = dimnames(state.x77)[[2]]) Population Income Illiteracy Life Exp Murder HS Grad Frost Area "California" "Alaska" "Louisiana" "Hawaii" "Alabama" "Utah" "Nevada" "Alaska"
A use of lapply that we have not yet seen is to replace explicit looping. Here is an example of using lapply as a replacement for a for loop:
res<-vector(list, B) # res is where we will keep the B results of the loop # do the loop
args.list gives a list of arguments for my.function. Sometimes using lapply instead of for helps memory management, keeping it under manageable control.
To get a list of all the levels of the factors in the cu.summary data frame, do
> lapply(cu.summary[sapply(cu.summary, is.factor)], levels) $Country: [1] "Brazil" "Mexico" "England" "France" "Sweden" "USA" "worse" "Germany" "Japan" "Japan/USA" "Korea"
"average"
"better"
"Much better"
63
Thompson
10/2002
"Medium"
"Small"
"Sporty"
"Van"
One can use lapply on a dataframe, as it is a list. For example, if we wanted to apply a function that operates on vectors to all numeric columns of a data frame. We can use either lapply or apply for this
my.data.frame[]<-lapply(my.data.frame, function(x) if(is.numeric(x)) my.function(x) else x)
The left-hand side ensures that the row.names remain intact. The function split(data, group) takes in a vector, matrix, or data frame (data) and splits it by an index (group), a vector or factor giving the indices, returning a list. This list can then be passed to lapply. Using the combination of split then lapply is usually equivalent to using tapply alone, but frequently much faster (see discussion in V&R, p. 107).
64
Thompson
10/2002
27 Mexico average 28 Korea better 29 Mexico better 30 USA better 31 Japan Much better 32 Japan/USA Much better 33 USA Much worse 34 Japan average 35 USA average 36 Japan better 37 Japan Much better 38 Japan/USA Much better 39 USA Much worse 40 USA average 41 Japan Much better
Small Small Small Small Small Small Sporty Sporty Sporty Sporty Sporty Sporty Van Van Van
8695.00 6319.00 8672.00 8895.00 8659.00 8226.75 14111.29 12749.00 13098.00 22860.00 13745.00 12279.00 13790.00 13219.00 14944.00
NA 37.00000 26.00000 33.00000 30.66667 31.33333 22.00000 24.00000 30.00000 NA 30.00000 NA NA 18.00000 19.00000
aggregate(cu.summary$Price, by = list(Type = cu.summary$Type), mean) Type x Compact Compact 15201.909 Large Large 21499.714 Medium Medium 21622.867 Small Small 7736.591 Sporty Sporty 15308.115 Van Van 14014.300 > aggregate(cu.summary$Price, by = list(Type = cu.summary$Type), mean)$x [1] 15201.909 21499.714 21622.867 7736.591 15308.115 14014.300
Control Structures
The function ifelse
S has ordinary if-then-else structures, as well as a vectorized ifelse function. ifelse(test, yes, no) takes a vector (test) and returns yes[i] if test[i]==T and no[i] otherwise. If yes or no are not as long as test, they will be repeated cyclically.
ifelse evaluates all three arguments. > y <- c(-1, 0, 2, 3) > ifelse(y > 0, log(y), 0) [1] 0.0000000 0.0000000 0.6931472 1.0986123 Warning messages: NAs generated > log(ifelse(y > 0, y, 1)) [1] 0.0000000 0.0000000 0.6931472 1.0986123
65
Thompson
10/2002
if test evaluates to a character string, the value of the expression is that of the matching named argument, or the default if none matches. The default is, many times, the last unnamed option or argument.
switch(test.choice, Levene=, levene=levene(y, f), Cochran=, cochran=cochran(y, f), Bartlett=, bartlett=, bartlett(y,f))
example:
If test.choice evaluates to Levene or levene being matched, the expression evaluates to levene(y, f). If no option is matched, the expression evaluates to bartlett(y,f), the final unnamed argument.
o
To allow abbreviated names (e.g., test.choice evaluates to Lev or lev), use the pmatch function with nomatch=. example:
switch(pmatch(test.choice, c(Levene, levene, Cochran, Cochran), nomatch=), # result of pmatch is coerced to 1=, 2= levene(y, f), # character mode 3=, 4= cochran(y, f), bartlett(y,f))
if test evaluates to a number (which is coerced to an integer using trunc), the argument evaluated will be the position in the list of arguments matching the integer. There is no default argument (nonmatching evaluations of test give result NULL). So, we select an argument position from the list of arguments. example:
switch(1, T, F) returns T switch(2, T, F) returns F switch(4, T, F, T, ,) returns a result with mode missing
Scalar logical operators (only deal with single values) o && right-hand expression is evaluated only if the left-hand one is true o || right-hand expression is evaluated only if the left-hand one is false o example: if(any(y <0) || any(x<0)) stop(none of the data should be
negative)
Looping
Typical looping structures exist: for, while, and repeat loops. These are used in the usual way (see p. 58 of V&R). However, in S, one can loop over the items in a list. For example,
x<-list(ice.cream.choice=c("vanilla", "chocolate", "strawberry", "chocolate"), topping.choice=rev(c("vanilla", "chocolate", "strawberry", "chocolate")))
66
Thompson
10/2002
> x $ice.cream.choice: [1] "vanilla" $topping.choice: [1] "chocolate" "strawberry" "chocolate" "vanilla" "chocolate" "strawberry" "chocolate"
> for(i in x) print(sort(i)) [1] "chocolate" [1] "chocolate" "chocolate" "chocolate" "strawberry" "vanilla" "strawberry" "vanilla"
Later, I will discuss the For() loop, which starts a new S-PLUS process at each iteration of the loop so that each step is run as a top-level expression. This causes S-PLUS to release memory after each iteration of the loop. However, a new process must be called at each step, so a For loop will be useful only for very large computational tasks done at each iteration. A short loop is better left as a regular for() loop. Actually, what For does is call the program Sqpe.exe which starts up a terminal S-PLUS process. It does this while in one S-PLUS process using the MULTIPLE_INSTANCES switch (see Chapter 1). There are several arguments one can use with the For loop to change how the process is run (background or foreground) or how the commands in the loop are issued (one by one or in blocks of expressions).
67
Thompson
10/2002
Method Dispatch
var(x) # x is a vector
var(X) # X is a matrix
Debugging functions
trace
debugger
browser
inspect
find.calls
fix
Edit
68
Thompson
10/2002
To output graphics in Windows to a Windows metafile, call the commands wmf.graph (in S-PLUS) or To output graphics to a pdf file, use pdf.graph (in S-PLUS) or pdf (in R).
You can have several graphics windows or devices open simultaneously. As they open, they are assigned a number. The current device (that is, the one to which graphics commands will be sent) is the most recently opened device by default. To change the current device to say a graphics window in focus, select the Make Current item in the windows system menu. Alternatively, use dev.set(which) where which is the device to be made the current one. To turn off one of the devices issue the command dev.off(which) with which being the device number. To close all devices, use graphics.off().
To draw the first plot in the upper left corner (screen 3), make screen 3 the focus, and call the plotting function:
screen(3) # ready to draw on screen 3 (upper left corner) plot(<put arguments here>)
69
Thompson
10/2002
Do the same for the remaining screens. If you later want to change what was drawn on screen 3, you can erase it, and redraw:
erase.screen(3) # erase screen 3 screen(3) # ready to draw on screen 3 (upper left corner) plot(<put arguments here>) # draw something else
If you forget to issue close.screen(all=T), then the next set of screens you create will start with number 7, which can become very confusing.
graphsheet can be used to output a graph in a format that is not a screen device. For example, set the argument format = JPG option with file = mygraph.jpg to output a graph as a JPEG file. Other formats are EPS, WMF, and TIF. To send output to the printer, use format = printer.
You can also name a graphsheet for use with the guiModify function later. guiModify can be used to make changes to an editable graphsheet that has already been created. One of its arguments (Name) is the name of the graphsheet. If you dont set Name, the default name is GSD<number>, starting with number 2.
70