Vous êtes sur la page 1sur 24

A short introduction to R

Luc Hens Vrije Universiteit Brussel & Vesalius College 6 April 2011

Abstract Aimed at students and teachers in an introductory statistics class or in an economics, business, or other social science class with an empirical component, this paper introduces the open-source statistical software R with the help of simple examples. (Econlit A220, C800)

What is R?

R is freely available open-source statistical software comparable to commercial (and often expensive) statistical software packages. R runs on Windows, MacOS, and Linux. I have used R for applied statistical work and graphing functions in a standard introductory statistics class and in economics classes at the undergraduate and graduate level. The aim of this paper is to give a self-contained introduction to R that demonstrates its capabilities as a teaching and research tool. This paper suces to get started with R. To download R, go to the R Project Home page (r-project.org) and follow the Download link. Select a mirror site nearby (in Belgium, for instance). Select your operating system. For MacOS 10.5 or higher, download the setup program by clicking on the link R-2.11.1.pkg; the version number (here 2.11.1) may be more recent. For Windows, follow the link to the base distribution and download the setup program. Once downloaded, double-click the setup program to install R on your hard disk. You can also run R in the cloud : if you are using a smartphone, a tablet, or a computer where you cant install R (for instance, because the computer is not yours), point the web browser to Rweb (http://bayes.math.montana.edu/Rweb/); you can store your R scripts and the results of the computations using free services like Simplenote or Dropbox. R is not menu-driven but uses command lines (objects, to be exact). When you start R, a console window opens showing a prompt (>) at the bottom. In command-line software, the user writes a command after the prompt and presses enter to execute the command. Try this: after the prompt, type: citation() and press enter. The console will show the bibliographical information needed to cite R in research papers (including homework assignments; you still have
I would like to thank my students and Camille Vanderhoeft for comments on earlier versions of this paper.

to adapt the bibliographical information to the required citation style). Type date() and press enter to display the date and time. First, R is a calculator: to compute 2 + 5 type: 2+5 and press return. The arithmetic operators are +, -, / (division), * (multiplication) and ^ (power: 23 is 2^3). To take the square root of 2 do sqrt(2). Calculators and computer programs sometimes report numbers using scientic notation: 2.345e+04, where e (exponent) means times ten raised to the power . . . . For instance, 2.345e+04 means 2.345 times 10 to the power 4: 2.345e+04 = 2.345 104 = 2.345 10000 = 23450 Similarly, 2.345e04 means: 2.345e04 = 2.345 104 = 2.345 1 = 0.0002345 10000

R internally uses full precision numbers but typically displays only seven meaningful digits on screen. For instance, display the mathematical constant = 3.14 . . .: print(pi) You can display values with more than seven digits by adding the argument digits= the desired number of digits: print(pi,digits=22) The assignment operator <- assigns something (a value, a list of values, a formula) to an object. To create an object called shoesize and assign to it the value 42, do shoesize <- 42 To display the contents of the object shoesize type print(shoesize) (Just printing the objects name and pressing return displays the object, too; try this). Lists are dened by the concatenate function c(). To create an object called my.grades containing a list of last semesters grades, do something like: my.grades <- c(3.0, 3.3, 2.7, 2.3, 3.7) Display the list by typing print(my.grades) or just my.grades. Lists can contain descriptive phrases or words rather than numbers: my.favorite.vegetables <- c("broccoli","carrots","red beans") R is case-sensitive: it makes a dierence between upper-case and lower-case letters. If you dened an object as shoesize (with a lower-case) and give the command print(Shoesize), R will respond that it doesnt know the variable Shoesize (with an upper-case). Comments are useful to annotate R code youd like to re-use or to document the sources of data les. You can insert comments in R code and data les by starting the comment with a hash (#): 2

# The following line computes 2 times 3: 2*3 You can put a comment after a command in the same line: 2*3 # the preceding code computes 2 times 3

R is well documented. The ocial documentation is Venables et al. (2006) and there is plenty additional information geared towards novices on the R web site (www.r-project.org , follow the link Documentation). If you want to use R more intensively (for instance, for a masters thesis) I recommend to buy one of the following books: Adler (2010), Crawley (2002, 2007), Dalgaard (2002), Fox (2002), or Verzani (2005). All are accessible to novices; I found Adler (2010) the most congenial and Crawley (2007) the most comprehensive. You can invoke R s built-in help by typing help.search followed by a description of the topic, enclosed in brackets and quotes. For example, if you want to nd out how to draw a box plot in R, type: help.search("box plot") To ask R s help regarding a specic commandfor instance, to know more details about how to use the date commandtype: help(date) The menu bar of R gives access to a built-in text editor. You can use this text editor to write R code or to create and edit data les. The text editor also can open, edit, and save plain text les. A plain text le contains only standard text characters (the characters on your keyboard), no formatting information (such as font type or font size), and often has the extension .txt. To open an existing text le, do: File > Open Document . . . . To create a new text le, do: File > New Document. To save the le, do: File > Save. R will save the text le with an .R extension. If you want R to execute a line of code in the text editor, set the cursor to the line and choose: Edit > Execute. To execute several lines of code, highlight the code and choose: Edit > Execute. By saving code in a separate text le, you can adapt and re-use it later. Add comment lines (starting with #) to annotate code in the text le. Teachers may want to distribute the R code used in the course to their students. For an example see the le STA101_R_code.R posted on Hens (2007).

Entering data
Variable names should either contain no blank spaces and commas (year.of.birth) or be enclosed in quotes ("year of birth"). If the variable names are enclosed in quotes, the attach command (see below) will automatically replace all characters that are not allowed (such as blank spaces and commas) by periods. For instance, the variable name "year of birth" will be replaced by year.of.birth.

In order to be readable by R, data should follow a number of rules:

Numbers should use decimal points (not commas), have no separator for thousands, and have no currency symbol. For example, $1,234,567.89 should appear as: 1234567.89 (see below what to do if your spreadsheet data have decimal commas). Values of qualitative data that are character strings (not numbers) can contain blank spaces or commas but should be enclosed in quotation marks (same left as right): the variable major can take the value "International Affairs". Non-available values should be coded as NA, not as blanks or as the spreadsheet code #N/A (with a hash). NA should be capitalized: use NA, not na). Small data sets can be entered in a command line. Suppose in an economy the primary sector accounts for 5 percent, the secondary sector for 35 percent, and the tertiary sector for 60 percent. A quick way to enter the data as lists (the blank spacesoptionalimprove readability): sector share <- c("Primary","Secondary","Tertiary") <- c( 0.05 , 0.35 , 0.55 )

The operator <- assigns values to the objects (variables) sector and share. You can now use the objects sector and share by calling their name. Type the name of each object and press return to display its contents: sector share It is a good idea to bind the values for the variables together by case (that is, to make clear that the share 0.05 belongs to the primary sector, the share 0.35 belongs to the secondary sector, and the share 0.55 belongs to the tertiary sector). The data.frame command does exactly this: my.data <- data.frame(sector,share) Display the data frame my.data by typing its name and pressing enter: my.data You can now call the variables in the data frame my.data by typing the name of the data frame followed by a dollar sign and the name of the variable (name.of.data.frame$name.of.variable): my.data$sector my.data$share I strongly recommend to import large data sets or data sets involving more than one variable from an external le (which will often be a spreadsheet le) rather than using the method just described. The command to import a data set from an external le is read.table(). In what follows, Ill use a data set consisting of 30 students sex, height (cm), weight (kg), and major shown in appendix B. If you are on-line, you can load this data set using: students <- read.table( "http://homepages.vub.ac.be/~lmahens/students.csv", header=TRUE,sep=",") attach(students) students # displays the data names(students) # shows the names of variables 4

A simple way to enter small numeric data sets stored in a spreadsheet le is to use the scan() command. Assume you collected the values of two stock market indices (DJIA and S&P 500) for ve days. Use your spreadsheet program to create a new spreadsheet (in OpenOce.org Calc: File > New > Spreadsheet). Enter the data with variables as columns, elements (cases) as rows: row row row row row row 1 2 3 4 5 6 column A "DJIA" 10425 10220 9862 10367 99929 column B "SP500" 1387 1346 1333 1409 1395

Save the spreadsheet le. Select the DJIA data in cells A2:A6 (dont include the header with the variable name) and copy. In the R console, type after the prompt: DJIA <- scan() and press enter. After the prompt 1:, paste the column of data. Press enter twice. R has now loaded the DJIA data. Inspect whether the values were correctly read: typing DJIA and press enter to display the data. Repeat for SP500 (column B): copy cells B2:B6, go the R console and type SP500 <- scan(), and paste the column of data. Inspect whether the values were correctly read: type SP500 and press enter to display the data. Bind the data in a data frame: stock.market.data <- data.frame(DJIA,SP500) Although the scan() method is simple, the risk of making mistakes is large and mistakes are hard to spot. I recommend to use the read.table() command to import any external data le. To learn more about importing data from external les, read Appendix A. As this topic is rather complicated you may want to skip the appendix for the time being.

Descriptive statistics

The sort() function sorts values from small to large and displays them: sort(height) To display just the heights of male students do: height[sex=="Male"] The function quantile() computes the ve-number summary (minimum, rst quartile, median, third quartile, maximum); summary() computes the venumber summary and the mean: quantile(height) summary(height) To compute descriptive statistics (mean, median, variance, and standard deviation) use the following functions:

mean(height) median(height) var(height) sd(height) The variance and standard deviation commands (var and sd) use the formula for sample data (with n 1 in the denominator). This means that the R command sd produces SD+ from Freedman, Pisani, & Purves (2007, p. 74), not SD. To compute SD for the height do: number.of.entries <- length(height) sqrt((number.of.entries-1)/number.of.entries)*sd(height) To obtain a specic quantile (say, the 90th percentile) use: quantile(height,.90) There are dierent ways to compute quantiles (see help(quantile)), leading to (usually) slightly dierent outcomes. To get the 90th percentile as dened in Anderson et al. (2007, p. 69) use: quantile(height,.90,type=2) All of the objects above can also be applied to the height of just the male (ot just the female) students, e.g.: summary(height[sex=="Male"]) To cross-tabulate two variables use table(variable1,variable2). For example, to cross-tabulate sex and major from the students.csv data set, do: table(sex,major)

Displaying data and distributions

To get a density histogram of the heights use: hist(height, right=FALSE, freq=FALSE) The convention in histograms is that class intervals include the left endpoint but not the right, that is, the 150-to-160 cm interval includes 150 cm and does not include 160 cm. This is achieved with the argument right=FALSE. The argument freq=FALSE is needed to get a density histogram: in a density histogram, the vertical scale is a density scale showing the proportion per horizontal unit (crowding) and the relative frequencies are the areas of the bars (Freedman, Pisani, & Purves, 2007, Chapter 3). To plot a frequency histogram where the relative frequencies are the heights of the barsas in Anderson et al. (2007, p. 33)use the argument freq=TRUE. To omit the title use the argument main="" (usually one types the title of a gure in a word processor or layout program). You can control the labels of plots with the xlab and ylab arguments. The label should be a string delimited by quotes: hist(height, right=FALSE, freq=FALSE, xlab = "Height (cm)") 6

R chooses appropriate class intervals, but you can control the number of classes (say, ve) with the argument nclass=5 or the class limits between class intervals (breaks) with the argument breaks=c(150,160,170,180,190,200): hist(height, right=FALSE, xlab = "Height (cm)", breaks=c(150,160,170,180,190,200)) freq=FALSE,

To retrieve the information needed to make a frequency table (class limits, frequencies) use: hist(height, right=FALSE, freq=FALSE, plot=FALSE)$breaks hist(height, right=FALSE, freq=FALSE, plot=FALSE)$counts You can compute the relative and percentage relative frequencies as: hist(height, right=FALSE, freq=FALSE, plot=FALSE)$counts/length(height) hist(height, right=FALSE, freq=FALSE, plot=FALSE)$counts/length(height)*100 R often surrounds graphs by a box. Omit the boxas recommended by Tufte (1983, p. 127)by adding the argument frame.plot=FALSE to the command to generate the graph. To obtain a stem-and-leaf display of heights use: stem(height) To obtain a box plot of heights use: boxplot(height,frame.plot=FALSE) To compare the distributions of two (or more) variables in parallel box plots (only meaningful if all variables are measured on the same scale with the same units of measurement) use boxplot(variable1,variable2,...). Let us compare the distributions of male and female students heights : boxplot(height[sex=="Male"],height[sex=="Female"]) or, more elegantly: boxplot(height ~ sex,frame.plot=FALSE) Pie charts are a poor way to display counts of nominal data; use a bar chart instead orusually the best choicedisplay the frequency table. To generate a bar chart or a pie chart of the sector shares do: barplot(share, names = sector) pie(share, labels = sector)

Scatter plots and the line of best t

The workhorse command to make plots from data is plot(): depending on the context, it generates scatter plots, time series diagrams, line diagrams, and more. By adding arguments you can control the appearance of the plot: label the axes, control ticks and tick marks, use colors, add annotations, and much more. To make a scatter plot of heights on the x -axis and weights on the y -axis use: 7

plot(height,weight) To control the labels (for instance, to include the units of measurement of the variables) use the xlab and ylab arguments. The labels should be strings delimited by quotes: plot(height,weight,xlab="Height (cm)",ylab="Weight (kg)", frame.plot=FALSE) To nd the covariance and the coecient of correlation use: cov(height, weight) cor(height, weight) If there are missing values, use: cor(height, weight, use="complete.obs") To nd the line of best t or regression line between weight (as the dependent variable) and height (as the independent variable) use: fitted.model1 <- lm(weight ~ height) In the lm object, put the dependent (y -axis) variable rst, then a tilde (~) meaning is modelled by, then the independent (x-axis) variable. lm stands for linear model; fitted.model1 is the name you assigned to the tted model and can be any name. To display the coecients, their standard deviations, t -statistics and p -values, and the coecient of determination (R2 ) use: summary(fitted.model1) To display just the coecients use: coef(fitted.model1) To store and display the tted values ( y ) use: weight.fitted <- fitted(fitted.model1) weight.fitted To make a scatter plot of heights on the x -axis and weights on the y -axis displaying the line of best t, rst make the scatterplot: plot(height,weight, xlab="Height (cm)",ylab="Weight(kg)", frame.plot=FALSE) then add the line of best t (regression line): lines(height,weight.fitted) or: abline(fitted.model1)

Pasting a graph in a word processor document

If you want to paste a graph in a word processor document, do the following. In Windows, bring the R window with the graph to the front. Do Edit > Copy. Go to your word processor document and do Edit > Paste. In MacOS, bring the R window with the graph to the front. Choose Edit > Copy. Start the Preview application and in Preview do File > New From Clipboard. Save as a .png e. You can now copy and paste the graph from Preview into a word processor document. To save a graph, bring the window with the graph to the front and choose in the R menu File > Save as . . . . In Windows, save in the .png format. In MacOS, save as .pdfthats the only option, but you can open the .pdf in Preview and save in .png.

Probability distributions

To randomly draw ve numbers between 1 and 50 with and without replacement type: sample(1:50,5,replace=TRUE) sample(1:50,5,replace=FALSE) The binomial formula (Freedman, Pisani & Purves, 2007, p. 259) is computed using dbinom(k,n,p). More concretely, consider a binomial experiment with 10 trials and a probability that the event of interest occurs (success) of 0.4. To nd the probability of 6 successes do: dbinom(6,10,0.4) To get a list all probabilities (of 0, 1, 2, ..., 10 successes) do: dbinom(0:10,10,0.4) To plot the normal curve do: x <- seq(-4,+4,length=400) y <- dnorm(x) plot(x,y,ylab="f(x)",type="l",frame.plot=FALSE) The cumulative normal distribution pnorm computes the area in the left tail of the normal curve with a mean of 0 and a standard deviation of 1 (the standard normal curve). To nd the area under the normal curve to the left of 2 do: pnorm(-2) To nd the area under the normal curve to the right of 2 do: pnorm(2, lower.tail=FALSE) To nd an area under the normal curve between some lower boundary (say, 1) and some upper boundary (say, +2) do: pnorm(2) - pnorm(-1) To compute this area and plot the normal curve with the shaded area do: 9

xlo <-1 # the lower boundary xup <2 # the upper boundary # you dont have to change anything in the code below: pnorm(xup) - pnorm(xlo) cord.x <- c(xlo,seq(xlo,xup,0.01),xup) cord.y <- c(0,dnorm(seq(xlo,xup,0.01)),0) curve(dnorm(x),xlim=c(-4,+4), ylab="f(x)",frame.plot=FALSE) polygon(cord.x,cord.y,col="grey",lty="blank") Verify the empirical rule by computing: pnorm(+1) - pnorm(-1) pnorm(+2) - pnorm(-2) pnorm(+3) - pnorm(-3) The procedures to nd areas under the Student t -distribution curve are similar. The cumulative t -distribution is pt, and the arguments of the function are the upper boundary and the degrees of freedom. For example, to nd the area to the left of +2 under a t -curve with 18 degrees of freedom do: pt(+2, df = 18) To nd the area between -2 and +2 under a t -curve with 18 degrees of freedom do: pt(+2, df = 18) - pt(-2, df = 18) To compute the area and plot the t density curve with the shaded area do: tdf <18 tlo <-2 tup <+2 # you dont have to change anything in the code below: pt(tup, df = tdf) - pt(tlo, df = tdf) cord.x <- c(tlo,seq(tlo,tup,0.01),tup) cord.y <- c(0,dt(seq(tlo,tup,0.01),df=tdf),0) curve(dt(x,df=tdf),xlim=c(-3,+3),xlab="t", ylab="f(t)", frame.plot=FALSE) polygon(cord.x,cord.y,col="grey",lty="blank") To nd quantiles of a variable that follows the normal distribution, use the qnorm function. To nd the 25th percentile of a variable that follows the normal distribution with a mean of 20 and a standard deviation of 5 do: qnorm(.25, mean = 20, sd = 5) To nd the 25th percentile of the t -distribution with 18 degrees of freedom do: qt(.25, df = 18)

10

Condence intervals and hypothesis tests

Freedman, Pisani, & Purves (2007) use z -tests rather than t-tests most of the time. Thats OK if the sample is suciently large: in that case the test statistics and P -values for the z -test and t-test will be very similar. R has no z -test procedureuse a t-test instead. Heres how the t.test procedure works. Assuming that the variable weight is normally distributed and the sample is a simple random sample, the 95 percent condence interval for the population mean is obtained by: t.test(weight, conf.level = .95)$conf.int Suppose you would like to test hypotheses concerning the average weight of the population. Assuming that weights are normally distributed and the sample is a simple random sample, a t -test is the appropriate method. To test the null hypothesis that the population average (usually denoted by , the Greek letter mu) is 64 kg against the two-sided alternative that the population average diers from 64 kg (in either way) do: t.test(weight, mu = 64) To test the null hypothesis that the population average is 64 kg against the one-sided alternative that the population average is less than 64 kg do: t.test(weight, alternative = "less", mu = 64) To test the null hypothesis that the population average is 64 kg against the one-sided alternative that the population average is greater than 64 kg do: t.test(weight, alternative = "greater", mu = 64) To nd a condence interval or do a hypothesis test on a proportion or a percentage use binom.test. Suppose that you took a simple random sample of 1,600 persons, of which 917 people are Democrats (Freedman, Pisani, & Purves, 2007, example 2 p. 382). To get the 95 percent condence interval for the proportion use: binom.test(917, 1600, conf.level = .95)$conf.int To get the 95 percent condence interval for the percentage multiply by 100%: 100*binom.test(917, 1600, conf.level = .95)$conf.int To test the null hypothesis that the percentage of Democrats in the population is 55% against the one-sided alternative that the percentage is greater than 55%, do: binom.test(917, 1600, 0.55, alternative = "greater")

Time series

If you work with annual time series and dont need specic time series operations such as lagging a variable or make time series diagrams, the commands above will get you a long way. Suppose you have an annual time series of a price index for a country starting in 1993. The numbers can be entered as follows: 11

year <- c(1993,1994,1995,1996,1997,1998,1999) price <- c(90.1,92.3,95.3,96.1,97.2,99.0,100.0) You can now make a scatter plot, compute a time trend and so on. However, its often useful and sometimes necessary to create a time-series object. The command ts() creates a time series (usually called price.ts or so) from an existing variable price as follows: price.ts <- ts(price,start=c(1993,1)) price.ts # displays the time series Suppose your data have four observations per year, one for each quarter. Heres a time series of quarterly real GDP starting in the rst quarter of 1991 (expressed as an index, 2000 = 100): gdp <- c(86.0,85.3,84.9,86.1,87.7,87.0,87.0,87.0,86.0,86.1,86.8) Create a time series object as follows: gdp.ts <- ts(gdp,start=c(1991,1),frequency=4) gdp.ts # displays the time series frequency = 4 means that there are four observations (quarters) per unit of time (year). To plot the time series do: plot(gdp.ts,ylab="Real GDP (index, 2000 = 100)",frame.plot=FALSE) To compute and plot a linear time trend do: fitted.time.trend <- lm(gdp.ts ~ time(gdp.ts)) abline(fitted.time.trend, lty="dashed") lty stands for line type (in this case, a dashed line). To display the coecients and other statistics of the time trend use: summary(fitted.time.trend) To lag the time series by one period (in this case, one quarter) do: lag(gdp.ts, k = -1) To obtain the quarter-on-quarter growth rate of GDP do: (gdp.ts-lag(gdp.ts,k=-1))/lag(gdp.ts,k=-1) To obtain the year-on-year growth rate of quarterly GDP do: (gdp.ts-lag(gdp.ts,k=-4))/lag(gdp.ts,k=-4) Multiply by 100 to obtain percentage growth rates. R can also work with multivariate time series. The relationship between unemployment and ination is known as the Phillips curve. The following data are annual time series of the unemployment rate and the ination rate for a country starting in 1993: unemployment <- c(10.6,9.5,8.2,8.2,8.3,7.7,6.9,6.3,6.8,6.4,6.1) inflation <- c( 1.4,0.8,1.5,2.2,1.5,0.2,0.5,4.3,3.8,2.9,2.8) 12

Create the time series variables: inflation.ts <- ts(inflation ,start=c(1993,1),frequency=1) unemployment.ts <- ts(unemployment,start=c(1993,1),frequency=1) Plotting both series in a single diagram makes sense because they are measured on the same scale and use the same units (percent). To plot both series over time in a single time series diagram do: ts.plot(inflation.ts,unemployment.ts,lty=c(1:2),frame.plot=FALSE) legend("topright",legend=c("Inflation","Unemployment"),lty=c(1:2)) The argument lty=c(1:2) uses two dierent line types for the two series; to use dierent colors do: ts.plot(inflation.ts,unemployment.ts, col=c(1:2),frame.plot=FALSE) legend("topright",legend=c("Inflation","Unemployment"),lty=1,col=c(1:2)) You can explore the relationship between two time series by making a scatter plot and drawing the line of best t. The command plot(unemployment.ts,inflation.ts,frame.plot=FALSE) will generate a scatter plot with points labeled by time and connected by lines showing the time path. To obtain a plain vanilla scatter plot (with points indicating the observations, no time labels, no time path), use: plot(unemployment.ts,inflation.ts,frame.plot=FALSE,xy.labels=FALSE) (I omitted the xlab and ylab arguments for clarity.) To add the line of best t and compute the coecient of correlation, do: Phillips.curve.fitted <- lm(inflation.ts ~ unemployment.ts) abline(Phillips.curve.fitted) cor(unemployment.ts,inflation.ts, use="pairwise.complete.obs") (The following section is quite technical. You may want to skip it.) Working with the linear model object (lm) is tricky with time series objects, especially when there are missing values (NA) or time lags (see the R Help entry on lm). Suppose you would like to estimate the relationship between unemployment and the change in the ination rate (a simplied form of the expectations-augmented Phillips curve). First, create and display a new time series of the change in the ination rate: change.in.inflation.ts <- inflation.ts - lag(inflation.ts, k = -1) change.in.inflation.ts The time series change.in.inflation starts one period later than inflation. If you try to estimate the linear relationship between unemployment and the change in the ination rate, R returns an error message because the two series are not of the same length. To make both series of the same length, use the command ts.union, which binds the two time series by padding the shorter series with NAs to the total time coverage. The resulting new data frame (data2 in the example below) will have the same start and end period for all variables: 13

data.Phillips <- ts.union(change.in.inflation.ts,unemployment.ts,dframe=TRUE) data.Phillips Attaching the new data frame wont work because the variable names are already in use, but you can call the new variables by: data.Phillips$change.in.inflation.ts data.Phillips$unemployment.ts Heres how to make the scatter plot, add the line of best t, and compute the coecient of correlation (again, I omitted the frame.plot=FALSE,xlab and ylab arguments for clarity): augmented.Phillips.curve.fitted <- lm(data.Phillips$change.in.inflation.ts ~ data.Phillips$unemployment.ts) summary(augmented.Phillips.curve.fitted) cor(data.Phillips$unemployment.ts,data.Phillips$change.in.inflation.ts, use="pairwise.complete.obs") plot(data.Phillips$unemployment.ts,data.Phillips$change.in.inflation.ts, xy.labels=FALSE) abline(augmented.Phillips.curve.fitted) To learn more about using R for advanced time series work, consult Pfa (2006) or Shumway & Stoer (2006).

10

Plotting functions

R is an excellent tool to plot mathematical functions. Consider the following supply and demand schedules: Qs Qd = = 10 + 40P 100 20P

Solve both equations for the y -axis variable (P ): P P = = 1 10 + Qs 40 40 1 100 Qd 20 20

Use seq to dene the desired range of Q (here, 0 to 150 is ne, but the appropriate range depends on the problem at hand; you may have to experiment a bit): Q <seq(0,150)

Then dene the supply and demand functions: Ps <Pd <-(10/40) + (1/40)*Q (100/20) - (1/20)*Q

Use plot to draw the supply schedule: plot(Q,Ps,xlab="Quantity (kg)", ylab="Price (euros per kg)",type="l") 14

Use lines to add the demand schedule to the plot you just created: lines(Q,Pd) Another method uses the curve(expression,from,to,...) command: curve(-(10/40)+(1/40)*x,0,150, xlab="Quantity (kg)", ylab="Price (euros per kg)") curve((100/20)-(1/20)*x,add=TRUE) # adds the demand curve To add text (such as the labels Supply and Demand) rst locate the coordinates of the places where you want to position the text. Enter the instruction locator(2), press enter, and then click with the left mouse button on the two positions in the graph window. The console returns the coordinates. Then use text: text(130,2.6,"Supply") text(120,0,"Demand") The instruction locator(1) will also give the (approximate) location of the intersection between the supply and demand curve. To get the exact equilibrium (Q = 70, P = 1.5) you have to solve the system of equations. The R command to solve systems of equations (solve) requires you to write the system of equations in matrix form and is beyond the scope of this paper. To plot the equlibrium point (70, 1.5) and dashed lines showing the coordinates do: points(70,1.5) segments(70, 0 , 70, 1.5, lty="dashed", col="grey") # vertical segments( 0, 1.5, 70, 1.5, lty="dashed", col="grey") # horizontal A simple way to add a straight line to an existing plot uses the abline function. Read abline as a-b line: a is the intercept and b is the slope or gradient of the line with equation y = a + bx. For instance, suppose that the demand schedule shifts to: P = 120 1 Qd 20 20

To add the new demand schedule as a dashed line do: abline(a = 120/20, b = -1/20, lty = "dashed") lty stands for line type; if you omit the lty argument, abline and lines plot a solid line. Alternatively, you can use: curve(120/20-(1/20)*x, lty = "dashed", add=TRUE)

11

Troubleshooting R

The rst things you should do when you have a problem is to check this document and use Rs built-in help function (for example, if you have a problem with the cor command, type help(cor)). Here are some frequent problems and possible solutions. Problem 1. The read.table command doesnt work. I get the error message: 15

Error in file(file, "r") : unable to open connection In addition: Warning message: cannot open file ....., reason No such file or directory Solution: You probably made an error in the location of the data le. Carefully check the location of the data le and compare with the location in the read.table command. Make sure you used forward leaning slashes (/), not backslashes, and the proper identical quotes ("...", not ...). Make sure that the data le is a plain text le with the .cvs or .txt extension. Problem 2. The read.table command doesnt work. I get the error message: Error in read.table("...", header = TRUE, more columns than column names or: Error in scan(file, ..., na.strings, line 1 did not have 10 elements : :

Solution: Open the le in a text editor (Notepad, TextEdit). Check that the data le has the required format: the rst line contains variable names separated by blank spaces or commas, subsequent lines contain the data values separated by blank spaces commas. There should be as many data values per line as there are variable names in line one. In a comma-separated le, verify that there are no redundant commas; remove redundant commas. Problem 3. I loaded my data set using read.table. When I try to manipulate one of the variables (print, plot, compute descriptive statistics, ...) I get the error message: Error: object "..." not found Solution: Type ls() (list) and press enter to see a list of the data sets in your workspace (including the variable names generated using the attach(...) command). If the variable names dont appear, you probably forgot to attach the database. If the problem still occurs after you attached the database, check whether you included the separator (sep) argument in the read.table command. Problem 4. I typed a command and pressed enter. Nothing happened, except that the R console returns a + prompt instead of the usual < prompt. Solution: The + prompt indicates that R expects additional input. Probably your previous command was incomplete; you may have forgotten a closing bracket or other input. Type the missing part of the command after the + prompt and press enter. Problem 5. The cor command doesnt work. I get the error message: Error in cor(variable1,variable2) missing observations in cov/cor :

Solution: Your variable1 or variable2 contains missing observations (coded as NA). In that case, you have to specify what cor should do with missing observations: 16

cor(variable1,variable2, use = "complete.obs") Check help(cor) for details. Problem 6. I tried to plot a line of best t in the scatter plot. I dont get an error message, but the line of best t doesnt appear in the scatter plot. Solution: You probably confused the y - and x-axis variables in the lm(y ~ x) command: in the lm(y ~ x) command the y -axis variable should come rst, while in the plot(x,y) command the x-axis variable should come rst. y ~ x means y modelled by x.

Appendix: Importing data from an external le

This is an advanced topic. You can skip it if you just want to get a quick introduction to R, but you have to read it if you want to use R for more serious work.

A.1

The read.table() command

Its a good idea to store your data on a external le and import them to R the read.table() command. For beginners, importing data from an external le is the most challenging part of getting started with R (or any other statistical software). R is exible and can import data from les in many dierent ways; Ill explain just one method (importing from a comma-separated values le). I recommend to store all your R data in the same location. Create a new directory R Data in the Documents directory (MacOS) or My Documents directory (Windows). Download the data le students.csv from my Statistics 101 home page (Hens, 2007): http://homepages.vub.ac.be/~lmahens/students.csv Store students.csv in the R Data directory on your hard disk. It is important that your operating system shows the complete data le names including the extensions such as .txt and .csv. To view the extension of a le do the following: MacOS: select the relevant le(s) in the Finder and choose File > Get Info; then click Name & Extension and deselect the Hide extension checkbox(es). Windows: Open the R Data directory in My Computer. In the menu bar of the R Data directory, choose Tools > Folder options . . . > View. Uncheck the box Hide extensions (. . . ). This will display the extensions of all the les in the R Data directory. The data should be in a plain text le (see above) with a specic format: Variables should be in columns, elements (cases) in rows. The rst line should contain the variable names. The subsequent lines contain the data values. Its possible to include comment lines to document the source and units of measurement of the data; comment lines should start with a hash (#). 17

Variable names and values should be separated by commas. Thats why such a text le is called a comma-separated values le and gets the extension .csv rather than the extension .txt. Many databases oer the option to export data to a csv formatted le. Section 2.3 explains how to export spreadsheet les to csv les. Open the le students.csv with a text editor (like TextEdit in MacOS or NotePad in Windows) to see an example of a data le in plain text format. This data set shows the sex, height (in cm), weight (in kg), and major for 30 students. It is a le in comma-separated values (csv) text format. I vertically aligned the columns to improve readability, but this is not required. "sex" ,"height","weight","major" "Female", 172 , 63 ,"Business" "Female", 170 , 70 ,"International Affairs" "Female", 170 , 52 ,"Other" "Female", 171 , 52 ,"Communications" "Male" , 186 , 90 ,"Business" "Male" , 183 , 79 ,"Business" "Male" , 170 , 66 ,"Communications" "Female", 169 , 56 ,"Business" "Male" , 175 , 75 ,"International Affairs" "Female", 175 , 65 ,"Communications" "Male" , 195 , 94 ,"Business" "Female", 176 , 51 ,"International Affairs" "Male" , 188 , 76 ,"International Affairs" "Male" , 192 , 82 ,"Business" "Male" , 172 , 70 ,"International Affairs" "Female", 169 , 53 ,"Business" "Female", 172 , 52 ,"International Affairs" "Male" , 178 , 85 ,"Business" "Female", 177 , 59 ,"Communications" "Male" , 178 , 72 ,"International Affairs" "Female", 160 , 54 ,"Business" "Female", 175 , 54 ,"International Affairs" "Male" , 190 , 70 ,"International Affairs" "Male" , 178 , 85 ,"Business" "Female", 163 , 55 ,"Business" "Female", 161 , 59 ,"Business" "Female", 162 , 44 ,"Communications" "Female", 170 , 54 ,"Business" "Female", 154 , 52 ,"Business" "Female", 170 , 65 ,"Business" Each column represents a variable, and the rst row contains the variable names. The le contains sex, height (in centimeters), weight (in kilograms), major, and year of birth for a group of thirty students. The numbers are not included in quotes, the words and descriptive phrases are. Close students.csv before proceeding. (If you dont have students.csv but an electronic version of this document, copy and paste the data into a text document (in Windows use Notepad or Wordpad, in MacOS use TextEdit). Save the document as students.csv.) 18

Now well tell R what the working directory is, that is, where to look for data. You can set the working directory with the R setwd command. The command will look like this in Mac OS: setwd("/Users/myMacBook/Documents/R Data") and like this in Windows: setwd("C:/Documents and Settings/User/My documents/R Data") The text in quotation marks is the path to the working directory. To nd nd the path to the working directory in Windows, open My Computer and select any le in the R Data directory. Right-click the le and select Properties. Select and copy the le location, which will look someting like: C:\Documents and Settings\User\My documents\R Data. Caution: Windows uses backslashes (\) in the path, but in the setwd() command you should use forward-leaning slashes (/). To nd the path to the working directory in Mac OS, select any le in the R Data directory and choose File > Get Info. The where item gives the path to the le. You can also set the working directory via the menu: Mac OS: R > Preferences > Initial working directory. Click the change button and select the R Data directory you created. In Mac OS R remembers the working directory next time you start R, so you only have to do this once. Windows (method 1): File > Change dir . . . . A dialog window opens. Browse to the R Data directory you created and click OK. In Windows R doesnt remember the new working directory, so youll have to redo this the next time you start R. In the R console check the working directory with the command getwd(). Now were ready to read the comma-separated data table from students.csv stored in the working directory: students <- read.table("students.csv",header=TRUE, sep=",") The rst part of the command (students <-) assigns the result of the read.table command to an object called students (it can be any valid object name and should not contain blank spaces). The students.csv argument gives the le name including the extension and enclosed in quotes. The argument header=TRUE indicates that the le contains the names of the variables as its rst line (header). The argument sep="," indicates that values on each line of the le are separated by a comma. To display the contents of the students object just type students. You can now call the variables stored in the object students as students$sex,students$height, students$weight, and students$major. Try this by typing students$height and pressing enter. It is more convenient to call the variables simply as sex, height, weight, and major. To do this we create a data frame with the attach() function: attach(students) # creates a data frame "students"

19

Now type height and press enter. The console will display the values of the variable height. As shown already, you can also read data stored on a web site using the url of the data set. In particular, if you use Dropbox, you can store your data les on your Dropbox account and access them from within R. Just put the Dropbox link to your data le (a url) enclosed in quotes in the read.table() command. For example, I stored a copy of the students.csv le in the Public folder of my own Dropbox account. The url for the le is: http://dl.dropbox.com/u/12062161/students.csv To read the comma-separated data table from students.csv in R do: students.from.Dropbox <read.table("http://dl.dropbox.com/u/12062161/students.csv", header=TRUE,sep=",") students.from.Dropbox

A.2

Converting a spreadsheet le to a csv le

If the data are in a spreadsheet le (using a format such as .xls, .ods, or .numbers), rst convert the spreadsheet le to a comma-separated values le. Heres how to do this in OpenOce.org Calc.1 First, verify whether the spreadsheet program uses decimal points: open a new spreadsheet, type in in cell A1: =pi() (the function to display the mathematical constant = 3.1415927 . . .) and press enter. If the cell displays something like 3.14 (possibly with more decimals) your spreadsheet program uses decimal points. If the cell displays something like 3,14 (possibly with more decimals) your spreadsheet program uses decimal commas and the procedure is slightly dierent (as explained below). Second, open the data spreadsheet (in the example: students2.ods or students2.xls) in OpenOce.org Calc (OpenOce.org Calc opens Microsoft Excel spreadsheet les). Verify that the variables are in columns and the elements (cases) in rows. In the menu, select File > Save as . . . . A dialog window Select File Type appears. Select Text CSV and leave the Hide Extension box unchecked. The le name in dialog window will automatically get the .csv extension. Select the R Data directory as the location of the .csv le. Then select Save and click Yes to save as Text CSV. A new dialog window Export of text les appears. Leave the Character set at its default value. If your data have decimal points select as Field delimiter a comma (,). If your data have decimal commas select as Field delimiter a semicolon (;). Leave the text delimiter at its default value ("). Leave the Save cell content as shown box checked and the Fixed column width box unchecked. Note that its OK to save with the .csv extension even if you used semicolons rather than commas to separate data values. Well tell R in the read.table command whether the data le uses commas or semicolons to separate values.
1 I recommend OpenOce.org Calc because it gives the user more control over how the spreadsheet is exported to a csv le. OpenOce.org Calc opens Microsoft Excel spreadsheet les (.xls) and allows to save them as csv les. In Microsoft Excel the instructions to save a spreadsheet le to a csv le start with: File > Save as . . . ; in iWork Pages with: File > Export. Consult the manuals or help functions of Microsoft Excel or iWork Pages for details on exporting to csv les.

20

Open the text le (students2.cvs) in your text editor and clean it up if necessary: sometimes the spreadsheet may have had empty but active cells, which will show up in the csv le as redundant value separators (commas or semicolons) at the end of a line or at the end of the document, or as blank lines at the end of the documentremove them and save the le. You can now import the table in R as explained in the previous section. If your data le uses decimal points and has commas as separators the read.table() command looks like: students2 <- read.table("students2.csv",header=TRUE,sep=",") If your data le uses decimal commas and uses semicolons to separate values, the read.table command should have as arguments sep=";" (values are separated by semicolons) and dec="," (numbers have decimal commas): students2 <- read.table("students2.csv",header=TRUE,sep=";",dec=",") Type students2$height and press enter to display the heights.

A.3

Exporting data to a text le

Assume that you would like to sort the students by height and export the result to a le. First create a new data frame with the students sorted by height: students.ordered <- students[order(height),] (the comma before the closing bracket end is no typo; it should be there.) Then write the new data le to an external le (newdata.txt) in your working directory: write.table(students.ordered, file="newdata.txt",sep=",")

A.4

Built-in data sets and data packages

R comes with number of built-in data sets. To get a list of the the built-in data sets do: data() To get a description of one of the data sets (say, ChickWeight) do: help(ChickWeight) To display the variable weight from the ChickWeight data set do ChickWeight$weight More data sets are available as packages. The Penn World Table (national income accounts for 188 countries over 1950-2000Heston, Summers, & Aten, 2006) is available as the R package pwt. To install pwt make sure that you are online. Type: install.packages("pwt")

21

Installing a package can take a couple of minutes. You only have to do this once on the same computer: the command downloads and stores the Penn World Table on your hard disk. When the package pwt is installed, call it from the library as follows: library(pwt) You have to repeat library(pwt) whenever you start a new R session and want to use the Penn World Table. Open the help le on pwt: help.start() and follow the link packages and go to the pwt link. The le contains a description of the data et and the variables. To see how to cite the pwt package, type: citation(package="pwt") To load the data from Penn World Table 6.2 and create a dataframe type: data("pwt6.2") To get the variable names type: names(pwt6.2) To display the variable openc (openness, % in current prices) for Belgium type: pwt6.2$openc[pwt6.2$country=="Belgium"] To display the years type: pwt6.2$year[pwt6.2$country=="Belgium"] To create the corresponding time series type: openc.BEL.ts <- ts(pwt6.2$openc[pwt6.2$country=="Belgium"], start=1950,frequency=1) To plot openc with the correct labels and layout type: plot(openc.BEL.ts,xlab="Year", ylab="Openness, % in current prices",frame.plot=FALSE)

22

Frequently used commands


read.table() scan() x print(x) hist(x) boxplot(x) plot(x) plot(x,y) table(x,y) summary(x) quantile(x) mean() median() var() sd() cov(x,y) cor(x,y) fitted.model1 <- lm(y ~ x) summary(fitted.model1) dbinom(1:10,10,.4) pbinom(1:10,10,.4) pnorm(23, mean = 20, sd = 5) qnorm(.25, mean = 20, sd = 5) pt(.25, df = 18) qt(.25, df = 18) t.test(x, conf.level = .99) t.test(x, mu = 64, conf.level = .99)

Read data Display the object x Histogram Box plot Plot data Cross-tabulate Descriptive statistics Covariance Correlation Line of best t Binomial distribution Cumulative binomial distribution Cumulative normal distribution Quantile of normal distribution Cumulative t distribution Quantile of t distribution Condence interval Hypothesis test

References
[1] Adler, J. (2010). R in a nutshell. Sebastopol, CA: OReilly. [2] Anderson, D.R, Sweeney, D.J., Williams, T.A., Freeman, J., Shoesmith, E. (2007). Statistics for Business and Economics. London: Thomson. [3] Crawley, M.J. (2002). Statistics: An Introduction Using R. Wiley. [4] Crawley, M.J. (2007). The R Book. Chichester (UK): Wiley. (VUB library call number 004.43 G CRAW 2007) [5] Dalgaard, P. (2002). Introductory Statistics with R. Berlin: Springer. (VUB library call number 004.9 G DALG 2002) [6] Fox, J. (2002). An R and S-Plus Companion to Applied Regression. London: Sage. (VUB library call number 004.9 G FOX 2002) [7] Freedman, D., Pisani, R., & Purves, R. (2007). Statistics. 4th edition. New York: Norton. [8] Hens, L. (2009). Statistics 101 homepage. http://homepages.vub.ac.be/~lmahens/STA101.html URL:

[9] Heston, A, Summers, R., & Aten, B. (2006). Penn World Table Version 6.2. Philadelphia: Center for International Comparisons of Production, Income and Prices at the University of Pennsylvania. Retrieved on 12 February 2009 from http://pwt.econ.upenn.edu/

23

[10] Pfa, B. (2006). Analysis of Integrated and Co-integrated Time Series with R. Berlin: Springer. [11] R Development Core Team (2008). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Retrieved on 12 February 2009 from http://www.R-project.org. [12] Shumway, R.H., & Stoer, D.S. (2006). Time Series Analysis and Its Applications, With R Examples. 2nd edition. Berlin: Springer. [13] Tufte, E.R. (1983). The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press. [14] Venables, W. N., Smith, D. M., & the R Development Core Team (2006). An Introduction to R. URL: http://www.R-project.org. [15] Verzani, J. (2005). Using R for introductory statistics. Boca Raton (FL): Chapman and Hall/CRC. (Available in the VUB library, call number 311.1 G VERZ 2005).

24

Vous aimerez peut-être aussi