Vous êtes sur la page 1sur 17

Data manipulation tricks: Even better in R!

by Sharon Machlis, Online Managing Editor, Computerworld


After covering recent a session on data munging with Excel, I wanted to see how those tasks could be
accomplished in R. Surely anything you can do in a spreadsheet should be doable in a platform designed for
heavy-duty statistical analysis!
You can download the Excel Magic PDF and sample data spreadsheet and then follow along.(The original
Excel tips come from MaryJo Webster, senior data reporter with Digital First Media .)
(New to R? You can get up and running with our Beginners guide to R series.)
If you want to follow along, youll first want to load data from her sample spreadsheet into R. There are
several ways to do this, including:
You can save each sheet to CSV and load in with Rs read.csv() function.
You can copy data in a spreadsheet and read.table() your clipboard slightly different techniques for
Windows and Mac.
Or, you can install and load the xlsx package within R and read data directly from Excel.
Note: If a package I reference is not already installed on your system, youll need to install it first by using
Rs install.packages() function. Heres how to install the xlsx package:
install.packages("xlsx", dependencies=TRUE)
Note that this package can be a little finicky in Windows due to Java issues.
You only need to install a package once on a system. However, in order to use it, you need to load it in each
session. Heres how to load the xlsx package with library()
library(xlsx)
If you downloaded the spreadsheet to run R code on that sample data, set your working R directory to
whatever directory holds the spreadsheet. Replace DIRECTORY with your actual directory (capitalization
matters):
setwd("DIRECTORY")
If you are using the RStudio IDE for R, you can create a new RStudio project in the directory with the
spreadsheet, and then automatically be switched to your working directory each time you load that project.
See more about projects in RStudio.
Finally: Lets start coding!

Dates: Extract month, day and year from each date in a column
Well start by parsing a single example date: 4/3/04. If you load in an Excel spreadsheet with dates, your
dates may already be R date objects. If youve pulled in a CSV file, though, they may just be character
strings. If your date is just a text string, first wed need to turn that text into a date object and store it in a
variable. The package lubridate is helpful for date parsing.
Make sure to run
1

install.packages("lubridate", dependencies=TRUE)
if lubridate is not already installed on your system. Then well load lubridate with library(lubridate) and
use lubridates mdy() function to let R know that the date format is month/day/year and not, say, the
European day/month/year. We can then use lubridate functions such as year() and month() to parse the
date, similar to functions in Excel:
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.0.3
mydate <- mdy("4/3/04")
#get year
year(mydate)
## [1] 2004
#get month
month(mydate) #as number
## [1] 4
month(mydate, label=TRUE) #as name of month
## [1] Apr
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
#get day
day(mydate)
## [1] 3
#day of week as number
wday(mydate)
## [1] 7
#day of week as name of day
wday(mydate, label=TRUE)
## [1] Sat
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
#number of the week
week(mydate)
## [1] 14
2

Calculating ages and other date arithmetic


The lone tricky thing about doing date arithmetic in R is making sure youve got your data in the correct
date format for the package and function you decide to use.
The eeptools package has an extremely handly and elegant age_calc() function. It requires R Date objects as
input, which are easy to create with base Rs as.Date() function. When using as.Date(), you just need to
remember to tell R the format of your character string, such as %m/%d/%y for mm/dd/yy and %m/%d/%Y
for mm/dd/yyyy. Theres a [list of how to describe some common date formats at Quick-R.
In this test, well calculate how many days of summer there are between Memorial Day (May 26) and Labor
Day (Sept. 1) in 2014. Remember to install eeptools with install.packages(eeptools) if its not already on
your system, then load it with library(eeptools)
library(eeptools)
## Loading required package: ggplot2
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.0.3
## Loading required namespace: car
MemorialDay <- as.Date("5/26/2014", format="%m/%d/%Y") #Create date object for Memorial Day
LaborDay <- as.Date("9/1/2014", format="%m/%d/%Y") #Create date object for Labor Day
#The difference between the two dates in units of days:
summerdays <- age_calc(MemorialDay, LaborDay, units="days")
summerdays #This variable is an object of class difftime
## Time difference of 98 days
#To see the number of days as an integer, use as.integer()
as.integer(summerdays)
## [1] 98
Calculating ages, as MaryJo did in her Excel sheet, is even easier with age_calc(), because if no second date
is given, the function defaults to the current system date. So, you dont even have to explicitly state you
want todays date to calculate someones age as of today, as you need to do in Excel.
dob <- as.Date("2/4/1982", format="%m/%d/%Y") #Create a test date of birth date as a date object
#Find today's age as of when I wrote this script and saved the results to an html file:
age <- age_calc(dob, units='years')
#Round off the age to whole years with the floor() function
wholeyears <- floor(age)
#Now let's try this with an entire column of birth dates from MaryJo's spreadsheet.
#Read in data from the ExcelTricks2014 Dates worksheet using xlsx package
library(xlsx)

## Loading required package: rJava


## Loading required package: xlsxjars
testdates <- read.xlsx("ExcelTricks2014.xlsx", sheetName="Dates")
#What does the structure of that testdates object look like?
str(testdates)
## 'data.frame':
58 obs. of 9 variables:
## $ Player.: Factor w/ 58 levels "Adrian Awasom",..: 47 33 14 2 40 49 30 55 23 36 ...
## $ Pos.
: Factor w/ 11 levels "Center ","Defensive Back ",..: 7 7 7 8 8 8 8 8 11 11 ...
## $ Status.: Factor w/ 2 levels "Active ","Out ": 1 1 1 1 1 1 1 1 1 1 ...
## $ Ht.
: Factor w/ 12 levels "5'10' ","5'11' ",..: 6 8 6 5 1 7 8 5 5 4 ...
## $ Wt.
: num 215 220 229 217 191 240 258 237 190 204 ...
## $ DOB.
: Date, format: "1985-07-02" "1986-11-14" ...
## $ DATEDIF: logi NA NA NA NA NA NA ...
## $ YEAR
: logi NA NA NA NA NA NA ...
## $ WEEKDAY: logi NA NA NA NA NA NA ...
#Excellent, the DOB column was already read in as date objects!
#We want her DATEDIF column to have the ages:
testdates$DATEDIF <- round(age_calc(testdates$DOB., units='years'))
#While we're at it, let's add year and weekday columns
testdates$YEAR <- year(testdates$DOB.)
testdates$WEEKDAY <- wday(testdates$DOB., label=TRUE)
#we can add week numbers per MaryJo's discussion of seeking patterns in the data
testdates$WEEKNUMs <- strftime(testdates$DOB., format="%W")
table(testdates$WEEKDAY, testdates$WEEKNUMs)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Sun
Mon
Tues
Wed
Thurs
Fri
Sat

00 02 03 04 05 06 07 08 11 12 13 14 15 17 18 19 21 23 24 25 26 27
0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0
0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0

Sun
Mon
Tues
Wed
Thurs
Fri
Sat

28 29 30 31 32 33 34 35 37 38 39 40 41 43 44 45 46 50 51 52
0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0
0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 2 1
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
0 1 0 0 2 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0

#or just a frequency table for days of the week:


table(testdates$WEEKDAY)
##
##
##

Sun
8

Mon
3

Tues
13

Wed Thurs
7
11

Fri
6

Sat
10

#And let's save that to a new spreadsheet


write.xlsx(testdates, "ExcelToR.xlsx", sheetName = "Dates")
A final note about dates: The lubridate mdy() function creates an object of THE class POSIXct. If you are
familiar with Unix (or some other programming languages THAT handle POSIX dates), you can probably
already guess what that means: POSIXct stores the date as THE number of seconds since January 1, 1970.
If you try to print this object, it will show in R as a human-readable date such as 2004-04-03 UTC, but
dont be fooled: Its not actually an R Date object. So, not all R date arithmetic functions that require an
object of R class Date will work, because theyre trying to use the wrong type of object.
lubridates mdy() and ymd() functions parse most date-like character strings into POSIXct objects, not R
Date objects. You can turn POSIXct objects into R Date objects with as.Date():
mydate <- mdy(2/28/14) mydateAsDate <- as.Date(mydate)
Does it annoy you that R takes two steps (or one more complex single step) to do something that Excel
does in a single function? Well, thats the beauty of using a scripting language: If you dont want to
repeat multiple lines of code, write your own function to simplify it.
Heres one way to create a function called myDateFunc() that combines the lubridate mdy() and base Rs
as.Date() into a simpler single line of code:
myDateFunc <- function(dateliketext){
#Reminder: This requires text to be in some month-date-year format
require(lubridate) #Load the lubridate package
thedate <- mdy(dateliketext) #Create a POSIXct object from the date-like string
thedate <- as.Date(thedate) #Turn the POSIXct object into a Date object
}
Voila! Now if we want to create a date object from February 28, 2014, we just run a single line of code
using the new function:
mynewdate <- myDateFunc("February 28, 2014")
#See what that mynewdate object looks like:
print(mynewdate)
## [1] "2014-02-28"
#Check the class of mynewdate
class(mynewdate)
## [1] "Date"
You can put that function in a separate file mynewdate.R, for example and then add the code

source("mynewdate.R")
to your script file. That tells the script file to run all the code in the mynewdate.R file (This assumes
that the script file is in your working directory. If not, just include the full path to the file such as
C:/Rscripts/mynewdate.R).

Text functions: Search and substring extraction


Rs substr() function performs the same task as Excels LEFT and MID, using the syntax
substr(thestring, start, stop)
where start and stop are integers. So, substring(Computerworld, 1, 8) would return Computer: It slices
the string starting at position 1 and stopping at position 8.
Searching is much more robust in R than in Excel, in part because R can use powerful regular expressions.
In addition, there are many ways to handle and process strings in R .
One demonstrated Excel task was to extract a two-letter state abbreviation from a city and state when theres
no comma separating them but you know the state is always the last two letters of the character string using
LEFT and MID. We can use the same technique to find the last two letters of New York NY by finding the
length of the string with nchar() and two characters before the end of the string with nchar() - 2, like so:
mytext <- "New York NY"
substr(mytext, nchar(mytext) - 2, nchar(mytext))
## [1] " NY"
#Get the rest of the string before the space and two-letter state abbreviation:
mytext <- "New York NY"
substr(mytext, 1, nchar(mytext)-3)
## [1] "New York"
As you probably guessed, nchar() returns the number of characters in a character string, including number of
spaces.
If you are familiar with regular expressions, you can also search for a more complex pattern than last
two characters, such as all characters except a space and the last two letters. Base R handles regular
expressions, but I find the stringr package to be more convenient for some text operations, including matching
regular expressions with str_match():
library(stringr)
#This pattern says the first group in parentheses is "everything up until a space
#and two capital letters."
#The second group in parentheses is "two capital letters."
mypattern <- "(.*?) ([A-Z]{2})"
parsed <- str_match(mytext, mypattern)

#The first column of the parsed object contains the entire match. The second column
#is the first group - that is, the match just within the first parentheses,
#which in this case is the city.
#The third column is the match within the second parentheses, in this case the state.
parsed
##
[,1]
[,2]
[,3]
## [1,] "New York NY" "New York" "NY"
To perform this task on the sample spreadsheet, we can read in data from the CityState worksheet and then
populate the blank CITY and STATE columns:
cities <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "CityState")
parsed <- str_match(cities$CITY.STATE, mypattern)
cities$CITY <- parsed[,2]
#The second column of parsed has all the matches of the first group -#in this case, everything before space and 2 capital letters
cities$STATE <- parsed[,3]
#Append this as a sheet to our new ExcelToR.xlsx spreadsheet
write.xlsx(cities, "ExcelToR.xlsx", sheetName = "CityState", append = TRUE)

Text functions: Search and replace


For Excels SUBSTITUTE replacing old text with new text, there is base Rs
gsub("pattern to search for", "patern to replace it with", CharacterString)
#And stringr's
str_replace_all(CharacterString, "pattern to search for", "pattern to replace it with")
Pick a syntax and structure you like, and off you go.
To create a new column in a dataframe named df that removes PUBLIC SCHOOL DISTRICT from a
SchoolDistricts column, you could run stringrs str_replace_all() function on the SchoolDistricts column:
df$SchoolDistrictsEdited <- str_replace_all(df$SchoolDistricts, "PUBLIC SCHOOL DISTRICT", "")

Misc text functions


For Excels EXACT to see if two strings are identical, base R has identical().
For Excels LEN(text) to get the length of a string, base R has nchar() and stringr has str_length().
For Excels REPT(text, number) to repeat a text string a certain number of times, the stringr package has
str_dup()
To capitalize the first letter of each word, the existing toupper() functions help file to write and load your
own function, here called titleCase:

titleCase <- function(x) {


s <- strsplit(x, " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep = "", collapse = " ")
}
titleCase("hello there, world!")

Using a wildcard search


Because R supports regular expressions, wildcard searching is a bit simpler than Excels somewhat convoluted
=IF(ISERROR(SEARCH(Texas,B4,1)>0)=FALSE, X,)
which adds a column marking with an X all rows where one column includes Texas. In R, you can just use
an if-else statement such as:
ifelse(str_detect(mycolumn, Texas), X, )
However, there isnt always a need to add a column to do this, since you can easily filter a data frame by
searching for a string within a column. Heres some code to find all rows that include the phrase WESTERN
DISTRICT from the BasicIF tab
#Note we need to tell R to start reading on row 5 here
#because rows 1-4 are not part of the table
df <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "BasicIF", stringsAsFactors = FALSE,
startRow=5, header=TRUE)
#Now we just want rows where the SUBDEPT includes the phrase "WESTERN DISTRICT"
justWestern <- subset(df, str_detect(SUBDEPT, "WESTERN DISTRICT"))
#Check the first 20 rows & first 5 columns
head(justWestern[,1:5], n=20)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

LASTNAME FIRSTNAME DEPT


SUBDEPT YRS.EXP
9
ANDERSON
NEIL SPPD WESTERN DISTRICT - SOUTH
3
10
ANDERSON
ALLEN SPPD
WESTERN DISTRICT-NORTH
5
15
ANDERSON
STEVE SPPD
WESTERN DISTRICT-NORTH
18
16
ANDERSON
ERIC SPPD
WESTERN DISTRICT
20
17
ARNOLD
THOMAS SPPD WESTERN DISTRICT - SOUTH
14
27
BAILEY
SARA SPPD
WESTERN DISTRICT-NORTH
20
33
BARABAS
MICHAEL SPPD
WESTERN DISTRICT-NORTH
19
40 BAUMHOFER
AMY SPPD WESTERN DISTRICT - SOUTH
15
50
BENNETT CONSTANCE SPPD
WESTERN DISTRICT
4
51
BENNETT
BRUCE SPPD
WESTERN DISTRICT-NORTH
13
55
BITNEY TERRANCE SPPD
WESTERN DISTRICT
6
60
BOERGER
DARRYL SPPD WESTERN DISTRICT - SOUTH
4
62
BOHN
TIM SPPD WESTERN DISTRICT - SOUTH
11
69
BOYLE
JEFFERY SPPD
WESTERN DISTRICT-NORTH
20
76
BRODT
MARY SPPD WESTERN DISTRICT - SOUTH
1
80
BROWN
ANTHONY SPPD
WESTERN DISTRICT-NORTH
17
96
CARTER
MICHAEL SPPD
WESTERN DISTRICT
16
104
CHERRY
LYNETTE SPPD WESTERN DISTRICT - SOUTH
12
8

## 111 CLEVELAND
## 115
CONROY

KENT SPPD
MICHAEL SPPD

WESTERN DISTRICT-NORTH
WESTERN DISTRICT-NORTH

1
16

If statements
Excels basic IF statement is similar to Rs ifelse(): Both use the format (logical test, result if true, result if
false). This code can determine whether a home or visiting team won a game based on points scored by each,
using the sample spreadsheets More BasicIF worksheet:
scores <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "More BasicIF")
str(scores)
## 'data.frame':
256 obs. of 8 variables:
## $ Date
: Date, format: "2003-09-04" "2003-09-07" ...
## $ WeekNum
: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Visit.Team : Factor w/ 32 levels "ARI","ATL","BAL",..: 22 1 10 14 3 13 15 18 19 26 ...
## $ Visit.Score: num 13 24 30 9 15 21 23 30 0 14 ...
## $ Home.Team : Factor w/ 32 levels "ARI","ATL","BAL",..: 32 11 7 8 25 17 5 12 4 16 ...
## $ Home.Score : num 16 42 10 6 34 20 24 25 31 27 ...
## $ Winner
: logi NA NA NA NA NA NA ...
## $ WinTeam
: logi NA NA NA NA NA NA ...
#The team names are coming in as "factors" and not characters.
#We'll re-import the data, this time adding stringsAsFactors = FALSE
#to the function arguments:
scores <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "More BasicIF", stringsAsFactors = FALSE)
str(scores)
## 'data.frame':
256 obs. of 8 variables:
## $ Date
: Date, format: "2003-09-04" "2003-09-07" ...
## $ WeekNum
: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Visit.Team : chr "NYJ" "ARI" "DEN" "IND" ...
## $ Visit.Score: num 13 24 30 9 15 21 23 30 0 14 ...
## $ Home.Team : chr "WAS" "DET" "CIN" "CLE" ...
## $ Home.Score : num 16 42 10 6 34 20 24 25 31 27 ...
## $ Winner
: logi NA NA NA NA NA NA ...
## $ WinTeam
: logi NA NA NA NA NA NA ...
#That's better.
#Note that if there's a space in a column name, R converts it to a period
#Use an ifelse statement to find whether the home or visiting team had more points
#and thus won the game
scores$Winner <- ifelse(scores$Home.Score > scores$Visit.Score, "Home", "Visitor")
#Find out which team had more points
scores$WinTeam <- ifelse(scores$Home.Score > scores$Visit.Score, scores$Home.Team, scores$Visit.Team)
#save to our new spreadsheet
write.xlsx(scores, "ExcelToR.xlsx", sheetName = "MoreBasicIF", append = TRUE)
As with Excel IF, R ifelse statements can be nested.
9

Deal with data where column headers are rows within the data
Look at the Copy Down tab on the ExcelTricks2014.xlsx spreadsheet, and youll see the problem: Theres
a single row with the name of a team, the players on that team, the name of a second team, a list of players
on that team and so on. This interspersing of categories and values means that if you do any sorting or
aggregating of that column, youll no longer know which player is on what team. Whats needed is a way to
add a new column identifying which team each player is on.
Im sure theres a more elegant R way" to do this, but here Ill use a simple for loop instead. For loops are
discouraged in R, with vectorized functions preferred. However, those of us with experience in languages
where loops are common do find them a handy go-to.
#Read player data into R
players <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Copy Down",
stringsAsFactors = FALSE)
#See the structure of the data
str(players)
## 'data.frame':
451 obs. of 3 variables:
## $ Name
: chr "Arizona Cardinals" "Starks, Duane" "Stone, Michael" "Ransom, Derrick" ...
## $ Position: chr NA "DB" "DB" "DT" ...
## $ NA.
: chr NA NA "" "" ...
#Not sure what that .NA column is about, but we can get rid of it by setting it to NULL
players$NA. <- NULL
str(players)
## 'data.frame':
451 obs. of 2 variables:
## $ Name
: chr "Arizona Cardinals" "Starks, Duane" "Stone, Michael" "Ransom, Derrick" ...
## $ Position: chr NA "DB" "DB" "DT" ...
#That's better
#To see if a value is missing in R, use the is.na() function.
#Here we'll create a new column called Team.
#If there's no value in the Position column, we'll use the value of the Players$Name column.
#If there is a value in the Position column,
#we'll use the value of the Team column one row higher.
for(i in 1:length(players$Name)){
players$Team[i] <- ifelse(is.na(players$Position[i]), players$Name[i], players$Team[i-1])
}
#We can delete rows with the team names by using the handly na.omit(dataframe) function;
#that will eliminate all rows in a data frame that have at least one missing value.
players <- na.omit(players)
#Here we can add the reformatted data to our new spreadsheet.
#Don't forget append=TRUE or the spreadsheet will be overwritten.
write.xlsx(players, "ExcelToR.xlsx", sheetName = "CopyDownTeams", append=TRUE)

10

Functions by groups: SUMIF and COUNTIF equivalents


This is one of many areas where R shines over Excel grouping items for any purpose, not just subtotals or
counts.
Assuming all the data is in rows 2 to 424, the team name is in column c and the salaries are in column e, the
Excel tip was to use
=sumif(c2:c424, Dallas Mavericks, e2:e424)
to get just the Mavericks total and
=sumif(Salaries!c2:Salaries!c424, a3, Salaries!$e2 : Salaries!e$424)
to get subtotals by all teams where Team names are in column a of another worksheet.
I prefer something thats not hardcoded with total row numbers and thats more easily reproducible on
slightly different data.
In R, there are numerous ways to apply functions to a data set by group. My current favorite is the relatively
new dplyr package for R because of its consistent and (to me) fairly human-readable functions.
Since salary data isnt included in the sample spreadsheet, Im going to load a short table of top 40 salaries
from ESPN using the incredibly handy readHTMLTable() function in Rs XML package. Note that Im just
scraping the top 40 and not all the salaries to save time. Also note that it is indeed possible to scrape and
clean data from the Web using R :-).
#Load in data from table 1 at ESPN with XML package's readHTMLTable()
library(XML)
url <- 'http://espn.go.com/nba/salaries'
salaries <- readHTMLTable(url, stringsAsFactors = FALSE, which=1, header=TRUE)
#which=1 above means load the first table on the page
str(salaries)
## 'data.frame':
## $ RK
: chr
## $ NAME : chr
## $ TEAM : chr
## $ SALARY: chr

43 obs. of 4 variables:
"1" "2" "3" "4" ...
"Kobe Bryant, SG" "Dirk Nowitzki, PF" "Amar'e Stoudemire, PF" "Joe Johnson, SG" ...
"Los Angeles Lakers" "Dallas Mavericks" "New York Knicks" "Brooklyn Nets" ...
"$30,453,805" "$22,721,381" "$21,679,893" "$21,466,718" ...

#It's necessary to remove dollar signs and commas


#to turn SALARY character strings into integers for R
#Removes dollar sign:
salaries$SALARY <- str_replace_all(salaries$SALARY, '\\$', '')
#A handy decomma() function in the eeptools package removes commas and turns the
#numerical character strings into numbers
salaries$SALARY <- decomma(salaries$SALARY)
## Warning: NAs introduced by coercion
#Rows that don't contain numbers will appear in R as NA; we can remove those rows with na.omit()
salaries <- na.omit(salaries)
#Now that we have the data, time to sum and count top salaries by team -#and let's add mean and median for good measure:
library(dplyr)

11

## Warning: package 'dplyr' was built under R version 3.0.3


##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Attaching package: 'dplyr'


The following object is masked from 'package:MASS':
select
The following objects are masked from 'package:lubridate':
intersect, setdiff, union
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union

#Probably self-explanatory:
#Create a new variable salaries_grouped_by_team that uses dplyr's group_by() function
#to group the salaries data by the TEAM column
salaries_grouped_by_team <- group_by(salaries, TEAM)
#This creates new columns in a new variable, summaries_by_team, with summaries by group:
summaries_by_team <- summarise(salaries_grouped_by_team,
sums = sum(SALARY),
count = n(),
average = mean(SALARY),
median = median(SALARY))
#Finally, the arrange() function sorts by sums descending:
arrange(summaries_by_team, desc(sums))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Source: local data frame [22 x 5]


TEAM
1
Brooklyn Nets
2
New York Knicks
3
Miami Heat
4
Los Angeles Lakers
5
Oklahoma City Thunder
6
Golden State Warriors
7
Los Angeles Clippers
8
Houston Rockets
9
Memphis Grizzlies
10
Chicago Bulls
11 Minnesota Timberwolves
12
Charlotte Bobcats
13
Dallas Mavericks
14
Toronto Raptors
15 Portland Trail Blazers

sums count average


median
67699917
4 16924979 16899732
57169384
3 19056461 21388953
56808000
3 18936000 19067500
49739655
2 24869828 24869828
44876533
3 14958844 14693906
40746632
3 13582211 13878000
35109931
2 17554966 17554966
34214428
2 17107214 17107214
33098856
2 16549428 16549428
32932688
2 16466344 16466344
26793906
2 13396953 13396953
26700000
2 13350000 13350000
22721381
1 22721381 22721381
17888932
1 17888932 17888932
14878000
1 14878000 14878000
12

##
##
##
##
##
##
##

16
17
18
19
20
21
22

Phoenix Suns
Indiana Pacers
New Orleans Pelicans
Cleveland Cavaliers
Detroit Pistons
Washington Wizards
San Antonio Spurs

14487500
14283844
14283844
14275000
13500000
13000000
12500000

1
1
1
1
1
1
1

14487500
14283844
14283844
14275000
13500000
13000000
12500000

14487500
14283844
14283844
14275000
13500000
13000000
12500000

#If we want just the Dallas Mavericks and Maimi Heat data,
#pick one from several syntax options that you like best:
mydata <- subset(summaries_by_team, TEAM=="Dallas Mavericks" | TEAM=="Miami Heat")
#Or
mydata <- summaries_by_team[summaries_by_team$TEAM=="Dallas Mavericks" |
summaries_by_team$TEAM=="Miami Heat",]
#Or dplyr's filter()
mydata <- filter(summaries_by_team, TEAM=="Dallas Mavericks" |
TEAM=="Miami Heat")
mydata
## Source: local data frame [2 x 5]
##
##
TEAM
sums count average
median
## 1 Dallas Mavericks 22721381
1 22721381 22721381
## 2
Miami Heat 56808000
3 18936000 19067500
#Likewise you can easily count how many teams have at least 3 players in this list
subset(summaries_by_team, count >= 3)
##
##
##
##
##
##
##
##

Source: local data frame [5 x 5]


TEAM
1
Brooklyn Nets
7 Golden State Warriors
13
Miami Heat
16
New York Knicks
17 Oklahoma City Thunder

sums count average


median
67699917
4 16924979 16899732
40746632
3 13582211 13878000
56808000
3 18936000 19067500
57169384
3 19056461 21388953
44876533
3 14958844 14693906

R also has round() and rank() functions.


rank() gives the numerical rank by whatever column you want, while a stackexchange thread suggested this
easy function for percentile rank
perc.rank <- function(x) trunc(rank(x))/length(x)

Lookup tables
I confess: I have indeed used combinations of VLOOKUP, INDEX and MATCH in Excel to look up the value
of a key on one worksheet to insert a related value in another. However, in general Im not a fan of trying to

13

use Excel as a relational database unless theres a good reason for keeping my data in Excel (such as Im
sharing a spreadsheet with colleagues who dont use MySQL or R).
With several different robust lookup options, R is a much better tool than Excel for using lookup tables. One
choice: You can run SQL commands on a data frame with the sqldf package, much like running SQL queries
on a relational database.
Another option: The data.table package, which comes highly recommended for its speed with large data sets,
creates index keys for data frames and many join options.
Finally, there are several R functions that offer SQL-like joins, such as dplyrs inner join and left join options
and base Rs merge() function. (Some options require the common column in each table to have the same
name.) You can read more about all these options in this stackoverflow thread.
In the Excel Tricks example, there is a Lookups table with a fipscty column that holds a numerical code for
each county. She wants to add the county name to this worksheet (a separate table, Lookup2, has a list of all
the codes and county names).
Ill use dplyrs left_join() to accomplish this task (several other techniques will work well too). Why a left
join? Thats a SQL database term which means join two tables by one or more common columns, keeping
all the rows in the left table (here, left means the first one mentioned in the join statement) and adding
whatever matches there are from the right column.
Heres the code:
#Read in data from spreadsheet
Lookups <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Lookups", startRow = 2,
stringsAsFactors = FALSE)
Lookup2 <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Lookup2", startRow = 2,
stringsAsFactors = FALSE)
#I am going to rename the first column in the lookup table to match the name in the Lookups table
names(Lookup2)[1] <- "fipscty"
#One line of code adds the county name from Lookup2 to the Lookups table
Lookups <- left_join(Lookups, Lookup2, by="fipscty")
#Check our results
head(Lookups)
##
##
##
##
##
##
##
##
##
##
##
##
##
##

1
2
3
4
5
6
1
2
3
4
5
6

fipstate fipscty Tot.Employ An.Payroll Num.Estab. n1_4 n5_9 n10_19


27
085
16223
522051
987 495 231
126
27
135
7721
215072
414 240
84
43
27
129
5517
133031
556 337
99
76
27
127
4742
113020
572 348 115
60
27
125
835
18330
109
74
17
8
27
143
3050
71132
399 258
76
36
n20_49 n50_99 n100_249 n250_499 n500_999 n1000 COUNTY.NAME
County
86
30
13
4
0
2
NA
McLeod
34
9
1
1
0
2
NA
Roseau
29
8
5
1
1
0
NA Renville
32
10
7
0
0
0
NA Redwood
7
2
1
0
0
0
NA Red Lake
20
5
3
1
0
0
NA
Sibley

14

#Get rid of the blank COUNTY.NAMES column


Lookups$COUNTY.NAME <- NULL
Note that left_join will work regardless of where the columns are located within the tables, unlike VLOOKUP
in Excel.

Copying down a date versus date sequence


In Excel, if you click and drag a date down a column, Excel will assume you want to increment the date by 1
each row. So you need a special technique to copy the same date down a column. In R, if you want a column
to all be one date, you just assign it, such as:
df$mycolumn <- as.Date("2014-03-21")
With the code above, every row in the df dataframe will have the date value 2014-03-21 in the mycolumn
column.
But what if you want the default Excel behavior in R: adding one to each day in a column? Use the seq()
function. For instance, to get a date sequence of 15 days incremented by 1 day:
seq(as.Date("2014-03-21"), by="day", length.out=15 )
## [1] "2014-03-21" "2014-03-22" "2014-03-23" "2014-03-24" "2014-03-25"
## [6] "2014-03-26" "2014-03-27" "2014-03-28" "2014-03-29" "2014-03-30"
## [11] "2014-03-31" "2014-04-01" "2014-04-02" "2014-04-03" "2014-04-04"
If you want to do this for a data frame column, you have to tell seq() how many items your column needs.
You can do this by changing the hard-coded number 15 from the example above to the number of rows in
your data frame using the nrow() function:
df$mycolumn <- seq(as.Date("2014-03-21"), by="day", length.out=nrow(df) )
You can create date sequences by week, month, quarters and years as well.

Using column names


In Excel, you need to explicitly create names in a spreadsheet in order to use column names in formulas. In
R, you can use either the column name or its numerical index position.

Reshaping data
The sample spreadsheet features an example of Affordable Health Care premium data where each plans row
had age group data across multiple columns. The desired format was to have one plan price per age group
per row, not many age groups in a row. This means the data needs to be reshaped. In R lingo, we want to
reshape the data frame from wide to long.
Webster demonstrated a very useful free add-in for Excel from Tableau to perform this kind of reshaping. To
use the Tableau reshaping add-in for Excel, all the columns you want to be moved down from being column
headers must be on the right side of your spreadsheet; and all the columns you want to keep as column
headers must be on the left. In addition, you need to manually open the sheet and click on the correct cell
15

fine if youre working on a one-time project, but less ideal if this is data you process frequently (or if you
want others to be able to easily reproduce and check your work)
With an R script, the columns can be in any order and a script thats written once can be run from a batch
file.
Please see my detailed explanation of Reshaping: Wide to long (and back) in R for a full run-through of
this type of reshaping. But in brief, you want to use the reshape2 package and tell it which column headers
you want to move down so theyre no longer separate columns. In other words, if a data frame had column
headers for young, middle age and old with a price for each but you wanted only one price per row,
youd want to move those three column headers into one new variable column, perhaps called something like
age group.
To go from wide to long you use reshape2s melt() function and tell melt either which columns you want to
move into a new variable column or which columns you want to stay as ID variables and not move. In this
sample data, there are far fewer ID variables thatdont need to move than there are column variables that do
need to move, so Ill specify the id variables.
In addition, we have the option of naming what we want the variable column and value column to be called,
which Ill do below:
library(reshape2)
widedata <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Reshaper", header = TRUE)
#id.vars are the columns to keep as column headers.
#There's then no need to identify all the age group column headers that are moving
#from being column headers to being part of a new agegroup column.
#premium is the value column.
reshaped <- melt(widedata, id.vars <- c("Company", "PlanName", "Metal", "RatingArea",
"RateAreatxt"),
variable.name="agegroup", value.name="premium" )
#Check results
head(reshaped)
##
##
##
##
##
##
##
##
##
##
##
##
##
##

1
2
3
4
5
6
1
2
3
4
5
6

Company
All Savers
All Savers
All Savers
All Savers
All Savers
Cigna
premium
189.1
186.5
196.8
245.8
242.7
158.1

Lowest
Lowest
Lowest
Lowest
Lowest
Lowest

PlanName
cost Silver
cost Silver
cost Silver
cost Silver
cost Silver
cost Silver

Metal RatingArea RateAreatxt agegroup


Silver
3
Colorado3
X0.20.
Silver
7
Colorado7
X0.20.
Silver
8
Colorado8
X0.20.
Silver
9
Colorado9
X0.20.
Silver
10 Colorado10
X0.20.
Silver
3
Colorado3
X0.20.

#There's an X in front of the all the age groups because R columns can't start with a number
#and those columns in the spreadsheet all started with numbers.
#In addition the - in 0-20 was turned into a period because - is not a legal character
#for an R data frame column name.
#If that bothers us, we can use some search-and-replace strategies we learned above
16

#to remove X from the age groups and return the - to 0-20:
reshaped$agegroup <- str_replace_all(reshaped$agegroup, "X", "")
reshaped$agegroup <- str_replace_all(reshaped$agegroup, "0.20.", "0-20")
#We can see how many unique values of reshaped$agegroup there are with unique()
unique(reshaped$agegroup)
##
##
##
##
##
##
##
##
##
##
##
##

[1]
[5]
[9]
[13]
[17]
[21]
[25]
[29]
[33]
[37]
[41]
[45]

"0-20"
"24"
"28"
"32"
"36"
"40"
"44"
"48"
"52"
"56"
"60"
"64.and.other."

"21"
"25"
"29"
"33"
"37"
"41"
"45"
"49"
"53"
"57"
"61"

"22"
"26"
"30"
"34"
"38"
"42"
"46"
"50"
"54"
"58"
"62"

"23"
"27"
"31"
"35"
"39"
"43"
"47"
"51"
"55"
"59"
"63"

#I'll change "64.and.other. " to "64+"


reshaped$agegroup <- str_replace_all(reshaped$agegroup, "64.and.other.", "64+")
#Check unique values again
unique(reshaped$agegroup)
##
##
##
##
##

[1]
[11]
[21]
[31]
[41]

"0-20"
"30"
"40"
"50"
"60"

"21"
"31"
"41"
"51"
"61"

"22"
"32"
"42"
"52"
"62"

"23"
"33"
"43"
"53"
"63"

"24"
"34"
"44"
"54"
"64+"

"25"
"35"
"45"
"55"

"26"
"36"
"46"
"56"

"27"
"37"
"47"
"57"

"28"
"38"
"48"
"58"

"29"
"39"
"49"
"59"

For lots more on using R, see The Beginners Guide to R and 4 Data Wrangling Tasks for Advanced Beginners.
Sharon Machlis is online managing editor at Computerworld. You can follow her on Twitter at sharon000,
on Google or by subscribing to her RSS feeds: articles and blogs.

17

Vous aimerez peut-être aussi