Académique Documents
Professionnel Documents
Culture Documents
Dates: Extract month, day and year from each date in a column
Well start by parsing a single example date: 4/3/04. If you load in an Excel spreadsheet with dates, your
dates may already be R date objects. If youve pulled in a CSV file, though, they may just be character
strings. If your date is just a text string, first wed need to turn that text into a date object and store it in a
variable. The package lubridate is helpful for date parsing.
Make sure to run
1
install.packages("lubridate", dependencies=TRUE)
if lubridate is not already installed on your system. Then well load lubridate with library(lubridate) and
use lubridates mdy() function to let R know that the date format is month/day/year and not, say, the
European day/month/year. We can then use lubridate functions such as year() and month() to parse the
date, similar to functions in Excel:
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.0.3
mydate <- mdy("4/3/04")
#get year
year(mydate)
## [1] 2004
#get month
month(mydate) #as number
## [1] 4
month(mydate, label=TRUE) #as name of month
## [1] Apr
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
#get day
day(mydate)
## [1] 3
#day of week as number
wday(mydate)
## [1] 7
#day of week as name of day
wday(mydate, label=TRUE)
## [1] Sat
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
#number of the week
week(mydate)
## [1] 14
2
Sun
Mon
Tues
Wed
Thurs
Fri
Sat
00 02 03 04 05 06 07 08 11 12 13 14 15 17 18 19 21 23 24 25 26 27
0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0
0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0
Sun
Mon
Tues
Wed
Thurs
Fri
Sat
28 29 30 31 32 33 34 35 37 38 39 40 41 43 44 45 46 50 51 52
0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0
0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 2 1
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
0 1 0 0 2 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0
Sun
8
Mon
3
Tues
13
Wed Thurs
7
11
Fri
6
Sat
10
source("mynewdate.R")
to your script file. That tells the script file to run all the code in the mynewdate.R file (This assumes
that the script file is in your working directory. If not, just include the full path to the file such as
C:/Rscripts/mynewdate.R).
#The first column of the parsed object contains the entire match. The second column
#is the first group - that is, the match just within the first parentheses,
#which in this case is the city.
#The third column is the match within the second parentheses, in this case the state.
parsed
##
[,1]
[,2]
[,3]
## [1,] "New York NY" "New York" "NY"
To perform this task on the sample spreadsheet, we can read in data from the CityState worksheet and then
populate the blank CITY and STATE columns:
cities <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "CityState")
parsed <- str_match(cities$CITY.STATE, mypattern)
cities$CITY <- parsed[,2]
#The second column of parsed has all the matches of the first group -#in this case, everything before space and 2 capital letters
cities$STATE <- parsed[,3]
#Append this as a sheet to our new ExcelToR.xlsx spreadsheet
write.xlsx(cities, "ExcelToR.xlsx", sheetName = "CityState", append = TRUE)
## 111 CLEVELAND
## 115
CONROY
KENT SPPD
MICHAEL SPPD
WESTERN DISTRICT-NORTH
WESTERN DISTRICT-NORTH
1
16
If statements
Excels basic IF statement is similar to Rs ifelse(): Both use the format (logical test, result if true, result if
false). This code can determine whether a home or visiting team won a game based on points scored by each,
using the sample spreadsheets More BasicIF worksheet:
scores <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "More BasicIF")
str(scores)
## 'data.frame':
256 obs. of 8 variables:
## $ Date
: Date, format: "2003-09-04" "2003-09-07" ...
## $ WeekNum
: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Visit.Team : Factor w/ 32 levels "ARI","ATL","BAL",..: 22 1 10 14 3 13 15 18 19 26 ...
## $ Visit.Score: num 13 24 30 9 15 21 23 30 0 14 ...
## $ Home.Team : Factor w/ 32 levels "ARI","ATL","BAL",..: 32 11 7 8 25 17 5 12 4 16 ...
## $ Home.Score : num 16 42 10 6 34 20 24 25 31 27 ...
## $ Winner
: logi NA NA NA NA NA NA ...
## $ WinTeam
: logi NA NA NA NA NA NA ...
#The team names are coming in as "factors" and not characters.
#We'll re-import the data, this time adding stringsAsFactors = FALSE
#to the function arguments:
scores <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "More BasicIF", stringsAsFactors = FALSE)
str(scores)
## 'data.frame':
256 obs. of 8 variables:
## $ Date
: Date, format: "2003-09-04" "2003-09-07" ...
## $ WeekNum
: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Visit.Team : chr "NYJ" "ARI" "DEN" "IND" ...
## $ Visit.Score: num 13 24 30 9 15 21 23 30 0 14 ...
## $ Home.Team : chr "WAS" "DET" "CIN" "CLE" ...
## $ Home.Score : num 16 42 10 6 34 20 24 25 31 27 ...
## $ Winner
: logi NA NA NA NA NA NA ...
## $ WinTeam
: logi NA NA NA NA NA NA ...
#That's better.
#Note that if there's a space in a column name, R converts it to a period
#Use an ifelse statement to find whether the home or visiting team had more points
#and thus won the game
scores$Winner <- ifelse(scores$Home.Score > scores$Visit.Score, "Home", "Visitor")
#Find out which team had more points
scores$WinTeam <- ifelse(scores$Home.Score > scores$Visit.Score, scores$Home.Team, scores$Visit.Team)
#save to our new spreadsheet
write.xlsx(scores, "ExcelToR.xlsx", sheetName = "MoreBasicIF", append = TRUE)
As with Excel IF, R ifelse statements can be nested.
9
Deal with data where column headers are rows within the data
Look at the Copy Down tab on the ExcelTricks2014.xlsx spreadsheet, and youll see the problem: Theres
a single row with the name of a team, the players on that team, the name of a second team, a list of players
on that team and so on. This interspersing of categories and values means that if you do any sorting or
aggregating of that column, youll no longer know which player is on what team. Whats needed is a way to
add a new column identifying which team each player is on.
Im sure theres a more elegant R way" to do this, but here Ill use a simple for loop instead. For loops are
discouraged in R, with vectorized functions preferred. However, those of us with experience in languages
where loops are common do find them a handy go-to.
#Read player data into R
players <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Copy Down",
stringsAsFactors = FALSE)
#See the structure of the data
str(players)
## 'data.frame':
451 obs. of 3 variables:
## $ Name
: chr "Arizona Cardinals" "Starks, Duane" "Stone, Michael" "Ransom, Derrick" ...
## $ Position: chr NA "DB" "DB" "DT" ...
## $ NA.
: chr NA NA "" "" ...
#Not sure what that .NA column is about, but we can get rid of it by setting it to NULL
players$NA. <- NULL
str(players)
## 'data.frame':
451 obs. of 2 variables:
## $ Name
: chr "Arizona Cardinals" "Starks, Duane" "Stone, Michael" "Ransom, Derrick" ...
## $ Position: chr NA "DB" "DB" "DT" ...
#That's better
#To see if a value is missing in R, use the is.na() function.
#Here we'll create a new column called Team.
#If there's no value in the Position column, we'll use the value of the Players$Name column.
#If there is a value in the Position column,
#we'll use the value of the Team column one row higher.
for(i in 1:length(players$Name)){
players$Team[i] <- ifelse(is.na(players$Position[i]), players$Name[i], players$Team[i-1])
}
#We can delete rows with the team names by using the handly na.omit(dataframe) function;
#that will eliminate all rows in a data frame that have at least one missing value.
players <- na.omit(players)
#Here we can add the reformatted data to our new spreadsheet.
#Don't forget append=TRUE or the spreadsheet will be overwritten.
write.xlsx(players, "ExcelToR.xlsx", sheetName = "CopyDownTeams", append=TRUE)
10
43 obs. of 4 variables:
"1" "2" "3" "4" ...
"Kobe Bryant, SG" "Dirk Nowitzki, PF" "Amar'e Stoudemire, PF" "Joe Johnson, SG" ...
"Los Angeles Lakers" "Dallas Mavericks" "New York Knicks" "Brooklyn Nets" ...
"$30,453,805" "$22,721,381" "$21,679,893" "$21,466,718" ...
11
#Probably self-explanatory:
#Create a new variable salaries_grouped_by_team that uses dplyr's group_by() function
#to group the salaries data by the TEAM column
salaries_grouped_by_team <- group_by(salaries, TEAM)
#This creates new columns in a new variable, summaries_by_team, with summaries by group:
summaries_by_team <- summarise(salaries_grouped_by_team,
sums = sum(SALARY),
count = n(),
average = mean(SALARY),
median = median(SALARY))
#Finally, the arrange() function sorts by sums descending:
arrange(summaries_by_team, desc(sums))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
16
17
18
19
20
21
22
Phoenix Suns
Indiana Pacers
New Orleans Pelicans
Cleveland Cavaliers
Detroit Pistons
Washington Wizards
San Antonio Spurs
14487500
14283844
14283844
14275000
13500000
13000000
12500000
1
1
1
1
1
1
1
14487500
14283844
14283844
14275000
13500000
13000000
12500000
14487500
14283844
14283844
14275000
13500000
13000000
12500000
#If we want just the Dallas Mavericks and Maimi Heat data,
#pick one from several syntax options that you like best:
mydata <- subset(summaries_by_team, TEAM=="Dallas Mavericks" | TEAM=="Miami Heat")
#Or
mydata <- summaries_by_team[summaries_by_team$TEAM=="Dallas Mavericks" |
summaries_by_team$TEAM=="Miami Heat",]
#Or dplyr's filter()
mydata <- filter(summaries_by_team, TEAM=="Dallas Mavericks" |
TEAM=="Miami Heat")
mydata
## Source: local data frame [2 x 5]
##
##
TEAM
sums count average
median
## 1 Dallas Mavericks 22721381
1 22721381 22721381
## 2
Miami Heat 56808000
3 18936000 19067500
#Likewise you can easily count how many teams have at least 3 players in this list
subset(summaries_by_team, count >= 3)
##
##
##
##
##
##
##
##
Lookup tables
I confess: I have indeed used combinations of VLOOKUP, INDEX and MATCH in Excel to look up the value
of a key on one worksheet to insert a related value in another. However, in general Im not a fan of trying to
13
use Excel as a relational database unless theres a good reason for keeping my data in Excel (such as Im
sharing a spreadsheet with colleagues who dont use MySQL or R).
With several different robust lookup options, R is a much better tool than Excel for using lookup tables. One
choice: You can run SQL commands on a data frame with the sqldf package, much like running SQL queries
on a relational database.
Another option: The data.table package, which comes highly recommended for its speed with large data sets,
creates index keys for data frames and many join options.
Finally, there are several R functions that offer SQL-like joins, such as dplyrs inner join and left join options
and base Rs merge() function. (Some options require the common column in each table to have the same
name.) You can read more about all these options in this stackoverflow thread.
In the Excel Tricks example, there is a Lookups table with a fipscty column that holds a numerical code for
each county. She wants to add the county name to this worksheet (a separate table, Lookup2, has a list of all
the codes and county names).
Ill use dplyrs left_join() to accomplish this task (several other techniques will work well too). Why a left
join? Thats a SQL database term which means join two tables by one or more common columns, keeping
all the rows in the left table (here, left means the first one mentioned in the join statement) and adding
whatever matches there are from the right column.
Heres the code:
#Read in data from spreadsheet
Lookups <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Lookups", startRow = 2,
stringsAsFactors = FALSE)
Lookup2 <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Lookup2", startRow = 2,
stringsAsFactors = FALSE)
#I am going to rename the first column in the lookup table to match the name in the Lookups table
names(Lookup2)[1] <- "fipscty"
#One line of code adds the county name from Lookup2 to the Lookups table
Lookups <- left_join(Lookups, Lookup2, by="fipscty")
#Check our results
head(Lookups)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
1
2
3
4
5
6
14
Reshaping data
The sample spreadsheet features an example of Affordable Health Care premium data where each plans row
had age group data across multiple columns. The desired format was to have one plan price per age group
per row, not many age groups in a row. This means the data needs to be reshaped. In R lingo, we want to
reshape the data frame from wide to long.
Webster demonstrated a very useful free add-in for Excel from Tableau to perform this kind of reshaping. To
use the Tableau reshaping add-in for Excel, all the columns you want to be moved down from being column
headers must be on the right side of your spreadsheet; and all the columns you want to keep as column
headers must be on the left. In addition, you need to manually open the sheet and click on the correct cell
15
fine if youre working on a one-time project, but less ideal if this is data you process frequently (or if you
want others to be able to easily reproduce and check your work)
With an R script, the columns can be in any order and a script thats written once can be run from a batch
file.
Please see my detailed explanation of Reshaping: Wide to long (and back) in R for a full run-through of
this type of reshaping. But in brief, you want to use the reshape2 package and tell it which column headers
you want to move down so theyre no longer separate columns. In other words, if a data frame had column
headers for young, middle age and old with a price for each but you wanted only one price per row,
youd want to move those three column headers into one new variable column, perhaps called something like
age group.
To go from wide to long you use reshape2s melt() function and tell melt either which columns you want to
move into a new variable column or which columns you want to stay as ID variables and not move. In this
sample data, there are far fewer ID variables thatdont need to move than there are column variables that do
need to move, so Ill specify the id variables.
In addition, we have the option of naming what we want the variable column and value column to be called,
which Ill do below:
library(reshape2)
widedata <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Reshaper", header = TRUE)
#id.vars are the columns to keep as column headers.
#There's then no need to identify all the age group column headers that are moving
#from being column headers to being part of a new agegroup column.
#premium is the value column.
reshaped <- melt(widedata, id.vars <- c("Company", "PlanName", "Metal", "RatingArea",
"RateAreatxt"),
variable.name="agegroup", value.name="premium" )
#Check results
head(reshaped)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
1
2
3
4
5
6
Company
All Savers
All Savers
All Savers
All Savers
All Savers
Cigna
premium
189.1
186.5
196.8
245.8
242.7
158.1
Lowest
Lowest
Lowest
Lowest
Lowest
Lowest
PlanName
cost Silver
cost Silver
cost Silver
cost Silver
cost Silver
cost Silver
#There's an X in front of the all the age groups because R columns can't start with a number
#and those columns in the spreadsheet all started with numbers.
#In addition the - in 0-20 was turned into a period because - is not a legal character
#for an R data frame column name.
#If that bothers us, we can use some search-and-replace strategies we learned above
16
#to remove X from the age groups and return the - to 0-20:
reshaped$agegroup <- str_replace_all(reshaped$agegroup, "X", "")
reshaped$agegroup <- str_replace_all(reshaped$agegroup, "0.20.", "0-20")
#We can see how many unique values of reshaped$agegroup there are with unique()
unique(reshaped$agegroup)
##
##
##
##
##
##
##
##
##
##
##
##
[1]
[5]
[9]
[13]
[17]
[21]
[25]
[29]
[33]
[37]
[41]
[45]
"0-20"
"24"
"28"
"32"
"36"
"40"
"44"
"48"
"52"
"56"
"60"
"64.and.other."
"21"
"25"
"29"
"33"
"37"
"41"
"45"
"49"
"53"
"57"
"61"
"22"
"26"
"30"
"34"
"38"
"42"
"46"
"50"
"54"
"58"
"62"
"23"
"27"
"31"
"35"
"39"
"43"
"47"
"51"
"55"
"59"
"63"
[1]
[11]
[21]
[31]
[41]
"0-20"
"30"
"40"
"50"
"60"
"21"
"31"
"41"
"51"
"61"
"22"
"32"
"42"
"52"
"62"
"23"
"33"
"43"
"53"
"63"
"24"
"34"
"44"
"54"
"64+"
"25"
"35"
"45"
"55"
"26"
"36"
"46"
"56"
"27"
"37"
"47"
"57"
"28"
"38"
"48"
"58"
"29"
"39"
"49"
"59"
For lots more on using R, see The Beginners Guide to R and 4 Data Wrangling Tasks for Advanced Beginners.
Sharon Machlis is online managing editor at Computerworld. You can follow her on Twitter at sharon000,
on Google or by subscribing to her RSS feeds: articles and blogs.
17