Vous êtes sur la page 1sur 18

1/21/13

Representing data in R

Representing data in R
Jeffrey Leek, Assistant Professor of Biostatistics Johns Hopkins Bloomberg School of Public Health

file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1

1/18

1/21/13

Representing data in R

Important data types in R


Classes Character, Numeric, Integer, Logical Objects Vectors, Matrices, Data frames, Lists, Factors, Missing values Operations Subsetting, Logical subsetting For more information: Data Types

2/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 2/18

1/21/13

Representing data in R

Character
firstName = "jeff" class(firstName)

## [1] "character"

firstName

## [1] "jeff"

3/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 3/18

1/21/13

Representing data in R

Numeric
heightCM = 188.2 class(heightCM)

## [1] "numeric"

heightCM

## [1] 188.2

4/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 4/18

1/21/13

Representing data in R

Integer
numberSons = 1L class(numberSons)

## [1] "integer"

numberSons

## [1] 1

5/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 5/18

1/21/13

Representing data in R

Logical
teachingCoursera = TRUE class(teachingCoursera)

## [1] "logical"

teachingCoursera

## [1] TRUE

6/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 6/18

1/21/13

Representing data in R

Vectors
A set of values with the same class
heights = c(188.2, 181.3, 193.4) heights

## [1] 188.2 181.3 193.4

firstNames = c("jeff", "roger", "andrew", "brian") firstNames

## [1] "jeff"

"roger" "andrew" "brian"

7/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 7/18

1/21/13

Representing data in R

Lists
A vector of values of possibly different classes
vector1 = c(188.2, 181.3, 193.4) vector2 = c("jeff", "roger", "andrew", "brian") myList = list(heights = vector1, firstNames = vector2) myList

## ## ## ## ##

$heights [1] 188.2 181.3 193.4 $firstNames [1] "jeff" "roger" "andrew" "brian"

8/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 8/18

1/21/13

Representing data in R

Matrices
Vectors with multiple dimensions
myMatrix = matrix(c(1, 2, 3, 4), byrow = T, nrow = 2) myMatrix

## [,1] [,2] ## [1,] 1 2 ## [2,] 3 4

9/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 9/18

1/21/13

Representing data in R

Data frames
Multiple vectors of possibly different classes, of the same length
vector1 = c(188.2, 181.3, 193.4) vector2 = c("jeff", "roger", "andrew", "brian") myDataFrame = data.frame(heights = vector1, firstNames = vector2)

## Error: arguments imply differing number of rows: 3, 4

myDataFrame

## Error: object 'myDataFrame' not found

10/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 10/18

1/21/13

Representing data in R

Data frames
vector1 = c(188.2, 181.3, 193.4, 192.3) vector2 = c("jeff", "roger", "andrew", "brian") myDataFrame = data.frame(heights = vector1, firstNames = vector2) myDataFrame

## ## ## ## ##

1 2 3 4

heights firstNames 188.2 jeff 181.3 roger 193.4 andrew 192.3 brian

11/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 11/18

1/21/13

Representing data in R

Factors
Qualitative variables that can be included in models
smoker = c("yes", "no", "yes", "yes") smokerFactor = as.factor(smoker) smokerFactor

## [1] yes no yes yes ## Levels: no yes

12/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 12/18

1/21/13

Representing data in R

Missing values
In R they are usually coded NA
vector1 = c(188.2, 181.3, 193.4, NA) vector1

## [1] 188.2 181.3 193.4

NA

is.na(vector1)

## [1] FALSE FALSE FALSE TRUE

13/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 13/18

1/21/13

Representing data in R

Subsetting
vector1 = c(188.2, 181.3, 193.4, 192.3) vector2 = c("jeff", "roger", "andrew", "brian") myDataFrame = data.frame(heights = vector1, firstNames = vector2) vector1[1]

## [1] 188.2

vector1[c(1, 2, 4)]

## [1] 188.2 181.3 192.3

14/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 14/18

1/21/13

Representing data in R

Subsetting
myDataFrame[1, 1:2] ## heights firstNames ## 1 188.2 jeff

myDataFrame$firstNames

## [1] jeff roger andrew brian ## Levels: andrew brian jeff roger

15/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 15/18

1/21/13

Representing data in R

Logical subsetting
myDataFrame[firstNames == "jeff", ] ## heights firstNames ## 1 188.2 jeff

myDataFrame[heights < 190, ]

## heights firstNames ## 1 188.2 jeff ## 2 181.3 roger ## 4 192.3 brian

16/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 16/18

1/21/13

Representing data in R

Variable naming conventions


Variable names should be short, but descriptive. Here are some common styles Camel caps
myHeightCM = 188

Underscore
my_height_cm = 188

Dot separated
my.height.cm = 188

17/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 17/18

1/21/13

Representing data in R

Style guides
http://4dpiecharts.com/r-code-style-guide/ http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html http://wiki.fhcrc.org/bioc/Coding_Standards

18/18
file://localhost/Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week1/005representingDataR/index.html#1 18/18