Vous êtes sur la page 1sur 56

Set a working directory for reading form a file:

session > set working directory > choose directlry

1. Concatenation, sequences

Many tasks involving sequences of numbers. Here are some basic examples on
how to manipulate and create sequences. The function c, concatenation, is used
often in R, as are rep and seq

X=3
Y=4
c(X,Y)
[1] 3 4
The function rep denotes repeat:

print(rep(1,4))
print(rep(2,3))
c(rep(1,4), rep(2,3))
[1] 1 1 1 1
[1] 2 2 2
[1] 1 1 1 1 2 2 2

The function seq denotes sequence. There are various ways of specifying the
sequence.
seq(0,10,length=11)
[1] 0 1 2 3 4 5 6 7 8 9 10
In [10]: %%R
seq(0,10,by=2)
[1] 0 2 4 6 8 10
You can sort and order sequences
In [11]: %%R
X = c(4,6,2,9)
sort(X)
[1] 2 4 6 9
Use an ordering of X to sort a list of Y in the same order
Y = c(1,2,3,4)
o = order(X)
X[o]
Y[o]
[1] 3 1 2 4

1.

Ploting 2 variables in a dataset : airquality

1
2
3
4
5
6

Ozone Solar.R Wind Temp Month Day


41
190 7.4
67
5
1
36
118 8.0
72
5
2
12
149 12.6
74
5
3
18
313 11.5
62
5
4
NA
NA 14.3
56
5
5
28
NA 14.9
66
5
6

>Plot(ozone~wind, data = airquality)


2.

Simple Regression analysis on the file:


data= read.csv("file_predict.csv",header=TRUE)
reg <- lm(y ~ x, data)

3.

In regression analysis we use lm() function where y can be equal to any

function of x like y= x^2 etc so reg<- lm(y ~ x^2, data)

This gives you regression equation output:


Coefficients:
(Intercept)
-1.45e-15

x
5.00e-02

You can also check this:


1. plot(reg)
The plots can be made nicer by adding colors and using di_erent symbols.
See the help for function par.
plot(X, Y, pch=21, bg='red')
2. summary(reg)
3. predict(reg)

4.

http://www.dummies.com/how-to/content/how-to-predict-new-data-values-

with-r.html

5.

coplot(a ~ b | c)

values of c.

produces a number of scatterplots of a against b for given

6.

image(x, y, z, ...)
contour(x, y, z, ...)
persp(x, y, z, ...)

Plots of three variables. The image plot draws a grid of rectangles using different
colours to represent the value of z, the contour plot draws contour lines to represent
the value of z, and the persp plot draws a 3D surface.

7. To add connecting lines and shapes:

lines(X[21:40], Y[21:40], lwd=2, lty=3, col='orange')


8. abline(a, b)

abline(h=y)
abline(v=x)
abline(lm.obj)
example:
mean_jordan <- mean(data_jordan$points)
plot(data_jordan, main = "1st NBA season of Michael Jordan")
abline(h = mean_jordan)

Adds a line of slope b and intercept a to the current plot. h=y may be used to
specify y-coordinates for the heights of horizontal lines to go across a plot, and
v=x similarly for the x-coordinates for vertical lines. Also lm.obj may be list with a
coefficients component of length 2 (such as the result of model-fitting functions,)
which are taken as an intercept and slope, in that order.

9.

More information, including a full listing of the features available can obtained

from within R using the commands:


> help(plotmath)

> example(plotmath)
> demo(plotmath)

10.

x <- array(1:20, dim=c(4,5))

to Generate a 4 by 5 array.

11. Creating a function:

y <- function(x) {x^2+x+1}


y(2)

12.

plot(x, y, pch="+")

produces a scatterplot using a plus sign as the plotting character, without


changing the default plotting character for future plots.

13. Read data from a URL into a dataframe called PE (physical endurance)
PE <- read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header = TRUE)

# Summary statistics
describe(PE) or summary(PE)
# Scatter plots
plot(PE$age ~ PE$activeyears)
plot(PE$endurance ~ PE$activeyears)
plot(PE$endurance ~ PE$age)
14. x <- seq(-pi, pi, len=50)
15. Time-series. The function ts creates an object of class "ts" from a vector

(single time-series) or a matrix (multivariate time-series), and some options which characterize the series. The options, with default values are:

ts(data = NA, start = 1, end = numeric(0), frequency = 1, deltat = 1, ts.eps =


getOption("ts.eps"), class, names)

Eg: ts(1:10, start = 1959)


Eg: ts(1:47, frequency = 12, start = c(1959, 2))

16. Suppose we want to repeat the integers 1 through 3 twice. That's a simple

command:
c(1:3, 1:3)

17. Now suppose we want these numbers repeated six times, or maybe sixty times.

Writing a function that abstracts this operation begins to make sense. In fact that
abstraction has already been done for us:
rep(1:3, 6)

18. A global assignment can be performed with <<-

19. Package functionality: Suppose you have seen a command that you want to try,

such as fortune('dog')
You try it and get the error message: Error: could not find function "fortune"
You, of course, think that your installation of R is broken. I don't have evidence
that your installation is not broken, but more likely it is because your current
R session does not include the package where the fortune function lives. You
can try:
require(fortune)
Where upon you get the message: Error in library(package, ...) :
there is no package called 'fortune'.

The problem is that you need to install the package onto your computer. Assuming
you are connected to the internet, you can do this with the command:
install.packages('fortune')

After a bit of a preamble, you will get:


Warning message: package 'fortune' is not available

Now the problem is that we have the wrong name for the package. Capitalization
as well as spelling is important. The successful incantation is:

install.packages('fortunes')
require(fortunes)
fortune('dog')

Installing the package only needs to be done once, attaching the package with
the require function needs to be done in every R session where you want the
functionality.
The command: library() shows you a list of packages that are in your standard
location for packages.

20. If you want to do multiple tests, you don't get to abbreviate. With the x1:

> x1 == 4 | 6
OR
> x1 == (4 | 6)

21. The Apply() Function: apply function returning a vector

If you use apply with a function that returns a vector, that becomes the _rst
dimension of the result. This is likely not what you naively expect if you are
operating on rows:
> matrix(15:1, 3)
[,1] [,2] [,3] [,4] [,5]
[1,] 15 12 9 6 3
[2,] 14 11 8 5 2
[3,] 13 10 7 4 1
> apply(matrix(15:1, 3), 1, sort)
[,1] [,2] [,3]
[1,] 3 2 1

[2,] 6 5 4
[3,] 9 8 7
[4,] 12 11 10
[5,] 15 14 13
The naive expectation is really arrived at with:
t(apply(matrix(15:1, 3), 1, sort))

But note that no transpose is required if you operate on columns|the nave


expectation holds in that case.
Examples: Matrix(Row, column)
matrix(15:1, 5,3)
[,1] [,2] [,3]
[1,]
15
10
5
[2,]
14
9
4
[3,]
13
8
3
[4,]
12
7
2
[5,]
11
6
1
> matrix(15:1, 3,5)
[,1] [,2] [,3] [,4] [,5]
[1,]
15
12
9
6
3
[2,]
14
11
8
5
2
[3,]
13
10
7
4
1

22. Combining Lists: c(List1,List2)

23. File Encoding for file import: read.table("intro.dat", fileEncoding = "UTF-8")

24. Create a data frame and enter data frame by frame:

df <- data. f rame ( time=numeric (N) , temp=numeric (N) , pr e s sur


e=numeric (N) )
df [ 1 , ] <- c ( 0 , 100 , 80)
df [ 2 , ] <- c (10 , 110 , 87)

OR

m <- matrix ( nrow=N, ncol=3)


colnames (m) <- c ( "time " , "temp " , "pr e s sur e ")
m[ 1 , ] <- c ( 0 , 100 , 80)
m[ 2 , ] <- c (10 , 110 , 87)
25. Generate random numbers with fixed mean/variance:

R> x <- rnorm(100 , mean = 5 , sd = 2)


R> x <- ( x - mean( x ) ) / s q r t ( var ( x ) )
R> mean( x )
[ 1 ] 1 .385177e-16
R> var ( x )
[1]1
and now create your sample with mean 5 and sd 2:
R> x <- x*2 + 5
R> mean( x )
[1]5
R> var ( x )
[1]4
26. Extract particular columns in database: Take columns x1, x2, and x3.

datSubset <- dat [ , c ( "x1 " , "x2 " , "x3 ") ]


27. select observations for men over age 40 from a column, and sex was coded either
m or M. Use the subset function:
maleOver40 <- subset(column_name , sex %in% main_dataset ( "m" , "M") & age > 40)
Examples:
fifththmonth <- subset(airquality, airquality$Month < 6)
fifththonly <- subset(airquality$Month, airquality$Month < 6)

28. If you want to omit all rows for which one or more column is NA (missing):

x2 <- na.omit ( x )

29. If you just need to remove rows 1, 6, and 13, do:

New_data <- old_data [ -c ( 1 , 6 , 1 3 ) , ]


30. Difference betrween Order and Sort :
For a table data with column x
X
1
1
3
2
3
1
1
2
3
4
3

sort(dd$x, decreasing=T)
[1] 4 3 3 3 2 2 1 1 1 1
Hence sort output is the actual sorting of input values.

order(dd$x)
[1] 1 2 5 6 4 7 3 8 10 9
Hence the output of order function is the indexes of the values of input variable.

31. Find Mean of each and every row:


For a dataset data_set with 100 rown and 4 columns:

y <- apply(as.matrix(d4),1,mean)
x <- seq(along = y)
And now combine the 2 data frames by:
cbind(x,y)

32. Increase length of a vector: length ( v ) <- 2* l ength ( v )


33. Select every nth item in a vector : seq (n , length ( vec ) , by=n)

34. Find index of missing values:

seq ( along=Pes ) [ i s . n a ( Pes ) ]


or
which ( i s . n a ( Pes ) )
35. Find index of largest item in vector

A[ which (A==max(A, na.rm=TRUE) ) ]


36. Count number of items meeting a criterion: length (which (data_set<3))
37. Inverse of a matrix: solve(A)
38. Regression Example:

x <- rnorm(100)
e <- rnorm (100)
y <- 12 + 4*x +13 * e
mylm <- lm( y _ x )
pl o t (x , y , main = "My r e g r e s s i o n ")
a b l i n e (mylm)
39. Smooth line connecting points:

x <- 1:5
y <- c(1,3,4,2.5,2)
plot(x , y )
sp <- spline(x , y , n = 50)
lines( sp )

40. how to plot several \lines" in one graph?

x <- rnorm(10)

y1 <- rnorm(10)
y2 <- rnorm(10)
y3 <- rnorm(10)
plot(x,y1,type="l")
lines(x,y2)
lines(x,y3)

41. Code that creates a table and then calls \prop.table" to get percents on the columns:

x<- c ( 1 , 3 , 1 , 3 , 1 , 3 , 1 , 3 , 4 , 4 )
y <- c ( 2 , 4 , 1 , 4 , 2 , 4 , 1 , 4 , 2 , 4 )
hmm <- table (x , y )
hmm_out <- prop.table (hmm, 2 ) * 100

42. If you want to sum the column, there is a function \margin.table()" for that, but it is
just the same as doing a sum manually, as in:

apply(hmm, 2, sum)
43. To

get equation of a line from x and y vector coordinates:

x<-c ( 1 , 3 , 2 , 1 , 4 , 2 , 4)
y<-c ( 6, 3 , 4 , 1 , 3 , 1 , 4 )
mod1 <- lm(y~x)

44. To

predict the next values of a datadet of x and y coordinates:

x<-c ( 1 , 3 , 2 , 1 , 4 , 1 , 4 ,NA)
y<-c ( 4 , 3 , 4 , 1 , 4 , 1 , 4 , 2 )
mod1 <- lm(y~x)

testdata <- data.frame (x=c(1))


predict(mod1 , newdata = testdata)
45. To calculate predicted values for a variety of different outputs. Here is where the fun
ction \expand.grid" comes in very handy. If one specifes a list of values to be considered
for each variable, then expand.grid will create a \mix and match" data frame to represen
t all possibilities:

x <- c ( 1 , 2 , 3 , 4 )
y <- c ( 22 .1 , 33 .2 , 44 . 4 )
expand.grid (x , y )
46. Create a table from x and y values: Cbind

table_data <- cbind(x,y)

"cbind" and "rbind" functions that put data frames side by side or on top of each
other: they also work with matrices.
> cbind( c(1,2), c(3,4) )
[,1] [,2]
[1,] 1 3
[2,] 2 4
> rbind( c(1,3), c(2,4) )
[,1] [,2]
[1,] 1 3
[2,] 2 4
47. To use a tabular data in vcov() functions we need to convert table into data frame:

frame_data <- data.frame(table_data)


48. Calculate standard errors in a dataset(x,y) named table_data:

frame_data <- data.frame(table_data)


m <- lm( y~x+x+x,data=frame_data)
vcov(m)
Standard_errors <- sqrt (diag(vcov(m)))

49. LOOPS:

a.)
for (i in 1:10)
{
print(i^2)
}

b.) for (w in c('red', 'blue', 'green'))


{
print(w)
}
50. The matrix product is %*%, tensor product (aka Kronecker product) is %x%.

A <- matrix(c(1,2,3,4), nr=2, nc=2)


J <- matrix(c(1,0,2,1), nr=2, nc=2)
A
[,1] [,2]
[1,] 1 3
[2,] 2 4
J
[,1] [,2]
[1,] 1 2
[2,] 0 1
> J %x% A
[,1] [,2] [,3] [,4]
[1,] 1 3 2 6
[2,] 2 4 4 8
[3,] 0 0 1 3
51. We can create a factor that follows a certain pattern with the "gl" command.

> gl(1,4)
[1] 1 1 1 1
Levels: 1
> gl(2,4)
[1] 1 1 1 1 2 2 2 2
Levels: 1 2

> gl(2,4, labels=c(T,F))


[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
Levels: TRUE FALSE
> gl(2,1,8)
[1] 1 2 1 2 1 2 1 2
Levels: 1 2
> gl(2,1,8, labels=c(T,F))
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
Levels: TRUE FALSE

51. The "expand.grid" computes a cartesian product (and yields a data.frame).

> x <- c("A", "B", "C")


> y <- 1:2
> z <- c("a", "b")
> expand.grid(x,y,z)
52. When playing with factors, people sometimes want to turn them into numbers.
This can be ambiguous and/or dangerous.
> x <- factor(c(3,4,5,1))
> as.numeric(x) # No NOT do that
[1] 2 3 4 1
>x
[1] 3 4 5 1
Levels: 1 3 4 5
53. In R, the function rowSums() conveniently calculates the totals for each row of
a matrix. This function creates a new vector:
sum_of_rows_vector <- rowSums(my_matrix)
54.

survey_vector <- c("M", "F", "F", "M", "M")


# Encode survey_vector as a factor
factor_survey_vector <- factor(survey_vector)
# Specify the levels of 'factor_survey_vector'
levels(factor_survey_vector) <- c("Female", "Male")

factor_survey_vector
output: [1] Male Female Female Male Male
Levels: Female Male
summary(factor_survey_vector)
Output: Female Male
2 3
55. str(dataset) function : Another method that is often used to get a rapid
overview of your data is the function str(). The function str() shows you structure
of your data set. For a data frame it tells you:
The total number of observations (e.g. 32 car types)
The total number of variables (e.g. 11 car features)
A full list of the variables names (e.g. mpg, cyl )
The data type of each variable (e.g. num for car features)
The first observations: Applying the str() function will often be the first thing that
you do when receiving a new data set or data frame. It is a great way to get more in
sight in your data set before diving into the real analysis.

56. The subset( ) function: is the easiest way to select variables and observations.
In the following example, we select all rows that have a value of age greater than
or equal to 20 or age less then 10. We keep the ID and Weight columns.

newdata <- subset(mydata, age >= 20 | age < 10, select=c(ID, Weight))
57. Comparing Vectors:
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
linkedin > 15
Output: [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE

58. Types of Statistical Variables :

59. Factor Vector:


> # Create a numeric vector with the identifiers of the participants of your survey
> participants_1 <- c(2, 3, 5, 7, 11, 13, 17)
> # Check what type of values R thinks the vector consists of
> class(participants_1)
[1] "numeric"
> # Transform the numeric vector to a factor vector
> participants_2 <- factor(participants_1)
> # Check what type of values R thinks the vector consists of now
> class(participants_2)
[1] "factor"
60.

61.

62. Histogram Function: > hist(AirPassengers)


Create a histogram of the verbal_baseline variable. Set "Distribution of verbal
memory baseline scores" as the title, "score" as the x-axis label and "frequency" as
the y-axis label.
hist(verbal_baseline, main = "Distribution of verbal memory baseline
scores", xlab = "score", ylab = "frequency")
63. Print basic statistical properties of the red_wine_data data.frame:
describe(red_wine_data)
64. Negatively skewed distributions characteristically have a longer left tail while
positively skewed distributions have a longer right tail.
65.

66. The Scale() Function

The scale() function makes use of the following arguments.


x: a numeric object

center: if TRUE, the objects' column means are subtracted from the values in
those columns (ignoring NAs); if FALSE, centering is not performed

scale: if TRUE, the centered column values are divided by the column's
standard deviation (when center is also TRUE; otherwise, the root mean
square is used); if FALSE, scaling is not performed
> x <- matrix(1:9,3,3)
>x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> y <- scale(x)
>y
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] 0 0 0
[3,] 1 1 1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1

67. To Generate Z scores:

Y<- scale(x, center = TRUE, scale = TRUE)


OR
Y <- ( x mean(x) ) / sd(x)

68.

Summary(dataset) gives you following information:


Min. 1st Qu. Median Mean 3rd Qu.
Max.
Example:

Solution:

That is the probability is 78%.


To find a probability of height being >180 then we need to do this:
1- pnorm(180, mean =172, sd = 10)

69.

70. Variance also known as Mean Squares.


Var() function is used to calculate it in R

Example: var(data_jordan$points)
71. Analysis of Variance(ANOVA): ANOVA is used when more than two group
means are compared, whereas a t-test can only compare two group means.
Suppose a dataset has :

# Summary statistics by all groups (8 sessions, 12 sessions, 17


sessions, 19 sessions)
describeBy(WM, WM$condition)

# Boxplot IQ versus cond


boxplot(WM$IQ ~ WM$cond, main = "Boxplot", xlab = "Group
(cond)", ylab = "IQ")

# Apply the aov function


anova.WM <- aov(WM$IQ ~ WM$cond)
# Look at the summary table of the result
summary(anova.WM)

72. Generate density plot of the F-distribution


# Create the vector x
x <- seq(from = 0, to = 2, length = 100)
x
# Simulate the F-distributions
y_1 <- df(x, 1, 1)
y_2 <- df(x, 3, 1)
y_3 <- df(x, 6, 1)
y_4 <- df(x, 3, 3)
y_5 <- df(x, 6, 3)
y_6 <- df(x, 3, 6)
y_7 <- df(x, 6, 6)
# Plot the F-distributions
plot(x, y_1, col = 1, "l")
lines(x, y_2, col = 2, "l")
lines(x, y_3, col = 3, "l")
lines(x, y_4, col = 4, "l")
lines(x, y_5, col = 5, "l")
lines(x, y_6, col = 6, "l")

lines(x, y_7, col = 7, "l")


# Add the legend
legend("topright", c("df = (1,1)", "df = (3,1)", "df = (6,1)", "df = (3,3)", "df = (6,3)"
, "df = (3,6)", "df = (6,6)"), title = "F distributions", col = c(1,2,3,4,5,6,7), lty = 1)

73. Correlation and Covariance: cor(A,B) command


R is the correlation coefficient :

Step1: calculate covariance


>#
> # Take a quick peek at both vectors
>A
[1] 1 2 3
>B
[1] 3 6 7
> # Save differences of each vector element with the mean in a new variable
> diff_A <- A - mean(A)
> diff_B <- B - mean(B)
> # Do the summation of the elements of the vectors and divide by N-1 in
order to acquire covariance
> cov <- sum(diff_A * diff_B)/(3 - 1)

Step2: calculate standard deviations:


# Square the differences that were found in the previous step
sq_diff_A <- diff_A^2
sq_diff_B <- diff_B^2
# Take the sum of the elements, divide them by N-1 and consequently take
the square root to acquire the sample standard deviations
sd_A <- sqrt(sum(sq_diff_A)/(3-1))
sd_B <- sqrt(sum(sq_diff_B)/(3-1))

Step 3: calculation of r:
# Combine all the pieces of the puzzle
correlation <- cov/(sd_A*sd_B)
correlation

OR
Use the cor(A,B) command :

cor(A,B)

74. Multiple Regression

a. Simple regression
# fs is available in your working environment

fs
# Perform the two single regressions and save them in a variable

model_years <- lm(fs$salary ~ fs$years)


model_pubs <- lm(fs$salary ~ fs$pubs)
# Plot both enhanced scatter plots in one plot matrix of 1 by 2
par(mfrow = c(1, 2))
plot(fs$salary ~ fs$years, main = "plot_years", xlab = "years", ylab = "salary")
abline(model_years)
plot(fs$salary ~ fs$pubs, main = "plot_pubs", xlab = "pubs", ylab = "salary")
abline(model_pubs)

b. R2 coefficients of regression models: These coefficients are often used in practi


ce to select the best regression model in case of competing models. TheR2 coeffici
ent of a regression model is defined as the percentage of the variation in the outco
me variable that can be explained by the predictor variables of the model. In genera
l, the R2 coefficient of a model increases when more predictor variables are added
to the model. After all, adding more predictor variables to the model tends to incre
ase the odds of explaining more variation in the outcome variable.

> # fs is available in your working environment


> # Do a single regression of salary onto years of experience and check the output
> model_1 <- lm(fs$salary ~ fs$years)
> summary(model_1)

Call:
lm(formula = fs$salary ~ fs$years)

Residuals:
Min

1Q Median

3Q Max

-82972 -9537 4305 17703 57949

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68672.5

8259.0 8.315 5.38e-13 ***

fs$years

318.4 8.448 2.78e-13 ***

2689.9

--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30220 on 98 degrees of freedom


Multiple R-squared: 0.4214,

Adjusted R-squared: 0.4155

F-statistic: 71.37 on 1 and 98 DF, p-value: 2.779e-13

> # Do a multiple regression of salary onto years of experience and numbers of pub
lications and check the output
> model_2 <- lm(fs$salary ~ fs$years + fs$pubs)
> summary(model_2)

Call:
lm(formula = fs$salary ~ fs$years + fs$pubs)

Residuals:
Min

1Q Median

3Q Max

-67835 -14589 2362 13358 69613

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58828.7

7605.9 7.735 9.79e-12 ***

fs$years

1337.4

387.1 3.455 0.000819 ***

fs$pubs

634.9

123.6 5.137 1.44e-06 ***

--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26930 on 97 degrees of freedom


Multiple R-squared: 0.5451,

Adjusted R-squared: 0.5357

F-statistic: 58.12 on 2 and 97 DF, p-value: < 2.2e-16

> # Save the R squared of both models in preliminary variables


> preliminary_model_1 <- summary(model_1)$r.squared
> preliminary_model_2 <- summary(model_2)$r.squared
> # Round them off while you save them in a vector
> r_squared <- c()
> r_squared[1] <- round(preliminary_model_1, 3)
> r_squared[2] <- round(preliminary_model_2, 3)
> # Print out the vector to see both R squared coefficients
> r_squared
[1] 0.421 0.545

75.

GGVIS Package: Every ggvis graph contains 4 essential components: da

ta, a coordinate system, marks and corresponding properties. By changing the valu
es of each of these components you can create a vast array of unique plots.

TO install ggvis: > install.packages("ggvis")


Syntax: dataset %>% ggvis(~variable1, ~variable2, other options)

Example:
# The ggvis packages is loaded into the workspace already

# Change the code below to make a graph with red points


mtcars %>% ggvis(~wt, ~mpg, fill := "red") %>% layer_points()

# Change the code below draw smooths instead of points


mtcars %>% ggvis(~wt, ~mpg) %>% layer_smooths()

# Change the code below to amke a graph containing both points and a smoothed s
ummary line
mtcars %>% ggvis(~wt, ~mpg) %>% layer_points() %>% layer_smooths()

76. GGVIS Exmple:


# Make a scatterplot of the pressure dataset
pressure %>% ggvis(~temperature, ~pressure) %>% layer_points()

# Adapt the code you wrote for the first challenge: show bars instead of points
pressure %>% ggvis(~temperature, ~pressure) %>% layer_bars()

# Adapt the code you wrote for the first challenge: show lines instead of points
pressure %>% ggvis(~temperature, ~pressure) %>% layer_lines()

# Adapt the code you wrote for the first challenge: map the fill property to the tem
perature variable
pressure %>% ggvis(~temperature, ~pressure, fill = ~temperature) %>% layer_poi
nts()

# Extend the code you wrote for the previous challenge: map the size property to th
e pressure variable
pressure %>% ggvis(~temperature, ~pressure, fill = ~temperature, size = ~pressure
) %>% layer_points()

77. Importing Data with rxImport Function


A) Introduction for Big Data Import
The first step in this problem is to declare the file paths that point to where the data
is being stored on the server.
i.

The file.path function constructs a path to a file, and in this exercise, we will
use this function to define the location of the csv file that we would like to
import. As arguments tofile.path, we pass it the directory where the data
lives, and then the basename of the file we would like to import.

ii.

We can use rxGetOption(sampleDataDir) to get the appropriate directory.


In the exercise, fill in the appropriate filename for the csv file that contains
the appropriate airline dataset. in this example, we will just use a small
subsample of approximately 2.5% of those observations. These are available
in the file 2007_subset.csv.

iii.

Next, let's import the data. The rxImport function has the following syntax:

iv.

rxImport(inData, outFile, overwrite = TRUE)

v.

The inData argument is the file we want to convert, so it should be assigned


the csv file path. The outFile argument corresponds to the imported file we
want to create, so it should be defined as the xdf file path.

vi.

If we specify overwrite as TRUE, any existing data in the output file will be
overwritten by the results from this process. You should take extra care
when setting this argument to TRUE!

vii.

Once you have run the rxImport() command, you can runlist.files() to make
sure your xdf file has been created!

# Declare the file paths for the csv and xdf files
myAirlineCsv <- file.path(rxGetOption("sampleDataDir"), "2007_subset.csv")
myAirlineXdf <- "2007_subset.xdf"

# Use rxImport to import the data into xdf format


rxImport(inData = myAirlineCsv, outFile = myAirlineXdf, overwrite = TRUE)
list.files()

B) Operations on Big Data: Functions for Summarizing Data


Use the rxGetInfo(), rxSummary(), and rxHistogram() functions to summarize the
flight data.
i.

rxGetInfo() provides information about the dataset as a whole how many


rows and variables there are, the type of compression used, etc. Additionally,
if you set the getVarInfoargument to TRUE, then it will also provide a brief
synopsis of the variables. If you specify the numRows argument, it will also
return the first numRows rows.

ii.

rxSummary() provides summaries for individual variables within a dataset.

iii.

rxHistogram() will provide a histogram of a single variable.

C) Practical Use Example:

i. First, we will use rxGetInfo() in order to extract meta-data about the dataset.
The syntax of rxGetInfo() is:

rxGetInfo(data, getVarInfo, numRows)


data corresponds to the dataset that we would like to get information for.
getVarInfo is a boolean argument that determines whether meta-data regardin
g the variables is also returned (default = FALSE).
numRows determines the number of rows of the dataset that is returned (defa
ult = 0).

ii.

Use rxGetInfo() to summarize the myAirlineXdf dataset we created in the pr


evious exercise. In this function call, obtain information on all the variables,
and return the first ten rows of data.

iii.

Next, we will use rxSummary() to summarize a few variables in the dataset.


The syntax of rxSummary() is:
rxSummary(formula, data)
formula specifies the variables that you want to extract. The variables you
want to summarize should appear on the right-hand side of the formula, and
under most circumstances, will be separated by a + symbol.
data corresponds to dataset from which you would like to extract variables.

iv.

Summarize the variables - ActualElapsedTime, AirTime,DepDelay, and Dist


ance.

v.

Use rxSummary() to summarize ActualElapsedTime,AirTime, DepDelay, an


d Distance.

vi.

Finally, we will use rxHistogram() to visualize the distribution of one of our


variables. The syntax of rxHistogram() is:
rxHistogram(formula, data, )
formula specifies the variable that you want to visualize. In the simplest case
, it will take the form of ~ variable.

data corresponds to the dataset from which you would like to extract variabl
es.
is used to represent additional variables that govern the appearance of the
figure.
Use rxHistogram() to obtain a histogram for the variableDepDelay.
Build a second histogram in which the x axis is limited to departure delays b
etween 100 and 400 minutes. In this histogram, segment the data such that
there is one segment for a minute of delay. When the histogram is plotted ha
ve only ten ticks on the x-axis.

D) Preparing Data For Analysis: Import


Let's get a little more practice with importing and preparing data for analysis. Go a
head and declare the file paths, and use therxImport() command to import the airlin
e data to an xdf file.

Instructions
The first step in this problem is to declare the file paths that point to where the data
is being stored on the server.
The file.path() function constructs a path to a file, and in this problem you must
define the big data directory,sampleDataDir, where the files are being stored.
We can get the sampleDataDir by using the rxGetOption()function.
Once we know where the files are, we can look in that directory to examine what
files exist with list.files().
Once we have found the name of the file (AirlineDemoSmall.csv), we can import
it using rxImport(). In this case, we will also use the argument colInfo in order to
specify that the character variable DayOfWeek should be interpreted as a factor,
and that its levels should have the same order as the days of the week (Monday Sunday).
Next, let's import the data. You can use the help() or args() in order to remind
yourself of the arguments to the rxImport()function.

# Declare the file paths for the csv and xdf files
myAirlineCsv <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv
")
myAirlineXdf <- "ADS.xdf"

# Use rxImport to import the data into xdf format


rxImport(inData = myAirlineCsv,
outFile = myAirlineXdf,
overwrite = TRUE,
colInfo = list(
DayOfWeek = list(
type = "factor",
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday")
)
)
)

E) Preparing Data For Analysis: Exploration

If we are interested in predicting Arrival Delay by Day of the week, then the first
thing we might want to do is to explicitly compute the average arrival delay for
each day of the week. We can do this using the rxSummary() function.
rxSummary() provides a summary of variables. In the simplest case, it provides
simple univariate summaries of individual variables. In this example, we will use it
to create summaries of a numeric variable at different levels of a second,
categorical variable.

Instructions
Use rxSummary() in order to compute the average delay for each day of the week
in the myAirlineXdf dataset.
The basic syntax is: rxSummary(formula, data)
formula - A formula indicating the variables for which you want a summary. If
there is a variable on the left hand side of this formula, then it will compute
summary statistics of that variable for each level of a categorical variable on the
right hand side. If there is no variables on the left hand side, it will compute
summaries of the variables on the right hand side of the formula across the entire
dataset.
data - The dataset in which to look for variables specified informula.
Go ahead use rxSummary() to compute the mean arrival delay for each day of the
week.
After you have viewed the average arrival delay for each day of the week, you
might also want to view the distribution of arrival delays. We can do this
using rxHistogram(). Go ahead and use rxHistogram to visualize the distribution.

# Summarize arrival delay for each day of the week.


rxSummary(formula = ArrDelay ~ DayOfWeek, data = myAirlineXdf)

## Vizualize the arrival delay histogram


rxHistogram(formula = ~ ArrDelay, data = myAirlineXdf)

F) Hint: Remember that you can see what objects are available to you in the works
pace by using ls(), and that you can get help for functions by typing ?FunctionNam
e at the console. For example, ?rxGetInfo will provide a help browser for therxGetI
nfo() function.

For the rxGetInfo() command, remember that the dataargument corresponds to the
dataset you would like to get information about, getVarInfo corresponds to whethe

r or not information about variables should be extracted, andnumRows corresponds


to the number of rows to extract.

For the rxSummary() command, remember that the formulaargument specifies the
variables you would like to summarize. The variables should be placed on the right
hand side of the formula (You can see more help on formulas with ?formula). Rem
ember that you can get variable names from a dataset by using the rxGetInfo() func
tion with getVarInfo set to TRUE. The data argument should reference the dataset i
n which you want to look for the variables specified in the formula.

For the first rxHistogram command, remember that formulashould specify the vari
able you want to plot, in the format of a right-sided formula. If you can't remember
the name of appropriate variable, you can always run rxGetInfo() function with get
VarInfo set to TRUE. The data argument should reference the dataset in which you
want to look for the variables specified in the formula.

For the more specific histogram, several arguments have been added. You can vie
w the helpf for rxHistogram() with ?rxHistogram. xAxisMinMax controls the lowe
r and upper limits of the x axis. numBreaks controls the number of bins in the histo
gram, and xNumTicks controls the number of ticks on the x axis.

Example:
# Get basic information about your data
rxGetInfo(data = myAirlineXdf,
getVarInfo = TRUE,
numRows = 10)

## Summarize the variables corresponding to actual elapsed time, time in the air, d
eparture delay, flight Distance.
rxSummary(formula = ~ ActualElapsedTime + AirTime + DepDelay + Distance,

data = myAirlineXdf)

# Histogram of departure delays


rxHistogram(formula = ~DepDelay,
data = myAirlineXdf)

# Use parameters similar to a regular histogram to zero in on the interesting area


rxHistogram(formula = ~DepDelay,
data = myAirlineXdf,
xAxisMinMax = c(-100, 400),
numBreaks = 500,
xNumTicks = 10)

78. Construct a linear model


Use rxLinMod to create a simple linear model
The structure to a call to rxLinMod() is very similar to the lm()function in
the stats package. The syntax is: rxLinMod(formula, data, )

formula - The model specification.

data - The data in which you want to search for variables informula.

- Additional arguments

Using the data set myAirlineXdf, go ahead and start by creating a simple linear
model predicting arrival delay by day of the week.

## predict arrival delay by day of the week:


myLM1 <- rxLinMod( ArrDelay ~ DayOfWeek, data = myAirlineXdf)

## summarize the model


summary(myLM1)

## Use the transforms argument to create a factor variable associated with departur
e time "on the fly,"
## predict Arrival Delay by the interaction between Day of the week and that new f
actor variable
myLM2 <- rxLinMod( ArrDelay ~ DayOfWeek:catDepTime, data = myAirlineXdf
,
transforms = list(
catDepTime = cut(CRSDepTime, breaks = seq(from = 5, to = 23, by
= 2))
),
cube = TRUE
)

## summarize the model


summary(myLM2)

79. Correlation in Big Data: Use rxCor to construct the correlation matrix:
The syntax of rxCor function is: rxCor(formula, data, rowSelection) where

formula - a right-sided formula that specifies the variables you would like
correlate.

data - The dataset in which you want to search for the variables in the
formula.

rowSelection - a logical expression that will be used to select particular rows


for analysis. Rows in which the expression is TRUE will be included in the
correlation values; rows in which hte expression is FALSE will not.

# Correlation for departure delay, arrival delay, and air speed

rxCor(formula = ~ DepDelay + ArrDelay + airSpeed,


data = myAirlineXdf,
rowSelection = (airSpeed > 50) & (airSpeed < 800))

80. Linear Regression: Use rxLinMod to construct the regression: The syntax
is:rxLinMod(formula, data, rowSelection)

formula - a two-sided formula that specifies the variable you want to predict
on the left side, and the predictor variable(s) on the right side.

data - The dataset in which you want to search for the variables in the
formula.

rowSelection - a logical expression that will be used to select particular rows


for analysis. Rows in which the expression is *

Remember to use summary() to return your analysis results.

Example: Construct a linear regression predicting air speed by departure delay, for
air speeds between 50 and 800. Once you have computed the model, be sure to su
mmarize it in order to get inference results.
# Regression for airSpeed based on departure delay
myLMobj <- rxLinMod(formula = airSpeed ~ DepDelay,
data = myAirlineXdf,
rowSelection = (airSpeed > 50) & (airSpeed < 800))
summary(myLMobj)

81. Generating Predictions and Residuals


Use rxPredict() to generate arrival delay predictions for the model we created in
the prior exercise (myLM2).
Before we start with making predictions. Go ahead and summarize myLM2, so that
you can refresh your memory on the model and results.

rxPredict() is the RevoScaleR function that allows us to generate predictions and


residuals based on a variety of different models.
The syntax is: rxPredict(modelObject, data, outData, computeResiduals, )

modelObject - The model you would like to use in order to generate


predictions.

data - The data for which you want to make predictions.

outData - The location where you would like to store the residuals.

computeResiduals - Whether to compute residuals or not.

- Additional arguments.

In this case, we need to be careful. We need to keep in mind that it will generate as
many predictions as there are observations in the dataset. If you are trying to
generate predictions for a billion observations, your output will also have a billion
predictions. Because of this, the output is, by default not stored in memory, but
rather, stored in an xdf file.
Go ahead and create a new xdf file to store the predicted values of our original
dataset. Like other RevoScaleRfunctions, it can take additional arguments that
control which variables are kept in the creation of new data sets. Since we are
going to create our own copy of the dataset, we should also specify
the writeModelVars argument so that the values of the predictor variables are also
included in the new prediction file.
After using rxPredict() and rxGetInfo() to generate predictions, use the same
methods to generate the residuals.

## summarize model first


summary(myLM2)

## path to new dataset storing predictions


myNewADS <- "myNEWADS.xdf"

## generate predictions

rxPredict(modelObject = myLM2, data = myAirlineXdf,


outData = myNewADS,
writeModelVars = TRUE)
## get information on the new dataset
rxGetInfo(myNewADS, getVarInfo = TRUE)

## Generate residuals.
rxPredict(modelObject = myLM2, data = myAirlineXdf,
outData = myNewADS,
writeModelVars = TRUE,
computeResiduals = TRUE,
overwrite = TRUE)
## get information on the new dataset
rxGetInfo(myNewADS, getVarInfo = TRUE)

82. Computing k Means with rxKmeans()


Target: Cluster the mortgage default data into 4 clusters usingrxKmeans().
Use rxKmeans() function to perform k-means clustering on the data.
Before starting, be sure to take explore your workspace, and look at
the mortData dataset with rxGetInfo().
The syntax is: rxKmeans(formula, data, outFile, numClusters, )
formula - The formula that specifies which variables you would like to
include in the clustering algorithm.
data - The dataset in which to search for variables included in formula.
outFile - The dataset in which you would like to write the cluster IDs.
numClusters - The number (k) of clusters you would like to estimate.
- Additional arguments, several of which control the k means algorithm
(e.g. algorithm, numStarts, centers; see ?rxKmeans for more information)
Go ahead and use rxKmeans() to estimate a set of 4 clusters in
the mortData dataset. Use the variables corresponding to credit card debt, credit
score, and house age. Use therowSelection argument so that you only estimate the
clusters based on the observations from year 2000. Use theoutFile argument to
create a new dataset in which the model variables and the cluster IDs for each row
are written. Assign the output of the rxKmeans() call to an object namedKMout,
and then print that object, so that you can see some properties of the k-means
solution.
Once you have printed out some properties of the solution, take a look at the new
dataset that you have created. Extract the meta-data from the xdf file, and then
summarize the cluster variable using rxSummary().
Next, we are going to visualize the clustering. Go ahead and use
the rxXdfToDataFrame() function to read some of the new dataset into an
internal data.frame object. The first argument to rxXdfToDataFrame should be the
name of the xdf file you want to read in to memory. While this dataset is not
particularly
large,
we
will
go
ahead
and
use
therowSelection and transforms arguments in order to randomly sample a small
subset of the data. In this case, we use transforms to create a new
variable randSamp by randomly sampling the numbers 1 through 10 with
replacement. Using rowSelection to select only those rows where randSamp ==
1 means that we pull approximately 10% of our dataset into memory.
Once you have pulled the data into memory, visualize the clusters by creating each
possible pairwise scatter plot, with the color of each point corresponding to the
cluster it was assigned. The plot.data.frame() method is very useful here, but
remember to remove the variable corresponding to the cluster ID from the plot.

Can you explain why a single variable drives the clustering?


## Examine the mortData dataset
rxGetInfo(mortData, getVarInfo = TRUE)

## set up a path to a new xdf file


myNewMortData = "myMDwithKMeans.xdf"
## run k-means:
KMout <- rxKmeans(formula = ~ ccDebt + creditScore + houseAge,
data = mortData,
outFile = myNewMortData,
rowSelection = year == 2000,
numClusters = 4,
writeModelVars = TRUE)
print(KMout)

## Examine the variables in the new dataset:


rxGetInfo(myNewMortData, getVarInfo = TRUE)
## summarize the cluster variable:
rxSummary(~ F(.rxCluster), data = myNewMortData)

## read into memory 10% of the data:


mydf <- rxXdfToDataFrame(myNewMortData,
rowSelection = randSamp == 1,
varsToDrop = "year",
transforms = list(randSamp = sample(10, size = .rxNumRows, repla
ce = TRUE)))

## visualize the clusters:


plot(mydf[-1], col = mydf$.rxCluster)

83. Create some Decision Trees


Create a regression tree predicting default status using rxDTree().
Use rxDTree() to create a regression tree for the mortDatadataset.
rxDTree() is the RevoScaleR function that computes both regression and
classification trees. It decides which kind of tree to compute based on the type of
the dependent variable used in its formula argument. If that variable is numeric,
then it produces a regression tree; if that variable is a factor, then it produces a
classification tree.
The syntax is: rxDTree(formula, data, maxdepth, )

formula - The model specification

data - The data set in which to search for variables used informula

maxdepth - The maximum depth of the regression tree to estimate.

- Additional arguments

Go ahead and use rxDTree() in order to produce a decision tree predicting default
status by credit score, credit card debt, years employed, and house age, using a
maximum depth = 5. Use rowSelection to estimate this tree on only the data from
the year 2000. Assign the output to an object named regTreeOut. This can take a
few seconds, so be patient.
Once you have created this object, you can print it in order to view a summary and
textual description of the tree. Go ahead and print regTreeOut and spend a couple
of minutes looking at the output.
Although the text output can be useful, it is usually more intuitive to visualize such
an analysis via a dendrogram. You can produce this type of visualization a few
ways. First, you can make the output of rxDTree() appear to be an object produced
by the rpart() function by usingrxAddInheritance(). Once that is the case, then you
can use all of the methods associated with rpart: Using plot() on the object after
adding inheritance will produce a dendrogram, and then running text() on the
object after adding inheritance will add appropriate labels in the correct locations.
Go ahead and practice producing this dendrogram.

In most cases, you can also use the RevoTreeView library by loading that library
and running createTreeView() on theregTreeOut object. This does not work in the
datacamp platform, but will typically open an interactive web page where you can
expand nodes of the decision tree by clicking on them, and you can extract
information about each node by mousing over them.
Similar to the other modeling approaches we've seen, we can use rxPredict() in
order to generate predictions based on the model object. In order to generate
predictions, we need specify the same arguments as before: modelOject, data,
and outData. Since we will create a new dataset, we will also need to make sure to
write the model variables as well. And since we could generate predictions later as
well, let's go ahead and give the name of the new variable something more
specific. We can use teh predVarNames argument to specify the name of the
predicted values to be default_RegPred. Go ahead and try this. Once you have
created the variables, be sure to get information on your new variables
usingrxGetInfo() on the new dataset.
Another useful visualization for regression trees is the Receiver Operating
Characteristic (ROC) curve. This curve plots the "Hit" (or True Positive) rate as a
function of the False Positive rate for different criteria of classifying a positive or
negative. A good model will have a curve that deviates strongly from the identity
line y = x. One measure that is based on this ROC curve is the area under the curve
(AUC), which has a range between 0 and 1, with 0.5 corresponding to chance. We
can compute and display an ROC curve for a regression tree using rxRocCurve().
If we want to create a classification tree rather than a regression tree, we simply
need to convert our dependent measure into a factor. We can do this using
the transformsargument within the call to rxDTree() itself.

## regression tree:
regTreeOut <- rxDTree(default ~ creditScore + ccDebt + yearsEmploy + houseAge
,
rowSelection = year == 2000,
data = mortData,
maxdepth = 5)
## print out the object:
print(regTreeOut)

## plot a dendrogram, and add node labels:


plot(rxAddInheritance(regTreeOut))
text(rxAddInheritance(regTreeOut))

## Another visualization:
# library(RevoTreeView)
# createTreeView(regTreeOut)

## predict values:
myNewData = "myNewMortData.xdf"
rxPredict(regTreeOut,
data = mortData,
outData = myNewData,
writeModelVars = TRUE,
predVarNames = "default_RegPred")

## visualize ROC curve


rxRocCurve(actualVarName = "default",
predVarNames = "default_RegPred",
data = myNewData)

84. R Examples
a. Str(dataset) : for getting structure of data
b. Head(dataset): for getting top rows of data
Fivenum(dataset$column): Returns Tukey's five number summary (minimum, l
ower-hinge, median, upper-hinge, maximum) for the input data.
c.

d. IQR(dataset$column) : difference bwtween 75th and 25th quartile values.


e. Boxplot(dataset$column) : diagramitic analysis of min, max, median etc.
f. Newdataset <- edit(dataset) : to edit individual values in dataset.
g. Confidence and Support:
We will make some assumptions for what we might find in an experiment and find
the resulting confidence interval using a normal distribution. Here we assume that t
he sample mean is 5, the standard deviation is 2, and the sample size is 20. In the e
xample below we will use a 95% confidence level and wish to find the confidence i
nterval. The commands to find the confidence interval in R are the following:
> a <- 5
> s <- 2
> n <- 20
> error <- qnorm(0.975)*s/sqrt(n)
> left <- a-error
> right <- a+error
> left
[1] 4.123477
> right
[1] 5.876523

The true mean has a probability of 95% of being in the interval between 4.12 and 5.88
assuming that the original random variable is normally distributed, and the samples are
independent.

Calculating a Confidence Interval From a t Distribution:


Calculating the confidence interval when using a t-test is similar to using a normal
distribution. The only difference is that we use the command associated with the t-

distribution rather than the normal distribution. Here we repeat the procedures above,
but we will assume that we are working with a sample standard deviation rather than an
exact standard deviation.
Again we assume that the sample mean is 5, the sample standard deviation is 2, and the
sample size is 20. We use a 95% confidence level and wish to find the confidence
interval. The commands to find the confidence interval in R are the following:
> a <- 5
> s <- 2
> n <- 20
> error <- qt(0.975,df=n-1)*s/sqrt(n)
> left <- a-error
> right <- a+error
> left
[1] 4.063971
> right
[1] 5.936029

The true mean has a probability of 95% of being in the interval between 4.06 and 5.94
assuming that the original random variable is normally distributed, and the samples are
independent.
We now look at an example where we have a univariate data set and want to find the
95% confidence interval for the mean. In this example we use one of the data sets given
in the data input chapter. We use the w1.dat data set:
> w1 <- read.csv(file="w1.dat",sep=",",head=TRUE)
> summary(w1)
vals
Min.

:0.130

1st Qu.:0.480
Median :0.720
Mean

:0.765

3rd Qu.:1.008
Max.

:1.760

> length(w1$vals)
[1] 54

> mean(w1$vals)
[1] 0.765
> sd(w1$vals)
[1] 0.3781222

We can now calculate an error for the mean:


> error <- qt(0.975,df=length(w1$vals)-1)*sd(w1$vals)/sqrt(length(w1$vals)
)
> error
[1] 0.1032075

The confidence interval is found by adding and subtracting the error from the mean:
> left <- mean(w1$vals)-error
> right <- mean(w1$vals)+error
> left
[1] 0.6617925
> right
[1] 0.8682075

There is a 95% probability that the true mean is between 0.66 and 0.87 assuming that
the original random variable is normally distributed, and the samples are independent.

Calculating Many Confidence Intervals From a t Distribution


Suppose that you want to find the confidence intervals for many tests. This is a common
task and most software packages will allow you to do this.
We have three different sets of results:
Comparison 1

Mean

Std. Dev.

Number (pop.)

Group I

10

300

Group II

10.5

2.5

230

Comparison 2

Mean

Std. Dev.

Number (pop.)

Group I

12

210

Group II

13

5.3

340

Mean

Std. Dev.

Number (pop.)

Group I

30

4.5

420

Group II

28.5

400

Comparison 3

For each of these comparisons we want to calculate the associated confidence interval
for the difference of the means. For each comparison there are two groups. We will refer
to group one as the group whose results are in the first row of each comparison above.
We will refer to group two as the group whose results are in the second row of each
comparison above. Before we can do that we must first compute a standard error and a
t-score. We will find general formulae which is necessary in order to do all three
calculations at once.
We assume that the means for the first group are defined in a variable called m1. The
means for the second group are defined in a variable called m2. The standard deviations
for the first group are in a variable called sd1. The standard deviations for the second
group are in a variable called sd2. The number of samples for the first group are in a
variable called num1. Finally, the number of samples for the second group are in a
variable called num2.
With these definitions the standard error is the square root
of (sd1^2)/num1+(sd2^2)/num2. The R commands to do this can be found below:
> m1 <- c(10,12,30)
> m2 <- c(10.5,13,28.5)

> sd1 <- c(3,4,4.5)


> sd2 <- c(2.5,5.3,3)
> num1 <- c(300,210,420)
> num2 <- c(230,340,400)
> se <- sqrt(sd1*sd1/num1+sd2*sd2/num2)
> error <- qt(0.975,df=pmin(num1,num2)-1)*se

To see the values just type in the variable name on a line alone:
> m1
[1] 10 12 30
> m2
[1] 10.5 13.0 28.5
> sd1
[1] 3.0 4.0 4.5
> sd2
[1] 2.5 5.3 3.0
> num1
[1] 300 210 420
> num2
[1] 230 340 400
> se
[1] 0.2391107 0.3985074 0.2659216
> error
[1] 0.4711382 0.7856092 0.5227825

Now we need to define the confidence interval around the assumed differences. Just as
in the case of finding the p values in previous chapter we have to use the pmin command
to get the number of degrees of freedom. In this case the null hypotheses are for a
difference of zero, and we use a 95% confidence interval:
> left <- (m1-m2)-error
> right <- (m1-m2)+error
> left
[1] -0.9711382 -1.7856092
> right

0.9772175

[1] -0.02886177 -0.21439076

2.02278249

This gives the confidence intervals for each of the three tests. For example, in the first
experiment the 95% confidence interval is between -0.97 and -0.03 assuming that the
random variables are normally distributed, and the samples are independent.

h. if-else example:
if ( x < 0.2)
{

x <- x + 1
cat("increment that number!\n")

} else
{
x <- x - 1
cat("nah, make it smaller.\n");
}
nah, make it smaller.

For loop example: for (lupe in seq(0,1,by=0.3))


{

cat(lupe,"\n");

}
0
0.3
0.6
0.9
>
> x <- c(1,2,4,8,16)
> for (loop in x)
{
cat("value of loop: ",loop,"\n");
}
value of loop:

value of loop:

value of loop:

value of loop:

value of loop:

16

h.
sum(1, 3, 5)
rep("Yo ho!", times = 3)
i. You can list the files in the current directory from within R, by calling
the list.files function
j. To run a script, pass a string with its name to the source function.

source("bottle1.R")
k. barplot(1:100)
l. Vector Math:

a <- c(1, 2, 3)
>a+1
m. sum can take an optional named argument, na.rm. It's set to FALSE by default,
but if you set it to TRUE, all NA arguments will be removed from the vector
before the calculation is performed.
Try calling sum again, with na.rm set to TRUE:
RedoComplete
> sum(a, na.rm = TRUE)
n. a <- 1:12
matrix(a, 3, 4)
o. The dim assignment function sets dimensions for a matrix. It accepts a vector
with the number of rows and the number of columns to assign
dim(plank) <- c(2, 4)
p. contour map of the values simply by passing the matrix to the contour function
contour(matrix name)
q. create a 3D perspective plot with the persp function : persp(matrix name)
r. > persp(matrix name, expand=0.2)

s. image(matrix name) : to create a heat map.


t. add a line on the plot to show one standard deviation above the mean or TO
show two ablines: > abline(h = meanValue + deviation)
u. adding a line on the plot to show one standard devation below the mean:
abline(h = meanValue - deviation)

85. DATA FRAMES:


The weights, prices, and types data structures are all deeply tied together, if you
think about it. If you add a new weight sample, you need to remember to add a new
price and type, or risk everything falling out of sync. To avoid trouble, it would be
nice if we could tie all these variables together in a single data structure.
Fortunately, R has a structure for just this purpose: the data frame. You can think
of a data frame as something akin to a database table or an Excel spreadsheet. It
has a specific number of columns, each of which is expected to contain values of a
particular type. It also has an indeterminate number of rows - sets of related values
for each column.
Our vectors with treasure chest data are perfect candidates for conversion to a data
frame. And it's easy to do. Call the data.frame function, and pass weights, prices,
and types as the arguments. Assign the result to thetreasure variable:
> treasure <- data.frame(weights, prices, types)
> print(treasure)
weights prices types
1 300 9000 gold
2 200 5000 silver
3 100 12000 gems
4 250 7500 gold
5 150 18000 gems
a. You can get individual columns by providing their index number in doublebrackets. Try getting the second column (prices) of treasure:

> treasure[[2]]
b. You could instead provide a column name as a string in double-brackets. (This
is often more readable.) Retrieve the "weights" column:
> treasure[["weights"]]
c. Typing all those brackets can get tedious, so there's also a shorthand notation:
the data frame name, a dollar sign, and the column name (without quotes). Try
using it to get the "prices" column:
> treasure$prices
d. You can load a CSV file's content into a data frame by passing the file name to
the read.csv function. Try it with the "targets.csv" file:
RedoComplete
> read.csv("targets.csv")
e. Merge Data Frames: We want to loot the city with the most treasure and the
fewest guards. Right now, though, we have to look at both files and match up the
rows. It would be nice if all the data for a port were in one place...
R's merge function can accomplish precisely that. It joins two data frames together,
using the contents of one or more columns. First, we're going to store those file
contents in two data frames for you, targets and infantry.
The merge function takes arguments with an x frame (targets) and a y frame
(infantry). By default, it joins the frames on columns with the same name (the
two Port columns). See if you can merge the two frames:
RedoComplete
> targets <- read.csv("targets.csv")
> infantry <- read.table("infantry.txt", sep="\t", header=TRUE)
> merge(x = targets, y = infantry)

Vous aimerez peut-être aussi