Vous êtes sur la page 1sur 151

Machine Learning in R

Alexandros Karatzoglou 1

1 Telefonica Research Barcelona, Spain

December 15, 2010

Machine Learning in R Alexandros Karatzoglou 1 1 Telefonica Research Barcelona, Spain December 15, 2010 1

Outline

1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

2 Basic Plots

3 Lattice Plots

4 Basic Statistics & Machine Learning Tests

5 Linear Models

6 Naive Bayes

7 Support Vector Machines

8 Decision Trees

9 Dimensionality Reduction

Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9

2

R

Environment for statistical data analysis, inference and visualization.

Ports for Unix, Windows and MacOSX

Highly extensible through user-defined functions

Generic functions and conventions for standard operations like plot, predict etc.

1200 add-on packages contributed by developers from all over the world

e.g. Multivariate Statistics, Machine Learning, Natural Language Processing, Bioinformatics (Bioconductor), SNA, .

Interfaces to C, C++, Fortran, Java

Machine Learning, Natural Language Processing, Bioinformatics (Bioconductor), SNA, . Interfaces to C, C++, Fortran, Java 3

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

& Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 4

4

Comprehensive R Archive Network (CRAN)

CRAN includes packages which provide additional functionality to the one existing in R

Currently over 1200 packages in areas like multivariate statistics, time series analysis, Machine Learning, Geo-statistics, environmental statistics etc.

packages are written mainly by academics, PhD students, or company staff

Some of the package have been ordered into Task Views

are written mainly by academics, PhD students, or company staff Some of the package have been

CRAN Task Views

Some of the CRAN packages are ordered into categories called Task Views

Task Views are maintained by people with experience in the corresponding field

there are Task Views on Econometrics, Environmental Statistics, Finance, Genetics, Graphics Machine Learning, Multivariate Statistics, Natural Language Processing, Social Sciences, and others.

Genetics, Graphics Machine Learning, Multivariate Statistics, Natural Language Processing, Social Sciences, and others. 6

Online documentation

1 Go to http://www.r-project.org

2 On the left side under Documentation

3 Official Manuals, FAQ, Newsletter, Books

4 Other and click on other publications contains a collection of contributed guides and manuals

Newsletter, Books 4 Other and click on other publications contains a collection of contributed guides and

Mailing lists

R-announce is for announcing new major enhancements in R R-packages announcing new version of packages R-help is for R related user questions! R-devel is for R developers

announcing new version of packages R-help is for R related user questions! R-devel is for R

Command Line Interface

R does not have a “real” gui All computations and statistics are performed with commands These commands are called “functions” in R

gui All computations and statistics are performed with commands These commands are called “functions” in R

help

click on help search help functionality FAQ link to reference web pages apropos

help click on help search help functionality FAQ link to reference web pages apropos 10

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 1 1

11

Objects

Everything in R is an object Everything in R has a class.

Data, intermediate results and even e.g. the result of a regression are stored in R objects

The Class of the object both describes what the object contains and what many standard functions

Objects are usually accessed by name. Syntactic names for objects are made up from letters, the digits 0 to 9 in any non-initial position and also the period “.”

names for objects are made up from letters, the digits 0 to 9 in any non-initial

Assignment and Expression Operations

R commands are either assignments or expressions

Commands are separated either by a semicolon ; or newline

An expression command is evaluated and (normally) printed

An assignment command evaluates an expression and passes the value to a variable but the result is not printed

printed An assignment command evaluates an expression and passes the value to a variable but the

Expression Operations

>

1

[1]

+

2

1

Expression Operations > 1 [1] + 2 1 14

Assigment Operations

>

res

<-

1

+

1

“<-” is the assignment operator in R

a series of commands in a file (script) can be executed with the command : source(“myfile.R”)

operator in R a series of commands in a file (script) can be executed with the

Sample session

> 1:5

2

> powers.of.2

5

[1]

1

3

4

> powers.of.2

[1]

2

4

8

<-

16

2^(1:5)

32

> class(powers.of.2)

"numeric"

>

"powers.of.2"

> rm(powers.of.2)

[1]

ls()

[1]

"res"

> class(powers.of.2) "numeric" > "powers.of.2" > rm(powers.of.2) [1] ls() [1] "res" 16

Workspace

R stores objects in workspace that is kept in memory

When quiting R ask you if you want to save that workspace

The workspace containing all objects you work on can then be restored next time you work with R along with a history of the used commands.

all objects you work on can then be restored next time you work with R along

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 1 8

18

Basic Data Structures

Vectors

Factors

Data Frame

Matrices

Lists

Basic Data Structures Vectors Factors Data Frame Matrices Lists 19

Vectors

In R the building blocks for storing data are vectors of various types. The most common classes are:

"character", a vector of character strings of varying length. These are normally entered and printed surrounded by double quotes.

"numeric", a vector of real numbers.

"integer", a vector of (signed) integers.

"logical", a vector of logical (true or false) values. The values are printed and as TRUE and FALSE.

integers. "logical", a vector of logical (true or false) values. The values are printed and as

Numeric Vectors

>

> is(vect)

vect

<-

c(1,

[1]

"numeric"

> vect[2]

2

[1]

> vect[2:3]

2

> length(vect)

[1]

99

[1]

6

>

sum(vect)

[1]

125

2,

99,

6,

"vector"

8,

9)

2 [1] > vect[2:3] 2 > length(vect) [1] 99 [1] 6 > sum(vect) [1] 125 2,

Character Vectors

>

+ "france",

+ "poland")

> is(vect3)

vect3

<-

c("austria",

"uk",

"spain",

"belgium",

[1]

"character"

[2]

"vector"

[3]

"data.frameRowLabels"

[4]

"SuperClassMethod"

> vect3[2]

 

[1]

"spain"

> vect3[2:3]

 

[1]

"spain"

"france"

> length(vect3)

[1]

6

"spain" > vect3[2:3]   [1] "spain" "france" > length(vect3) [1] 6 22

Logical Vectors

>

+

> is(vect4)

vect4

<-

c(TRUE,

TRUE)

FALSE,

TRUE,

[1]

"logical"

"vector"

FALSE,

TRUE,

> + > is(vect4) vect4 <- c(TRUE, TRUE) FALSE, TRUE, [1] "logical" "vector" FALSE, TRUE, 23

Factors

> citizen

+

> citizen

<-

"no",

factor(c("uk",

"us",

"au",

"uk",

"us",

"us"))

[1]

uk

us

no

au

uk

us

us

Levels:

au

no

uk

us

> unclass(citizen)

[1]

3

4

2

1

3

4

4

attr(,"levels")

[1]

"au"

"no"

"uk"

> citizen[5:7]

[1]

uk

us

us

Levels:

> citizen[5:7,

au

no

[1]

uk

Levels:

us

uk

us

us

uk

us

drop

"us"

=

TRUE]

citizen[5:7] [1] uk us us Levels: > citizen[5:7, au no [1] uk Levels: us uk us

Factors Ordered

>

+

+

>

income

<-

"Lo",

"Mid",

income

ordered(c("Mid",

"Hi",

"Mid",

"Hi"))

"Lo",

"Hi"),

levels

=

c("Lo",

[1]

Mid

Hi

Lo

<

Lo

Mid

Mid

<

Lo

Hi

Levels:

> as.numeric(income)

[1]

2

3

1

2

1

3

> class(income)

"ordered"

> income[1:3]

[1]

"factor"

Hi

[1]

Mid

Hi

Lo

Levels:

Lo

<

Mid

<

Hi

> income[1:3] [1] "factor" Hi [1] Mid Hi Lo Levels: Lo < Mid < Hi 25

Data Frames

A data frame is the type of object normally used in R to store a

data matrix.

It should be thought of as a list of variables of the same length, but

possibly of different types (numeric, factor, character, logical, etc.).

as a list of variables of the same length, but possibly of different types (numeric, factor,

Data Frames

>

library(MASS)

 

>

data(painters)

>

painters[1:3,

]

 

Composition

Drawing

Colour

Da

Udine

10

8

16

Da

Vinci

15

16

4

Del

Piombo

 

8

13

16

Expression

School

Da

Udine

3

A

Da

Vinci

14

A

Del

Piombo

 

7

A

> names(painters)

[1]

"Composition"

"Drawing"

[3]

"Colour"

"Expression"

[5]

"School"

> row.names(painters)

[1]

"Da

Udine"

"Da

Vinci"

"Expression" [5] "School" > row.names(painters) [1] "Da Udine" "Da Vinci" 2 7

27

Data Frames

Data frames are by far the commonest way to store data in R

They are normally imported by reading a file or from a spreadsheet or database.

However, vectors of the same length can be collected into a data frame by the function data.frame

or database. However, vectors of the same length can be collected into a data frame by

Data Frame

>

+ income)

> summary(mydf)

mydf

<-

data.frame(vect,

vect3,

 

vect

vect3

income

Min.

:

1.00

austria:1

Lo

:2

1st

Qu.:

3.00

belgium:1

Mid:2

Median

:

7.00

france

:1

Hi

:2

Mean

:20.83

poland

:1

3rd

Qu.:

8.75

spain

:1

Max.

:99.00

uk

:1

> mydf

+ income)

<-

data.frame(vect,

I(vect3),

Qu.: 8.75 spain :1 Max. :99.00 uk :1 > mydf + income) <- data.frame(vect, I(vect3), 29

Matrices

A data frame may be printed like a matrix, but it is not a matrix. Matrices like vectors have all their elements of the same type

may be printed like a matrix, but it is not a matrix. Matrices like vectors have

> mymat

> class(mymat)

<-

matrix(1:10,

[1]

>

"matrix"

dim(mymat)

[1]

2

5

2,

5)

> mymat > class(mymat) <- matrix(1:10, [1] > "matrix" dim(mymat) [1] 2 5 2, 5) 31

Lists

A list is a vector of other R objects. Lists are used to collect together items of different classes. For example, an employee record might be created by :

Lists are used to collect together items of different classes. For example, an employee record might

Lists

>

+

+ child.ages

<-

Empl

list(employee

=

"Fred",

=

c(4,

spouse

> Empl$employee

[1]

"Anna"

=

"Anna",

children

=

7,

9))

> Empl$child.ages[2]

[1]

7

3,

spouse > Empl$employee [1] "Anna" = "Anna", children = 7, 9)) > Empl$child.ages[2] [1] 7 3,

Basic Mathematical Operations

>

[1]

a

>

a

[1]

> a

>

>

b

5

[1]

-

2

<-

<-

-

1

*

2

3

b

2

b

3

2:4

rep(1,

3

4

3)

Operations > [1] a > a [1] > a > > b 5 [1] - 2

Matrix Multiplication

"

> a

<-

matrix(1:9,

3,

3)

> matrix(2,

b

<-

3,

3)

> a

c

<-

% * %

b

<- matrix(1:9, 3, 3) > matrix(2, b <- 3, 3) > a c <- % *

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 3 6

36

Missing Values

Missing Values in R are labeled with the logical value NA

Missing Values Missing Values in R are labeled with the logical value NA 37

Missing Values

> mydf[mydf

==

99]

<-

NA

>

mydf

vect

vect3

income

1

1

austria

Mid

2

2

spain

Hi

3

NA

france

Lo

4

6

uk

Mid

5

8

belgium

Lo

6

9

poland

Hi

spain Hi 3 NA france Lo 4 6 uk Mid 5 8 belgium Lo 6 9

Missing Values

>

a

<-

c(3,

5,

6)

>

a[2]

<-

NA

>

a

[1]

3

NA

6

> is.na(a)

[1]

FALSE

TRUE

FALSE

c(3, 5, 6) > a[2] <- NA > a [1] 3 NA 6 > is.na(a) [1]

Special Values

R has the also special values NaN , Inf and -Inf

> d

<-

c(-1,

0,

1)

> d/0

 

[1]

-Inf

NaN

Inf

values NaN , Inf and -Inf > d <- c(-1, 0, 1) > d/0   [1]

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 4 1

41

Entering Data

For all but the smallest datasets the easiest way to get data into R is to import it from a connection such as a file

all but the smallest datasets the easiest way to get data into R is to import

Entering Data

> c(5,

a

<-

8,

5,

4,

9,

7)

> scan()

b

<-

Entering Data > c(5, a <- 8, 5, 4, 9, 7) > scan() b <- 43

Entering Data

>

mydf2

<-

edit(mydf)

Entering Data > mydf2 <- edit(mydf) 44

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 4 5

45

File Input and Output

read.table can be used to read data from text file like csv

read.table creates a data frame from the values and tries to guess the type of each variable

>

mydata

<-

read.table("file.csv",

+

sep

=

",")

the type of each variable > mydata <- read.table("file.csv", + sep = ",") 46

Packages

Packages contain additional functionality in the form of extra functions and data

Installing a package can be done with the function install.packages()

The default R installation contains a number of contributed packages like MASS, foreign, utils

Before we can access the functionality in a package we have to load the package with the function library()

utils Before we can access the functionality in a package we have to load the package

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 4 8

48

Packages

Help on the contents of the packages is available

>

library(help

=

foreign)

Help on installed packages is also available by help.start() Vignettes are also available for many packages

foreign) Help on installed packages is also available by help.start() Vignettes are also available for many

SPSS files

Package foreign contains functions that read data from SPSS (sav), STATA and other formats

> library(foreign)

> spssdata

+ to.data.frame

<-

read.spss("ees04.sav",

=

TRUE)

and other formats > library(foreign) > spssdata + to.data.frame <- read.spss("ees04.sav", = TRUE) 50

Excel files

To load Excel file into R an external package xlsReadWrite is needed

We will install the package directly from CRAN using install.packages

>

install.packages("xlsReadWrite",

+

lib

=

"C:\temp")

CRAN using install.packages > install.packages("xlsReadWrite", + lib = "C:\temp") 51

Excel files

> library(xlsReadWrite)

>

data

<-

read.xls("sampledata.xls")

Excel files > library(xlsReadWrite) > data <- read.xls("sampledata.xls") 52

Data Export

You can save your data by using the functions write.table() or save()

write.table is more data specific, while save can save any R object

save saves in a binary format while write.table() in text format.

> write.table(mydata,

+

> save(mydata,

> load(file

file

=

"mydata.csv",

quote

=

FALSE)

file

=

"mydata.rda")

=

"mydata.rda")

> load(file file = "mydata.csv", quote = FALSE) file = "mydata.rda") = "mydata.rda") 53

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 5 4

54

Indexing

R has a powerful set of Indexing capabilities

This facilitates Data manipulation and can speed up data pre-processing

Indexing Vectors, Data Frames and matrices is done in a similar manner

and can speed up data pre-processing Indexing Vectors, Data Frames and matrices is done in a

Indexing by numeric vectors

> letters[1:3]

[1]

"a"

"b"

"c"

> letters[c(7,

[1]

"g"

"i"

9)]

> letters[-(1:15)]

[1]

"p"

"q"

"r"

"s"

"t"

"u"

"v"

"w"

"x"

[10]

"y"

"z"

"t" "u" "v" "w" "x" [10] "y" "z" 56

Indexing by logical vectors

>

>

> is.na(a)

a[c(3,

a

<-

1:10

5,

7)]

<-

NA

[1]

FALSE

FALSE

TRUE

FALSE

TRUE

FALSE

[7]

TRUE

FALSE

FALSE

FALSE

> a[!is.na(a)]

[1]

1

2

4

6

8

9

10

TRUE FALSE TRUE FALSE [7] TRUE FALSE FALSE FALSE > a[!is.na(a)] [1] 1 2 4 6

Indexing Matrices and Data Frames

> mymat[1,

[1]

1

3

5

7

]

9

>

mymat[,

c(2,

3)]

 

[,1]

[,2]

[1,]

3

5

[2,]

4

6

> mymat[1,

[1]

7

9

-(1:3)]

mymat[, c(2, 3)]   [,1] [,2] [1,] 3 5 [2,] 4 6 > mymat[1, [1] 7

Selecting Subsets

> attach(painters)

> painters[Colour

>=

17,

]

 

Composition

Drawing

Colour

Bassano

6

8

17

Giorgione

8

9

18

Pordenone

8

14

17

Titian

12

15

18

Rembrandt

15

6

17

Rubens

18

13

17

Van

Dyck

15

10

17

 

Expression

School

Bassano

0

D

Giorgione

4

D

Pordenone

5

D

Titian

6

D

Rembrandt

12

G

Rubens

17

G

Rubens 17 G

Selecting Subsets

>

painters[Colour

>=

15

&

Composition

>

+

10,

]

 

Composition

Drawing

Colour

Tintoretto

15

14

16

Titian

12

15

18

Veronese

15

10

16

Corregio

13

13

15

Rembrandt

15

6

17

Rubens

18

13

17

Van

Dyck

15

10

17

 

Expression

School

Tintoretto

4

D

Titian

6

D

Veronese

3

D

Corregio

12

E

Rembrandt

12

G

Rubens

17

G

Rubens 17 G

Graphics

R has a rich set of visualization capabilities scatter-plots, histograms, box-plots and more.

Graphics R has a rich set of visualization capabilities scatter-plots, histograms, box-plots and more. 61

histograms

> data(hills)

> hist(hills$time)

histograms > data(hills) > hist(hills$time) 62

Scatter Plot

>

plot(climb

~

time,

hills)

Scatter Plot > plot(climb ~ time, hills) 63

Paired Scatterplot

>

pairs(hills)

Paired Scatterplot > pairs(hills) 64

Box Plots

>

boxplot(count

~

spray,

data

=

InsectSprays,

+

col

=

"lightgray")

 
> boxplot(count ~ spray, data = InsectSprays, + col = "lightgray")   65

Formula Interface

A formula is of the general form response expression where the

left-hand side, response, may in some uses be absent and the right-hand side, expression, is a collection of terms joined by operators usually resembling an arithmetical expression.

The meaning of the right-hand side is context dependent

e.g. in linear and generalized linear modelling it specifies the form

of the model matrix

side is context dependent e.g. in linear and generalized linear modelling it specifies the form of

Scatterplot and line

> library(MASS)

> data(hills)

> attach(hills)

> plot(climb

+

> abline(0,

> abline(lm(climb

~

time,

ylab

=

"Feet")

40)

~

hills,

time))

xlab

=

"Time",

+ > abline(0, > abline(lm(climb ~ time, ylab = "Feet") 40) ~ hills, time)) xlab =

Multiple plots

> par(mfrow

=

c(1,

2))

> plot(climb

 

~

time,

hills,

xlab

=

"Time",

+ ylab

=

"Feet")

 

> plot(dist

~

time,

hills,

xlab

=

"Time",

+ ylab

=

"Dist")

 

> par(mfrow

=

c(1,

1))

xlab = "Time", + ylab = "Dist")   > par(mfrow = c(1, 1)) 68

Plot to file

> pdf(file

=

"H:/My

Documents/plot.pdf")

plot(dist

> ~

time,

hills,

xlab

=

"Time",

+ ylab

=

"Dist",

main

=

"Hills")

> dev.off()

xlab = "Time", + ylab = "Dist", main = "Hills") > dev.off() 69

Paired Scatterplot

>

splom(~hills)

Paired Scatterplot > splom(~hills) 70

Box-whisker plots

> data(michelson)

> bwplot(Expt

~

Speed,

data

=

michelson,

+ ylab

=

"Experiment

No.")

> title("Speed

of

Light

Data")

data = michelson, + ylab = "Experiment No.") > title("Speed of Light Data") 71

Box-whisker plots

> data(quine)

> bwplot(Age

~

+ data

Eth,

+ 2))

Days

|

Sex

=

quine,

Lrn

layout

*

*

=

c(4,

Box-whisker plots > data(quine) > bwplot(Age ~ + data Eth, + 2)) Days | Sex =

Discriptive Statistics

> mean(hills$time)

[1]

57.87571

> colMeans(hills)

dist

climb

time

7.528571

1815.314286

57.875714

> median(hills$time)

[1]

39.75

> quantile(hills$time)

0%

25%

50%

75%

100%

15.950

28.000

39.750

68.625

204.617

> var(hills$time)

2504.073

> sd(hills$time)

50.04072

[1]

[1]

15.950 28.000 39.750 68.625 204.617 > var(hills$time) 2504.073 > sd(hills$time) 50.04072 [1] [1] 73

Discriptive Statistics

> cor(hills)

time

dist 1.0000000 0.6523461 0.9195892 climb 0.6523461 1.0000000 0.8052392 time 0.9195892 0.8052392 1.0000000

> cov(hills)

dist

climb

 

dist

climb

time

dist

30.51387

5834.638

254.1944

climb

5834.63782

2621648.457

65243.2567

time

254.19442

65243.257

2504.0733

> cor(hills$time,

0.8052392

[1]

hills$climb)

65243.2567 time 254.19442 65243.257 2504.0733 > cor(hills$time, 0.8052392 [1] hills$climb) 74

Distributions

R contains a number of functions for drawing from a number of distribution like e.g. normal, uniform etc.

Distributions R contains a number of functions for drawing from a number of distribution like e.g.

Sampling from a Normal Distribution

>

nvec

<-

rnorm(100,

3)

Sampling from a Normal Distribution > nvec <- rnorm(100, 3) 76

Outline

1

2

3

4

5

6

7

8

9

Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting

Basic Plots

Lattice Plots

Basic Statistics & Machine Learning Tests

Linear Models

Naive Bayes

Support Vector Machines

Decision Trees

Dimensionality Reduction

Machine Learning Tests Linear Models Naive Bayes Support Vector Machines Decision Trees Dimensionality Reduction 7 7

77

Testing for the mean T-test

>

t.test(nvec,

mu

=

1)

One

Sample

t-test

 

data:

nvec

t

=

21.187,

df

=

99,

p-value

<

2.2e-16

 

alternative

hypothesis:

true

mean

is

not

equal

to

1

95

percent

2.993364

confidence

3.405311

sample

estimates:

mean

x

3.199337

of

interval:

is not equal to 1 95 percent 2.993364 confidence 3.405311 sample estimates: mean x 3.199337 of

Testing for the mean T-test

>

One

t.test(nvec,

Sample

mu

t-test

=

2)

data:

nvec

t

=

11.5536,

df

=

99,

p-value

<

2.2e-16

 

alternative

hypothesis:

true

mean

is

not

equal

to

2

95

percent

2.993364

confidence

3.405311

sample

estimates:

mean

x

3.199337

of

interval:

is not equal to 2 95 percent 2.993364 confidence 3.405311 sample estimates: mean x 3.199337 of

Testing for the mean Wilcox-test

>

Wilcoxon

continuity

wilcox.test(nvec,

mu

signed

rank

correction

=

test

3)

with

nvec

V

alternative

data:

=

3056,

p-value

=

0.06815

true

hypothesis:

location

is

not

equal

to

correction = test 3) with nvec V alternative data: = 3056, p-value = 0.06815 true hypothesis:

Two Sample Tests

>

> t.test(nvec,

Welch

nvec2

<-

rnorm(100,

nvec2)

2)

Two

Sample

t-test

data:

nvec

and

nvec2

t

=

7.4455,

df

=

195.173,

p-value

=

3.025e-12

alternative

hypothesis:

true

difference

in

means

is

no

95

percent

confidence

interval:

0.8567072

1.4741003

sample

mean

of

estimates:

mean

y

3.199337 2.033933

x

of

percent confidence interval: 0.8567072 1.4741003 sample mean of estimates: mean y 3.199337 2.033933 x of 81

Two Sample Tests

>

Wilcoxon

continuity

wilcox.test(nvec,

rank

sum

nvec2)

with

test

correction

nvec

W

alternative

data:

and

p-value

nvec2

=

=

7729,

2.615e-11

true

hypothesis:

location

shift

is

not

equ

correction nvec W alternative data: and p-value nvec2 = = 7729, 2.615e-11 true hypothesis: location shift

Paired Tests

>

t.test(nvec,

nvec2,

paired

=

TRUE)

Paired

t-test

 

data:

nvec

and

nvec2

t

=

7.6648,

df

=

99,

p-value

=

1.245e-11

 

alternative

hypothesis:

true

difference

in

means

is

no

95

percent

confidence

interval:

 

0.863712

1.467096

 

sample

estimates:

mean

of

the

differences

1.165404

confidence interval:   0.863712 1.467096   sample estimates: mean of the differences 1.165404 83

Paired Tests

>

Wilcoxon

continuity

wilcox.test(nvec,

signed

rank

correction

nvec2,

test

paired

with

=

TRUE)

nvec

V

alternative

data:

and

p-value

nvec2

=

=

4314,

7.776e-10

true

hypothesis:

location

shift

is

not

equ

with = TRUE) nvec V alternative data: and p-value nvec2 = = 4314, 7.776e-10 true hypothesis:

Density Estimation

> library(MASS)

> truehist(nvec,

nbins

=

20,

xlim

=

c(0,

+ 6),

ymax

=

0.7)

> lines(density(nvec,

width

=

"nrd"))

nbins = 20, xlim = c(0, + 6), ymax = 0.7) > lines(density(nvec, width = "nrd"))

Density Estimation

> data(geyser)

> geyser2

+

> attach(geyser2)

> par(mfrow

> plot(pduration,

<-

data.frame(as.data.frame(geyser)[-1,

=

geyser$duration[-299])

],

pduration

=

c(2,

2))

waiting,

xlim

=

c(0.5,

+ ylim

6),

=

c(40,

110),

xlab

=

"previous

duration

ylab

+ =

"waiting")

 

> f1

<-

kde2d(pduration,

waiting,

+ =

n

50,

lims

=

c(0.5,

6,

40,

+ 110))

> image(f1,

+

> kde2d(pduration,

+ 50,

n

+ 110),

+ width.SJ(waiting)))

zlim

=

c(0,

"waiting")

=

0.075),

xlab

=

"previous

durat

ylab

<-

=

f2

waiting,

6,

lims

h

=

=

c(0.5,

40,

c(width.SJ(duration),

l

b = " r i d 86 r t
b
=
"
r
i
d
86 r
t

> lim

im

(f2

=

(0

0

075)

Simple Linear Model

Fitting a simple linear model in R is done using the lm function

> data(hills)

>

> class(mhill)

mhill

<-

lm(time

~

dist,

data

=

hills)

[1]

"lm"

> mhill

Call:

lm(formula

=

time

~

dist,

data

=

hills)

Coefficients:

 

(Intercept)

dist

-4.841

8.330

= time ~ dist, data = hills) Coefficients:   (Intercept) dist -4.841 8.330 87

Simple Linear Model

> summary(mhill)

> names(mhill)

> mhill$residuals

Simple Linear Model > summary(mhill) > names(mhill) > mhill$residuals 88

Multivariate Linear Model

>

mhill2

<-

lm(time

~

dist

+

climb,

 

+ data

=

hills)

 

> mhill2

 

Call:

lm(formula

=

time

~

dist

+

climb,

data

=

hills)

Coefficients:

 

(Intercept)

 

dist

climb

 
 

-8.99204

 

6.21796

 

0.01105

  (Intercept)   dist climb     -8.99204   6.21796   0.01105 89

Multifactor Linear Model

> summary(mhill2)

> update(update(mhill2,

> predict(mhill2,

weights

])

hills[2:3,

=

1/hills$dist^2))

Model > summary(mhill2) > update(update(mhill2, > predict(mhill2, weights ]) hills[2:3, = 1/hills$dist^2)) 90

Bayes Rule

p(y|x) = p(x|y)p(y) p(x)

(1)

Consider the problem of email filtering: x is the email e.g. in the form of a word vector, y the label spam, ham

the problem of email filtering: x is the email e.g. in the form of a word

Prior p(y )

p(ham) m ham

m total ,

p(spam) m spam

m

total

(2)

Prior p ( y ) p ( ham ) ≈ m h a m m total

Likelihood Ratio

p(y|x) = p(x|y)p(y) p(x)

key problem: we do not know p(x |y ) or p(x ). We can get rid of p(x ) by settling for a likelihood ratio:

(3)

L(x) := p(spam|x) = p(x|spam)p(spam)

p(ham|x)

p(x|ham)p(ham)

(4)

Whenever L(x ) exceeds a given threshold c we decide that x is spam and consequently reject the e-mail

) (4) Whenever L ( x ) exceeds a given threshold c we decide that x

Computing p(x|y )

Key assumption in Naive Bayes is that the variables (or elements of x) are conditionally independent i.e. words in emails are independent of each other. We can then compute p(x