Académique Documents
Professionnel Documents
Culture Documents
II YEAR – I SEMISTER
Unit-1 notes
Introduction to R language
Prepared by
S.S.R.K.M.GUPTA. M.Tech.,( Ph.D.), M.C.S.I.
Assistant Professor, CSE Department,
Aditya College of Engineering & Technology,
Surampalem.
STATISTICS WITH R PROGRAMMING
OBJECTIVE:
After taking the course, students will be able to
• Use R for statistical programming, computation, graphics, and modeling,
• Write functions and use R in an efficient way,
• Fit some basic types of statistical models
• Use R in their own research,
• Be able to expand their knowledge of R on their own.
BRIEF SYLLABUS
• UNIT-I: Introduction
• UNIT-II: R Programming Structures
• UNIT-III: Doing Math and Simulation in R
• UNIT-IV: Graphics
• UNIT-V: Probability and Basic Statistics
• UNIT-VI: Advanced Statistical Tools
OUTCOMES:
At the end of this course, students will be able to:
• List motivation for learning a programming language
• Access online resources for R and import new function packages into the R workspace
• Import, review, manipulate and summarize data-sets in R
• Explore data-sets to create testable hypotheses and identify appropriate statistical tests
• Perform appropriate statistical tests using R, Create and edit visualizations with R
TEXT BOOKS:
1) The Art of R Programming, A K Verma, Cengage Learning.
2) R for Everyone, Lander, Pearson.
3) The Art of R Programming, Norman Matloff, No starch Press.
REFERENCE BOOKS:
1) R Cookbook, PaulTeetor, Oreilly.
2) R in Action, Rob Kabacoff, Manning
UNIT-1 - TOPICS
• Introduction
• How to run R
• R Sessions and Functions
• R basics : Basic Math, Variables, Data Types.
• Advanced Data Structures : Vectors , Data Frames, Lists, Matrices, Arrays, Classes.
Introduction
What is statistics?
• Statisitics is the science of collecting, organizing, presenting, analyzing, and interpreting data to
assist in making more effective decisions.
• Statistical analysis is used to manipulate, summarize, and investigate data, so that useful for
decision-making from the information results.
• Types of statistics :
– Descriptive statistics – Methods of organizing, summarizing, and presenting data in an
informative way, includes Measures of central tendency like a) mean b) median c) mode,
Measure of variability like a) range b) deviation c) variance d) standard deviation.
– Inferential statistics – The methods used to determine something about a population on the
basis of a sample. Inference is the process of drawing conclusions or making decisions
about a population based on sample results ex: Estimation, Hypothesis testing etc.,
What is R ?
• R is a scripting language for statistical data manipulation and analysis.
• It supports statistical computing and graphics to analyze data, and making decisions.
• It is also has a large and highly flexible collection of graphing facilities for data display.
• “S” is a language that was developed by John Chambers in 1976, as an internal statistical analysis.
• “S” later was added with GUI interface and named as “S-PLUS”.
• “R” language referred as “GNU package of S”
• “R” was created by “Ross Ihaka” and “Robert Gentleman”, at university of Auckland, New Zealand in
1993.
• “R” is named with the first letters of two “R” authors, which is named with the influence of “S”-
language .
• History and milestones of R:
– 1976 - “S” language was invented.
– 1983 – Version S3 is released with OOPs paradigm.
– 1988 - S-PLUS is first produced.
– 1993 - “R” was created by “Ross Ihaka” and “Robert Gentleman”.
– 1995 – GNU general public license is used
– 1997 – R core group is formed
– 2000 – version 1.0.0 is released
– 2014 – version 3.1.2 is released
– 2017 – version 3.4.0 is released
How to Run R
Installation:
How to install R and R studio in different environments?
• Open url: https:/crane.r-project.org
• Download the precompiled binary distributions of the base system, form the links.
– Download R for Linux
– Download R for (Mac) OSX
– Download R for Windows
• Linux:
a) Ubuntu:
– >sudo apt-get update
– >sudo apt-get install r-base
– >sudo apt-get install r-base-dev
b) Redhat fedora:
– >sudo yum install R
– For R packages
– > yum list R-\*
– It lists all RPMs for additional packages
c) Debian:
– > apt-get install update
– > apt-get install r-base r-base-dev
• (Mac)OS:
– Download the package file for R 3.4.0.pkg
– Double click on it and it will open the installer
• Windows:
– Select the sub directory: base (click on it)
– Click on the link, download R 3.4.0-win.exe
– Install it as per the directions given by it.
• Some popular IDEs for R-Language:
– Rstudio
– Tinn-R
– Deducer
– Revolution R
– Text Editors: Vim, Eclipse+stat ET
• Installing R Studio: Download the latest version of RStudio just by clicking on the link provided here:
https://www.rstudio. com/products/rstudio/download/
Running R :
Explain the two modes to run R from the R- IDE.
We can run r environment in two modes.
a) Running R in Interactive mode
• Open the shortcut R 64 3.4.0
• It opens the command window with the prompt ‘>’
• You can execute R commands
– e.g.
– >print(“Welcome to R”)
– [1] “Welcome to R”
• You can also run the .r file
– >source(“sample.r”) and press enter
b) Running R in Batch mode
• Sometimes it is preferable to automate the process of running R
• We could automatically run the r script by simply typing
• R CMD BATCH – venilla < [input file] > [output file].
• Ex: R CMD BATCH – venilla < sample.r >result.txt
• The -venilla option tells R not to load up any startup file information, and not to save anything.
Example Session:
• When we work in R, the R objects are created and loaded are stored in memory position called
workspace.
• When we say no to save the workspace, we all will lose it. Objects are wiped out from the
workspace.
• If we say ‘yes”, they are saved into a file called “.RData” is written to the present working
directory.
• When we start R in the same current directory next time the workspace and all the created objects are
restore automatically from the .RData file.
Listing the objects:
• ls() function is used to list objects in the workspace.
Functions:
• A function is a simple module of a program, which is called by its function name and it is
executed in the function body, when function is called by its name.
• We can pass some input to the function with a program list.
Default arguments:
• R also makes frequent use of default arguments. In the (partial) function defination
• e.g. function(x,y=2)
• y will be initialized to 2 if the programmer doesn’t specify y in the function call
R Programming Tokens
i) Reserved Keywords:
If else repeat while function
for in next break TRUE
FALSE NULL Inf NaN NA
NA-integer NA-real NA-complex NA-character
ii) Identifiers:
Names of variables, methods, classes, etc
Rules:
• Identifiers can be a combination of letters, digits, period(.) and underscore (_)
• It must start with a letter or a period
• If it starts with a period, it can’t be followed by a digit
• Reserved keywords in R can’t be used as identifiers
• Identifiers are case sensitive and should not contain spaces
• Valid identifiers: total, sum, fine.with.dot, this_is_acceptable, numbner5
• Invalid Identifiers: tot@l, 5sum, _fine, TRUE, /ne
iii) Literals: constant values, which are normally assign to the variables.
• double – 0.3, 1.257, 12.0, .765, 12.75e+4
• integer – 10, 0xF2C
• logical – TRUE, FALSE, T, F
• complex – 3.5+4.2i
• character – ‘a’, “a”, ‘hello’, “hello”
Special values:
• NA – missing elements, Not Avaliable
• NaN – Not a number
• NULL – absence of object
iv) Operators:
Assignment operators:
<- left assignment, binary
-> Right assignment, binary
= left assignment but not recommended
<<- Left assignment in outer Lexical scope
Special operators:
$ list subset, binary
+ plus, can be unary or binary
~ used for model
: sequence, binary
:: refer to function in package
Arithmetic operators:
* multiplication, binary
- minus, can be unary or binary
/ division, binary
^ exponentiation, binary
%x% special binary operators, x can be replaced
%% modulus, binary
%/% integer division
%O% outer product, binary
%*% matrix product
%in% matching operator, binary
Logical operators:
!x logical negation
x&y logical and, element wise
x&&y vector logical And
x|y Logical OR, element wise
x||y Vector logical OR
x or (x,y) element wise execution OR
Relational operators:
< Less than
<= Less than or equal to
> greater than
>= greater than or equal to
== equal to
!= not equal to
Basic Math:
Once you have the R environment setup, then its easy to start your R command prompt by just typing
>R press enter
OR
click shortcut R-64 on the desktop
In the console, you can do some basic math operations in it as a calculator
>2+3
[1] 5
>3*6
[1] 18
>“Hello welcome to R”
[1] “Hello welcome to R”
Declaring Variables:
>age<-20
>print(age)
[1] 20
>age
[1] 20
>name <- “Hari Krishna”
>name
[1] “Hari Krishna”
Printing Output:
>age<- 25
>name<- “Ramesh”
>print(paste(“My name is “,name))
[1] “My name is Ramesh”
>print(paste(“My age is “, age))
[1] “My age is 25”
>cat(“My name is “,name,” and my age is “,age,’\n”)
[1] My name is Ramesh and my age is 25
Creating .R script:
File -> New Script -> opens a new editor
• Enter the below program:
name<-readline(“Enter your name: “)
age<-readline(“Enter your age: “)
print(paste(“My name is “,name))
print(paste(“My age is “,age))
• Now save the file as first.r
Running the .R script:
File->source R code->
• Select the file name as first.r
Or
• Type at the R command prompt
>source(“~\\first.r”)
R – Data Types:
• While doing programming in any programming language, you need to use various variables and
store various information.
• Variables are reserved memory locations to store values.
• You may like to store information of variables data types like character, string, integer, floating
Point, Boolean etc.
• In ‘C’ language a variable is declared with a particular data type [like int, double, float, char] and
the particular variable can store the same type of value in it, till the scope of variable the ends.
• But in R a variable is not declared of any data type, rather it gets the data type of the R-object
or literal assigned to it.
• So R is called a Dynamically Typed Language which means we can change a variable data type of
the same variable again and again when using it in a program.
Example:
Output:
v<-TRUE
1.LOGICAL TRUE,FALSE print(class(v))
[1] “logical”
v<-23.5
2.NUMERIC 12.3, 5, 999 print(class(v))
[1] “numeric”
v<-2L
3.INTEGER 2L,34L,0L print(class(v))
[1] “integer”
v<-2+5i
4.COMPLEX 3+2i print(class(v))
[1] “complex”
v<-“Welcome to R”
5.CHARACTER “a”, ‘a’, “hello‘’, ‘hello’ print(class(v))
[1] “character”
v<-charToRaw(“Hello”)
6.RAW “hello” is stored as print(class(v))
48 65 6c 6c 6f [1] “raw”
Type checking functions:
Checks the data type of variables and returns the TRUE/FALSE
• is.numeric(variable)
• is.double(variable)
• is.logical(variable)
• is.complex(variable)
• is.character(variable)
• is.raw(variable)
• is.integer(variable)
e.g.
>X<-10
>is.numeric(x)
[1] True
>is.double(x)
[1]True
>is.integer(x)
[1]False
>y<-25L
>is.integer(y)
[1]True
Vectors:
• This is the most basic data structure.
• A contiguous sequence of data objects with a specific indexed order.
• Vector is called “Atomic” because all objects stored in it have the same type.
• We can create a new vector using the c() function which is short for “combine” or “concatenate”.
• Using the assignment operator “<-“we can assign an object and its values to a named variable.
• e.g.
We can also store a sequence of numbers in a vector.
X <- 11: 15
X
[1] 11 12 13 14 15
seq( ) function:- generates a sequence of numbers
> X <- seq (-6,2)
>X
[1] -6 -5 -4 -3 -2 -1 0 1 2
From -6 to 7 , step=2:-
> X <- seq (-6 , 7 , by=2)
>X
[1] -6 -4 -2 0 2 4 6
With a smaller step by 0.3 :-
> X <- seq (-2 , 2 , by=0.3)
>X
[1] -2.0 -1.7 -1.4 -1.1 -0.8 -0.5 -0.2 0.1 0.4 0.7 1.0 1.3 1.6 1.9
> X <- seq (-2 , 2 , length.out=9) # specific number of elements
>X
[1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
rep ( ) function :repeating the data
> X <- rep (1:5)
>X
[1] 1 2 3 4 5
> X <- rep (1:5 , 2)
>X
[1] 1 2 3 4 5 1 2 3 4 5
> X <- rep (1:5 , each=2)
>X
[1] 1 1 2 2 3 3 4 4 5 5
> X <- rep.int (1:5 , 2)
>X
[1] 1 2 3 4 5 1 2 3 4 5
sample( ) function :- generates random numbers between a range
> X <- sample(1:8)
>X
[1] 8 4 7 2 3 6 5 1
> X <- sample(1:8 , replace = TRUE)
>X
[1] 7 6 2 4 1 3 1 1
> X <- sample (10:25 , size=5)
>X
[1] 22 14 16 19 15
> X <- sample.int (20:30 , size =4)
>X
[1] 25 22 29 20
• A vector can hold elements of same type only, and we cannot store mixed data type elements.
• When we are going to store both Logical and Numerical data types, logical elements are
automatically converted into numerical.
• This automatic upgrading is known as coercion.
• The automatic type conversion is followed as below:
logical->integer->double->complex->character
>a<-c(10L,T)
>typeof(a)
[1] ”integer”
>a<-c(10L,9.5,T)
>typeof(a)
[1] “double”
>a<-c(10L,9.5,T,3+4i)
>typeof (a)
[1] ”complex”
>a<-c(10L,9.5,T,3+4i,”hello”)
>typeof (a)
[1] ”character”
MATRICES
• Matrices are the R objects m which the elements of the same atomic type are arranged in a two-
dimensional rectangular layout.
• The basic syntax for creating a matrix is:
• matrix(data, nrow, ncol, byrow, dimnames)
– data is the input vector which becomes the data elements of the matrix
– nrow is the number of rows to be created
– ncol is the number of columns to be created
– byrow is a logical clue,if TRUE then the input vector elements are arranged by row(row
major matrix)
– dimnames are the names assigned to the rows and columns
CREATING MATRIX:
>m1matrix(1:6,nrow=2)
> m1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
>m2matrix(1:6,nrow=2,byrow=TRUE)
>m2
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
>m3matrix(1:6,ncol=2)
>m3
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
>m4 <- matrix(1:3,nrow=2,ncol=3)
>m4
[,1] [,2] [,3]
[1,] 1 3 2
[2,] 2 1 3
>rownames(m4) <- c("row1","row2")
>m4
[,1] [,2] [,3]
Row1 1 3 2
Row2 2 1 3
>colnames(m4) <- c("col1","col2","col3")
>m4
Col1 col2 col3
Row1 1 3 2
Row2 2 1 3
>m5<-matrix(1:6,nrow=2,dimnames=list(c("row1","row2"),
c("col1","col2","col3")))
>m5
Col1 col2 col3
Row1 1 3 5
Row2 2 4 6
>dimnames( m5 )
[[1]]
[1] "row1" "row2"
[[2]]
[1] "col1" "col2" "col3"
>mat1matrix(letters[1:6],nrow=2)
>mat1
[,1] [,2] [,3]
[1,] "A" "C" "E"
[2,] "B" "D" "F"
>mat2matrix(1:6,nrow=2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
>typeof(mat1)
[1]"character"
>typeof(mat2)
[1]"integer"
>mat3rbind(mat1,mat2)
>mat3
[,1] [,2] [,3]
[1,] "A" "C" "E"
[2,] "B" "D" "F"
[3,] "1" "3" "5"
[4,] "2" "4" "6"
>typeof(mat3)
[1]"character"
Matrix Sub-setting:
> m1 <- matrix(21:32,nrow=3)
>m1
[,1] [,2] [,3] [,4]
[1,] 21 24 27 30
[2,] 22 25 28 31
[3,] 23 26 29 32
>m1[1:2, ] # accessing 1st row and 2nd row
[,1] [,2] [,3] [,4]
[1,] 21 24 27 30
[2,] 22 25 28 31
>m1[ ,2:3] # accessing 2nd column and 3rd column
[,1] [,2]
[1,] 24 27
[2,] 25 28
[3,] 26 29
>m1[1:2,3:4]
[,1] [,2]
[1,] 27 30
[2,] 28 31
>m1[c(F,F,T),c(F,F,T,T)]
[1] 29 32
OPERATIONS ON MATRICES:
# Creating zero matrix:
>m1matrix(0,nrow=2,ncol=3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
# Creating unity matrix:
>m1matrix(1,nrow=2,ncol=3)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
# Creating identity matrix:
>m3diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
• The above data set in the real World is in tabular form, which can be stored using data frames.
• This data in the tabular form can be arranged in the data frame in rows and columns.
• data.frame is the function used to create data frame.
• data.frame(vector 1, vector2, ………, stringsAsFactors=FALSE)
• All vectors should be same length.
• By default, string vectors are stored as factors.
• "stringAsFactor=FALSE"-option stores the string vector as it is.
Creating Data Frames:
>rno<-c(501,502,503,504)
>sname<-c("Prasad","Kiran","Lakshmi","mohan")
>age<-c(23,22,21,21)
>marks<-c(78.5,62.6,91.8,97.2)
>student_data<-data.frame(rno, sname, age,
marks, stringsAsFactors=FALSE)
>student_data
rno sname age marks
1 501 Prasad 23 78.5
2 502 Kiran 22 62.6
3 503 Lakshmi 21 91.8
4 504 mohan 21 97.2
>student_data$sname #accessing a column
[1] "prasad " "kiran" " Lakshmi" "Mohan"
>str(student_data)
‘data.frame’: 4 object of 4 variable
$ rno : num 501 502 503 504
$sname: chr "Prasad" "Kiran" " Lakshmi" "mohan"
$age : num 23 22 21 21
$marks : num 78.5 62.6 91.8 97.2
>nrow(student_data)
[1] 4
>ncol(student_data)
[1] 4
>names(student_data)
[1] "rno" "sname" "age" "marks"
#changing column name
>colnames(student_data)[1]<-"rollno"
>colnames(student_data)
[1] "rollno" "sname" "age" "marks"
Lists:
• In R language, vectors, matrices, and arrays are used to store homogenous elements only.
• To store heterogeneous elements, we can use two types of data structures – lists and data
frames.
• A list contains different types of objects, such as strings, numbers, vectors, matrices, and
functions.
• In other words, a list is a generic vector containing other objects.
• A list can be included as a sub-list into another list.
• A list is ordered object, and we can access elements by index.
• The List is been created using list() function.
Creating a list:
> emp1 <- list("Ravi", 50000, TRUE)
> emp1
[[1]]
[1] "Ravi"
[[2]]
[1] 50000
[[3]]
[1] TRUE
# [[i]] – operator is used access element of a list with index.
> emp1[[2]]
[1] 50000
> length(emp1)
[1] 3
Recursive Lists:
> list3 <- c(list(a=10, b=12, c = list(d=34,e=21)), recursive=TRUE)
> list3
a b c.d c.e
10 12 34 21
How to create an arrays of n-dimensions in R?
Arrays:
• An array is a collection of similar data of N dimensions.
• The function to create array is :
• array(data, dim, dimnames)
• data – input vector
• dim – to define dimensions
• dimnames – to assign names to the dimensions
• Ex:
• >arr1 <- array(11:18, dim = c ( 3,3,2))
S3 class:
• S3 class is the most popular and prevalent class in R language.
• Most of the classes that come predefined in R are of this type.
• S3 classes has no formal, predefined definition.
• Basically, a list with in class attribute set to some class name, is an S3 class.
• The components of the list become the member variables of the object.
> # creating list with required elements
> s <- list(name="Ravi", age=21, gpa=3.5)
> # name the class appropriatly
> class(s) <- "student"
>s
$name
[1] "Ravi"
$age
[1] 21
$gpa
[1] 3.5
attr(,"class")
[1] "student"
S4 class:
S4 class have formal definition for class, and uniform way to create objects
setClass("student",
slots= list(name="character", age="numeric", gpa="numeric"))
# Creating object of the class "student"
s<-new("student", name="Ravi", age=21, gpa=3.5)
>s # An object of class "student"
slot "name"
[1] Ravi
slot "age"
[1] 21
slot "gpa"
[1] 3.5
Accessing and modifying slot of an object:
>s@name
[1] " Ravi "
>s@gpa<-5.2
Reference Classes:
• Unlike S3 and S4 classes, methods belong to class rather that to generic functions
Defining Reference Classes:
> student <- setRefClass("student", fields=list(name="character", age="numeric", gpa="numeric"))
> s <- student(name="Ravi", age=22, gpa=3.5)
>s
Reference class object of class "student"
Field "name":
[1] "Ravi"
Field "age":
[1] 22
Field "gpa":
[1] 3.5
Part A
a) What is variable scope?
b) List the differences between vector and list.
c) What are the different modes of working with R
d) List the data structures in R.
e) Create a 3-dimensional array in R.
f) Create a simple matrix with 3X3 size in R.
g) Write about vectors in R
h) Write about type conversions in R?
i) Explain the importance of data frame?
j) What are the data structures in R that is used to perform statistical analyses and create graphs?
k) Write about linear vector algebra operations.
l) Explain different matrix operation functions in R?
Part B
1. a)How basic arithmetic can be carried out in R? Explain with an example each.
b) What is a vector? How to create it? Create a vector X of elements 5, 2, -1, 7 ,4, 8, 12 and from it
create a vector Y containing elements of x>4
2. a) How a data frame is different from a list? Create a data frame of seven days in a week showing
minimum temperatures on that day.
b) What is the difference between NA and NULL values? How to handle them?
3. a) Define a data frame and distinguish it from a matrix object in R
b) Explain in detail about vectors in R.
4.a) Discuss about matrices in R.
b) Explain Datatypes in R
5. a) Explain in detail about dataframe and arrays with example R code.
b) Explain list data structure and its operation with example.
6. a) What is a vector in R? Explain operations on vectors.
b) Explain different data structures in R.
7.a) Write about data frame? Write about operations on data frame.
b) Explain about variables, constants and Data Types in R Programming
8. a) How to create, name ,access , merging and manipulate list elements? Explain with examples.
b) Explain different types of classes with examples?
Assignment:
PART A:
PART B: