Vous êtes sur la page 1sur 61

Business Analytics

Analytics

 Analytics can be defined as a process that involves the use of statistical techniques
(measures of central tendency, graphs, and so on), information system software (data
mining, sorting routines), and operations research methodologies (linear programming) to
explore, visualize, discover and communicate patterns or trends in data.

 Analytics convert data into useful information.

Application Of Analytics In Business Functions


Skills of a Business Analyst

 1. Understand Your Objectives.

 2. Good Verbal Communication Skills.

 3. The Ability To Run Stakeholder Meetings.

 4. Be A Good Listener.

 5. Hone Your Presentation Skills.

 6. Be Excellent At Time Management.

 7. Documentation And Writing Skills.

 8. Stakeholder Management.

 9. Develop Your Modelling Skills.


Types of Analytics

 1. Descriptive: The application of simple statistical techniques that describes what is


contained in a data set or database.

 Example: An age bar chart is used to depict retail shoppers for a department store that
wants to target advertising to customers by age.

Purpose of Descriptive Analytics

 Descriptive To identify possible trends in large data sets or databases.

 The purpose is to get a rough picture of what generally the data looks like and what
criteria might have potential for identifying trends or future business behavior.

 Descriptive statistics, including measures of central tendency (mean, median, mode),


measures of dispersion (standard deviation), charts, graphs, sorting methods, frequency
distributions, probability distributions, and sampling methods

 2. Predictive: An application of advanced statistical, information software, or operations


research methods to identify predictive variables and build predictive models to identify
trends and relationships not readily observed in a descriptive analysis.

 Example: Multiple regression is used to show the relationship (or lack of relationship)
between age, weight, and exercise on diet food sales. Knowing that relationships exist
helps explain why one set of independent variables influences dependent variables such
as business performance.

Purpose of Predictive analytics

 To build predictive models designed to identify and predict future trends. Statistical
methods like multiple regression and ANOVA.

 Information system methods like data mining and sorting. Operations research methods
like forecasting models.

 3. Prescriptive: An application of decision science, management science, and operations


research methodologies (applied mathematical techniques) to make best use of allocable
resources.

 Example: A department store has a limited advertising budget to target customers. Linear
programming models can be used to optimally allocate the budget to various advertising
media.

Purpose of Prescriptive analytics

 To allocate resources optimally to take advantage of predicted trends or future


opportunities.

 Operations research methodologies like linear programming and decision theory.


Business analytics

 Business analytics begins with a data set (a simple collection of data or a data file) or
commonly with a database (a collection of data files that contain information on people,
locations, and so on).
Types of Digital Data

 1. Structured data – Structured data is a data whose elements are addressable for effective
analysis. It has

 been organized into a formatted repository that is typically a database.

 It concern all data which can be stored in database SQL in table with rows and columns.

 They have relational key and can easily mapped into pre-designed fields.

 Today, those data are most processed in development and simplest way to manage
information. Example: Relational data.

 Semi-structured data is information that does not reside in a rational database but that
have some organizational properties that make it easier to analyze.

 With some process, you can store them in the relation database (it could be very hard for
some kind of semi-structured data), but Semi-structured exist to ease space. Example:
XML data.

 Unstructured data is a data that is which is not organized in a pre-defined manner or does
not have a pre-defined data model, thus it is not a good fit for a mainstream relational
database.

 So for Unstructured data, there are alternative platforms for storing and managing, it is
increasingly prevalent in IT systems and is used by organizations in a variety of business
intelligence and analytics applications. Example: Word, PDF, Text, Media logs.

Unstructured data

Limited indication of data types

E.g., web pages in html contain some unstructured data

Figure shows part of HTML document representing unstructured data


Types of Digital Data

 Structured data

 Information stored DB

 Strict format

 Limitation

 Not all data collected is structured

 Semi-structured data

 Data may have certain structure but not all information collected has identical
structure

 Some attributes may exist in some of the entities of a particular type but not in
others
 Unstructured data

 Very limited indication of data type

 E.g., a simple text document

Semi Structured Data

Figure represents semi-structured data as a graph

Note: difference between the two workers' data


Big data

 Big data describes the collection of data sets that are so large and complex that software
systems are hardly able to process them (Isson and Harriott, 2013, pp. 57–61).

Little Data

 Isson and Harriott (2013, p. 61) define little data as anything that is not big data.

 Little data describes the smaller data segments or files that help individual businesses
keep track of customers.

 As a means of sorting through data to find useful information, the application of analytics
has found new purpose.
Command Schedule

 R itself is a powerful language that performs a wide variety of functions, such as data
manipulation, statistical modeling, and graphics.

 One really big advantage of R, however, is its extensibility. Developers can easily write
their own software and distribute it in the form of add-on packages.

 R is an interpreted language, which means that — contrary to compiled languages like C


and Java — you don’t need a compiler to first create a program from your code before
you can use it.

Features

 Console: In the bottom-left corner, you find the console. The console in R Studio is
identical to the console in RGui. This is where you do all the interactive work with R.

 Workspace and history: The top-right corner is a handy overview of your workspace,
where you can inspect the variables you created in your session, as well as their values.
This is also the area where you can see a history of the commands you’ve issued in R.

 Files, plots, package, and help: In the bottom-right corner, you have access to several
tools:

 Files: This is where you can browse the folders and files on your computer.

 Plots: This is where R displays your plots (charts or graphs).

 Packages: This is where you can view a list of all the installed packages.

 Help: This is where you can browse the built-in Help system of R.

 Below all this information is the R prompt, denoted by a > symbol.

 The prompt indicates where you type your commands to R; you see a blinking cursor to
the right of the prompt.

 To quit your R session, type the following code in the console, after the command
prompt (>): > q()

 R is Case sensitive

Simple Arithmetic Calculations

 > 24+7+11
 [1] 42

 > 5*2

 [1] 10

 > 25/5

 [1] 5

Printing Message

 > print(“Hello R students of IPE!”)

 [1] “Hello R students of IPE!”

Vector

 A vector is the simplest type of data structure in R. The R manual defines a vector as “A
single entity consisting of a collection of things.”

 You also can construct a vector by using operators. An operator is a symbol you stick
between two values to make a calculation.

 The symbols +, -, *, and / are all operators, and they have the same meaning they do in
mathematics.

 A vector as a row or column of numbers or text. The list of numbers {1,2,3,4,5}, for
example, could be a vector with vector of length 5.

 Unlike most other programming languages, R allows you to apply functions to the whole
vector in a single operation without the need for an explicit loop.

 To construct a vector, type the following in the console:

 > c(1,2,3,4,5)

 [1] 1 2 3 4 5

 > x <- 1:5

 >x

 [1] 1 2 3 4 5

 The entries inside the parentheses are referred to as arguments.

 Next, we’ll add the value 2 to each element in the vector x and print the result:
 >x+2

 [1] 3 4 5 6 7

 > sum(1:5)

 [1] 15

 One very handy operator is called sequence, and it looks like a colon (:). Type the
following in your console:

 > 1:5

 [1] 1 2 3 4 5

 You can also add one vector to another. To add the values 6:10 element-wise to x, you do
the following:

 > x + 6:10

 [1] 7 9 11 13 15

 This feature of R is extremely powerful because it lets you perform many operations in a
single step.

Storing and calculating values

 > x <- 1:5

 >x

 [1] 1 2 3 4 5

 In R, the assignment operator is <-, which you type in the console by using two
keystrokes: the less-than symbol (<) followed by a hyphen (-). The combination of these
two symbols represents assignment.

 > y <- 10

 >x+y

 [1] 11 12 13 14 15

 >Assign(“j”,4)

 >J

 [1] 4
Calculations

 Now create a new variable z, assign it the value of x+y, and print its value:

 > z <- x + y

 >z

 [1] 11 12 13 14 15

 You must present text or character values to R inside quotation marks —either single or
double. R accepts both.

 So both h <- “Hello” and h <-‘Hello’ are examples of valid R syntax.

Packages

 https://cran.r-project.org/web/packages/available_packages_by_name.html

 A package is essentially a library of prewritten code designed to accomplish some task or


a collection of tasks.

 Install and Load Package Commands

 >Install.packages(‘Package Name’)

 >Library(‘package name’)

Installation

 Installation

 Tools->Install Packages ->install dependencies

 Loading

 Checkbox right hand box

 Command

 install.packages(“ Coefplot”)

 Unloading Package

 detach("package:ggplot2", unload=TRUE)

 Installation

 Tools->Install Packages ->install dependencies


 Loading

 Checkbox right hand box

 Command

 install.packages(“ Coefplot”)

 Unloading Package

 detach("package:ggplot2", unload=TRUE)

Uninstall Packages

 remove.packages(pkgs, lib)

 Arguments

 pkgs : a character vector with the names of the packages to be removed.

 lib : a character vector giving the library directories to remove the packages from. If
missing, defaults to the first element in .libPaths().

 remove.packages("coefplot")

Concatenate Strings/Combining Text

 c() function is used to combine numeric and text values into vectors.

 > cy<- c(“Hello”, “world!”)

 > cy

 [1] “Hello” “world!”

 You can use the paste() function to concatenate multiple text elements. By default,
paste() puts a space between the different elements, like this:

 > paste(“Hello”, “world!”)

 [1] “Hello world!

Talking back to the user

 Readline() prompts an user to an input and stores the input as a character vector

 readline("User Name")
 User NameArjun Reddy

 [1] "Arjun Reddy"

Sourcing a script(Send the entire script to the console)

 In R Studio, click anywhere in the source editor, and press Ctrl+Shift+Enter or click the
Source button in the console.

Multiple Commands

 ls() function to list the objects in the workspace. In

 the console, type the following:

 ls()

 [1] "cy" "x" "y"

 Select ‘History ‘ Tab to show previous operations

 To remove it permanently, use the rm() function

 rm(cy)

 >m<- c(1,2,3,4,5)

 >m

 >sum(m)

 > firstnames <- c(“Joris”, “Carolien”, “Koen”)

 > lastname <- “Meys”

 > paste(firstnames,lastname)

 [1] “Joris Meys” “Carolien Meys” “Koen Meys”

Choosing a correct name

 Names must start with a letter or a dot. If you start a name with a dot, the second
character can’t be a digit.

 Names should contain only letters, numbers, underscore characters (_),

 and dots (.). Although you can force R to accept other characters in names, you

 shouldn’t, because these characters often have a special meaning in R.


 R doesn’t use the dot (.) as an operator,

 so the dot can be used in names for objects as well. This style is called dotted style.

 Ex: print.default()

Naming Convention in R

Order of Operations

 1. Exponentiation

 2. Multiplication and division in the order in which the operators are presented

 3. Addition and subtraction in the order in which the operators are presented

 The mod operator (%%) and the integer division operator (%/%) have the same priority
as the normal division operator (/) in calculations.

 You can change the order of the operations by using parentheses i.e. ().
Calculating Logarithms and Exponentials

 In R, you can take the logarithm of the numbers from 1 to 3 like this:

 > log(1:3)

 [1] 0.0000000 0.6931472 1.0986123

 Whenever you use one of these functions, R calculates the natural logarithm if you don’t
specify any base.

 You calculate the logarithm of these numbers with base 6 like this:

 > log(1:3,base=6)

 [1] 0.0000000 0.3868528 0.6131472

Log and Inverse

 The inverse operation of log() by using exp().

 >x<-log(1:20)

 > exp(x)

Manipulating Operators

 Actually, operators are also functions. It helps to know, though, that operators can, in
many cases, be treated just like any other function if you put the operator between
backticks and add the arguments between parentheses,

 > '+'(4,6)

 > 10

 This may be useful later on when you want to apply a function over rows, columns, or
subsets of your data

Vector Types

 Numeric vectors, containing all kind of numbers.

 Integer vectors, containing integer values. (An integer vector is a special kind

 of numeric vector.)

 Logical vectors, containing logical values (TRUE and/or FALSE)

 Character vectors, containing text


 Datetime vectors, containing dates and times in different formats

 Factors, a special type of vector to work with categories.

 > o<-c(1,2,3,4,5)

 >o

 [1] 1 2 3 4 5

 > is.numeric(o)

 [1] TRUE

 > is.integer(o)

 [1] FALSE

 > seq(from = 4.5, to = 2.5, by = -0.5)

 [1] 4.5 4.0 3.5 3.0 2.5

 Alternatively, you can specify the length of the sequence by using the argument
length.out.

 R calculates the step size itself.

 > seq(from = -2.7, to = 1.3, length.out = 9)

 [1] -2.7 -2.2 -1.7 -1.2 -0.7 -0.2 0.3 0.8 1.3

Understanding indexing in R

 > numbers <- 100:1

 >numbers

 [1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

 [1]--- Called index shows first position in your vector.

 > numbers[10]

 [1] 21

Using trigonometric functions

 All trigonometric functions are available in R: the sine, cosine, and tangent functions and
their inverse functions. You can find them on the Help page you reach by typing ?Trig.
 So, you may want to try to calculate the cosine of an angle of 180 degrees like this:

 > cos(120)

 [1] 0.814181

 This code doesn’t give you the correct result, however, because R always works with
angles in radians, not in degrees.

 Instead, use a special variable called pi. This variable contains the value of —you
guessed it — π (3.141592653589 . . .).

 The correct way to calculate the cosine of an angle of 120 degrees, then, is this:

 > cos(120*pi/180)

 [1] -0.5

 > 2/0

 [1] Inf

 To check whether a value is finite, use the functions is.finite() and is.infinite().

 Ex: >is.infinite(x)

 > 0/ 0

 >[1]Nan

Structure of a vector

 The str() function gives you the type and structure of the object.

 > str(fn)

 >chr [1:3] "Hello" "Mello" "Olo"

 First, it tells you that this is a num (numeric)/char(character) type of vector.

 Next to the vector type, R gives you the dimensions of the vector. This example has only
one dimension, and that dimension has indices ranging from 1 to 3.

 Finally, R gives you the first few values of the vector. If you want to know only how long
a vector is, you can simply use the length() function.

Combining Vectors

 > i<-c("KK",1,2,3)
 >i

 [1] "KK" "1" "2" "3"

 > j<-c("LL",6,7,8)

 > k<-c(i,j)

 > k [1] "KK" "1" "2" "3" "LL" "6" "7" "8"

 **The c() function stands for concatenate. It doesn’t create vectors — it just combines
them.

 c() function maintains the order of the numbers.

Repeating Vectors

 > rep(k, times=3)

 [1] "KK" "1" "2" "3" "LL" "6" "7" "8" "KK" "1" "2" "3" "LL" "6" "7" "8" "KK" "1" "2"
"3" [21] "LL" "6" "7" "8“

 > rep(c(0, 0, 7), times = 3)

 You also can repeat every value by specifying the argument each, like this:

 > rep(c(2, 4, 2), each = 3)

 [1] 2 2 2 4 4 4 2 2 2

 p<-c(0,0,99)

 rep(p,times=3)

 [1] 0 0 99 0 0 99 0 0 99

 R has a little trick up its sleeve. You can tell R for each value how often it has to be
repeated. To take advantage of that magic, tell R how often to repeat each value in a
vector by using the times argument:

 > rep(c(0, 7), times = c(4,2))

 [1] 0 0 0 0 7 7

 use the argument length.out to tell R how long you want it to be.

 > rep(1:3,length.out=7)

 [1] 1 2 3 1 2 3 1
Vector Manipulations

 r<-c(1,3,5,7,9)

 s<(c(2,4,6,8,10)

 >r

 [1] 1 3 5 7 9

 >s

 [1] 2 4 6 8 10

 r[3]<-111

 s[4]<-222

 >r

 [1] 1 3 111 7 9

 >s

 [1] 2 4 6 222 10

 > r.copy<-r

 > s.copy<-s

 Reassign the vector

 >r<-r.copy

 >s<-s.copy

Function What It Does

sum(x) Calculates the sum of all values in x


prod(x) Calculates the product of all values in x

min(x) Gives the minimum of all values in x

max(x) Gives the maximum of all values in x

cumsum(x) Gives the cumulative sum of all values in x

cumprod(x) Gives the cumulative product of all values in x

cummin(x) Gives the minimum for all values in x from the start of the vector until the positio

cummax(x) Gives the maximum for all values in x from the start of the vector until the positio

Diff(x) Gives for every value the difference between that value and the next value in the v

Comparing Values in R

Function What It Does

sum(x) Calculates the sum of all values in x


prod(x) Calculates the product of all values in x

min(x) Gives the minimum of all values in x

max(x) Gives the maximum of all values in x

cumsum(x) Gives the cumulative sum of all values in x

cumprod(x) Gives the cumulative product of all values in x

cummin(x) Gives the minimum for all values in x from the start of the vector until the positio

cummax(x) Gives the maximum for all values in x from the start of the vector until the positio

Diff(x) Gives for every value the difference between that value and the next value in the v

Comparison of Values using Logical Operators

 > x<-c(3,5,7,9)

 >y<-c(1,2,4,5)

 >x

 > sum(x)

 [1] 24
 x>100

 [1] FALSE FALSE FALSE FALSE

 > x>4

 [1] FALSE TRUE TRUE TRUE

Which Function

 The which() function takes a logical vector as argument. Hence, you can save the
outcome of a logical vector in an object and pass that to the which() function, as in the
next example.

 You also can use all these operators to compare vectors value by value.

 > which(x>5)

 [1] 3 4

 >z<-x>y

 >z

 [1] TRUE TRUE TRUE TRUE

 >Which(z)

 [1] 1 2 3 4

 u<- x-y

 u

 [1] 2 3 3 4

Reading and Writing

 >y<-c("tt" , "rr" ,"ii“)

 >y

 >is.character(y)

 >True

 >y<-”PGDM Program at IPE”


 >is.character(y)

 >True

Properties of Character Vector

 > length(y)

 [1] 3

 > nchar(y)

 [1] 2 2 2

Nchar() function tells you that y has length 3 and that the 3 elements in y has 2 characters each

Subsetting

 The process of referring to a subset of a vector through indexing its elements is also
called subsetting.

 In other words, subsetting is the process of extracting a subset of a vector.

 Ex: Use the two built-in datasets letters and LETTERS.

 > letters

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" [13] "m" "n" "o" "p" "q" "r" "s" "t" "u"
"v" "w" "x" [25] "y" "z"

 > LETTERS

 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" [13] "M" "N" "O" "P" "Q" "R" "S"
"T" "U" "V" "W" "X" [25] "Y" "Z"

 > letters[15]

 [1] “o”

 > LETTERS[20:26]

 [1] "T" "U" "V" "W" "X" "Y" "Z“

Head or Tail

 You can use the head() function to get the first element of a variable.

 By default, both head() and tail() returns six elements, but you can tell it to return any
specific number of elements in the second argument.
 > tail(LETTERS,6)

 [1] "U" "V" "W" "X" "Y" "Z“

 > head(letters, 10)

 [1] “a” “b” “c” “d” “e” “f” “g” “h” “i” “j”

 > islands

 This built-in dataset islands, a named vector that contains the surface area of the world’s
48 largest land masses (continents and large islands).

 > str(islands)

 Named num [1:48] 11506 5500 16988 2968 16 ... - attr(*, "names")= chr [1:48] "Africa"
"Antarctica" "Asia" "Australia" ...

 > islands[10] ##Use of indexing

 Celebes 73

 > islands[c(“Asia”, “Africa”, “Antarctica”)]

 Asia Africa Antarctica

 16988 11506 5500

 You use the names() function to retrieve the names in a named vector:

 > names(islands)[1:5]

 [1] "Africa" "Antarctica" "Asia" [4] "Australia" "Axel Heiberg"”

SORT

 > names(sort(islands, decreasing=TRUE)[1:6])

 [1] “Asia” “Africa” “North America”

 [4] “South America” “Antarctica” “Europe”

 names(sort(islands)[6:1])

 [1] "Taiwan" "Kyushu" [3] "Timor" "Prince of Wales" [5] "Hainan" "Vancouver"

 ## increasing sort is default

 > month.days <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
 > names(month.days) <- month.name

 > month.days

 January February March April

 31 28 31 30

 May June July August

 31 30 31 31

 September October November December

 30 31 30 31

 Now you can use this vector to find the names of the months with 31 days:

 > names(month.days[month.days==31])

 #== equality operator

 [1] “January” “March” “May”

 [4] “July” “August” “October”

 [7] “December”

String theory: Combining and splitting strings

 A collection of combined letters and words is called a string. Whenever you work with
text, you need to be able to concatenate words (string them together) and split them apart.

 In R, you use the paste() function to concatenate and the strsplit() function to split.

Splitting text

 > wordplay <-"The quick brown fox jumps over the lazy dog“

 > wordplay

 [1] "The quick brown fox jumps over the lazy dog“

 > strsplit(wordplay,"")

 [[1]] [1] "T" "h" "e" " " "q" "u" "i" "c" "k" " " "b" "r" [13] "o" "w" "n" " " "f" "o" "x" " "
"j" "u" "m" "p" [25] "s" " " "o" "v" "e" "r" " " "t" "h" "e" " " "l" [37] "a" "z" "y" " " "d"
"o" "g"
 > strsplit(wordplay," ")

 [[1]]

 [1] "The" "quick" "brown" "fox" "jumps" "over" [7] "the" "lazy" "dog"

 > words <- strsplit(wordplay, " " )[[1]]

 [1] “The” “quick” “brown” “fox” “jumps” “over” “the” “lazy” “dog”

 To find the unique elements of a vector, including a vector of text, you use the unique()
function

 > unique(tolower(words))

 [1] "the" "quick" "brown" "fox" "jumps" "over" [7] "lazy" "dog"

 In the variable words, “the” appears twice: once in lowercase and once with the first letter
capitalized. To get a list of the unique words, first convert words to lowercase and then
use unique

Conversion from Upper case to Lower

 > unique(toupper(words))

 [1] "THE" "QUICK" "BROWN" "FOX" "JUMPS" "OVER" [7] "LAZY" "DOG"

 > toupper(words)

 [1] "THE" "QUICK" "BROWN" "FOX" "JUMPS" "OVER" [7] "THE" "LAZY" "DOG"

Factoring in Factors

 R has a special data structure for categorical data, called factors. Factors are closely
related to characters because any character vector can be represented by a factor.

 In real-world problems, you often encounter data that can be described using words rather
than numerical values. For example, cars can be red, green, or blue (or any other color);
people can be left-handed or right-handed, male or female; energy can be derived from
coal, nuclear, wind, or wave power.

 You can use the term categorical data to describe these examples

 Factors are special types of objects in R. They’re neither character vectors nor numeric
vectors, although they have some attributes of both.
 Factors behave a little bit like character vectors in the sense that the unique categories
often are text.

 Factors also behave a little bit like integer vectors because R encodes the levels as
integers.

Creating a factor

 To create a factor in R, you use the factor() function.

 x: The input vector that you want to turn into a factor.

 levels: An optional vector of the values that x might have taken. The default is
lexicographically sorted, unique values of x.

 labels: Another optional vector that, by default, takes the same values as levels. You can
use this argument to rename your levels, as we explain in the next paragraph.

 Just remember that levels refer to the input values of x, while labels refer to the output
values of the new factor.

Searching by position

 > head(substr(words, start=3, stop=6))

 [1] "e" "ick" "own" "x" "mps" "er"

 To find substrings, you can use the grep() function, which takes two essential arguments:

 pattern: The pattern you want to find.

 x: The character vector you want to search.

 Use inbuilt >“state.name”

 Searching by position

 If you know the exact position of a subtext inside a text element, you use the substr()
function to return the value. To extract the subtext that starts at the third position and
stops at the sixth position of state.name, use the following:

 > head(substr(state.name, start=3, stop=6))

 [1] "abam" "aska" "izon" "kans" "lifo" "lora"

Data Types
o Numeric
o Integer
o Character/String
o Date/POSIXct
o Logical(True/False

Class function

 Class function checks the type of data contained in a variable.

 > class(x)

 [1] "numeric"

 > i<-5L

 > i [1] 5

 > is.integer(i)

 [1] TRUE

 > is.numeric(i)

 [1] TRUE

Advanced Data Structures

 Data Frames: On surface data frame is like an Excel spreadsheet with rows and columns.

 In statistical terms each column is a variable and each row is an observation.

 In R organization, each column is a vector of same length in the data frame.

 Data frames are complex objects with many attributes

Create Data Frames

 Command: data.frames()

 > x<-c(3,5,7,9)

 >y<-c("tt","rr","ii","ee")

 >z<--4:-1

 > thedf<-data.frame(x,y,z)

 > thedf
 xyz

 1 3 tt -4

 2 5 rr -3

 3 7 ii -2

 4 9 ee -1

Checking attributes

 > nrow(thedf)

 [1] 4

 > ncol(thedf)

 [1] 3

 > dim(thedf)

 [1] 4 3 #Row and Column

More on Data Frames

 >head(thedf) # Shows first few elements

 >tail(thedf) #Shows last few elements

 > head(thedf,n=2) #Shows 2 rows

 >class(thedf) #Shows the class

 Since each data frame is an individual vector, it can be accessed individually and each
has its own class.

 >thedf$x

 [1] 3 5 7 9

 > thedf[2,3]

 [1] -3

 > thedf[, 2:3] # access all columns of 2 through 3

 > thedf[2, ] # access all of row 2


List

 It is a container of arbitrary objects either of same type or varying types.

 > list(1,2,3)

 [[1]] [1] 1

 [[2]] [1] 2

 [[3]] [1] 3

 >list(c(1,2,3)) # Create single element list with one element being a vector

 [[1]] [1] 1 2 3

More on Lists

 >list(thedf,1:5) # Create 2 element list, first a data frame, second a vector

 list3 <-list(c(1,2,3),1:3) # Create 2 element list

 # Assigning names to list elements using name value pairs

 >list6 <-list(THEDATAFRAME =thedf, THEVECTOR = 1:5, THELIST =list3)

Mattrices

 Matrix are mathematical structures similar to data frames with rows and columns but all
data are of same type.(Usually numerics)

 >m<-matrix(1:10,nrow=5) # create 5*2 Matrix

 >m

 n<-matrix(21:40,nrow=5)

 >nrow(m) #No of Rows

 >ncol(m) # No of Columns

 >dim(m) #dimension of matrix

Matrix Operations

 >m+n #Addition

 >m*n #Multiplication
 >m ==n # Test equality

Matrix Operations

 >m+n #Addition

 >m*n #Multiplication

 >m ==n # Test equality

Array

 It is a multidimensional vector. It must be of same type and individual elements are


accessed using square brackets.

 The first element is row index, second is the column index, and remaining elements are
outer dimensions.

 > thearrary <-array(1:12,dim =c(2,3,2))

 > thearrary

Basic Statistics

 >x<-sample(x=1:100,size=100,replace=TRUE)

 x

 > mean(x)

 [1] 50.5

 > median(x)

 [1] 50.5

 > summary(x)

 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 25.75 50.50 50.50 75.25 100.00

Fun with Central Tendencies

 Compute a table giving various descriptive statistics about the series in a data frame or in
a single/multiple time series

 Install and Load package pastecs, statsr

 > stat.desc(x, basic=TRUE, desc=TRUE, norm=FALSE, p=0.95)


 >stat.desc(x)

 Arguments

 X a data frame or a time series

 Basic do we have to return basic statistics (by default, it is TRUE)? These are: the
number of values (nbr.val), the number of null values (nbr.null), the number of missing
values (nbr.na), the minimal value (min), the maximal value (max), the range (range, that
is, max-min) and the sum of all non-missing values (sum)

 Desc do we have to return various descriptive statistics (by default, it is TRUE)? These
are: the median (median), the mean (mean), the standard error on the mean (SE.mean),
the confidence interval of the mean (CI.mean) at the p level, the variance (var), the
standard deviation (std.dev) and the variation coefficient (coef.var) defined as the
standard deviation divided by the mean

 Norm do we have to return normal distribution statistics (by default, it is FALSE)? the
skewness coefficient g1 (skewness), its significant criterium (skew.2SE, that is,
g1/2.SEg1; if skew.2SE > 1, then skewness is significantly different than zero), kurtosis
coefficient g2 (kurtosis), its significant criterium (kurt.2SE, same remark than for
skew.2SE), the statistic of a Shapiro-Wilk test of normality (normtest.W) and its
associated probability (normtest.p)

 SquareRoot

 > sqrt(x)

 Variance

 > var(x)

 Weighted Mean:

 > grades<-c(95,72,87,66)

 >weights<-c(0.50 0.23 0.25 0.47)

 > weighted.mean(x=grades, w=weights)

Correlation & Covariance

 It is used to test the relationship between more than one variables.

 > data(Orange) #Load Orange dataset

 > head(Orange) # View top contents


 > cor(Orange$age, Orange$circumference)

 [1] 0.9135189

 >data() # to see inbuilt datasets

 Install packages RODBC #Open Database Connectivity standard in R

 Load RODBC

 Install ggplot2,coefplot,corrplot,Hmisc and load

Correlation Matrix/Correlation with Multiple Variables

 >data(“mtcars”) #mtcars data

 >head(mtcars)

 > mydata <- mtcars[, c(1,2,3,4,5,6,7)]

 > mydata

 >res <- cor(mydata)

 >round(res, 2)

 > head(economics) # economics data

 > cor(economics[,c(2,4:6)])

 > cor(economics[,c(2:6)])

Correlation matrix with significance levels (p-value)

 >install.packages("Hmisc")

 >require(“Hmisc)

 > library("Hmisc") #Better practice

 > res2 <- rcorr(as.matrix(mydata))

 > res2
 The output of the function rcorr() is a list containing the following elements : - r : the
correlation matrix - n : the matrix of the number of observations used in analyzing each
pair of variables - P : the p-values corresponding to the significance levels of
correlations.

Extract the p-values or the correlation coefficients from the output

 # Extract the correlation coefficients

 >res2$r

 # Extract p-values

 >res2$P

Regression

Regression analysis is a tool for building statistical models that characterize relationships among
a dependent variable and one or more independent variables, all of which are numerical.

Simple linear regression involves a single independent variable.

Multiple regression involves two or more independent variables.

Purpose of regression analysis

 The purpose of regression analysis is to analyze relationships among variables.

 The analysis is carried out through the estimation of a relationship and the results serve
the following two purposes:
1. Answer the question of how much y changes with changes in each of the x's (x1,
x2,...,xk),

 Y is the dependent variable/Response

2. Forecast or predict the value of y based on the values of the X's

 X is the independent variable/Predictor

Simple Linear Regression

 Finds a linear relationship between:

- one independent variable X and

- one dependent variable Y

 First prepare a scatter plot to verify the data has a linear trend.

Use alternative approaches if the data is not linear

 It is used to determine the relationship between two variables.

 Given one variable regression tells us what we expect from the other variable.

 Y=ax+by+c

 Y= Dependent variable

 a=Intercept

 b=Slope
Scatter Plots and Correlation

 A scatter plot (or scatter diagram) is used to show the relationship between two variables

 Correlation analysis is used to measure strength of the association (linear relationship)


between two variables

 Only concerned with strength of the relationship

 No causal effect is implied

Finding the Best-Fitting Regression Line

 Two possible lines are shown below.

 Line A is clearly a better fit to the data.

 We want to determine the best regression line.

Y = b0 + b1X

where

b0 is the intercept

b1 is the slope

Least Squares Line


The most widely used criterion for measuring the goodness of fit of a line

The line that gives the best fit to the data is the one that minimizes this sum; it is called the least
squares line or sample regression line.

The slope of a regression line represents the rate of change in y as x changes. Because y is
dependent on x, the slope describes the predicted values of y given x.

Linear Relations

• We know from algebra lines come in the form y = mx + b, where m is the slope and b is
the y-intercept.

• In statistics, we use y = a + bx for the equation of a straight line. Now a is the intercept
and b is the slope.

• The slope (b) of the line, is the amount by which y increases when x increase by 1 unit.

• This interpretation is very important.

• The intercept (a), sometimes called the vertical intercept, is the height of the line when x
= 0.

Simple Linear Regression with R

 > x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

 > y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

 > relation <- lm(y~x) #lm=Liner Model

 > print(summary(relation))

 (Intercept) -38.4551 , x = 0.6746

 Test

 54.64109= -38.4551+(0.6746 *138)


Regression using dataset

 Load package car (Companion to applied regression)

 > head(Prestige) # Prestige dataset

 > relation3<-lm(income~education, data=Prestige)

 > relation3

 > summary(relation2)

 #Outcome= Intercept+Slope

 #Objective is to build a simple regression model that we can use to predict Distance
(dist) by establishing a statistically significant linear relationship with Speed (speed).

 Visualize the data

 1. Scatter plot: Visualize the linear relationship between the predictor and response

 2. Box plot: To spot any outlier observations in the variable. Having outliers in your
predictor can drastically affect the predictions as they can easily affect the direction/slope
of the line of best fit.

 3. Density plot: To see the distribution of the predictor variable. Ideally, a close to
normal distribution (a bell shaped curve), without being skewed to the left or right is
preferred. Let us see how to make each one of them.

Scatter Plot & Box Plot


 > head(cars)

 > scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed") # Scatter Plot

 > par(mfrow=c(1, 2))

 > boxplot(cars$speed, main="Speed", sub=paste("Outlier rows: ", boxplot.stats


(cars$speed)$out)) # box plot for 'speed‘

Density Plot

 >library(e1071) par(mfrow=c(1, 2)) # divide graph area in 2 columns

 >plot(density(cars$speed), main="Density Plot: Speed", ylab="Frequency",


sub=paste("Skewness:", round (e1071::skewness(cars$speed), 2))) # density plot for
'speed'

 >polygon(density(cars$speed), col="red")

 >plot(density(cars$dist), main="Density Plot: Distance", ylab="Frequency",


sub=paste("Skewness:", round(e1071::skewness(cars$dist), 2))) # density plot for 'dist'

 >polygon(density(cars$dist), col="red")

cor(cars$speed, cars$dist) # calculate correlation between speed and distance

[1] 0.8068949

Liner Regression Model

 > head(cars)

 >linearMod <- lm(dist ~ speed, data=cars)

 >print(linearMod)

 dist = Intercept + (β ∗ speed)
=> dist = −17.579 + 3.932∗speed

 Cross Check

Multiple Regression

 Multiple regression is an extension of linear regression into relationship between more


than two variables.
 In multiple regression we have more than one predictor variable and one response
variable.

 y = a + b1x1 + b2x2 +...bnxn

 y is the response variable.

 a, b1, b2...bn are the coefficients.

 x1, x2, ...xn are the predictor variables.

 The basic syntax for lm() function in multiple regression is −

 lm(y ~ x1+x2+x3...,data) Following is the description of the parameters used −

 formula is a symbol presenting the relation between the response variable and predictor
variables.

 data is the vector on which the formula will be applied.

 Load packages car & carData

Multiple Regression

 Objective: The goal of the model is to establish the relationship between "mpg" as a
response variable with "disp","hp" and "wt" as predictor variables.

 Solution: We create a subset of these variables from the mtcars data set for this purpose.

 > input <- mtcars[,c("mpg","disp","hp","wt")]

 > head(input)

Create Relationship Model and get Coefficients

 > model <- lm(mpg~disp+hp+wt, data = input) # Directly also possible without creating
subset

 > model # Model Shown

 >a <- coef(model) # Get the Intercept and coefficients as vector elements.

 >print(a)

Final Equation

 Y = a+Xdisp.x1+Xhp.x2+Xwt.x3

 or
 Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

 Apply Equation for predicting New Values

 For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is −

 Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104

Logistic Regression

 In logistic regression, we are only concerned about the probability of outcome dependent
variable ( success or failure).

 Probability of Success(p) and Probability of Failure(1-p). p should meet following


criteria:

1. It must always be positive (since p >= 0)

2. It must always be less than equals to 1 (since p <= 1)

 The Logistic Regression is a regression model in which the response variable (dependent
variable) has categorical values such as True/False or 0/1.

 The general mathematical equation for logistic regression is −

 y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

 Following is the description of the parameters used −

 y is the response variable.

 x is the predictor variable.

 a and b are the coefficients which are numeric constants.

Syntax of Logistic Regression

 glm(formula,data,family) Following is the description of the parameters used −

 formula is the symbol presenting the relationship between the variables.

 data is the data set giving the values of these variables.

 family is R object to specify the details of the model. It's value is binomial for logistic
regression.

Create Regression Model


 >input1 <- mtcars[,c("am","cyl","hp","wt")]

 >am.data = glm(formula = am ~ cyl + hp + wt, data = input1, family = binomial) #


Directly also possible without creating subset

 >print(summary(am.data))

 In the summary as the p-value in the last column is more than 0.05 for the variables "cyl"
and "hp", we consider them to be insignificant in contributing to the value of the variable
"am".

 Only weight (wt) impacts the "am" value in this regression model.

Cross Check

 -11.9483662=19.70288+(-9.14947 *3.460)

Statistical Functions

 skewness returns value of skewness,

 kurtosis returns value of kurtosis,

 basicStats computes an overview of basic statistical values,

 rowStats calculates row statistics,

 colStats calculates column statistics,

 rowAvgs calculates row means,

 colAvgs calculates column means,

 rowVars calculates row variances,

 colVars calculates column variances,

 rowStdevs calculates row standard deviations,

 colStdevs calculates column standard deviations,

 rowSkewness calculates row skewness,

 colSkewness calculates column skewness,

 rowKurtosis calculates row kurtosis,


 colKurtosis calculates column kurtosis,

 rowCumsums calculates row cumulated Sums,

 colCumsums calculates column cumulated Sums.

Major Types of Learning

 Unsupervised Learning:

 In machine learning, unsupervised learning is a class of problems in which one seeks to


determine how the data are organized. It is distinguished from supervised learning (and
reinforcement learning) in that the learner is given only unlabeled examples.

 Supervised Learning:

 A machine learning technique whereby a system uses a set of training examples to learn
how to correctly perform a task

Clustering

 It is basically a type of unsupervised learning method .

 An unsupervised learning method is a method in which we draw references from datasets


consisting of input data without labeled responses.

 Generally, it is used as a process to find meaningful structure, explanatory underlying


processes, generative features, and groupings inherent in a set of examples.

Clustering in Machine Learning

 Clustering: It is the assignment of a set of observations into subsets (called clusters) so


that observations in the same cluster are similar in some sense.

 Clustering is a method of unsupervised learning, and a common technique for statistical


data analysis used in many fields.

 It is basically a collection of objects on the basis of similarity and dissimilarity between


them.

 Clustering is very much important as it determines the intrinsic grouping among the
unlabeled data present. There are no criteria for a good clustering. It depends on the
user’s need
Applications of Clustering in different fields

 1. Marketing : It can be used to characterize & discover customer segments for


marketing purposes.
2. Biology : It can be used for classification among different species of plants and
animals.
3. Libraries : It is used in clustering different books on the basis of topics and
information.
4. Insurance : It is used to acknowledge the customers, their policies and identifying the
frauds.
5. City Planning : It is used to make groups of houses and to study their values based on
their geographical locations and other factors present.
6. Earthquake studies : By learning the earthquake affected areas we can determine the
dangerous zones.

Clustering Types

 Clustering plays a big role in machine learning by partitioning the data into groups.

 2 Major types:

 Hierarchical Clustering

 K-Means Clustering

K-means clustering

 K-means clustering is an algorithm to classify or to group your objects based on


attributes/features into K number of group. K is positive integer number.
 The grouping is done by minimizing the sum of squares of distances between data and
the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify
the data.

K-means Clustering –Example

 Objects Attribute 1 (X):weight index Attribute 2 (Y): pH

 Medicine A 1 1

 Medicine B 2 1

 Medicine C 4 3

 Medicine D 5 4

 We also know before hand that these objects belong to two groups of medicine (cluster 1
and cluster 2). The problem now is to determine which medicines belong to cluster 1 and
which medicines belong to the other cluster.

Final Grouping –As a Result

 ObjectsAttribute 1 (X): Attribute 2 (Y): Group (Result)


weight index pH

 Medicine A 1 1 1

 Medicine B 2 1 1

 Medicine C 4 3 2

 Medicine D 5 4 2

Hierarchical Clustering

 Hierarchical clustering involves creating clusters that have a predetermined ordering from
top to bottom.

 Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data,
where each group (or “node”) links to two or more successor groups.

 The groups are nested and organized as a tree, which ideally ends up as a meaningful
classification scheme.
 Each node in the cluster tree contains a group of similar data; Nodes group on the graph
next to other, similar nodes.

 Clusters at one level join with clusters in the next level up, using a degree of similarity;
The process carries on until all nodes are in the tree, which gives a visual snapshot of the
data contained in the whole set.

 The total number of clusters is not predetermined before you start the tree creation.

Dendrogram

 A dendrogram is a type of tree diagram showing hierarchical clustering — relationships


between similar sets of data. They are frequently used in biology to show clustering
between genes or samples, but they can represent any type of grouped data.

 The clade is the branch. Usually labeled with Greek letters from left to right (e.g. α β,
δ…).

 Each clade has one or more leaves. The leaves in the above image are:

 Single (simplicifolius): F

 Double (bifolius): D E

 Triple (trifolious): A B C
Dendrogram Interpretation

 The clades are arranged according to how similar (or dissimilar) they are. Clades that are
close to the same height are similar to each other; clades with different heights are
dissimilar — the greater the difference in height, the more dissimilarity (you can
measure similarity in many different ways; One of the most popular measures is
Pearson’s Correlation Coefficient).

 Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F.

 Leaves D and E are more similar to each other than they are to leaves A, B, C, or F.

 Leaf F is substantially different from all of the other leaves.

 Note that on the above graph, the same clave, β joins leaves A, B, C, D,and E. That
means that the two groups (A,B,C & D,E) are more similar to each other than they are to
F.

Hierarchical Clustering

 > car1<-hclust(d=dist(mtcars))
 > plot(car1)

 > car2<-hclust(dist(mtcars),method="single")

 > plot(car2)

 > car3<-hclust(dist(mtcars),method="complete")

 > plot(car3)

 > car4<-hclust(dist(mtcars),method="average")

 > plot(car4)

 > car5<-hclust(dist(mtcars),method="complete")

 > plot(car5)

Specify type of tree produced by clustering by splitting the observation into predefined
groups.

 Two types of split:

 How many cuts

 Where to cut

Cut Types in Hierarchical Clustering

 > install.packages("rattle.data")

 > library(rattle.data)

 > head(wine)

 # Plot into 3 clusters

 > wine1<-hclust(d=dist(wine))

 > plot(wine1)

 > rect.hclust(wine1,k=3, border="red")

 > rect.hclust(wine1,k=13, border="blue")

 # Cut using height of the cuts

 > rect.hclust(wine1,h=200, border="pink")


 > rect.hclust(wine1,h=600, border="orange")

 > rect.hclust(wine1,h=100, border="green")

K-Means Clustering

 It is the simplest unsupervised learning algorithm that solves clustering problem.

 K-means algorithm partition n observations into k clusters where each observation


belongs to the cluster with the nearest mean serving as a prototype of the cluster .

 Need to specify the number of clusters

 Done using kmeans()

 In R k means does not work with categorical data

 Set seed as a random number

 > set.seed(2453666) # Random number

 > wine2<-kmeans(x=wine,centers = 3)

 > print(wine2)

 > require(useful)

 > plot(wine2,data=wine)

Hartigan’s Rule

# it is used to determine the number of clusters

 > plot(wine2,data=wine)
 > wine4<-FitKMeans(wine, max.clusters=20, nstart=25,seed=454356)

 > wine4

 #Hargitan’s rule for determining the number of clusters

 > PlotHartigan(wine4)

Data Import in R

 For Stata and Systat, use the foreign package. For SPSS and SAS use the Hmisc
package for ease and functionality.

 Load and install readxl,Rcpp

 # to read excel files

 Read TXT files with read.table()

 If you have a .txt or a tab-delimited text file, you can easily import it with the basic R
function read.table().

 > df <- read.table ("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",


header = FALSE)

 > df
Group Manipulation

 Data managing/munging consumes 80% of the effort for data manipulation.

 Data munging requires repeated application of “split-apply-combine” procedures.

 Apply set of functions

Apply

 It works only on a matrix

 It can take Character, numeric or logical elements but only of same type in the matrix

 #Create a mtrix

 > thematrix<-matrix(1:9,nrow=3)

 > thematrix

 #sum the column

 > apply(thematrix,2,sum)
 [1] 6 15 24

 # Sum of rows

 > apply(thematrix,1,sum)

 [1] 12 15 18

 OR

 > rowSums(thematrix)

 [1] 12 15 18

 > colSums(thematrix)

 [1] 6 15 24

Lapply

 Lapply works by applying function to each element of a list and returns as a list.

 > thelist<-list(A=matrix(1:9,3),B=1:5,C=matrix(1:4,2),D=2)

 > thelist

 > lapply(thelist,sum) # sum function applied to each element of the list

Sapply

 Sapply is used to convert the result of lapply into a vector.

 > sapply(thelist,sum)

 ABCD

 45 15 10 2

 Since vector is the technical form of a list, sapply and lapply functions take vector as the
input.

 > thenames<-c("PGDMC","PGDMD", "PGDMMM")

 > lapply(thenames,nchar)

Mapply
 This function applies to each element of multiple lists.

 # Build two lists

 > firstlist<-list(A=matrix(1:16,4),B=matrix(1:16,2),C=1:5)

 > secondlist<-list(A=matrix(1:16,4),B=matrix(1:16,8),C=15:1)

 > firstlist

 >secondlist

 > mapply(identical,firstlist,secondlist)

 A B C

 TRUE FALSE FALSE

 #build a simple function that adds the number of rows/length

 # each corresponding element

 > simpleFunc<-function(x,y) +

NROW(x)+NROW(y)

 #apply the function to two lists

 > mapply(simpleFunc,firstlist,secondlist)

 A B C

 8 10 20

 # Aggregate Function

 > require(ggplot2)

 > library(plyr)

Aggregate Function

 > head(diamonds)
 # Calculate average price of each type of cut

 > aggregate(price~cut,diamonds,mean)

 cut price

 1 Fair 4358.758

 2 Good 3928.864

 3 Very Good 3981.760

 4 Premium 4584.258

 5 Ideal 3457.542

 > aggregate(price~cut+color,diamonds,mean)

Model Diagnostics

 Step 1: Analysis of residuals (difference between actual response and fitted values)

 Step 2: Import

 https://data.montgomerycountymd.gov/api/views/2qd6-
mr43/rows.csv?accessType=DOWNLOAD

 > library(readr)

 > USA_Housing <- read_csv("H:/Class Preparation/Business


Analytics/USA_Housing.csv")

 Or

 Load USA_Housing from File menu

 Install and load package “boot”

Data introduction

 > View(USA_Housing)

 # Heading Names Changed

 > names(USA_Housing)<-
c("avgincome","avghouseage","avgnorooms","avgbedrooms","population","price","addre
ss")

 > View(USA_Housing) # See the change


 > head (USA_Housing)

Model Building

 #Simple Regression/OLS

 > house1<-lm(price~avghouseage,data=USA_Housing)

 >house1

 > house2<-lm(price~avghouseage+avgnorooms,data=USA_Housing)

 > house3<-lm(price~avghouseage+avgnorooms+avgbedrooms,data=USA_Housing)

 > house4<-
lm(price~avghouseage+avgnorooms+avgbedrooms+population,data=USA_Housing)

 > house5<-
lm(price~avghouseage+avgnorooms+avgbedrooms+population+avgincome,data=USA_
Housing)

Visualizations of Regression

 >library(coefplot,ggplot2)

 > coefplot(house1)

 > coefplot(house2)

 > coefplot(house3)

 > coefplot(house4)

 > coefplot(house5)

Visualizations of Regression

 > plot(house1) # Q-Q Plot-If model good fit then standardized residuals will fall along a
straight line when plotted against the theoretical quantiles of

 >plot(house1,which=2) # Try which=2/3/4/5

 > plot(house2,which=2)

 > plot(house3,which=2)

 > plot(house4,which=2)
 > plot(house5,which=2)

Visualizations of Regression

 # Histogram of reseduals

 > ggplot(house1,aes(x=.resid))+geom_histogram()

 > ggplot(house2,aes(x=.resid))+geom_histogram()

 > ggplot(house3,aes(x=.resid))+geom_histogram()

 > ggplot(house4,aes(x=.resid))+geom_histogram()

 > ggplot(house5,aes(x=.resid))+geom_histogram()

Comparing Models

 # Visualize all the models using multiplot from coefplot package.

 > multiplot(house1,house2,house3,house4,house5, pointSize=4)

 #Use ANOVA to return a table of results including sum of squares (RSS) which is a
measure of error, the lower the better.

 > anova(house1,house2,house3,house4,house5)

 #Analysis of Variance Table

 4th model has the lowest RSS at 3.1519e+14.

 So 4th model is the best

 Problem with RSS is that it always improves when additional variables are added.

 This can lead to excessive model complexity and overfitting.

 Solution lies in (1)Akaike information criterion (AIC).

 Model with lowest AIC even –ve is considered as optimal.

 (2) Bayesian information criterion (BIC) or Schwarz information criterion (also SIC,
SBC, SBIC) is a criterion for model selection among a finite set of models; the model
with the lowest BIC is preferred.

 > AIC(house1,house2,house3,house4,house5)

 4th model lowest


 > BIC(house1,house2,house3,house4,house5)

 5th model lowest

 # Rule of thumb/Andrew Gelman: for every added variable in the model the deviance
should drop by two. this applies even for categorical/factor variables

K fold cross validation

 It is the preferred model to ascertain model quality.

 Step1: Data are broken into k(usually 5 or 10 non overlapping sections.

 Step 2: Then model is fitted into k-1 sections of data

 Step 3: The model is then used to make predictions about the kth section of data.

 Step 4: This is repeated K times when every section is held out for testing once and
included in the model fitting k-1 times.

K Fold

 # Perform glm for lm models for cross validation

 > library(boot)

 > houseg1<-glm(price~avghouseage,data=USA_Housing,
family=gaussian(link="identity"))

 #Ensure it gives same result as lm

 > identical(coef(house1),(coef(houseg1))

 TRUE

 > #Run cross validation with 5 folds

 > housecv1<-cv.glm(USA_Housing,houseg1,K=5)

 > #Check the error

 > housecv1$delta

 [1] 10239876350 10237640761

 The result delta has two no's

 First one Mean Squared Error, Second Adjusted Cross Validation error
 > houseg2<-glm(price~avghouseage +avgnorooms,data=USA_Housing,
family=gaussian(link="identity"))

 > houseg3<-glm(price~avghouseage +avgnorooms+avgbedrooms,data=USA_Housing,


family=gaussian(link="identity"))

 > houseg4<-glm(price~avghouseage
+avgnorooms+avgbedrooms+population,data=USA_Housing,
family=gaussian(link="identity"))

 > houseg5<-glm(price~avghouseage
+avgnorooms+avgbedrooms+population+avgincome,data=USA_Housing,
family=gaussian(link="identity"))

 > #Run Cross Validation

 > housecv2<-cv.glm(USA_Housing,houseg2,K=5)

 >

 > housecv3<-cv.glm(USA_Housing,houseg3,K=5)

 >

 > housecv4<-cv.glm(USA_Housing,houseg4,K=5)

 >

 > housecv5<-cv.glm(USA_Housing,houseg5,K=5)

 # Build data frame to check error results

 cvresults<-
as.data.frame(rbind(housecv1$delta,housecv2$delta,housecv3$delta,housecv4$delta,hous
ecv5$delta))

> cvresults

 Or

 #Give better column names

 > names(cvresults)<-c("Error","Adjusted Error")