Académique Documents
Professionnel Documents
Culture Documents
Analytics
Analytics can be defined as a process that involves the use of statistical techniques
(measures of central tendency, graphs, and so on), information system software (data
mining, sorting routines), and operations research methodologies (linear programming) to
explore, visualize, discover and communicate patterns or trends in data.
4. Be A Good Listener.
8. Stakeholder Management.
Example: An age bar chart is used to depict retail shoppers for a department store that
wants to target advertising to customers by age.
The purpose is to get a rough picture of what generally the data looks like and what
criteria might have potential for identifying trends or future business behavior.
Example: Multiple regression is used to show the relationship (or lack of relationship)
between age, weight, and exercise on diet food sales. Knowing that relationships exist
helps explain why one set of independent variables influences dependent variables such
as business performance.
To build predictive models designed to identify and predict future trends. Statistical
methods like multiple regression and ANOVA.
Information system methods like data mining and sorting. Operations research methods
like forecasting models.
Example: A department store has a limited advertising budget to target customers. Linear
programming models can be used to optimally allocate the budget to various advertising
media.
Business analytics begins with a data set (a simple collection of data or a data file) or
commonly with a database (a collection of data files that contain information on people,
locations, and so on).
Types of Digital Data
1. Structured data – Structured data is a data whose elements are addressable for effective
analysis. It has
It concern all data which can be stored in database SQL in table with rows and columns.
They have relational key and can easily mapped into pre-designed fields.
Today, those data are most processed in development and simplest way to manage
information. Example: Relational data.
Semi-structured data is information that does not reside in a rational database but that
have some organizational properties that make it easier to analyze.
With some process, you can store them in the relation database (it could be very hard for
some kind of semi-structured data), but Semi-structured exist to ease space. Example:
XML data.
Unstructured data is a data that is which is not organized in a pre-defined manner or does
not have a pre-defined data model, thus it is not a good fit for a mainstream relational
database.
So for Unstructured data, there are alternative platforms for storing and managing, it is
increasingly prevalent in IT systems and is used by organizations in a variety of business
intelligence and analytics applications. Example: Word, PDF, Text, Media logs.
Unstructured data
Structured data
Information stored DB
Strict format
Limitation
Semi-structured data
Data may have certain structure but not all information collected has identical
structure
Some attributes may exist in some of the entities of a particular type but not in
others
Unstructured data
Big data describes the collection of data sets that are so large and complex that software
systems are hardly able to process them (Isson and Harriott, 2013, pp. 57–61).
Little Data
Isson and Harriott (2013, p. 61) define little data as anything that is not big data.
Little data describes the smaller data segments or files that help individual businesses
keep track of customers.
As a means of sorting through data to find useful information, the application of analytics
has found new purpose.
Command Schedule
R itself is a powerful language that performs a wide variety of functions, such as data
manipulation, statistical modeling, and graphics.
One really big advantage of R, however, is its extensibility. Developers can easily write
their own software and distribute it in the form of add-on packages.
Features
Console: In the bottom-left corner, you find the console. The console in R Studio is
identical to the console in RGui. This is where you do all the interactive work with R.
Workspace and history: The top-right corner is a handy overview of your workspace,
where you can inspect the variables you created in your session, as well as their values.
This is also the area where you can see a history of the commands you’ve issued in R.
Files, plots, package, and help: In the bottom-right corner, you have access to several
tools:
Files: This is where you can browse the folders and files on your computer.
Packages: This is where you can view a list of all the installed packages.
Help: This is where you can browse the built-in Help system of R.
The prompt indicates where you type your commands to R; you see a blinking cursor to
the right of the prompt.
To quit your R session, type the following code in the console, after the command
prompt (>): > q()
R is Case sensitive
> 24+7+11
[1] 42
> 5*2
[1] 10
> 25/5
[1] 5
Printing Message
Vector
A vector is the simplest type of data structure in R. The R manual defines a vector as “A
single entity consisting of a collection of things.”
You also can construct a vector by using operators. An operator is a symbol you stick
between two values to make a calculation.
The symbols +, -, *, and / are all operators, and they have the same meaning they do in
mathematics.
A vector as a row or column of numbers or text. The list of numbers {1,2,3,4,5}, for
example, could be a vector with vector of length 5.
Unlike most other programming languages, R allows you to apply functions to the whole
vector in a single operation without the need for an explicit loop.
> c(1,2,3,4,5)
[1] 1 2 3 4 5
>x
[1] 1 2 3 4 5
Next, we’ll add the value 2 to each element in the vector x and print the result:
>x+2
[1] 3 4 5 6 7
> sum(1:5)
[1] 15
One very handy operator is called sequence, and it looks like a colon (:). Type the
following in your console:
> 1:5
[1] 1 2 3 4 5
You can also add one vector to another. To add the values 6:10 element-wise to x, you do
the following:
> x + 6:10
[1] 7 9 11 13 15
This feature of R is extremely powerful because it lets you perform many operations in a
single step.
>x
[1] 1 2 3 4 5
In R, the assignment operator is <-, which you type in the console by using two
keystrokes: the less-than symbol (<) followed by a hyphen (-). The combination of these
two symbols represents assignment.
> y <- 10
>x+y
[1] 11 12 13 14 15
>Assign(“j”,4)
>J
[1] 4
Calculations
Now create a new variable z, assign it the value of x+y, and print its value:
> z <- x + y
>z
[1] 11 12 13 14 15
You must present text or character values to R inside quotation marks —either single or
double. R accepts both.
Packages
https://cran.r-project.org/web/packages/available_packages_by_name.html
>Install.packages(‘Package Name’)
>Library(‘package name’)
Installation
Installation
Loading
Command
install.packages(“ Coefplot”)
Unloading Package
detach("package:ggplot2", unload=TRUE)
Installation
Command
install.packages(“ Coefplot”)
Unloading Package
detach("package:ggplot2", unload=TRUE)
Uninstall Packages
remove.packages(pkgs, lib)
Arguments
lib : a character vector giving the library directories to remove the packages from. If
missing, defaults to the first element in .libPaths().
remove.packages("coefplot")
c() function is used to combine numeric and text values into vectors.
> cy
You can use the paste() function to concatenate multiple text elements. By default,
paste() puts a space between the different elements, like this:
Readline() prompts an user to an input and stores the input as a character vector
readline("User Name")
User NameArjun Reddy
In R Studio, click anywhere in the source editor, and press Ctrl+Shift+Enter or click the
Source button in the console.
Multiple Commands
ls()
rm(cy)
>m<- c(1,2,3,4,5)
>m
>sum(m)
> paste(firstnames,lastname)
Names must start with a letter or a dot. If you start a name with a dot, the second
character can’t be a digit.
and dots (.). Although you can force R to accept other characters in names, you
so the dot can be used in names for objects as well. This style is called dotted style.
Ex: print.default()
Naming Convention in R
Order of Operations
1. Exponentiation
2. Multiplication and division in the order in which the operators are presented
3. Addition and subtraction in the order in which the operators are presented
The mod operator (%%) and the integer division operator (%/%) have the same priority
as the normal division operator (/) in calculations.
You can change the order of the operations by using parentheses i.e. ().
Calculating Logarithms and Exponentials
In R, you can take the logarithm of the numbers from 1 to 3 like this:
> log(1:3)
Whenever you use one of these functions, R calculates the natural logarithm if you don’t
specify any base.
You calculate the logarithm of these numbers with base 6 like this:
> log(1:3,base=6)
>x<-log(1:20)
> exp(x)
Manipulating Operators
Actually, operators are also functions. It helps to know, though, that operators can, in
many cases, be treated just like any other function if you put the operator between
backticks and add the arguments between parentheses,
> '+'(4,6)
> 10
This may be useful later on when you want to apply a function over rows, columns, or
subsets of your data
Vector Types
Integer vectors, containing integer values. (An integer vector is a special kind
of numeric vector.)
> o<-c(1,2,3,4,5)
>o
[1] 1 2 3 4 5
> is.numeric(o)
[1] TRUE
> is.integer(o)
[1] FALSE
Alternatively, you can specify the length of the sequence by using the argument
length.out.
[1] -2.7 -2.2 -1.7 -1.2 -0.7 -0.2 0.3 0.8 1.3
Understanding indexing in R
>numbers
[1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
> numbers[10]
[1] 21
All trigonometric functions are available in R: the sine, cosine, and tangent functions and
their inverse functions. You can find them on the Help page you reach by typing ?Trig.
So, you may want to try to calculate the cosine of an angle of 180 degrees like this:
> cos(120)
[1] 0.814181
This code doesn’t give you the correct result, however, because R always works with
angles in radians, not in degrees.
Instead, use a special variable called pi. This variable contains the value of —you
guessed it — π (3.141592653589 . . .).
The correct way to calculate the cosine of an angle of 120 degrees, then, is this:
> cos(120*pi/180)
[1] -0.5
> 2/0
[1] Inf
To check whether a value is finite, use the functions is.finite() and is.infinite().
Ex: >is.infinite(x)
> 0/ 0
>[1]Nan
Structure of a vector
The str() function gives you the type and structure of the object.
> str(fn)
Next to the vector type, R gives you the dimensions of the vector. This example has only
one dimension, and that dimension has indices ranging from 1 to 3.
Finally, R gives you the first few values of the vector. If you want to know only how long
a vector is, you can simply use the length() function.
Combining Vectors
> i<-c("KK",1,2,3)
>i
> j<-c("LL",6,7,8)
> k<-c(i,j)
> k [1] "KK" "1" "2" "3" "LL" "6" "7" "8"
**The c() function stands for concatenate. It doesn’t create vectors — it just combines
them.
Repeating Vectors
[1] "KK" "1" "2" "3" "LL" "6" "7" "8" "KK" "1" "2" "3" "LL" "6" "7" "8" "KK" "1" "2"
"3" [21] "LL" "6" "7" "8“
You also can repeat every value by specifying the argument each, like this:
[1] 2 2 2 4 4 4 2 2 2
p<-c(0,0,99)
rep(p,times=3)
[1] 0 0 99 0 0 99 0 0 99
R has a little trick up its sleeve. You can tell R for each value how often it has to be
repeated. To take advantage of that magic, tell R how often to repeat each value in a
vector by using the times argument:
[1] 0 0 0 0 7 7
use the argument length.out to tell R how long you want it to be.
> rep(1:3,length.out=7)
[1] 1 2 3 1 2 3 1
Vector Manipulations
r<-c(1,3,5,7,9)
s<(c(2,4,6,8,10)
>r
[1] 1 3 5 7 9
>s
[1] 2 4 6 8 10
r[3]<-111
s[4]<-222
>r
[1] 1 3 111 7 9
>s
[1] 2 4 6 222 10
> r.copy<-r
> s.copy<-s
>r<-r.copy
>s<-s.copy
cummin(x) Gives the minimum for all values in x from the start of the vector until the positio
cummax(x) Gives the maximum for all values in x from the start of the vector until the positio
Diff(x) Gives for every value the difference between that value and the next value in the v
Comparing Values in R
cummin(x) Gives the minimum for all values in x from the start of the vector until the positio
cummax(x) Gives the maximum for all values in x from the start of the vector until the positio
Diff(x) Gives for every value the difference between that value and the next value in the v
> x<-c(3,5,7,9)
>y<-c(1,2,4,5)
>x
> sum(x)
[1] 24
x>100
> x>4
Which Function
The which() function takes a logical vector as argument. Hence, you can save the
outcome of a logical vector in an object and pass that to the which() function, as in the
next example.
You also can use all these operators to compare vectors value by value.
> which(x>5)
[1] 3 4
>z<-x>y
>z
>Which(z)
[1] 1 2 3 4
u<- x-y
u
[1] 2 3 3 4
>y
>is.character(y)
>True
>True
> length(y)
[1] 3
> nchar(y)
[1] 2 2 2
Nchar() function tells you that y has length 3 and that the 3 elements in y has 2 characters each
Subsetting
The process of referring to a subset of a vector through indexing its elements is also
called subsetting.
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" [13] "m" "n" "o" "p" "q" "r" "s" "t" "u"
"v" "w" "x" [25] "y" "z"
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" [13] "M" "N" "O" "P" "Q" "R" "S"
"T" "U" "V" "W" "X" [25] "Y" "Z"
> letters[15]
[1] “o”
> LETTERS[20:26]
Head or Tail
You can use the head() function to get the first element of a variable.
By default, both head() and tail() returns six elements, but you can tell it to return any
specific number of elements in the second argument.
> tail(LETTERS,6)
[1] “a” “b” “c” “d” “e” “f” “g” “h” “i” “j”
> islands
This built-in dataset islands, a named vector that contains the surface area of the world’s
48 largest land masses (continents and large islands).
> str(islands)
Named num [1:48] 11506 5500 16988 2968 16 ... - attr(*, "names")= chr [1:48] "Africa"
"Antarctica" "Asia" "Australia" ...
Celebes 73
You use the names() function to retrieve the names in a named vector:
> names(islands)[1:5]
SORT
names(sort(islands)[6:1])
[1] "Taiwan" "Kyushu" [3] "Timor" "Prince of Wales" [5] "Hainan" "Vancouver"
> month.days <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
> names(month.days) <- month.name
> month.days
31 28 31 30
31 30 31 31
30 31 30 31
Now you can use this vector to find the names of the months with 31 days:
> names(month.days[month.days==31])
[7] “December”
A collection of combined letters and words is called a string. Whenever you work with
text, you need to be able to concatenate words (string them together) and split them apart.
In R, you use the paste() function to concatenate and the strsplit() function to split.
Splitting text
> wordplay <-"The quick brown fox jumps over the lazy dog“
> wordplay
[1] "The quick brown fox jumps over the lazy dog“
> strsplit(wordplay,"")
[[1]] [1] "T" "h" "e" " " "q" "u" "i" "c" "k" " " "b" "r" [13] "o" "w" "n" " " "f" "o" "x" " "
"j" "u" "m" "p" [25] "s" " " "o" "v" "e" "r" " " "t" "h" "e" " " "l" [37] "a" "z" "y" " " "d"
"o" "g"
> strsplit(wordplay," ")
[[1]]
[1] "The" "quick" "brown" "fox" "jumps" "over" [7] "the" "lazy" "dog"
[1] “The” “quick” “brown” “fox” “jumps” “over” “the” “lazy” “dog”
To find the unique elements of a vector, including a vector of text, you use the unique()
function
> unique(tolower(words))
[1] "the" "quick" "brown" "fox" "jumps" "over" [7] "lazy" "dog"
In the variable words, “the” appears twice: once in lowercase and once with the first letter
capitalized. To get a list of the unique words, first convert words to lowercase and then
use unique
> unique(toupper(words))
[1] "THE" "QUICK" "BROWN" "FOX" "JUMPS" "OVER" [7] "LAZY" "DOG"
> toupper(words)
[1] "THE" "QUICK" "BROWN" "FOX" "JUMPS" "OVER" [7] "THE" "LAZY" "DOG"
Factoring in Factors
R has a special data structure for categorical data, called factors. Factors are closely
related to characters because any character vector can be represented by a factor.
In real-world problems, you often encounter data that can be described using words rather
than numerical values. For example, cars can be red, green, or blue (or any other color);
people can be left-handed or right-handed, male or female; energy can be derived from
coal, nuclear, wind, or wave power.
You can use the term categorical data to describe these examples
Factors are special types of objects in R. They’re neither character vectors nor numeric
vectors, although they have some attributes of both.
Factors behave a little bit like character vectors in the sense that the unique categories
often are text.
Factors also behave a little bit like integer vectors because R encodes the levels as
integers.
Creating a factor
levels: An optional vector of the values that x might have taken. The default is
lexicographically sorted, unique values of x.
labels: Another optional vector that, by default, takes the same values as levels. You can
use this argument to rename your levels, as we explain in the next paragraph.
Just remember that levels refer to the input values of x, while labels refer to the output
values of the new factor.
Searching by position
To find substrings, you can use the grep() function, which takes two essential arguments:
Searching by position
If you know the exact position of a subtext inside a text element, you use the substr()
function to return the value. To extract the subtext that starts at the third position and
stops at the sixth position of state.name, use the following:
Data Types
o Numeric
o Integer
o Character/String
o Date/POSIXct
o Logical(True/False
Class function
> class(x)
[1] "numeric"
> i<-5L
> i [1] 5
> is.integer(i)
[1] TRUE
> is.numeric(i)
[1] TRUE
Data Frames: On surface data frame is like an Excel spreadsheet with rows and columns.
Command: data.frames()
> x<-c(3,5,7,9)
>y<-c("tt","rr","ii","ee")
>z<--4:-1
> thedf<-data.frame(x,y,z)
> thedf
xyz
1 3 tt -4
2 5 rr -3
3 7 ii -2
4 9 ee -1
Checking attributes
> nrow(thedf)
[1] 4
> ncol(thedf)
[1] 3
> dim(thedf)
Since each data frame is an individual vector, it can be accessed individually and each
has its own class.
>thedf$x
[1] 3 5 7 9
> thedf[2,3]
[1] -3
> list(1,2,3)
[[1]] [1] 1
[[2]] [1] 2
[[3]] [1] 3
>list(c(1,2,3)) # Create single element list with one element being a vector
[[1]] [1] 1 2 3
More on Lists
Mattrices
Matrix are mathematical structures similar to data frames with rows and columns but all
data are of same type.(Usually numerics)
>m
n<-matrix(21:40,nrow=5)
>ncol(m) # No of Columns
Matrix Operations
>m+n #Addition
>m*n #Multiplication
>m ==n # Test equality
Matrix Operations
>m+n #Addition
>m*n #Multiplication
Array
The first element is row index, second is the column index, and remaining elements are
outer dimensions.
> thearrary
Basic Statistics
>x<-sample(x=1:100,size=100,replace=TRUE)
x
> mean(x)
[1] 50.5
> median(x)
[1] 50.5
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 25.75 50.50 50.50 75.25 100.00
Compute a table giving various descriptive statistics about the series in a data frame or in
a single/multiple time series
Arguments
Basic do we have to return basic statistics (by default, it is TRUE)? These are: the
number of values (nbr.val), the number of null values (nbr.null), the number of missing
values (nbr.na), the minimal value (min), the maximal value (max), the range (range, that
is, max-min) and the sum of all non-missing values (sum)
Desc do we have to return various descriptive statistics (by default, it is TRUE)? These
are: the median (median), the mean (mean), the standard error on the mean (SE.mean),
the confidence interval of the mean (CI.mean) at the p level, the variance (var), the
standard deviation (std.dev) and the variation coefficient (coef.var) defined as the
standard deviation divided by the mean
Norm do we have to return normal distribution statistics (by default, it is FALSE)? the
skewness coefficient g1 (skewness), its significant criterium (skew.2SE, that is,
g1/2.SEg1; if skew.2SE > 1, then skewness is significantly different than zero), kurtosis
coefficient g2 (kurtosis), its significant criterium (kurt.2SE, same remark than for
skew.2SE), the statistic of a Shapiro-Wilk test of normality (normtest.W) and its
associated probability (normtest.p)
SquareRoot
> sqrt(x)
Variance
> var(x)
Weighted Mean:
> grades<-c(95,72,87,66)
[1] 0.9135189
Load RODBC
>head(mtcars)
> mydata
>round(res, 2)
> cor(economics[,c(2,4:6)])
> cor(economics[,c(2:6)])
>install.packages("Hmisc")
>require(“Hmisc)
> res2
The output of the function rcorr() is a list containing the following elements : - r : the
correlation matrix - n : the matrix of the number of observations used in analyzing each
pair of variables - P : the p-values corresponding to the significance levels of
correlations.
>res2$r
# Extract p-values
>res2$P
Regression
Regression analysis is a tool for building statistical models that characterize relationships among
a dependent variable and one or more independent variables, all of which are numerical.
The analysis is carried out through the estimation of a relationship and the results serve
the following two purposes:
1. Answer the question of how much y changes with changes in each of the x's (x1,
x2,...,xk),
First prepare a scatter plot to verify the data has a linear trend.
Given one variable regression tells us what we expect from the other variable.
Y=ax+by+c
Y= Dependent variable
a=Intercept
b=Slope
Scatter Plots and Correlation
A scatter plot (or scatter diagram) is used to show the relationship between two variables
Y = b0 + b1X
where
b0 is the intercept
b1 is the slope
The line that gives the best fit to the data is the one that minimizes this sum; it is called the least
squares line or sample regression line.
The slope of a regression line represents the rate of change in y as x changes. Because y is
dependent on x, the slope describes the predicted values of y given x.
Linear Relations
• We know from algebra lines come in the form y = mx + b, where m is the slope and b is
the y-intercept.
• In statistics, we use y = a + bx for the equation of a straight line. Now a is the intercept
and b is the slope.
• The slope (b) of the line, is the amount by which y increases when x increase by 1 unit.
• The intercept (a), sometimes called the vertical intercept, is the height of the line when x
= 0.
> x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
> y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
> print(summary(relation))
Test
> relation3
> summary(relation2)
#Outcome= Intercept+Slope
#Objective is to build a simple regression model that we can use to predict Distance
(dist) by establishing a statistically significant linear relationship with Speed (speed).
1. Scatter plot: Visualize the linear relationship between the predictor and response
2. Box plot: To spot any outlier observations in the variable. Having outliers in your
predictor can drastically affect the predictions as they can easily affect the direction/slope
of the line of best fit.
3. Density plot: To see the distribution of the predictor variable. Ideally, a close to
normal distribution (a bell shaped curve), without being skewed to the left or right is
preferred. Let us see how to make each one of them.
Density Plot
>polygon(density(cars$speed), col="red")
>polygon(density(cars$dist), col="red")
[1] 0.8068949
> head(cars)
>print(linearMod)
dist = Intercept + (β ∗ speed)
=> dist = −17.579 + 3.932∗speed
Cross Check
Multiple Regression
formula is a symbol presenting the relation between the response variable and predictor
variables.
Multiple Regression
Objective: The goal of the model is to establish the relationship between "mpg" as a
response variable with "disp","hp" and "wt" as predictor variables.
Solution: We create a subset of these variables from the mtcars data set for this purpose.
> head(input)
> model <- lm(mpg~disp+hp+wt, data = input) # Directly also possible without creating
subset
>a <- coef(model) # Get the Intercept and coefficients as vector elements.
>print(a)
Final Equation
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is −
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
Logistic Regression
In logistic regression, we are only concerned about the probability of outcome dependent
variable ( success or failure).
The Logistic Regression is a regression model in which the response variable (dependent
variable) has categorical values such as True/False or 0/1.
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
family is R object to specify the details of the model. It's value is binomial for logistic
regression.
>print(summary(am.data))
In the summary as the p-value in the last column is more than 0.05 for the variables "cyl"
and "hp", we consider them to be insignificant in contributing to the value of the variable
"am".
Only weight (wt) impacts the "am" value in this regression model.
Cross Check
-11.9483662=19.70288+(-9.14947 *3.460)
Statistical Functions
Unsupervised Learning:
Supervised Learning:
A machine learning technique whereby a system uses a set of training examples to learn
how to correctly perform a task
Clustering
Clustering is very much important as it determines the intrinsic grouping among the
unlabeled data present. There are no criteria for a good clustering. It depends on the
user’s need
Applications of Clustering in different fields
Clustering Types
Clustering plays a big role in machine learning by partitioning the data into groups.
2 Major types:
Hierarchical Clustering
K-Means Clustering
K-means clustering
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
We also know before hand that these objects belong to two groups of medicine (cluster 1
and cluster 2). The problem now is to determine which medicines belong to cluster 1 and
which medicines belong to the other cluster.
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2
Hierarchical Clustering
Hierarchical clustering involves creating clusters that have a predetermined ordering from
top to bottom.
Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data,
where each group (or “node”) links to two or more successor groups.
The groups are nested and organized as a tree, which ideally ends up as a meaningful
classification scheme.
Each node in the cluster tree contains a group of similar data; Nodes group on the graph
next to other, similar nodes.
Clusters at one level join with clusters in the next level up, using a degree of similarity;
The process carries on until all nodes are in the tree, which gives a visual snapshot of the
data contained in the whole set.
The total number of clusters is not predetermined before you start the tree creation.
Dendrogram
The clade is the branch. Usually labeled with Greek letters from left to right (e.g. α β,
δ…).
Each clade has one or more leaves. The leaves in the above image are:
Single (simplicifolius): F
Double (bifolius): D E
Triple (trifolious): A B C
Dendrogram Interpretation
The clades are arranged according to how similar (or dissimilar) they are. Clades that are
close to the same height are similar to each other; clades with different heights are
dissimilar — the greater the difference in height, the more dissimilarity (you can
measure similarity in many different ways; One of the most popular measures is
Pearson’s Correlation Coefficient).
Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F.
Leaves D and E are more similar to each other than they are to leaves A, B, C, or F.
Note that on the above graph, the same clave, β joins leaves A, B, C, D,and E. That
means that the two groups (A,B,C & D,E) are more similar to each other than they are to
F.
Hierarchical Clustering
> car1<-hclust(d=dist(mtcars))
> plot(car1)
> car2<-hclust(dist(mtcars),method="single")
> plot(car2)
> car3<-hclust(dist(mtcars),method="complete")
> plot(car3)
> car4<-hclust(dist(mtcars),method="average")
> plot(car4)
> car5<-hclust(dist(mtcars),method="complete")
> plot(car5)
Specify type of tree produced by clustering by splitting the observation into predefined
groups.
Where to cut
> install.packages("rattle.data")
> library(rattle.data)
> head(wine)
> wine1<-hclust(d=dist(wine))
> plot(wine1)
K-Means Clustering
> wine2<-kmeans(x=wine,centers = 3)
> print(wine2)
> require(useful)
> plot(wine2,data=wine)
Hartigan’s Rule
> plot(wine2,data=wine)
> wine4<-FitKMeans(wine, max.clusters=20, nstart=25,seed=454356)
> wine4
> PlotHartigan(wine4)
Data Import in R
For Stata and Systat, use the foreign package. For SPSS and SAS use the Hmisc
package for ease and functionality.
If you have a .txt or a tab-delimited text file, you can easily import it with the basic R
function read.table().
> df
Group Manipulation
Apply
It can take Character, numeric or logical elements but only of same type in the matrix
#Create a mtrix
> thematrix<-matrix(1:9,nrow=3)
> thematrix
> apply(thematrix,2,sum)
[1] 6 15 24
# Sum of rows
> apply(thematrix,1,sum)
[1] 12 15 18
OR
> rowSums(thematrix)
[1] 12 15 18
> colSums(thematrix)
[1] 6 15 24
Lapply
Lapply works by applying function to each element of a list and returns as a list.
> thelist<-list(A=matrix(1:9,3),B=1:5,C=matrix(1:4,2),D=2)
> thelist
Sapply
> sapply(thelist,sum)
ABCD
45 15 10 2
Since vector is the technical form of a list, sapply and lapply functions take vector as the
input.
> lapply(thenames,nchar)
Mapply
This function applies to each element of multiple lists.
> firstlist<-list(A=matrix(1:16,4),B=matrix(1:16,2),C=1:5)
> secondlist<-list(A=matrix(1:16,4),B=matrix(1:16,8),C=15:1)
> firstlist
>secondlist
> mapply(identical,firstlist,secondlist)
A B C
> simpleFunc<-function(x,y) +
NROW(x)+NROW(y)
> mapply(simpleFunc,firstlist,secondlist)
A B C
8 10 20
# Aggregate Function
> require(ggplot2)
> library(plyr)
Aggregate Function
> head(diamonds)
# Calculate average price of each type of cut
> aggregate(price~cut,diamonds,mean)
cut price
1 Fair 4358.758
2 Good 3928.864
4 Premium 4584.258
5 Ideal 3457.542
> aggregate(price~cut+color,diamonds,mean)
Model Diagnostics
Step 1: Analysis of residuals (difference between actual response and fitted values)
Step 2: Import
https://data.montgomerycountymd.gov/api/views/2qd6-
mr43/rows.csv?accessType=DOWNLOAD
> library(readr)
Or
Data introduction
> View(USA_Housing)
> names(USA_Housing)<-
c("avgincome","avghouseage","avgnorooms","avgbedrooms","population","price","addre
ss")
Model Building
#Simple Regression/OLS
> house1<-lm(price~avghouseage,data=USA_Housing)
>house1
> house2<-lm(price~avghouseage+avgnorooms,data=USA_Housing)
> house3<-lm(price~avghouseage+avgnorooms+avgbedrooms,data=USA_Housing)
> house4<-
lm(price~avghouseage+avgnorooms+avgbedrooms+population,data=USA_Housing)
> house5<-
lm(price~avghouseage+avgnorooms+avgbedrooms+population+avgincome,data=USA_
Housing)
Visualizations of Regression
>library(coefplot,ggplot2)
> coefplot(house1)
> coefplot(house2)
> coefplot(house3)
> coefplot(house4)
> coefplot(house5)
Visualizations of Regression
> plot(house1) # Q-Q Plot-If model good fit then standardized residuals will fall along a
straight line when plotted against the theoretical quantiles of
> plot(house2,which=2)
> plot(house3,which=2)
> plot(house4,which=2)
> plot(house5,which=2)
Visualizations of Regression
# Histogram of reseduals
> ggplot(house1,aes(x=.resid))+geom_histogram()
> ggplot(house2,aes(x=.resid))+geom_histogram()
> ggplot(house3,aes(x=.resid))+geom_histogram()
> ggplot(house4,aes(x=.resid))+geom_histogram()
> ggplot(house5,aes(x=.resid))+geom_histogram()
Comparing Models
#Use ANOVA to return a table of results including sum of squares (RSS) which is a
measure of error, the lower the better.
> anova(house1,house2,house3,house4,house5)
Problem with RSS is that it always improves when additional variables are added.
(2) Bayesian information criterion (BIC) or Schwarz information criterion (also SIC,
SBC, SBIC) is a criterion for model selection among a finite set of models; the model
with the lowest BIC is preferred.
> AIC(house1,house2,house3,house4,house5)
# Rule of thumb/Andrew Gelman: for every added variable in the model the deviance
should drop by two. this applies even for categorical/factor variables
Step 3: The model is then used to make predictions about the kth section of data.
Step 4: This is repeated K times when every section is held out for testing once and
included in the model fitting k-1 times.
K Fold
> library(boot)
> houseg1<-glm(price~avghouseage,data=USA_Housing,
family=gaussian(link="identity"))
> identical(coef(house1),(coef(houseg1))
TRUE
> housecv1<-cv.glm(USA_Housing,houseg1,K=5)
> housecv1$delta
First one Mean Squared Error, Second Adjusted Cross Validation error
> houseg2<-glm(price~avghouseage +avgnorooms,data=USA_Housing,
family=gaussian(link="identity"))
> houseg4<-glm(price~avghouseage
+avgnorooms+avgbedrooms+population,data=USA_Housing,
family=gaussian(link="identity"))
> houseg5<-glm(price~avghouseage
+avgnorooms+avgbedrooms+population+avgincome,data=USA_Housing,
family=gaussian(link="identity"))
> housecv2<-cv.glm(USA_Housing,houseg2,K=5)
>
> housecv3<-cv.glm(USA_Housing,houseg3,K=5)
>
> housecv4<-cv.glm(USA_Housing,houseg4,K=5)
>
> housecv5<-cv.glm(USA_Housing,houseg5,K=5)
cvresults<-
as.data.frame(rbind(housecv1$delta,housecv2$delta,housecv3$delta,housecv4$delta,hous
ecv5$delta))
> cvresults
Or