Académique Documents
Professionnel Documents
Culture Documents
Agglomerative Hierarchical clustering -This algorithm works by grouping the data one by one on the basis of
the nearest distance measure of all the pairwise distance between the data point. Again distance between the
data point is recalculated but which distance to consider when the groups has been formed? For this there are
many available methods. Some of them are:
1) single-nearest distance or single linkage.
2) complete-farthest distance or complete linkage.
3) average-average distance or average linkage.
4) centroid distance.
5) ward's method - sum of squared euclidean distance is minimized.
This way we go on grouping the data until one cluster is formed. Now on the basis of dendogram graph we can
calculate how many number of clusters should be actually present.
3) Increment the sequence number: m = m +1.Merge clusters (r) and (s) into a single cluster to form the nex
clustering m. Set the level of this clustering to L(m) = d[(r),(s)].
4) Update the distance matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) an
adding a row and column corresponding to the newly formed cluster. The distance between the new cluste
denoted (r,s) and old cluster(k) is defined in this way: d[(k), (r,s)] = min (d[(k),(r)], d[(k),(s)]).
5) If all the data points are in one cluster then stop, else repeat from step 2).
Divisive Hierarchical clustering - It is just the reverse of Agglomerative Hierarchical approach.
Advantages
1) No apriori information about the number of clusters required.
2) Easy to implement and gives best result in some cases.
Disadvantages
1) Algorithm can never undo what was done previously.
2) Time complexity of at least O(n2 log n) is required, where n is the number of data points.
3) Based on the type of distance matrix chosen for merging different algorithms can suffer with one or more of
the following:
i) Sensitivity to noise and outliers
ii) Breaking large clusters
iii) Difficulty handling different sized clusters and convex shapes
4) No objective function is directly minimized
5) Sometimes it is difficult to identify the correct number of clusters by the dendogram.
Fig I: Showing dendogram formed from the data set of size 'N' = 60
References
1) k-means and Hierarchical Clustering by Andrew W. Moore.
2) Hierarchical Document Clustering by Benjamin C. M. Fung, Ke Wang and Martin Ester.
3) How to explain Hierarchical Clustering by S. P. Borgatti.
4) http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
5) BIRCH: An efficient data clustering method for very large databases by T. Zhang, R. Ramakrishnan and M.
Livny.
1 Required R packages
2 Algorithm
9.1 Tanglegram
10 Infos
There are two standard clustering strategies: partitioning methods (e.g., k-means and pam) and hierarchical
clustering.
Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It
does not require to pre-specify the number of clusters to be generated. The result is a tree-based representation
of the observations which is called a dendrogram. It uses pairwise distance matrix between observations as
clustering criteria.
In this article we provide:
R lab sections with many examples for computing hierarchical clustering, visualizing and comparing
dendrogram
1 Required R packages
The required packages for this chapter are:
1.
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
2.
install.packages("cluster")
install.packages("dendextend")
3.
library(cluster)
library(dendextend)
library(factoextra)
2 Algorithm
Hierarchical clustering can be divided into two main types: agglomerative and divisive.
1.
Agglomerative clustering: Its also known as AGNES (Agglomerative Nesting). It works in a bottomup manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of
the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes).
This procedure is iterated until all points are member of just one single big cluster (root) (see figure
below). The result is a tree which can be plotted as a dendrogram.
2.
Divisive hierarchical clustering: Its also known as DIANA (Divise Analysis) and it works in a topdown manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are
included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two.
The process is iterated until all objects are in their own cluster (see figure below).
Note that agglomerative clustering is good at identifying small clusters. Divisive hierarchical clustering is good at
identifying large clusters.
The merging or the division of clusters is performed according some (dis)similarity measure. In R softwrare,
the Euclidean distance is used by default to measure the dissimilarity between each pair of observations.
As we already know, its easy to compute dissimilarity measure between two pairs of observations. Its mentioned
above that two clusters that are most similar are fused into a new big cluster.
A natural question is :
How to measure the dissimilarity between two clusters of observations?
A number of different cluster agglomeration methods (i.e, linkage methods) has been developed to answer to
this question. The most common types methods are:
Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements
in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these
dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in
cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage
criterion. It tends to produce long, loose clusters.
Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in
cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance
between the two clusters.
Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean
vector of length p variables) and the centroid for cluster 2.
Wards minimum variance method: It minimizes the total within-cluster variance. At each step the pair
of clusters with minimum between-cluster distance are merged.
Rape
21.2
44.5
31.0
19.5
40.6
38.7
Note that the variables have a large different means and variances. This is explained by the fact that the variables
are measured in different units; Murder, Rape, and Assault are measured as the number of occurrences per 100 000
people, and UrbanPop is the percentage of the states population that lives in an urban area.
They must be standardized (i.e., scaled) to make them comparable. Recall that, standardization consists of
transforming the variables such that they have mean zero and standard deviation one. You can read more
about standardizationin the following article: distance measures and scaling.
As we dont want the hierarchical clustering result to depend to an arbitrary variable unit, we start by scaling the
data using the R function scale() as follow:
df <- scale(df)
head(df)
##
Murder
## Alabama
1.24256408
## Alaska
0.50786248
## Arizona
0.07163341
## Arkansas
0.23234938
## California 0.27826823
## Colorado
0.02571456
Assault
UrbanPop
Rape
0.7828393 -0.5209066 -0.003416473
1.1068225 -1.2117642 2.484202941
1.4788032 0.9989801 1.042878388
0.2308680 -1.0735927 -0.184916602
1.2628144 1.7589234 2.067820292
0.3988593 0.8608085 1.864967207
hclust() [in stats package] and agnes() [in cluster package] for agglomerative hierarchical clustering (HC)
method: The agglomeration method to be used. Allowed values is one of ward.D, ward.D2, single,
complete, average, mcquitty, median or centroid.
The dist() function is used to compute the Euclidean distance between observations. Finally, observations are
clustered using Wards method.
# Dissimilarity matrix
d <- dist(df, method = "euclidean")
# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )
# Plot the obtained dendrogram
plot(res.hc, cex = 0.6, hang = -1)
x: data matrix or data frame or dissimilarity matrix. In case of matrix and data frame, rows are
observations and columns are variables. In case of a dissimilarity matrix, x is typically the output of
daisy() or dist().
metric: the metric to be used for calculating dissimilarities between observations. Possible values are
euclidean and manhattan.
stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities.
Measurements are standardized for each variable (column), by subtracting the variables mean value and
dividing by the variables mean absolute deviation
method: The clustering method. Possible values includes average, single, complete, ward.
The function agnes() returns an object of class agnes (see ?agnes.object) which has methods for the
functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().
The function diana() returns an object of class diana (see ?diana.object) which has also methods for the
functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().
Compared to other agglomerative clustering methods such as hclust(), agnes() has the following features:
It yields the agglomerative coefficient (see agnes.object) which measures the amount of clustering
structure found
Apart from the usual tree it also provides the banner, a novel graphical display (see plot.agnes).
Its also possible to draw AGNES dendrogram using the function plot.hclust() and the function plot.dendrogram() as
follow:
# plot.hclust()
plot(as.hclust(res.agnes), cex = 0.6, hang = -1)
# plot.dendrogram()
plot(as.dendrogram(res.agnes), cex = 0.6,
horiz = TRUE)
# Compute diana()
res.diana <- diana(df)
# Plot the tree
pltree(res.diana, cex = 0.6, hang = -1,
main = "Dendrogram of diana")
As for plotting AGNES dendrogram, the functions plot.hclust() and plot.dendrogram() can be used as follow:
# plot.hclust()
plot(as.hclust(res.diana), cex = 0.6, hang = -1)
# plot.dendrogram()
plot(as.dendrogram(res.diana), cex = 0.6,
horiz = TRUE)
Note that, conclusions about the proximity of two observations can be drawn only based on the height where
branches containing those two observations first are fused. We cannot use the proximity of two observations along
the horizontal axis as a criteria of their similarity.
In order to identify sub-groups (i.e. clusters), we can cut the dendrogram at a certain height as described in the
next section.
"Mississippi"
Its also possible to draw the dendrogram with a border around the 4 clusters. The argument border is used to
specify the border colors for the rectangles:
Using the function fviz_cluster() [in factoextra], we can also visualize the result in a scatter plot. Observations
are represented by points in the plot, using principal components. A frame is drawn around each cluster.
library(factoextra)
fviz_cluster(list(data = df, cluster = grp))
The function cutree() can be used also to cut the tree generated with agnes() and diana() as follow:
Note that, when the data are standardized, there is a functional relationship between the Pearson correlation
coefficient
deuc(x,y)=2m[1r(x,y)]deuc(x,y)=2m[1r(x,y)]
Where x and y are two standardized m-vectors with zero mean and unit length.
For example, the standard k-means clustering uses the Euclidean distance measure. So, If you want to compute Kmeans using correlation distance, you just have to normalize the points before clustering.
In the R code below, well start by computing pairwise distance matrix using the function dist(). Next, hierarchical
clustering (HC) is computed using two different linkage methods (average and ward.D2). Finally the results of
HC are transformed as dendrograms:
library(dendextend)
# Compute distance matrix
res.dist <- dist(df, method = "euclidean")
# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "average")
hc2 <- hclust(res.dist, method = "ward.D2")
# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)
# Create a list of dendrograms
dend_list <- dendlist(dend1, dend2)
9.1 Tanglegram
The function tanglegram() plots two dendrograms, side by side, with their labels connected by lines. It can be
used for visually comparing two methods of Hierarchical clustering as follow:
tanglegram(dend1, dend2)
Note that unique nodes, with a combination of labels/items not present in the other tree, are highlighted with
dashed lines.
The quality of the alignment of the two trees can be measured using the function entanglement(). The output
of tanglegram() can be customized using many other options as follow:
tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)
Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement
coefficient corresponds to a good alignment
cor_cophenetic(dend1, dend2)
## [1] 0.9646883
# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.9622885
Its also possible to compare simultaneously multiple dendrograms. A chaining operator %>% (available
in dendextend) is used to run multiple function at the same time. Its useful for simplifying the code:
# Subset data
set.seed(123)
ss <- sample(1:150, 10 )
# Create multiple dendrograms by chaining
dend1 <- df %>% dist %>% hclust("com") %>% as.dendrogram
dend2 <- df %>% dist %>% hclust("single") %>% as.dendrogram
dend3 <- df %>% dist %>% hclust("ave") %>% as.dendrogram
dend4 <- df %>% dist %>% hclust("centroid") %>% as.dendrogram
# Compute correlation matrix
dend_list <- dendlist("Complete" = dend1, "Single" = dend2,
"Average" = dend3, "Centroid" = dend4)
cors <- cor.dendlist(dend_list)
# Print correlation matrix
round(cors, 2)
##
Complete Single Average Centroid
## Complete
1.00
0.76
0.99
0.75
## Single
0.76
1.00
0.80
0.84
## Average
0.99
0.80
1.00
0.74
## Centroid
0.75
0.84
0.74
1.00
# Visualize the correlation matrix using corrplot package
library(corrplot)
corrplot(cors, "pie", "lower")
10 Infos
This analysis has been performed using R software (ver. 3.2.1)