Vous êtes sur la page 1sur 5

Copy Right : Ra i Unive rsit y

11.556 225
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
So far we have learnt how to classify objects into different groups
using Discriminant analysis. Discriminant analysis is essentially
used to divide objects into two or more groups. The result of
Discriminant analysis is an equation which can predict which class
or group a new object will fall into. We now turn to another
technique which groups objects, cluster analysis. However the basis
on which it does so is different from Discriminant analysis. Cluster
analysis is a technique of grouping individuals or objects into
unknown groups. It is different from Discriminant analysis
because the number and characteristics of a group in a cluster
analysis are not known prior to the analysis.
Again a technique using complex high level statistics our exposition
will focus on obtaining an intuitive understanding of the technique
and its applications.
Cluster Analysis What it is and What its
not
Frequently marketers are interested in putting people or objects
into groups on the basis of similarities among the objects based
on a common set of measures . This concept has extensive
applicability in marketing where we may want to group potential
customers into homogenous groups in which he could develop a
specific marketing mix to reach to reach a particular group. The
basis of grouping could involve a variety of socio economic and
psychographic characteristics.
One goal of marketing of a marketing manager is to identify
consumer segments so that the marketing programme can be
developed and tailored to each segment. This can be done by
clustering consumers. For example we might cluster them on the
basis of product benefits they seek from a college or by lifestyles.
The result for students might be one group which likes outdoor
activities, another that enjoys entertainment , etc. Each segment
would have different product needs and and may repond
differently to advertising.
Alternatively we might want to cluster brands to determine which
brands are regarded as similar. If a test marketing programme is
planned we could cluster different cities so that different marketing
programmes can be tried out in different cities.
Cluster analysis is neither a single technique nor a statistical
technique. It is a mathematical formula for dividing data into
classes, without a preconceived notion of what those classes are,
based on relationships within the data. There are many different
ways to do this, and some of them use statistical probabilities or
statistical quantities such as sum of squares at various points. But
overall, the techniques themselves are not really statistical, as they
give you no means of assessing likelihood.
Once you had come up with some classes, what are 2 techniques
that we have learned that you could use to explore the validity of
your classification?
Hierarchical vs. Nonhierarchical Approaches
Nonhierarchical approaches divide up your data into a set of classes,
and each case is assigned to a particular class. Generally, the
hierarchical methods are more widely used because they give greater
insight into the overall structure. However, they have their own
challenges.
There are 2 ways of doing a hierarchical analysis - agglomeratively
or divisively. Divisive methods are similar to what we talked about
with regression tree analysis - they start with a data set, subdivide
it, and continually subdivide the subsets until some predetermined
threshold is reached. That threshold might be determined by
group size or relationships among group members. Agglomerative
methods begin with each sample representing a cluster, joining
the two most similar, and then repeatedly joining new clusters
together.
Results of hierarchical methods are usually presented as a
dendrogram:
Rescaled Distance Cluster Combine
C A S E 0 5 10 15 20 25
Label Num ++++++
Case 11 11 -++
Case 26 26 -+ ++
Case 2 2 + ++
Case 12 12 ++ ++
Case 29 29 + | |
Case 27 27 ++ |
Case 28 28 + ++
Case 19 19 -++ | |
Case 25 25 + ++ | |
Case 18 18 + ++ |
Case 20 20 ++ | |
Case 22 22 + +-+ |
Case 1 1 ++ |
Case 8 8 + |
Case 6 6 ++ |
Case 10 10 + ++ |
Case 13 13 + ++ |
Case 17 17 ++ | |
Case 21 21 + | |
Case 14 14 +-+ ++ |
Case 16 16 + +-+ | | |
Case 30 30 + | | | |
Case 3 3 ++ ++ ++ |
Case 4 4 + ++ | | |
LESSON 38:
CLUSTER ANALYSIS
Copy Right : Ra i Unive rsit y
226 11.556
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
Case 23 23 + | | | |
Case 7 7 + | ++
Case 5 5 + |
Case 9 9 +-+ |
Case 24 24 + ++
Case 15 15 +
The further to the right, the more dissimilar the clusters are, and
eventually everything is joined together. Both agglomerative and
divisive techniques can be used to produce a dendrogram. A key
thing to understand about hierarchical techniques is that each
sample has membership in multiple groups, because the groups
are themselves clustered hierarchically. When you actually work
with the data, you often want to determine a point at which you
will work with the clusters, that is somewhere between each sample
being a cluster and all samples being one giant cluster.
Distance Measures and Agglomeration Methods
In order to do the agglomeration or division, there has to be a
means of measuring multivariate distance. I should also say that
we will work with cluster analysis on raw data, but often what
happens is that first you reduce the data using an ordination
technique, and then you do a clustering on the reduced data. So
that is another approach that can be used that is particularly useful
if youre working with something like species data, where you may
have literally hundreds of variables. Thats too many variables!
Species data are really a somewhat special case, but a case that many
of us might wish to work with.
The common distance measures are the Euclidean distance, the
squared Euclidean distance, the City-Block distance (simple
distance that doesnt adjust for geometry - not usually
recommended), and plenty of others. Generally, the Euclidean
distance in multivariate space will be the preferred method. SPSS
gives you 8 choices if your data are interval or greater, 2 choices for
count data, and 7 choices for binary data. Again, you can usually
get around this decision quite easily by just using the Euclidean
distance.
With species data, there is another common distance measure
that is often used - the percentage similarity. Percentage similarity
is calculated between pairs of samples, and it is the sum of the
minimum value for each species (i.e., whichever sample has the
least of a species, even if that number is zero), multiplied by 2,
divided by the sum of the species abundances for each sample.
This is a nice number, because it varies between 0 and 1. 1 minus
the %age similarity is the percentage dissimilarity, which is a measure
of distance. Most clustering of vegetation samples will start with
a matrix of dissimilarities and work directly from that, which
often gives a more stable solution than working with the original
data.
Another decision point comes in in how the the agglomeration is
done. This is not so important at the very first step, where you are
simply joining two samples, but it gets very important later on.
The key decisions are whether to look at the individual cases and
their similarity, choosing the most similar cases, at an average of
the cluster, or at some other means of joining groups of points.
There are basically 3 common techniques:
Single linkage or nearest neighbor joins the two clusters that have
the 2 most similar individual points. This can cause a lot of
problems - often what you get is a series of clusters that each have
one new member. This is also called chaining. It isnt a terrifically
useful new structure. In certain circumstances you might want
such a structure.
Complete linkage or furthest neighbor joins the two clusters that
have the most distance points the most similar. This method
tends to create globular-shaped clusters that have unequal
variances and sample sizes, which is often what we expect in ecology.
Ive found that it is the most practical and useful method in terms
of being able to straightforwardly interpret the output. The clusters
tend to be very clean and well-defined, you dont get a lot of
chaining or reversals.
Wards Method and Between-Groups Linkage
These are two different techniques that both tend to favor
spherical clusters with equal variance and sample size. Wards
method is based on minimizing the within-cluster sum of squares.
The between-groups method also uses sum of squares.
There are other methods - within-groups, which is closely related
to Wards, and the centroid method, which joins groups with the
nearest centroids, but they are not very useful for all practical
purposes. Two common problems in cluster analysis are chaining
and reversals. Chaining is where single samples join a larger cluster
each time, so instead of a cluster analysis, you really have an
ordination, i.e., there is no true hierarchical structure. Reversals are
caused when an entity joins another cluster at a higher level of
similarity than was there before, making the hierarchical structure
go backwards in some cases.
Generally, these problems can all be avoided by using a complete
linkage method.
Assumptions and Scaling Problems
Cluster analysis has no assumptions per se, because you arent
testing any hypotheses, but there are some considerations about
scaling. There are also some problems that can occur if your
variables have certain structure and you techniques that dont
consider that. For example, if you are using a technique that bases
distance on linear correlation, then you are assuming that linear
correlation is a reasonable approximation of distance within your
data.
If the variables that you are clustering have different scales, then
obviously those with a stronger structure and greater overall
variance will control the clusters. If you wish to avoid this, you
should rescale everything to a common system. SPSS offers several
ways to do this - z-scores, scaling from 0 to 1, scaling from -1 to
+1, etc. Even when your data are all on the same scale, you may
actually have some problems with this. Well talk a lot more about
scaling when we get to ordination, but for now well just think
about it in terms of scaling different variables to a common system.
Interpretation and validation
The interpretation and validation stage of cluster analysis are big
problems, because there is no way to really effectively assess the
outcome. Basically - is it useful? Answer that question, and you
will have gone a long way. You can statistically evaluate how well
differentiated your clusters are using MANOVA, and you can use
DFA to look at which variables are contributing most strongly to
Copy Right : Ra i Unive rsit y
11.556 227
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
the clustering. Using these other techniques to iteratively improve
your clustering is recommended.
A lot of the contents and material for this chapter have been taken
from 1998 Dr. Marilyn D. Walker and various other web sites.
Further application and theory from various web sites
Copy Right : Ra i Unive rsit y
228 11.556
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
Points to Ponder
Five steps are basic to the application of most cluster studies-
Selection of the sample to be clustered (e.g. buyers, medical
patients, inventory, products, employee)
Definition of the variables on which to measure the objects,
events, or people (e.g. financial status, political affiliation, market
segment characteristics, symptom classes, product competition
definitions, productivity attributes).
Copy Right : Ra i Unive rsit y
11.556 229
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
Computation of similarities among the entries through
correlation, Euclidean distances, and other techniques.
Selection of mutually exclusive clusters (maximization of within
cluster similarity and between cluster differences ) or
hierarchically arranged clusters.
Cluster comparison and validation.
Notes

Vous aimerez peut-être aussi