Académique Documents
Professionnel Documents
Culture Documents
International Journal of
Geographical Information
Science
Publication details, including instructions for
authors and subscription information:
http://www.tandfonline.com/loi/tgis20
To cite this article: ALAN T. MURRAY & VLADIMIR ESTIVILL-CASTRO (1998): Cluster
discovery techniques for exploratory spatial data analysis, International Journal
of Geographical Information Science, 12:5, 431-443
To link to this article: http://dx.doi.org/10.1080/136588198241734
Research Article
Cluster discovery techniques for exploratory spatial data analysis
ALAN T. MURRAY
Australian Housing and Urban Research Institute, Department of Geographical
Sciences and Planning, University of Queensland, Brisbane, Queensland 4072,
Australia
email: alan.murray@mailbox .uq.edu.au
1. Introduction
432
433
The following notation will be used in the speci cation of the spatial interaction
clustering problem:
i , j =indices of observations ( total number = n );
k =index of clusters (total number = p );
a i= demand / population/weight of observation i ;
d ij= spatial difference measure relating observations i and j ;
Decision variables:
y ik =
1 if observation i is in cluster k
0 otherwise.
There are two issues which warrant further comment. The rst is the selection of an
appropriate p value, for which various approaches have been suggested. They typically involve the evaluation of a range of p values in terms of objective performance
as well as other measurement criteria. The second issue is the interpretation of d ij .
The use of d ij represents a distance based measurement di erence between two spatial
observations. This may be given for a variety of scales (see Hartigan 1975, Everitt
1980 ). Since the interest here is in information that has geographic association,
spatial proximity is an important component. The measure of distance between two
sites may be a coordinate metric or a network distance or travel time. The coordinate
metric is typically de ned as an l m metric. Without loss of generality, in twodimensional space it has the following form:
m
m 1/m
lm = (|x i xj | + |y i yj | )
where
(x i , y i )=coordinates of observation i ;
(xj , y j )=coordinates of observation j ;
m = distance metric parameter.
a i aj d ij y ik y jk
434
y ik= 1
for all i ;
y ik > 1
for all k ;
( 3) Integer requirements.
y ik= 0, 1
for each i , k .
The objective of the OICP is to minimize total weighted di erence in the assignment
of observations to clusters. Constraint ( 1) ensures that each observation is assigned
to a cluster. Constraint ( 2 ) imposes the condition that at least one observation is
assigned to a cluster. Constraint (3 ) imposes integer restrictions on all decision
variables. This formulation roughly corresponds to that given in Rosing and
ReVelle ( 1986 ).
The non-linear objective function in the OICP makes it di cult to solve. There
are np decision variables and n + p constraints. Dynamic programming solution
techniques are not generally capable of solving this problem optimally. Further
discussion will follow in the section devoted to solution aspects of the presented
clustering problems.
3. Centre point clustering
Notice that the di erence measure now relates an observation to a centre point.
Also, the y ik decision variables are unaltered in de nition. The model formulation is
now given.
3.1. Centre Points Clustering Problem (CPCP )
Minimize Z =
a i d ik y ik
435
Subject to:
y ik= 1
for all i ;
( 2) Integer requirements.
y ik= 0, 1
for each i , k .
The objective of the CPCP is to minimize the total di erence in the assignment
of observations to cluster centres. Constraint ( 1) ensures that each observation is
assigned to a cluster. Constraint ( 2 ) imposes integer restrictions on all decision
variables.
The somewhat hidden element of this formulation is that the objective is nonlinear, as is the OICP. This is due to the fact that the distance measure is a function
of the cluster membership. Thus, the centre point cannot be identi ed until the
cluster membership is determined. In two-dimensional space there are ( 2 + n ) p
decision variables (centre point de nition variables and y ik variables) and n constraints associated with the CPCP. As with the OICP, dynamic programming techniques are generally incapable of solving this problem optimally (see Rosing 1991
for special cases). Further discussion will follow in the section on solution aspects
of presented clustering problems.
Another important item worth mentioning is that there are at least two alternatives for de ning the centre. The version of the CPCP presented here is the use of
d ik , which was suggested in Cooper ( 1963). This point is commonly referred to as
the Weber point when m = 2 in the facility location literature ( Rosing 1991,
2
Wesolowsky 1993 ). Another potential version of the CPCP formulation, where dik
is given in the objective function, is typically found in statistical cluster analysis
( Fisher 1958, MacQueen 1967, Hartigan 1975 ). This centre point is de ned and
referred to as the centroid or centre of gravity. In fact, the spatial data mining
literature has also relied upon this particular representation ( Zhang et al. 1996). One
reason for the use of the squared distance measure is that the centre point within
each cluster is de ned by a closed form equation when m = 2, which makes its
computation less di cult. It should be recognized that all of the clustering approaches
2
presented in this paper could utilize dik as the distance measure. However, reasons
will be given in a later section for the preferred use of d ik .
4. M edian clustering
A slightly modi ed alternative to the centre approach is to de ne cluster membership based on assigning observations to a representative observation. Approaches in
this area include Hakimi ( 1964 ), Vinod ( 1969), and ReVelle and Swain ( 1970 ). This
is referred to as a median (or medoid in Kaufman and Rousseuw 1990) approach.
The median, similar to the centre point, serves only as a means for identifying cluster
groups. The advantage of the median over the centre point is that the potential
medians are known a priori as they correspond to the set of observations. In contrast,
the centre point is a function of the cluster membership in continuous rather than
discrete space.
The following notation will assist in the speci cation of the model formulation:
i =index of observations (total number = n );
436
Decision variables:
xi=
z ij =
0 otherwise.
0 otherwise.
It is worth noting that the set of potential medians would typically correspond to
the set of observations in this particular clustering model. So, i and j are indices
referring to the spatial observations. The reason for this is associated with the intent
of the approach. The goal is to partition the spatial objects into natural groupings.
Since this model identi es partitions based upon the assignment of observations to
selected medians, the use of observations as medians facilitates the process and has
no practical interpretation.
4.1. Median Clustering Problem (MCP )
Minimize Z =
a i d ij z ij
Subject to:
( 1 ) Each observation must be assigned to a cluster median.
z ij= 1
for all i ;
for all i , j ;
xj = p
( 4) Integer requirements.
z ij= 0, 1
for each i , j ;
x j= 0, 1
for each j .
437
in signi cant computational advantages for the MCP (and the OICP) over the
2
2
CPCP. There are n + n decision variables and n + n + 1 constraints associated with
the MCP. Although problem size is an issue for the MCP, optimal solutions may
be obtained for small to medium sized problem instances. Further discussion of the
MCP will be given in the section on solution aspects of the presented clustering
approaches.
5. Application considerations
There are two important application issues in cluster identi cation for spatial
data associated with the distance based di erence measure. One is the distance metric
utilized as this is a function of geographical space and observation association. The
other important element in the detection of clusters is the actual form of the distance
measure applied to the utilized metric. Speci cally, the use of the distance metric as
it is speci ed, e.g. l m , in contrast with the square of the metric. Thus, this section
2
reviews the e ects of the use of the squared distance measure, dij, versus the originally
speci ed metric, d ij . The importance of this aspect of the modelling e ort is due to
the fact that squared distances have been suggested and applied for cluster analysis
and detection in spatial data and the associated e ects are pronounced.
One area of spatial modelling that has recognized the e ects of using d ij as
2
opposed to d ij is facility location (see Watson-Gandy 1972, Wesolowsky 1993 ). This
does not appear to be the case in cluster analysis (see Selim and Alsultan 1991,
Zhang et al. 1996 ). An exception may be Kaufman and Rousseuw ( 1990), but they
2
only defend their use of d ij rather than dij. The di erence between these two forms
may be demonstrated through a simple illustration.
Figure 1 compares distance functions in terms of actual distance versus modelled
distance. What is shown in gure 1 is that the modelled distance di ers signi cantly
to the actual distance for the distance squared function. As the intent in cluster
analysis is to create spatial partitions which minimize within group di erence, there
2
is little reason for giving such importance to observations. Further, d ij signi cantly
alters the impact that the distance metric is meant to address. Figure 2 shows optimal
2
grouping con gurations for the MCP using d ij and dij, where p =5. The application
Figure 1.
438
Figure 2.
shown in gure 2 represents the Washington, DC area and has been analysed by
Murray and Church ( 1996 ) among others. The solid lines delineate the cluster regions
associated with the use of d ij . Alternatively, the broken lines delineate the cluster
2
regions associated with the use of dij. The most obvious contrast between the two
partitions is that they are spatially di erent. However, this spatial di erence is also
distinguishable when evaluating the two con gurations using the MCP objective
2
function. Speci cally, the con guration identi ed using d ij is over 9% less e cient
than the d ij partition when evaluated as would functionally be interpreted using
d ij . This was found to be the case across values of p using numerous spatial data
sets. Thus, the use of the squared distance measure results in inferior solutions when
2
the data has spatial attributes. This should not be too surprising given that dij has
no physical interpretation and is not representative of travel, transportation or
movement distance.
439
As mentioned in Kaufman and Rousseuw ( 1990) the use of dij impacts outliers,
which is more or less what the previous discussion suggests. The use of the squared
distance function gives greater importance in cluster creation to outliers than warranted. Why should an outlier have substantially more in uence in partition selection? Furthermore, clustering approaches have been developed which systematically
discard outliers because they are not representative of the actual relationships of
interest ( Zhang et al. 1996).
2
The e ects of dij are pronounced and demonstrate that d ij is most appropriate
for cluster detection and knowledge discovery for ESDA.
6. Solution approaches
440
Simulated Annealing heuristics have been applied for the later case by Selim and
Alsultan ( 1991 ) among others. Spatial data cluster analysis has not been an area of
application to date.
6.3. Median Clustering Problem (MCP )
Optimal approaches for solving the MCP include integer programming and
Lagrangian relaxation with branch and bound. A review of the application of
Lagrangian relaxation in this area may be found in Galvao ( 1993 ). Lagrangian
relaxation techniques are capable of solving problem instances approaching a thousand observations ( however commercial code is not necessarily available). Integer
programming can e ciently solve MCP instances of only a couple hundred observations using commercially available packages. Given these limitations, heuristics are
certainly required for larger problem instances such as those associated with cluster
identi cation in GIS databases.
A variety of heuristics exist for solving the MCP and recent surveys may be
found in Murray and Church ( 1996 ) and Rolland et al. ( 1996). Such approaches
include interchange ( hill-climbing), Simulated Annealing, Tabu search, and
Lagrangian heuristics. Most of the clustering techniques for spatial data developed
thus far have utilized the MCP solved using interchange heuristics ( Kaufman and
Rousseuw 1990, Ng and Han 1994 ). Unfortunately, they fail to recognize previous
work in this area and as a result have produced inferior approaches. The generic
interchange heuristic begins with a set of p cluster medians (often selected at random)
to which observations are grouped with their closest median ( based on the distance
measure). The interchange aspect of the heuristic is then to evaluate the replacement
of the median observation set with one of the n -p non-median observations (a number
of techniques exist for doing this and constitute distinctions between the alternatives).
If an improvement in the MCP objective results from an interchange (an exchange
of a current median with a non-median observation), then the best found is accepted
and the interchange evaluation process begins again. A local optimal solution is
identi ed and the process terminates when no interchange results in an improvement.
Given this generic description of the interchange heuristic, there are three major
concerns associated with the implementations developed for the MCP by Kaufman
and Rousseuw ( 1990 ) and Ng and Han ( 1994):
( 1 ) The interchange heuristic suggested by Kaufman and Rousseuw ( 1990 ) and
extended by Ng and Han ( 1994 ) is equivalent to the global interchange
heuristic developed for the MCP by Goodchild and Noronha ( 1983). The
global interchange approach evaluates the exchange of each n -p non-median
observations with the current p medians before an exchange is accepted. The
global interchange heuristic has been shown to require more total computational e ort to reach a local optima than other interchange approaches while
identifying comparable solutions (this is rather well known and accepted in
441
the location literature and may be con rmed in recent work by Densham
and Rushton 1992a and Rolland et al. 1996 ).
( 2 ) Kaufman and Rousseuw ( 1990) and Zhang et al. ( 1996) implement a distance
based observation cut o technique (which is actually proli c in the clustering
literature) in order to reduce computational e ort in their interchange heuristics. This approach attempts to reduce the number of exchanges evaluated
by not considering exchanges of non-medians with medians if they are beyond
a speci ed distance. An interchange heuristic has previously been proposed
and developed for the MCP based on the use of a limited distance string
( Densham and Rushton 1992a, b) and shown to be problematic by Sorensen
( 1994 ) in that poorer quality local optimal solutions result if a data string
cut o is employed. From an optimization perspective, this is a concern as
the intent of using a heuristic is to obtain the best possible near optimal
solution. The use of the data string cut o reduces the likelihood of this
happening, which should be well understood in practice and application.
( 3 ) Ng and Han ( 1994 ) present a sampling scheme within their global interchange
process. This approach, rather than using a distance cut o , limits the
exchanges evaluated through sampling. Murray and Church ( 1995) found
that sampling based heuristics were signi cantly inferior to interchange and
other more sophisticated heuristics, although this was for a somewhat di erent problem application. Thus, this approach leaves an open question as to
the e ectiveness of sampling for identifying near optimal solutions.
In summary, a signi cant amount of research has been devoted to the analysis of
the MCP and is not re ected in the clustering heuristics developed thus far for ESDA.
7. Discussion
As the previous section has shown, the MCP has received the greatest amount
of attention in the area of cluster analysis for ESDA. The development of inferior
heuristics for solving the MCP may in part be due to the notion of `throwing
computer power at geographically based problems of interest as suggested in
Openshaw ( 1987, 1991 ). That is, the most appropriate methods are not sought out
in advance before the application of a particular approach is carried out, even though
data structures, memory and complexity issues are critical. Based upon the review
of the various clustering methods and how they may be solved, it is clear that
heuristic solution techniques will be essential for analysing medium to large size
applications associated with GIS. Given that this is the case, each clustering method
may be of merit as heuristic solution performance is not generally a ected by nonlinear objective functions or constraints.
An important point to note is that in practice the clustering heuristics discussed
in this paper are often modi ed in various ways in order to account for application
speci c details, such as the treatment of outliers and the issues just mentioned above
( Zhang et al. 1996). Although the focus has been on non-hierarchical methods, much
of what has been presented and discussed throughout this paper has direct application
to hierarchical approaches ( Everitt 1980).
The issues of memory and complexity are worth further elaboration. Memory
requirements are equivalent for each of the presented models in terms of the data
input. The major requirement is the d ij measure. Thus, memory needs would be
2
O (n ) if d ij is computed a priori rather than as needed. The complexity of the
442
heuristics are an important aspect of implementation. As demonstrated in the previous section, heuristic development for the MCP has been signi cant so the complexity
issue has received a great deal of attention. Recent work in this area may be found
in Horn ( 1996 ) and Densham and Rushton ( 1992a, b), which has focused on the
operational complexity issue of the interchange heuristic.
8. Conclusions
This paper has reviewed three basic clustering methods which may be applied to
geographic information in an exploratory spatial data analysis ( ESDA) capacity
within geographical information systems (GIS). This is a timely review for an important emerging area. The rst clustering method was the Observation Interaction
Clustering Problem (OICP). The second method was the Centre Point Clustering
Problem (CPCP) which utilizes a centre point for the creation of spatial partitions
or groups. The third method was the Median Clustering Problem (MCP) which uses
spatial observation members to create groupings. An important product of the
detailed presentation and discussion of these clustering methods is that each is
attempting to accomplish the same goal the identi cation of spatial groupings
which are most similar, yet the CPCP and MCP only approximate what is modelled
explicitly in the OICP.
Exact methods for solving these clustering problems as a part of ESDA is
currently unrealistic. Thus, heuristic solution techniques are critical and continue to
be a needed area of future research, particularly for the OICP.
What this paper has not addressed nor has previous research is di erences in the
clusters identi ed by these methods or the simplistic yet computationally intensive
Geographical Analysis Machine of Openshaw et al. ( 1987 ). Such a study examining
the di erences in produced clusters as well as providing an interpretation of their
impacts for ESDA is an important area for future research.
Acknowledgments
This research was supported in part by a grant from the Australian Research
Council. The authors would like to thank M. Goodchild for comments on an initial
draft of this manuscript.
References
de A morim, S ., B arthelemy, J ., and R ibeiro, C . , 1992, Clustering and clique partitioning:
simulated annealing and tabu search approaches. Journal of Classi cation , 9, 17 41.
C ooper, L . , 1963, Location-allocation problems. Operations Research, 11, 331 343.
C ooper, L . , 1964, Heuristic methods for location-allocation problems. SIAM Review, 6, 37 53.
C ooper, L . , 1967, Solutions of generalized locational equilibrium models. Journal of Regional
Science, 7, 1 18.
D ensham, P ., and R ushton, G ., 1992a, Strategies for solving large location-allocation problems by heuristic methods. Environment and Planning A , 24, 289 304.
D ensham, P ., and R ushton, G ., 1992b, A more e cient heuristic for solving large p -median
problems. Papers in Regional Science, 71, 307 329.
D orndorf, U ., and P esch, E . , 1994, Fast clustering algorithms. ORSA Journal on Computing ,
6, 141 153.
E veritt, B . , 1980, Cluster Analysis, 2nd edn ( New York: Halsted Press).
F ayyad, U ., P iatetsky-S hapiro, G ., and S myth, P ., 1996, The KDD process for extracting
useful knowledge from volumes of data. Communications of the ACM , 39, 27 34.
F isher, W . , 1958, On grouping for maximum homogeneity. Journal of the American Statistical
Association , 53, 789 798.
F rawley, W ., P iatetsky-S hapiro, G ., and M atheus, C . , 1991, Knowledge discovery in databases: an overview. In Knowledge Discovery in Databases , edited by G. PiatetskyShapiro and W. Frawley (California: AAAI Press), pp. 1 27.
443
G alvao, R ., 1993, The use of Lagrangian relaxation in the solution of uncapacitated facility
location problems. L ocation Science, 1, 57 79.
G oodchild, M . , and N oronha, V . , 1983, L ocation-Allocation for Small Computers . Monograph
8, University of Iowa.
H akimi, L . , 1964, Optimum locations of switching centers and the absolute centers and
medians of a graph. Operations Research, 12, 450 459.
H artigan, J ., 1975, Clustering Algorithms (New York: John Wiley).
H orn, M . , 1996, Analysis and computational schemes for p -median heuristics. Environment
and Planning A , 28, 1699 1708.
J ensen, R ., 1969, A dynamic programming algorithm for cluster analysis. Operations Research,
12, 1034 1057.
K aufman, L . , and R ousseuw, P ., 1990, Finding Groups in Data: An Introduction to Cluster
Analysis ( New York: John Wiley).
K lein, G ., and A ronson, J ., 1991, Optimal clustering: a model and method. Naval Research
L ogistics, 38, 447 461.
M ac Q ueen, J ., 1967, Some methods for classi cation and analysis of multivariate observations.
In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability vol. I, Edited by L. Le Cam and J. Neyman (Berkeley: University of
O penshaw, S ., 1991, Developing appropriate spatial analysis methods for GIS. In Geographical
Information Systems Principles and Applications , edited by D. Maguire, M. Goodchild,
O penshaw, S ., 1992, Some suggestions concerning the development of arti cial intelligence
tools for spatial modelling and analysis in GIS. Annals of Regional Science, 26, 35 51.
O penshaw, S ., C harlton, M ., W ymer, C . , and C raft, A ., 1987, A mark I geographical
analysis machine for the automated analysis of point data sets. International Journal
of Geographical Information Systems, 1, 335 358.
R ao, M . , 1971, Cluster analysis and mathematical programming. Journal of the American
Statistical Association , 66, 622 626.
R eV elle, C . , and S wain, R ., 1970, Central facilities location. Geographical Analysis, 2, 30 42.
R olland, E ., S chilling, D ., and C urrent, J ., 1996, An e cient tabu search procedure for
the p -median problem. European Journal of Operational Research, 96, 329 342.
R osing, K ., 1991, Towards the solution of the (generalised ) multi-Weber problem. Environment
and Planning B , 18, 347 360.
R osing, K ., 1992, An optimal method for solving the (generalized ) multi-Weber problem.
European Journal of Operational Research, 58, 414 426.
R osing, K ., and R eV elle, C . , 1986, Optimal clustering. Environment and Planning A , 18,
1463 1476.
S elim, S ., and A lsultan, K ., 1991, A simulated-anneali ng algorithm for the clustering problem.
Pattern Recognition , 24, 1003 1008.
S orensen, P ., 1994, Analysis and design of heuristics for the p-median location-allocation