Vous êtes sur la page 1sur 15

This article was downloaded by: [Brown University]

On: 27 April 2013, At: 11:09


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number:
1072954 Registered office: Mortimer House, 37-41 Mortimer Street,
London W1T 3JH, UK

International Journal of
Geographical Information
Science
Publication details, including instructions for
authors and subscription information:
http://www.tandfonline.com/loi/tgis20

Cluster discovery techniques


for exploratory spatial data
analysis
ALAN T. MURRAY & VLADIMIR ESTIVILL-CASTRO
Version of record first published: 09 Aug 2010.

To cite this article: ALAN T. MURRAY & VLADIMIR ESTIVILL-CASTRO (1998): Cluster
discovery techniques for exploratory spatial data analysis, International Journal
of Geographical Information Science, 12:5, 431-443
To link to this article: http://dx.doi.org/10.1080/136588198241734

PLEASE SCROLL DOWN FOR ARTICLE


Full terms and conditions of use: http://www.tandfonline.com/page/
terms-and-conditions
This article may be used for research, teaching, and private study
purposes. Any substantial or systematic reproduction, redistribution,
reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make
any representation that the contents will be complete or accurate or
up to date. The accuracy of any instructions, formulae, and drug doses
should be independently verified with primary sources. The publisher
shall not be liable for any loss, actions, claims, proceedings, demand, or

Downloaded by [Brown University] at 11:09 27 April 2013

costs or damages whatsoever or howsoever caused arising directly or


indirectly in connection with or arising out of the use of this material.

int. j. geographical information science, 1998 , vol. 12 , no. 5 , 431 443

Research Article
Cluster discovery techniques for exploratory spatial data analysis
ALAN T. MURRAY
Australian Housing and Urban Research Institute, Department of Geographical
Sciences and Planning, University of Queensland, Brisbane, Queensland 4072,
Australia
email: alan.murray@mailbox .uq.edu.au

Downloaded by [Brown University] at 11:09 27 April 2013

and VLADIMIR ESTIVILL-CASTRO


Department of Computer Science and Software Engineering, University of
Newcastle, Callaghan, NSW 2308, Australia
email: vlad@cs.newcastle.ed u.au
( Received 10 June 1997; accepted 7 November 1997 )
This paper reviews approaches for automated pattern spotting and
knowledge discovery in spatially referenced data. This is an emerging eld which
to date has received developmental contributions primarily from researchers in
statistics and knowledge discovery in databases (KDD). The eld of geographical
information systems (GIS) has, however, recognized its importance as a means
for providing more exploratory analysis functionality. Tools based upon automated approaches that identify potentially important relationships in spatial data
are essential in GIS in order to e ectively deal with the increasing amounts of
information being gathered. Clustering techniques are proving to be valuable, but
there appears to be a general lack of understanding associated with the use and
application of various clustering methods in the geographic domain. Further,
there is little if any recognition of the relationships between clustering methods. As a result, the development of techniques known to be problematic or
inferior has occurred. This paper presents an overview of clustering methods for
exploratory spatial data analysis and associated application issues.
Abstract.

1. Introduction

The importance of spatial/ geographical information is well established. Associated


with this is the continued and growing signi cance of geographical information
systems (GIS) as a component of most planning processes. However, it is widely
recognized that GIS development, in terms of spatial analysis capabilities, has not
matured in accordance with the range and abundance of geographical data that it
maintains (see Openshaw 1991, 1992 among others). Numerous spatial analysis
approaches are conceivable and necessary as such functionality is dependent upon
the particular problem being analysed. One such component is being able to explore
spatial information without knowing a priori what should be found. The reasoning
behind this is that geographical databases tend to be extremely large and trends not
necessarily apparent. This is one form of exploratory spatial data analysis ( ESDA)
associated with GIS and has been referred to as spatial data mining and pattern
spotting.
Such ESDA functionality could involve the investigation of a set or subset of
1365 8816/98 $1200

1998 Taylor & Francis Ltd.

Downloaded by [Brown University] at 11:09 27 April 2013

432

A. T . Murray and V . Estivill-Castro

spatial object attributes in order to assess cluster presence. As an example, Openshaw


et al. ( 1987 ) discuss the application of pattern identi cation for cancer cluster
detection. A project that we are undertaking involves the analysis of criminal o ences
in the south-east Queensland region. This spatial database contains information on
roughly 65 criminal o ence categories (murder, assault, rape, robbery, etc.) for
approximately 541 suburbs. The ability to identify or recognize the clustering of
activity through the use of automated spatial data mining and pattern spotting
techniques is important as such trends are not obvious using traditional query based
display options provided in GIS.
The exploration of databases is not unique to the GIS eld. In fact, the area of
knowledge discovery in databases ( KDD) is aimed at aspects of data investigation
and summarization ( Frawley et al. 1991 ). A major component of KDD has been the
development of data mining approaches ( Fayyad et al. 1996), which is actually the
notion of searching for patterns or relationships within data as suggested for ESDA
using GIS. However, the data typically utilized in KDD does not have geographical
attributes.
It should not be surprising that KDD is beginning to recognize the importance
of space and the potential to apply their technology to geographic problems. It
requires little imagination to see the emerging and evolving contributions of KDD
in the development of ESDA within GIS, particularly spatial data mining. A number
of spatial data mining approaches have appeared in the KDD literature based upon
the clustering methods of Kaufman and Rousseuw (1990 ) for identifying patterns
within data. Given the importance that space has in GIS, there are problematic and
concerning characteristics of the spatial data mining approaches being developed by
Ng and Han ( 1994 ) and Zhang et al. ( 1996). This may be due to a lack of understanding regarding implications associated with particular clustering methods. Further,
heuristic solution techniques recently suggested and developed by Kaufman and
Rousseuw ( 1990), Ng and Han (1994 ) and Zhang et al. (1996 ) have drawbacks as
well. Nevertheless, preliminary results demonstrate that clustering methods applied
to spatial information are contributing to the knowledge discovery process. Thus,
this paper serves as a basis for re-evaluation to take place so that clustering based
ESDA components for inclusion within GIS are not poorly structured or problematic.
This paper will present a variety of clustering approaches, principally nonhierarchical (or partitioning), in the sections which follow. The use of distance based
di erence measures will then be examined. Next, solution approaches for solving
these clustering problems will be reviewed. Finally, a discussion and conclusions will
be presented.

2. Observation clustering based on spatial interaction

The rationale for clustering is to group spatial observations or objects in order


to minimize di erences between group members. Although this has been dealt with
in various ways, the goal or intent is to ensure that spatially based partitions are
identi ed which are the most similar. Given this, within group di erence must be
minimized explicitly. Representative approaches of this type include the dynamic
programming approach of Jensen ( 1969 ), the sum of average and total within group
approaches formulated by Rao ( 1971 ), and the full interaction model by Rosing and
ReVelle ( 1986 ).

Clustering in exploratory spatial data analysis

433

The following notation will be used in the speci cation of the spatial interaction
clustering problem:
i , j =indices of observations ( total number = n );
k =index of clusters (total number = p );
a i= demand / population/weight of observation i ;
d ij= spatial difference measure relating observations i and j ;

Decision variables:

Downloaded by [Brown University] at 11:09 27 April 2013

y ik =

1 if observation i is in cluster k

0 otherwise.

There are two issues which warrant further comment. The rst is the selection of an
appropriate p value, for which various approaches have been suggested. They typically involve the evaluation of a range of p values in terms of objective performance
as well as other measurement criteria. The second issue is the interpretation of d ij .
The use of d ij represents a distance based measurement di erence between two spatial
observations. This may be given for a variety of scales (see Hartigan 1975, Everitt
1980 ). Since the interest here is in information that has geographic association,
spatial proximity is an important component. The measure of distance between two
sites may be a coordinate metric or a network distance or travel time. The coordinate
metric is typically de ned as an l m metric. Without loss of generality, in twodimensional space it has the following form:
m
m 1/m
lm = (|x i xj | + |y i yj | )

where
(x i , y i )=coordinates of observation i ;
(xj , y j )=coordinates of observation j ;
m = distance metric parameter.

Given this metric, m = 2 speci es a Euclidean distance and m = 1 indicates a rectilinear


distance measure. This parameter may be adjusted to correspond to the study area
so that it more accurately represents observed distances. Further discussion of the
use of distance measures in clustering will be left for a later section. The rst clustering
model is now given.

2.1 Observation interaction clustering problem (OICP)


Minimize Z =

a i aj d ij y ik y jk

434

A. T . Murray and V . Estivill-Castro


Subject to:

( 1 ) Each observation must be assigned to a cluster.

y ik= 1

for all i ;

( 2) Each cluster must have at least one member.

y ik > 1

for all k ;

( 3) Integer requirements.

Downloaded by [Brown University] at 11:09 27 April 2013

y ik= 0, 1

for each i , k .

The objective of the OICP is to minimize total weighted di erence in the assignment
of observations to clusters. Constraint ( 1) ensures that each observation is assigned
to a cluster. Constraint ( 2 ) imposes the condition that at least one observation is
assigned to a cluster. Constraint (3 ) imposes integer restrictions on all decision
variables. This formulation roughly corresponds to that given in Rosing and
ReVelle ( 1986 ).
The non-linear objective function in the OICP makes it di cult to solve. There
are np decision variables and n + p constraints. Dynamic programming solution
techniques are not generally capable of solving this problem optimally. Further
discussion will follow in the section devoted to solution aspects of the presented
clustering problems.
3. Centre point clustering

Di ering somewhat from the notion of measuring between observation di erences


within clusters is the perspective that we assign cluster members to a centre point
(see Everitt 1980). Representative approaches include the spatial model of Cooper
( 1963 ) and the k -means process of MacQueen ( 1967). The assignment to a centre
point represents an alternative way of identifying cluster members and changes the
interpretation of the distance (or di erence) measure relating observations. The centre
points serve only as a means for generating or identifying cluster groups in this
context.
The following notation will be used in the speci cation of this alternative clustering problem:
k =index of centres (total number = p );
d ik= spatial difference measure relating observation i and centre k .

Notice that the di erence measure now relates an observation to a centre point.
Also, the y ik decision variables are unaltered in de nition. The model formulation is
now given.
3.1. Centre Points Clustering Problem (CPCP )
Minimize Z =

a i d ik y ik

Clustering in exploratory spatial data analysis

435

Subject to:

( 1 ) Each observation must be assigned to a cluster.

y ik= 1

for all i ;

( 2) Integer requirements.

Downloaded by [Brown University] at 11:09 27 April 2013

y ik= 0, 1

for each i , k .

The objective of the CPCP is to minimize the total di erence in the assignment
of observations to cluster centres. Constraint ( 1) ensures that each observation is
assigned to a cluster. Constraint ( 2 ) imposes integer restrictions on all decision
variables.
The somewhat hidden element of this formulation is that the objective is nonlinear, as is the OICP. This is due to the fact that the distance measure is a function
of the cluster membership. Thus, the centre point cannot be identi ed until the
cluster membership is determined. In two-dimensional space there are ( 2 + n ) p
decision variables (centre point de nition variables and y ik variables) and n constraints associated with the CPCP. As with the OICP, dynamic programming techniques are generally incapable of solving this problem optimally (see Rosing 1991
for special cases). Further discussion will follow in the section on solution aspects
of presented clustering problems.
Another important item worth mentioning is that there are at least two alternatives for de ning the centre. The version of the CPCP presented here is the use of
d ik , which was suggested in Cooper ( 1963). This point is commonly referred to as
the Weber point when m = 2 in the facility location literature ( Rosing 1991,
2
Wesolowsky 1993 ). Another potential version of the CPCP formulation, where dik
is given in the objective function, is typically found in statistical cluster analysis
( Fisher 1958, MacQueen 1967, Hartigan 1975 ). This centre point is de ned and
referred to as the centroid or centre of gravity. In fact, the spatial data mining
literature has also relied upon this particular representation ( Zhang et al. 1996). One
reason for the use of the squared distance measure is that the centre point within
each cluster is de ned by a closed form equation when m = 2, which makes its
computation less di cult. It should be recognized that all of the clustering approaches
2
presented in this paper could utilize dik as the distance measure. However, reasons
will be given in a later section for the preferred use of d ik .
4. M edian clustering

A slightly modi ed alternative to the centre approach is to de ne cluster membership based on assigning observations to a representative observation. Approaches in
this area include Hakimi ( 1964 ), Vinod ( 1969), and ReVelle and Swain ( 1970 ). This
is referred to as a median (or medoid in Kaufman and Rousseuw 1990) approach.
The median, similar to the centre point, serves only as a means for identifying cluster
groups. The advantage of the median over the centre point is that the potential
medians are known a priori as they correspond to the set of observations. In contrast,
the centre point is a function of the cluster membership in continuous rather than
discrete space.
The following notation will assist in the speci cation of the model formulation:
i =index of observations (total number = n );

436

A. T . Murray and V . Estivill-Castro


j =index of potential medians (same as i );
d ij= distance between observation i and potential median j ;
p =number of cluster medians to be selected;

Decision variables:
xi=

Downloaded by [Brown University] at 11:09 27 April 2013

z ij =

1 if cluster median j is selected

0 otherwise.

1 if observation i is assigned to cluster median j

0 otherwise.

It is worth noting that the set of potential medians would typically correspond to
the set of observations in this particular clustering model. So, i and j are indices
referring to the spatial observations. The reason for this is associated with the intent
of the approach. The goal is to partition the spatial objects into natural groupings.
Since this model identi es partitions based upon the assignment of observations to
selected medians, the use of observations as medians facilitates the process and has
no practical interpretation.
4.1. Median Clustering Problem (MCP )
Minimize Z =

a i d ij z ij

Subject to:
( 1 ) Each observation must be assigned to a cluster median.

z ij= 1

for all i ;

( 2) Cannot assign to a median unless one is selected.


z ij < x j

for all i , j ;

( 3) Select p cluster medians.

xj = p

( 4) Integer requirements.
z ij= 0, 1

for each i , j ;

x j= 0, 1

for each j .

The objective of the MCP is to minimize total weighted assignment of observations


to selected medians. Constraint ( 1 ) ensures that each observation is assigned to a
median. Constraint ( 2 ) imposes the condition that an observation may only be
assigned to a selected median. Constraint ( 3) speci es that p cluster medians are to
be selected. Constraint ( 4) imposes integer restrictions on all decision variables.
The objective function for the MCP is linear, in contrast to the OICP and the
CPCP models. Another contrast is that the d ij measure in the MCP (and the OICP)
may be calculated in advance, whereas this is not the case for the CPCP. This results

Clustering in exploratory spatial data analysis

437

in signi cant computational advantages for the MCP (and the OICP) over the
2
2
CPCP. There are n + n decision variables and n + n + 1 constraints associated with
the MCP. Although problem size is an issue for the MCP, optimal solutions may
be obtained for small to medium sized problem instances. Further discussion of the
MCP will be given in the section on solution aspects of the presented clustering
approaches.

Downloaded by [Brown University] at 11:09 27 April 2013

5. Application considerations

There are two important application issues in cluster identi cation for spatial
data associated with the distance based di erence measure. One is the distance metric
utilized as this is a function of geographical space and observation association. The
other important element in the detection of clusters is the actual form of the distance
measure applied to the utilized metric. Speci cally, the use of the distance metric as
it is speci ed, e.g. l m , in contrast with the square of the metric. Thus, this section
2
reviews the e ects of the use of the squared distance measure, dij, versus the originally
speci ed metric, d ij . The importance of this aspect of the modelling e ort is due to
the fact that squared distances have been suggested and applied for cluster analysis
and detection in spatial data and the associated e ects are pronounced.
One area of spatial modelling that has recognized the e ects of using d ij as
2
opposed to d ij is facility location (see Watson-Gandy 1972, Wesolowsky 1993 ). This
does not appear to be the case in cluster analysis (see Selim and Alsultan 1991,
Zhang et al. 1996 ). An exception may be Kaufman and Rousseuw ( 1990), but they
2
only defend their use of d ij rather than dij. The di erence between these two forms
may be demonstrated through a simple illustration.
Figure 1 compares distance functions in terms of actual distance versus modelled
distance. What is shown in gure 1 is that the modelled distance di ers signi cantly
to the actual distance for the distance squared function. As the intent in cluster
analysis is to create spatial partitions which minimize within group di erence, there
2
is little reason for giving such importance to observations. Further, d ij signi cantly
alters the impact that the distance metric is meant to address. Figure 2 shows optimal
2
grouping con gurations for the MCP using d ij and dij, where p =5. The application

Figure 1.

Comparison of actual and modelled distance measures.

Downloaded by [Brown University] at 11:09 27 April 2013

438

A. T . Murray and V . Estivill-Castro

Figure 2.

Identi ed cluster regions for p =5.

shown in gure 2 represents the Washington, DC area and has been analysed by
Murray and Church ( 1996 ) among others. The solid lines delineate the cluster regions
associated with the use of d ij . Alternatively, the broken lines delineate the cluster
2
regions associated with the use of dij. The most obvious contrast between the two
partitions is that they are spatially di erent. However, this spatial di erence is also
distinguishable when evaluating the two con gurations using the MCP objective
2
function. Speci cally, the con guration identi ed using d ij is over 9% less e cient
than the d ij partition when evaluated as would functionally be interpreted using
d ij . This was found to be the case across values of p using numerous spatial data
sets. Thus, the use of the squared distance measure results in inferior solutions when
2
the data has spatial attributes. This should not be too surprising given that dij has
no physical interpretation and is not representative of travel, transportation or
movement distance.

439

Clustering in exploratory spatial data analysis


2

As mentioned in Kaufman and Rousseuw ( 1990) the use of dij impacts outliers,
which is more or less what the previous discussion suggests. The use of the squared
distance function gives greater importance in cluster creation to outliers than warranted. Why should an outlier have substantially more in uence in partition selection? Furthermore, clustering approaches have been developed which systematically
discard outliers because they are not representative of the actual relationships of
interest ( Zhang et al. 1996).
2
The e ects of dij are pronounced and demonstrate that d ij is most appropriate
for cluster detection and knowledge discovery for ESDA.

Downloaded by [Brown University] at 11:09 27 April 2013

6. Solution approaches

Solving these clustering models is a signi cant consideration as the geographical


data to be explored typically contains hundreds or thousands of observations. Both
exact and heuristic solution approaches will be discussed in this section. However,
emphasis is given to heuristic solution techniques as problem sizes dictate that exact
methods are unlikely to be e ective or feasible in practice for cluster discovery in
spatial data analysis.
6.1. Observation Interaction Clustering Problem (OICP )
Although the formulation of the OICP is rather straightforward, there are di culties associated with its use in practice. The non-linear objective function makes
this formulation of the problem essentially impossible to solve optimally. One possibility is a linear transformation such as that given by Rao ( 1971 ) (see also Rosing
and ReVelle 1986, Klein and Aronson 1991), which then allows the problem to be
solved using a commercial integer programming package. Relatively small problems
(up to n =50) have been solved to date using such a technique ( Rosing and ReVelle
1986, Klein and Aronson 1991). Another option is to solve the OICP as an equivalent
Clique Partitioning Problem ( Torki et al. 1996). Slightly larger problems (up to n =
160) have been optimally solved to date ( Dorndorf and Pesch 1994). Thus, these
approaches are not practical for medium to large scale problem instances, which are
typically associated with spatial data.
Given that exact approaches for identifying optimal solutions for the OICP are
limited at best, heuristic solution techniques are necessary in practice. The application
and development of heuristics for directly solving the OICP do not appear in the
literature. There have been heuristics developed for solving the Clique Partitioning
Problem using Simulated Annealing and Tabu search as well as approximate exact
methods (de Amorim et al. 1992, Torki et al. 1996 ). Problem sizes have been as high
as n =1000 for these heuristics. However, the larger problem applications use synthetic data. Thus, it remains to be seen how such approaches would respond to
spatial data application. Heuristic solution development for the OICP is clearly a
needed area of future research.
6.2. Centre Point Clustering Problem (CPCP )
It has already been pointed out that the CPCP model is non-linear. It is possible
to optimally solve special cases of the CPCP as a discrete problem using a somewhat
indirect approach developed in Rosing ( 1992). However, there are di culties in
identifying all convex hulls and the necessity of having to solve a set covering
problem ( possibly using a commercial integer programming package). Only small
sized problems have been solved to date and this is only for the case where m = 2

440

A. T . Murray and V . Estivill-Castro

using d ij ( Rosing 1992). Thus, optimal methods are unlikely to be implemented in


practice and heuristic techniques are essential.
Heuristic solution methods have been developed for various cases of the CPCP.
The most prominent are for m = 2:
alternating heuristics of Cooper ( 1964, 1967) using d ij ;
2
k-means process of MacQueen ( 1967 ) using d ij
E
E

Downloaded by [Brown University] at 11:09 27 April 2013

Simulated Annealing heuristics have been applied for the later case by Selim and
Alsultan ( 1991 ) among others. Spatial data cluster analysis has not been an area of
application to date.
6.3. Median Clustering Problem (MCP )
Optimal approaches for solving the MCP include integer programming and
Lagrangian relaxation with branch and bound. A review of the application of
Lagrangian relaxation in this area may be found in Galvao ( 1993 ). Lagrangian
relaxation techniques are capable of solving problem instances approaching a thousand observations ( however commercial code is not necessarily available). Integer
programming can e ciently solve MCP instances of only a couple hundred observations using commercially available packages. Given these limitations, heuristics are
certainly required for larger problem instances such as those associated with cluster
identi cation in GIS databases.
A variety of heuristics exist for solving the MCP and recent surveys may be
found in Murray and Church ( 1996 ) and Rolland et al. ( 1996). Such approaches
include interchange ( hill-climbing), Simulated Annealing, Tabu search, and
Lagrangian heuristics. Most of the clustering techniques for spatial data developed
thus far have utilized the MCP solved using interchange heuristics ( Kaufman and
Rousseuw 1990, Ng and Han 1994 ). Unfortunately, they fail to recognize previous
work in this area and as a result have produced inferior approaches. The generic
interchange heuristic begins with a set of p cluster medians (often selected at random)
to which observations are grouped with their closest median ( based on the distance
measure). The interchange aspect of the heuristic is then to evaluate the replacement
of the median observation set with one of the n -p non-median observations (a number
of techniques exist for doing this and constitute distinctions between the alternatives).
If an improvement in the MCP objective results from an interchange (an exchange
of a current median with a non-median observation), then the best found is accepted
and the interchange evaluation process begins again. A local optimal solution is
identi ed and the process terminates when no interchange results in an improvement.
Given this generic description of the interchange heuristic, there are three major
concerns associated with the implementations developed for the MCP by Kaufman
and Rousseuw ( 1990 ) and Ng and Han ( 1994):
( 1 ) The interchange heuristic suggested by Kaufman and Rousseuw ( 1990 ) and
extended by Ng and Han ( 1994 ) is equivalent to the global interchange
heuristic developed for the MCP by Goodchild and Noronha ( 1983). The
global interchange approach evaluates the exchange of each n -p non-median
observations with the current p medians before an exchange is accepted. The
global interchange heuristic has been shown to require more total computational e ort to reach a local optima than other interchange approaches while
identifying comparable solutions (this is rather well known and accepted in

Downloaded by [Brown University] at 11:09 27 April 2013

Clustering in exploratory spatial data analysis

441

the location literature and may be con rmed in recent work by Densham
and Rushton 1992a and Rolland et al. 1996 ).
( 2 ) Kaufman and Rousseuw ( 1990) and Zhang et al. ( 1996) implement a distance
based observation cut o technique (which is actually proli c in the clustering
literature) in order to reduce computational e ort in their interchange heuristics. This approach attempts to reduce the number of exchanges evaluated
by not considering exchanges of non-medians with medians if they are beyond
a speci ed distance. An interchange heuristic has previously been proposed
and developed for the MCP based on the use of a limited distance string
( Densham and Rushton 1992a, b) and shown to be problematic by Sorensen
( 1994 ) in that poorer quality local optimal solutions result if a data string
cut o is employed. From an optimization perspective, this is a concern as
the intent of using a heuristic is to obtain the best possible near optimal
solution. The use of the data string cut o reduces the likelihood of this
happening, which should be well understood in practice and application.
( 3 ) Ng and Han ( 1994 ) present a sampling scheme within their global interchange
process. This approach, rather than using a distance cut o , limits the
exchanges evaluated through sampling. Murray and Church ( 1995) found
that sampling based heuristics were signi cantly inferior to interchange and
other more sophisticated heuristics, although this was for a somewhat di erent problem application. Thus, this approach leaves an open question as to
the e ectiveness of sampling for identifying near optimal solutions.
In summary, a signi cant amount of research has been devoted to the analysis of
the MCP and is not re ected in the clustering heuristics developed thus far for ESDA.
7. Discussion

As the previous section has shown, the MCP has received the greatest amount
of attention in the area of cluster analysis for ESDA. The development of inferior
heuristics for solving the MCP may in part be due to the notion of `throwing
computer power at geographically based problems of interest as suggested in
Openshaw ( 1987, 1991 ). That is, the most appropriate methods are not sought out
in advance before the application of a particular approach is carried out, even though
data structures, memory and complexity issues are critical. Based upon the review
of the various clustering methods and how they may be solved, it is clear that
heuristic solution techniques will be essential for analysing medium to large size
applications associated with GIS. Given that this is the case, each clustering method
may be of merit as heuristic solution performance is not generally a ected by nonlinear objective functions or constraints.
An important point to note is that in practice the clustering heuristics discussed
in this paper are often modi ed in various ways in order to account for application
speci c details, such as the treatment of outliers and the issues just mentioned above
( Zhang et al. 1996). Although the focus has been on non-hierarchical methods, much
of what has been presented and discussed throughout this paper has direct application
to hierarchical approaches ( Everitt 1980).
The issues of memory and complexity are worth further elaboration. Memory
requirements are equivalent for each of the presented models in terms of the data
input. The major requirement is the d ij measure. Thus, memory needs would be
2
O (n ) if d ij is computed a priori rather than as needed. The complexity of the

442

A. T . Murray and V . Estivill-Castro

heuristics are an important aspect of implementation. As demonstrated in the previous section, heuristic development for the MCP has been signi cant so the complexity
issue has received a great deal of attention. Recent work in this area may be found
in Horn ( 1996 ) and Densham and Rushton ( 1992a, b), which has focused on the
operational complexity issue of the interchange heuristic.

Downloaded by [Brown University] at 11:09 27 April 2013

8. Conclusions

This paper has reviewed three basic clustering methods which may be applied to
geographic information in an exploratory spatial data analysis ( ESDA) capacity
within geographical information systems (GIS). This is a timely review for an important emerging area. The rst clustering method was the Observation Interaction
Clustering Problem (OICP). The second method was the Centre Point Clustering
Problem (CPCP) which utilizes a centre point for the creation of spatial partitions
or groups. The third method was the Median Clustering Problem (MCP) which uses
spatial observation members to create groupings. An important product of the
detailed presentation and discussion of these clustering methods is that each is
attempting to accomplish the same goal the identi cation of spatial groupings
which are most similar, yet the CPCP and MCP only approximate what is modelled
explicitly in the OICP.
Exact methods for solving these clustering problems as a part of ESDA is
currently unrealistic. Thus, heuristic solution techniques are critical and continue to
be a needed area of future research, particularly for the OICP.
What this paper has not addressed nor has previous research is di erences in the
clusters identi ed by these methods or the simplistic yet computationally intensive
Geographical Analysis Machine of Openshaw et al. ( 1987 ). Such a study examining
the di erences in produced clusters as well as providing an interpretation of their
impacts for ESDA is an important area for future research.
Acknowledgments

This research was supported in part by a grant from the Australian Research
Council. The authors would like to thank M. Goodchild for comments on an initial
draft of this manuscript.
References
de A morim, S ., B arthelemy, J ., and R ibeiro, C . , 1992, Clustering and clique partitioning:
simulated annealing and tabu search approaches. Journal of Classi cation , 9, 17 41.
C ooper, L . , 1963, Location-allocation problems. Operations Research, 11, 331 343.
C ooper, L . , 1964, Heuristic methods for location-allocation problems. SIAM Review, 6, 37 53.
C ooper, L . , 1967, Solutions of generalized locational equilibrium models. Journal of Regional
Science, 7, 1 18.
D ensham, P ., and R ushton, G ., 1992a, Strategies for solving large location-allocation problems by heuristic methods. Environment and Planning A , 24, 289 304.
D ensham, P ., and R ushton, G ., 1992b, A more e cient heuristic for solving large p -median
problems. Papers in Regional Science, 71, 307 329.
D orndorf, U ., and P esch, E . , 1994, Fast clustering algorithms. ORSA Journal on Computing ,
6, 141 153.
E veritt, B . , 1980, Cluster Analysis, 2nd edn ( New York: Halsted Press).
F ayyad, U ., P iatetsky-S hapiro, G ., and S myth, P ., 1996, The KDD process for extracting
useful knowledge from volumes of data. Communications of the ACM , 39, 27 34.
F isher, W . , 1958, On grouping for maximum homogeneity. Journal of the American Statistical
Association , 53, 789 798.
F rawley, W ., P iatetsky-S hapiro, G ., and M atheus, C . , 1991, Knowledge discovery in databases: an overview. In Knowledge Discovery in Databases , edited by G. PiatetskyShapiro and W. Frawley (California: AAAI Press), pp. 1 27.

Clustering in exploratory spatial data analysis

443

G alvao, R ., 1993, The use of Lagrangian relaxation in the solution of uncapacitated facility
location problems. L ocation Science, 1, 57 79.
G oodchild, M . , and N oronha, V . , 1983, L ocation-Allocation for Small Computers . Monograph

Downloaded by [Brown University] at 11:09 27 April 2013

8, University of Iowa.

H akimi, L . , 1964, Optimum locations of switching centers and the absolute centers and
medians of a graph. Operations Research, 12, 450 459.
H artigan, J ., 1975, Clustering Algorithms (New York: John Wiley).
H orn, M . , 1996, Analysis and computational schemes for p -median heuristics. Environment
and Planning A , 28, 1699 1708.
J ensen, R ., 1969, A dynamic programming algorithm for cluster analysis. Operations Research,
12, 1034 1057.
K aufman, L . , and R ousseuw, P ., 1990, Finding Groups in Data: An Introduction to Cluster
Analysis ( New York: John Wiley).
K lein, G ., and A ronson, J ., 1991, Optimal clustering: a model and method. Naval Research
L ogistics, 38, 447 461.
M ac Q ueen, J ., 1967, Some methods for classi cation and analysis of multivariate observations.
In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability vol. I, Edited by L. Le Cam and J. Neyman (Berkeley: University of

California) pp. 281 297.

M urray, A ., and C hurch, R ., 1995, Heuristic solution approaches to operational forest


planning problems. OR Spektrum , 17, 193 203.
M urray, A ., and C hurch, R ., 1996, Applying simulated annealing to location planning
models. Journal of Heuristics, 2, 49 71.
N g, R ., and H an, J ., 1994, E cient and e ective clustering methods for spatial data mining.
In Proceedings of the 20th Conference on V ery L arge Data Bases (V L DB) , edited by

J. Bocca, M. Jarke and C. Zaniolo, pp. 144 155.

O penshaw, S ., 1991, Developing appropriate spatial analysis methods for GIS. In Geographical
Information Systems Principles and Applications , edited by D. Maguire, M. Goodchild,

and D. Rhind (London: Longman) pp. 389 402.

O penshaw, S ., 1992, Some suggestions concerning the development of arti cial intelligence
tools for spatial modelling and analysis in GIS. Annals of Regional Science, 26, 35 51.
O penshaw, S ., C harlton, M ., W ymer, C . , and C raft, A ., 1987, A mark I geographical
analysis machine for the automated analysis of point data sets. International Journal
of Geographical Information Systems, 1, 335 358.
R ao, M . , 1971, Cluster analysis and mathematical programming. Journal of the American
Statistical Association , 66, 622 626.
R eV elle, C . , and S wain, R ., 1970, Central facilities location. Geographical Analysis, 2, 30 42.
R olland, E ., S chilling, D ., and C urrent, J ., 1996, An e cient tabu search procedure for
the p -median problem. European Journal of Operational Research, 96, 329 342.
R osing, K ., 1991, Towards the solution of the (generalised ) multi-Weber problem. Environment
and Planning B , 18, 347 360.
R osing, K ., 1992, An optimal method for solving the (generalized ) multi-Weber problem.
European Journal of Operational Research, 58, 414 426.
R osing, K ., and R eV elle, C . , 1986, Optimal clustering. Environment and Planning A , 18,

1463 1476.

S elim, S ., and A lsultan, K ., 1991, A simulated-anneali ng algorithm for the clustering problem.
Pattern Recognition , 24, 1003 1008.
S orensen, P ., 1994, Analysis and design of heuristics for the p-median location-allocation

problem. Masters Thesis, Department of Geography, University of California, Santa


Barbara.
T orki, A ., Y ajima, Y ., and E nkawa, T . , 1996, A new composite algorithm for clustering
problems. International T ransactions in Operational Research, 3, 197 206.
V inod, H ., 1969, Integer programming and the theory of grouping. Journal of the American
Statistical Association , 64, 506 517.
W atson-G andy, C . , 1972, A note on the centre of gravity in depot location. Management
Science, 18, B478 481.
W esolowsky, G ., 1993, The Weber problem: history and perspectives. L ocation Science,
1, 5 23.
Z hang, T ., R amakrishnan, R ., and L ivny, M . , 1996, BIRCH: an e cient data clustering
method for very large databases. SIGMOD, 25, 103 114.

Vous aimerez peut-être aussi