Vous êtes sur la page 1sur 10

A GENERAL PURPOSE CONCEPTUAL CLUSTERING ENGINE1

Paul Craven
York University (Canada)

History is not just a ‘ballet dance of bloodless categories’ (F.H. Bradley), yet category-
making is an indispensable tool of the historian’s art. The richness and complexity of histor-
ical data, and the opportunities and difficulties they present for comparative analysis, make
computer-assisted classification techniques attractive. Their attraction is often outweighed
by the obstacles to their employment, though. Multivariate statistical techniques have lim-
ited application to historical research because their input must be quantifiable and their
output, for all its mathematical elegance, can rarely be expressed in clear narrative prose.
This paper describes a technique that can be used to help build classification schemes or
typologies out of the ordinary bread and butter of historical research—events, words, deeds,
ideas, transactions—and allows for literate explanations. Called ‘conceptual clustering,’ it
is a method for grouping like things together in order to arrive at inherently meaningful
classification schemes. It relies on pattern matching instead of number crunching.
Conceptual clustering was developed by researchers in artificial intelligence and ma-
chine learning.2 Despite having been more or less overtaken in those fields by neural networks
and genetic algorithms, conceptual clustering remains suitable for many problems in histor-
ical data analysis. It employs a common-sense heuristic that can be easily grasped and
explained. In our research on the distribution of employment law throughout the British
1
Research reported in this paper was conducted under the auspices of the Master & Servant Project at
York University. The project is co-directed by Paul Craven and Douglas Hay and supported by research
funding from the Social Sciences and Humanities Research Council of Canada and York University.

2
R.S. Michalski and R.E. Stepp, ‘Automated construction of classifications: conceptual clustering versus
numerical taxonomy,’ IEEE Transactions on Pattern Analysis and Machine Intelligence, v.PAMI-5, n.4, July
1983; R.S. Michalski and R.E. Stepp, ‘Learning from observation: conceptual clustering,’ in R.S. Michalski
et al. (eds.), Machine Learning: An Artificial Intelligence Approach, I (Palo Alto, CA, 1983); D. Fisher
and P. Langley, ‘Conceptual clustering,’ in W.A. Gale (ed.) Artificial Intelligence and Statistics (Reading,
MA, 1986). For an alternative approach, see Stepp and Michalski, ‘Conceptual clustering: inventing goal-
oriented classifications of structured objects,’ in R.S. Michalski et al. (eds.), Machine Learning: An Artificial
Intelligence Approach, II (Palo Alto, CA, 1983). There is an excellent recent summation in Kenneth Haase,
‘Automated discovery,’ in Richard Forsyth (ed.), Machine Learning: Principles and Techniques (London,
1989), especially 132-7. A brief account of classification in artificial intelligence research and an illustration of
a simple conceptual clustering engine will be found in B. Thompson and B. Thompson, ‘Artificial intelligence:
overturning the category bucket,’ BYTE, 16:1, January 1991. For a discussion of a statistical cluster analysis
algorithm similar in many ways to the conceptual clustering model, see M.G. Kendall, ‘Cluster analysis,’ in
S. Watanabe (ed.), Frontiers of Pattern Recognition (New York, 1972).
Empire over four centuries, Douglas Hay and I have been using conceptual clustering for ex-
ploratory data analysis and for helping to test ideas about how the many parts of a complex
system interact and interrelate.3
One of the central questions in the Master & Servant Project is whether different
types of colonial political economy tended to adopt different types of employment law. This
formulation implies two distinct classification issues: types of colonial political economy
(which we consider the independent variable) and types of employment law (the dependent
variable). We are using conceptual clustering to derive a typology of employment law from a
detailed analysis of the contents of several hundred master and servant statutes from almost
a hundred jurisdictions, and we expect to use it to help develop the typology of colonial
political economies as well. We have implemented conceptual clustering in PDC Prolog
as part of a suite of computer programmes for comparative historical research.4 In this
paper, I describe the conceptual clustering technique by working through a simple tutorial
illustration of its use with a small dataset. Demonstration versions of the programme and
example dataset described here are available for personal computers running DOS.5 In the
Master & Servant Project, we are running versions of these programmes under OS/2 to
accommodate much larger datasets.
To illustrate the technique, I will use a small dataset containing information about the
first fifty countries described in the 1990 CIA World Factbook, a geopolitical almanac that
is freely available in electronic form from Project Gutenberg.6 This dataset is presented here
solely to demonstrate conceptual clustering, and certainly not as an example of sophisticated
social analysis.
The first step in any classification exercise is to determine the criteria of interest.
Researchers will normally expend much effort on identifying and refining typological criteria,
and conceptual clustering can be an heuristic aid during that very process of theorization. For
this tutorial, I extracted five characteristics which seemed plausibly interrelated: population
size, infant mortality, adult literacy, type of polity, and wealth.
3
For an overview of the project, see D. Hay and P. Craven, ‘Master and servant in England and the
empire: a comparative study,’ Labour/Le Travail 31 (Spring 1993), 175-85; for a preliminary report on the
application of conceptual clustering in this research, see P. Craven and D. Hay, ‘The criminalization of “free”
labour: master and servant in comparative perspective,’ Slavery & Abolition 15, 2 (August 1994), 71-101.

4
The programme was designed by the author and Will Traves, and originally implemented by Will Traves
and the author. It has subsequently been revised by the author. For another element in this suite of
programmes, see P. Craven and W. Traves, ’A general-purpose hierarchical coding engine and its application
to computer analysis of statutes,’ Literary and Linguistic Computing 8, 1 (1993), 27-32.

5
The demonstration version is available by anonymous ftp from ftp.yorku.ca in the directory
/pub/york other/msproject/clusta. Login as ‘anonymous’ and give your e-mail address as the password.
Inquiries should be directed to the author at pcraven@yorku.ca.

6
For information on Project Gutenberg, send e-mail consisting of the line, send gutenberg catalog, to
the list-server at almanac@oes.orst.edu.

2
The second step is to summarize or abstract the raw data by coding. Here, too, real
research imposes significant demands: in the Master & Servant Project we spent more than
a year developing and refining our codebook and validation techniques. By contrast, I used
a crudely opportunistic coding scheme for the tutorial dataset, selecting breakpoints by ‘eye-
balling’ the ranges. Countries with fewer than a million people are ‘small’, those with more
than twenty million are ‘big’, and those in between are ‘mid’-sized. Countries with infant
mortality rates below twenty-five per thousand births have ‘low’ infant mortality, those with
rates above ninety-nine per thousand have ‘high’ mortality, and the rest are ‘mid’. Similarly,
adult literacy rates below forty per cent are ‘low’, those above sixty-nine per cent are ‘high’,
and the rest are ‘mid’. Type of polity is divided simple-mindedly enough between countries
with popular multi-party elections (‘democratic’) and those without (‘dictatorial’). Finally,
‘poor’ countries have per capita GNP or GDP below one thousand United States dollars,
‘rich’ countries have GNP or GDP above five thousand U.S. dollars per capita, and those in
between these limits are ‘mid’.
The dataset consists of fifty items (countries). Each item is described by a set of five
attributes of interest (population, mortality, literacy, polity and wealth). Each attribute
has a range of values associated with it. Each descriptive characteristic (for example,
‘democratic polity’ or ‘low literacy’) can be represented as an attribute-value pair, or attval
(e.g. ‘polity:democratic’ or ‘literacy:low’). Our objective is to build a classification scheme
or typology that groups similar countries together, and thereby to identify the distinctive
characteristics of each class or type. The more attribute-value pairs (attvals) two countries
have in common, the more similar they are for the purposes of this typology.
To simplify matters further, we can represent each item as a list of values. Since lists
are ordered data structures, the attributes are silently represented by their place in the list.
On the computer, the clustering engine expects data in the form of Prolog terms containing
the name of the item and a list of attval lists:7

rec("Denmark",[["mid"],["low"],["high"],["dem"],["rich"]))
rec("Argentina",[["big"],["mid"],["high"],["dem"],[]])

Missing data (unknown values) are represented by empty lists, which act as place-holders.
7
We represent attvals as lists (rather than just strings) because in the process of cluster-generation,
clusters may have multiple values for an attribute. This is illustrated below. The corollary to this is that
we can characterize even single items with multiple values for an attribute. For example, if colour was an
attribute of interest, two of the values might be ‘blue’ and ‘green’. If we then encounter an item which is
greenish-blue or bluish-green, we have a coding choice. We can create a new value (‘grue’ or ‘bleen’, with
apologies to the logicians), but an item with the ‘grue’ attval would group only with other ‘grue’ items. If,
instead, we wanted to represent the idea that this item’s colour was indeterminate between blue and green,
so that it could make a partial match with blue or green items (but not with red ones), we could code
the colour attribute as ["blue","green"], which the programme treats as identical to ["green","blue"].
The item would then have a half-match with blue items, a half-match with green items, and no match with
red items, on the colour attribute. The proportional weighting scheme is illustrated below as it applies to
associating items with clusters; the same principles apply when items have multiple values for one or more
attributes.

3
For our purposes here, though, we can use a simpler (for people) tabular representation,
which has exactly the same meaning:

COUNTRY POPULATION MORTALITY LITERACY POLITY WEALTH


Denmark mid low high dem rich
Argentina big mid high dem --

One step remains before we can begin to cluster these data: we must decide how many
categories or sub-groupings we want to make. The clustering algorithm works by assigning
items to categories so that those with the most attvals in common are grouped together.
This requires that we know in advance how many categories there are to be. In practice,
this is not a serious limitation. Exploring the data may involve deriving several candidate
typologies, each with a different number of categories. As we shall see, it is possible to
choose from among these candidates the classification scheme that is ‘best’ for the purposes
intended.
In outline, the clustering process involves numerous iterations of four basic steps. First,
we seed the typology by selecting one item to fix the initial values for each category. Second,
we grow the clusters by associating other items with these seeds. Third, we test the
clustering, by measuring the similarity of items within each category and the differences
among categories. Fourth, we use the results of these tests to make a strategy for refining
the clustering by selecting alternate seeds. We apply this strategy to the next iteration,
repeating the whole series of steps until we are satisfied with the result. I shall illustrate
these steps by working through a clustering of the example dataset.

Seed
Clusters are ‘grown’ from single items, or ‘seeds’. These may be real data items (e.g. Den-
mark, Argentina) or ideal types designed by the researcher. There will be one seed for each
cluster in the model. The computer programme will suggest seeds for the initial iteration,
based on a weighted ranking of the items: the user may override this suggestion. After the
first pass, seeds are drawn by the programme in accordance with the search strategy for
each iteration. For our manual illustration, we will choose three seeds that differ from one
another on several dimensions:

POPULATION MORTALITY LITERACY POLITY WEALTH


C1 (Denmark) mid low high dem rich
C2 (Colombia) big mid high dem mid
C3 (Chad) mid high low dict poor

These seeds’ attvals supply the initial descriptions for each cluster (C1, C2, C3); because
there are three of them, we will call the complete model a ‘3-clustering’ of the dataset.

4
Grow
Clusters are grown by comparing each item in turn with each of the seeds. We count the
number of attvals the item has in common with each seed, and assign it to the most similar
cluster—the one with the greatest number of attvals in common. After first comparing each
item to each seed, we update the cluster descriptions to reflect any new attvals contributed by
the items added.8 Then we repeat the assignment of items and update the cluster descriptions
as many times as necessary to unambiguously assign each item to a cluster. Items that cannot
immediately be assigned to one cluster are suspended until all the unambiguous assignments
have been made.
We’ll start by considering a new item, Barbados:

Barbados small low high dem rich

We compare it to the three seeds, checking off the values it has in common with each of
them:

C1 0 1 1 1 1
C2 0 0 1 1 0
C3 0 0 0 0 0

Barbados has four attvals in common with C1, two with C2 and none with C3. We therefore
assign it to C1. Note, however, that Barbados has a ‘small’ population, while C1, which is
based on Denmark, currently has a ‘mid’-sized population. When we add Barbados to C1,
we update the cluster description to include this new attval. Now C1 includes two values for
the population attribute:

Old C1 mid low high dem rich


Barbados small low high dem rich
New C1 small,mid low high dem rich

Now we add another item:

Djibouti small high low dict mid

We compare it to the three seeds, checking off the values it has in common with each of
them. C1 has been changed with the addition of Barbados, but thus far C2 and C3 remain
the same. Djibouti has one of C1’s two population attvals, so we give it a score of 1/2 on
that attribute:
8
In this illustration, we update cluster descriptions immediately. In the computer programme, each item
is first compared with each of the seeds; only when all these individual comparisons have been made are the
cluster descriptions updated.

5
C1 1/2 0 0 0 0
C2 0 0 0 0 1
C3 0 1 1 1 0

Djibouti has half an attval in common with C1, one with C2 and three with C3, so we assign
it to C3 and update the cluster description accordingly:

Old C3 mid high low dict poor


Djibouti small high low dict mid
New C3 small,mid high low dict poor,mid

We continue in this way until every item has been assigned to a cluster.

Test
Once every item has been assigned to a cluster, we use two measures to evaluate the cluster-
ing. Similarity is a measure of how closely the items within each cluster resemble one another.
Disjunction is a measure of how different each cluster is from all the other clusters.9 We
combine these measures to compute an overall score; the computer programme allows the
user to determine the ratio of similarity to disjunction to be used. (In the tutorial example,
we give the two measures equal weight.) We use these tests to make a decision about how
to proceed in the next iteration. They are not intended for making comparisons between
different classification schemes.

Strategy
Our objective is to make the best possible assignment of items to categories within the
constraints of this clustering model. Think of the clustering as a multidimensional space,
9
Disjunction is a property of the clustering as a whole and represents the distinctiveness of each cluster
from all the others. It is measured by calculating the number of attribute-values every cluster has in common
with each of the other clusters; the fewer attribute-values two clusters have in common, the greater their
disjunction:
DISJOINT SCORE = 1 - (((ISUM / N AV) - 1) / (N CLUST - 1))
List all the unique attvals in each cluster. ISUM is the sum of the number of attvals in the pairwise
intersections of these lists. N AV is the number of unique attvals in the whole clustering. N CLUST is the
number of clusters.
Similarity is a property of individual clusters and represents their internal consistency. A cluster is con-
sistent if its members are similar. Items are similar if they have attributes (including patterns of association
among attributes) in common. Similarity is measured by calculating the extent to which the members of
each distinct cluster have attribute-values in common; the results are summed over the whole clustering.
SIMIL SCORE = CSUM / (N INT * N ATT)
For each cluster(1 . . . n), take the intersection of each pair of members and count the attributes having
any values in common in CATT(1 . . . n). Accumulate a count of the total number of intersections in N INT.
Treat the special case of a cluster with only one member (a ‘singleton cluster’) as the intersection of that
member with itself: i.e., put the number of attributes in the singleton’s CATT and add 1 to N INT. N ATT
is the number of attributes per item. CSUM is the sum of CATT(1) . . . CATT(n).

6
one dimension for each attribute. The items are scattered in this space in regions of greater
and lesser density. A clustering model is then a set of boundaries or envelopes marking off
and containing regions in this space. A good clustering will clearly demarcate the different
regions of higher density from one another (disjunction), while including all the neighbour
items within the same envelope (similarity). In such a clustering, the defining characteristics
of each cluster will be those of the items in the denser core of its region. Items which are more
distantly related to these central concepts will be found in the less-populated periphery of
the region. This image is the key to refining the model in successive iterations. We compare
the overall score for the current pass with the overall score for the best pass so far. If the
current score is higher, performance is improving, so new seeds are selected from the ‘core’
of each current cluster. If the current score is lower than the best score so far, performance
is deteriorating. In this case, we choose new seeds from the ‘edges’ of the current clusters.10
We then grow new clusters from the new seeds, evaluate the resulting clustering, and
compare it to the score obtained on the previous pass. This process continues until a thresh-
old set by the user is met, or we run out of potential seeds. Normally, we set a minimum
number of iterations at the beginning of the run. We then ‘reward’ improvements in perfor-
mance by adding one or more iterations to this bank. This provision for successive refine-
ments of the model is the ‘machine learning’ aspect of conceptual clustering. The process
continues until the bank is exhausted or the programme runs out of potential seed combina-
tions. At termination, it reports the best clustering it has obtained in the run. The example
3-clustering yielded this result:

C1 (16 items) C2 (20 items) C3 (14 items)


------------------ ------------------ ------------------
Afghanistan Antigua/Barbuda Albania
Angola Australia Algeria
Bangladesh Austria Argentina
Benin Bahrain Belize
Bhutan Barbados Botswana
Bolivia Belgium Brazil
Burkina Brunei Burma
Burundi Bulgaria Cape Verde
Cambodia Canada China
Cameroon Chile Colombia
Central African Rep. Costa Rica Dominican Republic
Chad Cuba Ecuador
Comoros Cyprus Egypt
Congo Czechoslovakia El Salvador
Djibouti Denmark
Ethiopia Dominica
Fiji
France
The Bahamas
Finland
10
After a number of passes with deteriorating performance, we might opt instead for an ‘impatient’ strategy,
selecting new seeds at random, on the theory that if our current method is not working we should try
something quite different. The user can determine how much ‘patience’ the computer programme will
exhibit before concluding that the current strategy is a dead end and embarking on a new path.

7
Interpretation
So far we have identified which items belong to which categories. This is the ‘clustering’
part of conceptual clustering, but what are the core concepts that distinguish the categories?
This is a matter of interpreting the clustering. A cluster description is a tabulation showing
which attvals occur in each cluster:

Population Mortality Literacy Polity Wealth


B M S L M H L M H Dict Dem P M R
C1 + + + + + + + + + +
C2 + + + + + + + + + +
C3 + + + + + + + + + +

In this 3-clustering, the three groups of countries are completely distinguishable by their
infant mortality rates. There appears to be some association with wealth and literacy,
although it is ambiguous in the mid-range. Population and polity make no difference at all.
A good clustering will be unambiguous. One way of representing this is to transform
the cluster description into a set of rules for the assignment of items to clusters. If all items
can be unambiguously assigned, then the model is a complete one. This rule completely
describes the example:

if infant mortality is high


then assign to Cluster 1;
else if infant mortality is low
then assign to Cluster 2;
else assign to Cluster 3;
endif.

Rules in this form are suitable for inclusion in expert systems or other inductive knowledge-
bases. The cluster description can also be shown as a decision tree (another way of rep-
resenting a rule). In the Master & Servant Project, we use a Prolog implementation of
J.R. Quinlan’s ID3 algorithm to generate decision trees from cluster descriptions.11 The
3-clustering in the example can be described by an extremely simple decision tree:

mortality = mid ==> Cluster 3


mortality = high ==> Cluster 1
mortality = low ==> Cluster 2
11
J.R. Quinlan, ‘Learning efficient classification procedures and their application to chess end games,’ in
R.S. Michalski et al. (eds.), Machine Learning: An Artificial Intelligence Approach, I (Palo Alto, CA, 1983);
J.R. Quinlan, ‘Induction of decision trees,’ Machine Learning 1 (1986), 81-106. We use an implementation
by Luis Torgo (Universidade do Porto, 1989), which is available by anonymous ftp from the machine learning
library of algorithms in Prolog at ftp.gmd.de and from the Carnegie Mellon Artificial Intelligence Repository
at ftp.cs.cmu.edu.

8
Alternative models
Often the researcher will want to test and compare alternative classification models. For
example, we might want to see whether we get a more desirable typology from a 2-clustering
or a 4-clustering. In comparing different models, we do not compare the evaluation scores
of the best clusterings. Instead, we compare the interpretations of the best clusterings.
For the data presented here, no other model is capable of so simple an interpretation
as the 3-clustering. Compare its decision tree to the decision tree for 2- and 4-clusterings of
the same data.
A 2-clustering of the example dataset:

mortality = mid and 5 = mid ==> Cluster 2


5 = poor ==> Cluster 1

mortality = high ==> Cluster 1

mortality = low ==> Cluster 2

A 4-clustering of the example dataset:

2=mid & 1=small & 3=high ==> Cluster 3


3=mid ==> Cluster 4
1=mid & 3=mid ==> Cluster 3
3=high ==> Cluster 4
1=big ==> Cluster 4

2=high ==> Cluster 1

2=low & 1=big ==> Cluster 2


1=mid & 3=high & 4=dem & 5=rich ==> Cluster 2
5=mid ==> Cluster 3
4=dict & 5=rich ==> Cluster 2
5=mid ==> Cluster 3
1=small & 3=mid ==> Cluster 2
3=high ==> Cluster 3

Some models cannot be given a complete interpretation. The engine can be adjusted
to compensate for ambiguous assignments, which may permit complete interpretations to be
derived at the expense of excluding some items. The 2-, 3- and 4-clusterings shown here are
all complete interpretations, of varying degrees of complexity. Typically, the complexity of a
typology will correspond to the number of attributes in its decision tree. (The decision tree
is a rule saying, in effect, that if you know these things about an item you can classify it

9
without having to know anything else about it. The more you have to know about an item to
classify it, the more complex the typology.) I suggest that in choosing among clusterings the
researcher should apply two criteria: the theoretical or explanatory adequacy of the typology
(does it leave out anything known to be crucial?) and Occam’s razor (less is more). I suggest
further that if the simplest typology is theoretically inadequate, the researcher may want to
reconsider the selection and coding of attributes upon which the clustering is based.
Conceptual clustering is but one example of the fruitful application of techniques and
methods in artificial intelligence and other computer science disciplines to research in the
humanities and social sciences. Historians’ computing has traditionally been oriented to
quantitative analysis, record linkage, and relational database applications, and so has to
some extent been relegated to the periphery of the discipline. With the dissemination of
techniques for textual analysis, comparison of documents, and other qualitative methods,
computer-assisted research tools may join computer-assisted writing tools on the desks of
the historical mainstream.

PAUL CRAVEN teaches in the Social Science Division, York University (4700 Keele Street,
North York, Ontario, Canada M3J 1P3). His publications include ‘An Impartial Umpire’:
Industrial Relations and the Canadian State, 1900-11 (University of Toronto Press, 1980),
Labouring Lives: Work and Workers in 19th-Century Ontario (University of Toronto Press,
1995), and a number of journal articles and book chapters in Canadian labour, legal and
economic history. With Douglas Hay (Osgoode Hall Law School, York University) he co-
directs the Master & Servant Project. He can be reached by e-mail as pcraven@yorku.ca.

10

Vous aimerez peut-être aussi