Vous êtes sur la page 1sur 16

Algorithms for Model-Based Gaussian Hierarchical Clustering 1

C. Fraley Technical Report No. 311 Department of Statistics University of Washington Box 354322 Seattle, WA 98195-4322 USA October 29, 1996

Funded by the O ce of Naval Research under contracts N00014-96-1-0192 and N00014-96-10330. This work could not have been accomplished without the expertise and enthusiastic support of principal investigator Adrian Raftery.
1

Abstract
Agglomerative hierarchical clustering methods based on Gaussian probability models have recently shown promise in a variety of applications. In this approach, a maximum-likelihood pair of clusters is chosen for merging at each stage. Unlike classical methods, model-based methods reduce to a recurrence relation only in the simplest case, which corresponds to the classical sum of squares method. We show how the structure of the Gaussian model can be exploited to yield e cient algorithms for agglomerative hierarchical clustering.

Contents

1 Introduction

1.1 Model-Based Cluster Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : 1.2 Hierarchical Agglomeration : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.1 k = 2I : : : : : : : : : 2I : : : : : : : : : 2.2 k = k 2.3 Constant k : : : : : : : 2.4 Unconstrained k : : : : 2.5 Benchmark Comparisons

1
1 2

2 E cient Algorithms for the Four Basic Models


: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

4
4 5 6 7 8

3 Extension to more Complex Gaussian Models 4 Concluding Remarks A Derivation of the Update Formula

9 11 12

List of Tables
1 2
Four parameterizations of the covariance matrix k in the Gaussian model with the corresponding criteria to be minimized. : : : : : : : : : : : : : : : : : : : : : : Parameterizations of the covariance matrix k in the Gaussian model and their geometric interpretation. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 11

List of Figures
1
CPU time vs number of observations for the four basic models. : : : : : : : : : :

10

1 Introduction
Multivariate Gaussian models have been proposed for quite some time as a basis for clustering algorithms. Recently, methods of this type have shown promise in a number of practical applications 9]. Examples in the geophysical sciences include seismic data processing, in the biological sciences classi cation of cell types based on chemical responses, and in the social sciences classi cation based on attachment theory in psychology. They have also been used for clustering various types of industrial and nancial data. Image-processing applications include unsupervised texture image segmentation, tissue classi cation in biomedical images, identi cation of objects in astronomy, analysis of images from molecular spectroscopy, and recognition and classi cation of surface defects in manufactured products. Agglomerative hierarchical clustering (Murtagh and Raftery 8], Ban eld and Raftery 1]), the EM algorithm and related iterative techniques (Celeux and Govaert 3]) or some combination of these (Dasgupta and Raftery 4]) are e ective computational techniques for obtaining partitions from these models. The subject of e cient computation in this context has however received little attention. We aim to ll this gap in the case of agglomerative hierarchical clustering. Although no iterative computation is involved, the issue of e ciency is nevertheless important since the practical value of these methods is limited by a growth in time complexity which is at least quadratic in the number of observations. This paper is organized as follows: The remainder of this section gives the necessary background in modelbased clustering and hierarchical agglomeration. In Section 2, we propose computational techniques for each of the four simplest and most common Gaussian models, and compare the performance of each method to an appropriate benchmark. Finally, extension to more complex Gaussian models is discussed in Section 3.

1.1 Model-Based Cluster Analysis

The relevant probability model is as follows: the population of interest consists of G di erent subpopulations the density of a p-dimensional observation x from the kth subpopulation is fk (x ) for some unknown vector of parameters . Given observations x = (x1 ::: xn), let = ( 1 ::: n)T denote the identifying labels for the classi cation, where i = k if xi comes from the kth subpopulation. In the classi cation likelihood approach to clustering, parameters and labels are chosen so as to maximize the likelihood G Y L(x ) = f (x ): (1)
i=1
i

Our focus is on the case where fk (x ) is multivariate normal (Gaussian) with mean vector k and variance matrix k . The overall approach is much more general and is not restricted to multivariate normal distributions 1]. However, experience to date suggests that clustering based on the multivariate normal distribution is useful in a great many situations of interest ( 8], 1], 9], 3], 4]).

When fk (x ) is multivariate normal, the likelihood (1) has the form L(x 1 : : : G 1 : : : G ) = G Y (2) Y 1 2 expf; (xi ; k )T ;1 (xi ; k )g (2 ); 2 j k j; 1 k 2 k=1 i2I where Ik = fi : i = kg is the set of indices corresponding to observations belonging to the k-th group. Replacing k in (2) with its maximum likelihood estimator ^k = P x = x =n , where n is the number of elements in I , yields the concentrated
p k

log-likelihood

i2Ik

::: G ) = G (3) );1X ;1 ) + n log j jg ; pn log(2 f tr( W k k k k 2 2 k=1 in which Wk = Pi2I (xi ; xk ) (xi ; xk )T is the sample cross-product matrix for the kth group. If k = 2I , then the log-likelihood (3) is maximized by classi cations that minimize P tr G k=1 Wk . This is the well-known sum of squares criterion which, for example, was suggested by Ward 11] as a possible metric when he proposed the agglomerative hierarchical method for clustering. An alternative that a di erent variance for each group is PG allows 2I = is chosen so as to minimize n log tr W 1]. If k is the same for all k k k=1 k n groups but otherwise has no structural constraints, then values of that minimize PG k=1 Wk maximize the log-likelihood 5]. When k is allowed to vary completely between groups, the P G log-likelihood is maximized whenever minimizes k=1 nk log W 10]. Table 1 summarizes n the equivalent criteria to be minimized corresponding to these four parameterizations of k .
1
k k k k k

l(x ^1 : : : ^G

k 2

criterion tr
G X G X k=1

I I

Wk

criterion
G X G X k=1

Wk

2 k

k nk log tr W nk k=1

k nk log W nk k=1

corresponding criteria to be minimized.

Table 1: Four parameterizations of the covariance matrix

in the Gaussian model with the

1.2 Hierarchical Agglomeration

Agglomerative hierarchical clustering (Ward 11]) is a stagewise procedure in which `optimal' pairs of clusters are successively merged. Each stage of merging corresponds to a unique 2

number of clusters, and a unique partition of the data. Classi cations di er according to the criterion for optimality, and the strategy for choosing a single pair when more than one is optimal. In model-based hierarchical clustering, a maximum-likelihood pair is merged at each stage. Although the resulting partitions are suboptimal, agglomerative hierarchical clustering methods are in common use because they often yield reasonable results and are relatively easy to compute. For model-based clustering, another advantage of hierarchical agglomeration is that there is an associated Bayesian criterion for choosing the best partition (hence the optimal number of clusters) from among those de ned by the hierarchy 1]. Hierarchical clustering can be accomplished by splitting rather than agglomeration, but the complexity of such algorithms is combinatorial unless severe restrictions on the allowed subdivisions are applicable. The process of hierarchical agglomeration is usually assumed to start with each observation in a cluster by itself, and proceed until all observations are in a single cluster. However it could just as well be started from a given partition and proceed from there to form larger clusters. The value returned consists of a `classi cation tree' (a list of pairs of clusters merged), and possibly the optimal value of the change in criterion at each stage. In classical agglomerative methods (e. g. sum of squares, nearest and farthest neighbor (single and complete link) 7]), there is a metric or `cost' based on geometric considerations associated with merging a pair of clusters. For a particular pair this cost remains xed as long as neither of the clusters in that pair is involved in a merge, so that the time complexity of hierarchical agglomeration can be signi cantly reduced if the cost of merging pairs is retained and updated during the course of the algorithm. The overall memory usage is then proportional to the square of the initial number of clusters (usually just the initial number of observations), which could be a severe limitation. For large data sets, one possible strategy is to apply hierarchical agglomeration to a subset of the data and partition the remaining observations via supervised classi cation or discriminant analysis. Ban eld and Raftery 1] used only 522 out of 26,000 pixels in an initial hierarchical phase to successfully classify tissue from an MRI brain-scan image via Gaussian model-based techniques. For each classical method, there is a simple recurrence relation for updating the cost of merging pairs. In the sum-of-squares method, this recurrence is (nj + nk ) (j k) ; nk (i j ) (hi j i k) = (ni + nk ) (i k) + (4) ni + nj + nk where (i j ) is the cost of merging groups i and j , and hi j i represents the group formed by merging groups i and j . Once the initial cost of merging each pair is obtained, computation can proceed without further reference to the data the size of each group must be retained and updated. The amount of space needed for (i j ) decreases as the number of groups increases. A memory-e cient scheme for maintaining (i j ) is as follows. Assume (without loss of generality) that observation number k is the observation of smallest index in group k in the initial classi cation. If for each group j > 1 values of (j i) are stored for all i < j , then it is easy to recover space during the course of the computation. Assuming that j is the highest index in a particular merge, and l is the largest current index, the space associated with group j can be used for group l, thereby freeing the (larger) space associated with group l. The original indexes for the groups can easily be recovered at the end. In programming 3

languages such as Fortran 77 in which memory allocation is static, values of (j i) can be stored sequentially in the order (2 1) (3 1) (3 2) (4 1) (4 2) (4 3) : : :, so that the scheme described above leaves contiguous free space that can be used for the classi cation tree and other return values. In languages such as C that allow dynamic memory allocation, a separate list of values (j i) for all i < j can be maintained for each j the space associated with the list for the largest value of j can be freed at each stage under this scheme. Model-based methods generally require more computational resources than classical methods. In some there is no advantage in storing the cost of merging pairs, and some require relatively expensive computations such as determinants of the cross-product matrices. The object of this paper is to show that there are relatively e cient methods for agglomerative hierachical clustering based on Gaussian models.

2 E cient Algorithms for the Four Basic Models


There is clearly structure to be exploited in the various criteria, in which Wk is a symmetric, positive semide nite matrix (see Table 1 in Section 1.1). Moreover, since only two groups are merged at each stage of hierarchical agglomeration, there should be a close relationship between criteria at successive stages. In fact, the sample cross-product matrix for the merged group can be obtained from the sum of the sample cross-product matrices of its two component groups by means of a symmetric rank-1 update :
T Whi ji = Wi + Wj + wij wij

(5)

ni (6) nj (ni + nj ) where sk denotes the sum of the observations for group k. A derivation is given in the Appendix. In the remainder of this section we show that this relation leads to e cient algorithms for all of the methods of Table 1. We assume that the input consists of an n p matrix whose rows correspond to individual observations and a vector of length n indicating the initial classi cation of each observation. wij
ji i

where

s ; ij sj

ij

2.1

When the covariance matrix is constrained to be diagonal and uniform across all groups, the criterion to be minimized at each stage is ! X G G X tr Wk = tr (Wk ) : (7)
k=1 k=1

= 2I

This is the sum-of-squares criterion, long known as a heuristic before any relationship to the Gaussian model was recognized: tr (Wk ) is the sum of squares of observations in group k with the group mean subtracted out. Of the classical methods, it is the only one known to have an underlying statistical model. In view of the recurrence relation (4), all that is required to start 4

the hierarchical clustering procedure is a set of values (i j ) and the number of observations in each group. First, the value of (i j ) for each pair of observations can be computed in the absence of other information, the individual observations usually constitute the initial partition of the data, and nothing further need be done. For coarser initial partitions, the recurrence relation could be used to obtain the initial values for hierarchical clustering given (i j ) for each pair of observations. Merges for initialization are determined by the given partition rather than by the minimum value of (i j ) at any stage. The process just described, however, requires storage proportional to the square of the number observations n, which is undesirable if there are m < n groups to begin with. The update formula (5), leads to a better initialization procedure for (i j ), since (i j ) = tr Whi ji ; tr (Wi ) + tr (Wj )] = T T tr Whi ji ; Wi + Wj ] = tr wij wij = wij wij : (8)

Because we are assuming that k is smallest index associated with observations in group k, we can overwrite the kth observation by the sum sk of observations in that group and the kth element of the classi cation vector with nk . This can be accomplished in O(np) time, and requires no additional storage since the input is overwritten. Then (8) and (6) can be used to initialize (i j ). The total storage required would then be O(np + m2) : O(np) for the input, and O(m2) for (i j ).

2.2

When the convariance of each group is constrained to be diagonal, but otherwise allowed to vary between groups, the criterion to be minimized at each stage is " # G G X X W tr ( W k k) nk log tr n = nk log n : (9) k k k=1 k=1 As for the sum-of-squares and the other classical methods, (i j ) remains unchanged from stage to stage unless either group i or group j is involved in a merge, so that storing (i j ) results in a gain in time e ciency for hierarchical clustering. From the update formula (5), we have 2 3 ( " # " #) tr W h i ji tr ( W tr ( W i) j) 4 5 (i j ) = (ni + nj ) log n + n ; ni log n + nj log n i j i j 2 3 ( ) T tr Wi + Wj + wij wij 5 = (ni + nj ) log 4 ; ::: ni + nj " tr (W ) + tr (W ) + wT w # ( ) i j ij ij = (ni + nj ) log ; ::: : n +n
i j

2I = k

Unlike the classical methods, there is no simple recurrence relation for (hi j i k) given (i j ), (i k), and (j k). However, there is a reasonably e cient update all that is required is to maintain values of nk , tr (Wk ) and nk log tr (Wk ) =nk ] in addition to sk . The 5

vector sk can overwrite the kth observation, as was done for the sum of squares. But this time for each group k that has more than one element, the index k0 of the next observation in that group is stored in the kth element of the classi cation vector. The number of elements in the group is stored in the k0th entry of the classication vector, while the trace of the sample cross product matrix and the corresponding term of the criterion overwrite the rst two elements of the k0th observation. With this scheme no additional storage is necessary (p 2) beyond that required for (i j ) and the input. The issue of terms in which tr (Wk ) = 0 remains to be resolved this includes those terms corresponding to groups consisting of a single observation, as well as those groups in which all observations coincide. Hence the rst stages of hierachical clustering will be arbitrary without some sort of initialization procedure. We replace (9) with a modi ed criterion in order to handle these cases transparently : # " G X W tr ( W ) k (10) nk log tr n + np k k=1 where W is the sample cross product matrix for the group consisting of all observations, and has the default value of 1. The factor tr (W ) =np is an attempt to take into account scaling in the data, since the resulting criterion is scale dependent.

2.3 Constant

When the covariance matrix is uniform across all groups but otherwise has no structural constraints, the criterion to be minimized at each stage is G X Wk : (11)
k=1

In contrast to the methods discussed up to this point, the change in criterion caused by merging two groups is a ected by merges among other groups in the current classi cation. Hence it is of no advantage here to store (i j ) for the duration of the computation. NevP G ertheless, (5) leads to an e cient update, because the change in W = k=1 Wk when groups i and j are merged can be represented as n o T W W + Whi ji ; (Wi + Wj ) = W + wij wij : (12) Instead of W itself, we maintain a lower triangular Cholesky factor L for W (see e. g. 6]), since then the determinant can be easily computed as the square of the product of the diagonals of L : nY o jW j = LLT = jLj2 = diag(L) 2 : Noting that since W = 0 and hence L = 0 when each observation is in a group by itself, we can then either build the Cholesky factor for a coarser initial partition, or else merge optimal groups in hierarchical agglomeration as follows: LT T T = ( L wij ) w = LLT + wij wij W + wij wij T ij 6

1 0~ ~ ~ ~1 0 1 C B C B C B0 C B0 ~ ~ ~C C LT Givens B Givens B C C C C ;! B 0 0 C ;! B 0 0 C C B C B wT A @0 0 0 A @0 0 0 C A ~ 0 ~ ~ ~ 0 ~ 0 ~ ~ 1 0 1 0 C B C B B0 C B0 C ~T L Givens B Givens B C C ~ ~ ;! B 0 0 C ;! B 0 0 C = B B 0 A @0 0 0 ~C A @0 0 0 C 0 0 ~ 0 ~ 0 0 0 ~ 0 T ~L ~T : W + wij wij =L The symbol Givens ;! stands for application of a Givens rotation, an elementary orthogonal transformation that allows selective and numerically stable introduction of zero elements in a matrix. The marked entries are values changed in the last transformation. The time e ciency for the Cholesky update is O(p2), in contrast to O(p3) for forming a new Cholesky factor from the updated p p matrix W . For details of the Cholesky update via Givens rotations, see e. g. 6]. Although the criterion is de ned for all possible partitions, there remains a problem with initialization: jW j = 0 whenever W has rank less than p. In particular, the rst stages of hierachical clustering using this criterion will be arbitrary if initially each observation is in a cluster by itself, since merging any pair of observations i and j will result in W = Whi ji and Whi ji = 0 whenever i and j are singletons (p 2). To circumvent this, we use the sum-of-squares criterion tr (W ) to determine merges until the value of jW j is positive, while maintaining the Cholesky factor of W . As the computation proceeds, sk overwrites the data and nk overwrites the classi cation vector most of the information needed to recover the classi cation tree and optimal values of the criterion can be stored in the portions of these structures that are no longer needed in the algorithm. Other than the necessary O(p2) storage for maintaining LT , additional storage of size O(M ), where M is the number of stages, is needed when p < 4 to store the merge indexes in order to completely reconstruct the classi cation tree.
When the covariance matrix is allowed to vary completely between groups, the criterion to be minimized at each stage is G X k nk log W (13) n :
k=1 k

0 B B0 =B B 0 0 B @0 0 0

2.4 Unconstrained

Like the criteria discussed in sections 2.1 and 2.2, each group contributes a separate additive term to (13), so that it is time e cient to save values of (i j ). If Lk denotes the Cholesky factor of Wk , then, in view of (5),
T T T T Lhi jiLT hi j i = Li Li + Lj Lj + wij wij = ( Li Lj wij ) ( Li Lj wij ) :

Lhi ji can be computed e ciently from Li, Lj and wij by applying Givens rotations (see Section 2.3) to the composite matrix : 0 T 1 0 T 1 L Lhi ji C i C B B B C B C: T B (14) @ LjT C A ! ::: ! B @ 0 C A wij 0
The composite matrix is never explicitly formed instead, wij and each row of LT j are treated T as separate updates to Li . Although the time e ciency to form the updated Cholesky factor is of the same order of magnitude as that for forming a new Cholesky factor from the updated Whi ji, the use of (14) has an advantage in storage e ciency. Maintaining the upper triangle of each Wk and updating directly via (5) would require more storage since Wk has p rows regardless of nk , whereas Lk has at most min(nk ; 1 p) nonzero rows (Wk = 0 whenever nk = 1). Moreover, sk and LT k can overwrite the data and the entries corresponding to the lower triangle of LT can be used for the necessary pointers and values to be updated (the k kth term in (13)). Besides what is required for the data and for (i j ), additional O(p2) storage is needed for the Cholesky factors when updating (i j ). Finally, because jWk j = 0 whenever nk < p, there is even greater ambiguity with criterion (13) than with either (11) or (9). For this reason, we use " # G X W W tr ( W ) k k nk log n + 1tr n + 2 np (15) k k k=1 in place of (13), which gives a hybrid between the modi ed criterion for (13). The default value for both 1 and 2 is 1.
k 2I (10) and = k

2.5 Benchmark Comparisons

In this section we compare the algorithms developed in Sections 2.1{2.4 with approaches that do not use the update formula (5) but are otherwise e cient. The `benchmark' algorithms leave the data intact, and keep track of the composition of all groups in the classi cation 2 vector used to transmit the initial partition. For k = 2I and k = k I , (i j ) is updated after a merge by rst forming the column means for the combined group, then forming the sum of squares for all of observations in both groups with the column mean subtracted out. P G For constant k , the upper n triangle of W = o k=1 Wk after a merge is updated using the rst part of (12) : W + Whi ji ; (Wi + Wj ) . Column sums for Wi and Wj are computed from the list of observations in each group, n and these are added o to form the column sum for Whi ji. Then the upper triangle of W + Whi ji ; (Wi + Wj ) is formed using symmetric rank one operations. The quantity jW j for the merge is then computed from its Cholesky decomposition as described in Section (2.3). For unconstrained k , the upper triangle of Whi ji is formed from scratch after a merge involving i or j , and the Cholesky decomposition is used to obtain its determinant. In addition to the O(p2) storage used for the sample cross-product matrices, O(n) storage is allocated in the benchmarks for the return values, for the number of groups in each cluster, and, to facilitate updating (i j ), for the term contributed by each cluster to the criterion. 8

The methods of Section 3 use considerably less storage: constant variance requires O(n) additional storage to recover results, and constant and unconstrained variance require O(p2) additional storage for maintaining Cholesky factors. Figure 1 shows marked gains in time e ciency for the methods of Section 2 over the benchmarks. Randomly generated observations of dimension p = 5 were used, with the default initial partition in which each singleton observation constitutes a cluster. The basic methods were written in Fortran with an S-Plus interface, and the time shown is the average over nine di erent data sets using S-Plus version 3.31 on a Silicon Graphics Iris workstation under the IRIX 5.2 operating system. The solid line represents the performance of algorithms based on the update formula (5), while the dashed line represents performance of algorithms in which the necessary quantities are obtained without updating. The e ect is most dramatic in the unconstrained case, where extrapolated results (ignoring e ects of increased memory usage) show an improvement by a factor of around 15 for n = 5000, as compared to a factor of around 4 for n = 500. The results for k = 2I are also a point of comparison, since in that case the solid line represents that classical sum of squares approach via the well-known recurrence relation (4). Note that the time scale for the constant-variance method di ers from that of the other methods, which use more memory in exchange for improved time e ciency.

3 Extension to more Complex Gaussian Models


Ban eld and Raftery 1] developed a model-based framework that subsumes all of the parameterizations in Table 1. The resulting clustering methods include some criteria that are more general than k = 2I or constant k , while still constraining the structure of k . This is accomplished by means of a reparameterization of the covariance matrix in terms of its eigenvalue decomposition T (16) k = k Dk Ak Dk where Dk is the orthogonal matrix of eigenvectors and Ak is a diagonal matrix whose elements are proportional to eigenvalues of k , and k is a scalar proportional to the volume of the ellipsoid. The orientation of the principal components of k is determined by Dk , while Ak determines the shape of the density contours. This paradigm is particularly useful for two and three-dimensional data, where geometric features can often be identi ed visually. It may also be applicable for higher-dimensional data when multivariate visualization analysis reveals some structure. For example, Ban eld and Raftery 1] were able to closely match the clinical classi cation of a biomedical data set using Gaussian hierarchical clustering after analyzing its geometric features. The parameterization can be selected so as to allow some but not all of the characteristics (orientation, volume and shape) of distributions to vary between groups, while constraining others to be the same. Analysis of the model that leads to the sum of squares criterion ( k = I or 2I ) in terms of (16) suggests that it is likely to be most appropriate when groups are spherical and of approximately the same size. The constant-variance assumption in which Dk , k and Ak are the same for all groups but otherwise unconstrained favors clusters that are ellipsoidal with
1 S-Plus 3.3 Version 3.3 for Unix

, MathSoft, Inc., Seattle, WA (1995).

Diagonal (Uniform)
60 60

Diagonal (Varying)
60

Unconstrained

50

50

time (seconds) 30 40

time (seconds) 30 40

20

20

10

10

100

200 300 400 number of observations

500

100

200 300 400 number of observations

500

10

20

time (seconds) 30 40

50

100

200 300 400 number of observations

500

Constant Variance
3000 0 500 time (seconds) 1000 1500 2000 2500

100

200 300 400 number of observations

500

Figure 1: CPU time vs number of observations for the four basic models. The solid line represents
the methods proposed in this paper.

the same orientation, shape, and volume. If all elements of k are allowed to vary between groups, the resulting classi cation is likely to contain elliptical groups with di ering geometric features. Metrics appropriate for various intermediate situations can also be formulated. For 2I implies that the underlying densities are spherical, example, assuming that k = k I or k while variation in k between groups allows their volumes to di er. Celeux and Govaert 2] analyzed this criterion and showed that it can give classi cation performance that is much better than traditional methods. In one example, they successfully apply the method to an astronomical image in which one tightly clustered galaxy is contained within another more dispersed one. Table 2 shows relationships between orientation, volume and shape discussed in 1]. Criteria based on other combinations of these factors are also possible 3]. Software for hierarchical clustering based on these models is available in the public domain (see 1]) it has been used in a variety of applications with some success 9]. A revision based on the techniques described in this paper is currently in progress. E cient computational methods for the rst four models in Table 2 were given in Section 10

Distribution I Spherical Spherical kI DAD Elliptical Elliptical k Dk Ak Dk Dk ADk Elliptical Elliptical k Dk ADk
k

Volume Shape Orientation Reference xed xed NA 11], 5], 10], 8], 1], 3] variable xed NA 1], 3] xed xed xed 5], 10], 1], 3] variable variable variable 10], 1], 3] xed xed variable 8], 1], 3] variable xed variable 1], 3]
k

Table 2: Parameterizations of the covariance matrix

in the Gaussian model and their geometric interpretation. The models shown here are those discussed in Ban eld and Raftery 1].

2. We conclude this section by showing how those techniques can be applied to the remaining P G models = Dk ADk and k = k Dk ADk . The relevant criteria are k=1 tr (A;1 k ) and PG n :logk tr (A;1 k ) =nk ], respectively, where k is the diagonal matrix of eigenvalues k=1 k of Wk . In both cases, e cient algorithms are possible if sk and LT k are maintained and updated in the storage provided for the original data, as described in Section 2.4. Instead of information pertaining to the terms of (15), the kth term of the sum in the appropriate criterion is stored in the space corresponding to the lower triangle of LT k . As in all of the methods in Section 3, the matrix Whi ji is never explicitly formed. Instead, its nonzero eigenvalues are obtained as the squares of the singular values of LT hi j i , which has min(nk ; 1 p) rows and p columns. For k = k Dk ADk , we include the additive term inside the logarithm 2I in order to accommodate those cases in which W that appears in (10) for k = k I or k k (and hence k ) vanishes.

4 Concluding Remarks
This paper has made several contributions toward computational e ciency in agglomerative hierarchical clustering. First, we gave a memory-e cient scheme suitable for any method that stores the change in criterion for each merged pair. Second, we showed that the samplecross product matrix for the union of two Gaussian clusters can be formed by a rank-one update of the sum of sample-cross matrices of its constituent clusters, and described how this can be used to obtain e cient algorithms for model-based clustering. This included a memory-e cient initialization strategy for the sum of squares method, which corresponds to the simplest Gaussian model, as well as time and memory e cient algorithms for three other Gaussian models that have no counterpart in classical hierarchical agglomeration. At the same time, we gave strategies to resolve the inherent ambiguities in some of the models. Finally, we showed how these techniques can be easily extended to two additional Gaussian models based on a more sophisticated parameterization of the covariance matrix that has recently shown promise in practical applications.

11

A Derivation of the Update Formula


s

The following holds for the group hi j i formed by merging groups i and j : T Whi ji = Wi + Wj + wij wij where Wk denotes the sample cross-product matrix for group k , and

wij

ji i

s ; ij sj

ij

! T T T e s e e e e n n n n n n g X Xk ; n = Xk ; n Xk = I ; n Xk k k k k g where en denotes the vector of length nk in which every element is equal to 1 (X k is Xk T g g with the mean subtracted out of each column), then Wk = Xk Xk . Hence !T ! ! en eT en eT en eT n n n T T Wk = Xk I ; n I; n Xk = Xk I ; n Xk k k k T T s k sk T T T en en = Xk Xk ; Xk n Xk = Xk Xk ; n k k T e . It follows that since sk = Xk n T fhTi jiX fhi ji = XhTi jiXhi ji ; shi jishi ji : Whi ji = X nhi ji Since Xhi ji is the matrix consisting of observations in groups i and j , T Xi T Xi ; shi jisT hi j i = X T X + X T X ; shi j i shi j i Whi ji = X i i j j Xj nhi ji nhi ji j Hence ! s sT T sj sT s hi j i hi j i j i si + ; Whi ji ; (Wi + Wj ) = ni nj nhi ji ! T T s sT j ) (si + sj ) = sinsi + jn j ; (si + sn i j i + nj ! ! T T sj sT si sT sj sT sj sT s s j j i j i si i si = ni + nj ; ni + nj + ni + nj + ni + nj + ni + nj ! ! ! si sT sj sT 1 1 1 1 j i T T = n ; n + n si si + n ; n + n sj sj ; n + n + n + n i i j j i j i j i j ! sj sT ni s sT ; si sT j j i T = n (nn+ s s + + nj ) i i nj (ni + nj ) j j ni + nj ni + nj : i i
k k k k k k k k k k k k k k k k

where sk is the sum of the observations and nk the cardinality of group k. Proof. Let Xk be the matrix of observations corresponding to group k . If

ni nj (ni + nj )

12

If then

wij
T wij wij = 2 ji i

ji i

s ; ij sj

s
ij

ni : nj (ni + nj )

so that

T T 2 T s sT i ; ji ij si sj ; ij ji sj si + ij sj sj si sT sj sT ni s sT j j i T = n (nn+ s s ; ; + i i nj ) ni + nj ni + nj nj (ni + nj ) j j i i T Whi ji ; (Wi + Wj ) = wij wij :

References
1] J. D. Ban eld and A. E. Raftery. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803{821, 1993. 2] G. Celeux and G. Govaert. Comparison of the mixture and the classi cation maximum likelihood in cluster analysis. Journal of Statistical Computation and Simulation, 47:127{146, 1993. 3] G. Celeux and G. Govaert. Gaussian parsimonius clustering models. Pattern Recognition, 28:781{793, 1995. 4] A. Dasgupta and A. E. Raftery. Detecting features in spatial point processes with clutter via model-based clustering. Technical Report 295, University of Washington, Department of Statistics, October 1995. See http://www.stat.washington.edu/tech.reports. 5] H. P. Friedman and J. Rubin. On some invariant criteria for grouping data. Journal of the American Statistical Association, 62:1159{1178, 1967. 6] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins, 2nd edition, 1989. 7] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data. Wiley, 1990. 8] F. Murtagh and A. E. Raftery. Fitting straight lines to point patterns. Pattern Recognition, 17:479{483, 1984. 9] A. E. Raftery. Transitions from ONR Contract N00014-91-J-1074 `Time Series and Image Analysis'. manuscript, Department of Statistics, University of Washington, December 1993. 10] A. J. Scott and M. J. Symons. Clustering methods based on likelihood ratio criteria. Biometrics, 27:387{397, 1971. 11] J. H. Ward. Hierarchical groupings to optimize an objective function. Journal of the American Statistical Association, 58:234{244, 1963. 13

Vous aimerez peut-être aussi