Académique Documents
Professionnel Documents
Culture Documents
Basic Concepts
In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the data into a number of sensible clusters (groups). This unravels similarities and differences among the available data. Applications: Engineering Bioinformatics Social Sciences Medicine Data and Web Mining To perform clustering of a data set, a clustering criterion must first be adopted. Different clustering criteria lead, in general, to different clusters.
A simple example
Depending on the similarity measure, the clustering criterion and the clustering algorithm different clusters may result. Subjectivity is a reality to live with from now on. A simple example: How many clusters??
2 or 4 ??
Let X
{x1 , x 2 ,..., x N }
An m-clustering R of X, is defined as the partition of X into m sets (clusters), C1, C2,,Cm, so that
Ci
m
, i 1,2,..., m
U Ci
i 1
Ci
Cj
, i
j , i, j
1,2,..., m
In addition, data in Ci are more similar to each other and less similar to the data in the rest of the clusters. Quantifying the terms similar-dissimilar depends on the types of clusters that are expected to underlie the structure of X.
Fuzzy clustering: Each point belongs to all clusters up to some degree. A fuzzy clustering of X into m clusters is characterized by m functions
j 1
N
uj : x
m
[0,1], j
1,2,..., m
u j ( x i ) 1, i 1,2,..., N
0
i 1
u j ( xi )
N , j 1,2,..., m
These are known as membership functions. Thus, each xi belongs to any cluster up to some degree, depending on the value of
u j ( x i ), j 1,2,..., m
u j ( xi ) close to 1 high grade of
TYPES OF FEATURES
With respect to their domain
Continuous (the domain is a continuous subset of ). Discrete (the domain is a finite discrete set). Binary or dichotomous (the domain consists of two possible values).
Interval-scaled (the difference of two values is meaningful but their Ratio-scaled (the ratio of two values is meaningful, e.g., weight).
9
PROXIMITY MEASURES
Between vectors Dissimilarity measure (between vectors of X) is a
function
d: X X
with the following properties
d0
d0
x X
d ( x, y)
x, y
d ( x, x) d0 ,
d ( x, y) d ( y, x),
x, y
X
10
If in addition
d ( x, y) d0 if and only if x
d ( x, z) d ( x, y) d ( y, z),
(triangular inequality)
x, y, z
11
s: X X
with the following properties
s0
:
y) s( y, x),
s( x, y) s0
x X
x, y X
x, y
s( x, x) s0 ,
s( x,
12
If in addition
s( x, y)
s0 if and only if x
y
x, y, z X
:U U
A dissimilarity measure has to satisfy the relations of dissimilarity measure between vectors, where Di s are used in place of x, y (similarly for similarity measures).
13
d p ( x, y )
l i 1
wi | xi
yi | p )1/ p
p=1 (weighted Manhattan norm) p=2 (weighted Euclidean norm) p= (d (x,y)=max1 i l wi|xi-yi| )
14
Other measures
d G ( x, y )
log 10
1 1 l
l j 1
| xj bj
yj | aj
where bj and aj are the maximum and the minimum values of the j-th feature, among the vectors of X (dependence on the current data set)
d Q ( x, y )
1 l
l j 1
xj xj
yj yj
15
Similarity measures
Inner product
sinner ( x, y )
Tanimoto measure
x y
x y
T
l i 1
xi yi
sT ( x, y )
|| x ||2
|| y ||2 x y
sT ( x, y) 1
d 2 ( x, y) || x || || y ||
16
Discrete-valued vectors
Let F={0,1,,k-1} be a set of symbols and X={x1,,xN} Fl Let A(x,y)=[aij], i, j=0,1,,k-1, where aij is the number of places where x has the i-th symbol and y has the j-th symbol.
k 1 k 1
NOTE:
aij
i 0 j 0
Dissimilarity measures: The Hamming distance (number of places where x and y differ)
k 1 k 1 i 0 j 0 j i
d H ( x, y)
The l1 distance
aij
l i 1
d1 ( x, y )
| xi
yi |
17
Similarity measures:
Tanimoto measure : sT ( x, y )
k 1 i 1
aii
k 1 k 1 i 1 j 1
nx
where
ny
aij ,
aij
nx
k 1 k 1 i 1 j 0
aij ,
ny
k 1 k 1 i 0 j 1
k 1 i 1
k 1
aii / l
i 1
k 1 i 0
aii / l
18
Mixed-valued vectors
Some of the coordinates of the vectors x are real and the rest are discrete.
Methods for measuring the proximity between two such xi and xj:
Adopt a proximity measure (PM) suitable for real-valued vectors. Convert the real-valued features to discrete ones and employ a discrete PM. The more general case of mixed-valued vectors: Here nominal, ordinal, interval-scaled, ratio-scaled features are treated separately.
19
s( x i , x j )
l q 1
sq ( x i , x j ) /
l q 1
wq
In the above definition: wq=0, if at least one of the q-th coordinates of xi and xj are undefined or both the q-th coordinates are equal to 0. Otherwise wq=1. If the q-th coordinates are binary, sq(xi,xj)=1 if xiq=xjq=1 and 0 otherwise. If the q-th coordinates are nominal or ordinal, sq(xi,xj)=1 if xiq=xjq and 0 otherwise. If the q-th coordinates are interval or ratio scaled-valued
sq ( x i , x j ) 1 | xiq
x jq | / rq ,
where rq is the interval where the q-th coordinates of the vectors of the data set X lie.
20
Fuzzy measures
Let x, y [0,1]l. Here the value of the i-th coordinate, xi, of x, is not the outcome of a measuring device. The closer the coordinate xi is to 1 (0), the more likely the vector x possesses (does not possess) the i-th characteristic. As xi approaches 0.5, the certainty about the possession or not of the i-th feature from x decreases. A possible similarity measure that can quantify the above is:
s( xi , yi ) max(min( 1 xi ,1 yi ), min( xi , yi ))
l i 1 1/ q
Then
q sF ( x, y )
s( xi , yi ) q
21
Missing data
For some vectors of the data set X, some features values are unknown
where (xi,yi) denotes the PM between two scalars xi, yi. Find the average proximities avg(i) between all feature vectors in X along all components. Then
( x, y )
l i 1
( xi , yi )
avg(i)
22
( x, C) max y
( x, y)
( x, C) min y
( x, y)
( x, C )
1 nC
( x, y)
y C
A representative(s) of C, rC , contributes to the definition of (x,C) In this case: (x,C)= (x,rC) Typical representatives are:
The mean vector:
mp 1 nC y
y C
d: a dissimilarity
d ( z, y),
y C
d ( mC , y )
z C
measure
z C
NOTE: Other representatives (e.g., hyperplanes, hyperspheres) are useful in certain applications (e.g., object identification using clustering techniques).
24
is a
( Di , D j ) maxx
Di , y D j
( x, y)
is a
( Di , D j ) min x
Di , y D j
( x, y)
is a
( Di , D j )
ni n j
( x, y)
x Di x D j
25
is a
(Di , Dj )
ni n j ni nj
(mi , m j )
(mi , m j )
ss e
( Di , D j )
NOTE: Proximity functions between a vector x and a set C may be derived from the above functions if we set Di={x}.
26
Remarks:
Different choices of proximity functions between sets may lead to totally different clustering results. Different proximity measures between vectors in the same proximity function between sets may lead to totally different clustering results.
The only way to achieve a proper clustering is by trial and error and, taking into account the opinion of an expert in the field of application.
27