Vous êtes sur la page 1sur 34

Introduc)on

to Informa)on Retrieval

ISTE-612 Knowledge Processing


Technologies
Week 12
Text Clustering

Introduc)on to Informa)on Retrieval

Todays Topic: Clustering


Document clustering
MoEvaEons
Document representaEons
Success criteria

Clustering algorithms
ParEEonal
Hierarchical

Introduc)on to Informa)on Retrieval

WHAT IS CLUSTERING?

Introduc)on to Informa)on Retrieval

Ch.
16

What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from dierent clusters should be
dissimilar.

The commonest form of unsupervised learning


Unsupervised learning = ?

Introduc)on to Informa)on Retrieval

Ch.
16

What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from dierent clusters should be
dissimilar.

The commonest form of unsupervised learning

Introduc)on to Informa)on Retrieval

Ch.
16

A data set with clear cluster structure


How would
you design
an algorithm
for nding
the three
clusters in
this case?

Introduc)on to Informa)on Retrieval

Sec.
16.1

ApplicaEons of clustering in IR
Whole corpus analysis/naviga@on
BeBer user interface: search without typing

For improving recall in search applicaEons


BeVer search results (like pseudo RF)

For beVer navigaEon of search results


EecEve user recall will be higher

For speeding up vector space retrieval


Cluster-based retrieval gives faster search

Introduc)on to Informa)on Retrieval

ApplicaEons of clustering in IR
Whole corpus analysis/navigaEon
Explore data

For improving recall in search applica@ons


Returning similar documents

For beVer navigaEon of search results


Explore search results

For speeding up vector space retrieval


Fewer comparisons

Sec.
16.1

Introduc)on to Informa)on Retrieval

Sec.
16.1

For improving search recall


Cluster hypothesis - Documents in the same cluster behave
similarly with respect to relevance to informaEon needs
Therefore, to improve search recall:
Cluster docs in corpus a priori
When a query matches a doc D, also return other docs in the
cluster containing D
Hope if we do this: The query car will also return docs containing
automobile
Because clustering grouped together docs containing car with
those containing automobile.

Why might this happen?

Introduc)on to Informa)on Retrieval

ApplicaEons of clustering in IR
Whole corpus analysis/navigaEon
BeVer user interface: search without typing

For improving recall in search applicaEons


BeVer search results (like pseudo RF)

For beBer naviga@on of search results


Eec@ve user recall will be higher

For speeding up vector space retrieval


Cluster-based retrieval gives faster search

Sec.
16.1

Introduc)on to Informa)on Retrieval

ApplicaEons of clustering in IR
Whole corpus analysis/navigaEon
BeVer user interface: search without typing

For improving recall in search applicaEons


BeVer search results (like pseudo RF)

For beVer navigaEon of search results


EecEve user recall will be higher

For speeding up vector space retrieval


Cluster-based retrieval gives faster search

Sec.
16.1

Introduc)on to Informa)on Retrieval

CLUSTERING ISSUES

12

Introduc)on to Informa)on Retrieval

Issues for clustering


RepresentaEon for clustering
Document representaEon
Vector space model

Need a noEon of similarity/distance

How many clusters?


Fixed a priori?
Completely data driven?
Avoid trivial clusters - too large or small
What is the right size for a cluster?
Likely data dependent

Sec.
16.2

Introduc)on to Informa)on Retrieval

NoEon of similarity/distance
Ideal: seman2c similarity.
PracEcal: term-staEsEcal similarity (docs as
vectors)
Cosine similarity
For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
But real implementaEons use cosine similarity

Introduc)on to Informa)on Retrieval

Clustering Algorithms
Flat algorithms
Usually start with a random parEEoning
Rene it iteraEvely
K means clustering

Hierarchical algorithms
BoVom-up, agglomeraEve

Introduc)on to Informa)on Retrieval

Sec.
16.4

K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
!
!
1
(c) =
x

| c | x!c
Reassignment of instances to clusters is based on
distance to the current cluster centroids.
(Or one can equivalently phrase it in terms of similariEes)

Introduc)on to Informa)on Retrieval


Sec.
16.4

K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds.
UnEl clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
Compute the new centroid

Introduc)on to Informa)on Retrieval


Sec.
16.4

K Means Example
(K=2)

Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!

Introduc)on to Informa)on Retrieval

TerminaEon condiEons
Several possibiliEes, e.g.,
A xed number of iteraEons.
Doc parEEon unchanged.
Centroid posiEons dont change.
Does this mean that the docs in a
cluster are unchanged?


Sec.
16.4

Introduc)on to Informa)on Retrieval


Sec.
16.4

Seed Choice
Results can vary based on
random seed selecEon.
Some seeds can result in poor
convergence rate, or
convergence to sub-opEmal
clusterings.
Select good seeds using a heurisEc
(e.g., doc least similar to any
exisEng mean)
Try out mulEple starEng points
IniEalize with the results of another
method.

Example showing
sensitivity to seeds

In the above, if you start


with B and E as centroids
you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}

Introduc)on to Informa)on Retrieval

How Many Clusters?


Number of clusters K is given
ParEEon n docs into predetermined number of clusters

Finding the right number of clusters is part of the


problem
Given docs, parEEon into an appropriate number of
subsets.
E.g., for query results - ideal value of K not known up front
- though UI may impose limits.

Introduc)on to Informa)on Retrieval

HIERARCHICAL CLUSTERING

22

Introduc)on to Informa)on Retrieval

17
Ch.

Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.

animal
vertebrate
fish reptile amphib. mammal

invertebrate
worm insect crustacean

One approach: recursive applicaEon of a


parEEonal clustering algorithm.

Introduc)on to Informa)on Retrieval

Dendrogram: Hierarchical Clustering


Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.

24

Introduc)on to Informa)on Retrieval


Sec.
17.1

Hierarchical AgglomeraEve Clustering


(HAC)
Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, unEl there is only one cluster.

The history of merging forms a binary tree


or hierarchy.

Introduc)on to Informa)on Retrieval


Sec.
17.2

Closest pair of clusters


Many variants to dening closest pair of clusters
Single-link
Similarity of the most cosine-similar (single-link)

Complete-link
Similarity of the furthest points, the least cosine-similar

Centroid
Clusters whose centroids (centers of gravity) are the most
cosine-similar

Average-link
Average cosine between all pairs of elements

Introduc)on to Informa)on Retrieval


Sec.
17.2

Single Link AgglomeraEve Clustering


Use maximum similarity of pairs:

sim(ci ,c j ) = max sim( x, y)


xci , yc j

Can result in straggly (long and thin) clusters


due to chaining eect.
Afer merging ci and cj, the similarity of the
resulEng cluster to another cluster, ck, is:

sim(( ci c j ), ck ) = max( sim(ci , ck ), sim(c j , ck ))

Introduc)on to Informa)on Retrieval


Sec.
17.2

Complete Link
Use minimum similarity of pairs:

sim(ci ,c j ) = min sim( x, y )


xci , yc j

Makes Eghter, spherical clusters that are typically


preferable.
Afer merging ci and cj, the similarity of the resulEng
cluster to another cluster, ck, is:
(( ci c j ), ck ) = min( sim(ci , ck ), sim(c j , ck ))
sim
Ci

Cj

Ck

Introduc)on to Informa)on Retrieval


Sec.
17.3

Group Average
Similarity of two clusters = average similarity of all
pairs within merged cluster.

! !
1
sim(ci , c j ) =
sim( x , y )

ci c j ( ci c j 1) x!( ci c j ) y!( ci c j ): y! x!
Compromise between single and complete link.
Two opEons:
Averaged across all pairs in the merged cluster
Averaged over all pairs between the two original clusters

Introduc)on to Informa)on Retrieval

END FOR THIS WEEK

30

Introduc)on to Informa)on Retrieval


Sec.
16.3

What Is A Good Clustering?


Internal criterion: A good clustering will produce
high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is
high
the inter-class similarity is low
The measured quality of a clustering depends on
both the document representaEon and the
similarity measure used

Introduc)on to Informa)on Retrieval


Sec.
16.3

External criteria for clustering quality


Quality measured by its ability to discover some
or all of the hidden paVerns or latent classes in
gold standard data
Assesses a clustering with respect to ground
truth requires labeled data
Assume documents with C gold standard classes,
while our clustering algorithms produce K
clusters, 1, 2, , K with ni members.

Introduc)on to Informa)on Retrieval


Sec.
16.3

External EvaluaEon of Cluster Quality


Simple measure: purity, assign cluster i to
the most frequent class
1
purity = max| k c j |
N k
j

Introduc)on to Informa)on Retrieval

Purity example

Cluster I: (max(5, 1, 0)) = 5 (red)


Cluster II: (max(1, 4, 1)) = 4 (blue)
Cluster III: (max(2, 0, 3)) = 3 (green)


Sec.
16.3

Vous aimerez peut-être aussi