ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering

Introduc)on
to Informa)on Retrieval
ISTE-612 Knowledge Processing

Technologies
Week 12
Text Clustering
Introduc)on to Informa)on Retrieval
Todays Topic: Clustering

Document clustering
MoEvaEons
Document representaEons
Success criteria
Clustering algorithms
ParEEonal
Hierarchical
WHAT IS CLUSTERING?
Ch.
16
What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from dierent clusters should be
dissimilar.
The commonest form of unsupervised learning

Unsupervised learning = ?
Ch.
16
What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from dierent clusters should be
dissimilar.
The commonest form of unsupervised learning
Ch.
16
A data set with clear cluster structure

How would
you design
an algorithm
for nding
the three
clusters in
this case?
Sec.
16.1

ApplicaEons of clustering in IR
Whole corpus analysis/naviga@on
BeBer user interface: search without typing
For improving recall in search applicaEons

BeVer search results (like pseudo RF)
For beVer navigaEon of search results

EecEve user recall will be higher
For speeding up vector space retrieval

Cluster-based retrieval gives faster search
Whole corpus analysis/navigaEon
Explore data
For improving recall in search applica@ons

Returning similar documents

Explore search results

Fewer comparisons
Sec.
16.1

Sec.
16.1

For improving search recall

Cluster hypothesis - Documents in the same cluster behave
similarly with respect to relevance to informaEon needs
Therefore, to improve search recall:
Cluster docs in corpus a priori
When a query matches a doc D, also return other docs in the
cluster containing D
Hope if we do this: The query car will also return docs containing
automobile
Because clustering grouped together docs containing car with
those containing automobile.
Why might this happen?
BeVer user interface: search without typing

For beBer naviga@on of search results

Eec@ve user recall will be higher

Sec.
16.1

BeVer user interface: search without typing


EecEve user recall will be higher

Sec.
16.1

CLUSTERING ISSUES
12
Issues for clustering

RepresentaEon for clustering
Document representaEon
Vector space model
Need a noEon of similarity/distance
How many clusters?

Fixed a priori?
Completely data driven?
Avoid trivial clusters - too large or small
What is the right size for a cluster?
Likely data dependent
Sec.
16.2

NoEon of similarity/distance
Ideal: seman2c similarity.
PracEcal: term-staEsEcal similarity (docs as
vectors)
Cosine similarity
For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
But real implementaEons use cosine similarity
Clustering Algorithms
Flat algorithms
Usually start with a random parEEoning
Rene it iteraEvely
K means clustering
Hierarchical algorithms
BoVom-up, agglomeraEve
Sec.
16.4

K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
!
!
1
(c) =
x
| c | x!c
Reassignment of instances to clusters is based on
distance to the current cluster centroids.
(Or one can equivalently phrase it in terms of similariEes)

Sec.
16.4
K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds.
UnEl clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
Compute the new centroid

Sec.
16.4
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!
TerminaEon condiEons
Several possibiliEes, e.g.,
A xed number of iteraEons.
Doc parEEon unchanged.
Centroid posiEons dont change.
Does this mean that the docs in a
cluster are unchanged?

Sec.
16.4

Sec.
16.4
Seed Choice
Results can vary based on
random seed selecEon.
Some seeds can result in poor
convergence rate, or
convergence to sub-opEmal
clusterings.
Select good seeds using a heurisEc
(e.g., doc least similar to any
exisEng mean)
Try out mulEple starEng points
IniEalize with the results of another
method.
Example showing
sensitivity to seeds
In the above, if you start

with B and E as centroids
you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}
How Many Clusters?

Number of clusters K is given
ParEEon n docs into predetermined number of clusters
Finding the right number of clusters is part of the

problem
Given docs, parEEon into an appropriate number of
subsets.
E.g., for query results - ideal value of K not known up front
- though UI may impose limits.
HIERARCHICAL CLUSTERING
22
17
Ch.
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.

animal
vertebrate
fish reptile amphib. mammal
invertebrate
worm insect crustacean
One approach: recursive applicaEon of a

parEEonal clustering algorithm.
Dendrogram: Hierarchical Clustering

Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
24

Sec.
17.1
Hierarchical AgglomeraEve Clustering

(HAC)
Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, unEl there is only one cluster.
The history of merging forms a binary tree

or hierarchy.

Sec.
17.2
Closest pair of clusters

Many variants to dening closest pair of clusters
Single-link
Similarity of the most cosine-similar (single-link)
Complete-link
Similarity of the furthest points, the least cosine-similar
Centroid
Clusters whose centroids (centers of gravity) are the most
cosine-similar
Average-link
Average cosine between all pairs of elements

Sec.
17.2
Single Link AgglomeraEve Clustering

Use maximum similarity of pairs:
sim(ci ,c j ) = max sim( x, y)

xci , yc j
Can result in straggly (long and thin) clusters

due to chaining eect.
Afer merging ci and cj, the similarity of the
resulEng cluster to another cluster, ck, is:
sim(( ci c j ), ck ) = max( sim(ci , ck ), sim(c j , ck ))

Sec.
17.2
Complete Link
Use minimum similarity of pairs:
sim(ci ,c j ) = min sim( x, y )

xci , yc j
Makes Eghter, spherical clusters that are typically

preferable.
Afer merging ci and cj, the similarity of the resulEng
cluster to another cluster, ck, is:
(( ci c j ), ck ) = min( sim(ci , ck ), sim(c j , ck ))
sim
Ci
Cj
Ck

Sec.
17.3
Group Average
Similarity of two clusters = average similarity of all
pairs within merged cluster.
! !
1
sim(ci , c j ) =
sim( x , y )
ci c j ( ci c j 1) x!( ci c j ) y!( ci c j ): y! x!
Compromise between single and complete link.
Two opEons:
Averaged across all pairs in the merged cluster
Averaged over all pairs between the two original clusters
END FOR THIS WEEK
30

Sec.
16.3
What Is A Good Clustering?

Internal criterion: A good clustering will produce
high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is
high
the inter-class similarity is low
The measured quality of a clustering depends on
both the document representaEon and the
similarity measure used

Sec.
16.3
External criteria for clustering quality

Quality measured by its ability to discover some
or all of the hidden paVerns or latent classes in
gold standard data
Assesses a clustering with respect to ground
truth requires labeled data
Assume documents with C gold standard classes,
while our clustering algorithms produce K
clusters, 1, 2, , K with ni members.

Sec.
16.3
External EvaluaEon of Cluster Quality

Simple measure: purity, assign cluster i to
the most frequent class
1
purity = max| k c j |
N k
j
Purity example
Cluster I: (max(5, 1, 0)) = 5 (red)

Cluster II: (max(1, 4, 1)) = 4 (blue)
Cluster III: (max(2, 0, 3)) = 3 (green)

Sec.
16.3

ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering

Transféré par

Droits d'auteur :

Formats disponibles

Introduc)on

ISTE-612 Knowledge Processing

Introduc)on to Informa)on Retrieval

Todays Topic: Clustering

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

The commonest form of unsupervised learning

Introduc)on to Informa)on Retrieval

The commonest form of unsupervised learning

Introduc)on to Informa)on Retrieval

A data set with clear cluster structure

Introduc)on to Informa)on Retrieval

For improving recall in search applicaEons

For beVer navigaEon of search results

For speeding up vector space retrieval

Introduc)on to Informa)on Retrieval

For improving recall in search applica@ons

For beVer navigaEon of search results

For speeding up vector space retrieval

Introduc)on to Informa)on Retrieval

For improving search recall

Why might this happen?

Introduc)on to Informa)on Retrieval

For improving recall in search applicaEons

For beBer naviga@on of search results

For speeding up vector space retrieval

Introduc)on to Informa)on Retrieval

For improving recall in search applicaEons

For beVer navigaEon of search results

For speeding up vector space retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Issues for clustering

Need a noEon of similarity/distance

How many clusters?

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

In the above, if you start

Introduc)on to Informa)on Retrieval

How Many Clusters?

Finding the right number of clusters is part of the

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

One approach: recursive applicaEon of a

Introduc)on to Informa)on Retrieval

Dendrogram: Hierarchical Clustering

Introduc)on to Informa)on Retrieval

Hierarchical AgglomeraEve Clustering

The history of merging forms a binary tree

Introduc)on to Informa)on Retrieval

Closest pair of clusters

Introduc)on to Informa)on Retrieval

Single Link AgglomeraEve Clustering

sim(ci ,c j ) = max sim( x, y)

Can result in straggly (long and thin) clusters

sim(( ci c j ), ck ) = max( sim(ci , ck ), sim(c j , ck ))

Introduc)on to Informa)on Retrieval

sim(ci ,c j ) = min sim( x, y )