Vous êtes sur la page 1sur 5

IMPORTANCE OF CLUSTERING:

The method of identifying similar groups of data in a data set is called


clustering. Entities of each group are comparatively more similar to entities of
that group than those of other groups.
It is a main task of exploratory data mining, and a common technique for
statistical data analysis, used in many fields, including machine learning,
pattern recognition, image analysis, information retrieval, and bioinformatics.

WHY CLUSTERING?
• Organizing data into clusters shows internal structure of the data
– Ex. Clusty and clustering genes above
• Sometimes the partitioning is the goal
– Ex. Market segmentation
• Prepare for other AI techniques
– Ex. Summarize news (cluster and then find centroid)
• Techniques for clustering is useful in knowledge discovery in data
– Ex. Underlying rules, reoccurring patterns, topics, etc.

VARIOUS METHODS OF CLUSTERING:


k-means:

K-means clustering is a type of unsupervised learning, which is used


when you have unlabeled data (i.e., data without defined categories or
groups). The goal of this algorithm is to find groups in the data, with the
number of groups represented by the variable K. The algorithm works
iteratively to assign each data point to one of K groups based on the
features that are provided. Data points are clustered based on feature
similarity.
 Difficult to predict the number of clusters (K-Value)
• Initial seeds have a strong impact on the final results
• The order of the data has an impact on the final results
• Sensitive to scale: rescaling your datasets (normalization or standardization)
will completely change results. While this itself is not bad, not realizing that
you have to spend extra attention to scaling your data might be bad.

FUZZY C MEANS:
fuzzy c means is by allocating memberships for each data point equivalent to
each and every cluster midpoint based on the distance between cluster
midpoint and its data point. This type of fuzzy implementation can reduce data
loss while clustering huge amounts of data. However, on using non-fuzzy
clustering, the data points are able to be stored in only a single cluster. i.e.
single data point, in which retrieving data will results in losing of data.
Types of different Clustering like k means, gives good results in which data sets
will be belongs to only single cluster which results in, while retrieving data, loss
of data.

Partitional Clustering:
A partitional clustering a simply a division of the set of data objects into non-
overlapping subsets (clusters) such that each data object is in exactly one
subset. partitional clustering aims successive clusters using some iterative
processes. Partitional clustering assigns a set of data points into k-clusters
by using iterative processes. In these processes, n data are classified into
k-clusters. The predefined criterion function J assigns the datum
into kth number set according to the maximization and minimization
calculation in k sets.

DIFFERENCE BETWEEN FUZZY C MEANS AND K-MEANS:


The essential difference between fuzzy c-means clustering and standard k-
means clustering is the partitioning of objects into each group. Rather than the
hard partitioning of standard k-means clustering, where objects belong to only
a single cluster, fuzzy c-means clustering considers each object a member of
every cluster, with a variable degree of “membership”.
The similarity between objects is defined by a distance measure, which plays
an important role in obtaining correct clusters. For simple datasets where the
data are multidimensional, the Euclidean distance measure can be used.
However, there are several types of distance measure that can be used for
obtaining clusters of the same data.

The accuracy between fuzzy c means and k-means:


For example, arterial input function (AIF) of FCM is Top: time–concentration
curves for different clusters. Bottom: mean curve for M values of 0.0182,
1.9697e-004, 1.7418e-004, 0.0016, 3.9782e-004, respectively. The mean curve
for M = 0.0182 was thus selected to represent the estimated AIF.
For k-means Top: time–concentration curves for different clusters. Bottom:
mean curve for M values of 0.0212, 5.8216e-004, 1.7180e-004, 2.3354e-004,
0.0205, respectively. The mean curve for M = 0.0212 was thus selected to
represent the estimated AIF.
Associative Rule Mining techniques(ARM):
Association rule mining is a procedure which aims to
observe frequently occurring patterns, correlations, or
associations from datasets found in various kinds of
databases such as relational databases, transactional
databases, and other forms of repositories.
ASSOCIATION RULE:
Association Rules find all sets of items (itemsets) that have support greater than
the minimum support and then using the large itemsets to generate the desired
rules that have confidence greater than the minimum confidence. The lift of a rule
is the ratio of the observed support to that expected if X and Y were
independent.  A typical and widely used example of association rules application
is market basket analysis. 

FP GROWTH:
1. This algorithm needs to scan the database only twice when compared to
Apriori which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns .
APRIORI:
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large databases
3. It requires high computation if the itemsets are very large and the minimum
support is kept very low.
4. The entire database needs to be scanned.

FP GROWTH VS APRIORI:
FP GROWTH:
Pattern generation, FP Growth generates pattern by constructing FP Tree
Candidate generation, There is no candidate generation.
Process, The process is faster as compared to Apriori. The runtime of process
increases linearly with increase in number of itemsets.
Memory Usage, A compact version of database is saved.
APRIORI:
Pattern Generation, Apriori generates pattern by pairing the items into singletons, pairs
and triplets.
Candidate Generation, Apriori uses candidate generation
Process, The process is comparatively slower than FP Growth, the runtime increases
exponentially with increase in number of itemsets
Memory Usage, The candidates combinations are saved in memory.

DRAWBACKS OF APRIORI:
1. Apriori algorithm uses Apriori property and join, pure property for mining
frequent patterns. FP Growth algorithm constructs conditional pattern free and
conditional pattern base from the database which satisfies the minimum
support.
2.Apriori uses breadth first search method and FP Growth uses divide and
conquer method.
3.Apriori algorithm requires large memory space as they deal with large
number of candidate itemset generation. FP Growth algorithm requires less
memory due to its compact structure they discover the frequent itemsets
without candidate itemset generation.
4. Apriori algorithm performs multiple scans for generating candidate set. FP
Growth algorithm scans the database only twice.
5. In Apriori algorithm execution time is more wasted in producing candidates
every time. FP Growth’s execution time is less when compared to Apriori.