Académique Documents
Professionnel Documents
Culture Documents
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. Data mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Data can be associated with classes or concepts. For example, in the Electronics store, classes
of items for sale include computers and printers, and concepts of customers include big
Spenders and budget Spenders.
Data characterization
Data discrimination
Data discrimination is a comparison of the general features of target class data objects with
the general features of objects from one or a set of contrasting classes.
Frequent patterns, are patterns that occur frequently in data. There are many kinds of frequent
patterns, including item sets, subsequences, and substructures.
Association analysis
Suppose, as a marketing manager, you would like to determine which items are frequently
purchased together within the same transactions.
Support=1% means that 1% of all of the transactions under analysis showed that computer
and software were purchased together.
2. CLASSIFICATION OF DATA MINING SYSTEMS.
It is time to get into some more delicate topics.Over the past “stories” we discussed what data
mining is and how data is stored.As a recap of the past topics let’s redefine data mining using
current knowledge:
neural networks
set theory
knowledge representation
high-performance computing
Depending on the nature of the data mined or on the given data mining application, the data
mining methods may also combine techniques from spatial data analysis, information
retrieval, pattern recognition, image analysis, signal processing, computer graphics,
technology, economics, business, bioinformatics, or psychology.
Data mining systems can also be categorized as those that mine data regularities (commonly
occurring patterns) versus those that mine data irregularities (such as exceptions, or outliers).
Those are usually based on concept description, association and correlation analysis,
classification, prediction, and clustering mine data regularities, rejecting outliers as noise.
autonomous systems
query-driven systems
machine learning
statistics, visualization
pattern recognition
neural networks
A sophisticated data mining system will often adopt multiple data mining techniques or work
out an effective, integrated technique that combines the merits of a few individual
approaches.
3. Classification according to the applications adapted:
Data mining systems can also be categorised according to the applications they adapt.
For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
Different applications often require the integration of application-specific methods.
Therefore, a generic, all-purpose data mining system may not fit domain-specific
mining tasks.
Data mining systems face a lot of challenges and issues in today’s world some of them are:
2 Performance issues
Different user - different knowledge - different way.That means different client want a
different kind of information so it becomes difficult to cover vast range of data that can meet
the client requirement.
Interactive mining allows users to focus the search for patterns from different angles.The data
mining process should be interactive because it is difficult to know what can be discovered
within a database.
Background knowledge is used to guide discovery process and to express the discovered
patterns.
Relational query languages (such as SQL) allow users to pose ad-hoc queries for data
retrieval.The language of data mining query language should be in perfectly matched with the
query language of data warehouse.
In a large database, many of the attribute values will be incorrect.This may be due to human
error or because of any instruments fail. Data cleaning methods and data analysis methods are
used to handle noise data.
2 Performance issues
Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
The huge size of many databases, the wide distribution of data, and complexity of some data
mining methods are factors motivating the development of parallel and distributed data
mining algorithms. Such algorithms divide the data into partitions, which are processed in
parallel.
There are many kinds of data stored in databases and data warehouses. It is not possible for
one system to mine all these kind of data.So different data mining system should be construed
for different kinds data.
Since data is fetched from different data sources on Local Area Network (LAN) and Wide
Area Network (WAN).The discovery of knowledge from different sources of structured is a
great challenge to data mining.
1. DATA PREPROCESSING
Why preprocessing ?
o Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies.
o Data reduction: reducing the volume but producing the same or similar analytical
results.
Quality of your data is critical in getting to final analysis.Any data which tend to be
incomplete, noisy and inconsistent can affect your result.
Data cleaning in data mining is the process of detecting and removing corrupt or inaccurate
records from a record set, table or database.
1 You can ignore the tuple.This is done when class label is missing. This method is not very
effective, unless the tuple contains several attributes with missing values.
2 You can fill in the missing value manually. This approach is effective on small data set with
some missing values.
3 You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.
4 You can use the attribute mean to fill in the missing value. For example customer average
income is 25000 then you can use this value to replace missing value for income.
Noisy Data
Noise is a random error or variance in a measured variable. Noisy Data may be due to faulty
data collection instruments, data entry problems and technology limitation.
Binning:
Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it. The sorted values is distributed into a number of “buckets,” or bins.
For example
Bin a: 4, 8, 15
In this example, the data for price are first sorted and then partitioned into equal-frequency
bins of size 3.
Bin a: 9, 9, 9
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
Bin a: 4, 4, 15
In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.
Regression
Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters. Values that fall outside of the set of clusters may be considered outliers.
Clustering
In data transformation process data are transformed from one format to another format, that is
more appropriate for data mining.
2 Aggregation
Aggregation is a process where summary or aggregation operations are applied to the data.
3 Generalization
In generalization low-level data are replaced with high-level data by using concept
hierarchies climbing.
4 Normalization
Normalization scaled attribute data so as to fall within a small specified range, such as 0.0 to
1.0.
5 Attribute Construction
In Attribute construction, new attributes are constructed from the given set of attributes.
A database or date warehouse may store terabytes of data. So it may take very long to
perform data analysis and mining on such huge amounts of data.
Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume but still contain critical information.
Aggregation operations are applied to the data in the construction of a data cube.
2 Dimensionality Reduction
In dimensionality reduction redundant attributes are detected and removed which reduce the
data set size.
3 Data Compression
4 Numerosity Reduction
Where raw data values for attributes are replaced by ranges or higher conceptual levels.
Data Discretization techniques can be used to divide the range of continuous attribute into
intervals.Numerous continuous attribute values are replaced by small interval labels.
Top-down discretization
If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals, then
it is called top-down discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-points,
removes some by merging neighborhood values to form intervals, then it is called bottom-up
discretization or merging.
Concept hierarchies
Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input/output operations and is more efficient
than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
Typical methods
1 Binning
2 Histogram Analysis
3 Cluster Analysis
Cluster analysis is a popular data discretization method.A clustering algorithm can be applied
to discrete a numerical attribute of A by partitioning the values of A into clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures, forming
a lower level of the hierarchy.
UNIT – II
Data warehouses and their architectures vary depending upon the situation:-
1 Bottom tier
2 Middle tier
3 Top tier
1 Bottom tier
The bottom tier of the architecture is the data warehouse database server.It is the relational
database system. Data is feed into bottom tier by some back-end tools and utilities.
1 Data Extraction
2 Data Cleaning
3 Data Transformation
4 Load
5 Refresh
2 Middle tier
The middle tier is an OLAP server present user a multidimensional data from data
warehouse.Middle tier is an OLAP server can be implemented using either relational OLAP
(ROLAP) model or multidimensional OLAP (MOLAP) model.
ROLAP is an extended relational database management system. The ROLAP maps the
operations on multidimensional data to standard relational operations.
Data warehouse provide customers view which help to improve customer relationship.Data
warehouse can also help in cost reduction by tracking trends and patterns.
Top-down view allow the selection of the relevant information necessary for the data
warehouse.
Data source view presents information being captured, stored, and managed by operational
systems.This information may be documented at various levels of detail and accuracy, from
individual data source tables to integrated data source tables.
Data warehouse view includes fact tables and dimension tables.This view represents the
information that is stored inside the data warehouse.
Business query view is view of data in the data warehouse from the viewpoint of the end user.
There are some other differences between OLTP and OLAP which I have explained using the
comparison chart shown below.
OLTP Vs OLAP
1. Comparison Chart
2. Definition
3. Key Differences
Comparison Chart
Basis for OLTP OLAP
Comparison
Basic It is an online transactional system It is an online data retrieving and data
and manages database modification. analysis system.
Focus Insert, Update, Delete information Extract data for analyzing that helps in
from the database. decision making.
Data OLTP and its transactions are the Different OLTPs database becomes the
original source of data. source of data for OLAP.
Transaction OLTP has short transactions. OLAP has long transactions.
Time The processing time of a transaction The processing time of a transaction is
is comparatively less in OLTP. comparatively more in OLAP.
Queries Simpler queries. Complex queries.
Normalization Tables in OLTP database are Tables in OLAP database are not
normalized (3NF). normalized.
Integrity OLTP database must maintain data OLAP database does not get frequently
integrity constraint. modified. Hence, data integrity is not
affected.
Definition of OLTP
OLTP is an Online Transaction Processing system. The main focus of OLTP system is to
record the current Update, Insertion and Deletion while transaction. The OLTP queries are
simpler and short and hence require less time in processing, and also requires less space.
OLTP database gets updated frequently. It may happen that a transaction in OLTP fails in
middle, which may effect data integrity. So, it has to take special care of data integrity.
OLTP database has normalized tables (3NF).
The best example for OLTP system is an ATM, in which using short transactions we modify
the status of our account. OLTP system becomes the source of data for OLAP.
Definition of OLAP
OLAP is an Online Analytical Processing system. OLAP database stores historical data that
has been inputted by OLTP. It allows a user to view different summaries of multi-dimensional
data. Using OLAP, you can extract information from a large database and analyze it for
decision making.
OLAP also allow a user to execute complex queries to extract multidimensional data. In
OLTP even if the transaction fails in middle it will not harm data integrity as the user use
OLAP system to retrieve data from a large database to analyze. Simply the user can fire the
query again and extract the data for analysis.
The transaction in OLAP are long and hence take comparatively more time for processing
and requires large space. The transactions in OLAP are less frequent as compared to OLTP.
Even the tables in OLAP database may not be normalized. The example for OLAP is to view
a financial report, or budgeting, marketing management, sales report, etc.
1. The point that distinguishes OLTP and OLAP is that OLTP is an online transaction system
whereas, OLAP is an online data retrieval and analysis system.
2. Online transactional data becomes the source of data for OLTP. However, the different OLTPs
database becomes the source of data for OLAP.
3. OLTP’s main operations are insert, update and delete whereas, OLAP’s main operation is to
extract multi dimensional data for analysis.
4. OLTP has short but frequent transactions whereas, OLAP has long and less frequent
transaction.
7. The tables in OLTP database must be normalized (3NF) whereas, the tables in OLAP database
may not be normalized.
8. As OLTPs frequently executes transactions in database, in case any transaction fails in middle
it may harm data’s integrity and hence it must take care of data integrity. While in OLAP the
transaction is less frequent hence, it does not bother much about data integrity.
3. OLAP OPERATIONS
OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It is
based on multidimensional data model and allows the user to query on multi-dimensional
data (eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more cubes
and these cubes are known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed
data. It can be done by:
In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:
o Reducing the dimensions In the cube given in the overview section, the roll-up
operation is performed by climbing up in the concept hierarchy of Location dimension (City
-> Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension Time
= “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing pivot
peration gives a new view of it.
4. OLAP SERVERS
Relational On-Line Analytical Processing (ROLAP) work mainly for the data that resides in a
relational database, where the base data and dimension tables are stored as relational tables.
ROLAP servers are placed between the relational back-end server and client front-end tools.
ROLAP servers use RDBMS to store and manage warehouse data, and OLAP middleware to
support missing pieces.
Advantages of ROLAP
Disadvantages of ROLAP
Advantages of MOLAP
Disadvantages of MOLAP
Advantages of HOLAP
Disadvantages of HOLAP
1 HOLAP architecture is very complex because it support both MOLAP and ROLAP servers.
Unit- III
In this chapter, we will learn how to mine frequent patterns, association rules, and correlation
rules when working with R programs. Then, we will evaluate all these methods with
benchmark data to determine the interestingness of the frequent patterns and rules. We will
cover the following topics in this chapter:
High-performance algorithms
The algorithms to find frequent items from various data types can be applied to numeric or
categorical data. Most of these algorithms have one common basic algorithmic form, which is
A-Priori, depending on certain circumstances. Another basic algorithm is FP-Growth, which
is similar to A-Priori. Most pattern-related mining algorithms derive from these basic
algorithms.
With frequent patterns found as one input, many algorithms are designed to find association
and correlation rules. Each algorithm is only a variation from the basic algorithm.
Along with the growth, size, and types of datasets from various domains, new algorithms are
designed, such as the multistage algorithm, the multihash algorithm, and the limited-pass
algorithm.
One popular task for data mining is to find relations among the source dataset; this is based
on searching frequent patterns from various data sources, such as market baskets, graphs, and
streams.
All the algorithms illustrated in this chapter are written from scratch in the R language for the
purpose of explaining association analysis, and the code will be demonstrated using the
standard R packages for the algorithms such as arules.
With many applications across a broad field, frequent pattern mining is often used in solving
various problems, such as the market investigation for a shopping mall from the transaction
data.
Frequent patterns are the ones that often occur in the source dataset. The dataset types for
frequent pattern mining can be itemset, subsequence, or substructure. As a result, the frequent
patterns found are known as:
Frequent itemset
Frequent subsequence
Frequent substructures
These three frequent patterns will be discussed in detail in the upcoming sections.
These newly founded frequent patterns will serve as an important platform when searching
for recurring interesting rules or relationships among the given dataset.
Various patterns are proposed to improve the efficiency of mining on a dataset. Some of them
are as follows; they will be defined in detail later:
Closed patterns
Maximal patterns
Approximate patterns
Condensed patterns
The frequent itemset originated from true market basket analysis. In a store such as Amazon,
there are many orders or transactions; a certain customer performs a transaction where their
Amazon shopping cart includes some items. The mass result of all customers' transactions
can be used by the storeowner to find out what items are purchased together by customers. As
a simple definition, itemset denotes a collection of zero or more items.
We call a transaction a basket, and a set of items can belong to any basket. We will set the
variable s as the support threshold, which is compared with the count of a certain set of items
that appear in all the baskets. If the count of a certain set of items that appear in all the
baskets is not less than s, we would call the itemset a frequent itemset.
If an itemset is frequent, then any of its subset must be frequent. This is known as the A-
Priori principle, the foundation of the A-Priori algorithm. The direct application of the A-
Priori principle is to prune the huge number of frequent itemsets.
One important factor that affects the number of frequent itemsets is the minimum support
count: the lower the minimum support count, the larger the number of frequent itemsets.
For the purpose of optimizing the frequent itemset-generation algorithm, some more concepts
are proposed:
; X is
also called a closed itemset. In other words, if X is frequent, then X is a closed frequent
itemset.
An itemset X is considered a constrained frequent itemset once the frequent itemset satisfies
the user-specified constraints.
An itemset X is a top-k frequent itemset in the dataset S if X is the k-most frequent itemset,
given a user-defined value k.
The following example is of a transaction dataset. All itemsets only contain items from the
T001
T002
T003
T004
T005
T006
T007
T008
T009
T010
The frequent sequence is an ordered list of elements where each element contains at least one
event. An example of this is the page-visit sequence on a site by the specific web page the
user is on more concretely speaking, the order in which a certain user visits web pages. Here
are two examples of the frequent subsequence:
Customer: Successive shopping records of certain customers in a shopping mart serves as
the sequence, each item bought serves as the event item, and all the items bought by a
customer in one shopping are treated as elements or transactions
Web usage data: Users who visit the history of the WWW are treated as a sequence, each
UI/page serves as the event or item, and the element or transaction can be defined as the
pages visited by users with one click of the mouse
The length of a sequence is defined by the number of items contained in the sequence. A
sequence of length k is called a k-sequence. The size of a sequence is defined by the number
In some domains, the tasks under research can be modeled with a graph theory. As a result,
there are requirements for mining common subgraphs (subtrees or sublattices); some
examples are as follows:
Web mining: Web pages are treated as the vertices of graph, links between pages serve as
edges, and a user's page-visiting records construct the graph.
Network computing: Any device with computation ability on the network serves as the
vertex, and the interconnection between these devices serves as the edge. The whole
network that is made up of these devices and interconnections is treated as a graph.
Semantic web: XML elements serve as the vertices, and the parent/child relations between
them are edges; all these XML files are treated as graphs.
Mining of association rules is based on the frequent patterns found. The different emphases
on the interestingness of relations derives two types of relations for further research:
association rules and correlation rules.
Association rules
In a later section, a method to show association analysis is illustrated; this is a useful method
to discover interesting relationships within a huge dataset. The relations can be represented in
the form of association rules or frequent itemsets.
Association rule mining is to find the result rule set on a given dataset (the transaction data
set or other sequence-pattern-type dataset), a predefined minimum support count s, and a
and .
For association rules, the key measures of rule interestingness are rule support and
confidence. Their relationship is given as follows:
The meaning of the found association rules should be explained with caution, especially
when there is not enough to judge whether the rule implies causality. It only shows the co-
occurrence of the prefix and postfix of the rule. The following are the different kinds of rules
you can come across:
A rule is a Boolean association rule if it contains association of the presence of the item
A rule is a single-dimensional association if there is, at the most, only one dimension referred
to in the rules
A rule is a multidimensional association rule if there are at least two dimensions referred to
in the rules
Correlation rules
In some situations, the support and confidence pairs are not sufficient to filter uninteresting
association rules. In such a case, we will use support count, confidence, and correlations to
filter association rules.
There are a lot of methods to calculate the correlation of an association rule, such as
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a test
on an attribute. Each leaf node represents a class.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of
ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the
trees are constructed in a top-down recursive divide-and-conquer manner.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
These variables may correspond to the actual attribute given in the data.
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung cancer
is influenced by a person's family history of lung cancer, as well as whether or not the person
is a smoker. It is worth noting that the variable PositiveXray is independent of whether the
patient has a family history of lung cancer or that the patient is a smoker, given that we know
the patient has lung cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC) showing
each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker
(S) is as follows −
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −
Points to remember −
The antecedent part the condition consist of one or more attribute tests and these tests
are logically ANDed.
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a
decision tree.
Points to remember −
The leaf node holds the class prediction, forming the rule consequent.
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data.
We do not require to generate a decision tree first. In this algorithm, each rule for a given
class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by
the rule is removed and the process continues for the rest of the tuples. This is because the
path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a
time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class
C only and no tuple form any other class.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule Pruning
The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule
pruning is required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R
has greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
where pos and neg is the number of positive tuples covered by R, respectively.
5. First I would like to thank Josh Mitchell for bringing back my charger that I left at his
place at one AM. If it was not for him I wouldn’t be able to write this and get lot of
stuff done tonight that I did.
6. Without further ado, let’s start talking about Apriori algorithm. It is a classic algorithm
used in data mining for learning association rules. It is nowhere as complex as it
sounds, on the contrary it is very simple; let me give you an example to explain it.
Suppose you have records of large number of transactions at a shopping center as
follows:
7.
Transactions Items bought
T1 Item1, item2, item3
T2 Item1, item2
T3 Item2, item5
T4 Item1, item2, item5
8.
9. Learning association rules basically means finding the items that are purchased
together more frequently than others.
10. For example in the above table you can see Item1 and item2 are bought together
frequently.
UNIT – IV
1. CLUSTER ANALYSIS
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
While doing cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the labels to the groups.
The following points throw light on why clustering is required in data mining −
High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
Points to remember −
For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.
Here are the two approaches that are used to improve the quality of hierarchical clustering −
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
Constraint-based Method
An Outlier is a rare chance of occurrence within a given data set. In Data Science, an Outlier
is an observation point that is distant from other observations. An Outlier may be due to
variability in the measurement or it may indicate experimental error.
Outliers, being the most extreme observations, may include the sample maximum or sample
minimum, or both, depending on whether they are extremely high or low. However, the
sample maximum and minimum are not always outliers because they may not be unusually
far from other observations.
What is an Outlier?
While Outliers, are attributed to a rare chance and may not necessarily be fully explainable,
Outliers in data can distort predictions and affect the accuracy, if you don’t detect and handle
them.
The contentious decision to consider or discard an outlier needs to be taken at the time of
building the model. Outliers can drastically bias/change the fit estimates and predictions. It is
left to the best judgement of the analyst to decide whether treating outliers is necessary and
how to go about it.
3. HIERARCHICAL CLUSTURING.
Hierarchical Clustering
Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example
and folders on the hard disk are organized in a hierarchy. There are two types of hierarchical clustering, Divisive and
Agglomerative.
Divisive method
In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the clu
two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation
evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumst
is conceptually more complex.
Agglomerative method
In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then, compute the sim
(e.g., distance) between each of the clusters and join the two most similar clusters. Finally, repeat steps 2 and 3 until th
a single cluster left. The related algorithm is shown below.
Before any clustering is performed, it is required to determine the proximity matrix containing the distance between e
using a distance function. Then, the matrix is updated to display the distance between each cluster. The following thre
differ in how the distance between each cluster is measured.
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between
in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow be
their two closest points.
Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance betwe
points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the ar
between their two furthest points.
Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance betwee
point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the le
to the average length each arrow between connecting the points of one cluster to the other.
where,
‘||xi - vj||’ is the Euclidean distance between xi and vj.
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of
centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers..
6) If no data point was reassigned then stop, otherwise repeat from step 3).
Advantages
3) Gives best result when data set are distinct or well separated from each other.
Note: For more detailed figure for k-means algorithm please refer to k-means figure sub page.
Disadvantages
1) The learning algorithm requires apriori specification of the number of cluster centers.
2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means
will not be able to resolve that there are two clusters.
3) The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get
different results (data represented in form of cartesian co-ordinates and polar co-ordinates
will give different results).
5) The learning algorithm provides the local optima of the squared error function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.
7) Applicable only when mean is defined i.e. fails for categorical data.
A time series is a sequence of data points recorded at specific time points - most often in
regular time intervals (seconds, hours, days, months etc.). Every organization generates a
high volume of data every single day – be it sales figure, revenue, traffic, or operating cost.
Time series data mining can generate valuable information for long-term business decisions,
yet they are underutilized in most organizations. Below is a list of few possible ways to take
advantage of time series datasets:
Trend analysis: Just plotting data against time can generate very powerful insights.
One very basic use of time-series data is just understanding temporal pattern/trend in
what is being measured. In businesses it can even give an early indication on the
overall direction of a typical business cycle.
Predictive analytics: Advanced statistical analysis such as panel data models (fixed
and random effects models) rely heavily on multi-variate longitudinal datasets. These
types of analysis help in business forecasts, identify explanatory variables, or simply
help understand associations between features in a dataset
UNIT- V
1. TEXT MINING
Information Retrieval
Information retrieval deals with the retrieval of information from a large number of text-
based documents. Some of the database systems are not usually present in information
retrieval systems because both handle different kinds of data. Examples of information
retrieval system include −
Note − The main problem in an information retrieval system is to locate relevant documents
in a document collection based on a user's query. This kind of user's query consists of some
keywords describing an information need.
In such search problems, the user takes an initiative to pull relevant information out from a
collection. This is appropriate when the user has ad-hoc information need, i.e., a short-term
need. But if the user has a long-term information need, then the retrieval system can also take
an initiative to push any newly arrived information item to the user.
This kind of access to information is called Information Filtering. And the corresponding
systems are known as Filtering Systems or Recommender Systems.
We need to check the accuracy of a system when it retrieves a number of documents on the
basis of user's input. Let the set of documents relevant to a query be denoted as {Relevant}
and the set of retrieved document as {Retrieved}. The set of documents that are relevant and
retrieved can be denoted as {Relevant} ∩ {Retrieved}. This can be shown in the form of a
Venn diagram as follows −
There are three fundamental measures for assessing the quality of text retrieval −
Precision
Recall
F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query.
Precision can be defined as −
Recall
Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as −
2. WWW MINING
The web poses great challenges for resource and knowledge discovery based on the following
observations −
The web is too huge − The size of the web is very huge and rapidly increasing. This
seems that the web is too huge for data warehousing and data mining.
Complexity of Web pages − The web pages do not have unifying structure. They are
very complex as compared to traditional text document. There are huge amount of
documents in digital library of web. These libraries are not arranged according to any
particular sorted order.
The basic structure of the web page is based on the Document Object Model (DOM). The
DOM structure refers to a tree like structure where the HTML tag in the page corresponds to
a node in the DOM tree. We can segment the web page by using predefined tags in HTML.
The HTML syntax is flexible therefore, the web pages does not follow the W3C
specifications. Not following the specifications of W3C may cause error in DOM tree
structure.
The DOM structure was initially introduced for presentation in the browser and not for
description of semantic structure of the web page. The DOM structure cannot correctly
identify the semantic relationship between the different parts of a web page.
The purpose of VIPS is to extract the semantic structure of a web page based on its
visual presentation.
Such a semantic structure corresponds to a tree structure. In this tree each node
corresponds to a block.
A value is assigned to each node. This value is called the Degree of Coherence. This
value is assigned to indicate the coherent content in the block based on visual
perception.
The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree.
After that it finds the separators between these blocks.
The separators refer to the horizontal or vertical lines in a web page that visually cross
with no blocks.
The semantics of the web page is constructed on the basis of these blocks.
3.
3.
3.
3.
DATAMINING APPLICATIONS AND TRENDS
Retail Industry
Telecommunication Industry
Intrusion Detection
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
Design and construction of data warehouses for multidimensional data analysis and
data mining.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to expand rapidly
because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that
lead to improved quality of customer service and good customer retention and satisfaction.
Here is the list of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data mining.
Customer Retention.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web
data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason why
data mining is become very important to help and understand the business.
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes
for biological data analysis −
The applications discussed above tend to handle relatively small and homogeneous data sets
for which the statistical techniques are appropriate. Huge amount of data have been collected
from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is
being generated because of the fast numerical simulations in various fields such as climate
and ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the
applications of data mining in the field of Scientific Applications −
Graph-based mining.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration. Here is the list of areas in which data mining technology may be
applied for intrusion detection −
There are many data mining system products and domain specific data mining applications.
The new data mining systems and applications are being added to the previous systems. Also,
efforts are being made to standardize data mining languages.
Data Types − The data mining system may handle formatted text, record-based data,
and relational data. The data could also be in ASCII text, relational database data or
data warehouse data. Therefore, we should check what exact format the data mining
system can handle.
System Issues − We must consider the compatibility of a data mining system with
different operating systems. One data mining system may run on only one operating
system or on several. There are also data mining systems that provide web-based user
interfaces and allow XML data as input.
Data Sources − Data sources refer to the data formats in which data mining system
will operate. Some data mining system may work only on ASCII text files while
others on multiple relational sources. Data mining system should also support ODBC
connections or OLE DB for ODBC connections.
Data Mining functions and methodologies − There are some data mining systems
that provide only one data mining function such as classification while some provides
multiple data mining functions such as concept description, discovery-driven OLAP
analysis, association mining, linkage analysis, statistical analysis, classification,
prediction, clustering, outlier analysis, similarity search, etc.
Coupling data mining with databases or data warehouse systems − Data mining
systems need to be coupled with a database or a data warehouse system. The coupled
components are integrated into a uniform information processing environment. Here
are the types of coupling listed below −
o No coupling
o Loose Coupling
o Tight Coupling
o Data Visualization
Trends in Data
Mining
Data mining concepts are still evolving and here are the latest trends that we get to see in this
field −
Application Exploration.
Integration of data mining with database systems, data warehouse systems and web
database systems.
Web mining.