Vous êtes sur la page 1sur 53

UNIT – I FIRST HALF

1. DATA MINING FUNCTIONALITIES.(OR) DATA MINING CHARACTERISTICS ?

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. Data mining tasks can be classified into two categories: descriptive and predictive.

Descriptive mining tasks characterize the general properties of the data in the database.

Predictive mining tasks perform inference on the current data in order to make predictions.

Concept/Class Description: Characterization and Discrimination

Data can be associated with classes or concepts. For example, in the Electronics store, classes
of items for sale include computers and printers, and concepts of customers include big
Spenders and budget Spenders.

Data characterization

Data characterization is a summarization of the general characteristics or features of a target


class of data.

Data discrimination

Data discrimination is a comparison of the general features of target class data objects with
the general features of objects from one or a set of contrasting classes.

Mining Frequent Patterns, Associations, and Correlations

Frequent patterns, are patterns that occur frequently in data. There are many kinds of frequent
patterns, including item sets, subsequences, and substructures.

Association analysis

Suppose, as a marketing manager, you would like to determine which items are frequently
purchased together within the same transactions.

Buys(X, “computer”) =buys(X,“software”) [support=1%,confidence=50%]

Where X is a variable representing a customer. Confidence=50% means that if a customer


buys a computer, there is a 50% chance that she will buy software as well.

Support=1% means that 1% of all of the transactions under analysis showed that computer
and software were purchased together.
2. CLASSIFICATION OF DATA MINING SYSTEMS.

It is time to get into some more delicate topics.Over the past “stories” we discussed what data
mining is and how data is stored.As a recap of the past topics let’s redefine data mining using
current knowledge:

Data mining is an interdisciplinary area, the assembly of a set of disciplines, including


database systems, statistics, machine learning, visualisation, and information science.
Moreover, depending on the mining approach used, techniques from other disciplines may be
applied, such as :

 neural networks

 set theory

 knowledge representation

 inductive logic programming

 high-performance computing

Depending on the nature of the data mined or on the given data mining application, the data
mining methods may also combine techniques from spatial data analysis, information
retrieval, pattern recognition, image analysis, signal processing, computer graphics,
technology, economics, business, bioinformatics, or psychology.

Because of the diversity of disciplines contributing to data mining, research is expected to


generate a large variety of data mining systems. Therefore, it is necessary to provide a clear
classification of systems, which may help users differentiate such systems and identify those
that best match their needs. Data mining systems can be categorised according to various
criteria, as follows:

1.Classification according to the kinds of databases mined:


A data mining system can be classified according to the kinds of databases mined such as
data models, or the types of data or applications involved, each of which may require its own
data mining technique.
For instance, if classifying according to data models, we may have a relational, transactional,
object-relational, or warehouse mining system. If classifying according to the special types of
data handled, we may have a spatial, time-series, text, stream data, multimedia data mining
system, or a World Wide Web mining system.

2.Classification according to the kinds of knowledge mined:


Data mining systems can be categorized according to the kinds of knowledge they mine, that
is, based on data mining functionalities, such as characterization, discrimination, association
and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution
analysis. A comprehensive data mining system usually provides multiple integrated data
mining functionalities.
Furthermore, data mining systems can be distinguished based on the granularity (levels of
abstraction) of the knowledge mined, including generalized knowledge (at a high level of
abstraction)
primitive-level knowledge (at a raw data level)
or knowledge at multiple levels (considering several levels of abstraction).
An advanced data mining system should facilitate the discovery of knowledge at multiple
levels of abstraction.

Data mining systems can also be categorized as those that mine data regularities (commonly
occurring patterns) versus those that mine data irregularities (such as exceptions, or outliers).
Those are usually based on concept description, association and correlation analysis,
classification, prediction, and clustering mine data regularities, rejecting outliers as noise.

3. Classification according to the kinds of techniques utilized:


Data mining systems can be categorized according to the underlying data mining techniques
employed. These techniques can be described according to the degree of user interaction
involved:

 autonomous systems

 interactive exploratory systems

 query-driven systems

Or the methods of data analysis employed:

 database-oriented or data warehouse oriented techniques

 machine learning

 statistics, visualization

 pattern recognition

 neural networks

A sophisticated data mining system will often adopt multiple data mining techniques or work
out an effective, integrated technique that combines the merits of a few individual
approaches.
3. Classification according to the applications adapted:
Data mining systems can also be categorised according to the applications they adapt.
For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
Different applications often require the integration of application-specific methods.
Therefore, a generic, all-purpose data mining system may not fit domain-specific
mining tasks.

3. MAJOR ISSUES IN DATA MINING.

Data mining systems face a lot of challenges and issues in today’s world some of them are:

1 Mining methodology and user interaction issues

2 Performance issues

3 Issues relating to the diversity of database types

1 Mining methodology and user interaction issues:


Mining different kinds of knowledge in databases:

Different user - different knowledge - different way.That means different client want a
different kind of information so it becomes difficult to cover vast range of data that can meet
the client requirement.

Interactive mining of knowledge at multiple levels of abstraction:

Interactive mining allows users to focus the search for patterns from different angles.The data
mining process should be interactive because it is difficult to know what can be discovered
within a database.

Incorporation of background knowledge:

Background knowledge is used to guide discovery process and to express the discovered
patterns.

Query languages and ad hoc mining:

Relational query languages (such as SQL) allow users to pose ad-hoc queries for data
retrieval.The language of data mining query language should be in perfectly matched with the
query language of data warehouse.

Handling noisy or incomplete data:

In a large database, many of the attribute values will be incorrect.This may be due to human
error or because of any instruments fail. Data cleaning methods and data analysis methods are
used to handle noise data.
2 Performance issues
Efficiency and scalability of data mining algorithms:

To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms:

The huge size of many databases, the wide distribution of data, and complexity of some data
mining methods are factors motivating the development of parallel and distributed data
mining algorithms. Such algorithms divide the data into partitions, which are processed in
parallel.

3 Issues relating to the diversity of database types:


Handling of relational and complex types of data:

There are many kinds of data stored in databases and data warehouses. It is not possible for
one system to mine all these kind of data.So different data mining system should be construed
for different kinds data.

Mining information from heterogeneous databases and global information systems:

Since data is fetched from different data sources on Local Area Network (LAN) and Wide
Area Network (WAN).The discovery of knowledge from different sources of structured is a
great challenge to data mining.

UNIT – I SECOND HALF

1. DATA PREPROCESSING
Why preprocessing ?

1. Real world data are generally

o Incomplete: lacking attribute values, lacking certain attributes of interest, or


containing only aggregate data

o Noisy: containing errors or outliers

o Inconsistent: containing discrepancies in codes or names

2. Tasks in data preprocessing

o Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies.

o Data integration: using multiple databases, data cubes, or files.


o Data transformation: normalization and aggregation.

o Data reduction: reducing the volume but producing the same or similar analytical
results.

o Data discretization: part of data reduction, replacing numerical attributes with


nominal ones.

Data Cleaning in Data Mining

Quality of your data is critical in getting to final analysis.Any data which tend to be
incomplete, noisy and inconsistent can affect your result.

Data cleaning in data mining is the process of detecting and removing corrupt or inaccurate
records from a record set, table or database.

Some data cleaning methods:-

1 You can ignore the tuple.This is done when class label is missing. This method is not very
effective, unless the tuple contains several attributes with missing values.

2 You can fill in the missing value manually. This approach is effective on small data set with
some missing values.

3 You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.

4 You can use the attribute mean to fill in the missing value. For example customer average
income is 25000 then you can use this value to replace missing value for income.

5 Use the most probable value to fill in the missing value.

Noisy Data

Noise is a random error or variance in a measured variable. Noisy Data may be due to faulty
data collection instruments, data entry problems and technology limitation.

How to Handle Noisy Data?

Binning:

Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it. The sorted values is distributed into a number of “buckets,” or bins.

For example

Price = 4, 8, 15, 21, 21, 24, 25, 28, 34


Partition into (equal-frequency) bins:

Bin a: 4, 8, 15

Bin b: 21, 21, 24

Bin c: 25, 28, 34

In this example, the data for price are first sorted and then partitioned into equal-frequency
bins of size 3.

Smoothing by bin means:

Bin a: 9, 9, 9

Bin b: 22, 22, 22

Bin c: 29, 29, 29

In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.

Smoothing by bin boundaries:

Bin a: 4, 4, 15

Bin b: 21, 21, 24

Bin c: 25, 25, 34

In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.

Regression

Data can be smoothed by fitting the data into a regression functions.

Clustering:

Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters. Values that fall outside of the set of clusters may be considered outliers.
Clustering

Data Transformation In Data Mining

In data transformation process data are transformed from one format to another format, that is
more appropriate for data mining.

Some Data Transformation Strategies:-


1 Smoothing

Smoothing is a process of removing noise from the data.

2 Aggregation

Aggregation is a process where summary or aggregation operations are applied to the data.

3 Generalization

In generalization low-level data are replaced with high-level data by using concept
hierarchies climbing.

4 Normalization

Normalization scaled attribute data so as to fall within a small specified range, such as 0.0 to
1.0.

5 Attribute Construction

In Attribute construction, new attributes are constructed from the given set of attributes.

Data Reduction In Data Mining

A database or date warehouse may store terabytes of data. So it may take very long to
perform data analysis and mining on such huge amounts of data.

Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume but still contain critical information.

Data Reduction Strategies:-


1 Data Cube Aggregation

Aggregation operations are applied to the data in the construction of a data cube.
2 Dimensionality Reduction

In dimensionality reduction redundant attributes are detected and removed which reduce the
data set size.

3 Data Compression

Encoding mechanisms are used to reduce the data set size.

4 Numerosity Reduction

In numerosity reduction where the data are replaced or estimated by alternative.

5 Discretisation and concept hierarchy generation

Where raw data values for attributes are replaced by ranges or higher conceptual levels.

Data Discretization and Concept Hierarchy Generation

Data Discretization techniques can be used to divide the range of continuous attribute into
intervals.Numerous continuous attribute values are replaced by small interval labels.

This leads to a concise, easy-to-use, knowledge-level representation of mining results.

Top-down discretization

If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals, then
it is called top-down discretization or splitting.

Bottom-up discretization

If the process starts by considering all of the continuous values as potential split-points,
removes some by merging neighborhood values to form intervals, then it is called bottom-up
discretization or merging.

Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning


of the attribute values, known as a concept hierarchy.

Concept hierarchies

Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts with higher-level concepts.

In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input/output operations and is more efficient
than mining on a larger data set.

Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.

Discretization and Concept Hierarchy Generation for Numerical Data

Typical methods
1 Binning

Binning is a top-down splitting technique based on a specified number of bins. Binning is an


unsupervised discretization technique.

2 Histogram Analysis

Because histogram analysis does not use class information so it is an unsupervised


discretization technique.Histograms partition the values for an attribute into disjoint ranges
called buckets.

3 Cluster Analysis

Cluster analysis is a popular data discretization method.A clustering algorithm can be applied
to discrete a numerical attribute of A by partitioning the values of A into clusters or groups.

Each initial cluster or partition may be further decomposed into several subcultures, forming
a lower level of the hierarchy.

UNIT – II

1. Data Warehouse Architecture

Data warehouses and their architectures vary depending upon the situation:-

Three-Tier Data Warehouse Architecture

Data warehouses often adopt a three – tier architecture,

1 Bottom tier
2 Middle tier

3 Top tier

1 Bottom tier

The bottom tier of the architecture is the data warehouse database server.It is the relational
database system. Data is feed into bottom tier by some back-end tools and utilities.

Function performed by back end tools and utilities are:-

1 Data Extraction

2 Data Cleaning

3 Data Transformation

4 Load

5 Refresh

2 Middle tier

The middle tier is an OLAP server present user a multidimensional data from data
warehouse.Middle tier is an OLAP server can be implemented using either relational OLAP
(ROLAP) model or multidimensional OLAP (MOLAP) model.

Relational OLAP (ROLAP) model:-

ROLAP is an extended relational database management system. The ROLAP maps the
operations on multidimensional data to standard relational operations.

Multidimensional OLAP (MOLAP) model:-

MOLAP directly implements multidimensional data and operations.


3 Top tier

The top tier is a front-end client layer.

The top tier layer holds following tool:-

Query and Reporting tools - Production reporting tool.

Analysis tools - Prepare charts based on analysis.

Data mining tools - Discover hidden knowledge, pattern.

A Business Analysis Framework

Data warehouse provide critical information to business analyst to measure performance of


business and help in the process of decision-making.With data warehouse business
productivity can be enhance because business analyst can quickly gather information that
describes the organization.

Data warehouse provide customers view which help to improve customer relationship.Data
warehouse can also help in cost reduction by tracking trends and patterns.

Different data warehouse views:-


Top-down view

Top-down view allow the selection of the relevant information necessary for the data
warehouse.

Data source view

Data source view presents information being captured, stored, and managed by operational
systems.This information may be documented at various levels of detail and accuracy, from
individual data source tables to integrated data source tables.

Data warehouse view

Data warehouse view includes fact tables and dimension tables.This view represents the
information that is stored inside the data warehouse.

Business query view

Business query view is view of data in the data warehouse from the viewpoint of the end user.

2. DIFFEREENCE BETWEEN OLTP AND OLAP


OLTP and OLAP both are the online processing systems. OLTP is a transactional processing
while OLAP is an analytical processing system. OLTP is a system that manages transaction-
oriented applications on the internet for example, ATM. OLAP is an online system that
reports to multidimensional analytical queries like financial reporting, forecasting, etc. The
basic difference between OLTP and OLAP is that OLTP is an online database modifying
system, whereas, OLAP is an online database query answering system.

There are some other differences between OLTP and OLAP which I have explained using the
comparison chart shown below.

OLTP Vs OLAP

1. Comparison Chart

2. Definition

3. Key Differences

Comparison Chart
Basis for OLTP OLAP
Comparison
Basic It is an online transactional system It is an online data retrieving and data
and manages database modification. analysis system.
Focus Insert, Update, Delete information Extract data for analyzing that helps in
from the database. decision making.
Data OLTP and its transactions are the Different OLTPs database becomes the
original source of data. source of data for OLAP.
Transaction OLTP has short transactions. OLAP has long transactions.
Time The processing time of a transaction The processing time of a transaction is
is comparatively less in OLTP. comparatively more in OLAP.
Queries Simpler queries. Complex queries.
Normalization Tables in OLTP database are Tables in OLAP database are not
normalized (3NF). normalized.
Integrity OLTP database must maintain data OLAP database does not get frequently
integrity constraint. modified. Hence, data integrity is not
affected.

Definition of OLTP

OLTP is an Online Transaction Processing system. The main focus of OLTP system is to
record the current Update, Insertion and Deletion while transaction. The OLTP queries are
simpler and short and hence require less time in processing, and also requires less space.

OLTP database gets updated frequently. It may happen that a transaction in OLTP fails in
middle, which may effect data integrity. So, it has to take special care of data integrity.
OLTP database has normalized tables (3NF).
The best example for OLTP system is an ATM, in which using short transactions we modify
the status of our account. OLTP system becomes the source of data for OLAP.

Definition of OLAP

OLAP is an Online Analytical Processing system. OLAP database stores historical data that
has been inputted by OLTP. It allows a user to view different summaries of multi-dimensional
data. Using OLAP, you can extract information from a large database and analyze it for
decision making.

OLAP also allow a user to execute complex queries to extract multidimensional data. In
OLTP even if the transaction fails in middle it will not harm data integrity as the user use
OLAP system to retrieve data from a large database to analyze. Simply the user can fire the
query again and extract the data for analysis.

The transaction in OLAP are long and hence take comparatively more time for processing
and requires large space. The transactions in OLAP are less frequent as compared to OLTP.
Even the tables in OLAP database may not be normalized. The example for OLAP is to view
a financial report, or budgeting, marketing management, sales report, etc.

Key Differences Between OLTP and OLAP

1. The point that distinguishes OLTP and OLAP is that OLTP is an online transaction system
whereas, OLAP is an online data retrieval and analysis system.

2. Online transactional data becomes the source of data for OLTP. However, the different OLTPs
database becomes the source of data for OLAP.

3. OLTP’s main operations are insert, update and delete whereas, OLAP’s main operation is to
extract multi dimensional data for analysis.

4. OLTP has short but frequent transactions whereas, OLAP has long and less frequent
transaction.

5. Processing time for the OLAP’s transaction is more as compared to OLTP.

6. OLAPs queries are more complex with respect OLTPs.

7. The tables in OLTP database must be normalized (3NF) whereas, the tables in OLAP database
may not be normalized.

8. As OLTPs frequently executes transactions in database, in case any transaction fails in middle
it may harm data’s integrity and hence it must take care of data integrity. While in OLAP the
transaction is less frequent hence, it does not bother much about data integrity.

3. OLAP OPERATIONS

OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It is
based on multidimensional data model and allows the user to query on multi-dimensional
data (eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more cubes
and these cubes are known as Hyper-cubes.

OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:

1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed
data. It can be done by:

o Moving down in the concept hierarchy

o Adding a new dimension

In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:

o Climbing up in the concept hierarchy

o Reducing the dimensions In the cube given in the overview section, the roll-up
operation is performed by climbing up in the concept hierarchy of Location dimension (City
-> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:

o Location = “Delhi” or “Kolkata”

o Time = “Q1” or “Q2”

o Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension Time
= “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing pivot
peration gives a new view of it.

4. OLAP SERVERS

ype Of OLAP Servers


Three types of OLAP servers are:-

1 Relational OLAP (ROLAP)

2 Multidimensional OLAP (MOLAP)

3 Hybrid OLAP (HOLAP)

1 Relational OLAP (ROLAP)

Relational On-Line Analytical Processing (ROLAP) work mainly for the data that resides in a
relational database, where the base data and dimension tables are stored as relational tables.
ROLAP servers are placed between the relational back-end server and client front-end tools.
ROLAP servers use RDBMS to store and manage warehouse data, and OLAP middleware to
support missing pieces.
Advantages of ROLAP

1 ROLAP can handle large amounts of data.

2 Can be used with data warehouse and OLTP systems.

Disadvantages of ROLAP

1 Limited by SQL functionalities.

2 Hard to maintain aggregate tables.

2 Multidimensional OLAP (MOLAP)

Multidimensional On-Line Analytical Processing (MOLAP) support multidimensional views


of data through array-based multidimensional storage engines.With multidimensional data
stores, the storage utilization may be low if the data set is sparse.

Advantages of MOLAP

1 Optimal for slice and dice operations.

2 Performs better than ROLAP when data is dense.

3 Can perform complex calculations.

Disadvantages of MOLAP

1 Difficult to change dimension without re-aggregation.

2 MOLAP can handle limited amount of data.

3 Hybrid OLAP (HOLAP)

Hybrid On-Line Analytical Processing (HOLAP) is a combination of ROLAP and MOLAP.


HOLAP provide greater scalability of ROLAP and the faster computation of MOLAP.

Advantages of HOLAP

1 HOLAP provide advantages of both MOLAP and ROLAP.

2 Provide fast access at all levels of aggregation.

Disadvantages of HOLAP

1 HOLAP architecture is very complex because it support both MOLAP and ROLAP servers.
Unit- III

1. Mining Frequent Patterns, Associations, and Correlations

In this chapter, we will learn how to mine frequent patterns, association rules, and correlation
rules when working with R programs. Then, we will evaluate all these methods with
benchmark data to determine the interestingness of the frequent patterns and rules. We will
cover the following topics in this chapter:

 Introduction to associations and patterns

 Market basket analysis

 Hybrid association rules mining

 Mining sequence datasets

 High-performance algorithms

The algorithms to find frequent items from various data types can be applied to numeric or
categorical data. Most of these algorithms have one common basic algorithmic form, which is
A-Priori, depending on certain circumstances. Another basic algorithm is FP-Growth, which
is similar to A-Priori. Most pattern-related mining algorithms derive from these basic
algorithms.

With frequent patterns found as one input, many algorithms are designed to find association
and correlation rules. Each algorithm is only a variation from the basic algorithm.

Along with the growth, size, and types of datasets from various domains, new algorithms are
designed, such as the multistage algorithm, the multihash algorithm, and the limited-pass
algorithm.

An overview of associations and patterns

One popular task for data mining is to find relations among the source dataset; this is based
on searching frequent patterns from various data sources, such as market baskets, graphs, and
streams.

All the algorithms illustrated in this chapter are written from scratch in the R language for the
purpose of explaining association analysis, and the code will be demonstrated using the
standard R packages for the algorithms such as arules.

Patterns and pattern discovery

With many applications across a broad field, frequent pattern mining is often used in solving
various problems, such as the market investigation for a shopping mall from the transaction
data.
Frequent patterns are the ones that often occur in the source dataset. The dataset types for
frequent pattern mining can be itemset, subsequence, or substructure. As a result, the frequent
patterns found are known as:

 Frequent itemset

 Frequent subsequence

 Frequent substructures

These three frequent patterns will be discussed in detail in the upcoming sections.

These newly founded frequent patterns will serve as an important platform when searching
for recurring interesting rules or relationships among the given dataset.

Various patterns are proposed to improve the efficiency of mining on a dataset. Some of them
are as follows; they will be defined in detail later:

 Closed patterns

 Maximal patterns

 Approximate patterns

 Condensed patterns

 Discriminative frequent patterns

The frequent itemset

The frequent itemset originated from true market basket analysis. In a store such as Amazon,
there are many orders or transactions; a certain customer performs a transaction where their
Amazon shopping cart includes some items. The mass result of all customers' transactions
can be used by the storeowner to find out what items are purchased together by customers. As
a simple definition, itemset denotes a collection of zero or more items.

We call a transaction a basket, and a set of items can belong to any basket. We will set the
variable s as the support threshold, which is compared with the count of a certain set of items
that appear in all the baskets. If the count of a certain set of items that appear in all the
baskets is not less than s, we would call the itemset a frequent itemset.

An itemset is called a k-itemset if it contains k pieces of items, where k is a non-zero integer.

The support count of an itemset is , the count of itemset contained


X, given the dataset.

For a predefined minimum support threshold s, the itemset X is a frequent itemset if

. The minimum support threshold s is a customizable


parameter, which can be adjusted by domain experts or experiences.
The frequent itemset is also used in many domains. Some of them are shown in the following
table:

Items Baskets Comments

Related concepts Words Documents

Plagiarism Documents Sentences

Biomarkers Biomarkers and diseases The set of data about a patient

If an itemset is frequent, then any of its subset must be frequent. This is known as the A-
Priori principle, the foundation of the A-Priori algorithm. The direct application of the A-
Priori principle is to prune the huge number of frequent itemsets.

One important factor that affects the number of frequent itemsets is the minimum support
count: the lower the minimum support count, the larger the number of frequent itemsets.

For the purpose of optimizing the frequent itemset-generation algorithm, some more concepts
are proposed:

 An itemset X is closed in dataset S, if

; X is
also called a closed itemset. In other words, if X is frequent, then X is a closed frequent
itemset.

 An itemset X is a maximal frequent itemset if

; in other words, Y does not have frequent


supersets.

 An itemset X is considered a constrained frequent itemset once the frequent itemset satisfies
the user-specified constraints.

 An itemset X is an approximate frequent itemset if X derives only approximate support


counts for the mined frequent itemsets.

 An itemset X is a top-k frequent itemset in the dataset S if X is the k-most frequent itemset,
given a user-defined value k.

The following example is of a transaction dataset. All itemsets only contain items from the

set, .Let's assume that the minimum support count is 3.


tid (transaction id) List of items in the itemset or transaction

T001

T002

T003

T004

T005

T006

T007

T008

T009

T010

Then, we will get the frequent itemsets and

The frequent subsequence

The frequent sequence is an ordered list of elements where each element contains at least one
event. An example of this is the page-visit sequence on a site by the specific web page the
user is on more concretely speaking, the order in which a certain user visits web pages. Here
are two examples of the frequent subsequence:
 Customer: Successive shopping records of certain customers in a shopping mart serves as
the sequence, each item bought serves as the event item, and all the items bought by a
customer in one shopping are treated as elements or transactions

 Web usage data: Users who visit the history of the WWW are treated as a sequence, each
UI/page serves as the event or item, and the element or transaction can be defined as the
pages visited by users with one click of the mouse

The length of a sequence is defined by the number of items contained in the sequence. A
sequence of length k is called a k-sequence. The size of a sequence is defined by the number

of itemsets in the sequence. We call a sequence as a subsequence of the

sequence or as the super sequence of when


is satisfied.

The frequent substructures

In some domains, the tasks under research can be modeled with a graph theory. As a result,
there are requirements for mining common subgraphs (subtrees or sublattices); some
examples are as follows:

 Web mining: Web pages are treated as the vertices of graph, links between pages serve as
edges, and a user's page-visiting records construct the graph.

 Network computing: Any device with computation ability on the network serves as the
vertex, and the interconnection between these devices serves as the edge. The whole
network that is made up of these devices and interconnections is treated as a graph.

 Semantic web: XML elements serve as the vertices, and the parent/child relations between
them are edges; all these XML files are treated as graphs.

A graph G is represented by G = (V, E), where V represents a group of vertices, and E

represents a group of edges. A graph is called as subgraph of graph G = (V,


E) once and . Here is an example of a subgraph. There is the original
graph with vertices and edges on the left-hand side of the following figure and the subgraph
on the right-hand side with some edges omitted (or omission of vertices in other
circumstances):
Relationship or rules discovery

Mining of association rules is based on the frequent patterns found. The different emphases
on the interestingness of relations derives two types of relations for further research:
association rules and correlation rules.

Association rules

In a later section, a method to show association analysis is illustrated; this is a useful method
to discover interesting relationships within a huge dataset. The relations can be represented in
the form of association rules or frequent itemsets.

Association rule mining is to find the result rule set on a given dataset (the transaction data
set or other sequence-pattern-type dataset), a predefined minimum support count s, and a

predefined confidence c, given any found rule ,

and .

is an association rule where ; X and Y are disjoint. The interesting


thing about this rule is that it is measured by its support and confidence. Support means the
frequency in which this rule appears in the dataset, and confidence means the probability of
the appearance of Y when X is present.

For association rules, the key measures of rule interestingness are rule support and
confidence. Their relationship is given as follows:

support_count(X) is the count of itemset in the dataset, contained X.


As a convention, in support_count(X), in the confidence value and support count value are
represented as a percentage between 0 and 100.

The association rule is strong once and

. The predefined minimum support threshold is s, and c is the


predefined minimum confidence threshold.

The meaning of the found association rules should be explained with caution, especially
when there is not enough to judge whether the rule implies causality. It only shows the co-
occurrence of the prefix and postfix of the rule. The following are the different kinds of rules
you can come across:

 A rule is a Boolean association rule if it contains association of the presence of the item

 A rule is a single-dimensional association if there is, at the most, only one dimension referred
to in the rules

 A rule is a multidimensional association rule if there are at least two dimensions referred to
in the rules

 A rule is a correlation-association rule if the relations or rules are measured by statistical


correlation, which, once passed, leads to a correlation rule

 A rule is a quantitative-association rule if at least one item or attribute contained in it is


quantitative

Correlation rules

In some situations, the support and confidence pairs are not sufficient to filter uninteresting
association rules. In such a case, we will use support count, confidence, and correlations to
filter association rules.

There are a lot of methods to calculate the correlation of an association rule, such as

analyses, all-confidence analysis, and cosine. For a k-itemset , define


the all-confidence value of X as:

III – Unit Second Half

1. Classification by decision tree induction and its algorithm


A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a test
on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

 It is easy to comprehend.

 The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of
ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the
trees are constructed in a top-down recursive divide-and-conquer manner.

Generating a decision tree form training tuples of data partition D


Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a
partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

2. BAYES THEOREM, NAIVE BAYES

Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.

Baye's Theorem

Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

 Posterior Probability [P(H/X)]

 Prior Probability [P(H)]

where X is data tuple and H is some hypothesis.

According to Bayes' Theorem,

P(H/X)= P(X/H)P(H) / P(X)


Bayesian Belief Network

Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.

 A Belief Network allows class conditional independencies to be defined between


subsets of variables.

 It provides a graphical model of causal relationship on which learning can be


performed.

 We can use a trained Bayesian Network for classification.

There are two components that define a Bayesian Belief Network −

 Directed acyclic graph

 A set of conditional probability tables

Directed Acyclic Graph

 Each node in a directed acyclic graph represents a random variable.

 These variable may be discrete or continuous valued.

 These variables may correspond to the actual attribute given in the data.

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung cancer
is influenced by a person's family history of lung cancer, as well as whether or not the person
is a smoker. It is worth noting that the variable PositiveXray is independent of whether the
patient has a family history of lung cancer or that the patient is a smoker, given that we know
the patient has lung cancer.
Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC) showing
each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker
(S) is as follows −

3. RULES BASED CLASSIFICATION

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes


THEN buy_computer = yes

Points to remember −

 The IF part of the rule is called rule antecedent or precondition.

 The THEN part of the rule is called rule consequent.

 The antecedent part the condition consist of one or more attribute tests and these tests
are logically ANDed.

 The consequent part consists of class prediction.

Note − We can also write rule R1 as follows −

R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction

Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a
decision tree.

Points to remember −

To extract a rule from a decision tree −


 One rule is created for each path from the root to the leaf node.

 To form a rule antecedent, each splitting criterion is logically ANDed.

 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data.
We do not require to generate a decision tree first. In this algorithm, each rule for a given
class covers many of the tuples of that class.

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by
the rule is removed and the process continues for the rest of the tuples. This is because the
path to each leaf in a decision tree corresponds to a rule.

Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.

The Following is the sequential learning Algorithm where rules are learned for one class at a
time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class
C only and no tuple form any other class.

Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;

Rule Pruning

The rule is pruned is due to the following reason −

 The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule
pruning is required.
 The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R
has greater quality than what was assessed on an independent set of tuples.

FOIL is one of the simple and effective method for rule pruning. For a given rule R,

FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.

4. APRIORI ALOGORITHM AND ASSOCIATION RULES

5. First I would like to thank Josh Mitchell for bringing back my charger that I left at his
place at one AM. If it was not for him I wouldn’t be able to write this and get lot of
stuff done tonight that I did.
6. Without further ado, let’s start talking about Apriori algorithm. It is a classic algorithm
used in data mining for learning association rules. It is nowhere as complex as it
sounds, on the contrary it is very simple; let me give you an example to explain it.
Suppose you have records of large number of transactions at a shopping center as
follows:
7.
Transactions Items bought
T1 Item1, item2, item3
T2 Item1, item2
T3 Item2, item5
T4 Item1, item2, item5
8.
9. Learning association rules basically means finding the items that are purchased
together more frequently than others.
10. For example in the above table you can see Item1 and item2 are bought together
frequently.

11. What is the use of learning association rules?


12.  Shopping centers use association rules to place the items next to each other so
that users buy more items. If you are familiar with data mining you would know about
the famous beer-diapers-Wal-Mart story. Basically Wal-Mart studied their data and
found that on Friday afternoon young American males who buy diapers also tend to
buy beer. So Wal-Mart placed beer next to diapers and the beer-sales went up. This is
famous because no one would have predicted such a result and that’s the power of
data mining. You can Google for this if you are interested in further details
13.  Also if you are familiar with Amazon, they use association mining to recommend
you the items based on the current item you are browsing/buying.
14.  Another application is the Google auto-complete, where after you type in a word
it searches frequently associated words that user type after that particular word.
15.
16. So as I said Apriori is the classic and probably the most basic algorithm to do it. Now
if you search online you can easily find the pseudo-code and mathematical equations
and stuff. I would like to make it more intuitive and easy, if I can.
17.
18. I would like if a 10th or a 12th grader can understand this without any problem. So I
will try and not use any terminologies or jargons.
19.
20. Let’s start with a non-simple example,
21.
Transaction Items Bought
ID
T1 {Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T2 {Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T3 {Mango, Apple, Key-chain, Eggs}
T4 {Mango, Umbrella, Corn, Key-chain, Yo-yo}
T5 {Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}
22.
23. Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it
is bought at least 60% of times. So for here it should be bought at least 3 times.
24.
25. For simplicity
26. M = Mango
27. O = Onion
28. And so on……
29.
30. So the table becomes
31.
32. Original table:
Transaction Items Bought
ID
T1 {M, O, N, K, E, Y }
T2 {D, O, N, K, E, Y }
T3 {M, A, K, E}
T4 {M, U, C, K, Y }
T5 {C, O, O, K, I, E}
33.
34.
35. Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’
is bought 4 times in total, but, it occurs in just 3 transactions.
36.
Item No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1
37.
38.
39. Step 2: Now remember we said the item is said frequently bought if it is bought at
least 3 times. So in this step we remove all the items that are bought less than 3 times
from the above table and we are left with
40.
Item Number of
transactions
M 3
O 3
K 5
E 4
Y 3
41.
42.
43. This is the single items that are bought frequently. Now let’s say we want to find a
pair of items that are bought frequently. We continue from the above table (Table in
step 2)
44.
45. Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we
start with the second item like OK,OE,OY. We did not do OM because we already did
MO when we were making pairs with M and buying a Mango and Onion together is
same as buying Onion and Mango together. After making all the pairs we get,
46.
Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY
47.
48.
49. Step 4: Now we count how many times each pair is bought together. For example M
and O is just bought together in {M,O,N,K,E,Y}
50. While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND
{M,U,C, K, Y}
51. After doing that for all the pairs we get
52.
Item Pairs Number of
transactions
MO 1
MK 3
ME 2
MY 2
OK 3
OE 3
OY 2
KE 4
KY 3
EY 2
53.
54.
55. Step 5: Golden rule to the rescue. Remove all the item pairs with number of
transactions less than three and we are left with
56.
Item Pairs Number of
transactions
MK 3
OK 3
OE 3
KE 4
KY 3
57.
58. These are the pairs of items frequently bought together.
59. Now let’s say we want to find a set of three items that are brought together.
60. We use the above table (table in step 5) and make a set of 3 items.
61.
62. Step 6: To make the set of three items we need one more rule (it’s termed as self-
join),
63. It simply means, from the Item pairs in the above table, we find two pairs with the
same first Alphabet, so we get
64.  OK and OE, this gives OKE
65.  KE and KY, this gives KEY
66.
67. Then we find how many times O,K,E are bought together in the original table and
same for K,E,Y and we get the following table
68.
Item Set Number of
transactions
OKE 3
KEY 2
69.
70. While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE,
BCD and you want to generate item sets of 4 items you look for two sets having the
same first two alphabets.
71.  ABC and ABD -> ABCD
72.  ACD and ACE -> ACDE
73.
74. And so on … In general you have to look for sets having just the last alphabet/item
different.
75.
76. Step 7: So we again apply the golden rule, that is, the item set must be bought
together at least 3 times which leaves us with just OKE, Since KEY are bought
together just two times.
77.
78. Thus the set of three items that are bought together most frequently are O,K,E.

UNIT – IV

1. CLUSTER ANALYSIS

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.

 A cluster of data objects can be treated as one group.

 While doing cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the labels to the groups.

 The main advantage of clustering over classification is that, it is adaptable to changes


and helps single out useful features that distinguish different groups..

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −

 Scalability − We need highly scalable clustering algorithms to deal with large


databases.

 Ability to deal with different kinds of attributes − Algorithms should be capable to


be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.

 Discovery of clusters with attribute shape − The clustering algorithm should be


capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.

 High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.

 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.

 Interpretability − The clustering results should be interpretable, comprehensible, and


usable.
Clustering Methods

Clustering methods can be classified into the following categories −

 Partitioning Method

 Hierarchical Method

 Density-based Method

 Grid-Based Method

 Model-Based Method

 Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −

 Each group contains at least one object.

 Each object must belong to exactly one group.

Points to remember −

 For a given number of partitions (say k), the partitioning method will create an initial
partitioning.

 Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −

 Agglomerative Approach

 Divisive Approach

Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −

 Perform careful analysis of object linkages at each hierarchical partitioning.

 Integrate hierarchical agglomeration by first using a hierarchical agglomerative


algorithm to group objects into micro-clusters, and then performing macro-clustering
on the micro-clusters.

Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.

Advantages

 The major advantage of this method is fast processing time.

 It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods

In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.

This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or application-


oriented constraints. A constraint refers to the user expectation or the properties of desired
clustering results. Constraints provide us with an interactive way of communication with the
clustering process. Constraints can be specified by the user or the application requirement.

2. OUTLIER AND OUTLIER DETECTION:

An Outlier is a rare chance of occurrence within a given data set. In Data Science, an Outlier
is an observation point that is distant from other observations. An Outlier may be due to
variability in the measurement or it may indicate experimental error.

Outliers, being the most extreme observations, may include the sample maximum or sample
minimum, or both, depending on whether they are extremely high or low. However, the
sample maximum and minimum are not always outliers because they may not be unusually
far from other observations.

Below is a simplistic representation of an Outlier:

What is an Outlier?

While Outliers, are attributed to a rare chance and may not necessarily be fully explainable,
Outliers in data can distort predictions and affect the accuracy, if you don’t detect and handle
them.

The contentious decision to consider or discard an outlier needs to be taken at the time of
building the model. Outliers can drastically bias/change the fit estimates and predictions. It is
left to the best judgement of the analyst to decide whether treating outliers is necessary and
how to go about it.

Treating or altering the outlier/extreme values in genuine observations is not a standard


operating procedure. If a data point (or points) is excluded from the data analysis, this should
be clearly stated on any subsequent report

3. HIERARCHICAL CLUSTURING.
Hierarchical Clustering
Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example
and folders on the hard disk are organized in a hierarchy. There are two types of hierarchical clustering, Divisive and
Agglomerative.

Divisive method

In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the clu
two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation
evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumst
is conceptually more complex.

Agglomerative method

In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then, compute the sim
(e.g., distance) between each of the clusters and join the two most similar clusters. Finally, repeat steps 2 and 3 until th
a single cluster left. The related algorithm is shown below.
Before any clustering is performed, it is required to determine the proximity matrix containing the distance between e
using a distance function. Then, the matrix is updated to display the distance between each cluster. The following thre
differ in how the distance between each cluster is measured.

Single Linkage

In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between
in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow be
their two closest points.

Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance betwe
points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the ar
between their two furthest points.

Average Linkage

In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance betwee
point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the le
to the average length each arrow between connecting the points of one cluster to the other.

4. K-MEANS ALOGORITHM PARTITIONING CLUSTERING

k-means clustering algorithm


k-means is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k
centers, one for each cluster. These centers should be placed in a cunning way because of different
location causes different result. So, the better choice is to place them as much as possible far
away from each other. The next step is to take each point belonging to a given data set and
associate it to the nearest center. When no point is pending, the first step is completed and an early
group age is done. At this point we need to re-calculate k new centroids as barycenter of the
clusters resulting from the previous step. After we have these k new centroids, a new binding has to
be done between the same data set points and the nearest new center. A loop has been
generated. As a result of this loop we may notice that the k centers change their location step by
step until no more changes are done or in other words centers do not move any more. Finally, this
algorithm aims at minimizing an objective function know as squared error function given by:

where,
‘||xi - vj||’ is the Euclidean distance between xi and vj.

‘ci’ is the number of data points in ith cluster.

‘c’ is the number of cluster centers.

Algorithmic steps for k-means clustering

Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of
centers.

1) Randomly select ‘c’ cluster centers.

2) Calculate the distance between each data point and cluster centers.

3) Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers..

4) Recalculate the new cluster center using:

where, ‘ci’ represents the number of data points in ith cluster.


5) Recalculate the distance between each data point and new obtained cluster centers.

6) If no data point was reassigned then stop, otherwise repeat from step 3).

Advantages

1) Fast, robust and easier to understand.

2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each


object, and t is # iterations. Normally, k, t, d << n.

3) Gives best result when data set are distinct or well separated from each other.

Fig I: Showing the result of k-means for 'N' = 60 and 'c' = 3

Note: For more detailed figure for k-means algorithm please refer to k-means figure sub page.

Disadvantages

1) The learning algorithm requires apriori specification of the number of cluster centers.

2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means
will not be able to resolve that there are two clusters.

3) The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get

different results (data represented in form of cartesian co-ordinates and polar co-ordinates
will give different results).

4) Euclidean distance measures can unequally weight underlying factors.

5) The learning algorithm provides the local optima of the squared error function.

6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.
7) Applicable only when mean is defined i.e. fails for categorical data.

8) Unable to handle noisy data and outliers.

9) Algorithm fails for non-linear data set.

5. MINING TIME SERIES DATA

A time series is a sequence of data points recorded at specific time points - most often in
regular time intervals (seconds, hours, days, months etc.). Every organization generates a
high volume of data every single day – be it sales figure, revenue, traffic, or operating cost.
Time series data mining can generate valuable information for long-term business decisions,
yet they are underutilized in most organizations. Below is a list of few possible ways to take
advantage of time series datasets:

 Trend analysis: Just plotting data against time can generate very powerful insights.
One very basic use of time-series data is just understanding temporal pattern/trend in
what is being measured. In businesses it can even give an early indication on the
overall direction of a typical business cycle.

 Outlier/anomaly detection: An outlier in a temporal dataset represents an anomaly.


Whether desired (e.g. profit margin) or not (e.g. cost), outliers detected in a dataset
can help prevent unintended consequences.

 Examining shocks/unexpected variation: Time-series data can identify variations


(expected or unexpected) and abnormalities, detect signals in the noise.

 Association analysis: By plotting bivariate/multivariate temporal data it is easy (just


visually) to identify associations between any two features (e.g. profit vs sales). This
association may or may not imply causation, but this is a good starting point in
selecting input features that impact output variables in more advanced statistical
analysis.
 Forecasting: Forecasting future values using historical data is a common
methodological approach – from simple extrapolation to sophisticated stochastic
methods such as ARIMA.

 Predictive analytics: Advanced statistical analysis such as panel data models (fixed
and random effects models) rely heavily on multi-variate longitudinal datasets. These
types of analysis help in business forecasts, identify explanatory variables, or simply
help understand associations between features in a dataset

UNIT- V

1. TEXT MINING

Information Retrieval

Information retrieval deals with the retrieval of information from a large number of text-
based documents. Some of the database systems are not usually present in information
retrieval systems because both handle different kinds of data. Examples of information
retrieval system include −

 Online Library catalogue system

 Online Document Management Systems

 Web Search Systems etc.

Note − The main problem in an information retrieval system is to locate relevant documents
in a document collection based on a user's query. This kind of user's query consists of some
keywords describing an information need.

In such search problems, the user takes an initiative to pull relevant information out from a
collection. This is appropriate when the user has ad-hoc information need, i.e., a short-term
need. But if the user has a long-term information need, then the retrieval system can also take
an initiative to push any newly arrived information item to the user.

This kind of access to information is called Information Filtering. And the corresponding
systems are known as Filtering Systems or Recommender Systems.

Basic Measures for Text Retrieval

We need to check the accuracy of a system when it retrieves a number of documents on the
basis of user's input. Let the set of documents relevant to a query be denoted as {Relevant}
and the set of retrieved document as {Retrieved}. The set of documents that are relevant and
retrieved can be denoted as {Relevant} ∩ {Retrieved}. This can be shown in the form of a
Venn diagram as follows −
There are three fundamental measures for assessing the quality of text retrieval −

 Precision

 Recall

 F-score

Precision

Precision is the percentage of retrieved documents that are in fact relevant to the query.
Precision can be defined as −

Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|

Recall

Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as −

Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|

2. WWW MINING

Challenges in Web Mining

The web poses great challenges for resource and knowledge discovery based on the following
observations −

 The web is too huge − The size of the web is very huge and rapidly increasing. This
seems that the web is too huge for data warehousing and data mining.

 Complexity of Web pages − The web pages do not have unifying structure. They are
very complex as compared to traditional text document. There are huge amount of
documents in digital library of web. These libraries are not arranged according to any
particular sorted order.

 Web is dynamic information source − The information on the web is rapidly


updated. The data such as news, stock markets, weather, sports, shopping, etc., are
regularly updated.
 Diversity of user communities − The user community on the web is rapidly
expanding. These users have different backgrounds, interests, and usage purposes.
There are more than 100 million workstations that are connected to the Internet and
still rapidly increasing.

 Relevancy of Information − It is considered that a particular person is generally


interested in only small portion of the web, while the rest of the portion of the web
contains the information that is not relevant to the user and may swamp desired
results.

Mining Web page layout structure

The basic structure of the web page is based on the Document Object Model (DOM). The
DOM structure refers to a tree like structure where the HTML tag in the page corresponds to
a node in the DOM tree. We can segment the web page by using predefined tags in HTML.
The HTML syntax is flexible therefore, the web pages does not follow the W3C
specifications. Not following the specifications of W3C may cause error in DOM tree
structure.

The DOM structure was initially introduced for presentation in the browser and not for
description of semantic structure of the web page. The DOM structure cannot correctly
identify the semantic relationship between the different parts of a web page.

Vision-based page segmentation (VIPS)

 The purpose of VIPS is to extract the semantic structure of a web page based on its
visual presentation.

 Such a semantic structure corresponds to a tree structure. In this tree each node
corresponds to a block.

 A value is assigned to each node. This value is called the Degree of Coherence. This
value is assigned to indicate the coherent content in the block based on visual
perception.

 The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree.
After that it finds the separators between these blocks.

 The separators refer to the horizontal or vertical lines in a web page that visually cross
with no blocks.

 The semantics of the web page is constructed on the basis of these blocks.
3.
3.
3.
3.
DATAMINING APPLICATIONS AND TRENDS

Data Mining Applications

Here is the list of areas where data mining is widely used −

 Financial Data Analysis

 Retail Industry

 Telecommunication Industry

 Biological Data Analysis

 Other Scientific Applications

 Intrusion Detection

Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −

 Design and construction of data warehouses for multidimensional data analysis and
data mining.

 Loan payment prediction and customer credit policy analysis.

 Classification and clustering of customers for targeted marketing.

 Detection of money laundering and other financial crimes.

Retail Industry

Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to expand rapidly
because of the increasing ease, availability and popularity of the web.

Data mining in retail industry helps in identifying customer buying patterns and trends that
lead to improved quality of customer service and good customer retention and satisfaction.
Here is the list of examples of data mining in the retail industry −

 Design and Construction of data warehouses based on the benefits of data mining.

 Multidimensional analysis of sales, customers, products, time and region.

 Analysis of effectiveness of sales campaigns.

 Customer Retention.

 Product recommendation and cross-referencing of items.

Telecommunication Industry

Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web
data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason why
data mining is become very important to help and understand the business.

Data mining in telecommunication industry helps in identifying the telecommunication


patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services −

 Multidimensional Analysis of Telecommunication data.

 Fraudulent pattern analysis.

 Identification of unusual patterns.

 Multidimensional association and sequential patterns analysis.

 Mobile Telecommunication services.

 Use of visualization tools in telecommunication data analysis.

Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes
for biological data analysis −

 Semantic integration of heterogeneous, distributed genomic and proteomic databases.


 Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.

 Discovery of structural patterns and analysis of genetic networks and protein


pathways.

 Association and path analysis.

 Visualization tools in genetic data analysis.

Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data sets
for which the statistical techniques are appropriate. Huge amount of data have been collected
from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is
being generated because of the fast numerical simulations in various fields such as climate
and ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the
applications of data mining in the field of Scientific Applications −

 Data Warehouses and data preprocessing.

 Graph-based mining.

 Visualization and domain specific knowledge.

Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration. Here is the list of areas in which data mining technology may be
applied for intrusion detection −

 Development of data mining algorithm for intrusion detection.

 Association and correlation analysis, aggregation to help select and build


discriminating attributes.

 Analysis of Stream data.

 Distributed data mining.

 Visualization and query tools.


Data Mining System Products

There are many data mining system products and domain specific data mining applications.
The new data mining systems and applications are being added to the previous systems. Also,
efforts are being made to standardize data mining languages.

Choosing a Data Mining System

The selection of a data mining system depends on the following features −

 Data Types − The data mining system may handle formatted text, record-based data,
and relational data. The data could also be in ASCII text, relational database data or
data warehouse data. Therefore, we should check what exact format the data mining
system can handle.

 System Issues − We must consider the compatibility of a data mining system with
different operating systems. One data mining system may run on only one operating
system or on several. There are also data mining systems that provide web-based user
interfaces and allow XML data as input.

 Data Sources − Data sources refer to the data formats in which data mining system
will operate. Some data mining system may work only on ASCII text files while
others on multiple relational sources. Data mining system should also support ODBC
connections or OLE DB for ODBC connections.

 Data Mining functions and methodologies − There are some data mining systems
that provide only one data mining function such as classification while some provides
multiple data mining functions such as concept description, discovery-driven OLAP
analysis, association mining, linkage analysis, statistical analysis, classification,
prediction, clustering, outlier analysis, similarity search, etc.

 Coupling data mining with databases or data warehouse systems − Data mining
systems need to be coupled with a database or a data warehouse system. The coupled
components are integrated into a uniform information processing environment. Here
are the types of coupling listed below −

o No coupling

o Loose Coupling

o Semi tight Coupling

o Tight Coupling

 Scalability − There are two scalability issues in data mining −

o Row (Database size) Scalability − A data mining system is considered as row


scalable when the number or rows are enlarged 10 times. It takes no more than
10 times to execute a query.
o Column (Dimension) Salability − A data mining system is considered as
column scalable if the mining query execution time increases linearly with the
number of columns.

 Visualization Tools − Visualization in data mining can be categorized as follows −

o Data Visualization

o Mining Results Visualization

o Mining process visualization

o Visual data mining

 Data Mining query language and graphical user interface − An easy-to-use


graphical user interface is important to promote user-guided, interactive data mining.
Unlike relational database systems, data mining systems do not share underlying data
mining query language.

Trends in Data

Mining

Data mining concepts are still evolving and here are the latest trends that we get to see in this
field −

 Application Exploration.

 Scalable and interactive data mining methods.

 Integration of data mining with database systems, data warehouse systems and web
database systems.

 SStandardization of data mining query language.

 Visual data mining.

 New methods for mining complex types of data.

 Biological data mining.

 Data mining and software engineering.

 Web mining.

 Distributed data mining.

 Real time data mining.

 Multi database data mining.


 Privacy protection and information security in data mining.

Vous aimerez peut-être aussi