Vous êtes sur la page 1sur 17

1.

INTRODUCTION TO DATA MINING


What is Data Mining?
Data mining is a process where intelligent methods are applied to extract data
patterns. Data mining can also be called as a process of discovering interesting
patterns and knowledge from large amounts of data. The data sources may
include databases, warehouses, web etc.

Knowledge discovery from data?

1. Data cleaning: The process of filling missing values, removing outliers,


soothing noisy data and resolving inconsistency of data.
2. Data integration: The process of collecting the data from multiple sources
and collaborating them together.
3. Data selection: The process of only retrieving data that is relevant to the
analysis task from the database.
4. Data transformation: The process where data are transformed and
consolidated into forms appropriate for mining by performing summary or
aggregation operations.
5. Data mining: The process where intelligent methods are applied to extract
data patterns.
6. Pattern evaluation: The process to identify the truly interesting patterns
representing knowledge based on interestingness measures.
7. Knowledge presentation: The process where visualisation and knowledge
representation techniques are used to represent mined knowledge to users.
Types of Data

1. Database Data

A database management system (DBMS), consists of a collection of
interrelated data, known as a database, and a set of software programs to
manage and access the data. 

A relational database is a collection of tables, each of which is assigned a
unique name. Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows). Each tuple in a
relational table represents an object identified by a unique key and
described by a set of attribute values. A semantic data model, such as an
entity-relationship (ER) data model, is often constructed for relational
databases. An ER data model represents the database as a set of entities and
their relationships.
2. Data Warehouses 

A data warehouse is a repository of information collected from multiple
sources, stored under a unified schema and is usually residing at a single
site. Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading and periodic data refreshing.

3. Transactional Data 

Each record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web page. A
transaction typically includes a unique transaction identity number (trans
ID) and a list of the items making up the transaction, such as the items
purchased in the transaction.
4. Other Kinds of Data

• Time-related or sequence data - historical records, stock exchange data, and


time-series and bio- logical sequence data

• Data streams - video surveillance and sensor data, which are continuously
transmitted

• Spatial data - maps


• Engineering design data -the design of buildings, system components, or
integrated circuits

• Hypertext and multimedia data - text, image, video, and audio data
• Graph and networked data - social and information networks
• Web - a huge, widely distributed information repository made available by
the Internet

Which Technologies are used for DM?


Several disciplines that strongly influence the development of data mining
methods. Some of the major ones are:
1. Statistics

Statistics studies the collection, analysis, interpretation or explanation, and
presentation of data. Data mining has an inherent connection with statistics.
A statistical model is a set of mathematical functions that describe the
behaviour of the objects in a target class in terms of random variables and
their associated probability distributions. Statistical models are widely used
to model data and data classes. 

For example, in data mining tasks like data characterisation and
classification, statistical models of target classes can be built. In other
words, such statistical models can be the outcome of a data mining task.
Alternatively, data mining tasks can be built on top of statistical models.
For example, we can use statistics to model noise and missing data values.
Then, when mining patterns in a large data set, the data mining process can
use the model to help identify and handle noisy or missing values in the
data. 

Statistical methods can also be used to verify data mining results.
2. Machine Learning

Machine learning investigates how computers can learn (or improve their
performance) based on data. A main research area is for computer programs
to automatically learn to recognise complex patterns and make intelligent
decisions based on data.
A. Supervised learning is basically a synonym for classification.
B. Unsupervised learning is essentially a synonym for clustering.
C. Semi-supervised learning is a class of machine learning techniques that
make use of both labeled and unlabelled examples when learning a
model. In one approach, labeled examples are used to learn class models
and unlabelled examples are used to refine the boundaries between
classes.
D. Active learning is a machine learning approach that lets users play an
active role in the learning process. An active learning approach can ask a
user (e.g., a domain expert) to label an example, which may be from a
set of unlabelled examples or synthesised by the learning program. The
goal is to optimise the model quality by actively acquiring knowledge
from human users, given a constraint on how many examples they can
be asked to label.
3. Database Systems and Data Warehouses 

Database systems research focuses on the creation, maintenance, and use of
databases for organisations and end-users. Particularly, database systems
researchers have established highly recognised principles in data models,
query languages, query processing and optimisation methods, data storage,
and indexing and accessing methods. Database systems are often well
known for their high scalability in processing very large, relatively
structured data sets. 

Many data mining tasks need to handle large data sets or even real-time,
fast stream- ing data. Therefore, data mining can make good use of scalable
database technologies to achieve high efficiency and scalability on large
data sets. Moreover, data mining tasks can be used to extend the capability
of existing database systems to satisfy advanced users’ sophisticated data
analysis requirements.
4. Information retrieval (IR)

It is the science of searching for documents or information in documents.
Documents can be text or multimedia, and may reside on the Web. The
differences between traditional information retrieval and database systems
are twofold: Information retrieval assumes that
1. The data under search are unstructured; and
2. The queries are formed mainly by keywords, which do not have complex
structures.
The typical approaches in information retrieval adopt probabilistic
models. For example, a text document can be regarded as a bag of
words, that is, a multi-set of words appearing in the document. The
document’s language model is the probability density function that
generates the bag of words in the document. The similarity between
two documents can be measured by the similarity between their
corresponding language models.

Applications of Data Mining


1. Business Intelligence (BI)

Business intelligence technologies provide historical, current, and
predictive views of business operations. 

Effective market analysis, compare customer feedback on similar products,
discover the strengths and weaknesses of their competitors, retain highly
valuable customers, and make smart business decisions. 

Online analytical processing(OLAP) tools in business intelligence rely on
data warehousing and multidimensional data mining.
2. Web Search Engines

A Web search engine is a specialised computer server that searches for
information on the Web. The search results of a user query are often
returned as a list (sometimes called hits). The hits may consist of web
pages, images, and other types of files. Crawling, Indexing, Searching
3. Fraud Detection
4. Banking
5. Credit Rating System
6. Ad campagne
7. Retail
1. Customer loyalty
2. Customer relationship
3. Market basket analysis
4. Ad campagne
5. Inventory management

Major Issues in Data Mining

Mining Methodology
Handling uncertainty, noise, or incompleteness of data:

Data often contain noise, errors, exceptions, or uncertainty, or are incomplete. Errors and
noise may confuse the data mining process, leading to the derivation of erroneous
patterns. Data clean- ing, data preprocessing, outlier detection and removal, and
uncertainty reasoning are examples of techniques that need to be integrated with the data
mining process.

Pattern evaluation and pattern- or constraint-guided mining:

What makes a pattern interesting may vary from user to user. Therefore, techniques are
needed to assess the interestingness of discovered patterns based on subjective measures.

User Interaction
Interactive mining: The data mining process should be highly interactive. Thus, it is
important to build flexible user interfaces and an exploratory mining environment,
facilitating the user’s interaction with the system.
Incorporation of background knowledge: Background knowledge, constraints, rules, and
other information regarding the domain under study should be incorporated

Presentation and visualisation of data mining results:

Efficiency and Scalability


Efficiency and scalability of data mining algorithms:

Parallel, distributed, and incremental mining algorithms:

Diversity of Database Types


Handling complex types of data:

Mining dynamic, networked, and global data repositories:

Data Mining and Society


Social impacts of data mining:

Privacy-preserving data mining:


2. Getting to know your data
Data Objects and Attribute Types
A data object represents an entity. Data objects are typically described by
attributes. Data objects can also be referred to as samples, examples, instances,
data points, or objects. If the data objects are stored in a database, they are data
tuples. That is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes.
An attribute is a data field, representing a characteristic or feature of a data
object.
The type of an attribute is determined by the set of possible values—nominal,
binary, ordinal, or numeric—the attribute can have.
1. Nominal Attributes 

The values of a nominal attribute are symbols or names of things. Each
value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical.
2. Binary Attributes 

A binary attribute is a nominal attribute with only two categories or states: 0
or 1, where 0 typically means that the attribute is absent, and 1 means that it
is present. Binary attributes are referred to as Boolean if the two states
correspond to true and false.
3. Ordinal Attributes

An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
4. Numeric Attributes

A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values.

• Interval-Scaled Attributes 

Interval-scaled attributes are measured on a scale of equal-size units. The
values of interval-scaled attributes have order and can be positive, 0, or
negative. Thus, in addition to providing a ranking of values, such attributes
allow us to compare and quantify the difference between values.

• Ratio-Scaled Attributes 

A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value. In addition, the values are ordered, and
we can also compute the difference between values, as well as the mean,
median, and mode.
5. Discrete Attributes 

A discrete attribute has a finite or countably infinite set of values, which
may or may not be represented as integers.
6. Continuous Attributes 

If an attribute is not discrete, it is continuous. Continuous attributes are
typically represented as floating-point variables.

Data Visualisation
Similarity

3. Data Preprocessing
Data have quality if they satisfy the requirements of the intended use. There are
many factors comprising data quality, including accuracy, completeness,
consistency, timeliness, believability, and interpretability.
1. Data Cleaning 

Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in
the data.
1) Missing Values
A. Ignore the tuple
B. Fill in the missing value manually
C. Use a global constant to fill in the missing
value
D. Use a measure of central tendency for the
attribute
E. Use the attribute mean or median for all
samples belonging to the same class as
the given tuple
F. Use the most probable value to fill in the missing value.
2) Noisy Data 

Noise is a random error or variance in a measured variable. Data
Smoothing techniques:
A. Binning

Binning methods smooth a sorted data value by consulting its
“neighbourhood,” that is, the values around it. The sorted values
are distributed into a number of “buckets,” or bins.
B. Regression

Data smoothing can also be done by regression, a technique that con-
forms data values to a function.
C. Outlier analysis

Outliers may be detected by clustering, for example, where similar
values are organised into groups, or “clusters.” Intuitively, values that
fall outside of the set of clusters may be considered outliers
2. Data Integration
1) Entity Identification Problem 

When we have multiple data sources, real-world entities need to be
matched up.
2) Redundancy and Correlation Analysis

An attribute may be redundant if it can be “derived” from another
attribute or set of attributes. Inconsistencies in attribute or dimension
naming can also cause redundancies in the resulting data set. (We use
correlation to fix this)
3) Tuple Duplication 

Duplication should also be detected at the tuple level (e.g., where there
are two or more identical tuples for a given unique data entry case)
4) Data Value Conflict Detection and Resolution 

For the same real-world entity, attribute values from different sources
may differ. This may be due to differences in representation, scaling, or
encoding.
3. Data reduction

Data reduction techniques can be applied to obtain a reduced representation
of the data set that is much smaller in volume, yet closely maintains the
integrity of the original data.
1) Dimensionality reduction is the process of reducing the number of
random variables or attributes under consideration.
A. Wavelet Transforms
B. Principal Components Analysis
2) Attribute Subset Selection

Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions)
3) Regression and Log-Linear Models: Parametric Data Reduction
4) Histogram: Histograms use binning to approximate data distributions
and are a popular form of data reduction.
5) Clustering
6) Sampling
7) Data Cube Aggregation
4. Data Transformation and Data Discretisation 

In data transformation, the data are transformed or consolidated into
forms appropriate for mining. Strategies for data transformation include
the following:
1) Smoothing, which work store move noise from the data. Techniques
include binning, regression, and clustering.
2) Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of attributes to
help the mining process.
3) Aggregation, where summary or aggregation operations are applied
to the data. For example, the daily sales data may be aggregated so as
to compute monthly and annual total amounts. This step is typically
used in constructing a data cube for data analysis at multiple
abstraction levels.
4) Normalisation, where the attribute data are scaled so as to fall within
a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
5) Discretisation, where the raw values of a numeric
attribute(e.g.,age)are replaced by interval labels (e.g., 0–10, 11–20,
etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in
turn, can be recursively organised into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute. Figure 3.12
shows a concept hierarchy for the attribute price. More than one
concept hierarchy can be defined for the same attribute to
accommodate the needs of various users.
6) Concept hierarchy generation for nominal data, where attributes
such as street can be generalised to higher-level concepts, like city or
country. Many hierarchies for nominal attributes are implicit within
the database schema and can be automatically defined at the schema
definition level. 

cluster analysis
Cluster analysis or simply clustering is the process of partitioning a set of data objects
(or observations) into subsets. Each subset is a cluster, such that objects in a cluster are
similar to one another, yet dissimilar to objects in other clusters. The set of clusters
resulting from a cluster analysis can be referred to as a clustering. (aka data
segmentation)

Requirements for Cluster Analysis


1. Scalability: Many clustering algorithms work well on small data sets containing
fewer than several hundred data objects; however, a large database may contain
millions or even billions of objects, particularly in Web search scenarios. Clustering
on only a sample of a given large data set may lead to biased results. Therefore,
highly scalable clustering algorithms are needed.

2. Ability to deal with different types of attributes: Many algorithms are designed to
cluster numeric (interval-based) data. However, applications may require clustering
other data types, such as binary, nominal (categorical), and ordinal data, or mixtures
of these data types. Recently, more and more applications need clustering techniques
for complex data types such as graphs, sequences, images, and documents.

3. Discovery of clusters with arbitrary shape: Many clustering algorithms determine


clusters based on Euclidean or Manhattan distance measures (Chapter 2). Algorithms
based on such distance measures tend to find spherical clusters with similar size and
density. However, a cluster could be of any shape. Consider sensors, for example,
which are often deployed for environment surveillance. Cluster analysis on sensor
readings can detect interesting phenomena. We may want to use clustering to find the
frontier of a running forest fire, which is often not spherical. It is important to develop
algorithms that can detect clusters of arbitrary shape.

4. Requirements for domain knowledge to determine input parameters: Many clus-


tering algorithms require users to provide domain knowledge in the form of input
parameters such as the desired number of clusters. Consequently, the clustering
results may be sensitive to such parameters. Parameters are often hard to determine,
especially for high-dimensionality data sets and where users have yet to grasp a deep
understanding of their data. Requiring the specification of domain knowledge not
only burdens users, but also makes the quality of clustering difficult to control.

5. Ability to deal with noisy data: Most real-world data sets contain outliers and/or
missing, unknown, or erroneous data. Sensor readings, for example, are often noisy—
some readings may be inaccurate due to the sensing mechanisms, and some readings
may be erroneous due to interferences from surrounding transient objects. Clustering
algorithms can be sensitive to such noise and may produce poor-quality clusters.
Therefore, we need clustering methods that are robust to noise.
6. Incremental clustering and insensitivity to input order: In many applications,
incremental updates (representing newer data) may arrive at any time. Some
clustering algorithms cannot incorporate incremental updates into existing clustering
structures and, instead, have to recompute a new clustering from scratch. Cluster- ing
algorithms may also be sensitive to the input data order. That is, given a set of data
objects, clustering algorithms may return dramatically different clusterings depending
on the order in which the objects are presented. Incremental clustering algorithms and
algorithms that are insensitive to the input order are needed.

7. Capability of clustering high-dimensionality data: A data set can contain numerous


dimensions or attributes. When clustering documents, for example, each keyword can
be regarded as a dimension, and there are often thousands of keywords. Most
clustering algorithms are good at handling low-dimensional data such as data sets
involving only two or three dimensions. Finding clusters of data objects in a high-
dimensional space is challenging, especially considering that such data can be very
sparse and highly skewed.

8. Constraint-based clustering: Real-world applications may need to perform


clustering under various kinds of constraints. Suppose that your job is to choose the
locations for a given number of new automatic teller machines (ATMs) in a city. To
decide upon this, you may cluster households while considering constraints such as
the city’s rivers and highway networks and the types and number of customers per
cluster. A challenging task is to find data groups with good clustering behavior that
satisfy specified constraints.

9. Interpretability and usability: Users want clustering results to be interpretable,


comprehensible, and usable. That is, clustering may need to be tied in with spe- cific
semantic interpretations and applications. It is important to study how an application
goal may influence the selection of clustering features and clustering methods.

The following are orthogonal aspects with which clustering methods can be compared:

10. The partitioning criteria: In some methods, all the objects are partitioned so that no
hierarchy exists among the clusters. That is, all the clusters are at the same level
conceptually. Such a method is useful, for example, for partitioning customers into
groups so that each group has its own manager. Alternatively, other methods partition
data objects hierarchically, where clusters can be formed at different semantic levels.
For example, in text mining, we may want to organize a corpus of documents into
multiple general topics, such as “politics” and “sports,” each of which may have
subtopics, For instance, “football,” “basketball,” “baseball,” and “hockey” can exist
as subtopics of “sports.” The latter four subtopics are at a lower level in the hierarchy
than “sports.”

11. Separation of clusters: Some methods partition data objects into mutually exclusive
clusters. When clustering customers into groups so that each group is taken care of by
one manager, each customer may belong to only one group. In some other situations,
the clusters may not be exclusive, that is, a data object may belong to more than one
cluster. For example, when clustering documents into topics, a document may be
related to multiple topics. Thus, the topics as clusters may not be exclusive.

12. Similarity measure: Some methods determine the similarity between two objects by
the distance between them. Such a distance can be defined on Euclidean space, a road
network, a vector space, or any other space. In other methods, the similarity may be
defined by connectivity based on density or contiguity, and may not rely on the
absolute distance between two objects. Similarity measures play a fundamental role in
the design of clustering methods. While distance-based methods can often take
advantage of optimization techniques, density- and continuity-based methods can
often find clusters of arbitrary shape.

13. Clustering space: Many clustering methods search for clusters within the entire
given data space. These methods are useful for low-dimensionality data sets. With
high- dimensional data, however, there can be many irrelevant attributes, which can
make similarity measurements unreliable. Consequently, clusters found in the full
space are often meaningless. It’s often better to instead search for clusters within
different subspaces of the same data set. Subspace clustering discovers clusters and
subspaces (often of low dimensionality) that manifest object similarity.

Basic Clustering Methods


Partitioning methods: Given a set of n objects, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k ≤ n. That is, it
divides the data into k groups such that each group must contain at least one object.

Uses iterative relocation technique

Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the


given set of data objects. A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical decomposition is formed.

The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group. It successively merges the objects or groups close to one
another, until all the groups are merged into one (the topmost level of the hierarchy), or a
termination condition holds.

The divisive approach, also called the top-down approach, starts with all the objects in the
same cluster. In each successive iteration, a cluster is split into smaller clusters, until
eventually each object is in one cluster, or a termination condition holds.

Density-based methods: Most partitioning methods cluster objects based on the distance
between objects. Such methods can find only spherical-shaped clusters and encounter
difficulty in discovering clusters of arbitrary shapes. Other clustering methods have been
developed based on the notion of density.
Grid-based methods: Grid-based methods quantize the object space into a finite number
of cells that form a grid structure. All the clustering operations are per- formed on the grid
structure (i.e., on the quantized space). The main advantage of this approach is its fast
processing time, which is typically independent of the number of data objects and
dependent only on the number of cells in each dimension in the quantized space.
k-means clustering
Algorithm: k-means.
The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.
Input:
k: the number of clusters,

D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) _____ (re)assign each object to the cluster to which the object is the most
_____ similar, based on the mean value of the objects in the cluster;
(4) _____ update the cluster means, that is, calculate the mean value of the
_____ objects for each cluster;
(5) until no change;


Vous aimerez peut-être aussi