Académique Documents
Professionnel Documents
Culture Documents
3. Transactional Data
Each record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web page. A
transaction typically includes a unique transaction identity number (trans
ID) and a list of the items making up the transaction, such as the items
purchased in the transaction.
4. Other Kinds of Data
• Data streams - video surveillance and sensor data, which are continuously
transmitted
• Hypertext and multimedia data - text, image, video, and audio data
• Graph and networked data - social and information networks
• Web - a huge, widely distributed information repository made available by
the Internet
Mining Methodology
Handling uncertainty, noise, or incompleteness of data:
Data often contain noise, errors, exceptions, or uncertainty, or are incomplete. Errors and
noise may confuse the data mining process, leading to the derivation of erroneous
patterns. Data clean- ing, data preprocessing, outlier detection and removal, and
uncertainty reasoning are examples of techniques that need to be integrated with the data
mining process.
What makes a pattern interesting may vary from user to user. Therefore, techniques are
needed to assess the interestingness of discovered patterns based on subjective measures.
User Interaction
Interactive mining: The data mining process should be highly interactive. Thus, it is
important to build flexible user interfaces and an exploratory mining environment,
facilitating the user’s interaction with the system.
Incorporation of background knowledge: Background knowledge, constraints, rules, and
other information regarding the domain under study should be incorporated
2. Getting to know your data
Data Objects and Attribute Types
A data object represents an entity. Data objects are typically described by
attributes. Data objects can also be referred to as samples, examples, instances,
data points, or objects. If the data objects are stored in a database, they are data
tuples. That is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes.
An attribute is a data field, representing a characteristic or feature of a data
object.
The type of an attribute is determined by the set of possible values—nominal,
binary, ordinal, or numeric—the attribute can have.
1. Nominal Attributes
The values of a nominal attribute are symbols or names of things. Each
value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical.
2. Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0
or 1, where 0 typically means that the attribute is absent, and 1 means that it
is present. Binary attributes are referred to as Boolean if the two states
correspond to true and false.
3. Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
4. Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values.
• Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The
values of interval-scaled attributes have order and can be positive, 0, or
negative. Thus, in addition to providing a ranking of values, such attributes
allow us to compare and quantify the difference between values.
• Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value. In addition, the values are ordered, and
we can also compute the difference between values, as well as the mean,
median, and mode.
5. Discrete Attributes
A discrete attribute has a finite or countably infinite set of values, which
may or may not be represented as integers.
6. Continuous Attributes
If an attribute is not discrete, it is continuous. Continuous attributes are
typically represented as floating-point variables.
Data Visualisation
Similarity
3. Data Preprocessing
Data have quality if they satisfy the requirements of the intended use. There are
many factors comprising data quality, including accuracy, completeness,
consistency, timeliness, believability, and interpretability.
1. Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in
the data.
1) Missing Values
A. Ignore the tuple
B. Fill in the missing value manually
C. Use a global constant to fill in the missing
value
D. Use a measure of central tendency for the
attribute
E. Use the attribute mean or median for all
samples belonging to the same class as
the given tuple
F. Use the most probable value to fill in the missing value.
2) Noisy Data
Noise is a random error or variance in a measured variable. Data
Smoothing techniques:
A. Binning
Binning methods smooth a sorted data value by consulting its
“neighbourhood,” that is, the values around it. The sorted values
are distributed into a number of “buckets,” or bins.
B. Regression
Data smoothing can also be done by regression, a technique that con-
forms data values to a function.
C. Outlier analysis
Outliers may be detected by clustering, for example, where similar
values are organised into groups, or “clusters.” Intuitively, values that
fall outside of the set of clusters may be considered outliers
2. Data Integration
1) Entity Identification Problem
When we have multiple data sources, real-world entities need to be
matched up.
2) Redundancy and Correlation Analysis
An attribute may be redundant if it can be “derived” from another
attribute or set of attributes. Inconsistencies in attribute or dimension
naming can also cause redundancies in the resulting data set. (We use
correlation to fix this)
3) Tuple Duplication
Duplication should also be detected at the tuple level (e.g., where there
are two or more identical tuples for a given unique data entry case)
4) Data Value Conflict Detection and Resolution
For the same real-world entity, attribute values from different sources
may differ. This may be due to differences in representation, scaling, or
encoding.
3. Data reduction
Data reduction techniques can be applied to obtain a reduced representation
of the data set that is much smaller in volume, yet closely maintains the
integrity of the original data.
1) Dimensionality reduction is the process of reducing the number of
random variables or attributes under consideration.
A. Wavelet Transforms
B. Principal Components Analysis
2) Attribute Subset Selection
Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions)
3) Regression and Log-Linear Models: Parametric Data Reduction
4) Histogram: Histograms use binning to approximate data distributions
and are a popular form of data reduction.
5) Clustering
6) Sampling
7) Data Cube Aggregation
4. Data Transformation and Data Discretisation
In data transformation, the data are transformed or consolidated into
forms appropriate for mining. Strategies for data transformation include
the following:
1) Smoothing, which work store move noise from the data. Techniques
include binning, regression, and clustering.
2) Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of attributes to
help the mining process.
3) Aggregation, where summary or aggregation operations are applied
to the data. For example, the daily sales data may be aggregated so as
to compute monthly and annual total amounts. This step is typically
used in constructing a data cube for data analysis at multiple
abstraction levels.
4) Normalisation, where the attribute data are scaled so as to fall within
a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
5) Discretisation, where the raw values of a numeric
attribute(e.g.,age)are replaced by interval labels (e.g., 0–10, 11–20,
etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in
turn, can be recursively organised into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute. Figure 3.12
shows a concept hierarchy for the attribute price. More than one
concept hierarchy can be defined for the same attribute to
accommodate the needs of various users.
6) Concept hierarchy generation for nominal data, where attributes
such as street can be generalised to higher-level concepts, like city or
country. Many hierarchies for nominal attributes are implicit within
the database schema and can be automatically defined at the schema
definition level.
cluster analysis
Cluster analysis or simply clustering is the process of partitioning a set of data objects
(or observations) into subsets. Each subset is a cluster, such that objects in a cluster are
similar to one another, yet dissimilar to objects in other clusters. The set of clusters
resulting from a cluster analysis can be referred to as a clustering. (aka data
segmentation)
2. Ability to deal with different types of attributes: Many algorithms are designed to
cluster numeric (interval-based) data. However, applications may require clustering
other data types, such as binary, nominal (categorical), and ordinal data, or mixtures
of these data types. Recently, more and more applications need clustering techniques
for complex data types such as graphs, sequences, images, and documents.
5. Ability to deal with noisy data: Most real-world data sets contain outliers and/or
missing, unknown, or erroneous data. Sensor readings, for example, are often noisy—
some readings may be inaccurate due to the sensing mechanisms, and some readings
may be erroneous due to interferences from surrounding transient objects. Clustering
algorithms can be sensitive to such noise and may produce poor-quality clusters.
Therefore, we need clustering methods that are robust to noise.
6. Incremental clustering and insensitivity to input order: In many applications,
incremental updates (representing newer data) may arrive at any time. Some
clustering algorithms cannot incorporate incremental updates into existing clustering
structures and, instead, have to recompute a new clustering from scratch. Cluster- ing
algorithms may also be sensitive to the input data order. That is, given a set of data
objects, clustering algorithms may return dramatically different clusterings depending
on the order in which the objects are presented. Incremental clustering algorithms and
algorithms that are insensitive to the input order are needed.
The following are orthogonal aspects with which clustering methods can be compared:
10. The partitioning criteria: In some methods, all the objects are partitioned so that no
hierarchy exists among the clusters. That is, all the clusters are at the same level
conceptually. Such a method is useful, for example, for partitioning customers into
groups so that each group has its own manager. Alternatively, other methods partition
data objects hierarchically, where clusters can be formed at different semantic levels.
For example, in text mining, we may want to organize a corpus of documents into
multiple general topics, such as “politics” and “sports,” each of which may have
subtopics, For instance, “football,” “basketball,” “baseball,” and “hockey” can exist
as subtopics of “sports.” The latter four subtopics are at a lower level in the hierarchy
than “sports.”
11. Separation of clusters: Some methods partition data objects into mutually exclusive
clusters. When clustering customers into groups so that each group is taken care of by
one manager, each customer may belong to only one group. In some other situations,
the clusters may not be exclusive, that is, a data object may belong to more than one
cluster. For example, when clustering documents into topics, a document may be
related to multiple topics. Thus, the topics as clusters may not be exclusive.
12. Similarity measure: Some methods determine the similarity between two objects by
the distance between them. Such a distance can be defined on Euclidean space, a road
network, a vector space, or any other space. In other methods, the similarity may be
defined by connectivity based on density or contiguity, and may not rely on the
absolute distance between two objects. Similarity measures play a fundamental role in
the design of clustering methods. While distance-based methods can often take
advantage of optimization techniques, density- and continuity-based methods can
often find clusters of arbitrary shape.
13. Clustering space: Many clustering methods search for clusters within the entire
given data space. These methods are useful for low-dimensionality data sets. With
high- dimensional data, however, there can be many irrelevant attributes, which can
make similarity measurements unreliable. Consequently, clusters found in the full
space are often meaningless. It’s often better to instead search for clusters within
different subspaces of the same data set. Subspace clustering discovers clusters and
subspaces (often of low dimensionality) that manifest object similarity.
The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group. It successively merges the objects or groups close to one
another, until all the groups are merged into one (the topmost level of the hierarchy), or a
termination condition holds.
The divisive approach, also called the top-down approach, starts with all the objects in the
same cluster. In each successive iteration, a cluster is split into smaller clusters, until
eventually each object is in one cluster, or a termination condition holds.
Density-based methods: Most partitioning methods cluster objects based on the distance
between objects. Such methods can find only spherical-shaped clusters and encounter
difficulty in discovering clusters of arbitrary shapes. Other clustering methods have been
developed based on the notion of density.
Grid-based methods: Grid-based methods quantize the object space into a finite number
of cells that form a grid structure. All the clustering operations are per- formed on the grid
structure (i.e., on the quantized space). The main advantage of this approach is its fast
processing time, which is typically independent of the number of data objects and
dependent only on the number of cells in each dimension in the quantized space.
k-means clustering
Algorithm: k-means.
The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) _____ (re)assign each object to the cluster to which the object is the most
_____ similar, based on the mean value of the objects in the cluster;
(4) _____ update the cluster means, that is, calculate the mean value of the
_____ objects for each cluster;
(5) until no change;