DM Chap 2

Ch2.
Data Preprocessing
DB
Data Mining
Information
2.1 Why preprocess the data

Todays real-world databases are highly susceptible
to noise, missing, and inconsistent data due to
typically huge size, often several gigabytes or
more.
There are a number of data preprocessing
techniques:
descriptive data summarization

data cleaning
data integration
data transformations
data reduction
2.1 Why preprocess the data
2.2 Descriptive Data Summarization

Mean aggregate function, average (avg() in SQL)
provided in relational database systems.
Median the middle value of the ordered set.
Mode the value that occurs most frequently in the set.
Unimodal and multimodal (bimodal, trimodal)
For unimodal frequency curves that are moderately
skewed (asymmetrical), we have the following empirical
relation: mean-mode=3 (mean-median)
2.3 Data Cleaning Missing Values

You can go the following methods about filling in
the missing values for these attributes:
Ignore the tuple

Fill the missing value manually
Use a global constant to fill in the missing value
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belong to the
same class as the given tuple
Use the most probable value to fill in the missing value
2.3 Data Cleaning Noisy Data

We can use the following techniques to smooth
out the data and to remove the noise:
Binning
Clustering
Combined computer and human inspection
Regression
2.3 Data Cleaning Noisy Data

Stored data for price: 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22 , 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
2.3 Data Cleaning Inconsistent Data

There may be inconsistencies in the data recorded
for some transactions. Some data inconsistencies
may be corrected manually using externally
references. For example, errors made at data entry
may be corrected by performing a paper trace.
2.4 Data Integration and Transformation

Data integration: the merging of data from
multiple data stores.
It is likely that your data analysis task will involve
data integration, which combines data from
multiple sources into a coherent data store, as in
data warehousing.
Redundancy is another important issue. An
attribute may be redundant if it can be derived
from another table, such as annual revenue.

Some redundancies can be detected by correlation
analysis.
For example, given two attributes, such analysis
can measure how strongly one attribute implies the
other, based on the available data. The correlation
between attributes A and B can be measured by
( A A)( B B )
rA,B
(n 1)AB

n: the number of tuples
A, B : respective mean values of A and B.
A,B : respective standard deviation of A and B
A
The mean of A is
n
The standard deviation of A is
( A A)
n 1

Data transformation can involve the following:
Smoothing, which works to move the noise from data.
Aggregation, where summary or aggregation operations are
applied to the data.
Generalization, where low-level or primitive data are
replaced by higher-level concepts through the use of concept
hierarchies.
Normalization, where the attribute data are scaled so as to
fall within a small specified range.
Attribute construction, where new attributes are constructed
and added from the given set of attributes to help the mining
process.

An attribute is normalized by scaling its value so
that they fall within a small specified range, such
as 0.0 to 1.0. Normalization is particularly useful
for classification algorithm involving neural
networks, or distance measurements such as
nearest neighbor classification and clustering.

Min-max normalization performs a linear transformation on
the original data.
Support that minA and maxA are the minimum and
maximum values of an attribute A. Min-max normalization
maps a value v of A to v in the range [new_minA,
new_maxA] by computing
V'
v min A
(new _ max A new _ min A) new _ min A
max A min A
Min-max normalization preserves the relationships among

the original data values. It will encounter an out of
bounds error if a future input case for normalization falls
outside of the original data range for A.

Ex:
Suppose that the minimum and maximum values for the
attribute income to the range $12000 and $98000,
respectively. We would like to map income to the range
[0.0,1.0]. By min-max normalization, a value of $73600
for income is transformed to
V
'
v min A
(new _ max A new _ min A) new _ min A
max A min A
73600 12000
(1.0 0.0) 0 0.716
98000 12000
2.5 Data Reduction

Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data
cube.
Dimension reduction, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
Data compression, where encoding mechanisms are
used to reduce the data set size.
2.5 Data Reduction

Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations
such as parametric models (which need to store only
the model parameters instead of the actual data), or
nonparametric methods such as clustering, sampling,
and the use of histograms.
Discretization and concept hierarchy generation, where
raw data values for attributes are placed by ranges or
higher conceptual levels.
2.5 Data Reduction

Data cube aggregation
2.5 Data Reduction

Dimension reduction
Basic heuristic methods of attribute subset
selection include the following techniques:
1. Stepwise forward selection: The procedure starts with
an empty set of attributes. The best of the original
attributes is determined and added to the set. At each
subsequent iteration or step, the best of the remaining
original attributes is added to the set.
2.5 Data Reduction

2. Stepwise backward elimination: The procedure starts
with the full set of attributes. At each step, it removes
the worst attribute remaining in the set.
3. Combination of forward selection and backward
elimination: The stepwise forward selection and
backward elimination methods can be combined so
that, at each step, the procedure selects the best
attribute and removes the worst from among the
remaining attributes.
2.5 Data Reduction

Ex:
2.5 Data Reduction

Data compression
The discrete wavelet transform(DWT) is a linear signal
processing technique that, when applied to data vector
D, transforms it to a numerically different vector, D, of
wavelet coefficients. The two vectors have the same
length.
Popular wavelet transforms include the Harr_2,
Daubechies_4, and Daubechies_6 transforms.
2.5 Data Reduction

The method is as follows:
The length, L, of the input data vector must be integer
power of 2. This condition can be padding the data
vector with zeros, as necessary.
Each transform involves applying two functions. The
first applies some data smoothing, such as a sum or
weighted average. The second performs a weighted
difference, which acts to bring out the detailed features
of the data.
2.5 Data Reduction

The two functions are applied to pairs of the input data,
resulting in two sets of length L/2. In general, these
represent a smoothed or low frequency version of the
input data, and the high frequency content of it,
respectively.
The two functions are recursively applied to the sets of
data obtained in the previous loop, until the resulting
data sets obtained are of length 2.
A selection of values from the data sets obtained in the
above iterations are designed the wavelet coefficients
of the transformed data.
2.5 Data Reduction

Ex:
Discrete Wavelet Transformation

Space domain Frequency domain
Simple approach Haar function
Haar Function

Addition Low freq.

Difference High freq.
Low and High Frequency

Haar Function-- DWT

1 .
2 .
Horizontal Segmentation

Vertical Segmentation

Two-Stage DWT
Three-Stage DWT
DWT for Video Retrieval

Video Database Indexing

1 .
2 .
Coefficient Quantization
Video Database Indexing

3 .
4 .
5 .
2.5 Data Reduction

Numerosity reduction
Can we reduce the data volume by choosing alternative,
smaller forms of data representation?
Parametric methods: a model is used to estimate the
data so that typically only the data parameters need be
stored, instead of the actual data.
Nonparametric methods: used for storing reduced
representations of the data include histograms,
clustering, and sampling.
2.5 Data Reduction

Regression and log-linear models
Linear regression: the data are modeled to fit a straight line.
A random variable, Y (called a response variable), can be modeled
as a linear function of another random variable, X (called a
predictor variable), with the equation Y X
where the variance of Y is assumed to be constant.
Log-linear models: approximate discrete multidimentional
probability distributions. The method can be used to estimate the
probability of each cell in a base cuboid for a set of discretized
attributes, based on the smaller cuboids making up the data cube
lattice.
2.5 Data Reduction

Numerosity reduction
Histograms use binning to approximate data
distributions and are a popular form of data reduction.
A histogram for an attribute A partitions the data
distribution of A into disjoint subsets, or buckets.
2.5 Data Reduction
1-10
11-20
21-30
An equiwidth histogram for price,

where values are aggregated so
that each bucket has a uniform
width of $10.
2.5 Data Reduction
2.5 Data Reduction

Discretization and concept hierarchy generation
Discretization techniques can be used to reduce the
number of values for a given continuous attribute, by
dividing the range of the attribute into intervals.
Interval labels can be then used to replace actual data
values.
A concept hierarchy for a given numeric attribute
defines a discretization of the attribute. Concept
hierarchies can be used to reduce the data by collecting
and replacing low-level concepts (such as numeric
values for the attribute age) by higher-level concepts
(such as young, middle-aged, or senior).
2.5 Data Reduction

Discretization and concept hierarchy
generation for numeric data
Binning
Histogram analysis
Cluster analysis
Entropy-based discretization
Segmentation by natural partitioning
2.5 Data Reduction
2.5 Data Reduction
2.5 Data Reduction

DM Chap 2

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DM Chap 2

Transféré par

Droits d'auteur :

Formats disponibles

Ch2.

2.1 Why preprocess the data

descriptive data summarization

2.1 Why preprocess the data

2.2 Descriptive Data Summarization

2.3 Data Cleaning Missing Values

Ignore the tuple

2.3 Data Cleaning Noisy Data

2.3 Data Cleaning Noisy Data

2.3 Data Cleaning Inconsistent Data

2.4 Data Integration and Transformation

2.4 Data Integration and Transformation

2.4 Data Integration and Transformation

2.4 Data Integration and Transformation

2.4 Data Integration and Transformation

2.4 Data Integration and Transformation

Min-max normalization preserves the relationships among

2.4 Data Integration and Transformation

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

Discrete Wavelet Transformation

Addition Low freq.

Low and High Frequency

Haar Function-- DWT

DWT for Video Retrieval

Video Database Indexing

Video Database Indexing

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

An equiwidth histogram for price,

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

Vous aimerez peut-être aussi