Vous êtes sur la page 1sur 47

Ch2.

Data Preprocessing

DB

Data Mining

Information

2.1 Why preprocess the data


Todays real-world databases are highly susceptible
to noise, missing, and inconsistent data due to
typically huge size, often several gigabytes or
more.
There are a number of data preprocessing
techniques:

descriptive data summarization


data cleaning
data integration
data transformations
data reduction

2.1 Why preprocess the data

2.2 Descriptive Data Summarization


Mean aggregate function, average (avg() in SQL)
provided in relational database systems.
Median the middle value of the ordered set.
Mode the value that occurs most frequently in the set.
Unimodal and multimodal (bimodal, trimodal)
For unimodal frequency curves that are moderately
skewed (asymmetrical), we have the following empirical
relation: mean-mode=3 (mean-median)

2.3 Data Cleaning Missing Values


You can go the following methods about filling in
the missing values for these attributes:

Ignore the tuple


Fill the missing value manually
Use a global constant to fill in the missing value
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belong to the
same class as the given tuple
Use the most probable value to fill in the missing value

2.3 Data Cleaning Noisy Data


We can use the following techniques to smooth
out the data and to remove the noise:

Binning
Clustering
Combined computer and human inspection
Regression

2.3 Data Cleaning Noisy Data


Stored data for price: 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22 , 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

2.3 Data Cleaning Inconsistent Data


There may be inconsistencies in the data recorded
for some transactions. Some data inconsistencies
may be corrected manually using externally
references. For example, errors made at data entry
may be corrected by performing a paper trace.

2.4 Data Integration and Transformation


Data integration: the merging of data from
multiple data stores.
It is likely that your data analysis task will involve
data integration, which combines data from
multiple sources into a coherent data store, as in
data warehousing.
Redundancy is another important issue. An
attribute may be redundant if it can be derived
from another table, such as annual revenue.

2.4 Data Integration and Transformation


Some redundancies can be detected by correlation
analysis.
For example, given two attributes, such analysis
can measure how strongly one attribute implies the
other, based on the available data. The correlation
between attributes A and B can be measured by
( A A)( B B )
rA,B
(n 1)AB

2.4 Data Integration and Transformation


n: the number of tuples
A, B : respective mean values of A and B.
A,B : respective standard deviation of A and B
A

The mean of A is
n
The standard deviation of A is

( A A)
n 1

2.4 Data Integration and Transformation


Data transformation can involve the following:
Smoothing, which works to move the noise from data.
Aggregation, where summary or aggregation operations are
applied to the data.
Generalization, where low-level or primitive data are
replaced by higher-level concepts through the use of concept
hierarchies.
Normalization, where the attribute data are scaled so as to
fall within a small specified range.
Attribute construction, where new attributes are constructed
and added from the given set of attributes to help the mining
process.

2.4 Data Integration and Transformation


An attribute is normalized by scaling its value so
that they fall within a small specified range, such
as 0.0 to 1.0. Normalization is particularly useful
for classification algorithm involving neural
networks, or distance measurements such as
nearest neighbor classification and clustering.

2.4 Data Integration and Transformation


Min-max normalization performs a linear transformation on
the original data.
Support that minA and maxA are the minimum and
maximum values of an attribute A. Min-max normalization
maps a value v of A to v in the range [new_minA,
new_maxA] by computing
V'

v min A
(new _ max A new _ min A) new _ min A
max A min A

Min-max normalization preserves the relationships among


the original data values. It will encounter an out of
bounds error if a future input case for normalization falls
outside of the original data range for A.

2.4 Data Integration and Transformation


Ex:
Suppose that the minimum and maximum values for the
attribute income to the range $12000 and $98000,
respectively. We would like to map income to the range
[0.0,1.0]. By min-max normalization, a value of $73600
for income is transformed to
V

'

v min A
(new _ max A new _ min A) new _ min A
max A min A

73600 12000
(1.0 0.0) 0 0.716
98000 12000

2.5 Data Reduction


Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data
cube.
Dimension reduction, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
Data compression, where encoding mechanisms are
used to reduce the data set size.

2.5 Data Reduction


Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations
such as parametric models (which need to store only
the model parameters instead of the actual data), or
nonparametric methods such as clustering, sampling,
and the use of histograms.
Discretization and concept hierarchy generation, where
raw data values for attributes are placed by ranges or
higher conceptual levels.

2.5 Data Reduction


Data cube aggregation

2.5 Data Reduction


Dimension reduction
Basic heuristic methods of attribute subset
selection include the following techniques:
1. Stepwise forward selection: The procedure starts with
an empty set of attributes. The best of the original
attributes is determined and added to the set. At each
subsequent iteration or step, the best of the remaining
original attributes is added to the set.

2.5 Data Reduction


2. Stepwise backward elimination: The procedure starts
with the full set of attributes. At each step, it removes
the worst attribute remaining in the set.
3. Combination of forward selection and backward
elimination: The stepwise forward selection and
backward elimination methods can be combined so
that, at each step, the procedure selects the best
attribute and removes the worst from among the
remaining attributes.

2.5 Data Reduction


Ex:

2.5 Data Reduction


Data compression
The discrete wavelet transform(DWT) is a linear signal
processing technique that, when applied to data vector
D, transforms it to a numerically different vector, D, of
wavelet coefficients. The two vectors have the same
length.
Popular wavelet transforms include the Harr_2,
Daubechies_4, and Daubechies_6 transforms.

2.5 Data Reduction


The method is as follows:
The length, L, of the input data vector must be integer
power of 2. This condition can be padding the data
vector with zeros, as necessary.
Each transform involves applying two functions. The
first applies some data smoothing, such as a sum or
weighted average. The second performs a weighted
difference, which acts to bring out the detailed features
of the data.

2.5 Data Reduction


The two functions are applied to pairs of the input data,
resulting in two sets of length L/2. In general, these
represent a smoothed or low frequency version of the
input data, and the high frequency content of it,
respectively.
The two functions are recursively applied to the sets of
data obtained in the previous loop, until the resulting
data sets obtained are of length 2.
A selection of values from the data sets obtained in the
above iterations are designed the wavelet coefficients
of the transformed data.

2.5 Data Reduction


Ex:

Discrete Wavelet Transformation


Space domain Frequency domain
Simple approach Haar function

Haar Function

Addition Low freq.


Difference High freq.

Low and High Frequency





Haar Function-- DWT


1 .

2 .

Horizontal Segmentation

Vertical Segmentation

Two-Stage DWT

Three-Stage DWT

DWT for Video Retrieval



Video Database Indexing


1 .

2 .

Coefficient Quantization

Video Database Indexing


3 .

4 .
5 .

2.5 Data Reduction


Numerosity reduction
Can we reduce the data volume by choosing alternative,
smaller forms of data representation?
Parametric methods: a model is used to estimate the
data so that typically only the data parameters need be
stored, instead of the actual data.
Nonparametric methods: used for storing reduced
representations of the data include histograms,
clustering, and sampling.

2.5 Data Reduction


Regression and log-linear models
Linear regression: the data are modeled to fit a straight line.
A random variable, Y (called a response variable), can be modeled
as a linear function of another random variable, X (called a
predictor variable), with the equation Y X
where the variance of Y is assumed to be constant.
Log-linear models: approximate discrete multidimentional
probability distributions. The method can be used to estimate the
probability of each cell in a base cuboid for a set of discretized
attributes, based on the smaller cuboids making up the data cube
lattice.

2.5 Data Reduction


Numerosity reduction
Histograms use binning to approximate data
distributions and are a popular form of data reduction.
A histogram for an attribute A partitions the data
distribution of A into disjoint subsets, or buckets.

2.5 Data Reduction

1-10

11-20

21-30

An equiwidth histogram for price,


where values are aggregated so
that each bucket has a uniform
width of $10.

2.5 Data Reduction

2.5 Data Reduction


Discretization and concept hierarchy generation
Discretization techniques can be used to reduce the
number of values for a given continuous attribute, by
dividing the range of the attribute into intervals.
Interval labels can be then used to replace actual data
values.
A concept hierarchy for a given numeric attribute
defines a discretization of the attribute. Concept
hierarchies can be used to reduce the data by collecting
and replacing low-level concepts (such as numeric
values for the attribute age) by higher-level concepts
(such as young, middle-aged, or senior).

2.5 Data Reduction


Discretization and concept hierarchy
generation for numeric data

Binning
Histogram analysis
Cluster analysis
Entropy-based discretization
Segmentation by natural partitioning

2.5 Data Reduction

2.5 Data Reduction

2.5 Data Reduction

Vous aimerez peut-être aussi