Data Preprocessing

Data Preprocessing
Lect 3/30-07-09 1/16

Why Data Preprocessing?
 Data in the real world is dirty

– incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
 e.g., occupation=“ ”
– noisy: containing errors or outliers
 e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Lect 3/30-07-09 2/16

What is Data?
 Collection of data objects and

Attributes
their attributes
 An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
Tid R
– Attribute is also known as
variable, field, characteristic, or
Objects
feature
 A collection of attributes
describe an object
– Object is also known as record,
point, case, sample, entity, or
instance
1Lect 3/30-07-09 Y
3/16
The different types of
attributes
 The following properties (operations) of numbers are typically used
to describe attributes:
– 1. Distinctness =&#
– 2. Order <,<=,>,>=
– 3. Addition +&-
– 4. Multiplication *&/
Keeping these properties in mind, Attributes can be categorized as:
Attributes
Nominal Ordinal Interval Ratio

Lect 3/30-07-09 4/16
Types of Attributes
 There are different types of attributes
– Nominal (categorical or qualitative (=,#))
 The values of a nominal attribute are just different names.
 Nominal values provide only enough information to
distinguish one object from the other.
 Examples: ID numbers, eye color, zip codes
– Ordinal (categorical or qualitative (<,>))
 Provide enough information to order the objects.
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval (quantitative or numeric (+,-))
 The differences between the attribute values are
meaningful.
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit
Lect 3/30-07-09 5/16

– Ratio (quantitative or numeric (*,/))
 Both differences and ratios are meaningful.
 A ratio-scaled variable makes a positive measurement
on a nonlinear scale, such as exponential scale,
apporx. Following the formula:
AeBt or Ae-Bt
Where A and B are +ve constants, t represents
time.
 Examples: growth of a bacteria, decay of a radioactive
element
Lect 3/30-07-09 6/16

Discrete and Continuous
Attributes
 Discrete Attribute
– Has only a finite or countably infinite set of values
– Can be categorical.
– Examples: zip codes, Id numbers, counts, or the set of words in
a collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Lect 3/30-07-09 7/16

Data Quality
 What kinds of data quality problems?

 How can we detect problems with the data?
 What can we do about these problems?
 Examples of data quality problems:

– Noise and outliers
– missing values
– duplicate data
Lect 3/30-07-09 8/16

Noise
 Noise refers to modification of original values

– Examples: distortion of a person’s voice when talking on a
poor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise
Lect 3/30-07-09 9/16

Outliers
 Outliers are data objects with characteristics that

are considerably different than most of the other
data objects in the data set
Lect 3/30-07-09 10/16

Missing Values
 Reasons for missing values

– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
 Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)
Lect 3/30-07-09 11/16

Duplicate Data
 Data set may include data objects that are

duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources
 Examples:
– Same person with multiple email addresses
 Data cleaning
– Process of dealing with duplicate data issues
Lect 3/30-07-09 12/16

-- … Why Preprocess the Data
 Reason for data cleaning
– Incomplete data (missing data)
– Noisy data (contains errors)
– Inconsistent data (containing discrepancies)
 Reasons for data integration

– Data comes from multiple sources
 Reason for data transformation

– Some data must be transformed to be used for mining
 Reasons for data reduction

– Performance
 No quality data  no quality mining results!
Lect 3/30-07-09 13/16

Major Tasks in Data
Preprocessing
 1.Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 2.Data integration
– Integration of multiple databases, data cubes, or files
 3.Data transformation
– Normalization and aggregation
 4.Data reduction (Sampling, dimensionality

reduction, feature subset selection)
– Obtains reduced representation in volume but produces the
same or similar analytical results
Lect 3/30-07-09 14/16

 5.Data discretization
– For classification algorithms sometimes it is required that
data should be in the form of categorical attributes
– Algo. That find association patterns require that the data
be in the form of binary attributes.
– Thus it is required to transform a continuous attribute to a
categorical attribute( discretization).
– Part of data reduction but with particular importance,
especially for numerical data
Lect 3/30-07-09 15/16

Forms of Data Preprocessing
Lect 3/30-07-09 16/16

1.Data Cleaning
 Data cleaning tasks

– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data
integration
Lect 3/30-07-09 17/16

1.Data Cleaning
: How to Handle Missing Data?
 Ignore the tuple: usually done when class label is
missing (assuming the tasks in classification)—not
effective unless the tuple contains several
attributes with the missing values
 Fill in the missing value manually- not feasible for
large datasets and time- consuming
 Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the most probable value: inference-based such as
Bayesian formula or regression or decision tree induction
Lect 3/30-07-09 18/16

1.Data Cleaning : How to Handle Noisy
Data?
 Noise- a random error or variance in a measured

variable.
 Incorrect attribute values may due to

– faulty data collection
– data entry problems
– data transmission problems
– data conversion errors
– Data decay problems
– technology limitations, e.g. buffer overflow or field size

limits
Lect 3/30-07-09 19/16

1.Data Cleaning : How to Handle Noisy
Data?
Methods
 Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression
– smooth by fitting the data into regression functions
 Clustering
– detect and remove outliers
 Combined computer and human inspection

– detect suspicious values and check by human (e.g., deal
with possible outliers)
Lect 3/30-07-09 20/16

1.Data Cleaning
: Binning Methods
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Lect 3/30-07-09 21/16

1.Data Cleaning :
Regression
•Data can be smoothed by y
fitting the data to a
function such as with
regression. Y1
•Linear regression involves
finding the ‘best’ line to fit
2 variables.
Y1’ y=x+1
X1 x
•Also, it is possible to
predict one variable using
the other variable.
Lect 3/30-07-09 22/16

1.Data Cleaning : Cluster
Analysis
Lect 3/30-07-09 23/16

2. Data Integration
 Data integration:
– Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id ≡ B.cust-#

– Integrate metadata from different sources
 Entity identification problem:

– Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
– For the same real world entity, attribute values from
different sources are different
– Possible reasons: different representations, different scales
Lect 3/30-07-09 24/16

Data Integration
: Handling Redundancy in Data
Integration
 Redundant data occur often when integration of
multiple databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected
by correlation analysis
 Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
Lect 3/30-07-09 25/16
Data Integration :
Correlation Analysis (Numerical Data)
 Correlation coefficient (also called Pearson’s product

moment coefficient)
rA, B =
∑( A − A)( B − B ) ∑( AB ) − n A B
=
(n −1)σAσB (n −1)σAσB
where n is the number of tuples, and are the respective means of A
A standard
and B, σA and σB are the respective B deviation of A and B, and
Σ(AB) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values

increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated
Lect 3/30-07-09 26/16

Data Integration
: Correlation Analysis (Categorical
Data)
 Χ2 (chi-square) test
(Observed − Expected ) 2
χ2 = ∑
Expected
 The larger the Χ2 value, the more likely the

variables are related
 The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count
 Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Lect 3/30-07-09 27/16

Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
 Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution in
the two categories)
( 250 − 90 ) 2
(50 − 210 ) 2
( 200 − 360 ) 2
(1000 − 840 ) 2
χ2 = + + + = 507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are

correlated in the group
Lect 3/30-07-09 28/16

Data Transformation
 Smoothing: remove noise from data

 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
 Attribute/feature construction
– New attributes constructed from the given ones
Lect 3/30-07-09 29/16

Data Transformation
: Normalization
 Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0,

73,600 − 12,000
1.0]. Then $73,000 is mapped to (1.0 − 0) + 0 = 0.716
98,000 − 12,000
 Z-score normalization (μ: mean, σ: standard deviation):

v −µA
v' =
σ A
73,600 − 54,000
= 1.225
– Ex. Let μ = 54,000, σ = 16,000. Then 16,000
 Normalization by decimal scaling

v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Lect 3/30-07-09 30/16
Data Reduction Strategies
 Why data reduction?

– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to
run on the complete data set
 Data reduction
– Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
 Data reduction strategies

– Aggregation
– Sampling
– Dimensionality Reduction
– Feature subset selection
– Feature creation
– Discretization and Binarization
– Attribute Transformation
Lect 3/30-07-09 31/16

Data Reduction : Aggregation
 Combining two or more attributes (or objects) into

a single attribute (or object)
 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc
– More “stable” data
 Aggregated data tends to have less variability
Lect 3/30-07-09 32/16

Data Reduction : Aggregation
Variation of Precipitation in Australia
Standard Deviation of Average Standard Deviation of Average

Monthly Precipitation Yearly Precipitation
Lect 3/30-07-09 33/16

Data Reduction : Sampling
 Sampling is the main technique employed for data selection.

– It is often used for both the preliminary investigation of the data
and the final data analysis.
 Statisticians sample because obtaining the entire set of data

of interest is too expensive or time consuming.
 Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.
Lect 3/30-07-09 34/16

Data Reduction : Types of
Sampling
 Simple Random Sampling
– There is an equal probability of selecting any particular item
 Sampling without replacement

– As each item is selected, it is removed from the population
 Sampling with replacement

– Objects are not removed from the population as they are
selected for the sample.
 In sampling with replacement, the same object can be
picked up more than once
Lect 3/30-07-09 35/16

Data Reduction
: Dimensionality Reduction
 Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
 Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
Lect 3/30-07-09 36/16

Dimensionality Reduction : PCA
 Goal is to find a projection that captures the largest

amount of variation in data
x2
x1
Lect 3/30-07-09 37/16

Dimensionality Reduction : PCA
 Find the eigenvectors of the covariance matrix

 The eigenvectors define the new space
x2
x1
Lect 3/30-07-09 38/16

Data Reduction
: Feature Subset Selection
 Another way to reduce dimensionality of data
 Redundant features
– duplicate much or all of the information contained in one
or more other attributes
– Example: purchase price of a product and the amount of
sales tax paid
 Irrelevant features
– contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
Lect 3/30-07-09 39/16

Data Reduction
: Feature Subset Selection
 Techniques:
– Brute-force approch:
 Try all possible feature subsets as input to data mining
algorithm
– Filter approaches:
 Features are selected before data mining algorithm is
run
– Wrapper approaches:
 Use the data mining algorithm as a black box to find
best subset of attributes
Lect 3/30-07-09 40/16

Data Reduction
: Feature Creation
 Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
 Three general methodologies:

– Feature Extraction
 domain-specific
– Mapping Data to New Space
– Feature Construction
 combining features
Lect 3/30-07-09 41/16

Data Reduction
: Mapping Data to a New Space
 Fourier transform
 Wavelet transform
Two Sine Waves Two Sine Waves + Noise Frequency
Lect 3/30-07-09 42/16

Data Reduction
: Discretization Using Class Labels
 Entropy based approach
3 categories for both x and y 5 categories for both x and y
Lect 3/30-07-09 43/16

Data Reduction
: Discretization Without Using Class
Labels
Data Equal interval width
Equal frequency K-means
Lect 3/30-07-09 44/16

Data Reduction
: Attribute Transformation
 A function that maps the entire set of values of a
given attribute to a new set of replacement values
such that each old value can be identified with
one of the new values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization
Lect 3/30-07-09 45/16

Data Preprocessing

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Preprocessing

Transféré par

Droits d'auteur :

Formats disponibles

Data Preprocessing

Lect 3/30-07-09 1/16

 Data in the real world is dirty

Lect 3/30-07-09 2/16

 Collection of data objects and

Keeping these properties in mind, Attributes can be categorized as:

Nominal Ordinal Interval Ratio

Lect 3/30-07-09 5/16

Lect 3/30-07-09 6/16

Lect 3/30-07-09 7/16

 What kinds of data quality problems?

 Examples of data quality problems:

Lect 3/30-07-09 8/16

 Noise refers to modification of original values

Two Sine Waves Two Sine Waves + Noise

Lect 3/30-07-09 9/16

 Outliers are data objects with characteristics that

Lect 3/30-07-09 10/16

 Reasons for missing values

 Handling missing values

Lect 3/30-07-09 11/16

 Data set may include data objects that are

Lect 3/30-07-09 12/16

 Reasons for data integration

 Reason for data transformation

 Reasons for data reduction

 No quality data  no quality mining results!

Lect 3/30-07-09 13/16

 4.Data reduction (Sampling, dimensionality

Lect 3/30-07-09 14/16

Lect 3/30-07-09 15/16

Lect 3/30-07-09 16/16

 Data cleaning tasks

Lect 3/30-07-09 17/16

Lect 3/30-07-09 18/16

 Noise- a random error or variance in a measured

 Incorrect attribute values may due to

– data entry problems

– data transmission problems

– data conversion errors

– Data decay problems

– technology limitations, e.g. buffer overflow or field size

Lect 3/30-07-09 19/16

 Combined computer and human inspection

Lect 3/30-07-09 20/16

Lect 3/30-07-09 21/16

Lect 3/30-07-09 22/16

Lect 3/30-07-09 23/16

 Schema integration: e.g., A.cust-id ≡ B.cust-#

 Entity identification problem:

Lect 3/30-07-09 24/16

 Correlation coefficient (also called Pearson’s product

 If rA,B > 0, A and B are positively correlated (A’s values

Lect 3/30-07-09 26/16

 The larger the Χ2 value, the more likely the

Lect 3/30-07-09 27/16

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

 It shows that like_science_fiction and play_chess are

Lect 3/30-07-09 28/16

 Smoothing: remove noise from data

Lect 3/30-07-09 29/16