Académique Documents
Professionnel Documents
Culture Documents
Comp 150 DW
Chapter 3: Data
Preprocessing
Instructor: Dan Hebert
Data Warehousing/Mining
Chapter 3: Data
Preprocessing
Data cleaning
Data reduction
Summary
Data Warehousing/Mining
Data Warehousing/Mining
Multi-Dimensional Measure of
Data Quality
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
intrinsic, contextual, representational, and
accessibility.
Data Warehousing/Mining
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Warehousing/Mining
Forms of data
preprocessing
Data Warehousing/Mining
Data Cleaning
Data Warehousing/Mining
Missing Data
Data Warehousing/Mining
Name
SSN
Address
Phone #
Date
Acct Total
John Doe
111-223333
1 Main St
Bedford,
Ma
111-2223333
2/12/1999
2200.12
7/15/2000
12000.54
8/22/2001
2000.33
12/22/2002
15333.22
John W. Doe
Bedford,
Ma
John Doe
111-223333
James
Smith
222-334444
2 Oak St
Boston,
Ma
222-3334444
Jim Smith
222-334444
2 Oak St
Boston,
Ma
222-3334444
Jim Smith
222-334444
2 Oak St
Boston,
Ma we
should
222-3334444
How
Data Warehousing/Mining
12333.66
handle this?
9
Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
Data Warehousing/Mining
10
Noisy Data
Data Warehousing/Mining
11
Name
SSN
Address
Phone #
Date
Acct Total
John Doe
111-223333
1 Main St
Bedford,
Ma
111-2223333
2/12/1999
2200.12
John Doe
111-223333
1 Main St
Bedford,
Ma
111-2223333
2/12/1999
2233.67
James
Smith
222-334444
2 Oak St
Boston,
Ma
222-3334444
12/22/2002
15333.22
James
Smith
222-334444
2 Oak St
Boston,
Ma
222-3334444
12/23/2003
15333000.
00
12
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Regression
smooth by fitting the data into regression functions
Data Warehousing/Mining
13
Data Warehousing/Mining
14
15
16
Cluster Analysis
Allowsdetectionandremovalofoutliers
Data Warehousing/Mining
17
Regression
y
Y1
y=x+1
Y1
X1
Linearregressionfindthebestlinetofittwovariablesand
useregressionfunctiontosmoothdata
Data Warehousing/Mining
18
Data
Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id B.cust-#
Data Warehousing/Mining
19
Handling Redundant
Data in Data
Redundant data occur often when integration of
Integration
multiple databases
The same attribute may have different names in different
databases
One attribute may be a derived attribute in another
table, e.g., annual revenue
Data Warehousing/Mining
20
Correlational
Analysis
A,B
21
Correlational
Analysis Example
RA,B>0correlatedconsiderremovalofAorB
Data Warehousing/Mining
22
Data
Transformation
Attribute/feature construction
New attributes constructed from the given ones to help in
the data mining process
Data Warehousing/Mining
23
Data
Transformation:
min-max normalization
Normalization
v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA
Data Warehousing/Mining
24
Data
Transformation:
z-score normalization
Normalization
v meanA
v'
stand _ devA
Data Warehousing/Mining
25
Data
Transformation:
normalization by decimal scaling
Normalization
v
v' j
10
Data Warehousing/Mining
26
Data Reduction
Strategies
27
Data Warehousing/Mining
28
Dimensionality Reduction
Data Warehousing/Mining
29
Data
Compression
String compression
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Data Warehousing/Mining
31
Data
Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
Data Warehousing/Mining
y
s
s
lo
32
Wavelet Transforms
Haar2
Daubechie4
Method:
Length, L, must be an integer power of 2 (padding with 0s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
Data Warehousing/Mining
33
Data Warehousing/Mining
34
Numerosity
Reduction
Parametric methods
Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
Log-linear models: obtain value at a point in mD space as the product on appropriate marginal
subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Data Warehousing/Mining
35
Data Warehousing/Mining
36
Histograms
30
25
20
15
10
5
90000
80000
70000
60000
100000
Data Warehousing/Mining
50000
0
10000
35
40000
40
30000
A popular data
reduction technique
Divide data into buckets
and store average
(sum) for each bucket
Can be constructed
optimally in one
dimension using
dynamic programming
Related to quantization
problems.
20000
37
Histograms (continued)
Data Warehousing/Mining
38
Clustering
Data Warehousing/Mining
39
Sampling
Data Warehousing/Mining
40
Sampling (continued)
Alltupleshaveequalprobabilityofselection
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
Onceselected,cantbe
w
e
l
p
sam ment) selectedagain
ce
a
l
p
e
r
Raw Data
Data Warehousing/Mining
SRSW
R
(simp
le ran
dom
samp
le
replac with Onceselected,canbe
emen
selectedagain
t)
41
Sampling (continued)
Ifdataisclusteredorstratified,performedasimplerandom
sample(withorwithoutreplacement)ineachclusterorstrata
Raw Data
Data Warehousing/Mining
Cluster/Stratified Sample
42
Hierarchical Reduction
Data Warehousing/Mining
43
Discretization
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Data Warehousing/Mining
44
Discretization
reduce the number of values for a given
continuous attribute by dividing the range of
the attribute into intervals. Interval labels can
then be used to replace actual data values.
Concept hierarchies
reduce the data by collecting and replacing
low level concepts (such as numeric values for
the attribute age) by higher level concepts
(such as young, middle-aged, or senior).
Data Warehousing/Mining
45
Entropy-based discretization
Data Warehousing/Mining
46
Entropy-Based Discretization
| S 1|
|S|
Ent ( S 1)
|S 2|
|S|
Ent ( S 2)
Data Warehousing/Mining
47
Entropy-Based Discretization
Data Warehousing/Mining
48
Segmentation by natural
partitioning
3-4-5 rule can be used to segment numeric data into
relatively uniform, natural intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at
the most significant digit, partition the range into 3
equi-width intervals
* If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
Data Warehousing/Mining
49
Step 1:
Step 2:
-$351
-$159
Min
msd=1,000
profit
Low=-$1,000
(-$1,000 - 0)
(-$4000 - 0)
(-$2000 -$1000)
Data
Max
High=$2,000
($1,000 - $2,000)
(0 -$ 1,000)
(-$4000 -$5,000)
Step 4:
(-$3000 -$2000)
$4,700
(-$1,000 - $2,000)
Step 3:
(-$4000 -$3000)
$1,838
(0 - $1,000)
(0 $200)
($1,000 $1,200)
($200 $400)
($1,200 $1,400)
($1,400 $1,600)
($400 $600)
($600 $800)
(-$1000 0)
Warehousing/Mining
($800 $1,000)
($2,000 $3,000)
($3,000 $4,000)
($4,000 $5,000)
50
Data Warehousing/Mining
51
Concept hierarchy
generation for categorical
data
Specification of a partial ordering of attributes
Data Warehousing/Mining
52
Concept hierarchy
generation for categorical
Specification of a set of attributes, but not of their partial
data
ordering
Data Warehousing/Mining
53
Specification of a set of
attributes
Concept hierarchy can be automatically
generated based on the number of distinct
values per attribute in the given attribute set.
The attribute with the most distinct values is
placed at the lowest level of the hierarchy.
country
15 distinct values
province_or_ state
65 distinct
values
3567 distinct values
city
Data Warehousing/Mining
street
54
Summary
Data Warehousing/Mining
55
Homework Assignment
Data Warehousing/Mining
56
Data Warehousing/Mining
57
58