Académique Documents
Professionnel Documents
Culture Documents
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber, All rights reserved
February 19, 2008
Data cleaning
Data reduction
Summary
e.g., occupation=
e.g., Salary=-10
Data cleaning
Data reduction
Summary
x
Mean (algebraic measure) (sample vs. population):
ni
x
1
w x
x
n
i
wi
Empirical formula:
mean mode 3
Data Mining: Concepts and
Techniques
f l
median
mean median
6
n 1i
x
1
1
n 1
x
i 1
x
i 1
2
i
Ni 1
x
i
N i 1 i2
Data cleaning
Data reduction
Summary
10
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
11
Missing Data
equipment malfunction
12
13
Noisy Data
14
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)
15
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
16
17
Regression
y
Y1
y=x+1
Y1
X1
18
Cluster Analysis
19
Data cleaning
Data reduction
Summary
20
Data Integration
Data integration:
Combines data from multiple sources into a
coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values
from different sources are different
Possible reasons: different representations,
different scales, e.g., metric vs. British units
21
22
2 (chi-square) test
Observed Expected
Expected
The larger the 2 value, the more likely the
variables are related
23
Not play
chess
200(360)
Sum
(row)
450
50(210)
1000(840)
1050
300
1200
1500
Play
chess
250(90)
250 90
90
50 210
210
200 360
360
1000 840
840
507 .93
24
Data Transformation
min-max normalization
z-score normalization
Attribute/feature construction
25
Data Transformation:
Normalization
min
max A
min
new max A
new min A
newmin A
v'
A
A
73 , 600
Ex. Let = 54,000, = 16,000. Then
v'
February 19, 2008
v
10
16 , 000
54 , 000
1. 225
26
Data cleaning
Data reduction
Summary
27
Techniques
28
29
A1?
Class 1
>
Class 2
Class 1
Class 2
30
31
Numerosity Reduction
32
33
Partitioning rules:
15
90000
100000
80000
70000
10
60000
20
50000
40000
25
30000
20000
10000
35
34
Can have hierarchical clustering and be stored in multidimensional index tree structures
35
Techniques
36
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
February 19, 2008
37
Raw Data
Cluster/Stratified Sample
38
Data cleaning
Data reduction
Summary
39
Discretization
Discretization:
40
Discretization
41
42
Segmentation by Natural
Partitioning
43
Step 1:
Step 2:
-$351
-$159
Min
msd=1,000
profit
Low=-$1,000
$1,838
High(i.e, 95%-0 tile)
$4,700
Max
High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0)
(0 -$ 1,000)
($1,000 - $2,000)
(-$400 -$5,000)
Step 4:
(-$400 - 0)
(-$400 -$300)
(-$300 -$200)
(-$200 -$100)
(-$100 0)
(0 - $1,000)
(0 $200)
($1,000 $1,200)
($200 $400)
($1,200 $1,400)
($1,400 $1,600)
($400 $600)
($600 $800)
($800 $1,000)
($2,000 $3,000)
($3,000 $4,000)
($4,000 $5,000)
44
45
15 distinct values
country
province_or_ state
city
street
February 19, 2008
46
Data cleaning
Data reduction
Summary
47
Summary
Discretization
48
References
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons,
2003
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of
the Technical Committee on Data Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning
and Transformation, VLDB2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE
Trans. Knowledge and Data Engineering,
7:623-640,
1995
Data Mining:
Concepts and
Techniques
49