Vous êtes sur la page 1sur 26

Chapter 2: Data Preprocessing

Focus:
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
2.1 Why Data Preprocessing?
Example: You want to analysis your companys sales data in your branch
Situation 1: No info on whether an item was on sale when purchased
Data in the real orld is dirty
o incomplete: lac!ing attri"ute #alues$ lac!ing certain attri"utes of interest$ or
containing only aggregate data
o noisy: containing errors or outliers
o inconsistent: containing discrepancies in codes or names
Ex.$ Some items ere entered ith the rong code
%o &uality data$ no &uality mining results' ()*)+,
o -uality decisions must "e "ased on &uality data
o Data arehouse needs consistent integration of &uality data
.ulti/Dimensional .easure of Data -uality
0 ell/accepted multidimensional #ie:
o 0ccuracy
o Completeness
o Consistency
o 1imeliness
o 2elie#a"ility
o 3alue added
o *nterpreta"ility
o 0ccessi"ility
Situation 2: You find some of the missing data in another database,
.a4or 1as!s in Data Preprocessing
Data cleaning
o Fill in missing #alues$ smooth noisy data$ identify or remo#e outliers$ and resol#e
inconsistencies
Data integration
o *ntegration of multiple data"ases$ data cu"es$ or files
Situation : You decided to create decision tree from the data! "ut, what should you do with
the attribute price#
Data transformation
o %ormalization and aggregation
56.$ scale numeric #alues to 78$ 19 especially if you ant to use a distance/
"ased algorithm
o Data discretization (yes$ it is a ord$ no matter hat .icrosoft says,
o Part of data reduction "ut ith particular importance$ especially for numerical data
Situation $: You are all set, ready to go, but, you found that there is too much data now
Data reduction
o +"tains reduced representation in #olume "ut produces the same or similar analytical results
2.2 Descripti#e Data Summarization
.oti#ation
o 1o "etter understand the data: central tendency$ #ariation and spread
Data dispersion characteristics
o median$ ma6$ min$ &uantiles$ outliers$ #ariance$ etc.
%umerical dimensions correspond to sorted inter#als
o Data dispersion: analyzed ith multiple granularities of precision
o 2o6plot or &uantile analysis on sorted inter#als
Dispersion analysis on computed measures
o Folding measures into numerical dimensions
o 2o6plot or &uantile analysis on the transformed cu"e
2.2.1 .easuring the Central 1endency
.ean (alge"raic measure, (sample #s. population,:
o Weighted arithmetic mean:
(i.e.$ eighted a#erage,
o 1rimmed mean: chopping e6treme #alues
.edian : 0 holistic measure
o .iddle #alue if odd num"er of #alues$ or a#erage of the middle to #alues
otherise
o 5stimated "y interpolation (for grouped data,:
.ode
o 3alue that occurs most fre&uently in the data
o :nimodal$ "imodal$ trimodal
o 5mpirical formula:
2.2.2 .easuring the Dispersion of Data
-uartiles$ outliers and "o6plots
o -uartiles: -1 (2;th percentile,$ -< (=;th percentile,
o *nter/&uartile range: *-> ? -< @ -1
o Fi#e num"er summary: min$ -1$ .$ -<$ ma6
o 2o6plot: ends of the "o6 are the &uartiles$ median is mar!ed$ his!ers$ and plot
outlier indi#idually
+utlier: usually$ a #alue higherAloer than 1.; 6 *->
o 3ariance and standard de#iation (sample: use s, population: use %&
o 3ariance: (alge"raic$ scala"le computation,
Standard de#iation s 'or %& is the s&uare root of #ariance s2 'or %2&
1he normal (distri"ution, cur#e

=
=
n
i
i
x
n
x
1
1

=
=
=
n
i
i
n
i
i i
w
x w
x
1
1
c
f
l f n
( median
median
,
, ( 2 A
(
1

+ =
, ( < median mean mode mean =

= = =

=
n
i
n
i
i i
n
i
i
x
n
x
n
x x
n
s
1 1
2 2
1
2 2
9 , (
1
7
1
1
, (
1
1

= =
= =
n
i
i
n
i
i
x
N
x
N
1
2 2
1
2 2
1
, (
1

o +ne standard de#iation (B@C to BDC,: contains a"out EFG of the measurements (B:
mean$ C: standard de#iation,
o 1o standard de#iation (B@2C to BD2C,: contains a"out H;G of it
o 1hree standard de#iation (B@<C to BD<C,: contains a"out HH.=G of it
56.$ (from IanJs ppt,
2.2.< )raphic Displays of 2asic Descripti#e Data Summaries
Fre&uency histograms
o 0 uni#ariate graphical method
o Consists of a set of rectangles that reflect the counts or fre&uencies of the classes
present in the gi#en data
Figure 2.K
-uantile Plot
o Displays all of the data (alloing the user to assess "oth the o#erall "eha#ior and
unusual occurrences,
o Plots &uantile information
For a data xi data sorted in increasing order$ fi indicates that appro6imately
188 fiG of the data are "elo or e&ual to the #alue xi
-uantile/-uantile (-/-, Plot
o )raphs the &uantiles of one uni#ariate distri"ution against the corresponding
&uantiles of another
o 0llos the user to #ie hether there is a shift in going from one distri"ution to
another
Scatter plot
o Pro#ides a first loo! at "i#ariate data to see clusters of points$ outliers$ etc
o 5ach pair of #alues is treated as a pair of coordinates and plotted as points in the
plane
Loess Cur#e
o 0dds a smooth cur#e to a scatter plot in order to pro#ide "etter perception of the
pattern of dependence
o Loess cur#e is fitted "y setting to parameters: a smoothing parameter$ and the
degree of the polynomials that are fitted "y the regression
2.< Data Cleaning
1as!s
Fill in missing #alues
*dentify outliers and smooth out noisy data
Correct inconsistent data
2.<.1 .issing Data @ hat caused it
Data is not alays a#aila"le
o 5.g.$ many tuples ha#e no recorded #alue for se#eral attri"utes$ such as customer
income in sales data
.issing data may "e due to
o 5&uipment malfunction
o inconsistent ith other recorded data and thus deleted
o data not entered due to misunderstanding
o certain data may not "e considered important at the time of entry$ e6.$ hether an
item is on sale.
o not register history or changes of the data
.issing data may need to "e inferred.
.issing Data / ho to handle
1. *gnore the tuple: usually done hen class la"el is missing (assuming the tas!s in
classification
2. Doe this or!? %ot alays
a. .issing #alue may ha#e significance in itself (e.g. missing years in a resume,
". not effecti#e hen the percentage of missing #alues per attri"ute #aries
considera"ly.
<. Fill in the missing #alue manually: tedious D infeasi"le
K. :se a glo"al constant to fill in the missing #alue: e.g.$ =un!non=$ a ne class?'
;. :se the attri"ute mean to fill in the missing #alue
E. :se the attri"ute mean for all samples "elonging to the same class to fill in the missing
#alue: smarter
=. :se the most pro"a"le #alue to fill in the missing #alue: inference/"ased such as
2ayesian formula or decision tree
.ethod </E "ias the data$ though E is a preferred method
2.<.2 %oisy Data
%oise: random error or #ariance in a measured #aria"le
*ncorrect attri"ute #alues may due to
o faulty data collection instruments
o data entry pro"lems
o data transmission pro"lems
o technology limitation
o inconsistency in naming con#ention
o +ther data pro"lems hich re&uires data cleaning
o duplicate records
o incomplete data
o inconsistent data
Io to Iandle %oisy Data?
1. 2inning method:
o sort data and partition into "ins
o replace #alue in each "in ith its mean / smooth "y "in means$
o Can also smooth "y "in median$ smooth "y "in "oundaries$ etc
o 5&ual/idth (distance, partitioning (Fig</18,
Di#ided the range into N inter#als of e&ual size: uniform grid
if ) and " are the loest and highest #alues of the attri"ute$ the idth of
inter#als ill "e: * ? ("/),AN!
1he most straightforard
2ut outliers may dominate presentation
S!eed data is not handled ell.
o 5&ual/depth (fre&uency, partitioning:
*t di#ides the range into N inter#als$ each containing appro6imately same
num"er of samples
)ood data scaling
i. .anaging categorical attri"utes can "e tric!y.
56ample:
Sorted data for price: K$ F$ H$ 1;$ 21$ 21$ 2K$ 2;$ 2E$ 2F$ 2H$ <K
Partition into (e&ui/depth, "ins:
i. 2in 1: K$ F$ H$ 1;
ii. 2in 2: 21$ 21$ 2K$ 2;
iii. 2in <: 2E$ 2F$ 2H$ <K
Smoothing "y "in means:
i. 2in 1: H$ H$ H$ H
ii. 2in 2: 2<$ 2<$ 2<$ 2<
iii. 2in <: 2H$ 2H$ 2H$ 2H
Smoothing "y "in "oundaries:
i. 2in 1: K$ K$ K$ 1;
ii. 2in 2: 21$ 21$ 2;$ 2;
iii. 2in <: 2E$ 2E$ 2E$ <K
2. Clustering
a. detect and remo#e outliers
Cluster analysis
<. Com"ined computer and human inspection
a. detect suspicious #alues and chec! "y human
". 2/D and </D #isualizations sho dependencies
c. Domain e6perts need to "e consulted
d. 1oo much data to inspect? 1a!e a sample'
K. >egression
a. smooth "y fitting the data into a mathematical function (regression function, (more
a"out this in <.K,
May 11, 2004 Data Mining: Concepts and Techniques 16
Regression
6
y
y ? 6 D 1
M1
N1
N1J
2.K Data *ntegration and 1ransformation
2.K.1 Data integration:
Com"ines data from multiple sources into a coherent store
.a4or *ssues:
1. Schema integration
o integrate metadata from different sources
o Ia#e to deal ith Entity identification problem: identify real orld entities from
multiple data sources$ e.g.$ 0: cust/id$ 2: cust/O
2. Detecting and resol#ing data #alue conflicts
o for the same real orld entity$ attri"ute #alues from different sources are different
o possi"le reasons: different representations$ different scales$ e.g.$ metric #s. 2ritish
units
<. Iandling >edundant Data in Data *ntegration
o >edundant data occur often hen integration of multiple data"ases
1he same attri"ute may ha#e different names in different data"ases
+ne attri"ute may "e a 18deri#ed18 attri"ute in another ta"le$ e.g.$ annual
re#enue
o >edundant data may "e a"le to "e detected "y correlational analysis
Careful integration of the data from multiple sources may help reduceAa#oid redundancies and
inconsistencies and impro#e mining speed and &uality
2.K.2 Data 1ransformation
*ncludes the folloing:
Smoothing: remo#e noise from data
0ggregation: summarization$ data cu"e construction
)eneralization: concept hierarchy clim"ing
56.$
May 26, 2004 Data Mining: Concepts and Techniques 24
A Concept Hierarchy: Diension !"ocation#
all
5urope %orthP0merica
.e6ico Canada Spain )ermany
3ancou#er
.. Wind L. Chan
...
... ...
...
...
...
all
region
office
country
1oronto Fran!furt city
%ormalization: scaled to fall ithin a small$ specified range
o min/ma6 normalization
o z/score normalization
o normalization "y decimal scaling
0ttri"uteAfeature construction
o %e attri"utes constructed from the gi#en ones to help the mining process
Data 1ransformation: %ormalization
min/ma6 normalization
z/score (or zero/mean, normalization
normalization "y decimal scaling
Where + is the smallest integer such that .a6(Q Q,R1
56amples 2.2 @ 2.K
*ncome attri"ute ith the folloing characteristics:
.in ? 12888$ .a6 ? HF888$ .ean ? ;K888
Standard De#iation ? 1E888
Consider ho the income #alue =<E88 ould "e changed
.in/.a6 %orm:
%e #alue #J percent ? (=<E88/12888,A(HF888/12888, ? 8.=1E (not =1percentile,
%e #alue 3J (on a scale of 8 to 1, ? .=1ES(1/ 8, D 8 ? .=1E
%e #alue 3J (on a scale of /1 to 1, ? .=1ES(1/ /1, D /1 ? .K<
T/score:
3J ? (=<E88/;K888,A1E888 ? 1.22;
Decimal Scaling:
.a6 ? HF888 R 188888 (4?;,
) ) )
) )
)
min new min new max new
min max
min ,
, P , P P ( U +

=
)
)
de, stand
mean ,
,
P
U

=
+
,
,
18
U =
U ,
3J ? =<E88A188888 ? 8.=<E
2.; Data >eduction
Why?
Warehouse may store tera"ytes of data: Comple6 data analysisAmining may ta!e a #ery
long time to run on the complete data set
Data reduction:
+"tains a reduced representation of the data set that is much smaller in #olume "ut yet produces
the same (or almost the same, analytical results
Data reduction strategies
Data cu"e aggregation
Dimensionality reduction
%umerosity reduction
Discretization and concept hierarchy generation
2.;.1 Data Cu"e 0ggregation
1he loest le#el of a data cu"e
o the aggregated data for an indi#idual entity of interest
o e.g.$ a customer in a phone calling data arehouse.
May 26, 2004 Data Mining: Concepts and Techniques 2$
A %ap"e Data Cu&e
Total annual sales
of TV in U.S.A.
Date
P
r
o
d
u
c
t
C
o
u
n
t
r
y
sum
sum
13
3C>
PC
1-tr
2-tr
<-tr K-tr
:.S.0
Canada
.e6ico
sum
.ultiple le#els of aggregation in data cu"es
o Further reduce the size of data to deal ith
>eference appropriate le#els
o :se the smallest representation hich is enough to sol#e the tas!
-ueries regarding aggregated information should "e ansered using data cu"e$ hen
possi"le
2.;.2 0ttri"ute su"set selection:
Feature Selection
o Select a minimum set of features such that the pro"a"ility distri"ution of different
classes gi#en the #alues for those features is as close as possi"le to the original
distri"ution gi#en the #alues of all features
o reduce O of patterns in the patterns$ easier to understand
Ieuristic .ethods
o Step/ise forard selection
o Step/ise "ac!ard elimination
o Com"ining forard selection and "ac!ard elimination
o Decision/tree induction
May 11, 2004 Data Mining: Concepts and Techniques 26
Example of Decision Tree Induction
*nitial attri"ute set:
V01$ 02$ 0<$ 0K$ 0;$ 0EW
0K ?
01?
0E?
Class 1
Class 2
Class 1
Class 2
X >educed attri"ute set: V01$ 0K$ 0EW
Ieuristic Feature Selection .ethods
1here are 2
d
possi"le su"/features of d features
Se#eral heuristic feature selection methods:
o 2est single features under the feature independence assumption: choose "y
significance tests.
o 2est step/ise feature selection:
1he "est single/feature is pic!ed first
1hen ne6t "est feature condition to the first$ ...
o Step/ise feature elimination:
>epeatedly eliminate the orst feature
o 2est com"ined feature selection and elimination:
o +ptimal "ranch and "ound:
:se feature elimination and "ac!trac!ing
2.;.< Dimensionality >eduction (in data Cu"e,
Sample data Cu"e schema:
May 26, 2004 Data Mining: Concepts and Techniques 16
'(ap"e o) %tar %chea
timeP!ey
day
dayPofPthePee!
month
&uarter
year
time
locationP!ey
street
city
pro#incePorPstreet
country
location
Sales Fact 1a"le
timeP!ey
itemP!ey
"ranchP!ey
locationP!ey
unitsPsold
dollarsPsold
a#gPsales
.easures
itemP!ey
itemPname
"rand
type
supplierPtype
item
"ranchP!ey
"ranchPname
"ranchPtype
"ranch
)enerally$ there are to types of approaches:
1. Wrapper approach:
:se the mining algorithm itself for attri"ute selection for classification tas!s.
.ore accurate
2. Filter approach: >educe attri"ute set "efore mining algorithm is used
Data Compression
1ypical .ethods:
String compression
o 1here are e6tensi#e theories and ell/tuned algorithms
o 1ypically lossless
o 2ut only limited manipulation is possi"le ithout e6pansion
0udioA#ideo compression
o 1ypically lossy compression$ ith progressi#e refinement
o Sometimes small fragments of signal can "e reconstructed ithout reconstructing
the hole
May 11, 2004 Data Mining: Concepts and Techniques 2*
Data Copression
+riginal Data Compressed
Data
lossless
+riginal Data
0ppro6imated
l
o
s
s
y
Popular lossy data compression methods:
Wa#elet 1ransforms
Discrete a#elet transform (DW1,: from linear signal processing and Fourier
transformation
1ransform a data #ector D to a numerically different #ector DJ of a#elet coefficients.
Compressed appro6imation: store only a small fraction of the strongest of the a#elet
coefficients
Similar to discrete Fourier transform (DF1,$ "ut "etter lossy compression$ localized in
space
.ethod:
o Length$ L$ must "e an integer poer of 2 (padding ith 8s$ hen necessary,
o 5ach transform has 2 functions: smoothing$ difference

Haar
2
Dau&echie
4
o 0pplies to pairs of data$ resulting in to set of data of length LA2
o 0pplies to functions recursi#ely$ until reaches the desired length
Principal Component 0nalysis
May 11, 2004 Data Mining: Concepts and Techniques +2
M1
M2
N1
N2
Principal Component Analysis
)i#en N data #ectors from -/dimensions$ find c ./ - orthogonal #ectors that can "e "est
used to represent data
o 1he original data set is reduced to one consisting of % data #ectors on c principal
components (reduced dimensions,
5ach data #ector is a linear com"ination of the c principal component #ectors
Wor!s for numeric data only
:sed hen the num"er of dimensions is large
2.;.K %umerosity >eduction
>educe the data #olume "y choosing alternati#e$ YsmallerZ forms of the data
Parametric methods
o 0ssume the data fits some model$ estimate model parameters$ store only the
parameters$ and discard the data (e6cept possi"le outliers,
56.$ Log/linear models: o"tain #alue at a point in m/D space as the product on
appropriate marginal su"spaces
%on/parametric methods
o Do not assume models
o .a4or families: histograms$ clustering$ sampling
>egression and Log/Linear .odels
Linear regression: Data are modeled to fit a straight line
o +ften uses the least/s&uare method to fit the line
.ultiple regression: allos a response #aria"le N to "e modeled as a linear function of
multidimensional feature #ector
Log/linear model: appro6imates discrete multidimensional pro"a"ility distri"utions
>egress 0nalysis and Log/Linear .odels
Linear regression : Y / 0 1
o 1o parameters $ and specify the line and are to "e estimated "y using the data
at hand.
o using the least s&uares criterion to the !non #alues of Y1, Y2, 12, 11, 12, 12!
o 3emo at http:44www!math!csusb!edu4faculty4stanton4m2524regress4regress!html
.ultiple linear regression : Y / b6 0 b1 11 0 b2 12!
o .any nonlinear functions can "e transformed into the a"o#e.
Log/linear models :
o 1he multi/ay ta"le of 4oint pro"a"ilities is appro6imated "y a product of loer/
order ta"les.
Iistograms
0 popular data reduction techni&ue
Di#ide data into "uc!ets and store a,erage (sum, for each "uc!et
Can "e constructed optimally in one dimension using dynamic programming
>elated to &uantization pro"lems.
56ample <.K
1he list of prices of commonly sold items at 0ll5lectronics:
1$ 1$ ;$ ;$ ;$ ;$ ;$ F$ F$ 18$ 18$ 18$ 18$ 12$ 1K$ 1K$ 1K$1;$ 1;$ 1;$ 1;$ 1;$ 1;$ 1F$ 1F$ 1F$ 1F$ 1F$ 1F$
1F$ 1F$ 28$ 28$ 28$ 28$ 28$ 28$ 28$ 21$ 21$ 21$ 21$ 2;$ 2;$ 2;$ 2;$ 2;$ 2F$ 2F$ <8$ <8$ <8
Partition rules:
Singleton "uc!et: one #alue in each "uc!et @ figure <.H
5&uiidth: all "uc!et has the same idth (ie.$ #alue range, @ figure <.18
5&uidepth: each "uc!et has the same height (ie.$ same num"er of items in each "uc!et,
3/optimal: :se the histogram ith the least #ariance among all possi"le histograms
most accurate and practical
Clustering
Partition data set into clusters$ and store cluster representation only
Can "e #ery effecti#e if data is clustered "ut not if data is messy
Can ha#e hierarchical clustering and "e stored in multi/dimensional inde6 tree structures
1here are many choices of clustering definitions and clustering algorithms$ further
detailed in Chapter F
Sampling
0llo a mining algorithm to run in comple6ity that is potentially su"/linear to the size of
the data
Simple methods for choosing a representati,e subset of the data
1. Simple 7andom Sample *ithout 7eplacement (S>SW+>,: randomly dra a su"set
of n tuples from the data set D
2. Simple 7andom Sample *ith 7eplacement (S>SW>, @ each time a tuple is dran$
it is placed "ac! in D so that it may "e dran again
Simple random sampling may ha#e #ery poor performance in the presence
of s!eed data. 0dditional methods:
<. 8luster Sample: *f D is grouped into . mutually dis4oint [Clusters\ (e6.$ pages,$
o"tain n random clusters.
>educed D2 *+
K. Stratified sample: *F D is di#ided into mutually dis4oint parts called strata (e6.$
each stratum represents a #alue of an attri"ute or a classification,$ a stratified
sample of D can "e o"tained "y S>S of each stratum.
0ppro6imate the percentage of each class (or su"population of interest, in
the o#erall data"ase
:sed in con4unction ith s!eed data
Sampling may not reduce data"ase *A+s (page at a time,.
May 11, 2004 Data Mining: Concepts and Techniques 1
Sampling
S
>
S
W
+
>
(
s
i
m
p
l
e

r
a
n
d
o
m
s
a
m
p
l
e

i
t
h
o
u
t

r
e
p
l
a
c
e
m
e
n
t
,
S
>
S
W
>
>a Data
May 11, 2004 Data Mining: Concepts and Techniques 40
%ap"ing
>a Data
ClusterAStratified Sample
Iierarchical >eduction
:se multi/resolution structure ith different degrees of reduction
Iierarchical clustering is often performed "ut tends to define partitions of data sets rather
than clusters
Parametric methods are usually not amena"le to hierarchical representation
Iierarchical aggregation
o 0n inde6 tree hierarchically di#ides a data set into partitions "y #alue range of
some attri"utes
o 5ach partition can "e considered as a "uc!et
o 1hus an inde6 tree ith aggregates stored at each node is a hierarchical histogram
2.E Discretization
1hree types of attri"utes:
o %ominal #alues from an unordered set
o +rdinal #alues from an ordered set
o Continuous real num"ers
Discretization:
o di#ide the range of a continuous attri"ute into inter#als
o Some classification algorithms only accept categorical attri"utes.
o >educe data size "y discretization
o Prepare for further analysis
2.E.1 Discretization and concept hierarchy generation for numeric data
2inning (see earlier sections,
Iistogram analysis (see earlier sections,
Clustering analysis (see earlier sections,
5ntropy/"ased discretization
Segmentation "y natural partitioning
Classification/"ased (from Witten chapter K,
Entropy-ased Discreti!ation
)i#en a set of samples S (ie.$ s num"er of tuples,$ the "asic method for discretization of
attri"ute 0:
1. 5ach #alue of 0 is considered a potential inter#al "oundary (or threshold, 1$
e6.$ # di#ide the #alues to 0 R # and 0 X? #.
2. if S is partitioned into to inter#als S1 and S2 using "oundary 1$ the entropy
after partitioning is
here 5nt(S1, is the entropy of S1$ calculated as follos:
, (
Q Q
Q Q
, (
Q Q
Q Q
, $ (
2
2
1
1
S
S
S
S
Ent
S
Ent
S
9 S E + =
5nt(S1, ? / ] P
i
Log2 (P
i
,
Where P
i
is the pro"a"ility of class i in S1
Purer S1 smaller entropy
<. 1he "oundary that minimizes the entropy function o#er all possi"le "oundaries
is selected as a "inary discretization.
K. 1he process is recursi#ely applied to partitions o"tained until some stopping
criterion is met$ e.g.$
56periments sho that it may reduce data size and impro#e classification accuracy
Segmentation "y natural partitioning
</K/; rule can "e used to segment numeric data into relati#ely uniform$ natural
inter#als. *t partitions a gi#en range into <$ K$ or ; relati#ely e&ui/idth inter#als.
1he rules ("ased on the most significant digit,:
o *f an inter#al co#ers <$ E$ = or H distinct #alues at the most significant digit$
partition the range into < e&ui/idth inter#als
o *f it co#ers 2$ K$ or F distinct #alues at the most significant digit$ partition the range
into K inter#als
o *f it co#ers 1$ ;$ or 18 distinct #alues at the most significant digit$ partition the
range into ; inter#als
1he rules can "e applied recursi#ely to create a concept hierarchy
56ample:
Profits at "ranches of 0ll5lectronics:
.in ? /<;1H=E
.a6 ? K=88FHE
Step 1: Lo (;
th
percentile, ? /1;H$F=E Iigh (H;
th
percentile, ? 1$F<F$=E1
Step 2: 2eteen Lo and Iigh$ the msd (most significant digit, is 1$888$888. *t
di#ides the range into < inter#als: /1$888$888$ 8$ 1$888$888$ 2$888$888
Step <: Create a top/tier
Step K: 56pand the top/tier to include .in (no or!, and .a6 (must create a ne inter#al:
2$888$888 @ ;$888$888,
> , $ ( , ( S 9 E S Ent
May 11, 2004 Data Mining: Concepts and Techniques 4,
'(ap"e o) +-4-. ru"e
(/^K888 /^;$888,
(/^K88 / 8,
(/^K88 /
/^<88,
(/^<88 /
/^288,
(/^288 /
/^188,
(/^188 /
8,
(8 / ^1$888,
(8 /
^288,
(^288 /
^K88,
(^K88 /
^E88,
(^E88 /
^F88, (^F88 /
^1$888,
(^2$888 / ^;$ 888,
(^2$888 /
^<$888,
(^<$888 /
^K$888,
(^K$888 /
^;$888,
(^1$888 / ^2$ 888,
(^1$888 /
^1$288,
(^1$288 /
^1$K88,
(^1$K88 /
^1$E88,
(^1$E88 /
^1$F88,
(^1$F88 /
^2$888,
msd?1$888 Lo?/^1$888 Iigh?^2$888 Step 2:
Step K:
Step 1: /^<;1 /^1;H profit ^1$F<F ^K$=88
.in Lo (i.e$ ;G/tile, Iigh(i.e$ H;G/8 tile, .a6
count
(/^1$888 / ^2$888,
(/^1$888 / 8, (8 /^ 1$888,
Step <:
(^1$888 / ^2$888,

Concept hierarchy generation for categorical data
Specification of a partial ordering of attri"utes e6plicitly at the schema le#el "y users or
e6perts
Specification of a portion of a hierarchy "y e6plicit data grouping
Specification of a set of attri"utes$ "ut not of their partial ordering
o +rdering can "e done automatically using heuristics:
feer num"er higher in concept hierarchy
o 2ased on num"er of #alues
o See e6ample "elo
o 0ny counter/e6ample?
May 11, 2004 Data Mining: Concepts and Techniques .0
%peci)ication o) a set o) attri&utes
Concept hierarchy can &e autoatica""y generated &ased
on the nu&er o) distinct /a"ues per attri&ute in the
gi/en attri&ute set0 The attri&ute 1ith the ost
distinct /a"ues is p"aced at the "o1est "e/e" o) the
hierarchy0
country
pro#incePorP state
city
street
1; distinct #alues
E; distinct #alues
<;E= distinct #alues
E=K$<<H distinct #alues

Discretization #s. Concept hierarchy
Discretization
o >educe the num"er of #alues for a gi#en continuous attri"ute "y di#iding the range
of the attri"ute into inter#als. *nter#al la"els can then "e used to replace actual data
#alues.
Concept hierarchies
o >educe the data "y collecting and replacing lo le#el concepts (such as numeric
#alues for the attri"ute age, "y higher le#el concepts (such as young$ middle/aged$
or senior,.
2.= Summary
Data preparation is a "ig issue for "oth arehousing and mining
Data preparation includes
o Data cleaning and data integration
o Data reduction and feature selection
o Discretization
0 lot a methods ha#e "een de#eloped "ut still an acti#e area of research
>eferences
D. P. 2allou and ). _. 1ayi. 5nhancing data &uality in data arehouse en#ironments.
Communications of 0C.$ K2:=</=F$ 1HHH.
`agadish et al.$ Special *ssue on Data >eduction 1echni&ues. 2ulletin of the 1echnical
Committee on Data 5ngineering$ 28(K,$ Decem"er 1HH=.
D. Pyle. Data Preparation for Data .ining. .organ _aufmann$ 1HHH.
1. >edman. Data -uality: .anagement and 1echnology. 2antam 2oo!s$ %e Nor!$
1HH2.
N. Wand and >. Wang. 0nchoring data &uality dimensions ontological foundations.
Communications of 0C.$ <H:FE/H;$ 1HHE.
>. Wang$ 3. Storey$ and C. Firth. 0 frameor! for analysis of data &uality research.
*555 1rans. _noledge and Data 5ngineering$ =:E2</EK8$ 1HH;.

Vous aimerez peut-être aussi