Chapter 2 6 Data Mining

Discretization and concept hierarchy generation Entropy-based discretization
Entropy is often defined as disorder, but this definition is by no means quantitative Entropy is really referring to the statistical distribution of energy across the different possible states Entropy-based discretization: supervised, top-down split It explores class distribution information in its calculation and determination of split-points To discretize a numerical attribute, A, the method selects the value of A that has the minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization
Let D consist of data tuples defined by a set of attributes and a class-label attribute. The class-label attribute provides the class information per tuple. The basic method for entropy-based discretization of an attribute A within the set is as follows:
1. Each value of A can be considered as a potential interval

boundary or split-point to partition the range of A. That is, a split-point for A can partition the tuples in D into two subsets satisfying the conditions A split point and A > split point, respectively, thereby creating a binary discretization.
2. Given
set
of
samples
D,
if
is
partitioned
into
two
intervals D1 and D2 using boundary T, the information gain after partitioning is
| | | | I ( D, T ) = D1 Entropy ( D1) + D 2 Entropy ( D 2) |D| |D|

Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of D1 is
m
Entropy ( D1 ) = p i log 2 ( pi )
where pi is the probability of class i in D1
i =1
3. The boundary that minimizes the entropy function over all

possible boundaries is selected as a binary discretization o o The process is recursively applied to partitions obtained until some stopping criterion is met Such a boundary may reduce data size and improve classification accuracy Interval Merge by 2 Analysis
It
employs
bottom-up
approach
by
finding
the
best
neighboring intervals and then merging these to form larger intervals
Merge: Find the best neighboring intervals and merge them to form larger intervals recursively ChiMerge o Initially, each distinct value of a numerical attr. A is considered to be one interval
o 2 tests are performed for every pair of adjacent intervals o Adjacent intervals with the least 2 values are merged
together, since low 2 values for a pair indicate similar class distributions o This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level, maxinterval, max inconsistency, etc.) Discretization by Intuitive Partitioning
A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals. o If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equiwidth intervals o o If If it it covers covers 2, 1, 4, 5, or or 8 10 distinct distinct values values at at the the most most significant digit, partition the range into 4 intervals significant digit, partition the range into 5 intervals
count
Step 1: Step 2: Step 3:
-$351 Min msd=1,000
-$159 Low (i.e, 5%-tile) Low=-$1,000
profit High=$2,000 (-$1,000 - $2,000) (-$1,000 - 0) (0 -$ 1,000)
$1,838 High(i.e, 95%-0 tile)
$4,700 Max
($1,000 - $2,000)
Step 4:
(-$400 -$5,000)
(-$400 - 0) (-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0) (0 $200) ($200 $400) ($400 $600)
(0 - $1,000) ($1,000 $1,200) ($1,200 $1,400)
($1,000 - $2, 000)
($2,000 - $5, 000)
($2,000 $3,000) ($3,000 $4,000) ($4,000 $5,000)
($1,400 $1,600) ($800 $1,000) ($1,600 $1,800) ($1,800 $2,000)
($600 $800)
Concept Hierarchy Generation for Categorical Data
Categorical
data
are
discrete
data.
Categorical
attributes
have a finite (but possibly large) number of distinct values, with no ordering among the values. 1. Specification o of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country 2. Specification of a hierarchy for a set of values by explicit data grouping o o {Urbana, Champaign, Chicago} < Illinois E.g., only street < city, not others 3. Specification of only a partial set of attributes 4. Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values o o E.g., for a set of attributes: {street, city, state, country} Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year
country province_or_ state city street
15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values

Chapter 2 6 Data Mining

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chapter 2 6 Data Mining

Transféré par

Droits d'auteur :

Formats disponibles

Discretization and concept hierarchy generation Entropy-based discretization

1. Each value of A can be considered as a potential interval

intervals D1 and D2 using boundary T, the information gain after partitioning is

| | | | I ( D, T ) = D1 Entropy ( D1) + D 2 Entropy ( D 2) |D| |D|

3. The boundary that minimizes the entropy function over all

neighboring intervals and then merging these to form larger intervals

Step 1: Step 2: Step 3:

-$351 Min msd=1,000

-$159 Low (i.e, 5%-tile) Low=-$1,000

profit High=$2,000 (-$1,000 - $2,000) (-$1,000 - 0) (0 -$ 1,000)

$1,838 High(i.e, 95%-0 tile)

(0 - $1,000) ($1,000 $1,200) ($1,200 $1,400)

($1,000 - $2, 000)

($2,000 - $5, 000)

($2,000 $3,000) ($3,000 $4,000) ($4,000 $5,000)

($1,400 $1,600) ($800 $1,000) ($1,600 $1,800) ($1,800 $2,000)

Concept Hierarchy Generation for Categorical Data

country province_or_ state city street

Vous aimerez peut-être aussi