Vous êtes sur la page 1sur 4

Discretization and concept hierarchy generation Entropy-based discretization

Entropy is often defined as disorder, but this definition is by no means quantitative Entropy is really referring to the statistical distribution of energy across the different possible states Entropy-based discretization: supervised, top-down split It explores class distribution information in its calculation and determination of split-points To discretize a numerical attribute, A, the method selects the value of A that has the minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization

Let D consist of data tuples defined by a set of attributes and a class-label attribute. The class-label attribute provides the class information per tuple. The basic method for entropy-based discretization of an attribute A within the set is as follows:

1. Each value of A can be considered as a potential interval


boundary or split-point to partition the range of A. That is, a split-point for A can partition the tuples in D into two subsets satisfying the conditions A split point and A > split point, respectively, thereby creating a binary discretization.

2. Given

set

of

samples

D,

if

is

partitioned

into

two

intervals D1 and D2 using boundary T, the information gain after partitioning is

| | | | I ( D, T ) = D1 Entropy ( D1) + D 2 Entropy ( D 2) |D| |D|


Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of D1 is
m

Entropy ( D1 ) = p i log 2 ( pi )
where pi is the probability of class i in D1
i =1

3. The boundary that minimizes the entropy function over all


possible boundaries is selected as a binary discretization o o The process is recursively applied to partitions obtained until some stopping criterion is met Such a boundary may reduce data size and improve classification accuracy Interval Merge by 2 Analysis

It

employs

bottom-up

approach

by

finding

the

best

neighboring intervals and then merging these to form larger intervals

Merge: Find the best neighboring intervals and merge them to form larger intervals recursively ChiMerge o Initially, each distinct value of a numerical attr. A is considered to be one interval

o 2 tests are performed for every pair of adjacent intervals o Adjacent intervals with the least 2 values are merged
together, since low 2 values for a pair indicate similar class distributions o This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level, maxinterval, max inconsistency, etc.) Discretization by Intuitive Partitioning

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals. o If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equiwidth intervals o o If If it it covers covers 2, 1, 4, 5, or or 8 10 distinct distinct values values at at the the most most significant digit, partition the range into 4 intervals significant digit, partition the range into 5 intervals

count

Step 1: Step 2: Step 3:

-$351 Min msd=1,000

-$159 Low (i.e, 5%-tile) Low=-$1,000

profit High=$2,000 (-$1,000 - $2,000) (-$1,000 - 0) (0 -$ 1,000)

$1,838 High(i.e, 95%-0 tile)

$4,700 Max

($1,000 - $2,000)

Step 4:

(-$400 -$5,000)

(-$400 - 0) (-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0) (0 $200) ($200 $400) ($400 $600)

(0 - $1,000) ($1,000 $1,200) ($1,200 $1,400)

($1,000 - $2, 000)

($2,000 - $5, 000)

($2,000 $3,000) ($3,000 $4,000) ($4,000 $5,000)

($1,400 $1,600) ($800 $1,000) ($1,600 $1,800) ($1,800 $2,000)

($600 $800)

Concept Hierarchy Generation for Categorical Data

Categorical

data

are

discrete

data.

Categorical

attributes

have a finite (but possibly large) number of distinct values, with no ordering among the values. 1. Specification o of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country 2. Specification of a hierarchy for a set of values by explicit data grouping o o {Urbana, Champaign, Chicago} < Illinois E.g., only street < city, not others 3. Specification of only a partial set of attributes 4. Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values o o E.g., for a set of attributes: {street, city, state, country} Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set

The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year

country province_or_ state city street

15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values

Vous aimerez peut-être aussi