Académique Documents
Professionnel Documents
Culture Documents
Entropy is often defined as disorder, but this definition is by no means quantitative Entropy is really referring to the statistical distribution of energy across the different possible states Entropy-based discretization: supervised, top-down split It explores class distribution information in its calculation and determination of split-points To discretize a numerical attribute, A, the method selects the value of A that has the minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization
Let D consist of data tuples defined by a set of attributes and a class-label attribute. The class-label attribute provides the class information per tuple. The basic method for entropy-based discretization of an attribute A within the set is as follows:
2. Given
set
of
samples
D,
if
is
partitioned
into
two
Entropy ( D1 ) = p i log 2 ( pi )
where pi is the probability of class i in D1
i =1
It
employs
bottom-up
approach
by
finding
the
best
Merge: Find the best neighboring intervals and merge them to form larger intervals recursively ChiMerge o Initially, each distinct value of a numerical attr. A is considered to be one interval
o 2 tests are performed for every pair of adjacent intervals o Adjacent intervals with the least 2 values are merged
together, since low 2 values for a pair indicate similar class distributions o This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level, maxinterval, max inconsistency, etc.) Discretization by Intuitive Partitioning
A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals. o If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equiwidth intervals o o If If it it covers covers 2, 1, 4, 5, or or 8 10 distinct distinct values values at at the the most most significant digit, partition the range into 4 intervals significant digit, partition the range into 5 intervals
count
$4,700 Max
($1,000 - $2,000)
Step 4:
(-$400 -$5,000)
(-$400 - 0) (-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0) (0 $200) ($200 $400) ($400 $600)
($600 $800)
Categorical
data
are
discrete
data.
Categorical
attributes
have a finite (but possibly large) number of distinct values, with no ordering among the values. 1. Specification o of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country 2. Specification of a hierarchy for a set of values by explicit data grouping o o {Urbana, Champaign, Chicago} < Illinois E.g., only street < city, not others 3. Specification of only a partial set of attributes 4. Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values o o E.g., for a set of attributes: {street, city, state, country} Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year
15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values