Vous êtes sur la page 1sur 4

One of Major Tasks in Data Preprocessing

Data discretization – Part of data reduction but with particular importance, especially for numerical data

Discretization

Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).

Data discretization transforms numeric data by mapping values to interval or concept labels.
Discretization techniques include binning, histogram analysis, cluster analysis, and so on.

Data discretization and concept hierarchy generation are also forms of data reduction. The raw data are
replaced by a smaller number of interval or concept labels. This simplifies the original data and makes
the mining more efficient. The resulting patterns mined are typically easier to understand.

Three types of attributes:

(An attribute is a property or characteristic of an object)


Examples: Eye Color of a person, temperature, etc. Attribute is also known as variable, field,
characteristics or feature.

Nominal — values from an unordered set, e.g., color, profession


Ordinal — values from an ordered set, e.g., military or academic rank
Continuous — real numbers, e.g., integer or real numbers (here we aggregated interval and ratio
attributes into continuous)

Discretization:
– Divide the range of a continuous attribute into intervals
– Reduce data size by discretization
– Prepare for further analysis

Discretization
– Reduce the number of values for a given continuous attribute by dividing the range of the attribute
into intervals
– Interval labels can then be used to replace actual data

************************

Discretization techniques can be categorized based on how the discretization is performed, such as
whether it uses class information or which direction it proceeds (i.e., top-down vs. bottom-up). If the
discretization process uses class information, then we say it is supervised discretization. Otherwise, it is
unsupervised. If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals, it is called
top-down discretization or splitting. This contrasts with bottom-up discretization or merging, which
starts by considering all of the continuous values as potential split-points, removes some by merging
neighborhood values to form intervals, and then recursively applies this process to the resulting
intervals.
– Supervised vs. unsupervised (use class or don’t use class variable)
– Split (top-down) vs. merge (bottom-up)

Discretization for numeric Data

Typical methods: All the methods can be applied recursively


– Binning (covered earlier)
Top-down split, unsupervised,
First sort data and partition into (equal-frequency) bins
Then one can: smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is,
the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing. Figure 3.2
illustrates some binning techniques. In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3 (i.e., each bin contains three values). In smoothing by bin
means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values
4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly,
smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In
smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the
width, the greater the effect of the smoothing. Alternatively, bins may be equal width, where the
interval range of values in each bin is constant. Binning is also used as a discretization technique

Discretization by Binning Binning is a top-down splitting technique based on a specified number


of bins. Section 3.2.2 discussed binning methods for data smoothing. These methods are also used as
discretization methods for data reduction and concept hierarchy generation. For example, attribute
values can be discretized by applying equal-width or equal-frequency binning, and then replacing each
bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians,
respectively. These techniques can be applied recursively to the resulting partitions to generate concept
hierarchies. Binning does not use class information and is therefore an unsupervised discretization
technique. It is sensitive to the user-specified number of bins, as well as the presence of outliers.
– Histogram analysis (covered earlier)
Top-down split, unsupervised

Discretization by Histogram Analysis Like binning, histogram analysis is an unsupervised


discretization technique because it does not use class information. Histograms were introduced in
Section 2.2.3. A histogram partitions the values of an attribute, A, into disjoint ranges called buckets or
bins. Various partitioning rules can be used to define histograms (Section 3.4.6). In an equal-width
histogram, for example, the values are partitioned into equal-size partitions or ranges (e.g., earlier in
Figure 3.8 for price, where each bucket has a width of $10). With an equal-frequency histogram, the
values are partitioned so that, ideally, each partition contains the same number of data tuples. The
histogram analysis algorithm can be applied recursively to each partition in order to automatically
generate a multilevel concept hierarchy, with the procedure terminating once a prespecified number of
concept levels has been reached. A minimum interval size can also be used per level to control the
recursive procedure. This specifies the minimum width of a partition, or the minimum number of values
for each partition at each level. Histograms can also be partitioned based on cluster analysis of the data
distribution, as described next.
- Clustering Analysis
Either top-down split or bottom-up merge, unsupervised
Detect and remove outliers
Cluster analysis is a popular data discretization method. A clustering algorithm can be applied to
discretize a numeric attribute, A, by partitioning the values of A into clusters or groups. Clustering takes
the distribution of A into consideration, as well as the closeness of data points, and therefore is able to
produce high-quality discretization results. Clustering can be used to generate a concept hierarchy for A
by following either a top-down splitting strategy or a bottom-up merging strategy, where each cluster
forms a node of the concept hierarchy. In the former, each initial cluster or partition may be further
decomposed into several subclusters, forming a lower level of the hierarchy. In the latter, clusters are
formed by repeatedly grouping neighboring clusters in order to form higher-level concepts. Clustering
methods for data mining are studied in Chapters 10 and 11.

Simple Discretization Methods: Binning


z Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of intervals will be: intervals will
be: W = (B–A)/ N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well Skewed data is not handled well

Equal-depth (frequency) partitioning


– Divides the range into Divides the range into N intervals each containing approximately same intervals,
each containing approximately same number of data points
– Good data scaling
– Managing categorical attributes can be tricky

Vous aimerez peut-être aussi