Académique Documents
Professionnel Documents
Culture Documents
or observation
Attribute Values
Attribute values are numbers/text/symbols
assigned to an attribute
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented
using a finite number of digits.
Continuous attributes are typically represented as floating-
point variables.
Types of Attributes (more
specifically)
Four different types of attributes
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts
Properties of Attribute
Values
The type of an attribute depends on which of
the following properties it possesses:
Distinctness: =
Order: < >
Addition: + -
Multiplication: * /
types of
Document Data
Transaction Data
Graph
data sets: World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
P r o j e c t i o n P r o j e c t i o n D i s t a n c e L o a d T h i c k n e s s
o f x L o a d o f y l o a d
1 0 . 2 3 5 . 2 7 1 5 . 2 2 2 . 7 1 . 2
1 2 . 6 5 6 . 2 5 1 6 . 2 2 2 . 2 1 . 1
Document Data
Each document becomes a term
vector,
each term is a component (attribute) of
the vector
the value of each component is the
number of times the corresponding term
occurs in the document.
Transaction Data
A special type of record data, where
each record (transaction) involves a set
of items.
For example, consider a grocery store.
The set of products purchased by a
customer during one shopping trip
constitute TID
a transaction,
Items while the
individual 1products that
Bread, Coke, were purchased
Milk
are the items.
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Date Latitude Longitude
Temperature
1/24/16 23.566 27.567 69.3
1/25/16 23.566 27.567 71.6
.
.
.
Summary of the importance of
different kinds of data sets
Different kinds of data sets have different kinds
of information quality issues (sensor problems,
key stroke errors, etc.)
Different kinds of data sets will lead to different
data mining goals (finding associations in
transaction data, predicting the next values in
a sequence, etc.)
Data mining techniques cannot be applied
correctly without understanding the
attribute types and the context of the
entire set of data
Data Exploration/Statistics
(From Chapter 3)
Sampling
Can be used for both the preliminary investigation of the data and the
final data analysis.
Stratified sampling
Split the data into several partitions; then draw random
samples from each partition
Sample Size
In a statistics class you might learn n=30 is good (for certain tests/assumptions)
but is it always?
Sparsity
Certain important events may occur
rarely (impacts the data mining
technique)
Resolution
Patterns can depend on the scale
(granularity)
Curse of Dimensionality
When dimensionality
increases, data
becomes increasingly
sparse in the space
that it occupies
Definitions of density
and distance between
points, which is critical
for clustering and
outlier detection,
Randomly generate 500 points
become less
Compute difference between
meaningful; for max and min distance between
classification, more any pair of points
data become
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by
data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce
noise
Techniques
Principle Component Analysis and similar algebraic
methods are the most common techniques
Others: supervised and non-linear techniques
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Data Exploration
What is data exploration?
220
220
210 210
200 200
190 190
180 180
170 170
160 160
150 160 170 180 190 200 210 150 160 170 180 190 200 210
= 0.87 = 0.08
Visually Evaluating Correlation
Scatter plots
showing the
similarity
from 1 to 1.
Visualization
Run charts
Histograms
10th
percentile
75th
percentile
50th
percentile
25th
percentile
10th
percentile
Example of Box Plots
Box plots can be used to compare
attributes
Visualization Techniques: Scatter
Plots
Scatter plots
Attributes values determine the position
Two-dimensional scatter plots most
common, but can have three-dimensional
scatter plots
Often additional attributes can be
displayed by using the size, shape, and
color of the markers that represent the
objects
It is useful to have arrays of scatter plots
can compactly summarize the
relationships of several pairs of attributes
See example on the next slide
Scatter Plot Array of Iris
Attributes
Visualization Notes
Run charts are not in the textbook, but are very
useful and can show information that cannot be
Run Chart seen on a histogramthey show plots of data
over time and allow unusual trends over time to
be seen;
Histograms show overall distributions of data
and can also be used to diagnose problems like
missing values in the data;
Scatterplots show relationships between data
and can be used to find outliers that dont follow
the typical relationship
Histogram
Scatterplot
Useful Statistical
Knowledge/Methods
Why is statistics involved in data
mining?
Data mining requires understanding
the data being mined
Important for looking for information
quality issues
Important for finding the right data
mining method
Exploratory data analysis from the
field of statistics enables an
understanding of the data
Why is statistics involved in data
mining? (continued)
Data mining can involve sampling, the field of
statistics has long studied sampling
Even though there are some fundamental
differences in the two fields, data mining and
statistics both involve minimizing decision
making errors from data and thus utilize
many of the same techniques/theories
The next few slides give a few bits of
knowledge that are commonly used in data
mining and statistics
Distributions
Describes how data are distributed (a measure of
central tendency with variability provide a look at
the distribution of data)
In non-technical terms, similar types of data tend
to have similar histogram patternsa distribution
is a way of representing a histogram pattern with a
mathematical formula
Distributions may be :
Unimodal- describes many types of data that arise from
a single distinct process
Multimodal often arise from several distinct processes
Example Common
Distributions
Discrete: Uniform, Binomial, Poisson
Continuous: Normal, Exponential
y 0 1 x
0 y 1 x
( x x )( y y )
1
( x x) 2
If it looks strange and unusual and you
have never seen it before, its OKmany
more details to come when we start using
it.
OLAP
Moving on from the role of Statistics in Data Mining
OLAP is a database approach to data exploration
OLAP
On-Line Analytical Processing (OLAP)
was proposed by E. F. Codd, the father
of the relational database.
Relational databases put data into
tables, while OLAP uses a
multidimensional array
representation.
Such representations of data previously
existed in statistics and other fields
There are a number of data analysis
and data exploration operations that
are easier with such a data
Creating a Multidimensional
Array
Two key steps in converting tabular data into a
multidimensional array.
First, identify which attributes are to be the
dimensions and which attribute is to be the target
attribute whose values appear as entries in the
multidimensional array.
The attributes used as dimensions must have discrete
values
The target value is typically a count or continuous value,
e.g., the cost of an item
Can have no target variable at all except the count of
objects that have the same set of attribute values
Second, find the value of each entry in the
multidimensional array by summing the values (of
the target attribute) or count of all objects that have
the attribute values corresponding to that entry.
Example: Iris data
We show how the attributes, petal
length, petal width, and species type
can be converted to a
multidimensional array
First, we discretized the petal width and
length to have categorical values: low,
medium, and high
We get the following table - note the
count attribute
Example: Iris data
(continued)
Each unique tuple of petal width, petal
length, and species type identifies one
element of the array.
This element is assigned the
corresponding count value.
The figure illustrates
the result.
All non-specified
tuples are 0.
Example: Iris data
(continued)
Slices of the multidimensional array
are shown by the following cross-
tabulations
What do these tables tell us?
OLAP Operations: Data
Cube
The key operation of a OLAP is the formation of a
data cube
A data cube is a multidimensional representation
of data, together with all possible aggregates.
By all possible aggregates, we mean the
aggregates that result by selecting a proper
subset of the dimensions and summing over all
remaining dimensions.
For example, if we choose the species type
dimension of the Iris data and sum over all other
dimensions, the result will be a one-dimensional
entry with three entries, each of which gives the
number of flowers of each type.
Data Cube Example
Consider a data set that records the
sales of products at a number of
company stores at various dates.
This data can be represented
as a 3 dimensional array
There are 3 two-dimensional
aggregates (3 choose 2 ),
3 one-dimensional aggregates,
and 1 zero-dimensional
aggregate (the overall total)
Data Cube Example (continued)
The following figure table shows one of the
two dimensional aggregates, along with two
of the one-dimensional aggregates, and the
overall total
OLAP Operations: Slicing and Dicing
Slicing is selecting a group of cells from the
entire multidimensional array by specifying
a specific value for one or more dimensions.
Dicing involves selecting a subset of cells
by specifying a range of attribute values.
This is equivalent to defining a subarray from
the complete array.
In practice, both operations can also be
accompanied by aggregation over some
dimensions.
OLAP Operations: Roll-up and Drill-down