Vous êtes sur la page 1sur 82

Data (from Chapter 2)

This PPT attempts to highlight


certain material and provide a few
additional thoughts on the material
in chapter 2 and 3 of the textbook
Data have Value
Data, as we will use the term in
class, are stored on a computer
Someone made the decision to store
the data because the data have
inherent value
What is/are Data?

Collection of data objects Attributes


and their attributes

Tid Refund Marital Taxable


An attribute is a property or Status Income Cheat
characteristic of an object
1 Yes Single 125K No
Examples: eye color of a
2 No Married 100K No
person, temperature, etc.
3 No Single 70K No
Attribute is also known as
4 Yes Married 120K No
variable, field,
characteristic, or feature Object 5 No Divorced 95K Yes

A collection of attributes s or 6 No Married 60K No

describe an object entitie 7 Yes Divorced 220K No

Object is also known as


s 8 No Single 85K Yes
9 No Married 75K No
record, point, case,
sample, entity, instance, 10
10 No Single 90K Yes

or observation
Attribute Values
Attribute values are numbers/text/symbols
assigned to an attribute

Distinction between attributes and attribute


values
Same attribute can be mapped to different
attribute values
Example: height can be measured in feet or meters

Different attributes can be mapped to the same


set of values
Example: Attribute values for ID and age are integers
But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Discrete and Continuous
Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection
of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes

Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented
using a finite number of digits.
Continuous attributes are typically represented as floating-
point variables.
Types of Attributes (more
specifically)
Four different types of attributes
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts
Properties of Attribute
Values
The type of an attribute depends on which of
the following properties it possesses:
Distinctness: =
Order: < >
Addition: + -
Multiplication: * /

Nominal attribute: distinctness


Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Record
Example Data Matrix

types of
Document Data
Transaction Data

Graph
data sets: World Wide Web
Molecular Structures

Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data

This is not a comprehensive list


The next few slides are just meant to give a few
examples of the variety
of types of data being mined.
Record Data
Data that consists of a collection of
records, each of which consists of a
fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute
Such data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute and matrix algebra may be used in analysis
techniques
Note you can turn record data into a data matric via indicator
variables (0 and 1s the indicate the presence/absence of a class)

P r o j e c t i o n P r o j e c t i o n D i s t a n c e L o a d T h i c k n e s s

o f x L o a d o f y l o a d

1 0 . 2 3 5 . 2 7 1 5 . 2 2 2 . 7 1 . 2

1 2 . 6 5 6 . 2 5 1 6 . 2 2 2 . 2 1 . 1
Document Data
Each document becomes a term
vector,
each term is a component (attribute) of
the vector
the value of each component is the
number of times the corresponding term
occurs in the document.
Transaction Data
A special type of record data, where
each record (transaction) involves a set
of items.
For example, consider a grocery store.
The set of products purchased by a
customer during one shopping trip
constitute TID
a transaction,
Items while the
individual 1products that
Bread, Coke, were purchased
Milk
are the items.
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Date Latitude Longitude
Temperature
1/24/16 23.566 27.567 69.3
1/25/16 23.566 27.567 71.6
.
.
.
Summary of the importance of
different kinds of data sets
Different kinds of data sets have different kinds
of information quality issues (sensor problems,
key stroke errors, etc.)
Different kinds of data sets will lead to different
data mining goals (finding associations in
transaction data, predicting the next values in
a sequence, etc.)
Data mining techniques cannot be applied
correctly without understanding the
attribute types and the context of the
entire set of data
Data Exploration/Statistics

(From Chapter 3)
Sampling
Can be used for both the preliminary investigation of the data and the
final data analysis.

Statisticians sample because obtaining the entire set of data


of interest is too expensive or time consuming. (In statistics,
determining sample size is a formal process involving the
Power of statistic to detect a change of practical significance)

Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.
Sampling
The key principle for effective
sampling is the following:
For many types of data mining tasks
using a sample will work almost as well
as using the entire data sets, if the
sample is representative

A sample is representative if it has


approximately the same property (of
interest) as the original set of data
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular
item

Stratified sampling
Split the data into several partitions; then draw random
samples from each partition
Sample Size
In a statistics class you might learn n=30 is good (for certain tests/assumptions)
but is it always?

8000 points 2000 Points 500 Points

In the above data, we dont get a full picture of what is going on in


the data with the 2000 or 500 point data sets. There is not a
magic number for data mining samplingin data mining the
rule of thumb is use all the data if you can, or use as much of the
data as possible.
Important Characteristics of Data

Dimensionality (the number of


columns or attributes)
Curse of Dimensionality

Sparsity
Certain important events may occur
rarely (impacts the data mining
technique)

Resolution
Patterns can depend on the scale
(granularity)
Curse of Dimensionality

When dimensionality
increases, data
becomes increasingly
sparse in the space
that it occupies

Definitions of density
and distance between
points, which is critical
for clustering and
outlier detection,
Randomly generate 500 points
become less
Compute difference between
meaningful; for max and min distance between
classification, more any pair of points
data become
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by
data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce
noise

Techniques
Principle Component Analysis and similar algebraic
methods are the most common techniques
Others: supervised and non-linear techniques
Feature Subset Selection
Another way to reduce dimensionality of data

Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid

Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Data Exploration
What is data exploration?

A preliminary exploration of the data


to better understand its
characteristics.
Key motivations of data exploration include
Helping to select the right tool for preprocessing or
analysis
Making use of humans abilities to recognize patterns
People can recognize patterns not captured
by data analysis tools
Related to the area of Exploratory Data Analysis
(EDA)
Created by statistician John Tukey
Seminal book is Exploratory Data Analysis by Tukey
A nice online introduction can be found in Chapter 1 of
the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
Techniques Used In Data Exploration

In our discussion of data exploration,


we focus on
Summary statistics
Visualization
Online Analytical Processing (OLAP)
Summary Statistics
Measures of central tendency:
Mean, Median, Mode
Measures of variability:
Max, Min, Quantiles, standard deviation
(variation)
Measure of linear similarity:
correlation
For categorical data, frequency is used in the
calculation for mean, standard deviation and
correlation
Measures of Location: Mean and
Median
The mean is the most common
measure of the location of a set of
points.
However, the mean is very sensitive
to outliers.
Thus, the median or a trimmed mean
is also sometimes used.
Measures of Spread: Range and Variance

Range is the difference between the max


and min
The variance or standard deviation is the
most common measure of the spread of a
set of points.

The standard deviation is the square root of


the variance.
Correlation
Correlation measures the linear
relationship between objects
Correlation Coefficient ()
Strong linear relationship
No relationship
230
230

220
220

210 210

200 200

190 190

180 180

170 170

160 160

150 160 170 180 190 200 210 150 160 170 180 190 200 210

= 0.87 = 0.08

A correlation coefficient near -1 or 1 indicates a strong linear relationship.


A correlation coefficient near 0 indicates no linear relationship.
Correlation Coefficient
(continued)
Calculated on data pairs (must have
a natural pairing of the data to
calculate correlation)


Visually Evaluating Correlation

Scatter plots
showing the
similarity
from 1 to 1.
Visualization
Run charts

Histograms

Scatter plots (2 and 3-D)

Higher dimensions sometimes with


use of color, different sized dots, etc.
Visualization
Visualization is the conversion of data into a
visual or tabular format so that the
characteristics of the data and the relationships
among data items or attributes can be analyzed
or reported.

Visualization of data is one of the most powerful


and appealing techniques for data exploration.
Humans have a well developed ability to analyze
large amounts of information that is presented
visually
Can detect general patterns and trends (however,
we can also over do this)
Can detect outliers and unusual patterns
Example: Sea Surface Temperature
The following shows the Sea Surface
Temperature (SST) for July 1982
Tens of thousands of data points are
summarized in a single figure
Representation
Is the mapping of information to a visual
format
Data objects, their attributes, and the
relationships among data objects are
translated into graphical elements such as
points, lines, shapes, and colors.
Example:
Objects are often represented as points
Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
If position is used, then the relationships of points,
i.e., whether they form groups or a point is an
outlier, is easily perceived.
Arrangement
Is the placement of visual elements
within a display
Can make a large difference in how
easy it is to understand the data
Example: Same data, represented differently
Contingency Table (Joint Probability
Table, Confusion Matrix)
B Not B Totals
A Total A Total A Total A
and B and Not
B
Not A Total B Total Not Total
and Not B and Not A
A Not A
Totals Total B Total Grand
Not B Total

Used to summarize some data mining results (confusion


matrices in classification, associations in association analysis)
Selection in visualization
Data mining algorithms may be able to handle
all the databut visualization rarely can
Visualization often requires selection for the
elimination or the de-emphasis of certain objects
and attributes
Selection may involve the choosing a subset of
attributes
Dimensionality reduction is often used to reduce the
number of dimensions to two or three
Alternatively, pairs of attributes can be considered
Selection may also involve choosing a subset of
objects
A region of the screen can only show so many points
Can sample, but want to preserve points in sparse
areas
Visualization Techniques:
Histogram
Histograms
Usually shows the distribution of values of a single
variable
Divide the values into bins and show a bar plot of the
number of objects in each bin.
The height of each bar indicates the number of objects
Shape of histogram depends on the number of bins
Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional
Histograms
Show the joint distribution of the values
of two attributes
Example: petal width and petal length
What does this tell us?
Visualization Techniques:
Box Plots Box Plots
Invented by J. Tukey
Another way of displaying the distribution of data
Following figure shows the basic part of a box plot
outlier

10th
percentile

75th
percentile
50th
percentile
25th
percentile

10th
percentile
Example of Box Plots
Box plots can be used to compare
attributes
Visualization Techniques: Scatter
Plots
Scatter plots
Attributes values determine the position
Two-dimensional scatter plots most
common, but can have three-dimensional
scatter plots
Often additional attributes can be
displayed by using the size, shape, and
color of the markers that represent the
objects
It is useful to have arrays of scatter plots
can compactly summarize the
relationships of several pairs of attributes
See example on the next slide
Scatter Plot Array of Iris
Attributes
Visualization Notes
Run charts are not in the textbook, but are very
useful and can show information that cannot be
Run Chart seen on a histogramthey show plots of data
over time and allow unusual trends over time to
be seen;
Histograms show overall distributions of data
and can also be used to diagnose problems like
missing values in the data;
Scatterplots show relationships between data
and can be used to find outliers that dont follow
the typical relationship

Histogram
Scatterplot
Useful Statistical
Knowledge/Methods
Why is statistics involved in data
mining?
Data mining requires understanding
the data being mined
Important for looking for information
quality issues
Important for finding the right data
mining method
Exploratory data analysis from the
field of statistics enables an
understanding of the data
Why is statistics involved in data
mining? (continued)
Data mining can involve sampling, the field of
statistics has long studied sampling
Even though there are some fundamental
differences in the two fields, data mining and
statistics both involve minimizing decision
making errors from data and thus utilize
many of the same techniques/theories
The next few slides give a few bits of
knowledge that are commonly used in data
mining and statistics
Distributions
Describes how data are distributed (a measure of
central tendency with variability provide a look at
the distribution of data)
In non-technical terms, similar types of data tend
to have similar histogram patternsa distribution
is a way of representing a histogram pattern with a
mathematical formula
Distributions may be :
Unimodal- describes many types of data that arise from
a single distinct process
Multimodal often arise from several distinct processes
Example Common
Distributions
Discrete: Uniform, Binomial, Poisson
Continuous: Normal, Exponential

Understanding underlying distribution of the data is required for


some classification, clustering and anomaly detection techniques
so as needed, we will bring these up in more detail later in the
course
3 Important Rules Regarding the
Average and Standard Deviation
The Chebyshev inquality:
P[- h< X < + h] > 1-1/h2
Works for any distribution!

Vysochanski & Petunin (1981)


extension to the Chebyshev inquality:
P[- h< X < + h] > 1-4/9h2
Works for any unimodal distribution!

The central limit theorem (i.e. use the


normal distribution in many cases)
These three stat rules provide a basis for saying if data looks unusual using
only the average and standard deviationwe will use these later in the
course for anomaly detection
Understanding Uncertainty
Present in every decision, process, or
future event
Sometimes called: variability, error,
or inconsistency
Data mining supports decision
making, and therefore should
quantify uncertainty!
Quantifying Risk in a Traditional
Statistical Test:

Type I Error (alpha)


Type II Error (beta)
May be difficult to measure in
complex decisions
Acceptably levels of each error
depend on the situation
The next six slides provide an example of
a statistical hypothesis testplease note,
you will not have to know or perform this
kind of statistical hypothesis testing for
this class. I am showing this example
(hopefully a reminder from a previous
statistics class) to emphasize the contrast
between statistical techniques and data
mining techniques.
Example of Using Traditional Statistics
(hypothesis testing) to Quantify Risk in a
Simple Decision
Suppose you are working as a consultant to build a
website for a business wanting to branch into online
sales

Should the background of the webpage be blue or


yellow? (a ridiculous example, but just pretend this is
a really important decision)

If you want to involve users and make a scientifically


valid decision on what users actually preferstatistics
is the only way.
Hypothesis formation
Users prefer a blue background over a yellow
background

Must be stated formally as a hypothesis that can


be tested to provide confidence in conclusions:

Null Hypothesis H0: pB=.5


Alternative Hypothesis HA: pB>.5
Where pB is the proportion of users that prefer the blue background
Hypothesis Testing
null hypothesis:
Mathematically stated as an equality
Data are used to provide evidence on whether or not the
null hypothesis can be rejected
alternative hypothesis:
Mathematically stated as greater than, less than, or not
equal to
Data collection
Data must be collected in an
unbiased manner
Blind and double blind techniques
are often used to ensure researcher
is not biasing results

Suppose 68 of 100 randomly


surveyed customers preferred a blue
to yellow background. or pB = .68
Statistical Test

In this case the appropriate test is a


one sample z-test: Why use statistics? Why
not just say68 out of

pB = .68 p B p0 100 is more than half


z people must like blue
p0 (1 p0 ) done?
p0 = .5 n
Because if we were to
repeat the experiment
and ask 100 different
n = 100 people, would we really
.68 .5 expect exactly 68 to
z 3.6 prefer blue in the next
.5(1 .5) sample? Nope. So at
what point can we really
100 conclude blue is
preferred? Statistical
tests are the only
scientifically value way to
Statistical Test
(continued)
Rejection of the null hypothesis depends on a
predetermined value of alpha, the probability of a type I
error, (0.05 is commonly used for 95% confidence)

The alpha value and test statistic are taken to the


appropriate distribution (using tables or software) to
determine the aggreement of the data to the null
hypothesis

Z* = 1.65 for 95% confidence in our design


Compare z to z*, if z is greater than z*, reject the null
hypothesis
3.6 is greater than 1.65, so the conclusion is that more
users prefer a blue background
Data Mining Methods Differ From
Traditional Statistical Methods!
Data mining does not involve well-controlled experiments
Data mining will typically involve so much data that
hypothesis testing is meaningless
Traditional statistics optimizes decision making for small sample
sizesStatistical significance of non-significant events will be shown
if we were to try traditional statistical approaches for data mining
Why? Sample size is in the denominator of the test statistic for all
statistical hypothesis tests, meaning as sample size becomes large,
significance will always be shown, making such an approach useless
for large amounts of data
However, the principles of quantifying risk, stating
assumptions, etc. should be carried over to Data Mining as
much as possible
Also, sometimes in data mining similar calculations are used,
but interpreted differently
Data Mining Should
Clearly state uncertainty
e.g. confidence bounds, sensitivity
analysis

Provide a measurable improvement


over guessing
Often many alternative models/methods
are compared before selecting the best
DM method
Are there some statistical
methods that are used for data
mining? Yesfor example,
regression is statistical
technique very useful in DMwe
will cover this when we talk
about classification (note, the
details of how regression is done
differ from a stat problem to a
DM problem)
The simplest form of
regression
Ordinary least squares regression

y 0 1 x

0 y 1 x

( x x )( y y )
1
( x x) 2
If it looks strange and unusual and you
have never seen it before, its OKmany
more details to come when we start using
it.
OLAP
Moving on from the role of Statistics in Data Mining
OLAP is a database approach to data exploration
OLAP
On-Line Analytical Processing (OLAP)
was proposed by E. F. Codd, the father
of the relational database.
Relational databases put data into
tables, while OLAP uses a
multidimensional array
representation.
Such representations of data previously
existed in statistics and other fields
There are a number of data analysis
and data exploration operations that
are easier with such a data
Creating a Multidimensional
Array
Two key steps in converting tabular data into a
multidimensional array.
First, identify which attributes are to be the
dimensions and which attribute is to be the target
attribute whose values appear as entries in the
multidimensional array.
The attributes used as dimensions must have discrete
values
The target value is typically a count or continuous value,
e.g., the cost of an item
Can have no target variable at all except the count of
objects that have the same set of attribute values
Second, find the value of each entry in the
multidimensional array by summing the values (of
the target attribute) or count of all objects that have
the attribute values corresponding to that entry.
Example: Iris data
We show how the attributes, petal
length, petal width, and species type
can be converted to a
multidimensional array
First, we discretized the petal width and
length to have categorical values: low,
medium, and high
We get the following table - note the
count attribute
Example: Iris data
(continued)
Each unique tuple of petal width, petal
length, and species type identifies one
element of the array.
This element is assigned the
corresponding count value.
The figure illustrates
the result.
All non-specified
tuples are 0.
Example: Iris data
(continued)
Slices of the multidimensional array
are shown by the following cross-
tabulations
What do these tables tell us?
OLAP Operations: Data
Cube
The key operation of a OLAP is the formation of a
data cube
A data cube is a multidimensional representation
of data, together with all possible aggregates.
By all possible aggregates, we mean the
aggregates that result by selecting a proper
subset of the dimensions and summing over all
remaining dimensions.
For example, if we choose the species type
dimension of the Iris data and sum over all other
dimensions, the result will be a one-dimensional
entry with three entries, each of which gives the
number of flowers of each type.
Data Cube Example
Consider a data set that records the
sales of products at a number of
company stores at various dates.
This data can be represented
as a 3 dimensional array
There are 3 two-dimensional
aggregates (3 choose 2 ),
3 one-dimensional aggregates,
and 1 zero-dimensional
aggregate (the overall total)
Data Cube Example (continued)
The following figure table shows one of the
two dimensional aggregates, along with two
of the one-dimensional aggregates, and the
overall total
OLAP Operations: Slicing and Dicing
Slicing is selecting a group of cells from the
entire multidimensional array by specifying
a specific value for one or more dimensions.
Dicing involves selecting a subset of cells
by specifying a range of attribute values.
This is equivalent to defining a subarray from
the complete array.
In practice, both operations can also be
accompanied by aggregation over some
dimensions.
OLAP Operations: Roll-up and Drill-down

Attribute values often have a hierarchical


structure.
Each date is associated with a year, month, and
week.
A location is associated with a continent, country,
state (province, etc.), and city.
Products can be divided into various categories,
such as clothing, electronics, and furniture.
Note that these categories often nest and form
a tree or lattice
A year contains months which contains day
A country contains a state which contains a city
OLAP Operations: Roll-up and Drill-down

This hierarchical structure gives rise


to the roll-up and drill-down
operations.
For sales data, we can aggregate (roll up)
the sales across all the dates in a month.
Conversely, given a view of the data
where the time dimension is broken into
months, we could split the monthly sales
totals (drill down) into daily sales totals.
Likewise, we can drill down or roll up on
the location or product ID attributes.
Summary OLAP
OLAP is a database approach to data exploration
(and more generally is used for Business
Intelligence)
Many different OLAP products, simple OLAP
features are built into Excel (pivot tables)
May not be easy to implement OLAP for
extremely large tables (however, the same results
can be achieved with carefully written SQL)
Clarifying note: OLAP is not data mining, it might
be used as a first step to better understand data
before a data mining technique is applied
Other topics closely tied to data
mining that we will not cover in
detail (all three have enough
material for a class of their own):
Data Warehousesoften used for data
mining (brings together different data
sources into a subject oriented relational
database)
Non-relational databasesgrowing in
popularity for data mining because of speed
Information qualityfundamentally tied to
effectiveness of data mining
Big take away from chapters 2 and 3
The first steps in data mining are
Understand the dataBeyond just determining the
type of data, we are trying to answer the question:
Is there something of value that can be found in
this data? (the answer is not always yes)
Would it be valuable to know anomalies (unusual
entities)?
Would it be valuable to know relationships in the data
(correlations, associations, clusters)?
Would it be valuable to make predictions? (classify or
predict new values)
Look for information quality issues (which can
become data mining problems on their own)

Vous aimerez peut-être aussi