Académique Documents
Professionnel Documents
Culture Documents
Data Visualization
Summary
pla
y
ball
score
game
wi
n
lost
timeout
season
coach
team
Record
Relational records
Transaction data
Graph and network
Molecular Structures
Ordered
Image data:
Video data:
Document 1
Document 2
Document 3
TID
Items
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Dimensionality
Sparsity
Resolution
Curse of dimensionality
Distribution
Data Objects
Examples:
Attributes
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
Attribute Types
e.g., gender
No true zero-point
Ratio
Inherent zero-point
monetary quantities
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a
collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete
attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and
represented using a finite number of digits
Continuous attributes are typically represented as
floating-point variables
Data Visualization
Summary
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
1 n
x xi
n i 1
n
Median:
w x
i 1
n
w
i 1
Mode
median L1 (
n / 2 ( freq)l
freqmedian
Empirical formula:
)width
positively skewed
symmetric
negatively skewed
Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
1 n
1 n 2 1 n
2
s
( xi x )
[ xi ( xi ) 2 ]
n 1 i 1
n 1 i 1
n i 1
2
1
N
2
1
(
x
i
N
i 1
2
xi 2
i 1
Boxplot Analysis
Boxplot
Histogram Analysis
40
30
35
25
10000
30000
50000
70000
90000
Quantile Plot
Scatter plot
Uncorrelated Data
Data Visualization
Summary
Data Visualization
(a) Income
(d) age
28
Direct visualization
Landscapes
Prosection views
Hyperslice
Parallel coordinates
Scatterplot Matrices
Landscapes
news articles
visualized as
a landscape
Parallel Coordinates
n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
Every data item corresponds to a polygonal line which intersects each
of the axes at the point which corresponds to the value for the
attribute
Attr. 1
Attr. 2
Attr. 3
Attr. k
Chernoff Faces
Stick Figures
General techniques
Chernoff Faces
Stick Figure
A census data
figure showing
age, income,
gender,
education, etc.
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)
Data Mining, IGIT
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
Cone Trees
InfoCube
Dimensional Stacking
attribute 4
attribute 2
attribute 3
attribute 1
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
Worlds-within-Worlds
Assign the function and two most important parameters to innermost
world
Nvision: Dynamic
interaction through data
glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
Auto Visual: Static
interaction by means of
queries
Tree-Map
InfoCube
Data Visualization
Summary
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects
are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
Data matrix
n data points with p
dimensions
Two modes
Dissimilarity matrix
n data points, but
registers only the
distance
A triangular matrix
Single mode
x11
...
x
i1
...
x
n1
... x1f
... ...
...
xif
...
...
... xnf
... x1p
... ...
... xip
... ...
... xnp
d(2,1)
d(3,1) d ( 3,2) 0
:
:
:
binary variables:
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
Z-score:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
mf 1
n (x1 f x2 f ... xnf )
.
xif m f
zif
sf
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point
x1
x2
x3
x4
attribute1 attribute2
1
2
3
5
2
0
4
5
Dissimilarity Matrix
(with Euclidean Distance)
x1
x1
x2
x3
x4
x2
0
3.61
2.24
4.24
x3
0
5.1
1
x4
0
5.39
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
point
x1
x2
x3
x4
attribute 1 attribute 2
1
2
3
5
2
0
4
5
Manhattan (L1)
L
x1
x2
x3
x4
x1
0
5
3
6
x2
x3
x4
0
6
1
0
7
x2
x3
x4
Euclidean (L2)
L2
x1
x2
x3
x4
x1
0
3.61
2.24
4.24
0
5.1
1
0
5.39
Supremum
L
x1
x2
x3
x4
x1
x2
0
3
2
3
x3
0
5
1
x4
0
5
Ordinal Variables
rif 1
M f 1
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks rif and
zif r 1
M 1
Treat zif as interval-scaled
if
Cosine Similarity
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
Data Visualization
Summary
Summary
Many types of data sets, e.g., numerical, text, graph, Web, image.
Many methods have been developed but still an active area of research.