Académique Documents
Professionnel Documents
Culture Documents
Multimedia Databases
Christos Faloutsos
CMU
www.cs.cmu.edu/~christos
Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
New tools for Data Mining: Fractals
Conclusions
Resources
U. of Alberta
C. Faloutsos
Problem
Given a large collection of (multimedia)
records, find similar/interesting things, ie:
Allow fast, approximate queries, and
Find rules/patterns
U. of Alberta
C. Faloutsos
Sample queries
Similarity search
Find pairs of branches with similar sales
patterns
find medical cases similar to Smith's
Find pairs of sensor series that move in sync
U. of Alberta
C. Faloutsos
U. of Alberta
C. Faloutsos
Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
New tools for Data Mining: Fractals
Conclusions
Resourses
U. of Alberta
C. Faloutsos
Indexing - Multimedia
Problem:
given a set of (multimedia) objects,
find the ones similar to a desirable query
object (quickly!)
U. of Alberta
C. Faloutsos
$price
$price
365
day
$price
365
day
365
day
U. of Alberta
C. Faloutsos
GEMINI - Pictorially
eg,. std
S1
F(S1)
365
day
Sn
F(Sn)
eg, avg
365
day
U. of Alberta
off-the-shelf S.A.Ms
(spatial Access Methods)
C. Faloutsos
GEMINI
fast; correct (=no false dismissals)
used for
U. of Alberta
C. Faloutsos
10
Remaining issues
how to extract features automatically?
how to merge similarity scores from
different media
U. of Alberta
C. Faloutsos
11
Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
Visualization: Fastmap
Relevance feedback: FALCON
C. Faloutsos
12
FastMap
O1
O2
O3
O4
O5
O1
0
1
1
100
100
O2
1
0
1
100
100
U. of Alberta
O3
1
1
0
100
100
O4
100
100
100
0
1
O5
100
100
100
1
0
~100
??
C. Faloutsos
~1
13
FastMap
Multi-dimensional scaling (MDS) can do
that, but in O(N**2) time
We want a linear algorithm: FastMap
[SIGMOD95]
U. of Alberta
C. Faloutsos
14
rate
JPY
HKD
time
U. of Alberta
C. Faloutsos
15
Applications - financial
currency exchange rates [ICDE00]
FRF
GBP
JPY
HKD
USD(t)
USD(t-5)
U. of Alberta
C. Faloutsos
16
Applications - financial
currency exchange rates [ICDE00]
FRF
DEM
HKD
JPY
USD(t)
USD(t-5)
U. of Alberta
USD
GBP
C. Faloutsos
17
Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
Visualization: Fastmap
Relevance feedback: FALCON
C. Faloutsos
20
C. Faloutsos
21
DEMO
server
U. of Alberta
demo
C. Faloutsos
22
FALCON
Vs
Inverted Vs
C. Faloutsos
23
FALCON
Vs
Inverted Vs
average: is flat!
U. of Alberta
C. Faloutsos
24
+
++
Rocchio
U. of Alberta
avg
C. Faloutsos
25
+
++
Rocchio
+
++
x
+ +
+
MARS
x
+ +
+
+
+
+
MindReader
C. Faloutsos
26
+
feature2
eg., std
+
+
+
C. Faloutsos
27
A: Aggregate Dissimilarity
DG x
x
+
g1
+
+
d g , x
g2
C. Faloutsos
28
FALCON
converges quickly (~5 iterations)
good precision/recall
is fast (can use off-the-shelf spatial/metric
access methods)
U. of Alberta
C. Faloutsos
29
U. of Alberta
C. Faloutsos
30
Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
New tools for Data Mining: Fractals
Conclusions
Resourses
U. of Alberta
C. Faloutsos
31
U. of Alberta
C. Faloutsos
32
- separability??
C. Faloutsos
33
engine
size
(Q: why?
A: to avoid the dimensionality curse)
U. of Alberta
C. Faloutsos
34
Answer:
Fractals / self-similarities / power laws
U. of Alberta
C. Faloutsos
35
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...
U. of Alberta
C. Faloutsos
zero area;
infinite length!
36
Definitions (contd)
Paradox: Infinite perimeter ; Zero area!
dimensionality: between 1 and 2
actually: Log(3)/Log(2) = 1.58 (long
story)
U. of Alberta
C. Faloutsos
37
U. of Alberta
Eg:
#cylinders; miles /
gallon
x
5
4
3
2
C. Faloutsos
y
1
2
3
4
38
U. of Alberta
C. Faloutsos
39
U. of Alberta
Q: fd of a plane?
A: nn ( <= r ) ~ r^2
fd== slope of (log(nn) vs
log(r) )
C. Faloutsos
40
Sierpinsky triangle
== correlation integral
log(#pairs
within <=r )
1.58
log( r )
U. of Alberta
C. Faloutsos
41
Observations
self-similarity ->
<=> fractals
<=> scale-free
<=> power-laws (y=x^a, F=C*r^(-2))
log(#pairs
within <=r )
1.58
log( r )
U. of Alberta
C. Faloutsos
42
Road map
U. of Alberta
C. Faloutsos
43
clusters?
separable?
attraction/repulsion?
data scrubbing
duplicates?
U. of Alberta
C. Faloutsos
44
- repulsion!
spi-spi
spi-ell
log(r)
U. of Alberta
C. Faloutsos
45
- repulsion!
spi-spi
spi-ell
log(r)
U. of Alberta
C. Faloutsos
46
spatial d.m.
r1
r2
r2 r1
U. of Alberta
Heuristic on choosing # of
clusters
C. Faloutsos
47
- repulsion!
spi-spi
spi-ell
log(r)
U. of Alberta
C. Faloutsos
48
-repulsion!!
spi-spi
-duplicates
spi-ell
log(r)
U. of Alberta
C. Faloutsos
49
0
0
U. of Alberta
(b)Line
C. Faloutsos
(c) Spike
50
Solution:
drop the attributes that dont increase the
partial f.d. PFD
dfn: PFD of attribute set A is the f.d. of the
projected cloud of points [w/ Traina, Traina,
Wu, SBBD00]
U. of Alberta
C. Faloutsos
51
PFD=1
PFD~1
y (a) Quarter-circle
PFD~1
U. of Alberta
(b)Line
PFD=1
C. Faloutsos
(c) Spike
PFD=0
52
PFD=1
y (a) Quarter-circle
PFD~1
U. of Alberta
PFD=1
y
(b)Line
PFD=1
C. Faloutsos
(c) Spike
Notice: max
variance
PFD=0 would
fail here
53
PFD=1
PFD~1
y (a) Quarter-circle
PFD~1
U. of Alberta
(b)Line
PFD=1
(c) Spike
C. Faloutsos
54
Currency dataset
USD
Day1 1.62
Day2 1.58
U. of Alberta
HKD JPY
C. Faloutsos
55
self-similar?
eigenfaces
currency
Currency dataset
15
14
13
Eigenfaces dataset
25
Currency slp=1.9807
S(r)
Eigenfaces slp=4.2506
S(r)
20
12
15
11
fd=1.98
10
9
10
fd=4.25
8
5
7
6
5
4
-3.5
-3
-2.5
-2
-1.5
U. of Alberta
-1
-0.5
0.5
1.5
log(radii)
-5
-7
-6
C. Faloutsos
-5
-4
-3
-2
log(radii)
-1
56
American
Dollar
German Mark
British Pound
French
Franc
1.8
1.6
1.4
1.2
1
Japanese Yen
0.8
1
U. of Alberta
4
5
#Attributes considered
C. Faloutsos
57
American
Dollar
German Mark
British Pound
French
Franc
1.8
1.6
HKD: useless
>1.98 axis are needed
1.4
1.2
1
Japanese Yen
0.8
1
U. of Alberta
4
5
#Attributes considered
C. Faloutsos
58
Road map
U. of Alberta
C. Faloutsos
59
App. : traffic
disk traces: self-similar (also: web traffic; comm.
errors; etc)
#bytes
time
U. of Alberta
C. Faloutsos
60
2.63 =
fd
U. of Alberta
C. Faloutsos
octree levels
61
More fractals:
stock prices (LYCOS) - random walks: 1.5
1 year
U. of Alberta
2 years
C. Faloutsos
62
More fractals:
coast-lines: 1.1-1.2 (up to 1.58)
U. of Alberta
C. Faloutsos
63
U. of Alberta
C. Faloutsos
64
Examples:MG county
Montgomery County of MD (road endpoints)
U. of Alberta
C. Faloutsos
65
Examples:LB county
Long Beach county of CA (road end-points)
U. of Alberta
C. Faloutsos
66
Bible - rank vs
frequency (log-log)
the
log(rank)
U. of Alberta
C. Faloutsos
67
U. of Alberta
C. Faloutsos
68
Internet
Internet routers: how many neighbors
within h hops?
U of Alberta
U. of Alberta
C. Faloutsos
69
Internet topology
Internet routers: how many neighbors
within h hops? [SIGCOMM 99]
log(#pairs)
Reachability function:
number of neighbors
within r hops, vs r (loglog).
2.8
log(hops)
U. of Alberta
C. Faloutsos
70
U. of Alberta
C. Faloutsos
71
Scandinavian lakes
area vs
complementary
cumulative count
(log-log axes)
U. of Alberta
log(area)
C. Faloutsos
72
Olympic medals:
log(# medals)
2.5
2
1.5
Series1
Linear (Series1)
y = -0.9676x + 2.3054
R2 = 0.9458
0.5
0
0
U. of Alberta
0.5
1.5
log rank
C. Faloutsos
73
amplitude
day
U. of Alberta
magnitude
C. Faloutsos
74
U. of Alberta
C. Faloutsos
75
U. of Alberta
C. Faloutsos
76
Overall Conclusions:
Find similar/interesting things in multimedia
databases
Indexing: feature extraction (GEMINI)
automatic feature extraction: FastMap
Relevance feedback: FALCON
U. of Alberta
C. Faloutsos
77
Conclusions - contd
New tools for Data Mining: Fractals/power
laws:
appear everywhere
lead to skewed distributions (Gaussian,
Poisson, uniformity, independence)
correlation integral for separability/cluster
detection
PFD for dimensionality reduction
U. of Alberta
C. Faloutsos
78
Conclusions - contd
can model bursty time sequences
(buffering/prefetching)
selectivity estimation (how many neighbors
within x km?)
dim. curse diagnosis (its the fractal dim. that
matters! [ICDE2000])
U. of Alberta
C. Faloutsos
79
Resources:
Software and papers:
http://www.cs.cmu.edu/~christos
Fractal dimension (FracDim)
Separability (sigmod 2000)
Relevance feedback for query by content
(FALCON vldb 2000)
U. of Alberta
C. Faloutsos
80