Vous êtes sur la page 1sur 78

Indexing and Data Mining in

Multimedia Databases
Christos Faloutsos
CMU
www.cs.cmu.edu/~christos

Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
New tools for Data Mining: Fractals
Conclusions
Resources
U. of Alberta

C. Faloutsos

Problem
Given a large collection of (multimedia)
records, find similar/interesting things, ie:
Allow fast, approximate queries, and
Find rules/patterns

U. of Alberta

C. Faloutsos

Sample queries
Similarity search
Find pairs of branches with similar sales
patterns
find medical cases similar to Smith's
Find pairs of sensor series that move in sync

U. of Alberta

C. Faloutsos

Sample queries contd


Rule discovery
Clusters (of patients; of customers; ...)
Forecasting (total sales for next year?)
Outliers (eg., fraud detection)

U. of Alberta

C. Faloutsos

Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
New tools for Data Mining: Fractals
Conclusions
Resourses
U. of Alberta

C. Faloutsos

Indexing - Multimedia
Problem:
given a set of (multimedia) objects,
find the ones similar to a desirable query
object (quickly!)

U. of Alberta

C. Faloutsos

$price
$price

365
day

$price

365
day

distance function: by expert


1

365
day

U. of Alberta

C. Faloutsos

GEMINI - Pictorially
eg,. std
S1

F(S1)

365
day

Sn

F(Sn)

eg, avg

365
day
U. of Alberta

off-the-shelf S.A.Ms
(spatial Access Methods)
C. Faloutsos

GEMINI
fast; correct (=no false dismissals)
used for

images (eg., QBIC) (2x, 10x faster)


shapes (27x faster)
video (eg., InforMedia)
time sequences ([Rafiei+Mendelzon], ++)

U. of Alberta

C. Faloutsos

10

Remaining issues
how to extract features automatically?
how to merge similarity scores from
different media

U. of Alberta

C. Faloutsos

11

Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
Visualization: Fastmap
Relevance feedback: FALCON

Data Mining / Fractals


Conclusions
U. of Alberta

C. Faloutsos

12

FastMap

O1
O2
O3
O4
O5

O1
0
1
1
100
100

O2
1
0
1
100
100

U. of Alberta

O3
1
1
0
100
100

O4
100
100
100
0
1

O5
100
100
100
1
0

~100

??

C. Faloutsos

~1

13

FastMap
Multi-dimensional scaling (MDS) can do
that, but in O(N**2) time
We want a linear algorithm: FastMap
[SIGMOD95]

U. of Alberta

C. Faloutsos

14

Applications: time sequences


given n co-evolving time sequences
visualize them + find rules [ICDE00]
DEM

rate

JPY
HKD
time

U. of Alberta

C. Faloutsos

15

Applications - financial
currency exchange rates [ICDE00]
FRF
GBP
JPY
HKD
USD(t)
USD(t-5)

U. of Alberta

C. Faloutsos

16

Applications - financial
currency exchange rates [ICDE00]
FRF

DEM

HKD
JPY

USD(t)
USD(t-5)

U. of Alberta

USD

GBP

C. Faloutsos

17

Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
Visualization: Fastmap
Relevance feedback: FALCON

Data Mining / Fractals


Conclusions
U. of Alberta

C. Faloutsos

20

Merging similarity scores


eg., video: text, color, motion, audio
weights change with the query!

solution 1: user specifies weights


solution 2: user gives examples
and we learn what he/she wants: rel. feedback
(Rocchio, MARS, MindReader)
but: how about disjunctive queries?
U. of Alberta

C. Faloutsos

21

DEMO

server

U. of Alberta

demo

C. Faloutsos

22

FALCON
Vs

Inverted Vs

Trader wants only unstable stocks


U. of Alberta

C. Faloutsos

23

FALCON
Vs

Inverted Vs

average: is flat!

U. of Alberta

C. Faloutsos

24

Single query point methods


std
x
+ +
+

+
++

Rocchio

U. of Alberta

avg

C. Faloutsos

25

Single query point methods


x
+ +
+

+
++

Rocchio

+
++

x
+ +
+
MARS

x
+ +
+

+
+
+

MindReader

The averaging affect in action...


U. of Alberta

C. Faloutsos

26

Main idea: FALCON Contours


[Wu+, vldb2000]

+
feature2

eg., std

+
+
+

feature1 (eg., avg)


U. of Alberta

C. Faloutsos

27

A: Aggregate Dissimilarity
DG x
x
+

g1

+
+

d g , x

g2

: parameter (~ -5 ~ soft OR)


U. of Alberta

C. Faloutsos

28

FALCON
converges quickly (~5 iterations)
good precision/recall
is fast (can use off-the-shelf spatial/metric
access methods)

U. of Alberta

C. Faloutsos

29

Conclusions for indexing +


visualization
GEMINI: fast indexing, exploiting off-theshelf SAMs
FastMap: automatic feature extraction in
O(N) time
FALCON: relevance feedback for
disjunctive queries

U. of Alberta

C. Faloutsos

30

Outline
Goal: Find similar / interesting things
Problem - Applications
Indexing - similarity search
New tools for Data Mining: Fractals
Conclusions
Resourses
U. of Alberta

C. Faloutsos

31

Data mining & fractals


Road map

Motivation problems / case study


Definition of fractals and power laws
Solutions to posed problems
More examples

U. of Alberta

C. Faloutsos

32

Problem #1 - spatial d.m.


Galaxies (Sloan Digital Sky Survey w/ B.
- spiral and elliptical
Nichol)
galaxies

(stores & households; healthy


& ill subjects)
- patterns? (not Gaussian; not
uniform)
-attraction/repulsion?
U. of Alberta

- separability??
C. Faloutsos

33

Problem#2: dim. reduction


mpg

given attributes x1, ... xn


possibly, non-linearly correlated

drop the useless ones

engine
size

(Q: why?
A: to avoid the dimensionality curse)
U. of Alberta

C. Faloutsos

34

Answer:
Fractals / self-similarities / power laws

U. of Alberta

C. Faloutsos

35

What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...

U. of Alberta

C. Faloutsos

zero area;
infinite length!

36

Definitions (contd)
Paradox: Infinite perimeter ; Zero area!
dimensionality: between 1 and 2
actually: Log(3)/Log(2) = 1.58 (long
story)

U. of Alberta

C. Faloutsos

37

Intrinsic (fractal) dimension


Q: fractal dimension
of a line?

U. of Alberta

Eg:
#cylinders; miles /
gallon

x
5
4
3
2

C. Faloutsos

y
1
2
3
4

38

Intrinsic (fractal) dimension


Q: fractal dimension
of a line?
A: nn ( <= r ) ~ r^1

U. of Alberta

C. Faloutsos

39

Intrinsic (fractal) dimension


Q: fractal dimension
of a line?
A: nn ( <= r ) ~ r^1

U. of Alberta

Q: fd of a plane?
A: nn ( <= r ) ~ r^2
fd== slope of (log(nn) vs
log(r) )

C. Faloutsos

40

Sierpinsky triangle
== correlation integral
log(#pairs
within <=r )
1.58

log( r )
U. of Alberta

C. Faloutsos

41

Observations
self-similarity ->
<=> fractals
<=> scale-free
<=> power-laws (y=x^a, F=C*r^(-2))
log(#pairs
within <=r )

1.58
log( r )

U. of Alberta

C. Faloutsos

42

Road map

Motivation problems / case studies


Definition of fractals and power laws
Solutions to posed problems
More examples
Conclusions

U. of Alberta

C. Faloutsos

43

Solution#1: spatial d.m.


Galaxies (Sloan Digital Sky Survey w/ B.
Nichol - BOPS plot - [sigmod2000])

clusters?
separable?
attraction/repulsion?
data scrubbing
duplicates?
U. of Alberta

C. Faloutsos

44

Solution#1: spatial d.m.


log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell

- repulsion!

spi-spi
spi-ell
log(r)
U. of Alberta

C. Faloutsos

45

Solution#1: spatial d.m.

[w/ Seeger, Traina, Traina, SIGMOD00]


log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell

- repulsion!

spi-spi
spi-ell
log(r)
U. of Alberta

C. Faloutsos

46

spatial d.m.

r1
r2

r2 r1
U. of Alberta

Heuristic on choosing # of
clusters
C. Faloutsos

47

Solution#1: spatial d.m.


log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell

- repulsion!

spi-spi

spi-ell
log(r)
U. of Alberta

C. Faloutsos

48

Solution#1: spatial d.m.


log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell

-repulsion!!

spi-spi

-duplicates

spi-ell
log(r)
U. of Alberta

C. Faloutsos

49

Problem #2: Dim. reduction


y (a) Quarter-circle

0
0

U. of Alberta

(b)Line

C. Faloutsos

(c) Spike

50

Solution:
drop the attributes that dont increase the
partial f.d. PFD
dfn: PFD of attribute set A is the f.d. of the
projected cloud of points [w/ Traina, Traina,
Wu, SBBD00]

U. of Alberta

C. Faloutsos

51

Problem #2: dim. reduction


global FD=1

PFD=1

PFD~1

y (a) Quarter-circle

PFD~1
U. of Alberta

(b)Line

PFD=1
C. Faloutsos

(c) Spike

PFD=0

52

Problem #2: dim. reduction


global FD=1

PFD=1

y (a) Quarter-circle

PFD~1
U. of Alberta

PFD=1
y

(b)Line

PFD=1
C. Faloutsos

(c) Spike

Notice: max
variance
PFD=0 would
fail here
53

Problem #2: dim. reduction


global FD=1

PFD=1

PFD~1

y (a) Quarter-circle

PFD~1
U. of Alberta

(b)Line

PFD=1

(c) Spike

Notice: SVD would fail


here
PFD=0

C. Faloutsos

54

Currency dataset
USD
Day1 1.62
Day2 1.58

U. of Alberta

HKD JPY

C. Faloutsos

55

self-similar?
eigenfaces

currency
Currency dataset

15
14
13

Eigenfaces dataset

25

Currency slp=1.9807
S(r)

Eigenfaces slp=4.2506
S(r)

20

12
15

11

fd=1.98

10
9

10

fd=4.25

8
5
7
6

5
4
-3.5

-3

-2.5

-2

-1.5

U. of Alberta

-1

-0.5

0.5

1.5

log(radii)

-5
-7

-6

C. Faloutsos

-5

-4

-3

-2

log(radii)

-1

56

FDR on the currency dataset


if unif + indep.
2

American
Dollar
German Mark
British Pound
French
Franc

1.8
1.6
1.4
1.2
1

Japanese Yen

0.8
1

U. of Alberta

4
5
#Attributes considered

C. Faloutsos

57

FDR on the currency dataset


if unif + indep.
2

American
Dollar
German Mark
British Pound
French
Franc

1.8
1.6

HKD: useless
>1.98 axis are needed

1.4
1.2
1

Japanese Yen

0.8
1

U. of Alberta

4
5
#Attributes considered

C. Faloutsos

58

Road map

Motivation problems / case studies


Definition of fractals and power laws
Solutions to posed problems
More examples
Conclusions

U. of Alberta

C. Faloutsos

59

App. : traffic
disk traces: self-similar (also: web traffic; comm.
errors; etc)
#bytes

time
U. of Alberta

C. Faloutsos

60

More apps: Brain scans


Oct-trees; brain-scans
Log(#octants)

2.63 =
fd

U. of Alberta

C. Faloutsos

octree levels

61

More fractals:
stock prices (LYCOS) - random walks: 1.5
1 year

U. of Alberta

2 years

C. Faloutsos

62

More fractals:
coast-lines: 1.1-1.2 (up to 1.58)

U. of Alberta

C. Faloutsos

63

U. of Alberta

C. Faloutsos

64

Examples:MG county
Montgomery County of MD (road endpoints)

U. of Alberta

C. Faloutsos

65

Examples:LB county
Long Beach county of CA (road end-points)

U. of Alberta

C. Faloutsos

66

More power laws: Zipfs law


log(freq)
a

Bible - rank vs
frequency (log-log)

the

log(rank)
U. of Alberta

C. Faloutsos

67

More power laws


Freq. distr. of first names; last names
(Mandelbrot)

U. of Alberta

C. Faloutsos

68

Internet
Internet routers: how many neighbors
within h hops?

U of Alberta
U. of Alberta

C. Faloutsos

69

Internet topology
Internet routers: how many neighbors
within h hops? [SIGCOMM 99]
log(#pairs)

Reachability function:
number of neighbors
within r hops, vs r (loglog).

2.8
log(hops)
U. of Alberta

Mbone routers, 1995

C. Faloutsos

70

More power laws: areas


Korcaks law
([icde99], w/ Proietti)
Scandinavian lakes

U. of Alberta

C. Faloutsos

71

More power laws: areas


Korcaks law
log(count( >= area))

Scandinavian lakes
area vs
complementary
cumulative count
(log-log axes)
U. of Alberta

log(area)
C. Faloutsos

72

Olympic medals:
log(# medals)
2.5
2
1.5

Series1
Linear (Series1)

y = -0.9676x + 2.3054
R2 = 0.9458

0.5
0
0

U. of Alberta

0.5

1.5

log rank

C. Faloutsos

73

More power laws


Energy of earthquakes (Gutenberg-Richter
law) [simscience.org]
log(count)

amplitude

day
U. of Alberta

magnitude
C. Faloutsos

74

Even more power laws:

Income distribution (Paretos law);


sales distributions;
duration of UNIX jobs
Distribution of UNIX file sizes
publication counts (Lotkas law)

U. of Alberta

C. Faloutsos

75

Even more power laws:


web hit frequencies ([Huberman])
hyper-link distribution [Barabasi], ++

U. of Alberta

C. Faloutsos

76

Overall Conclusions:
Find similar/interesting things in multimedia
databases
Indexing: feature extraction (GEMINI)
automatic feature extraction: FastMap
Relevance feedback: FALCON

U. of Alberta

C. Faloutsos

77

Conclusions - contd
New tools for Data Mining: Fractals/power
laws:
appear everywhere
lead to skewed distributions (Gaussian,
Poisson, uniformity, independence)
correlation integral for separability/cluster
detection
PFD for dimensionality reduction
U. of Alberta

C. Faloutsos

78

Conclusions - contd
can model bursty time sequences
(buffering/prefetching)
selectivity estimation (how many neighbors
within x km?)
dim. curse diagnosis (its the fractal dim. that
matters! [ICDE2000])

U. of Alberta

C. Faloutsos

79

Resources:
Software and papers:

http://www.cs.cmu.edu/~christos
Fractal dimension (FracDim)
Separability (sigmod 2000)
Relevance feedback for query by content
(FALCON vldb 2000)

U. of Alberta

C. Faloutsos

80

Vous aimerez peut-être aussi