Académique Documents
Professionnel Documents
Culture Documents
Richard Weber
Francisco Cisternas
(frcister@ing.uchile.cl)
Departamento de Ingeniera Industrial
Universidad de Chile
PROCESO DE KDD
KNOWLEDGE DISCOVERY IN DATABASES
Motivacin (1/3)
Problema: 9 ndices de
calidad de vida en 329
ciudades de USA
(Ejemplo de MatLab).
Que hacemos?,
hagamos una exploracin
simple.
Boxplot
recreation
arts
education
C o lu m n N u m b e r
economics
transportation
crime
health
housing
climate
0
3
Values
5
4
x 10
Motivacin (2/3)
Veamos las relaciones
Motivacin (3/3)
La Solucin
Idea Bsica
x2
z2
z1
x1
Siguiendo la idea
Posible Solucin:
Sea a1 el vector de pesos de la proyeccin de dimensin de p1
(desconocido por ahora).
Entonces podemos escribir la primera componente buscada
como:
z1 = Xa1
1 T
1 T T
z1 z1 = a1 X Xa1 = a1T Sa1
n
n
Matriz de Varianza-Covarianza
Entonces Maximicemos la Varianza
Muy Fcil aumenta a1
Solucin
a1T a1 = 1
Ahora tenemos un problema de optimizacin, ocupemos Lagrange:
Maximizamos derivando
O sea
M
= 2 Sa1 2 a1 = 0
a1
( S I ) a1 = 0
Sa1 = a1
Solucin
El mayor valor del vector propio corresponde a la primera
componente y as sucesivamente.
De regreso a nuestro problema:
Economics
Recreation
Arts
Education
Transportation
Crime
Health
housing
climate
a1
a2
a3
a4
0.0064
0.2691
0.1783
0.0281
0.1493
0.0252
0.9309
0.0698
0.0251
-0.0155
-0.9372
0.0205
0.0109
-0.0188
0.0014
0.2823
-0.1038
-0.1734
0.0067
0.0826
-0.0278
-0.0376
-0.9715
-0.0415
0.1510
-0.1496
-0.0127
0.0263
0.1778
0.0266
-0.0990
0.0384
-0.0216
-0.0278
-0.0690
-0.9745
0
Values
x 10
Solucin
Ms Grficos
4
1.5
x 10
New York, NY
100%
90
90%
80
80%
70
70%
60
60%
50
50%
40
40%
30
30%
20
20%
10
10%
Philadelphia,Chicago,
PA-NJ IL
-0.5
San Francisco, CA
Honolulu, HI
Norwalk, CT
-1
Stamford, CT
-1.5
-1
100
0.5
2
3
1st Principal Component
6
4
x 10
3
Principal Component
0%
0.2064
0.3565
0.4602
0.2813
0.3512
0.2753
0.4631
0.3279
0.1354
0.2178
0.2506
-0.2995
0.3553
-0.1796
-0.4834
-0.1948
0.3845
0.4713
-0.6900
-0.2082
-0.0073
0.1851
0.1464
0.2297
-0.0265
-0.0509
0.6073
0.1373
0.5118
0.0147
-0.5391
-0.3029
0.3354
-0.1011
-0.1898
0.4218
a1
a2
a3
a4
100
100%
90
90%
80
80%
70
70%
60
60%
50
50%
40
40%
30
30%
20
20%
Miami-Hialeah, FL
Anaheim-Santa Ana, CA
Seattle, WA
2nd Principal Component
San Francisco, CA
1
10
10%
Baltimore, MD
-1
Boston, MA
Cumberland, MD-WV
-2
Pittsburgh, PA
Chicago, IL
-3
-4
-4
New York, NY
Washington, DC-MD-VA
-2
4
6
1st Principal Component
10
12
14
4
5
Principal Component
0%
a1
(0.278,
0.286,
0.265,
0.318,
0.208,
0.293,
0.176,
0.202,
0.339,
0.329,
0.276,
0.248,
0.329).
a2
(0.280,
0.396,
0.392,
0.325,
-0.288,
-0.259,
-0.189,
-0.315,
0.163,
-0.050,
-0.169,
-0.329,
-0.232).
No Confundir
Anlisis de Factores
Example:
Cocktail Party Problem
Microphone 1
Microphone
2
Independent
Component Analysis
Microphone 3
Microphone 4
Component 1
Component 2
Component 3
Component 4
Applications of ICA
http://www.cis.hut.fi/projects/ica/
Application in Santiago
Monitoring
stations
= Microphons
Sources of
= Persons
Contamination
Measurements
Independent
Component
Analysis
Contaminants:
CO
SO2
NO/NO2
O3
MP10
Others
Available data
AO CO
1997
1998
1999
2000
2001
2002
MP10 O3
SO2
MP25 NO2
250
M8 Centrado
200
M8 Pasado
150
100
50
Hora
23
21
19
17
15
13
11
Forecasting
Independent
Components:
External
Variables
Forecasting
techniques
(Neural
networks,
Regression,
ARIMA, etc.)
Forecast
for t days
ICA Model
How?
f(s1)
f(s2)
Definition 1 (General definition) ICA of the random vector x consists of finding a linear transform A so that the
components si are as independent as possible, in the sense of maximizing some function F(s1,...,sm) that
measures independence.
ICA Model
ICA Model
Assumptions:
Sources are Independent
At most one source is gaussian
ICA
http://www.cis.hut.fi/projects/ica/cocktail/cocktail_en.cgi