Vous êtes sur la page 1sur 7

STATISTICS 407

APPLIED MULTIVARIATE ANALYSIS

TOPICS
Principal Component Analysis (PCA): Reduce the ________________, summarize the sources of variation in the data, transform the data into a new data set where the variables are uncorrelated. Factor Analysis (FA): When its not possible to _________________ of interest directly, measure whats possible and create the variables of interest from the observed data. Discriminant Analysis (classication, supervised learning): Build a rule to _____________ or group id from observed training data.

TOPICS
Cluster Analysis (unsupervised learning): Find similar groups of individuals, or ______________________ based on their similarity. M(ultivariate)ANOVA: Infer information about the ____________________ based on the sample means. Multivariate Regression and Canonical Correlation Analysis: ______________________ variables, and explore the association between the set of dependent variables and a set of explanatory variables. PLUS how to plot multivariate data, ____________.

A TAXONOMY OF TECHNIQUES
Variable-directed: Quantifying the relationships between variables, eg ___________________________ _______________________________________________ _______________________________________________. Individual-directed: Summarizing relationships that exist between individuals, or experimental units, eg _______________________________________________.

MULTIVARIATE DATA
Example, nutritional information of chocolates (100g equivalent ) from around the world:

SOME MATH...
Matrix Notation Data (n observations, p variables) has matrix form as follows:

X = [X1 X2 . . . Xp]

X11 X12 X21 X22 . . . . . . Xn1 Xn2

. . . X1p . . . X2p . ... . . . . . Xnp np

Xij is the element in the ith row and j th column, that is ith case and j th variable.
5

SOME MATH...
Mean Vector, Variance-Covariance/Correlation Matrices 1 X . = X . . p X

S=

S11 S12 . . . S1p S21 S22 . . . S2p . . . ... . . . . . . Sp1 Sp2 . . . Spp

How do you calculate S11, S12, r12?

R=

1 r12 . . . r1p r21 1 . . . r2p . . . . . . . ... . . rp1 rp2 . . . 1


8

MULTIVARIATE DATA
For the chocolates data example, wed calculate the mean vector, var-cov matrix, and correlation matrix for the _________________________. For the __________ variables wed report counts, and proportions. The mean vector, var-cov and corr matrix might also be reported _____________ for each category of the categorical variables.

MULTIVARIATE DATA
Chocolates data: n=10, p=6
Whats the mean Calories? variance of Fiber? correlation between Sugars and Calories? Which variables have negative covariance? Which variable has the largest variance? Standard deviation of Chol?
2299.6 229.3 156.1 S= 296.3 48.8 425.1 R= 1.000 0.712 0.303 0.130 0.380 0.632 551.18 36.31 = 9.48 X 56.88 6.27 40.01 229.26 45.14 16.62 166.50 5.97 66.28 0.712 1.000 0.230 0.520 0.332 0.703 156.05 16.62 115.50 108.79 5.65 60.36 0.303 0.230 1.000 0.212 0.197 0.400 48.78 5.97 5.65 61.30 7.16 23.60 0.380 0.332 0.197 0.480 1.000 0.629 425.1 66.3 60.4 90.8 23.6 196.9 0.632 0.703 0.400 0.136 0.629 1.000

296.3 166.5 108.8 2275.2 61.3 90.8 0.130 0.520 0.212 1.000 0.480 0.136

MULTIVARIATE DATA
Chocolates data - summary of categorical variables 1 using counts. These are the ____________, or even the dependent variables that might be used for classifying observations.

10

MORE MATH...
Linear combinations, and projections:
If 1 . = . . p p1 1 X11 + 2 X21 + . . . + p Xp1 . . X = . 1 X1n + 2 X2n + . . . + p Xpn n1

then

2 + . . . + 2 = 1 then is a projection vector, and X is a projection If 1 p of the data.

Used in ______________________________________.

11

MORE MATH...
Distance measures:
For two points (rows of the data matrix) A = (A1 A2 . . . Ap ) and B = (B1 B2 . . . Bp ), Euclidean distance is dened as d(A, B) = (A B)(A B) = (A1 B1 )2 + . . . + (Ap Bp )2 and statistical distance (or Mahalobis distance) is dened as d(A, B) = (A B)S1 (A B) .

Generally any distance measure can be dened, but it must satisfy (1) d(A, B) = d(B, A), (2) d(A, B) > 0, if A = B, (3) d(A, B) = 0, if A = B, (4) d(A, B) d(A, C) + d(C, B), for any intermediate point C.

Used in ______________________.

12

MORE MATH...
Scaling:
The standardized data matrix is z11 z12 z21 z22 Z= . . . . . . zn 1 zn 2
x x

...

z1p z2 p . . .

. . . znp

where zij = ijsj j , i = 1, ..., n; j = 1, ..., p. The standardized data has mean vector all zeros, and variances all equal to 1.

Different from ____________! Doesnt change the correlation between variables. _____________ does remove correlation - more to come on this.
13

This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/ licenses/by-nc/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

14

Vous aimerez peut-être aussi