Intro Class PDF

STATISTICS 407
APPLIED MULTIVARIATE ANALYSIS
TOPICS
Principal Component Analysis (PCA): Reduce the ________________, summarize the sources of variation in the data, transform the data into a new data set where the variables are uncorrelated. Factor Analysis (FA): When its not possible to _________________ of interest directly, measure whats possible and create the variables of interest from the observed data. Discriminant Analysis (classication, supervised learning): Build a rule to _____________ or group id from observed training data.
TOPICS
Cluster Analysis (unsupervised learning): Find similar groups of individuals, or ______________________ based on their similarity. M(ultivariate)ANOVA: Infer information about the ____________________ based on the sample means. Multivariate Regression and Canonical Correlation Analysis: ______________________ variables, and explore the association between the set of dependent variables and a set of explanatory variables. PLUS how to plot multivariate data, ____________.
A TAXONOMY OF TECHNIQUES
Variable-directed: Quantifying the relationships between variables, eg ___________________________ _______________________________________________ _______________________________________________. Individual-directed: Summarizing relationships that exist between individuals, or experimental units, eg _______________________________________________.
MULTIVARIATE DATA
Example, nutritional information of chocolates (100g equivalent ) from around the world:
SOME MATH...
Matrix Notation Data (n observations, p variables) has matrix form as follows:
X = [X1 X2 . . . Xp]

X11 X12 X21 X22 . . . . . . Xn1 Xn2
. . . X1p . . . X2p . ... . . . . . Xnp np
Xij is the element in the ith row and j th column, that is ith case and j th variable.
5
SOME MATH...
Mean Vector, Variance-Covariance/Correlation Matrices 1 X . = X . . p X

S=
S11 S12 . . . S1p S21 S22 . . . S2p . . . ... . . . . . . Sp1 Sp2 . . . Spp
How do you calculate S11, S12, r12?
R=
1 r12 . . . r1p r21 1 . . . r2p . . . . . . . ... . . rp1 rp2 . . . 1

8
MULTIVARIATE DATA
For the chocolates data example, wed calculate the mean vector, var-cov matrix, and correlation matrix for the _________________________. For the __________ variables wed report counts, and proportions. The mean vector, var-cov and corr matrix might also be reported _____________ for each category of the categorical variables.
MULTIVARIATE DATA
Chocolates data: n=10, p=6
Whats the mean Calories? variance of Fiber? correlation between Sugars and Calories? Which variables have negative covariance? Which variable has the largest variance? Standard deviation of Chol?
2299.6 229.3 156.1 S= 296.3 48.8 425.1 R= 1.000 0.712 0.303 0.130 0.380 0.632 551.18 36.31 = 9.48 X 56.88 6.27 40.01 229.26 45.14 16.62 166.50 5.97 66.28 0.712 1.000 0.230 0.520 0.332 0.703 156.05 16.62 115.50 108.79 5.65 60.36 0.303 0.230 1.000 0.212 0.197 0.400 48.78 5.97 5.65 61.30 7.16 23.60 0.380 0.332 0.197 0.480 1.000 0.629 425.1 66.3 60.4 90.8 23.6 196.9 0.632 0.703 0.400 0.136 0.629 1.000
296.3 166.5 108.8 2275.2 61.3 90.8 0.130 0.520 0.212 1.000 0.480 0.136
MULTIVARIATE DATA
Chocolates data - summary of categorical variables 1 using counts. These are the ____________, or even the dependent variables that might be used for classifying observations.
10
MORE MATH...
Linear combinations, and projections:
If 1 . = . . p p1 1 X11 + 2 X21 + . . . + p Xp1 . . X = . 1 X1n + 2 X2n + . . . + p Xpn n1
then
2 + . . . + 2 = 1 then is a projection vector, and X is a projection If 1 p of the data.
Used in ______________________________________.
11
MORE MATH...
Distance measures:
For two points (rows of the data matrix) A = (A1 A2 . . . Ap ) and B = (B1 B2 . . . Bp ), Euclidean distance is dened as d(A, B) = (A B)(A B) = (A1 B1 )2 + . . . + (Ap Bp )2 and statistical distance (or Mahalobis distance) is dened as d(A, B) = (A B)S1 (A B) .
Generally any distance measure can be dened, but it must satisfy (1) d(A, B) = d(B, A), (2) d(A, B) > 0, if A = B, (3) d(A, B) = 0, if A = B, (4) d(A, B) d(A, C) + d(C, B), for any intermediate point C.
Used in ______________________.
12
MORE MATH...
Scaling:
The standardized data matrix is z11 z12 z21 z22 Z= . . . . . . zn 1 zn 2
x x
...
z1p z2 p . . .
. . . znp
where zij = ijsj j , i = 1, ..., n; j = 1, ..., p. The standardized data has mean vector all zeros, and variances all equal to 1.
Different from ____________! Doesnt change the correlation between variables. _____________ does remove correlation - more to come on this.
13
This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/ licenses/by-nc/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
14

Intro Class PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Intro Class PDF

Transféré par

Droits d'auteur :

Formats disponibles

STATISTICS 407

APPLIED MULTIVARIATE ANALYSIS

X11 X12 X21 X22 . . . . . . Xn1 Xn2

. . . X1p . . . X2p . ... . . . . . Xnp np

How do you calculate S11, S12, r12?

1 r12 . . . r1p r21 1 . . . r2p . . . . . . . ... . . rp1 rp2 . . . 1

2 + . . . + 2 = 1 then is a projection vector, and X is a projection If 1 p of the data.

Vous aimerez peut-être aussi