Principal Component Analysis

Principal Component Analysis
Factor analysis dates back to 1930s. It was originally used in psychology to study intelligence. Attempts were made to relate test results to other factors. Premise was that X = S L + E where ! X = test performance L = intrinsic intelligence factors S = individual scoree E = residual error

Using an eigenvector rotation, it would be possible to decompose the X matrix into a series of loadings and scores. Underlying or intrinsic factors related to intelligence could then be detected. In chemistry, this approach can be used by diagonalizating the correlation or covariance matrix - Principal Component Analysis.

PCA is typically conducted using the covariance matrix from autoscaled data.
!
Covariance matrix =
1 Xl T Xl NV - 1
It is then diagonalized - eigenvector rotation Typically, the largest eigenvectors (based on the size of the eigenvalues) are the most important.
Covariance and Correlation

Covariance
A measure of the association of two variables. The sum of cross products between two variables as deviations from their respective means. COVAR = !^x - x h !_ y - y i

Approaches fall into two categories Complete diagonalization of the matrix. Approximation methods that extract one component at a time. In the end, the results are the same. The data is decomposed into a set of loadings, scores and a residual.
m X n = n t1 p 1 m + n t2 p 2 m +...+ n ta p a m + n E m
Correlation The covariance between two ztransformed variables (autoscaled).
r = COVAR Sx *Sy
Principal component (PC). A linear combination of related variables. It represents an intrinsic factor of your data. Scores. The projection of your data in to PC space. Loading. Show the relative signicance of the original variables. Residual. The data that could not be correlated -- typically random noise.
Varimax rotation
EV1 EV2
Variable 3
A secondary tweaking of the PCs to help better observe relationships. It is essentially a secondary rotation of your data in an attempt to lump all variance from individual variables in to single components. It can often help you to better understand the effects of your original data.
a ari
ble
Variab
le 1
Varimax rotation
Assume we have 5 original variables and are only interested in the rst 3 PC.
PC1 1 2 3 4 5 % variance % variance % variance PC2 PC3
Varimax rotation
After varimax rotation, it might look like:
PC1 1 2 3 4 5 % variance % variance % variance
original variables
PC2
PC3
original variables
The signicance of each variable is easier to see.
Using PCA results

The best way to appreciate PCA is to look at a series of examples. Well attempt to show what types of information can be obtained and how it can be used. Examples Classication of artifacts Classication of whiskey Noise reduction of 3D data
PCA of archaeological artifacts

The information presented in this example is from: ! Kowalski, Schatzki and Stross, Anal. Chem. 44, 2176 (1972). A complete evaluation of the data is also presented in Chemometrics by Sharaf, Illman and Kowalski, John Wiley & Sons, 1986.

Summary of study.

Will start by initially assigning classes to each type of sample - to be used in the labeling of the various plots. ! 1-4 ! 5-7 Quarry samples Artifacts from Indian sites
Native American artifacts made of obsidian glass were
obtained from 5 sites in northern California. Samples from 4 quarry sites obsidian were obtained in the same area. Mn, Rb, Sr,Y and Zr) was conducted on all 75 samples.
X-ray uorescence analysis for ten elements (Fe, Ti, Ba, Ca, K,
Questions posed.
Can the different sources of obsidian be differentiated based

on the chemical measurements made?
Can something be said regarding the sources of the artifacts

and the migration and trading patterns of the Indians?
Both unscaled and autoscaled daa will be evaluated using XLStat,
180000 160000 140000 120000
100
80
100000 80000 60000 40000 20000 0 F1 F2 F3
Virtually all of the variance is in the rst principal component.
60
40
20
0 F4 F5 F6 F7 F8 F9 F10
Archaeological artifacts - Data scaling
Archaeological artifacts - Eigenvalues

F2 (16.14 %)
600
We can now produce displays of our components. A plot of the scores for PC1 vs. PC2 will result in about 98% of the original information being displayed. A loadings plot of L1 vs. L2 will show the importance of the original variables in the construction of PC1 and PC2.
400 5 2 200 2 6 3 0 7 -200 7 3 77 3 3 73 3 33 3 3 3 333 3 3 3 3 3 3 6 2 6 5 1 5 11 1 1 1 21 1 21 2 12 2 5 2
K Ca Ti Rb Sr Y Mn Ba Zr Fe 4 4
Quarry samples from site 1 tends to form an individual group

4 4 44 4 4 4 4 4 4 4 4 4 44 4 4 4 4
-400 -800
Quarry site 3 and artifact site 7 appear to be related

-600 -400 -200 0 200 400 600 800
F1 (81.68 %)
PC1 vs. PC2
Cumulative variability (%)
Eigenvalue
Ca
The loadings show that Ca and Fe both have an effect on PC1. The other variables have a smaller effect.
1 K Rb 0.5 Sr 0.25 Ca Ti
0.75
Ti
0 Y -0.25 Ba
XLStat will also produce a plot that indicates the correlation between the original variables and the factors. Here, it indicates that Y has little effect and the other have impact on both PCs.
0.022
Sr Y Mn Ba Rb Zr
0.810
K
Mn Fe
-0.5 Zr
-0.75
-1
Fe
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
L1 vs. L2
Correlation plot

6 100 80
Now that each variable is given equal weight, variance is no longer all smashed into the 1st component.
60
40
20
0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Archaeological artifacts - Autoscaling
5 4 3 K Rb 2 2 1 1 2 22 2 6 6 2 6 2 2 5 5 1 1 11 11 5 1 1
Ca Sr

Ti
So what does it all mean? In this case, both scaled and unscaled results indicate a grouping of related samples - XRF results can be used to classify related samples. Samples from different quarries (1-4) can easily be determined although samples from site 2 are pretty scattered.
5 6
F2 (20.78 %)
2 1 0
4 4 Mn 4 4 Fe 4 4 4 Ba4 44 4 44 4 44 4 44 4 4
-1 -2 -3 -6 -5 -4 -3
33 3 73 3 3 3 37 3 3 33 3 7 3 3 73 3 3 3 3 3 3 7
5 Y
-2
-1
Zr
Can we tell anything about the artifacts?
F1 (52.52 %)
PC1 vs. PC2
Biplot showing scores and loadings
Cumulative variability (%)
Eigenvalue
5 4 3 2 1 0 3 3 3 7 3 3 3 3 37 3 33 3 7 3 7 3 3 3 3 3 3 3 7 5 Y 26 6 2 2 6 K Rb 2 2 1 1 1 1 1 11 5 1 1 1 5
Ca Sr
Ti
F2 (20.78 %)
F2 (20.78 %)
22 2
No artifacts appear to have come from quarry site 4. Its well resolved from the other samples.
5 6
4 3 2 1 0 26 6 2 2 6
K Rb 2 2 1 1 1 1 1 11 5 1 1 1 5
Ca Sr
Ti
22 2 2
Artifacts from site 6 appear to come from quarry 2, although the results are scattered. Artifacts from site 7 are from quarry 3. Site 5 artifacts appear to come from all over the place. Its possible that this was a nomadic tribe.
5 6
4 4 Mn 4 4Fe 4 4 4 Ba 444 4 44 4 4 4 4 4 4 4 4
4 4 Mn 4 4Fe 4 4 4 Ba 444 4 44 4 4 4 4 4 4 4 4
-1 -2 -3 -6 -5 -4 -3
-1 -2 -3 -6 -5 -4 -3
3 3 3 7 3 3 3 3 37 3 33 3 7 3 3 7 3 3 3 3 3 3 7
5 Y
-2
-1 0 1 F1 (52.52 %)
Zr 3
-2
-1 0 1 F1 (52.52 %)
Zr 3
Archaeological artifacts - Results
Using the loadings

The loadings indicate that many of our
variables are closely related. our results.
1 0.75 0.5 0.25 0 -0.25 -0.5 -0.75 -1 -1 -0.75 -0.5 -0.25 0 0.25 0.5 K Rb
Use K Use Ca Use K

Ca Sr Ti
In addition,V appears to have little effect on We can reprocess our data after eliminating
V and some of our correlated variables.
Use Fe Remove
Zr Y Ba Mn Fe
This might improve our results. At a minimum it will make subsequent

studies easier - less data to collect.
Use
1
0.75
Using the loadings
Modied Study
We now only 4 variables.
F2 (27.66 %)
Our results are almost identical to our 10 variable work.

K 2
Ca
Is it enough to still give the same characterization as with the original 10? If this works, well be able to save quite a bit of time and money on subsequent assays. Also, does it improve our results?
2 2 26 6
2 0
22 2 2 1 5 1 1 1 1 1 1 11 1 5 5
Fe 44 4 4 44 44 4 4 4 4 4 44 4 44
-2
7 37 3 7 3 3 33 3 3 3 7 3 3 3333 3 3 7 33
-4 -5 -3 -1 1 3
Zr 5
F1 (60.47 %)
PC1 vs. PC2
Classication of whiskey
Another example -- the work conducted in our laboratory. One study involved the characterization of whiskey based on GC/MS traces. This example show what you might need to do in order to make your data suitable for PCA evaluations.
Representative whiskeys
Methylene chloride extracts for a series of whiskeys were assayed using a GC/MS. Variables needed to be constructed from these traces.
Data preprocessing
Each chromatograph consisted of approximately 1800 points. This would be too much for many systems to handle.
Variables were constructed by summing response at 1 min intervals resulting in 30 variables.
Data preprocessing
To improve variable stability: The smallest response value was treated as
a baseline for background correction.
An internal standard was used to

normalized detector response.
The internal standard was also used to

account for small time variations.
Data preprocessing
All data was autoscaled prior to PCA. Questions asked Could whiskies be classied? Could the approach be used to detect sample dilution? blending? contamination?
PC2
C CC
-1
C CC C CC LL L LL L L C L CC
S S S S S S S SS S S S S S S S
T T TTT T
B B B B B B B B B
BB
BB
-3 -3 -1 1 3
S - Scotch B - Bourbon C - Canadian L - Blended T - Tennessee

5
PC1
Initial PCA analysis
3 X X X X XX X X X X X XX X XX X X 25Y X X X X X 50Y X X X X X X XX
% of brand Y in the blend.

Y Y Y Y Y Y Y Y Y Y 75Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y YY Y
Contamination
4
PC2
-1
PC1
ppm methybenzoate vs. PC1

0 30
-3 -3 -1
PC1
-2
ppm
60
90
Blending of one scotch into another
Dilution
3 0 0 0 20 20 20 1 40 40 40 60 6060 80 80 80 X X X X X X X X X X X XX X XX X X X X X X X X X X X X X X
PC2
-1
% by V, whiskey
-3 -7 -3
PC1

Principal Component Analysis

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Principal Component Analysis

Transféré par

Droits d'auteur :

Formats disponibles

Principal Component Analysis