Vous êtes sur la page 1sur 7

Principal Component Analysis

Principal Component Analysis

Factor analysis dates back to 1930s. It was originally used in psychology to study intelligence. Attempts were made to relate test results to other factors. Premise was that X = S L + E where ! X = test performance L = intrinsic intelligence factors S = individual scoree E = residual error

Principal Component Analysis


Using an eigenvector rotation, it would be possible to decompose the X matrix into a series of loadings and scores. Underlying or intrinsic factors related to intelligence could then be detected. In chemistry, this approach can be used by diagonalizating the correlation or covariance matrix - Principal Component Analysis.

Principal Component Analysis


PCA is typically conducted using the covariance matrix from autoscaled data.
!

Covariance matrix =

1 Xl T Xl NV - 1

It is then diagonalized - eigenvector rotation Typically, the largest eigenvectors (based on the size of the eigenvalues) are the most important.

Covariance and Correlation


Covariance
A measure of the association of two variables. The sum of cross products between two variables as deviations from their respective means. COVAR = !^x - x h !_ y - y i

Principal Component Analysis


Approaches fall into two categories Complete diagonalization of the matrix. Approximation methods that extract one component at a time. In the end, the results are the same. The data is decomposed into a set of loadings, scores and a residual.
m X n = n t1 p 1 m + n t2 p 2 m +...+ n ta p a m + n E m

Correlation The covariance between two ztransformed variables (autoscaled).

r = COVAR Sx *Sy

Principal component (PC). A linear combination of related variables. It represents an intrinsic factor of your data. Scores. The projection of your data in to PC space. Loading. Show the relative signicance of the original variables. Residual. The data that could not be correlated -- typically random noise.

Varimax rotation
EV1 EV2
Variable 3

A secondary tweaking of the PCs to help better observe relationships. It is essentially a secondary rotation of your data in an attempt to lump all variance from individual variables in to single components. It can often help you to better understand the effects of your original data.

a ari

ble

Variab

le 1

Varimax rotation
Assume we have 5 original variables and are only interested in the rst 3 PC.
PC1 1 2 3 4 5 % variance % variance % variance PC2 PC3

Varimax rotation
After varimax rotation, it might look like:
PC1 1 2 3 4 5 % variance % variance % variance
original variables

PC2

PC3

original variables

The signicance of each variable is easier to see.

Using PCA results


The best way to appreciate PCA is to look at a series of examples. Well attempt to show what types of information can be obtained and how it can be used. Examples Classication of artifacts Classication of whiskey Noise reduction of 3D data

PCA of archaeological artifacts


The information presented in this example is from: ! Kowalski, Schatzki and Stross, Anal. Chem. 44, 2176 (1972). A complete evaluation of the data is also presented in Chemometrics by Sharaf, Illman and Kowalski, John Wiley & Sons, 1986.

PCA of archaeological artifacts


Summary of study.

PCA of archaeological artifacts


Will start by initially assigning classes to each type of sample - to be used in the labeling of the various plots. ! 1-4 ! 5-7 Quarry samples Artifacts from Indian sites

Native American artifacts made of obsidian glass were

obtained from 5 sites in northern California. Samples from 4 quarry sites obsidian were obtained in the same area. Mn, Rb, Sr,Y and Zr) was conducted on all 75 samples.

X-ray uorescence analysis for ten elements (Fe, Ti, Ba, Ca, K,
Questions posed.

Can the different sources of obsidian be differentiated based


on the chemical measurements made?

Can something be said regarding the sources of the artifacts


and the migration and trading patterns of the Indians?

Both unscaled and autoscaled daa will be evaluated using XLStat,

180000 160000 140000 120000

100

80

100000 80000 60000 40000 20000 0 F1 F2 F3

Virtually all of the variance is in the rst principal component.

60

40

20

0 F4 F5 F6 F7 F8 F9 F10

Archaeological artifacts - Data scaling

Archaeological artifacts - Eigenvalues

PCA of archaeological artifacts


F2 (16.14 %)

600

We can now produce displays of our components. A plot of the scores for PC1 vs. PC2 will result in about 98% of the original information being displayed. A loadings plot of L1 vs. L2 will show the importance of the original variables in the construction of PC1 and PC2.

400 5 2 200 2 6 3 0 7 -200 7 3 77 3 3 73 3 33 3 3 3 333 3 3 3 3 3 3 6 2 6 5 1 5 11 1 1 1 21 1 21 2 12 2 5 2

K Ca Ti Rb Sr Y Mn Ba Zr Fe 4 4

Quarry samples from site 1 tends to form an individual group


4 4 44 4 4 4 4 4 4 4 4 4 44 4 4 4 4

-400 -800

Quarry site 3 and artifact site 7 appear to be related


-600 -400 -200 0 200 400 600 800

F1 (81.68 %)

PC1 vs. PC2

PCA of archaeological artifacts

Cumulative variability (%)

Eigenvalue

Ca

The loadings show that Ca and Fe both have an effect on PC1. The other variables have a smaller effect.

1 K Rb 0.5 Sr 0.25 Ca Ti

0.75

Ti

0 Y -0.25 Ba

XLStat will also produce a plot that indicates the correlation between the original variables and the factors. Here, it indicates that Y has little effect and the other have impact on both PCs.

0.022

Sr Y Mn Ba Rb Zr

0.810
K

Mn Fe

-0.5 Zr

-0.75

-1

Fe

-1

-0.75

-0.5

-0.25

0.25

0.5

0.75

PCA of archaeological artifacts

L1 vs. L2

Correlation plot

PCA of archaeological artifacts


6 100 80

Now that each variable is given equal weight, variance is no longer all smashed into the 1st component.

60

40

20

0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

Archaeological artifacts - Autoscaling

5 4 3 K Rb 2 2 1 1 2 22 2 6 6 2 6 2 2 5 5 1 1 11 11 5 1 1

Ca Sr

PCA of archaeological artifacts


Ti

So what does it all mean? In this case, both scaled and unscaled results indicate a grouping of related samples - XRF results can be used to classify related samples. Samples from different quarries (1-4) can easily be determined although samples from site 2 are pretty scattered.
5 6

F2 (20.78 %)

2 1 0

4 4 Mn 4 4 Fe 4 4 4 Ba4 44 4 44 4 44 4 44 4 4

-1 -2 -3 -6 -5 -4 -3

33 3 73 3 3 3 37 3 3 33 3 7 3 3 73 3 3 3 3 3 3 7

5 Y

-2

-1

Zr

Can we tell anything about the artifacts?

F1 (52.52 %)

PC1 vs. PC2

Biplot showing scores and loadings

Cumulative variability (%)

Eigenvalue

5 4 3 2 1 0 3 3 3 7 3 3 3 3 37 3 33 3 7 3 7 3 3 3 3 3 3 3 7 5 Y 26 6 2 2 6 K Rb 2 2 1 1 1 1 1 11 5 1 1 1 5

Ca Sr

Ti

F2 (20.78 %)

F2 (20.78 %)

22 2

No artifacts appear to have come from quarry site 4. Its well resolved from the other samples.
5 6

4 3 2 1 0 26 6 2 2 6

K Rb 2 2 1 1 1 1 1 11 5 1 1 1 5

Ca Sr

Ti

22 2 2

Artifacts from site 6 appear to come from quarry 2, although the results are scattered. Artifacts from site 7 are from quarry 3. Site 5 artifacts appear to come from all over the place. Its possible that this was a nomadic tribe.
5 6

4 4 Mn 4 4Fe 4 4 4 Ba 444 4 44 4 4 4 4 4 4 4 4

4 4 Mn 4 4Fe 4 4 4 Ba 444 4 44 4 4 4 4 4 4 4 4

-1 -2 -3 -6 -5 -4 -3

-1 -2 -3 -6 -5 -4 -3

3 3 3 7 3 3 3 3 37 3 33 3 7 3 3 7 3 3 3 3 3 3 7

5 Y

-2

-1 0 1 F1 (52.52 %)

Zr 3

-2

-1 0 1 F1 (52.52 %)

Zr 3

Archaeological artifacts - Results

Using the loadings


The loadings indicate that many of our
variables are closely related. our results.

1 0.75 0.5 0.25 0 -0.25 -0.5 -0.75 -1 -1 -0.75 -0.5 -0.25 0 0.25 0.5 K Rb

Use K Use Ca Use K


Ca Sr Ti

In addition,V appears to have little effect on We can reprocess our data after eliminating
V and some of our correlated variables.

Use Fe Remove
Zr Y Ba Mn Fe

This might improve our results. At a minimum it will make subsequent


studies easier - less data to collect.

Use
1

0.75

Using the loadings

Modied Study
We now only 4 variables.
F2 (27.66 %)

Our results are almost identical to our 10 variable work.


K 2

Ca

Is it enough to still give the same characterization as with the original 10? If this works, well be able to save quite a bit of time and money on subsequent assays. Also, does it improve our results?

2 2 26 6

2 0

22 2 2 1 5 1 1 1 1 1 1 11 1 5 5

Fe 44 4 4 44 44 4 4 4 4 4 44 4 44

-2

7 37 3 7 3 3 33 3 3 3 7 3 3 3333 3 3 7 33

-4 -5 -3 -1 1 3

Zr 5

F1 (60.47 %)

PC1 vs. PC2

Classication of whiskey
Another example -- the work conducted in our laboratory. One study involved the characterization of whiskey based on GC/MS traces. This example show what you might need to do in order to make your data suitable for PCA evaluations.

Representative whiskeys
Methylene chloride extracts for a series of whiskeys were assayed using a GC/MS. Variables needed to be constructed from these traces.

Data preprocessing
Each chromatograph consisted of approximately 1800 points. This would be too much for many systems to handle.
Variables were constructed by summing response at 1 min intervals resulting in 30 variables.

Data preprocessing
To improve variable stability: The smallest response value was treated as
a baseline for background correction.

An internal standard was used to


normalized detector response.

The internal standard was also used to


account for small time variations.

Data preprocessing
All data was autoscaled prior to PCA. Questions asked Could whiskies be classied? Could the approach be used to detect sample dilution? blending? contamination?

PC2

C CC

-1

C CC C CC LL L LL L L C L CC

S S S S S S S SS S S S S S S S

T T TTT T

B B B B B B B B B

BB

BB

-3 -3 -1 1 3

S - Scotch B - Bourbon C - Canadian L - Blended T - Tennessee


5

PC1

Initial PCA analysis

3 X X X X XX X X X X X XX X XX X X 25Y X X X X X 50Y X X X X X X XX

% of brand Y in the blend.


Y Y Y Y Y Y Y Y Y Y 75Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y YY Y

Contamination
4

PC2

-1

PC1

ppm methybenzoate vs. PC1


0 30

-3 -3 -1

PC1

-2

ppm

60

90

Blending of one scotch into another

Dilution
3 0 0 0 20 20 20 1 40 40 40 60 6060 80 80 80 X X X X X X X X X X X XX X XX X X X X X X X X X X X X X X

PC2

-1

% by V, whiskey
-3 -7 -3

PC1

Vous aimerez peut-être aussi