Vous êtes sur la page 1sur 10

Tanagra

R.R.

Subject
ImplementingthePrincipalComponentAnalysis(PCA)withTANAGRA. ThePCAbelongstothefactoranalysisapproaches.Itisusedtodiscovertheunderlyingstructureofasetof variables. It reduces attribute space from a larger number of variables to a smaller number of factors (dimensions).Itisanunsupervisedprocedurei.e.itdoesnotassumeadependentvariableisspecified 1. Inthistutorial,weshowhowtoimplementthisapproachandhowtointerprettheresultswithTanagra.

Dataset
We use the AUTOS_ACP.XLS dataset from the famous SAPORTAs 2 book (Tableau 17.1, page 428). The interestofthisdatasetisthatwecancompareourresultswiththosedescribedinthebook(pages177to181). We simplyshow thesequence of operationsand the reading of the results tables in thistutorial. About the detailedinterpretation,itisbesttorefertothebook. Thedatatableisthefollowing:
Modele CYL PUISS LONG LARG POIDS V-MAX Alfasud TI 1350 79 393 161 870 165 Audi 100 1588 85 468 177 1110 160 Sim ca 1300 1294 68 424 168 1050 152 Cit roen GS Club 1222 59 412 161 930 151 Fiat 132 1585 98 439 164 1105 165 Lancia Bet a 1297 82 429 169 1080 160 Peugeot 504 1796 79 449 169 1160 154 Renault 16 TL 1565 55 424 163 1010 140 Renault 30 2664 128 452 173 1320 180 Toyot a Corolla 1166 55 399 157 815 140 Alfet t a-1.66 1570 109 428 162 1060 175 Princess-1800 1798 82 445 172 1160 158 Dat sun-200L 1998 115 469 169 1370 160 Taunus-2000 1993 98 438 170 1080 167 Rancho 1442 80 431 166 1129 144 Mazda-9295 1769 83 440 165 1095 165 Opel-Rekord 1979 100 459 173 1120 173 Lada-1300 1294 68 404 161 955 140 FINITION PRIX R-POID.PUIS B 30570 11.01 TB 39990 13.06 M 29600 15.44 M 28250 15.76 B 34900 11.28 TB 35480 13.17 B 32300 14.68 B 32000 18.36 TB 47700 10.31 M 26540 14.82 TB 42395 9.72 B 33990 14.15 TB 43980 11.91 B 35010 11.02 TB 39450 14.11 M 27900 13.19 B 32700 11.20 M 22100 14.04

Thefirstcolumnisthelabeloftheexamples.Theactivevariables,usedduringthecomputationoftheaxes, areingreen;thesupplementary(illustrative)variables,usedonlyfortheinterpretationoftheresults,arein blue.Comparedtotheoriginaldataset,wecreateanewvariable,RPOID.PUISwhichistheratiobetweenthe horsepowerandtheweightofthevehicles.Itsvaluesarelowforsportscars.

http://faculty.chass.ncsu.edu/garson/PA765/factor.htm G.SAPORTA,Probabilits,AnalysededonnesetStatistique,TECHNIP,2006(inFrench).

19mai2009

Page1sur10

Tanagra

R.R.

Principal Component Analysis with TANAGRA


Creatingadiagram
WecanlaunchTanagrafromExcelusinganaddon 3.Weselecttherangeofcellsthatcontainsthedataset. ThenweclickontheTANAGRA/EXECUTETANAGRAmenu.

Adialogboxappears.WeclickonOKiftheselectionisright.

TANAGRAislaunched.Wecheckthatthereare18examplesand10variables .
3

http://dataminingtutorials.blogspot.com/2008/10/excelfilehandlingusingaddin.html; we can also import the XLS datafile evenifExcelis not installed onour computer,seehttp://dataminingtutorials.blogspot.com/2008/10/excelfile formatdirectimportation.html. Tanagradoesnothandlethelabelcolumn.Thefirstcolumnisthuscountedasacategoricalvariableinourdataset.Itis notaproblem.Tanagraconsidersthateachlabelisavalueofthevariable...butinthiscase,thenumberofdifferentvalues islimitedto255.
4

19mai2009

Page2sur10

Tanagra

R.R.

Principalcomponentanalysis
First,wemustdefinethetypesofvariables.WeinserttheDEFINESTATUScomponentintothediagramby clicking the shortcut in the toolbar. We set as INPUT the active variables. We see below how to use the illustrativevariables.

ThenweaddthePRINCIPALCOMPONENTANALYSIScomponent(FACTORIALANALYSIStab).Weclickon thePARAMETERSmenu:wesetthenumberofdimensionstocalculate(3factors);wewanttocomputethe COS2(contributionofdimensionstopointsorsquaredcorrelations)andtheCTR(contributionsofpointsto dimensions). 19mai2009 Page3sur10

Tanagra

R.R.

WeclickontheVIEWmenuinordertoobtaintheresults.

Eigenvalues.Thefirsttabledescribestheeigenvalues.Theyreflecttheimportanceofeachdimension.Wesee thatthetwofirstfactorsreflectthe87.95%oftheavailableinformation.

Correlationsbetweenfactorsandactivevariables.ThesecondpartdescribesthecorrelationsandtheCOS2 (squaredcorrelations)inpercentageandcumulatedpercentagebetweenthevariablesandthefactors.

19mai2009

Page4sur10

Tanagra

R.R.

Factor scores. A third table displays the coefficients for the computation of the factor scores of the individuals.Themeanandthestandarddeviationusedduringthecomputationofthecorrelationmatrixare given.Youmustperformthetransformationbeforeapplyingthecoefficientsonanewinstance.

But we can also get the computed factor scores for all the individuals. Indeed, the PCA component adds automaticallynewcolumnstothecurrentdataset.Theyareavailableinthesubsequentpartofthediagram. Accordingoursettings(seeabove),theCOS2andcontributionsarealsocomputed. Inordertoviewtherelatedvalues,weaddtheVIEWDATASETcomponent(DATAVISUALIZATIONtab)into thediagram.WeclickontheVIEWmenu.

19mai2009

Page5sur10

Tanagra

R.R.

TANAGRAusesascientificformat.Asimplewaytoobtainamoreconvenientpresentationistocopy/paste thevaluesintoaspreadsheet(COMPONENT/COPYRESULTSmenu)asthefollows.

In addition, with the spreadsheet, we have multiple sorting options that allow to highlight the relevant information.Forinstance,accordingthecontributions,wenotethatthefirstdimensionismainlydefinedby theoppositionbetweenRENAULT30TS+DATSUN200L(bigcars)andTOYOTACOROLLA+LADA1300 (smallcars).

19mai2009

Page6sur10

Tanagra

R.R.

Scatterplots.Thepopularityoffactorialmethodsisbasedlargelyongraphicalrepresentations.Theyallowus tovisuallyevaluatetheproximitybetweenobservations.Inourcase,weprojecttheobservationsinthefirst two dimensions. We can associate a label to each point. We insert the SCATTERPLOT WITH LABEL component(DATAVISUALIZATIONtab)intothediagram.Wesetthefirstdimensionforthehorizontalaxis andtheseconddimensionfortheverticalone.Wenotethatwecaneasilymodifytheaxes.

The points are automatically labeled by their number. We can modify this by clicking on LEGEND / ATTRIBUTEVALUEoption.WeselectMODELEasreferenceattribute.Weobtainthescatterplotonthetwo first dimensions. Of course, this option is practical as long as the number of points remains reasonable. Beyondofacertainnumberofobservations,thegraphwouldbeunreadable. Sometimes,somepointsaresuperposed.Inthiscase,copyingthecoordinatesinaspreadsheetandranking theexamplesaccordingthedimensionsisthebestwaytoidentifypreciselyeachexample.Wecanalsomodify thesizeofthelabelswiththeshortcutsCTRL+WandCTRL+Q.

19mai2009

Page7sur10

Tanagra

R.R.

Correlationcircleandillustrativevariables.Thecorrelationcircle(orcorrelationscatterplot)isagraphical tool which allows to enhance the interpretation of the factors. Their correlations with the active and the illustrativecontinuousvariablesarecomputed.Thesearethecoordinatesofthevariablesintothescatterplot. FirstweinserttheDEFINESTATUScomponent.WesetasTARGETthetwofirstfactors;thenwesetasINPUT allthecontinuousvariables,includingtheillustrativeones.

19mai2009

Page8sur10

Tanagra

R.R.

Then,weaddtheCORRELATIONSCATTERPLOTcomponent.WeclickontheVIEWmenu.

WeseethatPRICE(PRIX),whichisanillustrativevariable,ishighlycorrelatedtothefirstfactor.Itisassociated to"big"cars(withhighhorsepower,highweight,etc.). Thesecondfactorisassociatedtothe"sports"characteristicofthevehicles.ThelocationoftheR.POISPUIS inthescatterplotconfirmsthisanalysis. Categorical illustrative variables. Active variables must be continuous one for PCA. But we can use categoricalillustrativevariablesinordertoimprovetheinterpretationofthefactors. Likeforthecorrelationscatterplot,wemustfirstdefinethetypesofthevariablesusingtheDEFINESTATUS component.WesetasTARGETthetwofirstfactors;thenwesetasINPUTthecategoricalillustrativevariable e.g.FINITION(finishingtouchesofthecars:M mediocre;B good;TB verygood). ThenweinserttheVIEWMULTIPLESCATTERPLOTcomponent. Thecoordinatescorrespondtotheconditionalaverageofthecategoriesofthevariableoneachdimension.

19mai2009

Page9sur10

Tanagra

R.R.

Wenotethat"big"carshavealsoverygoodfinishingtouches. Illustrativeexamples.Wedonotusethisoptioninourtutorial,butwecanalsosubdividingthedatasetina learningsampleandanillustrativesample.Thefirstisusedforthecomputationofthenewdimensions.The secondcorrespondstonewinstancesthatwewanttolocateintothisnewrepresentationspace,forinstance whenwewanttoapplytheresultsonothersubpopulation.

Conclusion
The principal component analysis, and in general the factor analysis, is useful to understand the underlying structureofatabulardataset.Inthistutorial,weshowhowtoimplementthisapproachwithTanagra. The opportunity to copy/paste the results into a spreadsheet is certainly one of the most interesting functionalities of the software. Indeed, it gives us access to tools (sorting, formatting, etc.) in awellknown environmentoftheexpertsofthedataprocessing.Forexample,thepossibilitytosortingthevarioustables accordingtothecontributionsandtheCOS2isreallyaninterestingfunctionalitywhenwewishtointerpret thedimensions.

19mai2009

Page10sur10

Vous aimerez peut-être aussi