Lesson 34: Principal Component Analysis: 1. Cross-Tabulation

Copy Right : Ra i Unive rsit y
202 11.556
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
Students, we are going to deal with aspects of principle
component. We are dealing with the issues with the help of
examples.
A (hypothetical) study was conducted by a bank to determine if
special marketing programs should be developed for several key
segments. One of the studys research questions concerned
attitudes toward banking, The respondents were asked their
opinion on a 0-to-9, agree-disagree scale, on the following
questions:
1. Small banks charge less than large banks.
2. Large banks are more likely to make mistakes than small banks.
3. Tellers do not need to be extremely courteous and friendly; its
enough for them simply to be civil.
4. I want to be known personally at my bank and to be treated
with special
courtesy.
5. If a financial institution treated me in an impersonal or uncaring
way, I would never patronize that organization again.
courtesy.
1. Small banks charge less than large banks.
2. Large banks are more likely to make mistakes than small banks.
3. Tellers do not need to be extremely courteous and friendly; its
enough for them simply to be civil.
4. I want to be known personally at my bank and to be treated
with special
courtesy.
5. If a financial institution treated me in an impersonal or uncaring
way, I would never patronize that organization again.
Rotation
This is a second stage and is optional.
Factor analysis can generate several solutions. Each one being
termed a rotation.Each time there is a rotation the factor loadings
change as does the interpretation of factors.
There are many rotation programs e.g Varimax(orthogonal
rotation).
Outputs Most imp Items are
Factor loadings; correlations bet factors and variable and is used to
determine the factor. The percentage of variance explained criteria
help determine no. of factors to include.
Also included is some practical application exercises from the
internet.
The following exercises are to done over the two practical sessions.
You should be familiar with some of the early procedures. For
the EFA procedures, please refer to the notes earlier in this handout.
Saving a copy of the data file
Before you go any further, you should save a copy of the file
driving01.sav into your file space. You can find driving01.sav
by:
1. Open SPSS in the usual way, select Open existing file and More
files
2. In the Open file window, go to Psycho\ courses\ psy2005 \
spss\ , select the driving01.sav file and click on ok
3. Once the file is open, click on Save as and put it in my
documents in PC files on Singer
Whenever you need the file again, you now have a copy from
which to work.
Exploring the data set
1. Cross-Tabulation
Before you start any kind of analysis of a new data set, you should
explore the data so that you know what the variables are and what
each number actually means. If you move the cursor to the grey
cell at the top of a column, a label will appear, telling you what the
variable is.
The variables in the file are as follows: gender, age, area respondent
lives, length of time respondent has held a driving licence (in years
and months), annual mileage, preferred speed on a variety of
different roads at day and at night (motorways, dual carriageways,
A-roads, country lanes, residential roads and busy high street) and
finally, a series of scores relating to items on a personality trait
inventory.
You can use the descriptives and frequencies commands to do
investigate the data, but they cannot tell you everything. If we
wanted to find out how many women there are in the dataset
who live in rural areas, we must use a Crosstabs (Cross-Tabulation)
command:
4. Click on Analyze in the top menu, then select Descriptive
Statistics, and click on Crosstabs.
5. Select the two variables that you want to compare (in this case
gender and area), put one in the Row box, and one in the
Column box.
6. Click on Statistics, and check the Chi-Square box. Click on
Continue.
7. Click on ok.
The output tells us how many men and women in the data set
come from each type of area, and the chi-square option tells us
whether there are significantly different numbers in each cell.
However, it is not clear where these differences lie, so:
8. Click on Analyze in the top menu, then select Descriptive
Statistics, and click on Crosstabs (so long as you havent done
anything else since the first Crosstabs analysis above, the gender
and area variables should still be in the correct boxes, if not
move them into the row and column boxes and Click on
Statistics, and check the Chi-Square box. Click on Continue).
LESSON 34:
PRINCIPAL COMPONENT ANALYSIS
11.556 203
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
9. Click on Cells and check the Expected counts box (also, try
selecting the Row, Column and Total percentage boxes).
Click on Continue.
10. Click on OK.
Comparing the expected count with the observed count will tell
you whether or not there is a higher observed frequency than
expected in that particular cell. This will then tell you where the
significant differences lie.
Using the Crosstabs procedure, how many female respondents
live in rural area, and what percentage of the total sample do they
make up? (10, 4.6%). How many male respondents are between
the ages of 36 and 40? What percentage of the total sample do
they constitute? (35, 16.1%).
2. Creating a Scale
You might want to sum peoples scores on several items to create
a kind of index of their attitude, for example; we know that some
of the personality inventory items in the data set relate to the
Thrill-Sedate Driver Scale (Meadows, 1994). These items are
numbered 7 to 13 in the questionnaire (Appendix A) and var7 to
var13 in the data set. How do we create a single scale score?
First of all, some of the items may have been counterbalanced, so
we have to reverse the scoring on these variables before we add
them together to give a single scale score. Currently, a high score
may indicate strong positive tendency on some of the items,
whereas the opposite is true of other items. We need to ensure
that scores of 5 represent the same tendencies throughout the
scale (in this case, high Thrill driving style), so that the items
scores may be added together to create a meaningful overall score.
Missing values
Make sure before computing a new variable like this that you have
already defined missing values, otherwise these will be included in
the scale score. For example, you would not want the values 99
for no response included in your scales, so defining these as
missing will mean that that particular respondent will not be
included in the analysis.
11. Double-click on the grey cell at the top of the relevant column
of data
12. Click on Missing Values
13. Select Discrete Missing Values and type in 99 into one of the
boxes
14. Click on Continue and OK
Recoding Variable Scores
It is usually fairly clear which items need to be recoded. If strong
agreement with the item statement indicates positive tendency,
then that item is okay to include in the scale without recoding.
However, if disagreement with the statement indicates positive
tendency, that items scores must be recoded. Looking at the actual
questions in the questionnaire, it is clear that items var07, var08,
var09 and var10 should all be recoded (var11- var13 are okay, because
strong agreement implies a high Thrill driving style).
Follow these steps to recode each item and then compute a scale
composed of all item variables:
15. Go to Transform Recode Into different variables
and select the first item variable (var07) that requires recoding
16. Give a name for the new recoded variable, such as var07r and
label it as reversed var07
17. Set the new values by clicking old and new values and entering
the old and new values in the appropriate boxes (adding each
transformation as you go along). So you finish up with 1 >
5; 2 > 4; 3 > 3; 4 > 2; and 5 > 1
18. Click continue and then change and check that the
transformation has worked by getting a frequency table for
the old and new variables var07 and var07r. Have the values
reversed properly? If not, then you may need to do it again!
Follow the same procedure for the other items in the scale that
need to be reversed
Scale calculation
Once you have successfully reversed the counterbalanced item
variables, you can compute your scale.
19. Click on Transform Compute and typing a name for the
scale (e.g.: Thrill) in the Target variable box and type the
following in the numeric expression box:
var07r + var08r + var09r + var10r + var11 + var12 + var13
20. Click on ok
Now take a look at your new variable (it will have appeared in a
column on the far right of your data sheet get a descriptives
analysis on it. You should find that the maximum and minimum
values make sense in terms of the original values. The seven
Thrill-Sedate items are scored between 1 and 5, so there should
be no scores lower than 7 (ie: 1 x 7) and none higher than 35 (ie: 5
x 7). If there are scores outside these limits, perhaps you forgot to
exclude missing values.
3. Checking the scales Internal Reliability
Checking the internal reliability of a scale is vital. It assesses how
much each item score is correlated with the overall scale score (a
simplified version of the correlation matrix that I talked about in
the lecture).
To check scale reliability:
21. Click on Analyze Scale Reliability Analysis
22. Select the items that you want to include in the scale (in this
case, all the items between var07 and var13 that didnt require
recoding in the earlier step, plus all the recoded ones in
other words, those listed in the previous scale calculation
step), and move them into the Items box.
23. Click on Statistics
24. Select Scale if item deleted and Inter-item Correlations
25. Click on Continue ok
26.
In the output, you can first see the correlations between items in
the proposed scale. This is just like the correlation matrix referred
to in the lectures. Secondly, you will see a list of the items in the
scale with a certain amount of information about the items and
the overall scale. The statistic that SPSS uses to check reliability is
Cronbachs Alpha, which takes values between zero and 1. The
closer to 1 the value, the better, with acceptable reliability if Alpha
exceeds about 0.7.The column on the far right will tell us if there
are any items currently in the scale that dont correlate with the
204 11.556
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
rest. If any of the values in that column exceed the value for
Alpha at the bottom of the table, then the scale would be better
without that item it should be removed from the scale and the
Reliability Analysis run again. For this example, you should get a
value for Alpha of 0.7793, with none of the seven items requiring
removal from the scale.
Factor analysis of Driving01.sav
1. Orthogonal (Varimax) Rotation (Uncorrelated
Factors)
An orthogonal (varimax) analysis will identify factors that are
entirely independent of each other. Using the data in Driving01.sav
we will run a factor analysis on the personality trait items (var01 to
var20).
Use the following procedure to carry out the analysis:
27. Analyze Data reduction Factor
28. Select all the items from var01 to var20 and move them into
the Variables box
29. Click on Extraction
30. Click on the button next to the Method box and select
Principal Axis Factoring from the drop-down list
31. Make sure there is a tick in the Scree Plot option
32. Click on Continue
33. Click on Rotation, select Varimax (make sure the circle is
checked)
34. Click on Options, select Sort by size and Suppress absolute
values less than 0.1 and then change the value to 0.3 (instead
of 0.1)
35. Click on Continue OK.
Output
First, you have the Communalities. These are all okay, as there are
none lower than about 0.2 (anything less than 0.1 should prompt
you to drop that particular variable, as it clearly does not have
enough in common with the factors in the solution to be useful.
If you drop a variable, you should run the analysis again, but
without the problem variable).
The next table displays the Eigenvalues for each potential factor.
You will have as many factors as there were variables to begin
with, but this does not result in any kind of data reduction not
very useful. The first four factors have eigenvalues greater than 1,
so SPSS will extract these factors by default (SPSS automatically
extracts all factors with eigenvalues greater than 1, unless you tell it
to do otherwise). In column 2 you have the amount of variance
explained by each factor, and in the next column, the cumulative
variance explained by each successive factor. In this example, the
cumulative variance explained by the first four factors is 52.4%.
You can ignore the remaining columns.
The Scree plot is displayed next. You can see that in this example,
although four factors have been extracted (using the SPSS default
criteria see later), the scree plot shows that a 3 factor solution
might be better the big difference in the slope of the line comes
after three factors have been extracted. You can see this more
clearly if you place a ruler along the slope in the scree plot. The
discontinuity between the first three factors and the remaining set
is clear they have a far steeper slope than the later factors. Perhaps
three factors may be better than four? See section [iii] p.9 for
further discussion of this issue.
Next comes the factor matrix, showing the loadings for each of
the variables on each of the four factors. Remember that this is for
unrotated factors, so move on to look at the rotated factor matrix
below it, which will be easier to interpret. Each factor has a number
of variables which have higher loadings, and the rest have lower
ones. Remember that we have asked SPSS to suppress or ignore
any values below 0.1, so these will be represented by blank spaces.
You should concentrate on those values greater than 0.3, as any
lower than this can also be ignored. To make things easier, you
could go back and ask SPSS to suppress values less than 0.3 that
will clean up the rotated factor matrix and make it easier to interpret.
Finally comes the factor rotation matrix, which can also be ignored
(it simply specifies the rotation that has been applied to the factors).
2. Correlated Factors Oblique (Oblimin) Rotation
You may have noticed that some of the questions in the
questionnaire seem to measure similar things (for example, the
law is mentioned in variable items that do not appear to load
heavily on the same factor). Two or more of the factors identified
in the last exercise may well correlate with one another, as personality
variables have a habit of doing. An orthogonal analysis may not
be the most logical procedure to carry out. Using the data in
Driving01.sav we will run an oblique factor analysis on the
personality trait items (var01 to var20), which will identify factors
that may be correlated to some degree.
Use the procedure described above, but when you click on the
Rotation button, instead of checking the Varimax option, check
the Direct Oblimin option instead. Compare the output from
this analysis with the output from the varimax analysis. The first
few sections will look the same, because both analyses use the
same process to extract the factors. The difference is once the initial
solution has been identified and SPSS rotates it, in order to clarify
the solution (by redistributing the variance across the factors).
Instead of a rotated factor matrix, you will have pattern and
structure matrices.
Look at the factors and the loadings in the pattern matrix
(concentrate on loadings greater than +/ - 0.3). Do they look the
same as the varimax solution? One thing that has changed is that,
although the factors look similar, the loadings will have changed a
bit, and not all load in the same way as before. For example, var02
(These days a person doesnt really know quite who he can count
on) no longer loads on factor 3, but only on factor 4. How does
this change the interpretation of factors 3 and 4?
Finally is the factor correlation matrix. In a varimax solution, you
can ignore the plus or minus signs in front of the factor loadings,
because the factors are entirely independent of one another (in
other words, uncorrelated). However, in oblique (oblimin)
analyses, we have to take these into account because the factors
correlate with one another to some extent, and therefore we need
to know if there is a positive or a negative relationship (correlation)
between the factors. The relationship between correlated factors
must inherently take into account the sign of the loadings. In this
example, the negative correlations are so small as to be unimportant
(correlations less than 0.1 are usually non-significant), and so this
is not an issue. However, you should be aware that this may not
11.556 205
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
always be the case. It may seem confusing at first, but working out
the logic behind the relationships between factors makes sense
when you look at the variable items that represent the factors (the
relevant questionnaire statements).
3. Extracting a Specific Number of Factors
Up to now, you have been letting SPSS decide how many factors
to extract, and it has been using the default criterion (called the
Kaiser criterion) of extracting factors with eigenvalues greater
than 1. Look at the second table in your output: four factors have
eigenvalues greater than 1, so SPSS extracts and rotates four factors.
However, this criterion doesnt always guarantee the optimal
solution. We may have an idea of how many factors we should
extract the scree plot can give some heavy hints (as mentioned
earlier). The scree plot is not exact there is a degree of judgement
in drawing these lines and judging where the major change in
slope comes, but with larger samples it is usually pretty reliable.
I reckon that three factors would lead to a more accurate solution
than four, so try running the analysis again, but this time specify
that you want a 3 factor solution by specifying the Number of
factors to extract as 3 in the Extraction options window. The
solution fits quite well, with all variable items loading quite high
on only one factor, thus revealing a good simple structure.
Where to go from here? Further Exercises (not included
on original handout)
The following exercises dont introduce any new ideas or concepts,
but should enable you to practice some of the techniques that are
covered earlier in this series of exercises. They should also help
you to see how the techniques from the three sections of the
PSY2005 Multivariate Statistics module fit together with one
another.
1. Exploring the Thrill-Sedate Scale Scores
Some of you were asking what to do with the Thrill-Sedate Driver
scale once you had calculated it. You could try comparing men vs
women in terms of the scale score. We might expect that men
would record higher scores than women, and this is the case.
However, an independent t-test shows us that this difference is
not significant - why?
36. The first step in answering this question is to produce a
crosstabs table for gender vs age-group. Compare the
observed values (count) and expected values for each cell.
Youll see that there are more older men (ie: fewer younger
men) and more younger women (ie: fewer older women)
than would be expected in the sample.
37. Now run an ANOVA using the Thrill scale score as DV and
Age-group as IV. Previous research has found that younger
people record higher Thrill scores than older people. Does
this pattern appear in this sample?
38. If you run a two-way ANOVA with Thrill score as DV and
with Age and Gender as IVs, you may find an interaction
between them. What does the interaction mean?
Can you now see why, for this particular sample, there is no
significant difference between men and women in terms of Thrill
scores? The male sample is made up of older men, while the
female sample is made up of younger women, so the scores will
be similar. This obviously emphasises how important it is to
ensure that your sample is representative.
2. Further factor analysis
Once the EFA procedures carried out during the factor analysis of
the data have identified the variables that load on each factor
(exercise 3), you could construct scales for the other two factors
from the items in the questionnaire that load on each factor (looking
at the scree plot, we can see that there is quite a neat 3-factor
solution, with each variable loading on only one of the factors).
39. First try to interpret the factors, based on the questionnaire
items (variables) that the factors load on.
40. Use the scale-building and reliability procedures described
earlier in these exercises to produce internally-reliable scales
which we may then use to describe differences between people.
41. You could put all three personality trait scores into a
Regression analysis and see how well they predict preferred
speed on different types of road.
42. You could also factor analyse the preferred speed data, to see
if there are any patterns in the way that people respond to the
different items. How could you interpret the resulting factors?
Remember, this is Real World data, so the possibilities are endless
- you could come up with your own hypotheses, based on your
own ideas about how people drive.
Cris Burgess (2001)
Notes

Lesson 34: Principal Component Analysis: 1. Cross-Tabulation

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lesson 34: Principal Component Analysis: 1. Cross-Tabulation

Transféré par

Droits d'auteur :

Formats disponibles

Copy Right : Ra i Unive rsit y

Vous aimerez peut-être aussi