Vous êtes sur la page 1sur 7

609843

research-article2015
JPAXXX10.1177/0734282915609843Journal of Psychoeducational AssessmentMaeda and Yoon

Brief Article
Journal of Psychoeducational Assessment
2016, Vol. 34(4) 397403
Are Gender Differences in Spatial The Author(s) 2015
Reprints and permissions:
Ability Real or an Artifact? sagepub.com/journalsPermissions.nav
DOI: 10.1177/0734282915609843
Evaluation of Measurement jpa.sagepub.com

Invariance on the Revised PSVT:R

Yukiko Maeda1 and So Yoon Yoon2

Abstract
We investigated the extent to which the observed gender differences in mental rotation ability
among the 2,468 freshmen studying engineering at a Midwest public university attributed to the
gender bias of a test. The Revised Purdue Spatial Visualization Tests: Visualization of Rotations
(Revised PSVT:R) is a spatial test frequently used to measure students spatial visualization ability
in three-dimensional mental rotation in science, technology, engineering, and mathematics fields.
With two major approaches for evaluating measurement invariance, we found that five items
in the Revised PSVT:R showed a difference in the response pattern by gender, but the impact
of these biased items was marginal on the total scores on the scale. Our findings support the
equitable use of the Revised PSVT:R by gender for educational research and practices.

Keywords
measurement invariance, gender difference, spatial ability, fair testing, the Revised PSVT:R

Spatial ability refers to the ability to generate, retain, retrieve, and transform well-structured
visual images (Lohman, 1996, p. 98). Mental rotation ability, a sub-component of spatial ability,
involves cognitive processing to mentally rotate visual stimuli, which are often two-dimensional
(2-D) or three-dimensional (3-D) objects, toward the directions that were indicated by a compari-
son stimulus or instruction (Linn & Petersen, 1985; Uttal et al., 2013). Because the literature
suggests that spatial ability has shown a positive link to academic and career success, particularly
in science, technology, engineering, and mathematics (STEM) fields (e.g., Wai, Lubinski, &
Benbow, 2009), and the ability is a prerequisite for developing quantitative reasoning skills, a
spatial test often has been used to predict students academic success (e.g., Sorby, Casey, Veurink,
& Dulaney, 2013). A review of relevant literature also indicated the malleability of spatial skills
with appropriate interventions, and discussed the possibility that spatially enriched education
may increase the opportunity for participation in STEM disciplines for all students, especially

1Purdue University, West Lafayette, IN, USA


2Texas A&M University, College Station, TX, USA

Corresponding Author:
Yukiko Maeda, College of Education, Department of Educational Studies, Purdue University, West Lafayette, IN
47907, USA.
Email: ymaeda@purdue.edu
398 Journal of Psychoeducational Assessment 34(4)

minority and/or female students pursuing careers in STEM fields (Hill, Corbett, & St. Rose,
2010; Uttal et al., 2013).
Although research provides strong evidence that spatial ability is crucial for success in STEM
fields, the existence of gender differences in mental rotation abilitywith males scoring higher
has widely been supported by meta-analyses (e.g., Linn & Petersen, 1985; Maeda & Yoon, 2013).
For example, Linn and Petersen (1985) reported an effect size (i.e., the standardized mean differ-
ence between males and females on spatial tasks) of 0.73 favoring males. Recent evidence suggests
that this trend has not changed for the last three decades (Maeda & Yoon, 2013). In addition, the
gender differences are likely to remain after spatial training, although both females and males tend
to improve their performance to the same extent through the intervention (e.g., Uttal et al., 2013).
Although the literature discusses several possible reasons for gender differences in mental
rotation ability, Maeda and Yoon (2013) found that employed assessment procedures moderate
the magnitude of the gender difference. For example, males outperformance on a spatial test
tends to increase with a stringent time limit for testing, as opposed to no time limit or relaxed
time limit conditions. The procedural impact on gender differences is of particular interest
because it suggests the observed gender differences may partially derive from measurement
errors or construct-irrelevant factors resulting from employed procedures for measuring spatial
ability. Investigation of measurement bias against females on spatial tests is critical because the
use of a biased test may explain persistent underperformance of females on spatial tasks, conse-
quently leading to underrepresentation of females in STEM fields. Therefore, investigating a
possible gender bias inherent in measuring spatial ability is imperative to support the fair use of
the test in educational settings (AERA, APA, & NCME, 2014).
Although various spatial tests were used in STEM research and education, the psychometric
evaluations of these tests are limited, particularly regarding fairness of the test or items for mak-
ing sound educational decisions (Maeda & Yoon, 2013). Given the lack of investigation on psy-
chometric properties of spatial testsparticularly about a potential gender biaswe conducted
this study to estimate the extent to which the observed gender differences result from bias in the
instrument used for measuring spatial performance. For this investigation, we selected the
Revised Purdue Spatial Visualization Tests: Visualization of Rotations (Revised PSVT:R; Yoon,
2011) due to its pervasive use in STEM education research, as well as supporting evidence for
high reliability and validity from past research on the instrument (Maeda & Yoon, 2013; Maeda,
Yoon, Imbrie, & Kang, 2013; Yoon, 2011).

Method
Data and Data Analysis
The Revised PSVT:R measures the 3-D mental rotation ability of individuals 13 years or older
(Yoon, 2011). The Revised PSVT:R contains 30 items consisting of 13 symmetrical and 17 non-
symmetrical 3-D objects that are drawn in a 2-D isometric format. Each item asks a respondent
to mentally rotate an object in the same direction as visually indicated in the instructions. The
respondent is then asked to select the right answer from five possible response options.
We used archival data of the Revised PSVT:R obtained from 2,468 engineering freshmen (of
those, NM = 1,888 [76.5%] were males, NF = 580 [23.5%] were females) who took the spatial test
in the fall of 2010 or 2011. Cronbachs alphas for both gender groups were almost equal: .816 for
females and .834 for males.
We chose two major approaches to examine the potential bias and measurement invariance
by gender: differential item functioning (DIF) analyses and a multiple-groups confirmatory fac-
tor analysis (MCFA). We began with a series of descriptive analyses to examine how observed
score distributions differ by gender. Next, we conducted three methods of DIF analyses for
Maeda and Yoon 399

convergence of findings across methods: (a) the MantelHaenszel (M-H) method, (b) the logistic
regression method, and (c) a three-parameter logistic (3-PL) item response model. Because of the
large sample size used in the study, we used both statistical and practical significance for identi-
fying a DIF item. We selected these three methods because of their distinctive differences in
procedures to identify items that show a DIF. We used DIFAS software (Penfield, 2005, 2012) to
run M-H analyses, using the total score on 30 items as a matching variable, and SPSS 22 (2013)
to run a series of logistic regression analyses. Although there are a variety of approaches for item
response theory (IRT)-based DIF analyses, they tend to produce relatively consistent findings
(e.g., Yang et al., 2011). Thus, we selected one approach, that is, a likelihood-ratio test using
freeware called IRTLRDIF (Thissen, 2001), in the current investigation. We also conducted a
differential test function (DTF) analysis to evaluate whether a set of the items as a whole (or a
test) has a bias against gender because bias at an individual item level may have little impact if
the test does not show the bias.
Finally, we ran MCFA with the robust weighted least squares (WLSMV) estimators and theta
parameterization in Mplus 7.0 (Muthn & Muthn, 1998-2012). Because the Revised PSVT:R
items produce a bivariate distribution of responses on each item, WLSMV functions appropri-
ately to generate accurate estimates for dichotomous indicators (Brown, 2006). We first exam-
ined the equivalence of factor structure of both gender groups. Then, we evaluated the equivalence
of measurement parameters (i.e., factor loadings and thresholds in tandem) using chi-square tests
for the difference between restrictive and less restrictive models. Because the result of the chi-
square test was significant, we tested partial measurement invariance models by releasing equal-
ity constraints for factor loadings and threshold of items with a large modification index, each in
turn and together (Brown, 2006; Sass, 2011).

Results and Discussion


As consistent with majority of the extant literature (e.g., Linn & Petersen, 1985; Maeda & Yoon,
2013), we observed gender differences on both raw, t(914.4) = 12.20, p < .01, and IRT-based
ability scores, t(964.6) = 12.75, p < .01, on the Revised PSVT:R. On average, male students
answered 77% of the items correctly (M = 23.13, SD = 4.97), whereas female students correctly
answered 67% of the items (M = 20.11, SD = 5.29). The magnitude of the difference was rela-
tively large (Hedges g = .60), which is similar in size (Hedges g = .57) reported in the meta-
analytic study by Maeda and Yoon (2013).
Although most items listed in Table 1 showed minor DIF, Items 6 and 14 showed substantial
gender bias. For Item 6, the M-H chi-square test was significant and an index by Educational
Testing Service (ETS; Zwick, 2012) is a C, indicating large DIF. The results of logistic regression
and IRT-based results are congruent to support uniform DIF. Items 6 and 14 are moderately easy;
82.2% of respondents (N = 2,468) answered Item 6 correctly, and 78.3% answered Item 14 cor-
rectly. These items seem to be more difficult for females than males with the same ability level.
These two items only showed differences in item difficulty, but neither item discrimination nor
guessing parameters were statistically significant. However, the result of DTF analysis indicates
that the weighted variance of DIFs across 30 items is 0.05, suggesting that identified DIF items
on the total test function might be negligible.
Table 2 shows the results of MCFA analyses. Overall fit statistics for the one-factor structure
supports a good model fit for both gender groups, and the data from the women showed better fit
indices than men. The chi-square difference test between the two models was significant and the
modification indices suggested that 11 out of 30 items (including Items 6 and 14) might be attrib-
uted to non-invariance. However, when measurement parameters of three items (Items 13, 15,
and 16 with the largest modification indices) were freely estimated for partial invariance, the
chi-square test yielded a non-significant result.
400 Journal of Psychoeducational Assessment 34(4)

Table 1. The Summary of the Items Identified by Three DIF Analyses.

IRT

Likelihood ratio
M-H Logistic statistic (G2)
Uniform Non-
Item 2 LOR LOR (Z) CDR ETS 2 uniform 2 a b C
5 5.17 0.38 2.33 F A
6 30.08 0.68 5.42 F C 34.04 6.5 1.4 33.5 0.3
8 5.50 0.29 2.39 F A 6.86
13 11.90 0.38 3.51 F A 11.59 1.2 10.9 0.2
14 16.51 0.49 4.06 F B 18.52 0.3 16.7 0.0
15 7.23 0.32 2.72 F A 8.57 10.11 0.3 0.7 10.5
16 7.96 1.5 0.1 7.6
17 9.09 0.36 3.08 F A 10.85 0.5 0.4 8.9
22 12.06
23 6.39
29 8.69 0.34 3.00 F A 10.265 0.0 2.0 8.0

Note. DIF = differential item functioning; M-H = the MantelHaenszel method; IRT = Item response theory (IRT)-
based method; LOR= log-odds ratio; LOR (Z) = standardized log-odds ratio; CDR= combined decision rule; ETS=
The Educational Testing Service (ETS) Categorization Scheme (Zwick, 2012).

Table 2. Tests of Measurement Invariance of the Revised PSVT:R by Gender.

2 df 2diff df RMSEA CFI TLI


One group solution with 1,623.06* 405 .035 .928 .923
both genders
Single group solutions
Male (NM = 1,888) 1,490.32* 405 .038 .905 .898
Female (NF = 580) 577.58* 405 .027 .942 .937
Measurement invariance
Least restrictive model 1,961.93* 810 .034 .921 .915
More restrictive model 1,976.33* 838 63.33* 28 .033 .922 .919
Partial measurement invariance
Re-specified model 1,931.54* 835 30.47 25 .033 .924 .921

Note. Revised PSVT:R = Revised Purdue Spatial Visualization Tests: Visualization of Rotations; RMSEA = root mean
square error of approximation; CFI = comparative fit index; TLI = TuckerLewis index.
*p < .001.

In practice, having measurement parameter invariance across groups on all the items in an
instrument is rare (Schmitt & Kuljanin, 2008). Furthermore, current literature has not reached a
consensus on the guideline for the degree of acceptable range of partial measurement invariance
(Schmitt & Kuljanin, 2008). Because MCFA heavily relies on chi-square tests to evaluate mea-
surement invariance for items, even small differences will reach statistical significance with large
sample sizes. This might be the situation for the current study, as our sample size (N = 2,468)
provides considerable power for the chi-square test. It seems that retaining the three items does
not threaten the content and construct validity of the Revised PSVT:R. If there is a threat, it
should be minimal (Byrne, Shavelson, & Muthn, 1989), because gender differences in average
factor scores remain after statistically controlling the functional differences in these items.
Table 3. Characteristics of Items That Showed DIF by Two Evaluation Approaches.

% correct by Unstandardized factor loading Unstandardized threshold


gender (standardized factor loading) (standardized threshold)

Item Figure Involved rotation Male Female Male Female Male Female
6 One 90 rotation around one axis 86.9 66.9 1.003 (0.638) 1.003 (0.441) 0.570 (0.729) 0.570 (0.512)

13 One 180 rotation around one axis 69.5 51.7 0.815 (0.376) 1.007 (0.443) 0.319 (0.295) 0.048 (0.043)

14 One 180 rotation around one axis 82.7 63.8 1.052 (0.606) 1.052 (0.459) 0.490 (0.567) 0.490 (0.436)

15 One 90 rotation around one axis 74.8 71.2 1.108 (0.483) 0.589 (0.277) 0.447 (0.392) 0.582 (0.559)
and another 90 rotation around a
different axis
16 One 90 rotation around one axis 80.6 74.1 1.358 (0.560) 0.810 (0.369) 0.652 (0.540) 0.697 (0.648)
and another 90 rotation around a
different axis

Note. DIF = differential item functioning.

401
402 Journal of Psychoeducational Assessment 34(4)

To verify the conclusion, we conducted a small simulation study with 1,000 replications to
examine the extent to which female average scores differ under the two conditions: (a) IRT item
parameters of Items 6, 13, 14, 15, and 16 show DIF as reported in Table 1 (the biased condi-
tion) and (b) item parameters of these items are equalized, so there is no DIF by gender (the
unbiased condition). The result showed that the average difference between the biased and
unbiased female group means was 0.07 (SD = 0.14), and the mean comparison by a t test was not
significant for all 1,000 replications. Thus, the observed gender differences on Revised PSVT:R
scores were not affected by the item bias.
Although our findings offer support for the equitable use of the Revised PSVT:R for gender
comparison, a substantial question remains: Why do some items function differently by gender?
Table 3 summarizes the characteristics of these five items along with the percent correct and fac-
tor loading by gender. These items have little in common regarding item characteristics, includ-
ing the shape of objects and rotations involved. Therefore, further investigation may be necessary
to scrutinize how cognitive processes would differ by gender as functions of item features in
experimental settings (i.e., the shape of the object, the direction and angle of rotation, the com-
plexity of rotating tasks [single vs. multiple rotations], etc.). Characteristics of the gender-DIF
items reported in Table 3 should provide insights into such investigations. This line of research
will also advance the exposition of gender differences in mental rotation ability and may contrib-
ute to identifying the source of gender differences.
Furthermore, because we used the archival data obtained from freshmen in engineering at a
public university in the Midwest, we acknowledge that the data do not represent the college
population in general. Therefore, although gender differences with higher average scores by
males tend to be observed in different majors (Yoon, 2011), caution is required when generaliz-
ing our results to other college populations. Finally, the current investigation only concludes that
the Revised PSVT:R is unbiased for gender comparison. Further evaluation of measurement
invariance across different subgroups of the population will promote the equitable use of the
Revised PSVT:R, particularly for high-stakes decisions such as assignment to a remedial course.

Declaration of Conflicting Interests


The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or
publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

References
American Educational Research Association (AERA), American Psychological Association (APA), &
National Council on Measurement in Education (NCME). (2014). Standards for educational and psy-
chological testing. Washington, DC: American Educational Research Association.
Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: Guilford Press.
Byrne, B. M., Shavelson, R. J., & Muthn, B. (1989). Testing for the equivalence of factor covariance and
mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456-466.
doi:10.1037/0033-2909.105.3.456
Hill, C., Corbett, C., & St. Rose, A. (2010). Why so few? Women in science, technology, engineering, and
mathematics. Washington, DC: American Association of University Women. Retrieved from http://
files.eric.ed.gov/fulltext/ED509653.pdf
Linn, M. C., & Petersen, A. C. (1985). Emergence and characterization of sex differences in spatial ability:
A meta-analysis. Child Development, 56, 1479-1498. doi:10.2307/1130467
Lohman, D. F. (1996). Spatial ability and g. In I. Dennis & P. Tapsfield (Eds.), Human abilities: Their
nature and measurement (pp. 97-116). Mahwah, NJ: Lawrence Erlbaum.
Maeda and Yoon 403

Maeda, Y., & Yoon, S. Y. (2013). A meta-analysis on gender differences in mental rotation ability measured
by the Purdue spatial visualization tests: Visualization of rotations (PSVT: R). Educational Psychology
Review, 25(1), 69-94. doi:10.1007/s10648-012-9215-x
Maeda, Y., Yoon, S. Y., Kim-Kang, K., & Imbrie, P. K. (2013). Psychometric properties of the Revised
PSVT:R for measuring the first year engineering students spatial ability. International Journal of
Engineering Education, 29, 763-776.
Muthn, L. K., & Muthn, B. O. (1998-2012). Mplus users guide (7th ed.), Los Angeles, CA: Author.
Penfield, R. D. (2005). DIFAS: Differential item functioning analysis system. Applied Psychological
Measurement, 29, 150-151. doi:10.1177/0146621603260686
Penfield, R. D. (2012). DIFAS 5.0. Users manual. Retrieved from http://erm.uncg.edu/wp-content/
uploads/2012/07/DIFASManual_V5.pdf
Sass, D. A. (2011). Testing measurement invariance and comparing latent factor means within a con-
firmatory factor analysis framework. Journal of Psychoeducational Assessment, 29, 347-363.
doi:10.1177/0734282911406661
Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and implications. Human
Resource Management Review, 18, 210-222. doi:10.1016/j.hrmr.2008.03.003
Sorby, S., Casey, B., Veurink, N., & Dulaney, A. (2013). The role of spatial training in improving spatial
and calculus performance in engineering students. Learning and Individual Differences, 26, 20-29.
doi:10.1016/j.lindif.2013.03.010
Thissen, D. (2001). IRTLRDIF v2.0b. Retrieved from http://www.swmath.org/software/11676
Uttal, D. H., Meadow, N. G., Tipton, E., Hand, L. L., Alden, A. R., Warren, C., & Newcombe, N. S. (2013).
The malleability of spatial skills: A meta-analysis of training studies. Psychological Bulletin, 139, 352-
402. doi:10.1037/a0028446
Wai, J., Lubinski, D., & Benbow, C. P. (2009). Spatial ability for STEM domains: Aligning over 50 years
of cumulative psychological knowledge solidifies its importance. Journal of Educational Psychology,
101, 817-835. doi:10.1037/a0016127
Yang, F. M., Heslin, K. C., Mehta, K. M., Yang, C. W., Ocepek-Welikson, K., Kleinman, M., . . . Teresi,
J. A. (2011). A comparison of item response theory-based methods for examining differential item
functioning in object naming test by language of assessment among older Latinos. Psychological Test
and Assessment Modeling, 53, 440-460.
Yoon, S. Y. (2011). Psychometric properties of the revised purdue spatial visualization tests: Visualization
of rotations (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses. (Order No.
3480934).
Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules,
minimum sample size requirements, and criterion refinement. Princeton, NJ: Educational Testing
Service.

Vous aimerez peut-être aussi