Vous êtes sur la page 1sur 18

Canadian Journal of Behavioural Science

2007, Vol. 39, No. 3, 184-201

Copyright 2007 by the Canadian Psychological Association


DOI: 10.1037/cjbs2007015

Socially Desirable Responding Does Moderate


Personality Scale Validity Both in Experimental and in
Nonexperimental Contexts
RONALD R. HOLDEN, Queens University

Abstract
The influence of socially desirable responding on the
validity of self-reported personality is examined in three
studies that involve 1,056 participants, Five-Factor Model
personality scales, and a psychometrically strong measure of impression management. Findings indicate that,
whereas experimentally induced faking produces
extremely strong validity-moderating effects for impression management, such effects are altered, but still significant and of large effect size, for naturally occurring variations in socially desirable responding. Attenuated levels
of significance for naturally occurring socially desirable
responding as a moderator may relate to the lack of construct validity for the measurement of impression management and to the reduced statistical power for small
sample sizes. It is concluded that the dismissal of socially
desirable responding as an issue for self-report personality scale validity is premature.
Rsum
Linfluence de la rponse socialement dsirable sur la
validit de la personnalit autodclare est examine
dans trois tudes portant sur 1 056 participants, des
dimensions de la personnalit cinq modles factoriels et
une mesure solide sur le plan psychomtrique de la conduite stratgique des relations. Les rsultats indiquent
que le trucage induit de faon exprimentale gnre des
effets modrateurs de la validit extrmement forts pour
la conduite stratgique des relations, mais que de tels
effets sont modifis, tout en restant substantiels et importants, pour les variations qui se produisent naturellement
dans les rponses socialement dsirables. Des niveaux
attnus dimportance pour la rponse socialement dsirable se produisant naturellement en tant que modrateur peuvent tre lis un manque de validit du concept
pour la mesure de la conduite stratgique des relations et
la rduction de lefficacit statistique en raison de la
petite taille de lchantillon. Lauteur conclut que le rejet
de la rponse socialement dsirable en tant quenjeu pour
la validit de la dimension de la personnalit
autodclare est prmatur.

Does socially desirable responding affect the


validity of self-report personality scales? Although
long debated, a consensus does not exist. Some
researchers have shown that socially desirable
responding is deleterious to personality scale validity
in real-world (e.g., Rosse, Stecher, Miller, & Levin,
1998) and simulated (e.g., Douglas, McDaniel, &
Snell, 1996; Jackson, Wroblewski, & Ashton, 2000) circumstances while others suggest that the detrimental
results from artificial assessment scenarios do not
generalize to actual situations (Ones & Viswesvaran,
1998; Ones, Viswesvaran, & Reiss, 1996; Smith &
Ellingson, 2002). Validity, in some real-world contexts, has not appeared compromised by socially
desirable responding (e.g., Barrick & Mount, 1996;
Hough, Eaton, Dunnette, Kamp, & McCloy, 1990)
and, consequently, socially desirable responding has
been dismissed by some (Ones et al., 1996) as a red
herring. The current paper challenges the dismissal
of socially desirable responding, arguing that such a
judgment is premature.
The dismissal of socially desirable responding as
an issue for self-report is predicated on a putative
lack of generalizability from experimentally induced
faking investigations to studies where naturally
occurring impression management may occur. The
present research contributes by focusing on the construct validity associated with measuring the impression management component of socially desirable
responding. In three studies, I differentiate between
the effect of experimentally induced faking and the
influence of naturally occurring socially desirable
responding on a prominent scale that purports to
measure an impression management component (i.e.,
faking) of socially desirable responding. In the first
study, using a state-of-the-art impression management scale, psychometrically strong personality
scales, and recommended analytic procedures, I
demonstrate that a scale of socially desirable
responding can moderate validity. A second study,
using the same predictor and socially desirable

Canadian Journal of Behavioural Science, 2007, 39:3, 184-201

Socially Desirable Responding 185


responding measures and analytic techniques,
demonstrates that socially desirable responding may
not appear to moderate validity. Secondary analyses,
however, demonstrate that the moderating effect of
socially desirable responding is present and is,
indeed, of a large effect size. Further, more detailed
analyses of the first two studies indicate a source of
the apparent discrepancy and challenge the adequacy
of the scale of socially desirable responding. A third
study articulates why the scale of socially desirable
responding, despite evidence for its construct validity, fails as an adequate measure for assessing the
moderating effects of impression management. This
research contributes by demonstrating that the
understanding of the essence of socially desirable
responding and impression management is incomplete. Socially desirable responding can compromise
personality scale validity both in experimental and in
nonexperimental contexts and cannot be readily dismissed it is a complex phenomenon with multifaceted relationships to test validity.
Self-report personality inventories are common in
psychology. Personality scales are used routinely in
personnel, counseling, clinical, and research settings.
Sources of invalidity for these scales are potential
concerns for test developers and users. Prominent
response biases that could compromise validity
include deviant responding, careless responding,
consistent responding, omitting items, acquiescence,
extremity bias, and socially desirable responding
(Paulhus, 1991). As an issue, socially desirable
responding response biases have a long history (e.g.,
Edwards, 1990; Nicholson & Hogan, 1990) and are
now viewed as multifactorial (Holden & Fekken,
1989; OGrady, 1988; Paulhus, 1984). In particular,
Paulhus (1984) differentiates socially desirable
responding into unconscious, self-deceptive enhancement and conscious, impression management
processes. He further indicates (Paulhus, 1998, 2002)
that, related to faking, impression management is an
important issue for personality assessment.
Reviews of the effect of faking on personality
scales (e.g., Hough & Oswald, 2000) indicate that,
although instructed faking can affect scale scores, the
effect of naturally occurring distortion is uncertain.
Further, although instructed faking may alter personality scales latent structure, this effect tends to be
smaller in natural (e.g., job applicant) settings. In
addition, Hough and Oswald indicate that, despite
instructed faking resulting in lower validities for selfreport scales, job applicant settings produce the same
or only slightly lower validities than for job incumbents assessed in research settings.
Does impression management (i.e., faking or con-

scious response distortion) affect self-reported personality? Consider that, although meta-analysis indicates that personality scale scores are altered approximately half a standard deviation by faking instructions, findings may neither generalize to real-world
applications nor address predictive validity
(Viswesaran & Ones, 1999). Next, consider that some
results (e.g., Piedmont, McCrae, Riemann, &
Angleitner, 2000) find that, under standard conditions, validity scales do not moderate the association
between self-report personality scales and criterion
measures. Finally, consider that research on the factor
structure of personality measures (e.g., Ellingson,
Smith, & Sackett, 2001; Marshall, De Fruyt, Rolland,
& Bagby, 2005) does not indicate differences, for job
applicants, between high and low scorers on scales of
socially desirable responding (but see Brown &
Barrett, 1999, and Schmit & Ryan, 1993, for dissenting
evidence). Consequently, it is implied that associations among personality scales are not subject to a
bias associated with natural positive self-presentation. Indeed, this perspective also has some empirical
support for actual criterion validity in real-world
contexts. Using correlations between personality
scales and criteria, two notable studies have produced null results for the influence of socially desirable responding. Hough et al. (1990) reported that: a)
when instructed to, army personnel could fake on a
personality measure; b) this faking did not moderate
criterion validity (i.e., the correlation of personality
scales with criteria); and c) for another sample of
over 9,000 enlisted personnel, scales on this personality measure had criterion validity (i.e., correlated
with various performance measures). Interestingly, in
concluding that personality scale validity is not compromised by socially desirable responding, these
researchers do also indicate that army applicants did
not show greater socially desirable responding than
army incumbents. In another prominent study,
Barrick and Mount (1996) demonstrated that socially
desirable responding was significantly associated
with personality predictor measures but, for a sample
of 286 transportation job applicants, socially desirable responding did not influence the association
between these predictors and job-relevant criteria.
The studies by Hough et al. (1990) and Barrick
and Mount (1996) represent particularly influential
investigations. They are real-world studies; they
compute actual criterion validities; and in the case of
Hough et al., they involve thousands of participants
comprising job incumbents and employee applicants
who, putatively, seem naturally inclined to self-present favourably. In contrast to laboratory research
that shows effects on personality scale mean scores

186 Holden
TABLE 1
Scale Descriptive Statistics as a Function of Instructional Condition
Instructions (n = 105 per group)
Standard
Scale
Neuroticism
Extraversion
Openness
Agreeableness
Conscientiousness
Impression Management

Fake Good

Fake Bad

SD

Honest
SD

SD

23.29
31.95
28.94
30.82
32.13
6.33

8.36
5.64
6.70
6.36
6.22
3.50

22.42
32.32
28.50
33.08
33.40
7.40

9.18
6.27
6.52
6.47
6.72
3.75

10.51
34.49
24.40
28.94
42.56
11.98

6.91
5.93
6.49
7.67
6.05
4.72

SD

38.59
6.61
14.61
7.39
23.36
7.75
17.50 10.30
9.06
7.67
3.61
4.45

F(3, 416)

226.47 **
220.66 **
17.73 **
81.69 **
476.44 **
74.51 **

.62
.62
.11
.37
.78
.35

** p < .01.

with small samples of undergraduates who are


experimentally instructed to either fake good, fake
bad, or respond honestly, it is easy to be swayed by
Hough et al.s and Barrick and Mounts null findings
for the effects of socially desirable responding on personality scale validity. Not surprisingly, therefore,
there does seem to be a growing view that socially
desirable responding, in general, and naturally
occurring faking, in particular, are not a widespread
concern for personality assessment. The present
investigation revisits this issue.
In Study 1, I examine Five-Factor Model personality scale validity based on correlations with corresponding self-report criteria. For this, the validitymoderating effects of both experimentally manipulated faking instructions and socially desirable responding are evaluated. For Study 2, with peer reportbased criteria, the relationship between scale validity
and naturally occurring socially desirable responding
is investigated. Differences with Study 1 results are
analytically articulated. Finally, focusing on one FiveFactor Model domain (i.e., Extraversion), Study 3
emphasizes peer-based criterion validity and demonstrates how the validity-moderating effect of socially
desirable responding is dependent on the proportion
of respondents who are faking and how the measurement of faking is operationalized.
Study 1
This first investigation focused on the experimental manipulation of instructions in order to demonstrate the effects for induced faking and socially
desirable responding on common, psychometrically
strong, personality scales.
Method
Participants. Participants were 420 undergraduates
(347 women, 73 men) with a mean age of 19.03 years
(SD = 2.50). Individuals participated in exchange for
credit toward an introductory psychology course.

Materials. Stimuli comprised the NEO Five-Factor


Inventory ( NEO-FFI ; Costa & McCrae, 1991), the
Impression Management scale of the Balanced
Inventory of Desirable Responding (BIDR; Paulhus,
1998), and a series of self-report criteria. The NEO-FFI
has considerable psychometric strengths (Costa &
McCrae, 1989), consists of 60 items answered on 5point continua, and assesses the Five-Factor Model of
personality with scales of Neuroticism, Extraversion,
Openness, Agreeableness, and Conscientiousness. As
a measure of socially desirable responding, the 20item Impression Management scale quantifies a
form of dissimulation known as faking or lying
(Paulhus, 1998, p. 9). Items (e.g., I always obey laws,
even if Im unlikely to get caught) are answered on
5-point scales that range from Not True to Very True.
For a sample of 200 undergraduates, Holden,
Starzyk, McLeod, and Edwards (2000) report a coefficient alpha reliability of .72 and strong support for
the scales item latent structure. Furthermore, the
scales validity has been demonstrated through its
sensitivity to situational self-presentation demands
(Paulhus, 1984; Paulhus, Bruce, & Trapnell, 1995).
Similar to Costa and McCrae (1992, p. 10), criteria
were self-report. NEO-FFI criteria were from Holden,
Wood, and Tomashewski (2001) who derived them
from Goldberg (1992, p. 41) and Paunonen and
Jackson (1985). For each of the five NEO-FFI dimensions, Goldbergs best five positively keyed and best
five negatively keyed unipolar adjective markers
were used as were Paunonen and Jacksons bipolar
rating scales for dimensions that loaded most strongly on Five-Factor Model factors (Costa & McCrae,
1988, p. 263).
Procedure
Prior to the experimental manipulation, respondents completed the criteria under standard instructions. Subsequently, participants were randomly
assigned to one of four instructional conditions (n =
105 per group) under which they completed the NEO-

Socially Desirable Responding 187


TABLE 2
NEO-FFI Scale Validities as a Function of Instructional Condition

Instructions
Scale
Neuroticism
Extraversion
Openness
Agreeableness
Conscientiousness
Mean

Standard
(n = 105)

Honest
(n = 105)

Fake Good
(n = 105)

Fake Bad
(n = 105)

.50
.47
.54
.77
.72
.62

.66
.62
.60
.77
.73
.68

.03
.15
.08
.07
.13
.09

.21
-.04
.28
.03
-.07
.08

Figure 1. Mean NEO-FFI scale validity as a function of Impression Management scale


scores Study 1 (all instructional groups).

FFI and Impression Management scale. Respondents


were asked to imagine that, in answering these subsequent materials, they were being screened by the
government for possible military induction. For
instructional conditions, participants were either
given standard instructions associated with the NEOFFI, asked to answer as honestly as possible, asked to
fake so as to maximize their chances of being inducted into the military (i.e., fake good), or asked to fake
so as to minimize their chances of military induction
(i.e., fake bad).1

1 Paulhus (1993) suggests that socially desirable responding


may be differentially reduced by instructions to respond honestly versus standard instructions.

Results
Scale descriptive statistics are displayed in Table 1.
Multivariate analysis of variance indicated a significant effect for instructions on the NEO-FFI scale
scores, Wilks = .148, F(15, 1,134.99) = 75.46, p < .01.
The effect size was more than large, 2 = .47.
Impression Management scale scores also differed
significantly among instructional groups, F(3, 416) =
74.51, p < .01, and the effect size, f = 0.73, vastly surpassed a standard of .40 (Cohen, 1992) for a large
effect. In particular, Impression Management scale
scores for standard instructions (M = 6.33, SD = 3.50)
differed significantly and with large effect sizes from
those for faking good (M = 11.98, SD = 4.72), t(208) =
-7.77, p < .01, Cohens d = 1.07, and faking bad (M =
3.61, SD = 4.45), t(208) = 6.66, p < .01, Cohens d = .92.

188 Holden

Coefficient alpha reliabilities for the Impression


Management scale were .73, .76, .85, and .90 for standard, honest, faking good, and faking bad groups,
respectively. Thus, the Impression Management scale
was an internally consistent index that not only confirmed the effectiveness of the experimental manipulation but was an exceptionally strong indicator of
faking.
Following Holden et al. (2001), for each NEO-FFI
dimension, criterion scale scores for the Goldberg
markers were standardized, and summed with the
corresponding standardized criterion scale scores for
the Paunonen and Jackson (1992) ratings for the same
dimension. Validities for each scale were calculated
by correlating each NEO-FFI scale and corresponding
criteria separately within each experimental condition (Table 2). Mean validities of .62 (standard
instructions) and .68 (honest instructions) were comparable to those reported by Costa and McCrae
(1992, p. 54) for self-report adjective criteria.
Moderated multiple regression procedures examined whether instructions affected validity. For each
NEO-FFI dimension, criterion scores were regressed
on corresponding self-report scores and variables
that coded experimental instructional group membership. Then, interaction variables (product terms
between the self-report scale scores and the variables
coding instructional group membership) were added
and tested for the statistical significance of their
increment in prediction. For each of the five NEO-FFI
dimensions, validity was influenced by instructions
(all ps < .01).
Figure 1 displays average NEO-FFI scale validities
(based on all participants) for indicated values of the
Impression Management scale. Paulhus (1998, p. 10)
designates that probably invalid responding is
associated with Impression Management scale scores
above 12 or below 1. For scores above 8 or below 2,
responding may be invalid. Figure 1s curvilinear
graph demonstrates that extreme scores in either
direction on Impression Management are associated
with lower scale validity. In evaluating the moderating effects of Impression Management scale scores on
the validity of self-reported personality, a hierarchical
regression procedure was used. This procedure,
undertaken separately for each NEO-FFI dimension
and using standardized scores for all variables,
involved the following steps. First, criterion scores
were regressed on corresponding self-report personality scale scores and Impression Management scale
scores. Second, a product term of the personality
scale and Impression Management scale scores was
added to the regression equation and the increment
in variance accounted for was evaluated for statistical

significance. This served to test whether the


Impression Management scores acted as a linear
moderator of validity. Third, the Impression
Management scale scores were squared and added to
the regression equation. Finally, the product term of
the personality scale and the squared Impression
Management scale scores was added to the regression equation and the increment in variance accounted for was examined for statistical significance. This
examination tested whether the Impression
Management scores acted as a quadratic moderator
of validity. Overall, moderated multiple regression
analyses indicated that, whereas Impression
Management scale scores were a linear moderator
only for the validity of the Agreeableness scale (p <
.01; all other ps > .19), Impression Management scale
scores were a quadratic moderator of validity for
each of the five NEO-FFI scales (all ps < .01).
Discussion
This first study confirms four important points.
First, under nondissimulation conditions, NEO-FFI
scales have substantial validity for indicating selfreport criteria. Values for validities are consistent
with those in the inventory manual for similar criteria (Costa & McCrae, 1992). Second, NEO-FFI scales
are susceptible to instructed faking. This is consistent
with reports concerning the NEO-FFI, in particular
(Holden et al., 2001), and personality inventories, in
general (Viswesvaran & Ones, 1999). Third, the
Impression Management scale is an internally consistent, valid index of instructed faking. The currently
obtained effect size (i.e., f = 0.73) not only is very
large, but exceeds that found in similar faking studies
(f = 0.58, Holden et al., 2000; f = 0.59, Holden, Book,
Edwards, Wasylkiw, & Starzyk, 2003). Fourth,
instructed faking (based on group membership)
moderates scale validity. Again, this is consistent
with studies of the NEO-FFI (Holden et al., 2001;
Topping & OGorman, 1997) and other inventories
(Douglas et al., 1996).
Extending previous research, socially desirable
responding, as a continuous variable measured by
the Impression Management scale, also moderated
validity. Although the convergence in moderator
analysis between measuring faking either by
assigned group membership or by an Impression
Management scale score seems intuitive and logical,
the effects of socially desirable responding (as measured by a continuous variables scale scores) on
validity have very rarely been statistically demonstrated through moderated multiple regression procedures, even when instructed faking is used to
experimentally manipulate socially desirable

Socially Desirable Responding 189

Figure 2. Mean NEO-FFI scale validity as a function of Impression Management scale


scores Study 2 (standard instructions).
TABLE 3
NEO-FFI Scale Intercorrelations (N = 420)

NS

ES

OS

AS

Correlations
CS
NR

.88
-.36
-.01
-.29
-.29

.79
.05
.23
.20

.77
.02
-.17

.75
.20

.86

.46
-.19
.03
-.01
-.10

-.29
.58
-.01
.01
.05

-.04
-.01
.60
-.02
-.12

-.19
.19
.02
.50
.14

-.23
.09
-.18
.12
.55

ER

OR

AR

CR

Self-Report Scale
Neuroticism (NS)
Extraversion (ES)
Openness (OS)
Agreeableness (AS)
Conscientiousness (CS)
Roommate-Report Scale
Neuroticism (NR)
Extraversion (ER)
Openness (OR)
Agreeableness (AR)
Conscientiousness (CR)

.85
-.35
.04
-.23
-.24

.82
.05
.24
.14

.72
.10
-.03

.83
.26

.89

Note. Validity coefficients are in bold. Coefficient alpha reliabilities are in the diagonal.

responding levels. This finding validates the use of


Paulhus scale as a validity index and confirms lay
and professional perceptions that respondents who
fake provide invalid information and that fakers can
be caught.
Two limitations for this first study are noteworthy.
First, criterion measures were self-report. Although
self-report criteria are used in various validation
studies of personality measures (e.g., Costa &
McCrae, 1989, 1992), such criteria may not be as
objective as those that are based on peer ratings,

observed behaviour, or job performance. Second,


socially desirable responding was induced through
experimental instructions. The extent to which results
associated with such induced responding generalize
to natural assessment contexts is a source of debate.
Study 2
Although, for instructed faking, personality scale
validity may be moderated by scores on a measure of
socially desirable responding, some (e.g., Ones &
Viswesvaran, 1998) question whether induced simu-

190 Holden
TABLE 4
NEO-FFI Scale Validity Coefficients as a Function of Impression Management (IM) Scale Score
IM Scale Score

Neuroticism

Extraversion

Openness

Agreeableness

Conscientiousness

Mean

<1
2
3
4
5
6
7
8
9
10
11
> 12

32
34
39
39
49
38
31
47
30
25
23
33

.61
.60
.39
.52
.51
.58
.45
.10
.48
.14
.46
.37

.73
.57
.56
.64
.60
.49
.57
.60
.57
.57
.54
.62

.58
.59
.59
.54
.63
.70
.53
.54
.58
.76
.47
.54

.51
.59
.45
.48
.46
.50
.58
.57
.46
.33
.33
.43

.65
.43
.56
.63
.44
.48
.37
.68
.46
.38
.63
.54

.62
.56
.52
.56
.53
.55
.51
.52
.52
.46
.49
.50

Note. IM scores < 1 were combined and scores > 12 were combined to ensure at least 20 subjects per IM group.

lation studies generalize to the real world.


Experimental studies may produce extremities in faking that either exaggerate or differ qualitatively from
naturally occurring results. Further, others (e.g.,
Holden & Fekken, 1989) have offered an alternative
interpretation of Paulhus (1984) dimensions of
socially desirable responding and suggest that his
impression management dimension may be a broader construct that is interpreted as interpersonal sensitivity. Consequently, the construct validity of socially
desirable responding scales for indicating faking may
be challenged (Pauls & Crost, 2004). Following from
these issues, this second study focused on personality scale validity, under standard instructions, as a
function of naturally occurring levels of a measure of
socially desirable responding.
Method
Participants. Participants were 420 university student roommate pairs (210 pairs; 332 women, 88 men)
who had lived together for at least three months (M =
18.42; SD = 27.24). Mean age was 21.07 years (SD =
2.46). Individuals were paid for their participation.
Materials. Materials comprised the self-report
Impression Management scale, and the self-report
(Form S) and observer (Form R) versions of the NEOFFI. Funder (1991) has indicated that peer report may
be the best single method of trait assessment because
peer ratings are based on a large number of behaviours occurring in natural, daily settings. Peer report
is commonly used as criteria to validate self-report
personality scales (e.g., Costa & McCrae, 1989, 1992;
Holden et al., 2001), including in studies of socially
desirable responding (e.g., Piedmont et al., 2000).

Procedure. Participants completed the Impression


Management scale and self-report NEO-FFI as applied
to the self and the observer NEO-FFI as applied to
their respective roommate. All materials were administered under standard instructions.
Results
Coefficient alpha reliabilities for NEO-FFI scales
(Table 3) all exceeded .70, indicating that the personality scales and criterion measures possessed acceptable levels of internal consistency for this sample.
Correlations between self and roommate report on
corresponding NEO-FFI scales indicated validities of
large effect size for the self-report scales. Table 4 presents validity coefficients as a function of Impression
Management scale scores.
Calculated on all respondents, the Impression
Management scale had a coefficient alpha reliability
of .74. Moderated multiple regression was employed
on a scale-by-scale basis to examine whether
Impression Management scale scores moderated
NEO-FFI scale validity. As a linear moderator, only
one of five analyses (for the Neuroticism scale) indicated a significant moderator effect. Importantly,
however, all beta weights for linear moderators were
in a consistent direction, indicating less validity for
greater socially desirable responding (p < .05, sign
test, one-tailed). Examining the Impression
Management scale as a quadratic moderator where
both extremes (positive and negative) of socially
desirable responding reduce validity, one of five
moderators was statistically significant (for the
Agreeableness scale) and no consistency in the direction of moderator beta weights occurred.
Across the five NEO-FFI scales, mean validities
were graphed (Figure 2) as a function of the same

Socially Desirable Responding 191


TABLE 5
NEO-FFI Scale Validity Coefficients as a Function of Extreme Impression Management (IM) Scale Score and Instructional Group (Reanalysis of Study 1)
Neuroticism
IM Scale Score < 2
Standard and Honest
Instructions (n = 28)

Extraversion

Openness

Agreeableness

Conscientiousness

Mean

.40

.56

.55

.80

.77

.64

.11

-.23

.21

.05

-.08

.01

IM Scale Score > 12


Standard and Honest
Instructions (n = 22)

.38

.51

.51

.63

.63

.54

Fake Good (n = 62)

-.15

.03

.14

.11

.14

.05

Fake Bad (n = 56)

levels of Impression Management scale scores as in


Study 1. No quadratic function relating socially desirable responding to validity was evident and this contrasts starkly with Study 1 (Figure 1). For respondents scoring one or less on the Impression
Management scale (described as probably invalid by
the scales manual), the mean validity across NEO-FFI
scales was .62. For respondents scoring 12 or greater
on the Impression Management scale (also described
as probably invalid by the scales manual), the mean
validity across NEO-FFI scales was .50. Importantly,
however, mean scale validity was linearly (negatively) correlated with Impression Management scale
scores, Spearmans r(10) = -.86, p < .01, indicating a
decline of large effect size in average scale validity
for respondents scoring high on the Impression
Management scale.2 No evidence of heterogeneity of
variance across levels of impression management for
any of the self-reported or roommate-reported NEOFFI scales was present (all ps > .05 for Levenes test
statistic). Thus, this negative linear relationship
between average validity and Impression
Management scores could not be attributed to statistical artifacts associated with inflated or attenuated
scale score variances. Of note, although average
validity declined with increasing Impression
2 The correlation between average validity and Impression
Management scale scores is a correlation referred to as ralerting
(Rosenthal & DiMatteo, 2001; Rosnow & Rosenthal, 2002). In
computing ralerting, error found within conditions is ignored.
However, for the present data, no evidence of differential error
(based on Levenes test) across scores of the Impression
Management scale was observed. In examining nonlinear associations between average validity and Impression Management
scale scores, no supporting results were obtained for nonlinearity.

Management scale scores, the group of high scorers


(> 12) still manifested validities that represented a
large effect size (i.e., r of .50, Cohen, 1992).3
Discrepancies in the shapes of Figures 1 and 2 for
the two studies were explored by re-examining
validities in Study 1 for respondents scoring 2 or less
(may be invalid, Paulhus, 1998, p. 10) and 12 or
greater (probably invalid). Differentiating between
instructional conditions (Table 5) offers a re-interpretation of the extremities of Figure 1 for Study 1.
Although, as an entire group, respondents who were
extreme for Impression Management scale scores
provided less valid self-report, the reduced validities
were much more prominent for those instructed to
fake as opposed to those given standard or honest
instructions. Thus, although the Impression
Management scale may represent a superior index of
faking and shatters the standard of a large effect size
for doing so, it fails to identify correctly a seemingly
nontrivial group of individuals (50 of 208) who do
supply self-report of substantial validity.
Interestingly and confirming the findings for Study 2
participants, among Study 1 respondents receiving
standard or honest instructions, those scoring high (>
12) on the Impression Management scale had lower
scale validities than those scoring low (< 2) on the
scale, mean validities of .54 and .64, respectively
(both large effect sizes), paired t(4) = 2.33, p < .05,
one-tailed, Cohens d = 1.04.

3 Only 33 of 420 respondents had Impression Management


scale scores > 12. Although such scores are normatively extreme,
scores can theoretically range up to 20. Only 8 of 420 participants
had scale scores over 14 and, thus, it was not feasible to examine
validities for these even more extreme scale values.

192 Holden

Discussion
Results for Study 2s sample confirm that, based
on peer criteria, the NEO-FFI scales have substantial
validity (M = .54). Observed values are consistent
with those in the inventorys manual (M = .58; Costa
& McCrae, 1992) for similar criteria completed by
spouses. For the Impression Management scale, consistency with previous results is also present. An
observed mean score of 6.30 (SD = 3.55) is similar to
the general population mean (i.e., M = 6.7) in the
scales manual (Paulhus, 1998). It is noteworthy, however, that 6% (n = 25) and 34% (n = 143) of Study 2s
sample are probably invalid and may be invalid,
respectively, based on cut-scores indicated in the
manual. Paradoxically, such warnings appear questionable when NEO-FFI scale validities are examined
as a function of Impression Management scale scores
(Table 4). That is, under standard instructions, substantial variability in impression management exists,
a nontrivial number of respondents are flagged by
validity index guidelines, and, despite these invalidity warnings, these potentially nonvalid respondents
provide self-report having a mean validity over .50 (a
large effect size).
At least three nonmutually exclusive interpretations of these results merit consideration. First, NEOFFI scale validity may not be susceptible to faking.
Study 1 and other research (Holden et al., 2001;
Topping & OGorman, 1997) refute this interpretation. Nevertheless, a counterargument to this is that
the NEO-FFI may be unaffected by naturally occurring
(as opposed to experimentally instructed) faking.
Second, faking (and, therefore, socially desirable
responding) effects may not have been present in this
sample. Supporting this view is the general lack of
significant moderator effects found in the moderated
multiple regression analyses. Impression
Management scale scores were statistically significant
for only one of five linear and one of five quadratic
moderators. However, countering this interpretation
are: (a) a significant, observed consistency in the
signs of linear moderator regression weights that
supports the perspective that increased positivity in
self-report attenuates validity; b) a negative correlation (i.e., -.86) of large effect size between impression
management and average validity across scales; c)
the nontrivial observed number of respondents that
exceeded cut-scores for indicating invalid responding; and d) a lack of statistical power generally associated with moderated multiple regression analyses.
Further, the significant difference in mean scale validity between extremely high (mean validity of .54;
29% of the variance in the criteria) and low (mean
validity of .64; 40% of the variance in the criteria)

Impression Management scale scorers for the reanalysis of standard and honest respondents in Study
1 reinforces the interpretation of the presence of a
socially desirable responding effect.
Third, the Impression Management scale may not
be an adequate index of faking. Study 1 and published research (e.g., Holden et al., 2000, 2003) attest
to strong Impression Management scale effect sizes
for identifying fakers. These effect sizes vastly surpass standards for a large effect size. Nevertheless, a
re-examination of validity by instructional group for
extreme scorers on the Impression Management scale
in Study 1, in conjunction with the validity for
extreme Impression Management scale scores in
Study 2, highlights an important challenge for the
scales construct validity. In particular, although fakers may provide extreme scores on the scale, extreme
scorers may not necessarily be fakers. Indeed, under
standard conditions, honest responding appears to
predominate over invalid responding for these
extreme scorers. Consequently, the function in Figure
2 emerges. When, among these extreme scorers, fakers dominate, validity plummets and Figure 1s
curvilinear function predominates. This quadratic
function can be altered (Table 4) back to that shown
in Figure 2 by excluding instructed fakers, even
though other extreme Impression Management scorers (not instructed to fake) are retained.
Overall, therefore this second study contributes by
demonstrating that, although multiple regression
analyses of individual personality scales failed to
uncover moderating effects for a valid scale of
impression management, linear moderating effects of
large size are confirmed when personality scales
were analyzed collectively. Further, the Impression
Management scale, although a highly valid indicator
of faking, does not provide extreme scale scores that
have an unambiguous interpretation. Extreme scorers
may still provide valid self-report even if that validity is significantly reduced.
Study 3
Given that extreme scorers on the Impression
Management scale (and presumably other socially
desirable responding scales) offer valid self-report
(Study 2), but instructed fakers (Study 1) and maybe
natural fakers do not, then the influence of socially
desirable responding on scale validity is unclear. If
scores on a valid socially desirable responding scale
are isomorphic with instructed faking conditions,
such a scale would be anticipated to moderate validity. Alternatively, if a valid socially desirable responding scale is not an isomorphic indicator of faking
(e.g., Holden & Fekken, 1989), moderating effects on

Socially Desirable Responding 193


validity may not be indicated.
The purpose of Study 3 was to examine how the
Impression Management scale behaves or fails to
behave as a perfect indicator of faking. For this study,
the personality construct of extraversion was chosen
because it is common to many multiscale inventories
and it is a readily observable dimension, a feature
associated with accurate peer ratings (Funder &
Colvin, 1988; Funder & Dobroth, 1987). For this
example, the emphasis was on positive impression
management (i.e., faking good) because a positivity
bias is the general focus of most research, particularly
with personnel applications. As a consequence of
investigating only positivity, effects on validity for
faking or socially desirable responding should be linear rather than curvilinear.
Method
Participants. Eighty-nine female and 19 male samesex university roommate pairs (216 students in total)
volunteered and were paid for their participation.
Mean age of the sample was 20.12 years (SD = 2.17).
This sample comprised roommate pairs who had
lived together for at least three months (M = 10.50; SD
= 9.59).
Materials. Stimuli comprised the Extraversion scale
of the NEO-FFI, the Impression Management scale,
and a series of criteria. For the NEO-FFI Extraversion
scale, Costa and McCrae (1991, p. 17) report a coefficient alpha reliability of .79 and validities of .60, .51,
and .38 based on correlations with adjective selfreports, spouse ratings, and peer ratings, respectively. Criteria for the Extraversion scale were as in Study
1 and comprised five positively keyed (i.e., assertive,
bold, extraverted, talkative, verbal) and five negatively keyed (i.e., bashful, introverted, quiet, shy,
untalkative) unipolar markers and three bipolar rating scales (i.e., sociable vs. withdrawn; exhibitionistic
vs. shy; fun-loving vs. serious).
Procedure. Initially, respondents completed the criteria under standard instructions as applied to participants respective roommates. Subsequently, individuals were randomly assigned to complete the NEOFFI Extraversion and Impression Management scales
under instructions either to answer honestly (n = 107)
or to fake. In the Faking condition (n = 109), respondents received the following instructions:
Imagine that you are applying for a job. The job is a sensitive government position involving exposure to confidential material. As part of the application procedure, please
complete the following personnel security survey. You

wish, however, to respond so as to MAXIMIZE YOUR


CHANCES OF BEING HIRED. Therefore, do not necessarily answer the following statement truthfully, but
answer so that you WILL BE HIRED. FAKE this test so
you will get the job. Although you may feel that you
would never represent yourself dishonestly, please try to
do so for this study. However, BEWARE that the survey
has certain features (WHICH YOU WANT TO AVOID)
designed to detect faking. Do your best to FAKE out
the survey and get the job. All your responses are strictly
CONFIDENTIAL. Please respond to all items even if
some seem not applicable.

Participants in the Honest condition were presented


with the same job scenario (i.e., identical first three
sentences of instructions), but were asked to answer
honestly.
Results
Based on all participants, the 10-item criteria from
the unipolar markers had a coefficient alpha of .88.
The 3-item criteria from the bipolar ratings had a
coefficient alpha of .59. Because totals associated with
these two criteria correlated .69, these totals were
standardized across respondents and summed to
produce a more reliable peer-report criterion score
for extraversion.
Scores on the Extraversion scale were significantly
higher in the Faking (M = 35.41, SD = 6.13, coefficient
alpha = .83) than in the Honest (M = 32.24, SD = 5.90,
coefficient alpha = .78) condition, t(214) = 3.86, p <
.001. The magnitude of the difference represented a
medium effect size, d = 0.53 (Cohen, 1992). Thus, the
Extraversion scale was susceptible to substantial
response distortion.
For the Impression Management scale, scores were
significantly higher in the Faking (M = 11.80, SD =
4.46, coefficient alpha = .84) than in the Honest (M =
6.02, SD = 3.12, coefficient alpha = .66) condition,
t(214) = 11.02, p < .001. The effect size, d = 1.50, vastly
surpassed Cohens (1992) standard for a large effect
(i.e., d = 0.80). Application of logistic regression yielded an overall correct classification rate of 78.24% (169
of 216 respondents) for identifying respondents as
either answering honestly or faking, 2(1, N = 216) =
94.68, p < .0001. Interestingly, application of the manuals recommendation for using Impression
Management scores for identifying protocols that
may be invalid due to faking good (i.e., scores > 8;
Paulhus, 1998, p. 10) resulted in an identical correct
classification hit rate (including 76.15% of fakers and
80.37% of honest respondents correctly classified).
Thus, the Impression Management scale had substantial validity for detecting faking (correlation of .60

194 Holden

Figure 3. Significance levels of moderator as a function of percentage of instructed fakers.

with the experimental condition), and the cut-scores


from the scales manual were highly effective, but
imperfect, in correctly classifying instructed fakers
and nonfakers.
Criterion validity for the Extraversion scale was
computed by correlating scale scores with peer criterion scores, separately for each condition. Validities
were .11 and .54 for Faking and Honest conditions,
respectively, and these differed significantly, z = 3.58,
p < .0005. According to Cohen (1992), the magnitude
of the difference between these validities was a large
effect size, q = .50. To further confirm that faking had
a statistically significant effect on validity, moderated
multiple regression procedures were again
employed. Initially, extraversion peer criterion scores
were regressed on self-report extraversion scores and
a variable representing experimental group membership. Subsequently, the interaction of self-report
extraversion scores and the group membership variable was added. Associated with the addition of this
interaction term, an observed change in R from .33 to
.41 corresponded to an increment in R2 of .058 (p <
.001) and verified that faking moderated the validity
of self-report extraversion scores. Thus, faking did
affect the validity of a self-report scale.
Given that the Impression Management scale was
a highly valid index of faking, the effects of faking
were also analyzed by substituting Impression

Management scale scores for experimental group


membership variable in the moderated multiple
regression analysis. Here, with the addition of the
interaction term (i.e., the product of the self-report
Extraversion and Impression Management scale
scores), the observed change in R from .38 to .40 corresponded to a nonsignificant increment in R2 of .010
(p > .11). Thus, enigmatically, a reliable, highly valid
index of faking was incapable of demonstrating a significant moderating effect for faking when indeed
such an effect was present and was large.
To elucidate further the discrepancy between the
experimental implementation of faking and the operational measurement of this experimental faking by
the Impression Management scale, validity-moderating effects were examined as a function of the ratio of
instructed fakers to honest respondents. Samples of
three participants from the faking condition were
randomly selected without replacement and successively added to the 107 respondents in the honest
condition participants. When all participants
instructed to fake had been added, random samples
of three respondents in the honest condition were
successively deleted from the total group. At each
change in sample size, the statistical significance of
and the unique variance accounted for (i.e., part correlation squared) by the moderator were examined
using either the experimental condition or the

Socially Desirable Responding 195

Figure 4. Effect sizes [criterion-(moderator x predictor) part correlations] as a function of percentage of


instructed fakers.

Impression Management scale scores separately as


the moderator in moderated multiple regressions.
In terms of statistical significance (Figure 3),
results demonstrated that Impression Management
scale scores were almost always a weaker moderator
than the experimental faking condition. Further, even
the instructed faking condition failed to consistently
return a significant moderator effect (p < .05) when
the percentage of fakers was less than 12.30% (i.e., <
15 fakers of 122 respondents) or greater than 82.58%
(i.e., > 109 fakers of 132 participants). With a more
stringent significance level (i.e., p < .01) Impression
Management scale scores failed as a moderator for
any ratio of instructed fakers to honest respondents.
Further, a moderating effect for the experimental faking condition only consistently emerged when the
percentage of instructed fakers was between 29.61%
(45 of 152 respondents) and 71.24% (109 of 153 participants).
As a specific example, when 42 instructed fakers
had been added to the sample of 107 respondents
instructed to answer honestly, not only did
Impression Management scale scores fail to moderate
validity (p > .01), but so too did the experimental fak-

ing condition (p > .01). Yet, surprisingly, for this identical subsample, Paulhus (1998, p. 10) Impression
Management scale cut-off score for distinguishing
invalid (i.e., scores > 8) from valid protocols resulted
in a correct classification rate of 77.18% (115 of 149),
2(1, N = 149) = 33.04, p < .00001. Thus, an apparent
paradox exists whereby a measure can correctly classify fakers and honest respondents to a significant
and substantial degree, yet is incapable of significantly moderating validity. This paradox is even more
anomalous considering that, whereas the application
of the Impression Management scale cut-off score
yields a dichotomous variable (i.e., either valid or
invalid), the moderated multiple regression analysis
uses a continuous variable (i.e., Impression
Management scale scores), where statistical power
should be greater.
Consideration of statistical significance is, however, influenced by sample size and examination
(Figure 4) of moderator effect size (i.e., part correlation squared or variance accounted for), that is independent of total sample size, assists in clarifying the
above paradox. Figure 4, in conjunction with Figure
3, demonstrates three particularly noteworthy points.

196 Holden

First, the Impression Management scale, although a


highly valid indicator of faking (i.e., correlation of .60
with experimental condition), is generally a nonsignificant moderator of validity and has a less than
small effect size accounting for less than 2% of the
variance in the criterion scores. Second, faking condition as a moderator in multiple regression yields an
effect size for socially desirable responding that is, on
average, five times (Median = 5.3 times) greater than
that for the Impression Management scale. Third,
faking condition as a moderator generally indicates
significant and more than small effect sizes, accounting for up to 7.3% of the variance in the criterion.
Discussion
Results for Study 3 contribute by demonstrating
that a highly valid index of faking can have substantial limitations for demonstrating the moderating
effects of socially desirable responding on personality
scale validity. Although a large effect for faking was
present (Cohens q = .50 for the difference in group
validities) and the Impression Management scale
demonstrated a large effect size in detecting fakers
(Cohens d = 1.50), as a moderator of validity, the
scale manifested notable shortcomings. The
Impression Management scale both failed to moderate, in terms of statistical significance, the validity of
self-reported personality and provided a severe
underestimation, in terms of effect size, of the moderating influence of faking on validity.
General Discussion
This research contributes by highlighting three
considerations when obtaining nonsignificant findings for the influence of socially desirable responding
on personality scale validity in naturalistic settings.
First, such socially desirable responding influences,
relative to experimental research, may appear to be
attenuated (even to the point of statistical nonsignificance). In particular, Studies 1 and 2 demonstrate: a)
this attenuated effect between experimental and nonexperimental data; b) a significant, negative association of large effect size between impression management and personality scale average validity in natural responding; and c) that, despite this large negative
association, substantial personality scale validity can
still exist for extreme scorers on measures of socially
desirable responding. For this latter point, although
substantial validity remained across levels of naturally occurring social desirability, a variation in over
10% of the criterion variance explained constituted
more than a small effect size.
Second, extreme scores on a scale of impression
management, may reflect more than faking. The cur-

rent three studies indicate that, although instructed


fakers may provide extreme scores on a measure of
socially desirable responding and may present personality self-report that has reduced or no validity,
other respondents who have extreme scores on the
same socially desirable responding scale present substantially valid personality self-report.
Third, failures to detect moderating effects on personality scale validity for socially desirable responding may reflect power issues associated with imbalances in the distribution of valid and invalid
responding or may represent psychometric limitations of what is a strong scale of socially desirable
responding. When a nonsignificant validity-moderating effect for a measure of socially desirable responding is found, multiple interpretations are possible
and, consequently, null results should not lead
unequivocally to the conclusion that socially desirable responding is a nonissue for personality scale
validity. An alternative interpretation is that the valid
scale of faking (e.g., the Impression Management
scale) is not up to the psychometric challenge of
moderating validity. Study 3 (Figure 4) emphasizes a
marked discrepancy, based on the same respondents,
between moderator effects measured by group
assignment versus a scale of socially desirable
responding. Thus, a scale of socially desirable
responding may be highly valid but be less than an
adequate moderator.
Imagine the following. As part of a job application, a potential employee completes a personality
scale that has predictive validity for job performance.
Rather than answering the scale honestly, the applicant opts to respond so as to maximize the chances of
being hired. In doing so, the potential employee
fakes. The personality scale, however, has a psychometrically strong validity indicator to detect faking.
Will the dissimulating applicant be identified as a
faker? Probably if the validity indicator is as good
as the Impression Management scale and if the
nature of faking is similar to that induced in Study 3,
then there would be an over 75% chance of correctly
identifying this person as a faker. Would faking moderate the validity of the personality scale? Very probably not if the applicants data were pooled with
normative data comprising both predictor and criterion scores, moderated multiple regression procedures would fail to detect the validity-moderating
effect of the psychometrically strong validity indicator unless the scale was better than merely psychometrically adequate and unless there was a substantial mix of both faking and nonfaking respondents in
the pool. Does this failure mean that the validity indicator is invalid? Not at all the validity indicator

Socially Desirable Responding 197


may well have made a correct decision, if it identified
the respondent as a faker. Even if the index made an
incorrect classification in this instance, it may still
possess substantial validity. Does the nonsignificant
moderator finding mean that faking is not an issue
for the personality scale? No regardless of statistical
significance (or lack thereof) for the moderator, this
respondent provided distorted responses. Faking
remains an issue for the validity of individual protocols regardless of whether or not it is an issue for the
personality scale in general.
Present results confirm that respondents can fake
on a self-report personality measure when instructed
to do so. These findings confirm those of others and
indicate a generalizable effect across areas of personality assessment (e.g., Dunnette, McCartney, Carlson,
& Kirchner, 1962; Hough et al., 1990; Jackson et al.,
2000; Rosse et al., 1998).
Even if individuals can fake on self-report personality scales when instructed to do so, does faking
occur naturally? Although debatable, empirical evidence indicates that this may be the case. In a comparison of job applicants and nonapplicants, Stark,
Chernyshenko, Chan, Lee, and Drasgow (2001)
reported substantial differential item functioning for
each of 15 noncognitive personality scales. Rosse et
al. (1998) found that, relative to incumbents, job
applicants for a property management firm scored an
average of 0.65 standard deviations higher on appropriate personality measures. For 3,760 applicants for
security officer positions, Jackson et al. (2000) reported that the effect of faking in this natural context
exceeded that associated with experimentally
induced faking in a research setting. Dunnette et al.
(1962) indicated that, for sales applicants, natural faking may occur and that it may be present in 14% of
applicants. Among 271 job applicants for airline pilot
positions, Butcher, Morfitt, Rouse, and Holden (1997)
reported that 27% were faking. In a study of attendants at a professional training centre, Rees and
Metcalfe (2003) reported that 17% of personnel staff
would probably or definitely present themselves
favourably when completing a personality inventory
as part of a job application procedure. In contrast to
these findings, Hough et al. (1990) found no evidence
to indicate the presence of faking among 125 applicants at a military entrance processing station. Rosse
et al., however, have suggested that Hough et al.s
nonsignificant findings are associated with applicants who actually had already been sworn into the
military and who may not be representative of more
typical job applicants. It appears, therefore, that naturally faking on self-report personality measures does
occur, that it occurs in job applicant settings, and that

the size of the effect does vary, sometimes being less


than and sometimes exceeding the effect size associated with experimental faking. The estimated percentage of job applicants who fake varies substantially and may be a function of the population sampled
and the assessment situation. Further research is
needed to delineate rates of faking for various applicant populations, occupations, and assessment contexts.
Can fakers be detected? Present findings indicate
that a response distortion scale (i.e., the Impression
Management scale) can detect individuals instructed
to fake. Despite not being a significant moderator of
criterion validity in the third study, the Impression
Management scale correctly identified, to a significant degree, over 77% of respondents as being either
in the Faking or Honest instruction condition. An
extraordinarily large effect size for faking on the
Impression Management scale (i.e., Cohens d = 1.50)
attested to the construct validity of this socially desirable responding scale as have effect sizes from other
dissimulation studies (Holden et al., 2000, 2003;
MacNeil & Holden, 2005). Thus, validity scales, in
particular the Impression Management scale, detect
the response distortion of experimentally instructed
fakers. Nevertheless, even if excellent identification
hit rates are associated with a scale that detects faking, substantial misclassifications occur. Consider
that, in the third study, the Impression Management
scales classification hit rate of an impressive 78.24%
misclassified 21 of 107 honest instruction respondents as fakers and 26 of 109 faking instruction
respondents as Honest. The Impression Management
scale may have strong construct validity, but it is far
from an infallible faking indicator. Its irrelevant variance may be random error or may reflect an overlapping (Cunningham, Wong, & Barbee, 1994; Meston,
Heiman, Trapnell, & Paulhus, 1998) or alternative
construct such as interpersonal sensitivity (Holden &
Fekken, 1989).
Does faking affect the validity of a self-report personality scale? Present data indicate that experimentally induced faking does influence scale validity.
This result is not novel, but replicates an established
phenomenon (Viswesaran & Ones, 1999). For example, for the NEO-FFI scales, Topping and OGorman
(1997) report significantly different mean validities of
.54 and .22 for honest and faking good groups,
respectively. With experimental faking, it generally
appears that faking good results in validities that are
less than those associated with honest responding.
Does natural faking affect the validity of selfreport? Although Hough et al. (1990) seemingly
answer this question negatively, there are mitigating

198 Holden

issues that merit evaluation. Consider that, in the


shift from studies of instructed faking to noninstructed faking, typically two factors are confounded. While there is a change from instructed faking
to natural faking, there is usually a concomitant
change in the method of measurement of this independent variable, from experimental treatment condition to a scale of faking (e.g., socially desirable
responding). In the present studies, however, it was
possible to disentangle this confound. In Study 1,
both faking condition and the Impression
Management scale demonstrated moderator effects
on scale validity. In Study 2, where there was no
experimentally induced faking but rather a range of
Impression Management scale scores that indicated
the presence of naturally occurring faking, validitymoderating effects for the Impression Management
scale, at one level, seemed attenuated relative to
Study 1. Nevertheless, despite substantial validity
existing for the self-reported personality of relatively
extreme scorers on a scale of impression management, there was a negative linear association (-.86) of
large effect size between social desirability responding and average validity. Three considerations are
noteworthy here. First, using the demonstrably highly valid Impression Management scale, natural faking produced less detectable moderating effects on
validity than experimentally induced faking. As
Figure 4 articulated, a highly regarded scale of socially desirable responding severely underestimated (by
a factor of five) the effect size for experimentally
induced faking. This highlights a potential shortcoming for scales of socially responding and raises the
issue as to whether similar underestimations of faking effects have occurred in naturalistic research on
socially desirable responding. This underestimation
could be an explanatory factor for concerns about the
generalizability of results from experimental faking
studies to naturally occurring misrepresentation
(Ones & Viswesvaran, 1998; Ones, Viswesvaran, &
Reiss, 1996; Smith & Ellingson, 2002). Research by
Schmitt and Oswald (2006) has attempted to bridge
the gap between experimental and natural faking
research through Monte Carlo methods. Whether
their null findings (for corrections for faking) generalize from artificial data with a specific range of simulation scenarios to either experimental or natural
faking in actual respondents remains to be verified.
Second, validity-moderating effects in Study 2
were of large effect size when aggregated across
scales, but much weaker when moderated multiple
regression analysis was undertaken at the individual
scale level.4 This highlights the difficulties associated
with detecting moderator effects in nonexperimental

designs where score unreliability and nonoptimal


distributions may result in substandard statistical
power (McClelland & Judd, 1993). Although results
for natural distortion may arguably be weaker than
for experimentally instructed faking, statistical
power issues may mitigate the interpretation of null
results.
A third consideration is that naturally occurring
faking was not present in the current second study.
Because the Impression Management scale is not an
isomorphic indicator of faking, high and low scorers
on this scale may represent relatively interpersonally
sensitive and insensitive persons, respectively
(Holden & Fekken, 1989), rather than fakers.5 This
alternative view may explain why extreme scorers in
Study 2 and extreme scorers in the nonfaking instructional conditions of Study 1 provided self-report of
substantial validity. If true for one of the premiere
scales of socially desirable responding, then this consideration is also relevant for other socially desirable
responding indicators (e.g., Piedmont et al., 2000)
that may not possess psychometric strengths equal to
those of Paulhus Impression Management scale.6
Are different processes involved in instructed and
natural faking? Reviews (Hough & Oswald, 2000;
Ones & Viswesvaran, 1998) argue that the effects for
4 With aggregation, re-examination of Hough et al.s (1990)
validities (p. 591, their Table 7) challenges socially desirable
responding as a nonmoderator of validity. In their Table 7, 10 of
33 validity comparisons were significant (versus a chance rate
under two). Further, 22 of 31 validities (with two ties) were
greater for the accurate group than the overly desirable
group, 2(1, N = 31) = 4.65, p < .05. Additionally, although mean
validities were only .17 and .15 for the accurate and overly
desirable groups, respectively, paired comparisons across scales
and criteria indicate a significant difference, t(32) = 2.97, p < .01,
and an effect size that is more than small, d = .52. For predicting
effort and leadership, mean validities were .17 and .15 for the
accurate and overly desirable groups, respectively, t(10) =
4.06, p < .01, d = 1.22. With the criterion of personal discipline,
mean validities for accurate and overly desirable groups
were .15 and .16, respectively, t(10) = -1.50, p > .05, d = 0.45. For
the physical fitness criterion, mean validities were .20 and .16 for
the accurate and overly desirable groups, respectively, t(10)
= 4.48, p < .01, d = 1.35. Thus, additional analyses of Hough et
al.s data indicate that validities associated with accurate
respondents exceed comparable validities found with overly
desirable respondents and that the differences between groups
are nontrivial.

5 Paradoxical relationships with the Impression Management


scale (e.g., negative correlations between the scale scores and psychopathy; Seto, Khattar, Lalumiere, & Quinsey, 1997) have been
observed. Recently, Paulhus (2002) has indicated an evolution of
the socially desirable responding, suggesting content-based as
well as stylistic interpretations.

Socially Desirable Responding 199


instructed faking are generally greater than when
naturally occurring socially desirable responding
would be expected to occur, such as in job application contexts. Although this could indicate the presence of a single process of faking that differs quantitatively across experimental and nonexperimental
conditions, another interpretation is that qualitatively
different processes are involved. Other research
attests not only to the multidimensionality of socially
desirable responding (Paulhus, 1984) but also to the
multidimensionality of experimentally induced
impression management (Holden & Evoy, 2005;
Holden et al., 2003). Thus, additional research may be
required to elucidate the cognitive mechanism(s)
associated with distorted self-report.
Limitations to this research exist. Present studies:
a) were conducted with undergraduates, b) focused
on the Five-Factor Model of personality as operationalized by the NEO-FFI, c) used peer-report for
Five-Factor Model dimensions as criteria, and d)
assessed socially desirable responding and faking
either through Paulhus Impression Management
scale or through experimental conditions associated
with particular instructions. The use of other scales of
socially desirable responding and personality will
establish the stability of current findings. Further,
despite Funders (1991) praise for using peer ratings,
there is no gold standard as to what constitutes an
appropriate criterion for personality scales and many
applied researchers choose to focus on job-relevant
outcomes (e.g., employee turnover, performance ratings). The degree that current findings extend to
these applied outcomes that exist within a nomological personality network remains to be confirmed.
Additionally, replication with contexts and respondents from populations where motivations to distort
self-presentation may be naturally present (e.g.,
employment contexts, workers compensation
clients) will provide evaluations of the generalizability of the present results.
In conclusion, the present research contributes by
demonstrating that:

6 An issue about the generalizability of Hough et al.s (1990)


and Piedmont et al.s (2000) findings exists. Those studies used
measures of socially desirable responding whose psychometric
properties are not as well established as other scales of response
distortion such as the Paulhus Impression Management scale.
Further, in contrast to other studies (e.g., Rosse et al., 1998), job
applicants in Hough et al.s study scored lower than job incumbents on socially desirable personality scale dimensions. Thus,
for the Hough et al. study, replicability and generalizability of
findings to other populations (e.g., nonmilitary) and to other
socially desirable responding scales still await confirmation.

Under standard instructions, personality inventory respondents have a broad range of socially
desirable responding, including extreme zones
that are designated as reflecting invalidity.
Under standard instructions, self-report associated
with these extreme zones has substantial validity
and has greater validity than for instructed fakers
scoring in these same extreme zones.
Nevertheless, under standard instructions, the
average validity of self-report is moderated significantly and with a large effect size by socially
desirable responding.
When faking is present, even as a large effect size,
faking will moderate validity only when the ratio
of fakers to nonfakers is not extreme.
Even if highly valid in measuring faking and
detecting fakers, socially desirable responding
scales such as Paulhus Impression Management
scale are fallible and, consequently, can severely
underestimate the effect size and statistical significance of the validity-moderating influence of
socially desirable responding.
Test users should continue to be alert to contaminations of self-report by all invalid sources of variance and should not prematurely dismiss concerns
about socially desirable responding.
This research was supported by the Social Sciences
and Humanities Research Council of Canada.
Correspondence concerning this article should be
addressed to Ronald R. Holden, Department of
Psychology, Queens University, Kingston, Ontario,
Canada K7L 3N6 (E-mail: holdenr@post.queensu.ca).
References
Barrick, M. R., & Mount, M. K. (1996). Effects of impression management and self-deception on the predictive
validity of personality constructs. Journal of Applied
Psychology, 81, 261-272.
Brown, R., & Barrett, P. (1999, June). Differences between
applicant and non-applicant personality questionnaire data.
British Psychological Society Test User Conference. In
published Conference Proceedings, pp. 76-86.
Leicester: British Psychological Society.
Butcher, J. N., Morfitt, R. C., Rouse, S. V., & Holden, R. R.
(1997). Reducing MMPI-2 defensiveness: The effect of
specialized instructions on retest validity in a job
applicant sample. Journal of Personality Assessment, 68,
385-401.
Cohen, J. (1992). A power primer. Psychological Bulletin,
112, 155-159.
Costa, P. T. Jr., & McCrae, R. R. (1988). From catalog to
classification: Murrays needs and the Five-Factor

200 Holden
Model. Journal of Personality and Social Psychology, 55,
258-265.
Costa, P. T. Jr., & McCrae, R. R. (1989). The NEO-PI/NEOFFI manual supplement. Odessa, FL: Psychological
Assessment Resources.
Costa, P. T. Jr., & McCrae, R. R. (1991). NEO Five-Factor
Inventory Form S. Odessa, FL: Psychological
Assessment Resources.
Costa, P. T. Jr., & McCrae, R. R. (1992). NEO PI-R professional manual: Revised NEO Personality Inventory (NEOPI-R) and NEO Five-Factor Inventory (NEO-FFI).
Odessa, FL: Psychological Assessment Resources.
Cunningham, M. R., Wong, D. T., & Barbee, A. P. (1994).
Self-presentation dynamics on overt integrity tests:
Experimental studies of the Reid Report. Journal of
Applied Psychology, 79, 643-658.
Douglas, E. F., McDaniel, M. A., & Snell, A. F. (1996). The
validity of non-cognitive measures decays when applicants fake. In J. B. Keyes & L. N. Dosier (Eds.),
Proceedings of the Academy of Management (pp. 127-131).
Madison, WI: Omnipress.
Dunnette, M. D., McCartney, J., Carlson, H. C., &
Kirchner, W. K. (1962). A study of faking behavior on a
forced-choice self-description checklist. Personnel
Psychology, 15, 13-24.
Edwards, A. L. (1990). Construct validity and social desirability. American Psychologist, 45, 287-289.
Ellingson, J. E., Smith, D. B., & Sackett, P. R. (2001).
Investigating the influence of socially desirable
responding on personality factor structure. Journal of
Applied Psychology, 86, 122-133.
Funder, D. C. (1991). Global traits: A neo-Allportian
approach to personality. Psychological Science, 2, 31-39.
Funder, D. C., & Colvin, C. R. (1988). Friends and
strangers: Acquaintanceship, agreement, and the accuracy of personality judgment. Journal of Personality and
Social Psychology, 55, 149-158.
Funder, D. C., & Dobroth, K. M. (1987). Differences
between traits: Properties associated with interjudge
agreement. Journal of Personality and Social Psychology,
52, 409-418.
Goldberg, L. R. (1992). The development of markers for
the Big-Five factor structure. Psychological Assessment,
4, 26-42.
Holden, R. R., Book, A. S., Edwards, M. J., Wasylkiw, L.,
& Starzyk, K. B. (2003). Experimental faking in selfreported psychopathology: Unidimensional or multidimensional? Personality and Individual Differences, 35,
1197-1117.
Holden, R. R., & Evoy, R. A. (2005). Personality inventory
faking: A four-dimensional simulation of dissimulation. Personality and Individual Differences, 39, 13071318.
Holden, R. R., & Fekken, G. C. (1989). Three common

socially desirable responding scales: Friends, acquaintances, or strangers? Journal of Research in Personality,
23, 180-191.
Holden, R. R., Starzyk, K. B., McLeod, L. D., & Edwards,
M. J. (2000). Comparisons among the Holden
Psychological Screening Inventory (HPSI), the Brief
Symptom Inventory (BSI), and the Balanced Inventory
of Desirable Responding (BIDR). Assessment, 7, 163175.
Holden, R. R., Wood, L. L., & Tomashewski, L. (2001). Do
response time limitations counteract the effect of faking on personality inventory validity? Journal of
Personality and Social Psychology, 81, 160-169.
Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, J. D.,
& McCloy, R. A. (1990). Criterion-related validities of
personality constructs and the effect of response distortion on those validities. Journal of Applied
Psychology, 75, 581-595.
Hough, L. M., & Oswald, F. L. (2000). Personnel selection:
Looking toward the future remembering the past.
Annual Review of Psychology, 51, 631-664.
Jackson, D. N., Wroblewski, V. R., & Ashton, M. C. (2000).
The impact of faking on employment test validity:
Does forced-choice offer a solution? Human
Performance, 13, 371-388.
MacNeil, B., & Holden, R. R. (2005, June). Detecting faking
in self-report personality assessment: A comparison of
validity indices. Paper presented at the Canadian
Psychological Association Annual Convention,
Montral, Canada.
Marshall, M. B., De Fruyt, F., Rolland, J.-P., & Bagby, R.
M. (2005). Socially desirable responding and the factorial stability of the NEO PI-R. Psychological Assessment,
17, 379-384.
McClelland, G. H., & Judd, C. M. (1993). Statistical difficulties of detecting interactions and moderator effects.
Psychological Bulletin, 114, 376-390.
Meston, C. M., Heiman, J. R., Trapnell, P. D., & Paulhus,
D. L. (1998). Socially desirable responding and sexuality self-reports. The Journal of Sex Research, 35, 148-157.
Nicholson, R. A., & Hogan, R. (1990). The construct validity of social desirability. American Psychologist, 45, 290292.
OGrady, K. E. (1988). The Marlowe-Crowne and
Edwards socially desirable responding scales: A psychometric perspective. Multivariate Behavioral Research,
23, 87-101.
Ones, D. S., & Viswesvaran, C. (1998). The effects of
socially desirable responding and faking on personality and integrity assessment for personnel selection.
Human Performance, 11, 245-269.
Ones, D. S., Viswesvaran, C., & Reiss, A. D. (1996). Role of
socially desirable responding in personality testing for
personnel selection: The red herring. Journal of Applied

Socially Desirable Responding 201


Psychology, 81, 660-679.
Paulhus, D. L. (1984). Two-component model of socially
desirable responding. Journal of Personality and Social
Psychology, 46, 598-609.
Paulhus, D. L. (1991). Measurement and control of
response bias. In J. P. Robinson, P. R. Shaver, & L. S.
Wrightsman (Eds.), Measures of personality and social
psychological attitudes, Vol. 1 (pp. 17-59). San Diego, CA:
Academic Press.
Paulhus, D. L. (1993). Bypassing the will: The automatization of affirmations. In. D. M. Wegner & J. W.
Pennebaker (Eds.), Handbook of mental control (pp. 573587). Upper Saddle River, NJ: Prentice-Hall.
Paulhus, D. L. (1998). Paulhus Deception Scales (PDS) users
manual. North Tonawanda, NY: Multi-Health Systems.
Paulhus, D. L. (2002). Socially desirable responding: The
evolution of a construct. In H. I. Braun, D. N. Jackson,
& D. E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 49-69). Mahwah,
NJ: Erlbaum.
Paulhus, D. L., Bruce, M. N., & Trapnell, P. D. (1995).
Effects of self-presentation strategies on personality
profiles and their structure. Personality and Social
Psychology Bulletin, 21, 100-108.
Pauls, C. A., & Crost, N. W. (2004). Effects of faking on
self-deception and impression management scales.
Personality and Individual Differences, 37, 1137-1151.
Paunonen, S. V., & Jackson, D. N. (1985). Idiographic
measurement strategies for personality and prediction:
Some unredeemed promissory notes. Psychological
Review, 92, 486-511.
Piedmont, R. L., McCrae, R. R., Riemann, R., &
Angleitner, A. (2000). On the invalidity of validity
scales: Evidence from self-reports and observer ratings
in volunteer samples. Journal of Personality and Social
Psychology, 78, 582-593.
Rees, C. J., & Metcalfe, B. (2003). The faking of personality
questionnaire results: Whos kidding whom? Journal of
Managerial Psychology, 18, 156-165.
Rosenthal, R., & DiMatteo, M. R. (2001). Meta analysis:

Recent developments in quantitative methods for literature reviews. Annual Review of Psychology, 52, 59-82.
Rosnow, R. L., & Rosenthal, R. (2002). Contrasts and correlations in theory assessment. Journal of Pediatric
Psychology, 27, 59-66.
Rosse, J. G., Stecher, M. D., Miller, J. L., & Levin, R. A.
(1998). The impact of response distortion on preemployment personality testing and hiring decisions.
Journal of Applied Psychology, 83, 634-644.
Schmit, M. J., & Ryan, A. M. (1993). The Big Five in personnel selection: Factor structure in applicant and
nonapplicant populations. Journal of Applied
Psychology, 78, 966-974.
Schmitt, N., & Oswald, F. L. (2006). The impact of corrections for faking on the validity of noncognitive measures in selection setting. Journal of Applied Psychology,
91, 613-621.
Seto, M. C., Khattar, N. A., Lalumiere, M. L., & Quinsey,
V. L. (1997). Deception and sexual strategy in psychopathy. Personality and Individual Differences, 22, 301307.
Smith, D. B., & Ellingson, J. E. (2002). Substance versus
style: A new look at socially desirable responding in
motivating contexts. Journal of Applied Psychology, 87,
211-219.
Stark, S., Chernyshenko, O. S., Chan, K.-Y., Lee, W. C., &
Drasgow, F. (2001). Effects of the testing situation on
item responding: Cause for concern. Journal of Applied
Psychology, 86, 943-953.
Topping, G. D., & OGorman, J. G. (1997). Effects of faking set on validity of the NEO-FFI. Personality and
Individual Differences, 23, 117-124.
Viswesvaran, C., & Ones, D. S. (1999). Meta-analyses of
fakability estimates: Implications for personality
assessment. Educational and Psychological Measurement,
59, 197-210.
Received March 13, 2006
Revised September 1, 2006
Accepted September 12, 2006

Vous aimerez peut-être aussi