Worthington 2006

The Counseling Psychologist
http://tcp.sagepub.com/
Scale Development Research: A Content Analysis and

Recommendations for Best Practices
Roger L. Worthington and Tiffany A. Whittaker
The Counseling Psychologist 2006 34: 806
DOI: 10.1177/0011000006288127
The online version of this article can be found at:

http://tcp.sagepub.com/content/34/6/806
Published by:
http://www.sagepublications.com
On behalf of:
Division of Counseling Psychology of the American Psychological Association
Additional services and information for The Counseling Psychologist can be found at:
Email Alerts: http://tcp.sagepub.com/cgi/alerts
Subscriptions: http://tcp.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://tcp.sagepub.com/content/34/6/806.refs.html
>> Version of Record - Nov 3, 2006

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014
What is This?
Scale Development Research
A Content Analysis and Recommendations
for Best Practices
Roger L. Worthington
University of Missouri–Columbia
Tiffany A. Whittaker
University of Texas at Austin
The authors conducted a content analysis on new scale development articles appearing
in the Journal of Counseling Psychology during 10 years (1995 to 2004). The authors
analyze and discuss characteristics of the exploratory and confirmatory factor analysis
procedures in these scale development studies with respect to sample characteristics,
factorability, extraction methods, rotation methods, item deletion or retention, factor
retention, and model fit indexes. The authors uncovered a variety of specific practices
that were at variance with the current literature on factor analysis or structural equa-
tion modeling. They make recommendations for best practices in scale development
research in counseling psychology using exploratory and confirmatory factor analysis.
Counseling psychology has a rich tradition producing psychometrically

sound instruments for applications in research, training, and practice. Many
areas of scholarly inquiry in counseling psychology continue to be ripe for
scale development research. In a special issue of the Journal of Counseling
Psychology (JCP) on quantitative research methods, Dawis (1987) pro-
vided an overview of scale development techniques, Tinsley and Tinsley
(1987) discussed the use of factor analysis, and Fassinger (1987) presented
an overview of structural equation modeling (SEM). Although these articles
continue to be cited in counseling psychology research, recent advances
require updated information and a comprehensive overview of all three top-
ics. More recently, Quintana and Maxwell (1999) and Martens (2005) pro-
vided comprehensive updates of SEM, but their focus was not specifically
on its use in scale development research (see also Martens & Hasse, 2006
[this issue]; Weston & Gore, 2006 [TCP, special issue, part 1]).
The purpose of this article is threefold: (a) to provide an overview of the
steps taken in the scale development process using exploratory factor analy-
sis (EFA) and confirmatory factor analysis (CFA), (b) to assess current prac-
The authors contributed equally to the writing of this article. We would like to thank Jeffrey
Andreas Tan for his assistance with the content analysis. Address correspondence to Roger L.
Worthington, Department of Educational, School, and Counseling Psychology, University of
Missouri, Columbia, MO 65211; e-mail: WorthingtonR@missouri.edu
THE COUNSELING PSYCHOLOGIST, Vol. 34 No. 6, November 2006 806-838
DOI: 10.1177/0011000006288127
© 2006 by the Society of Counseling Psychology
806
Worthington, Whittaker / SCALE DEVELOPMENT RESEARCH 807
tices by reporting the results of a 10-year content analysis of scale develop-

ment research in counseling psychology, and (c) to provide a set of recom-
mendations for best practices in using EFA and CFA in scale development
(for more on factor analysis, see Kahn, 2006 [TCP, special issue, part 1]). We
assume the reader has basic knowledge of psychometrics, including principles
of reliability (Helms, Henze, Sass, & Mifsud, 2006 [TCP, special issue, part
1]), validity (Hoyt, Warbasse, & Chu, 2006 [this issue]), and multivariate sta-
tistics (Sherry, 2006 [TCP, special issue, part 1]). We begin with an overview
of EFA and CFA, followed by a discussion of the procedure we used in con-
ducting our content analysis. We then embed the findings of our content
analysis within more detailed discussions of EFA and CFA, identifying poten-
tial problems and highlighting best practices. We conclude with an integrative
discussion of best practices and findings from the content analysis.
OVERVIEW OF EFA AND CFA
Factor analysis is a technique used to identify or confirm a smaller num-

ber of factors or latent constructs from a large number of observed variables
(or items). There are two main categories of factor analysis: (a) exploratory
and (b) confirmatory (Kahn, 2006 [TCP, special issue, part 1]). Although
researchers may use factor analysis for a range of purposes, one of the most
prevalent uses of factor-analytic techniques is to support the validity of
newly developed tests or scales—that is, does the newly developed test or
scale measure the intended construct(s)? More specifically, the application
of factor analysis to a set of items may help researchers answer the follow-
ing questions: How many factors or constructs underlie the set of items?
What are the defining features or dimensions of the factors or constructs
that underlie the set of items (Tabachnick & Fidell, 2001)?
EFA assesses the construct validity during the initial development of an
instrument. After developing an initial set of items, researchers apply EFA to
examine the underlying dimensionality of the item set. Thus, they can group a
large item set into meaningful subsets that measure different factors. The pri-
mary reason for using EFA is that it allows items to be related to any of the
factors underlying examinee responses. As a result, the developer can easily
identify items that do not measure an intended factor or that simultaneously
measure multiple factors, in which case they could be poor indicators of the
desired construct and eliminated from further consideration.
When used for scale development, EFA becomes a combination of qual-
itative and quantitative methods, which can be either confusing or enliven-
ing for researchers. We have found that novices (and some who are not
novice) hope to have the statistical program produce the ultimate solution
that will provide them with a set of empirically determined, indisputable
808 THE COUNSELING PSYCHOLOGIST / November 2006
dimensions or factors. However, effectively using EFA procedures requires

researchers to use inductive reasoning, while patiently and subtly adjusting
and readjusting their approach to produce the most meaningful results.
Therefore, the process of scale development using EFA can become a rela-
tively dynamic process of examination and revision, followed by more
examination and revision, ultimately leading to a tentative rather than a
definitive outcome.
The most current approach in conducting CFA is to use SEM. Prior to
analyzing the data, a researcher must indicate (a) how many factors are
present in an instrument, (b) which items are related to each factor, and
(c) whether the factors are correlated or uncorrelated (issues that are
revealed during the process of EFA). Because the items are generally con-
strained to load on only one factor in CFA, it is generally intended not to
explore whether a given item measures no factors, one factor, or multiple
factors but instead to evaluate or confirm the extent to which the
researcher’s measurement model is replicated in the sample data. Thus, it
is critical to have prior knowledge of the expected relationships between
items and factors before conducting CFA—hence the term confirmatory.
SEM is a powerful confirmatory technique because it allows the
researcher greater control over the form of constraints placed on items
and factors when analyzing a hypothesized model. Furthermore, as we
discuss later, researchers can also use SEM to examine competing models
to assess the extent to which one hypothesized model fits the data better
than an alternative model. In our discussion, we provide information about
the basic concepts and procedures necessary to use SEM in scale devel-
opment research. For more advanced discussions of SEM, we refer read-
ers to several existing books and articles (e.g., Bollen, 1989; Kline, 2005;
Martens, 2005; Martens & Hasse, 2006; Quintana & Maxwell, 1999;
Thompson, 2004).
CONTENT-ANALYSIS PROCEDURE
To provide context for our discussion of scale development best practices,

we conducted a content analysis of scale development articles in counsel-
ing psychology that reflect common practices. In this section, we provide
an overview of the article-selection process used in our content analysis.
We then integrate the findings of our content analysis into the remainder of
the article as we review the literature and recommend best practices for
scale development.
We reviewed scale development articles published in JCP in the 10 years
between 1995 and 2004, inclusive (see appendix for a list of articles). We

based our selection of articles on two central criteria: We included (a) only
new scale development research articles (i.e., we excluded articles investi-
gating only the reliability, validity, or revisions of existing scales) and (b)
only articles that reported results from EFA and CFA. A paid graduate stu-
dent assistant reviewed the tables of contents for each issue of JCP pub-
lished during the specified time frame. We instructed the graduate student
to err on the side of being overly inclusive, which resulted in the identifi-
cation of 38 articles that used EFA and CFA to examine the psychometric
properties of measurement instruments. The first author reviewed these
articles and eliminated 15 that did not meet the selection criteria, resulting
in 23 articles for our sample. Next, the first author and second author inde-
pendently evaluated the 23 articles to identify and quantify the EFA and
CFA characteristics. The only discrepancies in the independent evaluations
of the articles were because of clerical errors in recording descriptive infor-
mation (as opposed to disagreement in classification), which we jointly
checked and verified.
We were interested in a number of characteristics of the studies. For stud-
ies reporting EFA procedures, we were interested in the following: (a) sample
characteristics, (b) criteria for assessing the factorability of the correlation
matrix, (c) extraction methods, (d) criteria for determining rotation method,
(e) rotation methods, (f) criteria for factor retention, (g) criteria for item dele-
tion, and (h) purposes and criteria for optimizing scale length (see Table 1).
For studies reporting CFA procedures, we were interested in the follow-
ing: (a) using SEM versus alternative methods as a confirmatory approach,
(b) sample-size criteria, (c) fit indexes, (d) fit-index criteria, (e) cross-validation
indexes, and (f) model-modification issues (see Table 2).
THE PROCESS OF SCALE DEVELOPMENT RESEARCH
There are various strategies used in scale construction, often described

using somewhat differing labels for similar approaches. Brown (1983) sum-
marized three primary strategies: logical, empirical, and homogeneous.
Friedenberg (1995) identified a slightly different set of categories: logical-
content or rational, theoretical, and empirical, in which the latter contains
criterion group and factor analysis methods. The rational or logical
approach simply uses the scale developer’s judgments to identify or con-
struct items that are obviously related to the characteristic being measured.
The theoretical approach uses psychological theory to determine the con-
tent of the scale items. Both the theoretical and rational and logical
approaches are no longer popular methods in scale development. The more
rigorous empirical approach uses statistical analyses of item responses as

TABLE 1: Characteristics of Exploratory Factor Analyses Used in Scale Development

Studies Published in the Journal of Counseling Psychology (1995 to 2004)
Characteristic Frequency
Sample characteristics
Convenience sample 5
Purposeful sample of target group 10
Convenience and purposeful sampling 6
Criteria used to assess factorability of correlation matrix
Absolute sample size 1
Item intercorrelations 1
Participants per item ratio 3
Barlett’s test of sphericity 5
Kaiser-Meyer-Olkin test of sample adequacy 7
Unspecified 11
Extraction method
Principal-components analysis 9
Common-factors analysis
Principal-axis factoring 6
Maximum likelihood 3
Unspecified 1
Combination principal-components analysis and common-factors analysis 1
Unspecified 1
Criteria for determining rotation method
Subscale intercorrelations 2
Theory 3
Both 1
Other 3
Unspecified 12
Rotation method
Orthogonal
Varimax 8
Unspecified 1
Oblique
Promax 1
Oblimin 3
Unspecified 4
Both orthogonal and oblique 3
Unspecified 1
Criteria for item deletion or retention
Loadings 16
Cross-loadings 13
Communalities 0
Item analysis 1
Other 3
Unspecified 2
No items were deleted 2
(continued)
TABLE 1 (continued)
Criteria for factor retention

Eigenvalues 18
Scree plot 17
Minimum proportion of variance accounted for by factor 2
Number of items per factor 4
Simple structure 5
Conceptual interpretability 15
Other 3
Unspecified 2
Optimizing scale length
None attempted 15
Purpose
Reduce total scale length 2
Limit total items per factor 3
Balance items per factor 2
Criteria
Redundant items 1
Conceptually unrelated items 1
Statistical invariance 1
Cross-loadings 1
Dropped items with lowest loadings 4
Item content 2
NOTE: Values in each category may not sum to equal the total number of studies because
some studies may have reported more than one criterion or approach.
the basis for item selection based on (a) predictive utility for a criterion
group (e.g., depressives) or (b) homogenous item groupings. The method
described in this article is an empirical approach that employs factor analy-
sis to form homogenous item groupings.
A number of authors have recommended similar sequences of steps to
be taken prior to using factor-analytic techniques (e.g., Anastasi, 1988;
Dawis, 1987; DeVellis, 2003). We review these preliminary steps in the fol-
lowing section because, as is the case in most scientific endeavors, early
mistakes in scale development often lead to problems later in the process.
Once we have described all the steps in some detail, we address the extent
to which the studies in our content analysis incorporated the steps in their
designs.
Although there is little variation between models proposed by different
authors, we rely primarily on DeVellis (2003) as the most current resource.
Thus, the following description is only one of several similar models available
and does not reflect a unitary best practice. DeVellis (2003) recommends the

TABLE 2: Characteristics of Confirmatory Factor Analyses Used in Scale

Development Studies Published in the Journal of Counseling Psychology
(1995 to 2004)
SEM versus FA as a confirmatory approach

SEM used 14
FA used 2
Typical SEM approaches
Single-model approach 2
Competing-models approach 8
Nested models compared 4
Nonnested or equivalent models compared 4
Sample-size criteria (SEM only)
Participants per parameter 1
Unspecified 13
Overall model fit
Chi-square 12
Chi-square and df ratio 6
Incremental fit indexes reported
CFI 8
PCFI 1
IFI 2
NFI 4
NNFI/TLI 7
RNI 1
Absolute fit indexes reported
GFI 10
AGFI 6
RMSEA 6
RMSEA with confidence intervals 1
RMR 4
SRMR 1
Hoetler N 1
Predictive fit indexes reported
AIC 2
CAIC 1
ECVI 2
BIC 1
Fit index criteria
Recommended cutoff 11
Unspecified 3
(continued)

TABLE 2 (continued)
Model modification
Lagrange multiplier 3
Wald statistic 0
Item parceling 2
NOTE: Values in each category may not sum to equal the total number of studies because
some studies may have reported more than one criterion or approach. AGFI = Adjusted
Goodness-of-Fit Index; AIC = Akaike’s Information Criterion; BIC = Bayesian Information
Criterion; CAIC = Consistent Akaike’s Information Criterion; CFI = Comparative Fit Index;
ECVI = Expected Cross-Validation Index; FA = Common-Factors Analysis; GFI = Goodness-
of-Fit Index; IFI = Incremental Fit Index; NFI = Normed Fit Index; NNFI/TLI = Nonnormed
Fit Index or Tucker-Lewis Index; PCFI = Parsimony Comparative Fit Index; RMR = Root
Mean-Square Residual; RMSEA = Root Mean-Square Error of Approximation; RNI =
Relative Noncentrality Index; SEM = Structural Equation Modeling; SRMR = Standardized
Root Mean-Square Residual.
following steps in constructing new instruments: (a) Determine clearly what

you want to measure, (b) generate an item pool, (c) determine the format of
the measure, (d) have experts review the initial item pool, (e) consider inclu-
sion of validation items, (f) administer items to a development sample, (g)
evaluate the items, and (h) optimize scale length.
In scale development, the first step is to define your construct clearly and
concretely, using both existing theory and research to provide a sound con-
ceptual foundation. This is sometimes more difficult than it may initially
appear because it requires researchers to distinctly define the attributes of
abstract constructs. Nothing is more difficult to measure than an ill-defined
construct because it leads to the inclusion of items that may be only periph-
erally related to the construct of interest or to the exclusion of items that are
important components of the content domain.
The next step is to generate a pool of items designed to tap the construct.
Ultimately, the objective is to arrive at a set of items that clearly represent the
construct of interest so that factor-analytic, data-reduction techniques yield a
stable set of underlying factors that accurately reflect the construct. Items that
are poorly worded or not central to a clearly articulated construct will introduce
potential sources of error variance, reducing the strength of correlations among
items, and will diminish the overall objectives of scale development (see
Quintana & Minami, 2006 [this issue], on dealing with measurement error in
meta-analyses). In general, researchers should write items so that they are clear,
concise, readable, distinct, and reflect the scale’s purpose (e.g., produce
responses that can be scored in a meaningful way in relation to the construct
definition). DeVellis (2003) and Anastasi (1988) offer a host of recommenda-
tions for generating quality items and choosing a response format that are
beyond the scope of this article. It suffices to say that the researcher should not

take the quality of the item pool lightly, and a carefully planned approach to
item generation is a critical beginning to scale development research.
Having the items reviewed by one or more groups of knowledgeable
people (experts) to assess item quality on a number of different dimensions
is another critical step in the process. At a minimum, expert review should
involve an analysis of content validity (e.g., the extent to which a set of items
reflects the content domain). Experts can also evaluate items for clarity,
conciseness, grammar, reading level, face validity, and redundancy. Finally,
it is also helpful at this stage for experts to offer suggestions for adding new
items and length of administration.
Although it is possible to include additional scales for participants to
complete that may provide information about convergent and discriminant
validity, we recommend that researchers limit such efforts at this stage of
development. We recommend this for two reasons. First, it is wise to keep
the total questionnaire length as short as possible and directly related to the
study’s central purpose. The longer the questionnaire, the less likely poten-
tial participants will be to volunteer for the study or to complete all the items
(Converse & Presser, 1986). Scale development studies sometimes include
as many as 3 to 4 times the number of items that will eventually end up on
the instrument, making inclusion of additional scales prohibitive. Second,
there are several ways that items from other measures may interact with
items designed for the new instrument to affect participant responses and,
thus, to interfere in the scale development process. In particular, it would be
very difficult, if not impossible, to control for order effects of different mea-
sures while testing the initial factor structure for the new scale. Randomly
administering existing measures with the other instruments might contami-
nate participants’ responses on the items for the new scale, but administer-
ing the new items first to avoid contamination eliminates an important
procedure commonly used when researchers use multiple self-report scales
concurrently within a single study. Thus, we believe that it is important to
avoid influencing item responses during the initial phase of scale develop-
ment by limiting the use of additional measures. Although ultimately a mat-
ter of researcher judgment, assessing the convergent and discriminant
validity (e.g., correlation with other measures) is an important step that we
believe should occur later in the process of scale development.
Of the 23 studies in our content analysis, 14 reported a construct or scale
definition that guided item generation, and all but 2 studies indicated that
item generation was based on prior theoretical and empirical literature
in the field. Occasionally, however, we found that articles provided only
sparse details in the introductory material articulating the theoretical
foundations for the research. The studies in our review used various item-
generation approaches. All the approaches involved some form of rational

item generation, with the primary variations involving the combination of

rational and empirical approaches. Although the extensiveness and specific
approaches of the procedures varied widely, only a few studies (n = 2) did not
include (or failed to report) expert review of item sets prior to conducting
EFA or CFA. Finally, our content analysis showed three typical patterns with
respect to the inclusion of validity items during administration to the initial
development sample: (a) administering only the scale items (no validity items
being included), (b) assessing only social desirability along with the scale
items, or (c) administering numerous other scales along with the scale items
to provide additional evidence of convergent and discriminant validity.
THE ORDERING OF EFA AND CFA IN

NEW SCALE DEVELOPMENT RESEARCH
Researchers typically use CFA after an instrument has already been

assessed using EFA, and they want to know if the factor structure produced
by EFA fits the data from a new sample. An alternative, less typical approach,
is to perform CFA to confirm a theoretically driven item set without the prior
use of EFA. However, Byrne (2001) stated that “the application of CFA pro-
cedures to assessment instruments that are still in the initial stages of devel-
opment represents a serious misuse of this analytic strategy” (p. 99).
Furthermore, reporting the findings of a single CFA is of little advantage over
conducting a single EFA. Specifically, research has shown that exploratory
methods (i.e., principal-axis and maximum-likelihood factor analysis) are
able to recover the correct factor model satisfactorily a majority of the time
(Gerbing & Hamilton, 1996). In addition, a key validity issue is the replica-
tion of the hypothesized factor structure using a new sample. Thus, rather
than produce a CFA that would ultimately need to be followed by a second
CFA, the most logical approach would be to conduct an EFA followed by a
CFA in all cases. Thus, when developing new scales, researchers should con-
duct an EFA first, followed by CFA. Regardless of how effectively the
researcher believes item generation has reproduced the theorized latent vari-
ables, we believe that the initial validation of an instrument should involve
empirically appraising the underlying factor structure (i.e., EFA).
Of the 23 new scale development articles we reviewed, a significant major-
ity conducted EFA followed by CFA (n = 10) or only EFA without CFA
(n = 8). One article reported using SEM following EFA, but the procedure
was inconsistent with CFA. Two smaller subsets of articles reported only
CFA (n = 2) or conducted CFA followed by EFA (n = 2). In the two stud-
ies in which EFA followed CFA, researchers had produced theoretically
derived instruments that they believed required only a confirmation of the

hypothesized factor structure (which proved wrong in both cases). As a

result, when the hypothesized factor structure did not fit the data using
SEM, the researchers reverted to EFA (using the same sample) as a means
of uncovering the underlying factor structure—a somewhat questionable
procedure that could have been avoided if they had relied on EFA in the
first place. The studies that successfully used only CFA included one that
reported only a single CFA and another that reported two consecutive CFAs
(in which the second replicated the findings of the first).
EFA
Development sample characteristics. Representativeness in scale develop-

ment research does not follow conventional wisdom—that is, it is not neces-
sary to closely represent any clearly identified population as long as those
who would score high and those who would score low are well represented
(Gorsuch, 1997). Furthermore, one reason many scholars have consistently
advocated for large samples in scale development research (see further on) is
that scale variance attributable to specific participants tends to be cancelled
by random effects as sample size increases (Tabachnick & Fidell, 2001).
Nevertheless, samples that do not adequately represent the population of
interest affect factor-structure stability and generalizability. When all partici-
pants are drawn from a particular source sharing certain characteristics (e.g.,
age, education, socioeconomic status, and racial and ethnic group), even large
samples will not sufficiently control for the systematic variance produced by
these characteristics. Thus, it is advisable to ensure the appropriateness of the
development sample to the degree possible before conducting an EFA.
An important caveat with respect to sample characteristics is that in
counseling psychology research, there are many potential populations
whose members may be difficult to identify or from whom it is particularly
difficult to solicit participation (e.g., lesbian, gay, bisexual and transgender
individuals,` and persons with disabilities). Under circumstances where a
researcher believes that the sample characteristics might be at variance
from unknown population characteristics, she or he may be forced to adjust
to these unknowns and simply move forward with a sample that is adequate
but not ideal (Worthington & Navarro, 2003).
In the studies we reviewed for the content analysis, some form of purpose-
ful sampling from a specific target population was the most common approach,
followed by a combination of convenience and purposeful sampling. Only
about 25% of the studies used convenience sampling, most often with under-
graduate student participants. Three of the studies we reviewed used split
samples (i.e., a large sample split into two groups for separate analyses).

Sample size. Sample size is an issue that has received considerable

discussion in the literature. There are two central risks with using too few
participants: (a) Patterns of covariation may not be stable, because chance
can substantially influence correlations among items when the ratio of par-
ticipants to items is relatively low; and (b) the development sample may not
adequately represent the intended population (DeVellis, 2003). Comrey
(1973) has been cited often as classifying a variety of sample sizes from
very poor (N = 50) to excellent (N = 1,000) based solely on the number of
participants in a sample and as recommending at least 300 cases for factor
analysis. Gorsuch (1983) has also proposed guidelines for minimum ratios
of participants to items (5:1 or 10:1), which has been widely cited in coun-
seling psychology research. However, other authors have pointed out that
these general guidelines may be misleading (MacCallum, Widaman,
Zhang, & Hong, 1999; Tabachnick & Fidell, 2001; Velicer & Fava, 1998).
In general, there is some agreement that larger sample sizes are likely
to result in more stable correlations among variables and will result in
greater replicability of EFA outcomes. Velicer and Fava (1998) produced
evidence indicating that any ratio less than a minimum of three partici-
pants per item is inadequate, and there is additional evidence that factor
saturation (the number of items per factor) and item communalities are the
most important determinants of adequate sample size (Guadagnoli &
Velicer, 1988; MacCallum et al., 1999). Thus, we offer four overarching
guidelines: (a) Sample sizes of at least 300 are generally sufficient in most
cases, (b) sample sizes of 150 to 200 are likely to be adequate with data
sets containing communalities higher than .50 or with 10:1 items per fac-
tor with factor loadings at approximately |.4|, (c) smaller samples sizes
may be adequate if all communalities are .60 or greater or with at least 4:1
items per factor and factor loadings greater than |.6|, and (d) samples sizes
less than 100 or with fewer than 3:1 participant-to-item ratios are gener-
ally inadequate (Reise, Waller, & Comrey, 2000; Thompson, 2004). Note
that this requires researchers to set a minimum sample size at the outset
and to evaluate the need for additional data collection based on the out-
comes of an initial EFA.
In our content analysis, absolute magnitude of sample sizes and participant-
per-item ratios were virtually the only references made with respect to sample
size, and both varied widely. Absolute sample sizes varied from 84 to 411
(M = 258.95; SD = 100.80). Participant-per-item ratios varied from 2:1 to 35:1
(the modal ratio was 3:1). The authors addressed no other sample-size criteria
when discussing the adequacy of their sample sizes.
Factorability of the correlation matrix. Although many people are famil-
iar with the previously described standards regarding sample size, the fac-
torability of a data set also has been related to the sizes of correlations in

the matrix. Researchers can use Bartlett’s (1950) test of sphericity to

estimate the probability that correlations in a matrix are 0. However, it is
highly susceptible to the influence of sample size and likely to be signifi-
cant for large samples with relatively small correlations (Tabachnick &
Fidell, 2001). Thus, we recommend using this test only if there are fewer than
about 5 cases per variable, but this becomes moot with samples containing
fewer than three cases per variable (see earlier). In studies with cases-per-
item ratios higher than 5:1, we recommend that researchers provide addi-
tional evidence for scale factorability.
The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is also
useful for evaluating factorability. This measure of sampling adequacy
accounts for the relationship of partial correlations to the sum of squared
correlations. Thus, it indicates the extent to which a correlation matrix actu-
ally contains factors or simply chance correlations between a small subset
of variables. Tabachnick and Fidell (2001) suggested that values of .60 and
higher are required for good factor analysis.
In our content analysis of scale development articles in JCP, the largest
number of studies (n = 11) did not report using any criteria to assess the fac-
torability of the correlation matrix. Although some studies (n = 5) reported
using Bartlett’s test of sphericity, only one of those studies contained a
cases-to-items ratio small enough to provide useful information on the basis
of Bartlett’s test. Although other studies had cases-to-items ratios less than
5:1, they did not report using Barlett’s test to assess scale factorability. Only
7 of the articles reported the value of KMO as a precursor to completing
factor analysis, and a few articles (n = 3) used the participants-per-item
ratio as the sole criterion.
Extraction methods. There are a variety of factor-extraction methods
based on a number of statistical theories, but the two most commonly
known and studied are principal-components analysis (PCA) and common-
factors analysis (FA). There has been a protracted debate over the preferred
use of PCA versus FA (e.g., principal-axis factoring, maximum-likelihood
factoring) as exploratory procedures, which has yet to be resolved
(Gorsuch, 2003). We do not intend to examine this debate in detail (see
Multivariate Behavioral Research, 1990, Volume 25, Issue 1, for an exten-
sive discussion of the pros and cons of both). However, it is important for
researchers to understand the distinct purposes of each technique. The pur-
pose of PCA is to reduce the number of items while retaining as much of
the original item variance as possible. The purpose of FA is to understand
the latent factors or constructs that account for the shared variance among
items. Thus, the purpose of FA is more closely aligned with the develop-
ment of new scales. In addition, although it has been shown that PCA and
FA often produce similar results (Velicer & Jackson, 1990; Velicer,

Peacock, & Jackson, 1982), there are several conditions under which FA
has been shown to be superior to PCA (Gorsuch, 1990; Tucker, Koopman,
& Linn, 1969; Widamen, 1993). Finally, compared with PCA, the outcomes
of FA should more effectively generalize to CFA (Floyd & Widaman,
1995). Thus, although there may be other appropriate uses for PCA, we
recommend FA for the development of new scales.
An example of the use of FA versus PCA in a simulated data set might
illustrate the differences between these two approaches. Imagine that a
researcher at a public university is interested in measuring campus climate for
diversity. The researcher created 12 items to measure three different aspects of
campus climate (each using 4 items): (a) general comfort or safety, (b) open-
ness to diversity, and (c) perceptions of the learning environment. In a sample
of 500 respondents, correlations among the 12 variables indicated that one
item from each subset did not correlate with any other items on the scale (e.g.,
no higher than r = .12 for any bivariate pair containing these items). In FA, the
three uncorrelated items appropriately drop out of the solution because of low
factor loadings (loadings < .23), resulting in a three-factor solution (each fac-
tor retaining 3 items). In PCA, the three uncorrelated items load together on a
fourth factor (loadings > .45). This example demonstrates that under certain
conditions, PCA may overestimate factor loadings and result in erroneous
decisions about the number of factors or items to retain.
We should also make clear that there are several techniques of FA,
including principal-axis factoring, maximum likelihood, image factoring,
alpha factoring, and unweighted and generalized least squares. Gerbing and
Hamilton (1996) have shown that principal-axis factoring and maximum-
likelihood approaches are relatively equal in their capacities to extract the
correct model when the model is known in the population. However,
Gorsuch (1997) points out that maximum-likelihood extractions result in
occasional problems that do not occur with principal-axis factoring. Prior to
the current use of SEM as a CFA technique, maximum-likelihood extraction
had some advantages over other FA procedures as a confirmatory technique
(Tabachnick & Fidell, 2001). For further discussion of less commonly used
approaches, see Tabachnick and Fidell (2001).
Among the studies in our content analysis, most used some form of FA
(n = 10), but a similar number used PCA (n = 9). One study used a combi-
nation of PCA and FA, and another did not report an extraction method.
(Note: 2 of the 23 studies used only CFA and are not included in the figures
reported earlier.) A cursory examination of the publication dates indicates
that the majority of studies using PCA were published prior to the majority
of those using FA, suggesting a trend away from PCA in favor of FA.
Criteria for determining rotation method. FA rotation methods include
two basic types: orthogonal and oblique. Researchers use orthogonal

rotations when the set of factors underlying a given item set are assumed or
known to be uncorrelated. Researchers use oblique rotations when the fac-
tors are assumed or known to be correlated. A discussion of the statistical
properties of the various types of orthogonal and oblique rotation methods
is beyond the scope of this article (we refer readers to Gorsuch [1983] and
Thompson [2004] for such discussions). In practice, researchers can deter-
mine whether to use an orthogonal versus oblique rotation during the initial
FA based on either theory or data. However, if they discover that the factors
appear to be correlated in the data when theory has suggested them to be
uncorrelated, it is still most appropriate to rely on the data-based approach
and to use an oblique rotation. Although, in some cases, both procedures
might produce the same factor structure with the same data, using an
orthogonal rotation with correlated factors tends to overestimate loadings
(e.g., they will have higher values than with an oblique rotation; Loehlin,
1998). Thus, researchers may retain or reject some items inappropriately,
and the factor structure may be more difficult to replicate during CFA.
Our content analysis showed that relatively few of the studies in our
review reported an adequate rationale for selecting an orthogonal or oblique
rotation method, with only 2 using subscale intercorrelations, 3 using theory,
and 1 using both. Twelve studies did not specify the criteria used to select
a rotation method, and 3 studies actually reported criteria irrelevant to the
task (e.g., although the factors were correlated, the orthogonal solution
matched the prior expectations for the factor solution). Also, 8 studies used
orthogonal rotations despite reporting moderate to high correlations among
factors, and 4 studies did not provide factor intercorrelations.
Criteria for factor retention. Researchers can use numerous criteria to
estimate the number of factors for a given item set. The most widely
known approaches were recommended by Kaiser (1958) and Cattell
(1966) on the basis of eigenvalues, which may help determine the impor-
tance of a factor and indicate the amount of variance in the entire set of
items accounted for by a given factor (for a more detailed explanation of
eigenvalues, see Gorsuch, 1983). The iterative process of factor analysis
produces successively less useful information with each new factor
extracted in a set because each factor extracted after the first is based on
the residual of the previous factor’s extraction. The eigenvalues produced
will be successively smaller with each new factor extracted (accounting
for smaller and smaller proportions of variance) until virtually meaning-
less values result. Thus, Kaiser (1958) believed that eigenvalues less than
1.0 reflect potentially unstable factors. Cattell (1966) used the relative val-
ues of eigenvalues to estimate the correct number of factors to examine
during factor analysis—a procedure known as the scree test. Using the
scree plot, a researcher examines the descending values of eigenvalues to

locate a break in the size of eigenvalues, after which the remaining values
tend to level off horizontally.
Parallel analysis (Horn, 1965) is another procedure for deciding how
many factors to retain. Generally, when using parallel analysis, researchers
randomly order the participants’ item scores and conduct a factor analysis on
both the original data set and the randomly ordered scores. Researchers
determine the number of factors to retain by comparing the eigenvalues
determined in the original data set and in the randomly ordered data set.
They retain a factor if the original eigenvalue is larger than the eigenvalue
from the random data. This has been shown to work reasonably well when
using FA (Humphreys & Montanelli, 1975) as well as PCA (Zwick &
Velicer, 1986). Parallel analysis is not readily available in commonly used
statistical software, but programs are available that conduct parallel analysis
when using principal-axis factor analysis and PCA (see O’Connor, 2000).
Approximating simple structure is another way to evaluate factor reten-
tion during EFA. According to McDonald (1985), the term simple structure
has two radically different meanings that are often confused. A factor pat-
tern has simple structure (a) if several items load strongly on only one fac-
tor and (b) if items have a zero correlation to other factors in the solution.
SEM constrains the relationships between items and factors to produce
simple structure as defined earlier (which will become important later).
McDonald (1985) differentiates this from what he prefers to call approxi-
mate simple structure, often reported in counseling psychology research as
if it were simple structure, which substitutes the word small (undefined) for
the word zero (definitive) in the primary definition. Researchers can esti-
mate approximate simple structure by using rotation methods during FA. In
EFA, efforts to produce factor solutions with approximate simple structure
are central to decisions about the final number of factors and about the
retention and deletion of items in a given solution. If factors share items
that cross-load too highly on more than one factor (e.g., > .32), the items
are considered complex because they reflect the influence of more than one
factor. Approximating simple structure can be achieved through item or fac-
tor deletion or both. SEM approaches to CFA assume simple structure, and
very closely approximating simple structure during EFA will likely
improve the subsequent results of CFA using SEM.
The larger the number of items on a factor, the more confidence one has that
it will be a reliable factor in future studies. Thus, with a few minor caveats,
some authors have recommended against retaining factors with fewer than
three items (Tabachnick & Fidell, 2001). It is possible to retain a factor with
only two items if the items are highly correlated (i.e., r > .70) and relatively
uncorrelated with other variables. Under these conditions, it may be appropri-
ate to consider other criteria (e.g., interpretability) in deciding whether to retain

the factor or to discard it. Nevertheless, it may be best to revisit item-genera-

tion procedures to produce additional items intended to load on the factor
(which would require a new EFA before moving on to the CFA).
Conceptual interpretability is the definitive factor-retention criterion. In
the end, researchers should retain a factor only if they can interpret it in a
meaningful way no matter how solid the evidence for its retention based on
the empirical criteria earlier described. EFA is ultimately a combination of
empirical and subjective approaches to data analysis because the job is not
complete until the solution makes sense. (Note that this is not necessarily
true for the criterion-group method of scale development.) At this stage, the
researcher should conduct an analysis of the items within each factor to
assess the extent to which the items make sense as a group. Although
uncommon, it may be useful to submit the item-factor combinations to a
small group of experts for external interpretation to avoid a situation in
which a factor makes sense to the researcher eager for a viable scale but not
to anybody else.
In our content analysis of JCP articles, it appeared that numerous
researchers encountered problems reconciling their EFA findings with
their conceptual interpretation of the factor solution and occasionally
engaged in rationalizations that led to questionable practices. For example,
researchers in one study selected a factor solution that fit their precon-
ceived conceptualization of the scale although some of the factors were
very highly intercorrelated (e.g., the data indicated fewer factors than the
authors adopted). When a researcher desires a specific factor structure that
is not adequately reproduced during EFA, the recommended practice
would be (a) to adopt the factor solution supported by the data and engage
in meaningful interpretation based on those findings and (b) to return to
item generation and go back through earlier steps in the scale development
process (including EFA). There were a few articles in our content analysis
that inappropriately moved forward with CFA after making revisions that
were not assessed by EFA.
Criteria for item deletion or retention. Although, on rare occasions, a
researcher may retain all the initial items submitted to EFA, item deletion
is a very common and expected part of the process. Researchers most often
use the values of the item loadings and cross-loadings on the factors to
determine whether items should be deleted or retained. Inevitably, this
process is intertwined with the process of determining the number of fac-
tors that will be retained (described earlier). For example, in some
instances, a researcher might be evaluating the relative value of several dif-
ferent factor solutions (e.g., 2, 3, or 4 factors). As such, deleting items
before establishing the final number of factors could actually reduce the
number of factors retained. On the other hand, unnecessarily retaining

items that fail to contribute meaningfully to any of the potential factor solu-
tions will make it more difficult to make a final decision about the number
of factors to retain. Thus, the process we recommend is designed to retain
potentially meaningful items early in the process and to optimize scale
length only after the factor solution is clear.
Most researchers begin EFA with a substantially larger number of items
than they ultimately plan to retain. However, there is considerable variation
among studies in the proportion of items in the initial pool that are planned
for deletion. We recommend that researchers wait until the last step in EFA
to trim unnecessary items and focus primarily on empirical scale develop-
ment procedures at this stage in the process so as not to confuse the purposes
of these two similar activities (e.g., item deletion). Thus, researchers should
base decisions about whether to retain or delete items at this stage on their
contribution to the factor solution rather than on the final length of the scale.
Most researchers use some guideline for a lower limit on item factor
loadings and cross-loadings to determine whether to retain or delete items,
but the criteria for determining the magnitude of loadings and cross-loadings
have been described as a matter of researcher preference (Tabachnick &
Fidell, 2001). Larger, more frequent cross-loadings will contribute to factor
intercorrelations (requiring oblique rotation) and lesser approximations of
simple structure (described earlier). Thus, to the degree possible, researchers
should attempt to set their minimum values for factor loadings as high as
possible and the absolute magnitude for cross-loadings as low as possible
(without compromising scale length or factor structure), which will result
in fewer cross-loadings of lower magnitudes and better approximations of
simple structure. For example, researchers should delete items with factor
loadings less than .32 or cross-loadings less than .15 difference from an
item’s highest factor loading. In addition, they should also delete items that
contain absolute loadings higher than a certain value (e.g., .32) on two or
more factors. However, we urge researchers to use caution when using
cross-loadings as a criterion for item deletion until establishing the final
factor solution because an item with a relatively high cross-loading could
be retained if the factor on which it is cross-loaded is deleted or collapsed
into another existing factor.
Item communalities after rotation can be a useful guide for item deletion as
well. Remember that high item communalities are important for determining
the factorability of a data set, but they can also be useful in evaluating specific
items for deletion or retention because a communality reflects the proportion of
item variance accounted for by the factors; it is the squared multiple correlation
of the item as predicted from the set of factors in the solution (Tabachnick &
Fidell, 2001). Thus, items with low communalities (e.g., less than .40) are not
highly correlated with one or more of the factors in the solution.

In our content analysis, the most common criteria for item-deletion deci-
sions were absolute values of item loadings and cross-loadings, which were
often used in combination. None of the studies we reviewed reported using
item communalities as a criterion for deletion, and one study used item-
analysis procedures (e.g., contribution to internal consistency reliability).
There were no items deleted in two studies, and two others did not specify
the criteria for item deletion.
Optimizing scale length. Once the items have been evaluated, it is useful to
assess the trade-off between length and reliability to optimize scale length.
Longer scales of relatively highly correlated items are generally more reliable,
but Converse and Presser (1986) recommended that questionnaires take no
longer than 50 minutes to complete. In our experience, scales that take longer
than about 15 to 30 minutes might become problematic, depending on the
respondents, the intended use of the scale, and the respondents’ motivation
regarding the purpose of the administration. Thus, scale developers may find
it useful to examine the length of each subscale to determine whether it is a
reasonable trade-off to sacrifice a small degree of internal consistency to
shorten its length. Some statistical packages (e.g., SPSS) allow researchers to
compare all the items on a given subscale to identify those that contribute the
least to internal consistency, making item deletion with the goal of optimizing
scale length relatively easy. Generally, when a factor contains more than the
desired number of items, the researcher will have the option of deleting items
that (a) have the lowest factor loadings, (b) have the highest cross-loadings,
(c) contribute the least to the internal consistency of the scale scores, and
(d) have low conceptual consistency with other items on the factor. The
researcher should avoid scale-length optimization that degrades the quality of
the factor structure, factor intercorrelations, item communalities, factor load-
ings, or cross-loadings. Ultimately, researchers must conduct a final EFA to
ensure that the factor solution does not change after deleting items.
CFA
SEM versus FA. SEM has become a widely used tool in explaining theo-
retical models within the social and behavioral sciences (see Martens, 2005;
Martens & Hasse, 2006; Quintana & Maxwell, 1999; Weston & Gore, 2006).
CFA is one of the most popular uses of SEM. CFA is most commonly used
during the scale development process to help support the validity of a scale
following an EFA. In the past, a number of published studies have used FA or
PCA procedures as confirmatory approaches (Gerbing & Hamilton, 1996).
With the increasing availability of computer software, however, most
researchers use SEM as the preferred approach for CFA.

In our content analysis, 14 of the studies used SEM as the confirmatory

approach. In comparison, 2 studies used PCA as a confirmatory approach
(these appeared before SEM was widely applied in counseling psychology
research).
Typical SEM approaches. Once a researcher obtains a theoretically
meaningful factor structure via EFA, the logical next step is to specify the
resulting factor solution in the SEM confirmatory procedure—that is, if the
researcher obtains a three-factor oblique factor structure in the EFA, speci-
fying the same correlated three-factor model using SEM and finding good
fit of the model to the data in a new sample will help support the factor-
structure reliability and the validity of the scale. Another approach is to
compare competing theoretically plausible models (e.g., different numbers
of factors, inclusion or exclusion of specific paths). Thus, the researcher
can compare the factor structure uncovered in the EFA with alternative
models to evaluate which model best fits the data. The hypothesized
model’s fitting the data better than alternative models is further evidence of
construct validity. If an alternative model fits the data better than the
hypothesized model, the investigator is obligated to explain how discrepan-
cies between models effect construct validity and then to conduct another
study to further validate the newly adopted model (or start over).
Testing nested or hierarchically related models is another typical SEM
approach. A model is nested if it is a subset of another model to which it is
compared. For example, suppose a researcher conducted a study on an
eight-item, course-evaluation survey in which four items assess satisfaction
with the readings and homework assigned in the course and the remaining
four items assess satisfaction with the professor’s sensitivity to diversity,
resulting in a two-factor correlated model. However, one could assume that
the eight items on the survey assess overall satisfaction with the course,
resulting in a one-factor model. If this one-factor model was compared with
the correlated two-factor model, the one-factor (restricted) model would be
nested within the two-factor (unrestricted) model because the correlation
between the two factors in the two-factor model would be set to a value
of 1.0 to form the one-factor model. When comparing nested models,
researchers use a chi-square difference test to examine whether a significant
loss in fit occurs when going from the unrestricted model to the nested
(restricted) model (for the statistical formula, see Kline, 2005).
When structural equation models are not nested (i.e., one model is not a
subset of another model), the chi-square difference test is an inappropriate
method to assess model fit differences because neither of the two models
can serve as a baseline comparison model. Still, there are instances when
researchers compare nonhierarchically related models in terms of model fit,
such as when testing different theoretical models posited to support the

data. In this case, researchers may use fit indices to select among compet-
ing models. It is becoming more and more common to compare nonnested
models using predictive fit indices (discussed further on), which indicate
how well a model will cross-validate in future samples.
Some competing models may be equivalent models—that is, these models
are mathematically equivalent even when their parameter configurations
appear different (MacCallum, Wegener, Uchino, & Fabrigar, 1993), and
they will have a different configuration but yield the same chi-square test
statistics and goodness-of-fit indices. Thus, theory should play the strongest
role in selecting the appropriate model when comparing equivalent models.
Another SEM approach that may support the construct validity of a scale
is called multiple-group analysis. In multiple-group analysis, the same
structural equation model may be applied to the data for two or more dis-
tinct groups (e.g., male and female) to simultaneously test for invariance
(model equivalency) across the two groups by constraining different sets of
model parameters to be equal in both groups (for more on conducting
multiple-group analysis, see Bentler, 1995; Bollen, 1989; Byrne, 2001).
Of the 10 studies in the content analysis using a confirmatory SEM
approach, 2 of them used the single-model approach wherein the model
produced by the EFA was specified in a CFA, and 8 of the studies per-
formed model comparisons. Of these 8 studies, 4 evaluated nested models,
but only 3 of the 4 used the chi-square difference test when selecting among
the nested models. All 4 of the studies used fit indices to select among
nonnested competing models. Of the 4 studies comparing alternative
nonnested models, 2 used predictive fit indices when selecting among the
set of competing models. Researchers compared equivalent and nonequiv-
alent models in 2 of the studies in the content analysis. One of these stud-
ies selected a nonequivalent model over 2 equivalent models based on
higher values of the fit indices. In the second study, the authors relied on
theory when selecting among 2 equivalent models.
Sample-size considerations. The statistical theory underlying SEM is
asymptotic, which assumes that large sample sizes are necessary to provide
stable parameter estimates (Bentler, 1995). Thus, some researchers have
suggested that SEM analyses should not be performed on sample sizes
smaller than 200, whereas others recommend minimum sample sizes
between 100 and 200 participants (Kline, 2005). Another recommendation
is that there should be between 5 and 10 participants per observed variable
(Grimm & Yarnold, 1995); yet another guideline is that there should be
between 5 and 10 participants per parameter to be estimated (Bentler &
Chou, 1987). The findings are mixed in terms of which criterion is best
because it depends on various model characteristics, including the number
of indicator variables per factor (Marsh, Hau, Balla, & Grayson, 1998),

TABLE 3: Incremental, Absolute, and Predictive Fit Indices Used in Structural

Equation Modeling
Fit Index Citation
Incremental fit indices

Normed Fit Index (NFI) Bentler & Bonnett (1980)
Incremental Fit Index (IFI) Bollen (1989)
Nonnormed Fit Index (NNFI) or Tucker & Lewis (1973)
Tucker-Lewis Index (TLI)
Comparative Fit Index (CFI) Bentler (1990)
Parsimony Comparative Fit Index (PCFI) Mulaik et al. (1989)
Relative Noncentrality Index (RNI) McDonald & Marsh (1990)
Absolute Fit Indices
Chi-square/df ratio Marsh, Balla, & McDonald (1988)
Goodness-of-Fit Index (GFI) Jöreskog & Sörbom (1984)
Adjusted Goodness-of-Fit Index (AGFI) Jöreskog & Sörbom (1984)
McDonald’s Fit Index (MFI) or McDonald (1989)
McDonald’s Centrality Index (MCI)
Gamma hat Steiger (1989)
Hoelter N Hoelter (1983)
Root Mean Square Residual (RMR) Jöreskog & Sörbom (1981)
Standardized Root Mean Square Residual (SRMR) Bentler (1995)
Root Mean-Square Error of Approximation Steiger & Lind (1980)
(RMSEA)
Predictive Fit Indices
Akaike’s Information Criterion (AIC) Akaike (1987)
Consistent AIC (CAIC) Bozdogan (1987)
Bayesian Information Criterion (BIC) Schwarz (1978)
Expected Cross-Validation Index (ECVI) Browne & Cudeck (1992)
estimation method (Fan, Thompson, & Wang, 1999), nonnormality of the

data (West, Finch, & Curran, 1995), as well as the strength of the relation-
ships among indicator variables and latent factors (Velicer & Fava, 1998).
However, because there is a clear relationship between sample size and
model complexity, we recommend that the researcher should account for
the number of parameters to be estimated when considering sample size.
Given ideal conditions (e.g., enough indicators per factor, high factor load-
ings, and normally distributed data), we recommend Bentler and Chou’s
(1987) guideline of at least the 5:1 ratio of participants to number of para-
meters, with the ratio of 10:1 being optimal. In addition, we do not recom-
mend using SEM on sample sizes smaller than 100 participants.
Only one study in our content analysis reported using one of the earlier
described criteria (5 to 10 participants per indicator) to establish an ade-
quate sample size. The remainder of the studies did not specify whether

they used particular criteria to evaluate the adequacy of the sample size to
conduct SEM. However, we assessed the sample sizes for all the studies
included in the content analysis and determined that the remaining studies
met the 5:1 ratio of participants to parameters.
Overall model fit. Researchers typically use a chi-square test statistic
as a test of overall model fit in SEM. The chi-square test, however, is
often criticized for its sensitivity to sample size (Bentler & Bonett, 1980;
Hu & Bentler, 1999). The sample-size dependency of the chi-square test
statistic has led to the proposal of numerous alternative fit indices that
evaluate model fit, supplementing the chi-square test statistic. These fit
indices may be classified as incremental, absolute, or predictive fit indices
(Kline, 2005).
Incremental fit indices measure the improvement in a model’s fit to the
data by comparing a specific structural equation model to a baseline struc-
tural equation model. The typical baseline comparison model is the null (or
independence) model in which all the variables are independent of each
other or uncorrelated (Bentler & Bonnett, 1980). Absolute fit indices mea-
sure how well a structural equation model explains the relationships found
in the sample data. Predictive fit indices (or information criteria) measure
how well the structural equation model would fit in other samples from the
same population (see Table 3 for examples of incremental, absolute, and
predictive fit indices).
We should note that there are various recommendations about reporting
these indices as well as suggested cutoff values for each of these fit indices
(e.g., see Hu & Bentler, 1999; Kline, 2005). Researchers have commonly
interpreted incremental fit index, goodness-of-fit index, adjusted goodness-
of-fit index, and McDonald’s Fit Index (MFI) values greater than .90 as an
acceptable cutoff (Bentler & Bonnett, 1980). More recently, however, SEM
researchers have advocated .95 as a more desirable level (e.g., Hu &
Bentler, 1999). Values for the standardized root mean square residual
(SRMR) less than .10 are generally indicative of acceptable model fit.
Values for the root mean square error of approximation (RMSEA) at or less
than .05 indicate close model fit, which is customarily considered accept-
able. However, debate continues concerning the use of these indices and the
cutoff values when fitting structural equation models (e.g., see Marsh, Hau,
& Wen, 2004). One reason for this debate is that the findings are mixed in
terms of which index is best, and their performance depends on various
study characteristics, including the number of variables (Kenny & McCoach,
2003), estimation method (Fan et al., 1999; Hu & Bentler, 1998), model
misspecification (Hu & Bentler, 1999), and sample size (Marsh, Balla, &
Hau, 1996). Researchers should bear in mind that suggested cutoff criteria
are general guidelines and are not necessarily definitive rules.

According to Kline (2005), a minimum collection of these types of fit

indices to report would consist of (a) the chi-square test statistic with cor-
responding degrees of freedom and level of significance, (b) the RMSEA
(Steiger & Lind, 1980) with its corresponding 90% confidence interval,
(c) the Comparative Fit Index (CFI; Bentler, 1990), and (d) the SRMR
(Bentler, 1995). Hu and Bentler (1999) recommend using a two-index com-
bination approach when reporting findings in SEM. More specifically, they
recommend using the SRMR accompanied by one of the following indices:
Nonnormed Fit Index, Incremental Fit Index, Relative Noncentrality Index,
CFI, Gamma Hat, MFI, or RMSEA. Although there is evidence that Hu and
Bentler’s (1999) joint criteria help minimize the possibility of rejecting the
right model, there is also evidence that misspecified (incorrect) models
could be considered acceptable when using the proposed cutoff criteria
(Marsh et al., 2004). Thus, we adopt Kline’s (2005) recommendation with
respect to the minimum fit indices to report. In addition, because structural
equation models approximate truth, we further recommend that researchers
compare competing theoretically plausible models whenever possible and
report predictive fit indices (see Table 3) to ensure that the model will cross-
validate in subsequent samples. Finally, and most important, researchers
should always base their selections of the appropriate model on relevant
theory.
In our content analysis, 12 of the 14 studies using SEM reported the chi-
square statistic. All 14 studies reported at least two fit indices. We list the
most commonly reported fit indices in these studies in Table 2. Although 7
articles reported the RMSEA, only 1 of these reported its corresponding 90%
confidence interval (regarding confidence intervals around the RMSEA, see
Quintana & Maxwell, 1999; for more on confidence intervals, see Henson,
2006 [TCP, special issue, part 1]). All but 3 studies assessed model fit using
various suggested cutoff criteria (e.g., Bentler, 1990, 1992; Byrne, 2001;
Comrey & Lee, 1992; Hu & Bentler, 1999; Kline, 2005; Quintana &
Maxwell, 1999). Several of the studies were published after the seminal Hu
and Bentler (1999) cutoff-criteria article and referred to the less stringent cut-
off criteria suggested by previous researchers (e.g., .90 for incremental fit
indices). Only 3 of the 8 studies in the content analysis comparing compet-
ing models (nested or nonnested) reported predictive fit indices.
Model modification. When structural equation models do not demon-
strate good fit, researchers often modify (respecify) and subsequently retest
models (MacCallum, Roznowski, & Necowitz, 1992). This results in the
confirmatory approach’s reverting to an exploratory approach again but that
is of less consequence than not knowing the reasons behind poor model fit.
Modification indices are sometimes used to either add or drop parameters
in the process of model respecification. For example, the Lagrange

Multiplier Modification index estimates the decrease in the chi-square test

statistic that would occur if a parameter were to be freely estimated. More
specifically, it indicates which parameters could be added to increase model
fit by significantly decreasing the chi-square test statistic of overall fit. In
contrast, the Wald statistic estimates the increase in the chi-square test sta-
tistic that would occur if a parameter were fixed to 0, which is essentially
the same as dropping a nonsignificant parameter from the model (Kline,
2005). Researchers have examined the performance of these indices in
terms of helping the researcher arrive at the correct structural equation model
and have shown these indices to be inaccurate under certain conditions
(e.g., Chou & Bentler, 2002; MacCallum, 1986). Thus, applied researchers
are warned as to the accuracy of respecified models when modifications are
made using the Lagrange Multiplier and the Wald statistic. In the end, the-
ory should guide model respecification, and respecified models should be
tested using new samples.
Researchers may also modify models in terms of the unit of analysis
used, such as item parcels. Parceling means either summing or averaging
two or more items together to create parcels (sometimes referred to as bun-
dles). These parcels are then used as the unit of analysis in SEM instead of
the individual items. It is crucial, however, that researchers in the scale
development process do not use item parceling, because item parcels can
hide the true relationships among items in the scale (Cattell, 1974). In addi-
tion, model misspecification may be hidden when using item parceling
(Bandalos & Finney, 2001).
The data-driven methods for model respecification in SEM are more
appropriate for fine-tuning a model than they are for large-scale respeci-
fication of severely misspecified initial models because multiple mis-
specification errors interact with each other, making respecification more
difficult (Gerbing & Hamilton, 1996). For similar reasons, Gorsuch
(1997) suggested that it is possible to use FA procedures as an appropri-
ate alternative to adjusting the confirmatory model when finding mis-
specification, but this does not imply reversing the typical order of FA
prior to SEM in scale development research. Finally, we highly recom-
mend cross-validation of respecified structural equation models to estab-
lish predictive validity (MacCallum et al., 1992). Thus, another sample of
data should be collected and the respecified model tested in a confirma-
tory approach.
Of the 14 studies conducting SEM, three examined modification indices
(e.g., the Lagrange Multiplier) to assess if they should add parameters to the
model to significantly improve the fit. In two of these three studies, the
authors implemented modifications and retested the models. These two
studies allowed the errors to covary, and one study also allowed the factors

to covary. Neither of the two studies that modified the original structural
equation model cross-validated the respecified model in a separate sample.
Researchers in two of the studies in the content analysis used item parcel-
ing to avoid estimating large number of parameters and to reduce random
error, an approach we do not recommend.
CONCLUSIONS
In this article, we have examined common practices in counseling psy-

chology scale development research using EFA and CFA techniques. We con-
ducted a content analysis of new scale development articles in JCP during 10
years (1995 to 2004) to assess current practices in scale development. We
used data from our content analysis to provide information about the typical
procedures used in counseling psychology scale development research, and
we compared these practices to current literature on EFA and CFA to make
recommendations about best practices (which we summarize further on).
We found that counseling psychology scale development research
employed a wide range of procedures. Although we did not conduct a for-
mal trend analysis, our impressions were that the content-analysis data indi-
cated that counseling psychology scale development research became
increasingly more rigorous and sophisticated during the evaluation period,
especially through the attenuation of PCA procedures and the increased
employment of SEM as a confirmatory procedure. However, we also found
a variety of practices that seemed at odds with the current literature on EFA
and SEM, which indicated a need for even more rigor and standardization.
Specifically, we found the use of the following new scale development
practices to be problematic: a) employing SEM prior to using EFA, b) using
criteria that varied widely (or were not reported) with respect to determin-
ing the adequacy of the sample for both EFA and SEM, c) failing to report
an adequate rationale for selecting orthogonal versus oblique rotation meth-
ods, d) using orthogonal rotation methods during EFA despite clear evi-
dence that the factors were moderately to highly correlated, e) using
inappropriate rationales or ignoring contrary data when identifying and
reporting the final factor solution during EFA (e.g., ignoring high factor
intercorrelations to retain a preferred factor structure), f) using questionable
criteria as the basis for decisions about item deletion or retention, g) failing
to consider the extent to which the final factor solution achieved adequate
approximation of simple structure, h) making revisions to item content or
adding or deleting items between the conclusion of EFA and the initiation
of SEM, i) using criteria and fit indices that varied widely to determine
overall model fit during SEM, j) failing to report confidence intervals when

using the RMSEA, k) using item parcels (bundles) in scale development,

and l) failing to engage in additional cross-validation following model mis-
specification and modification during SEM.
We offer a number of caveats for the earlier described critique of scale
development practices. First, some of these recommendations do not trans-
fer directly to other approaches to empirical scale development (e.g., crite-
rion group) and should be understood as primarily referring to the
homogenous item-grouping approach. Second, it is important to note that
EFA is intended to be a flexible statistical procedure to produce the most
interpretable solution, which can lead to acceptable variations in practice.
Thus, some researchers may disagree on how stringently to use criteria to
constrain the process of EFA, and we acknowledge that the subjective and
interpretive aspects of scale development may justify variations that arise in
specific contexts. Finally, the current literature on both EFA and SEM con-
tinue to contain debates and conflicting recommendations that may be at
variance with our conclusions. We provide recommendations for best
practices here to increase standardization and rigor rather than as a res-
olution of those ongoing debates and data-driven improvements in best
practices.
RECOMMENDED BEST PRACTICES
1. Always provide a scale definition of the construct intended to be

measured.
2. Use expert review of items prior to submitting them to EFA.
3. In general, EFA should precede CFA.
4. When using EFA, set a preestablished minimum sample size (³ 100)
and then evaluate the need for additional data collection on the basis
of an initial EFA using communalities, factor saturation, and factor-
loadings criteria: (a) sample sizes of 150 to 200 are likely to be ade-
quate with data sets containing communalities higher than .50 or
with 10:1 items per factor with factor loadings at approximately |.4|
and (b) smaller samples sizes may be adequate if communalities are
all .60 or greater or with at least 4:1 items per factor and factor load-
ings greater than |.6|.
5. Verify the factorability of data via a significant Bartlett’s test of
sphericity (when the participants to items ratio is between 3:1 and
5:1), the KMO measure of sampling adequacy (values greater than
.60), or both.
6. Recognize and understand the basic differences between PCA and
FA extraction methods. For the purpose of scale development, FA is
generally preferred over PCA in most instances.

7. Even when theory suggests that factors will be uncorrelated, it is

good practice to use an oblique rotation when factors are correlated
in the data. Consider using an oblique rotation in the first run of an
EFA with each factor solution to empirically establish whether fac-
tors might be correlated.
8. Establish which criteria to use for factor retention and item deletion
or retention in advance (e.g., delete items with factor loadings less
than .32 or cross-loadings less than .15 difference from an item’s
highest factor loading; approximate simple structure; parallel analy-
sis; delete factors with fewer than two items unless the items are cor-
related, for example, r > .70).
9. Avoid allowing the influence of preconceived biases (e.g., how the
researcher wants the final solution to look) to override important sta-
tistical findings when making judgments. Consider using indepen-
dent judges to assist in decision making if it seems difficult to
disentangle researcher bias from conceptual interpretation of EFA
results.
10. If conducting scale-length optimization, it is essential to rerun the
EFA to ensure that item elimination did not result in changes to
factor structure, factor intercorrelations, item communalities, factor
loadings, or cross-loadings, so that all of the originally established
criteria for these outcomes are still met.
11. Avoid making changes to the scale produced by the final EFA prior
to conducting a CFA (e.g., adding new items, deleting items, chang-
ing item content, altering the rating scale). If you feel that the out-
comes of the EFA are unsatisfactory or that changes to the scale are
necessary, it is most appropriate to conduct a new EFA on the revised
scale before moving to CFA.
12. Competing-models approaches in SEM in seem to be gaining favor
in the literature over single-model approaches, indicating that
researchers should consider evaluating the theoretical plausibility
among either nested or nonnested or equivalent models.
13. When using SEM, use model complexity as the central indicator to
establish the minimum sample size required before conducting CFA;
we recommend a minimum of 5 cases per parameter to be esti-
mated.
14. At a minimum, report the following SEM fit indices: (a) the chi-
square with corresponding degrees of freedom and level of signifi-
cance, (b) the RMSEA with corresponding 90% confidence
intervals, (c) the CFI, and (d) the SRMR.
15. When comparing competing models with SEM, add an appropriate
predictive fit index to the standard set described earlier (see Table 3).
16. Data-driven methods for model respecification in SEM are more
appropriate for fine-tuning than for large-scale respecification of
severely misspecified models.
17. The Lagrange Multiplier Modification index may be used for respec-
ifications in which parameters are being added to the model; the
Wald statistic may be used for decisions about eliminating parame-

ters from the model. In the end, however, theory should accompany
modification procedures using these modification indices.
18. We recommend against item parceling (bundling) in SEM for scale
development research because item parcels can hide (a) the true rela-
tionships among items in the scale and (b) model misspecification
(which runs contrary to the underlying purposes of CFA).
19. Clearly report all of the decisions, rationales, and procedures when
using EFA and SEM in scale development research.
APPENDIX
Journal of Counseling Psychology
Scale Development Articles
Reference List (1995 to 2004)
Barber, J. P., Foltz, C., & Weinryb, R. M. (1998). The central relationship questionnaire: Initial
report. Journal of Counseling Psychology, 45, 131-142.
Dillon, F. R., & Worthington, R. L. (2003). The Lesbian, Gay, and Bisexual Affirmative
Counseling Self-Efficacy Inventory (LGB-CSI): Development, validation, and training
implications. Journal of Counseling Psychology, 50, 235-251.
Heppner, P. P., Cooper, C., Mulholland, A., & Wei, M. (2001). A brief, multidimensional, problem-
solving psychotherapy outcome measure. Journal of Counseling Psychology, 48, 330-343.
Hill, C. E., & Kellems, I. S. (2002). Development and use of the helping skills measure to
assess client perceptions of the effects of training and of helping skills in sessions. Journal
of Counseling Psychology, 49, 264-272.
Inman, A. G., Ladany, N., Constantine, M. G., & Morano, C. K. (2001). Development and pre-
liminary validation of the Cultural Values Conflict Scale for South Asian women. Journal
Kim, B. K., Atkinson, D. R., & Yang, P. H. (1999). The Asian Values Scale: Development, fac-
tor analysis, validation, and reliability. Journal of Counseling Psychology, 46, 342-352.
Kivlighan, D. M., Multon, K. D., & Brossart, D. F. (1996). Helpful impacts in group counsel-
ing: Development of a multidimensional rating system. Journal of Counseling Psychology,
43, 347-355.
Lee, R. M., Choe, J., Kim, G., & Ngo, V. (2000). Construction of the Asian American Family
Conflicts Scale. Journal of Counseling Psychology, 47, 211-222.
Lehrman-Waterman, D., & Ladany, N. (2001). Development and validation of the evaluation
process within supervision inventory. Journal of Counseling Psychology, 48, 168-177.
Lent, R. W., Hill, C. E. & Hoffman, M. A. (2003). Development and validation of the
Counselor Activity Self-Efficacy scales. Journal of Counseling Psychology, 50, 97-108.
Liang, C. T. H., Li, L. C., & Kim, B. S. K. (2004). The Asian American Racism-Related Stress
Inventory: Development, factor analysis, reliability, and validity. Journal of Counseling
Psychology, 51, 103-114.
Mallinckrodt, B., Gantt, D. L., & Coble, H. M. (1995). Attachment patterns in the psy-
chotherapy relationship: Development of the client attachment to therapist scale. Journal
Miville, M. L., Gelso, C. J., Pannu, R., Liu, W., Touradji, P., Holloway, P., & Fuertes, J. (1999).
Appreciating similarities and valuing differences: The Miville-Guzman Universality-
Diversity Scale. Journal of Counseling Psychology, 46, 291-307.
Mohr, J. J., & Rochlen, A. B. (1999). Measuring attitudes regarding bisexuality in lesbian, gay
male, and heterosexual populations. Journal of Counseling Psychology, 46, 353-369.
Neville, H. A., Lilly, R. L., Duran, G., Lee, R. M., & Browne, L. (2000). Construction and ini-
tial validation of the Color-Blind Racial Attitudes Scale (CoBRAS). Journal of Counseling
O’Brien, K. M., Heppner, M. J., Flores, L. Y., Bikos, L. H. (1997). The Career Counseling
Self-Efficacy Scale: Instrument development and training applications. Journal of
Counseling Psychology, 44, 20-31.
Phillips, J. C., Szymanski, D. M., Ozegovic, J. J., & Briggs-Phillips, M. (2004). Preliminary
examination and measurement of the internship research training environment. Journal of
Rochlen, A. B., Mohr, J. J., & Hargrove, B. K. (1999). Development of the attitudes toward
career counseling scale. Journal of Counseling Psychology, 46, 196-206.
Schlosser, L. Z., & Gelso, C. J. (2001). Measuring the working alliance in advisor-advisee
relationships in graduate school. Journal of Counseling Psychology, 48, 157-167.
Skowron, E. A., & Friedlander, M. L. (1998). The differentiation of self inventory: Development
and initial validation. Journal of Counseling Psychology, 45, 235-246.
Spanierman, L. B., & Heppner, M. J. (2004). Psychosocial Costs of Racism to Whites Scale
(PCRW): Construction and initial validation. Journal of Counseling Psychology, 51, 249-262.
Utsey, S. O., & Ponterotto, J. G. (1996). Development and validation of the index of race-
related stress. Journal of Counseling Psychology, 43, 490-501.
Wang, Y., Davidson, M. M., Yakushko, O. F., Savoy, H. B., Tan, J. A., & Bleier, J. K. (2003).
The Scale of Ethnocultural Empathy: Development, validation, and reliability. Journal of
REFERENCES
Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317-332.

Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan.
Bandalos, D. J., & Finney, S. J. (2001). Item parceling issues in structural equation modeling.
In G. A. Marcoulides & R. E. Schumacker (Eds.) New developments and techniques in
structural equation modeling (pp. 269-296). Mahwah, NJ: Lawrence Erlbaum.
Bartlett, M. S. (1950). Tests of significance in factor analysis. British Journal of Psychology,
3, 77-85.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin,
107, 238-246.
Bentler, P. M. (1992). On the fit of models to covariances and methodology to the Bulletin.
Psychological Bulletin, 112, 400-404.
Bentler, P. M. (1995). EQS: Structural equations program manual. Encino, CA: Multivariate
Software.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis
of covariance structures. Psychological Bulletin, 88, 588-606.
Bentler, P. M., & Chou, C.-P. (1987). Practical issues in structural modeling. Sociological
Methods & Research, 16, 78-117.
Bollen, K. A. (1989). A new incremental fit index for general structural equation models.
Sociological Methods & Research, 17, 303-316.
Bozdogan, H. (1987). Model selection and Akaike’s information criteria (AIC): The general
theory and its analytical extensions. Psychometrika, 52, 345-370.

Brown, F. G. (1983). Principles of educational and psychological testing (3rd ed.). New York:
Holt, Rinehart, & Winston.
Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological
Methods and Research, 21, 230-258.
Byrne, B. M. (2001). Structural equation modeling with AMOS: Basic concepts, applications
and programming. Mahwah, NJ: Lawrence Erlbaum.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245-276.
Cattell, R. B. (1974). Radial item parcel factoring vs. item factoring in defining personality
structure in questionnaires: Theory and experimental checks. Australian Journal of
Chou, C., & Bentler, P. M. (2002). Model modification in structural equation modeling by
imposing constraints. Computational Statistics and Data Analysis, 41, 271-287.
Comrey, A. L. (1973). A first course in factor analysis. New York: Academic Press.
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum.
Converse, J. M., & Presser, S. (1986). Survey questions: Handcrafting the standardized ques-
tionnaire. Newbury Park, CA: Sage.
Dawis, R. V. (1987). Scale construction. Journal of Counseling Psychology, 34, 481-489.
DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed.). Thousand Oaks,
CA: Sage.
Fan, X., Thompson, B., & Wang, L. (1999). Effects of sample size, estimation methods, and model
specification on structural equation modeling fit indexes. Structural Equation Modeling, 6, 56-83.
Fassinger, R. E. (1987). Use of structural equation modeling in counseling psychology
research. Journal of Counseling Psychology, 34, 425-436.
Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and refinement of
clinical assessment instruments. Psychological Assessment, 7, 286-299.
Friedenberg, L. (1995). Psychological testing: Design, analysis, and use. Boston, MA: Allyn
and Bacon.
Gerbing, D. W., & Hamilton, J. G. (1996). Viability of exploratory factor analysis as a pre-
cursor to confirmatory factor analysis. Structural Equation Modeling, 3, 62-72.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
Gorsuch, R. L. (1990). Common factor analysis versus principal components analysis: Some
well and little known facts. Multivariate Behavioral Research, 25, 33-39.
Gorsuch, R. L. (1997). Exploratory factor analysis: Its role in item analysis. Journal of
Personality Assessment, 68, 532-560.
Gorsuch, R. L. (2003). Factor analysis. In J. A Schinka & W. F. Velicer (Eds.), Handbook of
psychology: Research methods in psychology (Vol. 2, pp. 143-164). Hoboken, NJ: John Wiley.
Grimm, L. G., & Yarnold, P. R. (1995). Reading and understanding multivariate statistics.
Washington, DC: American Psychological Association.
Guadagnoli, E., & Velicer, W. F. (1988). The relationship of sample size to the stability of com-
ponent patterns. Psychological Bulletin, 103, 265-275.
Helms, J. E., Henze, K. T., Sass T. L., & Mifsud, V. A. (2006). Treating Cronbach’s alpha
reliability as data in nonpsychometric substantive applied research. The Counseling
Psychologist, 34, 630-660.
Henson, R. K. (2006). Effect-size measures and meta-analytic thinking in counseling psy-
chology research. The Counseling Psychologist, 34, 601-629.
Hoelter, J. W. (1983). The analysis of covariance structures: Goodness-of-fit indices.
Sociological Methods & Research, 11, 325-344.
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis.
Psychometrika, 30, 179-185.

Hoyt, W. T., Warbasse, R. E., & Chu, E. Y. (2006). Construct validation in counseling
psychology research. The Counseling Psychologist, 34, 769-805.
Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to
underparameterized model misspecification. Psychological Methods, 3, 424-453.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.
Humphreys, L. G, & Montanelli, R. G. (1975). An investigation of the parallel analysis criterion for
determining the number of common factors. Multivariate Behavioral Research, 10, 193-205.
Jöreskog, K. G., & Sörbom, D. (1981). LISREL V: Analysis of linear structural relations by
the method of maximum likelihood. Chicago: International Educational Services.
Jöreskog, K. G., & Sörbom, D. (1984). LISREL 6: A guide to the program and applications.
Chicago: SPSS.
Kahn, J. H. (2006). Factor analysis in counseling psychology research, training, and practice:
Principles, advances, and applications. The Counseling Psychologist, 34, 684-718.
Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis.
Psychometrika, 23, 187-200.
Kenny, D. A., & McCoach, D. B. (2003). Effect of the number of variables on measures of fit
in structural equation modeling. Structural Equation Modeling, 10, 333-351.
Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New
York: Guilford.
Loehlin, J. C. (1998). Latent variable models: An introduction to factor, path, and structural
analysis (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.
MacCallum, R. C. (1986). Specification searches in covariance structure modeling.
Psychological Bulletin, 107, 247-255.
MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modifications in covari-
ance structure analysis: The problem of capitalization on chance. Psychological Bulletin,
111, 490-504.
MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of
equivalent models in applications of covariance structure analysis. Psychological Bulletin,
114, 185-199.
MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in factor analy-
sis. Psychological Methods, 4, 84-99.
Marsh, H. W., Balla, J. R., & Hau, K. T. (1996). An evaluation of incremental fit indices: A clar-
ification of mathematical and empirical properties. In G. A. Marcoulides & R. E. Schumacker
(Eds.), Advanced structural equation modeling: Issues and techniques (pp. 315-353).
Mahwah, NJ: Lawrence Erlbaum.
Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirma-
tory factor analysis: The effect of sample size. Psychological Bulletin, 103, 391-410.
Marsh, H. W., Hau, K.-T., Balla, J. R., & Grayson, D. (1998). Is more ever too much? The
number of indicators per factor in confirmatory factor analysis. Multivariate Behavioral
Research, 33, 181-220.
Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-
testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing
Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11, 320-341.
Martens, M. P. (2005). The use of structural equation modeling in counseling psychology
research. The Counseling Psychologist, 33, 269-298.
Martens, M. P., & Hasse, R. F. (2006). Advanced applications of structural equation modeling
in counseling psychology research. The Counseling Psychologist, 34, 878-911.
McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence Erlbaum.
McDonald, R. P. (1989). An index of goodness-of-fit based on noncentrality. Journal of
Classification, 6, 97-103.

McDonald, R. P., & Marsh, H. W. (1990). Choosing a multivariate model: Noncentrality and
goodness of fit. Psychological Bulletin, 107, 247-255.
Mulaik, S. A., James, L. R., Van Alstine, J., Bennett, N., Lind, S., & Stilwell, C. D. (1989).
Evaluation of goodness-of-fit indices for structural equation models. Psychological
Bulletin, 105, 430-445.
O’Connor, B. P. (2000). SPSS and SAS programs for determining the number of components
using parallel analysis and Velicer’s MAP test. Behavior Research Methods, Instruments,
and Computers, 32, 396-402.
Quintana, S. M., & Maxwell, S. E. (1999). Implications of recent developments in structural
equation modeling for counseling psychology. The Counseling Psychologist, 27, 485-527.
Quintana, S. M., & Minami, T. (2006). Guidelines for meta-analyses of counseling psychol-
ogy research. The Counseling Psychologist, 34, 839-876.
Reise, S. P., Waller, N. G., & Comrey, A. L. (2000). Factor analysis and scale revision.
Psychological Assessment, 12, 287-297.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.
Sherry, A. (2006). Discriminant analysis in counseling psychology research. The Counseling
Psychologist, 34, 661-683.
Steiger, J. H. (1989). EzPATH: A supplementary module for SYSTAT and SYGRAPH.
Evanston, IL: SYSTAT.
Steiger, J. H., & Lind, J. C. (1980, May). Statistically based tests for the number of common
factors. Paper presented at the annual meeting of the Psychometric Society, Iowa City, IA.
Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.). New York:
Harper & Row.
Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts
and applications. Washington, DC: American Psychological Association.
Tinsley, H. E. A., & Tinsley, D. J. (1987). Uses of factor analysis in counseling psychology
research. Journal of Counseling Psychology, 34, 414-424.
Tucker, L. R., Koopman, R. F., & Linn, R. L. (1969). Evaluation of factor analytic research
procedures by means of simulated correlation matrices. Psychometrika, 34, 421-459.
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor
analysis. Psychometrika, 38, 1-10.
Velicer, W. F., & Fava, J. L. (1998). Effects of variable and subject sampling on factor pattern
recovery. Psychological Methods, 3, 231-251.
Velicer, W. F., & Jackson, D. N. (1990). Component analysis versus common factor analysis:
Some issues in selecting an appropriate procedure. Multivariate Behavioral Research,
25, 1-28.
Velicer, W. F., Peacock, A. C., & Jackson, D. N. (1982). A comparison of component and fac-
tor patterns: A Monte Carlo approach. Multivariate Behavioral Research, 17, 371-388.
West, S. G., Finch, J. F., & Curran, P. J. (1995). Structural equation models with nonnormal
variables: Problems and remedies. In R. H. Hoyle (Ed.), Structural equation modeling:
Concepts, issues, and applications (pp. 56-75). Thousand Oaks, CA: Sage.
Weston, R., & Gore, P. A., Jr. (2006). SEM 101: A brief guide to structural equation modeling.
The Counseling Psychologist, 34, 719-751.
Widaman, K. F. (1993). Common factor analysis versus principal components analysis:
Differential bias in representing model parameters? Multivariate Behavioral Research,
28, 263-311.
Worthington, R. L., & Navarro, R. L. (2003). Pathways to the future: Analyzing the contents
of a content analysis. The Counseling Psychologist, 31, 85-92.
Zwick, W. R., & Velicer, W. F. (1986). Factors influencing five rules for determining the num-
ber of components to retain. Psychological Bulletin, 99, 432-442.

Worthington 2006

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Worthington 2006

Transféré par

Droits d'auteur :

Formats disponibles

The Counseling Psychologist

Scale Development Research: A Content Analysis and

The online version of this article can be found at:

Email Alerts: http://tcp.sagepub.com/cgi/alerts

>> Version of Record - Nov 3, 2006

Counseling psychology has a rich tradition producing psychometrically

tices by reporting the results of a 10-year content analysis of scale develop-

OVERVIEW OF EFA AND CFA

Factor analysis is a technique used to identify or confirm a smaller num-

dimensions or factors. However, effectively using EFA procedures requires

To provide context for our discussion of scale development best practices,

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

THE PROCESS OF SCALE DEVELOPMENT RESEARCH

There are various strategies used in scale construction, often described

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

TABLE 1: Characteristics of Exploratory Factor Analyses Used in Scale Development

Criteria for factor retention

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

TABLE 2: Characteristics of Confirmatory Factor Analyses Used in Scale

SEM versus FA as a confirmatory approach

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

following steps in constructing new instruments: (a) Determine clearly what

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

item generation, with the primary variations involving the combination of

THE ORDERING OF EFA AND CFA IN

Researchers typically use CFA after an instrument has already been

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

hypothesized factor structure (which proved wrong in both cases). As a

Development sample characteristics. Representativeness in scale develop-

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Sample size. Sample size is an issue that has received considerable

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

the matrix. Researchers can use Bartlett’s (1950) test of sphericity to

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

the factor or to discard it. Nevertheless, it may be best to revisit item-genera-

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

In our content analysis, 14 of the studies used SEM as the confirmatory

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

TABLE 3: Incremental, Absolute, and Predictive Fit Indices Used in Structural

Fit Index Citation

Incremental fit indices

estimation method (Fan, Thompson, & Wang, 1999), nonnormality of the

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

According to Kline (2005), a minimum collection of these types of fit

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

Multiplier Modification index estimates the decrease in the chi-square test

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

In this article, we have examined common practices in counseling psy-

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

using the RMSEA, k) using item parcels (bundles) in scale development,

RECOMMENDED BEST PRACTICES

1. Always provide a scale definition of the construct intended to be

Downloaded from tcp.sagepub.com at TULANE UNIV on September 5, 2014

7. Even when theory suggests that factors will be uncorrelated, it is

Wald statistic may be used for decisions about eliminating parame-

Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317-332.