Vous êtes sur la page 1sur 18

Article

Educational and Psychological


Measurement
Enhancing a Short Measure 2014, Vol. 74(6) 10491066
The Author(s) 2014
of Big Five Personality Traits Reprints and permissions:
sagepub.com/journalsPermissions.nav
With Bayesian Scaling DOI: 10.1177/0013164414525040
epm.sagepub.com

W. Paul Jones1

Abstract
A study in a university clinic/laboratory investigated adaptive Bayesian scaling as a
supplement to interpretation of scores on the Mini-IPIP. A probability of belonging
in categories of low, medium, or high on each of the Big Five traits was calculated
after each item response and continued until all items had been used or until a pre-
determined criteria for the posterior probability has been obtained. The study found
higher levels of correspondence with the IPIP-50 score categories using the adaptive
Bayesian scaling than with the Mini-IPIP alone. The number of additional items ranged
from a mean of 2.9 to 12.5 contingent on the level of certainty desired.

Keywords
Big Five personality traits, Bayesian scaling, Mini-IPIP

Introduction
Issues associated with inconsistent and/or random responses continue to be a concern
in personality assessment (Siefert et al., 2012). Although not without some contro-
versy (McGrath, Mitchell, Kim, & Hough, 2010), a variety of approaches to detect
invalid protocols, after the fact, are available. It has also been suggested that reduc-
ing the number of items and the corresponding time required to complete the assess-
ment could reduce the likelihood of invalid responses. Examples of factors that have
been suggested to possibly increase response validity through reduction in the length
of the test include reduction in tedium for the person taking the test (Forbey & Ben-
Porath, 2007) and reduction in careless responses due to frustration (Schmidt, Le, &

1
University of Nevada, Las Vegas, NV, USA

Corresponding Author:
W. Paul Jones, Department of Educational Psychology and Higher Education, University of Nevada, Las
Vegas, 4505 S. Maryland Parkway, Las Vegas, NV 89154-3003, USA.
Email: jones@unlv.nevada.edu

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
1050 Educational and Psychological Measurement 74(6)

Ilies, 2003). In addition, when the questionnaire is one of several included in a single
research study, Donnellan, Oswald, Baird, and Lucas (2006) point out that long ques-
tionnaires could encourage users to simply drop out of the study. Kruyen, Emons,
and Sijtsma (2013), in an extensive review of shorter versions of existing tests, found
that savings in time and efficiency was a frequently reported motivation for develop-
ment of the shorter versions.

Shorter personality scales


There are many examples of shortened personality assessment scales. For example,
Saucier (1994) reduced the length of a set of adjective markers for the Big Five per-
sonality traits from 100 to 40 and found several advantages in the outcome including
lower interscale correlations as well as increased brevity. Gosling, Rentfrow, and
Swann (2003) investigated the properties of scales to measure the Big Five personal-
ity traits in studies using one and two items per trait.
One particularly noteworthy example of item reduction efforts in the assessment
of Big Five personality traits is the Mini-IPIP. Exploratory factor analysis was used
(Donnellan et al., 2006), followed by a series of studies with college students in the
United States, to select four items for each of the Big Five personality traits from a
base of the 50-item International Personality Item Pool five-factor model (IPIP-50;
Goldberg, 1999). Follow-up studies by other researchers (Baldasaro, Shanahan, &
Bauer, 2013; Cooper, Smillie, & Corr, 2010) and a version extended to include an
additional trait (Milojev, Osborne, Greaves, Barlow, & Sibley, 2013), including sam-
ples from other countries, have been generally supportive of Mini-IPIP as a viable
short form of the IPIP-50.
Fewer test items in a scale, while obviously reducing the test time, often comes
with a cost in measurement quality (Crede, Harms, Niehorster, & Gaye-Valentine,
2012; Kruyen et al., 2013). For example, in development studies for the Mini-IPIP
(Donnellan et al., 2006), the alpha reliability coefficients for the five traits using the
50-item IPIP ranged from .80 to .87 with a median of .80 in one study and ranged
from .78 to .91 with a median of .81 in another. Corresponding alpha coefficients for
the Mini-IPIP in one study ranged from .65 to .77 with a median of .69 in one study
and ranged from .70 to .82 with a median of .75 in another. Comparable alpha coeffi-
cients for the Mini-IPIP were found in a study with college students in England and
Wales (Cooper et al., 2010).

Adaptive Scaling Approaches


The term adaptive testing is most often associated with applications of item
response theory (IRT), but there are other approaches (Forbey & Ben-Porath, 2007).
In fact, while IRT models for computer-based adaptive scaling are dominant in mea-
surement of cognitive function, questions have been raised by Chernyshenko, Stark,

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
Paul Jones 1051

Chan, Drasgow, and Williams (2001) about the extent to which IRT models are
effective in applications with personality assessment.
An example of a non-IRT adaptive scaling approach when assignment to a norma-
tive category is sufficient is the simple countdown method proposed for the
Minnesota Multiphasic Personality Inventory (MMPI; Butcher, Keller, & Bacon,
1985) based on determining whether a scale score did or did not reach a level of clin-
ical elevation. Item administration ends whenever the point is reached when the item
total reaches clinical elevation or the number of remaining items would not be suffi-
cient for classification as clinical elevated. A derivative of this method with the
MMPI-2 focuses on identifying the extent of clinical elevation in which item admin-
istration stops when the level of clinical elevation cannot be reached with the remain-
ing items but continues through all items when a scale is clinically elevated. Studies,
for example, Forbey and Ben-Porath (2007), with the countdown approaches have
generally found significant time savings with the fewer number of items and without
significant impact on validity.
Another adaptive approach for appraisal of personality characteristics when cate-
gory assignment is sufficient is an application of Bayesian probability. Bayesian
models can be mathematically complex, but Jones (1993) found that an application
of the most basic Bayesian approach (Phillips, 1973) could significantly reduce the
number of items used in a personality assessment with little apparent loss in classifi-
cation accuracy. This approach is similar to the countdown method in that the objec-
tive is assigning the score to a category rather than identifying a point on an
underlying trait continuum, but different in that the outcome of the assessment is the
addition of the probability of the individuals belonging in the category.
The underlying premise in this adaptive approach is that there are research ques-
tions for which determining the correlates of classification as high, medium, or low
groups on a Big Five personality trait scale is potentially as useful as knowing the
point scores on the scale. Given data on responses to individual items by individuals
in each of those groups, a basic Bayesian application can provide the probability of
belonging to a high, medium, or low group that is informed on a continuous basis
by responses to each successive item. Measurement can continue until there have
been responses to all available items or can stop whenever a predetermined level of
certainty is reached.

Overview of the Present Study


Data obtained in a university clinic/laboratory in the United States provided the
opportunity for investigation of the impact of using the basic Bayesian model in a
form of adaptive scaling as an extension of the Mini-IPIP. The impact of Bayesian
scaling was examined with a data set composed of university subject pool partici-
pants who completed the IPIP-50 in the period between 2009 and 2012. The data set
was divided into a baseline group whose data were used to generate the normative
categories and item response patterns within each category and a separate validation

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
1052 Educational and Psychological Measurement 74(6)

group used to test the accuracy of the predictions of classifications in five scenarios
for adding items to the Mini-IPIP.
The study was designed to address three primary research questions. Simply
increasing the number of items in the Mini-IPIP with standard scoring would be
expected to increase the accuracy of predictions of IPIP-50 category classifications.
The first research question is whether, and to what extent, using adaptive Bayesian
scaling with the additional items would increase the accuracy of the predictions from
the Mini-IPIP alone. The second and third research questions address the extent of
increase in the number of items and impact on classification predictions using vari-
ous stop rules to define an acceptable level of certainty, and the impact on scale
reliability with a variable number of items administered.

Method
Participants in the Study
Data for the study were provided by a total of 854 students attending an urban, south-
western university in the United States. Participants were volunteers who chose a
study that included online computer administration of the IPIP-50 item scale from
several different projects available to meet a research participation requirement for
courses in educational psychology.
The baseline sample, 607 participants, was composed of the subset of the total
participant sample who completed the IPIP-50 in studies during the period between
summer 2009 and summer 2011 and was used to provide the normative data and the
individual item likelihoods for the Bayesian analysis. In the baseline sample, 450
(74.1%) were female, 157 (25.9%) were male. Of the participants (606) who
responded to the age range question, 373 (61.6%) were in the 18 to 25 age range,
158 (26%) were in the 26 to 35 age range, 38 (6.3%) were in the 36 to 45 age range,
and 37 (6.1%) were over 45 years of age. Of the participants (600) who reported eth-
nic background, the results were the following: African American = 58 (9.7%),
Asian American = 37 (6.2%), Caucasian = 378 (63%), Hispanic American = 80
(13.3%), and Other = 47 (7.8%).
The validation sample, 247 participants, was composed of the subset of the total
sample who completed the IPIP-50 in studies in the spring, summer, and fall of 2012.
Of the validation participants (n = 246) who responded to the gender, age range, and
ethnicity questions, 185 were female and 61 were male; 144 (61.6) were in the 18 to
25 age range, 57 (26%) were in the 26 to 35 age range, 25 (6.3%) were in the 36 to
45 age range, and 20 (6.1%) were over 45 years of age; ethnic background was as fol-
lows: African American = 19 (7.7%), Asian American = 23 (9.3%), Caucasian = 155
(63%), Hispanic American = 34 (13.8%), and Other = 15 (6.1%).
Chi-square analysis was used to examine the demographic comparability between
the baseline and validation samples to assess the extent to which the two samples
appeared to represent the same population, an important feature in cross-validation.
There were no statistically significant differences evident between the baseline and

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
Paul Jones 1053

validation samples in gender, x2(1, 853) = .06, p = .807, age range, x2(3, 852) = 5.5,
p = .139, or ethnicity, x2(4, 846) = 3.96, p = .412.

Instrumentation
IPIP-50. The Big Five personality characteristics were measured using the 50-item
scale from the International Personality Item Pool (http://ipip.ori.org/ipip/). The
International Personality Item Pool (Goldberg et al., 2006) is an open source resource
with sample questionnaires for personality constructs including the five-factor model
of Extraversion (E), Agreeableness (A), Conscientiousness (C), Neuroticism (N), and
Openness (O). In the IPIP-50, there are 10 items for each of the five traits, statements
to which the user responds on a five-choice Likert-type scale with response options
ranging from very inaccurate to very accurate. Reliability estimates (coefficient
alpha) for the five scales using the baseline sample data were the following: E = .88,
A = .76, C = .81, N = .86, and O = .76, similar to the IPIP-50 reliability estimates in
studies with different participant samples.

Mini-IPIP. Scores on the Mini-IPIP were constructed, as in prior studies, by summing


the responses on the four identified items for each trait. Reliability estimates for the
Mini-IPIP trait scores in the baseline sample were the following: E = .764, A = .683,
C = .688, N = .671, and O = .648, also similar to prior studies.
In their development of the Mini-IPP, Donnellan et al. (2006) chose Intellect/
Imagination as the label for the trait often referred to as Openness. While this author
concurs with the rationale they presented and would add comparable concern about
implications of the label, Agreeableness, the more traditional labels were used for
the traits in this study.

Procedure
Norms and Item Likelihoods. National norms for the IPIP-50 or Mini-IPIP to use in
assignment of scores as high, medium, and low are not available. Goldbergs stated
position on national norms (Goldberg et al., 2006) is that most such norms are mis-
leading, and he suggests that users needing norms on IPIP scales should develop local
norms based on their own samples. For this study, data from the 607 participants in
the baseline sample were used to create high, medium, and low normative categories
for each of the five personality traits on the IPIP-50 and constructed Mini-IPIP. The
three category classifications were based on a traditional normalized stanine scale
with the usual interpretation of stanines 1 to 3 as low, stanines 4 to 6 as medium, and
stanines of 7 to 9 as high (McIntire & Miller, 2007). Using the z score boundaries for
stanines 4 and 6, the categories were created by calculating z scores for each trait with
z scores ranging from 2.75 to + .75 then identified as medium, and z scores lower or
higher classified as low or high, respectively.
The resulting norms for IPIP-50 and Mini-IPIP in this sample are displayed in
Table 1. The extent to which these normative classifications would generalize to

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
1054 Educational and Psychological Measurement 74(6)

Table 1. Normative Classifications for IPIP-50 and Mini-IPIP From Baseline Sample.

E (n = 600), A (n = 599), C (n = 585), N (n = 599), O (n = 599),


M (SD) M (SD) M (SD) M (SD) M (SD)

IPIP-50
33.95 (8.453) 41.61 (5.324) 36.64 (6.822) 28.46 (8.043) 38.38 (5.694)
High 41-50 47-50 43-50 35-50 44-50
Medium 28-40 38-46 32-42 22-34 34-43
Low 10-27 10-37 10-31 10-21 10-33
Mini-IPIP
13.53 (3.855) 16.64 (2.625) 14.05 (3.458) 10.48 (3.453) 15.19 (3.077)
High 17-20 19-20 18-20 14-20 18-20
Medium 11-16 15-18 11-17 8-13 13-17
Low 5-10 5-14 5-10 5-7 5-12

Note. E = Extraversion; A = Agreeableness; C = Conscientious; N = Neuroticism; O = Openness.

other populations is uncertain, but it was interesting to note that the IPIP-50 means
and standard deviations in this baseline sample appeared remarkably similar to those
reported in a study with 2,263 participants (Donnellan et al., 2006). For the
Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness
(Imagination/Intellect) traits, the mean scores in the sample used in the current study
were 34.0, 41.6, 36.6, 28.5, and 38.4, respectively. Comparable mean scores on the
IPIP-50 in the study used to develop the Mini-IPIP (Donnelan et al., 2006) were
33.6, 40.0, 35.7, 27.2, and 36.3.
The mean Mini-IPIP scores for the baseline sample for the Extraversion,
Agreeableness, Conscientiousness, Neuroticism, and Openness traits were 13.5, 16.6,
14.1, 10.5, and 15.1, respectively. Comparable Mini-IPIP scores in the study by
Cooper et al. (2010) with approximately 1,500 participants in England and Wales
were 13.0, 16.6, 13.2, 11.8, and 15.8.
The three normative categories were then used in calculation of individual item
likelihoods for the adaptive Bayesian scaling. For each IPIP-50 normative classifica-
tion and each item, a likelihood was calculated for each response alternative. To
illustrate, the first item in the IPIP-50 Extraversion trait scale is am the life of the
party. Of those whose overall score on the Extraversion scale was in the high cate-
gory, the proportion who responded very accurate was .322, the proportion who
responded moderately accurate was .507, and so on. Of those whose overall
Extraversion scale was in the medium category, the proportions of very accurate
and moderately accurate responses on this item were .079 and .465, respectively.
The comparable proportions of participants with overall Extraversion scores in the
low category were .014 and .117 for the very accurate and moderately accurate
response options, respectively. The Descriptive Statistics-Frequencies function in
IBM SPSS, Version 20, was used to generate the 750 likelihoods (5 response options
3 50 items 3 3 categories).

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
Paul Jones 1055

Table 2. Illustration of Bayesian Formula.

Hypothesis Prior Belief Likelihood Prior 3 Likelihood Posterior Belief

High .23 .322 .07406 .07406/.11994 = .61748


Medium .54 .079 .04266 .04266/.11994 = .35568
Low .23 .014 .00322 .00322/.11994 = .02685
Sum .11994

Validation Procedures. For this study, a parsimonious Bayesian approach was selected
using the basic Bayesian formula described below:

PHjE = PH PEjH=PE,

where

P(H|E) = posterior probability


P(H) = prior probability of outcome
P(E|H) = likelihood of observed event given hypothesized outcome
P(E) = overall likelihood of observed event

Table 2 illustrates the calculations using the response to the first item in the
Extraversion scale. The prior beliefs about the respondent being in the high, medium,
or low category on this scale are .23, .54, and .23, respectively (from the procedure
used to calculate the norms). In the illustration, the respondent selected very accu-
rate for the item life of the party. From the item likelihood data obtained from
the baseline sample, the response of very accurate on this item is the one made by
32.2% of those in the high group, 7.9% of those in the medium group, and 1.4% of
those in the low group. The proportions become the likelihoods for the analysis. The
likelihoods are multiplied by the prior beliefs and summed.
The posterior beliefs for belonging in each of the three categories are then calcu-
lated as the product of the prior beliefs and likelihoods divided by the sum of the
products of the prior beliefs and likelihoods. The posterior beliefs become the new
prior beliefs for calculations with the next item. In this illustration, the probability of
belonging in the normative category of high on the Extraversion trait changed from
.23 to .617 after the response to the initial item. The process continues until there are
no more items or until a predetermined criterion for the posterior probability has been
obtained.
The procedure, while labor intensive if done by hand, is readily accomplished with
a computer. The calculations were done with a standard spreadsheet in this study.
The adaptive aspect of this study necessitates identifying a stop rule, a prob-
ability deemed sufficient for classification, and after which no additional items for
that trait would be presented. When adaptive testing is based on an IRT model,

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
1056 Educational and Psychological Measurement 74(6)

variable length stop rules are typically based on either standard error or minimum
information approaches (Choi, Grady, & Dodd, 2011). In this Bayesian application,
however, the premise is that the stop rule will vary by the intent of the assessment
and preferences of the clinician or researcher. Setting a lower stop point reduces the
number of items that would be presented with a risk of increasing classification
errors. Setting a higher stop point has the opposite effect. A higher stop point has a
lower risk of classification errors but a smaller reduction in the number of items
required. The stop rules in this study were defined based on the posterior probability
after each calculation. For example, when the stop rule was .95, no more items for
that trait were administered when there was a posterior probability equal to or greater
than .95 for the assignment as high, medium, or low.
This study first calculated the correspondence between placement in the high,
medium, and low categories on each trait based on the Mini-IPIP norms compared
with placement based on the IPIP-50 norms without Bayesian scaling.
Correspondence was defined using both the percentage of agreement and the Kappa
coefficient (Cohen, 1960) The study then simulated the impact of adaptive Bayesian
scaling on correspondence and the number of items required with five stop points,
.95, .90, .85, .80, and .75. After administration of a minimum of four items (Mini-
IPIP), the use of additional items for that trait continued until the posterior probabil-
ity of belonging in a normative category reached the predetermined stop point or
until all items for that trait had been used.
When items beyond the four items in the Mini-IPP were required to reach the stop
points in the validation sample, the order of the additional items in each trait was
determined by the correlation of the item with the raw score total on that trait in the
baseline sample. The order was the same for all respondents. Thus, if five items were
required to reach the stop point in a trait, the same five items in the same order were
used for each participant, and so forth.

Results
Correspondence With IPIP-50 Classifications of Mini-IPIP and Bayesian
Scenarios
Table 3 displays the correspondence between normative category classifications
based on the IPIP-50 compared with classifications from the Mini-IPIP and various
Bayesian scaling scenarios. The n for each trait varied (233 to 246) contingent on the
number of participant response omissions for that trait.
A perusal of Table 3 suggests an overall satisfactory level of correspondence of
the Mini-IPIP and IPIP-50 classifications. The Kappa coefficient proposed by Cohen
(1960) quantifies the extent of agreement between ratings, correcting for the expected
extent of agreement that results from chance alone. Kappa coefficients across the five
traits ranged from .575 to .744 with a median of .673.
Landis and Koch (1977) provided the initial guidance for interpreting the Kappa
coefficient with labels ranging from less than chance level to almost perfect

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
Paul Jones 1057

Table 3. Correspondence of Mini-IPIP and Bayesian Scaling SCENARIOS with IPIP-50


Classifications in Validation Sample.

E (n = 243) A (n = 242) C (n = 233) N (n = 246) O (n = 246)

Mini-IPIP
Hit Miss Hit Miss Hit Miss Hit Miss Hit Miss
Total 205 38 179 63 188 45 208 38 197 49
Total % 84 16 74 26 81 19 85 15 80 20
Kappa .730 .575 .652 .744 .673
.95 Stop Rule
Hit Miss Hit Miss Hit Miss Hit Miss Hit Miss
Total 221 22 217 25 208 25 225 21 226 20
Total % 91 9 90 10 89 11 91 9 92 8
Kappa .851 .824 .811 .859 .865
.90 Stop Rule
Total 218 25 216 26 202 31 222 24 223 23
Total % 90 10 89 11 87 13 90 10 91 9
Kappa .829 .818 .770 .840 .844
.85 Stop Rule
Total 215 28 212 30 200 33 220 26 219 27
Total % 88 12 88 12 86 15 89 11 89 11
Kappa .806 .789 .755 .827 .816
.80 Stop Rule
Total 216 27 210 37 197 36 214 32 216 30
Total % 89 11 87 15 85 15 87 13 88 12
Kappa .813 .774 .735 .787 .795
.75 Stop Rule
Total 211 32 205 37 197 36 213 33 215 31
Total % 87 13 85 15 85 15 87 13 87 13
Kappa .775 .737 .735 .780 .788

Note. E = Extraversion; A = Agreeableness; C = Conscientious; N = Neuroticism; O = Openness.

agreement, and these continue to be commonly cited (Viera & Garrett, 2005). Munoz
and Bangdiwala (1997) noted that the labels were arbitrary rather than empirically
based and suggested modifications of the range of coefficients in the labels based on
the number of categories. For a 3 3 3 table, they suggest .75 as the lower bound for
Kappa interpretation of almost perfect, .45 as the lower bound for substantial, and .20
as the lower bound for moderate agreement. Using these guidelines, the Kappa coeffi-
cients for Mini-IPIP and IPIP-50 ratings of placement in high, medium, or low on
each trait would typically be interpreted as indicating substantial agreement.
A key question is whether adaptive Bayesian scaling would enhance the extent of
agreement and/or the level of accuracy estimated by the Kappa coefficients using the
Mini-IPIP alone and the number of additional items that would be required for such
enhancement. The results of the effects of five stop rules on simple percentage of
agreement and on classification accuracy with the Kappa coefficient are also dis-
played in Table 3.

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
1058 Educational and Psychological Measurement 74(6)

A perusal of the data in Table 3 indicates that the simple percentages of agree-
ment and the Kappa coefficients with IPIP-50 classifications as the criterion were
higher in all of the adapted Bayesian scaling scenarios than with the Mini-IPIP alone.
Mini-IPIP percentages of agreement for the five traits ranged from a low of 74
(Agreeableness) to a high of 85 (Neuroticism) with a median of 81. Corresponding
percentages for the .95 stop rule ranged from a low of 89 (Conscientiousness) to a
high of 92 (Openness) with a median of 91. With a stop rule of .75, the smallest
number of additional items, the percentages of agreement ranged from 85
(Agreeableness and Conscientiousness) to 87 (Extraversion, Neuroticism, and
Openness) with a median of 87.
Kappa coefficients when item presentation continued until there was a probability
of .95 ranged from .811 to .865 with a median of .851, a marked difference between
the corresponding range and median for the Mini-IPIP (.575 to .744 with a median of
.673). The stop point of .75 resulted in Kappa coefficients ranging from .735 to .788
with a median of .775, all higher than the corresponding Mini-IPIP coefficients.
Results of the study were supportive of a positive answer to the first research
question. All scenarios resulted in a higher level of accuracy of prediction and higher
Kappa coefficients with IPIP-50 classification categories when items were added to
the Mini-IPIP using adaptive Bayesian scaling.

Additional Items Required With Various Bayesian Scenarios


The additional extent of agreement with the adaptive Bayesian scaling came, of
course, with a cost; more than 20 items were typically required to reach the stop rule.
The number of additional items required in the stop rule scenarios is displayed in
Table 4 with the number of participants reaching the stop rule, the number of items
required to reach that stop rule, and the mean number of items required with each
stop rule for each trait. To illustrate, the first line of Table 4 indicates that on the
Extraversion trait, with application of the most stringent stop rule (.95 posterior prob-
ability), 88 of the 243 participants needed only the four Mini-IPIP items to reach the
stop point, 133 participants reached the stop point with the Mini-IPIP and one addi-
tional item, 161 reached the stop point with the Mini-IPIP and two additional items,
and so forth. The total number of items required to reach the .95 posterior probability
level on the Extraversion trait was a mean of only 6.2.
The sum of the mean item total for each trait displayed in Table 4 with the .95 stop
rule was 32.5, as compared to 20 items on the Mini-IPIP and 50 items on the IPIP-50,
and the corresponding total using the .75 stop rule was only 22.9. To reach the .95
posterior probability level on each trait, most required more than the four Mini-IPIP
items (64% to 83%), but approximately one-half (43% to 55%) of the participants
reached that level with no more than five items. With the .75 posterior probability
level as the stop rule, approximately three fourths (72% to 78%) stopped with the four
Mini-IPIP items, and a large majority (83% to 91%) stopped with a total of five items

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
Table 4. Distribution of Participants by Items Required to Reach Five Posterior Probability Stop Points.

n requiring n requiring n requiring n requiring n requiring n requiring Mean items


\5 items \6 items \7 items \8 items \9 items \10 items required

E .95 stop rule 88 133 161 179 187 193 6.1


.90 stop rule 117 170 189 203 211 219 5.4
.85 stop rule 154 192 210 218 227 233 4.9
.80 stop rule 171 207 221 233 238 240 4.6
.75 stop rule 183 220 235 239 240 243 4.4
A .95 stop rule 40 103 117 149 155 164 7.0
.90 stop rule 119 133 150 178 195 196 6.0
.85 stop rule 136 148 168 194 202 209 5.6
.80 stop rule 144 155 201 215 220 228 5.2
.75 stop rule 177 201 217 228 229 236 4.7
C .95 stop rule 82 123 139 159 165 175 6.4
.90 stop rule 110 153 170 187 194 201 5.4
.85 stop rule 126 174 186 198 204 210 5.3
.80 stop rule 150 186 200 211 216 221 4.9
.75 stop rule 182 211 217 225 226 228 4.7
N .95 stop rule 72 105 139 161 171 184 6.6
.90 stop rule 112 160 174 191 209 218 5.7
.85 stop rule 154 185 207 219 230 234 5.0
.80 stop rule 178 208 229 236 238 241 4.6
.75 stop rule 190 216 233 239 243 244 4.5
O .95 stop rule 60 126 151 167 183 197 6.4
.90 stop rule 109 172 192 199 207 217 5.5

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
.85 stop rule 148 189 206 213 221 226 5.1
.80 stop rule 163 205 217 225 229 232 4.8
.75 stop rule 178 215 225 234 238 240 4.6

Note. E = Extraversion, n = 243; A = Agreeableness, n = 242; C = Conscientious, n = 233; N = Neuroticism, n = 246; O = Openness, n = 246.

1059
1060 Educational and Psychological Measurement 74(6)

or less. The median number of items required for the .75 posterior probability level
was four on all traits.
For the second research question, the number of additional items required and
impact on classification accuracy, the .95 stop rule produced the highest percentage
of correspondence and Kappa coefficient (median 91%, median Kappa = .851) and
required a mean of 12.5 additional items. Using the lowest stop point, .75 posterior
probability, there was a reduction in both percentage of correspondence (median of
81%) and Kappa coefficient (median = .775) but with a corresponding reduction in
the number of items required, a mean of 2.9 additional items. Application of stop
rules of .95, .90, .85, .80, and .75 resulted in estimated total number of items of 32.5,
28, 25.9, 24.1, and 22.9, respectively, as compared to the 20 items required for the
Mini-IPIP.

Reliability Estimates
Reliability estimates are displayed in Table 5. Calculation of coefficient alpha relia-
bility estimates was enabled because the item order was fixed, so, for example, any
participant who reached the stop rule with only six items responded to the same six
items as any other participant who stopped after six items. The data in Table 5 show
a significant amount of variance with different traits, item counts, and stop rules. In
general, the data suggest that, as expected, the magnitude of the alpha reliability esti-
mate increases with each additional item, and there is a gradual decrease in the mag-
nitude of the alpha coefficient with lower levels of the stop rule. There are special
challenges in estimating the influence of the Bayesian scaling to produce general
scale reliability estimates, in part because of limitations inherent in scenarios and
also because the number of items varies by participant and by stop rule. Thus, the
coefficient alpha reliability estimates in Table 5 can only be preliminary, subject to
confirmation with additional studies and alternative techniques for estimating scale
reliability.
With the caveat noted, the information in Table 5 may be instructive in regard to
the probable impact of Bayesian scaling scenarios on scale reliability as compared
with reliability of the four items per trait in the Mini-IPIP alone. The coefficient alpha
reliability estimates for the Mini-IPIP, four items for each trait, with all participants
in the validation sample were .747, .685, .662, .717, and .649 for the Extraversion,
Agreeableness, Conscientiousness, Neuroticism, and Openness traits, respectively.
With participants limited to those with a mid-level stop point (posterior probability of
.85), the corresponding alpha coefficients with the four items were .806, .713, .773,
.801, and .737, respectively.
Most relevant for the research question comparing estimates of reliability from
the Mini-IPIP with reliability estimates from the Bayesian scaling would be scenarios
in which a participant being assessed with the Bayesian model would require one or
more additional items to meet the predetermined stop rule. Kruyen et al. (2013) sug-
gested calculating a Relative 95% CI, defined as the ratio of a 95% confidence

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
Table 5. Coefficient Alpha Reliability Estimates With and Without Bayesian Scaling.

4 items (n) 5 items (n) 6 items (n) 7 items (n) 8 items (n) 9 items (n) 10 items (n)

E Mini-IPIP .747 (243)


.95 stop rule .856 (88) .855 (133) .867 (161) .877 (179) .885 (187) .894 (193) .893 (195)
.90 stop rule .819 (117) .836 (170) .859 (189) .871 (203) .878 (211) .881 (219) .885 (220)
.85 stop rule .806 (154) .823 (192) .849 (210) .871 (218) .875 (227) .879 (233) .882 (233)
.80 stop rule .788 (171) .820 (207) .844 (221) .863 (233) .876 (238) .873 (240) .880 (240)
.75 stop rule .761 (183) .816 (220) .842 (235) .862 (239) .874 (240) .873 (243) .879 (243)
A Mini-IPIP .685 (242)
.95 stop rule .501 (40) .807 (103) .838 (117) .776 (149) .807 (155) .819 (164) .815 (172)
.90 stop rule .735 (119) .770 (133) .796 (150) .767 (178) .795 (195) .803 (196) .800 (207)
.85 stop rule .713 (136) .753 (148) .792 (168) .759 (194) .777 (202) .792 (209) .787 (217)
.80 stop rule .693 (144) .745 (155) .785 (201) .753 (215) .777 (220) .784 (228) .779 (232)
.75 stop rule .701 (177) .726 (201) .779 (217) .748 (228) .773 (229) .783 (236) .771 (238)
C Mini-IPIP .662 (233)
.95 stop rule .865 (82) .845 (123) .850 (139) .825 (159) .849 (165) .844 (175) .850 (178)
.90 stop rule .788 (110) .812 (153) .812 (170) .811 (187) .816 (194) .824 (201) .838 (209)
.85 stop rule .773 (126) .792 (174) .791 (186) .797 (198) .807 (204) .812 (210) .828 (215)
.80 stop rule .749 (150) .769 (186) .781 (200) .795 (211) .795 (216) .806 (221) .821 (223)
.75 stop rule .702 (182) .772 (211) .774 (217) .790 (225) .790 (226) .801 (228) .816 (230)
N Mini-IPIP .717 (246)
.95 stop rule .855 (72) .851 (105) .878 (139) .885 (161) .897 (171) .908 (184) .907 (193)
.90 stop rule .814 (112) .833 (160) .856 (174) .880 (191) .889 (209) .899 (218) .899 (219)
.85 stop rule .801 (154) .823 (185) .848 (207) .876 (219) .883 (230) .895 (234) .895 (236)
.80 stop rule .744 (178) .820 (208) .848 (229) .873 (236) .883 (238) .895 (241) .895 (243)
.75 stop rule .742 (190) .815 (216) .840 (233) .864 (239) .878 (243) .889 (244) .893 (244)
O Mini-IPIP .649 (246)

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
.95 stop rule .890 (60) .834 (126) .816 (151) .848 (167) .859 (183) .855 (197) .850 (199)
.90 stop rule .787 (109) .800 (172) .806 (192) .824 (199) .837 (207) .845 (217) .836 (219)
.85 stop rule .737 (148) .789 (189) .793 (206) .820 (213) .826 (221) .841 (216) .833 (219)
.80 stop rule .711 (163) .772 (205) .776 (217) .809 (225) .823 (229) .838 (232) .828 (235)
.75 stop rule .701 (178) .760 (215) .772 (225) .807 (234) .818 (238) .834 (240) .825 (241)

1061
Note. E = Extraversion; A = Agreeableness; C = Conscientious; N = Neuroticism; O = Openness.
1062 Educational and Psychological Measurement 74(6)

interval and the length of the scale to indicate measurement precision when compar-
ing instruments of different length. In this procedure, the higher the Relative 95%
CI, the less precise is the measurement.
To illustrate, the Relative 95% CI for Extraversion based on the scores on the
Mini-IPIP was .502. Approximately three fourths of the participants needed fewer
than six items to reach the .85 stop rule; the Relative 95% CI for the Extraversion
score with this group was .417. For the Conscientiousness, Neuroticism, and
Openness traits, the comparable Relative 95% CIs for the Bayesian scaling with five
items were .421, .414, and .387, respectively, as compared to corresponding Mini-
IPIP ratios of .523, .507, and .507. On the Agreeableness trait, a total of seven items
was required for three fourth of the participants to reach the .85 stop point. The
Relative 95% CI ratio was .303 as compared to the Mini-IPIP ratio of .401.
The overall impression related to the third research question, impact of adaptive
Bayesian scaling on instrument reliability, is that higher levels of internal consistency
will be evident with the adaptive Bayesian scaling as a result of use of additional test
items(s). The value added of this Bayesian approach is that not all participants will
be required to respond to additional items.

Discussion
Limitations to Generalizability
There certainly are limitations in this study that could influence the generalizability
of the results. Of particular concern may be that these data came from construc-
tions of Mini-IPIP results that may or may not reflect participant response patterns
with an actual administration of the Mini-IPIP or administration of additional items
with a stop rule. In the standard administration of the Mini-IPIP, a participant
responds to an item for one trait, followed by an item for another trait, and so forth,
cycling through the item pool until there have been responses to four items for each
of the Big Five personality traits. The item responses in this study came instead from
responding to an item for one trait, followed by an item for another trait, cycling
through the items until there were responses to 10 items for each of the personality
traits with the Mini-IPIP scores constructed by using only 4 of the 10 items. There is
an unknown potential influence of remembering responses on a prior item that is not
controlled in this design. With the Bayesian scaling in this design, it was theoreti-
cally possible for a participant to cycle through four items for each of the core traits
and then be asked to respond to six consecutive items associated with only one of
the traits. Although real data were used in this study, there is an element of simula-
tion in that these results are, in effect, simulating what might be expected with a stan-
dard administration of the Mini-IPIP or with an instrument designed specifically for
adaptive Bayesian scaling. Another potential limitation is that these results are based
on responses by university students, mostly female, all attending a university in the
United States, and all participating in the study as a part of required subject pool
research participation assignment. While the descriptive data, particularly mean

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
Paul Jones 1063

scores on the Mini-IPIP and IPIP-50, did appear remarkably similar to those obtained
in different studies and with participants from countries other than the United States,
the extent to which the responses, normative data, and item likelihoods would gener-
alize to other settings is unknown.

Examples of Needs for Further Study


Evident in the identified limitations is a need for future studies in which Mini-IPIP
scores are not constructed from IPIP-50 responses, and additional items are presented
with an instrument created for adaptive Bayesian scaling. Additional investigation is
also needed with samples from diverse populations to determine the extent to which
the item likelihoods and normative data in this study can be generalized. The 750
individual item likelihoods used in this study are available from the author.
Milojev et al. (2013) make a strong case that when measuring relatively enduring
personality characteristics, testretest reliability is a key psychometric property, and
they further suggest the value of a different application of Bayesian probability as a
tool to assess that property. Testretest studies using an instrument designed with
adaptive Bayesian scaling are needed for direct rather than inferred estimates of the
reliability of the results with Bayesian adaptive scoring.
The concept of the International Personality Item Pool as a scientific collaboration
offers still another potential using this Bayesian scaling approach. The levels of cer-
tainty associated with the category assignments in this study were constrained by the
size of the item pool, a maximum of 10 items for each trait. Shared data sets from
administrations of the 100-item version of the IPIP could be used to generate item
likelihoods for additional items, reducing the number of outcomes with less than the
preferred stop point.
Additional investigation could also compare the results obtained with this
Bayesian adaptive scaling model with results from other adaptive scaling techniques.
For example, the simple countdown method originally proposed for the MMPI
(Butcher et al., 1985) could be modified for use with the IPIP-50 with presentation
of items stopping when responses to the items remaining for the trait would not
change the normative category placement.
The intent of this study was direct assessment of the utility of a particular
Bayesian model for adaptive scaling with intentional focus on probabilities associ-
ated with category assignment rather than latent traits. Other researchers may want
to assess the extent to which an IRT approach could enhance measurement preci-
sion with short personality test and the number of additional items likely to be
required.

Summary and Conclusions


With limitations and need for additional studies acknowledged, creating shorter ver-
sions of existing personality tests is an evident trend (Kruyen et al., 2013; Milojev et

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
1064 Educational and Psychological Measurement 74(6)

al., 2013), sometimes with the intent of savings in time that could reduce the extent
of invalid responses caused by participant frustration (Schmidt et al., 2003). An
example for measurement of the Big Five personality traits is the 20-item Mini-IPIP
(Donnellan et al., 2006) developed by selecting items from the 50-item International
Personality Item Pool five-factor model (IPIP-50; Goldberg, 1999).
While the Mini-IPIP has demonstrated generally satisfactory levels of measure-
ment quality, the significant reduction in the number of items has had a correspond-
ing reduction in the alpha reliability coefficients that may also have limited the
external validity of the scale (Baldasaro et al., 2013; Milojev et al., 2013). The cur-
rent study explored the possibility that application of a basic Bayesian theorem could
moderate the negative consequences in shortening the scale, specifically the extent to
which predictions of normative categories on the full IPIP-50 instrument would be
enhanced and to identify the additional number of items that might be required to
attain the enhancement.
While this Bayesian application would be appropriate only in circumstances where
assignment to a normative category would be sufficient to answer the research or
clinical question, arguably this often is the case in our use of personality test results.
In those instances, the results of this study appear supportive of the viability of this
Bayesian approach.
Each of the scenarios provided somewhat higher levels of prediction of IPIP-50
categories in comparison to the Mini-IPIP alone. The additional items do appear to
enhance the reliability of the results, and the number of additional items required is
substantially less than would be required in administering the full IPIP-50 scale.
Obviously this Bayesian approach adds a layer of complexity to simply summing
items responses to produce a trait score. Whether the additional complexity is war-
ranted by the outcome would appear to be an it depends question, contingent on
the objectives of the researcher and/or the clinician. One advantage in this adaptive
approach is that this technique provides the user not only with a predicted normative
category but also with an empirical probability statement about the level of certainty
associated with that prediction. A researcher, for example, investigating whether the
Conscientiousness trait is associated with performance on a cognitive measure could
limit the statistical analysis to the participants whose normative category assignment
had a substantial posterior probability.
A key to this adaptive approach is that the number of items that would be admi-
nistered to each participant is contingent on the prior responses. Thus, while a com-
parable level of reliability to that found with the Bayesian scaling could have most
likely been achieved by simply adding items to the Mini-IPIP, for example, five or
six rather than four items per trait, all participants would have to respond to the addi-
tional items. In contrast, contingent on the level of certainty desired by the researcher
or clinician, many participants would reach an acceptable level of certainty in norma-
tive category placement with only four items, and additional items would be admi-
nistered only as needed by individual participants.

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
Paul Jones 1065

Declaration of Conflicting Interests


The author(s) declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of
this article.

References
Baldasaro, R. E., Shanahan, M. J., & Bauer, D. J. (2013). Psychometric properties of the Mini-
IPIP in a large, nationally representative sample of young adults. Journal of Personality
Assessment, 95, 74-84.
Butcher, J. N., Keller, L. S., & Bacon, S. F. (1985). Current developments and future directions
in computerized personality assessment. Journal of Consulting and Clinical Psychology,
53, 803-815.
Chernyshenko, O. S., Stark, S., Chan, K.-Y., Drasgow, F., & Williams, B. (2001). Fitting item
response theory models to two personality inventories: Issues and insights. Multivariate
Behavioral Research, 36, 523-562.
Choi, S. W., Grady, M. W., & Dodd, B. G. (2011). A new stopping rule for computerized
adaptive testing. Educational and Psychological Measurement, 71, 37-53.
Cohen, J. (1960). A coefficient for agreement for nominal scales. Educational and
Psychological Measurement, 20, 37-46.
Cooper, A. J., Smillie, L. D., & Corr, P. J. (2010). A confirmatory factor analysis of the Mini-
IPIP five-factor model personality scale. Personality and Individual Differences, 48,
688-691.
Crede, M., Harms, P., Niehorster, S., & Gaye-Valentine, A. (2012). An evaluation of the
consequences of using short measures of the big five personality traits. Journal of
Personality and Social Psychology, 102, 874-888.
Donnellan, M., Oswald, F., Baird, B., & Lucas, R. (2006). The Mini-IPIP scales: Tiny-yet-
effective measures of the Big Five factors of personality. Psychological Assessment, 18,
192-203.
Forbey, J. D., & Ben-Porath, Y. S. (2007). Computerized adaptive personality testing: A
review and illustration with the MMPI-2 computerized adaptive version. Psychological
Assessment, 19, 14-24.
Goldberg, L. R. (1999). A broad-bandwidth, public-domain, personality inventory measuring
the lower-level facets of several five-factor models. In I. Mervielde, I. J. Deary, F. De
Fruyt, & F. Ostendorf (Eds.), Personality psychology in Europe (Vol. 7, pp. 7-28). Tilburg,
Netherlands: Tilburg University Press.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., &
Gough, H. C. (2006). The International Personality Item Pool and the future of public-
domain personality measures. Journal of Research in Personality, 40, 84-96.
Gosling, S. D., Rentfrow, P. J., & Swann, W. B. (2003). A very brief measure f the Big-Five
personality domains. Journal of Research in Personality, 37, 504-528.
Jones, W. P. (1993). Real-data simulation of computerized adaptive Bayesian scaling.
Measurement and Evaluation in Counseling and Development, 26, 143-151.

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015
1066 Educational and Psychological Measurement 74(6)

Kruyen, P. M., Emons, W. H. M., & Sijtsma, K. (2013). On the shortcomings of shortened
tests: A literature review. International Journal of Testing, 13, 223-248.
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical
data. Biometrics, 33, 159-174.
McGrath, R. E., Mitchell, M., Kim, B. H., & Hough, L. (2010). Evidence for response bias as a
source of error variance in applied assessment. Psychological Bulletin, 136, 450-470.
McIntire, S. A., & Miller, L. A. (2007). Foundations of psychological testing: A practical
approach (2nd ed.). Thousand Oaks, CA: Sage.
Milojev, P., Osborne, D., Greaves, L. M., Barlow, F. K., & Sibley, C. G. (2013). The Mini-
IPIP6: Tiny yet highly stable markers of Big Six personality. Journal of Research in
Personality, 47, 936-944.
Munoz, S. R., & Bangdiwala, S. I. (1997). Interpretation of Kappa and B statistics measures of
agreement. Journal of Applied Statistics, 24, 105-111.
Phillips, L. D. (1973). Bayesian statistics for social scientists. New York, NY: Crowell.
Saucier, G. (1994). Mini-markers: A brief version of Goldbergs unipolar big-five markers.
Journal of Personality Assessment, 63, 506-516.
Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An empirical examination of the
effects of different sources of measurement error on reliability estimates for measures of
individual differences constructs. Psychological Methods, 8, 206-224.
Siefert, C. J., Stein, M., Sinclair, S. J., Antonius, D., Shiva, A., & Blais, M. A. (2012).
Development and initial validation of a scale for detecting inconsistent responding on the
Personality Assessment InventoryShort Form. Journal of Personality Assessment, 94,
601-606.
Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: The Kappa
statistic. Family Medicine, 37, 360-363.

Downloaded from epm.sagepub.com at Tumaini Uni-Ingringa University College on November 28, 2015

Vous aimerez peut-être aussi