Vous êtes sur la page 1sur 23

THE INTERNATIONAL JOURNAL FOR THE PSYCHOLOGY OF RELIGION, 17 (2), 157178 Copyright 2007, Lawrence Erlbaum Associates, Inc.

RESEARCH

An Item Response Theory Analysis of the Spiritual Assessment Inventory


Todd W. Hall
Rosemead School of Psychology Biola University

Steven P. Reise
Department of Psychology University of California, Los Angeles

Mark G. Haviland
Department of Psychiatry Loma Linda University

Item response theory (IRT) was applied to evaluate the psychometric properties of the Spiritual Assessment Inventory (SAI; Hall & Edwards, 1996, 2002). The SAI is a 49-item self-report questionnaire designed to assess ve aspects of spirituality: Awareness of God, Disappointment (with God), Grandiosity (excessive self-importance), Realistic Acceptance (of God), and Instability (in ones relationship to God). IRT analysis revealed that for several scales: (a) two or three items per scale carry the psychometric workload and (b) measurement precision is peaked for all ve scales, such that one end of the scale, and not the other, is measured precisely. We considered how sample homogeneity and the possible quasi-continuous nature of the SAI constructs may have affected our results and, in light of this, made suggestions for SAI revisions, as well as for measuring spirituality, in general.
Correspondence should be sent to Todd W. Hall, Ph.D., Rosemead School of Psychology, Biola University, 13800 Biola Avenue, La Mirada, CA 90639. E-mail: todd.hall@biola.edu

157

158

HALL, REISE, HAVILAND

In large-scale aptitude testing, item response theory (IRT) is now the dominant psychometric theory underlying scale development and analysis (Embretson & Reise, 2000). IRT also has been applied to personality and psychopathology measurement (Reise & Haviland, 2005; Reise & Waller, 2003). With one exception, Gomez and Fishers (2005) IRT analysis of the Spiritual Well-Being Questionnaire, IRT methods have not been used in the development and evaluation of religiousness and spirituality (RS) constructs. Herein, we report IRT analyses of the Spiritual Assessment Inventory (SAI; Hall & Edwards, 1996, 2002) to illustrate further the benets of IRT in exploring the psychometric properties of RS measures. We will not review all of the technical details of IRT methods or compare IRT to traditional classical test theory (CTT) (for this see Reise, 2004; Reise & Henson, 2003). As we proceed through the analyses and results, however, several major differences between IRT and CTT will be evident.

THE SPIRITUAL ASSESSMENT INVENTORY The SAI (Hall & Edwards, 1996, 2002) is a theoretically based measure of spiritual development designed for use by both clinicians and researchers. The overall framework for this measure is the notion of relationship, which integrates theistic and relational psychological perspectives (i.e., attachment and object relations) of personality. The SAI consists of ve subscales: Awareness of God (AOG), Disappointment (DIS; with God), Grandiosity (GRA; excessive self-importance), Realistic Acceptance (RA; of God), and Instability (INS; in ones relationship to God). The rst measures a persons tendency to experience Gods presence and communications. The remaining four assess the developmental maturity of ones patterns of relationshipor internal working modelwith respect to God. More specically, DIS taps an individuals level of anger, frustration, and disappointment with God (e.g., There are times when I feel frustrated with God ); GRA, the degree to which individuals view their relationship with God as unique or special (e.g., God recognizes that I am more spiritual than most people); RA, how well people reconcile their relationship with God following disappointment (e.g., When I feel betrayed by God, I still want our relationship to continue); and INS, individuals concerns about the stability of their relationships with God (e.g., I feel I have to please God or he might reject me). Three factor analytic studies have been conducted, which have led to a revised version of the SAI (Hall & Edwards, 1996, 2002; Hall, Edwards, & Slater, 2003). Each subscale has demonstrated good internal consistency reliability (.73 to .95). The ve-factor structure (excluding an experimental impression management scale presently in development) has been corroborated by a recent conrmatory factor analysis (Hall & Edwards, 2002). Moreover, Hall,

IRT ANALYSIS OF SAI

159

Edwards, and Slater (2003) found that the factor structure of the SAI was stable, even when three other similar measures of spirituality and the SAI were factor analyzed together. Construct validity was explored by correlating various SAI scores with other self-report measures, such as the Bell Object Relations Inventory (Bell, 1991), the Intrinsic/Extrinsic-Revised (Gorsuch & McPherson, 1989), the Spiritual WellBeing Scale (Ellison, 1983), and the Narcissistic Personality Inventory (Emmons, 1984, 1987). The SAI showed incremental validity over the Spiritual Well-Being Scale and Intrinsic/Extrinsic-Revised in predicting object relations development (Hall & Edwards, 2002). Moreover, the SAI and object relations made independent contributions to the prediction of psychological adjustment (Hall, Edwards, & Hall, 2000). After multiple revisions and factor analyses of the SAI with different samples, the evidence showed that the factor structure is very stable, and the scales are reliably measuring the constructs they are intended to measure. It is important to note, the SAI met or exceeded acceptability criteria (theoretical basis, sample representativeness/generalization, reliability, and validity) that Hill (2005) has established for religion and spirituality measures. As a next step, we investigated the item and scale properties with IRT methods. Our goal was to determine the extent to which the SAI scales provide adequate discrimination among individuals across the construct range. In the discussion, we will suggest ways in which the SAI may be improved and comment on the more general advantages of an IRT approach to spirituality assessment. METHOD Participants As part of a larger project on college students spiritual development, several measures of religiousness and spirituality, including the SAI (Hall & Edwards, 1996, 2002) were administered to a sample of 1,024 (M age = 20.5 years) undergraduates from nine Christian liberal arts colleges and universitiesAsbury College, Azusa Pacic University, Bethel (Minnesota), Biola University, Bluffton College, Eastern College, Eastern Nazarene College, Greenville College, and Taylor University. Predominant in the sample were Euro-Americans and women. The largest proportion of participants had nondenominational church afliations (14.5%). The other participants were afliated with a wide range of Protestant denominations, 2.8% were Catholic, and 1.9% were religiously unafliated. Measure The SAI is a 49-item self-report measure designed to assess ve constructs: AOG (19 items), DIS (7 items), GRA (7 items), RA (7 items), and INS (9 items). (See

160

HALL, REISE, HAVILAND

Hall & Edwards, 2002, for all item content.) For the most part, the SAI was administered as an in-class exercise in a variety of general education, introductory psychology, and educational psychology classes. Each item is rated on a ve-point scale: 1 (not at all true), 2 (slightly true), 3 (moderately true), 4 (substantially true), and 5 (very true). The descriptive item statistics for each scale (means, standard deviations, and itemtest correlations) are shown in the rst three columns of Table 1. Scale raw score descriptive statistics (means, standard deviations, and skewness) are shown in Table 2. It is not surprising for this sample of college students attending Christian schools that the SAI scales tend to be skewed due to ceiling (AOG, RA) or oor
TABLE 1 Descriptive Statistics and Item Parameter Estimates Item # M SD r Slope Bi1 Bi2 Bi3 Bi4

Awareness of God A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 3.3 3.6 3.7 3.3 3.9 3.8 3.9 3.6 4.0 3.8 4.1 4.1 3.6 3.9 3.9 4.0 4.3 3.7 4.1 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.9 1.0 1.0 0.9 0.9 0.9 0.9 1.1 0.9 .72 .77 .78 .72 .71 .79 .75 .74 .78 .72 .75 .74 .67 .71 .70 .67 .70 .67 .68 2.46 2.92 3.01 2.43 2.31 3.19 2.75 2.56 3.12 2.31 2.61 2.65 1.94 2.43 2.15 1.97 2.55 2.01 2.15 1.98 2.14 2.30 2.28 2.79 2.23 2.55 2.44 2.48 2.66 2.78 2.53 2.58 2.75 3.02 3.16 2.78 2.57 2.59 0.84 1.01 1.33 0.98 1.66 1.21 1.40 1.27 1.58 1.50 1.70 1.71 1.30 1.65 1.74 1.93 2.10 1.30 1.91 0.10 0.10 0.36 0.14 0.91 0.37 0.42 0.13 0.49 0.48 0.87 0.63 0.23 0.61 0.68 0.82 1.14 0.31 0.95 1.02 0.90 0.72 1.27 0.56 0.60 0.58 1.04 0.35 0.76 0.22 0.24 1.20 0.61 0.64 0.33 0.10 0.66 0.19

Disappointment D1 D2 D3 D4 D5 D6 D7 2.6 2.7 3.2 2.7 2.2 1.9 3.0 1.2 1.3 1.2 1.1 1.2 1.1 1.2 .79 .79 .71 .71 .72 .67 .62 3.54 3.58 2.51 2.60 2.50 2.44 1.80 0.89 0.88 1.62 1.21 0.54 0.06 1.70 0.11 0.09 0.63 0.02 0.46 0.85 0.34 0.70 0.58 0.24 0.88 1.22 1.49 0.56 1.49 1.25 1.17 1.63 2.00 2.12 1.56 (continued )

IRT ANALYSIS OF SAI

161

TABLE 1 (Continued ) Item # M SD r Slope Grandiosity G1 G2 G3 G4 G5 G6 G7 1.4 1.3 1.6 1.7 1.1 2.4 2.0 0.8 0.7 0.9 0.9 0.7 1.2 1.1 .58 .50 .51 .54 .43 .40 .33 2.88 2.57 1.78 1.94 2.11 1.08 0.88 0.49 1.04 0.43 0.14 1.54 1.19 0.45 1.46 1.68 1.28 1.21 2.01 0.14 0.90 2.04 2.23 2.39 2.21 2.50 1.65 2.67 2.68 2.60 3.43 2.89 2.86 3.03 4.54 Bi1 Bi2 Bi3 Bi4

Realistic acceptance R1 R2 R3 R4 R5 R6 R7 3.9 4.5 4.0 4.2 4.0 4.7 3.9 1.2 0.9 1.1 1.0 1.2 0.7 1.1 .66 .68 .71 .66 .55 .62 .67 2.50 3.57 2.71 2.75 1.88 2.84 2.51 1.69 2.01 1.83 2.34 1.82 2.59 2.00 1.33 1.75 1.37 1.66 1.55 2.21 1.32 0.58 1.13 0.70 0.89 0.94 1.57 0.55 0.14 0.49 0.25 0.08 0.04 0.82 0.33

Instability I1 I2 I3 I4 I5 I6 I7 I8 I9 2.2 1.8 2.4 1.7 2.4 1.5 3.1 2.1 2.1 1.2 1.0 1.2 1.0 1.2 0.9 1.2 1.1 1.1 .67 .59 .60 .55 .51 .52 .33 .43 .46 2.63 1.90 1.88 1.44 1.54 1.52 0.74 0.93 1.03 0.47 0.09 1.00 0.18 1.17 0.71 3.40 0.57 1.11 0.49 0.90 0.16 1.34 0.20 1.52 0.94 0.90 0.81 1.21 1.76 0.95 2.05 1.07 2.24 0.83 2.39 2.10 1.95 2.63 1.89 2.90 2.28 3.12 2.69 3.77 3.49

TABLE 2 Scale Descriptives Raw Score M Awareness of God (19) Disappointment (7) Grandiosity (7) Realistic acceptance (7) Instability (9) 72.87 18.36 16.26 29.45 19.64 SD 14.65 6.75 4.03 5.51 6.45 Skewness 0.92 0.52 1.42 1.31 0.78 1 12.50 4.95 3.84 4.94 4.15 Eigenvalues 2 0.73 0.53 0.86 0.51 1.03 Ratio 17.12 9.34 4.46 9.69 4.03 GFI .99 .99 .99 .99 .98 RMSR .04 .04 .06 .03 .07

Note. Number of items given in parentheses.

162

HALL, REISE, HAVILAND

(DIS, GRA, INS) effects. Coefcient alpha reliabilities were AOG = .96, DIS = .91, GRA = .74, RA = .87, and INS = .82, which are well above conventional acceptability standards for scales of this length.
Note: The DIS and RA scales are linked. For example, DIS Item 1, There are times when I feel irritated at God, is followed up by RA Item 1, When I feel this way, I put effort into restoring our relationship. If an individual responds 1 (not at all true) to a DIS item, the corresponding RA item is not scored. In turn, the RA scale is scored by taking the average of the scored RA items, which is the scale score. Using traditional psychometric procedures, this is necessary to place different people who respond to different subsets of items onto a comparable scale. In this study, however, which uses IRT, we simply treated nonresponse to an RA item (which occurs whenever a person scores 1 on a corresponding DIS item) as a missing response. This has no material effect, because under an IRT framework, it is not required that all people respond to the same item set.

PROCEDURE AND RESULTS The primary objective of IRT modeling is to develop for each scale item a model of the relation between individual differences on a latent variable (e.g., AOG), denoted by , and the probability of responding in a particular item category. This person parameter, , represents individual differences on a latent trait or construct (e.g., AOG) and is assumed to cause differences between individuals in item response behavior. In IRT applications, the scale has an arbitrary metric. To identify the scale (i.e., to create one that is easily interpreted), the latent trait is conventionally xed to a mean of zero and variance of one; hence, the scale by which individual differences are represented can be regarded as a Z-score scale. Given that most spirituality measures use multipoint responses, we review the characteristics of an IRT model appropriate for this formatthe Graded Response Model (Samejima, 1969, 1996). In the Graded Response Model, the parameters for a set of threshold response curves (TRC) are estimated for each scale item. For an item with ve response options, such as the SAI scales, four TRCs for each item are estimated. Four TRCs must be estimated because there are four boundaries or thresholds between k = ve response options: (1 vs. 2, 3, 4, 5; 1, 2 vs. 3, 4, 5; 1, 2, 3 vs. 4, 5; and 1, 2, 3, 4 vs. 5). A TRC is a curve that describes how the probability of responding above a particular category increases as a function of the latent trait. The parameters for a TRC have the following denitions: for each item (i ), there will be four threshold parameter estimates (called bj , where j = 1 . . . 4) and one item discrimination parameter estimate (called a ). The bj parameters control the location of the TRCs along the latent trait continuum, and the

IRT ANALYSIS OF SAI

163

FIGURE 1a Threshold Response Curves for AOG 1 (I experience an awareness of God speaking to me personally).

FIGURE 1b Category Response Curves for AOG 1 (I experience an awareness of God speaking to me personally).

a parameter denes the steepness. More discriminating items have larger a parameters. To illustrate, Figure 1a shows the estimated TRCs for Item 1 on the AOG scale (I experience an awareness of God speaking to me personally). The estimated item parameters are a = 2.46, b1 = 1.98, b2 = 0.84, b3 = 0.10, and b4 = 1.02. These curves describe the probability of responding above a particular category as a function of the latent trait.
Note that the b parameters reect how much trait standing is needed to have a .50 probability of responding above a particular category (or threshold).

164

HALL, REISE, HAVILAND

The TRCs shown in Figure 1a are difcult to interpret by themselves; however, they can translate into more readily interpretable category response curves. The category response curves show the probability of endorsing an item in a particular category as a function of the latent trait. In Figure 1b, for example, we show the category response curves for Item 1 on the AOG scale. This graph shows precisely how the probability of responding to a particular category (1, 2, and so forth) changes as a function of the latent trait (i.e., AOG). Clearly, only individuals who are extremely low on the latent trait are likely to respond in category 1, whereas any individual around a standard deviation above the mean will be most likely to respond in the highest category, which is 5. IRT models make several fundamental assumptions about item response data, so we must rst determine whether the data meet these assumptions. The rst is monotonicityfor a polytomous response item, monotonicity means that as trait levels increase, it is more likely that individuals will respond in a higher category. We evaluated this assumption by computing rest-score graphs using the MSP software program (Molenaar & Sitjma, 2000). A rest-score is a raw scale score minus the item score, which reects how the individual scored on the rest of the scale. For each scale item, rest-score graphs show raw score (i.e., raw score groupings with sufcient sample sizes) minus item score on the X-axis and the proportion responding in a particular category in the Y-axis. Our investigation revealed no consequential violations of monotonicity. The rest-score curve analyses, however, showed that most SAI items were good, whereas a few were problematic. In Figures 2a and 2b, we display the rest score graphs for items DIS 3 (There are times when I feel frustrated with God ) and INS 7 (When I sin, I tend to withdraw from God ), respectively. In the gures are four linesone of each threshold between the response categories. The area above the top line represents the proportion of people who respond 1 as a function of the rest score. The area between the top line and the line below it is the proportion of people who respond 2, and so forth. Figures 2a and 2b illustrate good psychometrically sound items. These items are well functioning because response proportions increase across the restscore continuum for each category, and the category boundaries are spread out; in other words, the categories are distinguishing among individuals at different trait levels. In contrast, Figures 2c and 2d illustrate what relatively poorly functioning items look like. Figure 2c is AOG 17 (I am aware of God attending to me in times of need ) and 2d is GRA 5 (Manipulating God seems to be the best way to get what I want). For these items, the response categories do not differentiate among individuals. In AOG 17, almost all individuals in this sample responded in the highest category, so having multiple lower response options is of no value. In GRA 5, almost all individuals respond in the lowest category, so the higher response options serve no useful purpose.

IRT ANALYSIS OF SAI

165

FIGURE 2a God ).

Rest-score Curves for DIS 3 (There are times when I feel frustrated with

FIGURE 2b Rest-score Curves for INS 7 (When I sin, I tend to withdraw from God ).

A second assumption that must be met prior to applying IRT models is unidimensionalitythat one and only one latent variable (i.e., factor) is needed to explain item covariances. Given the very specic and narrow content of the SAI scales and the relatively small number of items on four of the ve scales, unidimensionality is almost certain. We, nevertheless, tested the unidimensionality of each scale by performing factor analyses on the polychoric correlations for each of the ve subscales using MicroFact 2.0 (Waller, 2000).

166

HALL, REISE, HAVILAND

FIGURE 2c Rest-score Curves for AOG 17 (I am aware of God attending to me in times of need ).

FIGURE 2d Rest-score Curves for GRA 5 (Manipulating God seems to be the best way to get what I want).

Although numerous criteria for evaluating unidimensionality have been proposed (see Hattie, 1985), there is no clear standard for evaluating whether an item set is sufciently unidimensional for IRT application. A common procedure is to conduct a factor analysis of polychoric or tetrachoric correlations (Reise & Waller, 2001), and one generally looks for (a) a large ratio of the 1st to 2nd

IRT ANALYSIS OF SAI

167

eigenvalues (e.g., 3 to 1 or greater), which indicates a strong common dimension among the items, (b) all items loading highly on a single common factor (e.g., greater than .40), and most important, (c) after extracting one common factor, residuals are small (i.e., the rst dimension accounts for a high percentage of the item covariance). The ratio of the rst-to-second eigenvalues, goodness of t statistics, and root mean squared residual results are shown in Table 2. These results provide strong support for the unidimensionality of all scales (enough support for application of IRT models). These results are expected for narrow-bandwidth measures, like the SAI subscales, which are comprised of items with similar content. This builtin item redundancy is a common practice in measures designed under a CTT framework, because the goal is to increase coefcient alpha reliability. The best way to increase coefcient alpha is to have highly correlated items, achieved by writing items with highly similar content. Fitting the IRT Model Item parameters for the graded response model were estimated using Parscale (Muraki & Bock, 1997). Table 1 shows the estimated item parameters for the SAI items for each factor, respectively. To interpret this table, note again that the discrimination parameters reect the steepness (slope) of the TRCs within an item. An item with a larger discrimination parameter does a good job of distinguishing among individuals high and low on the construct. Discrimination parameters are highly related to an items average correlation with the other items on the scale, and thus will tend to be very large in narrow-bandwidth measures where the items are highly intercorrelated. From the item discrimination parameters, we learned two important things. First, many of the values are exceptionally high. By typical IRT standards, discriminations above 2.0 would be considered very large (almost all AOG discriminations are in this very high range). Again, and unfortunately, this does not mean that these are exceptionally good items; rather, they merely reect the high item intercorrelations and the small conceptual gap between the item content and the latent trait. This is a common nding with narrow-band measures, such as those tapped by the SAI. Second, for the DIS, GRA, and INS scales, the discriminations have a steep staircase pattern; that is, one or two items have very high discriminations, whereas the remaining items have substantially lower discriminations. This means that for these scales, once one or two items most central to the trait are administered, the other items contribute relatively little unique measurement information. In short, it is possible that these constructs could be measured with the two or three items carrying the psychometric workload.

168

HALL, REISE, HAVILAND

Next, we evaluate the threshold parameters in Table 1. Recall that a threshold points to the location on the latent trait at which an examinee has a .50 probability of responding above a given category and that the latent trait has a Z-score like interpretation. The psychometric purpose of a polytomous response item is to spread the threshold parameters across the latent trait range. In turn, such an item would allow good discrimination among examinees regardless of their position on the latent trait. Optimally, an item will have thresholds within reasonable boundaries (e.g., between 2 and 2), so that the response categories are not too extreme and attract sufcient responses. Moreover, we would hope that the item threshold parameters are not highly similar across items, which suggests redundancy. In interpreting the threshold parameters, one must keep in mind that for negatively oriented subscales such as GRA, INS, and DIS, a low score is positive, and thus the rst between-category threshold parameter (b1 ) reveals the difculty level at the positive end of the trait continuum. For the positively oriented subscales, AOG and RA, the fourth between-category threshold (b4 ) provides this information. These parameters show the trait level necessary to have a .50 probability of obtaining the most positive (in terms of spirituality) score on the 5-point scale. There are several interesting and noteworthy ndings regarding the threshold parameters on the SAI scales. For the positively oriented scales, the b1 parameters are very low suggesting that few individuals will respond in the lowest category. Relatedly, for the positive scales, even low and modest trait level examinees are likely to endorse the highest category on some of the items. This is most evident on the AOG scale. Six AOG items have b4 parameters below .50. For example, on item AOG 17 (I am aware of God responding to me in a variety of ways), the b4 parameter is 0.10, suggesting that even individuals 0.10 standard deviations below the mean have a high probability of responding to this item in the highest category. Five AOG items, however, have b4 parameters at or above .90, suggesting that individuals need to be 1 standard deviation above the average trait level to have a high probability of endorsing the highest category. This same phenomenon also is evident on the RA scale where an item like RA 6 has a b4 parameter of 0.82. On the negative scales, the b4 parameters are all very high, suggesting that a person would have to be very extreme to respond to those items in the highest response category. Moreover, the b1 parameters for the negative scales are not extremely low. This suggests that a large proportion of individuals is expected to respond to the lowest category (highest in terms of spirituality) in these scales. For example, the mean of the b1 parameters for GRA, DIS, and INS are .29, .97, and .44, respectively. In other words, examinees who are below .29, .97, and .44 standard deviations on the latent trait metric are expected to respond in the lowest response category.

IRT ANALYSIS OF SAI

169

Item and Scale Information In traditional psychometrics, scale precision is dened by the reliability coefcient and the standard error of measurement, and both indices are assumed constant for all individuals. In IRT, on the other hand, an items and scales measurement precision is evaluated with information curves. Information indexes an items ability to differentiate people at different trait levels. Item information is determined by two item parameters. First, the amount of psychometric information is determined by the items discrimination parameterhigher discrimination leads to more psychometric information. Second, where an item provides discrimination is determined by the threshold parameters; for each item, the information tends to be peaked around the threshold parameters. If the threshold parameters are bunched close together, an item will have a peaked information function. Conversely, if the item threshold parameters are spread out, the information will be spread out more evenly across the trait range. To illustrate item information, consider the item information curves (IIC) shown in Figures 3a and 3b. The IIC in Figure 3a is for the highly discriminating item, DIS 2 (There are times when I feel angry at God ), which has an = 3.58, b1 = 0.88, b2 = 0.09, b3 = 0.58, and b4 = 1.25. The IIC in 3b is for the relatively poorly discriminating item, DIS 7 (There are times when I feel frustrated by God for not responding to my prayers), which has an = 1.80, b1 = 1.70, b2 = 0.34, b3 = 0.56, and b4 = 1.56. Because information is relative to the squared discrimination value, DIS 2 provides roughly 4 times the information as does DIS 7. For this reason, it would take four items such as DIS 7 to match the value provided by one item with the properties of DIS 2. In Figures 4a and 4b, we illustrate an IIC for an item with extremely skewed threshold parameters. Figure 4a is the IIC for GRA 5 (Manipulating God seems to be the best way to get what I want), which has = 2.11, b1 = 1.54, b2 = 2.01, b3 = 2.50, b4 = 2.86. In other words, all threshold parameters are positive, so this item only discriminates among individuals in the higher trait range on this construct. In contrast, Figure 4b shows the IIC for RA 5 (When this happens my trust in God is not completely broken), which has = 1.88, b1 = 1.82, b2 = 1.55, b3 = 0.94, and b4 = 0.04. Clearly, this item only provides information in the low trait range. As is evident in the previous gures, an IIC describes where on the latent trait continuum an item provides information; that is, where the item best differentiates among individuals. In IRT, item information is additive across items within a scale. Thus, we can aggregate the IICs within a scale to form a scale information curve. Scale information is very important, because it has an inverse relationship with an individuals standard error of measurement. When information is high, standard errors will be low (i.e., precise measurement), and when information is low, standard errors will be high (i.e., poor measurement).

170

HALL, REISE, HAVILAND

FIGURE 3a God ).

Item Information Curve for DIS 2 (There are times when I feel angry at

FIGURE 3b Item Information Curve for DIS 7 (There are times when I feel frustrated by God for not responding to my prayers).

An individuals standard error of measurement, in fact, is equal to one divided by the square root of the scale information, conditional on any trait level value. With the possible exception of the AOG scale, the scale information for the SAI scales tends to be peaked, rather than spread out across the trait continuum. This means that the scales are precise at only one end of the trait continuum. For example, Figure 5 displays the scale information curve for GRA. Clearly,

IRT ANALYSIS OF SAI

171

FIGURE 4a Item Information Curve for GRA 5 (Manipulating God seems to be the best way to get what I want).

FIGURE 4b Item Information Curve for RA 5 (When this happens, I still want our relationship to continue).

all the measurement precision is concentrated at the high end of the scale (i.e., the GRA scale only differentiates among people who tend to be grandiose and not between people who are below average or low on the construct). This is due, in part, to the oor effect on this scale, where individuals tend to score very low. Peaked information could result from poor item writing (i.e., too extreme) or from poor response anchors. In this case, however, peaked information may have more to do with the quasi-continuous nature of the SAI constructs, rather than to poor items and anchors, which we consider later.

172

HALL, REISE, HAVILAND

FIGURE 5 Scale Information Curve for Grandiosity (GRA).

DISCUSSION The RS measurement has a long history and with advancements, the constructs have become more inuential (Hill, 2005; Hill & Pargament, 2003; Hood & Belzen, 2005). In this report, we applied the graded response IRT model (Samejima, 1969, 1996) to further explore the psychometric properties of one such measure, the SAI. Our main objectives were to learn more about how the SAI items and subscales function and to identify ways that the SAI might be improved. Following, we review the main study ndings and comment on their implications for the SAI, in particular, and for the measurement of RS constructs in general. It is important here that the calibration sample we used were undergraduate students attending Christian colleges and universities; thus, when we refer to the mean on the latent trait, for example, we are referring to a mean dened in this particular type of population. Recall that in IRT analyses, the scale for the latent trait is arbitrary and, thus, must be dened prior to estimating item parameters. We dened the latent trait metric to have a mean of zero and a standard deviation of 1.0 (which is customary), so that the latent trait metric could be interpreted like a Z-score. It is important to keep in mind how the metric was dened because all results are relative to this metric. Moreover, sample homogeneity most certainly affected our ndings. It is quite possible that in a more diverse sample, some of the scales and items might not have displayed the ceiling and oor effects, clumping of category threshold parameters, or the peaked information functions that we observed in our sample. For example, in the Gomez and Fisher (2005) IRT analysis of the Spiritual Well-Being Questionnaire, a more diverse and secular sample of high

IRT ANALYSIS OF SAI

173

school and university students was used. Although their ndings were similar to ours in that items tended to have large discriminations, their item threshold parameters appeared to be spread out a little more than ours. The Spiritual WellBeing Questionnaire, however, does not measure precisely the same constructs as does the SAI; moreover, the SAI was designed specically to study change and growth in Christian populations. Prior to applying the IRT model, we evaluated unidimensionality and explored monotonicity of category response using rest-score graphs. Results revealed that the SAI scales are highly univocaleach scale measures one and only one common trait. This is not surprising given the narrow-bandwidth nature of these measures and the highly homogeneous item content within each scale. More interesting, perhaps, was our analysis of rest-score curves. Recall that for a good item, the four rest-score curves for a ve-category item will be spread out. On the other hand, within an item, if the rest-score curves are clumped together, the response categories are likely not differentiating among individuals. We provided two graphical examples of SAI items for which this was the case (other items displayed this tendency, as well). This phenomenon clumping of thresholds in IRT terminologyoccurs because people at different trait levels cannot make the appropriate distinctions among the response options. Threshold clumping has several possible causes, such as item content extremity (e.g., I think about God all the time everyday of the year, which, of course, is not an SAI item) or a highly skewed latent trait. In this study, however, we suggest that it may have occurred in SAI items simply because certain distinctions make little sense for certain types of items. For example, for an item like AOG 17 (I am aware of God attending to me in times of need ), it is possible that people cannot reliably differentiate among moderately true, substantially true, or very true. Simply stated, these distinctions appear to have no meaning for items with this contentpeople are either aware of Gods presence or they are not. If retained in a revision, these items could be either dichotomized or trichotomized, so that the response format better represents the actual response process.
Note: Conducting focus groups before scales are nalized is especially helpful in heading off this problem.

We recommend that other researchers consider rest-score curve analyses of their own surveys to identify when this phenomenon occurs or to verify that all response options are working as intended. Fitting the IRT model revealed several interesting properties for the SAI scales. Primarily, the items tend to have very high discrimination parameters. Again, this is consistent with Gomez and Fishers (2005) ndings. Because high item discrimination leads to an item providing more psychometric information,

174

HALL, REISE, HAVILAND

this generally is viewed as a good property. In the case of narrow-band measures, as is the case here, these high discriminations likely are the result of having very homogeneous item content and very high item intercorrelations. For the AOG scale, all items had large item discriminations, and there was little variability in discrimination within this scale. Moreover, for this scale, the item threshold parameters are fairly consistent across items. This suggests that all the AOG items are nearly equivalent measures of the latent trait. For three of the scales, DIS, GRA, and INS, we found that one or two items had very large discriminations, whereas the other items were relatively low. This suggests that only a few items within a scale are most relevant to the construct and the construct, perhaps, could be measured adequately with fewer items. These ndings underscore how difcult it is to write good (discriminating) items, because the pool of relevant indicators is limited for these very specic constructs. In other words, once one or two relevant items are written, other items provide considerably less information (i.e., have the ability to distinguish among individuals) than the rst items. Finally, our analyses of item thresholds and our subsequent analyses of item and scale information revealed that, with the exception of AOG, the SAI scales tend to have very peaked information on one end of the trait continuum. In other words, one end of the construct is measured relatively well, whereas the other end is measured relatively poorly. One possible cause of this is that the items were either too easy or too hard for this population.
Note: Ceiling and oor effects commonly are seen on other spirituality measures, too (Gomez & Fisher, 2005; Slater, 1999; Slater, Hall, & Edwards, 2001). For example, if we used several easy items on an IQ test (What is 3 + 2?), the information curve would be highly peaked in the low end, because items like this would not distinguish among individuals with high IQs (only among those with low IQs).

Although a similar argument could be made (i.e., better items need to be written that are either easier or harder to endorse) to explain the peaked information on the SAI scales, we would argue otherwise. In fact, the peaked information is a surprising result, given that the SAI items have ve response options, and more response options typically leads to information better spread out across the trait range (e.g., like on the AOG scale). We suggest that the peaked information occurs here because the SAI dimensions are not full bipolar continuous trait dimensions. Rather, they are quasi-continuous traits (i.e., a construct only dened on one end) or, perhaps, latent types (either you have the trait or you do not), although future research is needed to clarify this. Whether a construct is fully dimensional, a quasi-dimension, or a type has important implications in terms of scaling individual differences and modeling change. For example, if some RS constructs are really latent types, then we should be considering latent

IRT ANALYSIS OF SAI

175

transition change models (see Collins & Sayer, 2001), not latent growth curve analysis, for investigating RS substantive hypotheses. The Future of the SAI First, and most important, based on this IRT and earlier CTT analyses, researchers and clinicians comfortably can continue using the SAI in its present form. We base this judgment on the fact that our present analyses show all the scales provide a reasonable degree of information and precision, albeit peaked, and no truly very bad items were found (i.e., items that did not belong on the scale to which they are assigned). On the other hand, our results do suggest some specic points to consider for the next revision. The rst is at the scale levelthe peaked measurement precision of each of the scales (item thresholds being close together). Causes for this could be response categories not differentiating among people, the extremity of the sample (i.e., highly spiritual people), poorly worded or nonsensible response anchors, or the constructs, in reality, being latent types (e.g., either you are aware of God or not) rather than a continuum. (If it is the latter, no revisions will solve the problem, and latent type vs. latent trait, models should be pursued.) At the very least, consideration needs to be given to changing the response options (and perhaps, anchoring labels). Also, explorations of how SAI items work in samples not drawn from religious institutions should be undertaken (however, the SAI and some related measures were not developed with such applications in mind). The IRT results also suggest some ways to modify either specic scales or items within a scale. Our analyses show, for example, that the AOG items are redundant in terms of the psychometric information they provide. That is, they are all highly discriminating, and the threshold parameters are all nearly equivalent across items. What this suggests is that the AOG scale could be signicantly shorted, say to perhaps ve items, with little meaningful loss of measurement precision at the high end of the scale. With respect to specic scale items, most troubling are the items within each scale that have much lower slopes than the best functioning items. Specically, on DIS, two items have slopes of approximately 3.5, whereas DIS 7 has a relatively low slope of 1.8 (There are times when I feel frustrated with God for not responding to my prayers). Note the similarity to DIS 3 (There are times when I feel frustrated with God ); DIS 7 only adds the additional complexity (multidimensionality) regarding prayers. Given the limited value of this item relative to other DIS items, DIS 7 will almost certainly be deleted in a future version. GRA 6 (My relationship with God is an extraordinary one that most people would not understand ) and GRA 7 (I seem to have a unique ability to inuence God through my prayers) have relatively low discrimination (slope)

176

HALL, REISE, HAVILAND

and are apparently so extreme in content that few people respond in the highest categories. These two items will also be considered for deletion or rewording in future versions. Finally, on the INS scale, it is interesting that the item most pertinent to the scale title (INS 9; My emotional connection with God is unstable) is one of the poorer ones psychometrically. Perhaps the scale should be renamed (e.g., Abandonment or Rejection by God) to better match the content of the best functioning items (INS 1, When I sin, I am afraid of what God will do to me; INS 2, I feel I have to please God or he might reject me). The Future of IRT in the RS Domain In this study, IRT methods were used in their most basic form, that is, as a simple psychometric tool for examining SAI item and scale functioning. Nevertheless, we believe that even these basic analyses revealed some interesting psychometric features of this instrument and highlighted the potential challenges inherent in measuring some RS constructs, in particular, those that are narrow-band. It is not easy, we believe, to write items that spread information across the range or to write many items for certain constructs, all of which have large discriminations. Of course, whether this is generally true across most spirituality measures (atheistic, theistic, theistic-relational, and so forth) remains to be demonstrated. It most likely depends on the degree to which the construct is general or broad, as is the case with spirituality, or more specic or narrow. IRT methods, however, are not only a tool for the psychometric analyses of scales. Because IRT models have certain properties that CTT methods do not, interesting applications are possible (see Reise, 2004; Reise & Henson, 2003). For example, because IRT item parameters have an invariance property, IRT facilitates the linking of different measures of the same construct onto a common scale (e.g., a large testing rm may link different measures of verbal ability onto the same scale so that examinees who took different tests can be compared). This linking of metrics could be an important future direction in the RS domain because of the large number of available measures that purport to measure the same or very similar constructs. Also, traditional scale construction practice in the RS domain, as in other elds, is to develop specic xed-length paper-and-pencil tests that reliably measure individual differences on a particular construct. When one changes the measure in CTT (e.g., deleting an item), it changes what is being measured (because it changed the true score metric). IRT leads researchers away from xed-length measures and toward the development of item banks (Flaugher, 2000). Item banks are the foundation for computerized adaptive testing (Wainer et al., 2000), in which subsets of items can be selected from a bank to tailor the property of a scale to an individuals trait level and to the needs of the measurement situation. Because a persons score in IRT is an estimated latent

IRT ANALYSIS OF SAI

177

trait rather than a total score on a particular set of items, an individuals standing on a latent trait continuum can be estimated using responses to any subset of items that have been calibrated under an IRT model. This cannot be done using a CTT framework, because different subsets of items will not be strictly parallel, which is required to compare individuals on the same metric. REFERENCES
Bell, M. (1991). An introduction to the Bell Object Relations and Reality Testing Inventory. Los Angeles: Western Psychological Services. Collins, L. M., & Sayer, A. G. (Eds.). (2001). New methods for the analysis of change. Washington, DC: American Psychological Association. Ellison, C. W. (1983). Spiritual well-being: Conceptualization and measurement. Journal of Psychology and Theology, 11, 330340. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Emmons, R. A. (1984). Factor analysis and construct validity of the Narcissistic Personality Inventory. Journal of Personality Assessment, 48, 291300. Emmons, R. A. (1987). Narcissism: Theory and measurement. Journal of Personality and Social Psychology, 52, 1117. Flaugher, R. (2000). Item pools. In H. Wainer, N. J. Dorans, D. Eignor, R. Flaugher, B. F. Green, R. J. Mislevy, L. Steinberg, & D. Thissen (Eds.), Computerized adaptive testing: A primer, (2nd ed.) (pp. 3760). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Gomez, R., & Fisher, J. W. (2005). Item response theory analysis of the spiritual well-being questionnaire. Personality and Individual Differences, 38, 11071121. Gorsuch, R. L., & McPherson, S. E. (1989). Intrinsic/extrinsic measurement: I/E-revised and singleitem scales. Journal for the Scientic Study of Religion, 28, 348354. Hall, T. W., & Edwards, K. J. (1996). The initial development and factor analysis of the spiritual assessment inventory. Journal of Psychology and Theology, 24, 233246. Hall, T. W., & Edwards, K. J. (2002). The spiritual assessment inventory: A theistic model and measure for assessing spiritual development. Journal for the Scientic Study of Religion, 41, 341357. Hall, M. E. L., Edwards, K. J., & Hall, T. W. (March, 2000). The role of spiritual development in the psychological and cross-cultural functioning of missionaries. Paper presented at the National Convention of the Christian Association for Psychological Studies, Tulsa, OK. Hall, T. W., Edwards, K. J., & Slater, W. (2003). Relational spirituality: A psychometric analysis of four measures. Unpublished manuscript, La Mirada, CA: Biola University. Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9, 139164. Hill, P. C. (2005). Measurement in the psychology of religion and spirituality: Current status and evaluation. In R. F. Paloutzian & C. L. Park (Eds.), Handbook of the psychology of religion and spirituality (pp. 4361). New York: Guilford. Hill, P. C., & Pargament, K. I. (2003). Advances in the conceptualization and measurement of religion and spirituality: Implications for physical and mental health research. American Psychologist, 58, 6474. Hood, R. W., Jr., & Belzen, J. A. (2005). Research methods in the psychology of religion. In R. F. Paloutzian & C. L. Park (Eds.), Handbook of the psychology of religion and spirituality (pp. 6279). New York: Guilford.

178

HALL, REISE, HAVILAND

Molenaar, I. W., & Sijtsma, K. (2000). Users manual MSP5 for Windows. Groningen; The Netherlands: iecProGAMMA. Muraki, E., & Bock, R. D. (1997). PARSCALE:IRT item analysis and test scoring for rating-scale data. Chicago: Scientic Software International. Reise, S. P. (2004). Item response theory and its applications for cancer outcomes measurement. In J. Lipscomb, C. C. Gotay, & C. F. Snyder (Eds.), The cancer outcomes measurement working group (COMWG): An NCI initiative to improve the science of outcomes measurement in cancer (pp. 425444). Boston, MA: Cambridge University Press. Reise, S. P., & Haviland, M. G. (2005). Item response theory and the measurement of clinical change. Journal of Personality Assessment, 84, 228238. Reise, S. P., & Henson, J. M. (2003). A discussion of modern versus traditional psychometrics as applied to personality assessment scales. Journal of Personality Assessment, 81, 93103. Reise, S. P., & Waller, N. G. (2001). Dichotomous IRT Models. In F. Drasgow & N. Schmitt (Eds.), Advances in measurement and data analysis (pp. 88122). Williamsburg, VA: Jossey-Bass. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, No. 17. Samejima, F. (1996). The graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory. New York: Springer. Slater, W. E. (1999, August). Dening and measuring spiritual health: An Investigation of the psychometric properties of measures of spiritual well-being and spiritual maturity. Paper presented at the 107th Annual Meeting of the American Psychological Association, Boston, MA. Slater, W. E., Hall, T. W., & Edwards, K. J. (2001). Measuring religion and spirituality: Where are we and where are we going? Journal of Psychology and Theology, 29, 322333. Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., et al. (2000). Computerized adaptive testing: A primer, (2nd ed.) Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Waller, N. G. (2000). A microcomputer factor analysis program for ordered polytomous data and mainframe size problems [computer software]. St. Paul, Minnesota: Assessment Systems Incorporated.

Vous aimerez peut-être aussi