Vous êtes sur la page 1sur 18

59

The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

An Introduction to Differential Item Functioning Hossein Karami

University of Tehran, Iran


Abstract Differential Item Functioning (DIF) has been increasingly applied in fairness studies in psychometric circles. Judicious application of this methodology by the researchers, however, requires an understanding of the technical complexities involved. This has become an impediment in the way of specially nonmathematically oriented researches. This paper is an attempt to bridge the gap. It provides a non-technical introduction to the fundamental concepts involved in DIF analysis. In addition, an introductory level explanation of a number of the most frequently applied DIF detection techniques will be offered. These include Logistic Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch
model. For each method, a number of the relevant software are also introduced.

Key words: Differential Item Functioning, Validity, Fairness, Bias Introduction Differential Item Functioning (DIF) occurs when two groups of equal ability levels are not equally able to correctly answer an item. In other words, one group does not have an equal chance of getting an item right though its members have comparable ability levels to the other group. If the factor leading to DIF is not part of the construct being tested, then the test is biased. DIF analysis has been increasing applied in psychometric circles for detecting bias at the item level (Zumbo, 1999). Language testing researchers have also followed suit and have exploited DIF analysis in their fairness studies. They have conducted a plethora of research studies to investigate the existence of bias in their tests. These studies have focused on such factors as gender (e.g. Ryan & Bachman, 1992; Karami, 2011; Takala & Kaftandjieva, 2000), language background (Chen & Henning, 1985; Brown, 1999; Elder, 1996; Kim 2001; Ryan & Bachman, 1992), and academic background or content knowledge (Alderson & Urquhart, 1985; Hale, 1988; Karami, 2010; Pae, 2004). Despite the widespread application of DIF analysis in psychometric circles, however, it seems that the inherent complexity of the concepts in DIF analysis has hampered its wider application among less mathematically oriented researchers. This paper is an attempt to bridge this gap by providing a non-technical introduction to the fundamental concepts in DIF analysis. The paper begins with an overview of the basic concepts involved. Then, a brief overview of the development of fairness studies and DIF analyses during the last century follows. The paper ends with a detailed, though non-technical, explanation of a number of the most widely used DIF detection techniques. For each technique, a number of most widely used software are also introduced. A few studies applying the relevant techniques are also
2012 Time Taylor Academic Journals ISSN 2094-0734

60
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

listed. Neither the list of the software nor the studies cited are meant to be exhaustive. Rather, these are intended to orient the reader. Differential Item Functioning Differential Item Functioning (DIF) occurs when examinees with the same ability level but from two different groups have different probabilities of endorsing an item (Clauser & Mazor, 1998). It is synonymous with statistical bias where one or more parameters of the statistical model are under- or overestimated (Camilli, 2006; Wiberg, 2007. Whenever DIF is present in an item, the source(s) of this variance should be investigated to ensure that it is not a case of bias. Any item flagged as showing DIF is biased if, and only if, the source of variance is irrelevant to the construct being measured by the test. In other words, it is a source of construct-irrelevant variance and the groups perform differentially on an item because of a grouping factor (Messick, 1989, 1994). There are at least two groups, i.e. focal and reference groups, in any DIF study. The focal group, a group of minorities for example, is the potentially disadvantaged group. The group which is considered to be potentially advantaged by the test is called the reference group. Note, however, that naming the groups is not always clear-cut. Naming the groups in such cases is totally random. There are two types of DIF, namely uniform and non-uniform DIF. Uniform DIF occurs when a group performs better than another group on all ability levels. That is, almost all members of a group outperform almost all members of the other group who are at the same ability levels. In the case of nonuniform DIF, members of one group are favored up to a level on the ability scale and from that point on the relationship is reversed. That is, there is an interaction between grouping and ability level. As stated earlier, DIF occurs when two groups of the same ability levels have different chances of endorsing an item. Thus, a criterion is needed for matching the examinees for ability. The process is called conditioning and the criterion dubbed as matching criterion. Matching is of two types: internal and external. In the case of internal matching, the criterion is the observed or latent score of the test itself. For external matching, the observed or latent score of another test is considered as the criterion. External matching can become problematic because in such cases the assumption is that the supplementary test itself is free of bias and that it is testing the same construct as the test of focus (McNamara & Roever, 2006). DIF is not evidence for bias in the test. It is evidence of bias if, and only if, the factor causing DIF is irrelevant to the construct underlying the test. If that factor is part of the construct, it is called impact rather than bias. The decision as to whether the real source of DIF in an item is part of the construct being gauged is totally subjective. Usually, a panel of experts is consulted to give more validity to the interpretations.

2012 Time Taylor Academic Journals ISSN 2094-0734

61
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

The Development of DIF The origins of bias analysis can be traced back to the early twentieth century (McNamara & Roever, 2006). At the time, researchers were concerned with developing tests that measured raw intelligence. A number of studies conducted at the time, however, showed that the socio-economic status of the test takers was a confounding variable. Thus, they aimed to partial out this variance through purging items that functioned differently for examinees with high and low socio-economic status. In the 1960s, the focus of bias studies shifted from intelligence tests to areas where social equity was a major concern (Angoff, 1993). The role of fairness in tests became highlighted. A variety of techniques were developed for detecting bias in the tests. There was a problem with all these bias-detection techniques: all the techniques required performance on a criterion test. Criterion measures could not be obtained until tests were in use, however, making test-bias detection procedures inapplicable (Scheuneman & Bleistein, 1989 p. 256). Consequently, researchers went for devising a plethora of item-detection procedures. The Golden Rule Settlement in 1976 was a landmark in bias studies because legal issues entered the scene (McNamara & Roever, 2006). The Golden Rule Insurance Company filed a suit against the Educational Testing Service and the Illinois Department of Insurance due to an alleged bias against blacks in the tests they developed. The court issued a verdict in favor of the Golden Rule Insurance Company. The ETS was considered liable for the tests it developed and was legally ordered to make every effort to rule out bias in its tests. The main point about the settlement was the fact that bias analysis turned out to be a legal issue as the test developing agencies were legally held responsible for the consequences of their tests. The case also highlighted the significance of Samuel Messicks (1980, 1989) works that emphasized the consequential aspects of the tests in his validation framework. A number of researchers (e.g. Linn & Drasgow, 1987) opposed the verdict emphasizing that simply discarding items showing DIF may render the test invalid by making it less representative of the construct measured by the test. Another reason they put forward was the fact that the items may show true differences between the test takers and the test may be a mirror of the real world ability differences. The proponents of the settlement, however, argued that there is no reason to believe that there are ability differences between the test takers simply because they are from different test taking groups. Thus, any observed differences in the performance of, say, blacks and whites, is cause for concern and a source of construct-irrelevant variance in Messicks terminology. Since then, a number of techniques have been developed to detect differentially functioning items. DIF, Validity, and Fairness The primary concern in test development and test use, as Bachman (1990) suggests, is demonstrating that the interpretations and uses we make of test scores
2012 Time Taylor Academic Journals ISSN 2094-0734

62
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

are valid. Moreover, a test needs to be fair for different test takers. In other words, the test should not be biased against test takers characteristics, e.g. males vs. females, blacks vs. whites, etc. To examine such an issue requires at least a statistical approach to test analysis which is able to find initially whether the test items are functioning differentially among test taking groups and finally detect the sources of this variance (Geranpayeh & Kunnan 2006). One of the approaches suggested for such purposes is DIF. Studying the differential performance of different test taking groups is essential in the test development and test use procedures. If the sources of DIF are irrelevant to the construct being measured by the test, it is a source of bias and the validity of the test is under question. The higher the stakes of the test, the more serious the consequences of the test use are. With high stakes tests, it is incumbent upon the test users to ensure that their test is free of bias and the interpretation made on the test scores are valid. Test fairness analysis and the search for test bias are closely interwoven. In fact, they are two sides of the same coin: whenever a test is biased, it is not fair and vice versa. The search for fairness has gained new impetus during the last two decades mainly due to advances within Critical Language Testing (CLT). The proponents of the CLT believe that all uses of language tests are politically motivated. Tests are, as they suggest, means of manipulating society and imposing the will of the system on individuals (see Shohamy 2001). DIF analysis provides only a partial answer to fairness issues. It is focused on only differential performance of two groups on an item. Therefore, whenever no groupings are involved in a test, then DIF is not applicable. However, when groupings are involved, the possibility that the items are favoring one group exists. If this happens, then the test may not be fair for the disfavored group. Thus, DIF analysis should be applied in such contexts to obviate the problem. DIF Methodology McNamara and Roever (2006, p. 93) have classified methods of DIF detection into four categories: 1. Analyses based on item difficulty (e.g. transformed item difficulty index (TID) or delta plot). 2. Nonparametric methods. These methods make use of contingency tables and chi-square methods. 3. Item-response-theory-based approaches which include 1, 2, and 3 parameter logistic models 4. Other approaches. These methods have not been developed primarily for DIF detection but they can be utilized for this purpose. They include multifaceted Rasch measurement and generalizability theory. Despite the diversity of techniques, only a limited number of them appear to be in current use. DIF detention techniques based on difficulty indices are not common. Although they are conceptually simple and their application do not require understanding complicated mathematical formulas, they face certain
2012 Time Taylor Academic Journals ISSN 2094-0734

63
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

problems including the fact that they assume equal discrimination across all items and that there is no matching for ability (McNamara & Roever (2006). If the first assumption is not met, the results can be misleading (Angoff, 1993; Scheuneman & Bleistein, 1989). When an item has high discrimination level, it shows large differences between the groups. One the other hand, differences between the groups will not be significant in an item with low discrimination. As indicated above, the DIF indices based upon item difficulty are not common. Thus, they will not be discussed here. (For a detailed account of DIF detection methods, both traditional and modern, see the following: Kamata & Vaughn 2004; Scheuneman & Bleistein 1989; Wiberg 2007). In the next sections, a general discussion of the most frequently used DIF detection methods will be presented. LogisticRegression Logistic regression, first proposed by Swaminathan and Rogers (1990), is basically used when we have one or more independent variables which are most of the time continuous, and a binary or dichotomous dependent variable (Pampel, 2000; Swaminathan & Rogers, 1990; Zumbo, 1999). In applying logistic regression to DIF detection, one attempts to see whether item performance, a wrong or right answer, can be predicted from total scores alone, from total scores plus group membership, and from total scores, group membership, and interaction between them. The procedure can be formulaically presented as follows: Ln = 0 + 1 + 2 + 3 ( ) 1

In the formula, 0 is the intercept, 1 is the effect of conditioning variable which is usually the total score on the test, 2 is the grouping variable, and finally 3 ( ) is the ability by grouping interaction effect. If the conditioning variable alone is enough to predict the item performance, with relatively little residuals, then no DIF is present. If group membership, 2 , adds to the precision of the prediction, uniform DIF is detected. That is, one group performs better than another group and this is a case of uniform DIF. Finally, in addition to total scores and grouping, if an interaction effect, signified by 3 in the formula, is also needed for a more precise prediction of the total scores, it is a case of non-uniform DIF (Zumbo, 1999). Also, note that the formula is based on logistic function denoted by Ln
P mi 1P mi

where Pmi is the probability of giving a correct answer to item i by

person m and 1 Pmi is the probability of a wrong response. In simple words, it is the natural logarithm of the odds of success to the odds of failure. Identifying DIF through logistic regression is similar to step-wise regression in that successive models are built up in each step entering a new variable to see whether the new model is an improvement over the previous one due to the

2012 Time Taylor Academic Journals ISSN 2094-0734

64
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

presence of the new variable. As such, logistic regression involves three successive steps: 1. The conditioning variable or the total score is entered into the model 2. The grouping variable is added 3. The interaction term is also entered. As a test of the significance of DIF, the Chi-square value of step 1 is calculated and subtracted from the Chi-square value of step three. This is an overall index of the significance of the DIF. The Chi-square value of step 2 can be subtracted from that of step 3 to provide a significance test of non-uniform DIF. In addition, comparing the Chi-square value of steps 1 and 2 is a good indicator of uniform DIF. Zumbo (1999) argued that logistic regression has three main advantages over other DIF detection techniques in that one: - need not categorize a continuous criterion variable, - can model both uniform and non-uniform DIF - can generalize the binary logistic regression model for use with ordinal item scores. (p. 23) Also, Wiberg (2007) noted that the logistic regression and the MantelHaenszel statistics (to be explained in the next section) have gained particular attention due to the fact that they can be utilized for detecting DIF in small sample sizes. For example, Zumbo (1999) pointed out that 200 people per group are needed. This is not a remarkable sample size compared to that required by other models such as the three-parameter IRT which require over 1000 test takers per group. McNamara and Roever (2006) also stated that, Logistic regression is useful because it allows modeling of uniform and non-uniform DIF, is nonparametric, can be applied to dichotomous and rated items, and requires less complicated computing than IRT-based analysis, (p. 116). There are a number of software for doing DIF analysis using logistic regression. LORDIF (Choi, Gibbons, & Crane, 2011) conducts DIF analysis for
dichotomous and polytomous items using both ordinal logistic regression and IRT. In addition, SPSS can also be used for doing DIF analysis through both the MH and logistic regression. Zumbo (1999) and Kamata and Vaughn (2004) provide examples of such

analyses. Magis, Bland, Tuerlinckx, and De Boeck, (2010) also have introduced an R package for DIF detection, called difR, that can apply nine DIF detection techniques including the logistic regression and the MH. There are a number of studies that have applied logistic regression for DIF detection. Shabani (2008) utilized logistic regression to analyze a version of the University of Tehran English Proficiency test (UTEPT) for the presence of DIF due to gender differences. Kim (2001) conducted a DIF analysis of the polytomously scored speaking items in the SPEAK test (the Speaking Proficiency English Assessment Kit), a test developed by the Educational Testing Service. The participants were divided into two different groups: the East Asian and the European groups. He utilized the IRT likelihood ratio test and logistic regression to detect the differentially functioning items. Davidson (2004) has investigated the
2012 Time Taylor Academic Journals ISSN 2094-0734

65
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

comparability of the performances of non-aboriginal and aboriginal students. Lee, Breland, and Muraki (2004) comparability of computer-based testing (CBT) writing prompts in the Test of English as a Foreign Language (TOEFL) for examinees of different native language backgrounds,with a focus on European(German, French, and Spanish) and East Asian (Chinese, Japanese, and Korean) native languagegroups as reference and focal groups, respectively. Standardization The idea here is to compute the difference between the proportion of test takers, from both focal and reference groups, who answer the item correctly at each score level. More weight is attached to score levels with more test takers (McNamara & Roever, 2006). The procedure can be formulaically presented as (Clauser & Mazor 1998): = ( )

where is the relative frequencey of the group members at score levels, is the proportion of the focal group at score level correctly responding to the item, and is the proportion of reference group members scoring who endorse the item. There are two versions of this technique based on whether the sign of the difference is taken into account or not: unsigned proportion difference and the signed proportion difference (Wiberg, 2007). The former is also referred to as the standardized p-difference. The standardized p-difference indexis more common. The item will be flagged as DIF if the absolute value of this index is above 0.1. Despite conceptual and statistical simplicity, the standardization procedure is not so prevalent due to the large sample sizes that it requires (McNamara & Roever, 2006). Another shortcoming of the procedure is that it has no significance tests (Clauser & Mazor 1998). One of the most recent software introduced for DIF detection through the Standardization procedure is the EASY-DIF (Gonzlez et al. 2011).EASY-DIF also applies the MantelHaenszel as explained earlier. Also, STDIF (Robin, 2001) is a free DOS-based program to compute DIF through the Standardization approach. STDIF also has a manual (Zenisky,Robin, & Hambleton2009) which is freely available. The software and the manual are both available at: http://www.umass.edu/remp/software/STDIF.html. Zenisky, Hambleton, & Robin, (2003) utilized the STDIF to apply a twostage methodology for evaluating DIF in large-scale state assessment data. These researchers were concerned with the merits iterative approaches to DIF detection. In a later study, Zenisky et al, (2004) also applied the STDIF to identify gender DIF in a large-scale science assessment. As the authors explain, their methodology was a variant of the Standardization technique. Lawrence, Curley, and McHale, (1988) also applied the Standardization technique to detect differentially functioning items in the reading comprehension and sentence completion items in the verbal section of the Scholastic Aptitude Test (SAT). Freedle and Kostin (1997) conducted an ethnic comparison study using the DIF methodology. The
2012 Time Taylor Academic Journals ISSN 2094-0734

66
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

scrutinized a large number of items from SAT and GRE exams comparing the performance of the Black and White examinees. Gallagher (2004) has applied the Standardization procedure, logistic regression, and the MH to investigate the reading performance differences between African-American and White students taking a nationally normed reading test. Mantel-Haenszel The Mantel-Haenszel (MH) procedure was first proposed for DIF analysis by Holland and Thayer (as cited in Kamata & Vaughn, 2004). The basic idea is to calculate the odds of correctly endorsing an item for the focal group relative to the reference group. If there are large differences, DIF is present. According to Scheuneman and Bleistein (1989), The MH estimate is a weighted average of the odds ratios at each of j ability levels, (p. 262). That is, the odds ratios of success at each ability level are estimated and then summed over all ability levels. Table 1 shows the hypothetical performance of two groups of test takers, focal and reference groups, on an item. Table 1

Hypothetical Performance of Two Groups on an Item Correct Incorrect Reference group 14 6 Focal group 8 12 Total 20 20

Total 20 20 40

The first step in calculating the MH statistics is to compute the probabilities of correct and incorrect responses for both groups. The empirical probabilities are shown in table 2.2. The second step is to find out how much more likely are the members of either group to answer correctly rather than incorrectly to the item. For the reference group, the odds are: = 0.7/0.3 = 2.33 Similarly, the odds of giving a correct answer to the item for the focal group are as follows: = 0.4/0.6 = 0.66 Table 2

Empirical Probabilities Correct Reference group Focal group Total .7 .4 25 Incorrect .3 .6 25 Total 15 20 50

2012 Time Taylor Academic Journals ISSN 2094-0734

67
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

Finally, we want to know how much more likely are the members of the reference group to correctly respond than the members of the focal group. To this aim, we get the odds ratio: = 2.33/0.66 = 3.5 Simply put, the odds ratio in the above example shows that members of the reference group are three and a half times more likely than members of the focal group to endorse the item. However, note that we have calculated the odds ratio for only ability level. Thus, the overall DIF is calculated by summing the odds ratios at all ability levels and dividing them by the number of ability levels. The resulting index is the Mantel-Haenszel odds ratio denoted by . This index is usually transformed by the following: = ln A negative indicates DIF in favor of the focal group whereas a positive MH shows DIF favoring the reference group (Wiberg 2007). Sometimes, is further transformed into: = 2.35 ln A positive value of indicates that the item is more difficult for the reference group while a negative value shows focal group faces more difficulty with the item. The Educational Testing Service uses the MH statistics in DIF analysis. Items flagged as DIF are further classified into three types (Zieky, 1993) to avoid identifying items that display practically trivial but statistically significant DIF (Clauser & Mazor, 1998, p. 39). Items are identified as showing type A DIF if absolute value of is smaller than 1.0 or not significantly different from zero. Type C DIF occurs when the absolute value of is greater than 1.5 or it is significantly different from 1.0. All other DIF items are flagged as type B. The main software for DIF analysis using the MH are DIFAS (Penfield, 2005), EZDIF (Waller, 1998), and more recently EASY-DIF (Gonzlez, Padilla, Hidalgo, Gmez-Benito & Bentez, 2011). Another relevant software is the DICHODIF (Rogers, Swaminathan, & Hambleton, 1993) that can apply both the MH and Logistic Regression. Also, LERTAP (Nelson, 2000) is anExcel-based classical itemanalysis software that is able to do DIF analysis using the MH. Its student version is freely available and the full version is available from http://assess.com/xcart/product.php?productid=235&cat=21&page=1. For more helpful information about the software see also http://lertap.curtin.edu.au/. Winsteps (Linacre, 2010) also provides MH based DIF estimates as part of its output. Elder (1996) conducted a study to determine whether language background may lead to DIF. She examined reading and listening subsections of the Australian Language Certificates (ALC), a test given to Australian school-age learners from diverse language backgrounds. Her participants included those who were enrolled
2012 Time Taylor Academic Journals ISSN 2094-0734

68
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

in language classes in years 8 and 9. The languages of her focus were Greek, Italian, and Chinese. Elder (1996) compared the performance of background speakers (those who used to speak the target language plus English at home) with nonbackground speakers (those who were only exposed to English at home). She applied the Mantel-Haenszel procedure to detect DIF.Ryan and Bachman (1992), also utilized the Mantel-Haenszel procedure to compare the performance of a group of males and female test takers on FCE and TOELF tests. Allalouf and Abramzon (2008) investigated the differences between groups from different first language backgrounds, namely Arabic and Russian, using he Mantel-Haenszel. Ockey (2007) applied both IRT and the MH to compare the performance of the English language learners (ELL) and non-ELL 8th-grade students scores on National Assessment of Educational Progress(NAEP) math word problems. Foran overview of the applications of Mantel-Haenszel procedure to detect DIF, see Guilera, Gmez-Benito, and Hidalgo(2009). Item Response Theory The main difference between IRT DIF detection techniques and other methods including logistic regression and MH is the fact that in non-IRT approaches, examinees are typically matched on an observed variable (such as total test score), and then counts of examinees in the focal and reference groups getting the studied item correct or incorrect are compared (Clauser & Mazor 1998, p. 35). That is, the conditioning or the matching criterion is the observed score. However, in IRT-based methods, matching is based on the examinees estimated ability level or the latent trait, .

Figure 1. A Typical ICC


Methods based on item response theory are conceptually elegant though mathematically very complicated. The building block of IRT is item characteristic curve (ICC) (see Baker, 2001; DeMarse, 2009; Embretson & Reise, 2000;
2012 Time Taylor Academic Journals ISSN 2094-0734

69
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

Hambleton, Swaminathan, & Rogers, 1991). It is a smooth S-shaped curve which depicts the relationship between the ability level and the probability of correct response to the item. As it is evident from Figure 1, the probability of correct response approaches one at the higher end of the ability scale, never actually reaching one. Similarly, at the lower end of the ability scale, the probability approaches, but never reaches, zero. IRT uses three features to describe the shape of the ICC: item difficulty, item discrimination, and guessing factor. Based on how many of these parameters are involved in the estimation of the relationship between the ability and item response patterns, there are three IRT models, namely one, two, and three parameter logistic models. In the one parameter logistic model and the Rasch model, it is assumed that all items have the same discrimination level. The two parameter IRT model takes account of item difficulty and item discrimination. However, guessing is assumed to be uniform across ability levels. Finally, the three parameter model includes a guessing parameter in addition to item difficulty and discrimination. The models provide a mathematical equation for the relation of the responses to ability levels (Baker, 2001). The equation for the three parameter model is: P = + (1 ) 1 1+ ()

where: b is the difficulty parameter is the discrimination parameter c is the guessing or pseudo-chance parameter and is the ability level The basic idea in detecting DIF through IRT models is that if DIF is present in an item, the ICCs of the item for the reference and the focal groups should be different (Thissen, Steinberg, & Wainer, 1993). However, where there is no DIF, the item parameters and hence ICCs should be almost the same. It is evident that the ICCs would be different if the item parameters vary from a group to another. Thus, one possible way of detecting DIF through IRT is to compare item parameters in two groups. If the item parameters are significantly different, then DIF is ensured. IRT DIF can be computed using BILOG-MG (Scientific Software International, 2003) for dichotomously scored items andPARSCALE ((Muraki & Bock, 2002) and MULTILOG (Thissen, 1991) for the polytomously scored items. In addition, for small sample sizes, nonparametric IRT can be employed suing the TestGraf software (Ramsay, 2001).For an exemplary study of the application of the TestGraf for DIF detection, see Laroche, Kim, and Tomiuk, (1998). Finally, the IRTDIF software (Kim, & Cohen, 1992) can do DIF analysis under the IRT framework. Pae (2004) undertook a DIF study of examinees with different academic backgrounds sitting theEnglish subtest of the Korean National Entrance Exam
2012 Time Taylor Academic Journals ISSN 2094-0734

70
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

forColleges and Universities. He applied the Three Parameter IRT though the MULTILOG for DIF analysis. Before applying the IRT, however, Pae (2004) also did an initial DIF analysis using the MH procedure to detect suspect items.Geranpayeh and Kunnan (2006) also examined the existence of differentially functioning items on the listening section of the Certificate in Advanced Englishexamination for test takers from three different age groups.Uiterwijk and Vallen (2005) investigated the performance of the second generation immigrant(SGI) students and native Dutch (ND) students in the Final Test ofPrimary Education in the Netherlands. Both IRT and MantelHaenszel were applied in their study. The Rasch Model Although the one-parameter logistic model and the Rasch model are mathematically similar, they were developed independently of each other. In fact, a number of scholars (e.g. Pollitt 1997) believe that the IRT models are totally different from the Rasch model. The Rasch model focuses on the probability of endorsing item by person . In aiming to model this probability, it essentially takes into account person ability and item difficulty. Probability is a function of the difference between person ability and item difficulty. The following formula shows just this: P (x = 1 , ) = ( ) where is person ability and is item difficulty. The formula simply states that the probability of endorsing the item is a function of the difference between person ability, , and item difficulty, . This is possible becasue item difficulty and person ability are on the same scale in the Rasch model. It is also intuitively appealing to conceive of probability in such terms. The Rasch model assumes that any person taking the test has an amount of the construct gauged by the test and that any item also shows an amount of the construct. These values work in the opposite direction. Thus, it is the difference between item difficulty and person ability that counts Three cases can be considered for any encounter of persons and items (Wilson, 2005): 1. item difficulty and person ability are the same, = 0, and the person has an equal probability of endorsing the item or failing. Thus, probability is .5. 2. person ability is greater than item difficulty, > 0, and the person has more than .5 probability of endorsing the item. 3. person ability is lower than item difficulty, < 0, and the probability of giving a correct response to the item is less than .5. The exact formula for the Rasch model is the following: Ln = 1

2012 Time Taylor Academic Journals ISSN 2094-0734

71
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

The Rasch model provides us with sample independent item difficulty indices. Therefore, DIF occurs when invariance is not accrued in a particular application of the model (Engelhard, 2009). That is, the indices are dependent on the sample who takes the test. The amount of DIF is calculated by a separate calibration t-test approach first proposed by Wright and Stone (1979, see Smith, 2004). The formula is the following: = di2 di1 ( 2 2 2 1 )

where di1 is the difficulty of item i in calibration 1, di2 is the difficulty of item i in calibration based on group 2, 2 1 is the standard error of estimate for di1 , and 2 2 is the standard error of estimate for di2 .Baghaei (2009), Bond and Fox (2007), and Wilson (2005) present excellent introductory level expositions of the Rasch model. Among the software for DIF using the Rasch model are ConQuest(Wu, Adams, Wilson, & Haldane, 2007), Winsteps (Linacre, 2010), and Facets (Linacre, 2009). Karami (2011) has applied the Rasch model to investigate the existence of DIF items in the UTEPT for male and female examinees. He applied Linacres Winsteps for DIF analysis. Also, Karami (2010) exploited the Winsteps to examine the UTEPT items for possible DIF for test takers from different academic backgrounds. Elder,McNamara, and Congdon (2003) also applied the Rasch model to examine the performance of native and non-native speakers on a test of academic English. Furthermore, Takala and Kaftandjieva (2000) undertook a study to investigate the presence of DIF in the vocabulary subtest of the Finnish Foreign Language Certificate Examination, an official, national high-stakes foreign-language examination based on a bill passed by Parliament. To detect DIF, they utilized the One Parameter Logistic Model (OPLM), a modification of the Rasch model where item discrimination is not considered to be one but is input as a known constant. Pallant and Tennant(2007) also applied the Rasch model to scrutinize the utility of the Hospital Anxiety and Depression Scale (HADS) totalscore (HADS-14) as a measure of psychological distress. Conclusion DIF analysis aims to detect items that differentially favor examinees of the same ability levels but from different groups. The technical requirements of this methodology, however, has hampered the non-mathematically oriented researchers. Even if a researcher does not apply these techniques in his own studies, he has to be familiar with them in order to fully appreciate the published papers that report such analyses. This paper attempted to provide a non-technical introduction to the basic principles of DIF analysis. Five DIF detection techniques were explained: Logistic Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch
model. For each technique, a number of the most widely applied software along with some studies applying those techniques were briefly cited. The interested reader may refer to
2012 Time Taylor Academic Journals ISSN 2094-0734

72
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

such studies for further information about their application. It is hoped that the

exposition offered here will enable researchers to appreciate and enjoy reading studies that have conducted a DIF analysis. References Alderson, J. C., & Urquhart, A. (1985). The effect of students academic discipline ontheir performance on ESP reading tests. Language Testing, 2, 192-204. Allalouf, A., & Abramzon, A. (2008). Constructing better second language assessments based on differential item functioning analysis. Language Assessment Quarterly, 5, 120141. Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3 4). Hillsdale, NJ: Lawrence Erlbaum Associates. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press. Baghaei, P. (2009) Understanding the Rasch model. Mashad: Mashad Islamic Azad University Press. Baker, F. (2001). The basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD. Bond, T. G., & Fox, C.M. (2007) Applying the Rasch model: Fundamental measurement in the human sciences. London: Lawrence Erlbaum. Brown, J. D. (1999). The relative importance of persons, items, subtests and languages to TOEFL test variance. Language Testing, 16, 217238. Camilli, G. (2006) Test fairness. In R. Brennan (Ed.), Educational measurement (pp. 221-256). New York: American Council on Education & Praeger series on higher education. Chen, Z., & Henning, G. (1985) Linguistic and cultural bias in language proficiency tests. Language Testing, 2(2), 155163. Choi, S. W., Gibbons, L. E., & Crane, P. K. (2011). Lordif: An R Package for Detecting DifferentialItem Functioning Using Iterative Hybrid Ordinal Logistic Regression/Item Response Theory andMonte Carlo Simulations. Journal of Statistical Software, 39(8), 1-30. Clauser, E. B., & Mazor, M. K. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice. 17, 31-44. Davidson, B. (2004). Comparability of test scores for non-aboriginal and aboriginal students. (DoctoralDissertation, University of British Columbia, 2004).

UMI Proquest Digital Dissertation. DeMars, C. E. (2010). Item response theory. New York: Oxford University Press.
Elder,C. (1996). The effect of language background on foreign language test performance:The case of Chinese, Italian, and Modern Greek. Language Learning, 46, 233282. Elder, C., McNamara, T. F., & Congdon, P. (2003). Understanding Raschmeasurement: Rasch techniques for detecting bias in performance assessments:An example comparing the performance of native and non 2012 Time Taylor Academic Journals ISSN 2094-0734

73
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

nativespeakers on a test of academic English. Journal of Applied Measurement, 4, 181197. Embretson, S. E., & Reise, S. (2000). Item Response Theory for psychologists. Mahwah, NJ: Erlbaum Publishers. Engelhard, G. (2009). Using item response theory and model--data fit to conceptualize differential item and person functioning for students with disabilities. Educational and Psychological Measurement,69, 585-602. Freedle, R., & Kostin, I. (1997). Predicting black and white differential item functioning in verbalanalogy performance. Intelligence, 24, 417444. Gallagher, M. (2004). A study of differential item functioning: Its use as a tool for urban educatorsto analyze reading performance (Unpublished doctoral dissertation, Kent State University). UMIProquest Digital Dissertation. Geranpayeh, A., & Kunnan, A. J. (2007) Differential Item Functioning in Terms of Age in the Certificate in Advanced English Examination. Language Assessment Quarterly, 4, 190-222. Gonzlez, A., Padilla, J. L., Hidalgo, M. D., Gmez-Benito, J., & Bentez, I. (2011). EASY-DIF: Software for analyzing differential item functioning using the Mantel-Haenszel and standardization procedures. Applied Psychological Measurement, 35, 483-484. Guilera, G., Gmez-Benito, J., & Hidalgo, M.D. (2009). Scientific production on the Mantel-Hanszel procedure as a way of detecting DIF. Psicothema, 21 (3), 492-498. Hale, G. A. (1988) Student major field and text content: interactive effects on reading comprehension in the Test of English as a Foreign Language. Language Testing5, 4961. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Kamata, A., & Vaughn, B. K. (2004). An introduction to differential item functioning analysis. Learning Disabilities: A Contemporary Journal, 2, 4969. Karami, H. (2010). A Differential Item Functioning analysis of a language proficiency test: an investigation of background knowledge bias. Unpublished Masters Thesis. University of Tehran, Iran. Karami, H. (2011). Detecting gender bias in a language proficiency test. International Journal of Language Studies, 5, 167-178. Kim, M. (2001). Detecting DIF across the different language groups in a speaking test. Language Testing, 18, 89114. Kim, S.-H., & Cohen, A. S. (1992). IRTDIF: A computer program for IRT differential itemfunctioning analysis. Applied Psychological Measurement, 16, 158. Laroche, M., Kim, C., & Tomiuk, M. A. (1998). Translation fidelity: an IRT analysis of Likert-type scale items from a culture change measure for ItalianCanadians. Advances in Consumer Research, 25, 240-245. Lawrence, I. M., Curley, W. E., & McHale, F. J. (1988). Differential item functioning for males and females on SAT verbal reading subscoreitems. Report No. 884. New York: College Entrance Examination Board.
2012 Time Taylor Academic Journals ISSN 2094-0734

74
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

Lee, Y. W., Breland, H., & Muraki, E. (2004). Comparability of TOEFL CBTwriting prompts for different native language groups (TOEFL ResearchReport No. RR-77). Princeton, NJ: Educational Testing Service. Retrieved September 29, 2011, from http://www.ets.org/Media/Research/pdf/RR-04-24.pdf. Linacre, J. M. (2009). FACETS Rasch-model computer program (Version 3.66.0) [Computer software]. Chicago, IL: Winsteps.com. Linacre, J. M. (2010) Winsteps (Version 3.70.0) [Computer Software]. Beaverton, Oregon:Winsteps.com. Linn, R. L., & Drasgow, F. (1987). Implications of the Golden Rule settlement for test construction. Educational Measurement: Issues and Practice, 6, 1317. Magis, D., Bland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847-862. McNamara, T., & C. Roever (2006) Language testing: The social dimension. Malden, MA & Oxford: Blackwell. Messick, S. (1980). Test validation and the ethics of assessment. American Psychologist, 35, 10121027. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13103). New York: American Council on Education & Macmillan. Messick, S. (1994) The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23. Muraki, E., & Bock, D. (2002) PARSCLE 4.1 Computer program. Chicago: Scientific Software International, Inc. Nelson, L. R. (2000). Item analysis for tests and surveys using Lertap 5. Perth, Western Australia: Curtin University of Technology (www.lertap.curtin.edu.au). Ockey, G. J. (2007). Investigating the validity of math word problems for English language learners with DIF. Language Assessment Quarterly, 4(2), 149-164. Pae, T. (2004). DIF for learners with different academic backgrounds. Language Testing, 21, 5373. Pallant, J. F., & Tennant, A. (2007).An introduction to the Rasch measurement model: An example using the Hospital Anxiety and Depression Scale (HADS). British Journal of Clinical Psychology, 4, 118. Pampel, F. (2000). Logistic regression: A primer. Thousand Oaks, CA: Sage. Penfield, R. D. (2005). DIFAS: Differential Item Functioning Analysis System. AppliedPsychological Measurement, 29, 150-151. Pollitt, A. (1997). Rasch measurement in latent trait models. In Clapham, C. and Corson, D., (Eds.), Encyclopedia of language and education. Volume 7: Language testing and assessment (pp. 243254). Dordrecht: Kluwer Academic. Ramsay, J. O. (2001). TestGraf: A program for the graphical analysis of multiplechoice test andquestionnaire data [Computer software and manual]. Montreal, Canada: McGill University. Robin, F. (2001). STDIF: Standardization-DIF analysis program [Computer program].Amherst, MA: University of Massachusetts, School of Education.
2012 Time Taylor Academic Journals ISSN 2094-0734

75
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

Rogers, H. J., Swaminathan, H., & Hambleton, R. K. (1993). DICHODIF: A FORTRAN program forDIF analysis of dichotomously scored item response data [Computer software]. Amherst:University of Massachusetts. Roznowski, M., & Reith, J. (1999). Examining the measurement quality of tests containing differentiallyfunctioning items: Do biased items result in poor measurement? Educational and Psychological Measurement, 59, 248269. Ryan, K., & Bachman, L. (1992). Differential item functioning on two tests of EFL proficiency. Language Testing, 9, 1229. Sasaki, M. (1991). A comparison of two methods for detecting differentialitem functioning in an ESL placement test. Language Testing, 8(2), 95111. Scheuneman, J. D., & Bleistein, C. A. (1989) A consumers guide to statistics for identifying differential item functioning. Applied Measurement in Education. 2, 255-275. Shohamy, E. (2001) The Power of Tests. A Critical Perspective on the Uses of Language Tests. London: Longman/Pearson Education. Smith, R. (2004) Detecting item bias with the Rasch model. Journal of Applied Measurement, 5(4), 430-449. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361370. Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test.Language Testing, 17, 323340. Thissen, D. (1991). MULTILOG users guide: Multiple categorical item analysis and test scoring using item response theory (Version 6.0). Chicago: Scientific Software. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67113). Hillsdale, NJ: Lawrence Erlbaum Associates. Uiterwijk, H., & Vallen, T. (2005). Linguistic sources of item bias for second generation immigrantsin Dutch tests. Language Testing, 22, 211234. Waller, N. G. (1998). EZDIF: Detection of Uniform and Nonuniform Differential ItemFunctioning With the Mantel-Haenszel and Logistic Regression Procedures. Applied Psychological Measurement, 22, 391. Wiberg, M. (2007) Measuring and detecting differential item functioning in criterion-referencedlicensing test. A Theoretic Comparison of Methods. Educational Measurement, technical report N. 2. Wilson, M. (2005) Constructing measures: an item response modeling approach. London: LawrenceErlbaum Associates. Wright B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press. Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER ConQuest Version 2: Generalized item response modeling software [computer program]. Camberwell: Australian Council for Educational Research.

2012 Time Taylor Academic Journals ISSN 2094-0734

76
The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

Zenisky, A. L., Hambleton, R. K., & Robin, F. (2003). Detection of differential item functioning in large-scale state assessments: A study evaluating a twostage approach. Educational and Psychological Measurement, 63 (1), 49-62. Zenisky, A. L., Hambleton, R. K., & Robin, F. (2004). DIF detection and interpretation in large-scale science assessments: Informing item-writing practices. Educational Assessment, 9(1&2), 61-78. Zenisky, A. L., Robin, F., & R. K. Hambleton. (2009). Differential item functioning analyses with STDIF: Users guide. Amherst, MA: University ofMassachusetts, Center for Educational Assessment. Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P.W. Holland&H.Wainer (Eds.), Differential item functioning (pp. 337348). Hillsdale, NJ: Lawrence Erlbaum Associates. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item

functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of
Human Resources Research and Evaluation, Department of National Defense. About the Author Hossein Karami (hkarami@ut.ac.ir) is currently a Ph.D. candidate in TEFL and an instructor atthe Faculty of Foreign Languages and Literature, University of Tehran, Iran. His research interests includevarious aspects of language testing in general, and Differential Item Functioning, validity, and fairness in particular.

2012 Time Taylor Academic Journals ISSN 2094-0734

Vous aimerez peut-être aussi