Vous êtes sur la page 1sur 13

Sultan Qaboos University Language Centre

MFRM TO ADJUST FOR RATER SEVERITY/LENIENCY


Presentation for the LC Conference
by

Farah Bahrouni/Mr
bahrouni@squ.edu.om

April 20, 2011

Farah Bahrouni/LC Conf./April 20, 2011

Plan
Briefing about MFRM Run the analysis for 5 facets: candidate, rater, background , experience & category Adjusting scores as per FACETS estimates Conclusion

Farah Bahrouni/LC Conf./April 20, 2011

Student 1 Total: 100 TA:25 CC:25 LR:25 GR:25 Mean 19.62132 Mean 19.38971 Mean 18.20956 Mean 16.45588 Max 25 Max 24 Max 23 Max 22 94 Min 14 Min 13 Min 14 Min 10 51 Range 11 Range 11 Range 9 Range 12 43 Count 68 Count 68 Count 68 Count 68

Student 2 Mean 20.13971 Mean 20.09926 Mean 19.88235 Max 25 Max 25 Max 25 Min 14 Min 13 Min 12 Range 11 Range 12 Range 13 Count 68 Count 68 Count 68
Student 3 Mean 15.16544 Mean 15.79559 Mean 15.48162 Max 25 Max 23 Max 20 Min 10 Min 10 Min 8 Range 15 Range 13 Range 12 Count 68 Count 68 Count 68

Mean Max Min Range Count Mean Max Min Range Count

18.88971 24 11 13 68 18.88971 24 11 13 68

99 50 49

92 39 53

Farah Bahrouni/LC Conf./April 20, 2011

Assessment of language proficiency: Speaking/Writing subjectivity a number of distinct factors directly or indirectly impinge upon the assessment/measurement outcomes.

These factors are referred to as facets.

Farah Bahrouni/LC Conf./April 20, 2011

A facet has been defined as

Any factor, variable, or component [e.g. examinees, tasks, raters, interviewers, etc] of the measurement situation that is assumed to affect test scores in a systematic way.
(Backman, 2004; Linacre, 2002; Wolfe & Dobria, 2008, cited in Eckes,
2009: 2)

Farah Bahrouni/LC Conf./April 20, 2011

The error-prone nature of most

measurement facets bring about serious


concerns about both the reliability and validity of the obtained scores.

Farah Bahrouni/LC Conf./April 20, 2011

The usual approaches to deal with rater variability include: rater training using 2 or more raters in the scoring of performance assessment call for an adjucator (3rd/4th.. rater, usu. > exp./senior/expert..) developing rubrics that spell out the proficiency levels identifying anchor papers to provide concrete examples of each proficiency level
(for details see Johnson, et al. 2005, 2003, 2001, 2000)

Farah Bahrouni/LC Conf./April 20, 2011

Nevertheless, research has found that try as they may, none of these methods is effective enough to guarantee reliable objective scores. They are diverse enough to raise questions about the quality of the resolved scores.
Underlying these resolution models is the common assumption that the discrepant scores might lack the requisite levels of reliability and validity, and that adjudication might improve this deficit to some extent (Johnson, et al. 2005 :123).

Farah Bahrouni/LC Conf./April 20, 2011

As for rater training, it has been found that even with proper training, substantial differences between raters persist.
(Linacre, 1990; Hamp-Lyons, 1991; Weigle, 1994, 1998, 2002; Lumley & McNamara , 1995; McNamara, 1996; Lumley 2005)

Raters differences are reduced by training, but do persist. (McNamara, 1996: 118 ) Reason:

Some see severity much as a personality trait that is inherently brought to any rating situation.
(Myford, et all. 2003)
Farah Bahrouni/LC Conf./April 20, 2011 9

Multi-facet Rasch Model (MFRM) provides a rich set of highly flexible tools to account, and compensate, for measurement error, especially rater-dependent measurement error.
It is an extension of the basic Rasch model that incorporates more facets than the 2 usally included in dichotomous item tests, i.e. candidates and items.

Farah Bahrouni/LC Conf./April 20, 2011

10

Multifaceted Rasch measurement is a stochastic model performed using FACETS, a computer program developed by Linacre (1989).
Candidate ability is estimated from all ratings given by all raters on all items (Lunz & Wright, 1997; McNamara, 1996: 132).

Item difficulty (TA,CC,LR & GA) is estimated from all responses across all candidates to that item (ibid).
Rater severity is estimated from all ratings given across all candidates and items (ibid).
Farah Bahrouni/LC Conf./April 20, 2011 11

In addition, MFRM has 2 more very informative functions:

Fit analysis

Bias analysis

These 2 functions enable researchers to look at


how individual raters, ratees, or traits included in the analysis are performing: (fit analysis: z score values between +2 & -2 are usually accepted in contexts similar to ours)

how the individual elements within the facets interact: individual-level effects of the various elements: (bias analysis: z score values between +2 & -2 )
Thus, source(s) of variation in the scores are efficiently determined.
(Myford, et al. 2003; Lunz & Wright, 1997)

Farah Bahrouni/LC Conf./April 20, 2011

12

Conclusion

Owing to the above features, MFRM has been found a model with a great potential to improve our capacity to produce objective measures of the ability of test takers in performance assessment contexts. It is practical and can be used in our context along with the pair rating.
(Linacre, et al. 1990; Engelhard, 1991, 1992, 1994, 1996; Engelhard & Myford, 2003; Hamp-Lyons, 1991; Lunz 1996, 1997a, 1997b; Lunz & Wright 1997, Weigle, 1994, 1998, 2002; Schaefer 2003, 2008; Kondo-Brown 2002; Lumley & McNamara 1995, Lumley 2005; McNamara 1991, 1996, 1997, 2000, 2002, 2008; McNamara & Roever, 2006; Myford et al, 2003, 2004; Shaw & Weir 2007; Wigglesworth, 1993, 1994).

Farah Bahrouni/LC Conf./April 20, 2011

13

Vous aimerez peut-être aussi