Vous êtes sur la page 1sur 5

The Effect of Multiple Choice Item Sequence on EFL Students Performance and Test Reliability Asep Suarman, Asf_suarman@yahoo.co.

id Abstract In 2011 Indonesian national examination, five packages of items were used in relatively random sequence in term of level of difficulty, which might affect students performance on the test. This paper reports the investigation whether there is significant effect of sequencing multiple choice items on the performance of the junior high school ESL students and whether the difference in sequencing affects the internal reliability of the items. A multiple choice paper and pencil test was run to collect data. 68 students who were divided into two intact groups easy to difficult (ED) item group and difficult to easy (DE) item were involved. The students answer was analyzed through ANATESV4 and SPSS 15 software. The result revealed that the students in the ED group outperformed the DE groups (t= -2.114, df.=52, p=0.039) meaning that there is significant effect of sequencing the items on the performance of the students. In term of internal reliability, the item in ED sequence is better than that in DE sequence (ED= .85, DE= .59). Further investigation was necessary to conduct to get more convincing conclusion. Keywords: Multiple Choice, Sequence, Performance, Reliability Background National examination is still conducted in Indonesian every year. It is done in the end of last grade of junior or senior school. It functions as one of the graduating requirements and as the mapping of students capability in generals. It covers Bahasa Indonesia, Mathematics, English and Science subjects. In the last few years, there has been some changing in packaging the multiple choice items of national examination of either junior or senior high schools in Indonesia. In 2008 and 2009, two packages of items derived from the same test criteria were used. The items were different but assumed to have same level of difficulty and were sequenced relatively from easy to difficult one. In 2010, two packages were employed. They had the same items but different sequence. In 2011, five packages of the items were used. The items on every package derive from the same test criteria consisting of the same genre of texts, the same material (skills tested), similar degree of difficulty but the different sequence. By sight, such items may draw no problems. However, if the items were analyzed and compared, some drawbacks from the sequence might come up. From the analysis of the five packages of the 2011 English national examination items, it is found that the packages basically are only made up of two types of items with the same genres of text and the same skills tested. Package number 39, 25 and 12 consist of the same items, and package number 46 and 54 do too. Only the sequence of the items in each package is different. Lets take for example. Number 1 in which the genre is notice, asking the main idea is the same in all packages. Number 2 and 3 in package 39, in which the genre is announcement, asking explicit and implicit information are put in number 10 and 11 in package 25, and number 12 and 13 in package 12. In another two packages, number 2 and 3 in package 46 are placed in number 49 and 50 in package 54. The analysis shows that the test criteria or material tested are the same but the items have some aspects of difference. The degree of difficulty of the test items might be the same but the sequence is very different. One package might consist of the items which are sequenced from the easy to the difficult (ED) items, another might be from difficult to easy (DE) items. This, of course, may affect students achievement in the exams since, psychologically, students motivation and disappointment may influence their score. Furthermore, from the fairness point of view (Brown, 2005 p. 26), the difference of the item sequence is not fair enough. The item sequence leads to different result. Although the items are similar in terms of competence, indicators and degree of difficulty, the result of such test remains quite inobjective. Students as examinee who actually have accepted the same treatment, materials, tasks, guidance or feedback are tested with different items. Some studies investigating the sequence of items have indicated that the sequence of the test has significant effect on the performance of the test takers: Jessel and Sullin (1975); Towle and Meril, 1

(1975); Carsten, P.W and McKeag, R.A. (1982); Carlson and Ostrosky (1992); Hodson, 2006; Soureshjani, (2011). Jessel and Sullin (1975) found that the ideal arrangement of multiple choice did differ significantly with respect to performance and reliability of test. Towle and Meril (1975) found that the students with easy to hard (EH) items had significantly higher score than those with hard to easy (HE) one and pointed out that no anxiety was found, meaning that difficulty sequence did not affect anxiety arousal. Carsten and McKeag (1982), Gohman and Spector (1989) and Carlson and Ostrosky (1992) showed evidence that the distribution of test score may be influenced by the sequence of the items but the items validity and reliability remain unaffected. Hodson (2006) and Soureshjani (2011) showed that the students taking easy to difficult (ED) multiple choices test items outperformed those difficult to easy (DE) items. Those previous research indicate that the sequence of items can affect adult learners performance on their test. But, no research was done to teenage or junior high school students. In addition, it is difficult to find research on the sequence of the items which is related to the internal reliability of the tests. Thus, this study attempted to shed light on the two questions: whether there is any significant effect of sequencing multiple choice test items on the performance of the junior high school students and whether the difference in sequencing affects the internal reliability of the items. Hopefully, this research might be the additional support on existing theory or finding about the topic. By and large, in connection to the performance of test takers, there are many factors affecting their gained score. The factors are the testing environment, the test rubric, the nature of the input of the test, the nature of the expected response, and the relationship between input and response (Bachman, 1990 in Soureshjani, 2011). In addition, test format like multiple-choice, true-false, cloze procedure, open-ended or other testing formats may influence the test takers' performance (Alderson, 2000; Bachman & Palmer, 1996; Buck, 2001; cited in Soureshjani, 2011). In multiple choice items, there might be three kinds of jumbling or sequence items from the test criteria. The first one is jumbling only the options of MC items. The items are the same but the options are different in every number of the test. The second one is jumbling the sequence of the items. The test has the same items and with the options, but the order of the items is different. And, the third one is both jumbling the items as well as the options. The items are different in order and the options of every item are the same. On the top of that, to know the quality of test items, reliability is one of the criteria beside practicality and validity. Reliability is defined the extent to which the result can be considered consistent or stable (Brown, 2005 p. 175) or the desired consistency (or reproducibility) of test score (Crocker and Agina, 1986 in Fulcher and Davidson, 2007 p. 104). A reliable test is consistent and dependable (Brown, 2001; Brown 2004). The reliability of a test may lie on the test its self, which is generally called test reliability, or in the scoring of the test which is called rater (scorer) reliability (Brown, 2001). Since multiple choice test, in this case, the former is the only one investigated here. Among three basic strategies to estimate the reliability of tests e.g. test-retest, equivalent or (parallel) forms and internal consistency (Brown, 2005 pp. 175-9, see also Fulcher and Davidson, 2007), this study employed internal consistency using split-half method with ANATES14 as the software. Method This study was done in EFL class in Serang Regency, where 74 participants which were divided into two intact groups involved. The population who were students of grade IX of junior high school was about 14-16 years old and in beginner-to-intermediate level of proficiency. All of them had studied English at least two and a half years. The instrument used in the study was 40 items of reading comprehension test. The test items were the adaptation from the Prediksi Ujian Nasional 2010 (The prediction of national examination 2010) issued by Depdiknas (Ministry of Education Affair) in the form of softcopy. The selected genres were report, procedure, letter and advertisement which were taught in odd term of Grade IX. The items were multiple choices with four options in two sequences: easy to hard and vice versa.

This study consisted of three stages: the pilot study, trial stage and the administration of the test. The first one was done to see the homogeneity of the groups. The second one was to find out the quality of test items and the degree of difficulty, and was conducted to different intact groups were not involved in the study. The third one, the same test with different sequence, was administered to get the data. One group of students were to do easy to difficult test items (ED group) and another got difficult to easy ones (DE group). To analyze data, SPSS software and ANATESV4 were utilized. In the pilot study, the results of two previous tests were analyzed to see the homogeneity of the groups. The multiple choice (MC) items were the teacher-made and about different genres e.g. report, procedure, letter and advertisement texts. The ttest analysis was employed to determine the homogeneity. On the trial, 40 MC items about all the mentioned genre of texts were tested to another intact group not involved in the study, then the result were analyzed by ANATESV4 to see the degree of difficulty, the quality of distracters and the internal validity of the items. The bad distracters were then to be corrected properly. Then, the items were arranged in two sequences: easy to difficult (ED) items and difficult to easy (DE) one. In the last stage, the ED and DE items were tested to the different but equivalent intact groups. To the result of the test, ANATES and SPSS independent t-test were conducted, the results were displayed. The t-test was employed to see whether there was a significant effect of items sequence and ANATES was to see the internal validity of the text. Finding and Discussion In the pilot study, the analysis on the results of two previous test was administered to see the homogeneity of ED and DE groups. The first test was reading report text and the second was reading procedure text. The groups are required to have equally assumed ability or statistically from the The analysis of Lavenes test reveals the equality of variances of previous test scores of both ED and DE groups. The calculation shows that t- value for daily test I score with equally variances assumed is -.113 and the probability (p) is 0.910 (t= -.113, df.=64, p=0.910) and t-value for daily test II with equally variances assumed is .012 and the probability (p) is .990 (t= .012, df.=66, p= .990) . This indicates that both groups were equal. Thus, the ED and DE groups were statistically homogenous. Then, the trial stage was conducted to try out the items to another intact groups not involved in the study. The analysis of ANATES V4 Software on the students answer sheet reveals that the 40 items were generally good but some items need correction. The internal reliability of the test items was 0.76, meaning that the test is quite reliable. The correlation of odd to even number was .62. The mean score where one correct answer was rated one was 16,86 and standard deviation was 4.56, with 27 as the highest score and 10 as the lowest one. Thus, before being tested to ED and DE groups, the items were, then, corrected to have better quality of disctactors. The items were rearranged into two types of sequence, based on the level of difficulty. The former was from the easy to difficult items, the latter from the difficult to easy ones. In sequencing, not only the difficulty rate but also the items embeded on certain text were considered as some items were integrated with certain text and could not be separated. Yet, the sequence was ordered according to the rate of difficulty level. Subsequently, three days after the trial, the test to ED and DE groups was administered. To maintain the internal validity of the test, it was held at the same time on different classroom (Hatch and Farhady, 1982 p. 7). The test takers had 80 minutes to answer the 40 items. A test administrator watched the test takers doing the test to avoid other interfering factors like cheating, cooperating or any disturbance. Based on the analysis of independent t-test using SPSS, it is revealed that the mean score (range 0-100) of ED group is 42.04 and DE group is 34.88. The standard deviation of the score is 15.67 of ED group and 8.10 of DE group. The result of SPSS independent t-test analysis is as follows. Table 1: Independent Samples T-Test of DE and ED groups in Test Stage
Levene's Test for Equality of Variances
Sig. (2tailed)

t-test for Equality of Means Df


95% Confid ence Interv al of the Differ ence Mean Differ ence Std. Error Differ ence

Sig.

Lower

Lower

Lower

Lower

TEST STAGE

Equal variances assumed Equal variances not assumed

9.824

.003

-2.114

52

.039

-7.15110

3.38352

-13.94063

-.36157

-2.159

41.543

.037

-7.15110

3.31207

-13.83731

-.46489

The table above shows the analysis of Lavenes test which reveals the difference of variances of test scores in the test stage between ED and DE groups. It shows that F for score test in test stage with equal variance assumed is 9.824 and level of probability is .003; meaning it is smaller than .05 (p > 0.5). This probability means that variances of both populations are different meaning that one group is outperformed the another. In addition, the table also shows that t- value for the test score with equally variances assumed is -2.114 and the probability (p) is 0.039 (t= -2.114, df.=52, p=0.039). This indicates that null hypothesis (H0) is rejected. Both groups were not equal. They are not homogenous. They have different performance. The difference is might due to the sequence of the items as the items are the same for both groups and the groups previously shown are homogenous. One group with ED items looks better than another group with DE items. In summary, the students who did items with Easy-to-Difficult sequence (ED group) outperformed those who did the items with Difficult-to-Easy sequence (DE group). This is in line with previous finding that the sequence of the test has significant effect on the performance of the test takers (Towle and Meril, 1975), (Carlson and Ostrosky, 1992) and Hodson, (2006). This result also confirms Soureshjanis (2011) finding that the students taking easy to difficult (ED) test items outperformed those difficult to easy (DE) items. Besides, the difference in test performance above might be caused by other affective factors like anxiety, motivation or frustration like what Munz and Smouse (1968) claimed that differential items sequence affect performance of different anxiety. The DE (difficult to easy) sequence might lead students to be frustrated, unmotivated or even disappointed so that on the rest of the items, they lack of concentration. In addition, based on the data obtained from the ANATESV4 analysis comparing trial group, ED and DE groups, it is revealed that the ED group outperforms the DE groups. All the means score, the standard deviation, the correlation of XY and the reliability of test of ED group is higher than DE groups. Look at the table for details. Table 2: The Result of ANATESV4 Analysis of ED and DE Groups
The Name Group Trial group ED Group DE Group Difference EDDE of Number of Population 36 30 29 1 Mean Score 16.19 15.71 13.83 1.88 Standard Deviation 4.57 6.71 3.57 3.14 Correlation of XY 0.42 0.73 0.42 0.31 Internal Reliability 0.59 0.85 0.59 0.26 Highest score 28 33 23 10 Lowest score 10 9 8 1

Further, it can be seen that the DE groups has almost the same result compared with the trial group. Despite the slight difference in mean score and standard deviation, the internal reliability and the correlation of XY are the same between trial and DE group. It can be assumed that the trial group and the DE group have similar capability. They look homogenous. In term of internal-consistency reliability (Brown, 2005 pp. 176-8; Hughes 2003, pp. 38-9) wherein the odd and even items are scored separately and compared, the above table shows that the items whose sequence is easy to difficult has better internal reliability than those of difficult to easy. In DE group, the reliability of the item is .59 (the same as the trial stage), but in ED group, it is .85 meaning that it has 4

Lower

Upper

Upper

Upper

Upper

80% reliability. It is close to the best reliability coefficient which is 1 or 100% (Hughes, 2003). This might mean that the items have better reliability if it is sequenced from easy to difficult items. Conclusion and Suggestions In summary, the result of the study confirmed previous finding that there is significant effect of sequencing the items on students performance. The SPSS independent t-test analysis shows that the students in ED (Easy-difficult item) group performed better than those in DE (difficult-easy item) group. It happens possibly due to students affective factors like anxiety, motivation, and frustration in the beginning of the test. Meanwhile, in term of internal item reliability, the data shows that the items in easy to difficult (ED) sequence has better reliability coefficient than that of difficult to easy (DE). However, as this study may contain some drawbacks, some further studies covering non-multiple choice items or more population need conducting to have more convincing evidence about the sequence of test items. It is worth mentioning despite teacher-team made test was used, the future study may utilized the items of standardized test so that different results may come out. It also would be a good idea to try to investigate similar study on computer or internet-based multiple choice item test and employ other software to analyze the result. Finally, the study on affective factors influencing test takers in a test is necessary to carry out in order that the test is valid and can reflect the test takers authentic performance. Bibliography
Alderson, J.C. 2000. Assessing Reading. Cambridge: Cambridge University Press. Bachman, L. (1990). Fundamental considerations in language testing. London: OUP in Soureshjani H. K. 2011. Item Sequence on Test Performance: Easy Items First? Language Testing in Asia Volume one, Issue three October 2011. Brown, J.D. 2005. Testing in Language Programs: A Comprehensive guide to English Language Assessment. Singapore: McGraw-Hill Education. Brown, H. D. 2001. Teaching by Principles; An Integrative Approach to Language Pedagogy, Third Edition. Now York: Addison Wesley Longman, Inc. Brown, H. D. 2004. Language Assessment: Principle and Classroom Practice. Now York: Pearson Education, Inc. Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press. In Soureshjani H. K. 2011. Item Sequence on Test Performance: Easy Items First? Language Testing in Asia Volume one, Issue three October 2011. Available online at http://www.languagetestingasia.com/ . Retrieved on November 2011. Carsten, P. W. and McKeag, R. A. 1982. The effect of a Change in Item Sequence Order on Performance in a Test, Re-Test Experiment. Online publication. Available at http://www.eric.ed.gov retrieved on November 2011 Carlson and Ostrosky. 1992. Item Sequence and Student Performance on Multiple Choice Exams. The Journal of Economic Education. Vol. 23, No. 3 (Summer 1992), pp 232-235. Available online at http://www.jstor.org/stable/1183225. Retrieved on November 2011. Fulcher, G and Davidson, F. 2007. Language Testing and Assessment: an Advance Resource Book. Oxon: Routledge. Gohman, Stephan F., and Lee C. Spector. 1989. Test Scrambling and Student Performance. Journal of Economic Education, Summer. In Carlson and Ostrosky. 1992. Item Sequence and Student Performance on Multiple Choice Exams. The Journal of Economic Education. Vol. 23, No. 3 (Summer 1992), pp 232-235. Available online at http://www.jstor.org/stable/1183225. Retrieved on November 2011. Hatch, E. and Farhady. 1982. Research Design and Statistics for Applied Linguistics. Los Angeles, California, U.S.A: Newbury House Publishers. Inc. Hughes, A. 2003. Testing for Language Teachers, Second Edition. Cambridge: Cambridge University Press. Hodson. 2006. The Effect of Change in item Sequence on Student Performance in A Multiple-Choice Chemistry Test. Journal of Educational Measurement: National Council on Measurement in Education. Available online at http://www.jstor.org. Retrieved on November 2011. Jessel, J.C. and Sullin, W.L. 1975. The Effect of Keyed Response Sequencing of Multiple Choice Items on Performance and Reliability. Journal of Educational Measurement . Volume 12 no. 1. National Council on Measurement in Education Available online at http://www.jstor.org. Retrieved on November 2011. Soureshjani H. K. 2011. Item Sequence on Test Performance: Easy Items First? Language Testing in Asia Volume one, Issue three October 2011. Available online at http://www.languagetestingasia.com/ . Retrieved on November 2011. Towle and Meril. 1975. Effects on Anxiety type and Item Difficulty sequencing on Mathematic Test Performance. Journal of Educational Measurement: National Council on Measurement in Education. Available online at http://www.jstor.org. Retrieved on November 2011.

Vous aimerez peut-être aussi