Vous êtes sur la page 1sur 32

SGDE 4013

DEVELOPMENT AND ANALYSES OF ITEMS

PREPARED BY:
BALQIS BINTI HASINI (824695)
NUR AQLILI HANUM BINTI ABDULLAH (824762)
PUTRI AMIRAH BINTI MEGAT AZAMUDDIN (824728)

PREPARED FOR:
DR. NURLIYANA BUKHARI
TABLE OF CONTENT

1.0 Introduction ..........................................................................................................................1


1.1 Deciding on a Test’s Purpose ..................................................................................2
1.2 Background Test takers............................................................................................2
1.3 Table of specification ........................................................................................... 2-3
1.4 Marking scheme/ scoring rubric .......................................................................... 4-8
2.0 Item Development ............................................................................................................ 8-9
3.0 Test Administration Procedure ...................................................................................... 9-10
4.0 Scoring And Recording......................................................................................................10
5.0 Item Analysis ............................................................................................................... 11-12
5.1 Item Difficulty Index ....................................................................................... 12-15
5.2 Item Discrimination Index ............................................................................... 15-17
5.3 Reliability......................................................................................................... 17-19
6.0 Analyses Of Students’ Performance ............................................................................ 19-20
7.0 Discussion ..........................................................................................................................20
7.1 Difficulty Index ................................................................................................ 20-21
7.2 Discrimination Index ....................................................................................... 21-22
7.3 Reliability......................................................................................................... 22-23
8.0 Suggestion For Improvement ....................................................................................... 23-24
9.0 Conclusion .........................................................................................................................24
10.0 Reference ...........................................................................................................................25
1.0 INTRODUCTION

Assessment entails the systematic gathering of evidence to judge a student understands of

learning (Alias, Maizam, 2005). Educators can then judge whether a student has learned what

they are expected to learn through assessment that we have made. We have decided to choose

one of the topics from subject assessment which is reliability as our project. We want to

measure how well student understanding about reliability concept. The topic reliability is

about the relative consistency of test scores of educational assessment. Reliability refers to

how well a score represents an individual’s ability, and within education, ensures that

assessments accurately measure student knowledge. Reliable is important concept that

educator should understand because scores help students grasp their level of development,

and help educator improve their teaching effectiveness.

1.1. Deciding on a Test’s Purpose

Determining a test’s objectives is the first step in a test’s construction process. The test

objective is the criterion that will be used in order to judge whether a test is valid or not.

Three of objectives that we used as a guideline to know level understanding students’ in

reliability topic:

 Measure Post Graduate Diploma Education students’ understanding of reliability

concepts.

 Measure Post Graduate Diploma Education students’ understanding factor of

affecting reliability.

 Measure Post Graduate Diploma Education students’ ability to determine various

methods in measuring reliability and in any assessment.

 Measure Post Graduate Diploma Education student’s ability in apply concept of

reliability in situation that are given.

1|Page
1.2. Background Test takers

31 Students from PGDE program are involved through this assessment. 6 of the student

taking part in this assessment are male and 25 are female students. For a total of 31 students

studying at Universiti Utara Malaysia (UUM), were selected.

1.3. Table of specification

A Table of Specifications is to identify the achievement domains being measured and to

ensure that a fair and representative sample of questions appear on the test. It also describes

the topics reliability to be covered by a test and the number of items or points which will be

associated with each topic. This table of specifications is constructed test based on three

construct which are concept of reliability, factor affecting reliability and method in measuring

reliability.

Objective

Cognitive level Total Total Items


No. Construct Easy Average Difficult Marks

1 Concept Reliability

Q1 1 1 5

Q2 1 1

Q3 1 1

Q4 1 1

Q5 1 1

2 Factor affecting reliability

Q6 1 1
5
Q7 1 1

Q8 1 1

2|Page
Q9 1 1

Q10 1 1

3 Methods of reliability

Q11 1 1 5

Q12 1 1

Q13 1 1

Q14 1 1

Q15 1 1

TOTAL 15 15

Subjective

Cognitive level Total Total Items


No. Construct Easy Average Difficult Marks

1 Concept Reliability

Q1 1 2 1

2 Factor affecting reliability

Q2 1 3
2
Q3 1 2

3 Methods of reliability.

Q4 1 4 2

Q5 1 4

TOTAL 15 5

3|Page
1.4. Marking scheme/ scoring rubric

Marking rubric for a scoring guide to evaluate quality of student’s constructed responses.

No. Answer Reason

1. B Reliability refers to whether we are truly measuring the concept of interest in


our study.
FALSE: The degree of consistency between two measures of the same thing.

2. A Reliability = Consistency

TRUE: The degree of consistency between two measures of the same thing.

3. B Measurement reliability refers to the:


a. Dependency of the scores (Assess nursing dependency)
b. Consistency of the scores (Reliability = Consistency)
c. Comprehensiveness of the scores (To assess general language
ability)
d. Accuracy of the scores (Validity)

4. C Which are not Sources of error in assessment?


i. Examinee
ii. Examination
iii. Examiner
a. i and ii (Examiner is not included in the answer)
b. i and iii (Examination is not included in the answer)
c. All of the above (Examinee, examination and examiner are source of
error)

5.

Based on the above diagram, which of the statement is correct?


a. Both reliable and valid (The target is all in the centre and
consistent)
b. Low validity and low reliability

4|Page
c. Reliable and not valid

d. Not reliable and not valid

6. B All of the following are the factors that influence reliability EXCEPT?
a. Length of test (factor that increase reliability, less guessing)
b. Culture (factor culture did not affect score reliability)
c. Range of Ability (factor that increase reliability, wider spread of
scores)
d. Scorer’s Objective (factor that increase reliability because not
influence by the score's judgment or opinion)

7. C What effect would the following most likely have on reliability?


a. Increasing the number task in assessment. (increase reliability)
b. Removing ambiguous task (increase reliability)
c. Changing from multiple choice test to an essay test covering
the same material. (Most effect because by changing to essay form
influence emotion’s examiner that affect reliability)

8. A What does the effect of homogenous group of examinees have on


reliability?
a. It lowers reliability (homogenous with same background/
ability of group can increase reliability)
b. It increases higher reliability (Heterogeneous-
different ability/background increase reliability)
c. It does not have any effect

9. B
Which factor that can increase reliability?
a. Lower the number of test items (decrease reliability)

5|Page
b. Test a heterogeneous group of examinees ( increase
reliability, wider spread of score)
c. Narrow the range of examinees’ ability (decrease reliability)
d. Test a homogeneous group of examinees (decrease reliability)

10. A Items that are more likely discriminating tends to increase reliability?
a. True (reliable because item discrimination compares the
number of high scorers and low scorers who answer an item
correctly.
b. False

11. A Why index of reliability could be negative?


a. Students who score high in the first test will get a low score
in the second test, and vice versa. (Negative reliability index
indicates inverse consistency)
b. Students who score low in the first test will get a low score in
the second test.(Consistent - Positive)
c. Students who score high in the first test will get a high score in
the second test. (Consistent - Positive)
d. Students who score high in the test will always score high in any
test. (Consistent - Positive)
e.

12. B In which level if (r) value is 0.163?


a. Reliability is very good
b. Reliability is very poor
c. Reliability is poor
d. Reliability is average

Based on Mahrens and Lehmann (1991)

13. D Which one disadvantage of the test-retest method?


a. It requires two test and at least two forms. (Disadvantage of
parallel-form method)
b. Its different subsection of test will affect test homogeneity, thus
reduce score reliability. (Disadvantage of split-half method)

6|Page
c. There are more possibilities for raters to disagree. (Disadvantage
of inter-rater reliability)
d. Reliability cannot be estimated until after the second test.
(Disadvantage of test-retest reliability)

14. C Which method of estimating reliability is suitable for essay?


a. Test-retest (The same test is administered twice to the same
group)
b. Inter-rater reliability (Related to examiner)
c. Cronbach’s alpha (Suitable for dichotomous & polytomous
items)
d. Kuder-Richardson-20 (Suitable for dichotomous items)

15. A Inter-rater reliability means that if two different raters scored the item
using the scoring rules, they should attain similar result.
a. True (Inter-rater reliability have two different raters)
b. False (Correct statement)

Answer scheme for subjective question.

No. Description Marks

1. Define reliability (2 marks)


2
 The degree of consistency between two measures of the same thing
marks

2. Why the scorer’s objectivity can affect the reliability of the score? (3 marks)
 Measures without references to outside influences 1 mark
 More objectively scored assessment result 1 mark
 The resulting score are not influence by the examiner judgement 1 mark

3. How do we test reliability (4 marks)


 Test-retest 1 mark
 Parallel form 1 mark
 Inter-rater 1 mark
 Internal consistency
1 mark

4. Briefly explain about inter-rater reliability & intra-rater reliability. (4 marks)


 Inter-rater reliability - means that if two different raters scored the scale 2
using the scoring rules, they should attain the same result. marks

 Intra-rater reliability - same raters give consistent estimates of the same


measurement over time. 2
marks

7|Page
5. How can examiner affect score? (2 marks)

 emotional or health of the examiner 1 mark

 examiner feel tired or anxiety and this will lead the examiner to feel
1 mark
impatient/tired when marking the paper
1 mark
 different scorer

(any 2 correct answers get full mark)

2.0 ITEM DEVELOPMENT

We have created 20 questions to identify the views of the students on knowledge of reliability

include multiple choice questions and subjective questions. Multiple-choice items contain 15

questions that can be used to measure knowledge outcomes and most widely used for

measuring knowledge, comprehension, and application of reliability concept. Multiple-choice

questions typically have 3 parts which is a stem, the correct answer called key, and several

wrong answers, called distractors. The parts of a multiple choice question contains options

which is response alternatives (i.e. A,B,C,D) . The correct answer is the best answer among

option given. The incorrect answer which is distractors but appear plausible, especially to

those who have not mastered content. In order to determine the views of the students on

understanding concept of reliability, they were asked to circle the answers either “A”, “B”,

“C” or “D”. It also consists of statement which the student has to indicate to be true or

false. In this section, we focus question on understanding on concept reliability, factor

affecting reliability and method that we used in measuring reliability so that students must

discriminate among options that vary in degree of correctness.

The other 5 questions we have created in subjective form. Subjective items which permit the

student to organize and present an original answer hence they involve a wider variety of

thinking skills which is students can recall, select and organize answer what have they

8|Page
understand about reliability topic. This approach makes the test more challenging for students

and decreases the chance of getting an answer correct by guessing thus we can find out how

far students mastered the topic of reliability that area tested.

In the first five questions, students were tested the specific concept of reliability. In the sixth

question, students were asked factors that affect reliability scores. The last five questions,

students were given some facts about a specific result index of reliability, based on this

knowledge they had to determine number index of reliability that supposedly reliable for

assessment. We also created question to identify level of understanding’s student about

method in measuring reliability. The aim behind this question was to test the application of

knowledge.

3.0 TEST ADMINISTRATION PROCEDURE

The objective’s test was given before presentation on reliability’s topic is done. They are

fifteen objective questions that covers about concept of reliability, sources of measurement

error, factors that affecting reliability and methods used to test reliability. Right after the

presentation is done, they were given ten minutes to do the test on reliability’s topic.

The test was held on 12 November 2018, 12.00 pm to 12.10 pm place at SEML class. The

condition of the venue was semi-formal, where examinee took the test at their place and it

was paper based test. All examinee need to submit their answer after ten minutes time limit.

Also, the examinee seats the test without being inform first that they going to have that test.

During the test, the subject or issue that may occur is that students may imitate a friend on the

side as all students only answer the test at their respective places. Then it's easy for students

to discuss and emulate among others because no rules are set except the time limit. Besides

that, guessing also might occur because they just heard and learn about that topic at that time

9|Page
so they don’t have enough time for the revision and prepare for the test. Apart from that,

because during the test there’s no one monitored the student so the possibility for students to

find information on the internet was high because of the short time to answer the test.

4.0 SCORING AND RECORDING

After the end of the test, the test paper is marked by the group member who provided the test

manually. As well as tests should be reviewed based on pre-prepared answer schemes so that

every paper were mark equally and no bias issues. In addition, for the objective score

question for each question is 1 mark so if the student answers wrong then the score is 0.

Then, the marks and scores were recorded in excel sheet, in a table form so that easy to

review and analyze the data. The data was recorded in excel sheet as shown below:

(Score recorded in Excel sheet)

10 | P a g e
5.0 ITEM ANALYSIS

CORRECTED ITEM-TOTAL CRONBACH’S ALPHA IF


ITEM MEAN SD
CORRELATION ITEM DELETED
Q1 0.13 0.341 0.083 0.486
Q2 1.00 0.000 0.000 0.491
Q3 0.87 0.341 -0.81 0.504
Q4 0.97 0.180 0.197 0.481
Q5 1.00 0.000 0.000 0.491
Q6 1.00 0.000 0.000 0.491
Q7 0.26 0.445 -0.352 0.547
Q8 0.68 0.475 0.376 0.440
Q9 0.61 0.495 0.554 0.406
Q10 0.61 0.495 -0.126 0.520
Q11 0.68 0.475 0.421 0.432
Q12 0.68 0.475 0.376 0.440
Q13 0.68 0.475 0.331 0.447
Q14 0.84 0.374 0.183 0.474
Q15 0.81 0.402 0.122 0.481
QS1 1.74 0.682 -0.039 0.518
QS2 1.29 1.216 0.574 0.297
QS3 3.10 1.106 0.372 0.402
QS4 3.68 0.871 -0.237 0.584
QS5 1.03 0.706 0.221 0.460
Scale Reliability (Cronbach’s Alpha for SAS) 0.489
Scale Mean 21.65
Scale SD 3.508
N 20
(Table 1: Item Analysis)
.

11 | P a g e
()(

(Chart 1: Item Analyses)

5.1. Item Difficulty Index

The Difficulty Index (DI) item is a measurement index of the difficulty of an item for a group

tested and used to describe the difficulty of an item for something test. The DI range is

between 0.00 and 1.00. Each level of difficulty test items can be determined by the value of

DI. Table 1 below show how difficulty level determined by difficulty index:

Difficulty Index (DI) Difficulty level

0.00 – 0.20 Too difficult

0.21 – 0.40 Difficult

0.41 – 0.60 Moderate/Average

12 | P a g e
0.61 – 0.80 Easy

0.81 – 1.00 Too easy

(Table 2: Table of Difficulty Index Item)

This study involved 31 respondents from Postgraduate Diploma in Education (PGDE)

students. Number of items for level test the reliability of concept of reliability is 7 items. The

difficulty level of concept constructs in this study shows that 5 items are too easy which the

item are number Q2, Q3, Q4, Q5, and QS1. Then 1 item is in the moderate difficulty level of

items QS5 while 1 item is at a high difficulty level which is number Q1.

Other than that, for construct of factors affecting reliability there are 6 items. There is 1 item

that fall under too easy level which is item number Q6. Besides that, for easy level there are 3

items which are number Q8, Q9 and Q10. For moderate difficulty level there’s only 1 item

which is QS2 and 1 item is under difficult level which is item number Q7. Next, for construct

items to test methods in measure reliability there are 7 items. The difficulty level of method

of reliability constructs shows that 3 items are too easy which are number Q14, Q15 and QS4,

while for easy level there are 4 items which are items number Q11, Q12, Q13 and QS3.

Study on the details of the DI value and difficulty level for the three constructs which are

concepts, factors and methods found that percentage of too easy level for the whole test item

is 50%, easy level is as much as 30%, moderate difficulty level is 10 %, difficulty levels for

the whole test item is 5% and high difficulty level is 5%. Percentage level percentages for the

whole test item are shown in the table below:

13 | P a g e
Construct

Difficulty Level Factor Methods Total %


Affect Measure Items
Concept Reliability Reliability

Too easy 5 - 4 9 45 %

Easy - 3 4 7 35 %

Moderate/Average 1 1 - 2 10 %

Difficult - 1 - 1 5%

Too difficult 1 - - 1 5%

TOTAL 7 5 8 20 100 %

(Table 3: % Overall Test Item Difficulty level)

The item difficulty level is obtained based on the respondents' score. Although table 2 shows

a high difficulty level item of 1 (5%) items, but all these test items will be retained in actual

study. Retention of all these items is only in the beginning that is to rely solely on the

difficulty level of the item. Determining all items that will be retained in the actual study later

will only be determined by the item value discrimination index (ID) to be analyzed.

The high difficulty level of an item indicates the respondent fails correctly answer the item.

Item analyzed in this study actually demonstrates existing knowledge respondents.

Respondents are not informed in advance and this study is to prevent respondents from

making their early preparation. Only our team and lecturer are informed the truth and

cooperation during this study run.

Given that the grading level of this reliability in score knowledge item is comprehensively

covering all subtopic in reliability, then the difficulty level of item obtained indicates the

understanding level respondent's control in these topics. Therefore, the findings analysis of

14 | P a g e
items based on the value of DI implemented will be helping us to identify strengths and

weaknesses actual survey respondents.

5.2. Item Discrimination Index

The discrimination index (DI) is a measure of how effectively an item discriminates between

the high and low groups. The index is intended to show the difference of two groups which is

high scorers and low scorers of item. This is because, usually the item has a level high

difficulty can only be answered by the group that gets high score only. However, there also

have item that can be answered by both groups.

Table of Index Discrimination

Index Discrimination Description Item


ID > 0.4 high positive discrimination
0.2 < ID < 0.4 moderate positive discrimination
0 < ID < 0.2 low positive discrimination
negative discrimination which is lower group is better than high
ID < 0 group
(Sources: Low Hiang Loon t.th)

High performers should be more likely to answer a good item correctly, and low performers

more likely to answer incorrectly. Scores range from – 1.00 to +1.00 with an ideal score of

+1.00. Positive coefficients indicate that high-scoring examinees tended to have higher scores

on the item, while a negative coefficient indicates that low-scoring students tended to have

lower scores. On items that discriminate well, more high scorers than low scorers will answer

those items correctly. The higher the discrimination index, the better the item because high

values indicate that the item discriminates in favor of the upper group which should answer

more items correctly. If more low scorers answer an item correctly, it will have a negative

value and is probably flawed (McCowan & McCowan, 1999).

There are 3 constructs measured in this study which are constructing of concept reliability,

factor affecting reliability and methods of reliability. Based on table 1 above, for construct of

15 | P a g e
reliability, we found that 2 items which is question number 2 and number 5 get zero DI value.

This show that item does not discriminate in any way at all. Discrimination levels for

question number 3 (Objective) and question number 1 (subjective) are negative value of ID.

It means that more people in the low group than in the high group got the item correct. The

remaining of 3 items which is question 1, 4 are low positive DI value and question 5

(subjective) are moderate positive ID value.

For construct of factor affecting reliability show that 1 item which is question number 6 got

0 value of DI. Negative value of discrimination has 2 items which is question number 7 and

10. Moderate positive discrimination value got 1 item which is question number 8, while the

two remaining item got high positive discrimination which is question number 9 and question

number 2 (subjective).

The third construct is method of reliability show 3 item have moderate positive value of

discrimination index which question number 12, 13 and question 3 in subjective. The lowest

DI (-0.237) is item 2 in subjective question. While the item with highest DI is item 11 which

is (0.421). The remaining 2 item which is question14 and 15 had low positive index

discrimination.

Construct
Factor Methods
affecting of Number
Description item Concept reliability reliability reliability of item Percentage (%)
Low 4 1 2 7 35
Moderate 1 1 3 5 25
High 0 2 1 3 15
Negative 2 2 1 5 25
TOTAL 7 6 7 20 100
(Table 4: Percentage of Item Discrimination)

16 | P a g e
From the analysis result on table above, overall this for the discriminating index there are 5

(25%) negative items means that the item should be eliminated or completely revised because

more student in the low group than in the high group got the item correct (the item is not

doing what it should). Meanwhile, for the high discriminating index there are only 3 items

(15%) then the item is functioning satisfactorily because more people in the high group than

in the low group got the item correct (the item is doing what it should).

5.3. Reliability

According to Hanna & Dettmar (2004) reliability refers to the consistency of the measures

produced by the tool. Indeed, the reliability of an exam means the consistency of the markers

produced by the test.

This study uses a single test procedure to determine the reliability and consistency of the

research instrument. The method selected is using Cronbach's Alpha formula. Cronbach

(1951) used alpha coefficient as a measure of internal consistency. This method is useful for

dichotomy and polytomous items, especially essentially an essay item whose scores can

include a large range of values.

Reliability Statistics
Cronbach's N of Items
Alpha

0.489 20

The value of Cronbach's Alpha obtained for the whole test item is 0.489.The level of
understanding that being measured by this study consists of three main constructs namely
concept, factor and method.

17 | P a g e
Reliability Statistics Reliability Statistics Reliability Statistics
Cronbach's N of Items Cronbach's N of Items Cronbach's N of Items
Alpha Alpha Alpha

0.188 7 0.239 6 0.367 7

Furthermore, for concept constructs obtained is 0.188, construct factor is 0.239 and method

construct 0.367.

According to Mehrens and Lehmann (1991) they listed five types of reliability and the

method of determining their index. Reliability index is between -1.00 to +1.00. The negative

reliability index indicates inverse consistency. Normally indexes are positive, and for tests,

the index between 0.65 and 0.85 is adequate. As a guide, the reliability of the test can be

interpreted by the index (r) as shown below.

Index (r) Item description

<0.20 Very poor

0.21-0.40 Poor

0.41-0.60 Moderate

0.61-0.80 Good

0.81-1.00 Very good

(Table 5: Reliability Index)

18 | P a g e
Hence, the instrument of this study which was analyzed using Cronbach's Alpha obtained the

value of reliability and overall consistency of the whole item of the test set which was

moderate is 0.489.

6.0 ANALYSES OF STUDENTS’ PERFORMANCE

(Table 6: Score of Mean, Standard Deviation and Z-Score)

Based on the table 6, we found that value of mean is 21.65. The mean score is the average of

the test scores for the class. It showed that student performance did quite well in the class.

However, the content or construct assessing probably needs to be reviewed in class. This is

because distracter analysis can also help the educator to identify which misconceptions are

shared by the majority of the students and correct them

While for the standard deviation (SD) is another way of showing the spread of scores. It

measures the degree to which the group of scores deviates from the mean. Based on the table

6 above, showed the value standard deviation for the overall item is 3.508. It’s a large

standard deviation means that there is much variability in the test scores of the group which is

students performed quite differently on the test.

The z-score is a conversion of the raw score into a standard score based on the mean and the

standard deviation. A positive z-score means the data value is larger than the mean.

Maximum value has a z-score of 1.81174, this show that this data value is 1.81174 standard

deviation larger than the mean. A negative z-score means the data value is smaller than the

19 | P a g e
mean. Minimum value has a z-score of -1.60941, this show that this data value is -1.60941

standard deviation smaller than the mean.

7.0 DISCUSSION

7.1. Difficulty Index

Item difficulty index level shows that overall item that fall under too difficult level is only Q1

with mean value 0.13. This might happen due to respondent lack of understanding or confuse

in concept of reliability subtopic so the chance for respondent to mistakenly answer this

question is high. Then, items that fall under difficult level is 1 item only which is Q7 with

mean value 0.26. This item is about factors that affect reliability score. This item fall under

difficult category can be happen due to answer’s choice make the respondent confuse to

choose the right answer because even this question was taken from text book, we as a team

that designed this item were also confuse which is the right answer for this item.

Other than that, for moderate difficulty level there are 2 items which are QS2 with mean

value 0.43 and QS5 with mean value 0.52. QS2 is item to test respondent understanding on

factor affecting reliability score while QS5 were to test respondent’s concept of reliability in

score. For item QS2, some respondents just leave it blank without answer that question. Also,

some respondents that answer these items didn’t give enough points that required by the

question. This items required 3 marks so, some respondent just give 1 to 2 points only that

makes them didn’t get full marks for this question. While for item QS5 is designed to test

concept of reliability on how examiner affect score in assessment. This item also make some

respondents didn’t give enough answers as required which 2 marks, so some of the

respondent didn’t get full marks.

20 | P a g e
Besides that, there are 7 items that categorized under easy level which are Q8, Q9, Q10, Q11,

Q12, Q13 and QS3 with mean values 0.68, 0.61, 0.61, 0.68, 0.68, 0.68 and 0.77. As for Q8,

Q9 and Q10 are to test respondents understanding about factor affects reliability score while

Q11, Q12, Q13 and QS3 are to test respondents about methods used to test reliability

subtopic. These items were test or questions directly to respondents with simple and clear

answer’s choices. As for subjective questions, the items test also questions directly to

respondent so most of the respondent can easily answer the questions. Other possible

condition might happen are respondent refers to notes, internet or discuss the answer among

friends near them.

Lastly, items that fall under too easy level are 9 items which are Q2, Q3, Q4, Q5, Q6, Q14,

Q15, QS1 and QS4 with mean values 1.00, 0.87, 0.97, 1.00, 1.00, 0.84, 0.81, 0.87 and 0.92.

As for items Q2, Q3, Q4, Q5 and QS1 are to test respondents understanding on concept of

reliability and item Q6 is to test factor that affect reliability score. While for items Q14, Q15

and QS4 are to test on methods used to test reliability score. These items also were test

directly and clearly to the respondent with simple answer’s choice and for subjective

questions also required respondent to give short answer that directly as questions needed.

Apart from that, respondents also might refer to notes or internet to seek the answers.

7.2. Discrimination Index

Based on discrimination index (DI) of the test above, there are 7 items were had low positive

discriminating. There were items of things we should consider because there’s item is very

easy that nearly everyone gets it correct or highly difficult that nearly everyone gets it wrong,

then it becomes very difficult to discriminate those who actually master of the content from

those who do not.

21 | P a g e
We found out that the distractors of question number 6 in objective had 0 index of

discrimination because those distractors are too obvious show that they are not working at all.

It seems like the test-maker give a slight clue in the length of the distracters. Option B are

chosen by all the students because it is already obvious that the distractors which are not

plausible because option B which is culture that not related to factor of reliability at all.

5 items of 20 items had negative discriminating. When an item is discriminating negatively,

overall the most low’s student are getting the item wrong and the least knowledgeable student

are getting the item right. A low group student may make a guess or refer other references,

select that response, and come up with the correct answer. High group students may be

suspicious stem of a question that looks too confuse especially question number 7 may take

the harder path to solving the problem, read too much into the question, and may end up

being less successful than those who guess.

7.3. Reliability

The reliability index for this study was 0.489 which is moderate. We cannot expect

assessment results to be perfectly consistent. There are numerous factors other than the

quality being measured that may influence assessment results.

Among the factors that can affect the reliability of the item is range of ability. Respondents in

this study had the homogenous ability and had caused this reliability index to be moderate. If

the selected respondent has a heterogeneous ability, the reliability index will be increases.

Item difficulty level also affects index of reliability. Reliability will be low if a test is so easy

or so difficult that every student gets most or all of the items wrong or correct. The items

given to the 31 respondents are majority in a too easy level of 45%. When most items are in

the same category and do not diversify the level of difficulty, the reliability index will be low.

22 | P a g e
Furthermore, examinees’ scores are affected by guessing. When the items are distributed to

the respondent, some of them not totally understand the question but only guessing the

question. Unfortunately, they were guessing the answer correctly. This situation made index

of reliability decrease.

Last but not least, distractor on the dichotomy item. In this study, the distractor in the

dichotomy item Quite weak and cannot make respondent confuse with the chosen answer.

Therefore, most respondents will answer correctly without any hesitation. Indirectly

reliability index will be decreases.

8.0 SUGGESTION FOR IMPROVEMENT

Once the analysis has been carried out, we see many deficiencies that need to be improved on

the item. Improvements to items need to be made so that items that have been enacted can

achieve the target. The item needs to be perfect because the scores obtained by the examinee

are able to evaluate what should be measured. Some suggestions may be given to fix the item.

Questions that have been given to 31 Postgraduate Diploma in Education students of

Universiti Utara Malaysia have low item difficulties. In the future, we need to diversify the

item's difficulty. When the item's difficulty various for each item, then the item will achieve

high reliability and validity.

In addition, every item should have time limit. When the items were distributed for them to

answer, we mention to them to answer in ten minutes but we did collect back from

respondent in the time given. Therefore, they complete the items given without the time limit.

Reliability and validity can effect due to no limit of time given. In the future, we should state

the time limit for our improvement.

23 | P a g e
Furthermore, each examinee should have space between each other when answering to the

item. This is because the rate of copying between them will be high. Each examinee will try

to discuss to each other to answer the items they have been given. They will not answer based

on their knowledge but are influenced by other examinee. They also had the opportunity to

find answers via the internet when answering.

Before they start answering the question, we were presenting the chapter only in twenty

minutes. It’s because the class will end in 30 minutes. We did not confirm that our knowledge

and information successfully deliver to them. Therefore, our suggestion to improve this issue

in future is we need enough time to presenting to them. Make sure they receive it and have a

deep understanding to the chapter.

Next, mostly the distractor in the dichotomy item is very weak. Respondent directly can

know the answer. They are not confuse in choose their answer. Therefore, for improvement

in the next study, we need to create the distractor that can make the respondent confuse with

the answer given. In this way, the reliability of this study will increase.

9.0 CONCLUSION

This study involved 31 respondents from Postgraduate Diploma in Education students,

Universiti Utara Malaysia. Item difficulty index of the item obtained through itemized

analysis of items indicates that 45% of the items are at a too easy level, 35% are at easy level,

10% are at moderate level, 5% are at difficult level and 5% are at too difficult level.

The reliability of the instrument was assessed using the Cronbach's Alpha formula which

showed a moderate reliability and consistency value of 0.489.

24 | P a g e
10.0 REFERENCE

Alias, M. (2005). Assessment of learning outcomes: validity and reliability of classroom


tests.
World Transaction On Engineering And Technology Education, Vol.4, No.5, 2005. Retrieved
from
http://eprints.uthm.edu.my.ASSESSMENT_OF_LEARNING_OUTCOMES_2005.pd
f

Cronbach, L. J. (1951). Coefficient Alpha and the Internal Structure of Tests.


Psychometrika, 16, 297-334
Hanna, G.S & Dettmer, P.A. (2004). Assessment for effective teaching: Using context-
adaptive planning. Boston: Pearson-Allan & Allyn and Bacon
McCowan, R., & McCowan, S. (1999). Item Analysis for Criterion-Referenced Tests. [S.l.]:
Distributed by ERIC Clearinghouse.
Mehrens, W.A. & Lehmann, I.J. (1991). Measurement and evaluation in education and
psychology (4th ed.). Chicago: Holt, Rinehart and Winston.
Low Hiang Loon. t.th. Penganalisisan dan pentafsiran soalan selepas pemarkahan.
http://www.iium.edu.my [10 Mac 2016].

25 | P a g e
Name: ______________________

Reliability question
1. Reliability refers to whether we are truly measuring the concept of interest in our study.
a. True
b. False

2. Reliability = Consistency
a. True
b. False

3. Measurement reliability refers to the:


a. Dependency of the scores
b. Consistency of the scores
c. Comprehensiveness of the scores
d. Accuracy of the scores

4. Which are not Sources of error in assessment?


i. Examinee
ii. Examination
iii. Examiner

a. i and ii
b. i and iii
c. All above

5. Based on the above diagram, which of the statement is correct?

A. Both reliable and valid.


B. Low validity and low reliability.
C. Reliable and not valid.
D. Not reliable and not valid.

6. All of the following are the factors that influence reliability EXCEPT?
a. Length of test
b. Culture
c. Range of Ability
d. Scorer’s Objective
7. What effect would the following most likely have on reliability?
a. Increasing the number task in assessment.
b. Removing ambiguous task.
c. Changing from multiple choice tests to an essay test covering the same material.

8. What does the effect of homogenous group of examinees have on reliability?


a. It lowers reliability.
b. It increases higher reliability.
c. It does not have any effect.

9. Which factor that can increase reliability?


a. Lower the number of test items.
b. Test a heterogeneous group of examinees.
c. Narrow the range of examinees’ ability.
d. Test a homogeneous group of examinees.

10. Items those are more likely discriminating tends to increase reliability?
a. True
b. False

11. Why index of reliability could be negative?


a. Students who score high in the first test will get a low score in the second test, and vice versa.
b. Students who score low in the first test will get a low score in the second test.
c. Students who score high in the first test will get a high score in the second test.
d. Students who score high in the test will always score high in any test.

12. In which level if (r) value is 0.163?


a. Reliability is very good
b. Reliability is very poor
c. Reliability is poor
d. Reliability is average

13. Which is one disadvantage of the test-retest method?


a. It requires two test and at least two forms.
b. Its different subsection of test will affect test homogeneity, thus reduce score reliability.
c. There are more possibilities for raters to disagree.
d. Reliability cannot be estimated until after the second test.

14. Which method of estimating reliability is suitable for essay?


a. Test-retest
b. Inter-rater reliability
c. Cronbach’s alpha
d. Kuder-Richardson-20
15. Inter-rater reliability means that if two different raters scored the item using the scoring rules,
they should attain similar result.
a. True
b. False

Subjective (15 Marks)

1. Define reliability. (2 marks)

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

2. Why the scorer’s objectivity can affect the reliability of the score? (3 marks)

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

3. How do we test reliability? (4 marks)

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________
4. Briefly explain about inter-rater reliability & intra-rater reliability. (4 marks)

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

5. How can examiner affect score? (2 marks)

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________
Table of SAS_Score, Mean, Standard Deviation (SD) and Z Score (Students’ Performance)

STUDENT SAS SCORE MEAN STANDARD Z-SCORE


(M) DEVIATION (X-M)/SD
(SD)
001 19 21.65 3.508 -0.75412
002 28 21.65 3.508 1.81174
003 27 21.65 3.508 1.52664
004 27 21.65 3.508 1.52664
005 24 21.65 3.508 0.67135
006 23 21.65 3.508 0.38626
007 26 21.65 3.508 1.24155
008 28 21.65 3.508 1.81174
009 25 21.65 3.508 0.95645
010 23 21.65 3.508 0.38626
011 23 21.65 3.508 0.38626
012 23 21.65 3.508 0.38626
013 21 21.65 3.508 -0.18393
014 26 21.65 3.508 1.24155
015 21 21.65 3.508 -0.18393
016 20 21.65 3.508 -0.46903
017 21 21.65 3.508 -0.18393
018 23 21.65 3.508 0.38626
019 20 21.65 3.508 -0.46903
020 21 21.65 3.508 -0.18393
021 18 21.65 3.508 -1.03922
022 18 21.65 3.508 -1.03922
023 19 21.65 3.508 -0.75412
024 19 21.65 3.508 -0.75412
025 16 21.65 3.508 -1.60941
026 16 21.65 3.508 -1.60941
027 18 21.65 3.508 -1.03922
028 18 21.65 3.508 -1.03922
029 22 21.65 3.508 0.10116
030 21 21.65 3.508 -0.18393
031 17 21.65 3.508 -1.32432

Vous aimerez peut-être aussi