Project Assesment

SGDE 4013
DEVELOPMENT AND ANALYSES OF ITEMS
PREPARED BY:
BALQIS BINTI HASINI (824695)
NUR AQLILI HANUM BINTI ABDULLAH (824762)
PUTRI AMIRAH BINTI MEGAT AZAMUDDIN (824728)
PREPARED FOR:
DR. NURLIYANA BUKHARI
TABLE OF CONTENT
1.0 Introduction ..........................................................................................................................1

1.1 Deciding on a Test’s Purpose ..................................................................................2
1.2 Background Test takers............................................................................................2
1.3 Table of specification ........................................................................................... 2-3
1.4 Marking scheme/ scoring rubric .......................................................................... 4-8
2.0 Item Development ............................................................................................................ 8-9
3.0 Test Administration Procedure ...................................................................................... 9-10
4.0 Scoring And Recording......................................................................................................10
5.0 Item Analysis ............................................................................................................... 11-12
5.1 Item Difficulty Index ....................................................................................... 12-15
5.2 Item Discrimination Index ............................................................................... 15-17
5.3 Reliability......................................................................................................... 17-19
6.0 Analyses Of Students’ Performance ............................................................................ 19-20
7.0 Discussion ..........................................................................................................................20
7.1 Difficulty Index ................................................................................................ 20-21
7.2 Discrimination Index ....................................................................................... 21-22
7.3 Reliability......................................................................................................... 22-23
8.0 Suggestion For Improvement ....................................................................................... 23-24
9.0 Conclusion .........................................................................................................................24
10.0 Reference ...........................................................................................................................25
1.0 INTRODUCTION
Assessment entails the systematic gathering of evidence to judge a student understands of
learning (Alias, Maizam, 2005). Educators can then judge whether a student has learned what
they are expected to learn through assessment that we have made. We have decided to choose
one of the topics from subject assessment which is reliability as our project. We want to
measure how well student understanding about reliability concept. The topic reliability is
about the relative consistency of test scores of educational assessment. Reliability refers to
how well a score represents an individual’s ability, and within education, ensures that
assessments accurately measure student knowledge. Reliable is important concept that
educator should understand because scores help students grasp their level of development,
and help educator improve their teaching effectiveness.
1.1. Deciding on a Test’s Purpose
Determining a test’s objectives is the first step in a test’s construction process. The test
objective is the criterion that will be used in order to judge whether a test is valid or not.
Three of objectives that we used as a guideline to know level understanding students’ in
reliability topic:
 Measure Post Graduate Diploma Education students’ understanding of reliability
concepts.
 Measure Post Graduate Diploma Education students’ understanding factor of
affecting reliability.
 Measure Post Graduate Diploma Education students’ ability to determine various
methods in measuring reliability and in any assessment.
 Measure Post Graduate Diploma Education student’s ability in apply concept of
reliability in situation that are given.
1|Page
1.2. Background Test takers
31 Students from PGDE program are involved through this assessment. 6 of the student
taking part in this assessment are male and 25 are female students. For a total of 31 students
studying at Universiti Utara Malaysia (UUM), were selected.
1.3. Table of specification
A Table of Specifications is to identify the achievement domains being measured and to
ensure that a fair and representative sample of questions appear on the test. It also describes
the topics reliability to be covered by a test and the number of items or points which will be
associated with each topic. This table of specifications is constructed test based on three
construct which are concept of reliability, factor affecting reliability and method in measuring
reliability.
Objective
Cognitive level Total Total Items

No. Construct Easy Average Difficult Marks
1 Concept Reliability
Q1 1 1 5
Q2 1 1
Q3 1 1
Q4 1 1
Q5 1 1
2 Factor affecting reliability
Q6 1 1
5
Q7 1 1
Q8 1 1
2|Page
Q9 1 1
Q10 1 1
3 Methods of reliability
Q11 1 1 5
Q12 1 1
Q13 1 1
Q14 1 1
Q15 1 1
TOTAL 15 15
Subjective
Cognitive level Total Total Items

No. Construct Easy Average Difficult Marks
1 Concept Reliability
Q1 1 2 1
2 Factor affecting reliability
Q2 1 3
2
Q3 1 2
3 Methods of reliability.
Q4 1 4 2
Q5 1 4
TOTAL 15 5
3|Page
1.4. Marking scheme/ scoring rubric
Marking rubric for a scoring guide to evaluate quality of student’s constructed responses.
No. Answer Reason
1. B Reliability refers to whether we are truly measuring the concept of interest in

our study.
FALSE: The degree of consistency between two measures of the same thing.
2. A Reliability = Consistency
TRUE: The degree of consistency between two measures of the same thing.
3. B Measurement reliability refers to the:

a. Dependency of the scores (Assess nursing dependency)
b. Consistency of the scores (Reliability = Consistency)
c. Comprehensiveness of the scores (To assess general language
ability)
d. Accuracy of the scores (Validity)
4. C Which are not Sources of error in assessment?

i. Examinee
ii. Examination
iii. Examiner
a. i and ii (Examiner is not included in the answer)
b. i and iii (Examination is not included in the answer)
c. All of the above (Examinee, examination and examiner are source of
error)
5.
Based on the above diagram, which of the statement is correct?

a. Both reliable and valid (The target is all in the centre and
consistent)
b. Low validity and low reliability
4|Page
c. Reliable and not valid
d. Not reliable and not valid
6. B All of the following are the factors that influence reliability EXCEPT?
a. Length of test (factor that increase reliability, less guessing)
b. Culture (factor culture did not affect score reliability)
c. Range of Ability (factor that increase reliability, wider spread of
scores)
d. Scorer’s Objective (factor that increase reliability because not
influence by the score's judgment or opinion)
7. C What effect would the following most likely have on reliability?

a. Increasing the number task in assessment. (increase reliability)
b. Removing ambiguous task (increase reliability)
c. Changing from multiple choice test to an essay test covering
the same material. (Most effect because by changing to essay form
influence emotion’s examiner that affect reliability)
8. A What does the effect of homogenous group of examinees have on

reliability?
a. It lowers reliability (homogenous with same background/
ability of group can increase reliability)
b. It increases higher reliability (Heterogeneous-
different ability/background increase reliability)
c. It does not have any effect
9. B
Which factor that can increase reliability?
a. Lower the number of test items (decrease reliability)
5|Page
b. Test a heterogeneous group of examinees ( increase
reliability, wider spread of score)
c. Narrow the range of examinees’ ability (decrease reliability)
d. Test a homogeneous group of examinees (decrease reliability)
10. A Items that are more likely discriminating tends to increase reliability?
a. True (reliable because item discrimination compares the
number of high scorers and low scorers who answer an item
correctly.
b. False
11. A Why index of reliability could be negative?

a. Students who score high in the first test will get a low score
in the second test, and vice versa. (Negative reliability index
indicates inverse consistency)
b. Students who score low in the first test will get a low score in
the second test.(Consistent - Positive)
c. Students who score high in the first test will get a high score in
the second test. (Consistent - Positive)
d. Students who score high in the test will always score high in any
test. (Consistent - Positive)
e.
12. B In which level if (r) value is 0.163?

a. Reliability is very good
b. Reliability is very poor
c. Reliability is poor
d. Reliability is average
Based on Mahrens and Lehmann (1991)
13. D Which one disadvantage of the test-retest method?

a. It requires two test and at least two forms. (Disadvantage of
parallel-form method)
b. Its different subsection of test will affect test homogeneity, thus
reduce score reliability. (Disadvantage of split-half method)
6|Page
c. There are more possibilities for raters to disagree. (Disadvantage
of inter-rater reliability)
d. Reliability cannot be estimated until after the second test.
(Disadvantage of test-retest reliability)
14. C Which method of estimating reliability is suitable for essay?

a. Test-retest (The same test is administered twice to the same
group)
b. Inter-rater reliability (Related to examiner)
c. Cronbach’s alpha (Suitable for dichotomous & polytomous
items)
d. Kuder-Richardson-20 (Suitable for dichotomous items)
15. A Inter-rater reliability means that if two different raters scored the item
using the scoring rules, they should attain similar result.
a. True (Inter-rater reliability have two different raters)
b. False (Correct statement)
Answer scheme for subjective question.
No. Description Marks
1. Define reliability (2 marks)

2
 The degree of consistency between two measures of the same thing
marks
2. Why the scorer’s objectivity can affect the reliability of the score? (3 marks)
 Measures without references to outside influences 1 mark
 More objectively scored assessment result 1 mark
 The resulting score are not influence by the examiner judgement 1 mark
3. How do we test reliability (4 marks)

 Test-retest 1 mark
 Parallel form 1 mark
 Inter-rater 1 mark
 Internal consistency
1 mark
4. Briefly explain about inter-rater reliability & intra-rater reliability. (4 marks)

 Inter-rater reliability - means that if two different raters scored the scale 2
using the scoring rules, they should attain the same result. marks
 Intra-rater reliability - same raters give consistent estimates of the same

measurement over time. 2
marks
7|Page
5. How can examiner affect score? (2 marks)
 emotional or health of the examiner 1 mark
 examiner feel tired or anxiety and this will lead the examiner to feel
1 mark
impatient/tired when marking the paper
1 mark
 different scorer
(any 2 correct answers get full mark)
2.0 ITEM DEVELOPMENT
We have created 20 questions to identify the views of the students on knowledge of reliability
include multiple choice questions and subjective questions. Multiple-choice items contain 15
questions that can be used to measure knowledge outcomes and most widely used for
measuring knowledge, comprehension, and application of reliability concept. Multiple-choice
questions typically have 3 parts which is a stem, the correct answer called key, and several
wrong answers, called distractors. The parts of a multiple choice question contains options
which is response alternatives (i.e. A,B,C,D) . The correct answer is the best answer among
option given. The incorrect answer which is distractors but appear plausible, especially to
those who have not mastered content. In order to determine the views of the students on
understanding concept of reliability, they were asked to circle the answers either “A”, “B”,
“C” or “D”. It also consists of statement which the student has to indicate to be true or
false. In this section, we focus question on understanding on concept reliability, factor
affecting reliability and method that we used in measuring reliability so that students must
discriminate among options that vary in degree of correctness.
The other 5 questions we have created in subjective form. Subjective items which permit the
student to organize and present an original answer hence they involve a wider variety of
thinking skills which is students can recall, select and organize answer what have they
8|Page
understand about reliability topic. This approach makes the test more challenging for students
and decreases the chance of getting an answer correct by guessing thus we can find out how
far students mastered the topic of reliability that area tested.
In the first five questions, students were tested the specific concept of reliability. In the sixth
question, students were asked factors that affect reliability scores. The last five questions,
students were given some facts about a specific result index of reliability, based on this
knowledge they had to determine number index of reliability that supposedly reliable for
assessment. We also created question to identify level of understanding’s student about
method in measuring reliability. The aim behind this question was to test the application of
knowledge.
3.0 TEST ADMINISTRATION PROCEDURE
The objective’s test was given before presentation on reliability’s topic is done. They are
fifteen objective questions that covers about concept of reliability, sources of measurement
error, factors that affecting reliability and methods used to test reliability. Right after the
presentation is done, they were given ten minutes to do the test on reliability’s topic.
The test was held on 12 November 2018, 12.00 pm to 12.10 pm place at SEML class. The
condition of the venue was semi-formal, where examinee took the test at their place and it
was paper based test. All examinee need to submit their answer after ten minutes time limit.
Also, the examinee seats the test without being inform first that they going to have that test.
During the test, the subject or issue that may occur is that students may imitate a friend on the
side as all students only answer the test at their respective places. Then it's easy for students
to discuss and emulate among others because no rules are set except the time limit. Besides
that, guessing also might occur because they just heard and learn about that topic at that time
9|Page
so they don’t have enough time for the revision and prepare for the test. Apart from that,
because during the test there’s no one monitored the student so the possibility for students to
find information on the internet was high because of the short time to answer the test.
4.0 SCORING AND RECORDING
After the end of the test, the test paper is marked by the group member who provided the test
manually. As well as tests should be reviewed based on pre-prepared answer schemes so that
every paper were mark equally and no bias issues. In addition, for the objective score
question for each question is 1 mark so if the student answers wrong then the score is 0.
Then, the marks and scores were recorded in excel sheet, in a table form so that easy to
review and analyze the data. The data was recorded in excel sheet as shown below:
(Score recorded in Excel sheet)
10 | P a g e
5.0 ITEM ANALYSIS
CORRECTED ITEM-TOTAL CRONBACH’S ALPHA IF

ITEM MEAN SD
CORRELATION ITEM DELETED
Q1 0.13 0.341 0.083 0.486
Q2 1.00 0.000 0.000 0.491
Q3 0.87 0.341 -0.81 0.504
Q4 0.97 0.180 0.197 0.481
Q5 1.00 0.000 0.000 0.491
Q6 1.00 0.000 0.000 0.491
Q7 0.26 0.445 -0.352 0.547
Q8 0.68 0.475 0.376 0.440
Q9 0.61 0.495 0.554 0.406
Q10 0.61 0.495 -0.126 0.520
Q11 0.68 0.475 0.421 0.432
Q12 0.68 0.475 0.376 0.440
Q13 0.68 0.475 0.331 0.447
Q14 0.84 0.374 0.183 0.474
Q15 0.81 0.402 0.122 0.481
QS1 1.74 0.682 -0.039 0.518
QS2 1.29 1.216 0.574 0.297
QS3 3.10 1.106 0.372 0.402
QS4 3.68 0.871 -0.237 0.584
QS5 1.03 0.706 0.221 0.460
Scale Reliability (Cronbach’s Alpha for SAS) 0.489
Scale Mean 21.65
Scale SD 3.508
N 20
(Table 1: Item Analysis)
.
11 | P a g e
()(
(Chart 1: Item Analyses)
5.1. Item Difficulty Index
The Difficulty Index (DI) item is a measurement index of the difficulty of an item for a group
tested and used to describe the difficulty of an item for something test. The DI range is
between 0.00 and 1.00. Each level of difficulty test items can be determined by the value of
DI. Table 1 below show how difficulty level determined by difficulty index:
Difficulty Index (DI) Difficulty level
0.00 – 0.20 Too difficult
0.21 – 0.40 Difficult
0.41 – 0.60 Moderate/Average
12 | P a g e
0.61 – 0.80 Easy
0.81 – 1.00 Too easy
(Table 2: Table of Difficulty Index Item)
This study involved 31 respondents from Postgraduate Diploma in Education (PGDE)
students. Number of items for level test the reliability of concept of reliability is 7 items. The
difficulty level of concept constructs in this study shows that 5 items are too easy which the
item are number Q2, Q3, Q4, Q5, and QS1. Then 1 item is in the moderate difficulty level of
items QS5 while 1 item is at a high difficulty level which is number Q1.
Other than that, for construct of factors affecting reliability there are 6 items. There is 1 item
that fall under too easy level which is item number Q6. Besides that, for easy level there are 3
items which are number Q8, Q9 and Q10. For moderate difficulty level there’s only 1 item
which is QS2 and 1 item is under difficult level which is item number Q7. Next, for construct
items to test methods in measure reliability there are 7 items. The difficulty level of method
of reliability constructs shows that 3 items are too easy which are number Q14, Q15 and QS4,
while for easy level there are 4 items which are items number Q11, Q12, Q13 and QS3.
Study on the details of the DI value and difficulty level for the three constructs which are
concepts, factors and methods found that percentage of too easy level for the whole test item
is 50%, easy level is as much as 30%, moderate difficulty level is 10 %, difficulty levels for
the whole test item is 5% and high difficulty level is 5%. Percentage level percentages for the
whole test item are shown in the table below:
13 | P a g e
Construct
Difficulty Level Factor Methods Total %

Affect Measure Items
Concept Reliability Reliability
Too easy 5 - 4 9 45 %
Easy - 3 4 7 35 %
Moderate/Average 1 1 - 2 10 %
Difficult - 1 - 1 5%
Too difficult 1 - - 1 5%
TOTAL 7 5 8 20 100 %
(Table 3: % Overall Test Item Difficulty level)
The item difficulty level is obtained based on the respondents' score. Although table 2 shows
a high difficulty level item of 1 (5%) items, but all these test items will be retained in actual
study. Retention of all these items is only in the beginning that is to rely solely on the
difficulty level of the item. Determining all items that will be retained in the actual study later
will only be determined by the item value discrimination index (ID) to be analyzed.
The high difficulty level of an item indicates the respondent fails correctly answer the item.
Item analyzed in this study actually demonstrates existing knowledge respondents.
Respondents are not informed in advance and this study is to prevent respondents from
making their early preparation. Only our team and lecturer are informed the truth and
cooperation during this study run.
Given that the grading level of this reliability in score knowledge item is comprehensively
covering all subtopic in reliability, then the difficulty level of item obtained indicates the
understanding level respondent's control in these topics. Therefore, the findings analysis of
14 | P a g e
items based on the value of DI implemented will be helping us to identify strengths and
weaknesses actual survey respondents.
5.2. Item Discrimination Index
The discrimination index (DI) is a measure of how effectively an item discriminates between
the high and low groups. The index is intended to show the difference of two groups which is
high scorers and low scorers of item. This is because, usually the item has a level high
difficulty can only be answered by the group that gets high score only. However, there also
have item that can be answered by both groups.
Table of Index Discrimination
Index Discrimination Description Item

ID > 0.4 high positive discrimination
0.2 < ID < 0.4 moderate positive discrimination
0 < ID < 0.2 low positive discrimination
negative discrimination which is lower group is better than high
ID < 0 group
(Sources: Low Hiang Loon t.th)
High performers should be more likely to answer a good item correctly, and low performers
more likely to answer incorrectly. Scores range from – 1.00 to +1.00 with an ideal score of
+1.00. Positive coefficients indicate that high-scoring examinees tended to have higher scores
on the item, while a negative coefficient indicates that low-scoring students tended to have
lower scores. On items that discriminate well, more high scorers than low scorers will answer
those items correctly. The higher the discrimination index, the better the item because high
values indicate that the item discriminates in favor of the upper group which should answer
more items correctly. If more low scorers answer an item correctly, it will have a negative
value and is probably flawed (McCowan & McCowan, 1999).
There are 3 constructs measured in this study which are constructing of concept reliability,
factor affecting reliability and methods of reliability. Based on table 1 above, for construct of
15 | P a g e
reliability, we found that 2 items which is question number 2 and number 5 get zero DI value.
This show that item does not discriminate in any way at all. Discrimination levels for
question number 3 (Objective) and question number 1 (subjective) are negative value of ID.
It means that more people in the low group than in the high group got the item correct. The
remaining of 3 items which is question 1, 4 are low positive DI value and question 5
(subjective) are moderate positive ID value.
For construct of factor affecting reliability show that 1 item which is question number 6 got
0 value of DI. Negative value of discrimination has 2 items which is question number 7 and
10. Moderate positive discrimination value got 1 item which is question number 8, while the
two remaining item got high positive discrimination which is question number 9 and question
number 2 (subjective).
The third construct is method of reliability show 3 item have moderate positive value of
discrimination index which question number 12, 13 and question 3 in subjective. The lowest
DI (-0.237) is item 2 in subjective question. While the item with highest DI is item 11 which
is (0.421). The remaining 2 item which is question14 and 15 had low positive index
discrimination.
Construct
Factor Methods
affecting of Number
Description item Concept reliability reliability reliability of item Percentage (%)
Low 4 1 2 7 35
Moderate 1 1 3 5 25
High 0 2 1 3 15
Negative 2 2 1 5 25
TOTAL 7 6 7 20 100
(Table 4: Percentage of Item Discrimination)
16 | P a g e
From the analysis result on table above, overall this for the discriminating index there are 5
(25%) negative items means that the item should be eliminated or completely revised because
more student in the low group than in the high group got the item correct (the item is not
doing what it should). Meanwhile, for the high discriminating index there are only 3 items
(15%) then the item is functioning satisfactorily because more people in the high group than
in the low group got the item correct (the item is doing what it should).
5.3. Reliability
According to Hanna & Dettmar (2004) reliability refers to the consistency of the measures
produced by the tool. Indeed, the reliability of an exam means the consistency of the markers
produced by the test.
This study uses a single test procedure to determine the reliability and consistency of the
research instrument. The method selected is using Cronbach's Alpha formula. Cronbach
(1951) used alpha coefficient as a measure of internal consistency. This method is useful for
dichotomy and polytomous items, especially essentially an essay item whose scores can
include a large range of values.
Reliability Statistics
Cronbach's N of Items
Alpha
0.489 20
The value of Cronbach's Alpha obtained for the whole test item is 0.489.The level of
understanding that being measured by this study consists of three main constructs namely
concept, factor and method.
17 | P a g e
Reliability Statistics Reliability Statistics Reliability Statistics
Cronbach's N of Items Cronbach's N of Items Cronbach's N of Items
Alpha Alpha Alpha
0.188 7 0.239 6 0.367 7
Furthermore, for concept constructs obtained is 0.188, construct factor is 0.239 and method
construct 0.367.
According to Mehrens and Lehmann (1991) they listed five types of reliability and the
method of determining their index. Reliability index is between -1.00 to +1.00. The negative
reliability index indicates inverse consistency. Normally indexes are positive, and for tests,
the index between 0.65 and 0.85 is adequate. As a guide, the reliability of the test can be
interpreted by the index (r) as shown below.
Index (r) Item description
<0.20 Very poor
0.21-0.40 Poor
0.41-0.60 Moderate
0.61-0.80 Good
0.81-1.00 Very good
(Table 5: Reliability Index)
18 | P a g e
Hence, the instrument of this study which was analyzed using Cronbach's Alpha obtained the
value of reliability and overall consistency of the whole item of the test set which was
moderate is 0.489.
6.0 ANALYSES OF STUDENTS’ PERFORMANCE
(Table 6: Score of Mean, Standard Deviation and Z-Score)
Based on the table 6, we found that value of mean is 21.65. The mean score is the average of
the test scores for the class. It showed that student performance did quite well in the class.
However, the content or construct assessing probably needs to be reviewed in class. This is
because distracter analysis can also help the educator to identify which misconceptions are
shared by the majority of the students and correct them
While for the standard deviation (SD) is another way of showing the spread of scores. It
measures the degree to which the group of scores deviates from the mean. Based on the table
6 above, showed the value standard deviation for the overall item is 3.508. It’s a large
standard deviation means that there is much variability in the test scores of the group which is
students performed quite differently on the test.
The z-score is a conversion of the raw score into a standard score based on the mean and the
standard deviation. A positive z-score means the data value is larger than the mean.
Maximum value has a z-score of 1.81174, this show that this data value is 1.81174 standard
deviation larger than the mean. A negative z-score means the data value is smaller than the
19 | P a g e
mean. Minimum value has a z-score of -1.60941, this show that this data value is -1.60941
standard deviation smaller than the mean.
7.0 DISCUSSION
7.1. Difficulty Index
Item difficulty index level shows that overall item that fall under too difficult level is only Q1
with mean value 0.13. This might happen due to respondent lack of understanding or confuse
in concept of reliability subtopic so the chance for respondent to mistakenly answer this
question is high. Then, items that fall under difficult level is 1 item only which is Q7 with
mean value 0.26. This item is about factors that affect reliability score. This item fall under
difficult category can be happen due to answer’s choice make the respondent confuse to
choose the right answer because even this question was taken from text book, we as a team
that designed this item were also confuse which is the right answer for this item.
Other than that, for moderate difficulty level there are 2 items which are QS2 with mean
value 0.43 and QS5 with mean value 0.52. QS2 is item to test respondent understanding on
factor affecting reliability score while QS5 were to test respondent’s concept of reliability in
score. For item QS2, some respondents just leave it blank without answer that question. Also,
some respondents that answer these items didn’t give enough points that required by the
question. This items required 3 marks so, some respondent just give 1 to 2 points only that
makes them didn’t get full marks for this question. While for item QS5 is designed to test
concept of reliability on how examiner affect score in assessment. This item also make some
respondents didn’t give enough answers as required which 2 marks, so some of the
respondent didn’t get full marks.
20 | P a g e
Besides that, there are 7 items that categorized under easy level which are Q8, Q9, Q10, Q11,
Q12, Q13 and QS3 with mean values 0.68, 0.61, 0.61, 0.68, 0.68, 0.68 and 0.77. As for Q8,
Q9 and Q10 are to test respondents understanding about factor affects reliability score while
Q11, Q12, Q13 and QS3 are to test respondents about methods used to test reliability
subtopic. These items were test or questions directly to respondents with simple and clear
answer’s choices. As for subjective questions, the items test also questions directly to
respondent so most of the respondent can easily answer the questions. Other possible
condition might happen are respondent refers to notes, internet or discuss the answer among
friends near them.
Lastly, items that fall under too easy level are 9 items which are Q2, Q3, Q4, Q5, Q6, Q14,
Q15, QS1 and QS4 with mean values 1.00, 0.87, 0.97, 1.00, 1.00, 0.84, 0.81, 0.87 and 0.92.
As for items Q2, Q3, Q4, Q5 and QS1 are to test respondents understanding on concept of
reliability and item Q6 is to test factor that affect reliability score. While for items Q14, Q15
and QS4 are to test on methods used to test reliability score. These items also were test
directly and clearly to the respondent with simple answer’s choice and for subjective
questions also required respondent to give short answer that directly as questions needed.
Apart from that, respondents also might refer to notes or internet to seek the answers.
7.2. Discrimination Index
Based on discrimination index (DI) of the test above, there are 7 items were had low positive
discriminating. There were items of things we should consider because there’s item is very
easy that nearly everyone gets it correct or highly difficult that nearly everyone gets it wrong,
then it becomes very difficult to discriminate those who actually master of the content from
those who do not.
21 | P a g e
We found out that the distractors of question number 6 in objective had 0 index of
discrimination because those distractors are too obvious show that they are not working at all.
It seems like the test-maker give a slight clue in the length of the distracters. Option B are
chosen by all the students because it is already obvious that the distractors which are not
plausible because option B which is culture that not related to factor of reliability at all.
5 items of 20 items had negative discriminating. When an item is discriminating negatively,
overall the most low’s student are getting the item wrong and the least knowledgeable student
are getting the item right. A low group student may make a guess or refer other references,
select that response, and come up with the correct answer. High group students may be
suspicious stem of a question that looks too confuse especially question number 7 may take
the harder path to solving the problem, read too much into the question, and may end up
being less successful than those who guess.
7.3. Reliability
The reliability index for this study was 0.489 which is moderate. We cannot expect
assessment results to be perfectly consistent. There are numerous factors other than the
quality being measured that may influence assessment results.
Among the factors that can affect the reliability of the item is range of ability. Respondents in
this study had the homogenous ability and had caused this reliability index to be moderate. If
the selected respondent has a heterogeneous ability, the reliability index will be increases.
Item difficulty level also affects index of reliability. Reliability will be low if a test is so easy
or so difficult that every student gets most or all of the items wrong or correct. The items
given to the 31 respondents are majority in a too easy level of 45%. When most items are in
the same category and do not diversify the level of difficulty, the reliability index will be low.
22 | P a g e
Furthermore, examinees’ scores are affected by guessing. When the items are distributed to
the respondent, some of them not totally understand the question but only guessing the
question. Unfortunately, they were guessing the answer correctly. This situation made index
of reliability decrease.
Last but not least, distractor on the dichotomy item. In this study, the distractor in the
dichotomy item Quite weak and cannot make respondent confuse with the chosen answer.
Therefore, most respondents will answer correctly without any hesitation. Indirectly
reliability index will be decreases.
8.0 SUGGESTION FOR IMPROVEMENT
Once the analysis has been carried out, we see many deficiencies that need to be improved on
the item. Improvements to items need to be made so that items that have been enacted can
achieve the target. The item needs to be perfect because the scores obtained by the examinee
are able to evaluate what should be measured. Some suggestions may be given to fix the item.
Questions that have been given to 31 Postgraduate Diploma in Education students of
Universiti Utara Malaysia have low item difficulties. In the future, we need to diversify the
item's difficulty. When the item's difficulty various for each item, then the item will achieve
high reliability and validity.
In addition, every item should have time limit. When the items were distributed for them to
answer, we mention to them to answer in ten minutes but we did collect back from
respondent in the time given. Therefore, they complete the items given without the time limit.
Reliability and validity can effect due to no limit of time given. In the future, we should state
the time limit for our improvement.
23 | P a g e
Furthermore, each examinee should have space between each other when answering to the
item. This is because the rate of copying between them will be high. Each examinee will try
to discuss to each other to answer the items they have been given. They will not answer based
on their knowledge but are influenced by other examinee. They also had the opportunity to
find answers via the internet when answering.
Before they start answering the question, we were presenting the chapter only in twenty
minutes. It’s because the class will end in 30 minutes. We did not confirm that our knowledge
and information successfully deliver to them. Therefore, our suggestion to improve this issue
in future is we need enough time to presenting to them. Make sure they receive it and have a
deep understanding to the chapter.
Next, mostly the distractor in the dichotomy item is very weak. Respondent directly can
know the answer. They are not confuse in choose their answer. Therefore, for improvement
in the next study, we need to create the distractor that can make the respondent confuse with
the answer given. In this way, the reliability of this study will increase.
9.0 CONCLUSION
This study involved 31 respondents from Postgraduate Diploma in Education students,
Universiti Utara Malaysia. Item difficulty index of the item obtained through itemized
analysis of items indicates that 45% of the items are at a too easy level, 35% are at easy level,
10% are at moderate level, 5% are at difficult level and 5% are at too difficult level.
The reliability of the instrument was assessed using the Cronbach's Alpha formula which
showed a moderate reliability and consistency value of 0.489.
24 | P a g e
10.0 REFERENCE
Alias, M. (2005). Assessment of learning outcomes: validity and reliability of classroom

tests.
World Transaction On Engineering And Technology Education, Vol.4, No.5, 2005. Retrieved
from
http://eprints.uthm.edu.my.ASSESSMENT_OF_LEARNING_OUTCOMES_2005.pd
f
Cronbach, L. J. (1951). Coefficient Alpha and the Internal Structure of Tests.

Psychometrika, 16, 297-334
Hanna, G.S & Dettmer, P.A. (2004). Assessment for effective teaching: Using context-
adaptive planning. Boston: Pearson-Allan & Allyn and Bacon
McCowan, R., & McCowan, S. (1999). Item Analysis for Criterion-Referenced Tests. [S.l.]:
Distributed by ERIC Clearinghouse.
Mehrens, W.A. & Lehmann, I.J. (1991). Measurement and evaluation in education and
psychology (4th ed.). Chicago: Holt, Rinehart and Winston.
Low Hiang Loon. t.th. Penganalisisan dan pentafsiran soalan selepas pemarkahan.
http://www.iium.edu.my [10 Mac 2016].
25 | P a g e
Name: ______________________
Reliability question
1. Reliability refers to whether we are truly measuring the concept of interest in our study.
a. True
b. False
2. Reliability = Consistency
a. True
b. False
3. Measurement reliability refers to the:

a. Dependency of the scores
b. Consistency of the scores
c. Comprehensiveness of the scores
d. Accuracy of the scores
4. Which are not Sources of error in assessment?

i. Examinee
ii. Examination
iii. Examiner
a. i and ii
b. i and iii
c. All above
5. Based on the above diagram, which of the statement is correct?
A. Both reliable and valid.

B. Low validity and low reliability.
C. Reliable and not valid.
D. Not reliable and not valid.
6. All of the following are the factors that influence reliability EXCEPT?
a. Length of test
b. Culture
c. Range of Ability
d. Scorer’s Objective
7. What effect would the following most likely have on reliability?
a. Increasing the number task in assessment.
b. Removing ambiguous task.
c. Changing from multiple choice tests to an essay test covering the same material.
8. What does the effect of homogenous group of examinees have on reliability?

a. It lowers reliability.
b. It increases higher reliability.
c. It does not have any effect.
9. Which factor that can increase reliability?

a. Lower the number of test items.
b. Test a heterogeneous group of examinees.
c. Narrow the range of examinees’ ability.
d. Test a homogeneous group of examinees.
10. Items those are more likely discriminating tends to increase reliability?
a. True
b. False
11. Why index of reliability could be negative?

a. Students who score high in the first test will get a low score in the second test, and vice versa.
b. Students who score low in the first test will get a low score in the second test.
c. Students who score high in the first test will get a high score in the second test.
d. Students who score high in the test will always score high in any test.
12. In which level if (r) value is 0.163?

a. Reliability is very good
b. Reliability is very poor
c. Reliability is poor
d. Reliability is average
13. Which is one disadvantage of the test-retest method?

a. It requires two test and at least two forms.
b. Its different subsection of test will affect test homogeneity, thus reduce score reliability.
c. There are more possibilities for raters to disagree.
d. Reliability cannot be estimated until after the second test.
14. Which method of estimating reliability is suitable for essay?

a. Test-retest
b. Inter-rater reliability
c. Cronbach’s alpha
d. Kuder-Richardson-20
15. Inter-rater reliability means that if two different raters scored the item using the scoring rules,
they should attain similar result.
a. True
b. False
Subjective (15 Marks)
1. Define reliability. (2 marks)
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
2. Why the scorer’s objectivity can affect the reliability of the score? (3 marks)
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
3. How do we test reliability? (4 marks)
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
4. Briefly explain about inter-rater reliability & intra-rater reliability. (4 marks)
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
5. How can examiner affect score? (2 marks)
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
Table of SAS_Score, Mean, Standard Deviation (SD) and Z Score (Students’ Performance)
STUDENT SAS SCORE MEAN STANDARD Z-SCORE

(M) DEVIATION (X-M)/SD
(SD)
001 19 21.65 3.508 -0.75412
002 28 21.65 3.508 1.81174
003 27 21.65 3.508 1.52664
004 27 21.65 3.508 1.52664
005 24 21.65 3.508 0.67135
006 23 21.65 3.508 0.38626
007 26 21.65 3.508 1.24155
008 28 21.65 3.508 1.81174
009 25 21.65 3.508 0.95645
010 23 21.65 3.508 0.38626
011 23 21.65 3.508 0.38626
012 23 21.65 3.508 0.38626
013 21 21.65 3.508 -0.18393
014 26 21.65 3.508 1.24155
015 21 21.65 3.508 -0.18393
016 20 21.65 3.508 -0.46903
017 21 21.65 3.508 -0.18393
018 23 21.65 3.508 0.38626
019 20 21.65 3.508 -0.46903
020 21 21.65 3.508 -0.18393
021 18 21.65 3.508 -1.03922
022 18 21.65 3.508 -1.03922
023 19 21.65 3.508 -0.75412
024 19 21.65 3.508 -0.75412
025 16 21.65 3.508 -1.60941
026 16 21.65 3.508 -1.60941
027 18 21.65 3.508 -1.03922
028 18 21.65 3.508 -1.03922
029 22 21.65 3.508 0.10116
030 21 21.65 3.508 -0.18393
031 17 21.65 3.508 -1.32432

Project Assesment

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Project Assesment

Transféré par

Droits d'auteur :

Formats disponibles

SGDE 4013

DEVELOPMENT AND ANALYSES OF ITEMS

1.0 Introduction ..........................................................................................................................1

Assessment entails the systematic gathering of evidence to judge a student understands of

assessments accurately measure student knowledge. Reliable is important concept that

and help educator improve their teaching effectiveness.

1.1. Deciding on a Test’s Purpose

Three of objectives that we used as a guideline to know level understanding students’ in

 Measure Post Graduate Diploma Education students’ understanding of reliability

 Measure Post Graduate Diploma Education students’ understanding factor of

 Measure Post Graduate Diploma Education students’ ability to determine various

methods in measuring reliability and in any assessment.

 Measure Post Graduate Diploma Education student’s ability in apply concept of

reliability in situation that are given.

studying at Universiti Utara Malaysia (UUM), were selected.

1.3. Table of specification

A Table of Specifications is to identify the achievement domains being measured and to

Cognitive level Total Total Items

2 Factor affecting reliability

Cognitive level Total Total Items

2 Factor affecting reliability

No. Answer Reason

1. B Reliability refers to whether we are truly measuring the concept of interest in

3. B Measurement reliability refers to the:

4. C Which are not Sources of error in assessment?

Based on the above diagram, which of the statement is correct?

d. Not reliable and not valid

7. C What effect would the following most likely have on reliability?

8. A What does the effect of homogenous group of examinees have on

11. A Why index of reliability could be negative?

12. B In which level if (r) value is 0.163?

Based on Mahrens and Lehmann (1991)

13. D Which one disadvantage of the test-retest method?

14. C Which method of estimating reliability is suitable for essay?

Answer scheme for subjective question.

No. Description Marks

1. Define reliability (2 marks)

3. How do we test reliability (4 marks)

4. Briefly explain about inter-rater reliability & intra-rater reliability. (4 marks)

 Intra-rater reliability - same raters give consistent estimates of the same

 emotional or health of the examiner 1 mark

(any 2 correct answers get full mark)

2.0 ITEM DEVELOPMENT

measuring knowledge, comprehension, and application of reliability concept. Multiple-choice

false. In this section, we focus question on understanding on concept reliability, factor

discriminate among options that vary in degree of correctness.

far students mastered the topic of reliability that area tested.

assessment. We also created question to identify level of understanding’s student about

3.0 TEST ADMINISTRATION PROCEDURE

4.0 SCORING AND RECORDING

(Score recorded in Excel sheet)

CORRECTED ITEM-TOTAL CRONBACH’S ALPHA IF

(Chart 1: Item Analyses)

5.1. Item Difficulty Index

Difficulty Index (DI) Difficulty level

0.00 – 0.20 Too difficult

0.21 – 0.40 Difficult

0.41 – 0.60 Moderate/Average

0.81 – 1.00 Too easy

(Table 2: Table of Difficulty Index Item)

This study involved 31 respondents from Postgraduate Diploma in Education (PGDE)

whole test item are shown in the table below:

Difficulty Level Factor Methods Total %