Académique Documents
Professionnel Documents
Culture Documents
1.0
OVERVIEW OF ASSESSMENT:
CONTEXT, ISSUES AND TRENDS
SYNOPSIS
LEARNING OUTCOMES
By the end of this topic, you will be able to:
1.2
1.
2.
3.
FRAMEWORK OF TOPICS
CONTENT
1
INTRODUCTION
1.4
1.4.1 Test
The four terms above are frequently used interchangeably in any
academic discussions. A test is a subset of assessment intended to measure
a test-taker's language proficiency, knowledge, performance or skills. Testing
is a type of assessment techniques. It is a systematically prepared procedure
that happens at a point in time when a test-taker gathers all his abilities to
achieve ultimateperformance because he knows that his responses are being
evaluated and measured.A test is first a method of measuring a test-takers
ability, knowledge or performance in a given area; and second it must
measure.
Bachman (1990) who was also quoted by Brown defined a test as a
process of quantifying a test-takers performance according to explicit
procedures or rules.
1.4.2 Assessment
2
1.4.3 Evaluation
Evaluation is another confusing term. Many are confused between
evaluation and testing. Evaluation does not necessary entail testing. In
reality, evaluation is involved when the results of a test (or other assessment
procedure) are used for decision-making (Bachman, 1990, pp. 22-23).
Evaluation involves the interpretation of information. If a teacher simply
records numbers or makes check marks on a chart, it does not constitute
evaluation. When a tester or marker evaluate, s/he values the results in
such a way that the worth of the performance is conveyed to the test-taker.
This is usually done with some reference to the consequences, either good or
bad of the performance.This is commonly practised in applied linguistics
research, where the focus is often on describing processes, individuals, and
groups, and the relationships among language use, the language use
situation, and language ability.
Test scores are an example of measurement, and conveying the
meaning of those scores is evaluation. However, evaluation can occur
without measurement. For example, if a teacher appraises a students correct
oral response with words like Excellent insight, Lilly!it is evaluation.
1.4.4 Measurement
3
research methodology;
b)
practical advances;
c)
d)
e)
concerns with the ethics of language testing and professionalising the field
The beginning of the new millennium is another exciting time for
PreIndependence
Implementation
of the Razak
Report (1956)
Implementation
of the
RahmanTalib
Report (1960)
Implementation
of the Cabinet
Report (1979)
Implementation of
the Malaysia
Education Blueprint
(2013 2025)
The
The emphasis
emphasis is
is on
on School-Based
School-Based Assessment
Assessment
(SBA).
(SBA). It
It was
was first
first introduced
introduced in
in 2002.
2002. It
It is
is a
a new
new
system of assessment and is one of the new
areas
areas where
where teachers
teachers are
are directly
directly involved.
involved. The
The
revamp
revamp of
of the
the national
national examination
examination and
and schoolschoolbased assessments in stages, whereby by 2016,
at
at least
least 40%
40% of
of questions
questions in
in
UjianPenilaianSekolahRendah
UjianPenilaianSekolahRendah (UPSR)
(UPSR) and
and 50%
50%
in SijilPelajaran Malaysia (SPM) are of high order
thinking skills questions.
TOPICvi 2
ROLE
AND PURPOSES OF
i
ASSESSMENT IN
TEACHING AND LEARNING
ii
iii
v
iv
Tutorial question
Examine the contributing factors to the changing trends of
language assessment.
Create and present findings using graphic organisers.
10
States, two common standardised English Language tests once used were
the Modern Language Aptitude Test (MLAT; Carroll & Sapon, 1958) and the
Pimsleur Language Aptitude Battery (PLAB; Pimsleur, 1966). Since there is
no research to show unequivocally that these kinds of tasks predict
communicative success in a language, apart from untutored language
acquisition, standardised aptitude tests are seldom used today with the
exception of identifying foreign language disability (Stansfield & Reed, 2004).
Progress Tests
These tests measure the progress that students are making towards
defined course or programme goals. They are administered at various stages
throughout a language course to see what the students have learned,
perhaps after certain segments of instruction have been completed. Progress
tests are generally teacher produced and are narrower in focus than
achievement tests because they cover a smaller amount of material and
assess fewer objectives.
Placement Tests
These tests, on the other hand, are designed to assess students level
of language ability for placement in an appropriate course or class. This type
of test indicates the level at which a student will learn most effectively. The
main aim is to create groups, which are homogeneous in level. In designing a
placement test, the test developer may choose to base the test content either
on a theory of general language proficiency or on learning objectives of the
curriculum. In the former, institutions may choose to use a well-established
proficiency test such as the TOEFL or IELTS exam and link it to curricular
benchmarks. In the latter, tests are based on aspects of the syllabus taught at
the institution concerned.
In some contexts, students are placed according to their overall rank in
the test results. At other institutions, students are placed according to their
level in each individual skill area. Elsewhere, placement test scores are used
to determine if a student needs any further instruction in the language or could
12
TOPIC 3
13
CONTENT
SESSION THREE (3 hours)
3.3
3.5
Norm-Referenced Test
A test that measures
students achievement as
compared to other
students in the group
Formative Test
Formative test or assessment, as the name implies, is a kind of
students may also need to change and improve. Due to the demanding
nature of this formative test, numerous teachers prefer not to adopt this test
although giving back any assessed homework or achievement test present
both teachers and students healthy and ultimate learning opportunities.
3.6
Summative Test
Summative test or assessment, on the other hand, refers to the kind of
measurement that summarise what the student has learnt orgive a one-off
measurement.In other words, summative assessment is assessment of
student learning. Students are more likely to experience assessment carried
out individually where they are expected to reproduce discrete language items
from memory.The results then are used to yield a school report and to
determine what students know and do not know.It does not necessarily
provide a clear picture of an individuals overall progress or even his/her full
potential, especially if s/heis hindered by the fear factor of physically sitting for
a test, but may provide straightforward and invaluable results for teachers to
analyse. It is given at a point in time to measure student achievement in
relation to a clearly defined set of standards, but it does not necessarily show
the way to future progress. It is given after learning is supposed to occur. End
of the year tests in a course and other general proficiency or public exams are
some of the examples of summative tests or assessment.Table 3.1 shows
formative and summative assessments that are common in schools.
Formative Assessment
Anecdotal records
Quizzes and essays
Summative Assessment
Final exams
National exams (UPSR, PMR, SPM,
STPM)
Diagnostic tests
Entrance exams
Table 3.1: Common formative and summative assessments in schools
3.7
Objective Test
16
ii.
True-falseitems/questions:
iii.
Matchingitems/questions; and
iv.
2.
Stem
Every multiple-choice item consists of a stem (the body of the item
short or simple, compact and clear. However, it must not easily give away the
right answer.
3.
Options or alternatives
They are known as a list of possible responses to a test item.
There are usually between three and five options/alternatives to
choose from.
4.
Key
This is the correct response. The response can either be
correct or the best one. Usually for a good item, the correct answer is not
obvious as compared to the distractors.
5. Distractors
This is known as a disturber that is included to distract students from
selecting the correct answer. An excellent distractor is almost the same as
the correct answer but it is not.
When building multiple-choice items for both classroom-based and
large-scaled standardised tests, consider the four guidelines below:
i.
ii.
iii.
Make certain that the intended answer is clearly the one correct
one;
iv.
3.8
Subjective Test
Contrary to an objective test, a subjective test is evaluated by giving an
predict. Generally, subjective tests will test the higher skills of analysis,
synthesis, and evaluation. In short, subjective test will enable students to be
more creative and critical. Table 3.2 shows various types of objective and
subjective assessments.
Objective Assessments
Subjective Assessments
True/False Items
Extended-response Items
Multiple-choice Items
Restricted-response Items
Multiple-responses Item
Essay
Matching Items
Table 3.2: Various types of objective and subjective assessments
Some have argued that the distinction between objective and
subjective assessments is neither useful nor accurate because, in reality,
there is no such thing as objective assessment. In fact, all assessments are
created with inherent biases built into decisions about relevant subject matter
and content, as well as cultural (class, ethnic, and gender) biases.
Reflection
1.
Objective test items are items that have only one answer or correct
response. Describe in-depth the multiple-choice test item.
2.
Discussion
1. Identify at least three differences between formative and summative
assessment?
2. What are the strengths of multiple-choice items compared to essay
items?
3. Informal assessments are often unreliable, yet they are still
important in classrooms. Explain why this is the case, and defend
your explanation with examples.
4. Compare and contrast Norm-Referenced Test with CriterionReferenced Test.
19
TOPIC 4
4.0
SYNOPSIS
LEARNING OUTCOMES
By the end of this topic, you will be able to:
4.2
1.
2.
3.
FRAMEWORK OF TOPICS
Reliability
Interpretability
Validity
Types of
Tests
Practicality
Authenticity
CONTENT
Washback Effect
Objectivity
20
INTRODUCTION
Assessment is a complex, iterative process requiring skills,
RELIABILITY ( consistency)
Reliability means the degree to which an assessment tool produces
the raters agree 8 out of 10 times, the test has an 80% inter-rater
reliability rate. Rater reliability is assessed by having two or more
independent judges score the test. The scores are then compared to
determine the consistency of the raters estimates.
Intra-rater reliability is an internal factor. In intra-rater reliability,
its main aim is consistency within the rater. For example, if a rater
(teacher) has many examination papers to mark and does not have
22
enough time to mark them, s/he might take much more care with the
first, say, ten papers, than the rest. This inconsistency will affect the
students scores; the first ten might get higher scores. In other
words, while inter-rater reliability involves two or more raters, intrarater reliability is the consistency of grading by a single rater.
Scores on a test are rated by a single rater/judge at different times.
When we grade tests at different times, we may become
inconsistent in our grading for various reasons. Some papers that are
graded during the day may get our full and careful attention, while
others that are graded towards the end of the day are very quickly
glossed over. As such, intra rater reliability determines the
consistency of our grading.
Both inter-and intra-rater reliability deserve close attention in
that test scores are likely to vary from rater to rater or even from the
same rater (Clark, 1979).
4.4.2 Test Administration Reliability
There are a number of reasons which influences test
administration reliability. Unreliability occurs due to outside
interference like noise, variations in photocopying, temperature
variations, the amount of light in various parts of the room, and even
the condition of desk and chairs. Brown (2010) stated that he once
witnessed the administration of a test of aural comprehension in which
an audio player was used to deliver items for comprehension, but due
to street noise outside the building, test-taker sitting next to open
windows could not hear the stimuli clearly. According to him, that was
a clear case of unreliability caused by the conditions of the test
administration.
23
b.
Teacher-Student factors
In most tests, it is normally for teachers to construct and
c.
Environment factors
An examination environment certainly influences test-takers and
Because students' grades are dependent on the way tests are being
administered, test administrators should strive to provide clear and
accurate instructions, sufficient time and careful monitoring of tests to
improve the reliability of their tests. A test-re-test technique can be
used to determine test reliability.
e.
Marking factors
common that different markers award different marks for the same
answer even with a prepared mark scheme. A markers assessment
may vary from time to time and with different situations. Conversely, it
does not happen to the objective type of tests since the responses are
fixed. Thus, objectivity is a condition for reliability.
4.5
VALIDITY
Validity refers to the evidence base that can be provided about
26
Content validity: Does the assessment content cover what you want to
assess? Have satisfactory samples of language and language skills been
selected for testing?
Concurrent (parallel) validity: Can you use the current test score to
estimate scores of other criteria? Does the test correlate with other existing
measures?
the criteria (concepts, skills and knowledge) relevant to the purpose of the
examination. The important notion here is the purpose.
27
28
juncture, (lack of) hesitations, and other elements within the construct
of fluency. Tests are, in a manner of speaking, operational definitions
of constructs in that their test tasks are the building blocks of the entity
that is being measured (see Davidson, Hudson, & Lynch, 1985; T.
McNamara, 2000).
4.5.4 Concurrent validity
Concurrent validity is the use of another more reputable and
recognised test to validate ones own test. For example, suppose you
come up with your own new test and would like to determine the
validity of your test. If you choose to use concurrent validity, you would
look for a reputable test and compare your students performance on
your test with their performance on the reputable and acknowledged
test. In concurrent validity, a correlation coefficient is obtained and used
to generate an actual numerical value. A high positive correlation of 0.7
to 1 indicates that the learners score is relatively similar for the two
tests or measures.
For example, in a course unit whose objective is for students to
be able to orally produce voiced and unvoiced stops in all possible
phonetics environments, the results of one teachers unit test might be
compared with an independent assessment such as a commercially
produced test of similar phonemic proficiency. Since criterion-related
evidence usually falls into one of two categories of concurrent and
predictive validity, a classroom test designed to assess mastery of a
point of grammar in a communicative use will have criterion validity if
test scores are verified either by observed subsequent behaviour or by
other communicative measures of grammar point in question.
4.5.5 Predictive validity
Predictive validity is closely related to concurrent validity in that
it too generates a numerical value. For example, the predictive validity
30
4.5.7 Objectivity
The objectivity of a test refers to the ability of
teachers/examiners who mark the answer scripts. Objectivity refers to
the extent, in which an examiner examines and awards scores to the
same answer script. The test is said to have high objectivity when the
examiner is able to give the same score to the similar answers guided
by the mark scheme. An objective test is a test that has the highest
level of objectivity due to the scoring that is not influenced by the
examiners skills and emotions. Meanwhile, subjective test is said to
have the lowest objectivity. Based on various researches, different
examiners tend to award different scores to an essay test. It is also
possible that the same examiner would give different scores to the
same essay if s/he is to re-check at different times.
4.5.8 Washback effect
The term 'washback' or backwash (Hughes, 2003, p.1)
refers to the impact that tests have on teaching and learning. Such
impact is usually seen as being negative: tests are said to force
32
35
36
TOPIC 5
5.0
SYNOPSIS
Topic 5 exposes you the stages of test construction, the preparing of test
blueprint/test specifications, the elements in a Test Specifications Guidelines
And the importance of following the guidelines for constructing tests items.
Then we look at the various test formats that are appropriate for language
assessment.
5.1
LEARNING OUTCOMES
By the end of this topic, you will be able to:
1.
2.
3.
4.
5.
6.
7.
8.
validity
identify the elements in a Test Specifications Guidelines
demonstrate an understanding of the importance of following the
9.
37
CONTENT
SESSION FIVE (3 hours)
5.3
determining
planning
writing
preparing
reviewing
vi
vii
pre-testing
validating
5.3.1 Determining
The essential first step in testing is to make oneself perfectly
clear about what it is one wants to know and for what purpose. When
we start to construct a test, the following questions have to be
answered.
5.3.2 Planning
The first form that the solution takes is a set of specifications for
the test.This will include information on: content, format and timing,
criteria,levels of performance, and scoring procedures.
In this stage, the test constructor has to determine the content by
answering the following questions:
Describing the purpose of the test;
Describing the characteristics of the test takers, the nature of the
population of the examinees for whom the test is being designed.
Defining the nature of the ability we want to measure;
Developing a plan for evaluating the qualities of test usefulness, which
is the degree to which a test is useful for teachers and students, it
includes six qualities: reliability, validity, authenticity, practicality interactiveness, and impact;
Identifying resources and developing a plan for their allocation and
management;
Determining format and timing of the test;
Determining levels of performance;
Determining scoring procedures
5.3.3 Writing
Although writing items is time-consuming, writing good items is an art.
No one can expect to be able consistently to produce perfect items.
Some items will have to be rejected, others reworked. The best way to
identify items that have to be improved or abandoned is through
teamwork. Colleagues must really try to find fault; and despite the
seemingly inevitable emotional attachment that item writers develop to
items that they have created, they must be open to, and ready to
39
accept, the criticisms that are offered to them. Good personal relations
are a desirable quality in any test writing team.
Test items writers should possess the following characteristics:
5.3.4 Preparing
One has to understand the major principles, techniques and
experience of preparing the test items. Not every teacher can make a
good tester. To construct different kinds of tests, the tester should
observe some principles. In the production-type tests, we have to bear
in mind that no comments are necessary. Test writers should also try to
avoid test items, which can be answered through test- wiseness. Testwiseness refers to the capacity of the examinees to utilise the
characteristics and formats of the test to guess the correct answer.
5.3.5 Reviewing
Principles for reviewing test items:
The test should not be reviewed immediately after its construction,
40
skills to be included
are your guiding plan for designing an instrument that effectively fulfils
your desired principles, especially validity.
It is vital to note that for large-scale standardised tests like Test
of English as a Foreign Language (TOEFL Test), International
English Language Testing System (IELTS), Michigan English
Language Assessment Battery) MELAB, and the like, that are intended
to be widely distributed and thus are broadly generalised, test
specifications are much more formal and detailed (Spaan, 2006). They
are also usually confidential so that the institution that is designing the
test can ensure the validity of subsequent forms of a test.
Many language teachers claim that it is difficult to construct an item. In
reality, it is rather easy to develop an item, if we are committed in the
planning of the measuring instruments to evaluate students
achievement.
However, what exactly is an item for a test? An item is a tool, an
instrument, instruction or question used to get feedback from testtakers, which is an evidence t of something that is being measured. An
item is an instrument used to get feedback, which is a useful
information for consideration in measuring or asserting a construct
measurement. Items can be classified as a recall and thinking item. A
recall item is the item that requires one to recall in order to answer, and
a thinking item refers to an item that requires test-takers to use their
thinking skills to attempt.
For instance, in a grammar unit test that will be administered at
the end of a three-week grammar course for high beginning adult
42
learners (Level 2). The students will be taking a test that covers verb
tenses and two integrated skills (listening/speaking and reading/writing)
and the grammar class they attend serves to reinforce the grammatical
forms that they have learnt in the two earlier classes.
Based on the scenario above, the test specs that you design
might consist of the four sequential steps:
1. a broad outline of how the test will be organised
2. which of the eight sub-skills you will test
3. what the various tasks and item types will be
4. how results will be scored, reported to students, and used in future
class (washback)
Besides knowing the purpose of the test you are creating, you
are required to know as precisely as possible what it is you want to
test. Do not conduct a test hastily. Instead, you need to examine the
objectives for the unit you are testing carefully.
5.5
43
Taxonomy by allowing these two aspects, the noun and verb, to form
separate dimensions, the noun providing the basis for the Knowledge
dimension
and the verb forming the basis for the Cognitive Process
44
Alternative Names
Remember
Definition
Retrieve knowledge
from long-term
memory
45
Recognising
Identifying
Locating knowledge in
long-term memory that
is consistent with
presented material
Recalling
Retrieving
Retrieving relevant
knowledge from longterm memory
Level 2 C2
Categories &
Cognitive Processes
Alternative Names
Understand
Interpreting
Construct meaning
from instructional
messages, including
oral, written, and
graphic
communication
Clarifying
Paraphrasing
Representing
Translating
Exemplifying
Illustrating
Instantiating
Classifying
Categorising
Subsuming
Summarising
Abstracting
Generalising
Inferring
Comparing
Definition
Concluding
Extrapolating
Interpolating
Predicting
Contrasting
Mapping
Matching
Constructing models
Explaining
Constructing a cause
and effect model of a
46
system
Level 3 C3
Categories &
Cognitive Processes
Alternative Names
Apply
Definition
Applying a procedure
to a familiar task
Carrying out
Executing
Exemplifying
Applying a procedure to
a familiar task
Illustrating
Instantiating
Applying a procedure to
an unfamiliar task
Using
Analyse
Differentiating
Organising
Distinguishing relevant
from irrelevant parts or
important from
unimportant parts of
presented material
Determining how
elements fit or function
within a structure
Attributing
Determining a point of
view, bias, values, or
intent underlying
presented material
Evaluating
Make judgments
based on criteria and
standards
Checking
Coordinating
Detecting
Monitoring
Testing
47
Detecting
inconsistencies or
fallacies within a
process or product,
determining whether a
process or product has
internal consistency;
detecting the
effectiveness of a
procedure as it is being
implemented
Judging
Critiquing
Detecting
inconsistencies
betweena product and
external
criteria;determining
whether a product has
external consistency;
detecting the
appropriateness of a
procedure for a given
problem
Create
Putting elements
together to form a
coherent or functional
whole; reorganise
elements into a new
pattern or structure
Hypothesising
Generating
Coming upwith
alternative hypotheses
based on criteria
Designing
Planning
Producing
Inventing a product
Categories &
Cognitive Processes
Factual Knowledge
Definition
The basic elements students must know to the
acquainted with a discipline or solve problems in it
48
Conceptual Knowledge
Procedural Knowledge
Metacognitive
Knowledge
and Multistructural), to the ability to link the ideas and elements of a task
together (Relational) and finally (Extended Abstract) to understand the topic
for themselves, possibly going beyond the initial scope of the task (Biggs &
Collis, 1982; Hattie & Brown, 2004). In their later research into multimodal
learning, Biggs & Collis noted that there was an increase in the structural
complexity of their (the students) responses (1991:64).
It may be useful to view the SOLO taxonomy as an integrated strategy,
to be used in lesson design, in task guidance and formative and summative
assessment (Smith & Colby, 2007; Black & William, 2009; Hattie, 2009; Smith,
2011). The structure of the taxonomy encourages viewing learning as an ongoing process, moving from simple recall of facts towards a deeper
understanding; that learning is a series of interconnected webs that can be
built upon and extended. Nckles et al., (2009:261) elaborates:
Cognitive strategies such as organization and elaboration are at
the heart of meaningful learning because they enable the learner
to organize learning into a coherent structure and integrate new
information with existing knowledge, thereby enabling deep
understanding and long-term retention.
This would help to develop Smiths (2011:92) self-regulating, self-evaluating
learners who were well motivated by learning.
A range of SOLO based techniques exist to assist teachers and
students. Use of constructional alignment (Biggs & Tang, 2009) encourages
teachers to be more explicit when creating learning objectives, focusing on
what the student should be able to do and at which level. This is essential for
a student to make progress and allows for the creation of rubrics, for use in
class (Black &Wiliam, 2009; Nckles et al., 2009; Huang, 2012), to make the
process explicit to the student. Use of HOTS viz. Higher Order Thinking Skills)
maps (Hook & Mills, 2011) can be used in English to scaffold in depth
discussion, encouraging students to:
51
5.6
clearly written questions that do not attempt to trick or confuse them into
incorrect responses. The following presents the major characteristics of wellwritten test items.
5.6.1 Aim of the test
Test item development is a critical step in building a test that properly
meets certain standards. A good test is only as good as the quality of the test
items. If the individual test items are not appropriate and do not perform well,
how can the test scores be meaningful? The topic to be evaluated (construct)
and where the evaluation is done (title/context) must be part of the
curriculum. If it is evaluated outside the curriculum, the curricular validity of
the item can be disputed. Therefore, test items must be developed to
precisely measure the objectives prescribed by the blueprint and meet quality
standards.
5.6.2 Range of the topics to be tested
A test must measure the test-takers ability or proficiency in applying
the knowledge and principles on the topics that they have learnt. Ample
opportunity must be given to students to learn the topics that are to be
53
55
Test format
What is the difference between test format and test type? For example,
when you want to introduce new kinds of test, for example, reading test, which
is organised a little bit different from the existing test items, what do you say?
Test format or test type? Test format refers to the layout of questions on a test.
For example, the format of a test could be two essay questions, 50 multiplechoice questions, etc.For the sake of brevity, I will consider providing the
outlines of some large-scale standardised tests.
UPSR
Primary School Evaluation Test, also as known Ujian Penilaian
Sekolah Rendah (commonly abbreviated as UPSR; Malay), is a national
examination taken by all pupils in our country at the end of their sixth year
in primary school before they leave for secondary school. It is prepared and
examined by the Malaysian Examinations Syndicate. This test consists of two
papers namely Paper 1 and Paper 2.
Multiple-choice questions are tested using a standardised optical
answer sheet that uses optical mark recognition for detecting answers for
Paper 1 and Paper 2 comprises three sections, namely Sections A, B, and C.
reading, speaking and writing), which take a total of about four and a half
hours to complete.
TOPIC 6
6.0
SYNOPSIS
Topic 6 focuses on ways to assess language skills and language
content. It defines the types of test items used to assess language
skills and language content. It also provides teachers with suggestions
on ways a teacher can assess the listening, speaking, reading and
writing skills in a classroom. It also discusses concepts of and
differences between discrete point test, integrative test and
communicative test.
6.1
LEARNING OUTCOMES
At the end of Topic 6, teachers will be able to:
58
6.2
FRAMEWORK OF TOPICS
CONTENT
SESSION SIX (6 hours)
6.2.1
b.
Speaking
In the assessment of oral production, both discrete feature
objective tests and integrative task-based tests are used. The first
type tests such skills as pronunciation, knowledge of what
language is appropriate in different situations, language required
in doing different things like describing, giving directions, giving
instructions, etc. The second type involves finding out if pupils
can perform different tasks using spoken language that is
appropriate for the purpose and the context. Task-based activities
involve describing scenes shown in a picture, participating in a
discussion about a given topic, narrating a story, etc. As in the
listening performance assessment tasks, Brown 2010 cited four
categories for oral assessment.
1.
B.
C.
concern.
Intensive (controlled). Beyond the fundamentals of imitative
writing are skills in producing appropriate vocabulary within a
context, collocation and idioms, and correct grammatical features
up to the length of a sentence. Meaning and context are
important in determining correctness and appropriateness but
most assessment tasks are more concerned with a focus on form
3.
4.
It is not by accident that we find there are few, if any, test formats that are
either supply type and objective or select type and subjective. Select type
tests tend to be objective while supply type tests tend to be subjective.
In addition to the above, Brown and Hudson (1998), have also suggested
three broad categories to differentiate tests according to how students are
expected to respond. These categories are the selected response tests, the
constructed response tests, and the personal response tests. Examples of
each of these types of tests are given in Table 6.1.
Constructed response
Personal response
True false
Fill-in
Conferences
Matching
Short answer
Portfolios
Multiple choice
Performance test
Communicative Test
As language teaching has emphasised the importance of
communication through the communicative approach, it is not
surprising that communicative tests have also been given prominence.
A communicative emphasis in testing involves many aspects, two of
which revolve around communicative elements in tests and meaningful
content. Both these aspects are briefly addressed in the following sub
sections:
69
In short, the kinds of tests that we should expect more of in the future
will be communicative tests in which candidates actually have to
produce the language in an interactive setting involving some degree of
unpredictability which is typical of any language interaction situation.
These tests would also take the communicative purpose of the
interaction into consideration and require the student to interact with
language that is actual and unsimplified for the learner. Fulcher finally
points out that in a communicative test, the only real criterion of
success is the behavioural outcome, or whether the learner was
able to achieve the intended communicative effect (p. 493). It is
obvious from this description that the communicative test may not be
so easily developed and implemented. Practical reasons may hinder
some of the demands listed. Nevertheless, a solution to this problem
has to be found in the near future in order to have valid language that
are purposeful and can stimulate positive washback in teaching and
learning.
Exercise 1
1.
2.
70
TOPIC 7
7.0
SYNOPSIS
Topic 7 focuses on the scoring, grading and assessment criteria. It
provides teachers with brief descriptions on the different approaches to
scoring namely:-objective, holistic and analytic.
7.1
LEARNING OUTCOMES
7.2
FRAMEWORK OF TOPICS
CONTENT
SESSION SEVEN (3 hours)
7.2.1
Objective approach
A type of scoring approach is the objective scoring approach. This scoring
approach relies on quantified methods of evaluating students writing. A
sample of how objective scoring is conducted is given by Bailey (1999) as
follows:
71
CCriteria
The 6 point scale above includes broad descriptors of what a students essay
reflects for each band. It is quite apparent that graders using this scale are
expected to pay attention to vocabulary, meaning, organisation, topic
73
Components
Content
Organisation
Vocabulary
Language Used
Mechanics
Weight
30 points
20 points
20 points
25 points
5 points
Advantages
Analytical
Objective
Disadvantages
Quickly graded
Provide a public standard that is
understood by the teachers and
students alike
Relatively higher degree of rater
reliability
Applicable to the assessment of
many different topics
Emphasise the students
strengths rather than their
weaknesses.
It provides clear guidelines in
grading in the form of the various
components.
Allows the graders to consciously
address important aspects of
writing.
Emphasises the students
strengths rather than their
weaknesses.
EXERCISE
1.
75
TOPIC 8
8.0
SYNOPSIS
Topic 8 focuses on item analysis and interpretation. It provides teachers with
brief descriptions on basic statistics terminologies such as mode, median,
mean, standard deviation, standard score and interpretation of data. It will also
look at some item analysis that deals with item difficulty and item discrimination.
Teachers will also be introduced to distractor analysis in language assessment.
FRAMEWORK OF TOPICS
CONTENT
SESSION EIGHT (6 hours)
8.2.1 Basic Statistics
76
Let us assume that you have just graded the test papers for your class. You
now have a set of scores. If a person were to ask you about the performance
of the students in your class, it would be very difficult to give all the scores in
the class. Instead, you may prefer to cite only one score.
Or perhaps you would like to report on the performance by giving some
values that would help provide a good indication of how the students in your
class performed. What values would you give? In this section, we will look at
two kinds of measures, namely measures of central tendency and measures
of dispersion. Both these types of measures are useful in score reporting.
Central tendency measures the extent to which a set of scores gathers
around. There are three major measures of central tendency. They are the
mode, median and mean.
MODE
MEDIAN
MEAN
8.2.2
Standard deviation
77
Standard deviation refers to how much the scores deviate from the mean.
There are two methods of calculating standard deviation which are the
deviation method and raw score method which are illustrated by the following
formulae.
To illustrate this, we will use 20, 25,30. Using standard deviation method, we
come up with the following table:
Table 8.1:Calculating the Standard Deviation Using the Deviation Method
Using the raw score method, we can come up with the following:
Table 8.2 : Calculating the Standard Deviation Using the Raw Score Method
78
Both methods result in the same final value of 5. If you are calculating
standard deviation with a calculator, it is suggested that the deviation
method be used when there are only a few scores and the raw score
method be used when there are many scores. This is because when
there are many scores, it will be tedious to calculate the square of the
deviations and their sum.
8.2.3 Standard score
Standardised scores are necessary when we want to make
comparisons across tests and measurements. Z scores and T scores
are the more common forms of standardised scores although you
may come up with your own standardised score. A standardised score
can be computed for every raw score in a set of scores for a test.
i. The Z score
The Z score is the basic standardised score. It is referred to as the
basic form as other computations of standardised scores must first
calculate the Z score. The formula used to calculate the Z score is as
79
follows:
Z score values are very small and usually range only from 2 to 2.
Such small values make it inappropriate for score reporting especially
for those unaccustomed to the concept. Imagine what a parent may
say if his child comes home with a report card with a Z score of 0.47
in English Language! Fortunately, there is another form of
standardised score - the T score with values that are more
palatable to the relevant parties.
ii.
8.2.4
The T score
The T score is a standardised score which can be computed using the
formula 10 (Z) + 50. As such, the T score for students A, B, C, and D in
the table 4.3 are 10(-1.28) + 50; 10 (-0.23) + 50; 10(0.47) + 50; and 10
(1.04) + 50 or 37.2, 47.7, 54.7, and 60.4 respectively. These values
seem perfectly appropriate compared to the Z score. The T score
average or mean is always 50 (i.e. a standard deviation of 0) which
connotes an average ability and the mid point of a 100 point scale.
Interpretation of data
80
How can En. Abu solve this problem? He would have to have
standardised scores in order to decide. This would require the
following information:
Test 1 : X = 42 standard deviation= 7
Test 2 : X = 47 standard deviation= 8
Using the information above, En. Abu can find the Z score for each
raw score reported as follows:
Table 8.4: Z Score for Form 2A
Based on Table 8.4, both Ali and Chong have a negative Z score as
their total score for both tests. However, Chong has a higher Z score
total (i.e. 1.07 compared to 1.34) and therefore performed better
when we take the performance of all the other students into
consideration.
THE NORMAL CURVE
The normal curve is a hypothetical curve that is supposed to represent all
naturally occurring phenomena. It is assumed that if we were to sample a
particular characteristic such as the height of Malaysian men, then we will
find that while most will have an average height of perhaps 5 feet 4 inches,
there will be a few who will be relatively shorter and an equal number who
are relatively taller. By plotting the heights of all Malaysian men according to
frequency of occurrence, it is expected that we would obtain something
81
similar to a normal distribution curve. Similarly, test scores that measure any
characteristic such as intelligence, language proficiency or writing ability of a
specific population is also expected to provide us with a normal curve.
The following is a diagram illustrating how the normal curve would look like.
score for the standard deviation value of 1 is 5 and for the value of 2 is 5
x 2 = 10 and for the value of 3 is 15 and so on. Standard deviation
values of 1, -2, and 3 will have corresponding negative scores of 5,
-10, and 15.
8.2.5
Item analysis
a.
Item difficulty
Item difficulty refers to how easy or difficult an item is. The formula
used to measure item difficulty is quite straightforward. It involves
finding out how many students answered an item correctly and
dividing it by the number of students who took this test. The formula
is therefore:
us that the weaker students performed better on a item than the better
students. This is hardly what we want from an item and if we obtain
such a value, it may indicate that there is something not quite right with
the item. It is strongly recommended that we examine the item to see
whether it is ambiguous or poorly written. A discrimination value of 1
shows positive discrimination with the better students performing much
better than the weaker ones as is to be expected.
Lets use the following instance as an example. Suppose you have just
conducted a twenty item test and obtained the following results:
As there are twelve students in the class, 33% of this total would be 4
students. Therefore, the upper group and lower group will each consist
of 4 students each. Based on their total scores, the upper group would
84
Distractor analysis
Distractor analysis is an extension of item analysis, using techniques
that are similar to item difficulty and item discrimination. In distractor
analysis, however, we are no longer interested in how test takers select
the correct answer, but how the distractors were able to function
effectively by drawing the test takers away from the correct answer.
The number of times each distractor is selected is noted in order to
determine the effectiveness of the distractor. We would expect that the
distractor is selected by enough candidates for it to be a viable
distractor.
85
Let us assume that 100 students took the test. If we assume that A is the
answer and the item difficulty is 0.7, then 70 students answered correctly.
What about the remaining 30 students and the effectiveness of the three
distractors? If all 30 selected D, then distractors B and C are useless in
their role as distractors. Similarly, if 15 students selected D and another 15
selected B, then C is not an effective distractor and should be replaced.
Therefore, the ideal situation would be for each of the three distractors to
be selected by an equal number of all students who did not get the answer
correct, i.e. in this case 10 students. Therefore the effectiveness of each
distractor can be quantified as 10/100 or 0.1 where 10 is the number of
students who selected the tiems and 100 is the total number of students
who took the test. This technique is similar to a difficulty index although the
result does not indicate the difficulty of each item, but rather the
effectiveness of the distractor. In the first situation described in this
paragraph, options A, B, C and D would have a difficulty index of 0.7, 0, 0,
and 0.3 respectively. If the distractors worked equally well, then the indices
would be 0.7, 0.1, 0.1, and 0.1. Unlike in determining the difficulty of an
item, the value of the difficulty index formula for the distractors must be
interpreted in relation to the indices for the other distractors.
From a different perspective, the item discrimination formula can also be
used in distractor analysis. The concept of upper groups and lower groups
would still remain, but the analysis and expectation would differ slightly
86
from the regular item discrimination that we have looked at earlier. Instead
of expecting a positive value, we should logically expect a negative value
as more students from the lower group should select distractors. Each
distractor can have its own item discrimination value in order to analyse
how the distractors work and ultimately refine the effectiveness of the test
item itself.
Table 8.6: Selection of Distractors
Distractor A
Distractor B
Distractor C
Distractor D
Item 1
8*
Item 2
8*
Item 3
8*
Item 4
8*
Item 5
7*
d.
* indicates key
For Item 1, the discrimination index for each distractor can be calculated
using the discrimination index formula. From Table 8.5, we know that all the
students in the upper group answered this item correctly and only one
student from the lower group did so. If we assume that the three remaining
students from the lower group all selected distractor B, then the
discrimination index for item 1, distractor B will be:
This negative value indicates that more students from the lower group
selected the distractor compared to students from the upper group. This
result is to be expected of a distractor and a value of -1 to 0 is preferred.
EXERCISE
1. Calculate the mean, mode, median and range of the following set of
scores:
23, 24, 25, 23, 24, 23, 23, 26, 27, 22, 28.
2. What is a normal curve and what does this show? Does the final
result always show a normal curve and how does this relate to
standardised tests?
87
TOPIC 9
9.0 SYNOPSIS
Topic 9 focuses on reporting assessment data. It provides teachers with brief
descriptions on the purposes of reporting and the reporting methods.
9.1 LEARNING OUTCOMES
By the end of Topic 9, teachers will be able to:
88
CONTENT
SESSION NINE (3 hours)
89
9.2.2
Reporting methods
Student achievement progress can be reported by comparing:
i. Norm - Referenced Assessment and Reporting
Assessing and reporting a student's achievement and progress in
comparison to other students.
ii Criterion - Referenced Assessment and Reporting
91
93
TOPIC 10
10.0 SYNOPSIS
Topic 10 focuses on the issues and concerns related to assessment in the
95
Malaysian primary schools. It will look at how assessment is viewed and used
in Malaysia.
10.1 LEARNING OUTCOMES
By the end of Topic 10, teachers will be able to:
CONTENT
SESSION TEN (3 hours)
10.3
Exam-oriented System
96
99
10.4
Knowledge
Comprehension
Application
Analysis
Synthesis
Evaluation
Knowledge
Recalling memorized information. May involve remembering a wide range of
material from specific facts to complete theories, but all that is required is the
bringing to mind of the appropriate information. Represents the lowest level
of learning outcomes in the cognitive domain.
Learning objectives at this level: know common terms, know specific facts,
know methods and procedures, know basic concepts, know principles.
Question verbs: Define, list, state, identify, label, name, who? when? where?
what?
Comprehension
The ability to grasp the meaning of material. Translating material from one
form to another (words to numbers), interpreting material (explaining or
summarizing), estimating future trends (predicting consequences or effects).
Goes one step beyond the simple remembering of material, and represent
the lowest level of understanding.
Learning objectives at this level: understand facts and principles, interpret
verbal material, interpret charts and graphs, translate verbal material to
mathematical formulae, estimate the future consequences implied in data,
justify methods and procedures.
Question verbs: Explain, predict, interpret, infer, summarize, convert,
translate, give example, account for, paraphrasex?
Application
The ability to use learned material in new and concrete situations. Applying
rules, methods, concepts, principles, laws, and theories. Learning outcomes
in this area require a higher level of understanding than those under
comprehension.
Learning objectives at this level: apply concepts and principles to new
situations, apply laws and theories to practical situations, solve mathematical
100
Analysis
The ability to break down material into its component parts. Identifying parts,
analysis of relationships between parts, recognition of the organizational
principles involved. Learning outcomes here represent a higher intellectual
level than comprehension and application because they require an
understanding of both the content and the structural form of the material.
Learning objectives at this level: recognize unstated assumptions, recognizes
logical fallacies in reasoning, distinguish between facts and inferences,
evaluate the relevancy of data, analyze the organizational structure of a work
(art, music, writing).
Question verbs: Differentiate, compare / contrast, distinguish x from y, how
does x affect or relate to y? why? how? What piece of x is missing / needed?
Synthesis
(By definition, synthesis cannot be assessed with multiple-choice questions.
It appears here to complete Bloom's taxonomy.)
The ability to put parts together to form a new whole. This may involve the
production of a unique communication (theme or speech), a plan of
operations (research proposal), or a set of abstract relations (scheme for
classifying information). Learning outcomes in this area stress creative
behaviors, with major emphasis on the formulation of new patterns or
structure.
Learning objectives at this level: write a well organized paper, give a well
organized speech, write a creative short story (or poem or music), propose a
plan for an experiment, integrate learning from different areas into a plan for
solving a problem, formulate a new scheme for classifying objects (or events,
or ideas).
Question verbs: Design, construct, develop, formulate, imagine, create,
change, write a short story and label the following elements:
101
Evaluation
The ability to judge the value of material (statement, novel, poem, research
report) for a given purpose. The judgments are to be based on definite
criteria, which may be internal (organization) or external (relevance to the
purpose). The student may determine the criteria or be given them. Learning
outcomes in this area are highest in the cognitivehierarchy because they
contain elements of all the other categories, plus conscious value judgments
based on clearly defined criteria.
Learning objectives at this level: judge the logical consistency of written
material, judge the adequacy with which conclusions are supported by data,
judge the value of a work (art, music, writing) by the use of internal criteria,
judge the value of a work (art, music, writing) by use of external standards of
excellence.
Question verbs: Justify, appraise, evaluate, judgexaccording to given
criteria. Which option would be better/preferable to partyy?
10.5
School-based Assessment
The traditional system of assessment no longer satisfies the educational
and social needs of the third millennium. In the past few decades, many
countries have made profound reforms in their assessment systems.
Several educational systems have in turn introduced school-based
assessment as part of or instead of external assessment in their
certification. While examination bodies acknowledge the immense
potential of school-based assessment in terms of validity and flexibility,
yet at the same time they have to guard against or deal with difficulties
related to reliability, quality control and quality assurance. In the debate
on school-based assessment, the issue of why has been widely written
about and there is general agreement on the principles of validity of
this form of assessment.
Izard (2001) as well as Raivoce and Pongi (2001) explain that schoolbased assessment (SBA) is often perceived as the process put in place
to collect evidence of what students have achieved, especially in
102
Academic:
Non-academic:
Centralised Assessment
Conducted and administered by teachers in schools using instruments,
rubrics, guidelines, time line and procedures prepared by LP
Monitoring and moderation conducted by PBS Committee at School,
District and State Education Department, and LP
School Assessment
The emphasis is on collecting first hand information about pupils learning
based on curriculum standards
Teachers plan the assessment, prepare the instrument and administer the
assessment during teaching and learning process
Teachers mark pupils responses and report their progress continuously.
10.6
Alternative Assessment
104
Alternative Assessment
One-shot tests
Indirect tests
Direct tests
Inauthentic tests
Authentic assessment
Individual projects
Group projects
No feedback to learners
Speeded exams
Power exams
Classroom-based tests
Summative
Formative
Product of instruction
Process of instruction
Intrusive
Integrated
Judgmental
Developmental
Teacher proof
Teacher mediated
105
Physical demonstration
Pictorial products
Reading response logs
K-W-L (what I know/what I want to know/what Ive learned) charts
Dialogue journals
Checklists
Teacher-pupils conferences
Interviews
Performace tasks
Portfolios
Self assessment
Peer assessment
106
Portfolios
A well known and commonly uses alternative assessment is the portfolio
assessment. The contents of the portfolio become evidence of abilities
much like how we would use a test to measure the abilities of our
students.
Bailey (1998, p: 218), describes a portfolio to contain four primary
elements.
First, it should have an introduction to the portfolio itself
which provides an overview to the content of the portfolio.
Bailey even suggests that this section include a reflective essay
by the student in order to help express the students thoughts
and feelings about the portfolio, perhaps explaining strengths
and possible weaknesses as well as explain why certain pieces
are included in the portfolio.
Introductory Section
Overview
Reflective Essay
Personal Section
Assessment Section
Evaluation by peers
Self-evaluation
Journals
Score reports
Photographs
Personal items
4.
3.
I have difficulty with some questions, but I generally get the meaning
2.
1.
stimulate meta-cognition.
EXERCISE
In your opinion, what are the advantages of using portfolios as
a form of alternative assessment?
REFERENCES
110
McKay.
University Press.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T. and
McNamara, T. (1999). Dictionary of language testing.
Cambridge: University ofCambridge Local Examinations
Syndicate and Cambridge University Press.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (ed.).
Educational Measurement. (3rd. ed.) pp.105-146. New York, NY:
Macmillan.
Gottlieb, M. (2006). Assessing English Language Learners:
Bridges from Language Proficiency to Academic Achievement.
USA: Corwin Press.
Grotjahn, R. (1986).Test validation and cognitive psychology:
Some methodological considerations.Language Testing
3,pp.15885.
Hattie, J. (2009).Visible Learning. New York: Routledge.
Hattie, J. (2012) Visible Learning for Teachers: Maximizing Impact
on
Learning. Abingdon: Routledge
Hattie, J. & Brown, G. (2004) Cognitive processes in asTTle: The
SOLO taxonomy. University of Auckland/Ministry of Education.
asTTle Technical Report 43
Hook, P. & Mills, J. (2011) SOLO Taxonomy: A Guide for Schools
Book 1: A
common language of learning. Laughton, UK:
Essential Resources Educational Publishers.
Huang, S.C. (2012).English Teaching: Practice and Critique 11 (4),
pp.
99119.
Hughes, A. (2003). Testing for language teachers (2nd. Ed.).
Cambridge,
MA: Cambridge University Press.
Gavin, B. et al. (2008). An introduction to educational assessment,
measurement and evaluation. (2nd ed.). Australia: Pearson
Education New Zealand.
McNamara, T. (2000). Language testing. Oxford, UK: Oxford
University Press.
Linn, R. L., & Gronlund, N. E. (2000). Measurement and
assessment in teaching. (8th ed.). Upper Saddle River, NJ:
Merrill/Prentice Hall.
113
Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S.,
Miller, J., &
Newton, D. (2005).Frameworks for Thinking: A
handbook for teaching
and learning. Cambridge: Cambridge
University Press.
Mousavi, S. A. (2009). An encyclopedic dictionary of language
testing (4th ed.)
Tehran: Rahnama Publications.
Norleha Ibrahim. (2009). Management of measurement and
evaluation
Module. Selongor: Open University Malaysia.
Nckles, M., Hbner, S. & Renkl, A. (2009). Enhancing selfregulated learning
by writing learning protocols. Learning and
Instruction, 19(3), pp. 259 271. Available
at: http://linkinghub.elsevier.com/retrieve/pii/S0959475208000558
(Retrieved March 26, 2013).
Oller, J. W. (1979). Language tests at school: A pragmatic
approach. London: Longman.
Pearson, I. (1988).Tests as levers for change. In D. Chamberlain
& R. Baumgardner (Eds.), ESP in the classroom: Practice and
evaluation (Vol. 128, 98-107). London: Modern
EnglishPublications.
Pimsleur, P. (1966). Pimsleur Language Aptitude Battery. New
York, NY:
Harcourt, Brace & World.
Shepard, L. A. (2000). The role of assessment in a learning
culture. Paper
presented at the Annual Meeting of the
American Educational
Research Association.
Available
http://www.aera.net/meeting/am2000/wrap/praddr01.htm
(Retrieved 10.8.2013)
114
115
NAMA
NURLIZA BT OTHMAN
othmannurliza@yahoo.com
KELAYAKAN
KELULUSAN:
KELULUSAN
116