Vous êtes sur la page 1sur 172

LANGUAGE

ASSESSMENT
ELT Teacher Training
Tark NCE

CHAPTER 1
TESTING
ASSESSING
AND
TEACHING

In an era of communicative language teaching:


Tests should measure up to standards of authenticity and
meaningfulness.
Ts should design tests that serve as motivating learning experiences
rather than anxiety-provoking threats.
Tests;
should be positive experiences
should build a persons confidence and become learning experiences
should bring out the best in students
shouldnt be degrading
shouldnt be artificial
shouldnt be anxiety-provoking
Language Assessment aims;
to create more authentic, intrinsically motivating assessment
procedures that are appropriate for their context & designed offer
constructive feedback to sts

What is a test?
A test
is measuring a persons ability, knowledge or performance in a
given domain.
1. Method
A set of techniques, procedures or items.
To qualify as a test, the method must be explicit and structured. Like;
Multiple-choice questions with prescribed correct
answers
A writing prompt with a scoring rubric
An oral interview based on a question script and a checklist
of
expected responses to be filled by the administrator
2 Measure
A means for offering the test-taker some kind of result.
If an instrument does not specify a form of reporting measurement,
then that technique cannot be defined as a test.
Scoring may be like the followings
Classroom-based short answer essay test may earn the test-taker a
letter grade accompanied by the instructors marginal comments.
Large-scale standardized tests provide a total numerical score, a
percentile rank, and perhaps some sub-scores.

3. The test-taker(the individual) = The person who takes the test.


Testers need to understand;
who the test-takers are?
what is their previous experience and background?
whether the test is appropriately matched to their abilities?
how should test-takers interpret their scores?
4. Performance
Test measures performance, but results imply test-taker ability or
competence.
Some language tests measure ones ability to perform language:
To speak, write, read or listen to a subset of language
Some others measure a test-takers knowledge about language:
Defining a vocabulary item, reciting a grammatical rule or identifying
a rhetorical feature in written discourse.

5. Measuring a given domain


It means measuring the desired criterion and not including other
factors.
Proficiency tests:
Even though the actual performance on the test involves only a
sampling of skills, that domain is overall proficiency in a language
general competence in all skills of a language.
Classroom-based performance tests:
These have more specific criteria. For example:
A test of pronunciation might well be a test of only a limited set
of phonemic minimal pairs.
A vocabulary test may focus on only the set of words covered in
a particular lesson.
A well-constructed test is an instrument that provides an accurate
measure of the test takers ability within a particular domain.

TESTING, ASSESSMENT & TEACHING


TESTING
are prepared administrative
procedures that occur at
identifiable times in a curriculum.
When tested, learners know that
their performance is being
measured and evaluated.
When tested, learners muster all
their faculties to offer peak
performance.
Tests are a subset of assessment.
They are only one among many
procedures and tasks that
teachers can ultimately use to
assess students.
Tests are usually time-constrained
(usually spanning a class period or
at most several hours) and draw
on a limited sample of
behaviour.

ASSESSMENT
Assessment is an ongoing
process that encompasses a
much wider domain.
A good teacher never ceases to
assess students, whether
those assessments are
incidental or intended.
Whenever a student responds to
a question, offers a comment,
or tries out a new word or
structure, the teacher
subconsciously makes an
assessment of the students
performance.
Assessment includes testing.
Assessment is more extended
and it includes a lot more
components.

What about TEACHING?


For optimal learning to take place, learners must have
opportunities to play with language without being formally
graded.
Teaching sets up the practice games of language learning:
the opportunities for learners to listen, think, take risks, set goals,
and process feedback from the teacher (coach)
and then recycle through the skills that they are trying to master.
During these practice activities, teachers are indeed observing
students performance and making various evaluations of each
learner.
Then, it can be said that testing and assessment are subsets of
teaching.

ASSESSMENT
Informal Assessment
They are incidental, unplanned
comments and responses.
Examples include: Nice job! Well
done! Good work! Did you say
can or cant? Broke or break!, or
putting a on some homework.
Classroom tasks are designed to elicit
performance without recording
results and making fixed
judgements about a students
competence.
Examples of unrecorded assessment:
marginal comments on papers,
responding to a draft of an essay,
advice about how to better
pronounce a word, a suggestion for
a strategy for compensating for a
reading difficulty, and showing how
to modify a students note-taking to
better remember the content of a
lecture.

They are exercises or procedures


Formal
Assessment
specifically designed to tap into a

storehouse of skills and knowledge.


They are systematic, planned sampling
techniques constructed to give Ts
and sts an appraisal of student
achievement.
They are tournament games that occur
periodically in the course of
teaching.
It can be said that all tests are formal
assessments, but not all formal
assessment is testing.
Example 1: A students journal or
portfolio of materials can be used as
a formal assessment of attainment
of the certain course objectives, but
it is problematic to call those two
procedures test.
Example 2: A systematic set of
observations of a students
frequency of oral participation in

THE
FUNCTION
AN ASSESSMENT
Evaluating
students OF
in the
Formative
process of Assessment
forming
their competencies and
skills with the goal of
helping them to continue
that growth process.

It provides the ongoing


development of learners
lang
Example: When you give sts
a comment or a
suggestion, or call
attention to an error, that
feedback is offered to
improve learners
language ability.
Virtually all kinds of

Summative Assessment
It aims to measure, or
summarize, what a
student has grasped, and
typically occurs at the
end of a course.
It does not necessarily point
the way to future
progress.
Example: Final exams in a
course and general
proficiency exams.
All tests/formal
assessment (quizzes,
periodic review tests,
midterm exams, etc.) are

IMPORTANT:
As far as summative assessment is considered, in the
aftermath of any test, students tend to think that Whew! Im
glad thats over.
Now I dont have to remember that stuff anymore!
An ideal teacher should try to change this attitude among
students.
A teacher should:
instill a more formative quality to his lessons
offer students an opportunity to convert tests into learning
experiences.

Each
test-takers score is interpreted
TESTS
in relation to a mean (average
Tests
Norm-Referenced
score), median (middle score),
standard deviation (extend of
variance in scores), and/or
percentile rank.
The purpose is to place test-takers
along a mathematical continuum
in rank order.
Scores are usually reported back to
the test-taker in the form of a
numerical score. (230 out of
300, 84%, etc.)
Typical of these tests are
standardized tests like SAT. TOEFL,
DS, KPDS, DS, etc.
These tests are intended to be
administered to large audiences,
with results efficiently
disseminated to test takers.
They must have fixed, predetermined
responses in a format that can be
scored quickly at minimum

Criterion-Referenced
Tests
They are designed
to give testtakers

feedback, usually in the form of


grades, on specific course or
lesson objectives.
Tests that involve the sts in only one
class, and are connected to a
curriculum, are CriterionReferenced Tests.
Much time and effort on the part of
the teacher are required to deliver
useful, appropriate feedback to
students.
The distribution of students scores
across a continuum may be of
little concern as long as the
instrument assesses appropriate
objectives.
As opposed to standardized, large
scale testing with its emphasis on
classroom-based testing,
Criterion-Referenced
Testing is of more prominent interest

Approaches to Language Testing: A Brief History


Historically, language-testing trends have followed the
trends of teaching methods.
During 1950s: An era of behaviourism and special
attention to contrastive analysis.
Testing focused on specific lang elements such as
phonological, grammatical, and lexical contrasts between
two languages.
During 1970s and 80s: Communicative Theories were
widely accepted.
A more integrative view of testing.
Today: Test designers are trying to form authentic, valid
instruments that simulate real world interaction.

APPROACHES TO LANGUAGE TESTING


A) Discrete-Point
Testing
can be broken
down
Language
into its component parts and
those parts can be tested
successfully.
Component parts; listening,
speaking, reading and writing.
Units of language (discrete
points); phonology, graphology,
morphology, lexicon, syntax
and discourse.
An language proficiency test
should sample all 4 skills and
as many linguistic discrete
points as possible
In the face of evidence that in a
study each student scored
differently in various skills
depending on his background,
country and major field, Oller
admitted that unitary trait

Language
competence
is a
B) Integrative
Testing
unified set of interacting
abilities that cannot be tested
separately.
Communicative competence is
global and requires such
integration that it cannot be
captured in additive tests of
grammar, reading, vocab, and
other discrete points of lang.
Two types of tests examples of
integrative tests:
*cloze test and **dictation.
Unitary trait hypothesis: It
suggests an indivisible view
of language proficiency; that
vocabulary, grammar,
phonology, 4 skills, and
other discrete points of lang
could not be disentangled

Cloze Test:
Cloze Test results are good measures of overall proficiency.
The ability to supply appropriate words in blanks requires a number
of abilities that lie at the heart of competence in a language:
knowledge of vocabulary, grammatical structure,
discourse structure, reading skills and strategies.
It was argued that successful completion of cloze items taps into all
of those abilities, which were said to be the essence of global
language proficiency.
Dictation
Essentially, learners listen to a passage of 100 to 150 words read
aloud by an administrator (or audiotape) and write what they hear,
using correct spelling.
Supporters argue that dictation is an integrative test because
success on a dictation requires careful listening,
reproduction in writing of what is heard, efficient short-term memory,
to an extent, some expectancy rules to aid the short-term memory.

c) Communicative Language Testing ( recent approach after


mid 1980s)

What does it criticise?


In order for a particular langtest to be useful for its intended purposes, test
performance must correspond in demonstrable ways to language use in nontest situations.
Integrative tests such as cloze only tell us about a candidates linguistic
competence. They do not tell us anything directly about a students
performance ability. (Knowledge about a language, not the use of language)
Any suggestion?
A quest for authenticity, as test designers centered on communicative
performance.
The supporters emphasized the importance of strategic competence (the
ability to employ communicative strategies to compensate for breakdowns
as well as to enhance the rhetorical effect of utterances) in the process of
communication.
Any problem in using this approach?
Yes, communicative testing presented challenges to test designers, because
they began to identify the real-world tasks that language learners were called
upon to perform.
But, it was clear that the contexts for those tasks were extraordinarily widely
varied and that the sampling of tasks for any one assessment procedure
needed to be validated by what language users actually do with language.
As a result:
The assessment field became more and more concerned with the
authenticity of tasks and the genuineness of texts.

d) Performance-Based Assessment
performance-based assessment of language typically involves oral
production,
written production, open-ended responses, integrated performance (across
skill areas), group performance, and other interactive tasks.
Any problems?
It is time-consuming and expensive, but those extra efforts are paying off in
more direct testing because sts are assessed as they perform actual or
simulated real-world tasks.
The advantage of this approach?
Higher content validity is achieved because learners are measured in the
process of performing the targeted linguistic acts. Important
performance-based assessment means that Ts should rely a little less on
formally structured tests and a little more on evaluation while sts are
performing various tasks.
In performance-based assessment:
Interactive Tests (speaking, requesting, responding, etc.) IN Paper-andpencil OUT
Result: in this test tasks can approach the authenticity of real life language
use.

CURRENT ISSUES IN CLASSROOM TESTING


The design of communicative, performance-based assessment
continues to challenge both assessment experts and classroom
teachers.
Therere three issues which are helping to shape our current
understanding of effective assessment. These are:
The effect of new theories of intelligence on the testing industry
The advent of what has come to be called alternative
assessment
The increasing popularity of computer-based testing
New Views on Intelligence
In the past:
Intelligence was once viewed strictly as the ability to perform
linguistic and logical-mathematical problem solving.
For many years, weve lived in a word of standardized, normreferenced tests that are timed in a multiple-choice format consisting
of a multiplicity of logic constrained items, many of which are
inauthentic.
We
were relying on timed, discrete-point, analytical tests in
measuring
lang.to be in the limits of objectivity and give impersonal
We
were forced

Recently:
Spatial intelligence
musical intelligence
bodily-kinesthetic intelligence
interpersonal intelligence
intrapersonal intelligence
EQ (Emotional Quotient) underscore emotions in our cognitive
processing.
Those who manage their emotions tend to be more capable of fully
intelligent processing, because anger, grief, resentment, other
feelings can easily impair peak performance in everyday tasks as
well as higher-order problem solving.
These conceptualizations of intelligence intuitive appeal infused the
1990s with a sense of both freedom and responsibility in our testing
agenda.
In past, our challenge was to test interpersonal, creative,
communicative, interactive skills, doing so to place some trust in our
subjectivity and intuition.

Traditional and Alternative Assessment


Traditional Assessment
-One-shot, standardized
exams
-Timed, multiple-choice
format
-Decontextualized test
items
-Scores suffice for feedback
-Norm-referenced scores
-Focus on the right
answer
-Summative
-Oriented to product
-Non-interactive process
-Fosters extrinsic

Alternative Assessment
Continuous longterm
assessment
Untimed, free-response
format
Contextualized communicative
tests
Individualized feedback and
washback

Criterion-referenced scores
Open-ended, creative
answers
Formative
Oriented to process
Interactive process

IMPORTANT
It is difficult to draw a clear line of distinction between
traditional and alternative assessment.
Many forms of assessment fall in between the two, and
some combine the best of both.
More time and higher institutional budgets are required
to administer and score assessments that presuppose
more subjective evaluation, more individualization, and
more interaction in the process of offering feedback.
But the payoff of the Alternative Assessment comes
with more useful feedback to students, the potential for
intrinsic motivation, and ultimately a more complete
description of a students ability.

Computer-Based Testing
Some computer-based tests are small-scale. Others are standardized, large
scale tests (e.g. TOEFL) in which thousands of test-takers are involved.
A type of computer-based test (Computer-Adaptive Test / CAT) is available
In CAT, the test-taker sees only one question at a time, and the computer
scores each question before selecting the next one.
Test-takers cannot skip questions, and, once they have entered and
confirmed their answers, they cannot return to questions.
Advantages of Computer-Based Testing:
o Classroom-based testing
o Self-directed testing on various aspects of a lang (vocabulary, grammar,
discourse, etc)
o Practice for upcoming high-stakes standardized tests
o Some individualization, in the case of CATs.
o Scored electronically for rapid reporting of results.
Disadvantages of Computer-Based Testing:
Lack of security and the possibility of cheating in unsupervised computerized
tests.
Home-grown quizzes may be mistaken for validates assessments.
Open-ended responses are less likely to appear because of need for human
scorers.
The human interactive element is absent.

An Overall summary
Tests
Assessment is an integral part of the teaching-learning cycle.
In an interactive, communicative curriculum, assessment is almost
constant.
Tests can provide authenticity, motivation, and feedback to the
learner.
Tests are essential components of a successful curriculum and
learning process.
Assessments
Periodic assessments can increase motivation as milestones of
student progress.
Appropriate assessments aid in the reinforcement and retention of
information.
Assessments can confirm strength and pinpoint areas needing further
work.
Assessments provide sense of periodic closure to modules within a
curriculum.
Assessments promote sts autonomy by encouraging self-evaluation
progress.
Assessments can spur learners to set goals for themselves.
Assessments can aid in evaluating teaching effectiveness.

Decide whether the following statements are TRUE or FALSE.


1. Its possible to create authentic and motivating assessment to
offer constructive feedback to the sts. ----------2. All tests should offer the test takers some kind of measurement or
result. ----3. Performance based tests measure test takers knowledge about
language. ----4. Tests are the best tools to assess students. ----------5. Assessment and testing are synonymous terms. ----------6. Ts incidental and unplanned comments and responses to sts is an
example of formal assessment. ------7. Most of our classroom assessment is summative assessment.
----------8. Formative assessment always points toward future formation of
learning. ---9. The distribution sts scores across a continuum is a concern in
norm referenced test. ----------10. C riterion referenced testing has more instructional value than
norm-referenced testing for classroom teachers. ----------1. TRUE 2. TRUE
3. FALSE They are designed to test actual use of lang not knowledge
about
lang
4.
FALSE
(We cannot say they are best, but one of useful devices to
assess
sts.)
5.
FALSE
(They are not.) 6. FALSE (They are informal assessment)
7. FALSE (formative assessment)
8. TRUE 9. TRUE
10. TRUE

CHAPTER 2
PRINCIPLES OF LANGUAGE
ASSESSMENT

Therere five testing criteria for testing a test:


1. Practicality 2. Reliability 3. Validity 4. Authenticity
5. Washback
1. PRACTICALITY
A practical test
is not excessively expensive,
stays within appropriate time constraints,
is relatively easy to administer, and
has a scoring/evaluation procedure that is specific and timeefficient.
For a test to be practical
administrative details should clearly be established before the test,
sts should be able to complete the test reasonably within the set
time frame,
the test should be able to be administered smoothly (prosedrle
bomamal),
all materials and equipment should be ready,
the cost of the test should be within budgeted limits,
the scoring/evaluation system should be feasible in the teachers
time frame.
methods for reporting results should be determined in advance.

2. RELIABILITY
A reliable test is consistent and dependable.
The issue of reliability of a test may best be addressed by
considering a number of factors that may contribute to the
unreliability of a test.
Consider following possibilities:
in the

fluctuations

student (Student-Related Reliability),

in

scoring (Rater Reliability),

in

test administration (Test Administration Reliability), and

in the

test (Test Reliability) itself.

Student-Related Reliability:
Temporary illness, fatigue, a bad day, anxiety, other physical or
psychological factors may make an observed score deviate from
ones true score.
Also a test-takers test-wiseness or strategies for efficient test
taking can also be included in this category.

Rater Reliability:
Human error, subjectivity, lack of attention to scoring criteria, inexperience,
inattention, or even preconceived (pein hkml) biases may enter into
scoring process.
Inter-rater unreliability occurs when 2 or more scorers yield inconsistent
scores of the same test.
Intra-rater unreliability is because of unclear scoring criteria, fatigue, bias
toward particular good and bad students, or simple carelessness.
One solution to such intra-rater unreliability is to read through about half of
the tests before rendering any final scores or grades, then to recycle back
through the whole set of tests to ensure an even-handed judgment.
The careful specification of an analytical scoring instrument can increase
raterreliability.
Test Administration Reliability:
Unreliability may also result from the conditions in which the test is
administered.
Street noise, photocopying variations, poor light, temperature, desks and
chairs.
Test Reliability:
Sometimes the nature of the test itself can cause measurement errors.
Timed tests may discriminate against sts who do not perform well with a time
limit.
Poorly written test items may be a further source of test unreliability.

3. VALIDITY
The extent to which the assessment requires students to
perform tasks that were included in the previous
classroom lessons.

How is the validity of a test established?

There is no final, absolute measure of validity, but several different


kinds of evidence may be invoked in support.
it may be appropriate to examine the extent to which a test calls for
performance that matches that of the course or unit of study being
tested.
In other cases we may be concerned with how well a test determines
whether or not students have reached an established set of goals or
level of competence.
it could be appropriate to study statistical correlation with other
related but independent measures.
Other concerns about a tests validity
may focus on the consequences beyond measuring the criteria
themselves - of a test, or even on the test-takers perception of
validity.
We will look at these five types of evidence below.

Content Validity:
If a test requires the test-taker to perform the behaviour that is being
measured,
content-related evidence of validity, often popularly referred to as content
validity.
If you assess a persons ability to speak TL, asking sts answer paper-andpencil multiple choice questions requiring grammatical judgements does not
achieve content validity.
for content validity to be achieved, one should be able to elicit the following
conditions:
Classroom objectives should be identified and appropriately framed. The
first measure of an effective classroom test is the identification of objectives.
Lesson objectives should be represented in the form of test specifications.
A test should have a structure that follows logically from lesson or unit you
are testing.
If you clearly perceive the performance of test-takers as reflective of the
classroom objectives, then you can argue this, content validity has probably
been achieved.
To understand content validity consider difference between direct and
indirect testing.
Direct testing involves the test-taker in actually performing the target task.
Indirect testing involves performing not target task itself, but that related in
some way.
Direct testing is most feasible (uygun) way to achieve content validity in
assessment.

Criterion-related Validity:
It examines the extent to which the criterion of test has actually
been achieved.
For example, a classroom test designed to assess a point of
grammar in communicative use will have criterion validity if test
scores are corroborated either by observed subsequent behavior or
by other communicative measures of the grammar point in question.
Criterion-related evidence usually falls into one of two categories:
Concurrent (uygun, ayn zamanda olan) validity:
A test has concurrent validity if its results are supported by other
concurrent performance beyond the assessment itself.
For example, the validity of a high score on the final exam of a
foreign language course will be substantiated by actual proficiency in
the language.
Predictive (ngrsel, tahmini) validity:
The assessment criterion in such cases is not to measure concurrent
ability but to assess (and predict) a test-takers likelihood of future
success.
For example, the predictive validity of an assessment becomes
important in the case of placement tests, language aptitude tests,

Construct Validity:
Every issue in language learning and teaching involves theoretical
constructs.
In the field of assessment, construct validity asks, Does this test
actually tap into the theoretical construct as it has been identified?
(test gerekten de test etmek istediim konu ya da beceriyi test
etmede gerekli olan yapsal zellikleri tayor mu?)
Imagine that you have been given a procedure for conducting an
oral interview. The scoring analysis for the interview includes
several factors in the final score: pronunciation, fluency,
grammatical accuracy, vocabulary use, and sociolinguistic
appropriateness. The justification for these five factors lies in a
theoretical construct that claims those factors to be major
components of oral proficiency. So if you were asked to conduct on
oral proficiency interview that evaluated only pronunciation and
grammar, you could be justifiably suspicious about the construct
validity of that
test.
Large-scale
standardized
tests olarak nitelediimiz snavlar
construct validity asndan pek de uygun deildir. nk pratik
olmas asndan (yani hem zaman hem de ekonomik nedenlerden)
bu testlerde llmesi gereken btn dil becerileri
llememektedir. rnein TOEFL da oral production blmnn
olmamas construct validity asndan byk bir engel olarak

Consequential Validity:
Consequential validity encompasses all the consequences of a
test, including such considerations as its accuracy in measuring
intended criteria, its impact on the preparation of test-takers, its
effect on the learner, and the (intended and unintended) social
consequences of a tests interpretation and use.
McNamara (2000, p. 54) cautions against test results that may
reflect socioeconomic conditions such as opportunities for
coaching (zel ders, zel ilgi). For example, only some families
can afford coaching, or because children with more highly
educated parents get help from their parents.

Teachers should consider the effect of assessments on students


motivation, subsequent performance in a course, independent
learning, study habits, and attitude toward school work.

Face Validity:
the degree to which a test looks right, and appears to measure the
knowledge or abilities it claims to measure, based on the subjective
judgment of test-takers
Face validity means that the students perceive the test to be valid.
Face validity asks the question Does the test, on the face of it,
appear from the learners perspective to test what it is designed to
test?
Face validity is not something that can be empirically tested by a
teacher or even by a testing expert. It depends on subjective
evaluation of the test-taker.
A classroom test is not the time to introduce new tasks.
If a test samples the actual content of what the learner has
achieved or expects to achieve, face validity will be more likely to be
perceived.
Content validity is a very important ingredient in achieving face
validity.
Students will generally judge a test to be face valid if directions are
clear, the structure of the test is organized logically, its difficulty
level is appropriately pitched, the test has no surprises, and timing
is appropriate.
To give an assessment procedure that is biased for best a teacher
offers students appropriate review and preparation for the test,
suggests strategies that will be beneficial, and structures the test so
that the best students will be modestly challenged and the weaker

4. AUTHENTICITY
In an authentic test
the language is as natural as possible,
items are as contextualized as possible,
topics and situations are interesting, enjoyable and/or humorous,
some thematic (konuyla ilgili) organization, such as through a story
line or episode is provided,
tasks represent real-world tasks.
Reading passages are selected from real-world sources that testtakers are likely to have encountered or will encounter.
Listening comprehension sections feature natural language with
hesitations, white noise, and interruptions.
More and more tests offer items that are episodic in that they are
sequenced to form meaningful units, paragraphs, or stories.

5. WASHBACK
Washback includes the effects of an assessment on teaching and
learning prior to the assessment itself, that is, on preparation for the
assessment.
Informal performance assessment is by nature more likely to have
built-in washback effects because the teacher is usually providing
interactive
Formal testsfeedback.
can also have positive washback, but they provide no
washback if the students receive a simple letter grade or a single
overall numerical score.
Tests should serve as learning devices through which washback is
achieved.
Sts incorrect responses can become windows of insight into further
work.
Their correct responses need to be praised, especially when they
represent accomplishments in a students inter-language.
Washback enhances a number of basic principles of language
acquisition: intrinsic motivation, autonomy, self-confidence, language
ego, interlanguage, and strategic investment, among others.
To enhance washback comment generously & specifically on test
performance.
Washback implies that students have ready access to the teacher to
discuss the feedback and evaluation he has given.
Teachers can raise the washback potential by asking students to use
test results as a guide to setting goals for their future effort.

What is washback?

In general terms: The effect of testing on teaching and learning


In large-scale assessment: Refers to the effects that the

tests have on instruction in terms of how students prepare for the


test
In classroom assessment: The information that washes back
to students in the form of useful diagnoses of strengths and
weaknesses
What does washback enhance?

Intrinsic motivation Autonomy


Language ego

Inter-language

Self-confidence

Strategic investment

What should teachers do to enhance washback?


Comment generously and specifically on test performance
Respond to as many details as possible
Praise strengths
Criticize weaknesses constructively
Give strategic hints to improve performance

Decide whether the following statements are TRUE or FALSE.


1. An expensive test is not practical.
2. One of the sources of unreliability of a test is the school.
3. Sts, raters, test, and administration of it may affect the tests
reliability.
4. In indirect tests, students do not actually perform the task.
5. If students are aware of what is being tested when they take a
test, and think that the questions are appropriate, the test has face
validity.
6. Face validity can be tested empirically.
7. Diagnosing strengths and weaknesses of students in language
learning is a facet of washback.
8. One way of achieving authenticity in testing is to use simplified
language.
1. TRUE 2. FALSE 3. TRUE 4. TRUE
5. TRUE

6. FALSE

7. TRUE

8. FALSE

Decide which type of validity does each sentence belong to?


1. It is based on subjective judgment. ---------------------2. It questions the accuracy of measuring the intended criteria.
---------------------3. It appears to measure the knowledge and abilities it claims to measure.
------------4. It measures whether the test meets the objectives of classroom objectives.
-------5. It requires the test to be based on a theoretical background.
---------------------6. Washback is part of it. ---------------------7. It requires the test-taker to perform the behavior being measured.
-----------------8. The students (test-takers) think they are given enough time to do the test.
----------9. It assesses a test-taker's likelihood of future success. (e.g. placement
tests). --------10. The students' psychological mood may affect it negatively or positively.
-------------11. It includes the consideration of the test's effect on the learner.
---------------------12. Items of the test do not seem to be complicated. ---------------------13. The test covers the objectives of the course. ---------------------14. The test has clear directions. ---------------------1. Face
2. Consequential
3. Face
4. Content
5. Construct
6.
Content
7. Criterion related
8. Face
9. Criterion related
10.
Consequential
11. Consequential
12. Face validity
13. Content validity
14.

Decide with which type of reliability could each sentence be related?


1. There are ambiguous items.
2. The student is anxious.
3. The tape is of bad quality.
4. The teacher is tired but continues scoring.
5. The test is too long.
6. The room is dark.
7. The student has had an argument with the teacher.
8. The scorers interpret the criteria differently.
9. There Is a lot of noise outside the building.
1.
3.

Test reliability
2. Student-related reliability
Test administration reliability 4. Rater reliability

5. Test reliability

6. Test administration reliability

7. Student-related reliability
9. Test administration reliability

8. Rater reliability

CHAPTER 3
DESIGNING CLASSROOM
LANGUAGE TESTS

we examine test types, and learn how to design tests and revise
existing ones.
To start the process of designing tests, we will ask some critical
questions.
5 questions should form basis of your approach to designing tests for
class.
Question 1: What is the purpose of the test?
Why am I creating this test?
For an evaluation of overall proficiency? (Proficiency Test)
To place students into a course? (Placement Test)
To measure achievement within a course? (Achievement Test)
Once you established major purpose of a test, you can determine its
objectives.
Question 2: What are the objectives of the test?
What specifically am I trying to find out?
What language abilities are to be assessed?
Question 3: How will test specifications reflect both purpose and
objectives?
When a test is designed, the objectives should be incorporated into
a structure that appropriately weights the various competencies
being assessed.

Question 4: How will test tasks be selected and the separate items
arranged?
The tasks need to be practical.
They should also achieve content validity by presenting tasks that
mirror those of the course being assessed.
They should be evaluated reliably by the teacher or scorer.
The tasks themselves should strive for authenticity, and the
progression of tasks ought to be biased for best performance.
Question 5: What kind of scoring, grading, and/or feedback is
expected?
Tests vary in the form and function of feedback, depending on their
purpose.
For every test, the way results are reported is an important
consideration.
Under some circumstances a letter grade or a holistic score may
appropriate;
other circumstances may require that a teacher offer substantive
washback to the learner.

TEST TYPES
Defining your purpose will help you choose the right kind of test, and
it will also help you to focus on the specific objectives of the test.
Below are the test types to be examined:
1. Language Aptitude Tests
2. Proficiency Tests
3. Placement Tests
4. Diagnostic Tests
5. Achievement Tests

1. Language Aptitude Tests


They predict a persons success prior to exposure to the second
language.
Aptitude test is designed to measure capacity or general ability to
learn a FL.
They are designed to apply to the classroom learning of any
language.
Two standardized aptitude tests have been used in the US.

The Modern Language Aptitude Test (MLAT),


Pimsleur Language Aptitude Battery(PLAB)
Tasks in MLAT includes: Number learning, phonetic script, spelling
clues, words in sentences, and paired associates.
Theres no unequivocal evidence that language aptitude tests predict
communicative success in a language.
Any test that claims to predict success in learning a language is
undoubtedly flawed because we now know that with appropriate selfknowledge, and active strategic involvement in learning, everyone
can succeed eventually.

2. Proficiency Tests
A proficiency test is not limited to any one course, curriculum, or
single skill in the language; rather, it tests overall ability.

It includes: standardized multiple choice items on grammar,

vocabulary, reading comprehension, and aural comprehension.


Sometimes a sample of writing is added, and more recent tests also
include oral production.
Such tests often have content validity weaknesses.
Proficiency tests are almost always summative and norm-referenced.
They are usually not equipped to provide diagnostic feedback.
Their role is to accept or to deny someones passage into next stage
of a journey
TOEFL is a typical standardized proficiency test.
Creating & validating them with research is time-consuming & costly
process
To choose one of a number of commercially available proficiency
tests is a far more practical method for classroom teachers.

3. Placement Tests

The objective of placement test is to correctly place sts into a course


or level.
Certain proficient tests can act in the role of placement tests.
A placement test usually includes a sampling of the material to be
covered in the various courses in a curriculum.
Sts should find the test neither too easy nor too difficult but
challenging.
ESL Placement Test (ESLPT) at San Francisco State University has
three parts.
Part 1: sts read a short article and then write a summary essay.

Part 2: sts write a composition in response to an article.


Part 3: multiple-choice; sts read an essay and identify grammar

errors in it.
ESL is more authentic but less practical, because human evaluators
are required for the first two parts.
Reliability problems present but mitigated by conscientious training
evaluators
What is lost in practicality and reliability is gained in the diagnostic
information that the ESLPT provides.

4. Diagnostic Tests
A diagnostic test is designed to diagnose specified aspects of a
language.
A diagnostic test can help a student become aware of errors and
encourage the adoption of appropriate compensatory strategies.
A test of pronunciation diagnose phonological features that are
difficult for Sts and should become part of a curriculum. Such tests
offer a checklist of features for administrator to use in pinpointing
difficulties.
A writing diagnostic elicit a writing sample from sts that would allow
Ts to identify those rhetorical and linguistic features on which the
course needed to focus special attention.
A diagnostic test of oral production was created by Clifford Prator
(1972) to accompany a manual of English pronunciation. In the test;
Test-takers are directed to read 150-word passage while they are tape
recorded.
The test administrator then refers to an inventory(envanter, deftere
kaytl eya) of phonological items for analyzing a learners
production.
After multiple listening, they produce checklist for errors in 5
categories.

Stress - rhythm, Intonation, Vowels, Consonants,


Other factors.

This information help Ts make decisions about aspects of English


phonology.

5. Achievement Tests
Achievement test is related directly to lessons, units, or even a total
curriculum.
Achievement tests should be limited to particular material addressed
in a curriculum within a particular time frame and should be offered
after a course has focused on the objectives in question.
Theres a fine line of differences between diagnostic test and
achievement test.
Achievement tests analyze the extent to which students have
acquired language features that have already been taught.
(Gemiin analizini yapyor.)
Diagnostic tests should elicit information on what students
need to work on in the future. (Gelecek ile ilgili bir analiz yaplyor.)
Primary role of achievement test is to determine whether course
objectives have been met and appropriate knowledge and skills
acquired by the end of a period of instruction.
They are often summative because they are administered end of a
unit or term.
But effective achievement tests can serve as useful washback by
showing the errors of students and helping them analyze their
weaknesses and strengths.
Achievement tests range from five- or ten-minute quizzes to threehour final examinations, with an almost infinite variety of item types

practical steps in constructing classroom tests:


A) Assessing Clear, Unambiguous Objectives
Before giving a test;
examine the objectives for the unit youre testing.
Your first task in designing a test, then, is to determine appropriate
objectives.
Students will recognize and produce tag questions, with the correct
grammatical form and final intonation pattern, in simple social
conversations.
B) Drawing Up Test Specifications (Talimatlar)
Test specifications will simply comprise
a) a broad outline of the test
b) what skills you will test
c) what the items will look like
This is an example for test specifications based on the objective
stated above:
Students will recognize and produce tag questions, with the correct
grammatical form and final intonation pattern, in simple social
conversations.

C) Devising Test Tasks


how students will perceive them(face validity) the extent to which authentic
language and contexts are present potential difficulty caused by cultural
schemata
In revising your draft, you should ask yourself some important questions:
1. Are the directions to each section absolutely clear?
2. Is there an example item for each section?
3. Does each item measure a specified objective?
4. Is each item stated in clear, simple language?
5. Does each multiple choice have appropriate distracters; that is, are the
wrong items clearly wrong and yet sufficiently alluring that they arent
ridiculously easy?
6. Is the difficulty of each item appropriate for your students?
7. Is the language of each item sufficiently authentic?
8. Do the sum of items and the test as a whole adequately reflect the
learning objectives?
In the final revision of your test, Time yourself
if the test should be shortened or lengthened, make the necessary
adjustments
make sure your test is neat and uncluttered on the page
if there is an audio component, make sure that the script is clear,

D) Designing Multiple-Choice Test Items


Therere a number of weaknesses in multiple-choice items:
The technique tests only recognition knowledge.
Guessing may have a considerable effect on test scores.
The technique severely restricts what can be tested.
It is very difficult to write successful items.
Washback may be harmful.
Cheating may be facilitated.

However,

2 principles support multiple-choice formats are


reliability.

practicality -

Some important jargons in Multiple-Choice Items:

Multiple-choice items are all receptive, or selective, that is, test-taker


chooses from a set of responses rather than creating a response.
Other receptive item types include true-false questions and matching
lists.
Every multiple-choice item has a stem, which presents several
options or alternatives to choose from.
One of those options, the key, is correct response, others serve as
distractors .

IMPORTANT!!!
Consider the following four guidelines for designing multiple-choice
items for both classroom-based and large-scale situations:
1. Design each item to measure a specific objective.
.)
2. State both stem and options as simply and directly as possible. Do
not use superfluous (lzumsuz) words, and another rule of
succinctness (az ve z) is to remove needless redundancy
(gereksiz bilgi) from your options.
3. Make certain that the intended answer is clearly the only correct
one. Eliminating unintended possible answers is often the most
difficult problem of designing multiple-choice items. With only a
minimum of context in each stem, a wide variety of responses may
be perceived as correct.
4. Use item indices (indeksler) to accept, discard, or revise items:
The appropriate selection and arrangement of suitable multiplechoice items on a test can best be accomplished by measuring
items against three indices: a) item facility(IF), or item
difficulty b) item discrimination (ID), or item differentiation,
and c) distractor analysis
(ayn anda hem modal bilgisini hem de

article bilgisini lme

a) Item facility (IF) is the extent to which an item is easy or difficult for the
proposed group of test-takers.
20 renciden 13 doru cevap geldiyse; 13/20=0,65(%65). %15 - %85in
kabul edilebilir
Two good reasons for including a very easy item (%85 or higher) are to build
in some affective feelings of success among lower-ability students and to
serve as warm-up items. And very difficult items can provide a challenge to
high estability sts.
b) Item discrimination (ID) is extent to which an item differentiates between
high- and low-ability test-takers.
An item on which high-ability students and low-ability students score equally
well would have poor ID because it did not discriminate between the two
groups.
An item that garners(toplamak) correct responses from most of the highability group and incorrect responses from most of low-ability group has good
discrimination power.
30 renciyi en iyiden en de kadar eit paraya ayr. En yksek notu
alan 10 renci ile en dk notu alan 10 renciyi bir itemda aadaki gibi
ayralm

Item #

Correct

Incorrect

High-ability students (top 10)


7
3
Low-ability
students (bottom10)
2
8
ID: 7-2=5/ 10= 0,50 The result tells us that us that the item has a moderate
level of ID.
High discriminating level would approach 1.0 and no discriminating power at
all would be zero. In most cases, you would want to discard an item that
scored near zero.
No absolute rule governs establishment of acceptable and unacceptable ID
indices.

c) Distractor efficiency (DE) is the extent to which the distractors


lure a sufficient number of test-takers, especially lower-ability
ones, and those responses are somewhat evenly distributed across
all distractors.
Example: *Note: C is the correct response.

Choices

students (10) 0
2
0 0

B C* D E High-ability

2 Low-ability students (10)

The item might be improved in two ways:


a) Distractor D doesnt fool anyone. Therefore it probably has no
utility. A revision might provide a distractor that actually attracts a
response or two.
b) Distractor E attracts more responses (2) from the high-ability
group than the low-ability group (0). Why are good students
choosing this one?
Perhaps it includes a subtle reference
that entices the high group but is over the head of low group, and
therefore latter sts dont even consider it.
The other two
distractor (A and B) seem to be fulfilling their function of attracting
some attention from the lower-ability students.

SCORING, GRADING AND GIVING FEEDBACK


A) Scoring
As you design a test, you must consider how the test will be scored
and graded
Scoring plan reflects relative weight that you place on each section
and items
hangi beceriyi daha ok nemsemise o beceriye fazla puan vermek
gerekir
Oral production %30, Listening %30, Reading %20 ve Writing %20
eklinde.
B) Grading
Grading doesnt mean just giving A for 90-100. Its not that simple.
How assign letter grades is a product of country, culture and context
of class
institutional expectations (most of them unwritten),
explicit and implicit definitions of grades that you have set forth,
the relationship you have established with the class,
Sts expectations that have been engendered in previous tests,
quizzes in class.

C) Giving Feedback
Feedback should become beneficial washback. Those are some examples of
feedback:
1. a letter grade
2. a total score
3. four subscores (speaking, listening, reading, writing)
4. for the listening and reading sections
a. an indication of correct/incorrect responses
comments
5. for the oral interview

b. marginal

a. scores for each element being rated b. checklist of areas needing work
c. oral feedback after the interview
d. post-interview conference to go
over results
6. on the essay
a. scores for each element being rated
b. a checklist of areas needing work
e. a self-assessment
c. marginal end-of-essay comments, suggestions
d. post-test
conference to go work
7. on all or selected parts of the test, peer checking of results
8. a whole-class discussion of results of the test
9. individual conferences with each student to review the whole test

Decide whether the following statements are TRUE or FALSE.


1.language aptitude test measures a learners future success in
learning a FL.
2. Language aptitude tests are very common today.
3. A proficiency test is limited to a particular course or curriculum.
4. The aim of a placement test is to place a student into particular
level.
5. Placement tests have many varieties.
6. Any placement test can be used at a particular teaching program.
7. Achievement tests are related to classroom lessons, units, or
curriculum.
8. A five-minute quiz can be an achievement test.
9. The first task in designing a test is to determine test specification.
1. TRUE

2. FALSE

3. FALSE

4. TRUE

5. TRUE

6. FALSE (Not all placement tests suit every teaching program.)

7. TRUE

8. FALSE

9. FALSE (The first task is to determine appropriate objectives.)

Decide whether the following statements are TRUE or FALSE.


1. It is very easy to develop multiple-choice tests.
2. Multiple-choice tests are practical but not reliable.
3. Multiple-choice tests are time-saving in terms of scoring and
grading.
4. Multiple-choice items are receptive.
5. Each multiple-choice item in a test should measure a specific
objective.
6. The stem of a multiple-choice item should be as long as possible in
order to help students to understand the context.
7. If the Item Facility value is .10(% 10), it means the item is very
easy.
8. Item discrimination index differentiates between high and lowability sts.
1. FALSE (It seems easy, but is not very easy.)
2. FALSE (They can be both practical and reliable.)
3. TRUE
4. TRUE
5. TRUE
6. FALSE (It should be short and to the point.)
7. FALSE (An item with an IF value of .10 is a very difficult one.)
8. TRUE

Chapter 4 STANDARDIZED
TESTING:

WHAT IS STANDARDIZATION:
A standardized test presupposes certain standard objectives or criteria that
are held constant across one form of the test to another..
They measure a broad band of competencies, but not only one particular
curriculum
They are norm-referenced and the main goal is to place sts in a rank order.
Scholastic Aptitude Test (SAT):
college entrance exam seeking further information
The Graduate Record Exam (GRE):
test for entry into many graduate school programs
Graduate Management Admission Test (GMAT) & Law School Aptitude Test
(LSAT):
tests that specialize in particular disciplines
Test of English as a Foreign Language (TOEFL):
produced by the International English Language Testing System (IELTS)
The tests are standardized because they specify a set of competencies for a
given domain and through a process of construct validation they program a
set of tasks.
In general standardized test items are in the form of MC.
They provide objective means for determining correct and incorrect
responses.
However MC is not the only test item type in standardized test.
Human scored tests of oral and written production are also involved.

ADVANTAGES AND DISADVANTAGES OF STANDARDIZED


TESTS:
-Advantages:
* Ready-made previously (Ts dont need to spend time to prepare it)
* It can be administered to a large number of sts in a time constraint
* Easy to score thanks to MC format scoring (computerized or holepunched grid scoring)
* It has face validity
-Disadvantages:
* Inappropriate use of tests
* Misunderstanding of the difference between direct and indirect
testing

characteristics of a standardized test

DEVELOPING A STANDARDIZED TEST:


- Knowing how to develop a standardized test can be
helpful to revise an existing test, adapt or expand an
existing test, create a smaller-scale standardized test
(A) The Test of English as a Foreign Language (TOEFL)
general ability or proficiency
(B) The English as a Second Language Placement Test
(ESLPT), San Francisco State University (SFSU)
placement test at a university
(C) The Graduate Essay Test (GET), SFSU gate-keeping
essay test

1. Determine the purpose and objectives of the test.


- Standardized tests are expected to be valid and practical
TOEFL
*To evaluate the English proficiency of people whose NL is not
English.
*Colleges and universities in the US use the score TOEFL score to
admit or refuse international applicants for admission
ESLPT
*To place already admitted sts at SFSU in an approp. course in
academic writing and oral production.
*To provide Ts some diagnostic information about sts
GET
*To determine whether their writing ability is sufficient to permit
them to enter graduate-level courses in their programs(it is offered
beginning of each term)

2. Design test specification.


TOEFL
the first step is to define the construct of language proficiency
After breaking langcompetence down into subset of 4 skills each
performance mode can be examined on a continuum of linguistic units.
(pronun, spelling, word, grammar)
Oral production section tests fluency and pronunciation by using imitation
Listening section focuses on a particular feature of lang or overall listening
comprehens
Reading section aims to test comprehension of long/short passages, single
sentences, phrases or words
Writing section tests writing ability in the form of open-ended(free
composition) or it can be structured to elicit anything from correct spelling to
discourse-level competence
ESLPT
Designing test specs for ESLPT was simpler tasks . purpose is placement and
construct validation of a test consisted of an examination of the content of
the
ESL courses
*In recent
revision of ESLPT, content & face validity are important theoretical
issues. And also practicality, reliability in tasks and item response formats
equally important
The specification mirrored reading-based and process writing approach used
in class.
GET
specification for GET are skills of writing grammatically and rhetorically
acceptable prose on a topic , with clearly produced organization of ideas and
logical development.

3. Design, select, and arrange test tasks/items.


TOEFL
Content coding: the skills and a variety of subject matter without biasing
(the content must be universal and as neutral as possible)
Statistical characteristic: it include IF and ID
Before administration, they are piloted and scientifically selected to meet
difficulty specifications within each subsection, section and the test
overall.
ESLPT
For written parts; the main problems are
a) selecting appropriate passages(conform the standards of content validity)
b) providing appropriate prompts (they should fit the passages)
c) processing data form pilot testing
In the MC editing test; first (easier task) choose an approp. essay within
whick embed errors. And a more complicated one is to embed a specified
number errors from a pre-determined error categories.(T can perceive the
GETcategories from sts previous error in written work & sts error can be used
as distractors)
Topics
are appealing and capable of yielding intended product of an essay
that requires an organized logical arguments conclusion. No pilot testing of
prompts is conducted.
Be careful about the potential cultural effect on the numerous
international students who must take the GET

4. Make appropriate evaluations of different kinds of items.


- IF, ID and distractor analysis may not be necessary for classroom
(one-time) test, but they are must for standardized MC test.
- For production responses, different forms of evaluation become
important. (i.e. practicality, reliability & facility)
*practicality: clarity of directions, timing of test, ease of
administration & how much time is required to score
*reliability: is a major player is instances where more than one scorer
is employed and to a lesser extent when a single scorer has to
evaluate tests over long spans of time that could lead to
deterioration of standards
*facilities: is key for valid and successful items. Unclear direction,
complex lang, obscure topic, fuzzy data, culturally biased
information may lead to higher level of difficulty
GET
*No data are collected from sts on their perceptions, but the scorers
have an opportunity to reflect on the validity of given topic

5. Specify scoring procedures and reporting formats.


TOEFL
-Scores are calculated and reported for
*three sections of TOEFL
*a total score
*a separate score
ESLPT
*It reports a score for each of the essay section (each essay is read
by 2 readers)
*Editing section is machined scanned
*It provides data to place sts and diagnostic information
*sts dont receive their essay back
GET
*Each GET is read by two trained reader. They give scores between 1
to 4
*recommended score is 6 as threshold for allowing sts to pursue
graduate-level courses
*If the st gets score below 6, he either repeat the test or take a
remedial course

6. Performing ongoing construct validation studies.


Any standardized test must be accompanied by systematic periodic
corroboration of its effectiveness and by steps toward its
improvement
TOEFL
*the latest study on TOEFL examined the content characteristics of
the TOEFL from a communicative perspective based on current
research in applied linguistics and language proficiency assessment
ESLPT
*The development of the new ESLPT involved a lengthy process both
content and construct validation, along with facing such practical
issues as scoring the written sections and a machine-scorable MC
answer sheet
GET
*There is no research to validate the GET itself. Administrators rely
on the research on university level academic writing tests such as
TWE.
*Some criticism of the GET has come from international test-takers
who posit that the topics and time limits of the GET work to the
disadvantage of writers whose native language is not English.

Primary market

TOEFL

U.S. universities and colleges for admission purposes


Type
Computer-based and paper-based
Response modes
Multiple-choice responses and essay
Time allocation
Up to 4 hours (CB); 3 hours (PB)
Specifications
CB: A listening section which includes dialogs, short conversations,
academic discussions, and mini lectures;
a structure section which tests formal language with two types of
questions (completing incomplete sentences and identifying one of
four underlined words or phrases that is not acceptable in English;
a reading section which include four to five passages on academic
subjects with 10-14 questions for each passage;
writing section which requires examinees to compose an essay on a
given topic

MELAB
Primary market
U.S. and Canadian language programs and colleges; some worldwide
educational settings
Type
Paper-based
Response modes
Multiple-choice responses and essay
Time allocation
2.5 to 3.5 hours
Specifications
A 30-minute impromptu essay on a given topic;
a 25-minute multiple-choice listening comprehension test;
a 100-item 75-minute multiple choice test of grammar, cloze reading,
vocabulary, and reading comprehension;
an optional oral interview

IELTS
Primary market
Australian, British, Canadian, and New Zealand academic institutions
and professional organizations and some American academic
institutions
Type
Computer-based for Reading and Writing sections; paper-based for
Listening and Speaking parts
Response modes
Multiple-choice responses, essay, and oral production
Time allocation
2 hours, 45 minutes
Specifications
A 60-minute reading;
a 60-minute writing;
a 30-minute listening of four sections;
a 10 to 15 minute speaking of five sections

TOEIC
Primary market
Worldwide; workplace settings
Type
Computer-based and paper-based
Response modes
Multiple-choice responses
Time allocation
2 hours
Specifications
A 100-item, approximately 45-minute listening administered by
audiocassette and which includes statements, questions, short
conversations, and short talks;
a 100-item, 75-minute reading which includes cloze sentences, error
recognition, and reading comprehension

CHAPTER 5 STANDARDIZEDBASED ASSESSMENT:

Mid 20th Century


Standardized tests had unchallenged popularity and growth.
Standardized tests brought convenience, efficiency, air of empirical
science.
Tests were considered to be a way of making reforms in education.
Quickly and cheaply assessing students became a political issue.
Late 20th Century
*There was possible inequity and disparity between the tests in such
tests and the ones they teach in classes.
*The claims in mid-20th century began to be questioned/criticised in
all areas.
*Teachers were in the leading position of those challenges.
The Last 20 Years
*Educators become aware of weaknesses in standardized testing:
They were not accurate measures of achievement and success and
they were not based on carefully framed, comprehensive and
validated standards of achievement.
*A movement has started to establish standards to assess students
of all ages and subject-matter areas.
*There have been efforts on basing the standardised tests on clearly
specified criteria for each content area being measured.

Criticism:
Some teachers claimed that those tests were unfair there were
dissimilarity between the content & task of the tests & what they
were teaching in their classes
Solutions:
By becoming aware of these weaknesses, educators started to
establish some standards on which sts of all ages & subject matter
areas might be assessed
most departments of education at all state level in the US have
specified the appropriate standards (criteria, objectives) for each
grade level(pre-school to grade 12) and each content area (math,
science, arts)
The construction of standards makes possible concordance between
standardized test specification and the goals and objectives (ESL,
ESOL, ELD,ELLs) (LEP is discarded because of the negative
connotation word limited) pg 105 please

ELD STANDARDS
In creating benchmarks for accountability, there is a tremendous
responsibility to carry out a comprehensive study of a number of
domains:
Categories of language; phonology, discourse, pragmatic, functional
and sociolinguistic elements.
Specification of what ELD students needs are.
A realistic scope of standards to be included in curriculum.
(MUFRADATTAKI STANDARDLAR GERCEKCI OLCAK)
Standards for teachers ( qualifications, expertise, training)
(OGRETMENLERE STANDARD GETIRIYOR)
A thorough analysis of means available to assess student attainment
of those standards.(OGRENCILERIN OGRENDIKLERINI NASIL
DEGERLENDIRECEZ

ELD ASSESSMENT
The development of standards obviously implies the responsibility for
correctly assessing their attainment.
It is found that the standardized tests of the past decades were not
in line with newly developed standards the interactive process not
only of developing standards but also of creating standards-based
assessment started.
Specialists design, revise and validate many tests.
The California English Language Development Test (CELDT) is a
battery of instruments designed to assess attainment of ELD
standards across grade level. (not publicly available)
Language and literacy assessment rubric collected students work.
Teachers observations recorded on scannable forms.
It provided useful data on students performance for oral production,
reading and writing in different grades

CASAS AND SCANS


CASAS: (Comprehensive Adult Student Assessment System):
Designed to provide broadly based assessments of ESL curricula across US.
It includes more than 80 standardized assessment instruments used to;
*place sts in programs *diagnose learners needs
*monitor progress *certify mastery of functional skills
At higher level of education (colleges, adult and language schools, workplace)
SCANS: (Secretarys Commissions in Achieving Necessary Skills):
outlines competencies necessary for language in the workplace
the competencies are acquired and maintained through training in basic
skills(4 skills);
thinking skills (reasoning & problem solving);
personal qualities (self-esteem & sociability)
Resources (allocating time, materials, staff etc.)
Interpersonal skills, teamwork, customer service etc.
Information processing, evaluating data, organising files etc,
Systems, understanding social and organizational system,
Technology use and application

TEACHER STANDARDS OGRETMEN NASIL OLMALI


Linguistic and language development
Culture and interrelationship between language and culture
Planning and managing instructions
Consequences of standardized based and standardized testing
Positive
High level of practicality and reliability
Provides insights into academic performance
Accuracy in placing a number of test takers on to a norm referenced
scala
Ongoing construct validation studies
Negative
They involve a number of test biases
A small but significant number of test takers are not assessed fairly
nor they are assessed accurately
Fosters extinct motivation
Multiple intelligence are not considered
There is danger of test driven learning and teaching
In general performance is not directly assessed

Test bias
Standardized tests involve many test bias (lang, culture, race, gender,
learning styles)
National Centre for Fair and Open Testing claims of tests bias from; teachers,
parents, students, and legal consultants. (reading texts, listening stimulus)
Standardised tests do not promote logical-mathematical and verbal linguistic
to the virtual exclusions of the other contextualised, integrative intelligence.
(some learners may need to be assessed with interviews, portfolios, samples
of work, demonstrations, observation reports) more formative assessment
rather than summative.
That would solve test bias problems but it is difficult to control it in
standardized items.
Those who use standardised tests for the gate keeping purposes, with few if
only other assessments would do well to consider multiple measures before
attributing infallible predictive power to standardised test.
Test-driven learning and teaching
It is another consequence of standardized testing. When students know that
one single measure of performance will determine their lives they are less
likely to take positive attitudes towards learning. Extrinsic motivation not
intrinsic
Ts are also affected from test-driven policies. They are under pressure to
make sure their sts excelled in the exam, at the risk of ignoring other
objectives in the curriculum. A more serious effect was to punish schools
with lower-socioeconomic neighbourhood

ETHICAL ISSUES: CRITICAL LANGUAGE TESTING


One of by-products of rapid growing testing industry is danger of an abuse of
power.
Tests represent a social technology deeply embedded in education,
government and business; tests are most powerful as they are often the
single indicators for determining the future of individuals (Shohamy)
Standards ,specified by client educational institutions, bring with them
certain ethical surrounding the gate-keeping nature of standardized tests.
Teachers can demonstrate standards in their teaching.
Teachers can be assessed through their classroom performance.
Performance can be detailed with indicators: examples of evidence that the
teacher can meet a part of a standard.
Indicators are more than how to statements (complex evidence of
performance.
Performance based assessment is integrated (not a checklist or discrete
assessments)
Each assessment has performance criteria against which performance can be
measured.
Performance criteria identify to what extend the teacher meets the standard.
Student learning is at the heart of the teachers performance.

6 ASSESSING LISTENING

OBSERVING THE PERFORMANCE OF FOUR SKILLS


1. two interacting concepts:
Performance
Observation
Sometimes the performance does not indicate true competence
a bad nights rest, illness, an emotional distraction, test anxiety, a
memory block, or other student-related reliability factor.
One important principle for assessing a learners competence is to
consider the fallibility of the results of a single performance such as
that produced in a test.
The form which involve performances and contexts in measurement
should design following:
Several tests that are combined t form an assessment.
The listening tasks are designed to assess the candidates ability to
process form of spoken English.
A single test with multiple test tasks to account for learning styles
and performance variables
In-class and extra-class graded work
Alternative forms of assessment ( e. g journal, portfolio, conference,
observation, self assessment, peer assessment )

Multiple measures give more reliable & valid assessment than a


single measure
We can observe neither the process of performing nor a product?
1. Receptive skills -- Listening performance
The process of listening performance is about :
Invisible, inaudible process of internalizing meaning form the
auditory signals being transmitted to the ear and brain.
2 The productive skills allow us to hear and see the process as it is
performance
writing can give permanent product of written piece.
But recorded speech, there is no permanent observable product for
speaking.
THE IMPORTANCE OF LISTENING
Listening has often played second fiddle to its counterpart of
speaking. But its rare to find just a listening test.
Listening is often implied as component of speaking.
Oral production ability other than monologues, speeches, reading
aloud and the like is only as good as ones listening comprehension.
Input the aural-oral mode accounts for a large proportion of
successful language acquisition.

BASIC TYPES OF LISTENING


For effective test, designing appropriate assessment tasks in
listening begins with the specification of objectives, or criteria.
The following processes flash through your brain :
1. recognize speech sounds and hold a temporary imprint
of them in short-term memory.
2. Simultaneously determine the

type of speech event.

3. use (bottom-up) linguistic decoding skills and / or


(top-down) background schemata to bring a plausible interpretation
to the message and assign a literal and intended meaning to the
utterance. ( Jeremy Harmer, page on 305) said.. This study shows is
that activating students schemata.
4. in most cases, delete the exact linguistic form in which
the message was originally received in favor of conceptually
retaining important or relevant information in long-term memory.

four commonly identified types of listening performances


1. Intensive.
Listening for perception of the components.
Teacher use audio material on tape or hard disk when they want their
students to practice listening skills
2. Responsive.
3. Selective.
4. Extensive.
Extensive listening will usually take a place outside the classroom.
Material for extensive listening can be obtained from a number of
sources.

Micro and Macro skills


Micro skills
Attending to smaller bits and chunks, in more of bottom-up
process
Discriminate among sounds of English
retain chunks of language of different lengths in short-term memory
Recognize stress patterns, words in stressed/ unstressed position,
rhythmic structure , intonation contours, and their role in signaling
information
Recognize reduce form of words.
Distinguish word boundaries, recognize the core of a words and
interpret word order patterns and their significance
Process speech at different rates of delivery
Process speech containing pauses, errors, corrections, other
performance variables
Recognize grammatical word classes (nouns, verbs, etc.), systems
(e.g. tense, agreement, pluralization), pattern, rules, and elliptical
forms. sentence constituents and distinguish between major-minor
Detect
constituents
Recognize particular meaning may be expressed in different
grammatical form
Recognize cohesive device in spoken discourse

Macroskills

Focusing on larger elements involved in a topdown approach


recognize the communicative functions of utterances, according to
situations, participants, goals
Infer situations, participants, goals using real-world knowledge
From events, ideas, and so on, described, predict outcomes, infer
links and connections between events, deduce causes and effects,
and detect such relations as main idea, supporting idea, new
information, given information, generalization, and exemplification
Distinguish between literal and implied meanings
Use the facial, kinesics, body language, and other nonverbal clues to
decipher meanings
Develop and uses a battery of listening strategies, such as detecting
key words, guessing the meaning from context, appealing for help,
and signaling comprehension or lack thereof

What Makes Listening Difficult


1. Clustering
Chunking-phrases, clauses, constituents
2. Redundancy
Repetitions, Rephrasing, Elaborations and Insertions
3. Reduced Forms
Understanding reduced forms that may not be a part of learners
past experiences in classes where only formal textbook lang has
been presented
4. Performance variables
Hesitations, False starts, Corrections, Diversion
5 Colloquial Language
Idioms, slang, reduced forms, shared cultural knowledge
6. Rate of Delivery
Keeping up with speed of delivery, processing automatically as
speker continu
7. Stress, Rhythm, and Intonation:
Correctly understanding prosodic elements of spoken language,
which is more difficult than understanding the smaller phonological
bits and pieces.
8. Interaction:
Negotiation,clarification,attending signals,turn

Designing Assessment Tasks Intensive Listening

Recognizing Phonological and Morphological Elements


Phonemic pair, consonants
Test-takers hear

Hes from California

Test-takers read:
A. Hes from California
B. Shes from California

Phonemic pair, vowels


Test-takers hear

is he living?

Test-takers read:
A. is he leaving?
B. is he living?

Morphological pair, -ed ending


Test-takers hear

I missed you very much.

Test-takers read:
A. I missed you very much
B. I miss you very much

Stress pattern in cant


Test-takers hear

My girlfriend cant go to the party

Test-takers read:
A. My girlfriend can go to the party
B. My girlfriend cant go to the party

One word stimulus


Test-takers hear

vine

Test-takers read:
A. Vine
B. Wine

Paraphrase Recognition
Sentence Paraphrase
Test-takers hear
from Japan

: Hellow, my name is Keiko. I come

Test-takers read: A. Keiko is comfortable in japan


B. Keiko wants to come to Japan
C. Keiko is Japanese
D. Keiko likes Japan

Dialogue
paraphrase
Test-takers hear
:

man

: Hi, Maria, my name is

George.
woman
: Nice to meet you, George.
Are you
American?
man : no, Im Canadian
Test-takers read: A. George lives in United States
B. George is American
C. George comes from Canada
D. Maria is Canadian

Designing Assessment Tasks

Responsive

listening
Appropriate response to a question
Test-takers hear
: how much time did you take to do your
homework?
Test-takers read: A. in about an hour
B. about an hour
C. about $10
D. yes, I did.

Open-ended response to a question


Test-takers hear
: how much time did you take to do
your
homework?
Test-takers write or speak :
__________________________________

Designing Assessment Tasks : Selective Listening


Test-taker listens a limited quantity of aural input and discern some specific
information
Listening Cloze (cloze dictations or Partial Dictation)
Listening cloze tasks require the test-taker to listen a story, monologue
or conversatation and simultaneously read written text in which selected
words or phrases have been deleted
One Potentional Weakness of listening cloze technique
They may be simply become reading comprehension tasks. Test-takers who
are asked to listen to a story with periodic deletions in the written version
may not need to listen at all, yet may still able to respond with the
appropriate word or phrase.
Information Transfer
aurally processed must be trnasfered to a visual representation, E.g labelling
a diagram, identifying an element in a picture, completing a form, or showing
routes on a map.
Chart Filling
Test-takers see the chart about Lucys daily schedule and fill in the schedule.
Sentence Repetition
The test-takers must retain a strecth of language long enough to reproduce
it, and then must respond with an oral repetition of that stimulus.

DESIGNING ASSESSMENT TASKS: EXTENSIVE LISTENING


Dictation: Test-takers hear a passage, typically 50-100 words,
recited 3 times;
First reading, natural speed, no pauses, test-takers listen for gist.
Second reading, slowed speed, pause at each break, test-takers
write.
Third reading, natural speed, test takers check their work.
Communicative Stimulus-Response Tasks
The test-takers are presented with a stimulus monologue or
conversation and then are asked to respond to a set of
comprehension questions.
First: Test-takers hear the insrtuction and dialogue or monologue.
Second: Test-takers read the multiple-choice comprehension
questions and items then chose the correct one
Authentic Listening Tasks
Buck (2001-p.92)Every test requires some components of
communicative language ability, and no test covers them all.
Similarly, every task shares some characteristics with targetlanguage tasks, and no test is completely authentic

Alternatives to

assess comprehension in a truly communicative

Note taking

context

Listening to a lecturer and write down the important ideas.

Disadvantage: scoring is time consuming


Advantages: mirror real classroom situation it fulfills the criteria

of cognitive demand, communicative language & authenticity


Editing
Editing a written stimulus of an aural stimulus
Interpretive tasks:

paraphrasing a story or conversation


Potential stimuli include: song lyrics, poetry, radio, TV, news reports,
etc.
Retelling
Listen story &simply retell it either orally or written show full
comprehension
Difficulties: scoring and reliability
validity, cognitive, communicative ability, authenticity are well
incorporated into the task.
Interactive listening (face to face conversations)

Chapter-7 Assessing
Speaking

Challenges of the testing speaking:


1- The interaction of speaking and listening
2- Elicitation techniques
3- Scoring
BASIC TYPES OF SPEAKING
1.Imitative: (parrot back) Testing the ability to imitate a word, phrase,
sentence. Pronunciation is tested. Examples: Word, phrase, sentence
repetition
2. Intensive: The purpose is producing short stretches of oral language. It
is designed to demonstrate competence in a narrow band of grammatical,
phrasal, lexical, phonological relationships (stress / rhythm / intonation)

3.Responsive: (interacting with the interlocutor) include interaction and

test comprehension but somewhat limited level of very short conversations,


standards greetings, small talk, simple requests and comments, and the like.
4. Interactive: Difference between responsive and interactive speaking
is length and complexity of interaction, which includes multiple exchanges
/or multiple participant.
5. Extensive (monologue) : Extensive oral production tasks include
speeches, oral presentations, story-telling, during which the opportunity for
oral interaction from listeners is either highly limited (perhaps to nonverbal
responses) or ruled out together.

Micro- and Macroskills of Speaking


microskills of speaking refer to producing small chunks of language
such as phonemes, morphemes, words and phrasal units. The
macroskills include the speakers' focus on the larger elements such
as fluency, discourse, function, style cohesion, nonverbal
communication and strategic options.
Macroskills
1.Apropriately accomplish communicative functions according to
situations, participants,and goals.
2.Use appropriate styles, registers, implicative, redundancies,
pragmatic conventions, conversation rules, floor-keeping and
yielding, interrupting, and other sociolinguistic features in face-toface conversations.
3.Convey links and connections between events and communicative
such relations as focal and peripheral ideas, events and feelings,
new information and given information, generalization and
exemplification.
4.Convey facial features, body language, and other nonverbal cues
along with verbal language.
5.Develop and use a battery of speaking strategies, such as
emphasizing key words, rephrasing, providing a context for
interpreting the meaning of words, appealing for help, and
accurately assessing how well your interlocutor is understanding

Microskills:
1.Produce differences among English phonemes and allophonic
variants.
2.Produce chunks of language of different lengths.
3.Produce English stress patterns, words in stressed and unstressed
positions, rhytmic structure, and intonation contours.
4.Produce reduced forms of words and phrases.
5.Use adequate number of lexical units(words) to accomplish
pragmatic purposes
6.Produce fluent speech at different rates of delivery.
7.Monitor ones own oral production and use various devices-pauses,
fillers, self-corrections, backtracking- to enhance the clarity of the
message.
8.Use grammatical word classes (nouns,verbs,etc.),systems (tense,
agreement, pluralization), word order, patterns, rules, and elliptical
forms.
9.Produce speech in natural constituents: in appropriate phrases,
pause groups,breath groups, and sentence constituents.
10.Express a particular meaning in different grammatical forms.
11.Use cohesive devices in spoken discourse.

Three important issues as you set out to design tasks;

1.No speaking task is capable of isolating the single skills of


oral production. Concurrent involvement of the additional
performance of aural comprehension, and possibly reading, is
usually necessary.

2.Eliciting the specific criterion you have designated for

a task can be tricky because beyond the word level, spoken


language offers a number of productive options to test-takers. Make
sure your elicitation prompt achieves its aims as closely as possible.

3.It is important

to carefully specify scoring procedures for


a response so that ultimately you achieve as high a reliability index
as possible.
interaction between speaking and listening or reading is
unavoidable.
Interaction effect:impossibility of testing speaking in isolation
Elicitation techniques:to elicit specific criterion we expect from
test takers.
Scoring:to achieve reliability

Designing Assessment Tasks: Imitative Speaking


paying more attention to pronunciation, especially suprasegmentals,
in attemptto help learners be more comprehensible.
Repetition tasks are not allowed to occupy a dominant role in an
overall oral production assessment, and as long as avoid a negative
washback effect.
In a simple repetition task, test-takers repeat the stimulus, whether it
is a pair of words, a sentence, or perhaps a question ( to test for
intonation production.)
Word repetition task:
Scoring specifications must be to avoid reliability breakdowns. A
common form of scoring simply indicates 2 or 3 point system for
each response
Scoring scale for repetition tasks:
2 acceptable pronunciation
1 comprehensible, partially correct pronunciation
0 silence, seriously incorrect pronunciation
The longer the stretch of language, the more possibility for error and
therefore the more difficult it becomes to assign a point system to

PHONEPASS TEST
The phonepass test has supported the construct validity of its repetition
tasks not just for discourse and overall oral production ability.
The PhonePass tests elicits computer-assisted oral production over a
telephone.
Test-takers read aloud, repeat sentences, say words, and answer questions.
Test-takers are directed to telephone a designated number and listen for
directions.
The test has five sections.
Part A Testee read aloud selected sentences forum among printed on the test
sheet.
Part B Testee repeat sentences dictated over the phone.
Part C Testee answer questions with a single word or a short phrase of 2 or 3
words.
Part D Testee hear 3 word groups in random order and link them in correctly
ordered sentence
Part E Testee have 30 seconds to talk about their opinion about some topic
that is dictated over phone.
Scores are calculated by a computerized scoring template and reported back
to the test-taker within minutes.
Pronunciation, reading fluency, repeat accuracy and fluency, listening
vocabulary are the sub-skills scored
The scoring procedure has been validated against human scoring with
extraordinary high reliabilities and correlation statistics.

Designing Assessment Tasks: Intensive Speaking


test-takers are prompted to produce short stretches of discourse (no more
then a sentence) through which they demonstrate linguistic ability at a
specified level lang
Intensive tasks may also be described aslimitedresponse tasks,
ormechanicaltasks, or what classroom pedagogy would label
ascontrolledresponses.
Directed Response Tasks
Administrator elicits a particular grammatical form or a transformation of a
sentence.
Such tasks are clearlymechanical and not communicative(possible
drawbacks),but they do requireminimal processing of meaning in order
to produce the correct grammatical output.(practical advantages
Read Aloud Tasks (to improve pronunciation and fluency)
include beyond sentence level up to a paragraph or two. It is easily
administered by selecting a passage that incorporates test specs and bye
recording testee output;the scoring is easy because all of the testtakerss oral production is controlled.
If reading aloud shows certain practical adavantages (predictable output,
practicality, reliability in scoring), there are several drawbacks
Reading aloud is somewhatinauthenticin that we seldom read anything
aloud to someone else in the real world, with exception of a parent reading to
a child.

Sentence / Dialogue Completion Tasks and Oral


Questionnaries
( to produce omitted lines, words in a dialogue appropiriately)
Test-takers read dialogue in which one speakers lines have been
omitted. Test-takers are first given time to read through the dialogue
to get its gist and to think about appropriate lines to fill in.
An advantage of this technique lies in its moderate control of the
output of the test-taker (practical advantage).
One disadvantage of this technique is itsreliance on literacyand
anability to transfer easily from written to spoken English.
(possible drawback)
Another disadvantage is contrived, inauthentic nature of this task.
(drawback.)
Picture Cued Tasks (to elicit oral production by using
pictures)
One of more popular ways to elicit oral language performance at
both intensive and extensive levels is a picture-cued stimulus that
requires a destcription from the test-taker.
Assessment of oral production may be stimulated through a more
elaborate picture. (practical advantages)
Maps are anothervisual stimulusthat can be used to assess the
language forms needed to give directions and specify locations.

Scoring may be problematic depending on the expected


performance.
Scoring scale for intensive tasks
2 comprehensible; acceptable target form
1 comprehensible; partially correct target form
0 silence, or seriously incorrect target form
Translation (of Limited Stretches of Discourse) (To translate
from target language to native language)
The test-takers are given a native language word, phrase, or
sentence and are asked to translate it.
As an assessment procedure, the advantages of translation lie in its
control of the output of the test-taker, which of course means
thatscoring is more easily specified.

Designing Assessment Tasks: Response Speaking


Assessment involves brief interactions with an interlocutor, differing
from intensive tasks in the increased creativity given to the test-taker
and from interactive tasks by the somewhat limited length of
utterances.
Question and Answer
Question and answer tasks can consist of one or two questions from
an interviewer, or they can make up a portion of a whole battery of
questions and prompts in an oral interview.
The first question is intensive in its purpose; it is adisplay
questionintended to elicit a predetermined correct response.
Questions at the responsive level tend to be genuinereferential
questionsin which the test-taker is given more opportunity to
produce meaningful language in response.
Test-takers respond with a few sentences at most.
Test-takers respond with questions.
A potentially tricky form of oral production assessment involves more
than one test-taker with an interviewer. With two students in an
interview contxt, both test-takers can ask questions of each other.

Giving Instruction and Directions


The technique is simple : the administrator poses the problem, and
the test-taker responds. Scoring is based primarily on
comprehensibility and secondarily on other specified grammatical or
discourse categories.
Eliciting instructions or directions
Paraphrasing
read or hear a number of sentences and produce a paraphrase of the
sentence.
Advantages they elicit short stretches of output and perhaps tap into
testee ability to practice conversation by reducing the output/input
ratio.
If you use short paraphrasing tasks as an assessment procedure, its
important to pinpoint objective of task clearly. In this case, the
integration of listening and speaking is probably more at stake than
simple oral production alone.
TEST OF SPOKEN ENGLISH (TSE)
The TSE is a 20 minute audio-taped test of oral language ability
within an academic or Professional environment.
The scores are also used for selecting and certifying health
professionals such as physicians, nurses, pharmacists, physical
therapists,
and
The tasks on
theveterinaries.
TSE are designed to elicit oral production in various
discourse categories rather than in selected phonological,

Designing Assessment Tasks: Interactive Speaking


Tasks include long interactive discourse ( interview, role plays, discussions,
games).
nterview
A test administrator and a test-taker sit down in a direct face-to-face
Exchange and proceed through a protocol of questions and directives. The
interview is then scored on accuracy in pronunciation and/or grammar,
vocabulary usage, fluency, pragmatic appropriateness, task
accomplishment, and even comprehension.
Placement interviews, designed to get a quick spoken sample from a student
to verify placement into a course,
Four stages:

1.Warm-up: (small talk) interviewer directs matual introductions, helps

testee become comfortable, apprises testee, anxieties.(No scoring)


2.Level check: interviewer stimulates testee to respond using
expected - predicted forms and functions. This stage give interviewer a
picture of testees extroversion, readiness to speak, confidence.Linguistic
target criteria are scored in this phase.
3.Probe:Probe questions and prompts challenge testee to go heights of
their ability, to extend beyond limits of interviewers expectation through
difficult questions.
4.Wind-down:This phase is a short period of time during which
interviewer encourages testee to relax with easy questions, sets testees

The scussess of an oral interview will


depend on;
*clearly specifying administrative procedures of
the assessment(practicality)
*focusing the q and probes on the purpose of the
assessment(validity)
*appropriately eliciting an optimal amount and
quality of oral production from the test-taker.
(biased for best performance)
*creating a consistent, workable scoring system
(reliability).

Role Play
Role playing is a popular pedagogical activity in communicative
language teaching classes.
Within constraints set forth by guidelines, it frees students to be
somewhat creative in their linguistic output.
While role play can be controlled or guided by the interviewer, this
technique takes test-takers beyond simple intensive and responsive
levels to a level of creativity and complexity that approaches realworld pragmatics.
Scoring presents the usual issues in any task that elicits somewhat
unpredictable responses from test-takers.
Discussions and Conversations
As formal assessment devices, discussions and conversations with
and among students are difficult to specify and even more difficult to
score.
But as informal techniques to assess learners, they offer a level of
authenticity and spontaneity that other assessment techniques may
not provide.
Assessing the performance of participants through score or checklists
should be carefully designed to suit the objectives of the observed
discussion.
Discussion is a integrative task, and so it is also advisable to give
some cognizance to comprehension performance in evaluating

Games
Among informal assessment devices are a variety of games that
directly involve language production.
Assessment games:
1.Tinkertoy game (Logo block)
2.Crossword puzzles
3.Information gap grids
4.City maps
ORAL PROFICIENCY INTERVIEW (OPI)
The best-known oral interview format is the Oral Proficinecy
Interview.
OPI is the result of historical progression of revisions under the
auspices of several agencies, including the Educational Testing
Service and American Council on Teaching Foreign Language
(ACTFL).
The OPI is carefully designed to elicit pronunciation, fluency and
integrative ability, sociolinguistic and cultural knowledge, grammar,
and vocabulary.
Performance is judged by the examiner to be at one of ten possible
levels on the ACTFL-designated proficiency guidelines for speaking:
Superior; Advanced-high, mid, low; Intermediate-high, mid,low;
Novice-high, mid,low.

Designing Assessments : Extensive Speaking


involves complex, relatively lengthy stretches of discourse.
They are variations on monologues, with minimal verbal interaction.
Oral Presentations
it would not be uncommon to be called on to present a report, a paper, a
marketing plan, a sales idea, a design of new product, or a method.

Once again the rules for effective assessment must be invoked:


a- specify the criterion,
b-set appropriate tasks,
c- elicit optimal output,
d-establish practical, reliable scoring procedures.
Scoring is the key assessment challenge.
Picture Cued Story-Telling
techniques for eliciting oral production is through visual pictures,
photographs, diagrams, and charts.

consider a picture or series of pictures as a stimulus for a longer or


description.
Criteria for scoring need to be clear about what it is you are hoping to

Retelling a Story, News Event


In this type of task, test-takers hear or read a story or news event
that they are asked to retell.
The objectives in assigning such a task vary from listening
comprehension of the original to production a number of oral
discourse features (communicating sequences and relationships of
events, stress and emphasis patterns, expression in the case of a
dramatic story), fluency, and interaction with the hearer.
Scoring should meet the intended criteria
Translation (of Extended Prose)
Longer texts are presented for test-taker to read in NL and then
translate into English (dialogues, directions for assembly of a
product, synopsis of a story or play or movie, directions on how to
find something on map, and other genres).
The advantage of translation is in the control of the content,
vocabulary, and to some extent, the grammatical and discourse
features.
The disadvantage is that translation of longer text is a highly
specialized skill for which some individuals obtain postbaccalaureate.
Criteria for scoring should take into account not only purpose in
stimulating a translation but possibility of errors that are unrelated to
oral production ability

8 ASSESSING READING

TYPES (GENRES) OF READING


Academic reading
Reference material , Textbooks, theses
Essays, papers, Test directions, Editorials and opinion writing
Job-related reading
Messages, Letters/ emails, Memos
Personal reading
Newspapers , magazines, Letters, emails, cards, invitations,
Schedules (trains, bus)

Microskills :
Discriminate among the distinctive graphemes and
orthographic patterns of English.
Retain chunks of language of different lenghts in shortterm memory.
Process writing at an efficient rate of speed to suit the
purpose.
Recognize a core of word, and interpret word order
patterns and their significance.
Recognize grammatical word classes(nouns, verbs,
etc),
systems (tense agreement,
pluralization), patterns, rules and elliptical forms.
Recognize cohesive devices in written discourse and their
role in signaling the relationship between and among
clauses.

Macroskills :
Recognize the rhetorical forms of written discourse and their
significance for interpretation.
Recognize the communicative functions of written text, according to
form and purpose
Infer context that is not explicit by using background knowledge
From described events, ideas, etc, infer links and connections
between events, deduce causes and effects, and detect such
relations as main idea, supporting idea, new information,
generalization, and exemplification
Distinguish between literal and implied meanings.
Detect culturally specific references and interpret them in a context
of the appropriate cultural schemata.
Develop and use a battery of reading strategies, such as scanning
and skimming, detecting discourse markers, guessing the meaning
of words from the context, and activating schemata for interpretation
of texts.

Some principal strategies for reading comprehension:


Identify your purpose in reading a text
Apply spelling rule and conventions for bottom-up decoding
Use lexical analysis to determine meaning
Guess at meaning when you arent certain
Skim the text for the gist and for main ideas
Scan the text for specific information(names, dates, key words)
Use silent reading techniques for rapid processing
Use marginal notes, outlines, charts, or semantic maps for
understanding and retaining information
Distinguish between literal and implied meanings
Capitalize on discourse markers to process relationships.

TYPES OF READING
Perceptive
Involve attending to the components of larger stretches of
discourse : letters, words, punctuation, and other graphemic
symbols.
Selective
Is largely an artifact of assessment formats. Used picture-cued tasks,
matching, true/ false, multiple-choice, etc.
Interactive
Interactive task is to identify relevant features (lexical, symbolic,
grammatical, and discourse) within texts of moderately short length
with the objective of retaining the information that is processed.
Extensive
The purposes of assessment usually are to tap into a learners global
understanding of a text, as opposed to asked test-takers to zoom in
on small details. Top down processing is assumed for most extensive

PERCEPTIVE READING
Reading Aloud
Reads them aloud, one by one, in the presence of-an administrator.
Written response
Reproduce the probein writing. Evaluation of the test takers
response must be carefully treated.
Multiple-choise
Choosing one of four or five possible answers.
Picture-Cued Items
Shown a picture, written text and are given one of a number of
possible tasks to perform.

SELECTIVE READING
The test designer focuses on formal aspects of language (lexical,
grammatical, and a few discourse features). Category includes what
many incorrectly think of as testing vocabulary and grammar
Multiple-Choise (for Form-Focused Criteria)
They may have little context, but might serve as a vocab or grammar
check.
Matching Tasks
The most frequently appearing criterion in matching procedures is
vocabulary.
Editing Tasks
For grammatical or rhetorical errors is a widely used test method for
assessing linguistic competence in reading.
Picture Cued Tasks
read sentence or passage and choose one of four pictures that is
described
read a series of sentences or definitions, each describing a labeled
part of a picture or diagram.
Gap-Filling Tasks
Is to create completion items where test-takers read part of a
sentence and then complete it by writing a phrase.

INTERACTIVE READING
Cloze Tasks
fill in gaps in an incomplete image (visual, auditory, or cognitive) and supply
(from background schemata) omitted details.
Impromptu Reading Plus Comprehension Questions
without some component of assessment involving impromptu reading and
responding to questions.
Short-Answer Tasks
following reading passages is the age-old short-answer format.
Editing (Longer Texts)
The technique has been applied successfully to longer passages of 200 to
300 words.
1th authenticity, 2nd tasks simulates proofreading ones own essay. 3th
connected to a specific curriculum.
Scanning
Strategy used by all readers to find relevant information in a text.
Ordering Tasks
Variations on this can serve as an assessment of overall global understanding
of a story and of the cohesive devices that signal the order of events or
ideas.
Information Transfers Reading Charts, Maps, Graphs, Diagrams
media presuppose readers schemata for interpreting them and are
accompanied by oral or written discourse to convey, clarify, question, argue,
debate, among other linguistic functions.

EXTENSIVE READING
Involves longer texts than we have been dealing with up to this
point.
Skimming Tasks
Process of rapid coverage of reading matter to determine its gist or
main idea
Summarizing and Responding
Is make summary of the text and give it a respond about the text
Note Taking and Outlining
A teacher, perhaps in one-on-one conferences with students, can
use student notes/ outlines as indicators of the presence or absence
of effective reading strategies, and thereby point the learners in
positive directions.

UNIT 9: ASSESSING
WRITING

GENRES OF WRITING
Academic Writing
papers and general subject reports essays, compositions
academically focused journals, short-answer test responses
technical reports (e.g., lab reports), theses, dissertations
Job-Related Writing
messages letters/emails, memos (e.g., interoffice), reports (e.g., job
evaluations, project reports)
schedules, labels, signs, advertisements, announcements, manuals
Personal Writing
letters, emails, greeting cards, invitations messages, notes, calendar entries,
shopping lists, reminders financial documents (e.g., checks, tax forms, loan
applications)
forms, questionnaires, medical reports, immigration documents
diaries, personal journals, fiction (eg. Short stories, poetry)

MICROSKILLS AND MACROSKILLS OF WRITING


Micro-skills
Produce graphemes and orthographic patterns of English.
Produce writing at an efficient rate of speed to suit the purpose.
Produce an acceptable core of words and use appropriate word order
patterns.
Use acceptable grammatical systems (Tense, agreement), patterns
and rules.
Express a particular meaning in different grammatical forms.
Use cohesive devices in written discourse.
Macro-skills
Use the rhetorical forms and conventions of written discourse.
Appropriately accomplish the communicative functions of written
texts according to form and purpose.
Convey links and connections between events, communicate such
relations as main idea, supporting idea, new information,
generalization, exemplification.
Distinguish between literal and implied meanings when writing.
Correctly convey culturally specific references in the context of the
written
text. writing strategies, accurately assessing audiences
Develop&use
interpretation, using prewriting devices, writing fluency in first drafts,
using phrases and synonyms, soliciting feedback and using feedback

Types of Writing Performance


Imitative Writing
Assess ability to spell correctly & perceive phoneme/grapheme
correspondences
Form rather than meaning (letters, words, punctuation, brief
sentences, mechanics of writing)
Intensive Writing
To produce appropriate vocabulary within a context and correct
grammatical features in a sentence
More form than meaning but meaning and context are of some
importance (collocations, idioms, correctness, appropriateness)
Responsive Writing
Connect sentences & create a logically connected 2 or 3 paragraphs
Discourse conventions with strong emphasis on context and meaning
(limited discourse level, connecting sentences logically) mostly 2-3
paragraphs
Extensive Writing
To manage all the processes of writing for all purposes to write longer
text (Essays, papers, theses)
Processes of writing (strategies of writing)

IMITATIVE WRITING
Tasks in Hand Writing Letters, Words, and Punctuation
Copying ( bit __ / bet __ / bat __ )
Copy the words given in the spaces provided
Listening cloze selection tasks
Write the missing words in blanks by selecting according to what they
hear
Combination of dictation with a written text
Purpose=to give practice in writing
Picture-cued tasks
Write the word the picture represents
Make sure that pictures are not ambiguous
Form completion tasks
Complete the blanks in simple forms Eg. Name, address, phone
number
Make sure that students have practiced filling out such forms
Converting numbers/abbreviations to words
Either write out the numbers or converting abbreviations to words
More reading than writing, so specify the criterion
Low authenticity, Reliable method to stimulate handwritten English

Spelling Tasks and Detecting Phoneme-Grapheme


Correspondences
Spelling Tests
Write words that are dictated, Choose words that have been heard or
spoken
Scoring=correct spelling
Picture-Cued Tasks
Write words that are displayed by pictures Eg. Boot-book, read-reed,
bit-bite
Choose items according to your test purpose
Multiple Choice Techniques
Choose and write the word with the correct spelling to fit the given
sentences
Items are better to have writing component / addition of homonym to
make the task challenging
Clashes with reading, so be careful To assess the ability to spell
words correctly and to process phoneme-grapheme correspondences
Matching Phonetic Symbols
Write the correctly spelled word alphabetically
Since Latin alphabet and Phonetic alphabet symbols are different
from each other, this works well.

INTENSIVE (CONTROLLED) WRITING


Dictation
Writing what is heard aurally
Listening & correct spelling punctuation
Dicto-comp
Re-writing the paragraph in one's own words after hearing it for 2 or 3
times
Listening & vocabulary & spelling & Punctuation
Grammatical transformation
Making grammatical transformations by changing or combining forms
of lang
Grammatical competence, Easy to administer & practical & reliable
No meaningful value, Even with context no authenticity
Picture-cued
1. Short sentences
2. Picture description
3. Picture sequence description
Reading non-verbal means & grammar & spelling & vocabulary
Reading-Writing integration, Scoring problematic when pictures are

Vocabulary assessment
Either defining or using a word in a sentence, assessing collocations
and derived morphology
Vocabulary & grammar, Less authentic: using a word in sentence?
Ordering
Ordering / re-ordering a scrambled set of words
If verbal=intensive speaking, If written=intensive writing
Reading and grammar
Appealing for who like word games and puzzles, Inauthentic
Needs practicing in class, Both reading and writing
Short answer and sentence completion
Answering or asking questions for the given statements / writing 2 or
3 sentences using the given prompts
Reading& Writing, Scoring on a 2-1-0 scale is appropriate

1. AUTHENTICITY (face and content validity)


Teacher becomes less instructor, more coach or facilitator
Assessment: formative (+) washback > practicality and reliability
2. SCORING
Both how Ss string words together and what they say
3. TIME
No time constraints freedom for drafts before finished product
Questioned issue= Timed impromptu format valid method of writing
assessment

RESPONSIVE AND EXTENSIVE WRITING


1. Paraphrasing
Its importance: To say something in one's own words, to avoid
plagiarism to offer some variety in expression
Test takers' task: Paraphrasing sentences or paragraphs with
purposes in mind

Assessment type: Informal and formative, Positive washback


Scoring: Giving similar messages is primary Discourse, grammar
and vocabulary are secondary

2. Guided question and answer


Its importance: To provide benefits of guiding test takers without
dictating the form of the output
Test takers' task: Paraphrasing sentences or paragraphs with
purposes in mind

Assessment type: Informal and formative

Scoring: Either on a holistic scale or an analytical one

3. Paragraph Construction Tasks


Topic Sentence Writing
The presence or absence of topic sentence The effectiveness of topic
sentence
Topic Development in a Paragraph
The clarity of expression The logic of the sequence The unity and
cohesion The overall effectiveness
Multi Paragraph Essay
Addressing topic /main idea / purpose Organizing supporting ideas
Using appropriate details for supporting ideas Facility and fluency in
language use Demonstrating syntactic variety
4.Strategic Options
Free writing, outlining, drafting and revising are strategies which help
writers create effective texts
Writers need to know their subject and purpose and audience to
write developing main and supporting ideas is the purpose for only
essay writing
Some tasks commonly addressed in academic writing courses are
compare/contrast, problem solution, pro/cons and cause and effect.
Assessment of tasks in academic writing course could be formative
& informal
Knowing conventions &opportunities of genre will help to write
effectively.
Every genre of writing requires different conventions.

Time
time

Test of Written English (TWE)


allocated: 30 minutes time limit/ no preparation ahead of

Prepared by: a panel of experts


Scoring: a mean score of 2 independent ratings based on a holistic
scoring

Number of raters: 2 trained raters working independently


Limitations: inauthentic / not real life / puts test takers into

artificially time constraint context inappropriate for instructional


purposes
Strengths: serves for administrative purposes
Follow 6 steps to be successful
Carefully identify the topic.
Plan your supporting ideas.
In introductory paragraph, restate topic and state organizational plan
of essay.
Write effective supporting paragraphs (show transitions, include a
topic sentence, specify details).
Restate your position and summarize in the concluding paragraph.
Edit sentence structure and rhetorical expression.

SCORING METHODS FOR RESPONSIVE AND EXTENSIVE


WRITING
Holistic Scoring

Definition: Assigning a single score to represent general overall


assessment

Purpose of use: Appropriate for administrative purposes /


Admission into an institution or placement in a course

Advantage(s): Quick scoring


High inter-rater reliability, Easily interpreted scores by lay persons
Emphasizes strengths of written piece Applicable to many different
disciplines

Disadvantages
No washback potential
Masking the differences across the sub skills within each score
Not applicable to all genres
Needs trained evaluators to use the scale accurately

Primary Trait Scoring


Assigning a score based on the effectiveness of the text's achieving
its purposes (accuracy, clarity, description, expression of opinion)

Purpose of use
To focus on the principle function of the text

Advantage(s)
Practical
Allows both the writer and scorer to focus on the function / purpose

Disadvantage(s)
Breaking text down into subcategories and giving separate ratings
for each

Analytic Scoring

Definition
Listening short monologues to scan for certain information

Purpose of use
Classroom instructional purposes

Advantage(s)
*More backwash into the further stages of learning Diagnose both
the weaknesses and strengths of writing

Disadvantage(s)
Lower practicality since scorers have to attend to details with each
sub-score.

BEYOND SCORING: RESPONDING TO EXTENSIVE WRITING


Here, the writer is talking about process approach to writing and how
the assessment takes place in this approach.
Many educators advocate process approach to writing.
This pays attention to various stages that any piece of writing goes
through.
By spending time with learners on pre-writing phases, editing, redrafting and finally producing a finished version of their work, a
process approach aims to get to the heart of the various skills that
most writers employ.

Types of responding: Self, peer, teacher responding

Assessment type: Informal / formative


Washback: Potential positive washback
Role of the assessor: Guide / facilitator

GUIDELINES FOR ASSESSING STAGES OF WRITTEN


COMPOSITION
Initial stages

Focus: Meaning & Main idea & organization


Ignore: Grammatical and lexical errors / minor errors
Indicate: Global errors but not corrected
Later stages

Focus: Fine tuning toward a final version


Ignore: Indicate: Problems related to

cohesion/documentation/citation

10 BEYOND TESTS:
ALTERNATIVES IN
ASSESSMENT

Characteristics of Alternative Assessment


require students to perform, create, produce, or do something;
use real-world contexts or simulations;
are non-intrusive in that they extend the day-to-day classroom
activities;
allow students to be assessed on what they normally do in class
every day;
use tasks that represent meaningful instructional activities;
focus on processes as well as products;
tap into higher-level thinking and problem-solving skills;
provide information about both the strengths and weaknesses of
students;
are multi-culturally sensitive when properly administered;
ensure that people, not machines, do the scoring, using human
judgment;
encourage open disclosure of standards and rating criteria; and
call upon teachers to perform new instructional and assessment
roles.

DILEMMA OF MAXIMIZING BOTH PRACTICALITY AND WASHBACK


LARGE SCALE STANDARDIZED
TESTS

one-shot performances
timed
multiple-choice
decontextualized
norm-referenced
foster extrinsic motivation
highly practical, reliable
instruments
minimize time and money
much practicality or reliability
cannot offer much washback or
authenticity

ALTERNATIVE ASSESSMENT

open-ended in their time


orientation and format,
contextualized to a
curriculum,
referenced to the criteria
(objectives) of that curriculum
likely to build intrinsic
motivation
considerable time and effort
offer much authenticity and
washback

The dilemma of maximizing both practicality and washback


The principal purpose of this chapter is to examine some of the alternatives
in assessment that are markedly different from formal tests.
Especially large scaled standardized tests, tend to be one shot performances
that are timed, multiple choice decontextualized, norm-referenced, and that
foster extrinsic motivation.
On the other hand, tasks like portfolios, journals,
Conferences and interviews and self assessment are
Open ended in their time orientation and format
Contextualized to a curriculum
Referenced to the criteria ( objectives) of that curriculum and
Likely to build intrinsic motivation.

PORTFOLIOS
One of the most popular alternatives in assessment, especially within
a framework of communicative language teaching, is portfolio
development.
portfolios include materials such as
Essays and compositions in draft and final forms
Reports, project outlines
Poetry and creative prose
Artwork, photos, newspaper or magazine clippings;
Audio and/or video recordings of presentations, demonstrations, etc
Journals, diaries, and other personal reflection ;
Test, test scores, and written homework exercises
Notes on lecturer; and
Self-and peer- assessments-comments, and checklists.

Successful portfolio development will depend on following a number


of steps and guidelines.
1. State objectives clearly.
2. Give guidelines on what materials to include.
3. Communicate assessment criteria to students,
4. Designate time within the curriculum for portfolio development.
5. Establish periodic schedules for review and conferencing.

6. Designate an accessible place to keep portfolios.


7. Provide positive washback giving final assessment

JOURNALS
a journal is a log or account of ones thoughts, feelings, reactions,
assessment, ideas, or progress toward goals, usually written with
little attention to structure, form, or correctness.
Categories or purposes in journal writing, such as the following:
a. Language learning logs
b. Grammar journals
c. Responses to readings
d. Strategies based learning logs
e. Self-assessment reflections
f. Diaries of attitudes, feelings, and other affective factors
g. Acculturation logs

CONFERENCES AND INTERVIEWS


Conferences
Conferences is not limited to drafts of written work including
portfolios and journals.
Conferences must assume that the teacher plays the role of a
facilitator and guide, not of an administrator, of a formal assessment.
Interviews
Interview may have one or more of several possible goals in which
the teacher
assesses the students oral production
ascertains a students need before designing a course of curriculum
seeks to discover a students learning style and preferences
One overriding principle of effective interviewing centers on the
nature of the questions that will be asked.

OBSERVATIONS
In order to carry out classroom observation, it is of course important
to take the following steps:
1. Determine the specific objectives of the observation.

2. Decide how many students will be observed at one time

3. Set up the logistics for making unnoticed observations

4. Design a system for recording observed performances

5. Plan how many observations you will make

SELF AND PEER ASSESSMENT


Five categories of self and peer assessment:
1. Assessment of performance, in this category, a student
typically monitors him or herself in either oral or written production
and renders some kind of evaluation of performance.
2. Indirect assessment of performance, indirect assessment
targets larger slices of time with a view to rendering an evaluation of
general ability as opposed to one to one specific, relatively time
constrained performance.
3. Metacognitive assessment for setting goals, some kind
evaluation are more strategic in nature, with the purpose not just of
viewing past performance or competence but of setting goals and
maintaining an eye on the process of their pursuit.
4. Socioaffective assessment, yet another type of self and
peer assessment comes in the form of methods of examining
affective factors in learning. Such assessment is quite different from
looking at and planning linguistic aspects of acquisition.
5. Student generated tests, a final type of assessment that
is not usually classified strictly as self or peer assessment is the
technique of engaging students in the process of constructing tests
themselves.

GUIDELINES FOR SELF AND PEER ASSESSMENT


Self and peer assessment are among the best possible formative
types of assessment and possibly the most rewarding.
Four guidelines will help teachers bring this intrinsically motivating
task into the classroom successfully.
1. Tell students the purpose of assessment
2. Define the task clearly
3. Encourage impartial evaluation of performance or ability
4. Ensure beneficial washback through follow up tasks
A TAXONOMY OF SELF AND PEER ASSESSMENT TASKS
It is helpful to consider a variety of tasks within each of the four
skills( listening skill, speaking skill, reading skill, writing skill).
An evaluation of self and peer assessment according to our classic
principles of assessment yields a pattern that is quite consistent with
other alternatives to assessment that have been analyzed in this
chapter.
Practicality can achieve a moderate level with such procedures as
checklists and questionnaires

CHAPTER 11:
GRADING AND STUDENT
EVALUATION

GUIDELINES FOR SELECTING GRADING CRITERIA


It is essential for all components of grading to be consistent with an
institutional philosophy and/or regulations (see below for a further
discussion of this topic).
All of the components of a final grade need to be explicitly stated in
writing to students at the beginning of a term of study, with a
designation of percentages or weighting figures for each component.
If your grading system includes items (d) through (g) in the
questionnaire above (improvement, behavior, effort; motivation), it
is important for you to recognize their subjectivity. But this should
not give you an excuse to avoid converting such factors into
observable and measurable results.
Finally, consider allocating relatively _ small weights to items (c)
through (h) so that a grade primarily reflects achievement. A
designation of 5 percent to 10 percent of a grade to such factors will
not mask strong achievement in a course.

CALCULATING GRADES: ABSOLUTE AND RELATIVE GRADING

ABSOLUTE GRADING:
If you pre-specify standards of performance on a
numerical point system, you are using an absolute
system of grading.
For example, having established points for a midterm
test, points for a final exam, and points accumulated for
the semester, you might adhere to the specifications in
the table below.
The key to making an absolute grading system work is to
be painstakingly clear on competencies and objectives,
and on tests, tasks, and other assessment techniques
that will figure into the formula for assigning a grade.

RELATIVE GRADING:

It is more commonly used than absolute grading. It has


the advantage of allowing your own interpretation and of
adjusting for unpredicted ease or difficulty of a test.

Relative grading is usually accomplished by ranking


students in order of performance (percentile ranks) and
assigning cut-off points for grades.
An older, relatively uncommon method of relative
grading is what has been called grading "on the curve,"
a term that comes from the normal bell curve of
normative data plotted on a graph.

TEACHERS PERCEPTIONS OF APPROPRIATE GRADE


DISTRIBUTIONS

Most teachers bring to a test or a course evaluation an


interpretation of estimated appropriate distributions,
follow that interpretation, and make minor adjustments
to compensate for such matters as unexpected difficulty.
What is surprising, however, is that teachers'
preconceived notions of their own standards for grading
often do not match their actual practice

INSTITUTIONAL EXPECTATIONS AND CONSTRAINTS


For many institutions letter grading is foreign but point
systems (100 pts or percentages) are common.
Some institutions refuse to employ either a letter grade
or a numerical system of evaluation and instead offer
narrative evaluations of Ss.
This preference for more individualized evaluations is
often a reaction to overgeneralization of letter and

CROSS-CULTURAL FACTORS AND THE QUESTION OF


DIFFICULTY

A number of variables bear on the issue. In many


cultures,
it is unheard of to ask a student to self-assess
performance.
Ts assign a grade, and nobody questions the teacher's
criteria.
measure of a good teacher is one who can design a test
that is so difficult that no student could achieve a perfect
score.
The
fact that students fall short of such marks of
perfection is a demonstration of the teacher's superior
knowledge.
as a corollary, grades of A are reserved for a highly select
few, and students are delighted with Bs.
one single final examination is the accepted determinant
of a student's entire course grade.
the notion of a teacher's preparing students to do their
best on a test is an educational contradiction.
In some cultures a "hard" test is a good test, but in
others, a good test results in a distribution like the one in
the bar graph for a "great bunch": a large proportion of
As and Bs, a few Cs, and maybe a D or an F for the

How do you gauge such difficulty as you design a


classroom test that has not had the luxury of piloting and
pre-testing?
The answer is complex. It is usually a combination of a
number of possible factors:
experience as a teacher (with appropriate intuition)
adeptness at designing feasible tasks
special care in framing items that are clear and relevant
mirroring in-class tasks that students have mastered
variation of tasks on the test itself
reference to prior tests in the same course
a thorough review and preparation for the test
knowledge of your students' collective abilities
a little bit of luck

WHAT DO LETTER GRADES MEAN?

Typically, institutional manuals for teachers and students


will list the following descriptors of letter grades:
A: excellent
B: good
C: adequate
D: inadequate/unsatisfactory F: failing/unacceptable
The overgeneralization implicit in letter grading
underscores the meaninglessness of the adjectives
typically cited as descriptors of those letters. Is there a
solution
their gate-keeping
role?
1. Every to
teacher
who uses letter
grades or a percentage
score to provide an evaluation, whether a summative,
end-of-course assessment or on a formal assessment
procedure, should
a. use a carefully constructed system of grading,
b. assign grades on the basis of explicitly stated criteria,
and
c. base criteria on objectives of course or assessment
procedure(s).
2. Educators everywhere must work to persuade the

gatekeepers of the world that letter/numerical


evaluations are simply one side of a complex
representation of a student's ability. Alternatives to letter

ALTERNATIVES TO LETTER GRADING


For assessment of a test, paper, report, extra-class exercise, or other
formal, scored task, the primary objective of which is to offer
formative feedback, the possibilities beyond a simple number or
letter include
a teacher's marginal and/or end comments,
a teacher's written reaction to a student's self-assessment of
performance,
a teacher's review of the test in the next class period,
peer-assessment of performance,
self-assessment of performance, and
a teacher's conference with the student.
For summative assessment of a student at the end of a course, those
same additional assessments can be made, perhaps in modified
forms:
a teacher's marginal and/or end of exam/paper/project comments
T's summative written evaluative remarks on a journal, portfolio, or
other tangible product
T's written reaction to a student's self assessment of performance in
a course
a completed summative checklist of competencies, with comments
narrative evaluations of general performance on key objectives
a teacher's conference with the student

A more detailed look is now appropriate for a few of the summative


alternatives to grading, particularly self-assessment, narrative
evaluations, checklists, and conferences.
1. Self-assessment.
Self-assessment of end-of-course attainment of objectives is
recommended through the use of the following:
Checklists
a guided journal entry that directs the student to reflect on the
content and linguistic objectives
an essay that self-assesses, a teacher-student conference
2. Narrative evaluations.
In protest against the widespread use of letter grades as exclusive
indicators'of achievement, a number of institutions have at one time
or another required narrative evaluations of students. In some
instances those narratives replaced grades, and in others they
supplemented them. (pg. 296-297)
Advantages: individualization, evaluation of multiple objectives of a
course, face validity, washback potential.
Disadvantages: not quantified by admissions and transcript
evaluation offices, not practical-time consuming, Ss paying little
attention to these, Ts succumbing to formulaic narratives which

3- Checklist evaluations.
To compensate for the time-consuming impracticality of
narrative evaluation, some programs opt for a
compromise: a checklist with brief comments from the
teacher ideally followed by a conference and/or a
response from the student.
Advantages: increased practicality, reliability,
washback. Teacher time is minimized; uniform measures
are applied across all students; some open-ended
comments from the teacher are available; and the
student responds with his or her own goals (in light of
theWhen
results
of checklist
the checkformat
list andisteacher
comments).
!!!
the
accompanied,
as in this
case, by letter grades as well, virtually none of the
disadvantages of narrative evaluations remain, with only
a small chance that some individualization may be
slightly.
4.Conferences.
Perhaps enough has been said about the virtues of
conferencing. You already know that the impracticality of
scheduling sessions with students is offset by its

SOME PRINCIPLES AND GUIDELINES FOR GRADING AND EVALUATION

You should now understand that


grading is not necessarily based on a universally
accepted scale,
grading is sometimes subjective and context-dependent,
grading of tests is often done on the "curve,"
grades reflect a teacher's philosophy of grading,
grades reflect an institutional philosophy of grading,
cross-cultural variation in grading philosophies needs to be
understood,

grades often conform, by design, to a teacher's expected


distribution of students across a continuum,
tests do not always yield an expected level of difficulty,
letter grades may not "mean" the same thing to all
people, and
alternatives to letter grades or numerical scores are
highly desirable as additional indicators of achievement.

With those characteristics of grading and evaluation in


mind, the following principled guidelines should help you
be an effective grader and evaluator of student
performance:
Develop an informed, comprehensive personal
philosophy of grading that is consistent with your
philosophy of teaching and evaluation.
Ascertain an institutions philosophy of grading and,
unless otherwise negotiated, conform to that philosophy
(so that you are not out of step with others).
Design tests that conform to appropriate institutional and
cultural expectations of the difficulty that Ss should
experience.
Select appropriate criteria for grading and their relative
weighting in calculating grades.
Communicate criteria for grading to Ss at the beginning
of the course and at subsequent grading periods (midterm, final).
Triangulate letter grade evaluations with alternatives that
are more formative and that give more washback.

Vous aimerez peut-être aussi