Language Testing

The Similarities & Diff erence of Classroom Test & Standardized
Achievement Test
by Stacy Alleyne, studioD
Formative assessments are used more frequently in the classroom.
Assessments serve several purposes, but they are most commonly used to gauge the
level and depth of student learning and skill retention. Assessments can be either
formative or summative. In a classroom, formative assessments are used to help
teachers plan future lessons and identify areas they have to reteach or ways they must
adjust their lessons. Formative assessments are usually not graded as strictly as
summative assessments. Summative assessments, like standardized achievement tests,
are used to gauge where students are at a particular time in relation to specific learning
goals.
Sponsored Link
Problems with homework?
We will help Free and fast
brainly.ph
Purpose
The main similarities and differences between classroom and standardized achievement
tests lie in the purpose for which a particular test is administered. Classroom teachers
utilize formative assessments because they are more concerned with equipping their
students with certain knowledge and skills while, standardized test administrators' sole
purpose is to evaluate student readiness.
Standardized Assessments
Standardized achievement tests are used to test a students understanding of skills and
knowledge in comparison to students of the same age group or educational level. The
scores from these tests are used in determining a students readiness for college,
graduate school and professional programs. The SAT, for example, is a well-known
summative test that is used to determine a students aptitude for college. Several
standardized achievement tests are given in elementary, middle and high school to
provide schools with the data they need to see how a school and its students are
performing in comparison with other states across the nation.
Similarities
Classroom and standardized tests are similar in that they both test student skill and
knowledge at various levels. In high school English class, for example, teachers can
choose to give students a summative, standardized test that assesses understanding of
grammar and usage just like a standardized achievement test. However, teachers may
choose to test only the areas covered in class, while standardized achievement tests are
likely to cover more ground.
Differences
Some of the major differences between classroom tests and standardized tests is the
allotted time, structure and content of the tests. Classroom tests can be much more
individualized. A teacher may choose to test students specifically on the subject matter
he or she taught in class. They may also vary the amount of time allowed for their
students to take a test. In a standardized testing situation, students all take the exact
same test and are given the same amount of time to take it. With the exception of
accommodations for students with disabilities, standardized tests are much more
structured and uniform than classroom tests.
Sponsored Links
DIRECT VS. INDIRECT ASSESSMENT MEASURES

What is a "direct measure" of student learning?
Direct measures assess student performance of identified learning outcomes, such as
mastery of a lifelong skill. They require standards of performance. Examples of direct
assessments are: pre/post test; course-embedded questions; standardized exams;
portfolio evaluation; videotape/audiotape of performance; capstone course evaluation.
What is an "indirect measure" of student learning?
Indirect measures assess opinions or thoughts about student knowledge, skills, attitudes,
learning experiences, and perceptions. Examples of indirect measures are: student
surveys about instruction; focus groups; alumni surveys; employer surveys.
Examples of Direct Measures of Student Learning
Faculty assigned to teach different sections of a gatekeeper course in the general

education social and behavioral sciences program agree to assign a final paper
that asks students for a comparative analysis of several sociological theories. With
the students' permission, the faculty copy the final papers before marking on
them, and then give the copies (or a representative or random sample) to the
review committee. Three members of the committee score each paper according
to guidelines developed by the program.
Each final exam for every course in a general education foreign language
program requires that students translate selected passages they are unlikely to
have seen before. At the end of the academic year, each translation is reviewed by
at least two foreign language faculty using the same scoring guide to analyze
points of strength and weakness in the translations for each academic class. As
they accumulate student translations from year to year, the faculty are able to
measure the improvement in the translations that individual students did in their
first and second year language courses.
Each year, twenty percent of the papers submitted by students in a program

capstone course are randomly selected for assessment and sent to a panel of
outside evaluators. The evaluators write comments about the overall quality of the
sample, and then select ten papers to discuss in depth. As part of their discussion,
the evaluators determine the extent to which each paper reflects the achievement
of one or two of the program learning goals, which the program has previously
selected, and score them according to guidelines previously developed by the
program.
Students in an occupational program are asked to collect and maintain a portfolio

of materials throughout their occupational coursework, and to write a short
summary for each course. Specific questions about learning styles and experiences
are provided to help guide their reflections. They write a full paper about their
experiences in the occupational program, tracing the main themes and content of
their learning experience. In addition, each student is asked to design and carry
out a research project on a topic that extends their learning experience. Three
faculty independently rate the reflective and research papers, using guidelines and
specific questions about how the students' performance reflects each of the
program's knowledge or skill goals.
The general education - math program have developed six outcomes for
intermediate algebra and eight outcomes for college algebra. A portion of the final
exam includes common problems that directly measure the students ability to
perform the outcome. The problems are used in every intermediate and college
algebra course offered. Common grading criteria have also been established and
are outlined in the course syllabi.
A general education - science program includes a set of common questions that

measure the course outcomes in the final exam. Instructors are then able to
determine overall if the students demonstrate competence, as well as identify
areas in which students may be weak.
An occupational program developed a competence checklist for speech and public

speaking that evaluates two presentations students give in the program as well as
their interaction in small group activities. Upon evaluating the students
performance on designated assessment activities, the instructor indicates in the
final grade roster beside the final letter grade whether or not the student has
achieved the competence measured. The program is then able to calculate the
percentage of students who were competent in the area assessed. This provides
feedback to faculty in the program about where students have a high level of
competence and where additional work may be needed.
A general education humanities program requires that students submit samples

of their work from their courses. During exam week, faculty gathers and critiques
the portfolios, looking not only at the quality of work in evidence, but at each
student's improvement. The review committee then asks faculty to reflect on the
trends they saw in their critiques, and what they might suggest about areas of
strength and weakness in the program overall.
A management/marketing program asks each student in the capstone course to

develop a marketing plan for a product and to present it to a panel of faculty while
being videotaped. Students have been told that the videotapes will be used for
assessment and their permissions have been secured. The assessment committee
then invites local marketing professionals to join them in developing guidelines by
which to assess the presentations. They view a random sample of the
presentations, and as a group, roughly divide them into groups of adequate,
superior and inadequate. They then review the tapes for each group and come up
with a five or six sentences describing the properties that make a presentation
superior, adequate or inadequate. After inductively composing this scoring guide,
the group divides in pairs, with each pair separately scoring every presentation in a
larger sample. If the two scorers reach different conclusions, the presentation is
sent to a third reviewer.
Students in a Spanish language program are interviewed at the end of their second
year. A panel of three faculty conduct the interviews in Spanish, and each rates
each student according to a standard scoring guide that rates students' proficiency
in Spanish as well as knowledge of Spanish literary, historical and cultural
traditions. At the bottom of each score card, faculty note particular areas of
strength and weakness within each category. After the interviews are completed,
faculty gather to compare observations, analyze the scores, and discuss common
areas of strength and weakness in the students' interviews.
Students entering an occupational program take a test to determine their

understanding of various concepts critical to the field of study. When the students
complete their course sequences, they are given exit exams to determine their
growth. Scores from the entering and exiting exams for a random sample of each
year's graduates are compared as part of each year's outcomes assessment
report.
Examples of Indirect Measures of Student Learning

Surveys - opinions, thoughts, reactions
Student Surveys
In Class Surveys
Department Surveys
Student Evaluation of Instruction
Student Ratings
Alumni Surveys
Employer Surveys
Faculty Surveys
Self Assessment / Reports
Student/Alumni/Faculty self assessments or reports
Focus Groups
Student groups
Alumni groups
Employer groups
Interviews
Student interviews
Exit interviews
Employer interviews
Course Grades
Graduation / Completion Rates
Job Placement Data
Advisory Board Feedback / Evaluation
Course Content / Grade Correlations
Difference Between Achievement and Aptitude Tests

Achievement vs Aptitude Tests
In line with education, taking tests has been a way of life for students who are still studying in schools.
There are many types of tests that gauge various strengths or weaknesses of a student. Often they are
taken to measure the psychological, logical, and general intelligence of a student. Some of the more
popular test forms administered nowadays are the achievement and aptitude tests. But how do these
tests differ from each other?
To measure ones ability or capability to learn, an aptitude test is the most appropriate test form to be
administered. Taking such a test will help evaluators, parents and teachers foretell how a particular
student is likely to fair in school. Because of the nature of this exam, theres actually no need to study
or prepare for it because you dont have any specific means to study for an aptitude evaluation.
Nonetheless, there are some techniques that may help the student gain better results for an aptitude
exam. These are the following:
o foster or encourage reading in your child or student
o converse with him or her on topics that are leaning more on current events
o have a dictionary and thesaurus readily available for your child or student for quick reference with
some newly encountered words or terms.
o Take him or her to art galleries, museums, libraries and other enriching locations within your area.
Conversely, achievement tests are very different in the sense that these exams are taken to gauge the
extent of what the child or student has already learned. In this regard, skills and
current knowledge regarding both familiar and trivial subject matters, which were most likely
discussed previously, can all be included. This type of test is probably the most commonly used test
form at school because almost all tests that measure theknowledge of what the students have learned
from the lesson are achievement exams like long tests, preliminary exams, midterm exams and even
the final exams.
When taking achievement tests, the student must first have some time to refresh his memory and
study. Repeated and quality reviews can help the student get higher marks for achievement exams.
Overall:
1. Aptitude tests are used to predict the students likelihood to pass or perform in school whereas
achievement tests are those that measure what the student has already learned in general
2. There is almost no specific or guaranteed way to prepare for an aptitude test while you only need to
review or study what you have previously learned so that you will become more prepared in taking an
achievement test.
Read
Components of Communication
1.
Context
2.
Sender/Encoder
3.
Message
4.
Medium
5.
Receiver/Decoder
6.
Feedback
Context
Every message (Oral or written), begins with context. Context is a very broad field that consists
different aspects. One aspect is country, culture and organization. Every organization, culture and
country communicate information in their own way.
Another aspect of context is external stimulus. The sources of external stimulus includes; meeting,
letter, memo, telephone call, fax, note, email and even a casual conversation. This external stimuli
motivates you to respond and this response may be oral or written.
Internal stimuli is another aspect of communication. Internal Stimuli includes; You opinion, attitude,
likes, dis-likes, emotions, experience, education and confidence. These all have multifaceted
influence on the way you communicate you ideas.
A sender can communicate his ideas effectively by considering all aspects of context mentioned
above.
Sender/Encoder
Encoder is the person who sends message. In oral communication the encoder is speaker, and in
written communication writer is the encoder. An encoder uses combination of symbols, words, graphs
and pictures understandable by the receiver, to best convey his message in order to achieve his
desired response.
Message
Message is the information that is exchanged between sender and receiver. The first task is to
decide what you want to communicate and what would be the content of your message; what are the
main points of your message and what other information to include. The central idea of the message
must be clear. While writing the message, encoder should keep in mind all aspects of context and the
receiver (How he will interpret the message).
Messages can be intentional and unintentional.
Medium
Medium is the channel through which encoder will communicate his message. How the
message gets there. Your medium to send a message, may be print, electronic, or sound. Medium
may be a person as postman. The choice of medium totally depends on the nature of you message
and contextual factors discussed above. Choice of medium is also influence by the relationship
between the sender and receiver.
The oral medium, to convey your message, is effective when your message is urgent, personal or
when immediate feedback is desired. While, when your message is ling, technical and needs to be
documented, then written medium should be preferred that is formal in nature. These guidelines may
change while communicating internationally where complex situations are dealt orally and
communicated in writing later on.
Receiver/Decoder
The person to whom the message is being sent is called receiver/decoder. Receiver may be a
listener or a reader depending on the choice of medium by sender to transmit the message.Receiver
is also influenced by the context, internal and external stimuli.
Receiver is the person who interprets the message, so higher the chances are of miscommunication because of receivers perception, opinion, attitude and personality. There will be minor
deviation in transmitting the exact idea only if your receiver is educated and have communication
skills.
Feedback
Response or reaction of the receiver, to a message, is called feedback. Feedback may be written or
oral message, an action or simply, silence may also be a feedback to a message.
Feedback is the most important component of communication in business. Communication is
said to be effective only when it receives some feedback. Feedback, actually, completes the loop of
communication.
Types of Test Item Formats

Introduction
Just as there are several types of tests available to help employers make employment decisions, there
are also several types of test formats. In this section, the pros and cons of general types of test item
formats are described. Also, some general guidelines for using different types of test item formats are
provided.
Before deciding on a particular type of test format, you should first establish a) does testing make
sense (see section onEmployment Testing Overview) and b) what it is you want to assess (see
section on Establishing an Effective Employee Testing Program). The determination of
what it is you want to measure with the test should precede the determination of how you are going to
measure it.
Pros and Cons of Multiple Choice Test Items
PROS
CONS
Can be used to test many levels of

learning
Test takers may perceive questions

to be tricky or too picky
Can be used to test a persons

ability to integrate information
Can be used to diagnose a persons

difficulty with certain concepts
Can provide test takers with

feedback about why distractors
were wrong and why correct
answers were right
Difficult to test attitudes towards

learning because correct responses
can be easily faked
Does not allow test takers to
demonstrate knowledge beyond
the options provided
Requires a great deal of time to
construct effective multiple choice
questions, especially ones that test
higher levels of learning
Can ask more questions, greater

coverage of material
Encourages guessing because one

option is always right
Can cover a wide range of difficulty

levels
Test takers may misinterpret

questions
Usually requires less time for test

takers to answer
Usually easily scored and graded
Pros and Cons of True-False Test Items

PROS
CONS
Can ask more questions for greater

coverage of material
Does not allow test takers to

demonstrate broad range of
knowledge
Can cover a wide range of difficulty

levels
Is difficult to construct effective

true-false items that test higher
levels of learning
Usually requires less time for test

takers to answer
Encourages guessing due to 50/50

chance of being correct
Usually easily graded and scored
Is easily faked, difficult to test

attitudes toward learning
Pros and Cons of Essay Test Items

PROS
Can test complex learning

objectives
CONS
Usually takes more time to answer
Can test processes used to answer

the question such as the ability to
integrate ideas and synthesize
information
Can be unreliable in assessing the

entire content of a course or topic
area
Requires use of writing skills,

correct spelling, and grammar
Essay answers are often written

poorly because test takers may not
have time to organize and
proofread answers
Can provide a more realistic and

generalizable task for test
Is typically graded or scored more

subjectively; non-test related
information may influence scoring
process
Usually takes less time to construct
Requires special effort to be

graded in an objective manner
Is more difficult for test takers to

guess correct answer
Requires more time to grade or

score
Guidelines for Using Multiple Choice or True-False Test Items

It is generally best to use multiple-choice or true-false items when:
You want to test the breadth of learning because more material can be covered with this
format.
You want to test different levels of learning.
You have little time for scoring.
You are not interested in evaluating how well a test taker can formulate a correct answer.
You have a clear idea of which material is important and which material is less important.
You have a large number of test takers.
Guidelines for Using Essay Test Items
It is generally best to use essay items when:
You want to evaluate a persons ability to formulate a correct answer.
You want to assess a peoples ability to express themselves in writing and writing is an
important aspect of the job.
You have time to score the essay items thoroughly.
You feel more confident about your ability to read written answers critically than to construct
effective multiple-choice items.
You want to test a persons ability to apply concepts and information to a new situation.
You have a clear idea of the most important information and concepts that should be tested.
Employment Testing Table of Contents
EXPLORING RELIABILITY IN ACADEMIC ASSESSMENT
Written by Colin Phelan and Julie Wren, Graduate Assistants, UNI Office of
Academic Assessment (2005-06)
Reliability is the degree to which an assessment tool produces stable and consistent
results.
Types of Reliability
1. Test-retest reliability is a measure of reliability obtained by administering the same test twice
over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then
be correlated in order to evaluate the test for stability over time.
Example: A test designed to assess student learning in psychology could be given to a

group of students twice, with the second administration perhaps coming a week after
the first. The obtained correlation coefficient would indicate the stability of the scores.
2. Parallel forms reliability is a measure of reliability obtained by administering different

versions of an assessment tool (both versions must contain items that probe the same
construct, skill, knowledge base, etc.) to the same group of individuals. The scores from the
two versions can then be correlated in order to evaluate the consistency of results across
alternate versions.
Example: If you wanted to evaluate the reliability of a critical thinking assessment, you
might create a large set of items that all pertain to critical thinking and then randomly
split the questions up into two sets, which would represent the parallel forms.
3. Inter-rater reliability is a measure of reliability used to assess the degree to which different
judges or raters agree in their assessment decisions. Inter-rater reliability is useful because
human observers will not necessarily interpret answers the same way; raters may disagree as
to how well certain responses or material demonstrate knowledge of the construct or skill being
assessed.
Example: Inter-rater reliability might be employed when different judges are evaluating
the degree to which art portfolios meet certain standards. Inter-rater reliability is
especially useful when judgments can be considered relatively subjective. Thus, the
use of this type of reliability would probably be more likely when evaluating artwork as
opposed to math problems.
4. Internal consistency reliability is a measure of reliability used to evaluate the degree to

which different test items that probe the same construct produce similar results.
A. Average inter-item correlation is a subtype of internal consistency reliability. It is

obtained by taking all of the items on a test that probe the same construct (e.g., reading
comprehension), determining the correlation coefficient for each pair of items, and
finally taking the average of all of these correlation coefficients. This final step yields the
average inter-item correlation.
B. Split-half reliability is another subtype of internal consistency reliability. The process

of obtaining split-half reliability is begun by splitting in half all items of a test that are
intended to probe the same area of knowledge (e.g., World War II) in order to form two
sets of items. The entire test is administered to a group of individuals, the total score
for each set is computed, and finally the split-half reliability is obtained by determining
the correlation between the two total set scores.
Validity refers to how well a test measures what it is purported to measure.
Why is it necessary?
While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also
needs to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day
with an excess of 5lbs. The scale is reliable because it consistently reports the same
weight every day, but it is not valid because it adds 5lbs to your true weight. It is not a
valid measure of your weight.
Types of Validity
1. Face Validity ascertains that the measure appears to be assessing the intended construct under
study. The stakeholders can easily assess face validity. Although this is not a very scientific type of
validity, it may be an essential component in enlisting motivation of stakeholders. If the stakeholders
do not believe the measure is an accurate assessment of the ability, they may become disengaged
with the task.
Example: If a measure of art appreciation is created all of the items should be related to
the different components and types of art. If the questions are regarding historical time
periods, with no reference to any artistic movement, stakeholders may not be motivated to
give their best effort or invest in this measure because they do not believe it is a true
assessment of art appreciation.
2. Construct Validity is used to ensure that the measure is actually measure what it is
intended to measure (i.e. the construct), and not other variables. Using a panel of
experts familiar with the construct is a way in which this type of validity can be assessed.
The experts can examine the items and decide what that specific item is intended to
measure. Students can be involved in this process to obtain their feedback.
Example: A womens studies program may design a cumulative assessment of learning

throughout the major. The questions are written with complicated wording and phrasing.
This can cause the test inadvertently becoming a test of reading comprehension, rather
than a test of womens studies. It is important that the measure is actually assessing the
intended construct, rather than an extraneous factor.
3. Criterion-Related Validity is used to predict future or current performance - it

correlates test results with another criterion of interest.
Example: If a physics program designed a measure to assess cumulative student learning

throughout the major. The new measure could be correlated with a standardized measure
of ability in this discipline, such as an ETS field test or the GRE subject test. The higher
the correlation between the established measure and new measure, the more faith
stakeholders can have in the new assessment tool.
4. Formative Validity when applied to outcomes assessment it is used to assess how well a
measure is able to provide information to help improve the program under study.
Example: When designing a rubric for history one could assess students knowledge
across the discipline. If the measure can provide information that students are lacking
knowledge in a certain area, for instance the Civil Rights Movement, then that assessment
tool is providing meaningful information that can be used to improve the course or
program requirements.
5. Sampling Validity (similar to content validity) ensures that the measure covers the
broad range of areas within the concept under study. Not everything can be covered, so
items need to be sampled from all of the domains. This may need to be completed using
a panel of experts to ensure that the content area is adequately sampled. Additionally, a
panel can help limit expert bias (i.e. a test reflecting what an individual personally feels
are the most important or relevant areas).
Example: When designing an assessment of learning in the theatre department, it would

not be sufficient to only cover issues related to acting. Other areas of theatre such as
lighting, sound, functions of stage managers should all be included. The assessment
should reflect the content area in its entirety.
What are some ways to improve validity?

1. Make sure your goals and objectives are clearly defined and operationalized. Expectations of
students should be written down.
2. Match your assessment measure to your goals and objectives. Additionally, have the test
reviewed by faculty at other schools to obtain feedback from an outside party who is less
invested in the instrument.
3. Get students involved; have the students look over the assessment for troublesome wording,
or other difficulties.
4. If possible, compare your measure with other measures, or data that may be available.
NORM-REFERENCED TEST
LAST UPDATED: 07.22.15
Norm-referenced refers to standardized tests that are designed to compare and
rank test takers in relation to one another. Norm-referenced tests report whether test
takers performed better or worse than a hypothetical average student, which is
determined by comparing scores against the performance results of a statistically
selected group of test takers, typically of the same age or grade level, who have already
taken the exam.
Calculating norm-referenced scores is called the norming process, and the comparison
group is known as the norming group. Norming groups typically comprise only a small
subset of previous test takers, not all or even most previous test takers. Test developers
use a variety of statistical methods to select norming groups, interpret raw scores, and
determine performance levels.
Norm-referenced scores are generally reported as a percentage or percentile ranking. For
example, a student who scores in the seventieth percentile performed as well or better
than seventy percent of other test takers of the same age or grade level, and thirty
percent of students performed better (as determined by norming-group scores).
Norm-referenced tests often use a multiple-choice format, though some include openended, short-answer questions. They are usually based on some form of
nationalstandards, not locally determined standards or curricula. IQ tests are among
the most well-known norm-referenced tests, as are developmental-screening tests, which
are used to identify learning disabilities in young children or determine eligibility for
special-education services. A few major norm-referenced tests include the California
Achievement Test, Iowa Test of Basic Skills, Stanford Achievement Test, and TerraNova.
The following are a few representative examples of how norm-referenced tests and
scores may be used:
To determine a young childs readiness for preschool or kindergarten. These tests

may be designed to measure oral-language ability, visual-motor skills, and cognitive
and social development.
To evaluate basic reading, writing, and math skills. Test results may be used for a
wide variety of purposes, such as measuring academic progress, making course
assignments, determining readiness for grade promotion, or identifying the need for
additional academic support.
To identify specific learning disabilities, such as autism, dyslexia, or nonverbal

learning disability, or to determine eligibility for special-education services.
To make program-eligibility or college-admissions decisions (in these cases, normreferenced scores are generally evaluated alongside other information about a
student). Scores on SAT or ACT exams are a common example.
Norm-Referenced vs. Criterion-Referenced Tests

Norm-referenced tests are specifically designed to rank test takers on a bell curve, or a
distribution of scores that resembles, when graphed, the outline of a belli.e., a small
percentage of students performing well, most performing average, and a small
percentage performing poorly. To produce a bell curve each time, test questions are
carefully designed to accentuate performance differences among test takers, not to
determine if students have achieved specified learning standards, learned certain
material, or acquired specific skills and knowledge. Tests that measure performance
against a fixed set of standards or criteria are calledcriterion-referenced tests.
Criterion-referenced test results are often based on the number of correct answers
provided by students, and scores might be expressed as a percentage of the total
possible number of correct answers. On a norm-referenced exam, however, the score
would reflect how many more or fewer correct answers a student gave in comparison to
other students. Hypothetically, if all the students who took a norm-referenced test
performed poorly, the least-poor results would rank students in the highest percentile.
Similarly, if all students performed extraordinarily well, the least-strong performance
would rank students in the lowest percentile.
It should be noted that norm-referenced tests cannot measure the learning achievement
or progress of an entire group of students, but only the relative performance of
individuals within a group. For this reason, criterion-referenced tests are used to measure
whole-group performance.
Reform
Norm-referenced tests have historically been used to make distinctions among students,
often for the purposes of course placement, program eligibility, or school admissions. Yet
because norm-referenced tests are designed to rank student performance on a relative
scalei.e., in relation to the performance of other studentsnorm-referenced testing has
been abandoned by many schools and states in favor of criterion-referenced tests, which
measure student performance in relation to common set of fixed criteria or standards.
It should be noted that norm-referenced tests are typically not the form of standardized
test widely used to comply with state or federal policiessuch as the No Child Left
Behind Actthat are intended to measure school performance, close achievement
gaps, or hold schools accountable for improving student learning results. In most cases,
criterion-referenced tests are used for these purposes because the goal is to determine
whether schools are successfully teaching students what they are expected to learn.
Similarly, the assessments being developed to measure student achievement of
theCommon Core State Standards are also criterion-referenced exams. However,
some test developers promote their norm-referenced examsfor example, the TerraNova
Common Coreas a way for teachers to benchmark learning progress and determine if
students are on track to perform well on Common Corebased assessments.
Debate
While norm-referenced tests are not the focus of ongoing national debates about highstakes testing, they are nonetheless the object of much debate. The essential
disagreement is between those who view norm-referenced tests as objective, valid, and
fair measures of student performance, and those who believe that relying on relative
performance results is inaccurate, unhelpful, and unfair, especially when making
important educational decisions for students. While part of the debate centers on
whether or not it is ethically appropriate, or even educationally useful, to evaluate
individual student learning in relation to other students (rather than evaluating individual
performance in relation to fixed and known criteria), much of the debate is
also focused on whether there is a general overreliance on standardized-test scoresin
the United States, and whether a single test, no matter what its design, should be used
in exclusion of other measuresto evaluate school or student performance.
It should be noted that perceived performance on a standardized test can potentially be
manipulated, regardless of whether a test is norm-referenced or criterion-referenced. For
example, if a large number of students are performing poorly on a test, the performance
criteriai.e., the bar for what is considered passing or proficientcould be lowered
to improve perceived performance, even if students are not learning more or
performing better than past test takers. For example, if a standardized test administered
in eleventh grade uses proficiency standards that are considered to be equivalent to

eighth-grade learning expectations, it will appear that students are performing well,
when in fact the test has not measured learning achievement at a level appropriate to
their age or grade. For this reason, it is important to investigate the criteria used to
determine proficiency on any given testand particularly when a test is considered
high stakes, since there is greater motivation to manipulate perceived test
performance when results are tied to sanctions, funding reductions, public
embarrassment, or other negative consequences.
The following are representative of the kinds of arguments typically made by proponents
of norm-referenced testing:
Norm-referenced tests are relatively inexpensive to develop, simple to administer,

and easy to score. As long as the results are used alongside other measures of
performance, they can provide valuable information about student learning.
The quality of norm-referenced tests is usually high because they are developed by
testing experts, piloted, and revised before they are used with students, and they are
dependable and stable for what they are designed to measure.
Norm-referenced tests can help differentiate students and identify those who may
have specific educational needs or deficits that require specialized assistance or
learning environments.
The tests are an objective evaluation method that can decrease bias or favoritism
when making educational decisions. If there are limited places in a gifted and
talented program, for example, one transparent way to make the decision is to give
every student the same test and allow the highest-scoring students to gain entry.
The following are representative of the kinds of arguments typically made by critics of
norm-referenced testing:
Although testing experts and test developers warn that major educational
decisions should not be made on the basis of a single test score, norm-referenced
scores are often misused in schools when making critical educational decisions, such
as grade promotion or retention, which can have potentially harmful consequences
for some students and student groups.
Norm-referenced tests encourage teachers to view students in terms of a bell

curve, which can lead them to lower academic expectations for certain groups of
students, particularly special-needs students, English-language learners, or
minority groups. And when academic expectations are consistently lowered year
after year, students in these groups may never catch up to their peers, creating a
self-fulfilling prophecy. For a related discussion, see high expectations.
Multiple-choice teststhe dominant norm-referenced formatare better suited to
measuring remembered facts than more complex forms of thinking. Consequently,
norm-referenced tests promote rote learning and memorization in schools over more
sophisticated cognitive skills, such as writing, critical reading, analytical thinking,
problem solving, or creativity.
Overreliance on norm-referenced test results can lead to inadvertent discrimination

against minority groups and low-income student populations, both of which tend to
face more educational obstacles that non-minority students from higher-income
households. For example, many educators have argued that the overuse of normreferenced testing has resulted in a significant overrepresentation of minority
students in special-education programs. On the other hand, using norm-referenced
scores to determine placement in gifted and talented programs, or other enriched
learning opportunities, leads to the underrepresentation of minority and lowerincome students in these programs. Similarly, students from higher-income
households may have an unfair advantage in the college-admissions process because
they can afford expensive test-preparation services.
An overreliance on norm-referenced test scores undervalues important

achievements, skills, and abilities in favor of the more narrow set of skills measured
by the tests.
Many educators and members of the public fail to grasp the distinctions between criterionreferenced and norm-referenced testing. It is common to hear the two types of testing referred to as
if they serve the same purposes, or shared the same characteristics. Much confusion can be
eliminated if the basic differences are understood.
The following is adapted from: Popham, J. W. (1975). Educational evaluation. Englewood Cliffs,
New Jersey: Prentice-Hall, Inc.
Dimension
Criterion-Referenced
Tests
To determine whether each student has
achieved specific skills or concepts.
Purpose
Content
Item
Characteristics
Score
Interpretation
Norm-Referenced
Tests
To rank each student with respect to the
achievement of others in broad areas of
knowledge.
To find out how much students know

before instruction begins and after it has To discriminate between high and low
finished.
achievers.
Measures specific skills which make up a
designated curriculum. These skills are
identified by teachers and curriculum
Measures broad skill areas sampled from a
experts.
variety of textbooks, syllabi, and the
judgments of curriculum experts.
Each skill is expressed as an instructional
objective.
Each skill is tested by at least four items Each skill is usually tested by less than four
in order to obtain an adequate sample of items.
student
performance and to minimize the effect Items vary in difficulty.
of guessing.
Items are selected that discriminate between
The items which test any given skill are high
parallel in difficulty.
and low achievers.
Each individual is compared with a preset Each individual is compared with other
standard for acceptable achievement. The examinees and assigned a score--usually
performance of other examinees is
expressed as a percentile, a grade
irrelevant.
equivalent
score, or a stanine.
A student's score is usually expressed as a
percentage.
Student achievement is reported for broad
skill areas, although some norm-referenced
tests do report student achievement for
Student achievement is reported for
individual skills.
individual skills.
The differences outlined are discussed in many texts on testing. The teacher or administrator who
wishes to acquire a more technical knowledge of criterion-referenced test or its norm-referenced
counterpart, may find the text from which this material was adapted particularly helpful.

Language Testing

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Language Testing

Transféré par

Droits d'auteur :

Formats disponibles

The Similarities & Diff erence of Classroom Test & Standardized

DIRECT VS. INDIRECT ASSESSMENT MEASURES

Faculty assigned to teach different sections of a gatekeeper course in the general

Each year, twenty percent of the papers submitted by students in a program

Students in an occupational program are asked to collect and maintain a portfolio

A general education - science program includes a set of common questions that

An occupational program developed a competence checklist for speech and public

A general education humanities program requires that students submit samples

A management/marketing program asks each student in the capstone course to

Students entering an occupational program take a test to determine their

Examples of Indirect Measures of Student Learning

Student Evaluation of Instruction

Self Assessment / Reports

Student/Alumni/Faculty self assessments or reports

Difference Between Achievement and Aptitude Tests

Types of Test Item Formats

Can be used to test many levels of

Test takers may perceive questions

Can be used to test a persons

Can be used to diagnose a persons

Can provide test takers with

Difficult to test attitudes towards

Can ask more questions, greater

Encourages guessing because one

Can cover a wide range of difficulty

Test takers may misinterpret

Usually requires less time for test

Usually easily scored and graded

Pros and Cons of True-False Test Items

Can ask more questions for greater

Does not allow test takers to

Can cover a wide range of difficulty

Is difficult to construct effective

Usually requires less time for test

Encourages guessing due to 50/50

Usually easily graded and scored

Is easily faked, difficult to test

Pros and Cons of Essay Test Items

Can test complex learning

Usually takes more time to answer

Can test processes used to answer

Can be unreliable in assessing the

Requires use of writing skills,

Essay answers are often written

Can provide a more realistic and

Is typically graded or scored more

Usually takes less time to construct

Requires special effort to be

Is more difficult for test takers to

Requires more time to grade or

Guidelines for Using Multiple Choice or True-False Test Items

EXPLORING RELIABILITY IN ACADEMIC ASSESSMENT

Example: A test designed to assess student learning in psychology could be given to a

2. Parallel forms reliability is a measure of reliability obtained by administering different

4. Internal consistency reliability is a measure of reliability used to evaluate the degree to

A. Average inter-item correlation is a subtype of internal consistency reliability. It is

B. Split-half reliability is another subtype of internal consistency reliability. The process

Validity refers to how well a test measures what it is purported to measure.

Example: A womens studies program may design a cumulative assessment of learning

3. Criterion-Related Validity is used to predict future or current performance - it

Example: If a physics program designed a measure to assess cumulative student learning

Example: When designing an assessment of learning in the theatre department, it would

What are some ways to improve validity?