Académique Documents
Professionnel Documents
Culture Documents
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Created from curtin on 2018-02-10 18:09:54.
UP TO STANDARD?
You are a week into your appointment as a first-year teacher at a P–Year 10 regional
school of 186 students and five teaching staff, including a teaching-principal (Marty).
You have just received the following SMS from the principal.
Hi Kaylan, need to speak. Look at action plan for lifting our literacy & numeracy
results. What advice? Speak Monday.
Best, Marty.
Sent 4.12 pm Friday, 22 February.
584 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Chapter overview
Standardised and standards-based tests are as much a fact of life at school as taking
attendance, lunch break and water fountains. In this chapter, standardised and
standards-based assessments are demystified. What these tests are, what they are
used for and how they are developed are examined. The benefits gained from using
such assessments as well as the problems associated with them are also considered.
The history of such tests and how to understand and interpret the scores are looked
at, and the controversies associated with standardised testing are considered.
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 585
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
the United States led to the forms of standardised testing that exist today. In
Germany, the p ioneering psychologist Wilhelm Wundt and his students began
the serious examination of individual differences among humans. In England,
Charles Darwin’s cousin and contemporary, Francis Galton, was interested in the
inheritance of intelligence. In the United States, James Cattell focused his efforts
on vision, reaction time and memory, among other characteristics.
In France, Alfred Binet worked to develop a series of mental tests that would
make it possible to determine the mental age of students who were not performing
well in public schools so that they might be assigned to the proper school for
remedial work. Binet’s work led to two fascinating and quite divergent develop-
intelligence test A measure of ments. The first was the creation of the intelligence test, and Binet is rightly called
generalised intellectual ability. the father of intelligence testing. His initial measure, intended to assess abilities
in children of age 3–13, consisted of a series of questions of increasing difficulty
and can be considered the first intelligence test (Wolf, 1973). Psychologist Lewis
Terman expanded on Binet’s work to create the Stanford-Binet intelligence test,
which is still in use today (Terman, 1916).
The second development arising from Binet’s work occurred shortly after his
death at a relatively young age. A young Swiss researcher named Jean Piaget came
to work in the laboratory that Binet had established with his colleague, Theodore
Simon. Piaget used many of the tasks and experiments that Binet had developed
(such as the conservation task described in chapter 3), but he was more interested
than Binet in finding out why children answered questions the way they did — and
especially in the characteristic mistakes they made. Thus, Binet’s work spawned
not only intelligence testing but also Piaget’s theory of cognitive development.
All testing at this time was done individually with a trained specialist.
A student of Lewis Terman, Arthur Otis, was instrumental in the development
of group testing, including objective measures such as the famous (or infamous)
multiple-choice item (Anastasi & Urbina, 1997). His work led to the development
of the Army Alpha and Beta tests, which were used extensively in World War I for
matching recruits to various tasks and roles in military service and for selecting
officers. From the 1920s through the 1940s, United States college admissions
testing, vocational testing, and testing for aptitudes and personality character-
istics all flourished, bolstered by the belief that progress in the quantification of
mental abilities could occur in as scientific a fashion as progress in mathematics,
physics, chemistry and biology.
testing was favoured because marking could be done in an objective fashion and
machines could be used to score the tests. More programs developed over the
decades, and from the 1960s onward, the number of children taking standardised,
end-of-year tests grew remarkably rapidly (Cizek, 1998; Johnson, 2010). Although
these programs usually provide tests for Years K–12, the primary focus has tra-
ditionally been on Years 2–8, with less emphasis on Kindergarten, Preparatory
Year, Year 1 and the high school years.
Measurement specialists develop school testing programs by first looking at
what schools teach and when they teach it. In recent years, the focus of this process
has shifted to statewide assessment standards, and more recently to a common
set of such standards that state and Commonwealth Ministers of Education rat-
ified. Typically, standards are examined and reduced to a common set of objec-
tives, organised by the school year in which they are taught. Often, compromises
586 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
are necessary because different schools teach certain things at different times,
particularly in mathematics, science and history. Tests are then developed to
reflect a common set of objectives.
The draft versions of tests go through a number of pilot tests to ensure that
they have good technical qualities (discussed later in the chapter). The final ver-
sion is then tested in a major nationwide norming study, in which a large group norming study The
of students (tens of thousands) take the test to see how well a representative administration of an
national sample performs on it. Results of the norming study are used to deter- assessment to a representative
national sample to obtain
mine scales, such as Year-equivalent scores and percentiles, all of which are dis-
a distribution of typical
cussed later in the chapter. performance on the
assessment.
Standards-based assessment
The most recent development in standardised school testing is standards-based standards‐based
assessment (Briars & Resnick, 2000; Nicholson, 2014). This approach to a ssessment assessment Assessments
is based on the development of a comprehensive set of standards (which look that are generated from a
list of educational standards,
much like objectives) for each Year level in each subject area. The standards are
usually at the state or national
then used to develop teaching approaches and materials, as well as assessments level; a form of standardised
that are clearly linked to the standards and to each other. The primary distinction assessment.
between standards-based assessment and the commercial school testing programs
described earlier lies in the notion that standards, teaching and assessment are all
linked together; that is, assessment is not something to be added on once goals
and teaching programs have been established. Standards-based assessment pro-
grams at the national level are a recent development in Australia. It would be an
excellent idea to look at the NAPLAN assessment information and the statement
on standards for the state where you are planning to teach (www.nap.edu.au/
naplan/naplan.html).
NAPLAN
NAPLAN was introduced in 2008 across Australia as the national, standards-based
assessment for students in Years 3, 5, 7 and 9 and is a key part of Australia’s
National Testing Program (NAP). The first cohort of young Australians to have
completed assessment in all three year-levels will have done so in 2014. NAPLAN’s
appearance on Australia’s formal education agenda followed six years after the
No Child Left Behind Act (NCLB) became federal United States law. As such, the
NCLB provides a useful basis of comparison and contrast for Australia’s NAPLAN.
The NCLB mandates that states comply with a series of regulations and meet
achievement standards to receive federal assistance in education. One of the pri-
mary requirements of the Act is that each state must develop an annual assess-
ment program and that schools show regular progress towards meeting the goal
that all students will have reached high standards of achievement by the 2013–14
United States school year. The NCLB also requires that all students are taught by
highly qualified teachers. Schools that do not meet the standards of the NCLB
Copyright © 2015. Wiley. All rights reserved.
initially receive support from their states to improve, but if improvement does not
occur, a series of sanctions take place that could include replacing the teaching
staff and administration of the school.
NAPLAN has developed a little differently from the NCLB in two ways. Impor-
tantly, it is a test designed and administered nationally rather than a series
of state-level tests as in the United States. It is part of the work of the Australian
Curriculum, Assessment and Reporting Authority (ACARA), an organisation rep-
resenting the Australian Government and all education streams (independent,
government and Catholic) across states and territories. ACARA is also responsible
for development of the Australian curriculum from Kindergarten to Year 12. This
enables a clear link to be made between national curriculum and national testing.
Thus, Australian NAPLAN results provide a basis for comparison nationally on
how all Australian students are performing in relation to literacy and numeracy
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 587
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
standards, and the comparison is not clouded because of variation across several
tests. The test is a single one — rather than different ones for each state and
territory.
Performances on NAPLAN are shown on national achievement scales from
Year 3 to Year 9. Six bands of achievement are reported for each Year level.
ACARA advises that:
One of these bands will represent the national minimum standard for students
at each Year level. A result at the national minimum standard indicates that the
student demonstrated the basic literacy and numeracy skills needed to participate
in that Year level (ACARA, 2010).
The second issue of difference is that several points of accountability for United
States education under the NCLB are specifically mandated. Accountability is
treated more softly in the Australian situation. Outcomes from the assessment are
seen as having uses for parents, students, schools and systems that include advice
on where resources for improvement might be located, but there is as yet no
threat of sanctions. For detailed information on NAPLAN and where it fits within
Australia’s National Testing Program (NAP), see www.nap.edu.au.
Intelligence testing
Intelligence testing is a hypersensitive issue. It has a long and often not very
attractive history. It began in the early 1900s with the work of Binet, described
earlier, and was followed by that of Lewis Terman, Henry Goddard and Charles
Spearman. Originally designed for the admirable purpose of trying to help stu-
dents with difficulties in learning, intelligence testing (IQ testing) has been used
for a host of less acceptable purposes, including lifelong labelling of students
within a mythology about IQ tests rather than a clear understanding of them
(Dawson 2003; Pfeiffer, 2013). Today, intelligence testing in education retains
its importance in attempts to learn as much as possible about students as learn-
ers. For example, school psychologist Peg Dawson (2003) asserted that observ-
ing how respondents completed an IQ test gives her insights into their complex
588 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
reasoning and problem solving as well as to a greater variety of other cognitive
and p
sychological variables more so than through any other single data source.
The most widely used IQ tests are the Wechsler Intelligence Scale for Children —
Revised (WISC–R) and the Kaufman Assessment Battery for Children (K-ABC).
Uncommon sense
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 589
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Stanovich (2013) believes that we need better understanding and measures
of processes behind such terms as ‘dysrationalia’, ‘myside bias’ and ‘miserly
cognition’. The puzzle earlier is based on an item that he has used in progressing this
development. With the advantage of such evolving knowledge and resources, teachers
and learners will be better placed to optimise opportunity structures for richer
education. Even at this point of his work, Stanovich adds uncommon sense to what
you and colleague educators might consider in looking at the reliability and validity
issues of standardised and standards-based assessments.
Categories of assessments
In chapter 13, differences between formative and summative assessment were
discussed. Assessments can be categorised in a variety of other ways. Three of
the most widely used categorisation systems are:
• norm and criterion referenced
• achievement, aptitude and affective measures
• traditional and alternative assessments.
and norm-referenced testing in broader terms. Tests that use norm-based scores
to give meaning to the results are norm-referenced tests. Those that use an arbi-
trarily determined standard of performance to give meaning, albeit standards that
teachers as test designers would like their students to reach (Biggs, 2003), are
criterion-referenced assessments.
590 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Assessment is not limited to what people know and can do; it also includes
how they learn, how they feel about themselves, how motivated they are,
and what they like and do not like. Issues related to an individual’s attitudes,
opinions, dispositions and feelings are usually labelled affective issues. A large
number of affective assessments are used in education, including measures of affective assessment An
self-efficacy, self-esteem, school motivation, test anxiety, study habits and alien- assessment related to feelings,
ation. Educational psychologists frequently use affective assessments to help motivation, attitudes and
the like.
them understand why some students do better at school than others. (See the
‘Uncommon sense’ box.)
Uncommon sense
as has, to some extent, the distinction between achievement and aptitude tests.
Concerns for the former centre on whether students have had the opportunity to learn
the material on an achievement test and whether the test really measures achievement
or instead measures the opportunity to have learned the material. For the latter, they
concern whether aptitude tests can measure one’s potential independently of non-
specific interactions such as differences in nurturance. In both cases, they also question
racial, ethnic and gender differences in test results and whether tests whose results
reveal such differences should be used for admissions and scholarship purposes. People
now often talk about ability tests as simply a measure of a student’s level of academic
performance at a given point in time. Without knowledge of a student’s educational
history, a score on such a test does not imply a judgement about how the student
attained his or her present level of performance.
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest EbookChapter
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 591
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Technical issues in assessment
The technical issues involved in assessment can be daunting to educators at all
levels; consequently, some educators shy away from them. However, with some
foundational knowledge, all teachers can understand and discuss these issues.
The scores
8
5 Total = 8 + 5 + 10 + 5 + 7 = 35
There are five scores here.
10 ADD them and divide by 5,
35 and you have the
5 Mean = =7
5 mean of the scores.
7
Uncommon sense
Copyright © 2015. Wiley. All rights reserved.
592
O'Donnell, Angela M.,EDUCATIONAL PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
includes some extreme scores that might make the mean appear not to be rep-
resentative of the set of scores as a whole. For example, the mean age of all the
people in a Preparatory class in May is usually around seven. This is because the
children are around five years old, and the teacher could be in their thirties or
forties. Therefore, the mean is not very useful in this case. The median (and the
mode) would be five, a number much more representative of the typical person
in the class. Figure 14.2 shows how to obtain the median.
8 5
5 5 There are five scores here.
10 7 If they are put in order from
lowest to highest, the middle
5 8 score, 7, would be the median.
7 10
The mode is simply the score that occurs most frequently. Researchers use the mode The most frequently
mode in describing the central tendency of variables in situations in which the occurring score in a set of
use of decimals seems inappropriate. For example, it is more reasonable to say scores.
that the modal family has two children, rather than saying that families have
2.4 children on average (Gravetter & Wallnau, 2004). In the example of the Pre-
paratory classroom, the mode would be a good measure of the central tendency.
8 8 ‒ 7= 1 1
The squared deviations
5 5 ‒ 7 = ‒2 4 sum to:
10 10 ‒ 7 = 3 9 1 + 4 + 9 + 4 + 0 = 18
The variance is the sum
5 5 ‒ 7 = ‒2 4 of the squared deviations
7 7‒7=0 0 divided by the number of
scores in the group:
The standard deviation (SD) is simply 18
= 3.6
the square root of the variance, or in 5
this case:
SD = √3.6 = 1.897
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 593
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
The variance is widely used in statistical analysis, but it is not as practical as the
standard deviation because the variance is in the form of squared units. That is, it
tells you, on average, how far each score is from the mean squared. The standard
deviation, on the other hand, provides a measure of the spread of the scores in
the numbering system of the scores themselves. Calculating the standard devi-
ation is easy once you have the variance: You simply take the square root of the
variance to obtain the standard deviation, as shown in figure 14.3.
standard deviation A measure The standard deviation provides an index of how far from the mean the scores
of how far scores vary from tend to be. If, in a large set of scores, the distribution of scores looks roughly
the mean. normal (or bell-shaped), about 95 per cent of the scores will fall within 2 stan-
dard deviations on either side of the mean. That is, if one goes up 2 standard
deviations from the mean, then down 2 standard deviations from the mean, about
95 per cent of the scores will fall between those two values. If a particular set of
scores has a mean of 38, a standard deviation of 6, and more or less a normal
distribution, about 95 per cent of the scores would fall between 26 (2 standard
deviations below the mean) and 50 (2 standard deviations above the mean). If the
standard deviation were 15, the bulk of the scores would range from 8 to 68. If
the standard deviation were 2, the scores would range from 34 to 42. (More infor-
mation about the normal curve appears later in the chapter.)
In essence, the standard deviation provides an idea of the numbering system
of a set of scores. For example, imagine a group of test scores that has a standard
deviation of 100. If a student increases his score by 10 on such a test, this is not
a very big increase. However, if the test score has a standard deviation of 5, a
10-point increase by a student represents a huge jump.
z-Scores
Earlier in the chapter norm-referenced testing was discussed as a way to give
meaning to a score by comparing it to those of others who took the same assess-
ment. Consider a set of test scores that has a mean of 500 and a standard devi-
ation of 100. A score of 550 would be one-half of a standard deviation above the
mean. It can be said that a score of 550 is +0.5 standard deviations above the
mean. A score of 320 would be 1.8 standard deviations below the mean, or −1.8.
z‐score A standard score This concept is called a z-score; it is simply how many standard deviations
that any set of scores can be away from its mean a given score is. If the score is above the mean, the z-score
converted to; it has a mean of is positive. If the score is below the mean, the z-score is negative. The calculation
0.0 and a standard deviation
for the z-score is simply the score minus its mean divided by its standard devi-
of 1.0.
ation. The formula looks like this:
(score − mean)
z=
standard deviation
For example, a student gets a score of 85 on a test. The mean for all the stu-
Copyright © 2015. Wiley. All rights reserved.
dents in the class is 76, and the standard deviation is 6. The z-score is:
(85 − 76)
z= = +1.5
6
This means that a raw score of 85 is 1.5 standard deviations above the mean for
this group of students. When combined with a working knowledge of the normal
curve, presented below, z-scores provide a lot of information. For example,
imagine Maria took a physical fitness test that had national norms that were
roughly normal (bell-shaped). Maria gets a raw score of 143. Is that score good,
bad or in the middle? If Maria’s z-score on the fitness test was +1.0, that would
mean she outperformed roughly 84 per cent of the students in the norming group.
A z-score of −1.0 would mean she outperformed only 16 per cent of the students.
594 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
You can see that the raw score does not present much useful information, but the
z-score tells us a lot.
Normal distribution
The bell-shaped curve, often mentioned in conversations about testing and stat-
istics, is more formally known as the normal distribution. It is actually a mathemat- normal distribution A
ical model or abstraction that provides a good representation of what data look mathematical conceptualisation
like in the real world, particularly in biology and the social sciences. Many sets of of how scores are distributed
when they are influenced by a
test scores show a more or less normal distribution. The normal curve is depicted
variety of relatively independent
in figure 14.4. Roughly speaking, normal distributions result when a number of
factors.
independent factors contribute to the value of some variable.
In a perfectly normal distribution, 68 per cent of the scores fall between
1 standard deviation above and 1 standard deviation below the mean; 96 per cent
of the scores fall between 2 standard deviations below the mean and 2 standard
deviations above the mean. This can be seen in figure 14.4. This figure provides
a good reference for thinking about where scores are located in most roughly
normal distributions. Many of the scales used in reporting standardised test scores
are based on the concept of the z-score.
Per cent of cases
Percentile
1 5 10 20 30 4050 60 70 80 90 95 99
NAPLAN
0 100 200 300 400 500 600 700 800 900 1000
ACT
5 10 15 20 25 30 35 40
IQ
Copyright © 2015. Wiley. All rights reserved.
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 595
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
means and standard deviations similar for all schools in your district? If
one school is doing better than others on a given score, or subscale, that
could be a clue to finding a particularly effective teaching strategy.
Scales
As mentioned earlier, most standardised assessment programs report
results using one or more scales employed to transform the scores into
numbering systems, or metrics, that are easier to understand. This sec-
tion describes the most commonly used scales.
Raw scores
Raw scores are usually obtained through the simple addition of all the
points awarded for all the items (questions, prompts etc.) on a test. For
example, if an assessment has ten multiple-choice items worth 1 point
each and five essays worth 5 points each, the maximum raw score would
be 35, and each student’s score would be the sum of the points attained
on each item. There is, however, one important exception to this defi-
nition of a raw score. Occasionally, students are penalised for making
wrong guesses, such as losing marks or part marks on multiple-choice
When looking at a set of test tests. The logic behind this penalty is that students should not receive
scores, it is important not to credit for blind guessing.
lose sight of the individual Imagine that a person guessed randomly on 100 five-choice multiple-choice
students who produced them.
items. One would expect that person on average to get 20 items correct (one-fifth
probability on each item times 100 items). That would yield a score of 20. H owever,
raw scores Scores that are
under a penalty system, that person would also be penalised for guessing wrong
simple sums of the points
obtained on an assessment. on the other 80 items. If the penalty is a deduction of one-fourth of a point for
each wrong guess, this would result in a deduction of 20 points (one-fourth of a
point per wrong guess times 80 wrong guesses). The 20 points deducted would
negate the 20 points gained for the correct guesses, leaving the total at zero. Thus,
overall under those conditions, on probability students would not gain or lose by
guessing randomly. However, on any particular day with any particular student,
luck may be running well or poorly. Random guessing could work for or against
a particular student’s results.
Scaled scores
Raw scores are useful in that they let students know how well they did against
a total maximum score. They are not particularly useful in helping students (or
teachers or parents) know how good a given score is. Moreover, assessment pro-
grams that test students on a yearly basis (such as end-of-year standardised school
assessment programs) usually construct a new assessment for each new cycle of
the assessment. Testing companies try to make the new test produce scores that
are as similar, or parallel, to the old one as possible.
In testing, parallel means that the two tests have highly similar means and
Copyright © 2015. Wiley. All rights reserved.
standard deviations, and there is a very high correlation between the two tests
(usually .80 or above). Even though two tests may be highly parallel, one test may
be slightly easier than the other (or perhaps easier in the lower ability ranges and
harder in the higher ability ranges). When this happens, a 64 on one test may be
equivalent to a 58 on the other. In an effort to make scores more meaningful and
to simplify equating scores from one form of a test to another, testing companies
scaled scores Scores from an report scores in terms of scaled scores. Scaled scores are mathematical transfor-
assessment that have been mations of raw scores into new scales, or numbering systems.
transformed into an arbitrary The first scaled score was the IQ score developed by Lewis Terman (1916).
numbering system to facilitate
Terman took the French psychologist Alfred Binet’s notion of mental age (the age
interpretation.
associated with how well a student could perform intelligence-test tasks) and
divided it by the student’s chronological age to obtain an intelligence quotient, or
596 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
IQ score. (IQ scores are no longer determined in this way). NAPLAN (0–1000)
provides scaled scores. Some states help schools to interpret how well their scores
position them in relation to the state average (mean) and distribution (standard
deviation) data. The NSW Department of Education provides software for schools
to make this comparison. The outcome is both numerical comparative data and
an evaluative comment on performance, such as At state level or Well above state
level (NSW Department of Education and Training, 2010).
Figure 14.5, below, shows the scale used for NAPLAN. The 1000-point range of
NAPLAN scaled scores has been gathered into ten bands for the purposes of
making reasonable comparisons for a child at different ages. Whereas those
performing at Band 2 in their Year 3 testing (scales scores between 100 and 199)
are at the national minimum standard, higher band level performances are
needed for the same evaluation for Year 5 (300 and 399), 7 (400 and 499) or 9
(500 and 599) s tudents (ACARA, 2010, p. 3).
Percentiles
Percentiles are the most straightforward scores other than raw scores. A p
ercentile percentile A number that
is the percentage of people who score less well than the score under consideration. indicates what percentage of
For example, if 76 per cent of the people who are tested score below a raw score the national norming sample
performed less well than the
of 59, the percentile score for a raw score of 59 is the 76th percentile. Percentiles
score in question.
are easy to interpret and are often used to report scores on school testing pro-
grams. Do not confuse percentile with per cent correct (which is a percentage
representing how well a student scored on an assessment). A second c aution is
that percentiles, along with the other scales described here, can drop or increase
rapidly if the scale is based on only a few questions. Therefore, when looking at
percentiles, always check to see how many questions were included in the scale
being reported.
Band 10 Band 10
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 597
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Stanines
stanines Short for ‘standard The United States Army developed stanines for use in classifying recruits in the
nine’, a scaled score that runs armed services during World War II. Stanines (short for ‘standard nine’) are scores
from 1 to 9 and has a mean from 1 to 9, with 1 being the lowest score and 9 the highest. They are calculated
of 5 and a standard deviation
by transforming raw scores into a new scale with a mean of 5 and a standard
of 2.
deviation of 2. They are then rounded off to whole numbers. Therefore, a sta-
nine of 1 would represent a score that is roughly 2 standard deviations below the
mean, and a stanine of 6 would be 0.5 standard deviations above the mean. Look
at figure 14.4 to see how stanines work. The utility of the stanine is that it allows
communication of a score with a single digit (number). In the days before the
widespread use of computers, this was particularly useful, and stanines are still
used in assessment programs today for quickly and easily providing an index of
how well a student did on a test.
other scales that will be available. There are several advantages to using such
scales. First, they have desirable statistical qualities, such as the idea that a gain
of a certain number of points means the same thing all across the scale. Second,
they allow a teacher or parent to look at growth over time instead of focusing on
whether a child is below or above the national average.
598 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Year 5 (like school years, Year-equivalent years contain only 10 months). But if
this is the average performance, is it not the case that about half the students fall
below this level? Yes — by the definition of Year-equivalent, about half the stu-
dents will always be below Year level. Do not interpret Year-equivalent scores as
where students ought to be; instead, interpret them as average scores for students
at that point in their school progress. The ‘Uncommon sense’ box brings this point
home. A teacher can also use them to show how much gain has been made in a
particular year.
Raw score The total number of points earned Simply looking at the number of
on the assessment. This could be items a student got right or, if a
simply the total number of items rubric is used, the number of points
Copyright © 2015. Wiley. All rights reserved.
Scaled score An arbitrary numbering system that Testing organisations sometimes use
is deliberately designed not to look scaled scores to equate one form of
like other scores. NAPLAN and IQ a test to another.
scores are examples of scaled
scores.
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 599
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
(continued)
Percentile Ranging from 1 to 99, percentiles Percentiles are very good for seeing
indicate the percentage of test how well a student did compared to
takers who got a score below the other similar students. It provides a
score under consideration. Thus, a norm-referenced look at how well a
raw score of 42 (out of a 50-point student is doing.
maximum) could have a percentile
of 94 if the test were a difficult one.
Normal curve Ranging from 1 to 99, normal NCE scores were developed to look
equivalent curve equivalent scores (NCEs) are like percentiles but also have the
score based on the normal curve and statistical quality of being a linear
have a mean of 50 and a standard scale, which allows for mathematical
deviation of 21.06. operations to be carried out on them.
Stanine Ranging from 1 to 9, stanines (an Stanines are good for quickly and
abbreviation of ‘standard nine’) easily providing an index of how well
are based on the normal curve. a student did on a test compared to
They have a mean of 5 and a other students.
standard deviation of 2 and are
also presented rounded off to whole
numbers. The US military developed
stanines to provide a score that
could be presented in a single
digit and compared across
tests.
Year-equivalent Ranging basically from 1.0 to 12.9, Year-equivalent scores give a quick
score they provide a score that indicates picture of how well a student is
how well a typical student at that doing compared to what students
year and month of school would in the norming group did at a given
have done on the test in question. year level. Be cautious about Year-
Imagine that a student had a Year- equivalent scores when they are far
equivalent score of 5.4 on a given from the level of the test that has
reading test. This would mean that been given.
this is how well a typical Year
5 student in the fourth month of
the school year would have
scored.
Cut, passing A score presented usually either Usually used with criterion-referenced
or mastery as a scaled score or a raw score assessments, these scores are
score that indicates that students have increasing in importance. They
exceeded a minimum level of indicate that the student has met the
performance on a test. This could minimal requirements, whether that
be a high school graduation test be for a unit, for high school or for a
or a formative test used as part of driver’s licence.
classroom teaching.
Copyright © 2015. Wiley. All rights reserved.
600 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
whole. These schools are invited to participate in a norming study. If one declines
to participate, another similar school is invited. They continue this process until
they have a representative sample. The schools administer the test under stand-
ard testing conditions, even though the schools, or the districts or regions within
which they are located, may or may not actually use the results. The scores of
students produced by this norming study give the testing companies the infor-
mation they need to develop the norms for the test. Norms are the information norms A set of tables
base that the companies use to determine such scales as percentiles and Year- based on a representative
equivalents. Norms can also be developed from students who take the test under administration of an
assessment that makes
real conditions. This is how NAPLAN works and similarly with most statewide
it possible to show how
testing programs such as Queensland’s Year 12 Core Skills Test. The percentiles well particular students did
for these tests are actual percentiles from people who took the test at the same compared to a national sample
time that the student did, not from a norming study. of students.
This approach, or a variation on it, is the most common method used to set
passing scores for standardised assessment. As can be seen, the score that is estab-
lished will depend to a large extent on who is chosen to be on the judging panel
and the choice of level of performance considered minimally acceptable. It is
important to understand that this process is fundamentally subjective: Although
technical issues are involved, the individuals’ judgements are the basis for what
the passing score should be.
602 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
• All the students who are rated pass are put into one group and all those rated
not pass are put into a second group. The distribution of scores in each group
is examined.
• The point where the two curves meet (see figure 14.7) is the passing score for
the assessment. It is the point that best differentiates minimally competent
students from students who are not minimally competent, in the judgement of
their teachers.
• Judges are told to think about a minimal passing student as one who had a
67 per cent chance of performing successfully on a given item.
• Judges are asked to go through the ordered test and to place a ‘bookmark’ at
the point in the test where a minimal passing student would no longer have a
67 per cent chance of successful performance.
• Once judges have made an initial determination of the passing level, their
decisions are discussed in groups and refined through various procedures.
• This procedure has been described as if there were only one passing score to
set, but the procedure could easily be extended to include several passing, or
cut scores.
• This procedure can be used with several question types, such as multiple
choice and matching.
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 603
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Taking it to the classroom
Letter Number
mark mark System A System B
F 0.0 59 and 69 and Even when marks are disappointing, they can
below below contain useful information for improvement
and growth.
There is nothing wrong with using these systems, but they assume that a
given percentage correct on an assessment always means the same thing. Some
assessments are simply easier than others, even when they are measuring the same
objective or achievement target. Consider the following maths item from a Year 5
assessment.
What three consecutive even integers add up to 48?
This is a moderately difficult item. But look what happens when the item becomes
a multiple-choice item.
What three consecutive even integers add up to 48?
a. 8, 10, 12
Copyright © 2015. Wiley. All rights reserved.
b. 16, 16, 16
c. 15, 16, 17
d. 14, 16, 18
Now students do not have to generate an answer; they simply have to find a set
that adds up to 48 and meets the criterion of consisting of even, consecutive integers.
Here is yet another possibility.
What three consecutive even integers add up to 48?
a. 4, 6, 8
b. 8, 10, 12
c. 14, 16, 18
d. 22, 24, 26
604 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Now all that students have to do is correctly add sets of three numbers and see
which set totals 48. The difficulty of the items has been changed substantially, even
though ostensibly they all measure the same objective. The point here is that a score
of 90 (or 93) may not always represent an A level of performance.
Teachers can recalibrate their marking for an assessment by going through a
procedure similar to the Angoff–Nedelsky procedure. After developing an assessment,
go through it and determine how many points a minimal A student would receive on
each item. Add up these points; this becomes the A/B break point. Then do the same
thing for the minimally passing student to arrive at the D/F break point. Once these
two break points have been determined, the B/C break point and the C/D break point
can be determined just by making each grade range roughly equal. For example, this
process might yield the following system.
A/B break point 86 (a difficult test)
D/F break point 64
With this information, the B/C break point could be 78 and the C/D break point
71. The point to keep in mind is that the grades associated with an assessment should
be the result of a thoughtful process rather than a predetermined system.
antique, it has to be old, but not all old things are antique (rocks, for example). material.
For standardised assessments, measurement specialists conduct studies to vali- criterion-related validity The
date empirically that the assessment measures what it is intended to measure. degree to which an assessment
These validation studies often include the following. correlates with an independent
• Having experts critically review the items on the assessment to ensure that indicator of the same
they measure what is intended (this is called content validity evidence). underlying ability.
• Statistically relating the scores from the measure with other, known indicators construct validity An indicator
of the same traits or abilities (this is called criterion-related validity evidence). of the degree to which an
• Conducting research studies in which the assessments are hypothesised to ability really exists in the way
demonstrate certain results based on theories of what the assessments measure it is theorised to exist and
(called construct validity evidence). whether it is appropriately
measured in that way by the
For example, imagine that a state put a spelling test on its standards-based
assessment.
language arts/literacy test for all Year 5 children. A content validity study might
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 605
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
consist of showing the words on the spelling test to Year 5 teachers to see if they
agreed that these were words that those in Year 5 should be able to spell (not too
easy, not too hard). A criterion-related validity study might compare the scores
obtained on this test to the number of spelling errors students made in their
regular written work. A construct validity study might look at how the questions
are asked on the test (students pick the correctly spelled word out of a list of
four options, versus they spell the words on their answer sheets after the teacher
pronounces them).
More recently, educators have become concerned about the consequences of
using a particular assessment. For example, if a NSW or ACT university highly
values the Australian Tertiary Admission Rank (ATAR) score as a determining
factor in admitting students, what kind of a message does that send to high school
students who want to go to that university with regard to how hard they should
work on their school subjects? Concerns of this type assess the consequential val-
idity of the assessment. In general, the issue of validity has to do with whether
an assessment really measures what it is intended to measure and whether the
conclusions or inferences that are made about students based on the assessment
are justified. In the spelling test example above, the consequential validity ques-
tion would centre on how teaching in language arts/literacy would differ if the
test were included or not.
Reliability
reliability Consistency over Reliability is the consistency or dependability of the scores obtained from an
an aspect of assessment, assessment. Reliability asks the question, Would I get roughly the same score for
such as over time or over this student if I gave the assessment again? Imagine a group of high school stu-
multiple raters. In classroom
dents were given a grammar assessment in Spanish on a Friday, and the same
assessment, reliability means
having enough information measure again on Monday (without learning how well they did in between).
about students on which to Correlating the scores between these two assessments would provide a measure
base judgements. of the reliability of the assessment. Reliability is closely related to validity, but it
is more limited in scope. In chapter 13, reliability in classroom assessment was
defined as having enough information to make a good judgement about students’
achievement. The definition presented here is a more formal definition of relia-
bility, applicable to a wider range of assessment issues.
As with validity, it is not the assessment itself but the particular application
of the assessment that is reliable or not. Moreover, assessments are not either
reliable or not reliable; they are either more reliable or less reliable. The essence
of reliability is how certain one can be that the assessment would produce the
same results on a second administration.
However, reliability is not related to the question of whether the assessment is
really measuring what it is intended to measure. That is, just because a measure
is reliable (i.e. produces consistent results) it does not necessarily mean that it is
valid (i.e. measures what is wanted or needed). The SAT (Scholastic Aptitude Test)
used widely by US universities for tertiary admission has a maths score that is just
Copyright © 2015. Wiley. All rights reserved.
606 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
If the study involves using two different forms of the same test, the reliability
would be called an alternate form reliability.
There are a number of other ways to calculate reliability coefficients. One very
common approach is split-half reliability. In this approach, the assessment is given split-half reliability A form of
once to a group of students. Each student receives two scores, one based on per- reliability coefficient, similar
formance on the even-numbered items and another based on performance on the to coefficient alpha, that
takes half of the items on an
odd-numbered items. These two scores are then correlated and adjusted using a
assessment, sums them into a
formula that takes into account the fact that only half of the test has been used score and then correlates that
in obtaining each score. A variation on split-half reliability takes an average of all score with a score based on
possible ways of splitting an assessment into two halves; this is coefficient alpha, or the other half of the items.
Cronbach’s alpha. For multiple-choice assessments, a version of coefficient alpha
coefficient alpha An approach
called KR-20 is often used. Finally, if a rater or judge is used to score the items to assessing reliability that
on an assessment (such as on an essay assessment or a performance assessment), uses a single administration of
an index of inter-rater reliability is needed. Inter-rater reliability is assessed by the assessment and focuses
having two raters score a set of assessments for a group of students and then cor- on whether all the assessment
relating the scores produced by the two raters. items appear to be measuring
the same underlying ability.
Teachers often ask how high a reliability coefficient should be. Generally
Also called Cronbach’s alpha.
speaking, an assessment that is used to make a decision about a child should
have a reliability coefficient of .90 or above. If the assessment is going to be
combined with other information, a slightly lower reliability (in the .80s) may be
acceptable.
A concept closely related to reliability is very useful in understanding
the scores students receive on assessments. This is the standard error of standard error of
measurement (SEM). The SEM provides a way for determining how much vari- measurement An index of
ability there might be in a student’s score. The best way to think about the how high or low an individual’s
assessment score might
SEM is to imagine that a student took a standardised test 1000 times. Each test
change on a subsequent
included different items but measured the same thing. The student would get testing.
a somewhat different score on each administration of the test, depending on
the specific items on that test, how the student was feeling, whether it was
a lucky or unlucky day and so forth. A plot of these scores would look like a
normal distribution. The mean of that distribution is what measurement
specialists call the true score. The standard deviation of that distribution would true score The hypothetical
be an index of how much the student’s score would vary. This standard devi- true ability of an individual on
ation (of one student taking the test many times) would be the standard error of an assessment.
measurement.
Let us consider an example. Imagine that Martin received a 560 on his NAPLAN
numeracy test. The SEM for the NAPLAN was estimated as roughly ±12 per cent
or 120 points (Wu, 2010). If he took the test again (without doing any additional
preparation for it), and assuming normal distribution, he would have a two-thirds
chance of scoring somewhere between 440 and 680, and a 95 per cent chance of
scoring between 320 and 800. That may seem like a large spread in the scores.
Indeed, it may be a conservative view of the real spread, if NAPLAN’s reliability is
Copyright © 2015. Wiley. All rights reserved.
low. While ACARA has provided a detailed account of steps taken to ensure relia-
bility of its program (ACARA, 2010), it has not yet published hard reliability data,
which has been an issue of concern for professional groups such as the Australian
College of Educators (2010).
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 607
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Interpreting standardised assessments
Interpreting standardised assessments can be very difficult for teachers. The diffi-
culty might be attributable partly to the nature of the reports they receive, which
can be hard to read, and partly to the love or hate relationship between educators
and standardised assessment. That is, at the same time that many educators wish
standardised tests would go away, they often put too much faith in them when
making decisions about students.
Teachers encounter several problems when looking at standardised test reports.
The first is that a teacher sees only the test report, not the actual efforts of stu-
dents taking the test. On a classroom assessment, the teacher sees the work of
each child on each problem or prompt. The teacher is able to gather information
about the student on each problem, draw inferences and make judgements about
the student’s overall performance on the assessment. With a standardised test,
the teacher does not see the work itself, but rather a scaled score of some sort
on a label, such as interpreting text or process skills. What does a score of 189 on
process skills mean?
A related problem is that teachers do not usually get the results of an assess-
ment until well after it was given. It may be months later; in fact, in some cases
it may be after the end of the school year. It is difficult to have an influence on a
student who is no longer in your class. Finally, the content of standardised assess-
ments does not always line up directly with what was taught to all of the students.
This can be a particular problem in mathematics, where different students can be
working on very different material.
The purpose of this section is to see how looking at standardised test results
can be a useful part of generating an overall picture of how well students are per-
forming. For most teachers, standardised assessment results are available near the
end of the school year. This section assumes that teachers looking at these results
are considering students they have had in their classes during the year. They are
assessing the progress those students have made, reflecting on the school year in
terms of each student’s growth, and perhaps putting together a summary com-
munication to parents or to the student’s teacher for the following school year.
(how well she did on her last classroom project, the kinds of questions she typi-
cally asks, a discussion with her parents at the parent–teacher conference) with
the assessment results. They are all snapshots of Margaret. None of them by itself
gives a true image of Margaret, but in combination, they will get you pretty close
to the real person. And that is the goal: finding the student in the data.
608 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
three referents. The first is a vertical line representing the average result; the
second is a colour, with the darker shade representing the band of results for the
middle 50 per cent of Year 9 students in Queensland; and the third is a dotted
vertical line for three of the subscales, Reading and Viewing, Writing, and Overall
Numeracy. The three referents are normative. Each is a statement about base-
lines against which Mitchell’s results might be compared.
Copyright © 2015. Wiley. All rights reserved.
13+
Figure 14.8 A Year 9 student’s NAPLAN report
The relative placement of each of Mitchell’s results shows that in all areas of
literacy he is close to average and performing within the band representing the
middle 50 per cent of Queensland’s Year 9 students. In spelling, he is slightly
higher than average. In the two literacy subscales where comparison with a
national benchmark is provided, Mitchell is performing above the benchmark. In
numeracy, Mitchell again is within the performance band of the middle 50 per cent
on three of the four subscales; falling a little below it on Number. On three of the
four, he is below the average score of his Queensland peers, most noticeably with
Number. Yet his overall result for numeracy is above the national benchmark.
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 609
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Probably the best approach to interpreting these scores is to think of the state
and national norming data as a set of baselines for looking at a student’s perfor-
mances. Look at the state average score in each subscale. That is how well the
typical Queensland student did on that part of the assessment. Now look at how
well Mitchell did. Are the scores above or below the state average? Now look simi-
larly for those where a national benchmark is provided. How did Mitchell com-
pare? Such normative comparisons allow Mitchell, his parents and teachers to see
his performance in relation to those of other students at his school-year level.
performance presents a different picture than his or her standardised test scores.
610
O'Donnell, Angela M.,EDUCATIONAL PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
• Are there any aspects of these scores that do not seem to make sense with
regard to this student?
The next level to look at concerns the subscales that are provided within each
primary score. If all these look similar, they probably reflect the student’s general
ability in the overall area to which they are related (e.g. literacy, numeracy).
If, however, there are strong discrepancies, this may be a good area for further
investigation.
day-to-day performances
as a learner at school and
at home. Mark regularly
scores in the bottom
30 per cent on a
standardised assessment,
but his classroom
performance is always
at least average and occasionally better than most. He is seemingly attentive and
engaged with class tasks and his parents report that in addition to his positive
approach and outcomes with homework, he learns things like sports skills and
cooking easily and well. The disparity appears to be with his test performances
rather than with evaluations of his learning in class and at home.
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest EbookChapter
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 611
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
For many test-takers, including young children, there is great pressure to do well.
This is not always clearly obvious, and may come from a variety of sources including
their concerns for what an outcome might mean for themselves, their family and
others — such as their teacher and school.
What can be done to reconcile Mark’s test performances?
• Have a discussion with Mark about why he is performing poorly on standardised
tests when he seems to do so well in class and at home.
• Reflect on your own observations of his demeanour around any class discussions
and practice leading up to the test; include these in your discussion with Mark to
see if they tally with his impressions.
• Have a parent–teacher meeting to go over Mark’s test performance and their
thoughts about Mark’s responses to you and his communication with them about
the discrepant performances.
• If there is any indication of high levels of test anxiety in Mark’s, his parents’
or your observations, request his parents’ permission to refer the matter to the
school’s guidance officer, and seek advice from the guidance officer for how you
might collaborate with any accommodation.
• If there is no indication of high levels of test anxiety, check his test-taking
planning and completion strategies. For example, select and use some similar test
items to do a ‘Think Aloud’ with Mark as he attends to an item.
• Check with other teachers about how Mark has done with them.
might be interested in talking with him about what they remembered after
reading this morning’s newspaper or from a mystery book some years ago — and
how they think their memories formed and remained. This would give them an
enjoyable vehicle to use in working with him.
It would be wonderful if each student could get an education tailored to his
or her unique needs, but that usually does not happen with 26 or more equally
deserving youngsters in the same classroom. Once a teacher has a good picture of
a student and his or her needs, that student has to be brought into the educational
and social structure of the classroom. Bringing the student to the classroom thus
requires imagination and creativity on the part of the teacher.
612
O'Donnell, Angela M.,EDUCATIONAL PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Thinking about the classroom, school and
district levels
If we only ever had one student to think about, teaching would be a lot easier.
But teachers work with classrooms full of young people and the group needs
to be considered as well as each individual. Teachers can use many of the
ideas presented in terms of interpreting scores for individuals for the class as
a whole. Overall, what are the strengths of this class, what are the weaknesses?
Are there issues that can be a focus for the class as a whole over the course of
the year?
Perhaps a group of students appears to be stronger in reading than in math
ematics. This might provide their teacher with an opportunity to allocate a little
more teaching time to mathematics and perhaps to assess progress in math-
ematics within the classroom on a more regular basis.
Another possibility is that there are several students who are having difficulty
with a particular aspect of the curriculum. These students might be grouped
together for purposes of some special attention. Of course, it is important to reas-
sess progress on a regular basis to make sure that teaching is attuned to students’
current needs, and not just to where they were at the beginning of the year.
Although teachers focus primarily at the classroom level, they need to recog-
nise that principals are concerned with the school as a whole, just as district level
personnel need to think about the needs of all the children in the district. Coordi
nation of teaching from one year to the next is a hallmark of a good teaching
program and a good action plan. Careful examination of assessment scores across
years of being at school, and Year levels, is an important component of such a
program and plan.
Students who do not speak English well, usually those for whom English is a
second language (ESL), pose a major problem for the interpretation of assess-
ments that have been delivered in English. The problem is easy to understand
but difficult to solve. Language arts literacy scores may not be measuring the lan-
guage arts abilities of these students at all but merely measuring their abilities in
English. On the other hand, some ESL students speak English fairly well but may
have deficiencies in language arts. How can the former situation be separated
from the latter?
To begin with, there is research evidence showing that testing accommodations
made for ESL students do not reduce the validity of the scores on the assessment
(Abedi, 2008; Kieffer, Lesaux, Rivera, & Francis, 2009; Kopriva, 2008; Munroe,
Abbott, & Rossiter, 2013). Teachers should make sure that ESL students in their
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest EbookChapter
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 613
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
classes receive accommodations if they are entitled to them. Teachers should
communicate across Year levels: A teacher who has an ESL student in Year 3 can
communicate with the Year 2 teacher about the language abilities a particular
student had displayed previously.
614
O'Donnell, Angela M.,EDUCATIONAL PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
If you are wearing glasses or contact lenses to read this material, you are using
assistive technology! The range of assistive technologies includes:
• computer-assisted teaching devices
• devices that augment communication, for example speech synthesisers
• software, such as text-to-speech and speech-to-text programs
• assistive listening devices, such as hearing aids
• specially adapted versions of textbooks, including audio, Braille and large-type
editions (both print and electronic)
• extended keyboards, Dvorak keyboards (which require less finger movement
than standard QWERTY keyboards) and other modified computer input devices
• provision of individual printed copies of PowerPoint slides, whiteboard content
or teaching notes
• specially designed learning games.
Controversies in assessment
Assessment involves evaluating someone’s progress. Any form of evaluation or
assessment holds the potential for controversy, and assessment of students is no
exception. Some issues in assessment have been controversial for decades; others
such as the disquiet over aspects of NAPLAN and its reporting of results through
My School (Goldstein, 2010), outlined in chapter 13, have appeared only recently.
This section begins with a discussion of some currently significant, but long-term
controversies.
Bias in testing
Bias in testing has long been a topic of heated debate (Alderman, 2013; Barry &
Chapman, 2007; Thorndike, 1997; Xi, 2010). Concerns about bias in testing often
revolve around the highly verbal nature of some measures where a strong com-
mand of the English language is important in obtaining a good score. But the
development of such language proficiency would seem to favour wealthier indi-
viduals who have greater access to the kinds of words that appear on a measure.
Certainly, individuals who do not speak English as a first language would be at a
disadvantage unless accommodations are made (Abedi, 2008).
However, the situation is far from simple. Even the definition of test bias is test bias The degree to which
a subject of debate among scholars. Some believe that whenever an assessment the scores from an assessment
produces different results for members of different racial or ethnic groups, or take on different meanings
when obtained by individuals
between genders, that assessment is biased. Measurement specialists use a more
from different groups.
refined definition: If individuals from different groups (racial groups, genders etc.)
obtain the same score on an assessment, it should mean the same thing for both
individuals. If it does not, that is evidence that the test is biased. For example, if
two students, one male and one female, get the same university entrance ranking
score such as an ATAR or OP, they should be predicted to do similarly well at uni-
Copyright © 2015. Wiley. All rights reserved.
versity. Of course, the prediction is not made for one pair of individuals but for
large groups.
Australian research (Coates & Friedman, 2010; Edwards, Coates, & Friedman,
2013) on tertiary admissions testing, as with similar studies in the United States
(Vernon, 2005; Xi, 2010), indicates that the tests typically do not show bias. That
is, the tests do an equally good job of predicting university performance for stu-
dents from minority groups as they do for majority students. This is not to say
that there are no group differences in performance on these measures. More-
over, if universities or colleges weigh the results of admissions tests too heavily,
the admissions process can still be biased even though the assessment is not. This is
somewhat akin to using height as the sole determinant of whom to select for a
basketball team. All other things being equal, taller players tend to be better than
shorter players. However, height is not the only factor that determines the quality
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest EbookChapter
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 615
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
of basketball players. In the same fashion, admissions test scores are not the
only factor that determines success at university or college. Thus, an admissions
system that relies too heavily on testing can be biased even though it may be hard
to find bias in the measures themselves.
The situation is even more subtle for students who do not speak English as
a first language. If a mathematics assessment contains a number of word prob-
lems, ESL students may fail to reach a correct answer not because of a deficiency
in mathematics ability but because of difficulty understanding exactly what was
being asked of them.
616
O'Donnell, Angela M.,EDUCATIONAL PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
attempt to prepare their students for the assessment that accompanies the
adoption of statewide or national standards. The problem this poses for class-
room teachers is that the students in a classroom typically are not all at the same
stage of ability or achievement in any given subject area. If they must all take the
same test, the teacher must review this material prior to the assessment, thereby
interrupting the natural progression of learning for many students.
Education is a complex and multifaceted phenomenon. How children and
youth learn, the best conditions for learning and how to maximise achievement
for a class of students are all difficult issues to address. When issues of poverty,
learning difficulties, unstable home environments and limited English ability
are thrown into the mix, the situation becomes even more challenging. Over the
past several decades, education has become increasingly politicised at both the
national and state levels. The problem is that impediments to improving edu-
cation do not go away simply because people want them to. What often is offered
in place of substantive plans for improving education is a call for higher standards
and more rigorous assessment of students’ achievement (e.g. Improving Literacy
Standards in Government Schools, Government of Victoria, 2003)
This seems like a worthwhile proposal — who could be opposed to higher stan-
dards? The problem is that the mechanisms for meeting the higher standards are
left to school districts, schools and ultimately to classroom teachers. The logic
here is tenuous: If students are not achieving enough now, demand more. Smith,
Smith, and De Lisi (2001) have compared this stance to working with a high
jumper. If the high jumper cannot clear the bar at 1.5 metres, is it going to do any
good to set it at 2 metres? The goal seems noble, but to attain it, something in
addition to higher standards is needed.
Concomitant with the increased emphasis on standardised assessment is a
greater tendency to teach to the test. This phrase refers to the practice of primarily,
or even exclusively, teaching those aspects of the curriculum that one knows are
going to appear on the standardised assessment. The problem is that those aspects
that do not appear on the test will disappear from the curriculum. In its extreme
form, students are not just taught only the content that will appear on the test,
they are taught that content only in the precise form in which it will appear on
the test. For example, if a Year 7 applied geometry standard is assessed by asking
students how many square metres of wallpaper will be needed to paper a room
with certain dimensions (including windows and doors), then eventually, in some
classrooms, that is the only way in which that aspect of geometry will be taught.
However, as Popham (2001) pointed out, if the curriculum is well defined and
the assessment covers it appropriately, teaching to the test can simply represent
good teaching. Of course, this requires teaching the content of the assessment in
such a way that students would be able to use it in a variety of situations, not just
to do well in the assessment.
Teachers cannot afford simply to bury their heads in the sand and hope that all
will turn out for the best with regard to testing. They must understand the pol-
Copyright © 2015. Wiley. All rights reserved.
itical, educational and assessment context in which they work. Teachers can be
effective advocates for best practices in education for their students, both in the
classroom and in larger political and social contexts.
Summary
Outline what standardised assessments are, and how such
instruments have been developed over time.
Assessment has a short history, with the initial development of standardised
assessment evolving from the need to determine appropriate educational place-
ment for students with learning difficulties. Standardised school assessments and
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest EbookChapter
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 617
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
testing used for tertiary admissions are both derivations of creations of the first half
of the twentieth century. Most assessments in use at schools today are required
by legislation and are based on core curriculum standards developed by educators.
Professional test development companies create most standardised assessments.
Key terms
affective assessment, p. 591 percentile, p. 597
central tendency, p. 592 raw scores, p. 596
coefficient alpha, p. 607 reliability, p. 606
construct validity, p. 605 reliability coefficient, p. 606
Copyright © 2015. Wiley. All rights reserved.
Reflection
First, think about what your role is and what kinds of advice you can give to the
principal. You are a first-year teacher participating with teachers and the principal
who are more experienced and have seen a number of innovations in education
come and go. To begin with, do not try to be either an expert or a no-ideas person;
recognise your knowledge and see yourself as an invited contributor. Suggest that
you respect and would like to listen to more experienced teachers — especially any
who were part of the committee that designed the action plan. Next, think about if
and where you can make a contribution to the action plan. Do your homework using
what you’ve learned and are learning about standard and standards-based assessment.
Come to the Monday meeting prepared.
What theoretical or conceptual information might assist in this situation?
Consider the following.
• The validity of the assessment, curricular and action plan alignments. How well does the
school’s curriculum align with the achievement standards? How well does the action
plan align with both of them? If either alignment is poor, then the results may not be
a valid picture of students’ achievement. However, since the standards are set (fixed),
the action plan, the curriculum or both may be what has to be adjusted.
• Employing statistical analysis. What does the distribution of scores for the regional
education district look like, not just the percentages of students passing and
failing? Are there areas where the district is close to having a lot more students
pass the test? Are there areas of strong need?
• Looking for strengths and weaknesses. Where do our strengths and weaknesses appear
to be? What needs to be addressed? What strengths can be built upon?
Information gathering
In this chapter, you have learned a great deal about how to look at assessment scores.
Find the school-wide report for each of the past three years since the action plan was
designed and bring them along to examine carefully at the meeting. What are the
school’s strengths and weaknesses? How do performances compare over the three
years? Are there trends that are consistent over time? How does your school compare
to others in the district or to similar schools in the state and nation? You are not the
only school facing this kind of problem. What have other schools done to improve
scores? You might suggest that finding answers to these questions and some creative
Copyright © 2015. Wiley. All rights reserved.
Decision making
There is a fable about mice being eaten by a cat. The mice decide to put a bell around
the cat’s neck so they can hear the cat coming and run away. The problem, of course,
is how to get the bell on the cat. Ideas that sound great but cannot be accomplished
are sometimes referred to as belling the cat. You need ideas that are practical and can
be put into effect. In choosing among various options about what might be done
in your school, the solutions have to be reasonable, given your circumstances and
resources. Using what you know about standardised and standards-based assessment
is a good place to start in finding and bringing clear argument and good ideas about
practice together for purposes behind the action plan.
620 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Evaluation
Were you able to use your knowledge about standardised assessments to contribute
to the first (and any follow-up) meetings with the principal? The ultimate evaluation
will occur when you look at what happens when the action plan is revisited, and the
achievement scores for this year, next year and the year after that. However, that is
a long time to wait for results. You might want to suggest giving a mid-year or even
quarterly assessment that resembles the NAPLAN assessment to provide an idea of
how you are doing and where adjustments might be made.
Lesson plans
The lesson plans on these pages are excerpts from more complete plans. They are included in this
form to provide you with practice in examining a lesson from the perspective of the content of this
chapter. For both, read the lesson plan excerpt provided here and respond to the questions related
to assessment in the ‘Ask yourself!’ section at the end.
Years 1–2
Learning Science
areas
Overview Students will consider, explore and talk about different types of weather and the
Copyright © 2015. Wiley. All rights reserved.
Materials • Book – I’m Freezing by Shane McG (or similar from the school library)
• Interactive whiteboard or whiteboard of paper block
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest EbookChapter
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 621
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Procedure 1. The teacher will begin by reading the book, I’m Freezing by Shane
McG (or similar from the school library) and then prompt the class to
discuss elements of the story in relation to what weather is and why it
is important in everyone’s life (e.g. staying inside vs going outside; how
we dress; games we can play; how weather makes us feel; what clouds
look like; why we don’t look directly at the sun, but can look at the moon
and stars). Then, the teacher will work with students in looking around
inside — and through the windows to the outside — to compare the
weather as it seems to be indoors and outside. (Note: Warn students not
to look directly at the sun, but to concentrate on the light and the warmth
it provides).
2. Initiate class discussion on words and word groups that describe weather
(e.g. conditions like cold/hot, wet/dry, windy/still, cloudy/clear; features
like sky, sun, wind, rain, fog, mist; and people’s reactions like sweaty/goose-
bumps, hot/freezing, warm clothes/cool clothes). List these interactively in
plain view on an available medium.
3. Students and teacher collaborate on how the listed items might be
categorised into sets.
4. With the teacher’s guidance, students then crease their paper sheet into
halves. On one half and again with the teacher’s modelling, they use the
listed language items and their own illustrations to create a picture of a
weather feature, its related condition/s and people’s reaction/s (as depicted
in faces, activities and clothing of some children playing).
5. With the teacher determining the predominant basis of grouping (e.g. sunny/
rainy/windy day) students in groups will talk about their pictures and find
something in a peer’s work that they would like to adopt for their own picture
and to speak about with the whole class.
6. Teacher and students repeat the previous two steps, this time using a
different weather feature to complete an illustration on the other half of the
creased page.
7. Finally, students present their two-part artwork and explain the ideas
behind their pictures, particularly the similarities and contrasts. Teacher’s
progressive feedback during the presentations will reinforce positive
instances of accord between a student’s recount and learning outcomes set
for the lesson — or arising through the implementation — and conclude with
a summation and consolidation.
Ask yourself!
1 Read the learning outcomes listed above. How will you adjust them and turn these into
teaching activities for your class? Write some teaching objectives appropriate for this lesson
plan and for which you can design of a reliable assessment.
2 Check the Australian Curriculum links for Science that apply to this lesson, such as:
• daily and seasonal changes in our environment, including the weather, affect everyday life
(ACSSU004)
• science involves exploring and observing the world using the senses (ACSHE013)
• respond to questions about familiar objects and events (ACSIS014)
Copyright © 2015. Wiley. All rights reserved.
622
O'Donnell, Angela M.,EDUCATIONAL PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Lesson 2: Film reels 13+
Years 8–9
Learning Mathematics
areas
Overview Students will examine motion picture terminology, specifically focusing on film
reels, to determine why reels are still preferred to digital recordings for most
high-end motion pictures. On average, the length of a 35 mm motion picture reel
is 300 m. The reel itself has a cylindrical core and walls on the sides to hold the
film, which is wound around the core. Using standard reel sizes, students will
complete maths computations.
Procedure 1. The teacher will explain the concept of film and film reels to students, utilising
opportunities to help students identify and clarify unfamiliar concepts and
language.
2. The teacher then will review geometrical maths computations involving
cylinders, radius, diameter, surface area and volume, by first providing
these five concepts as headings and asking students to construct
30-second mind-maps of each depicting their current and remembered
knowledge Students will then build the maps as necessary as the teacher
leads the review.
3. The teacher then shows the mechanical operation of the rotating film reel
passing across the light projection source and the power source that provides
the rotation, light and projection, and about the number of frames/metre in
the film.
4. Students are invited into a collaborative discussion about what
considerations of speed are necessary to ensure that the frames will be
projected in such a way as to provide sustained, credible human movement
and activity — and how mathematics helps in the planning for this.
5. The teacher then asks students what part may be played in such helping
by the maths as depicted on their mind maps, and assists students as
necessary to revisit and complete their maps more fully.
6. The teacher sets students to peruse questions from the task list based on
mathematical characteristics of film reels, explaining that their purposes
are to consider why film makers in days of reel technology and those
doing digital conversions of old film would need such information, and,
what mathematical formulae students will need to complete the five items
of the task.
7. Students then complete the task either independently or by working
collaboratively with others.
8. The teacher has students provide their answers and reasoning behind them,
Copyright © 2015. Wiley. All rights reserved.
ensuring that all instances of linkage to the stated learning objectives are
highlighted for discussion and consolidation.
Questions 1. The diameter of a 150 m film reel is 20 cm. What is the radius of this reel?
for 2. If the diameter of a 150 m film reel is 20 cm, how many metres of film would
discussion a reel with a 45 cm diameter hold?
3. What is the surface area of a reel with a diameter of 40 cm and a height
of 5 cm?
4. Based on the numbers used above, what is the volume of a 450 m film reel?
5. Based on the numbers used above, what is the surface area of a film reel
with a diameter of 60 cm?
Chapter
O'Donnell, Angela M., et al. EDUCATIONAL PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook
Created from curtin on 2018-02-10 18:09:54.
14 Standardised and standards-based assessments 623
Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.
Ask yourself!
1 How well do the learning objectives listed here align with the Australian Curriculum for Year
8 Mathematics (see especially ACMMG195, ACMMG 197 and ACMMG 199)?
2 How might you develop and extend some of the content questions in the list the teacher
provided to link more closely with the first learning objective?
3 What other kinds of media within the arts curriculum allow for mathematics to be used in
this way?
Copyright © 2015. Wiley. All rights reserved.
624 EDUCATIONAL
O'Donnell, Angela M., PSYCHOLOGY
et al. EDUCATIONAL
Created from curtin on 2018-02-10 18:09:54.
PSYCHOLOGY SECOND AUSTRALIAN EDITION, Wiley, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/curtin/detail.action?docID=4748094.