10 1 1 918 6674

Scaling descriptors for language
proficiency scales
Brian North Eurocentres Foundation, Zürich and
Günther Schneider Institut für deutsche Sprache, University
of Fribourg
This paper reports results from a Swiss National Science Research Council project
which aimed to develop a scale of language proficiency in the form of a ‘descriptor
bank’. Up until now, most scales of language proficiency have been produced by
appeal to intuition and to those scales which already exist rather than to theories
of linguistic description or of measurement. The intention in this project was to use
an item-banking methodology to develop a flexible scale of stand-alone criterion
statements with known difficulty values.
The project took place in two rounds: the first for English (1994), the second
for French, German and English (1995). In each year pools of descriptors were
produced by analysing available proficiency scales. Through workshops with rep-
resentative teachers, the descriptors were then refined into stand-alone criterion
statements considered to be clear, useful and relevant to the sectors concerned.
Selected descriptors presented on questionnaires were then used by participating
teachers to assess the proficiency of learners in their classes. This data was used
to scale the descriptors using the Rasch rating scale model. The difficulty estimates
for the descriptors produced in relation to English in 1994 proved remarkably
stable in relation to French, German and English in 1995.
I Introduction
During the past decade, two influences have led to the increasing use
of scales of language proficiency. The first influence has been a gen-
eral movement towards more transparency in educational systems.
The second has been moves towards greater international integration,
particularly in Europe, which places a higher value on being able to
state what the attainment of a given language objective means in prac-
tice. The result is that whereas 10 or 15 years ago scales which were
not directly or indirectly related back to the 1950s US Foreign Service
Institute (FSI) scale (Wilds, 1975) were quite rare, the last few years
have seen quite a proliferation of European scales which do not take
Address for correspondence: Brian North, Eurocentres, Seestrasse 247, CH-8038, Zurich, Switz-
erland; e-mail: bnorth얀eurocentres.com
Language Testing 1998 15 (2) 217–263 0265-5322(98)LT149OA  1998 Arnold
Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

218 Scaling descriptors for language proficiency scales
American scales as their starting point. Some examples are: the Brit-
ish National Language Standards (Languages Lead Body, 1992); the
Eurocentres Scale of Language Proficiency (North, 1991; 1993a); the
Finnish Scale of Language Proficiency (Luoma, 1993) and the ALTE
Framework (Association of Language Testers in Europe, 1998).
Section II considers scales of language proficiency: the functions
they can fulfil, common criticisms of them and various methodologies
which have been used recently in scale construction. It goes on to
explain why one particular methodology (qualitative validation of
descriptors with informants followed by Rasch model scaling) was
selected for the study in question. Section II then outlines the study
which is the subject of this article: the background and project struc-
ture and the way the phases operated in 1994 and 1995. Section III
briefly presents the results: a calibrated scale of language proficiency
and a bank of classified, calibrated descriptors. Section IV discusses
some of the complications encountered, including dimensionality
problems and differential item functioning: the way descriptors were
interpreted in different contexts, and finally Section V concludes on
the significance of the work undertaken.
II Scales of language proficiency

1 Functions
Many scales of proficiency represent what Bachman (1990: 325–30)
has described as the ‘real-life’ or behavioural approach to assessment.
This is because they try to give a picture of what a learner at a parti-
cular level can do in the real world. Other scales take what Bachman
describes as an ‘interactive-ability’ approach attempting to describe
the aspects of the learner’s language ability being sampled. Alderson
(1991: 71–76) uses a three-way classification focusing upon the pur-
poses for which scales are written and used. His expression for Bach-
man’s ‘interactive-ability’ approach is ‘assessor-oriented’ since such
scales are intended to bring consistency to the rating process. He
identifies in addition a ‘user-oriented’ function to give meaning to
scores in reporting results (usually in ‘real-life’ terms) as well as a
‘constructor-oriented’ function to provide guidance in the construc-
tion of tests or syllabuses, again usually defined in terms of ‘real life’
tasks. Matthews (1990) and Pollitt and Murray (1993/1996) point out
that complex analytic grids of ‘interactive-ability’ descriptors
intended for profiling can confuse rather than aid the actual assess-
ment process. Pollitt and Murray (ibid.) go on to suggest that such
profiling grids are, rather, ‘diagnosis-oriented’. As Alderson points
out, problems arrive when a scale developed for one function is used

Brian North and Günther Schneider 219
for another. In an educational framework, there will be circumstances

in which descriptors relating to ‘real-life’ tasks are appropriate; there
will be circumstances in which descriptors relating to qualitative
aspects of a person’s proficiency (‘interactive-ability’) will be appro-
priate. As Pollitt and Murray point out, rich detail may be appropriate
for some functions, but not for others.
Scales offering definitions of learner proficiency at successive
bands of ability are becoming more popular because they can be used
1) to provide ‘stereotypes’ against which learners can compare their
self image and roughly evaluate their position (Trim, 1978;
Oscarson, 1978/1979; 1984);
2) to increase the reliability of subjectively judged ratings, provid-
ing a common standard and meaning for such judgements
(Alderson, 1991);
3) to provide guidelines for test construction (Dandonoli and Hen-
ning, 1990; Alderson, 1991);
4) to report results from teacher assessments, scored tests, rated
tests and self-assessment all in terms of the same instrument –
whilst avoiding the spurious suggestion of precision given by a
scored scale (e.g. 1–1,000) (Alderson, 1991; Griffin, 1989);
5) to provide coherent internal links within an institution between
pre-course testing, syllabus planning, materials organisation, pro-
gress assessment and certification (North, 1991; 1993a);
6) to establish a framework of reference which can describe
achievement in a complex educational system in terms meaning-
ful to all the different partners involved (Trim, 1978; Brindley,
1986; 1991; Richterich and Schneider, 1992; Council of
Europe, 1996);
7) to enable comparison between systems or populations using a
common metric or yardstick (Lowe, 1983; Liskin-Gasparro,
1984; Bachman and Savignon, 1986; Carroll, and West, 1989).
2 Criticisms
A definition by John Clark (1985: 348) catches the main weakness
of the genre:
descriptions of expected outcomes, or impressionistic etchings of what pro-
ficiency might look like as one moves through hypothetical points or levels
on a developmental continuum.
Put another way, there is no guarantee that the description of pro-
ficiency offered in a scale is accurate, valid or balanced. Learners
may actually be able to interpret a scale remarkably successfully for
self assessment; correlations of 0.74–0.77 to test/interview results are

usual in relation to the Eurocentres scale. Raters may actually be

trained to think the same; inter-rater reliability correlations of over
0.8 are common in the literature and correlations over 0.9 are reported
in ‘studio conditions’ (e.g., Milanovic et al., 1992). But the fact that
people may be able to use such instruments with surprising effective-
ness doesn’t necessarily mean that what the scales say is valid. Fur-
thermore, with the vast majority of scales of language proficiency, it
is far from clear on what basis it was decided to put certain statements
at Level 3 and others at Level 4 anyway.
Another line of criticism has been that many scales of proficiency
cannot be regarded as offering criterion-referenced assessment
although they generally claim to do so. First, the meaning of the
descriptors at one level is often dependent on a reading of the descrip-
tors at other levels. Secondly, the formulation of the descriptors is
itself sometimes overtly relativistic (e.g. ‘better than Level 2’) or
norm-referenced (e.g., using expressions like ‘poor’, ‘weak’,
‘moderate’). Since the descriptors on most scales are not developed
independently to check that they are actually saying something, it is
not surprising that many scale descriptors fail to present stand-alone
criteria capable of generating a Yes/No response (Skehan, 1984:
217).
Most scales of language proficiency appear in fact to have been
produced pragmatically by appeal to intuition, the local pedagogic
culture and those scales to which the author had access. In the process
of development it is rare that much consideration is given to the fol-
lowing points:
• using a model of communicative competence and/or language
use;
• checking that the categories and the descriptor formulations are
relevant and make sense to users, as is standard practice in behav-
ioural scaling in other fields (Smith and Kendall, 1963; see North,
1993b for a review).
• using a model of measurement.
• avoiding the dangers of lifting rank-ordered scale content from
one context and then using it inappropriately in another (see
Spolsky, 1986; 1989: 65–69).
Whilst an intuitive approach may be appropriate in the development
of scales for use in a low-stakes context in which a known group of
assessors rate a familiar population of learners, it has been criticised
in relation to the development of national framework scales (e.g.
Skehan, 1984; Fulcher, 1987; 1993 in relation to the (British) English
Language Testing Service (ELTS); Brindley 1986; 1991; Pienemann
and Johnston, 1987 in relation to the Australian Second Language

Proficiency Ratings (ASLPR); Bachman and Savignon, 1986; Lantolf

and Frawley, 1985; 1988; 1992; Spolsky, 1986; 1993 in relation to the
American Council of the Teaching of Foreign Languages (ACTFL)
Proficiency Guidelines. As De Jong (1990: 72) has put it: ‘the
acceptability of these levels, grades and frameworks seems to rely
primarily on the authority of the scholars involved in their definition,
or on the political status of the bodies that control and promote them’.
3 Methodologies for scale development

Concerns such as those voiced above have led recently in language
testing to a search for data-based solutions to scale construction.
Methods used have included:
1) the intuitive identification of key ‘features’ at different levels
through rater discussion of performance samples ranked in a con-
sensus order (Mullis, 1980; Alderson, 1991: 81);
2) the identification in discussion with raters of performance fea-
tures and of decision points in relation to those features, so as to
construct a Yes/No binary algorithm (Upshur and Turner, 1995);
3) the empirical identification of key performance features through
a discriminant analysis of factors discovered in a detailed interac-
tion analysis of performance samples (Fulcher, 1993);
4) the construction of tests made up of items considered to oper-
ationalise proficiency descriptors; scaling those test items
through Rasch analysis of test data; editing a band scale by set-
ting cut-off points on the scale made up of the descriptors asso-
ciated with the calibrated test items (Brown et al., 1992);
5) the identification with informants of key target behaviours and
their formulation into descriptors; scaling those descriptors
through Rasch rating scale analysis of teacher ratings of their
students on questionnaires constructed from the descriptors; edit-
ing a band scale by setting cut-off points on the scale made up
of the calibrated descriptors (Griffin, 1989; 1990a; 1990b).
Methodologies 1–3 are based upon samples of performance in the
test setting concerned. They aim to provide assessor-oriented scales
to guide raters in the assessment of performance in a specific speaking
or writing task. This use of scales of proficiency is only one (No. 2)
of the six main functions listed in Section I.
Methodologies 1–2 are based on expert judgement rather than on
analysis. They do not actually scale the descriptor formulations pro-
duced. In view of the difficulty experts have in agreeing about task
difficulty, well documented in relation to standard setting (e.g.,
Jaeger, 1976) and language test construction (e.g., Alderson, 1990a),

such an approach is only likely to be successful in one specific, lim-

ited context.
Methodologies 3–4 are based upon analysis, but the descriptor for-
mulations are not directly scaled in either case. Whereas the features
observed in Methodology 3 can be described objectively from the
data, Methodology 4 runs into expert judgement. How can one estab-
lish that descriptor x is an exact verbal equivalence to performance
on test item y? Alderson (Alderson and Lukmani, 1989; Alderson,
1990b) and Lumley (1993) have disagreed over the ability of judges
to state what a test item is actually testing. However, even if that
problem is surmounted, the problem remains of associating the test
item with this particular formulation of the task involved rather than
that variant of it. Pollitt (1993) considers that for nonobservable per-
formances like listening and reading establishing a firm association
between an item and a descriptor is extremely difficult.
Methodology 5 is the only methodology which directly scales the
descriptors. The difficulty level represented by a descriptor is estab-
lished through statistical analysis of the way teachers actually use the
descriptor in live assessment. Expert opinions about the level of a
descriptor affect only which data collection instrument it is used on –
not its final scaled value. The danger of the approach is that one can
of course scale anything. Care needs to be taken in the development
and pre-testing of the descriptors to ensure that they take account of
available models of communicative competence and of language use
in order to provide a coherent, balanced framework for profiling. In
other words, to the degree that this is feasible, a priori construct vali-
dation is necessary in relation to theory as well as to the way users
think (Griffin and McKay, 1992: 21). Care also needs to be taken
during the scaling analysis to identify dimensionality problems: that
aspects are not scaled together if they should be dealt with as separate
parameters; that people are not lumped together if they should be
dealt with as separate populations.
There seem in fact to be five major differences between this last
kind of data-based approach and the other four kinds listed above.
All of these points were advantageous for the project reported here.
• Criterion statements: The descriptors are initially developed as

educationally relevant, stand-alone criterion statements, and are
then objectively scaled.
• Curriculum relevance: The technique is not confined to features
or behaviours observable in a specific test task, but can include
any features or ‘real-life’ behaviours that informants are able to
rate consistently.
• Consensus interpretation: The consultation of independent groups

of informants during the process of determining the content and

formulation of the descriptors helps ensure comprehensibility and
relevance. The use of questionnaires to collect the main data bases
the scaling of the descriptors on the collective interpretation of a
relatively large and potentially representative sample of users.
• Quality control: The consultation with informants should serve to
eliminate poorly formulated or irrelevant descriptors. In addition a
Rasch analysis helps in the process of identifying those descriptors
which, for one of a variety of reasons, do not ‘fit’ well with the
main construct being measured.
• Scaled descriptors: Rasch theory claims scale values to be stable
and objective. They are said to be ‘instrument-free’ in the sense
that no one single instrument must be used, and ‘person-free’ in
the sense that values apply to further groups which can be con-
sidered part of the same population (Wright and Linacre, 1987:
2). A scale of proficiency thus produced could therefore be said
to meet Thurstone’s requirement that:
the scale values of the statements should not be affected by the opinions
of the people who helped to construct it (the scale). This may turn out
to be a severe test in practice, but the scaling method must stand such
a test before it can be accepted as being more than a description of the
people who construct the scale’ (Thurstone, 1928: 547–48, cited in
Wright and Masters, 1982: 15).
The concept of objectivity is being used in two different senses

here. Firstly, the scaling is objective in Thurstone’s everyday
sense that the difficulty of the descriptors is not determined by
author opinion but established through data. Second, the scaling
is objective in Wright and Linacre’s technical sense that the values
can be replicated with different people (from the same population)
or with different instruments (linked to the scale).
A claim to absolute objectivity in the scaling would, however,
be overstated. Any analysis of data is affected by the character-
istics and the experiences of both the raters making the judge-
ments which produce the data and the learners being rated in the
process (see Farhady, 1982: 55; Carroll, 1983: 93; Upshur and
Homburg, 1983: 194; Cziko, 1984: 28, 34; Sang et al., 1986: 60,
70 for discussion of this point). What is actually being scaled here
is the teachers’ collective interpretation of the difficulty of these
descriptors, a mapping of the behaviours of one group by another
group (Pienemann and Johnston, 1987: 67). This is not necessarily
synonymous with objectivity but was attractive in the context of
developing a common framework.

II The study
1 Background
The study took place within the context of moves in Europe towards
a common framework of reference. The authors were members of a
Council of Europe working party charged with producing a ‘Common
European Framework of reference’ for language learning, teaching
and assessment (Council of Europe, 1996). The core of the Frame-
work is 1) a descriptive scheme representing aspects of communicat-
ive language competence and use and 2) a set of common reference
levels. The authors were simultaneously members of a Swiss working
party charged with developing a ‘Language Passport’ or ‘Language
Portfolio’ recording achievement in relation to the Framework
(Schärer, 1992; Council of Europe, 1997). In 1993 a Swiss National
Science Research Council three-year project (Schneider and North,
forthcoming) was set up with the primary aim of developing a bank
of transparent, calibrated descriptors of communicative language pro-
ficiency to be used in the first editions of the Framework and the Port-
folio.
2 Project structure
A pilot project for English conducted in 1994 (Year 1) was the sub-
ject of a PhD at Thames Valley University (North, 1996). The focus
in the pilot for English was on spoken interaction, including compre-
hension in interaction, and on spoken production (extended
monologue). Some descriptors were also included for written interac-
tion (letters, questionnaire and form filling) and for written production
(report writing, essays, etc.). In 1995 (Year 2) the survey was
extended to French and German as well as English. Descriptors were
also added for reading and for noninteractive listening.
The project took place in three steps in each of the two years 1994
and 1995:
1) Comprehensive documentation: creation of a descriptor pool
A survey of existing scales of language proficiency (North, 1994)
provided a starting point. Forty-one proficiency scales were
pulled apart with the definition for each level from each scale
assigned to a provisional level. Each descriptor was then split up
into sentences which were then each allocated to a provisional
category. When adjacent sentences were part of the same point,
they were edited into a compound sentence.
2) Qualitative validation: consultation with teachers through work-
shops
Qualitative validation of the descriptor pool was undertaken

through wide consultation with foreign-language teachers rep-

resentative of the different sectors in the Swiss educational sys-
tem. The consultation aimed to ensure that the way teachers
thought about proficiency was catered for in the descriptor pool
and that all the categories used in the pool were relevant and
usable for the teachers. In addition, the procedure identified the
most effective descriptors.
3) Quantitative validation: main data collection and Rasch scaling
• Data collection instruments: A selection of the best

descriptors was scaled in a questionnaire survey in which
class teachers assessed learners representative of the spread
of ability in their classes. Assessment was of two kinds:
a. Teachers’ assessment of the proficiency of 10 learners
in their classes using questionnaires made up of 50
descriptors. (See Appendix 1 for an example.) Adjac-
ent questionnaires shared a certain number of descrip-
tors (‘common items’).
b. Teachers’ assessment of video performances of selec-
ted learners in the survey using ‘mini questionnaires’.
All teachers rated all videos for the language in ques-
tion within two weeks of completing their class
questionnaire(s). The video performances were struc-
tured to a standard format including monologue
description and spontaneous interaction. The video
‘mini-questionnaires’ used a small selection of
descriptors drawn from the main questionnaires for
the level concerned. Descriptors which were parti-
cularly appropriate to each video performance con-
cerned were selected for this purpose.
• Analysis methodology: The analysis method was an adap-
tation of classic Rasch item banking in which a series of
tests (here questionnaires) are linked by common items
called ‘anchor items’ in order to create a common item
scale. In order to estimate and adjust for rater severity in
calibrating the video performances onto the scale, the
‘many-faceted’ Rasch rating scale model (Linacre 1989) as
operationalised in the program FACETS was used.
• Subjects: Exactly 100 teachers took part in the English pilot
in 1994, most rating 5 learners from two different classes
(total 945 learners). In the second year 192 teachers (81
French teachers, 65 German teachers, 46 English teachers)
each rated 10 learners, most rating 10 learners from the
same class. In each year about a quarter of the teachers

were teaching their mother tongue, and the main edu-

cational sectors were represented as follows:
Lower Upper Vocational Adult

Secondary Secondary
Year 1: 35% 19% 15% 31%
Year 2: 24% 31% 17% 28%
• Rating scale: Each descriptor on both the main question-
naires and the video mini-questionnaires had the same 0–
4 rating scale. This scale was defined in full on the cover
page of the questionnaire and repeated in short form at the
top of each page (see Appendix 1).
Consistency in the way the rating scale was actually used
1) by different teachers and 2) in relation to different items
was checked with FACETS by running Partial Credit
Model (PCM) analyses (Masters, 1982) which allowed
each item and each rater to define their own scale. No sig-
nificant differences were discovered.
3 Year 1: English, 1994
a. Creating a descriptor pool: In Year 1 the creation of the descrip-
tor pool for the project coincided with the period in which the Council
of Europe Framework authoring group was developing the descriptive
scheme. The descriptive scheme draws on the area of consensus
between existing models of communicative competence and language
use (e.g., Canale and Swain, 1980; 1981; Van Ek, 1986; Bachman,
1990; Skehan, 1995; McNamara, 1995). In addition, an organisation
of language activities under the headings Reception, Interaction and
Production, developing an idea from Brumfit (1987), was adopted.
Space does not permit detailed consideration of the scheme; readers
are referred to North (1997) for a shortened version of a study pro-
duced at the time and to the actual document (Council of Europe,
1996). Very briefly summarised, the scheme sees communicative lan-
guage competence (linguistic, pragmatic, socio-linguistic) as a part of
general human competences (including socio-cultural competence).
Learners draw on these competences to the extent that they can, given
the conditions and constraints operating, and adopt in the process stra-
tegies appropriate to their purpose in the circumstances in order to
complete the specific task in the language activity concerned. The
product is text.
Year 1 focused on speaking and on interaction, so from the point
of view of organising descriptors, the main potentially scaleable cat-
egories involved in the scheme were the following:

The descriptors for communicative language activities would corre-

spond to Bachman’s real-life and Alderson’s constructor-oriented
classification. The following example, taken from a Dutch scale, is a
descriptor for conversation, taken under spoken interaction. It was
later calibrated to Threshold Level:
Can enter unprepared into conversations on familiar topics.
The descriptors for qualitative aspects of language proficiency would
correspond to Bachman’s interactive-ability and Alderson’s assessor-
oriented classification. The following example, edited from entries in
a number of scales, is a descriptor for fluency, taken under pragmatic
(see discussion below). It was later calibrated to a new level for which
the name Vantage Level has been adopted.
Can produce stretches of language with a fairly even tempo; although he/she
can be hesitant as he or she searches for patterns and expressions, there are
few noticeably long pauses.
The descriptors for strategies fall somewhere between this
activity/quality distinction. The following example, written after
Kellerman, Bongearts and Poulisse’s (1987: 106) concept ‘analytic
strategy’, is a descriptor for compensating, taken under production
strategies. This descriptor was later calibrated to Threshold Plus, the
level between Threshold and Vantage.
Can define the features of something concrete for which he/she can’t remember
the word.
Descriptors from the scales used as sources were classified according
to this scheme. Since few descriptors for strategies were available,
some 80 were written for interaction and production strategies
(following Tarone, 1980; 1981/1983) with others added for reception
(restricted to listening in interaction). Problems with the classification
naturally arose because at least part of the time, one was trying to
relate performance descriptors and categories teachers are familiar
with (e.g. fluency) to competence categories. In that particular case
a broad view of pragmatic competence was taken following Thomas
(1983), and a broad view of fluency adopted, following Fillmore
(1979) and Lennon (1990). This was also a reflection of Brumfit’s
(1984) popular accuracy (linguistic) / fluency (pragmatic) distinction
referred to by many teachers in the workshops described below.

The elimination of repetition, negative formulation and norm-

referenced statements now meaningless away from their co-text pro-
duced a pool of approximately 1,000 stand-alone, positively worded
criterion statements such as those above and those appearing on the
questionnaire in Appendix 1.
b Qualitative validation: consultation with teachers through work-

shops: The purpose of the workshops was to ensure three things.
First, that the classification made sense to teachers, secondly, that the
way teachers thought – the categories they used to talk about pro-
ficiency – were well represented in the pool. And finally, that the
descriptors were relevant and well formulated. Two techniques were
used in each of 18 workshops attended by between 4 and 25 teachers.
The first technique was adapted from that reported by Pollitt and
Murray (1993/1996). Teachers were asked to discuss which of a pair
of learners talking to each other on a video was better – and justify
their choice. The aim was to elicit the metalanguage teachers used to
talk about qualitative aspects of proficiency and check that these were
included in the categories in the descriptor pool. These discussions
were recorded, transcribed in note form, analysed and, if something
new was being said, formulated into descriptors.
The second technique was based on that used by Smith and Kendall
(1963). Pairs of teachers were given a pile of 60–90 descriptors cut
up into confetti-like strips of paper and asked to sort them into three
or four labelled piles which represented related potential categories
of description. At least two, generally four and up to ten pairs of
teachers sorted each set of descriptors. A discard pile was provided
for descriptors for which the teachers couldn’t decide the category,
or found unclear or unhelpful. In addition teachers were asked to
indicate which descriptors they found particularly clear and useful and
which were relevant to their particular sector. This data was coded in
descriptor item histories. Some 400 items which had consistently been
sorted correctly and which offered a balanced coverage of the tax-
onomy were identified as suitable for use on the questionnaires in the
next stage.
c Quantitative validation: scaling descriptors through teacher

assessments: Seven main questionnaires spanning proficiency levels
from beginner to advanced were linked by 10–15 anchors in each
direction. One questionnaire is given as an example (Appendix 1),
with the descriptors which were dropped in analysis printed in italics.
A number of technical difficulties encountered in the analysis were
reported in North (1995). All concerned an apparently excessive over-
lap between the scales produced by the individual forms when these

separate scales were equated onto the common scale. This was caused
by three problems. First, the logit scale produced by the MLE
(maximum likelihood estimation) procedure used by most Rasch pro-
grams including FACETS distorts towards the top and bottom (see
Camilli, 1988; Warm, 1989; Jones, 1993). The solution adopted was
to exclude items and learners scoring over 75% or under 25% from
the analysis, thus setting what Warm (1989: 447) calls ‘rational
bounds’. Secondly, the powerful data set of all 100 teachers rating
all the video performances swamped the main questionnaire data. The
solution adopted was to first estimate difficulty values for the items
on the basis of the analysis of the main questionnaire data alone with
teacher severity anchored to zero. The third complication was that it
transpired that FACETS ratchets the forms too closely together when
analysing the whole data set (‘one-step equating’ Jones, 1993; ‘con-
current calibration’ Kenyon and Stansfield, 1992). The solution
adopted here was to fall back on Woods and Baker’s (1985: 128–31)
classic method of ‘two-step equating’. With this method each form
is analysed separately. Then the forms are linked together by increas-
ing values on each successive form by the average difference of dif-
ficulty of the anchor items on the two forms. Then the resulting com-
mon scale is recentred on zero.
During this phase of scale construction, Wright and Stone’s (1979:
92–96) classic technique for plotting the stability of the difficulty
estimates for the anchors on adjacent forms was used. In this simple
technique, the values produced by anchor items on one form are plot-
ted on the X axis with the values produced on the other form plotted
on the Y axis. A series of calculations based upon pooled standard
error of measurement at various points along the 45° diagonal pro-
duces upper and lower 95% criterion lines at the conventional 0.05
significance level. Anchor items appearing outside those lines demon-
strate an instability which cannot be explained by standard error. Such
items were therefore excluded. Unfortunately 9 of the 13 items lost
in this way were items on different kinds of strategies, such as the
example below:
Can identify words which sound as if they might be ‘international’, and try
them.
These descriptors had been used as anchor items precisely because

they seemed to be relevant to both the levels concerned. However,
presumably for that reason, they showed almost exactly the same dif-
ficulty value on both forms, whereas elementary descriptors on an
intermediate form should become easier relative to the other
(intermediate) items on the form. Fortunately, other descriptors on

strategies did stay within the criterion lines, indicating that their dif-
ficulty value was not dependent on the level at which they were used.
The problems with anchor items mentioned above were not sig-
nalled in misfit statistics. There was, however, a substantial amount
of misfit in Year 1 which led to three complete categories of items
being lost:
1) Socio-cultural competence: It is not clear how much this problem
was caused a) by the concept being a quite separate construct
from language proficiency and hence not ‘fitting’ – as also found
by Pollitt and Hutchinson (1987); b) by rather vague descriptors
identified as problematic in the workshops, or c) by inconsistent
responses by the teachers. Nos. 46 and 47 on the questionnaire
given as Appendix 1 are examples of descriptors lost in this way.
2) Work-related: Those descriptors asking teachers to guess about
activities (generally work-related) beyond their direct experi-
ence: telephoning; attending formal meetings; giving formal
presentations; writing reports and essays; formal correspondence.
Descriptors for these categories tended to show higher levels of
misfit and to have their difficulty underestimated when analysed
with the main construct. Griffin (1989; 1990a: 296) reports a
similar problem in relation to library activities during the devel-
opment of his primary-school reading scale. No. 9 in Appendix
1, on telephoning, is an example of a descriptor lost in this way.
One could argue that No. 4. (on negotiating) is also an example
of this type, though other descriptors for Negotiating were cali-
brated successfully.
3) Negative concept: Those descriptors relating to dependence
on / independence from interlocutor accommodation (need for
simplification; need to get repetition/clarification), which are
implicitly negative concepts, misfitted wildly. These aspects
worked better as provisos in positively worded statements, for
example:
Can generally understand clear, standard speech on familiar matters
directed at him/her, provided he/she can ask for repetition or reformul-
ation from time to time.
Nos. 27 and 28 in Appendix 1 are examples of descriptors lost
in this way.
Pronunciation is another concept which is often conceived in
negative terms – the strength of accent, the amount of foreignness
causing comprehension difficulties. The pronunciation items for
lower levels were negatively worded and showed large amounts
of misfit although they were calibrated sensibly. No. 45 in
Appendix 1 is an example of a descriptor lost in this way.

Apart from these groups of descriptors, a number of individual

descriptors were lost. There are five very typical examples in Appen-
dix 1: No. 26 (deducing unknown words – category called receptive
strategies); No. 29 (taking the turn); No. 32 (rehearsing and using
new language – category called flexibility); No. 35 (language aware-
ness and self-correction – category called monitoring) and No. 38
(coping with the unexpected – not used as a category). Three things
are noticeable about these five descriptors. First, they all concern
attempts to define aspects of communicative proficiency which are
not purely linguistic but have a strategic or an educational element.
Second, these aspects are all a little nebulous and difficult to observe
in comparison with the rather concrete statements in the list of spoken
tasks in Appendix 1. Finally, and most significantly, other descriptors
for all these categories except receptive strategies were successful in
both years.
Once the technical difficulties had been overcome, the stability of
the anchors had been checked and the categories and individual items
showing gross misfit had been excluded, difficulty values for the
remaining items were estimated to construct the scale.
d Setting cut-offs: Once the scale had been constructed, the descrip-
tors appeared calibrated in rank order onto a common logit scale as
shown in the extract in Appendix 2. The next task was to establish
‘cut-off points’ between bands or levels on this logit scale. Setting
cut-offs is always a subjective decision (Jaeger 1976: 2; 1989: 492)
and as Wright and Grosse (1993: 316) put it: ‘No measuring system
can decide for us at what point short becomes tall’. The cut-offs set
were not, however, arbitrary. As Pollitt (1991: 90) shows there is a
relationship between the reliability of a set of data and the number
of levels it will bear. In this case the scale reliability of 0.97 justified
10 levels.
The first step taken therefore was to set provisional cut-offs at
approximately equal intervals to create a 10-band scale. The second
step was to fine tune these cut-offs in relation to descriptor wording
in case there were threshold effects between levels. To check the
result, the content of descriptors for each category (e.g., comprehen-
sion in spoken interaction) at each level was broken up into elements
(e.g., qualities of the speech you can handle; degree of help required)
and displayed in a table for each category. This was in order to check
the plausibility of the progression and to see whether there was a
qualitative difference between the levels defined by the selected cut-
off points.
Appendix 3 shows the cut-offs on the logit scale between the 10
levels and the way those levels have been regrouped into 6 broader

levels for the Council of Europe Common Framework. The range on

the scale for each of the 10 levels is approximately 1 logit, though
this distance is slightly less at the centre of the scale (0.97–0.98) and
slightly more at the two ends (1.10). Since, as discussed above, MLE
logit scales are known to lose linearity towards the ends, a slight
distortion, even after the corrective action taken, was not surprising.
4 Year 2 French, German and English, 1995

The project followed the same three phases established in Year 1.
a Creating a descriptor pool for listening and reading: A similar

classification scheme for communicative activities, strategies, quali-
tative aspects of proficiency and socio-cultural competence was again
employed. Once again editing produced a pool of approximately
1,000 descriptors.
b Qualitative validation: consultation with teachers through work-

shops: Fourteen workshops with about 150 teachers were conduc-
ted, with sorting tasks as in Year 1. During these workshops, the
rejection rate was considerably higher than in Year 1. Both very glo-
bal statements and statements trying to define linguistic qualities of
texts which could be understood tended to be unpopular with teachers.
More concrete information about activities tended to be preferred.
c Quantitative validation: scaling descriptors through teacher

assessments: The same range of level as 1994 was covered with
four questionnaires, with a fifth very high level questionnaire which
in the event did not yield enough data for a satisfactory analysis.
Sixty-one of the 170 items employed on the four questionnaires which
could be analysed ‘anchored’ back to the 1994 English survey.
Parallel scale-construction analyses were run, one anchoring the 61
items from Year 1 back to their 1994 values in order to link the two
analyses onto the same scale, and the other allowing the 1994 items
to ‘float’ and establish new values. The Wright and Stone (1979)
anchor-checking technique was again employed to check stability of
anchor values between forms as in 1994. It was also exploited to
check the stability of the difficulty estimates of the descriptors com-
mon to the 1994 and 1995 surveys. The analysis procedure followed
in 1994 was repeated in 1995.
Following Bejar (1980), subanalyses were also run for the three
main content strands: interaction, listening and reading to see if these
would be better analysed separately. As Henning (1992) has
explained, tests can exhibit sufficient psychometric unidimensionality

without justifying assumptions about psychological unidimensional-

ity. In other words, dimensionality in, for example, Rasch, has a tech-
nical meaning related to the technical meaning of reliability as separ-
ability. Are the items strung out nicely along a plot – or are some
items pulling away to the sides because they do not really ‘fit’ the
main construct created by the data? Removal of such ‘outliers’ clari-
fies the picture and increases the scale length, the reliability and the
precision of the difficulty values for the items – in this case descrip-
tors. Unlike classical test theory (CTT: reliability theory) item
response theory (IRT: Rasch) does not say such outliers are necess-
arily bad items – but rather that they don’t belong here and should
perhaps be analysed separately to see if they build their own con-
struct.
In this project dimensionality problems showed themselves in one
or more of four ways:
1) statistical ‘misfit’ in the classic Rasch sense: either in terms of

the amount of misfit or the standardised residual (this had been
the case in 1994 with socio-cultural competence and with ‘inde-
pendence’ from interlocutor help and accommodation);
2) showing a consistent tendency to Differential Item Functioning
(DIF): to show apparently significant differences in the difficulty
values obtained from different groups of learners (reading);
3) showing a different slope to the scale when a separate analysis
was plotted against the main construct (again: reading);
4) showing a difference larger than standard error in difficulty esti-
mates when anchored to the 1994 construct or when analysed
separately (again: reading).
Thus in 1995, reading did not appear to ‘fit’ a construct dominated

by the overlapping concepts of speaking and interaction and needed
to be analysed separately – despite the fact that Rasch ‘fit’ statistics
did not indicate this. This then left the problem of how to equate the
reading scale to the main scale. The solution adopted was to analyse
reading together with listening (listening + reading = reception). The
37 calibrated listening items were then used as anchor items in order
to equate the two subsequent scales. The relationship between the two
scales proved in fact to be linear with a correlation of 0.98, but the
listening items were calibrated 0.31 logits lower when in company
with reading. The reading items were therefore equated to the
speaking/interaction/listening scale by increasing their values on the
reading and listening scale by 0.31.

IV Results
1 A calibrated scale of language proficiency
The scale of 10 levels was produced in the 1994 analysis, the process
being described in detail in North (1996). The central aim of the 1995
survey was to see if the 1994 scale values for descriptors would be
replicated in a survey focused mainly on French and German. This
is why the 1995 survey was anchored back to 1994 with 61 descrip-
tors. The difficulty values for the items in the 1994 construct (spoken
interaction and production) proved to be very stable. Only 8 of the
61 1994 descriptors reused in 1995 were interpreted in a significantly
different way – i.e., fell outside Wright and Stone’s 95% criterion
line. After the removal of those eight descriptors, the values of the
103 listening and speaking items used in 1995 (now including only
53 from 1994) correlated 0.99 (Pearson) when analysed a) entirely
separately from 1994 and b) with the 53 common items anchored to
their 1994 values. This is a very satisfactory consistency between the
two years when one considers that:
1) The 1994 difficulty values were based on judgements by 100
English teachers, whilst in 1995 only 46 of the 192 teachers
taught English, and only 20 of them had taken part in 1994. The
ratings dominating the 1995 construct were therefore those of the
French and German teachers.
2) The questionnaire forms used for data collection in 1994 and
1995 were completely different in terms of both content and
range of difficulty with four forms in 1995 covering the ground
covered by seven forms in 1994.
3) The majority of teachers in 1995 were using the descriptors in
French or German. Therefore it is possible that the problems with
the eight 1994 descriptors may have been at least partly caused
by inadequate translation.
2 A bank of classified, calibrated descriptors

The categories for which descriptors were successfully scaled are
shown in Figure 1.
Appendix 4 gives a subscale for comprehension in spoken interac-
tion. As mentioned earlier, the coherence of the classification of con-
tent was checked through comparing charts such as the one given as
Appendix 5. The degree of coherence within levels (i.e., across
tables) was remarkable and a clear progression (i.e., up tables) was
evident, with the suggestion of a qualitative change at the cut-off
points between levels. This point is taken in the discussion in Sec-
tion V.

Communicative activities
Listening Overall listening comprehension
Receptive Listening to announcements and instructions
Listening as a member of an audience
Listening to radio and audio recordings
Watching TV and film
Interactive Comprehension in spoken interaction
Reading Overall reading comprehension
Reading instructions
Reading for information
Reading for orientation (scanning)
Interaction: transactional Service encounters and negotiations
Information exchange
Interviewing and being interviewed
Notes, messages and forms
Interaction: interpersonal Conversation
Discussion
Personal correspondence
Production (spoken) Describing experience
(Sustained monologue) Putting a case
Processing and summarizing
Strategies
Receptive strategies Deducing meaning from context (only 2
descriptors)
Interaction strategies Taking the turn
Cooperating
Asking for clarification
Production strategies Planning
Compensating
Repairing and monitoring
Qualitative aspects of language proficiency
Pragmatic (language use) Fluency
Flexibility
Coherence
Thematic development
Precision
Linguistic (language resources)
Range (knowledge) General range
Vocabulary range
Accuracy (control) Grammatical accuracy
Vocabulary control
Figure 1 Categories scaled
V Discussion
1 Scaling
The stability of the scale difficulty values from two quite different
surveys (r = 0.99) suggests that the technical difficulties reported were
in fact overcome and that the items were satisfactorily scaled.
When one looks at the vertical scale of calibrated items, of which
an extract is given in Appendix 2, it is striking the extent to which

descriptors on similar issues land adjacent to each other although they

were used on different questionnaires. There are several pairs of this
sort visible in this extract: Nos. 259 and 221 at difficulty values of
approximately 1.5 (natural turn-taking, social fluency); Nos. 254 and
213 at approximately 1.00 (developing an argument); Nos. 195 and
245 at approximately 0.75 (explain a viewpoint/opinion).
Indeed, the levels produced by the cut-off points show a remarkable
consistency of key characteristics. Space does not permit a detailed
discussion of the whole scale, but taking three levels as an example:
Threshold is intended to represent the Council of Europe specifi-
cation for a visitor to a foreign country and is perhaps most categor-
ised by two features:
First, the ability to maintain interaction and get across what you
want to in a range of contexts:
• generally follow the main points of extended discussion around
him/her, provided speech is clearly articulated in standard dialect;
• give or seek personal views and opinions in an informal discussion
with friends;
• express the main point he/she wants to make comprehensibly;
• exploit a wide range of simple language flexibly to express much
of what he or she wants to;
• maintain a conversation or discussion but may sometimes be dif-
ficult to follow when trying to say exactly what he/she would
like to;
• keep going comprehensibly, even though pausing for grammatical
and lexical planning and repair is very evident, especially in
longer stretches of free production.
Secondly, the ability to cope flexibly with less straightforward situ-
ations in everyday life:
• cope with less routine situations on public transport;
• deal with most situations likely to arise when making travel
arrangements through an agent or when actually travelling;
• make a complaint;
• enter unprepared into conversations on familiar topics;
• take some initiatives in an interview/consultation (e.g. to bring
up a new subject) but is very dependent on interviewer in the
interaction;
• ask someone to clarify or elaborate what they have just said.
The next level, Threshold Plus, shows the same two concepts with
the addition of a number of descriptors which focus on the exchange
of detailed information:
• take messages communicating enquiries, explaining problems;

• provide concrete information required in an

interview/consultation (e.g. describe symptoms to a doctor) but
do so with limited precision;
• explain why something is a problem;
• summarise and give his/her opinion about a short story, article,
talk, discussion interview, or documentary and answer further
questions of detail;
• carry out a prepared interview, checking and confirming infor-
mation, though he/she may occasionally have to ask for repetition
if the other person’s response is rapid or extended;
• describe how to do something, giving detailed instructions;
• exchange accumulated factual information on familiar routine and
nonroutine matters within his/her field with some confidence.
The next level appears to represent a significant shift, offering some

justification for the new name Vantage. According to Trim (personal
communication), the intention is, as with Threshold and Waystage,
to find a name which hasn’t been used before and which symbolises
something central to the level concerned. In this case, the metaphor
is that having been progressing slowly but steadily across the inter-
mediate plateau, the learner finds he/she has arrived somewhere.
He/she acquires a new perspective and can look around him/her in
a new way. This concept does seem to be borne out to a considerable
extent by the descriptors calibrated here, which, as can be seen in
Appendix 2, represent quite a break with the content scaled so far.
At the lower end of the band there is a focus on effective argument:
• account for and sustain his/her opinions in discussion by provid-

ing relevant explanations, arguments and comments;
• explain a viewpoint on a topical issue giving the advantages and
disadvantages of various options;
• construct a chain of reasoned argument;
• develop an argument giving reasons in support of or against a
particular point of view;
• explain a problem and make it clear that his/her counterpart in a
negotiation must make a concession;
• speculate about causes, consequences, hypothetical situations;
• take an active part in informal discussion in familiar contexts,
commenting, putting point of view clearly, evaluating alternative
proposals and making and responding to hypotheses.
Running right through the band are two new focuses:

First, being able to more than hold your own in social discourse,
e.g.

• understand in detail what is said to him/her in the standard spoken

language even in a noisy environment;
• initiate discourse, take his/her turn when appropriate and end con-
versation when he/she needs to, though he/she may not always
do this elegantly;
• use stock phrases (e.g., ‘That’s a difficult question to answer’) to
gain time and keep the turn whilst formulating what to say;
• converse naturally, fluently and effectively;
• interact with a degree of fluency and spontaneity that makes reg-
ular interaction with native speakers quite possible without
imposing strain on either party;
• adjust to the changes of direction, style and emphasis normally
found in conversation;
• sustain relationships with native speakers without unintentionally
amusing or irritating them or requiring them to behave other than
they would with a native speaker.
Second, there is a new degree of language awareness, especially
self-monitoring:
• correct mistakes if they have led to misunderstandings;
• make a note of ‘favourite mistakes’ and consciously monitor
speech for them;
• generally correct slips and errors if he/she becomes conscious
of them;
• plan what is to be said and the means to say it, considering the
effect on the recipient(s).
2 Constructs and dimensionality

The connection between constructs and test dimensionality is not a
clear-cut one. ‘Unidimensionality (the notion of a single continuum)
is a relative concept and is constructed either to understand complex
phenomena or to facilitate decision-making’ (Andrich, 1988: 303). It
is ‘conceptual rather than factual, qualitative rather than quantitative’
(Wright and Linacre, 1989: 3), ‘a matter of degree’ (Choi and
Bachman, 1992: 74). Blais and Laurier (1995) consider that
unidimensionality in test data is a relative issue and demonstrate that
a test will appear uni-or multidimensional depending on the ‘test’ of
unidimensionality used. McNamara (1996: 278–79) agrees, illustrat-
ing this with the example of photographing a family group outside a
church: to get them all in you stand further back; to get the church
in as well you stand so far back as to lose a lot of detail.
Henning (1992: 8) argues that a group of highly correlating
psychological constructs such as those underlying most language tests

will tend to build a psychometrically unidimensional construct. As

both Henning (ibid.) and McNamara (1996: 217–81) point out, the
fact that underlying psychological constructs contributing to language
proficiency are multidimensional does not alter the fact that results
for the same individuals in the assessment data will tend to be inter-
correlated, building a psychometric dimension. As McNamara (ibid.:
271) points out, the assumption that this is so lies behind all tests –
and one might add all questionnaires. It is just that a Rasch analysis
helps to identify the extent to which this assumption is met. On the
other hand, as Carroll (1983: 83, 93) pointed out, psychometric unidi-
mensionality found in data merely means results are intercorrelated
because people are progressing in all the skills concerned in a fairly
even manner. It may or may not have implications about the organis-
ation of those skills or of the underlying psychological constructs
involved. As McNamara concludes (ibid.: 279) – and as Rasch
measurement specialists themselves maintain (Wright and Linacre,
personal communication) – Rasch results need to be interpreted in
the light of theory and models of the psychological constructs under-
lying test performance.
The coherence and plausibility of the scaling of content suggests
that a sufficiently robust psychometric dimension was refined during
the analysis of the data. According to Stansfield and Kenyon (1992:
10), less than 10% of the items should be misfitting before adequate
fit to the Rasch model and necessary psychometric unidimensionality
is claimed. By that criterion the scale produced in this study is on
the borderline of psychometric acceptability since slightly under 10%
of the surviving items in the 1994/95 speaking/interaction/listening
scale show misfit at the level Stansfield and Kenyon identify (1.5 for
amount of misfit; 2.0 for standardised residuals). Considering the real
multidimensionality implied by the range of content strands, edu-
cational sectors, language regions, local education systems, and
teacher experience, that is quite a satisfactory result.
The fact that reading on the one hand and written production
(sustained coherent text: essays, reports, etc.) on the other hand did
not fit the main construct suggests that this scaled construct is an
overlap of speaking, interaction and listening rather than global langu-
age proficiency.
The difficulties encountered with socio-cultural competence and
strategies suggest in addition that the construct was limited to com-
municative language proficiency rather than overall communicative
competence. This is regrettable, but again not that surprising.
Bachman and Palmer (1982) found evidence for the fact that
pragmatic/grammatical competence formed one trait, and that what
they called socio-linguistic competence, defined as ‘distinguishing of

registers, nativeness, and control of non-literal, figurative language

and relevant cultural allusions’ formed another (p. 450). From their
definition it would appear that they were using the term ‘socio-
linguistic’ in a broad sense which would subsume what many people
call socio-cultural competence.
The coherence and consistency of the scaling of the descriptors
which were calibrated is probably due to the fact that the descrip-
tors were:
1) organised and selected according to a principled taxonomy infor-
med by models of communicative language competence,
language and strategy use;
2) rigorously pre-tested with representative informants from the tar-
get users;
3) scrutinised during analysis and subjected to quality control on
the basis of item histories; these item histories logged a) fit; b)
stability of difficulty values in relation to subgroups of the popu-
lation (Differential Item Functioning); c) stability of difficulty in
relation to different proficiency levels, and finally d) counterin-
tuitive calibration.
It was in fact surprising how few of the problems which did occur
with dimensionality, with anchoring and with equating were signalled
by Rasch fit statistics. Logging detailed item histories in which fit
was just one factor proved far more useful in interpreting the data.
3 Differential Item Functioning

Constructing such item histories was the main reason for investigating
DIF. When items produce values in different contexts which are sig-
nificantly different in the statistical sense that they could not have
occurred by chance, this is known as Differential Item Functioning
(DIF).
An amount of DIF is to be expected in a study of this sort. What
is surprising is not that DIF occurred, but rather that the 1994 scale
values were replicated (r = 0.99) in the 1995 study with different
kinds of learners. This suggests that what DIF was occurring in
relation to individual descriptors was balancing itself out. This
hypothesis was confirmed by an analysis quantifying the effect of the
‘facets’ target language, educational sector, language region and
mother tongue, all of which showed nonsignificant amounts of vari-
ation. The fact that the aggregate scale of descriptors shows no bias
against any group supports the case for using it to create a common
scale. A common scale can then be used to profile genuine strengths
and weaknesses of different groups. However, although it may there-
fore be desirable to keep descriptors which show DIF in the profiling

bank, it may be wiser to avoid them when choosing descriptors for

a short holistic summary of each level. Hence the decision to log the
results of DIF studies in item histories.
The first technique used to investigate DIF was the same graphic
plotting technique used to check anchor stability (Wright and Stone,
1979: 92–96). In Year 1 (English) there was relatively little DIF
affecting only a handful of items. This DIF was also easy to explain
either in terms of undesirable curriculum effects (e.g., a lack of listen-
ing comprehension in the upper intermediate textbooks used in
French-speaking Switzerland) or in terms of genuine differences in
the real-life experience of different groups. In Year 2 the issue was
far more complex. Considerable DIF was indicated, but since the dif-
ferent demographic variables interact in complex ways in a relatively
unstructured sample of teachers and since the number of teachers
involved in the subsamples was small, conclusions were difficult to
draw. As De Jong (personal communication, 18.4.97) maintains, there
does seem to be a threshold for sample size below which the standard
error (the basis of the Wright and Stone technique) is grossly under-
estimated.
In order to investigate the DIF more thoroughly, a variation study
was commissioned from the statistics department of the University of
Fribourg. A variant of Wilks’ lambda (Johnson and Wichern, 1992:
246–48) was used to identify significance levels when all three or
four variables of a particular facet were compared simultaneously.
However, the more one went into the issue, the more it became appar-
ent that a number of fundamental assumptions made in both variation
studies were not being met in the data. The fundamental problem is
that 10 judgements by a teacher are not independent observations.
Therefore n (number of observations), already small for the popu-
lation subgroups, is exaggerated. The exaggeration of n causes the
Standard Error of Measurement (SEM) to be again underestimated.
Furthermore, the smallness of the sample of teachers means that it is
unlikely to have the normal distribution assumed by Wilks’ lambda,
again leading to exaggeration of DIF.
For these reasons a cautious approach was taken to the identifi-
cation of significance. The interpretation of the variation data had a
heuristic function in logging item histories to help decide which
descriptors to use in holistic summaries and which to confine to profi-
ling instruments. No claims of significant findings could be made in
relation to the characteristics of learner proficiency in the different
educational sectors and language regions involved.

VI Conclusion
There are those who consider that the development of common frame-
work scales should not be attempted before research has provided
adequate empirically validated descriptions of language proficiency
and of the language-learning process (Lantolf and Frawley, 1985;
1988; 1992). Spolsky (1993: 208) and Brindley (forthcoming) have
voiced similar concerns, Brindley concluding that:
rather than continue to proliferate scales which use generalised and empirically
unsubstantiated descriptors % it would perhaps be more profitable to draw
on SLA and LT research to develop more specific empirically-derived and
diagnostically-oriented scales of task performance which are relevant to parti-
cular purposes of language use in particular contexts and to investigate the
extent to which performance on these tasks taps common components of com-
petence (Brindley, ibid.: 22).
Yet previously, Brindley had accepted that ‘we cannot wait for the
emergence of empirically validated models of proficiency in order to
build up criteria for assessing learners’ second language performance’
(Brindley, 1989: 56). As Hulstijn puts it: ‘it should be obvious that
syllabus writers, teachers and testers cannot wait for full-fledged
theories of language proficiency to emerge from research laboratories.
In the absence of theories, they have to work with taxonomies which
seem to make sense even if they cannot be fully supported by a theor-
etical description’ (1985: 277).
The purpose of a common reference framework is to provide such
a taxonomy in response to a demand for this. The purpose of descrip-
tors of common reference levels is to provide a metalanguage of cri-
terion statements which people can use to roughly situate themselves
and/or their learners, in response to a demand for this. It is widely
recognised that the development of such a taxonomy entails a tension
between theoretical models developed by applied linguists (which are
incomplete) on the one hand and operational models developed by
practitioners (which may be impoverished) on the other hand (see
North, 1993b: 7; McNamara, 1995: 159–165; Chalhoub-Deville,
1997; and Brindley, forthcoming: 21).
Up until now a methodology for the development of such instru-
ments has been lacking. This project demonstrates one way in which
such an undertaking can be carried through in a principled fashion:
• comprehensive documentation of the experience and consensus in
the field of proficiency scales;
• classification of descriptors to a taxonomy informed by theoreti-
cal models;
• pre-testing of categories, formulations and translations to ensure

that the descriptors represent clear, useful, relevant, accessible,

stand-alone criterion statements;
• scaling of the descriptors with a measurement model;
• replication of the scale values.
The stability of the scale difficulty values obtained in relation to one
language (1994) and in relation to several languages (1995), plus the
stability of proficiency estimates for video samples rated in different
contexts (survey with questionnaire; conference with scale and grid),
suggests that an item-banking methodology can be used successfully
to develop a scale of descriptors in this way.
Some of the limits to such an endeavour have also been suggested.
To the extent that it is feasible, one needs to:
• start from theoretical models, but also ensure that representative
practitioners can distinguish between the categories, that ‘new cat-
egories’ (e.g., interaction strategies) can be seen as offering a
clear relative advantage and that popular categories (e.g., fluency)
can be located in the taxonomy;
• develop clear, positively formulated descriptors of observable
aspects of proficiency;
• ensure adequate, systematic anchoring in the data-collection
design;
• take steps to avoid and/or correct for distortions to the scaling
caused by imperfections in the measurement model used and/or
by response effects;
• identify those content strands which behave significantly differ-
ently to the main construct and those sub groups of learners who
behave significantly differently to the main population.
But however good descriptors are and however objectively they are
scaled, they are still subject to interpretation by raters in relation to
groups of learners. The provision of a scale with a degree of a priori
validity is only the first step in establishing an assessment framework.
Experience of a scale over a period of time, the relationship of the
scale to levels used by publishers and schools, training with standard-
ised performance samples, collection of collateral information from
tests, analyses of rater behaviour with programs like FACETS: all
these can contribute to effective implementation of a framework. But
that would be a subject for a future article.
Acknowledgements
The authors would like to express their gratitude to the US National
Foreign Language Center for the award of the Mellon Fellowship to

Brian North in 1992 which greatly facilitated the development of a

methodology for the project, to Mike Linacre (Chicago) and Alastair
Pollitt (Cambridge) for their help as technical advisers, and to Peter
Skehan for supervising the PhD which piloted the methodology.
VII References
Alderson, J.C. 1990a: Judgements in language testing, version three. Paper
presented at the 9th World Congress of Applied Linguistics, Thessa-
loniki, Greece, April.
—— 1990b: Testing reading comprehension skills (Part One). Reading in
a Foreign Language 6, 425–38.
—— 1991: Bands and scores. In Alderson, J.C. and North, B., editors,
Language testing in the 1990s, London: Modern English
Publications / British Council / Macmillan, 71–86.
Alderson, J.C. and Lukmani, Y. 1989: Cognition and reading: cognitive
levels as embodied in test questions. Reading in a Foreign Language
5, 253–270.
Alderson, J.C. and North, B., editors, 1991: Language testing in the 1990s.
London: Modern English Publications / British Council / Macmillan.
Andrich, D. 1988: Thurstone Scales. In Keeves J.P., editor, Educational
research, methodology and measurement: an international handbook,
Oxford / New York, Pergamon Press, 303–306.
Association of Language Testers in Europe (ALTE) 1998: ALTE Hand-
book of European Language examinations and examination systems:
descriptions of examinations offered and examinations administered
by members of the Association of Language Testers in Europe. Cam-
bridge: University of Cambridge Local Examinations Syndicate.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman L.F. and Palmer A. 1982: The construct validation of some
components of communicative proficiency. TESOL Quarterly 16,
449–64.
Bachman, L.F. and Savignon S.J. 1986: The evaluation of communicative
language proficiency: a critique of the ACTFL oral interview. Modern
Language Journal 70, 380–90.
Bejar, I. 1980: A procedure for investigating the unidimensionality of
achievement tests based on item parameter estimates. Journal of Edu-
cational Measurement 17, 283–96.
Blais, J.-G. and Laurier, M. 1995: The dimensionality of a placement test
from several analytic perspectives. Language Testing 12, 72–98.
Brindley, G. 1986: The assessment of second language proficiency: issues
and approaches. Adelaide: National Curriculum Resource Centre.
—— 1989: Assessing achievement in the learner-centred curriculum,
NCELTR Research Series. Sydney: Macquarie University.
—— 1991: Defining language ability: the criteria for criteria. In Anivan, S.,
editor, Current developments in language testing, Singapore: Regional
Language Centre.

—— forthcoming: Describing language development? Rating scales and

second language acquisition. In Bachman, L.F. and Cohen, A.D., edi-
tors, Interfaces between SLA and language testing research. Cam-
bridge: Cambridge University Press.
Brown, A., Elder, C., Lumley, T., McNamara, T. and McQueen, J. 1992:
Mapping abilities and skills levels using Rasch techniques. Paper
presented at the 14th Language Testing Research Colloquium, Van-
couver.
Brumfit, C.J. 1984: Communicative methodology in language teaching: the
roles of fluency and accuracy. Cambridge: Cambridge University
Press.
—— 1987: Concepts and categories in language teaching methodology.
AILA Review 4, 25–31.
Camilli, G. 1988: Scale shrinkage and the estimation of latent distribution
parameters. Journal of Educational Statistics 13, 227–41.
Canale, M. and Swain, M. 1980: Theoretical bases of communicative
approaches to second language teaching and testing. Applied Linguis-
tics 1, 1–47.
Carroll, B.J. and West, R. 1989. ESU (English-speaking union) framework.
Performance scales for English language examinations. London:
Longman.
Carroll, J.B. 1983: Psychometric theory and language testing. In Oller,
J.W., editor, Issues in language testing research. Rowley, MA: New-
bury House, 80–107.
Chalhoub-Deville, M. 1997: Theoretical models, assessment frameworks
and test construction. Language Testing 14, 3–22.
Choi, I.C. and Bachman, L.F. 1992: An investigation of the adequacy of
three IRT models for data from two EFL reading tests. Language Test-
ing 9, 51–78.
Clark, J.L. 1985: Curriculum renewal in second language learning: an over-
view. Canadian Modern Language Review 42, 342–60.
Council of Europe 1992: Transparency and coherence in language learning
in Europe: objectives, assessment and certification. Proceedings of the
intergovernmental symposium held at Rüschlikon, November 1991
(North, B., editor). Strasbourg: Council of Europe.
—— 1996: Modern languages: learning, teaching, assessment. A common
European framework of reference. Draft 2 of a framework proposal.
CC-LANG (95) 5 rev IV. Strasbourg: Council of Europe.
—— 1997: European language portfolio. Proposals for development. CC-
LANG (97)1. Strasbourg: Council of Europe.
Cziko, G. 1984: Some problems with empirically-based models of com-
municative competence. Applied Linguistics 5, 23–38.
Dandonoli, P. and Henning, G. 1990: An investigation of the construct
validity of the ACTFL proficiency guidelines and oral interview pro-
cedure. Foreign Language Annals 23, 11–22.
De Jong, H.A.L. 1990: Response to Masters: linguistic theory and psycho-
metric models, in De Jong, H.A.L. and Stevenson D.K., Individualising

the assessment of language abilities, Cleveland: Multilingual Matters,

71–82.
Farhady, H. 1982: Measures of language proficiency from the learner’s
perspective. TESOL Quarterly 16, 43–59.
Fillmore, C. 1979: On fluency. In Fillmore, C., Kempler, D. and Wang,
W., editors, Individual differences in language ability and language
behaviour, New York: Academic Press, 85–101.
Fulcher, G. 1987: Tests of oral performance: the need for data-based cri-
teria. ELT Journal 41, 287–91.
—— 1993: The construction and validation of rating scales for oral tests in
English as a foreign language. PhD thesis, University of Lancaster.
Griffin, P.E. 1989: Monitoring proficiency development in language. Paper
presented at the Annual Congress of the Modern Language Teachers
Association of Victoria, Monash University, 10–11 July.
—— 1990a: Profiling literacy development: monitoring the accumulation of
reading skills. Australian Journal of Education 34, 290–311.
—— 1990b: Literacy rating scales: literacy: learning for life. Unpublished
paper, Assessment Research Centre, Phillips Institute of Technology.
Griffin, P.E. and McKay P. 1992: Assessment and reporting in the ESL
Language and Literacy in Schools Project. In National Languages and
Literacy Institute of Australia, ESL development: language and liter-
acy in schools project. Volume 2: Documents on bandscale develop-
ment and language acquisition. Canberra: National Languages and Lit-
eracy Institute of Australia. Cited in Brindley, G., forthcoming,
Describing language development? Rating scales and second language
acquisition. In Bachman, L.F. and Cohen, A.D., editors, Interfaces
between SLA and language testing research, Cambridge: Cambridge
University Press.
Henning, G. 1992: Dimensionality and construct validity of language tests.
Language Testing 9, 1–11.
Hulstjin, J.H. 1985: Testing second language proficiency with direct pro-
cedures. A response to Ingram. In Hyltensstam K. and Pienemann M.,
editors, Modelling and assessing second language development, Cle-
vedon: Multilingual Matters, 277–82.
Jaeger, R.M. 1976: Measurement consequences of selected standard setting
models. Florida Journal of Educational Research 18, 22–27.
—— 1989: Certification of student competence. In Linn, R.L., editor, Edu-
cational measurement (3rd edn), New York: American Council on
Education / Macmillan, 485–514.
Johnson, R.A. and Wichern, D.W. 1992: Applied multivariate statistical
analysis (3rd edn). Englewood Cliffs: Prentice Hall.
Jones, N. 1993: An item bank for testing English language proficiency:
using the Rasch model to construct an objective measure. PhD thesis,
University of Edinburgh.
Keeves J.P., editor, 1988: Educational research, methodology and measure-
ment: an international handbook. Oxford / New York: Pergamon
Press.
Kellerman, E., Bongearts, T. and Poulisse, N. 1987: Strategy and system

in L2 referential communication. In Ellis R., Second language acqui-

sition in context, Englewood Cliffs: Prentice Hall, 100–12.
Kenyon, D.M. and Stansfield, C. W. 1992: Examining the validity of a
scale used in a performance assessment from many angles using the
many-faceted Rasch model. Paper presented at the Annual Meeting of
the American Educational Research Association, San Francisco, April.
Washington, DC: Center for Applied Linguistics.
Languages Lead Body 1992: National standards for languages: units of
competence and assessment guidance. UK Languages Lead Body,
July.
Lantolf, J. and Frawley, W. 1985: Oral proficiency testing: a critical analy-
sis. Modern Language Journal 69, 337–45.
—— 1988: Proficiency, understanding the construct. Studies in Second Lan-
guage Acquisition 10, 181–96.
——1992: Rejecting the OPI again: a response to Hagen. ADFL Bulletin
23, 34–37.
Lennon, P. 1990: Investigating fluency in EFL: a quantitative approach.
Language Learning, 40, 387–417.
Linacre, J.M. 1989: Multi-faceted measurement. Chicago: MESA Press.
Liskin-Gasparro, J.E. 1984: The ACTFL proficiency guidelines: a histori-
cal perspective. In Higgs, T.C., editor, Teaching for proficiency, the
organising principle. Lincolnwood, IL: National Textbook Company,
11–42.
Lowe, P. 1983: The IRL oral interview: origins, applications, pitfalls and
implications. Unterrichtspraxis 16, 230–44.
Lumley T. 1993: Reading comprehension sub-skills: teachers’ perceptions
of content in an EAP test. Melbourne Papers in Language Testing 2,
25–55.
Luoma, S. 1993: Validating the (Finnish) certificates of foreign language
proficiency. Paper presented at the 15th Language Testing Research
Colloquium, Cambridge and Arnhem, 2–4 August.
Masters, G. 1982: A Rasch model for partial credit scoring. Psychometrika
47, 149–74.
Matthews, M. 1990: The measurement of productive skills. Doubts concern-
ing the assessment criteria of certain public examinations. ELT Journal
44, 117–20.
McNamara, T. 1995: Modelling performance: opening Pandora’s box.
Applied Linguistics 16, 159–79.
—— 1996: Measuring second language performance. Harlow, Essex:
Addison Wesley Longman.
Milanovic, M. and Saville, N., editors, 1996: Performance testing, cognition
and assessment. Cambridge: University of Cambridge Local Examin-
ations Syndicate.
Milanovic, M., Saville, N., Pollitt, A. and Cook, A. 1992: Developing and
validating rating scales for CASE: theoretical concerns and analyses.
Paper presented at the 14th Annual Language Testing Research Collo-
quium. In Cumming, A. and Berwick, R., editors, 1996, Validation in
language testing, Clevedon: Multilingual Matters, 15–38.

Mullis, I.V.S. 1980: Using the primary trait system for evaluating writing.
Manuscript No. 10-W-51, Educational Testing Service, Princeton, NJ,
reprinted December 1981.
North, B. 1991: Standardisation of continuous assessment grades. In Alder-
son, J.C. and North, B., editors, Language testing in the 1990s, Lon-
don: Modern English Publications / British Council / Macmillan,
167–77.
—— 1993a: Transparency, coherence and washback in language assessment.
In Sajavaara, K., Takala, S., Lambert, D. and Morfit, C., editors, 1994,
National foreign language policies: practices and prospects. Univer-
sity of Jyväskyla: Institute for Education Research, 157–93.
—— 1993b: The development of descriptors on scales of proficiency: per-
spectives, problems, and a possible methodology. NFLC Occasional
Paper. Washington, DC: National Foreign Language Center, April.
—— 1994: Scales of language proficiency: a survey of some existing sys-
tems. Strasbourg: Council of Europe.
—— 1995: The development of a common framework scale of descriptors
of language proficiency based on a theory of measurement. System 23,
445–65.
—— 1996: The development of a common framework scale of descriptors
of language proficiency based on a theory of measurement. Unpub-
lished PhD thesis, Thames Valley University.
—— 1997: Perspectives on language proficiency and aspects of competence.
Language Teaching 30, 93–100.
Oller, J.W., editor, 1983: Issues in language testing research. Rowley, MA:
Newbury House.
Oscarson, M. 1978/1979: Approaches to self-assessment in foreign
language learning. Strasbourg: Council of Europe, 1978: Oxford, Per-
gamon, 1979.
—— 1984: Self-assessment of foreign language skills: a survey of research
and development work. Strasbourg: Council of Europe.
Pienemann, M. and Johnston, M. 1987: Factors influencing the develop-
ment of language proficiency. In Nunan, D., editor, Applying second
language acquisition research, Adelaide: National Curriculum
Resource Centre, 89–94.
Pollitt, A. 1991: Response to Alderson: bands and scores. In Alderson, J.C.
and North, B., editors, Language testing in the 1990s, London: Modern
English Publications / British Council / Macmillan, 87–94.
—— 1993 Reporting reading test results in grades. Paper presented at the
15th Language Testing Research Colloquium, Cambridge and Arnhem,
2–4 August.
Pollitt, A. and Hutchinson, C. 1987: Calibrating graded assessments; Rasch
partial credit analysis of performance in writing. Language Testing 4,
72–92.
Pollitt, A. and Murray, N.L. 1993/1996: What raters really pay attention to.
Paper presented at the 15th Language Testing Research Colloquium,
Cambridge and Arnhem, 2–4 August 1993. In Milanovic, M. and

Saville, N. editors, 1996, Performance testing, cognition and assessment,

Cambridge: University of Cambridge Local Examinations Syndicate.
Richterich, R. and Schneider, G. 1992: Transparency and coherence: why
and for whom? In Council of Europe, Transparency and coherence in
language learning in Europe: objectives, assessment and certification,
proceedings of the intergovernmental symposium held at Rüschlikon,
November 1991, Strasbourg: Council of Europe, 43–50.
Sang, B., Schmitz, H.J., Vollmer, J. and Roeder P.M. 1986: Models of
second language competence: a structural equation approach.
Schärer, R. 1992: A European language portfolio – a possible format. In
Council of Europe, Transparency and coherence in language learning
in Europe: objectives, assessment and certification, proceedings of the
intergovernmental symposium held at Rüschlikon, November 1991,
Strasbourg: Council of Europe, 140–146. Reprinted in Schärer, R. and
North, B. 1992: Towards a common European framework for reporting
language competency. NFLC Occasional Paper. Washington, DC:
National Foreign Language Center.
Schneider, G. and North, B. forthcoming: Assessment and self-assessment
of foreign language proficiency at cross-over-points in the Swiss edu-
cational system: transparent and coherent description of foreign lang-
uage competence as assessment, reporting and planning instruments.
Bern: National Science Research Council.
Skehan, P. 1984: Issues in the testing of English for specific purposes.
—— 1995: Analysability, accessibility and ability for use. In Cook, G. and
Seidelhofer, S., editors, Principle and practice in applied linguistics.
Oxford: Oxford University Press.
Smith, P.C. and Kendall, J.M. 1963: Retranslation of expectations: an
approach to the construction of unambiguous anchors for rating scales.
Journal of Applied Psychology 47, 149–54.
Spolsky, B. 1986: A multiple choice for language testers. Language Testing
3, 147–58.
—— 1989: Conditions for second language learning. Introduction to a gen-
eral theory. Oxford: Oxford University Press.
—— 1993: Testing and examinations in a national foreign language policy.
In Sajavaara, K., Takala, S., Lambert, D. and Morfit, C., editors, 1994,
National foreign language policies: practices and prospects, Univer-
sity of Jyväskyla: Institute for Education Research, 194–214.
Stansfield, C.W. and Kenyon, D. 1992: Comparing scales of speaking tasks
by language teachers and by the ACTFL guidelines. Paper presented
at the 14th Language Testing Research Colloquium Vancouver. In
Cumming, A. and Berwick, R., editors, 1996, Validation in language
testing, Clevedon, Avon: Multilingual Matters, 124–153.
Tarone, E. 1980 Communication strategies, foreigner talk and repair in
interlanguage. Language Learning 30, 417–31.
—— 1983/1981 Some thoughts on the notion of ‘Communication strategy’.
In Faerch, C. and Kasper, G., editors, 1983, Strategies in interlanguage

communication, London: Longman, 63–68 (originally published in

TESOL Quarterly, September 1981).
Thomas, J. 1983: Cross-cultural pragmatic failure. Applied Linguistics 4,
91–112.
Thurstone, L.L. 1928: Attitudes can be measured. American Journal of
Sociology 33, 529–54. Cited in Wright, B.D. and Masters, G., 1982,
Rating scale analysis, Chicago: Mesa Press, 10–15.
Trim, J.L.M. 1978: Some possible lines of development of an overall struc-
ture for a European unit/credit scheme for foreign language learning
by adults. Strasbourg: Council of Europe.
Upshur, J.A. and Homburg, T.J. 1983: Some relations among language
tests at successive ability levels. In Oller, J.W., editor, Issues in
language testing research, Rowley, MA: Newbury House.
Upshur, J.A. and Turner, C.E. 1995: Constructing rating scales for second
language tests. ELT Journal 49, 3–12.
Van Ek, J.A. 1986: Objectives for foreign language teaching: Volume I:
scope. Strasbourg: Council of Europe.
Warm, T.A. 1989: Weighted likelihood estimation of ability in item
response theory. Psychometrika 54, 427–50.
Wilds, C.P. 1975: The oral interview test. In Spolsky, B. and Jones, R.,
Testing language proficiency. Washington DC: Center for Applied
Linguistics, 29–44.
Woods, A. and Baker, R. 1985: Item response theory. Language Testing
2, 117–40.
Wright, B.D. and Grosse, M. 1993: How to set standards, Rasch Measure-
ment, Transactions of the Rasch Measurement Special Interest Group
of the American Educational Research Association 7/3, 315–16.
Wright, B.D. and Linacre, J.M. 1987: Research notes, Rasch Measurement,
Transactions of the Rasch Measurement Special Interest Group of the
American Educational Research Association 1/1, 2–3.
—— 1989: Research notes, Rasch Measurement, Transactions of the Rasch
Measurement Special Interest Group of the American Educational
Research Association 3/3, 1–3.
Wright, B.D. and Masters, G. 1982: Rating scale analysis. Chicago:
Mesa Press.
Wright, B.D. and Stone, M.H. 1979: Best test design. Chicago: Mesa Press.
Appendix 1 Sample data-collection questionnaire

QUESTIONNAIRE T1
The teacher
Teacher’s name: .....................................................................................................
The class
Sector: 왏 Lower Secondary 왏 Upper Secondary
왏 Commercial and Professional Schools
왏 Adult Education 왏 Other: please specify .............

Level of English Year of English study/Level: ...............................................

The learner
Name: ............................................................................. Sex: M: 왏 F: 왏
Mother tongue: ....................................... Age: ........................................
Please rate the learner for each of the 50 items on the questionnaire on the
following pages using the following scale. Please cross the appropriate num-
ber next to each item: ×
0 This describes a level which is definitely beyond his/her capabilities. Could
not be expected to perform like this.
1 Could be expected to perform like this provided that circumstances are
favourable, for example if he/she has some time to think about what to say,
or the interlocutor is tolerant and prepared to help out.
2 Could be expected to perform like this without support in normal circum-
stances.
3 Could be expected to perform like this even in difficult circumstances, for
example when in a surprising situation or when talking to a less co-operat-
ive interlocutor.
4 This describes a performance which is clearly below his/her level. Could
perform better than this.
Spoken tasks
1 Can deal with common aspects of everyday living 0 1 2 3 4

such as travel, lodgings, eating and shopping.
2 Can use public transport: buses, trains and taxis, ask 0 1 2 3 4
for basic information, ask and give directions, and
buy tickets.
3 Can cope with less routine situations in shops, post 0 1 2 3 4
office, bank, e.g., asking for a larger size, returning
an unsatisfactory purchase.
4 Can negotiate a price, e.g., for a second-hand car, 0 1 2 3 4
bike.
5 Can get all the information needed from a tourist 0 1 2 3 4
office, as long as it is of a straightforward,
nonspecialised nature.
6 Can give simple directions and instructions, e.g., 0 1 2 3 4
explain how to get somewhere; how to play a game.
7 Can provide concrete information required in an 0 1 2 3 4
interview/consultation (e.g., describe symptoms to a
doctor) but does so with limited precision.
8 Can take some initiatives in an interview/consultation 0 1 2 3 4
(e.g., to bring up a new subject) but is very
dependent on interviewer in the interaction.

Please rate the learner for each item using the scale defined on the first
page. Please cross the appropriate number next to each item: ×
0 1 2 3 4
Describes a Yes, in Yes, in normal Yes, even in Clearly
level beyond favourable circumstances difficult better
his/her circumstances circumstances than this
capabilities
9 Can make a phone call to book a hotel, order a 0 1 2 3 4

book, etc., coping with the switchboard, wrong
numbers, bad lines, etc. and deliver a short prepared
statement.
10 Can enter unprepared into conversations on familiar 0 1 2 3 4
topics.
11 Can make an introduction and use basic greeting and 0 1 2 3 4
leave-taking expressions.
12 Can maintain a conversation or discussion but may 0 1 2 3 4
sometimes be difficult to follow when trying to say
exactly what he or she would like to.
13 Can express and respond to feelings such as surprise, 0 1 2 3 4
happiness, sadness, interest and indifference.
14 Can discuss topics of interest. 0 1 2 3 4
15 Can discuss in a simple way how to organise an 0 1 2 3 4
event, e.g., an outing.
16 Can seek and respond to opinion on familiar subjects. 0 1 2 3 4
17 Can express belief, opinion, agreement and 0 1 2 3 4
disagreement.
18 Can give brief comments on others’ views during 0 1 2 3 4
discussion.
19 Can explain a problem and make it clear that his/her 0 1 2 3 4
counterpart in a negotiation must make a concession.
20 Can make his/her opinions and reactions understood 0 1 2 3 4
as regards solutions to problems or practical questions
of where to go, what to do.
21 Can describe their family, living conditions, 0 1 2 3 4
educational background, present or most recent job.
22 Can describe events, real or imagined. 0 1 2 3 4
23 Can narrate a story. 0 1 2 3 4
24 Can briefly give reasons and explanations for 0 1 2 3 4
opinions, plans and actions.
Comprehension
25 Can generally understand clear, standard speech on 0 1 2 3 4
familiar matters directed at him/her, provided he or
she can ask for repetition or reformulation from time
to time.

0 1 2 3 4
capabilities
26 Can understand key words and phrases in 0 1 2 3 4

conversations between native speakers and use them
to follow the topic.
27 Native speakers need to keep their language clear 0 1 2 3 4
and simple, as far as possible avoiding use of
idiomatic expressions and complex information
structuring in order for him/her to understand.
28 Often needs a good deal of paraphrase and 0 1 2 3 4
explanation when speaker is attempting to convey or
obtain detailed information in order to be able to
understand.
Interaction strategies
29 Can regularly join in a conversation, but may often 0 1 2 3 4
do so inappropriately.
30 Can repeat back part of what someone has said to 0 1 2 3 4
confirm mutual understanding and help keep the
development of ideas on course.
31 Can ask for clarification about key words not 0 1 2 3 4
understood using stock phrases.
32 Can rehearse and try out new combinations and 0 1 2 3 4
expressions, inviting feedback.
33 Can use a simple word meaning something similar to 0 1 2 3 4
the concept he or she wants to convey and invites
‘correction’.
34 Can define the features of something concrete for 0 1 2 3 4
which he or she can’t remember the word.
35 Is conscious of when he or she mixes up tenses, 0 1 2 3 4
words and expressions, and tries to self-correct.
Qualities of spoken performance

36 Can keep going comprehensibly, even though pausing 0 1 2 3 4
for grammatical and lexical planning and repair is
very evident, especially in longer stretches of free
production.
37 Can exploit a wide range of simple language flexibly 0 1 2 3 4
to express much of what he or she wants to.

0 1 2 3 4
capabilities
38 Can cope with unpredictable elements in familiar 0 1 2 3 4

situations.
39 Can interact with reasonable ease in structured 0 1 2 3 4
situations, given some help, but participation in open
discussion is fairly restricted.
40 Can link a series of shorter, discrete simple elements 0 1 2 3 4
into a connected discourse.
41 Has a repertoire of basic language which enables him 0 1 2 3 4
or her to deal with everyday situations with
predictable content, though he or she will generally
have to compromise the message and search for
words.
42 Has sufficient vocabulary to conduct routine, 0 1 2 3 4
everyday transactions involving familiar situations
and topics.
43 Can use reasonably accurately a repertoire of 0 1 2 3 4
frequently used ‘routines’ and patterns associated with
more predictable situations.
44 Can use some simple structures correctly, but still 0 1 2 3 4
systematically makes basic mistakes.
45 Stress and intonation are very foreign, but can be 0 1 2 3 4
followed okay nearly all the time.
46 Beginning to show awareness of the rules and 0 1 2 3 4
conventions governing social interaction but still
frequently fails to recognise nuances in tone of voice
and to respond appropriately.
47 Can make a crude formal/informal distinction of 0 1 2 3 4
register, but often mixes formal and informal
elements in his/her speech.
Writing tasks
48 Can fill in uncomplicated forms with personal details 0 1 2 3 4
name, address, nationality, marital status.
49 Can write simple notes to friends 0 1 2 3 4
50 Can write personal letters to a friend, host, etc. giving 0 1 2 3 4
and asking for news.

Appendix 2 Extract from the 1994 vertical scale of calibrated

descriptors
Column headings are explained at the end of the table.
Logit No. Descriptor Q Err Fit Std
1.69 222 Can produce stretches of I .26 .77 −1.21

language with a fairly even
tempo; although he/she can be
hesitant as he or she searches
for patterns and expressions,
there are few noticeably long
pauses.
1.64 199 Can adjust to the changes of T2/I .26 .82 −.96
direction, style and emphasis
normally found in
conversation.
1.63 250 Can sustain relationships with E .26 .82 −.97
native speakers without
unintentionally amusing or
irritating them or requiring
them to behave other than
they would with a native
speaker.
1.58 259 Can initiate, maintain and end E .26 1.40 1.83
discourse naturally with
effective turn-taking.
1.5 221 Can interact with a degree of I .26 .91 −.45
fluency and spontaneity that
makes regular interaction with
native speakers quite possible
without imposing strain on
either party.
1.36 197 Can help the discussion along T2/I .26 1.62 2.60
on familiar ground confirming
comprehension, inviting others
in, etc.
1.3 206 Can take an active part in I .26 .62 −2.23
informal discussion in familiar
contexts, commenting, putting
point of view clearly,

evaluating alternative
proposals and making and
responding to hypotheses.
1.3 263 Can generally correct slips E .26 1.25 1.18
and errors if he/she becomes
conscious of them.
1.24 235 Can plan what is to be said I/E .26 .75 −1.38
and the means to say it,
considering the effect on the
recipient(s).
1.23 196 Can use stock phrases (e.g., T2/I .26 1.07 .33
‘That’s a difficult question to
answer’) to gain time and
keep the turn whilst
formulating what to say.
1.23 204 Can engage in extended I .26 .63 −2.11
conversation in a clearly
participatory fashion on most
general topics.
1.23 255 Can speculate about causes, E .26 .67 −1.88
consequences, hypothetical
situations.
1.22 135 Can explain a problem and T1 .20 .89 −.77
make it clear that his/her
counterpart in a negotiation
must make a concession.
1.16 218 Can initiate discourse, take I .26 .65 −2.02
his/her turn when appropriate
and end conversation when
he/she needs to, though
he/she may not always do this
elegantly.
1.11 256 Can understand in detail what E .27 1.05 .23
is said to him/her in the
standard spoken language even
in a noisy environment.
1.09 254 Can develop an argument E .27 .91 −.46
giving reasons in support of or
against a particular point of
view.

1.03 181 Can make a note of ‘favourite T2 .24 1.17 .87

mistakes’ and consciously
monitor speech for it/them.
1.03 213 Can construct a chain of I .26 .95 −.25
reasoned argument.
1.02 248 Can converse naturally, E .27 .84 −.85
fluently and effectively.
0.76 220 Can correct mistakes if they I .26 1.02 .08
have led to misunderstandings.
0.75 195 Can explain a viewpoint on a T2/I .26 .58 −2.50
topical issue giving the
advantages and disadvantages
of various options.
0.74 245 Can account for and sustain E .27 .68 −1.71
his/her opinions in discussion
by providing relevant
explanations, arguments and
comments.
0.72 VANTAGE CUT-OFF
0.65 151 Can cope with less routine T1/ .24 1.18 1.02
situations in shops, post office, T2
bank, e.g., asking for a larger
size, returning an
unsatisfactory purchase.
0.64 231 Can exchange accumulated I/E .27 1.11 .48
factual information on familiar
routine and nonroutine matters
within his/her field with some
confidence.
0.57 201 Can describe how to do I .26 .63 −2.18
something, giving detailed
instructions
0.43 202 Can carry out a prepared I .26 .51 −3.07
interview, checking and
confirming information,
though he/she may
occasionally have to ask for
repetition if the other person’s
response is rapid or extended.

0.33 177 Can extrapolate the meaning T2 .24 1.46 2.37

of occasional unknown words
from the context and deduce
sentence meaning provided the
topic discussed is familiar.
Explanation of column headings in Appendix 2:
Logit: The estimated difficulty value of the item on the common logit scale
created by linking the scales for the different questionnaires. The cut-off
point for Vantage Level can be seen at 0.72 (see Appendix 3 for other cut-
offs).
No.: The serial number of the item.
Descriptor: The English formulation of the descriptor.
Q: The questionnaire(s) on which the descriptor was used.Where two
questionnaires are indicated, the item was an ‘anchor item’ common to two
adjacent questionnaires. In Year 1 there were 7 questionnaires:
B (Breakthrough)
W1 (Waystage 1)
W2 (effectively parallel at Waystage)
T1 (Threshold 1)
T2 (Threshold 2: slightly higher that T1)
I (Independence)
E (Effectiveness)
Err: The standard error of measurement (SEM) expressed in logits, a factor
of the number of judgements the values are based upon.
Fit: The mean-square fit statistic, indicating the actual amount of misfit.
Perfect fit is 1.0. Less than 1.0 indicates overfit. Over 1.5 is significantly
misfitting (e.g. No. 197 at 1.36 logits). Items misfitting over 1.75 were
discarded.
Std: The standardised residual misfit, a measure of plausiblility indicating
the extent to which the data fits the model. Perfect fit is 0.0. Overfit is
expressed as a minus, −2.0 being significant overfit.

Appendix 3 Cut-off points dividing band levels on the logit

scale
Finer levels Council of Cut-off on Range on logit
(Swiss) Europe logit scale scale
Mastery M Mastery 3.90

Effectiveness E Effectiveness 2.80 1.10
Vantage Plus V+ 1.74 1.06
Vantage V Vantage 0.72 1.02
Threshold Plus T+ −0.26 0.98
Threshold T Threshold −1.23 0.97
Waystage Plus W+ −2.21 0.98
Waystage W Waystage −3.23 1.02
Breakthrough B Breakthrough −4.29 1.06
Tourist Tour −5.39 1.10
Appendix 4 Subscale for one category: comprehension in

spoken interaction
Level Logit No. Descriptor Source
M No descriptor calibrated at this

level
E No descriptor calibrated at this

level
V+ 2.56 257 Can keep up with an animated North 7
conversation between native
speakers.
V 1.11 256 Can understand in detail what is North 6/

said to him/her in the standard Hofman V/
spoken language even in a noisy London 5/
environment. FSI 3
T+ 0.33 177 Can extrapolate the meaning of ASLPRI+
occasional unknown words from
the context and deduce sentence
meaning provided the topic
discussed is familiar.

T −1.04 176 Can generally follow the main North 4 /

points of extended discussion AMES 3 /
around him/her, provided speech is Milan 3
clearly articulated in standard
dialect.
T −1.09 215 Can follow clearly articulated IELTS 5
speech directed at him/her in edited
everyday conversation, though will
sometimes have to ask for
repetition of particular words and
phrases.
W+ −1.83 101 Can understand enough to manage AMES 3

simple, routine exchanges without
undue effort.
W+ −2.13 85 Can generally identify the topic of North 3
discussion around him/her which is
conducted slowly and clearly.
W+ −2.13 138 Can generally understand clear, Wilkins 3 /
standard speech on familiar matters ESU 4 /
directed at him/her, provided Eurocentres 5
he/she can ask for repetition or / Brit. NLS
reformulation from time to time. edited /
North 4
W −2.72 67 Can understand what is said Eurocentres 3
clearly, slowly and directly to / Finnish 3 /
him/her in simple everyday Hofman II
conversation; can be made to
understand, if the speaker can take
the trouble.
B −3.5 23 Can understand everyday Milan 1 /

expressions aimed at the ESU 1
satisfaction of simple needs of a
concrete type, delivered directly to
him/her in clear, slow and repeated
speech by a sympathetic speaker.
B −3.64 24 Can follow speech which is very North 1
slow and carefully articulated, with
long pauses for him/her to
assimilate meaning.

B −4.12 22 Can understand questions and Eurocentres 1

instructions addressed carefully and
slowly to him/her and follow short,
simple directions.
Source scales referred to in Appendix 4:

AMES Adult Migrant Education Program Scale (Australian), 1989
edition
ASLPR Australian Second Language Proficiency Ratings, 1982 edition
Brit. NLS British National Language Standards, 1993 edition
ESU English-Speaking Union Framework yardsticks, 1989 edition
Eurocentres Eurocentres Scale of Language Proficiency, 1993 edition
FSI US Foreign Service Institute scale, 1975 edition
Finnish Finnish Scale of Language Proficiency, 1992 edition
Hofman Hofman, Th. R. 1974: Levels of competence in oral
communication. ED 123 869
IELTS International English Language Testing Service, draft listening
proficiency band scales, internal paper, Brinley and Nunan, 1992
London University of London School Examination Board: Certificate of
Attainment 1987
Milan Milanese Adult Education Institute, 1986 edition
North European Language Portfolio Mock-up: Rüschlikon Symposium
1991
Wilkins Proposals for Level Definitions for a (European) Unit/Credit
Scheme 1978
Appendix 5 Element chart for the subscale: comprehension in

spoken interaction
Level Setting Speech Help
M
3,9
E
2.8
V+ • animated
1.74 conversation
between native
speakers
V • even noisy • standard spoken

0.72 environments language

Level Setting Speech Help
T+ • topics which are • none; extrapolate

−0.26 familiar unknown words;
deduce meaning
T • extended • clearly articulated As Waystage +

−1.23 everyday standard speech
conversation
W+ • simple, routine As Waystage

−2.21 exchanges • ask for repetition
• familiar matters and reformulation
W • simple everyday • clear, slow, • if partner will

−3.23 conversation standard take the trouble
• directed at
him/her
B • everyday • very clear, slow, • sympathetic

−4.29 expressions aimed carefully articulated partner
at the satisfaction repeated speech • long pauses to
of needs of a directed at him/her assimilate meaning
concrete type
• short, simple
questions and
instructions
Note:
Figures in the Level column indicate the cut-offs on the logit scale (see
Appendix III).

10 1 1 918 6674

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

10 1 1 918 6674

Transféré par

Droits d'auteur :

Formats disponibles

Scaling descriptors for language

Language Testing 1998 15 (2) 217–263 0265-5322(98)LT149OA  1998 Arnold

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

II Scales of language proficiency

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

for another. In an educational framework, there will be circumstances

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

usual in relation to the Eurocentres scale. Raters may actually be

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

Proficiency Ratings (ASLPR); Bachman and Savignon, 1986; Lantolf

3 Methodologies for scale development

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

such an approach is only likely to be successful in one specific, lim-

• Criterion statements: The descriptors are initially developed as

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

of informants during the process of determining the content and

The concept of objectivity is being used in two different senses

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

through wide consultation with foreign-language teachers rep-

• Data collection instruments: A selection of the best

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

were teaching their mother tongue, and the main edu-

Lower Upper Vocational Adult

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

The descriptors for communicative language activities would corre-

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

The elimination of repetition, negative formulation and norm-

b Qualitative validation: consultation with teachers through work-

c Quantitative validation: scaling descriptors through teacher

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

These descriptors had been used as anchor items precisely because

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

Apart from these groups of descriptors, a number of individual

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

levels for the Council of Europe Common Framework. The range on

4 Year 2 French, German and English, 1995

a Creating a descriptor pool for listening and reading: A similar

b Qualitative validation: consultation with teachers through work-

c Quantitative validation: scaling descriptors through teacher

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

without justifying assumptions about psychological unidimensional-

1) statistical ‘misfit’ in the classic Rasch sense: either in terms of

Thus in 1995, reading did not appear to ‘fit’ a construct dominated

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

2 A bank of classified, calibrated descriptors

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

Figure 1 Categories scaled

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

descriptors on similar issues land adjacent to each other although they

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

• provide concrete information required in an

The next level appears to represent a significant shift, offering some

• account for and sustain his/her opinions in discussion by provid-

Running right through the band are two new focuses:

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

• understand in detail what is said to him/her in the standard spoken

2 Constructs and dimensionality

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

will tend to build a psychometrically unidimensional construct. As

Downloaded from ltj.sagepub.com at PENNSYLVANIA STATE UNIV on May 13, 2016

registers, nativeness, and control of non-literal, figurative language