Académique Documents
Professionnel Documents
Culture Documents
proficiency scales
Brian North Eurocentres Foundation, Zürich and
Günther Schneider Institut für deutsche Sprache, University
of Fribourg
This paper reports results from a Swiss National Science Research Council project
which aimed to develop a scale of language proficiency in the form of a ‘descriptor
bank’. Up until now, most scales of language proficiency have been produced by
appeal to intuition and to those scales which already exist rather than to theories
of linguistic description or of measurement. The intention in this project was to use
an item-banking methodology to develop a flexible scale of stand-alone criterion
statements with known difficulty values.
The project took place in two rounds: the first for English (1994), the second
for French, German and English (1995). In each year pools of descriptors were
produced by analysing available proficiency scales. Through workshops with rep-
resentative teachers, the descriptors were then refined into stand-alone criterion
statements considered to be clear, useful and relevant to the sectors concerned.
Selected descriptors presented on questionnaires were then used by participating
teachers to assess the proficiency of learners in their classes. This data was used
to scale the descriptors using the Rasch rating scale model. The difficulty estimates
for the descriptors produced in relation to English in 1994 proved remarkably
stable in relation to French, German and English in 1995.
I Introduction
During the past decade, two influences have led to the increasing use
of scales of language proficiency. The first influence has been a gen-
eral movement towards more transparency in educational systems.
The second has been moves towards greater international integration,
particularly in Europe, which places a higher value on being able to
state what the attainment of a given language objective means in prac-
tice. The result is that whereas 10 or 15 years ago scales which were
not directly or indirectly related back to the 1950s US Foreign Service
Institute (FSI) scale (Wilds, 1975) were quite rare, the last few years
have seen quite a proliferation of European scales which do not take
Address for correspondence: Brian North, Eurocentres, Seestrasse 247, CH-8038, Zurich, Switz-
erland; e-mail: bnorth얀eurocentres.com
American scales as their starting point. Some examples are: the Brit-
ish National Language Standards (Languages Lead Body, 1992); the
Eurocentres Scale of Language Proficiency (North, 1991; 1993a); the
Finnish Scale of Language Proficiency (Luoma, 1993) and the ALTE
Framework (Association of Language Testers in Europe, 1998).
Section II considers scales of language proficiency: the functions
they can fulfil, common criticisms of them and various methodologies
which have been used recently in scale construction. It goes on to
explain why one particular methodology (qualitative validation of
descriptors with informants followed by Rasch model scaling) was
selected for the study in question. Section II then outlines the study
which is the subject of this article: the background and project struc-
ture and the way the phases operated in 1994 and 1995. Section III
briefly presents the results: a calibrated scale of language proficiency
and a bank of classified, calibrated descriptors. Section IV discusses
some of the complications encountered, including dimensionality
problems and differential item functioning: the way descriptors were
interpreted in different contexts, and finally Section V concludes on
the significance of the work undertaken.
2 Criticisms
A definition by John Clark (1985: 348) catches the main weakness
of the genre:
descriptions of expected outcomes, or impressionistic etchings of what pro-
ficiency might look like as one moves through hypothetical points or levels
on a developmental continuum.
Put another way, there is no guarantee that the description of pro-
ficiency offered in a scale is accurate, valid or balanced. Learners
may actually be able to interpret a scale remarkably successfully for
self assessment; correlations of 0.74–0.77 to test/interview results are
the scale values of the statements should not be affected by the opinions
of the people who helped to construct it (the scale). This may turn out
to be a severe test in practice, but the scaling method must stand such
a test before it can be accepted as being more than a description of the
people who construct the scale’ (Thurstone, 1928: 547–48, cited in
Wright and Masters, 1982: 15).
II The study
1 Background
The study took place within the context of moves in Europe towards
a common framework of reference. The authors were members of a
Council of Europe working party charged with producing a ‘Common
European Framework of reference’ for language learning, teaching
and assessment (Council of Europe, 1996). The core of the Frame-
work is 1) a descriptive scheme representing aspects of communicat-
ive language competence and use and 2) a set of common reference
levels. The authors were simultaneously members of a Swiss working
party charged with developing a ‘Language Passport’ or ‘Language
Portfolio’ recording achievement in relation to the Framework
(Schärer, 1992; Council of Europe, 1997). In 1993 a Swiss National
Science Research Council three-year project (Schneider and North,
forthcoming) was set up with the primary aim of developing a bank
of transparent, calibrated descriptors of communicative language pro-
ficiency to be used in the first editions of the Framework and the Port-
folio.
2 Project structure
A pilot project for English conducted in 1994 (Year 1) was the sub-
ject of a PhD at Thames Valley University (North, 1996). The focus
in the pilot for English was on spoken interaction, including compre-
hension in interaction, and on spoken production (extended
monologue). Some descriptors were also included for written interac-
tion (letters, questionnaire and form filling) and for written production
(report writing, essays, etc.). In 1995 (Year 2) the survey was
extended to French and German as well as English. Descriptors were
also added for reading and for noninteractive listening.
The project took place in three steps in each of the two years 1994
and 1995:
1) Comprehensive documentation: creation of a descriptor pool
A survey of existing scales of language proficiency (North, 1994)
provided a starting point. Forty-one proficiency scales were
pulled apart with the definition for each level from each scale
assigned to a provisional level. Each descriptor was then split up
into sentences which were then each allocated to a provisional
category. When adjacent sentences were part of the same point,
they were edited into a compound sentence.
2) Qualitative validation: consultation with teachers through work-
shops
Qualitative validation of the descriptor pool was undertaken
separate scales were equated onto the common scale. This was caused
by three problems. First, the logit scale produced by the MLE
(maximum likelihood estimation) procedure used by most Rasch pro-
grams including FACETS distorts towards the top and bottom (see
Camilli, 1988; Warm, 1989; Jones, 1993). The solution adopted was
to exclude items and learners scoring over 75% or under 25% from
the analysis, thus setting what Warm (1989: 447) calls ‘rational
bounds’. Secondly, the powerful data set of all 100 teachers rating
all the video performances swamped the main questionnaire data. The
solution adopted was to first estimate difficulty values for the items
on the basis of the analysis of the main questionnaire data alone with
teacher severity anchored to zero. The third complication was that it
transpired that FACETS ratchets the forms too closely together when
analysing the whole data set (‘one-step equating’ Jones, 1993; ‘con-
current calibration’ Kenyon and Stansfield, 1992). The solution
adopted here was to fall back on Woods and Baker’s (1985: 128–31)
classic method of ‘two-step equating’. With this method each form
is analysed separately. Then the forms are linked together by increas-
ing values on each successive form by the average difference of dif-
ficulty of the anchor items on the two forms. Then the resulting com-
mon scale is recentred on zero.
During this phase of scale construction, Wright and Stone’s (1979:
92–96) classic technique for plotting the stability of the difficulty
estimates for the anchors on adjacent forms was used. In this simple
technique, the values produced by anchor items on one form are plot-
ted on the X axis with the values produced on the other form plotted
on the Y axis. A series of calculations based upon pooled standard
error of measurement at various points along the 45° diagonal pro-
duces upper and lower 95% criterion lines at the conventional 0.05
significance level. Anchor items appearing outside those lines demon-
strate an instability which cannot be explained by standard error. Such
items were therefore excluded. Unfortunately 9 of the 13 items lost
in this way were items on different kinds of strategies, such as the
example below:
Can identify words which sound as if they might be ‘international’, and try
them.
strategies did stay within the criterion lines, indicating that their dif-
ficulty value was not dependent on the level at which they were used.
The problems with anchor items mentioned above were not sig-
nalled in misfit statistics. There was, however, a substantial amount
of misfit in Year 1 which led to three complete categories of items
being lost:
1) Socio-cultural competence: It is not clear how much this problem
was caused a) by the concept being a quite separate construct
from language proficiency and hence not ‘fitting’ – as also found
by Pollitt and Hutchinson (1987); b) by rather vague descriptors
identified as problematic in the workshops, or c) by inconsistent
responses by the teachers. Nos. 46 and 47 on the questionnaire
given as Appendix 1 are examples of descriptors lost in this way.
2) Work-related: Those descriptors asking teachers to guess about
activities (generally work-related) beyond their direct experi-
ence: telephoning; attending formal meetings; giving formal
presentations; writing reports and essays; formal correspondence.
Descriptors for these categories tended to show higher levels of
misfit and to have their difficulty underestimated when analysed
with the main construct. Griffin (1989; 1990a: 296) reports a
similar problem in relation to library activities during the devel-
opment of his primary-school reading scale. No. 9 in Appendix
1, on telephoning, is an example of a descriptor lost in this way.
One could argue that No. 4. (on negotiating) is also an example
of this type, though other descriptors for Negotiating were cali-
brated successfully.
3) Negative concept: Those descriptors relating to dependence
on / independence from interlocutor accommodation (need for
simplification; need to get repetition/clarification), which are
implicitly negative concepts, misfitted wildly. These aspects
worked better as provisos in positively worded statements, for
example:
Can generally understand clear, standard speech on familiar matters
directed at him/her, provided he/she can ask for repetition or reformul-
ation from time to time.
Nos. 27 and 28 in Appendix 1 are examples of descriptors lost
in this way.
Pronunciation is another concept which is often conceived in
negative terms – the strength of accent, the amount of foreignness
causing comprehension difficulties. The pronunciation items for
lower levels were negatively worded and showed large amounts
of misfit although they were calibrated sensibly. No. 45 in
Appendix 1 is an example of a descriptor lost in this way.
d Setting cut-offs: Once the scale had been constructed, the descrip-
tors appeared calibrated in rank order onto a common logit scale as
shown in the extract in Appendix 2. The next task was to establish
‘cut-off points’ between bands or levels on this logit scale. Setting
cut-offs is always a subjective decision (Jaeger 1976: 2; 1989: 492)
and as Wright and Grosse (1993: 316) put it: ‘No measuring system
can decide for us at what point short becomes tall’. The cut-offs set
were not, however, arbitrary. As Pollitt (1991: 90) shows there is a
relationship between the reliability of a set of data and the number
of levels it will bear. In this case the scale reliability of 0.97 justified
10 levels.
The first step taken therefore was to set provisional cut-offs at
approximately equal intervals to create a 10-band scale. The second
step was to fine tune these cut-offs in relation to descriptor wording
in case there were threshold effects between levels. To check the
result, the content of descriptors for each category (e.g., comprehen-
sion in spoken interaction) at each level was broken up into elements
(e.g., qualities of the speech you can handle; degree of help required)
and displayed in a table for each category. This was in order to check
the plausibility of the progression and to see whether there was a
qualitative difference between the levels defined by the selected cut-
off points.
Appendix 3 shows the cut-offs on the logit scale between the 10
levels and the way those levels have been regrouped into 6 broader
IV Results
1 A calibrated scale of language proficiency
The scale of 10 levels was produced in the 1994 analysis, the process
being described in detail in North (1996). The central aim of the 1995
survey was to see if the 1994 scale values for descriptors would be
replicated in a survey focused mainly on French and German. This
is why the 1995 survey was anchored back to 1994 with 61 descrip-
tors. The difficulty values for the items in the 1994 construct (spoken
interaction and production) proved to be very stable. Only 8 of the
61 1994 descriptors reused in 1995 were interpreted in a significantly
different way – i.e., fell outside Wright and Stone’s 95% criterion
line. After the removal of those eight descriptors, the values of the
103 listening and speaking items used in 1995 (now including only
53 from 1994) correlated 0.99 (Pearson) when analysed a) entirely
separately from 1994 and b) with the 53 common items anchored to
their 1994 values. This is a very satisfactory consistency between the
two years when one considers that:
1) The 1994 difficulty values were based on judgements by 100
English teachers, whilst in 1995 only 46 of the 192 teachers
taught English, and only 20 of them had taken part in 1994. The
ratings dominating the 1995 construct were therefore those of the
French and German teachers.
2) The questionnaire forms used for data collection in 1994 and
1995 were completely different in terms of both content and
range of difficulty with four forms in 1995 covering the ground
covered by seven forms in 1994.
3) The majority of teachers in 1995 were using the descriptors in
French or German. Therefore it is possible that the problems with
the eight 1994 descriptors may have been at least partly caused
by inadequate translation.
V Discussion
1 Scaling
The stability of the scale difficulty values from two quite different
surveys (r = 0.99) suggests that the technical difficulties reported were
in fact overcome and that the items were satisfactorily scaled.
When one looks at the vertical scale of calibrated items, of which
an extract is given in Appendix 2, it is striking the extent to which
VI Conclusion
There are those who consider that the development of common frame-
work scales should not be attempted before research has provided
adequate empirically validated descriptions of language proficiency
and of the language-learning process (Lantolf and Frawley, 1985;
1988; 1992). Spolsky (1993: 208) and Brindley (forthcoming) have
voiced similar concerns, Brindley concluding that:
rather than continue to proliferate scales which use generalised and empirically
unsubstantiated descriptors % it would perhaps be more profitable to draw
on SLA and LT research to develop more specific empirically-derived and
diagnostically-oriented scales of task performance which are relevant to parti-
cular purposes of language use in particular contexts and to investigate the
extent to which performance on these tasks taps common components of com-
petence (Brindley, ibid.: 22).
Yet previously, Brindley had accepted that ‘we cannot wait for the
emergence of empirically validated models of proficiency in order to
build up criteria for assessing learners’ second language performance’
(Brindley, 1989: 56). As Hulstijn puts it: ‘it should be obvious that
syllabus writers, teachers and testers cannot wait for full-fledged
theories of language proficiency to emerge from research laboratories.
In the absence of theories, they have to work with taxonomies which
seem to make sense even if they cannot be fully supported by a theor-
etical description’ (1985: 277).
The purpose of a common reference framework is to provide such
a taxonomy in response to a demand for this. The purpose of descrip-
tors of common reference levels is to provide a metalanguage of cri-
terion statements which people can use to roughly situate themselves
and/or their learners, in response to a demand for this. It is widely
recognised that the development of such a taxonomy entails a tension
between theoretical models developed by applied linguists (which are
incomplete) on the one hand and operational models developed by
practitioners (which may be impoverished) on the other hand (see
North, 1993b: 7; McNamara, 1995: 159–165; Chalhoub-Deville,
1997; and Brindley, forthcoming: 21).
Up until now a methodology for the development of such instru-
ments has been lacking. This project demonstrates one way in which
such an undertaking can be carried through in a principled fashion:
• comprehensive documentation of the experience and consensus in
the field of proficiency scales;
• classification of descriptors to a taxonomy informed by theoreti-
cal models;
• pre-testing of categories, formulations and translations to ensure
Acknowledgements
The authors would like to express their gratitude to the US National
Foreign Language Center for the award of the Mellon Fellowship to
VII References
Alderson, J.C. 1990a: Judgements in language testing, version three. Paper
presented at the 9th World Congress of Applied Linguistics, Thessa-
loniki, Greece, April.
—— 1990b: Testing reading comprehension skills (Part One). Reading in
a Foreign Language 6, 425–38.
—— 1991: Bands and scores. In Alderson, J.C. and North, B., editors,
Language testing in the 1990s, London: Modern English
Publications / British Council / Macmillan, 71–86.
Alderson, J.C. and Lukmani, Y. 1989: Cognition and reading: cognitive
levels as embodied in test questions. Reading in a Foreign Language
5, 253–270.
Alderson, J.C. and North, B., editors, 1991: Language testing in the 1990s.
London: Modern English Publications / British Council / Macmillan.
Andrich, D. 1988: Thurstone Scales. In Keeves J.P., editor, Educational
research, methodology and measurement: an international handbook,
Oxford / New York, Pergamon Press, 303–306.
Association of Language Testers in Europe (ALTE) 1998: ALTE Hand-
book of European Language examinations and examination systems:
descriptions of examinations offered and examinations administered
by members of the Association of Language Testers in Europe. Cam-
bridge: University of Cambridge Local Examinations Syndicate.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman L.F. and Palmer A. 1982: The construct validation of some
components of communicative proficiency. TESOL Quarterly 16,
449–64.
Bachman, L.F. and Savignon S.J. 1986: The evaluation of communicative
language proficiency: a critique of the ACTFL oral interview. Modern
Language Journal 70, 380–90.
Bejar, I. 1980: A procedure for investigating the unidimensionality of
achievement tests based on item parameter estimates. Journal of Edu-
cational Measurement 17, 283–96.
Blais, J.-G. and Laurier, M. 1995: The dimensionality of a placement test
from several analytic perspectives. Language Testing 12, 72–98.
Brindley, G. 1986: The assessment of second language proficiency: issues
and approaches. Adelaide: National Curriculum Resource Centre.
—— 1989: Assessing achievement in the learner-centred curriculum,
NCELTR Research Series. Sydney: Macquarie University.
—— 1991: Defining language ability: the criteria for criteria. In Anivan, S.,
editor, Current developments in language testing, Singapore: Regional
Language Centre.
Mullis, I.V.S. 1980: Using the primary trait system for evaluating writing.
Manuscript No. 10-W-51, Educational Testing Service, Princeton, NJ,
reprinted December 1981.
North, B. 1991: Standardisation of continuous assessment grades. In Alder-
son, J.C. and North, B., editors, Language testing in the 1990s, Lon-
don: Modern English Publications / British Council / Macmillan,
167–77.
—— 1993a: Transparency, coherence and washback in language assessment.
In Sajavaara, K., Takala, S., Lambert, D. and Morfit, C., editors, 1994,
National foreign language policies: practices and prospects. Univer-
sity of Jyväskyla: Institute for Education Research, 157–93.
—— 1993b: The development of descriptors on scales of proficiency: per-
spectives, problems, and a possible methodology. NFLC Occasional
Paper. Washington, DC: National Foreign Language Center, April.
—— 1994: Scales of language proficiency: a survey of some existing sys-
tems. Strasbourg: Council of Europe.
—— 1995: The development of a common framework scale of descriptors
of language proficiency based on a theory of measurement. System 23,
445–65.
—— 1996: The development of a common framework scale of descriptors
of language proficiency based on a theory of measurement. Unpub-
lished PhD thesis, Thames Valley University.
—— 1997: Perspectives on language proficiency and aspects of competence.
Language Teaching 30, 93–100.
Oller, J.W., editor, 1983: Issues in language testing research. Rowley, MA:
Newbury House.
Oscarson, M. 1978/1979: Approaches to self-assessment in foreign
language learning. Strasbourg: Council of Europe, 1978: Oxford, Per-
gamon, 1979.
—— 1984: Self-assessment of foreign language skills: a survey of research
and development work. Strasbourg: Council of Europe.
Pienemann, M. and Johnston, M. 1987: Factors influencing the develop-
ment of language proficiency. In Nunan, D., editor, Applying second
language acquisition research, Adelaide: National Curriculum
Resource Centre, 89–94.
Pollitt, A. 1991: Response to Alderson: bands and scores. In Alderson, J.C.
and North, B., editors, Language testing in the 1990s, London: Modern
English Publications / British Council / Macmillan, 87–94.
—— 1993 Reporting reading test results in grades. Paper presented at the
15th Language Testing Research Colloquium, Cambridge and Arnhem,
2–4 August.
Pollitt, A. and Hutchinson, C. 1987: Calibrating graded assessments; Rasch
partial credit analysis of performance in writing. Language Testing 4,
72–92.
Pollitt, A. and Murray, N.L. 1993/1996: What raters really pay attention to.
Paper presented at the 15th Language Testing Research Colloquium,
Cambridge and Arnhem, 2–4 August 1993. In Milanovic, M. and
Please rate the learner for each of the 50 items on the questionnaire on the
following pages using the following scale. Please cross the appropriate num-
ber next to each item: ×
0 This describes a level which is definitely beyond his/her capabilities. Could
not be expected to perform like this.
1 Could be expected to perform like this provided that circumstances are
favourable, for example if he/she has some time to think about what to say,
or the interlocutor is tolerant and prepared to help out.
2 Could be expected to perform like this without support in normal circum-
stances.
3 Could be expected to perform like this even in difficult circumstances, for
example when in a surprising situation or when talking to a less co-operat-
ive interlocutor.
4 This describes a performance which is clearly below his/her level. Could
perform better than this.
Spoken tasks
Please rate the learner for each item using the scale defined on the first
page. Please cross the appropriate number next to each item: ×
0 1 2 3 4
Describes a Yes, in Yes, in normal Yes, even in Clearly
level beyond favourable circumstances difficult better
his/her circumstances circumstances than this
capabilities
Comprehension
25 Can generally understand clear, standard speech on 0 1 2 3 4
familiar matters directed at him/her, provided he or
she can ask for repetition or reformulation from time
to time.
Please rate the learner for each item using the scale defined on the first
page. Please cross the appropriate number next to each item: ×
0 1 2 3 4
Describes a Yes, in Yes, in normal Yes, even in Clearly
level beyond favourable circumstances difficult better
his/her circumstances circumstances than this
capabilities
Interaction strategies
29 Can regularly join in a conversation, but may often 0 1 2 3 4
do so inappropriately.
30 Can repeat back part of what someone has said to 0 1 2 3 4
confirm mutual understanding and help keep the
development of ideas on course.
31 Can ask for clarification about key words not 0 1 2 3 4
understood using stock phrases.
32 Can rehearse and try out new combinations and 0 1 2 3 4
expressions, inviting feedback.
33 Can use a simple word meaning something similar to 0 1 2 3 4
the concept he or she wants to convey and invites
‘correction’.
34 Can define the features of something concrete for 0 1 2 3 4
which he or she can’t remember the word.
35 Is conscious of when he or she mixes up tenses, 0 1 2 3 4
words and expressions, and tries to self-correct.
Please rate the learner for each item using the scale defined on the first
page. Please cross the appropriate number next to each item: ×
0 1 2 3 4
Describes a Yes, in Yes, in normal Yes, even in Clearly
level beyond favourable circumstances difficult better
his/her circumstances circumstances than this
capabilities
Writing tasks
48 Can fill in uncomplicated forms with personal details 0 1 2 3 4
name, address, nationality, marital status.
49 Can write simple notes to friends 0 1 2 3 4
50 Can write personal letters to a friend, host, etc. giving 0 1 2 3 4
and asking for news.
evaluating alternative
proposals and making and
responding to hypotheses.
1.3 263 Can generally correct slips E .26 1.25 1.18
and errors if he/she becomes
conscious of them.
1.24 235 Can plan what is to be said I/E .26 .75 −1.38
and the means to say it,
considering the effect on the
recipient(s).
1.23 196 Can use stock phrases (e.g., T2/I .26 1.07 .33
‘That’s a difficult question to
answer’) to gain time and
keep the turn whilst
formulating what to say.
1.23 204 Can engage in extended I .26 .63 −2.11
conversation in a clearly
participatory fashion on most
general topics.
1.23 255 Can speculate about causes, E .26 .67 −1.88
consequences, hypothetical
situations.
1.22 135 Can explain a problem and T1 .20 .89 −.77
make it clear that his/her
counterpart in a negotiation
must make a concession.
1.16 218 Can initiate discourse, take I .26 .65 −2.02
his/her turn when appropriate
and end conversation when
he/she needs to, though
he/she may not always do this
elegantly.
1.11 256 Can understand in detail what E .27 1.05 .23
is said to him/her in the
standard spoken language even
in a noisy environment.
1.09 254 Can develop an argument E .27 .91 −.46
giving reasons in support of or
against a particular point of
view.
0.65 151 Can cope with less routine T1/ .24 1.18 1.02
situations in shops, post office, T2
bank, e.g., asking for a larger
size, returning an
unsatisfactory purchase.
0.64 231 Can exchange accumulated I/E .27 1.11 .48
factual information on familiar
routine and nonroutine matters
within his/her field with some
confidence.
0.57 201 Can describe how to do I .26 .63 −2.18
something, giving detailed
instructions
0.43 202 Can carry out a prepared I .26 .51 −3.07
interview, checking and
confirming information,
though he/she may
occasionally have to ask for
repetition if the other person’s
response is rapid or extended.
M
3,9
E
2.8
V+ • animated
1.74 conversation
between native
speakers
Note:
Figures in the Level column indicate the cut-offs on the logit scale (see
Appendix III).