Vous êtes sur la page 1sur 20

The Challenges of Creating a Valid and Reliable Speaking Test as Part of a Communicative English Program David Jeffrey David

Jeffrey is an instructor in the Communicative English Program at the Niigata University of International and Information Studies, Japan. He has had experience both in socio-economic development in South Africa and teaching English in Japan. Abstract This paper describes the challenging evolution and present form of the speaking test, which is the backbone of the Communicative English Program (CEP) of the Niigata University of International and Information Studies (NUIS), a private university in Japan, and discusses how valid and reliable this test appears to be. CEP is a semi-intensive and skill-based teaching program, founded in 2000. It is part of the Department of Information Culture of NUIS. One of the biggest challenges in setting up the CEP teaching program was the need to create a speaking test that could accurately measure the fluency criteria (content and communication strategies) of communication, as well as the accuracy criteria (grammar, vocabulary and pronunciation). It had to be practical and optimise on resources at the same time. The CEP speaking test is described in detail, using illustrative examples from the test itself wherever possible. Attention is also given to the use of interrater reliability correlations as a measure of the consistency between the examiners while applying the testing criteria. Finally, the usefulness of reflecting on both the evolution and present forms of testing procedures is considered in terms of its potential contribution to both the professional development of teachers and their teaching programs.

The Origins of the CEP Speaking Test


The origins of the CEP speaking test can be traced back to 1999, when the coordinator of CEP, Hadley, was a teacher at the Nagaoka National College of Technology, in Japan. He, together with his co-worker Mort, created a speaking test (the forerunner of the CEP speaking test) to assess the oral proficiency of their learners in terms of their ability to use English as a natural communicative skill. Their primary concern was to find out what their learners could do, rather than what they knew (Hadley and Mort, 1999). As a result, the speaking test that emerged at this time was one that measured mainly the fluency aspects of conversation (or the skills of making meaningful conversation), as well as the accuracy aspects (such as vocabulary and grammatical correctness) that are also considered an important part of conversation. The speaking test thus gave primary attention to the fluency aspects of conversation, and secondary concern to the accuracy

aspects of conversation. This had the effect of making the examining process of this test more subjective in nature. They consequently became concerned about its internal reliability, especially from a point of view of the examining process as well as the necessity for maximising interrater reliability. Interrater reliability measures the consistency between different examiners. Hadley and Mort (1999, p. 2) described it as: the degree of correlation between two or more examiners, with the goal of determining whether they are using the same set of criteria when testing the oral proficiency of their learners. The Speaking Test Evaluation Sheet used in these early days can be seen in Figure 1 on the next page. Please note the different weightings applied to the testing categories, which give higher priority to communicative ability and fluency (and lower priority to features of accuracy).

Figure 1 - Speaking Test Evaluation Sheet, as Used at the Nagaoka National College of Technology in 1999
Communicative Ability Fluency Vocabulary Non-verbal Strategies Includes lengths of utterances, flexibility to speakers of differing levels. Complexity of responses (Multiply by 6) Appropriate speed, pauses and discourse strategies (Multiply by 4) Did the student use a wide variety of words and phrases, or use new vocabulary used in class? (Multiply by 3) Did the student supplement oral communication with appropriate gestures, eye contact and body language? (Multiply by 3) How accurate and appropriate was the students grammar? (Multiply by 2) Was effort made to use correct intonation, or was the accent a barrier to communication? (Multiply by 2) 0/1 / 2 / 3 / 4 / 5 = ____ x 6 = ____ 0/1 / 2 / 3 / 4 / 5 = ____ x 4 = ____ 0/1/ 2 / 3 / 4 / 5 = ____ x 3 = ____ 0/1 / 2 / 3 / 4 / 5 = ____ x 3 = ____

Grammar Pronunciation

0/1 / 2 / 3 / 4 / 5 = ____ x 2 = ____ 0/1 / 2 / 3 / 4 / 5 = ____ x 2 = ____

It is important to stress that the aim was not to achieve exactly the same results between the examiners at this time, but rather to achieve similar results that were fairly high. Indeed, theoretical observers that Hadley and Mort referred to, such as Heaton (1997, p. 164, in Hadley and Mort, p. 5) note that internal reliabilities of objective tests (such as multiple-choice tests) are inherently higher than subjective tests (such as oral tests), and therefore what may not be considered an acceptable level for a multiple-choice test may be acceptable for an oral test. Heaton adds that a moderate level of internal reliability is in fact desirable for an oral test because such tests also rely on many uncontrolled variables within natural communicative expression, rather than the direct questions and discreet answers required by objective tests. The split-half method was used to check the internal reliability of the speaking test at this time. The split-half method involved dividing the test into two nearly equal parts, correlating the scores together for the two parts, and adjusting this coefficient using the Spearman-Brown Prophecy Formula (The Spearman-Brown Prophecy Formula is used to, according to Henning (1987, p. 197): adjust estimates of reliability to coincide with changes in the numbers of items or independent raters in a test. Hadley and Mort (1999, p. 3-5) were disappointed with the results of their interrater reliability testing, because a measure of only 0.54 was achieved, which was lower than the desired level even when taking into consideration that speaking tests are subjective in nature. It suggested that they needed to become more aligned with each other in terms of their common understanding of issues whilst internalising their examining criteria. They note possible reasons for this as having been, firstly, a general lack of confidence and feelings of distraction amongst the two examiners. Secondly, the scoring bands and their meanings needed to be more explicit. As a result of Hadley and Morts (1999, p. 9) scoring bands and their meanings being

somewhat inexplicit, one examiner used a basic criterion of: would a native speaker, who is unaccustomed with Japanese speech patterns and mannerisms, be able to understand this student? On the other hand, the second examiner used a basic criterion of: based upon my experience of living in Japan for eight years, can I understand what this student is trying to say. They concluded that they had not been explicit enough in their basic pedagogic criteria for rating learners, and worked out a middle ground between the two, which was stated as: will a native speaker of English, who is sincerely open to communicating with Japanese, be able to understand what the learner is trying to say, even though he or she is mostly unaccustomed with Japanese mannerisms and speech patterns? They also concluded that the results had helped them come to some important decisions to apply for forthcoming oral testing, particularly with respect to the examiners endeavouring to understand each others pedagogical stance to improve interrater reliability. These considerations laid the basis of the CEP speaking test. Defining What We Wanted in a Speaking Test Although we had some philosophical and practical background as a foundation to the CEP speaking test in the beginning, thanks to the work already carried out by Hadley and Mort at the Nagaoka National College of Technology, there was still much to consider in refining the speaking test to specifically meet the requirements of CEP and to make it as valid and reliable as possible, especially in terms of interrater reliability. Our starting point was to consider exactly what it was we wanted to achieve, and to work towards that goal. We began by going back to basics in considering the importance of testing, particularly oral testing in a communicative program. Definitions of tests were considered as one starting point, such as Bachmans (1990, p. 20) who defines a language test as: a measurement instrument designed to elicit a specific sample of an individuals behaviour(and) quantifies characteristics of individuals according to explicit procedures. As well as Underhill (1987, p. 7), who refers to speaking test as: a repeatable procedure in which the learner speaks and is assessed on the basis of what he says.

Weir (1990, p. 7) says: in testing communicative language ability we are evaluating samples of performance, in certain specific contexts of use, created under particular test constraints, for what they can tell us about a candidates communicative capacity or language ability. We found that going back to the theoretical basics of speaking tests very beneficial as a common starting point from which to advance. This process wasnt easy, and it was time-consuming, especially given the many demands that designing as well as implementing a program simultaneously placed on us. In retrospect, it is no wonder that speaking tests are generally considered a necessary evil by many teachers and learners alike. However, they remain an indispensable means of providing teachers and learners with information of the teaching and learning process. Hughes (1989, p. 1) says: Many language teachers harbour a deep mistrust of tests and of testers this mistrust is frequently well-founded. It cannot be denied that a great deal of testing is of very poor quality. Too often language tests have a harmful effect on teaching and learning; and too often they fail to measure accurately whatever it is they have intended to measure. We were aware of this potential shortcoming. Despite our frustrations, we put as much effort as possible to make the CEP speaking test as valid and reliable as possible, so that the testing be done well, to lessen the potential mistrust that learners and teachers might harbour towards it, especially given the centrality of the speaking test within the CEP program.

The Importance of Reliability and Validity


Reliability and validity were the central concepts around which we worked to create the CEP speaking test, but what is specifically meant by reliability and reliability, and why are they important? Reliability and validity are interrelated and rely on many aspects. In a broad sense, Henning (1987, p. 198) defines validity as: the extent to which a test measures the ability or knowledge that it is purported to measure.

He defines reliability as: the consistency of the scores obtainable from a test. It is usually an estimate on a scale of zero to one of the likelihood that the test would rank testees in the same order from one administration to another proximate one (p. 198). Reliability is therefore concerned with whether a test gives consistent results, as Underhill (1987, p. 9) says: If the same learners are tested on two or three occasions, do they get the same score each time? Validity, on the other hand, is concerned with whether a test measures what it is supposed to. Many important aspects of tests have a bearing on validity and reliability, and some worth mentioning here include backwash effects, face validity, content validity and construct validity. Hughes (1989, p. 1) states the effect of testing on teaching and learning is known as backwash. Backwash effects can be positive or negative, and they have a positive effect if they motivate both teachers and learners to prepare for the tests. Related to this, is the importance of considering the potential forward wash effects of tests that take place at the beginning of teaching cycles, that motivate learners to learn and perform better for future tests (Hunt, 1998, p. 68). Henning (1987, p. 192) defines face validity as: a subjective impression, usually on the part of examinees, of the extent to which the test and its format fulfils the intended purpose of measurement. Face validity is closely associated with content validity, defined by Henning (1987, p. 190) as: usually a non-empirical expert judgement of the extent to which the content of a test is comprehensive and representative of the content domain purported to be measured by the test. Face and content value therefore refer to the extent to which the test is recognisable as a fair test by learners, who thereby perform to their ability as a result. Tests that lack face and content cause negative backwash effects and result in student underperformance, as well as the results being contested by both teachers and learners. Henning (1987, p. 190) defines construct validity as:

the validity of the constructs measured by a test. Construct validity is related to content validity, in that it is concerned with the contents of the test and their wider context. Construct validity thus refers to whether the test shares the same philosophy of the teaching program of which it is a part, and can be measured by both statistical and intuitive methods, according to Underhill (1987, p. 106) who adds that: construct validity is not an easy idea to work withto reduce it to its simplest statement it says: does the test match your views on language learning? In practice, there may be little difference between construct and content validity. These were just some of the many things that needed a good deal consideration whilst creating the CEP speaking test.

CEP and its Speaking Test Today


Almost three years ago, when CEP was founded, it was clear that its teaching philosophy would be communicative. What was meant by communicative and how this philosophy would be reflected in our teaching and testing methodology was also a matter of much contemplation. In the first year of CEP the co-ordinator and the two instructors of CEP were engaged considerable innovation in syllabus design and implementation, which included extensive lesson planning and the creation of the CEP intranet server for the administrative files, a system that has proved to be extremely convenient, and where the final results of the speaking tests are stored. Our CEP website was also made at that time, and can be viewed at http://www.nuis.ac.jp/~hadley/cepweb/cep/. The first year of CEP was like building and sailing a ship on a rough sea. Now, after intense innovation, followed by consolidation and refinement, CEP has become a semi-intensive, skills-based, International Language (EIL) program. Small classes of 22 learners are streamed into six distinct levels of language proficiency, and meet once a day for 45 minutes from Monday to Friday where they study courses that focus on oral communication, listening and reading skills. CEP consists of 8 teaching cycles (4 for each semester). Each cycle lasts 3 weeks. The first 2 weeks of each cycle are devoted to classroom activities, and the last week is devoted to testing activities. Although a listening test is also undertaken during the last week, it is the speaking test that is the most important and the most challenging for both the examiners and the learners. There are thus 8 speaking tests in an academic year. Oral communication skills are considered to be the most important part of CEP, and thus the CEP speaking test has been, and still is, given considerable attention. It is true to say that the CEP speaking test has become the backbone of the CEP program, although listening and reading tests are also undertaken. The speaking tests are considered to be the most demanding of the tests by the learners, and are considered to be the most

accurate reflection on how they are managing in CEP, given the communicative philosophy of CEP. It should also be stressed that the tradition of considering the fluency aspects of conversation as being most important has been sustained, with a significant amount of time and effort being devoted to create the present form of the CEP speaking test and to make it as valid and reliable as possible. Accuracy is regarded in CEP as attention to, and familiarity with aspects of form, whereas fluency is regarded as a skill (as automated knowledge). CEP recognises that too much attention to accuracy jeopardises fluency, and thus diverts the bulk of attention away from accuracy, by focusing on meaning. Thus CEP aims to encourage learners to concentrate more on what they are saying, and less on how they are saying it. Learners in CEP are therefore taught to express matters that are important to them and their lives, focussing on Japanese issues as they relate to the international setting. The learners are encouraged to learn how to confidently and effectively communicate their concerns, cultural viewpoints and personal interests by taking ownership of English and using it as a means of meaningful interchange with people of other countries, and to relate what it means to be Japanese in a positive way to others in the world community. CEP thus wants its learners to learn how to authentically express who they are as Japanese, in English, and be able to relate who they are and why they think the way they do to people of other cultures. The New Interchange: English for International Communication (NIC) Levels 1, 2 and 3 (Richards, 1998) are the base texts used for the homework, listening and speaking activities. Although the NIC units are followed in the order as prescribed by their writers, the sequencing of lessons within these units are determined by the policy of CEP to move from accuracy (grammar homework) to fluency (conversation), with the listening activities bridging the two. Hence, the approach is to begin with activity-based homework activities, as a form of consciousness-raising. Homework checking takes about 5 minutes, and listening activities another 10 minutes, so that the remaining bulk of 30 minutes can be given to fluency-based conversational activities. In the conversational activities, learners are encouraged to place most emphasis on fluency (as opposed to accuracy), and conversational content and strategy, as well as physical gestures and eye contact, play important roles. Learners are taught how to open and close conversations, introduce and develop topics, understand and use common useful expressions as well as idiomatic phrases in the classroom. The speaking test then checks whether they have internalised what the have done in the classroom, and whether they can take ownership over what the have learned and use it as a skill. To reinforce the need to practice speaking as much as possible during the classroom activities, learners receive points in the form of plastic coins during their classroom speaking activities, which they cash in at the end of class. The points are recorded and contribute towards their final semester score. This encourages the learners to talk, and

they are awarded primary for conversational effort, for making an effort to communicate meaningfully, and points are not taken away for mistakes. For example, if the teacher asks a question during a listening exercise and a student attempts to answer the question, that student will receive a participation point, which counts towards the year-end grade, irrespective of whether the question is answered correctly or not. In addition to measuring mainly the fluency aspects of oral communicative ability that is encouraged in the classroom, the speaking test gives a definite purpose for the homework and classroom activities for each cycle. Without doing the homework, learners find themselves unable cope adequately with the listening and speaking activities in the classroom (given the semi-intensive nature of CEP and the quick tempo in sequencing of class activities), and without coping with the speaking activities in the classroom, they are unable to cope adequately with the speaking tests. They thus also aim to ensure that the learners undergo their learning activities with the necessary seriousness. The learners are expected to show that they can recall and use forms of speech related to the topics covered in the past two weeks, and couple them with their own insights in communication. The aim here is to encourage the learners to appreciate the relevance of the speaking tests with their own communicative and study purposes. The CEP speaking tests are predominantly performance-referenced as well as progress tests, testing how the learners are getting on in mid-course, and achievement tests, testing how well the learners have done at the end of each cycle. They want to give information of the learners ability to communicate, and how well they might be able to cope in situations of real-life language use.

The CEP Speaking Test in Detail


Testing procedure On the day of the speaking test, groups of three learners are chosen at random. The procedure is made random by having their names written on cards that are shuffled. They enter the testing room and sit facing each other at a small table. The test questions are written on cards and placed face down on the examiners table. Although these question cards are based on the topics covered in class over the previous two weeks, they do not come directly from the textbook but are created by the examiners some examples of questions asked in a test based on units in a cycle focussing on

environmental issues and learning are: Discuss the environmental problems of Japan Discuss what can be done to stop the illegal dumping of rubbish Discuss ways to help abandoned animals Discuss good ways to improve your English skills Discuss some new things you would like to learn to do Discuss the reasons why you study English

One of the three learners is asked to choose a number, from one to six (the number of cards in the continually shuffled pack). Without showing the corresponding card to the learners, the examiner reads the question slowly and clearly to the learners twice, and the learners then think about the question for ten seconds (this makes the task more directed and allows the weaker learners some thinking time). The learners must then discuss this question for three minutes. The examiners listen and give each student a grade (at no point in this process may examiners clarify the meaning of any word in the question, and the conversation is exclusively between the three learners). Their scores are entered into the examiners computers, onto the prepared master files with the formulas to calculate each learners score in accordance with the weightings of the testing categories. The learners final scores are an average of the three examiners grades. After the three minutes is up, the learners are asked to stop. A new group of learners is called into the room, and the process is continued until all the learners have been tested. The learners are given their results in their next class on Monday. The rapid feedback is possible by virtue of the scoring process being computerised, and is beneficial to the learners in that their memories of the tests are still fresh and they can recall what they did and reflect on what they scored fairly easily.

Score Sheets
The scores of the CEP speaking test are based on two score sheets comprised of rating bands consisting of assessment criteria for the examiners, and were created by the examiners in mutual agreement thus reflecting their combined pedagogical stance. There are two rating band sheets in CEP, one for the higher proficiency levels of classes A, B and C and the other for the lower proficiency levels of classes C, D and E. This is to make the process as fair as possible for the learners, especially the lower proficiency learners. The different levels allow learners to undergo class work and testing under conditions that are not too easy or too difficult given their current proficiency levels. Despite the different levels, attention is given to the reliability of the bands so that they do not overlap, are sufficiently described, as free as possible from allowing for the possibility of subjective and impressionistic elements to enter into the evaluation, and afford as much possibility for the examiners to enter similar scores. To make the process fair, the upper limit for the higher proficiency levels (A, B and C)

does not require a student to have the ability of a native English speaker, but an ability similar to a Japanese person who has spent three to five weeks abroad in a summer home stay program. The requirements for the lower proficiency levels (D, E and F) are 3 bands lower than the A, B and C bands, with the upper limit being an ability to converse on a simple level. Although on a simple level is subject to many interpretations, the examiners undergo extensive discussion during the norming sessions to be sure that a mutual understanding of what it means is reached and internalised. A simplified version of the score sheets can be seen in Figure 2.

Figure 2 - Simplified Version of Score Sheets (used in the examination room by the examiners)
CEP ABC Assessmen t Criteria Fluency (Content of Contributions)
Offers many details or examples Offers valid & pertinent reasons & opinions Able to converse on topic without struggle Offers a few details or examples Offers reasons & opinions Able to converse on topic with minimal struggle Offers a few details or examples Offers simple reasons or opinions Able to converse on topic with struggle Offers details or examples when asked Offers

Fluency (Communication Strategies)


Very actively engages others Uses gestures and maintains eye contact skilfully

Accuracy (Grammar & Vocabulary)


Uses/understand s complex grammatical structures often Uses/understand s new and sophisticated vocabulary often

Accuracy (Pronunciation )
nativelike accent natural rhythm and intonation

Often actively engages others Uses gestures and maintains eye contact appropriatel y

Uses/understand s complex grammatical structures with a few mistakes Uses/understand s some new and sophisticated vocabulary.

Sporadicall y active in engaging others Uses few gestures and maintains some eye contact

Uses/understand s less complex grammatical structures Uses/understand s basic topical vocabulary

nonnative accent with very few mispronu nciations natural rhythm and intonation with slight mistakes nonnative accent with some mispronu nciations nonnative rhythm and intonation nonnative accent with many mispronu

Somewhat Passive and is sometimes engaged by others

Uses/understand s simple grammatical structures Uses/understand s some basic

reasons or opinions when asked Greatly struggles on topic

Uses few gestures and maintains little eye contact

topical vocabulary

Struggles with details or examples when asked Struggles with reason or opinion when asked Converses mainly on unrelated topic

Passive and is usually engaged by others Uses almost no gestures nor eye contact.

Uses/understand s simple grammatical structures with difficulty Uses/understand s some basic topical vocabulary with difficulty

Struggles greatly with details or examples when asked Struggles greatly with reasons or opinions when asked Converses only on unrelated topic

Very passive Uses neither gestures nor eye contact.

Uses/understand s simple grammatical structures with great difficulty and numerous errors Uses some basic topical vocabulary with great difficulty

nciations nonnative rhythm and intonation interferes with compreh ension nonnative accent with many mispronu nciations causing difficulty in understa nding nonnative rhythm and intonation causing difficulty in compreh ension nonnative accent with many mispronu nciations causing great difficulty in understa nding nonnative rhythm and intonation interferes greatly with compreh ension

CEP DEF Assessmen t Criteria

Fluency (Content of Contributions)


Offers simple details or examples Offers simple reason or opinions Able to adequatel y converse on topic Offers limited details or examples Offers very simple reasons or opinions Converses on topic with a little struggle Offers few details or examples Offers few reasons or opinions Converses on topic with struggle, and tends to wander

Fluency (Communication Strategies)


Very actively engages others Uses gestures and maintains eye contact appropriatel y

Accuracy (Grammar & Vocabulary)


Uses/understand s simple grammatical structures well Uses/understand s new vocabulary well

Accuracy (Pronunciation)
nonnative accent with some mispronu nciations nonnative rhythm and intonation nonnative accent with very several mispronu nciations nonnative rhythm and intonation with slight mistakes nonnative accent with numerous mispronu nciations nonnative rhythm and intonation with several mistakes nonnative accent is heavy and interferes with comprehe nsion nonnative rhythm and intonation interferes with comprehe nsion nonnative accent

Active in engaging others Gestures and maintains eye contact

Uses/understand s simple grammatical structures with a few mistakes Uses/understand s some new vocabulary.

Sporadicall y active in engaging others Uses few gestures and maintains some eye contact

Uses/understand s simple grammatical structures with errors Uses/understand s basic topical vocabulary

Offers few details or examples when asked Offers limited reasons or opinion when asked Struggles greatly to converse on topic and shifts to unrelated topics Struggles greatly with

Somewhat passive and is sometimes engaged by others Uses almost no gestures and maintains little eye contact

Uses/understand s simple grammatical structures with numerous errors Uses/understand s some basic topical vocabulary

Very passive and is engaged

Uses/understand s simple grammatical

details or examples when asked Struggles greatly with reason or opinion when asked Able only to converse on unrelated topic

by others Uses neither gestures nor eye contact.

structures with great difficulty and numerous errors Uses/understand s some basic topical vocabulary with great difficulty

with many mispronu nciations causing great difficulty in understan ding nonnative rhythm and intonation interferes greatly with comprehe nsion

Offers no details or examples when asked Offers no reason or opinion when asked Total breakdow n

Doesnt participate Total communicat ion breakdown.

Doesnt use simple grammatical structures Doesnt understand basic vocabulary

nonnative accent breaks comm unicati on


nonnative rhythm and intonation stops the conversat ion

Grading Procedure
During the speaking tests, the learners are graded in terms of content of conversation (40%), communication strategies (30%), grammar and vocabulary (15%) and pronunciation (15%). It should be noted that a larger portion of the grade is allocated to the content of conversation and the communication strategies (fluency-based activities), than to grammar and vocabulary and pronunciation (accuracy-based activities). Content of conversation relate to the ability to converse on a topic with some detail, giving reasons for opinions, while communication strategies relate to starting and closing conversations, responding to questions, soliciting information, which includes gestures and eye contact. A simplified copy of the grading procedure can be seen below in Figure 3.

Figure 3 Simplified Version of Grading Procedure


Content 40% Communication and Participation 30% Vocabulary and Grammar 15% Pronunciation 15%

Speaks a lot Gives examples Explains why


5 (good) 4 3 2 1 (bad)

Active Uses gestures Looks at Partner while speaking


5 (good) 4 3 2 1 (bad)

Uses new words from textbook Uses grammar from textbook

Understandable

5 (good) 4 3 2 1 (bad)

5 (good) 4 3 2 1 (bad)

Speaks a little No examples

Not active (passive) Few gestures

Uses High School Level English Doesnt use new words or grammar from textbook

Very Strong Accent Hard to understand

No reasons why

Doesnt look at Partner when speaking


(___ x 6 = ___) +

(___ x 8 = ___) +

(___ x 3 = ____) + (___ x 3 = ___)

The learners are reminded at the beginning of each cycle that the questions in the speaking tests will be based on the activities they will cover in the classroom. It is felt that the weighting of scores, in terms of their distribution between fluency and accuracy, reflects the required balance in CEP between required knowledge (accuracy) and skills (fluency). The examiners use laptop computers and record the test results in formatted files. Their scores are combined after the testing so that the learners final grades are the average of the three examiners scores.

Norming Procedure
The CEP examiners undergo a norming procedure on a regular basis (once a month, at the end of each cycle, just prior to the speaking tests). The norming process is taken seriously given the importance for the CEP examiners to understand and internalise the common testing standards in order for the speaking tests to be as fair and reliable as possible. To this end, it is also important that the examiners base their scores on the learners performances in the test itself, and not on how they might be expected to perform based on performance in the classroom. Some learners converse competently in the classroom, yet perform poorly in the speaking tests, while others are

the opposite, but the CEP examiners make much effort to examine only on what happens in the test, regardless of their subjective opinions of the learners abilities. During the norming sessions, the rating band sheets (a more detailed version than those used in the classroom) are studied closely and the terms and concepts therein discussed to make sure that common understandings are reached. Then tape excerpts of learners taking the speaking tests are watched and each examiner assigns a grade accordingly. Then they discuss the grades they gave. If the scores of the examiners vary no more than 8 percentage points from each other, it is considered that an adequate level of norming is reached. The goal of this practice is not so that the examiners will give exactly the same score to learners each and every time, but rather that the same standards are understood and applied. The learners are also exposed to the speaking test standards, so that they understand the requirements. A copy of the examiners rating sheet written in Japanese and a simplified version of the examiners rating sheet written in English is given to each student at the beginning of the academic year, and they are asked to grade a group of learners shown to them on a video undergoing a speaking test. After the learners discuss the scores they gave to them, the teachers tell them what score each student actually received by the examiners, and the reasons why they received them. Learners are also given additional handouts to prepare them for the tests. All effort is made to emphasise that conversational activities in the classroom are linked to the tests not only by topic but also by the required conversational authenticity and fluency that learners practice in the classroom. Each student then prepares four questions based on the topics of the two NIC units covered in the previous two weeks as homework, and brings them to class for the review day prior to the speaking tests. They practice these in the classroom in groups of three for three minutes at a time, as a final preparation for the test.

How Valid and Reliable is the CEP Speaking Test?


It seems that the CEP speaking test, in its current form, has come a long way from its forerunner, and has a fairly high degree of validity and reliability for the following reasons: The different weightings for accuracy and fluency in the scoring sheets, with most emphasis on fluency, that remain unchanged throughout the year; The content of the scoring sheets also remaining the same throughout the year, in terms of terminological contents and mathematical formulae; The explicit nature of the terminological content of the scoring sheets, created with input from all the examiners reflecting a combined understanding of their pedagogical stance; The homogeneity of the learners in terms of race, age, academic status, socioeconomic and academic background; Their consistency with the philosophy of CEP; The consistency in application, directly after the first two weeks in each cycle;

The consistency and thoroughness of the norming procedure by the examiners, to maximise interrater reliability; The number of examiners, three in total; The process of making learners aware of the purpose of the speaking test through their exposure to the norming process in class just prior to the test; The practicing of three minute conversations in rotation in the class just prior to the speaking test, simulating the test with questions created by the learners; The fairly large contribution of the speaking tests to the final CEP grade; The consistent and fairly high correlations achieved despite a new examiner joining the team and having undergone only one intensive norming session (see below); The rapid feedback of results, helping learners relate their score to their performances more easily; and The backwash and forward wash effects that motivate the learners continue to practice in classroom in preparation for the speaking tests, thus internalising the link between classroom practice and test performance. Interrater Reliability Correlations to Establish the Validity and Reliability of the CEP Speaking Test Interrater reliability testing is consistently carried out in CEP, but the split-half method that was used in the forerunner of the speaking test has not been applied again, mainly because of problems encountered with the method when it was used prior to the time of CEP. Hadley and Mort (1999 p. 50) noted: although the split-half method is used with success with many more objective test designs, it is not certain if our test instrument can be measured objectively. We suspect that that this instrument may be more organic in nature, and cannot be easily separated into different parts. The alternative convenience in using the correlation formula, as opposed to the split-half method, provided by the Microsoft Excel software package (which uses a simple regression method, which is then correlated using the Pearson r correlation coefficient) has proven itself to be more convenient and accurate in CEP. It is easy to apply given that the scores are entered into computers and stored on the CEP intranet program, and has been done for the previous two years (2001 and 2002). The Microsoft Excel software package makes a regression analysis possible between two variables, and thereby allows for interrater correlations to be made between two examiners simultaneously. CEP had its first speaking tests for the current academic year on May 16th and 17th. For the first time, three examiners were used instead of two (the more the number of examiners the better the interrater reliability and thereby the better the internal reliability). The interrater correlation results between the three examiners for this test are shown in Table 1.

Table 1: Interrater Reliability Correlations of the Three CEP Examiners


Correlation between Correlation between Correlation between Examiner A and C Examiner A and B Examiner C and B 0.87 0.89 0.87 These correlations are inspiring, in that reveal that the examiners are not giving exactly the same scores (as that would render a correlation of 1.00), and neither are the scores too different from each other (as that would render a correlation of less than 0.50). They are also slightly higher than those of the previous two years (2001 and 2002) when two examiners were used. An interesting observation is that, while examiners B and C have been in the CEP program for two consecutive years, examiner A recently joined in April 2002 and only underwent one norming session with the other examiners prior to the testing. The correlations thus also suggest that the CEP speaking test has a fair degree of internal reliability in that it renders similar outcomes irrespective of who is examining the test, and this also suggests that the norming procedure is working well in practice. Since this test two more speaking tests have been undertaken and the interrater correlations have continued to be at acceptable levels. Detailed comparisons of interrater reliability correlations from cycle to cycle (test to test) have not been made to date, but this would be a useful exercise, as it would give some idea of the changing impact of the norming procedure over time. Similar trends could be established of the learners scores from cycle to cycle. Such comparisons could yield interesting results, particularly in an effort to measure the relationship over time between aspects of reliability and validity (for example, comparisons between interrater correlations and backwash and forward wash effects through changes in the learners scores). Although this is beyond the scope of this paper, it could be a subject of interesting academic research in the future.

Conclusion
We have found it very useful to consider the aspects of validity and reliability in the creation the CEP speaking tests. It has made us look at testing, especially oral testing, in a more critical way, and to be more aware about the need for validity and reliability, especially through the norming procedure to achieve acceptable levels of interrelater reliability. It made us realise that there is a lot more to oral testing that we initially envisaged, and that maintaining an oral test in good form needs constant attention. It is not an objective test that, once created can be filed at taken out only at times of use. The speaking test has consequently become a living part of CEP, in that all classroom activity is given a specific purpose, and the examiners undergo norming procedures before each speaking test is administered. Before going through this rigorous process, I often questioned the need to be so thorough and precise concerning all aspects of the speaking test, particularly during the discussions in the norming sessions, and sometimes felt that we overdid it at the expense of other things that needed to be done. In hindsight, I

have come to realise the relevance of what we were doing, and appreciate the product that I helped create all the more. We feel that our experience has equipped us with a better understanding of the complex dynamics of oral testing and will certainly be of good benefit for our future professional development. I would therefore recommend that all serious ESL teachers who have not yet closely considered the validity and reliability of their speaking tests, to do so, as the insight gained would be very beneficial to their professional development as well.

References
Bachman, L. F. (1990). Fundamental Considerations in Language Teaching. Oxford University Press. Hadley, G & Mort, J. (1999). An Investigation of Interrater Reliability in Oral Testing. Nagoya National College of Technology Journal, 35(2), 45-51, from http://www.nuis.ac.jp/~hadley/publication/interrater/reliability.htm. Heaton, J. B. (1997). Writing English Language Tests. New York: Longman. Henning, G. (1987). A Guide to Language Testing: Development, Evaluation, Research. Heinle & Heinle Publishers. Hughes, A. (1989). Testing for Language Teachers. Cambridge University Press. Hunt, D. (1998). Designing a Reading Comprehension Test for Oral English Classes. The Shizuoka Gakuen College Review Journal, 11, 61-80. Richards, J. C. (1998). New Interchange 1: English for International Communication. Cambridge University Press. Richards, J. C. (1998). New Interchange 2: English for International Communication. Cambridge University Press. Richards, J. C. (1998). New Interchange 3: English for International Communication. Cambridge University Press. Underhill, N. (1987). Testing Spoken Language. Cambridge University Press. Weir, C. J. (1990). Communicative Language Testing. Prentice Hall International.