Vous êtes sur la page 1sur 14

International Journal of Educational Management

Emerald Article: Challenges in designing student teaching evaluations in a


business program
Dmitriy V. Chulkov, Jason Van Alstine

Article information:
To cite this document: Dmitriy V. Chulkov, Jason Van Alstine, (2012),"Challenges in designing student teaching evaluations in a
business program", International Journal of Educational Management, Vol. 26 Iss: 2 pp. 162 - 174
Permanent link to this document:
http://dx.doi.org/10.1108/09513541211201979
Downloaded on: 09-11-2012
References: This document contains references to 26 other documents
To copy this document: permissions@emeraldinsight.com

Access to this document was granted through an Emerald subscription provided by DAYSTAR UNIVERSITY

For Authors:
If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service.
Information about how to choose which publication to write for and submission guidelines are available for all. Please visit
www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
With over forty years' experience, Emerald Group Publishing is a leading independent publisher of global research with impact in
business, society, public policy and education. In total, Emerald publishes over 275 journals and more than 130 book series, as
well as an extensive range of online products and services. Emerald is both COUNTER 3 and TRANSFER compliant. The organization is
a partner of the Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive
preservation.
*Related content and download information correct at time of download.
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0951-354X.htm

IJEM
26,2 Challenges in designing student
teaching evaluations in a business
program
162
Dmitriy V. Chulkov and Jason Van Alstine
School of Business, Indiana University Kokomo, Kokomo, Indiana, USA
Received 27 October 2010
Revised 13 December 2010
7 March 2011
Accepted 31 March 2011 Abstract
Purpose This article aims to present an empirical analysis of the effects of changes in the student
teaching evaluation (STE) form in a business school.
Design/methodology/approach The authors discuss a case of STE re-design in a business
school that focused on improving the STE instrument. They utilize empirical data collected from
students that completed both the original and the revised STE form in several semesters of
undergraduate economics courses to examine the effect of changing the evaluation scale and the
fashion in which written comments are solicited.
Findings There are three results of interest to departments considering a change to student
evaluation instruments. First, the authors find that a shift from a four-point scale to a five-point scale
leads to a decrease in evaluation scores even after making an adjustment for scaling. Second, they find
that students tend to give lower scores on comparison-type questions that ask for a comparison of the
instructor or the course to the students entire college experience. A larger share of such
comparison-type questions may depress the mean scores on composite evaluations. Third, soliciting
written feedback in a specific section of the form is an effective way to increase both the number of
written comments and the size of each comment.
Practical implications Student teaching evaluations serve as an assessment instrument and are
frequently used in faculty promotion decisions. A discussion of best practices in designing the STE is
provided in order to caution the stakeholders of the problems that may arise and to guide academic
institutions in the review of evaluation procedures.
Originality/value The authors start with an example of STE re-design and then analyze empirical
data from several semesters. Analysis of the literature and empirical evidence leads to recommended
best practices that make STE data more useful both as a summative measure for administrative
decisions and as a formative measure used by faculty looking to improve their teaching skills and
course design.
Keywords Teaching evaluation, Evaluation methods, Assessment, Business education, Teachers,
Business schools, Change management
Paper type Research paper

Challenges in designing STEs


1. Introduction
Student teaching evaluations (STEs) serve as an important assessment tool and are widely
used in faculty promotion or merit pay decisions (Dommeyer et al., 2002). STEs typically
contain a number of questions that ask students to evaluate various aspects of the
International Journal of Educational instructors performance and course design. These forms are completed by the students at
Management the end of the semester and often serve as a summative measure used in administrative
Vol. 26 No. 2, 2012
pp. 162-174 decisions about faculty tenure, promotion, and merit pay (Kulik, 2001). They also have an
q Emerald Group Publishing Limited important assessment function and are used as a formative measure by faculty looking to
0951-354X
DOI 10.1108/09513541211201979 improve their teaching skills and course design (Onwuegbuzie et al., 2009).
STEs are particularly common in US higher education, as Seldin (1993) states that Challenges in
STEs are used in personnel decisions by 86 percent of higher education institutions in designing STEs
the US. A growing importance of STEs internationally is reflected by studies on the
effectiveness of STEs from Australia (Bedggood and Pollard, 1999), European
countries (Husbands and Fosh, 1993), Malaysia (Liaw and Goh, 2003), Singapore (Koh
and Tan, 1997), and UAE (Badri et al., 2006). One factor that promotes the use of STE is
the influence of international accrediting bodies. For instance the Association for 163
Advancement of Collegiate Schools of Business (AACSB) the leading accrediting
body for business schools states the following in its accreditation standards. The
school should have a systematic program for evaluating instructional performance of
faculty members. Information from instructional evaluation should be available to
both faculty members and administrators. The school should use instructional
evaluations as the basis for development efforts for individual faculty members and for
the faculty as a whole. (AACSB, 2010) As of 2010, AASCB includes 123 accredited
schools in 36 countries outside of the US as well as over 400 schools in the US.
A significant amount of research is devoted to the issue of validity of STE
instruments (e.g. Marsh and Dunkin, 1997; Wachtel, 1998). Empirical studies indicate
that a well-constructed and score-validated STE instrument can serve as a useful
indicator of teaching effectiveness (e.g. DApollonia and Abrami, 1997; Marsh, 1987).
However, Seldin (1993) cautions that the validity of student responses on STEs is
impacted by administration procedures, such as the manner in which students are
provided with instructions and the timing for the administration of the STE during the
semester. Recently, Onwuegbuzie et al. (2009) discuss a meta-validity model that
summarizes evidence about the validity of using STE instruments and recommends
that STE results are not to be interpreted as the only indicator of teaching
effectiveness.
In the area of Business and Economics the widespread use of STEs is similar to the
overall experience of the US educational system, as these evaluations are frequently a
factor in personnel decisions (Becker and Watts, 1999). McPherson et al. (2009) detail
the empirical findings from an extensive dataset of 24 semesters and suggest that the
instructor characteristics such as experience, age, and gender may impact the STE
scores. Thus, using STE scores for personnel decisions in isolation may affect faculty
members within a department in a non-uniform fashion. Kozub (2010) concludes that
administrators should interpret STE ratings with care, particularly if a faculty member
teaches at an unpopular time of day, in areas that many students do not find
intrinsically interesting, or if the instructor is perceived by students to be a difficult
grader. Similar conclusions from international studies of STEs in business
departments are advocated by Koh and Tan (1997), Liaw and Goh (2003), and Badri
et al. (2006).
This study complements the literature by focusing on the practical implications of
STE design. We start with the discussion of a case study of the review and re-design of
STEs in a business school of a public US University, and then perform empirical
analysis of the effects of changes in STEs. The STEs used in the evaluation of faculty
at this school were originally designed in an ad hoc fashion and the basis for the design
of the forms questions and answer scale was not clear. Assessment activities triggered
a review of the STEs and a number of issues with the form were identified, including
the content of the questions on the instrument, the evaluation scales, and the fashion in
IJEM which written comments were solicited from the students. In the following section, we
26,2 outline these issues. Later, we present empirical data from several sections of
undergraduate Economics courses and evaluate the impact of the changes in the STE.
We conclude with a discussion of the results and provide best practices in the design of
STEs. The intent of this study is to provide a review of the challenges that occur in the
process of designing the STE, to caution the stakeholders of the problems that may
164 arise, and to guide academic institutions in general, and Business departments in
particular, in the review of their evaluation procedures.

2. STE re-design in a business program


Best practices in assessment involve periodic evaluation of the assessment process
itself. We start with an example of such evaluation of the STE in the business school at
a small campus of a public US university that uncovered multiple issues with the
schools existing STE form used for the evaluation of faculty instruction and led to a
revision of the STE. Please note that due to space limitations we do not present the
complete STE forms in this article. In the following discussion of STE design, the
original form refers to the form that existed in the school prior to the re-design, and the
revised form refers to the re-designed STE instrument[1].
The first issue discovered in the review of the original STE form was the fact that
the scale of evaluation for the questions on the original form did not provide for
ordinality in the responses available to the students. The original STE was designed in
an ad hoc fashion over a number of years and the evaluation scale was not the same for
each question. While most questions used a score of 1 as the best possible score and a
score of 4 as the worst, several questions did not use an ordinal scale. For instance,
question Work related to class level had the following answer choices:
(1) Work suited to class level.
(2) Attempt made to suit class level.
(3) Work completely above class level.
(4) Work completely below class level.

A score of 4 on this question was not necessarily worse than 2 or 3. Thus, averaging the
answers to this question did not provide a summary measure with a clear
interpretation. Furthermore, using the average in comparison to other questions that
had an ordinal scale could provide for misleading results. There were a total of six
questions on the original STE with no ordinal answer scale. Most of these questions
appeared in the course evaluation section of the STE, with only one such question in
the instructor evaluation section of the STE. Thus, the problem with non-ordinal scale
questions may have skewed the mean for the course evaluation questions for each
instructor more than the mean of the instructor evaluation questions. The department
used the STE to evaluate the mean performance on course-related questions separately
from the mean performance on the instructor-related questions. A bias caused by
non-ordinal scales makes the comparison of such means questionable.
A second issue concerned the evaluation scales on the original STE. Two questions
used a five-point scale, while the 15 remaining questions used a four-point scale.
Obviously, this made comparison of the means misleading. Furthermore, as the
comparison of the mean answers to evaluation questions is frequently used, clear
standards for the evaluation scale must exist to make these comparisons meaningful. Challenges in
An examination of the original STE also revealed that the four-point scale answer designing STEs
choices did not have the same midpoint for the answer scale. One would expect on the
four-point scale to have two answer choices above the average, and two answer choices
below the average. This was not always the case. For instance, question Speaking
ability had the following answer choices:
(1) Voice and demeanor excellent. 165
(2) Average.
(3) Poor speaking, distracting.
(4) Poor speaking, a serious handicap.

The average choice here was answer choice 2, rather than the point between 2 and 3.
Meanwhile, for several of the other questions an answer of two implied that instructor
was above the average for that particular question.
The third issue with the scale was discovered in the comparison with the other
schools and departments at the university. The other departments used scales in which
the highest answer choice was the best and the lowest was the worst. Having a scale
with the lowest score being the best put the School of Business at a disadvantage, as
the difference in scales had to be explained to the stakeholders at the campus level
every time the Business faculty STE scores were discussed.
Addressing these issues required making changes to the scale on the STE in order
to make the results ordinally comparable within the department and across the
departments at the university. The literature recommends using the five-point Likert
scale in the design of the STE (e.g. Frick et al., 2010). The three issues identified
previously are addressed by introducing a five-point scale as follows: 1 strongly
disagree; 2 disagree; 3 undecided; 4 agree; 5 strongly agree. This scale
provides a clear midpoint. It also is consistent with the scales used by other
departments across the campus in that the highest answer choice is the best.
Two other issues arose in the review of the original STE. The first involves the way
in which written comments were solicited from students. The original form provided
space to write additional comments at the side of the form in relation to any of the 17
questions on the form. The number of written comments actually submitted by the
students was low. The department decided to add specific questions to the form that
ask for optional written feedback. The following section presents empirical results that
demonstrate that including an explicit request for feedback and suggestions for
improvement helps increase the number of written comments received. The written
comments are important, as they complement the numerical rankings from the STE.
This additional evidence helps attenuate the issues with the validity of numerical STE
ranking raised by Onwuegbuzie et al. (2009). Complementing numerical rankings with
written comments and other teaching effectiveness data are advocated by studies of
STEs in business schools (e.g. McPherson et al., 2009; Kozub, 2010).
The final issue regards the nature and the phrasing of the questions. The original
form did not have a theoretical basis for the questions asked. The literature provides a
number of dimensions for evaluating teaching effectiveness. The STE should be used to
gather evidence along these dimensions. Lowman (1994) presents a two-factor model of
teaching effectiveness in which the factors are intellectual excitement and interpersonal
IJEM rapport. The basic idea of using two main factors to organize student ratings is
26,2 supported by a number of other researchers including Frey (1978), Cranton and Smith
(1986), Erdle et al. (1985). In these studies, one factor represents the pedagogical skills in
the classroom, such as presentation and organization, and the other factor represents
concern for or rapport with the students. These two dimensions are also consistent with
those identified in the factor analysis of STEs by Abrami et al. (1997) as being the two
166 most important. They also reflect two of the three roles of instructors (i.e. presentation
and facilitation) identified in the factor analysis by Feldman (1976).
Using the two-factor model, the department focused on the pedagogical skills in the
classroom, such as presentation and organization, and the instructors rapport with the
students. The revised form was designed to have two sections evaluated with the
Likert scale and an additional section that specifically asks for written feedback. The
17 STE questions on the revised form are divided into two sections dealing with the
instructor attributes and the course attributes, respectively. An optional written
comment section is placed at the end. This section solicits answers to four specific
questions designed to provide feedback to the instructor:
(1) What did you like most about the instructor?
(2) What did you like most about the course?
(3) What could your instructor do to most improve his/her effectiveness as a
teacher?
(4) What specific suggestions do you have for improving this course?

Writing the questions for the revised STE also involved changing the language in
order to clarify each question and to align each question with the Likert scale that was
introduced. For instance, question phrased as Ability to explain on the original form
was changed to: The instructor clearly explained the course material. In contrast, the
original STE relied on the answer choices on the four-point scale to clarify the nature of
the question. Moving away from the multiple scales toward the standard scale makes
the resulting data more comparable.
The issues with the scale of the STE form, the phrasing of the questions, and the way
in which the form solicits written feedback were addressed in the design of the revised
form. These issues may be present in the STE instruments used by other academic
institutions, especially in forms that were designed in an ad hoc fashion. In the following
we present an empirical analysis of the effects of changes to the STE instrument.

3. Data and methodology


At the end of each semester, students are asked to evaluate each instructor using a
standard STE form. After the school re-designed the STE form, we attempted to
evaluate the efficacy of the new evaluations by collecting empirical data from six
sections of undergraduate Economics courses taught by two different instructors over
the following three semesters. In collecting data, we asked students in these sections to
complete both the original STE and the revised STE. Both STE forms were completed
in the same environment and student privacy was protected through anonymity in
responses and through the use of a student proctor. Students were advised that
participation was voluntary and would not affect their course grade. Table I describes
the participation rates for the study. The standard department STE (the revised STE)
had to be completed by all students present in class. Students were able to opt out of Challenges in
participation in the study by not completing the original STE. A total of 97 students designing STEs
completed both forms for a participation rate of 78 percent.
In order to evaluate the data we use a two-sample comparison-of-means test. Before
completing these tests, we convert responses to questions with a four-point scale to a
five-point scale and we adjust the original STE responses to a scale in which higher
numbered responses are better. We study three issues to look for differences that have 167
arisen from changing the STE forms:
(1) Does changing the STE scale lead to differences in the information reported to
instructors from the student responses?
(2) Do responses on comparison-type questions differ from more standard
evaluation-type questions?
(3) Does direct solicitation of free form written responses lead to an increase in the
quantity and size of such responses?

Of the 17 questions on both the original STE and the revised STE only nine questions can
be directly compared based on the content. Seven of these questions ask the student to
evaluate the characteristics of the instructor. Specifically, students are asked to evaluate
the instructors knowledge of the subject, the instructors enthusiasm about the subject,
the clarity of the instructors explanations, the instructors attitude towards students, the
instructors openness and responsiveness to questions during class, the instructors
organization of lectures and how the instructor compares to other instructors the student
has had. The other two questions ask the student to evaluate characteristics of the course,
specifically the organization of the topics covered and the textbook used for the course.
Means of all instructor questions, all course questions, and all questions (including the
questions that cannot be directly paired) are also generated and studied as these measures
are often used as summary statistics to evaluate the effectiveness of an instructor.

4. Empirical results
4.1 Changing the scale on the STE form
The first step we take to empirically study the impact of changing the answer scale
between the two STEs is to compare the means of the responses from the original STE
and the revised STE. We choose to start at this point since mean response values have
become a common measure to evaluate an instructors performance relative to the other
instructors in the department. Table II presents the t-statistics for a
comparison-of-means test. As a number of questions were changed on the revised
form, this comparison is only presented for the questions that can be paired between
the original and the revised forms. A negative t-value implies that the mean on the
revised STE is lower than on the original STE.

Section
Student participation 1 2 3 4 5 6

Registered students 24 23 27 32 37 22 Table I.


Completed revised STE 17 14 24 26 26 17 Summary of student
Completed original STE 17 14 17 17 20 12 participation in study
STEs
26,2

168
IJEM

Table II.

responses from the


original and the revised
Comparison of the mean
Section
STE question 1 2 3 4 5 6 Entire sample

Knowledge of subject 0.30 21.04 2 0.59 2 1.96 2 1.79 20.72 22.51 *


Enthusiasm about subject 21.26 0.00 2 1.73 2 2.22 * 2 2.22 * 22.75 * 23.96 *
Clarity of explanations 20.36 0.98 2 1.10 2 1.87 2 1.94 21.00 22.56 *
Attitude toward students 0.00 20.37 2 1.42 2 2.13 * 2 1.85 21.73 23.11 *
Opportunity for questions 20.14 0.39 2 0.95 2 1.01 2 1.65 20.36 22.06 *
Compared to other instructors 0.64 0.92 2 0.46 2 3.00 * 2 1.51 21.12 22.02 *
Organization of topics 22.73 * 22.79 * 2 2.41 * 2 3.53 * 2 3.03 * 23.40 * 26.92 *
Organization of lectures 0.20 0.00 2 1.24 2 2.01 * 2 2.09 * 20.92 23.17 *
Textbooks 0.34 20.33 2 1.09 2 1.35 0.18 0.24 21.13
All instructor questions 1.50 3.48 * 2 1.87 2 5.80 * 2 5.50 * 24.56 * 26.58 *
All course questions 24.23 * 23.94 * 2 4.78 * 2 6.98 * 2 5.23 * 23.33 * 211.53 *
All questions 21.25 0.37 2 4.43 * 2 8.60 * 2 7.28 * 24.19 * 211.85 *
Notes: Table reports t-statistics. *Significant at the 5 per cent level
Two measures that are used frequently in reporting the evaluation results to instructors Challenges in
are the composite of all instructor-related questions and the composite of all of the designing STEs
course-related questions. Table II shows the results for a comparison-of-means test for
the instructor composite and the course composite in each section and for the aggregated
data. The course composite score is significantly lower in all six sections. The instructor
composite results are a little mixed, with sections 1 and 2 showing a higher average on
the revised STE (only one was significant) and sections 3-6 showing a decrease. The 169
results of a comparison-of-means test for the average for all 17 STE questions is also
reported in Table II, with four of the six sections and the overall aggregate showing a
significantly lower value for the revised STE. These results suggest that changing the
scale from a four-point to a five-point scale resulted in generally lower composite
evaluation mean scores. Note that we have converted the results from the original STE to
the five-point scale in order to perform this test, and yet the significant effect of changing
the scale is still observed. Lower scores were more pronounced in the course-related
questions. This result may reflect the fact that the original evaluation form had several
non-ordinal-scale questions in the course-related section, which may have contributed to
the apparent bias toward higher scores in the original STE.
To complete a more thorough analysis of the impact of the evaluation scale on the
overall evaluation scores, we match similar questions from the original STE to the
revised STE in order to compare the means of the responses for such pairs. Among the
17 questions on each STE, we identify nine questions that are comparable in content.
Table II reports the t-statistics from a comparison-of-means test for each section and
for the aggregate of all sections. Results from the revised STE were significantly lower
for eight of these questions and are lower (but not significantly so) in the other
question. This comparison of similar questions across the two STEs leads the
researchers to believe that introducing a five-point scale for answers may lead to a
decrease in the average evaluation from students compared with a four-point scale.
The overall implication of this analysis is that using the five-point scale on the
revised STE generates lower evaluation scores on most of the comparable questions,
on the instructor and course composites, and on the average of the entire evaluation
form. This implies that care should be used by each instructor when reporting his/her
STE results over time if a change to the evaluation scale occurs. Even if adjustments
are made to account for numerical differences between a four-point and five-point scale,
our results indicate that the five-point scale leads to lower STE mean scores which
could lead to the appearance that ones scores have decreased. Therefore, we believe
that comparisons should not be made for means from STEs that have changed their
scales over time. The results of the question-by-question comparison also imply that
students do appear to grade instructors differently (and lower) if they are given a
five-point Likert scale compared to the four-point scale that was used previously.

4.2 Question types: comparison questions vs evaluation questions


We observed that the STE forms contained several comparison questions that ask
students to compare the instructor or the course to the students entire college experience.
This is different from the other STE questions that generally ask for an evaluation of a
specific aspect of the students experience with the course or the instructor, such as
ability to communicate or organization of the course. We test whether students answer
such comparison-type questions differently than evaluation-type questions.
IJEM In order to examine the equivalency of the two question types, we compared the
26,2 mean of all evaluation-type questions on the original STE to the mean of all
comparison-type questions on the original STE. We repeated this analysis for the
revised STE. Table III presents t-statistics for a comparison-of-means test. A positive
number in Table III indicates that evaluation-type questions have a higher average
score than comparison-type questions. For the original STE this average was
170 significantly higher in two of the sections and for the aggregate. For the revised STE
this average was also significantly higher in two of the sections and for the aggregate.
These results imply that comparison-type questions tend to have a lower mean
score. Adding such questions to a section of an STE, as was done on the course-related
question section for the revised STE, could depress the overall section mean score. In
contrast, a shift to more evaluation-type questions may increase STE scores.

4.3 Soliciting written comments


The original STE contained no specific written-comment section, but did allow space
for students to make comments for any of the 17-scaled questions on the evaluation.
The revised STE continues this practice, but additionally includes a section that
solicits unstructured responses from students. In order to evaluate the impact of this
addition, we compare the average number of comments and the average number of
words contained in the comments provided by students on both the original STE and
the revised STE. Table IV reports t-statistics for a comparison-of-means test. Positive
values imply that the revised evaluations have more comments (or words) than the
original evaluations.
The results reported in Table IV indicate that both the number of comments and the
number of words increased with the inclusion of the specific written comment section
for all sections. Such a result is bound to provide a positive impact for the departments
instructors as they will receive more feedback on both the practices that students like
and the practices that students think can be improved on.
The previous empirical analysis provides three specific results of interest to
departments that are considering a change to student evaluation instruments. First, we

Section
1 2 3 4 5 6 Entire sample
Table III.
Comparison of means for Revised STE 3.03 * 2.96 * 1.83 1.56 0.03 0.78 3.39 *
evaluation-type and Original STE 1.77 3.11 * 2.41 * 0.41 1.30 20.99 3.63 *
comparison-type
questions Notes: Table reports t-statistics. *Significant at the 5 per cent level

Section
1 2 3 4 5 6 Entire sample
Table IV.
Comparison of written Number of comments 3.15 * 3.82 * 6.06 * 5.15 * 3.57 * 3.85 * 9.89 *
comments between the Number of words 2.83 * 2.55 * 3.59 * 5.30 * 3.68 * 2.93 * 8.16 *
original and the revised
STE Notes:Table reports t-statistics. *Significant at the 5 per cent level
find that a shift from a four-point scale to a five-point Likert scale leads to a decrease in Challenges in
evaluation scores even after making an adjustment for scaling. This means that designing STEs
departments and instructors should be careful before making conclusions when the
scale on the STE is revised. Second, we find that students tend to give lower scores on
comparison-type questions that ask for a comparison of the instructor or the course
to the students entire college experience. When making a change in STEs, one should
consider the ratio of evaluation-type questions and comparison-type questions. A 171
larger share of such comparison-type questions may depress the mean scores on
composite evaluations. Third, soliciting written feedback in a specific section of the
STE is an effective way to increase both the number of written comments and the size
of each comment. Providing space for students to leave comments without directly
asking them specific written-comment questions is not sufficient to elicit responses
from students.

5. Implications for academic programs and best practices


Student teaching evaluations are routinely used in the tenure, promotion, and merit
pay decisions (e.g. Dommeyer et al., 2002; Seldin, 1993). They also serve as an
assessment tool for faculty striving to improve teaching effectiveness and course
design and are often used as a source of data for research in the scholarship of teaching
and learning. This study describes a review of STEs and procedures in a business
school that uncovered a number of issues ranging from the scale of evaluation to the
nature of the questions. We reviewed the relevant literature, highlighted the changes
made to the STE, and evaluated data from several sections of undergraduate courses
that used both the original and the revised STEs. In order to assist academic
institutions and particularly business schools in the review of the STE process, the
following best practices are recommended.
First, the STE should not be used as the sole measure of teaching effectiveness.
While many researchers (e.g. DApollonia and Abrami, 1997; Marsh, 1987) agree that a
well-constructed STE instrument can serve as a useful indicator of teaching
effectiveness, others demonstrate that experience, age of the instructor, timing of the
course, and area of study may affect the STE scores (McPherson et al., 2009; Kozub,
2010). Our empirical analysis shows that STE results are also sensitive to the number
of answer choices available to the student in the STE scale, and whether questions are
of comparison-type or not. A departmental system of teaching effectiveness
evaluation should include additional components such as student focus groups,
classroom visits, etc. (King et al., 1999). The business school discussed in this study
recommended to measure teaching effectiveness of faculty for the purposes of tenure
and promotion with multiple measures such as teaching portfolios that include
evidence from classroom visits and examples of course design. For assurance of
student learning it was recommended to complement the STE data with
course-embedded assessment techniques and external standardized testing such as
the ETS Major Field Test in Business.
Second, the STE should utilize a standard scale, such as a five-point Likert scale.
The scale should be ordinal in order to facilitate comparison of means for each
question. The scale should be comparable across the questions on the form and across
forms for the various departments at a campus. If a five-point Likert scale is used all
departments should have the same order for the scale, such as five as the best possible
IJEM answer and one as the worst. Our empirical results suggest that changes to the
26,2 evaluation scale significantly affect the mean scores of the responses making STE
results non-comparable for the samples before and after the changes in scale, even if
scaling adjustments are made. Thus, departments that rely on STE instruments in
evaluation of faculty should not directly compare mean scores from samples that
involve different scales.
172 Third, the questions on the form should have a theoretical base. One example of
such base is the two-factor model of teaching effectiveness in which the factors are
intellectual excitement and interpersonal rapport (Lowman, 1994). In applying this
model, the questions are designed to measure the effectiveness of teachers in:
(1) Course presentation and organization.
(2) Concern for and rapport with the students as a factor of student motivation.

Other possible applications of theory to the design of an evaluation form include the
TALQ process described by Frick et al. (2010) or the holistic approach of Patel (2003).
Fourth, the questions should be written clearly. Clear phrasing of the question is
especially important if the STE instrument uses a standard scale for the answers. The
questions should not rely on providing additional explanations or qualifications in the
answer choices, as these make the data less comparable across questions.
Fifth, numerical STE scores should be supplemented by written comments. It is
recommended to establish a separate section for written comments, which includes
specific open-ended questions. Our empirical analysis demonstrates that including
such a section significantly increased the number of comments and the size of
comments in our sample. Increasing the number of written comments enhances the
assessment information available to the instructors and helps improve course design
and teaching effectiveness.
Following these best practices makes STE data more useful both as a summative
measure used in administrative decisions about faculty tenure, promotion, and merit
pay and as a formative measure used by faculty looking to improve their teaching
skills and course design. Research on the validity and application of STEs is
ever-growing and continuous review and improvement of the process makes sure that
the STEs are helpful in improving student learning in Business programs and other
areas of higher education.
The limitations of this study include the fact that all empirical data were collected in
a public US university. International experiences with STE scales and question types
may be different, and additional research is needed to examine the applicability of the
findings to international data. Another limitation of this study is its scope, which is
empirical and is not designed to provide new theoretical developments. A direction for
future research is the development of a theoretical model that addresses the empirical
issues discovered in this study, including the fact that a change in scale may lead
significantly lower evaluation scores even after adjustment for scaling, and the fact
that lower scores are observed on comparison-type questions.

Note
1. The original STE and the revised STE form are available from the authors on request.
References Challenges in
AACSB (2010), Eligibility procedures and accreditation standards for business accreditation, designing STEs
available at: www.aacsb.edu/accreditation/business_standards.pdf (accessed 12 December
2010).
Abrami, P., DApollonia, S. and Rosenfield, S. (1997), The dimensionality of student ratings of
instruction: what we know and what we do not, in Perry, R. and Smart, J. (Eds), Effective
Teaching in Higher Education: Research and Practice, Agathon Press, New York, NY. 173
Badri, M., Abdulla, M., Kamali, M. and Dodeen, H. (2006), Identifying potential biasing variables
in student evaluation of teaching in a newly accredited business program in the UAE,
International Journal of Educational Management, Vol. 20, pp. 43-59.
Becker, W. and Watts, M. (1999), How departments of economics evaluate teaching, American
Economic Association Paper and Proceedings, Vol. 89, pp. 344-9.
Bedggood, R. and Pollard, R. (1999), Uses and misuses of student opinion surveys in eight
Australian universities, Australian Journal of Education, Vol. 43, pp. 129-41.
Cranton, P. and Smith, R. (1986), A new look at the effect of course characteristics on student
ratings of instruction, American Educational Research Journal, Vol. 23, pp. 117-28.
DApollonia, S. and Abrami, P. (1997), Navigating student ratings of instruction, American
Psychologist, Vol. 52 No. 11, pp. 1198-208.
Dommeyer, C., Baum, P., Chapman, K. and Hanna, R. (2002), Attitudes of business faculty
towards two methods of collecting teaching evaluations: paper vs online, Assessment and
Evaluation in Higher Education, Vol. 27, pp. 455-62.
Erdle, S., Murray, H. and Rushton, J. (1985), Personality, classroom behavior, and student
ratings of college teaching effectiveness, Journal of Educational Psychology, Vol. 11,
pp. 394-407.
Feldman, K. (1976), The superior college teacher from the students view, Research in Higher
Education, Vol. 5, pp. 243-88.
Frey, P. (1978), A two-dimensional analysis of student ratings of instruction, Research in
Higher Education, Vol. 9, pp. 69-91.
Frick, T., Chadha, R., Watson, C. and Zlatkovska, E. (2010), Improving course evaluations to
improve instruction and complex learning in higher education, Educational Technology
Research and Development, Vol. 58, pp. 115-36.
Husbands, C. and Fosh, P. (1993), Students evaluation of teaching in higher education:
experiences from four European countries and some implications of the practice,
Assessment and Evaluation in Higher Education, Vol. 18, pp. 95-114.
King, M., Morison, I., Reed, G. and Stachow, G. (1999), Student feedback systems in the business
school: a departmental model, Quality Assurance in Education, Vol. 7, pp. 90-8.
Koh, H. and Tan, T. (1997), Empirical investigation of the factors affecting SET results,
International Journal of Educational Management, Vol. 11, pp. 170-8.
Kozub, R. (2010), Relationship of course, instructor, and student characteristics to dimensions of
student ratings of teaching effectiveness in business schools, American Journal of
Business Education, Vol. 3, pp. 33-41.
Kulik, J. (2001), Student ratings: validity, utility, and controversy, New Directions in
Institutional Research, Vol. 109, pp. 9-25.
Liaw, S. and Goh, K. (2003), Evidence and control of biases in student evaluations of teaching,
International Journal of Educational Management, Vol. 17, pp. 37-43.
IJEM Lowman, L. (1994), Professors as performers and motivators, College Teaching, Vol. 42,
pp. 137-41.
26,2 McPherson, M., Jewell, R. and Kim, M. (2009), What determines student evaluation scores?,
Eastern Economic Journal, Vol. 35, pp. 37-51.
Marsh, H. (1987), Students evaluations of university teaching: research findings,
methodological issues, and directions for research, International Journal of Education
174 Research, Vol. 11, pp. 253-88.
Marsh, H. and Dunkin, M. (1997), Student evaluations of university teaching: a multidimensional
perspective, in Perry, R. and Smart, J. (Eds), Effective Teaching in Higher Education:
Research and Practice, Agathon Press, New York, NY.
Onwuegbuzie, A., Daniel, L. and Collins, K. (2009), A meta-validation model for assessing the
score-validity of student teaching evaluations, Quality and Quantity, Vol. 43, pp. 197-209.
Patel, N. (2003), A holistic approach to learning and teaching interaction: factors in the
development of critical learners, International Journal of Educational Management,
Vol. 17, pp. 272-84.
Seldin, P. (1993), The use and abuse of student ratings of professors, The Chronicle of Higher
Education, Vol. 39 No. 46, p. A40.
Wachtel, H. (1998), Student evaluation of college teaching effectiveness: a brief review,
Assessment and Evaluation in Higher Education, Vol. 23, pp. 191-211.

About the authors


Dmitriy V. Chulkov is an Associate Professor of Economics at Indiana University Kokomo. He
earned his doctorate from the Krannert Graduate School of Management at Purdue University.
His research interests are in information economics and the scholarship of teaching and learning.
Dmitriy V. Chulkov is the corresponding author and can be contacted at: dchulkov@iuk.edu
Jason Van Alstine is an Assistant Professor of Economics at Indiana University Kokomo. He
earned his doctorate from Indiana University in Bloomington and focuses on research in public
economics and economics of education.

To purchase reprints of this article please e-mail: reprints@emeraldinsight.com


Or visit our web site for further details: www.emeraldinsight.com/reprints

Vous aimerez peut-être aussi