Vous êtes sur la page 1sur 14

BEHAVIORAL RESEARCH IN ACCOUNTING

Volume 16, 2004


pp. 7588

Debiasing Balanced Scorecard Evaluations


Michael L. Roberts
Thomas L. Albright
Aleecia R. Hibbets
The University of Alabama
ABSTRACT: Lipe and Salterio (2000) found that superiors disregarded half of the
information when using a Balanced Scorecard to evaluate the performance of two
divisional managers. Only common measures affected the superiors holistic evaluations, defeating the purpose of the Balanced Scorecard. Our study examines whether
disaggregating the Balanced Scorecard results in evaluations consistent with the intent
of the Balanced Scorecard approach. Results indicate the disaggregated strategy allows superiors to utilize unique as well as common measures, thus overcoming the
common-measures bias. In addition, we find Balanced Scorecard performance evaluations explain more than half the variation in subsequent compensation decisions.

INTRODUCTION
aplan and Norton (1996) observe that many corporate managers rely on financial measures
alone to evaluate subordinates performance, disregarding key elements in the corporations
strategic mission and inadvertently emphasizing measures that lag, instead of lead, actual
firm performance. Kaplan and Norton (1996) created the Balanced Scorecard (BSC) to enable
managers to utilize strategically important nonfinancial as well as financial measures. A central
premise behind the BSC is that each business unit of a firm should develop its own scorecard with
measures that capture the units unique strategy. The tool is now widely used in organizations (Silk
1998).
However, Lipe and Salterio (2000) demonstrated that M.B.A. students assigned to the role of
superiors using the BSC disregard measures unique to particular divisions. Superiors relied only on
the items appearing on both divisions scorecards. Half of the measures included in the scorecards,
which were unique or specific to a single division, were ignored. Because all of the items on a BSC
are assumed to be critically important measures of strategic performance, this common-measures
bias undercuts its potential usefulness.
Lipe and Salterio (2000) attribute the common-measures bias they found to the superiors need
to employ simplifying cognitive strategies. The purpose of this study is to examine a potential
approach to debias performance evaluations using the BSC. We use a disaggregated/mechanically
aggregated Balanced Scorecard (hereafter Disaggregated Balanced Scorecard) in which participants: (1) evaluate performance separately for each of 16 performance measures and then (2)
mechanically aggregate the separate judgments using pre-assigned weights for each measure. Following
this disaggregation-plus-mechanical-aggregation, participants make an overall evaluation. Thus, we
examine whether the common-measures bias found by Lipe and Salterio (2000) when the BSC is

The authors gratefully acknowledge the cooperation of Marlys Lipe in providing copies of experimental materials from Lipe
and Salterio (2000).

75

76

Roberts, Albright, and Hibbets

used to make holistic judgments can be overcome by utilizing a prior disaggregated, mechanically
aggregated information processing strategy.
M.B.A. students role-playing superiors in our study weighted unique measures consistently with
the BSC guidelines they were given for both unique and common measures, in contrast to Lipe and
Salterio (2000). Thus, our findings suggest disaggregating the steps involved in performing BSC
evaluations can overcome common-measures bias. Disaggregating the process, therefore, is one
approach for improving effectiveness of the BSC. Using disaggregated steps was not suggested by
the BSCs originators, Kaplan and Norton (1996).
We also extend Lipe and Salterio (2000) to examine the influence of BSC performance evaluations on subsequent compensation decisions. Although Kaplan and Norton suggest the BSC should
affect compensation, they provide no guidelines for this linkage (Kaplan and Norton 1996; Lipe and
Salterio 2000). We find superiors performance evaluations using the disaggregated BSC strategy
explains slightly more than half of the variation in superiors decisions to distribute a bonus to
division managers. Performance and bonus allocations are highly correlated.
The remainder of the paper is organized into five sections. The next section reviews relevant
literature and presents hypotheses. In the third section, we describe our research methods. Then we
present results of our experiment, tests of hypotheses, and supplemental analyses of related questions. In the fourth section, we discuss implications, limitations, and offer suggestions for future
research. In the final section, we present our conclusions.
LITERATURE REVIEW AND HYPOTHESIS DEVELOPMENT
Cognitive Demands in Comparative and Individual Judgments
Prior research in psychology has shown that decision makers faced with comparative evaluations tend to use information common to both objects and to underweight information unique to each
object (Slovic and MacPhillamy 1974). Dominance of the common information was found only
when objects were evaluated in pairs. The same information item did not dominate when each object
was evaluated individually.
Lipe and Salterios (2000) (hereafter Lipe and Salterio) participants were older, M.B.A. students, had an average of five years of work experience, and arguably were more knowledgeable
about their task than Slovic and MacPhillamys (1974) undergraduate participants. Lipe and Salterio
instructed their participants to evaluate two retail divisional managers independently, in contrast to
Slovic and MacPhillamy (1974), whose task involved choosing which of two candidates would be
more successful. However, Lipe and Salterios experimental materials presented participants with
Balanced Scorecards for both division managers before they evaluated each managers performance.
Lipe and Salterios results were consistent with Slovic and MacPhillamy (1974). Lipe and
Salterio found their M.B.A. participants, role-playing the part of superiors evaluating division managers, used the common measures but disregarded the unique measures in evaluating the performance of division managers using the BSC. Thus, Lipe and Salterio demonstrate the application of
common-measures bias in the BSC context, an important practical application.
Common measures may dominate in comparative evaluations for at least three related reasons.
First, they form a smaller subset of the total information, and it is cognitively easier to retain and
process less, rather than more, information (Anderson 1990). Second, not only does this result in less
total information, but also it may result in fewer categories or types of information to process (Lipe
and Salterio 2002). Third, common measures are the only information available to directly compare
the managers.
An Aid to Debiasing
Lipe and Salterio (2000, 287) suggested their subjects ignored unique measures in order to
reduce their effort to complete the evaluation tasks. One method for improving judgment quality

Behavioral Research in Accounting, 2004

Debiasing Balanced Scorecard Evaluations

77

when effort is insufficient is to use a decision aid (Kennedy 1995). Searching for the optimum
combination of human judgment and statistical modeling, Einhorn (1972) demonstrated improved
decision accuracy when human judges coded decision information into quantitative form and outputs
were generated using a mechanical combination rule. Bowman (1963) suggested combining man and
model by using clinical synthesis whereby the individual uses the output of a model as an input to
the individuals final judgment.
Application of Einhorns (1972) and Bowmans (1963) suggested approaches to the BSC would
involve a two-step process: (1) disaggregate the evaluation decision into several smaller decisions
and (2) aggregate the smaller decisions into an overall score based on predetermined weights (e.g.,
Einhorn 1972; Lyness and Cornelius 1982; Edwards and Newman 1982). Step 1, disaggregating a
complex decision, would encourage the extent to which each individual dimension is processed.
When focusing attention on one dimension, the decision-makers short-term working memory would
be free from simultaneously keeping information about other dimensions from decaying. This shift
in attention and processing capacity should facilitate greater total effort and ensure that effort is
exerted on all measures. This step should overcome common-measures bias to the extent the bias is
caused by failure to adequately attend to unique measures. In Step 2, the predetermined weights used
to aggregate the evaluations into an overall score should reinforce the importance of both common
and unique measures to the organization. It is thus more likely that both common and unique
measures will be used in subsequent holistic evaluations because decision makers will have already
incurred the processing cost of evaluating each dimension.
Disaggregated judgment strategies are more advantageous the more complex the judgment
required, even when complex judgments include as few as nine information cues (Lyness and
Cornelius 1982). In comparison, the BSC typically requires four to seven performance measures in
each of four categories (as suggested by Kaplan and Norton 1996). As a result, evaluators using the
BSC could potentially have 16 to 28 cues to process, holistically, in assessing the performance of a
firm manager. Thus, performance judgments using the BSC should be adequately complex to realize
the benefits of a disaggregated, mechanically aggregated judgment.1
Disaggregated judgments plus mechanical aggregation both decreases and increases task demands. Cognitive demands at any one time are reduced because the amount of information to be
considered for evaluating each individual dimension is less than the information in the entire BSC.
However, the total time and effort required increases because the number of evaluations and computations increases.2
For example, to apply disaggregation-plus-mechanical-aggregation to Lipe and Salterios BSC,
16 separate evaluations would be required for each of the two division managers (a total of 32
separate judgments, compared with only two holistic judgments in Lipe and Salterio). Then each of
the 16 evaluations would have to be extended by its decision weight, and the total of the 16 products
summed. A total of 96 evaluations and computations would be necessary.
Consistent with Kennedys (1995) debiasing framework, we expect that providing superiors
with a disaggregated BSC will increase the total cognitive effort expended to evaluate all measures
prior to making holistic evaluations. Given this increased effort, we expect superiors to utilize all the
1

We note reports that a few divisions of some firms have adopted weights for each scorecard item or categories of items
(Davis 2000; Kaplan 1997; Kaplan and Norton 2001, 256; Malina and Selto 2001). However, Kaplan and Norton do not
advocate weighting and scoring each scorecard item separately or any particular method or algorithm for aggregating
individual scores. Also, Lipe and Salterio (2002) demonstrate that when all measures within a BSC category are
consistently above or below target, BSC users tend to collapse performance on the individual items into a categorical
evaluation. In this situation, the total number of information cues for making an overall evaluation could be reduced
considerably, e.g., from 16 to four.
In addition, Bonner et al. (1996) suggest reasonableness checks should be employed when using mechanically aggregated
judgments. Reasonableness checks ensure aggregation problems such as those encountered by Jiambalvo and Waller
(1984) and Daniel (1988) do not occur. For example, decision makers can be directed to adjust their disaggregated
judgments of risk components so total risk does not exceed 1.0. Another approach, which we employ, is to provide
weights for each subtask to be mechanically aggregated (Lyness and Cornelius 1982).

Behavioral Research in Accounting, 2004

78

Roberts, Albright, and Hibbets

BSC measures rather than the strategy chosen by Lipe and Salterios participants of utilizing only
half the BSC measures.
Based on the above, we test the following hypothesis in alternative form:
H1:

Presenting the BSC in a disaggregated format will result in subsequent holistic


evaluations of managers performance that reflect unique (as well as common)
measures.

We describe the specific methods we use to disaggregate-mechanically-aggregate the BSC in


section three. First, however, we describe an additional extension of Lipe and Salterio (2000).
Linking the BSC to Compensation
Conceptually, performance evaluation using the BSC should be linked to compensation of unit
managers (Kaplan and Norton 1996, 217). However, firms traditionally have implemented the BSC
on an experimental basis and have waited to become more familiar with the new performance
evaluation tool before changing compensation practices (Chow et al. 1997; McWilliams 1996). As a
result, Kaplan and Norton (1996) make no recommendations about how BSC evaluations should
apply to compensation decisions.
Supervisors may be reluctant to be tied to a formal evaluation tool that does not allow them to
compensate subordinates at their discretion. Therefore, it is important to determine whether superiors will follow the formal BSC procedure in making compensation decisions.
Lipe and Salterio did not test the theoretical linkage between performance evaluation and
compensation decisions in their study. Because the link between performance evaluation and compensation is seen as critical to ex ante decisions of managers (Lipe and Salterio 2000, 293), we test
this linkage directly with the following hypothesis:
H2:

Superiors holistic performance evaluations using the disaggregated BSC will affect
subsequent compensation decisions.

PROCEDURES
M.B.A. students were given a case involving two divisions of WCS Incorporated, a retail firm
specializing in womens apparel. The case was administered during class, prior to any instruction on
the Balanced Scorecard. No credit was given for participation, and responses were anonymous. This
approach is identical to Lipe and Salterio (2000), whose 58 first-year M.B.A. participants completed
a classroom case. The case was adapted from Lipe and Salterio, which had followed Kaplan and
Nortons (1996) Kenyon Stores example of a BSC implementation. Participants were asked to
assume the role of a senior executive of WCS who has recently participated in a Harvard Business
School symposium on the Balanced Scorecard. Participants were given the mission statement of
WCS3 and introduced to the two division managers. The case informed participants of the individual
divisions strategies and presented each divisions Balanced Scorecard.
Next, participants completed the two steps of the Disaggregated BSC: they (1) rated each
managers performance on each of the 16 Balanced Scorecard items, using a scale from 0 (Unacceptable) to 100 (Excellent), and then (2) multiplied these individual judgments by pre-determined
weights and summed the weighted scores to create a total, aggregated score for each division. Predetermined weights for the unique measures were 64 percent of the total. These two steps were not
used by Lipe and Salterio nor were they suggested by Kaplan and Norton (1996).

The mission statement reads, We will be an outstanding apparel supplier in each of the specialty niches served by WCS.

Behavioral Research in Accounting, 2004

Debiasing Balanced Scorecard Evaluations

79

Participants then made a separate overall assessment of each managers performance, measured on a scale from 0 (Reassign) to 100 (Excellent). This overall assessment was worded the same
and used the same scale as Lipe and Salterios study and is used to test H1. This separate judgment
was elicited to give participants the opportunity to adjust their overall assessment if they were not
satisfied with the outcome of their mechanically aggregated score for any reason. Thus, participants
were not bound by the mechanical aggregation of their disaggregated judgments. They were free to
disregard them completely or to use them however they saw fit in evaluating each managers
performance.
After making an overall evaluation of each managers performance, participants allocated a
total year-end bonus fund of $100,000 between the two division managers. These allocations are
used to test H2. Then they completed follow-up questions about the case, provided demographic
information, answered manipulation checks, and responded to questions regarding task difficulty,
realism, and understandability.
Participants were given information about two of WCSs divisions, RadWear (RAD), specializing in teen clothing, and WorkWear (WORK), specializing in womens business uniforms. The
strategies for each division were presented, and performance measures appropriate for the divisions
strategy were employed on each divisions scorecard.
The design is 2 2 2, with two between-subjects factors (Common and Unique ) and one
within-subjects factor (Division). The first between-subjects factor, Common, indicates whether
RadWear or WorkWear performs better on the common measures. The second between-subjects
factor, Unique, indicates whether RadWear or WorkWear performs better on the unique measures.
Each participant evaluated managers of both divisions; thus Division is the within-subjects factor.
Each scorecard contained 16 separate measures, four in each of the four categories. In each
category, two measures were common across divisions, and two measures were unique to each
division. For example, in the financial category, both divisions had measures for return on sales and
sales growth. The two measures unique to RadWear were new store sales and market share relative
to retail space; WorkWears two unique financial measures were revenues per sales visit and catalog
profits. Both divisions perform better than target on all 16 measures. The percentage above target,
however, was varied so that either RadWear or WorkWear performed better as indicated in the
experimental design described above. The percentage above target was calculated to the second digit
and reported in a column of the scorecard. These percentages are identical to Lipe and Salterio
(2000).
With 16 common and unique measures, unit weighting would imply a weight of 6.25 percent for
each measure (100/16). Total weight for each of the four categories was set at 25 percent, and within
each category we varied the pre-assigned weights between 4.0 and 9.0 percent.4 These weights were
given to participants on the face of the Disaggregated BSC. Unique measures were assigned 64
percent of the total weights.5 A copy of the Disaggregated BSC is shown in the Appendix.
If the unique measures are used in the evaluation as hypothesized, an interaction of Division and
Unique should be observed. This is in addition to the interaction of Division and Common reported
by Lipe and Salterio (2000).
Participants
Eighty-one (81) M.B.A. students participated in the experiment. Seventy-nine (79) useable
responses are reported below because one participant failed to complete the overall performance
evaluation for both managers, and another participant did not provide disaggregated scores for
4
5

Large discrepancies among individual item weights results in perception by users that the BSC is not, in fact, balanced
and results in ignoring low-weight items (Malina and Selto 2001, 71).
Fifty percent of the 16 items on the scorecard are unique to each division. We weighted these items slightly more than 50
percent to ensure that participants were not merely resorting to unit-weighing.

Behavioral Research in Accounting, 2004

80

Roberts, Albright, and Hibbets

WorkWear. Twenty-five (25) participants were Executive M.B.A. students; 54 were regular M.B.A.
students. We tested for potential systematic differences in these two participant groups on the
variables of interest by including degree program as a variable in each statistical model. No significant differences were observed for degree program; therefore, the two groups were collapsed for the
analysis reported below.
Mean age was 27.6 (median, 25.0), with 5.1 years of work experience (median, 2.0). Seventythree (73) percent of participants were male. Fifty-three (53) percent indicated prior experience in
making performance evaluations.
Attention and Manipulation Checks
Overall, participants regarded the case as realistic, easy to understand, and not difficult to
complete. The mean score for realism was 2.2 on a scale from 5.0 to 5.0, with 5.0 indicating
participants strongly agree the case is realistic. The mean score for understandability was 3.1, and
the mean score for difficulty was 2.6. Participants also agreed the BSC items were usefully categorized (mean of 2.6), that RadWear and WorkWear target different markets (mean score 3.7), used
different measures (mean score of 3.1), and should use different measures (mean score of 3.4). All
means were significantly different from zero (p < 0.01).
We also checked each participants multiplication and addition of the weighted scores for the
mechanical aggregation (Step 2) part of the task. For RadWear, 70 of the 79 participants mechanically aggregated scores were within +/ 1.0 of our recalculation, and all 79 were within +/ 5.0. For
WorkWear, 75 participants were within +/ 1.0 of our recalculation, and all 79 were within +/ 6.0.
RESULTS
Disaggregation Strategy
Table 1 presents the results of the repeated measures ANOVA (compare to Lipe and Salterio
2000, Table 3). If the Disaggregated BSC is successful in preventing the common-measures bias
observed by Lipe and Salterio, there should be a significant interaction between Unique measures
and Division. As shown in Panel A, both the Division Unique interaction (f = 30.51, p < 0.01) as
well as the Division Common interaction are significant (f = 12.81, p < 0.01). Therefore, our
results provide evidence that both common and unique measures are important in explaining differences in overall evaluation scores. This result differs from Lipe and Salterio, who found significance
only on common measures. (Note: None of the between-subjects tests shown in Panel A are significant, nor is the three-way within-subject interaction. This is a result of the balanced experimental
design and is expected.)
Panel B of Table 1 reports means to illustrate direction and magnitude of the results. Consistent
with Lipe and Salterio, when common measures favor RadWear, superiors rank RadWears manager
2.28 points higher than WorkWears manager. Likewise, when common measures favor WorkWear,
superiors rank WorkWears manager 2.58 points higher than RadWears manager. These differences
for common measures are marginally significant, p = 0.05.
However, in contrast to Lipe and Salterio, our results indicate that when unique measures favor
RadWear, superiors rank RadWears manager 3.75 points higher than WorkWears manager. Likewise, when unique measures favor WorkWear, superiors rank WorkWears manager 4.0 points
higher than RadWears manager. These differences for unique measures are significant, p < 0.01.
To further examine the relative influence of the common and unique measures, we regressed
differences in superiors overall performance evaluations on common and unique measures. Lipe
and Salterio reported a significant positive slope coefficient from regression of 10.87 for Common
measures (t = 3.28, p < 0.01), but an insignificant coefficient for Unique measures, 0.08 (t = 0.02,
p > 0.10). In contrast, as shown in Table 2, both Common and Unique measures in our study have
significantly positive slope coefficients: 5.18 (t = 3.63, p < 0.001) and 8.00 (t = 5.67, p < 0.001) for
Common and Unique, respectively.

Behavioral Research in Accounting, 2004

Debiasing Balanced Scorecard Evaluations

81

TABLE 1
Influence of Common and Unique Measures on Subjective Overall Performance Evaluations Using
Disaggregated Balanced Scorecard
Panel A: Results of a 2 2 2 Repeated Measures ANOVA of Subjective Overall Performance Evaluations of RadWears and WorkWears Division Managers
Variable

df

Between Subjects
Common
Unique
Common Unique
Error
Within Subjects
Division
Division Common
Division Unique
Division Common
Unique
Error

SS

MS

1
1
1
74

0.77
87.98
50.11
17,004.39

0.77
87.98
50.11
2226.73

0.00
0.39
0.22

0.95
0.54
0.64

1
1
1

3.36
258.76
616.10

3.36
258.76
616.10

0.17
12.81
30.51

0.68
0.0006
<0.0001

1
74

10.86
1,514.50

10.86
20.19

0.54

0.47

Panel B: Mean Subjective Overall Performance Evaluations of RadWears and WorkWears Division
Managersa
Measures
Common

Favor RadWear
RadWear
WorkWear

Difference: RadWear WorkWear


T-test p-value
Unique
RadWear
WorkWear
Difference: RadWear WorkWear
T-test p-value
a
b

79.12b
(10.90)
76.85
(12.11)
2.28
0.05
79.03
(11.57)
75.28
(11.72)
3.75
0.005

Favor WorkWear
76.56
(11.46)
79.12
(10.13)
2.58
0.05
76.71
(10.74)
80.75
(10.03)
4.0
< 0.0001

Overall evaluations made on a 101-point scale, with 0 labeled Reassign and 100 labeled Excellent.
Panel values are means (standard deviation). Common measures appear on both divisions balanced scorecards, Unique
measures appear on only one divisions balanced scorecard. Favor RadWear indicates the measures were higher for the
RadWear division than the WorkWear division. Favor WorkWear indicates the measures were higher for the WorkWear
division than the RadWear Division.

Based on the results shown in Tables 1 and 2, we conclude the Disaggregated BSC is effective
in eliminating the common-measures bias Lipe and Salterio found when the BSC is used for holistic
performance evaluations.
Bonus Distribution (Allocation)
Our second hypothesis examines the influence of performance evaluations on the bonus allocation. We calculated the difference in managers bonuses assigned by each participant. We regressed
this difference on the differences in managers overall performance evaluations assigned by each
Behavioral Research in Accounting, 2004

82

Roberts, Albright, and Hibbets

TABLE 2a
Comparison of Relative Weights of Common and Unique Measures on Differences in
Subjective Overall Evaluations of Division Managers: Regression Analysis Results
Source
Model
Error
Corrected Total
R2
Adj. R2
Variable
Intercept
Common
Unique
a

df
2
76
79
0.36
0.34
df
1
1
1

Sum of
Squares
1,724.16
3,050.72
4,774.88

Parameter
Estimate
6.91
5.18
8.00

Mean
Square
862.08
40.14

Standard
Error
1.30
1.43
1.43

F-value

Pr > F

21.48

<.0001

t-value
5.33
3.63
5.67

Pr > |t|
<.0001
0.0005
<.0001

We obtained the same results when using the difference between the mechanically aggregated scores as the dependent
variable.
The dependent variable is the difference in the overall evaluations of RadWears and WorkWears Division managers
performance for the past year on a 101-point scale, with 0 labeled Reassign and 100 labeled Excellent.
Common = a 0/1 dummy variable indicating the particular division scored low (high) on the eight Balanced
Scorecard measures that appeared on both divisions scorecards; and
Unique = a 0/1 dummy variable indicating the particular division scored low (high) on the eight Balanced
Scorecard measures that were unique to that division, i.e., did not appear on both divisions scorecards.

participant using the Disaggregated BSC (PerformDiff), controlling for differences in each managers
mechanically aggregated score (AggScDiff). Table 3 reports the regression results. The performance-compensation model is significant, f = 48.84, p < 0.0001. Managers overall evaluation
scores were significant (p < 0.0001). Mechanically aggregated scores, included as a control variable,
were marginally significant (p = 0.07). Interestingly, the model explains only 55 percent of the
variance in bonus differences. Thus, superiors appear to use the Disaggregated BSC performance
evaluations as part of their judgment models for assigning bonuses, but they are either inconsistent in
their application of performance evaluation information or they adjust bonus allocations for additional factors not included in the BSC.6
Supplemental Analyses
By design, the mechanically aggregated BSC scores represent an input to the superiors performance and compensation decisions. Superiors final decisions were made separately from the mechanical aggregation. Importantly, their decisions were framed as an overall (holistic) evaluation.
This distinction raises the question to what extent the overall performance evaluations are affected
by the preliminary, mechanically aggregated BSC score.
To address this linkage, we correlated the superiors subjective, overall evaluations of each
managers performance with their mechanically aggregated score for the same manager. Coefficients
of correlation for RadWear were 0.74 (p < 0.0001) and for WorkWear, 0.84 (p < 0.0001). Thus, for
each division manager, the mechanically aggregated scores are significantly correlated with the
6

Debriefing conversations following the experiment revealed that some participants may have rated WorkWears performance as slightly better than RadWears because WorkWear was described as more stable and less growth-oriented and,
thus, may have been at a disadvantage in achieving above-target performance. A paired t-test for differences in bonus
compensation awarded to each division manager, however, was not significant (p = 0.28).

Behavioral Research in Accounting, 2004

Debiasing Balanced Scorecard Evaluations

83

TABLE 3
Influence of Disaggregated Balanced Scorecard Performance Evaluations
on Difference in Managers Bonuses: Regression Analysis Results
Source

df

Sum of
Squares

3,535,641,971
72,398,600

F-value

Pr > F

48.84

<.0001

Model
Error
Corrected Total
R2
0.56
Adj. R2 0.55

2
76
78

Variable

df

Parameter
Estimate

Standard
Error

t-value

Pr > |t|

1
1
1

1,580.13
955.50
451.11

965.64
179.41
245.10

1.64
5.33
1.84

0.1059
<.0001
0.0696

Intercept
PerformDiff
AggScDiff

7,071,283,943
5,502,293,601
12,573,577,543

Mean
Square

The dependent variable is the difference in the dollar amounts of a total annual bonus of $100,000 that was available to
allocate between the two managers.
PerformDiff = the difference between the two managers subjective overall performance evaluations using the
disaggregated Balanced Scorecard; and
AggScDiff = the difference between the two managers mechanically aggregated scores using the disaggregated
Balanced Scorecard.

subjective, overall evaluations. Both correlations are less than 1.0, however, indicating superiors
holistic evaluations included some mental adjustment of their mechanically aggregated scores or, at
least, they were not perfectly consistent.7
Previous research has found disaggregating decisions increases consensus and inter-judge agreement (Libby and Libby 1989; Davis 1998). We compared standard deviations for our participants
evaluations (Table 1, Panel B) with those reported by Lipe and Salterio (2000, Table 3, Panel B). Fstatistics were significant for only one of the eight comparisons (p < 0.05). Thus, we conclude that
disaggregating BSC evaluations does not reduce variation among evaluators. We note, however, that
our participants made use of twice the number of BSC items as Lipe and Salterios participants.
Also, the standard deviations available for comparison with Lipe and Salterio are averages across
two experimental cells, which would necessarily indicate less variation than individual cell means.
IMPLICATIONS, LIMITATIONS, AND SUGGESTIONS
Implications
Lipe and Salterio (2000) note the common measures employed in the BSC tend to be more
traditional financial measures, like return on sales and average markdowns, and that these measures
tend to lag actual performance. In contrast, the unique measures, such as sales from new market
leaders and market share relative to retail space, tend to be nontraditional and, more importantly,
leading indicators of performance that capture elements of corporate and division strategic emphasis
not captured elsewhere. Thus, ignoring the unique measures in the BSC is tantamount, in many
cases, to ignoring many leading indicators and focusing managerial attention more on lagging
indicators.
To be effective as a management control device, the BSC should result in evaluations that are
accurate, objective, and verifiable (Malina and Selto 2001, 75). Significant conflict and tension
7

See footnote 6.

Behavioral Research in Accounting, 2004

84

Roberts, Albright, and Hibbets

between superiors and evaluatees was observed when evaluation was perceived as subjective. Perceptions of subjectivity led to rejection of the BSC and return to financial performance measures at
another large firm (Ittner et al. 2002).
Using the disaggregated Balanced Scorecard, our participants utilized the unique factors to a
substantial extent. While two other studies find training (Dilla and Steinbart 2002) and explicit
communication of the importance of all BSC measures (Roberts et al. 2002) can improve utilization
of unique measures, both these latter studies find common measures account for two to four times
more variation in evaluations than unique measures. BSC items were not explicitly weighted in either
of these studies. In contrast, the present study demonstrates weights established as part of the BSC
design enables decision makers to place an equal or greater weight on unique measures, consistent
with company strategy. To the extent unique measures represent leading indicators, the disaggregated BSC will enable managers to intervene sooner when divisions encounter problems and to
attempt corrective action.
Limitations
The results of this study are limited to comparative evaluations. As discussed above, Slovic and
MacPhillamys (1974) findings of common-measures bias did not hold when individuals, rather than
pairs, were evaluated. Thus, when the BSC is used to evaluate divisions individually, an important
condition leading to common-measures bias will be absent. Also, the participants in this experiment
did not have personal experience with the managers being evaluated nor individual accountability
for their performance evaluations and compensation decisions. Accountability has positively affected decision making in some related contexts, such as when decision aids are not available
(Ashton 1990) and when decision makers sequentially process several positive and negative information items (Kennedy 1993). Finally, though our participants are similar to Lipe and Salterios
(2000), i.e., M.B.A. students at a major public university, there may be other differences between our
participants and/or the timing and setting of the two experiments about which we are unaware and
have not considered.
Suggestions
We used a two-part, disaggregated-mechanically-aggregated decision aid strategy consistent
with earlier research on judgments of man versus models of man (Ashton 1982, 3443). In our
approach, however, human decision makers perform the aggregation, as suggested by Bowman
(1963), prior to making subjective, overall evaluations. Thus, common-measures bias could possibly
be mitigated by either (1) requiring BSC users to evaluate performance on each BSC measure and/or
(2) suggesting weights for each measure. Future research could test whether common-measures bias
can be reduced or overcome by one of these approaches alone. We note, however, one study found
that requiring disaggregated judgments without providing a mechanism for combination resulted in
decreased judgment quality compared to holistic judgments (Lyness and Cornelius 1982). Also,
providing suggested weights would likely produce a result similar to a reminder to use all the
measures (Roberts et al. 2002).
Additionally, superiors could be asked to evaluate performance for each BSC category, i.e., to
evaluate performance on four items at a time, and then make a holistic judgment. Theoretically, this
would substantially lessen the amount of information to be processed at each stage, thereby reducing
the need for cognitive simplifying strategy(ies) present in Lipe and Salterios (2000) study.
Future research should examine the extent to which mechanical aggregation is acceptable to
managers and superiors. The influences of factors extraneous to the stated BSC measures should also
be addressed. In this study, the mechanically aggregated scores explained slightly more than 50
percent of the variation in overall evaluations of performance for one division (RadWear) and

Behavioral Research in Accounting, 2004

Debiasing Balanced Scorecard Evaluations

85

70 percent of the variation in performance evaluation for the other division (WorkWear). Perhaps
participants view the teenage RadWear market as more volatile than the WorkWear market, resulting in greater variance in evaluations of RadWear, or superiors could be reacting negatively to some
items on the BSC. They may discount the BSC somewhat since, in this experiment, they were not
active participants in developing the measures, or they could be reacting negatively to the targetsetting practice of the BSC. For example, it may seem unusual to participants that both divisions
exceeded their target performance on all 16 BSC measures. Superiors may be imposing their own
standards of performance that differ somewhat from the BSC guidelines. This possibility is suggested by the average mechanically aggregated scores, as well as the holistic scores obtained by Lipe
and Salterio, in the 7080 range for managerial performance that exceeded target on all 16 measures.
These and other possible explanations should be investigated by future research. Since acceptance of
the performance evaluation tool is critical to managers ex ante behavior (Lipe and Salterio 2000,
293), these issues are important to understand.
CONCLUSIONS
Lipe and Salterio (2000) demonstrate an important limitation to using the BSC. Without providing a way to ease the cognitive burden on users, decision makers, when making comparative evaluations, will tend to focus attention only on measures common among managers and ignore measures
unique to division manager. Our study demonstrates an efficient method for reducing the overwhelming cognitive demands of the Balanced Scorecard, while enabling users to make evaluations
consistent with all the important elements of corporate strategy and mission.
Although they circumvent the issue somewhat, Kaplan and Norton (1996) indicate employee
behaviors are not likely to be modified without a definite link to compensation. If the amount of
compensation to be received is determined from a superiors evaluation of the employees performance in meeting the divisions goals, then it is important to know how those superiors evaluations
are affected by the inclusion of weights on the BSC. Our results indicate decision-makers compensation decisions are strongly supported by the overall performance evaluation scores of the disaggregated Balanced Scorecard. This evidence, and similar evidence from practice, should reassure
employees in firms that have adopted the BSC approach that their bonus is, in fact, based on the
messages communicated by managementbut only if the weights and disaggregated scores are
made explicit.

Behavioral Research in Accounting, 2004

86

Roberts, Albright, and Hibbets

APPENDIX
RadWear Balanced Scorecard
Targets and Actuals for 1999
Measure

Target

Actual

% Better
than Target

Performance
Evaluation*

Weighted
Score**

4%
9%
5%

24%
30%
35%

26%
32.5%
38%

8.33%
8.33%
8.57%

_______
_______
_______

________
________
________

7%

$80

$86.85

8.56%

_______

________

8%
5%

85
30%

96
34%

12.94%
13.33%

_______
_______

________
________

8%

12%

11.6%

3.33%

_______

________

4%

92%

95%

3.26%

_______

________

4%

6%

5%

16.67%

_______

________

7%
6%

32
16%

37
13.5%

15.63%
15.63%

_______
_______

________
________

8%

25%

29%

16.00%

_______

________

9%

1.4

1.6

14.29%

_______

________

4%
8%
4%

15
85%
3.3

17
90%
3.5

13.33%
5.88%
6.06%

_______
_______
_______

________
________
________

Weight

Financial:
1. Return on sales
2. New store sales
3. Sales growth
4. Market share relative
to retail space
Customer-Related:
1. Mystery shopper
program rating
2. Repeat sales
3. Returns by customers
as % of sales
4. Customer satisfaction
rating
Internal Business Processes:
1. Returns to suppliers
2. Average major brand
names/store
3. Average markdowns
4. Sales from new market
leaders
Learning and Growth:
1. Average tenure of
sales personnel
2. Hours of employee
training/employee
3. Stores computerizing
4. Suggestions/employee

Composite Score:
(Aggregate of Weighted Scores)
* Use the following 100-point scale to indicate your evaluation of each scorecard item (place a value
corresponding to this scale in the blank beside each scorecard item):
0
50
100
|||||||||||
Unaccepvery
poor
average
good
very
Excellent
table
poor
good
** Multiply your performance evaluation of each measure by the weighting factor corresponding to the measure.

Behavioral Research in Accounting, 2004

Debiasing Balanced Scorecard Evaluations

87

REFERENCES
Anderson, J. R. 1990. Cognitive Psychology and its Implications . New, York, NY: W. H. Freeman and
Company.
Ashton, R. H. 1982. Human Information Processing in Accounting . Sarasota, FL: American Accounting
Association.
. 1990. Pressure and performance in accounting decision settings: Paradoxical effects of incentives,
feedback, and justification. Journal of Accounting Research 28 (Supplement): 148180.
Bonner, S. E., R. Libby, and M. W. Nelson. 1996. Using decision aids to improve auditors conditional
probability judgments. The Accounting Review 71 (2): 221241.
Bowman, E. H. 1963. Consistency and optimality in managerial decision making. Management Science 9 (1):
310321.
Chow, C. W., K. M. Hadad, and J. E. Williamson. 1997. Applying the balanced scorecard to small companies.
Management Accounting 79 (2): 2127.
Daniel, S. J. 1988. Some empirical evidence about the assessment of audit risk in practice. Auditing: A Journal
of Practice & Theory (Spring): 174181.
Davis, E. B. 1998. Decision-aids for going concern evaluation: Expectations of partial reliance. Advances in
Accounting Behavioral Research 1: 3359.
Davis, S. 2000. An investigation of the development, implementation, and effectiveness of the balanced
scorecard: A field study. Dissertation, The University of Alabama.
Dilla, W. N., and P. J. Steinbart. 2002. The effects of alternative supplementary information display formats on
judgments made using the Balanced Scorecard. Working paper, Iowa State University.
Einhorn, H. J. 1972. Expert measurement and mechanical combination. Organizational Behavior and Human
Decision Processes 19 (Feb): 86106.
Edwards, W., and J. R. Newman. 1982. Multiattribute Evaluation. Beverly Hills, CA: Sage Publications, Inc.
Ittner, C. D., D. F. Larcker, and M. W. Meyer. 2003. Subjectivity and the weighting of performance measures:
Evidence from a balanced scorecard. The Accounting Review. 78 (3): 725758.
Jiambalvo, J., and W. Waller. 1984. Decomposition and assessments of audit risk. Auditing: A Journal of
Practice & Theory (Spring) 8088.
Kaplan, R., and D. Norton. 1996. The Balanced Scorecard. Boston, MA: Harvard Business School Press.
. 1997. Mobil USM&R. Harvard Business School case. Boston, MA: Harvard Business School Publishing.
, and D. Norton. 2001. The Strategy-Focused Organization. Boston, MA: Harvard Business School
Press.
Kennedy, J. 1993. Debiasing audit judgment with accountability: A framework and experimental results.
Journal of Accounting Research 31 (Autumn): 231245.
. 1995. Debiasing the curse of knowledge in audit judgment. The Accounting Review 70 (2): 249273.
Lee, C., K. S. Law, and P. Bobko. 1999. The importance of justice perceptions on pay effectiveness: A two year
study of a skill-based pay plan. Journal of Management 25 (6): 851873.
Libby, R., and P. A. Libby. 1989. Expert measurement and mechanical combination in control reliance
decisions. The Accounting Review 64 (4): 729747.
, and M. G. Lipe. 1992. Incentives, effort, and the cognitive processes involved in accounting-related
judgments. Journal of Accounting Research 30 (2): 249273.
Lipe, M., and S. Salterio. 2000. The balanced scorecard: Judgmental effects of common and unique performance measures. The Accounting Review 75 (3): 283298.
, and . 2002. A note on the judgmental effects of the Balanced Scorecards information organization. Accounting, Organizations and Society 27 (6): 531540.
Lyness, K. S., and E. T. Cornelius III. 1982. A comparison of holistic and decomposed judgment strategies in a
performance rating simulation. Organizational Behavior and Human Performance 29 (Feb): 138.
Malina, M. A., and F. H. Selto. 2001. Communicating and controlling strategy: An empirical study of the
effectiveness of the Balanced Scorecard. Journal of Management Accounting Research 13: 4790.
McWilliams, B. 1996. The measure of success. Across the Board 33 (2): 1620.

Behavioral Research in Accounting, 2004

88

Roberts, Albright, and Hibbets

Roberts, M. L., T. L. Albright, and A. R. Hibbets. 2002. Improving utilization of unique measures in the
Balanced Scorecard: The effects of increased awareness and experience. Working paper, The University
of Alabama.
Silk, S. 1998. Automating the balanced scorecard. Management Accounting (May): 3844.
Slovic, P., and D. MacPhillamy. 1974. Dimensional commensurability and cue utilization in comparative
judgment. Organizational Behavior and Human Performance 11: 172194.

Behavioral Research in Accounting, 2004