An Empirical Validation of Object-Oriented Class Complexity Metrics and Their Ability To Predict Error-Prone Classes

JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE
J. Softw. Maint. Evol.: Res. Pract. 2008; 20:171–197

Published online 21 February 2008 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smr.366
Research
An empirical validation
of object-oriented class
complexity metrics and their
ability to predict error-prone
classes in highly iterative, or
agile, software: a case study
Hector M. Olague1, 2, ∗, † , Letha H. Etzkorn2 , Sherri L. Messimer3
and Harry S. Delugach2
1 U.S. Army Space and Missile Defense Command, P.O. Box 1500, Huntsville,
AL, U.S.A.
2 Computer Science Department, University of Alabama in Huntsville, Huntsville,
AL, U.S.A.
3 Industrial and Systems Engineering Department, University of Alabama in
Huntsville, Huntsville, AL, U.S.A.
SUMMARY
Empirical studies have shown complexity metrics to be good predictors of testing effort and maintainability
in traditional, imperative programming languages. Empirical validation studies have also shown that
complexity is a good predictor of initial quality and reliability in object-oriented (OO) software. To date,
one of the most empirically validated OO complexity metrics is the Chidamber and Kemerer Weighted
Methods in a Class (WMC). However, there are many more OO complexity metrics whose predictive
power has not been as extensively explored. In this study, we explore the predictive ability of several
complexity-related metrics for OO software that have not been heavily validated. We do this by exploring
their ability to measure quality in an evolutionary software process, by correlating these metrics to
defect data for six versions of Rhino, an open-source implementation of JavaScript written in Java.
Using statistical techniques such as Spearman’s correlation, principal component analysis, binary logistic
regression models and their respective validations, we show that some lesser known complexity metrics
including Michura et al.’s standard deviation method complexity and Etzkorn et al.’s average method
∗ Correspondence to: Hector M. Olague, U.S. Army Space and Missile Defense Command, P.O. Box 1500, Huntsville,
AL, U.S.A.
† E-mail: holague@cs.uah.edu, hector.olague@smdc.army.mil
Contract/grant sponsor: NASA; contract/grant numbers: NAG5-12725, NCC8-200
Copyright q 2008 John Wiley & Sons, Ltd.

172 H. M. OLAGUE ET AL.
complexity are more consistent predictors of OO quality than any variant of the Chidamber and Kemerer
WMC metric. We also show that these metrics are useful in identifying fault-prone classes in software
developed using highly iterative or agile software development processes. Copyright © 2008 John Wiley
& Sons, Ltd.
Received 31 December 2006; Revised 29 November 2007; Accepted 29 November 2007
KEY WORDS: complexity metrics; object-oriented complexity metrics; fault proneness; object-oriented metrics;
agile software development processes; empirical validation
1. INTRODUCTION
Cyclomatic complexity has been shown to be a good predictor of software testing and maintenance
in software written in imperative programming languages (non-object oriented) [1]. With the advent
of object-oriented (OO) programs, several studies that empirically validate the predictive ability of
OO complexity metrics to identify fault-prone classes have been presented. A summary of these
studies and their results is contained in El Emam et al. [2]. Additional empirical validation studies
involving complexity that have appeared since the publication of El Emam et al. [2] include [3–7].
Error proneness appears to be the most common method by which OO metrics have been evaluated
as quality indicators [2]. This can be done by examining the pairwise correlation statistics of OO
class metrics to defects discovered in OO software classes [8]. Both ordinary linear regression
and binary logistic regression (BLR) techniques have also been used as they provide a more in-
depth and realistic model than the pairwise comparisons [8]. The studies referenced earlier (above)
have primarily focused on empirically validating the Chidamber and Kemerer (C&K) OO metric
suite.
Basili et al. [8] point out that metrics may be useful as predictors in initial quality assessments of
OO software (i.e. the initial release of a software product). The reason is that once faulty components
are identified and repaired, their metrics may remain relatively the same (e.g. lines of code (LOC)
in a class remain relatively the same after errors are corrected) but do not identify a (once) faulty
component (because it has already been repaired). Therefore, their effectiveness diminishes after
a release of the software in traditionally designed and implemented software systems. This paper
departs from empirical validation for initial software quality by examining the usefulness of several
different OO complexity metrics in a highly iterative or agile design, where new functionality is
being added continuously as the design evolves.
In this paper, we first revalidate the conclusions of previous studies that various OO complexity
metrics are good indicators of initial quality [8]. Then we compare the predictive power of several
complexity OO metrics using statistical methods on a non-contrived, highly iterative, real-world
test case, as both initial and evolutionary quality indicators. According to Chapin et al. [9], software
evolution refers to the application of software maintenance activities and processes that ‘generate a
new operational software version, with a changed customer-experienced functionality or properties
from a prior operational version.’ The test case we have selected supports both scenarios. We model
error proneness using univariate and multivariate logistic regression techniques over several of the
most significant software complexity metrics.
Copyright q 2008 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2008; 20:171–197
DOI: 10.1002/smr
EMPIRICAL VALIDATION OF OO CLASS COMPLEXITY METRICS 173
2. BACKGROUND
A number of the OO class metrics used in this paper use McCabe cyclomatic complexity or a
simplified complexity metric in some capacity in the metric calculation. Several of the metrics used
are really size metrics based on LOC. However, it has been shown that lines of code correlate
strongly with cyclomatic complexity for software packages of significant size [10]. Hence, it is
reasonable to use the LOC metrics as part of a complexity analysis. A brief definition of each of
the metrics used in this paper is provided in Table I.
Table I. Specific definitions of metrics actually used in this study.

Metric name value Definition
McCabe cyclomatic Calculated from the control flow graph of a function as number of edges—number
complexity of nodes +2. Can be separately calculated as the number of decision points in a
program. In this paper, we employ McCabe cyclomatic complexity as a subpart of
some other metrics (see other definitions below in this table).
Weighted methods per Sum of complexities of local methods of a class. In this paper, when we refer to
class (WMC) WMC, we employ simple WMC, when the complexity of each local method is
unity. Thus, simple WMC = number of methods in the class. Thus, this definition of
WMC is a size metric.
Weighted methods per Sum of the complexities of local methods of a class, when the complexity of each
class McCabe local method is measured using McCabe’s cyclomatic complexity metric. Thus,
(WMC McCabe) WMC McCabe = sum of the McCabe cyclomatic complexities of all local methods
of a class.
n
Standard deviation SDMC = n1 i=1 (Mi − M)2
method complexity
(SDMC)
where n = number of methods in a class
Mi = complexity of method i of a class
M = average complexity of the methods in the class
In this paper, the complexity of each method is calculated using McCabe’s
cyclomatic complexity.
n
Average method AMC = n1 i=1 Mi
complexity (AMC)
where n = number of methods in a class.
Mi = static complexity of method i of a class.
In this paper, the complexity of each method is calculated using McCabe’s
cyclomatic complexity.
Maximum cyclomatic This metric is equal to the McCabe cyclomatic complexity value of the ‘most
complexity of a single complex’ local method within a class. The ‘most complex’ local method is that
method of a class (CC local method where the McCabe cyclomatic complexity value is the largest McCabe
Max) cyclomatic complexity value of any method in the class.
Number of instance Number of methods in an instance object of a class. This is different from a class
methods (NIM) method, which refers to a method which only operates on data belonging to the
class itself, not on data that belong to individual objects. Thus, NIM is a size metric.
Number of trivial Number of trivial methods is the number of local methods in the class whose
methods (NTM) McCabe complexity value is equal to one. This can also be classed as a size metric.
Average lines of code Average of the executable lines of code of a class. This is a size metric.
Lines of code per class Sum of the lines of code of a class. This is a size metric.
DOI: 10.1002/smr
2.1. McCabe cyclomatic complexity
The metrics used in this study are a blend of traditional [1], intuitive [11] and recent OO complexity
metrics [12]. The basis for several of these metrics is cyclomatic complexity, developed by
McCabe [13].
Cyclomatic complexity was devised as a measure to predict the maintainability and testability
of a program. To find the cyclomatic complexity of a program, it must first be represented by a
control flow graph. The cyclomatic complexity of a program is equal to the cyclomatic number of
its control flow graph and is calculated using the equation
v(G) = e −n +2 (1)
where e is the number of edges in the control flow graph and n the number of nodes in the control
flow graph.
The cyclomatic complexity of a program is also equal to the number of decision statements in
the program plus 1.
According to Fenton and Pfleeger [10], in several cases McCabe’s cyclomatic complexity metric
has had claims made about it that are difficult to support. For example, Fenton and Pfleeger say
that cyclomatic complexity presents only a partial view, rather than a general view, of complexity,
as it is related to the number of decisions in a program. However, there are several programs that
have a large number of decisions but are easy to understand, code and maintain. Similarly, Fenton
and Pfleeger discuss how, from a measurement theory perspective, various conclusions that one
would draw using cyclomatic complexity are not intuitive. Hence, a more complete view of program
complexity than that provided by McCabe’s cyclomatic complexity is necessary. Lastly, Fenton and
Pfleeger discuss how the many claims that have been made that the McCabe cyclomatic complexity
metric has been validated are negated by studies showing that cyclomatic complexity is no better
than a simple LOC measure (we see something like this in our study as well, see Section 6.1).
However, Fenton and Pfleeger do say that cyclomatic complexity has been shown to be a useful
indicator of how difficult a program will be to test or maintain.
2.2. Weighted methods per class
The weighted methods per class (WMC) metric, or simple WMC, is defined to be the summation
of the complexities of the local methods in a class [14]. It does not include inherited class methods
[15]. It is intended to be a first-level approximation of the complexity of a class. Chidamber and
Kemerer left undefined the exact meaning of complexity, leaving it up to the implementer to choose
an appropriate definition. If complexity is taken to be unity (1) for each method, WMC for a class
is equal to the number of local class methods. Many, if not most, metrics suites implement WMC
this way. Another variant is WMC-McCabe.
2.3. WMC-McCabe
After it was pointed out that the definition of simple WMC is essentially a method count per class,
the McCabe version of the WMC metric appeared in Chidamber and Kemerer [16]. WMC-McCabe
DOI: 10.1002/smr
redefined WMC as the aggregation of all the McCabe complexities of the local class methods
(it does not include the complexities of inherited class methods).
The WMC-McCabe is defined as follows:

n
WMC McCabe = Mi (2)
i=1
where n is the number of methods in a class and Mi the McCabe’s cyclomatic complexity of method
i of a class.
Then WMC McCabe represents the summation of all the McCabe complexities of all the methods
in a class.
3. DESCRIPTION OF METRICS USED
3.1. Standard deviation method complexity [12]
Michura and Capretz [12] make the point that no single measure of complexity can reliably predict
the maintainability of OO software. They present the metric standard deviation method complexity
(SDMC), which they use in tandem with WMC to provide insight into the nature of the complexity
of an OO class. They claim that the method diversity of a class serves as an indication that a class is
performing many different actions. For example, a class with a high WMC and low SDMC implies
that the class consists of only a small number of similarly complex methods. A class with a low
WMC and high SDMC implies that class methods are not complex with the exception of a few.
SDMC is defined as

1 n
SDMC = (Mi − M)2 (3)
n i=1
where n is the number of methods in a class, Mi the complexity of method i of a class. In this
paper, we use McCabe’s cyclomatic complexity to calculate the complexity of method i and M
the average complexity of the methods in the class. In this paper, we use McCabe’s cyclomatic
complexity in the determination of average complexity.
3.2. Average method complexity
Etzkorn et al. [17] attempt to clarify some misleading interpretations of WMC by using complexity
averaging of the McCabe complexity of the methods. The idea was that average method complexity
(AMC) would better reflect the complexity of a class type.
AMC is defined as follows:
1 n
AMC = Mi (4)
n i=1
where n is the number of methods in a class and Mi the static complexity of method i of a class.
In this paper, we use McCabe’s cyclomatic complexity to calculate the complexity of method i.
DOI: 10.1002/smr
Then, AMC represents the average method complexity of all the methods in the class. The
complexity of each method is measured using McCabe’s cyclomatic complexity metric. The use of
the AMC metric gives a better indication of the complexity of a class with a large number of
non-complex member functions than does the code complexity metric WMC (WMC-McCabe) [16],
which adds McCabe’s cyclomatic complexity numbers of all member functions in the class. (Another
variation of WMC, simple WMC, as it is known, is a count of all the methods in a class [14].)
3.3. Maximum cyclomatic complexity (CC Max) of a single method of a class
This metric compares local methods in a class and provides the McCabe cyclomatic complexity
value of the most complex local method in the class. This is different from WMC in that WMC
is basically a count of all the methods in a class. It is also different from WMC-McCabe since
WMC-McCabe is an aggregation of all the McCabe cyclomatic complexities of all the methods in
a class (except inherited class methods).
3.4. Number of instance methods
An instance method refers to a method that operates on data that are local to the instance object
of the class. This is different from a class method, which refers to a method that only operates on
data that belong to the class itself, not on data that belong to individual objects. Number of instance
methods (NIM) counts all the public, protected and private methods defined for a class’ instances.
NIM does not count inherited methods [11,15,18]. This is different from WMC in that WMC is a
count of all the local methods in a class (does not count inherited methods). A Java example of
instance methods is shown in Figure 1. In Java, an instance method is a method that does not have
static in its declaration.
public class Circle {

// A class field
public static final double PI= 3.14159; // A useful constant
// A class method: just compute a value based on the arguments

public static double radiansToDegrees(double rads) {
return rads * 180 / PI;
}
// An instance field
public double r; // The radius of the circle
// Two instance methods: they operate on the instance fields of an object

public double area() { // Compute the area of the circle
return PI * r * r;
}
public double circumference() { // Compute the circumference of the

circle
return 2 * PI * r;
}
}
Figure 1. An example of instance methods declared in Java.
DOI: 10.1002/smr
3.5. Number of class methods
A class method refers to a method that only operates on data that belong to the class itself, not
on data that do not belong to the class. Number of class methods (NCM) is the count of directly
declared methods in the class. NCM does not count inherited methods [11,15].
3.6. Number of trivial methods
While the simple WMC metric can result in obscuring the true complexity of a class, by simply
counting trivial member functions the code-complexity WMC (WMC-McCabe) redefinition may
also be problematic as it may result in the possibility of a few, highly complex methods dominating
the composite complexity of the class. Michura and Capretz have empirically shown how class
complexity employing the WMC-McCabe metric may be misinterpreted if used without investigating
the complexity contribution of each method to the overall class complexity [12]. This skewing
effect of trivial methods in the overall class complexity in both simple WMC and WMC-McCabe
calculations was the motivation for the development of a separate metric, number of trivial methods
(NTM), to differentiate trivial methods in OO classes separately from non-trivial methods [11].
NTM is defined as the number of local methods in the class whose McCabe complexity value
is equal to one [13]. The value for this metric can be obtained simply by counting the number of
methods in a class with complexity equal to one. This metric determines how much of the class’
WMC is due to trivial methods.
3.7. WMC
WMC is used in this study and has already been described in Section 2.2.
3.8. WMC-McCabe
WMC-McCabe is used in this study and has already been described in Section 2.3.
3.9. Average LOC and LOC per class
We included these LOC metrics because they are widely used in industry, primarily for sizing and
cost estimation. We count only executable LOC, not all lines, of the class. We take both the average
and aggregated LOC of all methods in the class.
4. TEST CASES USED FOR THIS STUDY
We chose the Mozilla Rhino project for this study since extensive error data were available for a
number of consecutive, sequential releases of this system. Also, the software development strategy
employed was highly iterative with a bottom-up approach. Rhino is an open-source implementation
of JavaScript written in Java. It is typically embedded in Java applications to provide scripting to
end users.
DOI: 10.1002/smr
Rhino versions 14R3, 15R1, 15R2, 15R3, 15R4 and 15R5 were analyzed and used in this study.
The development of Rhino meets most of the 12 principles defined by the Agile Alliance for achieving
agility: continuous delivery of software, welcoming changing requirements, deliver working soft-
ware with a varying cycle time (from 2 to16 months), working software is the primary measure
of progress, etc. [19,20]. Rhino can be considered an example of the use of the agile software
development model in open-source software [20]. The incorporation of software extensions and
improvements by its development team and generous third parties provide the agile aspects of Rhino’s
development cycle [20]. New software enhancements in Rhino are usually followed by new software
defects. This is due to the code integration and testing philosophy employed by agile systems (and
Rhino) in general. Regression testing is performed on Rhino versions prior to their release, but
the Rhino team relies heavily on user feedback to try out a new version and to report problems [20].
Error data exist for Rhino in the online Bugzilla repository located at [21]. The change logs for
each version of Rhino were examined, which usually listed the bugs that had been resolved for
that version of Rhino. Examining the change logs was easier than trying to search Bugzilla initially
because the change log contains query links back to the Bugzilla database. The change log allows
us to search for bugs targeted against each version of Rhino. Bug fixes were cross-referenced with
the classes that were affected by each bug/fix.
5. GOAL STATEMENT AND RESEARCH HYPOTHESIS
Two research hypotheses will be tested under the common goal:

Goal: To assess the ability of OO complexity metrics to identify fault-prone components in
different software development environments.
Hypothesis 1
H0 : OO complexity metrics cannot identify fault-prone classes in traditional and highly iterative,
or agile, developed OO software during its initial delivery (initial quality).
H1 : OO complexity metrics can identify fault-prone classes in traditional and highly iterative,
or agile, developed OO software during its initial delivery (initial quality).
Hypothesis 2
H0 : OO complexity metrics cannot identify fault-prone classes in multiple, sequential releases of
OO software systems that are developed using highly iterative, or agile, software development
processes.
H1 : OO complexity metrics can identify fault-prone classes in multiple, sequential releases of
OO software systems that are developed using highly iterative, or agile, software development
processes.
6. ANALYSIS
In order to test Hypothesis 1, we perform a correlation analysis (described below) on the initial
version of Rhino. To test Hypothesis 2, we will use the results of our analysis of Hypothesis 1
DOI: 10.1002/smr
Table II. Rhino version statistics for study.

Rhino version: 14R3 15R1 15R2 15R3 15R4 15R5
Files 77 102 124 122 128 126
Lines 29 549 44 007 54 601 54 834 58 412 61 246
Blank lines 2905 4023 5103 5165 5513 5589
Code lines 19 838 29 496 37 559 37 975 40 111 42 684
Comment lines 7953 11 788 13 188 12 951 13 295 13 443
Statements executable 8682 13 292 18 333 18 575 20 101 21 178
Statements declarative 3740 5322 6969 7062 7693 8120
Ratio comment/code 0.40 0.40 0.35 0.34 0.33 0.31
Number of classes 95 126 179 178 198 201
Number of methods 1155 1536 2085 1662 1760 1817
Defects reported 21 29 10 41 153 61
Enhancements made 1 3 0 0 76 37
(initial quality) and measure the robustness of the metrics for predicting fault-prone classes over
the remaining versions of the Rhino software. In addition to correlation analysis, later in this
section and in Section 7, we build univariate logistic regression models using each of these metrics
to predict defects to test Hypothesis 2. We also explore the possibility of building multivariate
logistic regression models using several metrics in conjunction. We used Understand for JavaR ,
a commercial reverse engineering and metrics tool, to collect commonly accepted and used OO
complexity metrics. Understand for JavaR provided a clear advantage over other metrics tools we
examined because it allows Perl scripts to be written to collect and compute metrics that are not
available in the standard suite of reports provided (e.g. SDMC, AMC). Several custom metrics
collection scripts were developed in support of this study. Parametric summary data for each version
of Rhino are provided in Table II and it includes the number of files, number of classes, number
of defects reported and other important data. After considering the variety of files and number of
classes in each version, we had an indication of the degree to which additional requirements are
being added to Rhino from version to version.
6.1. Class defect density correlation
The testing of Hypothesis 1 is performed to ascertain whether the currently available metrics can
predict the initial quality of a software system and is, more or less, a validation of the current
analysis methods in this field. This required that a simple correlation be calculated, using the
complexity metrics in this study against class defects density (defects/class). Table III presents the
results of the correlation analysis for all Rhino versions contained in this study. It also provides
the P-value that is used to test the significance of the correlation. Spearman rank correlation is
used here because the data are not normally distributed. The first row of Table III presents the
correlation for the initial version of the Rhino software, version 14R3. In Table III, the data for
each version of Rhino are independent of the data for each other version of Rhino. That is, for
each version of Rhino, we display the results of each metric collected separately on that version of
Rhino, correlated with the number of defects discovered in that version of Rhino. Varying degrees
of correlation were experienced. In this study, we are following the guidelines set forth by Cohen
DOI: 10.1002/smr
Table III. Rhino Spearman rank correlation between complexity measure and
defects in class (P-value in parentheses).
OO complexity metrics
Rhino WMC- Ave LOC/
version SDMC AMC CC Max NIM NCM NTM WMC McCabe LOC/class class
14R3 0.522 0.504 0.511 0.274 0.373 0.192 0.419 0.506 0.510 0.504
(< 0.001) (< 0.001) (< 0.001) (0.007) (< 0.001) (0.062) (< 0.001) (< 0.001) (< 0.001) (< 0.001)
15R1 0.315 0.284 0.334 0.213 0.441 0.352 0.395 0.395 0.298 0.386
(< 0.001) (0.001) (< 0.001) (<0.001) (< 0.001) (< 0.001) (< 0.001) (< 0.001) (<0.001) (< 0.001)
15R2 0.277 0.266 0.286 0.214 0.277 0.219 0.297 0.302 0.276 0.310
(<0.001) (<0.001) (<0.001) (0.004) (<0.001) (0.003) (<0.001) (< 0.001) (<0.001) (< 0.001)
15R3 0.328 0.320 0.354 0.289 0.349 0.301 0.381 0.414 0.309 0.407
(< 0.001) (< 0.001) (< 0.001) (<0.001) (< 0.001) (< 0.001) (< 0.001) (< 0.001) (< 0.001) (< 0.001)
15R4 0.439 0.406 0.445 0.417 0.545 0.464 0.529 0.531 0.349 0.519
(< 0.001) (< 0.001) (< 0.001) (< 0.001) (< 0.001) (< 0.001) (< 0.001) (< 0.001) (0.002) (< 0.001)
15R5 0.209 0.216 0.258 0.142 0.321 0.186 0.236 0.244 0.257 0.322
(0.003) (0.002) (<0.001) (0.045) (< 0.001) (0.008) (0.001) (<0.001) (<0.001) (< 0.001)
[22]: a correlation 0.5 or greater is considered ‘large,’ 0.3–0.5 is considered ‘moderate’ and 0.1–
0.3 is considered ‘small.’ For example, the correlation between the metric LOC per class and the
class defect density for this version is calculated to be 0.530, indicating that there is a fairly large
correlation between this complexity measure and the class defect density. We set a threshold of
correlation of 0.3 to test our hypotheses. Those correlations meeting or exceeding this threshold are
shown in bold in Table III. Based on this threshold, NTM and NIM correlations to the defects/class
were considered insignificant. Otherwise, all the metrics considered meet the criteria for significant
correlation for the metrics: SDMC, AMC, CC Max, NCM, WMC, WMC-McCabe, Ave LOC/Class
and LOC/Class. Since most of the metrics correlate with defects, higher values for these metrics
would indicate a higher number of defects in the class, and lower values for these metrics would
indicate a lower number of defects in the class. Thus, for Hypothesis 1, we reject the null hypothesis
and accept the alternative hypothesis that OO complexity metrics can identify fault-prone classes
in traditional and highly iterative, or agile, developed OO software during its initial delivery (initial
quality).
Correlation analysis was carried out on the subsequent versions of Rhino. Based on this analysis,
the metric WMC-McCabe produces the highest degree of correlation in three of the subsequent
versions, while the LOC/Class metric produces a high degree of correlation in the final two versions.
However, all these metrics do identify fault-prone classes, overall, in this study. Again, Table III
shows all correlations exceeding our correlation threshold (0.3) in bold. Table III shows that 63% of
all correlations exceeded our threshold value (were moderate or large). Therefore, for Hypothesis 2,
we reject the null hypothesis and accept the alternative hypothesis that OO complexity metrics
can identify fault-prone classes in multiple, sequential releases of OO software systems that are
developed using highly iterative, or agile, software development processes.
We note here that the different OO complexity metrics are intended to measure different aspects
of class complexity (see Section 2); however, over the data employed in this study, WMC-McCabe
and LOC/Class had the highest correlations with class defect density. It has long-been known that
DOI: 10.1002/smr
size of code correlates with the number of defects (that is, that larger software has more errors)
[23]. What is interesting here is that the size of code in terms of LOC correlates better with the
number of defects than do all of the other complexity metrics we examined except WMC McCabe.
6.2. Principal component analysis
In our second analysis, we examined the complexity metrics to determine whether they measure
different dimensions of OO class complexity (whether they measure different aspects of software
complexity) or are measuring the same thing. The goal of principal component analysis (PCA) is to
determine whether a few uncorrelated metrics can be used to model the defect data, without losing
a great deal of information. To do this, we performed a PCA on all the metrics collected.
When variables are strongly correlated, they tend to measure the same underlying property.
Principal components (PCs) are used to transform the original data set into a smaller set of linear
combinations that account for most of the variance of the original data set. The purpose is to identify
factors that explain as much of the variation with as few factors as possible [24,25].
Our analysis focuses on obtaining PCs for later use in a multiple binary logistic regression (BLR)
(see Section 6.5), to ensure that these PCs measured different aspects of the data and made sense
to include together in the later BLR.
We used SPSS Release 12 to analyze six versions of the Rhino software. Table IV contains our
findings for the PCA analysis for each version of the software. We omitted eigenvalues less than
0.3 to improve the readability of Table IV. The PCs are labeled PC1, PC2, etc.
In the first PC, PC1, SDMC was one of the strongest predictors in five out of six versions of
Rhino (14R3, 15R1, 15R2, 15R3 and 15R5) although it did not rank highest in four of these (14R3,
15R1, 15R2, 15R3). AMC and average LOC explained the greatest amount of variability in four
of six versions of Rhino. Therefore we conclude that SDMC is most consistent at explaining the
greatest amount of variability in this dimension of the data followed by AMC.
In this first principal component, AMC, SDMC and Ave LOC/Class all measure the same dimen-
sion of the software, the consistency of class size in terms of its methods. It is reasonable to assess
that the consistency of the complexity of the methods within a class would be a good indicator of
quality in terms of fault proneness. If most of the methods within a class were trivial, we could
predict that the class would be largely fault free. Similarly, if the class consisted largely of highly
complex methods, then we could assess that the class would probably be fault prone. Michura et al.
discussed this in [12], but did not present extensive empirical evidence to support their hypothesis.
In a regression model, we can include only one of the components from PC1 so we would choose
SDMC because it is more consistent than the others.
Our conclusions in PC1 are supported by the second principal component, PC2. PC2 measures
the method count/type dimension in a class. In PC2, the NTM in a class explains the greatest
variation of the data in the PC followed by the NIMs in a class and the WMC. (Note: This version
of WMC assigns the class method weight to unity.) We can reason that the complexity composition
of the class would be a good indicator of class quality (NTM and NIM), followed by the number of
methods in the class (WMC). NIM, NTM and WMC all measure the same dimension of the data, the
quantity and type of methods in a class. In a multiple logistic regression model, we would choose
NIM from PC2 because it explains more of the variability in the data in this dimension than the
NTM or WMC.
DOI: 10.1002/smr
Table IV. Rhino principal component analyses for OO complexity metrics∗ .

Rhino 14R3 Rhino 15R1
PC1 PC2 PC3 PC4 PC5 PC1 PC2 PC3 PC4 PC5
Eigenvalue 5.892 2.422 0.992 0.327 0.202 5.935 2.383 0.949 0.327 0.267
Proportion 58.923 24.216 9.923 3.266 2.020 59.346 23.835 9.486 3.267 2.669
Cumulative 58.923 83.139 93.062 96.328 98.347 59.346 83.181 92.667 95.934 98.603
SDMC 0.955 — — — — 0.938 — — — —
AMC 0.967 — — — — 0.971 — — — —
Max CC 0.882 — — — 0.309 0.845 — — 0.311 0.315
NIM — 0.963 — — — — 0.932 — 0.304 —
NCM — — 0.984 — — — — 0.973 — —
NTM — 0.936 — — — — 0.953 — — —
WMC — 0.727 0.649 — — — 0.804 0.489 — —
WMC McCabe 0.594 0.393 0.576 0.364 — 0.532 0.534 0.350 0.540 —
Average lines 0.921 — — — −0.340 0.939 — — — —
Lines of code (LOC) 0.556 0.363 0.503 0.536 — 0.463 0.525 — 0.658 —

Eigenvalue 5.849 2.436 0.892 0.414 0.249 5.806 2.542 0.907 0.437 0.194
Proportion 58.494 24.362 8.920 4.138 2.487 58.057 25.422 9.070 4.374 1.940
Cumulative 58.494 82.856 91.777 95.915 98.402 58.057 83.479 92.548 96.922 98.863
SDMC 0.916 — — — — 0.935 — — — —
AMC 0.966 — — — — 0.978 — — — —
Max CC 0.770 — 0.518 — — 0.756 — 0.562 — —
NIM — 0.939 — — — — 0.934 0.303 — —
NCM — — — 0.962 — — — — 0.963 —
NTM — 0.940 — — — — 0.937 — — —
WMC — 0.812 0.307 0.472 — — 0.801 0.320 0.485 —
WMC McCabe 0.438 0.488 0.637 0.374 — 0.419 0.466 0.668 0.372 —
Average lines 0.934 — — — −0.318 0.965 — — — —
Lines of code (LOC) 0.394 0.491 0.715 — — 0.369 0.480 0.732 — —

Eigenvalue 5.549 2.274 0.895 0.835 0.250 5.777 2.032 0.895 0.790 0.214
Proportion 55.487 22.735 8.952 8.347 2.501 57.769 20.316 8.955 7.898 2.141
Cumulative 55.487 78.223 87.175 95.522 98.022 57.769 78.086 87.041 94.938 97.080
SDMC — 0.973 — — — 0.957 — — — —
AMC — — 0.989 — — — — — 0.979 —
Max CC — 0.745 0.540 — — 0.869 — — 0.363 —
NIM 0.952 — — — — — 0.934 — — —
NCM — — — 0.961 — — — 0.967 — —
NTM 0.915 — — 0.328 — — 0.870 0.407 — —
WMC 0.821 — — 0.490 — — 0.720 0.611 — —
DOI: 10.1002/smr
Table IV. Continued.

WMC McCabe 0.493 0.611 0.322 0.360 0.366 0.737 0.424 0.407 — —
Average lines — 0.435 0.869 — — 0.534 — — 0.785 —
Lines of code (LOC) 0.567 0.607 — — 0.423 0.689 0.407 0.463 — 0.310
Extraction method: Principal component analysis. Rotation method: Varimax with Kaiser normalization.
A rotation converged in six iterations.
∗ Values <0.30 removed.
In PC3, WMC-McCabe appears in four of six versions of Rhino (14R3, 15R1, 15R2, 15R3) but
does not explain the greatest amount of variability in these data sets. NCM appeared in three out of
six versions of Rhino (14R3, 15R1, 15R5) but explained greater variability in the data sets common
to both (14R3, 15R1). This PC appears to measure the number of complex methods in a class. In
a multiple logistic regression model, we would choose NCM.
The components that are produced in Rhino version 15R4 are inconsistent (different) from the
components that emerge from the other five versions of Rhino in PCs 1, 2 and 3. We note that the
number of defects reported for this version was much higher than for the other Rhino versions,
relating to the large number of enhancements added to this class (see Table II).
6.3. Logistic regression analysis
Logistic regression has been used in different software metrics studies due to the lack of variation in
the dependent variable (faults per software class or software module) [8]. Some versions of Rhino
are not very fault dense and lack the variability in the response variable that is needed to justify
the use of ordinary regression techniques. Therefore, we shall also make use of logistic regression
techniques in this study to build better regression models.
In Section 6.4, we describe a univariate (one independent variable) logistic regression of the
metrics versus faults to determine which metrics were statistically significant indicators of quality. In
Section 6.5, we describe how we used multivariate (more than one independent variable) bionomial
logistic regression (BLR) to construct predictive models for this particular application. First, a
general overview of both the models is provided below.
6.3.1. Analysis method
Bionomial logistic regression (BLR), sometimes called binomial logistic regression, is a widely
used statistical technique where the dependent variable can be classified into one of two classes
and the independent variables may take on any form. It is used to predict the probability of a
particular outcome, given a particular set of circumstances. It is used extensively in the biological
sciences, for instance, to predict the possibility of disease given behavioral and genetic information
about a patient. Here, it will be used to assess the OO class complexity metrics selected in the
previous section by the PCA. A brief overview of logistic regression is given below; see Hosmer
and Lemeshow [26] for more details on the technique and its application.
DOI: 10.1002/smr
BLR does not require many of the assumptions of other statistical techniques such as linearity
between the independent and dependent variables, normally distributed variables, or homoscedas-
ticity [26]. BLR is a regression technique that utilizes log-likelihood estimators on the transformed
variables, rather than the typical maximum likelihood estimators (MLEs) seen with traditional linear
regression. First, the odds of the dependent variable occurring are calculated from historical data.
The natural log of these odds is then calculated and serves as the transformed dependent variable.
6.3.1.1. Univariate BLR. The first case considered is a model that considers only one explanatory
variable, a single class design metric. The univariate logistic model can be described as follows:

Logit(Y ) = natural log(odds) = ln = +X (5)
1−
where Y is the binary dependent variable with Y = 0 indicating no fault is found, and Y = 1 indicating
a fault is found, the probability that a fault was found in a class after the software was deployed
(Y = 1), the y intercept, the regression coefficient, and, in our case, X is a single class design
metric.
Taking the antilog of both sides of equation (5) derives
e+x
(Y = y|X = x) = (6)
1+e+x
Equation (6) allows us to predict the probability that we will find a fault in a software system based
on the outcome of a particular metric [27].
6.3.1.2. Multivariate BLR. Similarly, the multivariate logistic regression model can be described
as follows:

Logit(Y ) = natural log(odds) = ln = +1 X 1 +2 X 2 +· · ·+n X n (7)
1−
and, similarly taking the antilog of both sides yields
e+1 X 1 +2 X 2 +···+n X n

(Y = y|X 1 = x1 , X 2 = x2 , . . . , X n = xn ) = (8)
1+e+1 X 1 +2 X 2 +···+n X n
where X 1 , X 2 , etc. in our application are the class metrics under evaluation [27].
The Hosmer and Lemeshow (HL) goodness-of-fit (GOF) test is used to determine whether the
derived model adequately describes the data. In the HL GOF test, the data are first divided into
deciles based on predicted probabilities. A chi-square is computed from the observed and expected
frequencies. The null hypothesis is that there is no difference between the observed and predicted
values of the dependent variable. For example, if the HL GOF statistic is less than 0.05, we reject
the null hypothesis that there is no difference, implying that the model’s estimates fit the data at an
acceptable level. This does not imply that the model explains much of the variance in the dependent
variable(s), only that it does to a degree that is significant [28].
DOI: 10.1002/smr
6.4. Univariate BLR for complexity metrics
We performed a univariate logistic regression of the complexity metrics versus faults in the Rhino
software, by class, to determine which metrics were statistically significant indicators of quality.
In other words, each complexity measure was analyzed within each version of the software. The
univariate BLR analysis for the complexity metrics in our study shows that, with few exceptions,
(NTM in 14R3, NCM in 15R2 and AMC in 15R4) most of the coefficients are significant in many
versions of Rhino. See Table V for details of this analysis. Each table corresponds to a Rhino
version and contains results for each of the complexity measures under consideration in this study.
In Table V, all the metrics for which this analysis was not significant have been grayed out. One
important indicator is the odds ratio that is calculated for each of the metrics. Comparing across all
versions of the software, we noted that the largest and more consistent log odds ratios were found
in AMC (in 5 of 6 versions) and SDMC (in all versions of Rhino used in this study). The log odds
ratio is important because it indicates that for every unit increase in OO software class defects, a
correspondingly high value of the log odds will ensue. For example, in Rhino Version 14R3, AMC
has a log odds ratio of 1.51. This means that a unit increase in defects will be accompanied by an
increase of 1.51 in AMC. In Table V, for easy comparison, the odds ratios are shown on the rows
with a gray background.
Examining the AMC metric results in more detail, it is noted that its coefficient is significant at
the = 0.05 level for five out of six versions of Rhino examined (14R3, 15R1, 15R2, 15R3, 15R5).
The log-likelihood test statistic G, which tests the null hypothesis that all the coefficients associated
with the predictors equal zero versus these coefficients not all being equal to zero indicates that
there is sufficient evidence that at least one of the coefficients is different from zero since its P-value
is below our critical value of 0.025 in all five cases. The inferential GOF test that we used was the
HL test. It showed a good fit of the data for four of the five remaining versions of Rhino (P-value
>0.05).
Similar analysis for SDMC showed that its coefficient is significant at the = 0.05 level in all
versions of Rhino examined in this study. Its coefficient varied from 1.08 to 1.31. This means that
a unit increase in defects will be accompanied (in the best case) by an increase of 1.31 in SDMC.
The HL test showed a good fit of the data in four out of six versions of Rhino. NCM had log-odds
ratio that varied from 1.04 to 1.38 across the six versions of Rhino in this study. However, the
log-likelihood statistic (G) for Rhino version 15R2 indicated that its coefficient was no different
from zero and the HL test showed that the model was a poor fit for the data. Rhino version 15R4
had similar problems with its HL test.
NTM had a log-odds ratio that varied from 1.04 to 1.17. This means that a unit increase in defects
will be accompanied (in the best case) by an increase of 1.17 in NTM. Its coefficient is significant
at the = 0.05 level in five out of six versions of Rhino examined in this study and the G statistic
was acceptable in those same versions (15R1, 15R2, 15R3, 15R4 and 15R5). The HL test showed
that the models associated with NTM binary univariate logistic regression was good in four out of
the remaining five versions of Rhino (15R1, 15R2, 15R3 and 15R5).
WMC had modest log-odds ratios that varied from 1.04 to 1.08. Its coefficient was shown to be
significant in all six versions of Rhino. The G statistic indicated that the coefficient was different
from zero in all six cases and the HL test showed that the models produced were a good fit in three
out of six versions of Rhino (15R1, 15R3 and 15R5). In general, WMC’s modest log-odds ratio
DOI: 10.1002/smr
186
Copyright q
Table V. Rhino univariate binary logistic regression results.
SDMC AMC Max complex NIM NCM NTM WMC WMC McCabe Ave lines Lines of code
Rhino version 14R3 univariate binary logistic regression results

Constant −2.3035 −3.055 −2.3165 −2.3095 −2.0757 −1.9827 −2.5295 −2.6812 −3.0323 −2.6627
Coefficient 0.13504 0.4147 0.03261 0.05383 0.07948 0.0353 0.0506 0.0155 0.07880 0.00309
P-value 0.002 0.001 0.002 0.017 0.046 0.239 0.002 0.000 0.000 0.000
H. M. OLAGUE ET AL.
Odds ratio 1.14 1.51 1.03 1.06 1.08 1.04 1.05 1.02 1.08 1.000
Log-likelihood −35.844 −34.136 −35.778 −39.373 −38.113 −41.633 −36.403 −32.198 −32.721 −32.770
G/DF 12.853/1 16.271/1 12.985/1 5.796/1 8.316/1 1.276/1 11.736/1 20.146/1 19.1000/1 19.002/1
2008 John Wiley & Sons, Ltd.

P-value (G) 0.000 0.000 0.000 0.016 0.004 0.259 0.001 0.000 0.000 0.000
Somers’ D 0.80 0.77 0.79 0.43 0.50 0.30 0.65 0.80 0.78 0.80
Goodman–Kruskal Gamma 0.80 0.78 0.80 0.45 0.60 0.32 0.67 0.80 0.78 0.80
Kendall’s Tau-a 0.21 0.20 0.20 0.11 0.13 0.08 0.17 0.21 0.20 0.21
Hosmer and Lemeshow 17.157 12.780 16.826 8.825 7.205 10.250 14.121 17.765 12.903 16.005
goodness-of-fit test∗∗ 6 6 5 6 3 7 7 7 7 8
0.009 0.047 0.005 0.184 0.066 0.175 0.049 0.013 0.075 0.042

Constant −2.1822 −2.4150 −2.3309 −2.2809 −2.5827 −2.5105 −2.7719 −2.8133 −2.3419 −2.8109
Coefficient 0.08208 0.17880 0.02620 0.04321 0.17044 0.09176 0.0606 0.01470 0.03189 0.00309
P-value 0.007 0.013 0.001 0.014 0.000 0.002 0.000 0.000 0.017 0.000
Odds ratio 1.09 1.20 1.03 1.04 1.19 1.10 1.06 1.01 1.03 1.00
Log-likelihood −51.467 −52.136 −48.626 −51.523 −42.854 −49.642 −44.626 −41.215 −52.452 −41.525
G/DF 7.370/1 6.032/1 13.053/1 7.258/1 24.597/1 11.021/1 21.052/1 27.875/1 5.401/1 27.255/1
P-value (G) 0.007 0.014 0.000 0.007 0.000 0.001 0.000 0.000 0.000 0.000
Somers’ D 0.50 0.45 0.52 0.36 0.59 0.56 0.63 0.61 0.44 0.62
Kendall’s Tau-a 0.12 0.11 0.13 0.09 0.14 0.13 0.15 0.15 0.11 0.15
Hosmer and 8.227 6.448 17.424 4.463 1.328 8.894 3.939 6.820 10.282 7.655
Lemeshow 6 6 7 8 3 7 8 8 7 8
goodness-of-fit test∗∗ 0.222 0.375 0.015 0.813 0.723 0.260 0.863 0.556 0.173 0.468
DOI: 10.1002/smr
Copyright q
Table V. Continued.

Constant −4.0109 −4.3507 −4.2877 −3.9823 −3.5829 −3.9517 −4.1675 −4.6944 −4.1442 −4.5031
Coefficient 0.09897 0.22594 0.03242 0.04794 0.04301 0.07657 0.04312 0.01186 0.03754 0.00258
P-value 0.001 0.001 0.000 0.007 0.036 0.019 0.001 0.000 0.004 0.000
Odds ratio 1.10 1.25 1.03 1.05 1.04 1.08 1.04 1.01 1.04 1.00
Log-likelihood −25.719 −26.285 −22.968 −27.577 −28.608 −28.365 −25.850 −19.526 −27.069 −26.110
G/DF 9.942/1 8.811/1 15.445/1 6.227/1 4.165/1 4.650/1 9.681/1 22.329/1 7.242/1 19.160/1

P-value (G) 0.002 0.003 0.000 0.013 0.041 0.031 0.002 0.000 0.007 0.000
Somers’ D 0.53 0.81 0.87 0.63 0.61 0.67 0.90 0.92 0.81 0.93
Kendall’s Tau-a 0.05 0.05 0.06 0.04 0.04 0.04 0.06 0.06 0.05 0.06
Hosmer and 6.565 7.553 9.759 6.700 19.481 6.857 14.140 14.403 5.490 8.235
Lemeshow 5 6 6 6 3 6 6 7 7 8

Constant −2.1930 −2.3877 −2.3861 −2.6634 −2.0768 −2.5220 −2.9794 −2.6776 −2.2746 −2.8197
Coefficient 0.10314 0.19649 0.03409 0.07831 0.10552 0.10908 0.07540 0.01264 0.03812 0.00339
P-value 0.001 0.002 0.000 0.000 0.002 0.001 0.000 0.000 0.002 0.000
Odds ratio 1.11 1.22 1.03 1.08 1.11 1.12 1.08 1.01 1.04 1.00
Log-likelihood −49.920 −51.622 −46.194 −47.442 −50.205 −50.351 −42.591 −42.392 −51.467 −40.401
G/DF 15.136/1 11.731/1 22.588/1 20.091/1 14.566/1 14.274/1 29.792/1 30.192/1 12.042/1 34.172/1
P-value (G) 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.000
Somers’ D 0.60 0.58 0.64 0.49 0.51 0.48 0.65 0.72 0.61 0.72
Kendall’s Tau-a 0.16 0.16 0.17 0.13 0.14 0.13 0.18 0.20 0.17 0.20
Hosmer and 8.325 9.678 8.741 0.821 3.113 8.255 6.079 5.497 6.474 5.509
Lemeshow 6 6 5 7 3 6 7 7 6 8
EMPIRICAL VALIDATION OF OO CLASS COMPLEXITY METRICS
DOI: 10.1002/smr
187
188
Copyright q
Table V. Continued.

Constant −1.2348 −0.5350 0.8853 −1.1884 −1.0851 −1.4226 −1.4850 −1.1129 −0.9323 −1.2385
Coefficient 0.27055 0.00127 0.02130 0.0683 0.3238 0.15303 0.07879 0.01097 0.03313 0.00369
P-value 0.000 0.898 0.004 0.001 0.001 0.000 0.000 0.000 0.018 0.000
Odds ratio 1.31 1.00 1.03 1.07 1.38 1.17 1.08 1.01 1.03 1.00
H. M. OLAGUE ET AL.
Log-likelihood −78.100 −94.270 −87.086 −85.368 −77.086 −81.233 −77.977 −80.949 −89.708 −77.751
G/DF 32.357/1 0.016/1 14.386/1 17.820/1 34.385/1 26.091/1 32.604/1 26.659/1 9.140/1 33.054/1
P-value (G) 0.000 0.899 0.000 0.000 0.000 0.000 0.000 0.000 0.003 0.000

Somers’ D 0.61 0.17 0.60 0.52 0.57 0.59 0.66 0.68 0.52 0.68
Kendall’s Tau-a 0.29 0.08 0.28 0.24 0.27 0.28 0.31 0.32 0.24 0.32
Hosmer and 16.484 32.759 21.147 17.957 114.208 15.444 19.480 21.032 12.170 24.247
Lemeshow 6 6 5 6 3 5 7 8 7 8

Constant −1.8000 −1.9885 −1.8838 −1.9818 −1.8819 −2.0906 −2.2656 −1.9421 −2.0339 −2.0635
Coefficient 0.07681 0.14888 0.02087 0.04735 0.09862 0.08836 0.05090 0.00627 0.04017 0.00197
P-value 0.003 0.011 0.001 0.003 0.001 0.002 0.000 0.001 0.003 0.000
Odds ratio 1.08 1.16 1.02 1.05 1.10 1.09 1.05 1.01 1.04 1.00
Log-likelihood −67.251 −66.990 −64.534 −67.087 −62.690 −65.720 −61.597 −63.934 −65.723 −61.530
G/DF 9.072/1 9.592/1 14.506/1 9.399/1 18.194/1 12.133/1 20.380/1 15.706/1 12.127/1 20.513/1
P-value (G) 0.003 0.002 0.000 0.002 0.000 0.000 0.000 0.000 0.000 0.000
Somers’ D 0.36 0.35 0.42 0.20 0.44 0.26 0.35 0.44 0.47 0.53
Kendall’s Tau-a 0.11 0.11 0.13 0.06 0.14 0.08 0.11 0.14 0.14 0.16
Hosmer and 9.576 5.975 9.226 7.317 2.896 6.462 7.085 10.880 8.132 6.651
Lemeshow 6 6 5 7 3 6 6 8 7 8
∗∗ 2 , DF, P-value given for Hosmer and Lemeshow goodness-of-fit test.
DOI: 10.1002/smr
compared with AMC, SDMC and NCM was significantly smaller; therefore, we might decide not
to use this independent variable in any BLR model.
6.5. Multivariate binary logistic regression
Although the results of the univariate regression provided some insight into the usefulness of the
individual metrics in the prediction of quality software, it is appropriate to consider the utility
of several metrics used in conjunction. In some cases, the often-utilized techniques of automated
forward and/or backward regression techniques result in incorrectly formulated models if the model
developer is not careful in considering the interactions of the independent variables. Instead, we
rely on the results of our PCA and univariate BLR to guide the formulation of the model [29].
Based on the results of our PCA and univariate binary logistics regression, it appears that AMC
and SDMC are both likely candidates for independent variables for our multivariate BLR (MBLR).
However, these two variables are not independent of each other and thus could not be used in the
same model. This is because an increase in AMC is likely to affect SDMC. The opposite may also
be true. Thus, only one of these variables may be used in an MBLR.
In the selection of a second variable for an MBLR, it could be argued that NTM could be used with
either AMC or SDMC because they are relatively independent of each other. That is, the addition
of a trivial method (complexity = 1) to a class may not be likely to affect the value of either AMC
or SDMC. However, the same cannot be said about NCM as the addition of a complex method may
significantly affect AMC or SDMC.
Also, a model can be developed using both NTM and NCM as these two independent variables
are independent of each other. WMC cannot be used with AMC, SDMC, NTM or NCM because it
is not independent of these variables.
The MBLR using MiniTab produced the results contained in Table VI. As can be seen, the
addition of a second variable, NTM, to separate models containing the independent variables AMC
Table VI. MiniTab MBLR model results (logistic regression table).

95% CI
Predictor Coefficient SE coefficient Odds Z P Ratio Lower Upper
MBLR results for model containing SDMC, NTM using Rhino 14R3
Constant −2.4613 0.4384 −5.61 0.000
SDMC 0.13014 0.04377 2.97 0.003 1.14 1.05 1.24
NTM 0.02368 0.03307 0.72 0.474 1.02 0.96 1.09
Log-likelihood = −35.609
Test that all slopes are zero: G = 13.323, DF = 2, P-value = 0.001
MBLR results for model containing AMC, NTM using Rhino 14R3
Constant −3.3168 0.6047 −5.48 0.000
AMC 0.4107 0.1251 3.28 0.001 1.51 1.18 1.93
NTM 0.03571 0.03211 1.11 0.266 1.04 0.97 1.10
Log-likelihood = −33.581
Test that all slopes are zero: G = 17.380, DF = 2, P-value = 0.000
DOI: 10.1002/smr
and SDMC results in the P-value for NTM exceeding 0.05 in both cases. Therefore, NTM is not
significant in these models. No other variable combinations were explored due to our previous
analysis and conclusions (see Sections 6.1, 6.2 and 6.4).
7. VALIDATION
We used a simple holdout method to validate our univariate SDMC and AMC models. We developed
MBLR models with Rhino version n data and validated them with Rhino version n +1 data, for
Rhino versions 14R3–15R5. The measure of effectiveness used is the concordant, discordant and
tied pairs. The concordant, discordant and tied pairs entries are calculated by pairing the observations
with different response values: the pairs, concordant, discordant and tied pairs. For example, Rhino
version 14R3 has 15 classes with defects and 80 classes that are not defective. This results in
80×15 pairs (or 1200 pairs) with different response values. A pair is concordant if a class that is
defective has a higher probability of being defective, discordant if the opposite is true and tied if
the probabilities are equal [30]. Refer to [31] for more information.
The results are shown in a tabular form in Table VII, and graphically in Figures 2–4. Figures
2 and 3 show that SDMC as the independent variable in a univariate BLR is slightly better than
AMC at identifying fault-prone classes. The results are fairly similar for the first several versions
of Rhino, but AMC is worse for the final Rhino version (15R4). The performance of the AMC
model constructed using Rhino 15R4 is significantly decreased due to the significant number of
tied pairs in that data for that model. The SDMC model suffers decreased performance but it is
not as drastic as the AMC model. However, overall both metrics perform comparably on all other
previous versions of Rhino.
We performed a Wilcoxon signed-ranks test (WSRT) to compare the results of the SDMC and
AMC to see whether there was a statistical difference between them. The WSRT considers infor-
mation about both the sign of the differences and the magnitude of the differences between pairs.
If the two variables are similarly distributed, the number of positive and negative differences will
not differ significantly. The hypothesis tested by the WSRT in this application is
H0 : AMC BLR model does not produce OO fault-proneness classification results that are different
from the SDMC BLR model.
H1 : AMC BLR model does produce OO fault-proneness classification results that are different
from the SDMC BLR model.
Table VII. SDMC and AMC univariate BLR model validation results.
SDMC AMC
Rhino model Concordant Discordant Ties Concordant Discordant Ties
14R3 72.3 22.3 5.4 70.3 26.0 3.7
15R1 91.5 8.4 0.1 89.9 9.5 0.6
15R2 72.4 15.5 12.1 74.3 18.5 7.2
15R3 77.4 16.1 6.5 73.4 19.3 7.3
15R4 62.5 26.7 10.8 30.0 5.8 64.2
DOI: 10.1002/smr
100
90
80
70
60
Percentage
50
40
30
20
10
0
14R3 15R1 15R2 15R3 15R4
Concordant 72.3 91.5 72.4 77.4 62.5
Discordant 22.3 8.4 15.5 16.1 26.7
Ties 5.4 0.1 12.1 6.5 10.8
Rhino Versions in Ascending Order
Figure 2. Binary logistics results shown in graphical form for SDMC.
100
90
80
70
60
Percentage
50
40
30
20
10
0
14R3 15R1 15R2 15R3 15R4
Concordant 70.3 89.9 74.3 73.4 30
Discordant 26 9.5 18.5 19.3 5.8
Ties 3.7 0.6 7.2 7.3 64.2
Figure 3. Binary logistics results shown in graphical form for AMC.
DOI: 10.1002/smr
100
90
80
Percentage Correctly Classified
70
60
50
40
30
20
10
0
14R3 15R1 15R2 15R3 15R4
SDMC 72.3 91.5 72.4 77.4 62.5
AMC 70.3 89.9 74.3 73.4 30
Figure 4. A comparison of SDMC vs AMC univariate BLR results.
We assume a symmetrical distribution and assume a critical value (/2) of 0.025. The results
produce a Z score (a standardized measure of the distance between the rank sum of the negative
group and its expected value) of −1.483 and a significance 0.138, which exceeds our critical value
and shows that the AMC BLR model results are significantly different from the SDMC BLR model
results. Therefore, we conclude that for this test case the SDMC BLR model produces better results.
Since the AMC and SDMC metrics both perform well indicating fault-prone classes for several
versions of Rhino, for Hypothesis 2, we reject the null hypothesis and accept the alternative hypoth-
esis that OO complexity metrics can identify fault-prone classes in multiple, sequential releases
of OO software systems that are developed using highly iterative, or agile, software development
processes.
8. RELATED WORK
A literature search produced the following studies that use OO metrics to study the evolution of OO
software. We briefly discuss them here and identify differences from our research as appropriate.
• Ohlsson et al. produced a case study that tracked system evolution to identify decaying compo-
nents in software and take corrective action to avoid software brittleness in a system written in
the C programming language (non-OO). The study used metrics different from those used in
DOI: 10.1002/smr
this study: defect fix reports, degree of interaction, total number of changes to the source files,
unique number of files fixed in a component, the average number of changes in a source file
and the growth in both source and executable files [25]. By comparison, ours is an evolutionary
study that uses OO complexity metrics.
• Alshayeb and Li successfully examined the utility of software metrics in predicting design
efforts in dynamic, iterative, short-cycled software development processes. Their study uses six
OO metrics as independent variables: (WMC, depth of inheritance tree (DIT), lack of cohesion
of methods (LCOM), NLM, coupling though abstract data type (CTA) and coupling through
message passing (CTM). It used multiple linear regression analysis and used LOC added, LOC
changed and LOC deleted as dependent variables. Their objective was to predict change from
one system iteration to the next [32]. By comparison, our study used various OO complexity
metrics to create OO class quality models in terms of fault proneness in OO classes.
• Gyimothy et al. produced a study in 2005 whose focus was to re-validate the findings in Basili
et al. in [8] and also to use machine-learning techniques (neural networks and decision trees)
to assess the ability of these techniques to predict fault-prone classes. They examine the last
version of the Mozilla system (version 1.6), written in C++, and draw some conclusions while
comparing their results with the Basili et al. study. The Basili et al. study focused on the
C&K OO class metrics (WMC, DIT, RFC, NOC, CBO, LCOM, LCOMN) and also the LOC
metric. They only revalidated the fault-proneness ability of the metrics and did not conduct
a study to evaluate their ability to serve as quality indicators in an evolutionary role, over all
the versions of Mozilla. They validated their ordinary and logistic regression models using
the first version of Mozilla (1.0). A limited software evolution study of Mozilla consisted of
examining the trend in the number of bugs per version, number of classes per version, and
comparing those values with the mean and standard deviation for each of the C&K and LOC
metrics. By comparison, our study used fault proneness as a quality indicator over all versions
of software in our case study to evaluate their use as quality indicators in highly iterative or
agile software systems [33].
• Tsantalis et al. similarly studied predicting the probability of change in OO system components
using the C&K OO class metrics. Their goal was to assess the probability that each class
would change in a future version of the system. They showed statistically that a correlation
between the extracted probabilities and the actual changes in a system existed. Probability
values were extracted considering both the change history of a design as well as its structural
characteristics [34]. By comparison, we evaluated the software system in our case study using
fault proneness over several software iterations.
• Nakatani et al. used the C&K and the LOC metrics to describe the evolution of boundary,
domain and common class categories for iterative software development processes. They study
the evolution patterns of three modest software systems written in Smalltalk [35].
• Mens and Demeyer identified predictive and retrospective software evolution metrics and their
value in assessing software quality and controlling the software evolution process. They use
metrics to compare two versions of a logic language written in Smalltalk. They attempt to
retrospectively classify the nature of the evolution of the software [36].
In addition, Subramanyam and Krishnan [37] also studied the C&K metrics suite and the LOC
metric using fault proneness as a quality indicator. Their case study was a large, mixed language
(C++ and Java) e-commerce application and examined one version of the system, not multiple,
DOI: 10.1002/smr
sequential versions. In addition to their primary conclusion that there is a high correlation between
OO metrics and their usefulness as indicators of quality, they also concluded that there was a
significant difference in the correlation between faults and C++ classes versus the correlation
between faults and Java classes. Their conclusions may also be applicable to our case study.
9. CONCLUSIONS
In this paper, we examined the utility of nine object-oriented (OO) software complexity metrics to
predict faulty OO classes in Rhino, a highly iterative, open-source Java-based software project that
has many characteristics typical of an agile software development process. Our analysis covered
six sequential versions of Rhino. This is important because as an agile or highly iterative project
evolves, we are interested to know whether we can still use the same methods that are associated
with traditional waterfall software development. First we verified the findings of other OO class
metrics empirical validation studies. These studies typically focus on initial quality, which involves
only the initial release of a software product. The first six versions of the software case study
were analyzed, focusing on the OO class metric trends between versions. We showed that most
of the complexity metrics used in this study correlated well with fault proneness of an OO class,
in both the initial build of the software and five subsequent deliveries. We used binary logistic
regression (BLR) as a modeling technique to build models to assist in identifying fault-prone OO
classes in multiple versions of the software. We conducted a PCA of the nine complexity metrics
to improve our multivariate regression modeling. We also conducted a univariate BLR using every
complexity metric in this study and a multivariate BLR analysis on the most promising complexity
metrics that we rationalized are independent of each other. The PCA and univariate BLR results
showed that the Michura et al.’s standard deviation method complexity (SDMC) metric and Etzkorn
et al.’s average method complexity (AMC) were best at predicting fault-prone classes over multiple
versions of Rhino. A simple holdout validation using SDMC and AMC in individual univariate
BLR models showed that SDMC performed better over all versions of Rhino. The lack of indepen-
dence between complexity metrics made it impossible to build properly classified multivariate BLR
models.
The empirical results of this study show that OO class metrics may be used over several iterations
of highly iterative or agile software products to predict fault-prone classes. It was previously thought
that OO quality metrics (which include complexity metrics) were only useful in assessing initial
quality in traditionally developed OO software since the initial deployment would encompass 90%
or more of its requirements [8]. Therefore, once deployed, the software’s initial defects would
be removed but the metrics would remain virtually unchanged. However, highly iterative or agile
projects increase in size and capability with every new version of the software. The integration of
new components with the current version of the software provides a similar basis for the use of
the OO class metrics as the initial quality argument with OO software developed using traditional
processes.
A secondary but significant finding was that metrics that qualify the complexity composition
of an OO class (e.g. SDMC and AMC) produced better results using BLR than the intuitive
size measurement metrics (e.g. LOC, WMC). The introspective nature of these metrics on class
complexity is a better indicator at predicting fault-prone classes. Both SDMC and AMC produced
good results individually. However, since they are highly collinear, a single model using both of
DOI: 10.1002/smr
these metrics could not be built using BLR. However, other non-statistically based techniques may
be developed using these two (or more) measures of complexity that may produce better results
than we have shown here.
ACKNOWLEDGEMENTS
We thank Norris Boyd, the creator of Mozilla’s Rhino Project, for providing invaluable support and information
about the Rhino project, and Stacy Lukins of the University of Alabama in Huntsville for her invaluable assistance
organizing the Rhino class fault data. We also thank Ken Nelson, Michael Staheli, Jason Haslam and Kevin
Groke of Scientific Toolworks, Inc. for providing evaluation copies of Understand for Java R
, usage assistance,
and for providing custom Perl scripts to extract the OO software metrics used in this study. We thank Lisa
Pulignanni of McCabe Associates, Inc. for her assistance with complexity metrics definitions used in this study.
Finally, we thank the anonymous reviewers of this paper for their thoughtful and generous comments.
This work was funded in part by NASA under Grants NAG5-12725 and NCC8-200.
REFERENCES
1. Pressman R. Software Engineering: A Practitioner’s Approach (6th edn), 1994; 538–539.

2. El Emam K, Benlarbi S, Goel N, Rai SN. The confounding effect of class size on the validity of object-oriented metrics.
IEEE Transactions on Software Engineering 2001; 27(7):630–650.
3. Bellini P, Bruno I, Nesi P, Rogai D. Comparing fault-proneness estimation models. Proceedings 10th IEEE International
Conference on Engineering of Complex Computer Systems, 2005; 205–214.
4. Briand LC, Melo WL, Wust J. Assessing the applicability of fault-proneness models across object-oriented software
projects. IEEE Transactions on Software Engineering 2002; 28(7):706–720.
5. Denaro G, Pezze M. An empirical evaluation of fault-proneness models. Proceedings of the Twenty-fourth International
Conference on Software Engineering, 2002; 241–251.
6. Ferenc R, Siket I, Gyimothy T. Empirical validation of object-oriented metrics on open source software for fault
prediction. IEEE Transactions on Software Engineering 2005; 31(10):897–910.
7. Fioravanti F, Nesi P. A study on fault-proneness detection of object-oriented systems. Fifth European Conference on
Software Maintenance and Reengineering, 2001; 121–130.
8. Basili VR, Briand LC, Melo WL. A validation of object-oriented design metrics as quality indicators. IEEE Transactions
on Software Engineering 1996; 22(10):751–761.
9. Chapin N, Hale JE, Khan KM, Ramil JF, Tan W. Types of software evolution and software maintenance. Journal of
Software Maintenance and Evolution: Research and Practice 2001; 13(1):3–30.
10. Fenton NE, Pfleeger SL. Software Metrics (2nd edn), 1997.
11. Lorenz M, Kidd J. Object-oriented Software Metrics, 1994; 146.
12. Michura J, Capretz MAM. Metrics suite for class complexity. International Conference on Information Technology:
Coding and Computing, 2005; 404–409.
13. McCabe TJ. A complexity measure. IEEE Transactions on Software Engineering 1976; 2(4):308–320.
14. Chidamber S, Kemerer C. A metrics suite for object oriented design. IEEE Transactions on Software Engineering 1994;
20(6):476–493.
15. Scientific Toolworks, Inc. Understand home page. Understand for Java User Guide and Reference Manual, 2006; 198.
http://scitools.com/
documents/manuals/pdf/understand java.pdf [21 November 2007].
16. Chidamber S, Kemerer C. ‘Authors reply’. IEEE Transactions on Software Engineering 1995; 21(3):265.
17. Etzkorn LH, Bansiya J, Davis C. Design and code complexity metrics for OO classes. Journal of Object-oriented
Programming 1999; 12(1):35–40.
18. Groke K. Scientific Toolworks, Inc., Private Communication [19 May 2006].
19. The Agile Alliance Home Page. http://www.agilealliance.org/home [21 November 2007].
20. Boyd N (module owner of the Mozilla Rhino Project). Private Communication [October 2005].
21. Bugzilla. Bugzilla Database, 2004; 2. https://bugzilla.mozilla.org/ [21 November 2007].
22. Cohen J. Statistical Power Analysis for the Behavioral Sciences (2nd edn), 1988; 567.
23. Conte SD, Dunsmore HE, Shen VY. Software Engineering Metrics and Models, 1986.
24. Briand LC, Melo WL, Wust J. Assessing the applicability of fault-proneness models across object-oriented software
projects. IEEE Transactions on Software Engineering 2002; 28(7):706–720.
DOI: 10.1002/smr
25. Ohlsson MC, Andrews AA, Wohlin C. Modeling fault-proneness statistically over a sequence of releases: A case study.
Journal of Software Maintenance and Evolution: Research and Practice 2001; 13:167–199.
26. Hosmer D, Lemeshow S. Applied Logistic Regression (2nd edn), 2000; 375.
27. Peng J, Lee KL, Ingersoll GM. An Introduction to Logistic Regression Analysis and Reporting. Indiana University,
Bloomington IN, 2007. http://www.class.uidaho.edu/psy586/Course%20Readings/Peng%20Lee%20&%20Ingersoll 02.pdf
[21 November 2007].
28. Garson D. Logistic Regression: SPSS Output. North Carolina State University PA765: Raleigh NC, 2006; 5.
http://www2.chass.ncsu.edu/garson/PA765/logispss.htm [21 November 2007].
29. Evanco WM. Comments on ‘The confounding effect of class size on the validity of object-oriented metrics’. IEEE
Transactions on Software Engineering 2003; 29(7):670–672.
30. Olague HM, Etzkorn LH, Gholston S, Quattlebaum S. Empirical validation of three software metrics suites to predict
fault-proneness of object-oriented classes developed using highly iterative or agile software development processes. IEEE
31. MiniTab. Users Documentation Web Page, 2006; 138. http://www.minitab.com/support/docs/rel14/MeetMinitab14.pdf
[21 November 2007].
32. Alshayeb M, Li W. An empirical validation of object-oriented metrics in two different iterative software processes. IEEE
33. Gyimothy T, Ferenc R, Siket I. Empirical validation of object-oriented metrics on open source software for fault
prediction. IEEE Transactions on Software Engineering 2005; 31(10):897–910.
34. Tsantalis N, Chatzigeorgiou A, Stephanides G. Predicting the probability of change in object-oriented systems. IEEE
35. Nakatani T, Tamai T, Tomoeda A, Matsuda H. Towards constructing a class evolution model. Proceedings of the
Asia-Pacific Software Engineering Conference and International Computer Science Conference, 1997; 131–138.
36. Mens T, Demeyer S. Future trends in software evolution metrics. Proceedings of the Fourth International Workshop on
Principles of Software Evolution (IWPSE), 2001; 83–86.
37. Subramanyam R, Krishnan MS. Empirical analysis of CK metrics for object-oriented design complexity: Implications
for software defects. IEEE Transactions on Software Engineering 2003; 29(4):297–310.
AUTHORS’ BIOGRAPHIES
Hector M. Olague received the bachelor’s degree in marine engineering systems from
the United States Merchant Marine Academy, Kings Point, NY, and the MS and PhD
degrees in computer science from the University of Alabama in Huntsville. He is an
engineer and a government program manager at the U.S. Army Space and Missile
Defense Command in Huntsville, AL. His research interests include software engineering,
object-oriented software metrics, information theory, and statistical and non-statistical
classification modeling.
Letha H. Etzkorn received the bachelor’s and master’s degree in electrical engineering
from the Georgia Institute of Technology and the PhD degree in computer science from
the University of Alabama in Huntsville. She is an associate professor in the Computer
Science Department at the University of Alabama in Huntsville. Her primary research
areas are in software engineering, primarily software metrics and program understanding,
and mobile and intelligent agents.
DOI: 10.1002/smr
Sherri L. Messimer is an associate professor of Industrial and Systems Engineering

and an Associate Dean of Engineering Student Affairs at the University of Alabama in
Huntsville. She received the BSE and MSE from the University of Texas in Arlington and
the PhD from Texas A&M University, all in Industrial Engineering. Her research interests
are in design of quality strategies for low defect processes, manufacturing systems analysis
and statistical approaches to software development process control.
Harry S. Delugach is an associate professor of Computer Science at the University

of Alabama in Huntsville. He has over 20 years of teaching experience, as well as an
extensive scholarly publication record in knowledge-based systems, conceptual graphs,
Common Logic and modeling in software engineering. He serves on several conference
program committees, including a senior role in the International Conference on Conceptual
Structures (ICCS). He has been asked to serve as a reviewer for journal articles and
books in both software engineering and knowledge representation. He is the author of
CharGer, an open-source conceptual graph visualization package. He serves on the ANSI
L8 committee, which is one of the technical advisory groups to ISO/IEC JTC1’s SC32
subcommittee on data interchange, under whose auspices he served as editor of the
Common Logic standard (ISO/IEC 24707:2007). He consults regularly with commercial
and government enterprises for their software engineering and analysis needs.
DOI: 10.1002/smr

An Empirical Validation of Object-Oriented Class Complexity Metrics and Their Ability To Predict Error-Prone Classes

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

An Empirical Validation of Object-Oriented Class Complexity Metrics and Their Ability To Predict Error-Prone Classes

Transféré par

Droits d'auteur :

Formats disponibles

JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE

J. Softw. Maint. Evol.: Res. Pract. 2008; 20:171–197

Contract/grant sponsor: NASA; contract/grant numbers: NAG5-12725, NCC8-200

Copyright q 2008 John Wiley & Sons, Ltd.

Table I. Specific definitions of metrics actually used in this study.

2.1. McCabe cyclomatic complexity

2.2. Weighted methods per class

3. DESCRIPTION OF METRICS USED

3.1. Standard deviation method complexity [12]

3.2. Average method complexity

3.3. Maximum cyclomatic complexity (CC Max) of a single method of a class

3.4. Number of instance methods

public class Circle {

// A class method: just compute a value based on the arguments

// Two instance methods: they operate on the instance fields of an object

public double circumference() { // Compute the circumference of the

Figure 1. An example of instance methods declared in Java.

3.5. Number of class methods

3.6. Number of trivial methods

3.9. Average LOC and LOC per class

4. TEST CASES USED FOR THIS STUDY

5. GOAL STATEMENT AND RESEARCH HYPOTHESIS

Two research hypotheses will be tested under the common goal:

Table II. Rhino version statistics for study.

6.1. Class defect density correlation

6.2. Principal component analysis

Table IV. Rhino principal component analyses for OO complexity metrics∗ .

Rhino 15R2 Rhino 15R3

Rhino 15R4 Rhino 15R5

Table IV. Continued.

6.3. Logistic regression analysis

6.3.1. Analysis method

e+1 X 1 +2 X 2 +···+n X n

6.4. Univariate BLR for complexity metrics

Rhino version 14R3 univariate binary logistic regression results

2008 John Wiley & Sons, Ltd.

Rhino version 15R1 univariate binary logistic regression results

Rhino version 15R2 univariate binary logistic regression results

2008 John Wiley & Sons, Ltd.

Rhino version 15R3 univariate binary logistic regression results

Rhino version 15R4 univariate binary logistic regression results

2008 John Wiley & Sons, Ltd.

Rhino version 15R5 univariate binary logistic regression results

∗∗ 2 , DF, P-value given for Hosmer and Lemeshow goodness-of-fit test.

6.5. Multivariate binary logistic regression

Table VI. MiniTab MBLR model results (logistic regression table).

Figure 2. Binary logistics results shown in graphical form for SDMC.

Figure 3. Binary logistics results shown in graphical form for AMC.

Figure 4. A comparison of SDMC vs AMC univariate BLR results.

1. Pressman R. Software Engineering: A Practitioner’s Approach (6th edn), 1994; 538–539.

Sherri L. Messimer is an associate professor of Industrial and Systems Engineering

Harry S. Delugach is an associate professor of Computer Science at the University

Vous aimerez peut-être aussi