Vous êtes sur la page 1sur 39

Unit-III 5.1.

Establishing software quality requirements:


The software quality metrics methodology is a systematic approach to establishing quality requirements and identifying, implementing, analyzing, and validating the process and product of software quality metrics for a software system. It spans the entire software life cycle and comprises five steps. These steps shall be applied iteratively because insights gained from applying a step may show the need for further evaluation of the results of prior steps.

(a).Establish software quality requirements: .A list of quality factors is selected, prioritized, and quanti- fied at the outset of system development or system change. These requirements shall be used to guide and control the development of the system, to access the whether the system met the quality requirements specified in the contract. (b). identib software quality requirements: The software quality metrics framework is applied in the selection of relevant metrics. implementthesoftwarequalityrequiremnsts: Tools are either procured or developed, data is collected, and metrics are applied at each phase of the software life cycle. (d) Analyze software quality metrics results: The metrics results are analyzed and reported to help control the development and assess the final product. (e) Validatethequalitymetrics. Predictive metrics results are compared to the direct metrics results to determine whether the predictive metrics accurately measure their associated factors. The documentation or output produced as a result of these steps is shown in table I

Table I-Outputs of metrics methodology steps Metric Methodology Step Establish software quality requirements Identify software quality metrics output -Quality requirements Approved quality metrics framework - Metrics set

-Cost-benefit analysis Implement the software quality metrics - Description of data items - Metrieddata item -Traceability matrix -Training plan and schedule

5.1 Establish software quality requirements :

Quality requirements shall be represented in either of the following forms:

Direct metric value: A numerical target for a factor to be met in the final product. For example, mean time to failure (MTTF) is a direct metric of final system reliability. This is an intermediate requirement that is an early indicator of final system performance. For exam- ple, design or code errors may be early predictors of final system reliability.

5.1 .I Identify a list of possible quality requirements:

Identify quality requirements that may be applicable to the software system. Use organizational experience, required standards, regulations, or laws to create this list. Annex A contains sample lists of factors and sub- factors. In addition, list other system requirements that may affect the feasibility of the quality requirements. Consider acquisition concerns, such as cost or schedule constraints, warranties, and organizational self- interest. Do not rule out mutually exclusive requirements at this point. Focus on factoddirect metric combi- nations instead of predictive metrics. All parties involved in the creation and use of the system shall participate in the quality requirements identi- fication process.

5.1.2 Determine the actual list of quality requirements :

Rate each of the listed quality requirements by importance. Importance is a function of the system characteristics and the viewpoints of the people involved.

To determine the actual list of the possible quality require-ments, follow the two steps below: Create the actual quality requirements; Resolve the results of the survey into a single list of quality requirements. This shall involve a technical feasibility analysis of the quality requirements. The proposed factors for this list may have cooperative or conflicting relationships. Conflicts between requirements shall be resolved at this point. In addition, if the choice of quality requirements is in conflict with cost, schedule, or system functionality, one or the other shall be altered. Care shall be exercised in choosing the desired list to ensure that the requirements are technically feasible reasonable, complementary, achievable, and verifiable. All involved parties shall agree to this final List.

5.1.3 Quantify each factor:

For each factor, assign one or more direct metrics to represent the factor, and assign direct metric values to serve as quantitative requirements for that factor. For example, if high efficiency was one of the quality requirements from the previous item, the direct metric actual resource utilizatiodallocated resource utilization with a value of 90% could represent that factor. This direct metric value is used to verify the achieveMent of the quality requirement. Without it, there is no way to tell whether or not the delivered system meets its quality requirements. The-quantified list of quality requirements and their definitions again shall be approved by all involved parties.

5.2 Identify software quality metrics:

When identifying software quality metrics, apply the software quality metrics framework, and perform a cost-benefit analysis. Then gain commitment to the metrics from all involved parties.

5.2.1 Apply the software quality metrics framework:

Create a chart of the quality requirements based on the hierarchical tree structure found in figure 1. At this point, only the factor level must be complete. Next decompose each factor into subfactors as indicated in clause 4. The decomposition into subfactors must continue for as many levels as needed until the subfactor level is complete.

SOFTWARE QUALITY METRICS METHODOLOGY

Using the software quality metrics framework, decompose the subfactors into measurable metrics. For each validated metric on the metric level, assign a target value and a critical value and range that should be achieved during development. The target values constitute additional quality requirements for the system.

The framework and the target values for the metrics shall be reviewed and approved by all involved parties.

To help ensure that metrics are used appropriately, only validated metrics (that is, either direct metrics or metrics validated with respect to direct metrics) shall be used to assess current and future product and process quality. Nonvalidated metrics may be included for future analysis, but shall not be included as part of the system requirements. Furthermore, the metrics that are used shall be those that are associated with the quality requirements of the software project. However, given that the above conditions are satisfied. The selection of specific metrics as candidates for validation and the selection of specific vali dated metrics for application is at the discretion of the user of this standard. Examples of metrics and experi ences with their use are given in annex B. 4

Document each metric using the format shown in table 2.

Table 2-Metrics set

Item Name Impact Used to alter. Target value be achieved in order to meet quality. Factors

Description Name given to the metric. Indication of whether a metric may be

Numerical value of the metric that is to

Factors that are related to this metric.

5.2.2 Perform a cost-benefit analysis:

Perform a cost-benefit analysis by identifying the costs of implementing the metrics, identifying the benefits of applying the metrics, and then applying the metrics set. 5.2.2.1 Identify the costs of implementing the metrics:

Identify document(see 2.2) all the costs associated with the metrics in the metrics set. For each met ric, estimate and document the following impacts and costs. Metric utilization cost: The costs of collecting data; automating the metric value calculation (when possible); and analyzing, interpreting, and reporting the results. Software development cost change: The set of metrics may imply a change in the organizational structure used to produce the software system.

Special equipment: The hardware or software tools may have to be located, purchased, adapted, or developed to implement the metrics. Training: The quality assurancekontrol organization or the entire development team may 5

need train ing in the use of the metrics and data collection procedures. If the introduction of metrics has caused changes in the development process, the development team may need to be educated about the changes .

5.2.2.2 Identify the benefits of applying the metrics :

Identify and document the benefits that are associated with each metric in the metrics set . Some benefits to consider include the following: -Identify quality goals and increase awareness of the goals in the software organization. - Provide timely feedback useful in developing higher quality software - Increase customer satisfaction by quantifying the quality of the software before it is delivered to the customer. -provide a a quantitative basis for making decisions about software quality

5.2.2.3 Adjust the metrics set :

Weigh the benefits, tangible and intangible, against the costs of each metric. If the costs exceed the benefits of a given metric, alter or delete it from the metrics set. On the other hand, for metrics that remain, make plans for any necessary changes to the software development process, organizational structure, tools, and training. In most cases it will not be feasible to quantify benefits. In these cases judgment shall be exercised in weighing qualitative benefits against quantitative costs.

5.2.3 Gain commitment to the metrics set :

All involved parties shall review the revised metrics set to which the cost-benefit

analysis has been added. The metrics set shall be formally adopted and supported by this group.

5.3 Implement the software quality metrics :


To implement the software quality metrics, define the data collection procedures, use selected software to prototype the measurement process, then collect the data and compute the metrics values. 5.3.1 Define the data collection procedures: For each metric in the metric set, determine the data that will be collected and determine the assumptions that will be made about the data (for example, random sample, subjective or objective measure). The flow of data shall be shown from point of collection to evaluation of metrics. Describe or reference when and how tools are to be used and data storage procedures. Also, identify candidate tools. Select tools for use with the prototyping process. A traceability matrix shall be established between metrics and data items. Identify the organizational entities that will directly participate in data collection including those responsible for monitoring data collection. Describe the training and experience required for data collection and the training process for personnel involved. Describe each data item thoroughly, using the format shown in table 3 Table 3-Description of a data item
Item I Description

Name Metries Definition Source Collector Timing

Name given to the data item Metrics that are associated with the data item. Straightforward description of the data item. Location of where the data originates. Entity responsible for collecting the data. Time(s) in life cycle at which the data is to be collected. (Some dataitems , are collected more than once.)

Procedures

Methodology (for example. automated or manual) 7

used to collect the data. Representation Manner in which the data is represented, for example. its precision format.

Sample

Method used to select the data to be collected and the percentage of the available datathat is to be collected.

Alternatives

methods that may be used to collected the data other than method.

5.3.2 Prototype the measurement process : Test the data collection and metric computation procedures on selected software that will act a a prototype. If possible, the samples selected should be similar to the project(s) on which the metrics will later be used. analysis shall be made to determine if the data is collected uniformly and if instructions have always been interpreted in the same manner. In particular, data requiring subjective judgments shall be checked to determine if the descriptions and instructions are clear enough to ensure uniform results. In addition, the cost of the measurement process for the prototype shall be examined to verify or improve the cost analysis.
5.3.3 Collect the data and compute the metrics values:

Using the formats in table 2 and table 3, collect and store data at the appropriate time in the life cycle. The data shall be checked for accuracy and proper unit of measure. Data collection shall be monitored. If a sample of data is used, requirements such as randomness, minimum size of sample, and homogeneous sampling shall be verified. If more than one person is collecting the data, it shall be checked for uniformity. Compute the metrics values from the collected data.

5.4 Analyze the software metrics results:

Analyzing the results of the software metrics includes interpreting the results, identifying software quality, making software quality predictions, and ensuring compliance with requirements.

5.4.1 Interpret the results:

The results shall be interpreted and recorded against the broad context of the project as well as for a particu lar product or process of the project. The differences between the collected metric data and the target values for the metrics shall be analyzed against the quality requirements. Substantive differences shall be investi gated.

5.4.2 Identify software quality:

Quality metric values for software components shall be determined and reviewed. Quality metric values that are outside the anticipated tolerance intervals (low or unexpected high quality) shall be identified for further study. Unacceptable quality may be manifested as excessive complexity, inadequate documentation, lack of traceability, or other undesirable attributes. The existence of such conditions is an indication that the soft ware may not satisfy quality requirements when it becomes operational. Since many of the direct metrics that are usually of interest cannot be measured during software development (for example, reliability met rics), validated metrics shall be used when direct metrics are not available. Direct or validated metrics shall be measured for software components and process steps. The measurements shall be compared with critical values of the metrics. Software components whose measurements deviate from the critical values shall be analyzed in detail. Unexpected high quality metric values shall cause a review of the software development process as the expected tolerance levels may need to be modified or the process for identifying quality metric values may need to be improved. 5.4.3 Make software quality predictions:

During development validated metrics shall be used to make predictions of direct metric values. Predicted values of direct metrics shall be compared with target values to determine whether to flag software compo nents for detailed analysis. Predictions shall be made for software components and process steps. Software components and process steps whose predicted direct metric values deviate from the target values shall be analyzed in detail. Potentially, prediction is very valuable because it estimates the metric of ultimate interest-the direct metric. However, prediction is difficult because it involves using validated metrics from an earlier phase of the life cycle (for example, development) to make a prediction about a different but related metric (direct metric) in a much later phase (for example, operations).

5.4.4 Ensure compliance with requirements:

Direct rnetrics shall be used to ensure compliance of software products with quality requirements during system and acceptance testing. Direct metrics shall be measured for software components and process steps. These values shall be compared with target values of the direct metrics that represent quality require ments. Software components and process steps whose measurements deviate from the target values are non compliant

5.5 Validate the software quality metrics:


For infuriation on the statistical techniques used. See [B 1141. [B 115], (B 11 71,' or similar references.

5.5.1 Purpose of metrics validation:

The purpose of metrics validation is to identify both product and process metrjcs that can predict specified quality factor values, which are quantitative representations of quality requirements. If metrics are to be us ful, they shall indicate accurately whether 10

quality requirements have been achieved or are likely to be achieved in the future. When it is possible to measure factor values at the desired point in the life cycle, these direct metrics are used to evaluate software quality. At some points in the life cycle, certain quality factor values (for example, reliability) are not available. They are obtained after delivery or late in the project. In these cases, other metrics are used early on in a project to predict quality factor values. Although quality subfactors are useful when identifying and establishing factors and metrics, they need not be used in metrics validation. because the focus in validation is on determining whether a statistically signi icant relationship exists between predictive metric values and factor values. Quality factors may be affected by multiple variables. A single metric. therefore, may not sufficiently repre sent any one factor if it ignores these other variables.

5.5.2 Validity criteria:

To be considered valid, a predictive metric shall demonstrate a high degree of association with the quality factors it represents. This is equivalent to accurately portraying the quality condition(s) of a product or process. A metric may be valid with respect to certain validity criteria and invalid with respect to other criteria. The person in the organization who understands the consequence of the values selected shall designate threshold values for the following: V - square of the linear correlation coefficient B-rank correlations coefficient A-prediction error P-success rate A short numerical example follows the definition of each validity criterion. Detailed examples of the appli cation of metrics validation are contained in annex. Correlations: The variation in the quality factor values explained by the variation in the metric val ues, which is given by the square of the linear correlation coefficient between the metric and the corresponding factor, shall exceed where R2>v This criterion assesses whether there is a sufficiently strong linear association between a 11

factor and a metric to warrant using the metric as a substitute for the factor, when it is infeasible to use the latter. For example, the correlation coefficient between a complexity metric and the factor reliability may be 0.8. The square of this is 0.64. Only 64% of the variation in the factor is explained by the varia tion in the metric. If V has been established as 0.7, the conclusion would be drawn that the metric is invalid. Trucking: If a metric M is directly related to a quality factor F,for a given productor process, then a change in aquality factor value from FT1to FT2 at times T1 and T2, shall be accompanied by a change in metric value from MT1 to MT2, which is the same direction . For example, if a complexity metric is claimed to be a measure of reliability, then it is reasonable to expect a change in the reliability of a software component to be accompanied by an appropriate change in metric value (for instance, if the product increases in reliability, the metric value should also change in a direction that indicates the product has improved). That is, if Mean Time to Failure(MTTF) is used to measure reliability and is equal to 1000 hours during testing (TI) and 1500 hours during operation (T2), a complexity metric whose value is 8 in TI and 6 in T2, where 6 less complex than 8, is said to track reliability for this software component. If this relationship is demonstrated over a representative sample of software components, the conclusion could be drawn that the metric can track reliability (that is, indicate changes in product reliability) over the software life cycle. Consistency:if factor values f1,f2,.fn corresponding to products or processes relationship FI > F2> . . ,, Fn, then the corresponding metric values shall have the relationship MI M2 > . . . , Mn. To perform this test, compute the coefficient of rank correlation(r) between paired values (from the same software components) of the factor and the metric; Id shall exceed This criterion assesses whether there is consistency between the ranks of the factor values of a set of software components and the ranks of the metric values for the same set of software components Predictability: a metric is used at time T1 to predict a quality factor for a given product or processit shall predict a related quality factorFP~2with accurancy of where 12

FAT2is the actual value Fat atimeT2 This criterion assesses whether a metric is capable of predicting a factor value with the required accuracy. For example, if a complexity metric is used during development to predict the reliability of a soft ware component during operation (T2) to be 1200 hours MTTF (FpT2) and the actual MTTF that is measured during operation is 1000 hours FuT2), then the error in prediction is 200 hours, or 20%. If the acceptable prediction error (A) is 25%. prediction accuracy is acceptable. If the ability to predict is demonstrated over a representative sample of software components, the conclusion could be drawn that the metric can be used as a predictor of reliability. Discriminative power: A metric shall be able to discriminate between high quality software compo nents (for example, high MTTF) and low quality software components (for example, low MTTF). For instance, the set of metric values associated with the former should be significantly higher (or lower) than those associated with the latter. The Mann Whitney Test and the Chi-square test for differences in probabilities (Contingency Tables) can be used for this validation test. Reliability: A metric shall demonstrate the above correlation. tracking, consistency, predictability, and discriminative power properties forP percent of the applications of the metric. This criterion is used to ensure that a metric has passed a validity test over a sufficient number or percentage of applications so that there will be confidence that the metric can perform its intended function consistently.

5.5.3 Validation procedure:

Metrics validation shall include identifying the quality factors sample, identifying the metrics sample, per forming a statistical analysis, and documenting the results. 5.5.3.1 Identify the quality factors sample:

These factor values (for example, measurements of reliability), which represent the quality requirements of a project, were previously identified (see 5.1.3) and collected and stored (see 5.3.3). For validation purposes, draw a sample from the metrics database. 13

5.5.3.2 Identify the metrics sample: These metrics (for example, design structure) are used to predict or to represent quality factor values, when the factor values cannot be measured. The metrics were previously identified (see 5.2.1) and their data collected and stored, and values computed from the collected data (see 5.3.3). For validation purposes, draw a sample from the same domain (for example, same software components) of the metrics database as used in 5.5.3.1. 5.5.3.3 Perform a statistical analysis :

The tests described in 5.5.2 shall be performed. Before a metric is used to evaluate the quality of a product or process, it shall be validated against the criteria described in 5.5.2. If a metric does not pass all of the validity tests, it shall only be used according to the criteria prescribed by those tests (for example, if it only passes the tracking validity test, it shall be used only for tracking the quality of a product or process).

5.5.3.4 Document the results:

Documenting results shall include the direct metric, predictive metric, validation criteria, and numerical results, as a minimum.

5.5.4 Additional requirements: Additional requirements, those being the need for revalidation, confidence in analysis results, and the stability of environment, are described in the following subclauses. 5.5.4.1 The need for revalidation:

It is important to revalidate a predictive metric before it is used for another environment or application. AS the software engineering process changes, the validity of metrics changes. Cumulative metric validation val ues may be misleading because a 14

metric that has been valid for several uses may become invalid. It is wise to compare the one-time validation of a metric with its validation history to avoid being misled. The following statements of caution should be noted: *A validated metric may not necessarily be valid in other environments or future applications. *A metric that has been invalidated may be valid in other environments or future applications.

5.5.4.2 Confidence in analysis results:

Metrics validation is a continuous process. Confidence in metrics is increased over time as metrics are vali dated on a variety of projects and as the metrics database and sample size increases. Confidence is not a static, one-time property. If a metric is valid, confidence will increase with increased use (that is, the correlation coefficient will be significant at decreasing values of the significance level). Greatest confidence occurs when metrics have been tentatively validated based on data collected from previous projects. Even when this is the case, the validation analysis will continue into future prqjects as the metrics database and sample size grow.

5.5.4.3 Stability of environment:

Practicable metrics validation shall be undertaken in a stable development environment (that is, where the design language, implementation language, or program development tools do not change over the life of the project in which validation is performed). In addition, there shall be at least one project in which metrics data have been collected and validated prior to application of the predictive metrics. This project shall be similar to the one in which the metrics are applied with respect to software engineering skills, application, size, and software engineering environment.

SOFTWARE QUALITY INDICATOR: 15

A software quality indicator (SQI) is a variable whose value can be determined through direct analysis of product or process characteristics and whose evidential relationship undeniable. A Software Quality Indicator can be calculated to provide an indication of the quality of the system by assessing system characteristics. Method Assemble a quality indicator from factors that can be determined automatically with commercial or custom code scanners, such as the following: cyclomatic complexity of code (e.g., McCabe's to one ormoresoftware engineering attributes is

Complexity Measure), unused/unreferenced code segments (these should

be eliminated over time), average number of application calls per module

(complexity is directly proportional to the number of calls), size of compilation units (reasonably sized units

have approximately 20 functions (or paragraphs), or about 2000 lines of code; these guidelines will vary greatly by environment), use of structured programming constructs (e.g.,

elimination of GOTOs, and circular procedure calls). These measures apply to traditional 3GL environments and are more difficult to determine in environments which are using object-oriented languages, 4GLs, or code generators. Tips and Hints With existing software, the Software Quality Indicator could also include a measure of the reliability of the code. This can be

16

determined by keeping a record of how many times each module has to be fixed in a given time period. There are other factors which contribute to the quality of a system such as: procedure re-use, clarity of code and documentation, consistency in the application of naming

conventions, adherence to standards, consistency between documentation and code, the presence of current unit test plans.

These factors are harder to determine automatically. However, with the introduction of CASE tools and reverse-engineering tools, and as more of the design and documentation of a system is maintained in structured repositories, these measures of quality will be easier to determine, and could be added to the indicator.

Fundamentals in Measurement Theory:


To discusses the fundamentals of measurement theory. We outline the relationshipsamong theoretical concepts, definitions, and measurement, and describe some basic measures that are used frequently. It is important to distinguish the levels of the conceptualization proces s, from abstract concepts, to definitions that are used operationally, to actual measurements. Depending on the concept and the operational definition derived from it, different levels of measurement may be applied: nominal scale, ordinal scale, interval scale, and ratio scale. It is also beneficial to spell out the explicit differences among some basic measures such as ratio, proportion, percentage, and rate. Significant amounts of wasted effort and resources can be avoided if these fundamental measurements are well understood. We then focus on measurement quality. We discuss the most important issues in 17

measurement quality, namely, reliability and validity, and their relationships with measurement errors. We then discuss the role of correlation in observational studies and the criteria for causality.

3.1 Definition, Operational Definition, and Measurement:

It is undisputed that measurement is crucial to the progress of all sciences. Scientific progress is made through observations and generalizations based on data and measurements, the derivation of theories as a result, and in turn the confirmation or refutation of theories via hypothesis testing based on further empirical data. As an example, consider the proposition "the more rigorously the front end of the software development process is executed, the better the quality at the back end." To confirm or refute this proposition, we first need to define the key concepts. For example, we define "the software development process" and distinguish the process steps and activities of the front end from those of the back end. Assume that after the requirements-gathering process, our development process consists of the following phases: Design Design reviews and inspections Code Code inspection Debug and development tests Integration of components and modules to form the product Formal machine testing Early customer program

Integration is the development phase during which various parts and components are integrated to form one complete software product. Usually after integration the product is under formal change control. Specifically, after integration every change of the software must have a specific reason(e.g., to fix a bug uncovered during testing) and must be documented and tracked. Therefore, we may want to use integration as the cutoff point: The design, coding, debugging, and integration phases are classified as the front end of 18

the development process and the formal machine testing and early customer trials constitute the back end. We need to specify the indicator(s) of the definition and to make it (them) operational. For example, suppose the process documentation says all designs and code should be inspected. One operational definition of rigorous implementation may be inspection coverage expressed in terms of the percentage of the estimated lines of code (LOC) or of the function points (FP) that are actually inspected. Another indicator of good reviews and inspections could be the scoring of each inspection by the inspectors at the end of the inspection, based on a set of criteria. We may want to operationally use a five-point Likert scale to denote the degree of effectiveness (e.g., 5 = very effective, 4 = effective, 3 = somewhat effective, 2 = not effective, 1 = poor inspection). There may also be other indicators. In addition to design, design reviews, code implementation, and code inspections, development testing is part of our definition of the front end of the development process. We also need to we need to operationally define "quality at the back end" and decide which measurement indicators to use. For the sake of simplicity let us use defects found per KLOC (or defects per function point) during formal machine testing as the indicator of back-end quality. From these metrics, we can formulate several testable hypotheses such as the following: For software projects, the higher the percentage of the designs and code that areinspected, the lower the defect rate at the later phase of formal machine testing. The more effective the design reviews and the code inspections as scored by the inspection team, the lower the defect rate at the later phase of formal machine testing.

The more thorough the development testing (in terms of test coverage) before integration,the lower the defect rate at the formal machine testing phase.

With the hypotheses formulated, we can set out to gather data and test the hypotheses. We also need to determine the unit of analysis for our measurement and data. In this case, it could be at the project level or at the component level of a large project. If we are able to collect a number of data points that form a reasonable sample size (e.g., 35 projects or 19

components), we can perform statistical analysis to test the hypotheses

The building blocks of theory are concepts and definitions. In a theoretical definition a concept is defined in terms of other concepts that are already well understood. In the deductive logic system, certain concepts would be taken as undefined; they are the primitives. All other concepts would be defined in terms of the primitive concepts. For example, the concepts of point and line may be used as undefined and the concepts of triangle or rectangle can then be defined based on these primitives.

3.2 Level of Measurement:

We have seen that from theory to empirical hypothesis and from theoretically defined concepts to operational definitions. The process is by no means direct. As the example illustrates, when we operationalize a definition and derive measurement indicators, we must consider the scale of measurement. For instance, to measure the quality of software inspection we may use a five-point scale to score the inspection effectiveness or we may use percentage to indicate the inspection coverage. For some cases, more than one measurement scale is applicable; for others, the nature of the concept and the resultant operational definition can be 20

measured only with a certain scale. In this section, we briefly discuss the four levels of measurement: nominal scale, ordinal scale, interval scale, and ratio scale. Nominal Scale The most simple operation in science and the lowest level of measurement is classification. In classifying we attempt to sort elements into categories with respect to a certain attribute. For example, if the attribute of interest is religion, we may classify the subjects of the study into Catholics, Protestants, Jews, Buddhists, and so on. If we classify software products by the development process models through which the products were developed, then we may have categories such as waterfall development process, spiral development process, iterative development process, object-oriented programming process, and others. In a nominal scale, the names of the categories and their sequence bear no assumptions about relationships among categories. For instance, we place the waterfall development process in front of spiral development process, but we do not imply that one is "better than" or "greater than" the other. Ordinal Scale

Ordinal scale refers to the measurement operations through which the subjects can be compared in order. For example, we may classify families according to socioeconomic status: upper class, middle class, and lower class. We may classify software development projects according to the SEI maturity levels or according to a process rigor scale. The ordinal measurement scale is at a higher level than the nominal scale in the measurement hierarchy. Through it we are able not only to group subjects into categories, but also to order the categories. An ordinal scale is asymmetric in the sense that if A > B is true then B > A is false. It has the transitivity property in that if A > B and B > C, then A > C. Therefore, when we translate order relations into mathematical operations, we cannot use operations such as addition, subtraction, multiplication, and division. We can use "greater than" and "less than." However, in real-world application for some specific types of ordinal scales (such as the Likert five-point, seven-point, or ten-point scales), the 21

assumption of equal distance is often made and operations such as averaging are applied to these scales. In such cases, we should be aware that the measurement assumption is deviated, and then use extreme caution when interpreting the results of data analysis.

Interval and Ratio Scales: An interval scale indicates the exact differences between measurement points. The mathematical operations of addition and subtraction can be applied to interval scale data. For instance, assuming products A, B, and C are developed in the same language, if the defect rate of software product A is5 defects per KLOC and product B's rate is 3.5 defects per KLOC, then we can say product A's defect level is 1.5 defects per KLOC higher than product B's defect level. An interval scale of measurement requires a welldefined unit of measurement that can be agreed on as a common standard and that is repeatable. Given a unit of measurement, it is possible to say that the difference between two scores is 15 units or that one difference is the same as a second. Assuming product C's defect rate is 2 defects per KLOC, we can thus say the difference in defect rate between products A and B is the same as that between B and C. For interval and ratio scales, the measurement can be expressed in both integer and noninteger data. Integer data are usually given in terms of frequency counts (e.g., the number of defects customers will encounter for a software product over a specified time length). 3.3 Some Basic Measures Regardless of the measurement scale, when the data are gathered we need to analyze them toextract meaningful information. Various measures and statistics are available for summarizing theraw data and for making comparisons across groups. In this section we discuss some basic measures such as ratio, proportion, percentage, and rate, which are frequently used in our dailylives as well as in various activities associated with software development and software quality. These basic measures, while seemingly easy, are often misused. There are also numerous Sophisticated statistical techniques and methodologies that can be employed in data analysis. 22

However, such topics are not within the scope of this discussion. Ratio A ratio results from dividing one quantity by another. The numerator and denominator are from twodistinct populations and are mutually exclusive. For example, in demography, sex ratio is defined asIf the ratio is less than 100, there are more females than males; otherwise there are more malesthan females.

Ratios are also used in software metrics. The most often used, perhaps, is the ratio of number ofpeople in an independent test organization to the number of those in the development group. Thetest/development head-count ratio could range from 1:1 to 1:10 depending on the managementapproach to the software development process. For the large-ratio (e.g., 1:10) organizations, thedevelopment group usually is responsible for the complete development (including extensivedevelopment tests) of the product, and the test group conducts system-level testing in terms ofcustomer environment verifications. For the small-ratio organizations, the independent group takesthe major responsibility for testing (after debugging and code integration) and quality assurance. Proportion Proportion is different from ratio in that the numerator in a proportion is a part of the denominator: Proportion also differs from ratio in that ratio is best used for two groups,

whereas proportion isused for multiple categories (or populations) of one group. In other words, the denominator in thepreceding formula can be more than just a + b. If

then we have

When the numerator and the denominator are integers and represent counts of certain events,then p is also referred to as a relative frequency. For example, the following gives the proportion ofsatisfied customers of the total customer set: 23

The numerator and the denominator in a proportion need not be integers. They can be frequencycounts as well as measurement units on a continuous scale (e.g., height in inches, weight in pounds). When the measurement unit is not integer, proportions arecalled fractions. Percentage A proportion or a fraction becomes a percentage when it is expressed in terms of per hundred units(the denominator is normalized to 100). The word percent means per hundred. A proportion p istherefore equal to 100p percent (100p%).Percentages are frequently used to report results, and as such are frequently misused. First,because percentages represent relative frequencies, it isimportant that enough

contextualinformation be given, especially the total number of cases, so that the readers can interpret theinformation correctly. Jones (1992) observes that many reports and presentations in the softwareindustry are careless in using percentages and ratios. He cites the example:Requirements bugs were 15% of the total, design bugs were 25% of the total,coding bugs were50% of the total, and other bugs made up 10% of the total.Had the results been stated as follows, it would have been much more informative:The project consists of 8 thousand linesof code (KLOC). During its development a total of 200defects were detected and removed, giving adefect removal rate of 25 defects per KLOC. Of the200 defects, requirements bugs constituted 15%, design bugs 25%, coding bugs 50%, and otherbugs made up 10%.A second important rule of thumb is that the total number of cases must be sufficiently largeenough to use percentages. Percentages computed from a small total are not stable;

they alsoconvey an impression that a large number of cases are involved. Some writers recommend that theminimum number of cases for which percentages should be calculated is 50. We recommend that,depending on the number of categories, the minimum number be 30, the smallest sample sizerequired for parametric statistics. If the number of cases is too small, then absolute numbers,instead of percentages, should be used. For instance,Of the total 20 defects for the entire project of 2 KLOC, there were 3 requirements bugs, 5 designbugs, 10 coding bugs, and 2 others.When results in 24

percentages appear in table format, usually both the percentages and actualnumbers are shown when there is only one variable. When there are more than two groups, suchas the example in Table 3.1, it is better just to show the percentages and the total number of cases (N) for each group. With percentages and N known, one

canalwaysreconstructthefrequency distributions. The total of 100.0% should always be shown so that it is clear how thepercentages are computed. In a two-way table, the direction in which the percentages arecomputed depends on the purpose of the comparison. For instance, the percentages inTable3.1arecomputedvertically (the total of each column is 100.0%), and the purpose is to compare thedefect-type profile across projects (e.g., project B proportionally has more requirements defectsthan project A).In Table 3.2, the percentages are computed horizontally. The purpose here is to compare thedistribution of defects across projects for each type of defect. The inter-pretations of the two tablesdiffer. Therefore, it is important to carefully examine percentage tables to determine exactly how the percentages are calculated. Table 3.1. Percentage Distributions

25

Ratios, proportions, and percentages are static summary measures. They provide a cross-sectionalview of the phenomena of interest at a specific time. The concept of rate is associated with thedynamics (change) of the phenomena of interest; generally it can be defined as a measure ofchange in one quantity (y) per unit of another quantity (x) on which the former (y) depends. Usuallythe x variable is time. It is important that the time unit always be specified when describing a rateassociated with time. For instance, in demography the crude birth rate (CBR) is defined as:where B is the number of live births in a given calendar year, P is the mid-year population, and K isa constant, usually 1,000.

The concept of exposure to risk is also central to the definition of rate, which distinguishes rate fromproportion. Simply stated, all elements or subjects in the denominator have to be at risk ofbecoming or producing the elements or subjects in the numerator. If we take a second look at thecrude birth rate formula, we will note that the denominator is mid-year population and we knowthat not the entire population is subject to the risk of giving birth. Therefore, the operational Page where B is the number of live births in a given calendar year, P is the mid-year population, and K isa constant, usually 1,000.The concept of exposure to risk is also central to the definition of rate, which distinguishes rate fromproportion. Simply stated, all elements or subjects in the denominator have to be at risk ofbecoming or producing the elements or subjects in the numerator.

26

If we take a second look at thecrude birth rate formula, we will note that the denominator is mid-year population and we knowthat not the entire population is subject to the risk of giving birth. Therefore, the operational Six Sigma The term six sigma represents a stringent level of quality. It is a specific defect rate: 3.4 defectiveparts per million (ppm). It was made known in the industry by Motorola, Inc., in the late 1980swhen Motorola won the first Malcolm Baldrige National Quality Award (MBNQA). Six sigma hasbecome an industry standard as an ultimate quality goal. Sigma ( Figure 3.2 indicates, the areas

under thecurve of normal distribution defined by standard deviations are constants in terms of percentages,regardless of the distribution parameters. The area under the curve as defined by plus and minusone standard deviation (sigma) from the mean is 68.26%. The area defined by plus/minus twostandard deviations is 95.44%, and so forth. The area defined by plus/minus six sigma is99.9999998%. The area outside the six sigma area is thus 100% -99.9999998% = 0.0000002%.

Figure 3.3. Specification Limits, Centered Six Sigma, and Shifted (1.5 Sigma) Six Sigma Page If we take the area within the six sigma limit as the percentage of defect-free parts and the areaoutside the limit as the percentage of defective parts, we find that six sigma is equal to 2 defectivesper billion parts or 0.002 defective parts per million. The interpretation of defect rate as it relates tothe normal distribution will be clearer if we include the specification limits in the discussion, asshown in the top panel of Figure 3.3. Given the specification limits (which were derived fromcustomers' requirements), our

27

purpose is to produce parts or products within the limits. Parts orproducts outside the specification limits do not conform to requirements.

If we can reduce thevariations in the production process so that the six sigma (standard deviations) variation of theproduction process is within the specification limits, then we will havesixsigma quality level.

The six sigma value of 0.002 ppm is from the statistical normal distribution. It assumes that eachexecution of the production process will produce the exact distribution of parts or productscentered with regard to the specification limits. In reality, however, process shifts and drifts alwaysresult from variations in process execution. The maximum process shifts as indicated by research(Harry, 1989) is 1.5 sigma. If we account for this 1.5-sigma shift in the production process, we willget the value of 3.4 ppm. Such shifting is illustrated in the two lower panels of Figure 3.3. Givenfixed specification limits, the distribution of the production process may shift to the left or to theright. When the shift is 1.5 sigma, the area outside the specification limit on one end is 3.4 ppm,and on the other it is nearly zero. The six sigma definition accounting for the 1.5-sigma shift (3.4 ppm) proposed and used byMotorola (Harry, 1989) has become the industry standard in terms of six 28

sigmalevel quality (versusthe normal distribution's six sigma of 0.002 ppm). Furthermore, when the production distributionshifts 1.5 sigma, the intersection points of the normal curve and the specification limits become 4.5sigma at one end and 7.5 sigma at the other. Since for all practical purposes, the area outside 7.5

3.4 Reliability and Validity Recall that concepts and definitions have to be operationally defined before measurements can betaken. Assuming operational definitions are derived and measurements are taken, the logicalquestion to ask is, how good are the operational metrics and the measurement data? Do theyreally accomplish their task measuring the concept that we want to measure and doing so withgood quality? Of the many criteria of measurement quality, reliability and validity are the two mostimportant. Reliability: refers to the consistency of a number of measurements taken using the sameMeasurement method on the same subject. If repeated measurements are highly consistent oreven identical, then the measurement method or the operational definition has a high degree of Reliability. If the variations among repeated measurements are large, then reliability is low. Forexample, if an operational definition of a body height measurement of children (e.g., between ages3 and 12) includes specifications of the time of the day to take measurements, the specific scale touse, who takes the measurements (e.g., trained pediatric nurses), whether the measurementsshould be taken barefooted, and so on, it is likely that reliable data will be obtained. If theoperational definition is very vague in terms of these considerations, the data reliability may be low.Measurements taken in the early morning may be greater than those taken in the late afternoonbecause children's bodies tend to be more stretched after a good night's sleep and becomesomewhat compacted after a tiring day. Other factors that can contribute to the variations of themeasurement data include different scales, trained or untrained personnel, with or without shoeson, and so on.

29

Figure 3.4. An Analogy to Validity and Reliability From Practice of Social Research (Non-Info Trac Version), 9th edition, by E. Babbie. 2001 Thomson Learning. Reprinted with permission of Brooks/Cole, an imprint of the Wadsworth Group, a division of Thomson Learning (fax: 800-730-2215). Note

Note that there is some tension between validity and reliability. For the data to be reliable, themeasurement must be specifically defined. In such an endeavor, the risk of being unable torepresent the theoretical concept validly may be high. On the other hand, for the definition to havegood validity, it may be quite difficult to define the measurements precisely. For example, themeasurement of church attendance may be quite reliable because it is specific and observable.However, it may not be valid to represent the concept of religiousness. On the other hand, toderive valid measurements of religiousness is quite difficult. In the real world of measurements andmetrics, it is not uncommon for a certain tradeoff or balance to be made between validity and reliability. Validity and reliability issues come to the fore when we try to use metrics and measurements to represent abstract theoretical constructs. In traditional quality engi-neering where measurements are frequently physical and usually do not involve abstract concepts, the counterparts of validity and reliability are termed accuracy and precision (Juran and Gryna, 1970). Much confusion surroundsthese two terms despite their having distinctly different meanings. If 30

we want a much higherdegree of precision in measurement (e.g., accuracy up to three digits after the decimal point whenmeasuring height), then our chance of getting all measurements accurate may be reduced. Incontrast, if accuracy is required only at the level of integerinch (less precise), then it is a lot easier to meet the accuracy requirement. 3.5 Measurement Errors In this section we discuss validity and reliability in the context of measurement error. There are twotypes of measurement error: systematic and random.

Systematic measurement error is associatedwith validity; random error is associated with reliability. Let us revisit our example about thebathroom weight scale with an offset of 10 lb. Each time a person uses the scale, he will get ameasurement that is 10 lb. more than his actual body weight, in addition to the slight variationsamong measurements. Therefore, the expected value of the measurements from the scale does notequal the true value because of the systematic deviation of 10 lb. In simple formula:

In a general case:

where M is the observed/measured score, T is the true score, s is systematic error, and e is random error.

The presence of s (systematic error) makes the measurement invalid. Now let us assume themeasurement is valid and the s term is not in the equation. We have the following:The equation still states that any observed score is not equal to the true score because of randomdisturbancethe random error e. These disturbances mean that on one measurement, a person'sscore may be higher than his true score and on another occasion the measurement may be lowerthan the true score. However, since the disturbances are random, it means that the positive errorsare just as likely to occur as the negative errors and these errors are expected to cancel eachother. In other words, the average of these

31

errors in the long run, or the expected value of e, iszero: E(e) = 0. Furthermore, from statistical theory about random error, we can also assume the following: The correlation between the true score and the error term is zero. There is no serial correlation between the true score and the error term. The correlation between errors on distinct measurements is zero. From these assumptions, we find that the expected value of the observed scores is equal to the true score:

Therefore, the reliability of a metric varies between 0 and 1. In general, the larger the errorvariance relative to the variance of the observed score, the poorer the reliability. If all variance ofthe observed scores is a result of random errors, then the reliability is zero [1 (1/1) = 0]. 3.5.1 Assessing Reliability Thus far we have discussed the concept and meaning of validity and reliability and their interpretation in the context of measurement errors. Validity is associated with systematic error andthe only way to eliminate systematic error is through better understanding of the concept we try tomeasure, and through deductive logic and reasoning to derive better definitions. Reliability isassociated with random error. To reduce random error, we need good operational definitions, andbased on them, good execution of measurement operations and data collection. In this section, wediscuss how to assess the reliability of empirical measurements.There are several ways to assess the reliability of empirical measurements including the test/retest method, the alternative-form method, the split-halves method, and the internal consistency method(Carmines and Zeller, 1979). Because our purpose is to illustrate how to use our understanding ofreliability to interpret software metrics rather than indepthstatisticalexamination of the subject,we take the easiest method, the retest method. The retest method is simply taking a secondmeasurement of the subjects some time after 32

the first measurement is taken and then computingthe correlation between the first and the second measurements. For instance, to evaluate thereliability of a blood pressure machine, we would measure the blood pressures of a group of peopleand, after everyone has been measured, we would take another set of measurements. The secondmeasurement could be taken one day later at the same time of day, or we could simply take twomeasurements at one time. Either way, each person will have two scores. For the sake of simplicity,let us confine ourselves to just one measurement, either the systolic or the diastolic score. We thencalculate the correlation between the first and second score and the correlation coefficient is thereliability of the blood pressure machine. A schematic representation of the test/retest method forestimating reliability is shown in Figure 3.5. Figure 3.5. Test/Retest Method for Estimating Reliability Page

The equations for the two tests can be represented as follows:

From the assumptions about the error terms, as we briefly stated before, it can be shown that

m is the reliability measure. As an example in software metrics, let us assess the reliability of the reported number of defects found at design inspection. Assume that the inspection is formal; that is, an inspection meeting washeld and the participants include the design owner, the inspection moderator, and the inspectors.At the meeting, each defect is acknowledged by the whole group and 33

the record keeping is done bythe moderator. The test/retest method may involve two record keepers and, at the end of theinspection, each turns in his recorded number of defects. If this method is applied to a series ofinspections in a development organization, we will have two reports for each inspection over asample of inspections. We then calculate the correlation between the two series of reportednumbers and we can estimate the reliability of the reported inspection defects. 3.5.2 Correction for Attenuation One of the important uses of reliability assessment is to adjust or correct correlations for unreliability that result from random errors in measurements. Correlation is perhaps one of the most important methods in software engineering and other disciplines for analyzing relationships between metrics. For us to substantiate or refute a hypothesis, we have to gather data for boththe independent and the dependent variables and examine the correlation of the data. Let usrevisit our hypothesis testing example at the beginning of this chapter: The more effective thedesign reviews and the code inspections as scored by the inspection team, the lower the defectrate encountered at the later phase of formal machine testing.As mentioned, we first need to operationally define the independent variable (inspection effectiveness) and the dependent variable (defect rate during formal machine testing). Then we gather data on a sample of components or projects and calculate the correlation between theindependent variable and dependent variable. However, because of random errors in the data, theresultant correlation often is lower than the true correlation. With knowledge about the estimate ofthe reliability of the variables of interest, we can adjust the observed correlation to get a moreaccurate picture of the relationship under consideration. In software development, we observedthat a key reason for some theoretically sound hypotheses not being supported by actual projectdata is that the operational definitions of the metrics are poor and there are too many noises in thedata.Given the observed

34

correlation and the reliability estimates of the two variables, the formula forcorrection for attenuation (Carmines and Zeller, 1979) is as follows:

where xt yt) is the correlation corrected for attenuation, in other words, the estimated true correlation xi yi) is the observed correlation, calculated from the observed data xx' is the estimated reliability of the X variable yy' is the estimated reliability of the Y variable | For example, if the observed correlation between two variables was 0.2 and the reliability estimates were 0.5 and 0.7, respectively, for X and Y, then the correlation corrected for attenuation would be This

This means that the correlation between X and Y would be 0.34 if both were measured perfectly without error. 3.6 Be Careful with Correlation Correlation is probably the most widely used statistical method to assess relationships amongobservational data (versus experimental data). However, caution must be exercised when usingcorrelation; otherwise, the true relationship under investigation may be disguised ormisrepresented. There are several points about correlation that one has to know before using it. First, although there are special types of nonlinear correlation analysis available in statistical literature, most of the time when one mentions correlation, it means linear correlation. Indeed, themost well-known Pearson correlation coefficient assumes a linear relationship. Therefore, if a

35

correlation coefficient between two variables is weak, it simply means there is no linear relationshipbetween the two variables. It doesn't mean there is no relationship of any kind. Let us look at the five types of relationship shown in Figure 3.6. Panel A represents a positive linearrelationship and panel B a negative linear relationship. Panel C shows a curvilinear convexrelationship, and panel D a concave relationship. In panel E, a cyclical relationship (such as theFourier series representing frequency waves) is shown. Because correlation assumes linear relationships, when the correlation coefficients (Pearson) for the five relationships are calculated, the results accurately show that panels A and B have significant correlation. However, thecorrelation coefficients for the other three relationships will be very weak or will show norelationship at all. For this reason, it is highly recommended that when we use correlation wealways look at the scattergrams. If the scattergram shows a particular type of nonlinearrelationship, then we need to pursue analyses or coefficients other than linear correlation. Figure 3.6. Five Types of Relationship Between Two Variables Page

Second, if the data contain noise (due to unreliability in measurement) or if the range of the data points is large, the correlation coefficient (Pearson) will probably show no relationship. In such a situation, we recommend using the rank-order correlation method, such as Spearman's rank-order

36

correlation. The Pearson correlation (the correlation we usually refer to) requires interval scaledata, whereas rank-order correlation requires only ordinal data. If there is too much noise in theinterval data, the Pearson correlation coefficient thus calculated will be greatly attenuated. Asdiscussed in the last section, if we know the reliability of the variables involved, we can adjust theresultant correlation. However, if we have no knowledge about the reliability of the variables, rank-order correlation will be more likely to detect the underlying relationship. Specifically, if thenoises of the data did not affect the original ordering of the data points, then rank-order correlationwill be more successful in representing the true relationship. Since both Pearson's correlation andSpearman's rank-order correlation are covered in basic statistics textbooks and are available inmost statistical software packages, we need not get into the calculation details here.Third, the method of linear correlation (leastsquares method) is very vulnerable to extreme values.If there are a few extreme outliers in the sample, the correlation coefficient may be seriouslyaffected. For example, Figure 3.7 shows a moderately negative relationship between X and Y.However, because there are three extreme outliers at the northeast coordinates, the correlationcoefficient will become positive. This outlier susceptibility reinforces the point that when correlationis used, one should also look at the scatter diagram of the data. 3.7 Criteria for Causality The isolation of cause and effect in controlled experiments is relatively easy. For example, aheadache medicine was administered to a sample of subjects who were having headaches. Aplacebo was administered to another group with headaches (who were statistically not differentfrom the first group). If after a certain time of taking the headache medicine and the placebo, theheadaches of the first group were reduced or disappeared, while headaches persisted among thesecond group, then the curing effect of the headache medicine is clear. For analysis with observational data, the task is much more difficult. Researchers (e.g., Babbie, 1986) have identified three criteria: 1. The first requirement in a causal relationship between two variables is that the cause precede the effect in time or as shown clearly in logic. 37

2. The second requirement in a causal relationship is that the two variables be empirically correlated with one another. 3. The third requirement for a causal relationship is that the observed empirical correlation between two variables be not the result of a spurious relationship. The first and second requirements simply state that in addition to empirical correlation, therelationship has to be examined in terms of sequence of occurrence or deductive logic. Correlationis a statistical tool and could be misused without the guidance of a logic system. For instance, it ispossible to correlate the outcome of a Super Bowl (National Football League versus AmericanFootball League) to some interesting artifacts such as fashion (length of skirt, popular color, and soforth) and weather. However, logic tells us that coincidence or spurious association cannot\substantiate causation.The third requirement is a difficult one. There are several types of spurious relationships, as Figure3.8 shows, and sometimes it may be a formidable task to show that the observed correlation is notdue to a spurious relationship. For this reason, it is much more difficult to prove causality inobservational data than in experimental data. Nonetheless, examining for spurious relationships isnecessary for scientific reasoning; as a result, findings from the data will be of higher quality. Figure 3.8. Spurious Relationships

In Figure 3.8, case A is the typical spurious relationship between X and Y in which X and Y have a common cause Z. Case B is a case of the intervening variable, in which the real cause of Y is an intervening variable Z instead of X. In the strict sense, X is not a direct cause of Y. However, since X

38

causes Z and Z in turn causes Y, one could claim causality if the sequence is not too indirect. Case C is similar to case A. However, instead of X and Y having a common cause as in case B, both X and Yare indicators (operational definitions) of the same concept C. It is logical that there is a correlation between them, but causality should not be inferred. An example of the spurious causal relationship due to two indicators measuring the same concept is Halstead's (1977) software science formula for program length:

where N = estimated program length n1 = number of unique operators n2 = number of unique operands Researchers have reported high correlations between actual program length (actual lines of code count) and the predicted length based on the formula, sometimes as high as 0.95 (Fitzsimmons andLove, 1978). However, as Card and Agresti (1987) show, both the formula and actual programlength are functions of n1 and n2, so correlation exists by definition. In other words, both theformula and the actual lines of code counts are operational measurements of the concept ofprogram length. One has to conduct an actual n1 and n2 count for the formula to work.However, n1and n2 counts are not available until the program is complete or almost complete. Therefore, therelationship is not a causeand-effect relationship and the usefulness of the formula's predictabilityis limited.

39

Vous aimerez peut-être aussi