Discovering Similarity Measures in Software by Using Mining Graphs

International Journal of Computer Information Systems, Vol. 3, No.
4, 2011
Discovering Similarity measures in Software by using Mining Graphs

Dr. V Gayathri, Professor, Dept. of Computer Science, Kuppam Engineering College, Kuppam,Andhra Pradesh, gayhar11@gmail.com Mr. Lokanath J C, Asst. Professor, Dept. of CSE, Kuppam Engineering College, Kuppam, AP, India. loka.jc@gmail.com
ABSTRACT Clustering semantically related terms is crucial for many applications such as document categorization, and word sense disambiguation. However, automatically identifying semantically similar terms is challenging. We present a novel approach for automatically determining the degree of relatedness between terms to facilitate their subsequent clustering. Using the analogy of ensemble classifiers in Machine Learning, we combine multiple techniques like contextual similarity and semantic relatedness to boost the accuracy of our computations. Other research suggests that neglected conditions may be even more important than is generally indicated in the literature. I.LITERATURE SURVEY It is found that hundreds of bugs involving neglected conditions in code for major operating systems such as Linux and OpenBSD applied the semantic-graph differencing tool Dex to samples of patches to the Apache HTTP server and GCC C-compiler and found that 38 percent of the Apache patches and 44 percent of the GCC patches involved inserting conditional selection statements and that 31 percent of the Apache patches and 32 percent of the GCC patches involved altering existing if Conditions. Many neglected conditions can be prevented by the use of requirement elicitation and analysis techniques that are intended to ensure completeness of a requirement specification, such as viewpoint analysis. However, many other neglected conditions are not traceable to shortcomings of requirements engineering because they involve design or implementation issues that do not correspond directly to requirements. A familiar example is failing to check that a pointer or object reference is non-NULL before it is de-referenced to a function call and candidate rules and possible rule violations are identified automatically. As with Engler et al.s approach, candidate rules are identified by their frequency of occurrence and must be confirmed manually by developers. Our approach does consider semantically relevant constraints between elements of potential rules, in the form of enhanced program dependences .This paper extends it by employing EPDGs, presenting a new heuristic maximal frequent sub graph algorithm, and evaluating our approach on four open source projects not considered in [5]. PRELIMINARY STUDY A program dependence graph is a labeled directed graph that models dependences between the statements of a program or procedure. Two types of dependences are represented: A statement s1 is data dependent on a statement s2 if there is a variable x and a control flow path s2P s1 from s2 to s1 such that x is defined at s2, used at s1, and not redefined along the sub path P; s1 is control dependent on s2 if s2 is a branch predicate that directly controls whether or not s1 is executed. Because program dependence graphs capture the essential ordering constraints between program elements,
programming rules relating elements that need not be adjacent to one another in a program and need not appear in the same textual order wherever the rule occurs can be represented as sub graphs or, as we shall see, minors of program dependence graphs. Neglected conditions are an important but difficult-to-find class of software defects. This paper presents a novel approach for revealing neglected conditions that integrates static program analysis and advanced data mining techniques to discover implicit conditional rules in a code base and to discover rule violations that indicate neglected conditions. The approach requires the user to indicate minimal constraints on the context of the rules to be sought, We present a new approach to the detection of neglected conditions in software that builds upon the idea that vital clues about neglected conditions are often distributed throughout a project code base (or even multiple code bases). Our approach is intended to discover a wide variety of programming rules and violations of them without requiring developers to supply specific rule templates or checkers. Instead, developers indicate minimal constraints on the kind of rule violations they wish to find (e.g., any neglected condition or any neglected condition pertaining .In our work, we have employed the SDG generated by the Code Surfer static analysis tool. The SDG extends the program dependence graph representation for monolithic programs to incorporate collections of procedures. Each procedure is represented by a PDG and PDGs are augmented with special edges linking callers and cal lees. SDG edges can be classified into two overlapping sets of categories: 1) data dependence and control dependence edges and 2) inter procedural edges and intra procedural edges. We use the code shown in Fig. 1, from the openssl project to illustrate informally how a programming rule associated with the function U I process is mined and how a violation of the rule is detected by our approach. The first step of our approach is to create a dependence sphere with limited radius r for each call site node of U I process with the call site node as its center. The objective of this step is to extract the essential elements of rule instances. Initially, the dependence spheres contain only control and data dependences, but they are enhanced by adding SDDEs. SDDEs allow some semantic relationships to be modeled more precisely and provide some benefits of inter procedural analysis without incurring its cost. Each sphere is then reduced by removing those nodes whose occurrences in the set of spheres are infrequent. In our approach, a programming rule corresponds to a frequent graph minor of an SDG. To mine a frequent minor, instead of a frequent sub graph, our HMFSM algorithm is applied to NTCs of the reduced spheres (with 80 percent support) to find a maximal frequent sub graph. A discovered frequent graph minor is a candidate rule. The PDGs are enhanced by adding directed edges, called shared data dependence edges (SDDEs), between pairs of program elements that use the same variable definition and are connected by a control flow path. The resulting graphs are called enhanced PDGs (EPDGs).
October Issue
Page 114 of 116
ISSN 2229 5208
International Journal of Computer Information Systems, Vol. 3, No. 4, 2011

Because EPDG minors represent transitive (direct and indirect) intra procedural dependences between program statements, they capture essential constraints between rule elements and exclude spurious ones, In our work, we have employed the SDG generated by the Code Surfer static analysis tool. The SDG extends the program dependence graph representation for monolithic programs to incorporate collections of procedures. Each procedure is represented by a PDG and PDGs are augmented with special edges linking callers and cal lees. SDG edges can be classified into two overlapping sets of categories: 1) data dependence and control dependence edges and 2) inter procedural edges and intra procedural edges. SDGs contain various types of nodes, such as those representing call sites, statements, control points, actual-in/out parameters, formalin/out parameters, switch statements, and so forth. Expressions associated with PDG nodes are represented by abstract syntax trees (ASTs). Our experiments indicate that the Code Surfer SDG representation augmented with additional information derived from control-flow graphs is quite adequate for representing a wide variety of rules, including conditional rules. REPRESENTING CONDITIIONAL RULES Our decision to represent programs and conditional programming rules in terms of dependence graphs is based on our intuition that a wide variety of rules can be described in terms of data and control dependence relationships among a set of nodes with particular attributes (e.g., parameter types and expression ASTs). Control dependences are obviously critical elements of conditional rules and the absence of expected control dependence is a prime indicator of a neglected condition. It is important to note, however, that a violation of a conditional rule may involve missing data dependence instead of missing control dependence, as when a critical variable is erroneously omitted from a control predicate. Dependence graphs abstract away many coessential constraints on statement ordering that are implicit in other program representations such as ASTs if (parent_err) {// apr_file_t *parent_err rv = apr_file_dup(&attr->parent_out, parent_err, attr->pool); if (parent_err) { // apr_file_t *parent_err if (attr->parent_err == NULL) // apr_procattr_t *attr rv = apr_file_dup(&attr->parent_err, parent_err, attr->pool); else rv => apr_file_dup2(attr->parent_err, parent_err, attr->pool); ADDITIONAL CONSTRAINTS In some cases, a user may wish to search specifically for conditional rules satisfying certain constraints on nodes or edges, e.g., rules in which a functions return value is checked. When such a constraint involves the presence of a single node with a given type or label, it can be achieved by modifying the HMFSM algorithm to discard sub graphs that do not contain such a node in order to further reduce the computation time. To demonstrate the accuracy and robustness of our method, evaluation was performed on completely unstructured, noisy, free-text downloaded from the Internet, as opposed to most previous works where evaluation involved highlydomain specific and (semi-) structured corpora. Our evaluation corpus consisted of mobile phone descriptions from vendors sites and customers opinions from online forums. The corpus free-text nature presented additional challenges; such as identifying syntactic dependencies between words in a terms context to form the topic signatures and dealing with significant noise level. But when security questions appear, the confidence of these entrepreneurs will decrease. When Users are using cloud computing, they will let other store their data, so it will happen that losing business or users private information. Name Number Percent Missing Boolean 28 26% Expression Missing branch 13 26% Missing Conditional 5 5% Statements Both Conditional and 86% 79% Block needing control missing (Classification of Neglected Conditions in Firefox) GENERATING TARGET TERMS The downloaded documents are pre-processed (parsed, cleansed from stop-words and noise), to result in a corpus of 500 documents, each with an average of 15 sentences. Relevant terms from the corpus, whose TF-IDF [8] weights exceeded an experimental threshold, were selected as target terms to be subsequently clustered. We considered only terms with at most 3 constituent words RELATED WORK To evaluate our approach, we conducted an empirical study in which it was used to find conditional rules and rule violations in code from four open source projects. In this study, the rules and violations discovered by our approach were checked manually to see if they were valid. The evaluation focused on five principal questions: 1. Is the approach able to discover a high proportion of the conditional rules actually present in a code base automatically? 2. Are the discovered rules are of interest? 3. Is the addition of SDDEs to PDGs useful for finding programming rules? 4. Does the algorithm for detecting rule violations effectively distinguish between rule instances and non instances? 5. Do the rule violations reported by the heuristic graph matching algorithm actually involve neglected conditions? Combining similarity measures Nenadi et al. [5] presented a methodology incorporating contextual, lexical and syntactic similarity measures. Contextual similarity was defined as the ratio of common to distinct context patterns. Lexically similar terms were identified based on their common head nouns. Syntactically similar terms were those co-occurring in certain lexico-syntactic patterns. However, these patterns are heavily corpus-dependent, and not reliable for measuring similarity. It was also found that none of the similarity measures were very reliable on their own, and had to be combined for improved performance. EFFICIENCY OF RULE MINING On a Window system with one 1.74 GHz CPU and 1 Gbyte RAM, the average times for mining rules from the graph data set consisting of the dependence spheres generated for a candidate node were about 63, 105, 174, and 56 seconds for openssl, make, procmail,
October Issue
Page 115 of 116
ISSN 2229 5208
International Journal of Computer Information Systems, Vol. 3, No. 4, 2011

and amaya, respectively, as shown Table 2. This indicates that the efficiency of our approach was not greatly affected by the sizes of the SDGs CONCLUSION The empirical results presented in Section 5 suggest that our approach to rule discovery is effective and reasonably efficient for discovering conditional rules involving preconditions and postconditions of function calls and for discovering violations of those rules. The results also show that our approach is able to detect many rules involving function call ordering and related violations. The results also indicate that augmenting program dependence graphs with intra procedural SDDEs enables the detection of a significant number of additional rules that actually involve inter procedural dependences. We plan to extend our approach to consider inter procedural, as well as intra procedural, dependences. To address node labeling issues, we plan to investigate 1) use of incident data dependences and SDDEs to distinguish control points, 2) use of type inference to distinguish semantically distinct expressions having the same ASTs, and 3) clustering of semantically similar ASTs. Finally, we plan to broaden the range of targeted rules, e.g., by choosing other types of candidate nodes such as control points, expressions, and formal-in parameters. Further research can also be directed towards automatically determining the weights to be assigned to contextual similarity and to semantic relatedness based on the corpus characteristics. REFERENCES [1] Apache HTTP Server Project, Apache.org, www.apache.org, 2008. [2] M. Acharya, T. Xie, J. Pei, and J. Xu, Mining API Patterns as Partial Orders from Source Code: From Usage Scenarios to Specifications, Proc. Sixth Joint Meeting of the European Software Eng. Conf. and the ACM SIGSOFT Symp. Foundations of Software Eng., pp. 25-34, 2007. [3] T.A. Budd, R.A. DeMillo, R.J. Lipton, and F.G. Sayward,Theoretical and Empirical Studies on Using Program Mutation to Test the Functional Correctness of Programs, Proc. Seventh Ann. ACM Symp. Principles of Programming Languages, pp. 220-233, 1980. [4] D. Burdick, M. Calimlim, and J. Gehrke, MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases, Proc.17th Intl Conf. Data Eng., 2001. [5] R.Y. Chang, A. Podgurski, and J. Yang, Finding Whats Not There: A New Approach to Revealing Neglected Conditions in Software, Proc. ACM Intl Symp. Software Testing and Analysis,pp. 163-173, 2007. [6] B. Chelf, D. Engler, and S. Hallem, How to Write SystemSpecific, Static Checkers in Metal, Proc. ACM Workshop Program Analysis for Software Tools and Eng., pp. 51-56, 2002. [7] H. Chockler, O. Kupferman, and M. Vardi, Coverage Metrics for Formal Verification, Lecture Notes in Computer Science, vol. 2860, pp. 111-125, 2003. [8] A. Dunsmore, M. Roper, and M. Wood, Practical Code Inspection Techniques for Object-Oriented Systems: An Experimental Comparison, IEEE Software, vol. 20, no. 4, pp. 21-29, July/Aug. 2003. [9] D. Yarowsky, Word-Sense disambiguation using statistical models of Rogets categories trained on large corpora, in Proceedings of the 14th Conference on Computational Linguistics, Nantes, France. 1992. [10] D. Engler, D.Y. Chen, S. Hallem, A. Chou, and B. Chelf, Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code, Proc. 18th ACM Symp. Operating Systems Principles, pp. 57-72, 2001. [11] D. Engler, Meta-Level Compilation, metacomp.stanford.edu, 2008.
October Issue
Page 116 of 116
ISSN 2229 5208

Discovering Similarity Measures in Software by Using Mining Graphs

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Discovering Similarity Measures in Software by Using Mining Graphs

Transféré par

Droits d'auteur :

Formats disponibles

International Journal of Computer Information Systems, Vol. 3, No.

Discovering Similarity measures in Software by using Mining Graphs

Page 114 of 116

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 4, 2011

Page 115 of 116

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 4, 2011

Page 116 of 116

ISSN 2229 5208

Vous aimerez peut-être aussi