Académique Documents
Professionnel Documents
Culture Documents
discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/2369031
CITATIONS
DOWNLOADS
VIEWS
42
70
89
2 AUTHORS:
Timothy Lethbridge
Nicolas Anquetil
University of Ottawa
SEE PROFILE
SEE PROFILE
Abstract
1 Introduction
The goal of our research is to develop tools to
help software engineers (SEs) more effectively
maintain software. By analyzing the work of
SEs, we have come to the conclusion that they
spend a considerable portion of their time exploring source code, using a process that we call justin time program comprehension. As a result of
our analysis, we have developed a set of requirements for tools that can support the SEs.
In section 2 we present our analysis and the
requirements; we also explain why other tools do
not meet the requirements. In section 3, we present our design for a system that meets the requirements. Since there is not space for
exhaustive design details, we focus on key issues
2
We use the term semantically significant so as
to exclude the necessity for the tool to be required t o
retrieve hits on arbitrary sequences of characters in
the source code text. For example, the character
sequence e u occurs near the beginning of this
footnote, but we wouldnt expect an information
retrieval system to index such sequences; it would
only have to retrieve hits on words. In software the
semantically significant names are filenames,
routine names, variable names etc. Semantically
significant associations include such things as
routine calls, file inclusion.
NF6 Be well integrated and incorporate all frequently-used facilities and advantages of
tools that SEs already commonly use.
It is important for acceptance of a tool that it
neither represent a step backwards, nor require work-arounds such as switching to alternative tools for frequent tasks. In a survey
of 26 SEs [5], the most frequent complaint
about tools (23%) was that they are not integrated and/or are incompatible with each
other; the second most common complaint
was missing features (15%). In section 2.4
we discuss some tools the SEs already use
for the program comprehension task.
It works with arbitrary strings in text, not semantic items (requirement F1) such as routines, variables etc.
SEs must spend considerable time performing
repeated greps to
trace relationships
(requirement F2); and grep does not help them
organize the presentation of these relationships.
Parsers
Source Code
File
Database
Write-API
data
Parsers
DBMS
TA++ is generated by all programming language parsers. It is directly piped into a special
TA++ parser which builds the database (although
storage of TA++ in files or transmission over
network connections is also anticipated). Our
system also has a TA++ generator which permits
a database to be converted back to TA++. This is
useful since TA++ is far more space-consuming
than the database thus we dont want to unnecessarily keep TA++ files.
Data in TA++ format, as well as data passed
through the DBMS write-API, are both merely
lists of facts about the software as extracted by
parsers.
Read-API
data
Although the read-API provides basic query facilities, an API doesnt easily lend itself to composite queries, e.g. Tell me the variables
common to all files that contain calls to both
routine x and routine y. For this, a query language is needed, as illustrated in figure 4. Discussion of this language is in section 5.4.
DBMS
Parsers
Interchange
format
(TA++)
TA++
Files
Read-API
data
Clients
(User Interfaces
and other
analysis tools)
Read-API
data
Interchange
format (TA++)
TA++
Parser
Query
Engine
Write-API
data
Interchange
format (TA++)
3rd party tools
that generate
TA++
Query
response
DBMS
Read-API
data
TA++
Generator
However, although the database contains comprehensive information about source code objects
and their relationships, there are certain data
whose storage in the database would not be appropriate, but which is still needed to satisfy
decisions about the actual implementation paradigm will be deferred to section 4.10
Interchange
format
(TA++)
TA++
Files
Auxilliary
Analysis Tools
Interchange
format (TA++)
Database
TA++
Parser
Write-API
data
Interchange
format (TA++)
3rd party tools
that produce
TA++
DBMS
Read-API
data
Read-API
data
Query
Engine
Clients
(User Interfaces
and other
analysis tools)
Query
response
TA++
Generator
Source Code
File
Parsers
Auxilliary
Analysis Tools
a)
b)
c)
d)
e)
f)
Clients
(User Interfaces
and other
analysis tools)
Locations of definitions
Locations where defined objects are used
File inclusion (really a special case of b)
Routine calls (another special case of b)
Figure 7 presents a simplistic object model relating source files and routines. We can improve
this by recognizing that routines are just smaller
units of source. We thus derive the recursive dia-
contains
SourceFile
Routine
calls
(Vector)
SourceUnit
contains
SourceWithinFile
RoutineSource
Definition
ReferenceExistence
SourceUnit
SourceFile
SoftwareObject
objectId
Figure 9: The top of the inheritance hierarchy of software objects. The vector and
the objectId are discussed later.
includes
allObjects
calls
SourceUnit
ReferenceExistence
contains
SourceFile
SourceWithinFile
RoutineCallExistence
calls
RoutineSource
includes
FileInclusionExistence
3
In the more popular design pattern book b y
Gamma et al[1], this is really an inverse Composite,
because the root of the hierarchy is recursionlimiting case, instead of leaves.
SourceUnit
refersToData
DataUseExistence
FieldUseExistence
f1
f2
nextComponent
VariableUseExistence
component
a
b
a.b
c
b.c
a.b.c
d
b.d
Figure 12: One approach to storing multipart names that permits searches for particular combinations of components. The
class diagram is at the left, and an example instance diagram is at the right. This
example requires 15 objects and 16 links.
(HashTable)
Name
string
SourceFile
It provides sufficient information for the system to narrow the set of files for a fully qualified search, such that the search can be quickly
done by scanning files at run-time for the particular pattern.
Fully qualified searches will be relatively rarely
requested, compared to searches for variables
and fields.
ReferenceExistence
RoutineSource
Figure 11: Separating names from objects. The hash table is explained in section
SourceUnit
refersToData
DataUseExistence
f1
FieldUseExistence
VariableUseExistence
f2
a
b
c
b
d
usedWith
usedAsTypeOf
Name
TypedDefinition
TypeUseExistence
StandaloneDefinition
startChar
endChar
startLine
endLine
Reconstruct each file, building as many alternative top-level definitions as needed to account for all cases of conditional compilation.
Each top-level definition variant is then given
a modified name that reflects the conditions required for it to be compiled.
Field
RoutineSource
declaredAs
formalArgsIn
defines
formalArgs
SourceUnit
EnumerationConst TypeDef
DatumDef
isConst
EnumeratedTypeDef RecordTypeDef
A composite file, containing all top-level definition variants is then compiled in the normal
way.
ReferenceExistence
4.8 Macros
Macros are similar to conditional compilation in
the sense that they can, at their worst, result in
arbitrary syntax (for example, in cases where
punctuation is included in the macro). Of course
it is very bad programming practice to create
trick macros that break syntax; however tools
like our system are all the more needed when
code is very obscure.
The following are approaches to this
problem:
FileInclusionExistence
includes
potentiallyIncludedIn
SourceFile
DataUseExistence
foundInSource
refersToData
SourceUnit
FieldUseExistence
RoutineCallExistence
potentiallyCalledBy
calls
RoutineSource
usedWith
VariableUseExistence
ManifestConstExistence
TypedDeclaration
TypeUseExistence
usedAsTypeOf
usedAsReturnTypeOf
RoutineSource
10
11
The first two traverse the refers/referred-by associations in either direction, while the second
traverse the defined-in/defined-by associations
in either direction4.
They all return an iterator and take as arguments 1) an object that is the starting point for
links to be traversed, and 2) a class that constrains the links to be traversed. For example, I
could ask for all the refers-to links from a
RoutineSource to class ReferenceExistence in
general, in which case I would get a list containing routines, variables, types etc. On the
other hand I could ask for the refers-to links
from a RoutineSource to a more specific class
such as RoutineCallExistence, in which case I
would be asking about one level of the call hierarchy, and would get a subset of the results
of the former query.
The above design allows for a small number
of functions to efficiently handle a large number of types of query.
One function to create each concrete class. Arguments are the attributes of the class. Return
value is the new object.
Generic functions to add links of refers/referredby and defines/defined-by associations. A single call to one of these functions creates both
directions of the association.
Four functions that provide all that is necessary to query the associations. The four functions are:
cdbGetObjectListThatReferToObject()
cdbGetObjectListReferedByObject()
cdbGetObjectListThatDefineObject()
cdbGetObjectListDefinedByObject()
4
Although there are various different association names, they can be divided into these two categories.
12
5.3 TA++
As discussed in section 3.2, we designed an interchange format to allow various different program
comprehension tools to work together (e.g. sharing parsers and databases).
The TA language [2] is a generic syntax for
representing the nodes and arcs. Here are some
key facts about it:
Ensuring that all the information about attributes of objects has been obtained before the object is created.
Ensuring an object is defined before instances
of relations are made that need it.
The following are basic query formats that return object lists. If result-class is omitted in
any of the following then SoftwareObject is
assumed. Parentheses may be used to group
when necessary.
FACT TUPLE:
$INSTANCE flistaud.pas SourceFile
$INSTANCE audit_str_const_1 DatumDef
defines flistaud.pas audit_str_const_1
<result-class> <name>
FACT ATTRIBUTE:
audit_str_const_1 {
startChar = 1883
endChar = 1907
isConst = 1 }
Returns all objects that are in the given relation to the objects retrieved as a result of
query2.
(<query>) <operator> <query>
13
Returns objects by composing queries. Operator and means find the intersection of the two
queries; operator or means find the union; operator - means set difference
info <object-id>
Acknowledgments
next <n>
classinfo <class-id>
We would like to thank the SEs who have participated in our studies, in particular those at
Mitel with whom we have worked for many
months. We would like to thank the members of
our team who have developed and implemented
the ideas discussed in this paper: Abderrahmane
Lakas, Sue Rao and Javed Chowdhury. Janice
Singer has been instrumental in the studies of
SEs. We also acknowledge fruitful discussions
with others in the Consortium for Software
Engineering Research, especially Ric Holt now
of the University of Waterloo.
14
References
[13] Take5
Corporation
home
http://www.takefive.com/index.htm
[2] Holt, R., An Introduction To TA: The Tuple-Attribute Language, Draft, to be published.
www.turing.toronto.edu/~holt/
papers/ta.html
page,
15