Académique Documents
Professionnel Documents
Culture Documents
SE
debug. The assembly language is then processed by a program called an assembler
that produces relocatable machine code as its output.
C
ET
C
SA
Large programs are often compiled in pieces, so the relocatable machine code may
have to be linked together with other relocatable object files and library files into
the code that actually runs on the machine. The linker resolves external memory
addresses, where the code in one file may refer to a location in another file. The
loader then puts together all of the executable object files into memory for
execution.
www.sacet.ac.in Page 1
DEPARTMENT OF CSE
SE
Figure:- Compiler
Interpreter:-An interpreter is another common kind of language processor. Instead
of producing a target program as a translation, an interpreter directly execute the
C
operations specified in the source program on inputs supplied by the user.
ET
Figure:- An interpreter
The machine-language target program produced by a compiler executes much
faster than an interpreter. An interpreter, however, can give better error messages
C
Figure:- An Assembler
Linker:- The linker is a program which links the object programs of functions to the
main program.
Loader:-The loader loads the program on the hard disk onto the main memory and
loads the starting address of the program into the program counter(PC) and makes
the program ready for execution.
www.sacet.ac.in Page 2
DEPARTMENT OF CSE
SE
C
ET
Synthesis phase:- The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol table. Intermediate
C
Code Generator, Code Generator, and Code Optimizer are the parts of this phase.
The synthesis part is the back end of the compiler.
SA
www.sacet.ac.in Page 3
DEPARTMENT OF CSE
SE
C
ET
C
SA
www.sacet.ac.in Page 4
DEPARTMENT OF CSE
SE
consists of a sequence of assembly-like instructions with three operands per
instruction.
Code Optimizer:- The machine-independent code-optimization phase attempts to
improve the intermediate code so that better target code will result. The target code
C
generated must be executed faster and must consume less power.
Code Generator:- The code generator takes intermediate representation of the
source program and coverts into the target code. If the target language is machine
ET
code, registers or memory locations are selected for each of the variables used by the
program. Then, the intermediate instructions are translated into sequences of
machine instructions that perform the same task.
Symbol Table:- The symbol table is a data structure containing a record for each
C
variable name, with fields for the attributes of the name. The data structure should be
designed to allow the compiler to find the record for each name quickly and to store
or retrieve data from that record quickly.
SA
Error Handler:- Error handler should report the presence of an error. It must report
the place in the source program where an error is detected. Common programming
errors can occur at many different levels.
Lexical errors include misspellings of identifiers, keywords, or operators.
Syntax errors include misplaced semicolons or extra or missing braces.
Semantic errors include type mismatches between operators and operands.
Logical errors can be anything from incorrect reasoning on the part of the
programmer to the use in a C program of the assignment operator = instead of
the comparison operator ==.
The main goal of error handler is
1. Report the presence of errors clearly and accurately.
2. Recover from each error quickly enough to detect subsequent errors.
3. Add minimal overhead to the processing of correct programs.
www.sacet.ac.in Page 5
DEPARTMENT OF CSE
SE
C
ET
C
SA
www.sacet.ac.in Page 6
DEPARTMENT OF CSE
2.
SE
3. For the first time of compilation the For the first time of interpretation
process make take more time but as the process may complete within
the target program is saved on the hard less time. As the target program is
www.sacet.ac.in Page 7
DEPARTMENT OF CSE
Role of Lexical Analysis: - Lexical analyzer is the first phase of a compiler. The
main task of the lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce tokens for each lexeme in the
source program. The stream of tokens is sent to the parser for syntax analysis. When
lexical analyzer discovers a lexeme constituting an identifier, it interacts with the
SE
symbol table to enter that lexeme into the symbol table. Commonly, the interaction
is implemented by having the parser call the lexical analyzer. The getNextToken
command given by the parser , causes the lexical analyzer to read characters from its
input until it can identify the next lexeme and produce the next token, which it
returns to the parser. C
ET
C
SA
Since the lexical analyzer is the part of the compiler that reads the source text,
it may perform certain other tasks besides identification of lexemes. One such task is
stripping out comments and whitespace (blank, newline, tab, and perhaps other
characters that are used to separate tokens in the input). Another task is correlating
error messages generated by the compiler with the source program. For instance, the
lexical analyzer may keep track of the number of newline characters seen, so it can
associate a line number with each error message. In some compilers, the lexical
www.sacet.ac.in Page 8
DEPARTMENT OF CSE
SE
tasks.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to
apply specialized techniques that serve only the lexical task, not the job of
parsing. In addition, specialized buffering techniques for reading input
C
characters can speed up the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can be
restricted to the lexical analyzer.
ET
Token, patterns and Lexemes: - When discussing lexical analysis, we use three
related but distinct terms:
A token is a pair consisting of a token name and an optional attribute value.
The token name is the category of lexical unit, e.g., a particular keyword, or a
C
In many programming languages, the following classes cover most or all of the
tokens:
www.sacet.ac.in Page 9
DEPARTMENT OF CSE
SE
misspelling of the keyword if or an undeclared function identifier. Since fi is a valid
lexeme for the token id, the lexical analyzer must return the token id to the parser
and the parser handle an error due to transposition of the letters. However, suppose a
situation arises where the lexeme doesnot satisfy any of the pattern. The simplest
C
recovery strategy is "panic mode" recovery. We delete successive characters from
the remaining input, until the lexical analyzer can find a well-formed token at the
beginning of what input is left. This recovery technique may confuse the parser, but
ET
in an interactive computing environment it may be quite adequate. Other possible
error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
C
simplest such strategy is to see whether a prefix of the remaining input can be
transformed into a valid lexeme by a single transformation. This strategy makes
sense, since in practice most lexical errors involve a single character. A more general
correction strategy is to find the smallest number of transformations needed to
convert the source program into one that consists only of valid lexemes, but this
approach is considered too expensive in practice to be worth the effort.
www.sacet.ac.in Page 10
DEPARTMENT OF CSE
The vertical bar above means union, the parentheses are used to
group subexpressions, the star means "zero or more occurrences of". The
letter_ at the beginning indicates that the identifier can contain any letter
or underscore(_) at the beginning. The regular expressions are built
recursively out of smaller regular expressions.
where
SE
1. Each di is a new symbol, not in C and not the same as any other of the d's, and
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example 2 : Unsigned numbers (integer or floating point) are strings such as 5280,
0.01234, 6.336E4, or 1.89E-4. Write a regular definition for Unsigned numbers in C
language.
www.sacet.ac.in Page 11
DEPARTMENT OF CSE
Each state represents a condition that could occur during the process of scanning the
input looking for a lexeme that matches one of several patterns. Edges are
directed from one state of the transition diagram to another. Each edge is labeled by a
symbol or set of symbols. If we are in some state s, and the next input symbol is
a, we look for an edge out of state s labeled by a (and perhaps by other symbols, as
well). If we find such an edge, we enter the state of the transition diagram to
which that edge leads.
Example 1 : Draw a transaction diagram for the relational operator
relop. The relation operators are < , < > , <= , >, >=,=
Relop < | < > | <= | >| >=|=
SE
C
ET
www.sacet.ac.in Page 12
DEPARTMENT OF CSE
SE
C
ET
www.sacet.ac.in Page 13
DEPARTMENT OF CSE
SE
2. The translation rules each have the form
Pattern {Action}
Each pattern is a regular expression, which may use the regular definitions of
the declaration section. The actions are fragments of code, typically written in
C
C, although other languages can also be used
3. The third section holds whatever additional functions are used in the actions.
Alternatively, these functions can be compiled separately and loaded with the
ET
lexical analyzer.
The lexical analyzer created by Lex works along with the parser as follows.
When called by the parser, the lexical analyzer begins reading its remaining input,
one character at a time, until it finds the longest prefix of the input that matches one
C
of the patterns Pi. It then executes the associated action Ai. Typically, Ai will return
to the parser, but if it does not (e.g., because Pi describes whitespace or comments),
SA
then the lexical analyzer proceeds to find additional lexemes, until one of the
corresponding actions causes a return to the parser. The lexical analyzer returns a
single value, the token name, to the parser, but uses the shared, integer variable
yyval to pass additional information about the lexeme found, if needed.
www.sacet.ac.in Page 14
DEPARTMENT OF CSE
SE
"(" printf("%s is a open paranthesis\n",yytext);
")" printf("%s is a closed paranthesis\n",yytext);
"[" printf("%s is a open Square bracket\n",yytext);
C
"]" printf("%s is a closed Square bracket \n",yytext);
{label} printf("%s is a label\n",yytext);
{number} printf("%s is a n umber\n",yytext);
\".*\" printf("%s is a string \n",yytext);
ET
%%
main(int argc,char **argv)
{ FILE *f;
f=fopen(argv[1],"r");
C
yyin=f;
yylex();
SA
Ex 2:- Write a lex program to convert upper case into lower case and vice versa in a
given string
/* Program to convert upper case into lower case and vice versa */
%{
www.sacet.ac.in Page 15
DEPARTMENT OF CSE
int upcnt=0,lwcnt=0;
%}
%%
ch=ch-32;
lwcnt++;
SE
}
ch=ch+32;
C
%%
int main()
yylex();
www.sacet.ac.in Page 16
DEPARTMENT OF CSE
Ex 3:- Write a lex program to count no of characters, wards and lines in a given
text /* Program to count no of characters, wards and lines in a given text */
%{
%}
%%
SE
\n { nline++; }
%%
int main()
C
ET
{
%{
#include<stdio.h>
%}
%%
www.sacet.ac.in Page 17
DEPARTMENT OF CSE
%%
SE
yylex();
}ex();
C
printf("No of characters= %d\nNo of words=%d\nNo of lines=%d\n", nchar, nword,
nline);
ET
return 0;
}
C
%{
#include<stdio.h>
%}
%%
www.sacet.ac.in Page 18
DEPARTMENT OF CSE
%%
SE
int main (void)
}
yylex(); C
ET
IMPORTANT QUESTIONS
1. (a) What are the functions of pre-processing?
(b) Explain briefly, the need and functionality of linkers, assemblers and loaders.
2. (a) Mention the functions of linkers and loaders in pre-processing.
C
expression
a: = b + c *50.
(b) Give and explain the diagrammatic representation of a language processing
system.
4. Discuss about Lexical Analysis and Role of Lexical Analysis.
5. Defferentiate Lexical Analysis Vs Parsing.
6. Define the words Token, patterns and Lexemes.
7. Explain briefly about Lexical Errors.
8. Define Regular Expressions and Regular definitions. Write Regular
Expressions for the language constructs such as Strings, Sequences and
Comments.
9. Define Transaction diagram and draw Transaction diagram
for recognition of tokens, Reserved words and identifiers.
www.sacet.ac.in Page 19
DEPARTMENT OF CSE
SE
C
ET
Figure: Position of parser in compiler model
The parser reports any syntax errors in an intelligible fashion and tries to
recover commonly occurring errors to continue processing the remainder of the
C
program. Conceptually, for well-formed programs, the parser constructs a parse tree
and passes it to the rest of the compiler for further processing.
SA
www.sacet.ac.in Page 20
DEPARTMENT OF CSE
Bottom-up parsers:- Bottom-up parsers start from the leaves and work their way up
SE
to the root. In either case, the input to the parser is scanned from left to right, one
symbol at a time. Parsers for the larger class of LR grammars are usually constructed
using automated tools. The bottom-up parsing can be implemented by using the
following techniques.
1. Shift-reduce parser
2. Operator Precedence parser
3. LR parsers
C
ET
LR parsers are again subdived into
a) SLR parser
b) LALR parser
c) CLR parser
C
www.sacet.ac.in Page 21
DEPARTMENT OF CSE
Derivations:- The production rules are used to derive certain strings. The generation
of language using production rules is called derivation. A parse tree is a graphical
SE
representation of a derivation that filters out the order in which productions are
applied to replace nonterminals. Each interior node of a parse tree represents the
application of a production. The interior node is labeled with the non-terminal A in
C
the left hand side of the production; the children of the node are labeled, from left to
right, by the symbols in the right hand side of the production by which this A was
replaced during the derivation.
ET
C
SA
Ambiguity:- A grammar that produces more than one parse tree for some sentence
is said to be ambiguous. Put another way, an ambiguous grammar is one that
produces more than one leftmost derivation or more than one rightmost derivation
for the same sentence.
The arithmetic expression grammar permits two distinct leftmost derivations
for the sentence id + id * id. The corresponding parse trees appear in Fig.
www.sacet.ac.in Page 22
DEPARTMENT OF CSE
To construct a top-down parse tree for the input string w = cad, begin with a tree
consisting of a single node labeled S, and the input pointer pointing to c, the first
symbol of w. S has only one production, so we use it to expand S and obtain the tree
of Fig.(a). The leftmost leaf, labeled c, matches the first symbol of input w, so we
advance the input pointer to a, the second symbol of w, and consider the next leaf,
labeled A.
SE
C
Now, we expand A using the first alternative A a b to obtain the tree of Fig. (b).
We have a match for the second input symbol, a, so we advance the input pointer to
d, the third input symbol, and compare d against the next leaf, labeled b. Since b
ET
does not match d, we report failure and go back to A to see whether there is another
alternative for A that has not been tried, but that might produce a match.
In going back to A, we must reset the input pointer to position 2, the position
it had when we first came to A, which means that the procedure for A must store the
C
input pointer in a local variable. The second alternative for A produces the tree of
Fig.(c). The leaf a matches the second symbol of w and the leaf d matches the third
symbol. Since we have produced a parse tree for w, we halt and announce successful
SA
completion of parsing.
Difficulties in top-down parsing:- There are various difficulties associated with top-
down parsing. They are
1. Backtracking is the major difficulty with top-down parsing. Choosing a wrong
production for expansion necessitates back tracking. Top-down parsing with
backtracking involves exponential time complexity with respect to the length
of the input.
2. Left recursive grammars cannot be parsed by top-down parsers since they may
create an infinite loop.
3. Grammar must be left factored before applying it as an input to top-down
parser.
4. Top-down parsers cannot parse the ambiguous grammar.
5. Top-down parsers are slow and debugging is very difficult.
www.sacet.ac.in Page 23
DEPARTMENT OF CSE
Here "other" stands for any other statement. According to this grammar, the
compound conditional statement
if El then S1 else if E2 then S2 else S3
SE
Fig:- Parse tree for a conditional statement
has the parse tree shown above. The above Grammar is ambiguous since the string
if El then if E2 then S1 else S2
C
has the two parse trees shown in Fig. below.
ET
The idea is that a statement appearing between a then and an else must be "matched"
; that is, the interior statement must not end with an unmatched or open then. A
SA
www.sacet.ac.in Page 24
DEPARTMENT OF CSE
without changing the strings derivable from A. This rule by itself suffices for many
grammars.
Example : Consider the following grammar
SE
producing a grammar suitable for predictive, or top-down, parsing. In general, if
Aαβl | αβ2 are two A-productions, and the input begins with a nonempty string
derived from α, we do not know whether to expand A to αβl or αβ2 . That is, after
left-factored, the original productions become
C
Example:- Consider the grammar and perform left-factoring.
ET
www.sacet.ac.in Page 25
DEPARTMENT OF CSE
SE
T( )
{
F( ); F( )
B( ); {
}
B( )
{
C If (lookhead == id )
{
match( );
ET
If (lookhead == * ) }
{ match( ); else if ( lookhead == ‘( ‘ )
F( ); {
B( ); match( );
} E( );
C
} if ( lookhead == ‘)’ )
A( ) {
SA
{ match( );
If (lookhead == + ) }
{ else ERROR
match( ); }
T( ); else ERROR
A( ); }
}
}
Example:- Write a code for the Recursive-descent paring of the following grammar
expr term rest
rest +termrest | - termrest | ε
term 0 | 1 | ..... | 9
www.sacet.ac.in Page 26
DEPARTMENT OF CSE
SE
{ rest( );
Error( ); return( );
} }
} else
C {
}
return;
ET
}
Predictive parsing:- A nonrecursive predictive parser can be built by maintaining a
stack explicitly, rather than implicitly via recursive calls. The parser mimics a
leftmost derivation. If w is the input that has been matched so far, then the stack
holds a sequence of grammar symbols α such that
C
SA
The table-driven parser in Fig below has an input buffer, a stack containing a
sequence of grammar symbols, a parsing table constructed by Algorithm, and an
output stream. The input buffer contains the string to be parsed, followed by the
endmarker $. We reuse the symbol $ to mark the bottom of the stack, which initially
contains the start symbol of the grammar on top of $. The parsing table is a two-
dimensional array M[A, a] where A is a nonterminal, and a is a terminal or the
symbol $.
www.sacet.ac.in Page 27
DEPARTMENT OF CSE
SE
introducing certain computations, called FIRST and FOLLOW
FIRST:- FIRST(α) is defined to be the set of terminals that appear as the first
symbols of one or more strings of terminals generated from α. To compute
FIRST(X) for all grammar symbols X, apply the following rules until no more
C
terminals or E can be added to any FIRST set.
1. If X is a terminal, then FIRST(X) = {X}.
2. If X is a nonterminal and X YlY2 . . - Yk is a production then
ET
FIRST(X)= FIRST(Yl)
3. If X ε is a production, then FIRST(X) = ε .
4. If X aα is a production, then FIRST(X) = a .
5. If X aα | ε is a production, then FIRST(X) = {a, ε }.
FOLLOW:- FOLLOW(A) of a nonterminal A can be defined to be the set of
C
terminals a that can appear immediately to the right of A in some sentential form; To
compute FOLLOW(A) for all nonterminals A, apply the following rules until
SA
www.sacet.ac.in Page 28
DEPARTMENT OF CSE
If, after performing the above, there is no production at all in M[A, a], then set
M[A, a] to error (which we normally represent by an empty entry in the table).
SE
Example:- Consider the following grammar:
E TE1
E1 +T E1 | ε
T FT1
T1 *FT1 | ε
F (E) | id
FIRST(F) = FIRST(T) = FIRST(E) = {(, id }.
C
ET
FIRST(E') = {+, E }
FIRST(T') = {*, ε).
As E is the starting symbol FOLLOW(E) = {$}
FOLLOW(E) = FOLLOW(E')
C
Since F (E)
FOLLOW(E) = ‘)’
Therefore FOLLOW(E)= FOLLOW(E’) = { ‘)’ , $}
SA
We have E TE’
FOLLOW(T) = First(E’)-ε U FOLLOW(E’)
{ +, ε }- ε U {‘)’ , $ ) = { +, ‘)’, $ }
FOLLOW(T’) = FOLLOW(T) = { +, ‘)’, $ }
T FT’
FOLLOW(F) = First(T’) - ε U FOLLOW(T)
= {*, ε } – ε U { +, ‘)’, $ }
{ *, +, ‘)’, $ }
Construction of Parsing Table:-
1. For Every production, I β of the grammar go to steps 2 & 3.
2. For each terminal symbol x in First(β) place I β in the cell T[I, x] where
T is a two-dimensional array.
3. If First(β) contain ε then place I β in the cell T[I, y] where y is a terminal
symbol in FOLLOW[I].
www.sacet.ac.in Page 29
DEPARTMENT OF CSE
SE
Example: Consider grammar
C
On input id + id * id, determine the the sequence of moves of nonrecursive
ET
predictive parser.
C
SA
www.sacet.ac.in Page 30
DEPARTMENT OF CSE
SE
C
Example:- Check whether the given grammar is LL(1) grammar.
S iEtSS1 | a
ET
S1 eS | ε
Eb
Construct a predictive parse table.
First(S) = { i, a } FOLLOW(S) = { $ }
C
1
First(S ) = { e, ε } FOLLOW(S) = First(S1) – ε U FOLLOW(S)
First(E ) = { b } = { e, ε } – ε U { $ }
= { e, $ }
SA
1
FOLLOW(S ) = FOLLOW(S) = { e, $ }
i a B e t $
1
S S iEtSS S a
S1 S1 eS S1 ε
S1 ε
E Eb
Since Predictive Parse Table have double entries for M[S1, e] as S1 eS,
S1 ε , so the given grammar is not LL(1).
www.sacet.ac.in Page 31
DEPARTMENT OF CSE
Important Questions
SE
1. Explain about Syntax Analysis and Role of a parser
2. Discuss about Classification of parsing techniques
3. Briefly explain Top down parsing
4. Explain about Recursive descent parsing
5.
6.
7.
Explain about predictive parsing
C
Construction of Predictive parse table using First and Follow
Discuss about LL(1) Grammars
ET
8. Briefly explain Error recovery in predictive parsing.
C
SA
www.sacet.ac.in Page 32
DEPARTMENT OF CSE
Introduction to Simple LR:- The most important type of bottom-up parser is based on a
concept called LR(k) parsing; the "L" is for left-to-right scanning of the input, the "R" for
constructing a rightmost derivation in reverse, and the k for the number of input symbols
of lookahead that are used in making parsing decisions. Generally LR(k) parsers uses k =
0 or k = 1. When (k) is omitted, k is assumed to be 1. The easiest
method for constructing shift-reduce parsers is called "simple LR" (or SLR, for
short). Another two more complex bottom-up parsers are canonical-LR and LALR
which are used in the majority of LR parsers.
SE
LL parsers. A grammar for which we can construct a LR parsing table is said to
be an LR grammar. When a grammar is LR it is sufficient that a left-to-right shift-reduce
parser can be able to recognize handles of right-sentential forms when they appear
on top of the stack. LR parsing is attractive for a variety of reasons:
C
1. LR parsers can be constructed to recognize all programming language constructs
for which context-free grammars can be written.
ET
2. The LR-parsing method is the most general nonbacktracking shift-reduce parsing
method known, yet it can be implemented as efficiently as other shift-reduce
methods.
3. An LR parser can detect a syntactic error as soon as it is possible to do so on a left-
to-right scan of the input.
C
4. The class of grammars that can be parsed using LR methods is a proper superset of
the class of grammars that can be parsed with predictive or LL methods.
SA
www.sacet.ac.in Page 33
DEPARTMENT OF CSE
SE
off the stack and parser then pushed A and the entry for GOTO[Sm-2r, A], onto the
stack.
3. If ACTION[SM, ai ]= accept, parsing is completed.
4. If ACTION[SM, ai ]= error, the parser has discovered an error and calls an error
recovery routine.
C
Shift Reduce parsing:- Shift-reduce parsing is a form of bottom-up parsing in which a
ET
stack holds grammar symbols and an input buffer holds the rest of the string to be parsed.
As we shall see, the handle always appears at the top of the stack just before it is
identified as the handle. A "handle" is a substring that matches the body of a production,
and whose reduction represents one-step along the reverse of a rightmost derivation.
We use $ to mark the bottom of the stack and also the right end of the input.
C
Conventionally, when discussing bottom-up parsing, we show the top of the stack on the
right, rather than on the left as we did for top-down parsing. Initially, the stack is empty,
SA
During a left-to-right scan of the input string, the parser shifts zero or more input
symbols onto the stack, until it is ready to reduce a string β of grammar symbols on top of
the stack. It then reduces β to the head of the appropriate production. The parser repeats
this cycle until it has detected an error or until the stack contains the start symbol and the
input is empty:
Upon entering this configuration, the parser halts and announces successful completion of
parsing.
There are actually four possible actions a shift-reduce parser can make:
(1) shift, (2) reduce, (3) accept, and (4) error.
1. Shift:- Shift the next input symbol onto the top of the stack.
www.sacet.ac.in Page 34
DEPARTMENT OF CSE
Example:- List out the actions of a shift-reduce parser to parse the input string idl *id2
according to the expression grammar
SE
C
ET
Example:- List out the actions of a shift-reduce parser to parse the input string id
*id+id according to the following grammar
E E + E | E * E | ( E ) | id
C
$ id*id+id$ Shift id
$id *id+id$ Reduce E id
$E *id+id$ Shift *
$E* id+id$ Shift id
$E*id +id$ Reduce E id
$E*E +id$ Shift +
$E*E+ id$ Shift id
$E*E+id $ Reduce E id
$E*E+ E $ Reduce EE+E
$E*E $ Reduce EE*E
$E $ Accept
Conflicts During Shift-Reduce Parsing:- There are context-free grammars for which
shift-reduce parsing cannot be used. Every shift-reduce parser for such a grammar can
www.sacet.ac.in Page 35
DEPARTMENT OF CSE
SE
Shift/Reduce conflict. To Solve this problem it gives first preference to “Shift + “.
Reduce/reduce conflict:- When the stack and the input buffer contains the contents as
shown below
Stack Input Action
Contents Buffer Taken
$E*E+ E $ C
The parser can take the action “Reduce E E+E” or “Reduce E E*E “. So it
ET
known as Reduce/Reduce conflict. As Bottom-up parsing uses right most derivation it
gives first preference to “Reduce E E+E “.
Shift/reduce conflict or Reduce/reduce conflict will be encountered for those
grammars which are not LR or they are ambiguous. However compilers uses LR
C
grammars. Thus Shift/reduce conflict or Reduce/reduce conflict will not occur during
compilation process. But Shift-Reduce Parser can’t be constructed for a non LR
Grammar or ambiguous grammar.
SA
www.sacet.ac.in Page 36
DEPARTMENT OF CSE
SE
θ1 > θ2 if they are left associative
θ1 < θ2 if they are right associative
Rule 4:- Unary operators have higher precedence than binary operators.
Rule 5:- A pair of parenthesis cannot be removed until all operations between the pair
C
have been performed. A pair of outer parenthesis cannot be removed until all inner
parentheses have been removed. An operation outside a pair of parenthesis can’t be
performed until the pair have been removed.
ET
Construction of precedence Parse table:- The parse table is a table of size n x n where n
is the no of terminal symbols in the defined grammar.
1. Fill each cell with the relation between the vertical terminal symbol and the
horizontal terminal symbol.
C
Operator precedence parser:- The operator precedence parser uses a parsing table
called operator precedence parsing table for making informed decisions to identify the
www.sacet.ac.in Page 37
DEPARTMENT OF CSE
SE
Operator precedence parsing algorithm:-
C
Fig:- Operator Precedence Parser
Step 1:- Let a pointer P points to the first input symbol of x$, where x is the string to be
ET
parsed.
Step 2:- If the relation between top of the stack and current input symbol is <· then
perform shift operation to push current input symbol onto the stack.
Step 3:- If the relation between top of the stack and current input symbol is ·> then
C
www.sacet.ac.in Page 38
DEPARTMENT OF CSE
Example:- Construct the operator precedence parse table for the following grammar.
E E+E E E*E E (E) E id
And check the string id+id*id by using operator precedence parser.
Stack contents Relation Input Buffer Action Taken
$ <· id+id*id$ Shift id
$id ·> +id*id$ Reduce E id
$E <· +id*id$ Shift +
$E+ <· id*id$ Shift id
$E+id ·> *id$ Reduce E id
$E+E <· *id$ Shift id
$E+E* <· id$ Shift *
$E+E*id ·> $ Reduce E id
$E+E*E ·> $ Reduce E E*E
SE
$E+E ·> $ Reduce E E+E
$E ·> $ Accept
Construction of SLR tables: - The SLR method for constructing parsing tables is a good
SA
starting point for studying LR parsing. The parsing table constructed by this method is an
SLR table, and an LR parser using an SLR-parsing table is an SLR parser. The SLR
method begins with LR(0) items and LR(0) automata. That is, given a grammar, G, we
produce augmented grammar G’, with a new start symbol S’. From G', we construct C,
the canonical collection of sets of items for G’ together with the GOT0 function.
The ACTION and GOT0 entries in the parsing table are then constructed using the
FOLLOW(A) for each nonterminal A of a grammar.
1. Construct C = {I0, I1, . . . ,In), the collection of sets of LR(0) items for G’.
2. State i is constructed from Ii, . The parsing actions for state i are determined
as follows:
(a) If [A α.aβ] is in Ii, and GOTO(Ii ,a ) = Ij , t hen set ACTION[i , a] to "shift j."
Here a must be a terminal.
(b) If [A α.] is in Ii, then set ACTION[i, a] to "reduce A a" for all
a in FOLLOW(A) here A may not be S’.
(c) If [S’ S.] is in Ii ,then set ACTION[i, $1] to "accept ."
www.sacet.ac.in Page 39
DEPARTMENT OF CSE
Example:- Construct the SLR parse table for the following grammar.
S L=R|R
L *R | id
RL
Step 1:- Produce augmented grammar G’ for the given grammar.
S’ S
S L=R
S R
L *R
SE
L id
RL
Step 2: Construct the collection of sets of LR(0) items for G’.
I0 = Closure( S’ .S ) I5 = goto(I0 , id)
S’ ·S
S ·L = R
S ·R
C = Closure(L id· )
L id·
I6 = goto(I2 , =)
ET
L ·*R = Closure(S L = ·R)
L ·id S L=· R
R ·L R ·L
I1 = goto(I0 , S) L ·*R
= Closure( S’ S· ) L ·id
C
S’ S· I7 = goto(I4 , R)
I2 = goto(I0 , L) = Closure(L *R· )
SA
= Closure(S L· = R, R L· ) L *R·
S L· = R I8 = goto(I4 , L)
R L· = Closure( R L· )
I3 = goto(I0 , R) R L·
= Closure(S R· ) goto(I4 , *)
S R· = Closure(L *·R )
I4 = goto(I0 , *) I4
= Closure(L *·R ) goto(I4 , id)
L *·R = Closure(L id· )
R ·L = I5
L ·*R I9 = goto(I6 , R)
L ·id = Closure(S L = R·)
S L= R·
www.sacet.ac.in Page 40
DEPARTMENT OF CSE
SE
Follow(S’) = { $ }
Follow(S) = Follow(S’) = { $ }
Follow(R) = Follow(S) = { $ }
Follow(L) = Follow(R) = { $ }
Follow(L) = {=}
Thus Follow(L) = {=, $} C
ET
C
SA
www.sacet.ac.in Page 41
DEPARTMENT OF CSE
More Powerful LR Parsers: - LR parsing techniques can be extended to use one symbol
of lookahead on the input. Parsers constructed with this technique are called as powerful
LR parsers. There are two parsers constructed in this way. They are
1. The "canonical-LR" parser which makes full use of the lookahead symbol(s). This
method uses a large set of items, called the LR(1) items.
2. The "lookahead-LR" parser which is based on the LR(0) sets of items, and has
many fewer states than CLR parsers which are based on the LR(1) items. By
carefully introducing lookaheads into the LR(0) items, LALR parsers can handle
many more grammars than with the SLR parser, and build parsing tables that are
no bigger than the SLR tables. LALR is the method of choice in most situations.
Construction of CLR(1):- The method for building the collection of sets of valid LR(1)
SE
items is essentially the same as the one for building the canonical collection of sets of
LR(0) items. In the CLR(1) we collect canonical seethe value 1 in the bracket indicates
that there is one lookahead symbol in the set of items.
Construction of canonical set of items along with the lookahead: -
C
1. For the grammar G initially add S'--> .S in the set of item C.
2. For each set of items Ii in C and for each grammar symbol X add closure (Ii, X).
This process should be repeated by applying goto(Ii, X) for each X in Ii, such
ET
that goto(Ii, X) is not empty and not in C. The set of items has to be constructed
until no more set of items that can be added to C.
3. The closure function can be computed as follows.
For each item A α·Xβ, a and for each rule X γ and b Є first(βa)
C
with the canonical item name. There is an edge between one Ii node to the another node Ij
if Ij = goto(Ii, X). This process is repeated until all the nodes are connected by means of
the edges.
Construction of canonical LR parsing table: - The ACTION and GOT0 entries in the
parsing table are then constructed as follows.
(a) If [A α.aβ, b] is in Ii, and GOTO(Ii ,a ) = Ij , then set ACTION[i , a] to "shift j."
Here a must be a terminal.
(b) If [A α., a] is in Ii, then set ACTION[i, a] to "reduce A α·" here A may not be S’.
(c) If [S’ S., $] is in Ii ,then set ACTION[i, $] to "accept ."
(d) The goto part of the LR table is filled as follows.
If GOTO(Ii , A) = Ij, then GO TO[i, A] = j.
(e) All entries not defined by above rules are made "error."
www.sacet.ac.in Page 42
DEPARTMENT OF CSE
Parse the strings “cdcd” and “cddd” using the CLR parsing table
Solution:- Start by computing the closure of {[S’·S, $]}. To find the closure match
[S’·S, $] with the item [A α·Bβ, a]. That is, A = S’, α = ε, B = S, β = ε , and a = $.
Function CLOSURE tells us to add [B ·γ, b] for each production B γ and terminal b
in FIRST(βa).
The initial set of items is
SE
Now compute GOTO(I0,S),
C
No additional productions are added , since the dot is at the right end. Thus we have the
next set of items
ET
Now compute GOTO(I0,C),
C
We have finished considering GOTO on I0. We get no new sets from I1, but we can apply
goto's on I2 . Now compute GOTO(I2,C),
www.sacet.ac.in Page 43
DEPARTMENT OF CSE
SE
GOTO'S of I6 on c and d are I6 and I7 , respectively,
Compute GOTO(I6, C)
GOTO(I6,c) = I6
GOTO(I6,d) = I7
C
The GOTO'S of I6 on c and d are I6 and I7 , respectively,
ET
The remaining sets of items yield no GOTO'S, so we are done.
The goto graph for the above given grammar is
C
SA
www.sacet.ac.in Page 44
DEPARTMENT OF CSE
SE
Parsing the input string “cdcd using LR(1) parsing table
Stack Input Buffer Parsing Action
$0
$0c3
$0c3d4
cdcd$
dcd$
cd$
C
Shift S3 [push c & 3]
Shift S4 [push d & 4]
R3 Reduce C d
ET
$0c3C8 cd$ R2 Reduce C cC
$0C2 cd$ Shift S6 [push c & 6]
$0C2c6 d$ Shift S7 [push d & 7]
$0C2c6d7 $ R3 Reduce C d
C
$0C2c6C9 $ R2 Reduce C cC
$0C2C5 $ R1 Reduce S CC
SA
$0S1 $ Accept
Thus the given input string is successfully parsed using LR(1) parser or Canonical LR
parser.
Parsing the input string “cddd” using LR(1) parsing table
Stack Input Buffer Parsing Action
$0 cddd$ Shift S3 [push c & 3]
$0c3 ddd$ Shift S4 [push d & 4]
$0c3d4 dd$ R3 Reduce C d
$0c3C8 dd$ R2 Reduce C cC
$0C2 dd$ Shift S7 [push d & 7]
$0C2d7 d$ Error
Thus the given input string is not accepted by LR(1) parser or Canonical LR parser.
www.sacet.ac.in Page 45
DEPARTMENT OF CSE
SE
[A α·Bβ, a]. That is, A = A, α = ε, B = B, β = A , and a = $. Therefore FIRST(βa) =
first(A$)= a/b/$
The initial set of items is
I0: S · A, $
A · BA, $
A·,$
B ·aB , a/b/$
C
ET
B ·b , a/b/$
Now compute GOTO(I0,A)
I1: S A ·, $
Now compute GOTO(I0,B)
C
I2: A B· A, $
A · BA, $
SA
A·,$
B ·aB , a/b/$
B ·b , a/b/$
Similarly compute all the canonical items until · reaches to the end in every
production.
I3: GOTO(I0,a)
B a·B , a/b/$
B ·aB , a/b/$
B ·b , a/b/$
I4: GOTO(I0,b)
B b · , a/b /$
I5: GOTO(I2,A)
A BA·, $
I6: GOTO(I3,B)
B aB· , a/b/$
www.sacet.ac.in Page 46
DEPARTMENT OF CSE
SE
The canonical parsing table for the above grammar is
State
0
a
S3
b
S4
Action
$
r2
A
1
C
GoTo
B
2
ET
1 Accept
2 S3 S4 r2 5 2
3 S3 S4 6
4 r4 r4 r4
5 r1
C
6 r3 r3 r3
www.sacet.ac.in Page 47
DEPARTMENT OF CSE
SE
SLR grammars, but there are a few constructs that cannot be conveniently handled by LR
parsers. For a comparison of parser size, the SLR and LALR tables for a grammar always
have the same number of states, and this number is typically several hundred states for a
language like C. The canonical LR table would typically have several thousand states for
4. Parse the input string using LALR parse table similar to LR(1) parsing.
Example: - Construct the LALR parsing table for the following grammar.
SA
S CC
C aC
C d
Parse the strings “adad” and “adaa” using the LALRparsing table.
Solution:- Convert the above grammar as augmented grammar.
S’S
S CC
C aC
C d
The initial set of items is
I0: S’ ·S, $
S · CC, $
C · aC, a/d
C · d , a/d
www.sacet.ac.in Page 48
DEPARTMENT OF CSE
No additional productions are added , since the dot is at the right end. Thus we have the
next set of items
Now compute GOTO(I0,C),
I2: GOTO(I0,C)
S C · C, $
C · aC, $
C·d,$
Computing the remaing items we have
I3: GOTO(I0,a)
S a · C, a/d
C · aC, a/d I8: GOTO(I3,C)
SE
C · d , a/d S a C·, a/d
I4: GOTO(I0,d) GOTO(I3,a) = I3
C d ·, a/d GOTO(I3,d) = I4
I9: GOTO(I6,C)
I6: GOTO(I2,a)
S a · C, $
C S a C·, $
GOTO(I6,a) = I6
GOTO(I6,d) = I7
ET
C · aC, $
C·d,$
The remaining sets of items yield no GOTO'S, so we are done.
As the first component of states I3 and I6 are same we merge the two states to get I36.
C
C · aC, a/d/$
C · d , a/d/$
Similarly we merge the two states I4 and I7 to get I47 and states I8 and I9 to get I89.
I47: GOTO(I0/ I2,d) I89: GOTO(I3/ I6,C)
C d ·, a/d/$ S a C·, a/d/$
The goto graph for the above given grammar is
www.sacet.ac.in Page 49
DEPARTMENT OF CSE
SE
$0a36C89 ad$ R2 Reduce C aC
$0C2 ad$ Shift S36 [push c & 36]
$0C2a36 d$ Shift S47 [push d & 47]
$0C2a36d47 $ R3 Reduce C d
$0C2a36C89
$0C2C5
$0S1
$
$
$ Accept
C
R2 Reduce C aC
R1 Reduce S CC
ET
Thus the given input string is successfully parsed using LALR parser.
Parsing the input string “addd” using LR(1) parsing table
Stack Input Buffer Parsing Action
$0 addd$ Shift S36 [push c & 36]
C
Example:- Show that the following grammar is LR(1) but not LALR(1).
S Aa | bAc |Bc|bBa
A d
Bd
Solution:- Convert the above grammar as augmented grammar.
S’S
S Aa | bAc |Bc|bBa
A d
Bd
www.sacet.ac.in Page 50
DEPARTMENT OF CSE
I2: GOTO(I0,A)
S A·a, $
I3: GOTO(I0,b) I7: GOTO(I3,A)
SE
S b · Ac, $ S bA ·c , $
S b· Ba, $ I8: GOTO(I3,B)
A·d,c S bB·a, $
B·d,a I9: GOTO(I3,d)
I4:
I5:
GOTO(I0,B)
S B ·c, $
GOTO(I0,d)
C A d· , c
B d· , a
I10: GOTO(I4,c)
ET
A d· , a S Bc· , $
B d· , c I11: GOTO(I7,c)
I6: GOTO(I2,a) S bAc·, $
S Aa · , $ I12: GOTO(I8,a)
C
S bBa· , $
The remaining sets of items yield no GOTO'S, so we are done.
The LR(1) parsing table for the above grammar is
SA
Action GoTo
State a b c d $ S A B
0 s3 s5 1 2 4
1 Accept
2 s6
3 s9 7 8
4 s10
5 r5 r6
6 r1
7 s11
8 s12
9 r6 r5
10 r3
11 r2
12 r4
www.sacet.ac.in Page 51
DEPARTMENT OF CSE
SE
10 r3
11 r2
12 r4
The LALR parsing table shows multiple entries in Action[59, a] and Action[59, c].
This is called reduce/reduce conflict. Because of this conflict we cannot parse input. Thus
C
it is shown that the given grammar is LR(1) but not LALR.
Dangling Else ambiguity:- It is a fact that every ambiguous grammar fails to be LR.
ET
However, certain types of ambiguous grammars are quite useful in the specification and
implementation of languages. Consider again the following grammar for conditional
statements:
C
SA
The above grammar is ambiguous because it does not resolve the dangling-else
ambiguity. To simplify the discussion, let us consider an abstraction of this grammar,
where i stands for if expr then, e stands for else, and a stands for "all other productions.''
Converting the above grammar into augmented grammar we have
www.sacet.ac.in Page 52
DEPARTMENT OF CSE
The above table has multiple entries at Action[4, e]. So the above grammar suffers from
Shift/reduce conflict.
Parse the input string iiaea
Stack contents Input Buffer Action Taken
SE
$0 iiaea$ S2 push i & 2
$0i2 iaea$ S2 push i & 2
$0i2i2 aea$ S3 push a & 3
$0i2i2a3 ea$ R3 Reduce S a
$0i2i2S4 ea$
C
Shift/reduce conflict
When such a situation occurs first try choosing each action separately.
First choosing reduce action we have
ET
Stack contents Input Buffer Action Taken
$0 iiaea$ S2 push i & 2
$0i2 iaea$ S2 push i & 2
$0i2i2 aea$ S3 push a & 3
C
www.sacet.ac.in Page 53
DEPARTMENT OF CSE
Error Recovery in LR Parsing: - An LR parser will detect an error when it consults the
parsing action table and finds an error entry. Errors are never detected by consulting the
SE
goto table. An LR parser will announce an error as soon as there is no valid continuation
for the portion of the input thus far scanned. A canonical LR parser will not make even a
single reduction before announcing an error. SLR and LALR parsers may make several
contains an error. Part of that string has already been processed, and the result of this
processing is a sequence of states on top of the stack. The remainder of the string is still
SA
in the input, and the parser attempts to skip the remainder of this string by looking a
terminal that follows A. By removing states from the stack, skipping over the input, and
pushing GOTO(s, A) on the stack, the parser pretends that it has found an instance of A
and resumes normal parsing.
Phrase-level error recovery: - Phrase-level recovery is implemented by examining each
error entry in the LR parsing table and an appropriate recovery procedure can then be
constructed; In designing specific error-handling routines for an LR parser, we can fill in
each blank entry in the action field with a pointer to an error routine that will take the
appropriate action selected by the compiler designer. The actions may include insertion
or deletion of symbols from the stack or the input or both, or alteration and transposition
of input symbols. The modifications should be such that the LR parser will not get into an
infinite loop. A safe strategy will assure that at least one input symbol will be removed or
shifted eventually, or that the stack will eventually shrink if the end of the input has been
reached.
www.sacet.ac.in Page 54
DEPARTMENT OF CSE
SE
C
ET
C
SA
www.sacet.ac.in Page 55
DEPARTMENT OF CSE
SE
number.
position = initial + rate * 60
In the above expression suppose that position, initial, and rate have been
declared to be floating-point numbers, and that the lexeme 60 by itself forms an
C
integer. The type checker in the semantic analyzer discovers that the operator * is
applied to a floating-point number rate and an integer 60. In this case, the integer
may be converted into a floating-point number.
ET
Semantic errors include type mismatches between operators and operands. It
reports error when array index is out of range. The semantic analyzer can report
semantic errors both at compile time & run time. At compile time it checks the
compatibility of operands and operators. At run time it checks the range of array
C
index etc.
grammar together with, attributes and rules. Attributes are associated with grammar
symbols and rules are associated with productions. If X is a symbol and a is one of
its attributes, then we write X.a to denote the value of a at a particular parse-tree
node labeled X. If we implement the nodes of the parse tree by records or objects,
then the attributes of X can be implemented by data fields in the records that
represent the nodes for X. Attributes may be of any kind: numbers, types, table
references, or strings, for instance. The strings may even be long sequences of code,
say code in the intermediate language used by a compiler.
Inherited and Synthesized Attributes: - A syntax-directed definition (SDD) may use
two kinds of attributes for non terminals. They are
1. Synthesized Attributes: - A synthesized attribute at node N is defined only in
terms of attribute values at the children of N and at N itself.
www.sacet.ac.in Page 56
DEPARTMENT OF CSE
SE
int
L1 , L2
L1.type = int L2.type = int
Solution: - The values of lexval are presumed supplied by the lexical analyzer. Each
of the nodes for the non terminals has attribute val computed in a bottom-up order,
and we see the resulting values associated with each node. For instance, at the node
with a child labeled *, after computing T.val= 3 and F.val = 5 at its first and third
children, we apply the rule that says T.val is the product of these two values, or 15.
The annotated parse tree is as shown above.
Ex : Draw the annotated parse tree for the input string 3 * 5 using the grammar and
rules given in the table.
www.sacet.ac.in Page 57
DEPARTMENT OF CSE
Solution: - To see how the semantic rules are used, consider the annotated parse tree
for 3 * 5 in the above Fig. The leftmost leaf in the parse tree, labeled digit, has
attribute value lexval = 3, where the 3 is supplied by the lexical analyzer. Its parent
SE
is for production 4, F digit. The only semantic rule associated with this
production defines F. val = digit. lexval, which equals 3.
At the second child of the root, the inherited attribute T1.inh is defined by the
semantic rule T1.inh = F.val associated with production 1. Thus, the left operand, 3,
C
for the * operator is passed from left to right across the children of the root. The
production at the node for T11 is T11 * FT;. (We retain the subscript 1 in the
annotated parse tree to distinguish between the two nodes for TI.) The inherited
ET
attribute T11. inh is defined by the semantic rule T11. inh = T1.inh x F. val associated
with production 2.
With T1.inh = 3 and F.val = 5, we get T11. inh = 15. At the lower node for T11,
the production is T1 ε . The semantic rule T1.syn = T1.inh defines T11.syn = 15.
C
The syn attributes at the nodes for T1 pass the value 15 up the tree to the node for T,
where T.val = 15.
SA
www.sacet.ac.in Page 58
DEPARTMENT OF CSE
Firstly we parse the input token stream and a syntax tree is generated. Then
the tree is being traversed for evaluating the semantic rules at the parse tree nodes.
Applications of Syntax-Directed Translation: - The main application of syntax-
directed translation techniques is the construction of syntax trees. Since some
compilers use syntax trees as an intermediate representation, a common form of
SDD turns its input string into a tree. We consider two SDD's for constructing
syntax trees for expressions. The first, an S-attributed definition, is suitable for use
during bottom-up parsing. The second, L-attributed, is suitable for use during top-
down parsing.
Construction of Syntax Trees: - SDD can be used to construct either syntax trees
or DAG's. Each node in a syntax tree represents a construct; the children of the node
SE
represent the meaningful components of the construct. A syntax-tree node
representing an expression El + E2 has label + and two children representing the
subexpressions El and E2. We shall implement the nodes of a syntax tree by objects
with a suitable number of fields. Each object will have an op field that is the label of
C
the node. The objects will have additional fields as follows:
If the node is a leaf, an additional field holds the lexical value for the leaf. A
constructor function Leaf (op, val) creates a leaf object. Alternatively, if nodes
ET
are viewed as records, then Leaf returns a pointer to a new record for a leaf.
If the node is an interior node, there are as many additional fields as the node
has children in the syntax tree. A constructor function Node takes two or more
arguments: Node(op, cl, c2, . . . , ck) creates an object with first field op and k
C
Solution: - Every time the first production E El + T is used, its rule creates a node
with '+’ for op and two children, El.node and T.node, for the subexpressions. The
second production has a similar rule.
For production 3, E T, no node is created, since E.node is the same as
T.node. Similarly, no node is created for production 4, T ( E ). The value of
www.sacet.ac.in Page 59
DEPARTMENT OF CSE
T.node is the same as E.node, since parentheses are used only for grouping; they
influence the structure of the parse tree and the syntax tree, but once their job is
done, there is no further need to retain them in the syntax tree.
The last two T-productions have a single terminal on the right. We use the
constructor Leaf to create a suitable node, which becomes the value of T.node.
SE
C
ET
Above figure shows the construction of a syntax tree for the input a - 4 + c. The
C
nodes of the syntax tree are shown as records, with the op field first. Syntax-tree
edges are now shown as solid lines. The underlying parse tree, which need not
SA
actually be constructed, is shown with dotted edges. The third type of line, shown
dashed, represents the values of E.node and T.node; each line points to the
appropriate syntax-tree node. At the bottom we see leaves for a, 4 and c, constructed
by Leaf.
Dependency Graphs: - A dependency graph depicts the flow of information among
the attribute instances in a particular parse tree; an edge from one attribute instance
to another means that the value of the first is needed to compute the second. Edges
express constraints implied by the semantic rules.
Ex : - Construct the dependency graph tree for a - 4 + c using the L-attributed
definition given below.
www.sacet.ac.in Page 60
DEPARTMENT OF CSE
SE
C
Solution: - The below dependency graph depicts the order of evaluation of the
attributes in a particular parse tree; an edge from one attribute instance to another
ET
means that the value of the first is needed to compute the second. Edges express
constraints implied by the semantic rules.
C
SA
www.sacet.ac.in Page 61
DEPARTMENT OF CSE
SE
C
2. A compiler for different languages on the same machine can be developed
ET
by making use of multiple front ends and a single back end.
C
SA
Postfix Notation:- The expression contains operands and operators. If the expression
contains operator in between the operands then it is an infix expression. If the
expression contains operator after the operands then it is an postfix expression.
There is a lot of complexity for evaluating the infix expression. So infix expressions
are converted to postfix expressions. Postfix evaluation is very easy.
www.sacet.ac.in Page 62
DEPARTMENT OF CSE
SE
Solution:- As * has higher preference convert a*b into postfix notation ab*
i.e. {ab*}+c/d
As / has next higher preference convert c/d into postfix notation cd/
i.e. {ab*}+{cd/}
i.e. ab*cd/+
Evaluation of postfix expression:-
C
As + has next higher preference convert {ab*}+{cd/} into postfix notation
ET
1. Scan the expression from left to right.
2. If an operand is encountered place it onto the stack.
3. If an operator is encountered pop the top most operands and perform the
specified operation and push the result back onto the stack.
4. Repeat the steps 2 & 3 until the whole expression is scanned. Now the stack
C
4. As ‘*’ is an operator pop the top most operands and perform the specified
operation and push the result back onto the stack.
www.sacet.ac.in Page 63
DEPARTMENT OF CSE
6. As ‘+’ is an operator pop the top most operands and perform the specified
operation and push the result back onto the stack.
7. As whole expression is scanned. Now the stack contains only one element
which is the final result i.e. 5.
Abstract syntax trees: - One form, of intermediate code is called abstract syntax
trees or simply syntax trees. It represents the hierarchical syntactic structure of the
source program. The parser produces a syntax tree that is further translated into
SE
three-address code. In the syntax tree the leaf nodes are operators and the interior
nodes are operands.
Construction of Syntax Trees:-
1. Identify the operator which has the least priority in the given expression.
C
That operator becomes the root node. Now the sub expression before the
root node operator becomes the left child and the sub expression after the
root node operator becomes the right child.
ET
2. Repeat the same process for the sub expression which is the left child of
the root node.
3. Repeat the same process for the sub expression which is the right child of
the root node.
C
4. Steps 1,2 & 3 are repeated until the sub expressions are operands of the
given expression.
SA
2. In the left child expression ‘*’ is the only operator. So it becomes the
root. A becomes the left child and B becomes the right child.
3. In the right child expression ‘/’ is the only operator. So it becomes the
root. C becomes the left child and D becomes the right child.
www.sacet.ac.in Page 64
DEPARTMENT OF CSE
* /
A C/D
B C C/D
D
4. As all the sub expressions are operands the process is stopped and the tree
obtained is the syntax tree of the given expression.
SE
stands for an operator. In three-address code, there is at most one operator on the
right side of an instruction; that is, no built-up arithmetic expressions are permitted.
Thus a source-language expression like x+y*z might be translated into the
sequence of three-address instructions.
C
Three-address code is built from two concepts: addresses and instructions. An
ET
address can be one of the following:
1. A name. For convenience, we allow source-program names to appear as
addresses in three-address code. In an implementation, a source name is
replaced by a pointer to its symbol-table entry, where all information about
C
www.sacet.ac.in Page 65
DEPARTMENT OF CSE
Quadruples: - A quadruple (or just "quad') has four fields, which we call operator,
operand1, operand2 and result. For instance, the three-address instruction x = y + z
is represented by placing + in operator, y in operand1, z in operand2, and x in result.
The following are some exceptions to this rule:
1. Instructions with unary operators like x = minus y or x = y do not use
SE
operand2. Note that for a copy statement like x = y, operator is =, while for
most other operations, the assignment operator is implied.
2. For Instructions like param x, operator is param, operand1 is x but this
instruction neither uses operand2 nor result.
C
3. Conditional and unconditional jumps put the target label in result.
For example, a quadruple representation of the three-address code for the statement
x = (a + b) * - c/d is shown in Table 1. The numbers in parentheses represent the
ET
pointers to the triple structure.
Quadruple Representation of x = (a + b) * − c/d
Operator Operand1 Operand2 Result
1 + a b t1
C
2 − c t2
3 * t1 t2 t3
SA
4 / t3 d t4
5 = t4 x
Triples: - The contents of the operand1, operand2, and result fields are therefore
normally the pointers to the symbol records for the names represented by these
fields. Hence, it becomes necessary to enter temporary names into the symbol table
as they are created. This can be avoided by using the position of the statement to
refer to a temporary value. If this is done, then a record structure with three fields is
enough to represent the three-address statements: the first holds the operator value,
and the next two holding values for the operand1 and operand2, respectively. Such a
representation is called a "triple representation". The contents of the operand1 and
operand2 fields are either pointers to the symbol table records, or they are pointers to
records (for temporary names) within the triple representation itself. For example, a
www.sacet.ac.in Page 66
DEPARTMENT OF CSE
SE
list the pointers to the triples in the desired order. This is called an indirect triple
representation. For example, a triple representation of the three-address code for the
statement x = (a+b)*−c/d is shown in Table 3.
1
Operator
+
Operand1
a
Operand2
b
C 11 (1)
ET
2 − c 12 (2)
3 * (11) (12) 13 (3)
4 / (13) d 14 (4)
C
5 = x (14) 15 (5)
SA
www.sacet.ac.in Page 67
DEPARTMENT OF CSE
IMPORTANT QUESTIONS
SE
C
ET
C
SA
www.sacet.ac.in Page 68
DEPARTMENT OF CSE
Code generation: Issues, target language, Basic blocks & flow graphs,
Simple code generator, Peephole optimization, Register allocation and
assignment.
SE
Symbol Table: - Symbol table is a data structure that is used by compilers to hold
information of source program. The information is collected incrementally by the
analysis phases of a compiler and used by the synthesis phases to generate the target
C
code. Entries in the symbol table contain information about an identifier such as its
name, its type, its position of storage, and any other relevant information.
ET
Symbol table format:- A Symbol table is a storage area used by the compiler to
store symbols and their associated properties. For every identifier in the source
program, there exists an entry in the symbol table. The properties for each name can
be type, scope and its binding.
C
Properties
Symbol P1 P2 P3 P4 .. ..
S1
SA
S2
S3
:
:
:
In the above table P1, P2, P3,… are the properties of the symbol table and
S1,S2,S3… are the symbols encountered in the source program. The symbol table is
used by various phases as follows.
1. Lexical analyzer stores the information of the symbols in the symbol table.
2. Parser while checking the syntax of the statements uses the symbol table.
3. Semantic analysis phase refers symbol table for type checking.
4. Code generation refers symbol table to know run time memory allocated for
the symbols.
The most important property that the symbol table must have is that it should be
easily editable, grow able. The new properties of the symbol may be added in
different phases of the compiler. Hence, the symbol table should be flexible enough
SE
Fixed Length:- In the fixed length symbol table, the length of every symbol or name
is fixed. The size of the table is still growable depending on the no of symbols in the
program.
Symbol
C Properties
ET
a
b
f a c t
s u m
:
C
:
:
SA
The advantage of using the fixed size is to limit the maximum length of any
symbol in the language.
The disadvantage is that a smaller length symbol will waste the unused
memory allocated to it.
Variable Length:- A variable length symbol table will not impose any constraint of
the maximum length of the symbol. If the symbol needs only three cells in the
symbol table then only three cells are allocated for it in a separate array. Thus, the
symbol table will contain only the starting index of a symbol in an array. The array
stores each symbol and a special character ($) is used to separate the symbols.
Methods of organizing symbol table :- There exits many ways to organize a symbol
table. Among these methods ordered and unordered symbol tables are simple to
implement.
SE
Unordered Symbol Table:-When declaration of variables is encountered , variables
are entered into the symbol table. Non Block structured language uses implict
declarations. When a new symbol is to be inserted, it searches(i.e lookup) the
C
symbol table and if the symbol is not found then it inserts the symbol as new entry.
Here the symbol enteries are made in the order of their declaration. So the table is
unordered table.
ET
Disadvantages:-
1. For a large size, unordered symbol table is not suitablebecause more time is
consumed for seaching and inserting operation.
2. For direct generation of a cross-reference listing, the unordered symbol table
need to be sorted first.
C
Advantages :-
1. Look up(i.e. searching) operation is simplified for ordered symbol table.
Disadvantages: -
1. The insertion operation needs an average of (n+1)/2 record moves as records
are to be arranged in alphabetic order.
Attributes of a symbol table: - Some of the attributes of the symbol table are
Variable Name:- The name of a variable is a compulsory attribute of a symbol table,
as the variable name helps in identifying a variable, which is required by the code
generator and semantic analyzer.
Address:- Each variable in a program is associated with an object-code address. The
address gives the relative location for a variable's value or values at run time. When
a variable is first encountered or declared, its object-code address is entered into the
symbol table. Now, whenever a variable is referred in the source program, its object-
SE
integer value is used to populate the line declared column of the symbol table. Lines
Referenced:- If an already declared variable is referenced at some other lines in the
program then these line numbers separated by commas are indicated in lines
referenced attribute of the symbol table. This could be difficult to handle if the
Link:- The link field is used to generate a cross-reference listing which is ordered
ET
alphabetically by variable name. If the crossreference listing feature is not required
in a compiler then the attributes like, line declared and line referenced can be deleted
from the symbol table.
C
i 21 2 0 2 7,10 2
avg
45 1 0 4 6,9,10 1
x 53 1 1 3 5,7,14,15 3
Data of the target code can be stored in three storage areas. They are static,
stack and Heap.
The size of the generated target code is fixed at compile time, so the compiler
can place the executable target code in a statically determined area Code, usually in
the low end of memory. Similarly, the size of some program data objects, such as
global constants, and data generated by the compiler, such as information to support
garbage collection, may be known at compile time, and these data objects can be
SE
placed in another statically determined area called Static. One reason for statically
allocating as many data objects as possible is that the addresses of these objects can
be compiled into the target code. In early versions of Fortran, all data objects could
be allocated statically.
C
To maximize the utilization of space at run time, the other two areas, Stack and
Heap, are at the opposite ends of the remainder of the address space. These areas are
dynamic; their size can change as the program executes. These areas grow towards
ET
each other as needed. The stack is used to store data structures called activation
records that get generated during procedure calls. The stack grows towards lower
addresses, the heap towards higher.
Activation Records:- Procedure calls and returns are usually managed by a run-time
stack called the control stack. Each live procedure has an activation record on the
C
control stack. If one procedure calls another procedure, the latter procedure has its
activation record at the top of the stack.
SA
The contents of activation records vary with the language being implemented.
Here is a list of the kinds of data that might appear in an activation record.
SE
general.
C
tricky because the same name in a program text can refer to multiple locations at run
time. The two memory allocation techniques are
1. Static Memory Allocation
ET
2. Dynamic Memory Allocation.
Static Memory Allocation:- The storage-allocation decision is static, if the storage
allocation is done at compile time.
Dynamic Memory Allocation:- The storage-allocation decision is dynamic, if the
storage allocation is done at execution time.
C
Many compilers use some combination of the following two strategies for
dynamic storage allocation. Non-Block structured languages uses Static Memory
SA
Allocation.
Storage Allocation Schemes:- Depending upon where the activation records of the
procedures are stored, the storage allocation schemes are divided into three types.
They are
1. Static Allocation
2. Stack Allocation
3. Heap Allocation
Static Allocation:- Static Allocation allocates memory for the activation record at the
compile time. The compiler uses the type of the variable to determine the storage
required. The address assigned for each variable is fixed at compile time.
A FORTRAN language uses the the activation records to store information in
static data area. add( )
Example:- {
------
average( )
}
average( )
www.sacet.ac.in Page
{ 74 Page 74
DEPARTMENT OF CSE
Disadvantages:-
1. The size of the object should be known in advance.
2. Recursive procedures cannot be implemented in static allocation.
3. Dynamically created objects cant be used as the allocation is static.
SE
Stack Allocation:- Almost all compilers for languages that use procedures, functions,
or methods use stack as a part of their run-time memory. Each time a procedure is
called, activation record of the procedure is pushed onto a stack, and when the
procedure terminates, that activation record is popped off the stack. This
overlap in time.
C
arrangement allows memory to be shared by procedure calls whose durations do not
add( )
ET
{ ------
average( )
}
average( )
{ ------
C
print( )
}
print( )
SA
{ ------
Advantages:- }
1. Recursion can be implemented.
2. Dynamically created objects can be used as the allocation is stack.
Disadvantages:-
1. More time is spent in pushing and poping Activation Records.
Heap Allocation:- The heap is the portion of the memory that is used for data that
lives indefinitely, or until the program explicitly deletes it. While local variables
typically become inaccessible when their procedures end, many languages enable us
to create objects or other data whose existence is not tied to the procedure that
creates them. For example, both C++ and Java use new operator to create objects
that may be passed from procedure to procedure, so they continue to exist long after
the procedure that created them is terminated. Such objects are stored on a heap.
SE
2. Referencing deleted data is a dangling-reference error.
Dangiling Reference in storage allocation:- Dangling reference situation occurs in
static and stack storage allocation. Dangling reference situation occurs when
deallocated object is referenced by the object in the activation record.
Ex:- procedure add
{
a,b,sum,*c:integer;
C
ET
sum=a+b;
c=proc(b)
}
procedure proc(d: integer)
{
C
avg: integer;
avg=d/2;
SA
return(&avg);
}
When the activation record of proc( ) is deleted or removed its local
variables are also deleted. After the termination of proc( ) procedure the control of
execution returns to the main program at the line c=proc(b). Here c is an integer
pointer pointing to the location returned by the proc( ) procedure. The proc( )
procedure returns the address of avg variable but avg is already deallocated. Thus
Pointer C is pointing to already deallocated data , thus it is known as dangling
reference. Dangling reference problem causes pointer C to point
Garbage value if no other variable is allocated.
Some other location if the space of avg was allocated to some other data.
SE
as the value of the corresponding formal parameter. Uses of the formal parameter in
the code of the called program are implemented by following this pointer to the
location indicated by the caller. Changes to the formal parameter thus appear as
changes to the actual parameter. If the actual parameter is an expression, however,
C
then the expression is evaluated before the call, and its value stored in a location of
its own. Changes to the formal parameter change this location, but can have no
effect on the data of the caller. Call-by-reference is used for "ref" parameters in C++
ET
and is an option in many other languages. It is almost essential when the formal
parameter is a large object, array, or structure.
The third mechanism - call-by-name - was used in the early programming
language Algol 60. It requires that the callee execute as if the actual parameter were
substituted literally for the formal parameter in the code of the callee, as if the
C
formal parameter were a macro standing for the actual parameter (with renaming of
local names in the called procedure, to keep them distinct). When the actual
SA
SE
2. Reusability of memory can be achieved with the help of garbage collector.
Disadvantages:-
1. The execution of the program is stopped for some time when the garbage
collector is automatically invoked.
C
2. Sometime situation like thrashing may occur due to garbage collector. Let us
assume that garbage collector is called for getting some free space but almost
all the nodes are referred by external pointers. Now garbage collector executes
ET
and returns only a small amount of space. Again the system invokes garbage
collector for getting some more free space. Once again garbage collector
executes and returns very small amount of space. This happens repeatedly and
garbage collector is executing almost all the time. This process is called
thrashing. Thrashing must be avoided for better system performance.
C
SA
SE
Partitioning three-address instructions into basic blocks: - First, we determine those
instructions in the intermediate code that are leaders, that is, the first instructions in
the basic block. The rules for finding leaders are:
C
ET
1. The first three-address instruction in the intermediate code is a leader.
2. Any instruction that is the target of a conditional or unconditional jump is a
leader.
3. Any instruction that immediately follows a conditional or unconditional jump
is a leader.
C
Then, for each leader, its basic block consists of itself and all instructions up to
but not including the next leader or the end of the intermediate program.
SA
First covert the following source code into three address code. Here we assume each
array elements occupy 8 bytes.
SE
C
ET
C
SA
int fact( x )
1) f=1
SE
{
2) i=2
int f = 1;
3) if( i > x ) goto 9
i=2
4) t1 = f * i
while( i <= x )
5) f = t1
{
f=f*i;
i = i + 1;
6)
7)
8)
t2 = i + 1
i = t2
goto 3
C
ET
}
Print(f); 9) print(f)
}
C
SE
strategy: they generate naive code and then improve the quality of the target code by
applying "optimizing" transformations to the target program.
A simple but effective technique for locally improving the target code is
peephole optimization, which is done by examining a sliding window of target
C
instructions (called the peephole) and replacing instruction sequences within the
peephole by a shorter or faster sequence, whenever possible. Peephole optimization
can also be applied directly after intermediate code generation to improve the
ET
intermediate representation.
The peephole is a small, sliding window on a program. The code in the peephole
need not be contiguous, although some implementations do require this. It is
characteristic of peephole optimization that each improvement may spawn
opportunities for additional improvements. In general, repeated passes over the target
C
code are necessary to get the maximum benefit. Some program transformations that
are characteristic of peephole optimizations:
SA
SE
print debugging information
L2:
Now the argument of the first statement always evaluates to true, so the
statement can be replaced by goto L2. Then all statements that print debugging
C
information are unreachable and can be eliminated one at a time.
Flow-of-Control Optimizations:- Simple intermediate code-generation algorithms
frequently produce jumps to jumps, jumps to conditional jumps, or conditional jumps
ET
to jumps. These unnecessary jumps can be eliminated in either the intermediate code
or the target code by the following types of peephole optimizations. We can replace
the sequence
goto L1
...
C
Ll: goto L2
by the sequence
SA
goto L2
...
Ll: goto L2
If there are now no jumps to L1, then it may be possible to eliminate the
statement L1: goto L2 provided it is preceded by an unconditional jump.
Similarly, the sequence
If a< b go to L1
------
L1: go to L2
can be replaced by the sequence
If a< b go to L2
------
L1: go to L2
SE
also be used in code for statements like x=x+l.
utilization of registers
One approach to isregister
C
Register Allocation and Assignment:- Instructions involving only register
operands are faster than those involving memory operands. Therefore, efficient
vitallyallocation
important and
in generating
assignmentgood
is tocode.
assign specific values
ET
in the target program to certain registers. For example, assign base addresses to one
group of registers, arithmetic computations to another, the top of the stack to a fixed
register, and so on. This approach has the advantage that it simplifies the design of a
code generator. Its disadvantage is that, applied too strictly, it uses registers
inefficiently; certain registers may go unused over substantial portions of code, while
C
unnecessary loads and stores are generated into the other registers. Nevertheless, it is
reasonable in most computing environments to reserve a few registers for base
SA
registers, stack pointers, and allow the remaining registers to be used by the code
generator as it sees fit. The various techniques for register allocation are
Global Register Allocation:- The code generation algorithm used registers to hold
values for the duration of a single basic block. However, all live variables were stored
at the end of each block. To save some of these stores and corresponding loads, we
might arrange to assign registers to frequently used variables and keep these registers
consistent across block boundaries (globally). Since programs spend most of their
time in inner loops, a natural approach to global register assignment is to try to keep a
frequently used value in a fixed register throughout a loop. One strategy for global
register allocation is to assign some fixed number of registers to hold the most active
values in each inner loop. The selected values may be different in different loops.
Registers not already allocated may be used to hold values local to one block as in
Section. This approach has the drawback that the fixed number of registers is not
always the right number to make available for global register allocation. Yet the
SE
Example: Consider the basic blocks in the inner loop as shown in figure and
calculate the usage counts of each variable and show what variables are
stored in global registers.
C
ET
C
SA
Assume registers R0, R1, and R2 are allocated to hold values throughout the loop.
Variables live on entry into and on exit from each block are shown in Fig.
To evaluate usage count for x = a, we observe that a is live on exit from B1 and
Used in B2 & B3. Thus
usage count for a = use in B2+ use in B3 + 2*live from B1
=4
usage count for b = use in B1 +2*live in B4+2*live in b3
=5
usage count for c = use in B1+ use in B3 + use in B4
=3
usage count for d = use in B1+use in B2+ use in B3 + use in B4 +2*live from B1
=6
L1-L2
SE
L1-L2
C
If an outer loop L1 contains an inner loop L2 the register allocation is as
follows. If the variable x has allocated register in L2 need not be allocated registers in
L1 - L2. If we allocate x a register in L2 but not L1, we must load x on entrance to L2
ET
and store x on exit from L2.
Register Allocation by Graph Coloring:- When a register is needed for a computation
but all available registers are in use, the contents of one of the used registers must be
stored (spilled) into a memory location in order to free up a register. Graph coloring is
a simple, systematic technique for allocating registers and managing register spills.
C
In the method, two passes are used. In the first, target-machine instructions are
selected as though there are an infinite number of symbolic registers. Once the
SA
instructions have been selected, a second pass assigns physical registers to symbolic
ones. The goal is to find an assignment that minimizes the cost of spills. In the second
pass, a register-interference graph is constructed. In RIG there is a node for each
temporary and there is an edge between any two temporaries if they are live
simultaneously at some point in the program. Two temporaries can be allocated to the
same register if there is no edge connecting them
SE
4. Now all the nodes have less than four neighbours. So remove all of them and add to
the stack. Thus S= { f, e, c, b, d, a} .
C
5. Start assigning colors to: f, e, c, b, d, a. As k=4 the graph can be coloured with
manimum of 4 colours. Start assignging a colour to a node by checking the adjacent
coloured nodes. Repeat the process until all the nodes are coloured.
ET
C
SA
SE
Partitioning three-address instructions into basic blocks: - First, we determine those
instructions in the intermediate code that are leaders, that is, the first instructions in
the basic block. The rules for finding leaders are:
C
ET
1. The first three-address instruction in the intermediate code is a leader.
2. Any instruction that is the target of a conditional or unconditional jump is a
leader.
3. Any instruction that immediately follows a conditional or unconditional jump
is a leader.
C
Then, for each leader, its basic block consists of itself and all instructions up to
but not including the next leader or the end of the intermediate program.
SA
First covert the following source code into three address code. Here we assume each
array elements occupy 8 bytes.
SE
C
ET
C
SA
int fact( x )
1) f=1
SE
{
2) i=2
int f = 1;
3) if( i > x ) goto 9
i=2
4) t1 = f * i
while( i <= x )
5) f = t1
{
f=f*i;
i = i + 1;
6)
7)
8)
t2 = i + 1
i = t2
goto 3
C
ET
}
Print(f); 9) print(f)
}
C
SE
induction variables, elimination of common sub-expressions and replacement
of compile time computations.
3. At the target code level, the compiler optimizes on choosing proper machine
resources. This includes the usage of registers for heavily used variables,
C
choosing suitable addressing modes for the target machine and peephole
optimizations. The richest source of optimization is in the efficient use of
registers and instruction set of a machine.
ET
The properties of code optimization are listed below:
1. The transformation should preserve the meaning of programs i.e. optimation
should not change the output of the program or produce an error.
2. The transformation should improve the speed efficiency of the program and/or
reduce the space occupied by the program.
C
SE
example b+c is not a common expression since one of its operand have been
changed before using the same expression. In the second example r2+r3 is a
common expression so its result is stored in a new variable temp and it is assigned to
r4 instead of re computing the expression r2+r3.
C
Copy propagation: - Statements of the form f : = g are called copy statements or
copies. When common expressions are eliminated copy statements are introduced.
Hence they have to be eliminated. We can use g instead of f after copy statement.
ET
Example:
x[i] = a; x[i] = a;
---------- ----------
----------- -----------
C
we can use ‘a’ instead of x[i] in further calculations. So in the statement sum = x[i] +
a, x[i] is replaced with ‘a’ which produces the statement sum = a + a;
Elimination of dead code: - A piece of code which is not reachable, that is the values
it computes is never used anywhere in the program then it is said to be dead code
and can be removed from the program safely. An assignment to a variable results in
dead code, if the value of this variable is not used in the subsequent program. Also
an assignment to a variable is a dead code if there is always another assignment to
the same variable before its value is used in the subsequent program.
Copy propagation often makes copy statements into dead code which can be
easily eliminated. In the above example as x[i]=a is a copy statement we replaced
x[i] with ‘a’. If x[i] has no further use in the program then x[i] = a becomes a dead
statement and can be eliminated.
Loop optimization:- The major source of code optimization is loops, especially the
inner loops. Most of the run-time is spent inside the loops which can be reduced by
reducing the number of instructions in an inner loop. Important techniques of loop
optimization are
1. Code motion
2. Elimination of induction variables
3. Strength reduction.
Code motion: - Code motion reduces the number of instructions in a loop by moving
some loop-invariant instructions outside a loop. Loop-invariant computations are
SE
those instructions or expressions that result in the same value independent of the
number of times a loop is executed. Loop-invariant instructions inside the loop are
identified and move them to the beginning of the loop.
result is used in the loop. This eliminates the necessity of calculating max-2 every
time the loop repeats.
SA
In the above example there are three induction variables i,j and k which take
on the values 1,2,3, ... , 10 each time through the loop. Suppose that the values of
variables j and k are not used after the end of the loop then we can eliminate them
from the function fun ( ) by replacing them by variable i.
Strength Reduction: - Strength Reduction is the process of replacing expensive
operations by their equivalent cheaper operations on the target machine. On many
machines a multiplication operation takes more time than addition. On such
machines the speed of the object code can be increased by replacing a multiplication
by an addition.
For example when we want to calculate multiples of 2 we used multiplication
in the loop which calculates the ith multiple. But the same result can be obtained
through addition. Since multiplication is nothing but repeated addition. This often
SE
replaces expensive multiplication operation by its equivalent addition operations
which is less expensive.
i=1; i=1;
while ( i < =10 ) prod=0;
{
prod=2*i;
i = i + 1;
C
while ( i < =10 )
{
prod=prod+2;
ET
----------- i = i + 1;
} -----------
}
C
Frequency reduction:-
Loop unrolling: In order to reduce the number of iterations of a loop the body of the
loop is duplicated.
SA
i=1;
i=1;
while ( i < =n )
while ( i < =n )
{
{ a[i]=b[i];
a[i]=b[i];
i = i + 1;
i = i + 1;
a[i]=b[i];
-----------
i = i + 1;
}
}
In the above example we are transferring elements of B array into A array. In the
second example we are transferring two elements per iteration to reduce the number
of iterations.
SE
Folding:- Constant folding is a third optimization technique that evaluates constant
expressions at compile time and replaces such expressions by their computed values.
For example, the constant expression 3 x 3 which could be replaced by 9 at compile
time. Often the use of symbolic constants results in constant expressions.
i=1;
while ( i < =10 )
C i=1;
while ( i < =10 )
{
ET
{
prod=4;
prod=2*2;
i = i + 1;
i = i + 1;
-----------
-----------
}
C
}
SA
We start the construct of DAG form the first statement a = b + c. Since b and c are
defined elsewhere and used in this block they are designated as b0 and c0. For the
SE
expression a = b + c operator + becomes the root and b0 and c0 become the left and
right child respectively. As the result of the expression is stored in a, the node + is
labeled as a. Similarly we repeat the process for the second expression. c = b + c, we
C
know that the use of b in b + c refers to the b labeled at - because that is the most
recent definition of b. However, the node corresponding to the fourth statement d = a
- d has the operator - and the nodes with attached variables a and do as children.
ET
Since the operator and the children are the same as those for the node corresponding
to statement two, we do not create this node, but add d to the list of definitions for
the node labeled -.
Applications of DAG:- The DAG representation of a basic block lets us perform
several code improving transformations on the code represented by the block.
C
1. We can eliminate local common sub expressions, that is, instructions that
compute a value that has already been computed.
SA
SE
c=b+c
When we generate the code from the DAG the common expressions are
C
eliminated and statements are automatically reordered as shown in fig 3.
4. We can apply algebraic laws to reorder operands of three-address instructions,
and sometimes there by simplify the computation.
ET
The DAG-construction process can help us to apply general algebraic
transformations such as commutativity and associativity. For example,
suppose the language reference manual specifies that * is commutative; that
is, x* y = y*x. Before we create a new node labeled * with left child M and
right child N, we always check whether such a node already exists. However,
C
d
a=b*c +
* a, e
d=a+b
e=c*b b0 c0
In the above expression we have constructed the DAG for the first
expression in a normal manner. When considering the second expression we
have to construct the node + whose left child is a and b as the right child. As
we know a+b =b+a we reorder the operands to make use of already existing
node a. During the third statement we know * is commutative so we have
SE
recomputing E by assigning the result of E to x and use x in place of E if the
previous computation of E was assigned has not changed in the interim.
C
ET
C
Consider the TAC in above Figure. Here in Block B1, 4*k is computed and is
available expression at B2. Also Block B2 also has the same computation i.e. 4*k.
SA
SE
t2=a[m] is a copy statement. So after this statement we can use t2 instead of a[m].
C
which compute values but never get used. While the programmer is unlikely to
introduce any dead code intentionally, it may appear as the result of previous
transformations.
ET
C
SA
In the first fig there are copy statements and when they are eliminated we get
figure 2. In figure 2 t1 has assigned a value of m and is not further used, so it
becomes a dead code and can be eliminated. Similarly t5 has assigned a value of m
and is not further used, so it becomes a dead code and can be eliminated as shown in
fig 3.
Reduction of Induction variable elements:- Induction variables are loop variables
such that every time the loop repeats they change their value i.e. they either get
incremented or decremented. Remove unnecessary induction variables from the loop
by substituting uses with another basic induction variable.
In the above example r1 and r2 are two induction variables which computes
the same value every time the loop repeats. So use one induction variable i.e r2
instead of r1. Thus the fig 2 consists of only one induction variable.
SE
Procedure Inlining: - Procedure inlining, which is the replacement of a procedure
call by the body of the procedure, is particularly useful in code optimizations. This
method speeds up the excution process when the procedures are simple.
C
Normally when the procedure is called the calling program is stopped and the
procedure is copied on to the main memory and then executes the procedure. When
the procedure is terminated the calling program continues its execution. So there will
ET
be a lot of internal work to be done when a procedure is called and terminated.
Now if there are many calls to that procedure and if procedure contains few
lines of code then such a jumming to memory becomes a performance overhead for
the compiler. It ultimately slows down the execution of the program. Hence the
C
procedure, thus the whole code is available continuously. Due to which we can
avoid performance overhead for the compiler.
When an inline procedure is called 5 times the code is copied into the program
5 times which avoids jump to the procedure. The code size may slightly increase but
the performance of the compiler may be improved.
t1 := a + b MOV a, R0
ADD b, R0
t2 := c – d MOV c, R1
SUB d, R1
t3 := e + t2 MOV R0, t1
MOV e, R0
ADD R0, R1
t4 := t1 + t3 MOV t1, R0
ADD R1, R0
MOV R0, t4
Now if we change the ordering sequence of the above three address code.
SE
t2 := c – d MOV c, R0
SUB d, R0
MOV e, R1
t3 := e + t2
t1:=a + b
C ADD R0, R1
MOV a, R0
ADD b, R0
ET
ADD R1, R0
t4 := t1 + t3 MOV R0, t4
In the first case the assembly code contains 10 lines. After rearranging the three
C
address code sequence then the assembly code contains 8 lines. So by rearranging
some sequence of instructions we can generate an efficient code using minimum
SA
number of registers. Thus here, an optimal order means the order that yields the
shortest instruction sequence.
SE
C
Now we apply the transformations on block B5 and B6.
ET
In block B5 there are common subexpressions 4 * I and 4 * j we will remove
these common subexpressions and the code will be
C
SA
Now we will apply the global transformations once again on B5. As val contains a[t2]
and a[t2] is already stored in t3. So we can replace val by t3. Similarly as a[t4] is
already computed in Block B3 and its value is stored in t5. Hence we can eliminate t9.
The optimized block will then be
Now we will apply the global transformations once again on B6. As a[t2] is already
SE
stored in block in variable t3. Hence the optimized block B6 will then be