Compiler Design Notes PDF

DEPARTMENT OF CSE
Compiler Design UNIT - I

Overview of language processing – pre-processors – compiler – assembler – interpreters,
pre-processors, – linkers & loaders - structure of a compiler – phases of a compiler . Lexical
Analysis – Role of Lexical Analysis– Lexical Analysis Vs. Parsing – Token, patterns and
Lexemes – Lexical Errors – Regular Expressions – Regular definitions for the language
constructs – Strings, Sequences, Comments – Transition diagram for recognition of tokens,
Reserved words and identifiers, Examples.
Language Processors:- An integrated software development environment includes

many different kinds of language processors such as compilers, interpreters,
assemblers, linkers, loaders etc.,
Language-processing system:- The preprocessor may also expand shorthand’s,
called macros, into source language statements. The modified source program is
then fed to a compiler. The compiler may produce an assembly-language program as
its output, because assembly language is easier to produce as output and is easier to
SE
debug. The assembly language is then processed by a program called an assembler
that produces relocatable machine code as its output.
C
ET
C
SA
Large programs are often compiled in pieces, so the relocatable machine code may
have to be linked together with other relocatable object files and library files into
the code that actually runs on the machine. The linker resolves external memory
addresses, where the code in one file may refer to a location in another file. The
loader then puts together all of the executable object files into memory for
execution.
www.sacet.ac.in Page 1
DEPARTMENT OF CSE

Preprocessor:- A preprocessor is a program that processes its input data(i.e. source
program) to produce output(i.e. modified source program) that is used as input to
compiler. The preprocessor expand shorthand’s, called macros, into source language
statements.
Compiler:- A compiler is a program takes a program written in a source language
and translates it into an equivalent program in a target language. An important role
of the compiler is to report any errors in the source program that it detects during the
translation process.
SE
Figure:- Compiler
Interpreter:-An interpreter is another common kind of language processor. Instead
of producing a target program as a translation, an interpreter directly execute the
C
operations specified in the source program on inputs supplied by the user.
ET
Figure:- An interpreter
The machine-language target program produced by a compiler executes much
faster than an interpreter. An interpreter, however, can give better error messages
C
than a compiler, because it executes the source program statement by statement.

Assembler:-The compiler may produce an assembly-language program as its output,
SA
because assembly language is easier to produce as output and is easier to debug. A

program called an assembler that produces relocatable machine code as its output
then processes the assembly language.
Figure:- An Assembler
Linker:- The linker is a program which links the object programs of functions to the
main program.
Loader:-The loader loads the program on the hard disk onto the main memory and
loads the starting address of the program into the program counter(PC) and makes
the program ready for execution.
The Structure of a Compiler: - The compilation process of a compiler can be

subdivided into main parts. They are
DEPARTMENT OF CSE

1. Analysis and 2. Synthesis
Analysis phase: - The analysis part is often called the front end of the compiler. In
analysis phase, an intermediate representation is created from the given source
program. Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of
this phase. It breaks up the source program and checks whether the source program
is either syntactically or semantically correct. It provides informative error
messages, so that the user can take corrective action. The analysis part also collects
information about the source program and stores it in a data structure called a
symbol table, which is passed along with the intermediate representation to the
synthesis part.
SE
C
ET
Synthesis phase:- The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol table. Intermediate
C
Code Generator, Code Generator, and Code Optimizer are the parts of this phase.
The synthesis part is the back end of the compiler.
SA
Phases of A Compiler:- The compilation process operates as a sequence of phases.

Each phase transforms the source program from one representation into another
representation. A compiler is decomposed into phases as shown in Fig. The symbol
table, which stores information about the entire source program, is used by all
phases of the compiler. During Compilation process, each phase can encounter
errors. The error handler is a data structure that reports the presence of errors clearly
and accurately. It specifies how the errors can be recovered quickly to detect
subsequent errors.
Lexical Analyzer:- The first phase of a compiler is called lexical analysis or
scanning. The lexical analyzer reads the source program and groups the characters
into meaningful sequences called lexemes (i.e tokens). For each lexeme, the lexical
analyzer produces output in the form
<token-name, attribute-value>
This output is passed to the subsequent phase i.e syntax analysis.
DEPARTMENT OF CSE
SE
C
ET
C
SA
Figure:- Phases of a compiler

Syntax Analyzer:- The second phase of the compiler is syntax analysis or parsing.
The parser uses the tokens produced by the lexical analyzer to create a syntax tree.
In syntax tree, each interior node represents an operation and the children of the
node represent the arguments of the operation.
DEPARTMENT OF CSE

Semantic Analyzer: - The semantic analyzer uses the syntax tree and the information
in the symbol table to check the source program for semantic errors. It also gathers
type information and saves it in either the syntax tree or the symbol table, for
subsequent use during intermediate-code generation. An important part of semantic
analysis is type checking, where the compiler checks that each operator has
matching operands.
Intermediate Code Generator: - In the process of translating a source program into
target code, a compiler may construct one or more intermediate representations,
which can have a variety of forms. Syntax trees are a form of intermediate
representation; they are commonly used during syntax and semantic analysis. After
syntax and semantic analysis of the source program, many compilers generate an
intermediate representation. This intermediate representation should have two
important properties: it should be easy to produce and it should be easy to translate
into the target code. One form of intermediate code is three-address code, which
SE
consists of a sequence of assembly-like instructions with three operands per
instruction.
Code Optimizer:- The machine-independent code-optimization phase attempts to
improve the intermediate code so that better target code will result. The target code
C
generated must be executed faster and must consume less power.
Code Generator:- The code generator takes intermediate representation of the
source program and coverts into the target code. If the target language is machine
ET
code, registers or memory locations are selected for each of the variables used by the
program. Then, the intermediate instructions are translated into sequences of
machine instructions that perform the same task.
Symbol Table:- The symbol table is a data structure containing a record for each
C
variable name, with fields for the attributes of the name. The data structure should be
designed to allow the compiler to find the record for each name quickly and to store
or retrieve data from that record quickly.
SA
Error Handler:- Error handler should report the presence of an error. It must report
the place in the source program where an error is detected. Common programming
errors can occur at many different levels.
 Lexical errors include misspellings of identifiers, keywords, or operators.
 Syntax errors include misplaced semicolons or extra or missing braces.
 Semantic errors include type mismatches between operators and operands.
 Logical errors can be anything from incorrect reasoning on the part of the
programmer to the use in a C program of the assignment operator = instead of
the comparison operator ==.
The main goal of error handler is
1. Report the presence of errors clearly and accurately.
2. Recover from each error quickly enough to detect subsequent errors.
3. Add minimal overhead to the processing of correct programs.
DEPARTMENT OF CSE

Example:- Compile the statement position = initial + rate * 60
SE
C
ET
C
SA
Fig:- Output of each phase of compiler
DEPARTMENT OF CSE

Differences between compiler and Interpreter
S.No Compiler Interpreter
1. A compiler is a program that takes An interpreter is another translator
whole source program and translates it which translates each statement of
into a target program and stores it on source program and directly
hard disk. An important role of the execute the target statement on
compiler is to report any errors in the inputs supplied by the user.
source program that it detects during
the translation process.
2.
SE
3. For the first time of compilation the For the first time of interpretation
process make take more time but as the process may complete within
the target program is saved on the hard less time. As the target program is
program produced by a compiler

executes much faster than an
C
disk from next time onwards the target not saved on the hard disk the
interpreter has to interpret every
time it executes the program. So it
ET
interpreter. is much slower than a compiler.
4. As the compiler is a complex task it As the interpreter is a simple task it
occupies more main memory than occupies less main memory than
interpreter. compiler.
C
5. An compiler, list outs all error An interpreter, however, can give

messages so it hard to debug the better error messages than a
errors. compiler, because it executes the
SA
source program statement by

statement.
6. Examples:- C, C++, JAVA,…. Examples:- BASIC,LISP, JAVA,….
These languages use compiler for These languages use interpreter for
translation. translation.
DEPARTMENT OF CSE
Compiler Design UNIT - II
Lexical Analysis: - The first phase of a compiler is called lexical analysis or

scanning. The lexical analyzer reads the source program and groups the characters
into meaningful sequences called lexemes. It identifies the category (i.e tokens) to
which this lexeme belongs. For each lexeme, the lexical analyzer produces output in
the form
<token-name, attribute-value>
This output is passed to the subsequent phase i.e. syntax analysis.
Role of Lexical Analysis: - Lexical analyzer is the first phase of a compiler. The
main task of the lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce tokens for each lexeme in the
source program. The stream of tokens is sent to the parser for syntax analysis. When
lexical analyzer discovers a lexeme constituting an identifier, it interacts with the
SE
symbol table to enter that lexeme into the symbol table. Commonly, the interaction
is implemented by having the parser call the lexical analyzer. The getNextToken
command given by the parser , causes the lexical analyzer to read characters from its
input until it can identify the next lexeme and produce the next token, which it
returns to the parser. C
ET
C
SA
Since the lexical analyzer is the part of the compiler that reads the source text,
it may perform certain other tasks besides identification of lexemes. One such task is
stripping out comments and whitespace (blank, newline, tab, and perhaps other
characters that are used to separate tokens in the input). Another task is correlating
error messages generated by the compiler with the source program. For instance, the
lexical analyzer may keep track of the number of newline characters seen, so it can
associate a line number with each error message. In some compilers, the lexical
DEPARTMENT OF CSE

analyzer makes a copy of the source program with the error messages inserted at the
appropriate positions. If the source program uses a macro-preprocessor,
the expansion of macros may also be performed by the lexical analyzer.
Sometimes, lexical analyzers are divided into a cascade of two processes:
a. Scanning consists of the simple processes that perform such as deletion of
comments and eliminating excessive whitespace characters into one.
b. Lexical analysis is the more complex portion, which produces the sequence of
tokens as output.
Lexical Analysis Vs. Parsing: - There are a number of reasons why the analysis
portion of a compiler is normally separated into lexical analysis and parsing (syntax
analysis) phases.
1. Simplicity of design is the most important consideration. The separation of
lexical and syntactic analysis often allows us to simplify at least one of these
SE
tasks.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to
apply specialized techniques that serve only the lexical task, not the job of
parsing. In addition, specialized buffering techniques for reading input
C
characters can speed up the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can be
restricted to the lexical analyzer.
ET
Token, patterns and Lexemes: - When discussing lexical analysis, we use three
related but distinct terms:
 A token is a pair consisting of a token name and an optional attribute value.
The token name is the category of lexical unit, e.g., a particular keyword, or a
C
sequence of input characters denoting an identifier etc.,

 A pattern is a description that specify the rules that the lexemes should follow
in order to belong to that token.
SA
 A lexeme is a sequence of characters in the source program that matches the

pattern for a token and is identified by the lexical analyzer as an instance of
that token.
In many programming languages, the following classes cover most or all of the
tokens:
DEPARTMENT OF CSE

1. One token for each keyword. The pattern for a keyword is the same
as keyword itself.
2. Tokens for the operators, either individually or in classes such as the token
comparison mentioned.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal
strings.
5. Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
Lexical Errors: - Lexical errors include misspellings of identifiers, keywords, or
operators.
It is hard for a lexical analyzer to tell, without the aid of other components,
that there is a source-code error. For instance, if the string fi is encountered for the
first time in a C program in the context a lexical analyzer cannot tell whether fi is a
SE
misspelling of the keyword if or an undeclared function identifier. Since fi is a valid
lexeme for the token id, the lexical analyzer must return the token id to the parser
and the parser handle an error due to transposition of the letters. However, suppose a
situation arises where the lexeme doesnot satisfy any of the pattern. The simplest
C
recovery strategy is "panic mode" recovery. We delete successive characters from
the remaining input, until the lexical analyzer can find a well-formed token at the
beginning of what input is left. This recovery technique may confuse the parser, but
ET
in an interactive computing environment it may be quite adequate. Other possible
error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
C
4. Transpose two adjacent characters.

Transformations like these may be tried in an attempt to repair the input. The
SA
simplest such strategy is to see whether a prefix of the remaining input can be
transformed into a valid lexeme by a single transformation. This strategy makes
sense, since in practice most lexical errors involve a single character. A more general
correction strategy is to find the smallest number of transformations needed to
convert the source program into one that consists only of valid lexemes, but this
approach is considered too expensive in practice to be worth the effort.
Regular Expressions: - Regular expressions are an important notation for

specifying lexeme patterns. To describe the set of valid C identifiers use a notation
called regular expressions. In this notation, if letter_ is established to stand for any
letter or the underscore, and digit is established to stand for any digit. Then the
identifiers of C language are defined by
letter_ (letter_ | digit)*
DEPARTMENT OF CSE
The vertical bar above means union, the parentheses are used to
group subexpressions, the star means "zero or more occurrences of". The
letter_ at the beginning indicates that the identifier can contain any letter
or underscore(_) at the beginning. The regular expressions are built
recursively out of smaller regular expressions.
Regular definitions:- Regular Definitions are names given to certain

regular expressions and those names can be used in subsequent expressions
as symbols of the language. If C is an alphabet of basic symbols, then a
regular definition is a sequence of definitions of the form:
where
SE
1. Each di is a new symbol, not in C and not the same as any other of the d's, and
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example 1 : C identifiers are strings of letters, digits, and underscores. Write a

C
regular definition for the language of C identifiers.
ET
Using shorthand notations , the regular definition can be rewritten as:
C
SA
Example 2 : Unsigned numbers (integer or floating point) are strings such as 5280,
0.01234, 6.336E4, or 1.89E-4. Write a regular definition for Unsigned numbers in C
language.
Using shorthand notations , the regular definition can be rewritten as:
Transition Diagrams: - Compiler converts regular-expression patterns to transition

diagrams. Transition diagrams have a collection of nodes or circles, called states.
DEPARTMENT OF CSE
Each state represents a condition that could occur during the process of scanning the
input looking for a lexeme that matches one of several patterns. Edges are
directed from one state of the transition diagram to another. Each edge is labeled by a
symbol or set of symbols. If we are in some state s, and the next input symbol is
a, we look for an edge out of state s labeled by a (and perhaps by other symbols, as
well). If we find such an edge, we enter the state of the transition diagram to
which that edge leads.
Example 1 : Draw a transaction diagram for the relational operator
relop. The relation operators are < , < > , <= , >, >=,=
Relop < | < > | <= | >| >=|=
SE
C
ET
Figure: Transition diagram for relop

C
Example 2 : Draw a transaction diagram for the identifier in C language.

SA
Figure: Transition diagram for identifier

Example 3 : Draw a transaction diagram for the keyword then in C language.
then then
Figure: Transition diagram for then
DEPARTMENT OF CSE

Example 3 : Draw a transaction diagram for white space in C language.
delim ( blank I tab I newline )
ws delim+
Figure: Transition diagram for whitespace

Example 4: Draw a transaction diagram for unsigned numbers in C language.
SE
C
ET
Figure: Transition diagram for unsigned numbers

C
Lex (lexical analyzer generator):- Lex is a tool, or in a more recent implementation

SA
Flex, which allows to specify a lexical analyzer by specifying regular expressions to

describe patterns for tokens. The input notation for the Lex tool is referred to as the
Lex language and the tool itself is the Lex compiler. Behind the scenes, the Lex
compiler transforms the input patterns into a transition diagram and generates code,
in a file called lex.yy.c that simulates this transition diagram.
Use of Lex:- An input file, which we call lex.l, is written in the Lex language and
describes the lexical analyzer to be generated. The Lex compiler transforms lex.l to a
C program, in a file that is always named lex.yy.c . The lex.yy.c is compiled by the
C compiler into a file called a.out. The C compiler output is a working lexical
analyzer that can take a stream of input characters and produce a stream of tokens.
DEPARTMENT OF CSE
Structure of Lex Programs:- A Lex program has the following form:

declarations
%%
translation rules
%%
auxiliary functions
1. The declarations section includes declarations of variables, manifest constants
and regular definitions.
SE
2. The translation rules each have the form
Pattern {Action}
Each pattern is a regular expression, which may use the regular definitions of
the declaration section. The actions are fragments of code, typically written in
C
C, although other languages can also be used
3. The third section holds whatever additional functions are used in the actions.
Alternatively, these functions can be compiled separately and loaded with the
ET
lexical analyzer.
The lexical analyzer created by Lex works along with the parser as follows.
When called by the parser, the lexical analyzer begins reading its remaining input,
one character at a time, until it finds the longest prefix of the input that matches one
C
of the patterns Pi. It then executes the associated action Ai. Typically, Ai will return
to the parser, but if it does not (e.g., because Pi describes whitespace or comments),
SA
then the lexical analyzer proceeds to find additional lexemes, until one of the
corresponding actions causes a return to the parser. The lexical analyzer returns a
single value, the token name, to the parser, but uses the shared, integer variable
yyval to pass additional information about the lexeme found, if needed.
Ex 1:- Develop a lexical Analyzer using lex tool
/*PROGRAM OF A LEXICAL ANALYSER USING LEX TOOL*/

%{
#include<stdio.h>
int num_lines=0;
%}
label [a-z][a-z0-9_]*
number [0-9]+
%%
DEPARTMENT OF CSE

#.* printf("%s is preprocessor directive\n",yytext);
main printf("%s is a MAIN function\n",yytext);
{label}${label}$ printf("%s is a function\n",yytext);
{label}\[{number}\] printf("%s is an array\n",yytext);
\n ++num_lines;
if|else|while|for|int|char|float printf("%s is a keyword\n",yytext);
"+"|"-"|"*"|"/" printf("%s is arithematic operator\n",yytext);
">"|">="|"<"|"<="|"=="|"!=" printf("%s is a logical operator\n",yytext);
"++" printf("%s is a increment operator\n",yytext);
"--" printf("%s is a decrement operator\n",yytext);
"=" printf("%s is a assignment operator\n",yytext);
"," printf("%s is a comma operator\n",yytext);
";" printf("%s is a semicolon operator\n",yytext);
"{" printf("%s is a open braces\n",yytext);
"}" printf("%s is a closed brace\n",yytext);
SE
"(" printf("%s is a open paranthesis\n",yytext);
")" printf("%s is a closed paranthesis\n",yytext);
"[" printf("%s is a open Square bracket\n",yytext);
C
"]" printf("%s is a closed Square bracket \n",yytext);
{label} printf("%s is a label\n",yytext);
{number} printf("%s is a n umber\n",yytext);
\".*\" printf("%s is a string \n",yytext);
ET
%%
main(int argc,char **argv)
{ FILE *f;
f=fopen(argv[1],"r");
C
yyin=f;
yylex();
SA
printf("number of lines=%d \n",num_lines);

}
The program is executed as follows
$ lex lex.l
$ cc –o lex.out lex.yy.c –ll
./lex.out sum.c // Now the lexical analyzer will print out the tokens of sum.c
Ex 2:- Write a lex program to convert upper case into lower case and vice versa in a
given string
/* Program to convert upper case into lower case and vice versa */
%{
DEPARTMENT OF CSE

#include<stdio.h>
int upcnt=0,lwcnt=0;
%}
%%
[a-z] { char ch = yytext[0];
ch=ch-32;
lwcnt++;
printf ("%c", ch);
SE
}
[A-Z] { char ch = yytext[0];

C
ET
upcnt++;
ch=ch+32;
C
printf ("%c", ch);

SA
%%
int main()
printf("Enter any string and press ^d at End:\n ");
yylex();
printf("No of Lowercase characters converted : %d\n",lwcnt);
DEPARTMENT OF CSE

printf("No of Uppercase characters converted : %d\n",upcnt);
Ex 3:- Write a lex program to count no of characters, wards and lines in a given
text /* Program to count no of characters, wards and lines in a given text */
%{
int nchar=0, nword=0, nline=0;
%}
%%
SE
\n { nline++; }
[^ \t\n]+ { nword++; nchar+=yyleng; }
%%
int main()
C
ET
{
printf("Enter your message and press ^d at the end: \n");

C
yyl/* Program to identify Nouns,verbs and adjectives */

SA
%{
#include<stdio.h>
%}
%%
[\t ]+ /* ignore whitespace */ ;
Krishna|Kishore|Bapatla|Guntur { printf ("\"%s\" is a Noun\n", yytext); }
DEPARTMENT OF CSE
is|am|are|plays|dances { printf ("\"%s\" is a verb\n", yytext); }
good|tall|red|town { printf ("\"%s\" is a advejective\n", yytext); }
he|she { printf ("\"%s\" is pronoun\n", yytext); }
[a-zA-Z]+ { printf ("\"%s\" is not a verb\n", yytext); }
%%
int main (void)
SE
yylex();
}ex();
C
printf("No of characters= %d\nNo of words=%d\nNo of lines=%d\n", nchar, nword,
nline);
ET
return 0;
}
C
Ex 4:- Write a lex program to identify Nouns,verbs and adjectives

/* Program to identify Nouns,verbs and adjectives */
SA
%{
#include<stdio.h>
%}
%%
[\t ]+ /* ignore whitespace */ ;
DEPARTMENT OF CSE
Krishna|Kishore|Bapatla|Guntur{ printf ("\"%s\" is a Noun\n", yytext); }
is|am|are|plays|dances { printf ("\"%s\" is a verb\n", yytext); }
good|tall|red|town { printf ("\"%s\" is a advejective\n", yytext); }
he|she { printf ("\"%s\" is pronoun\n", yytext); }
[a-zA-Z]+ { printf ("\"%s\" is not a verb\n", yytext); }
%%
SE
int main (void)
}
yylex(); C
ET
IMPORTANT QUESTIONS
1. (a) What are the functions of pre-processing?
(b) Explain briefly, the need and functionality of linkers, assemblers and loaders.
2. (a) Mention the functions of linkers and loaders in pre-processing.
C
(b) Describe the functionality of compilers in language processing.

3. (a) Explain the phases in detail. Write down the output of each phase for the
SA
expression
a: = b + c *50.
(b) Give and explain the diagrammatic representation of a language processing
system.
4. Discuss about Lexical Analysis and Role of Lexical Analysis.
5. Defferentiate Lexical Analysis Vs Parsing.
6. Define the words Token, patterns and Lexemes.
7. Explain briefly about Lexical Errors.
8. Define Regular Expressions and Regular definitions. Write Regular
Expressions for the language constructs such as Strings, Sequences and
Comments.
9. Define Transaction diagram and draw Transaction diagram
for recognition of tokens, Reserved words and identifiers.
DEPARTMENT OF CSE
Compiler Design UNIT – II

Syntax Analysis - Role of a parser – Classification of parsing techniques – Left
Recursion, Left Factoring - Top down parsing – First and Follow – LL(1)
Grammars, Non-Recursive predictive parsing - Error recovery in predictive parsing.
Parsing:- Parsing is the process of determining how a string of terminals can be

generated by a grammar. A parser must be capable of constructing the parse tree to
check the syntax of the statement. Most parsing methods fall into one of two classes,
called the top-down and bottom-up methods. These terms refer to the order in which
nodes in the parse tree are constructed. In top-down parsers, construction starts at
the root and proceeds towards the leaves, while in bottom-up parsers, construction
starts at the leaves and proceeds towards the root.
Role of a parser:- In our compiler model, the parser obtains a string of tokens from
the lexical analyzer and verifies that the string of token names can be generated by
the grammar for the source language.
SE
C
ET
Figure: Position of parser in compiler model
The parser reports any syntax errors in an intelligible fashion and tries to
recover commonly occurring errors to continue processing the remainder of the
C
program. Conceptually, for well-formed programs, the parser constructs a parse tree
and passes it to the rest of the compiler for further processing.
SA
Classification of parsing techniques:-There are three general types of parsers for

grammars. They are
1. Universal parsers 2. Top-down parsers and 3. Bottom-up parsers.
Universal parsers:- Universal parsers perform parsing using Cocke-Younger-
Kasami (CYK)algorithm. It uses Chomsky Normal Form (CNF) of the CFG. These
parsers are inefficient to use in production compilers.
Top-down parsers:- Top-down parsers build parse trees from the top (root) to the
bottom (leaves). The most efficient top-down methods work only for subclasses of
grammars, but LL grammars are sufficient to describe the syntax structure of
modern programming languages. The Parsers implemented by hand often use LL
grammars; for example, the predictive-parsing approach works for LL grammars.
The top-down parsing can be implemented by using three techniques.
1. Recursive descent parsing
2. Predictive parsing
DEPARTMENT OF CSE
Bottom-up parsers:- Bottom-up parsers start from the leaves and work their way up
SE
to the root. In either case, the input to the parser is scanned from left to right, one
symbol at a time. Parsers for the larger class of LR grammars are usually constructed
using automated tools. The bottom-up parsing can be implemented by using the
following techniques.
1. Shift-reduce parser
2. Operator Precedence parser
3. LR parsers
C
ET
LR parsers are again subdived into
a) SLR parser
b) LALR parser
c) CLR parser
C
Context free grammars:- Grammars were introduced to systematically describe

the syntax of programming language constructs like expressions and statements.
SA
A context-free grammar (grammar for short) consists of terminals, nonterminals, a

start symbol, and productions.
A CFG is defined by (V,T,P,S)
V is the set of Non-terminals. Non-terminals are syntactic variables that denote sets
of strings. The sets of strings denoted by non-terminals help to define the language
generated by the grammar.
Terminals are the basic symbols from which strings are formed.
In a grammar, one nonterminal is distinguished as the start symbol, and the set of
strings it denotes is the language generated by the grammar. Conventionally, the
productions for the start symbol are listed first.
The productions of a grammar specify the manner in which the terminals and
nonterminals can be combined to form strings. Each production consists of a
nonterminal on left side of the production followed by the symbol  or : : = and on
right side consisting of zero or more terminals and nonterminals.
DEPARTMENT OF CSE

Example:- Define the grammar for simple arithmetic expressions. In this grammar,
the terminal symbols are id, +, -, *, /, ( ). The nonterminal symbols are expression,
term and factor, and expression is the start symbol
Using notational conventions, the above grammar can be rewritten concisely as
Derivations:- The production rules are used to derive certain strings. The generation
of language using production rules is called derivation. A parse tree is a graphical
SE
representation of a derivation that filters out the order in which productions are
applied to replace nonterminals. Each interior node of a parse tree represents the
application of a production. The interior node is labeled with the non-terminal A in
C
the left hand side of the production; the children of the node are labeled, from left to
right, by the symbols in the right hand side of the production by which this A was
replaced during the derivation.
ET
C
SA
Fig:- Sequence of Parse trees for deriving -(id + id)
Ambiguity:- A grammar that produces more than one parse tree for some sentence
is said to be ambiguous. Put another way, an ambiguous grammar is one that
produces more than one leftmost derivation or more than one rightmost derivation
for the same sentence.
The arithmetic expression grammar permits two distinct leftmost derivations
for the sentence id + id * id. The corresponding parse trees appear in Fig.
DEPARTMENT OF CSE

Top-down parsing:- Parsing means generating a parse tree. In Top-down parsing a
parse tree can be constructed for the input string, starting from the root and creating
the nodes of the parse tree in preorder. Top-down parsing uses a leftmost derivation
for generating an input string. The top-down parsing may require backtracking to
find the correct A-production to be applied.
Backtracking:- Consider the grammar
To construct a top-down parse tree for the input string w = cad, begin with a tree
consisting of a single node labeled S, and the input pointer pointing to c, the first
symbol of w. S has only one production, so we use it to expand S and obtain the tree
of Fig.(a). The leftmost leaf, labeled c, matches the first symbol of input w, so we
advance the input pointer to a, the second symbol of w, and consider the next leaf,
labeled A.
SE
C
Now, we expand A using the first alternative A  a b to obtain the tree of Fig. (b).
We have a match for the second input symbol, a, so we advance the input pointer to
d, the third input symbol, and compare d against the next leaf, labeled b. Since b
ET
does not match d, we report failure and go back to A to see whether there is another
alternative for A that has not been tried, but that might produce a match.
In going back to A, we must reset the input pointer to position 2, the position
it had when we first came to A, which means that the procedure for A must store the
C
input pointer in a local variable. The second alternative for A produces the tree of
Fig.(c). The leaf a matches the second symbol of w and the leaf d matches the third
symbol. Since we have produced a parse tree for w, we halt and announce successful
SA
completion of parsing.
Difficulties in top-down parsing:- There are various difficulties associated with top-
down parsing. They are
1. Backtracking is the major difficulty with top-down parsing. Choosing a wrong
production for expansion necessitates back tracking. Top-down parsing with
backtracking involves exponential time complexity with respect to the length
of the input.
2. Left recursive grammars cannot be parsed by top-down parsers since they may
create an infinite loop.
3. Grammar must be left factored before applying it as an input to top-down
parser.
4. Top-down parsers cannot parse the ambiguous grammar.
5. Top-down parsers are slow and debugging is very difficult.
DEPARTMENT OF CSE

Eliminating Ambiguity:- Sometimes an ambiguous grammar can be rewritten to
eliminate the ambiguity. As an example, we shall eliminate the ambiguity from the
following "danglingelse" grammar:
Here "other" stands for any other statement. According to this grammar, the
compound conditional statement
if El then S1 else if E2 then S2 else S3
SE
Fig:- Parse tree for a conditional statement
has the parse tree shown above. The above Grammar is ambiguous since the string
if El then if E2 then S1 else S2
C
has the two parse trees shown in Fig. below.
ET
We can rewrite the dangling-else grammar as the following unambiguous grammar.

C
The idea is that a statement appearing between a then and an else must be "matched"
; that is, the interior statement must not end with an unmatched or open then. A
SA
matched statement is either an if-then-else statement containing no open statements

or it is any other kind of unconditional statement. Thus, we may use the grammar for
the above strings. This grammar generates the same strings as the dangling-else
grammar for the first string , but it allows only one parsing for second string ;
namely, the one that associates each else with the closest previous unmatched then.
Elimination of Left Recursion:- A grammar is left recursive if it has a nonterminal

A such that there is a derivation A Aα for some string α. Top-down parsing
methods cannot handle left-recursive grammars, so a transformation is needed to
eliminate left recursion. A left-recursive production can be eliminated by rewriting
the production as
DEPARTMENT OF CSE
without changing the strings derivable from A. This rule by itself suffices for many
grammars.
Example : Consider the following grammar
The left-recursive pair of productions E  E + T | T are replaced by E  T E'

and E'  + T E' | ε . The new productions for T and T' are obtained similarly by
eliminating immediate left recursion. The non-left recursive grammar is
Left Factoring:- Left factoring is a grammar transformation that is useful for
SE
producing a grammar suitable for predictive, or top-down, parsing. In general, if
Aαβl | αβ2 are two A-productions, and the input begins with a nonempty string
derived from α, we do not know whether to expand A to αβl or αβ2 . That is, after
left-factored, the original productions become
C
Example:- Consider the grammar and perform left-factoring.
ET
After performing the left-factoring the grammar becomes

C
SA
Recursive descent parsing:- A top-down parser that executes a set of recursive

procedures to process the input without backtracking is called recursive-decent
parser, and parsing is called recursive-decent parsing. The recursive procedures can
be easy to write and fairly efficient if written in a language that implements the
procedure call efficiently. There is a procedure for each non-terminal in the
grammar. We assume a global variable, lookhead, holding the current input token
and a procedure match('Expected Token) is the action of recognizing the next token
in the parsing process and advancing the input stream pointer, such that lookhead
points to the next token to be parsed. match( ) is effectively a call to the lexical
analyzer to get the next token. For example input stream is : a + b$ then
lookhead == a
match( )
lookhead == +
match( )
DEPARTMENT OF CSE

lookhead == b
------------------
Example:- Consider the following grammar:
E  TA
A  +TA | ε
T  FB
B  *FB | ε
F  (E) | id
Write the algorithm of recursive-descent parser for given grammar.
E( )
{
T( );
A( );
}
SE
T( )
{
F( ); F( )
B( ); {
}
B( )
{
C If (lookhead == id )
{
match( );
ET
If (lookhead == * ) }
{ match( ); else if ( lookhead == ‘( ‘ )
F( ); {
B( ); match( );
} E( );
C
} if ( lookhead == ‘)’ )
A( ) {
SA
{ match( );
If (lookhead == + ) }
{ else ERROR
match( ); }
T( ); else ERROR
A( ); }
}
}
Example:- Write a code for the Recursive-descent paring of the following grammar
expr  term rest
rest  +termrest | - termrest | ε
term  0 | 1 | ..... | 9
DEPARTMENT OF CSE
Compiler Design UNIT –II

expr( ) rest( )
{ {
term( ); If (lookhead == ‘+’ )
rest( ); {
return; match( );
} term( );
term( ) rest( );
{ return( );
if( isdigit(lookhead)) }
{ else
match( ); if lookhead == ‘-’ )
return( ); {
} match( );
else term( );
SE
{ rest( );
Error( ); return( );
} }
} else
C {
}
return;
ET
}
Predictive parsing:- A nonrecursive predictive parser can be built by maintaining a
stack explicitly, rather than implicitly via recursive calls. The parser mimics a
leftmost derivation. If w is the input that has been matched so far, then the stack
holds a sequence of grammar symbols α such that
C
SA
The table-driven parser in Fig below has an input buffer, a stack containing a
sequence of grammar symbols, a parsing table constructed by Algorithm, and an
output stream. The input buffer contains the string to be parsed, followed by the
endmarker $. We reuse the symbol $ to mark the bottom of the stack, which initially
contains the start symbol of the grammar on top of $. The parsing table is a two-
dimensional array M[A, a] where A is a nonterminal, and a is a terminal or the
symbol $.
DEPARTMENT OF CSE

The parser is controlled by a program that considers X, the symbol on top of the
stack, and a, the current input symbol. These two symbols determine the action of
the parser. They are
1. If X = $, the parser halts and announces successful completion of parsing.
2. If X ≠ $, and X is a terminal symbol the parser matches that X with the input
symbol. If match is OK then it pops X and advances the input pointer to the
next i/p symbol.
3. If X is a nonterminal, the program consults every M[X, a] of the parsing table
M. This entry will be either an X-production of the grammar or an error-entry.
a. If M[X, a] = { X  UVW } the parser pops X from stack and
pushes WVU on the stack.
b. If M[X, a] = error, the parser calls an error procedure.
Construction of predictive parse table:- Construct predictive parsing tables by
SE
introducing certain computations, called FIRST and FOLLOW
FIRST:- FIRST(α) is defined to be the set of terminals that appear as the first
symbols of one or more strings of terminals generated from α. To compute
FIRST(X) for all grammar symbols X, apply the following rules until no more
C
terminals or E can be added to any FIRST set.
1. If X is a terminal, then FIRST(X) = {X}.
2. If X is a nonterminal and X YlY2 . . - Yk is a production then
ET
FIRST(X)= FIRST(Yl)
3. If X ε is a production, then FIRST(X) = ε .
4. If X aα is a production, then FIRST(X) = a .
5. If X aα | ε is a production, then FIRST(X) = {a, ε }.
FOLLOW:- FOLLOW(A) of a nonterminal A can be defined to be the set of
C
terminals a that can appear immediately to the right of A in some sentential form; To
compute FOLLOW(A) for all nonterminals A, apply the following rules until
SA
nothing can be added to any FOLLOW set.

For the start symbol S, FOLLOW(S) = $ .
If there is a production A  αBβ, and first(β) contain ε then
FOLLOW(B) = first(β) – { ε } U FOLLOW(A)
If there is a production A  αBβ, and first(β) doesnot contain ε then
FOLLOW(B) = first(β) – ε
If there is a production A αB, then
FOLLOW (B) = FOLLOW (A) .
DEPARTMENT OF CSE

Construction of Predictive Parse table.
:
If, after performing the above, there is no production at all in M[A, a], then set
M[A, a] to error (which we normally represent by an empty entry in the table).
SE
Example:- Consider the following grammar:
E  TE1
E1  +T E1 | ε
T  FT1
T1 *FT1 | ε
F  (E) | id
FIRST(F) = FIRST(T) = FIRST(E) = {(, id }.
C
ET
FIRST(E') = {+, E }
FIRST(T') = {*, ε).
As E is the starting symbol FOLLOW(E) = {$}
FOLLOW(E) = FOLLOW(E')
C
Since F  (E)
FOLLOW(E) = ‘)’
Therefore FOLLOW(E)= FOLLOW(E’) = { ‘)’ , $}
SA
We have E  TE’
FOLLOW(T) = First(E’)-ε U FOLLOW(E’)
{ +, ε }- ε U {‘)’ , $ ) = { +, ‘)’, $ }
FOLLOW(T’) = FOLLOW(T) = { +, ‘)’, $ }
T FT’
FOLLOW(F) = First(T’) - ε U FOLLOW(T)
= {*, ε } – ε U { +, ‘)’, $ }
{ *, +, ‘)’, $ }
Construction of Parsing Table:-
1. For Every production, I  β of the grammar go to steps 2 & 3.
2. For each terminal symbol x in First(β) place I  β in the cell T[I, x] where
T is a two-dimensional array.
3. If First(β) contain ε then place I  β in the cell T[I, y] where y is a terminal
symbol in FOLLOW[I].
DEPARTMENT OF CSE
SE
Example: Consider grammar
C
On input id + id * id, determine the the sequence of moves of nonrecursive
ET
predictive parser.
C
SA
DEPARTMENT OF CSE

LL(1) Grammars:- Predictive parsers, that is, recursive-descent parsers needing no
backtracking, can be constructed for a class of grammars called LL(1). The first "L"
in LL(1) stands for scanning the input from left to right, the second "L" for
producing a leftmost derivation, and the "1" for using one input symbol of lookahead
at each step to make parsing action decisions.
The class of LL(1) grammars is rich enough to cover most programming
constructs. For example, no left-recursive or ambiguous grammar can be LL(1). A
grammar G is LL(1) if and only if whenever A  α | β are two distinct productions
of G, the following conditions hold:
1. For no terminal a do both α and β derive strings beginning with a.
2. At most one of α and β can derive the empty string.
3. If then α does not derive any string beginning with a terminal in
FOLLOW(A). Likewise, if , then β does not derive any string beginning
with a terminal in FOLLOW(A).
SE
C
Example:- Check whether the given grammar is LL(1) grammar.
S iEtSS1 | a
ET
S1  eS | ε
Eb
Construct a predictive parse table.
First(S) = { i, a } FOLLOW(S) = { $ }
C
1
First(S ) = { e, ε } FOLLOW(S) = First(S1) – ε U FOLLOW(S)
First(E ) = { b } = { e, ε } – ε U { $ }
= { e, $ }
SA
1
FOLLOW(S ) = FOLLOW(S) = { e, $ }
i a B e t $
1
S S iEtSS S a
S1 S1  eS S1  ε
S1  ε
E Eb
Since Predictive Parse Table have double entries for M[S1, e] as S1  eS,
S1  ε , so the given grammar is not LL(1).
Error Recovery in Predictive Parsing:- Error recovery refers to the stack of a

table-driven predictive parser, since it makes explicit the terminals and nonterminals
that the parser hopes to match with the remainder of the input; the techniques can
also be used with recursive-descent parsing.
DEPARTMENT OF CSE

An error is detected during predictive parsing when the terminal on top of
the stack does not match the next input symbol or when nonterminal A is on top of
the stack, a is the next input symbol, and M[A, a] is error (i.e., the parsing-table
entry is empty).
Panic Mode:- Panic-mode error recovery is based on the idea of skipping symbols
on the the input until a token in a selected set of synchronizing tokens appears.
Phrase-level Recovery:- Phrase-level error recovery is implemented by filling in the
blank entries in the predictive parsing table with pointers to error routines. These
routines may change, insert, or delete symbols on the input and issue appropriate
error messages. They may also pop from the stack. Alteration of stack symbols or
the pushing of new symbols onto the stack is questionable for several reasons
Important Questions
SE
1. Explain about Syntax Analysis and Role of a parser
2. Discuss about Classification of parsing techniques
3. Briefly explain Top down parsing
4. Explain about Recursive descent parsing
5.
6.
7.
Explain about predictive parsing
C
Construction of Predictive parse table using First and Follow
Discuss about LL(1) Grammars
ET
8. Briefly explain Error recovery in predictive parsing.
C
SA
DEPARTMENT OF CSE
Compiler Design UNIT - III

Introduction to Simple LR – Why LR parsers – Model of an LR parsers – Operator
precedence- Shift Reduce parsing – Difference between LR and LL
parsers, Construction of SLR tables.
Introduction to Simple LR:- The most important type of bottom-up parser is based on a
concept called LR(k) parsing; the "L" is for left-to-right scanning of the input, the "R" for
constructing a rightmost derivation in reverse, and the k for the number of input symbols
of lookahead that are used in making parsing decisions. Generally LR(k) parsers uses k =
0 or k = 1. When (k) is omitted, k is assumed to be 1. The easiest
method for constructing shift-reduce parsers is called "simple LR" (or SLR, for
short). Another two more complex bottom-up parsers are canonical-LR and LALR
which are used in the majority of LR parsers.
Advantages of LR Parsers:- LR parsers are table-driven like the nonrecursive
SE
LL parsers. A grammar for which we can construct a LR parsing table is said to
be an LR grammar. When a grammar is LR it is sufficient that a left-to-right shift-reduce
parser can be able to recognize handles of right-sentential forms when they appear
on top of the stack. LR parsing is attractive for a variety of reasons:
C
1. LR parsers can be constructed to recognize all programming language constructs
for which context-free grammars can be written.
ET
2. The LR-parsing method is the most general nonbacktracking shift-reduce parsing
method known, yet it can be implemented as efficiently as other shift-reduce
methods.
3. An LR parser can detect a syntactic error as soon as it is possible to do so on a left-
to-right scan of the input.
C
4. The class of grammars that can be parsed using LR methods is a proper superset of
the class of grammars that can be parsed with predictive or LL methods.
SA
Model of an LR parsers: - A model of an LR parser is shown in Fig below. It consists of

an input, an output, a stack, a driver program, and a parsing table. The parsing table
has two parts (ACTION and GOTO). The driver program is the same for all LR parsers;
only the parsing table changes from one parser to another.
DEPARTMENT OF CSE

The parsing program reads characters from an input buffer one at a time. Where a
shift-reduce parser would shift a symbol, an LR parser shifts a state. Each state
summarizes the information contained in the stack below it.
The stack holds a sequence of states, S0, Sl, . . . Sm where Sm is on top. In the SLR
method, the stack holds states from the LR(0) collection.
Behavior of the LR Parser:- The next move of the parser is determined by reading ai, the
current input symbol, and Sm, the state on top of the stack, and then consulting the entry
ACTION[Sm, ai ]in the parsing action table. The action resulting after each of the four
types of move are as follows
1. If ACTION[SM, ai ]= shift Si, the parser executes a shift move; it shifts the next
state s(i.e it shifts ai and state i) onto the stack, The current input symbol is now
ai+l.
2. If ACTION[SM, ai ]= reduce A  β, then the parser executes a reduce move using
the production A  β. If β has r symbols, then the parser first popped 2r symbols
SE
off the stack and parser then pushed A and the entry for GOTO[Sm-2r, A], onto the
stack.
3. If ACTION[SM, ai ]= accept, parsing is completed.
4. If ACTION[SM, ai ]= error, the parser has discovered an error and calls an error
recovery routine.
C
Shift Reduce parsing:- Shift-reduce parsing is a form of bottom-up parsing in which a
ET
stack holds grammar symbols and an input buffer holds the rest of the string to be parsed.
As we shall see, the handle always appears at the top of the stack just before it is
identified as the handle. A "handle" is a substring that matches the body of a production,
and whose reduction represents one-step along the reverse of a rightmost derivation.
We use $ to mark the bottom of the stack and also the right end of the input.
C
Conventionally, when discussing bottom-up parsing, we show the top of the stack on the
right, rather than on the left as we did for top-down parsing. Initially, the stack is empty,
SA
and the string w is on the input, as follows:
During a left-to-right scan of the input string, the parser shifts zero or more input
symbols onto the stack, until it is ready to reduce a string β of grammar symbols on top of
the stack. It then reduces β to the head of the appropriate production. The parser repeats
this cycle until it has detected an error or until the stack contains the start symbol and the
input is empty:
Upon entering this configuration, the parser halts and announces successful completion of
parsing.
There are actually four possible actions a shift-reduce parser can make:
(1) shift, (2) reduce, (3) accept, and (4) error.
1. Shift:- Shift the next input symbol onto the top of the stack.
DEPARTMENT OF CSE

2. Reduce:- The right end of the string to be reduced must be at the top of the stack.
Locate the left end of the string within the stack and decide with what nonterminal
to replace the string.
3. Accept:- Announce successful completion of parsing.
4. Error:- Discover a syntax error and call an error recovery routine.
Example:- List out the actions of a shift-reduce parser to parse the input string idl *id2
according to the expression grammar
SE
C
ET
Example:- List out the actions of a shift-reduce parser to parse the input string id
*id+id according to the following grammar
E  E + E | E * E | ( E ) | id
C
Stack Input Action

Contents Buffer Taken
SA
$ id*id+id$ Shift id
$id *id+id$ Reduce E id
$E *id+id$ Shift *
$E* id+id$ Shift id
$E*id +id$ Reduce E id
$E*E +id$ Shift +
$E*E+ id$ Shift id
$E*E+id $ Reduce E id
$E*E+ E $ Reduce EE+E
$E*E $ Reduce EE*E
$E $ Accept
Conflicts During Shift-Reduce Parsing:- There are context-free grammars for which
shift-reduce parsing cannot be used. Every shift-reduce parser for such a grammar can
DEPARTMENT OF CSE

reach a configuration in which the parser, knowing the entire stack contents and the next
input symbol, cannot decide whether to shift or to reduce (a shift/reduce conflict), or
cannot decide which of several reductions to make (a reduce/reduce conflict). That is the
shift-reduce parser has two conflicts. They are,
1. Shift/reduce conflict
2. Reduce/reduce conflict
For the following grammar
E  E + E | E * E | ( E ) | id
Shift/reduce conflict:- When the stack and the input buffer contains the contents as shown
below
Stack Input Action
$E*E +id$
The parser can take the action “Shift +” or “Reduce E  E*E “. So it known as
SE
Shift/Reduce conflict. To Solve this problem it gives first preference to “Shift + “.
Reduce/reduce conflict:- When the stack and the input buffer contains the contents as
shown below
Stack Input Action
$E*E+ E $ C
The parser can take the action “Reduce E  E+E” or “Reduce E  E*E “. So it
ET
known as Reduce/Reduce conflict. As Bottom-up parsing uses right most derivation it
gives first preference to “Reduce E  E+E “.
Shift/reduce conflict or Reduce/reduce conflict will be encountered for those
grammars which are not LR or they are ambiguous. However compilers uses LR
C
grammars. Thus Shift/reduce conflict or Reduce/reduce conflict will not occur during
compilation process. But Shift-Reduce Parser can’t be constructed for a non LR
Grammar or ambiguous grammar.
SA
Operator Precedence Parsing:- Operator Precedence Parsing is a shift reduce parsing

method that can be applied to a small class of grammar called the operator grammar. An
operator grammar has two important characteristics.
1. There are no ε productions in this grammar.
2. No production would have two adjacent non-terminals.
Example of operator grammar:-
E  E + E | E * E | ( E ) | id
A shift-reduce parser can easily be constructed for this kind of grammar and it is
called an operator-precedence parser.
Operator Precedence Relation:- The precedence relations are only established between
the terminals of the grammar, non-terminals are ignored.
DEPARTMENT OF CSE

Symbol Relation Meaning
Proper Less Than θ 1 < θ2 θ1 has less precedence than θ2
“ <· “
Proper Greater Than θ 1 > θ2 θ1 has more precedence than θ2
“·> “
Proper Equal To θ1 = θ2 θ1 has same precedence than θ2
“ “
Rules for Operator Precedence Relation:-

Rule 1:- $ is an terminal whose precedence is less than any other operator symbol.
i.e $ < θ where θ is an operator.
Rule 2:- The identifier has more precedence than all the operators.
i.e id > θ where θ is an operator.
Rule 3:- For two operators with same precedence
SE
θ1 > θ2 if they are left associative
θ1 < θ2 if they are right associative
Rule 4:- Unary operators have higher precedence than binary operators.
Rule 5:- A pair of parenthesis cannot be removed until all operations between the pair
C
have been performed. A pair of outer parenthesis cannot be removed until all inner
parentheses have been removed. An operation outside a pair of parenthesis can’t be
performed until the pair have been removed.
ET
Construction of precedence Parse table:- The parse table is a table of size n x n where n
is the no of terminal symbols in the defined grammar.
1. Fill each cell with the relation between the vertical terminal symbol and the
horizontal terminal symbol.
C
2. If the relation cant be defined, then it is marked as an error.

Example:- Construct the operator precedence parse table for the following grammar.
SA
E E+E E E*E E (E) E id

The above grammar is an ε – free. So it is operator grammar. The operator
precedence parser table can be constructed for operator grammar.
The terminal symbols in the above grammar are +,*, (, ), id, $.
Terminal id + * ( ) $
id Error ·> ·> ·> ·> ·>
+ <· ·> <· <· ·> ·>
* <· ·> ·> <· ·> ·>
( <· <· <· <· <· ·>
) <· ·> ·> ·> ·> ·>
$ <· <· <· <· <· Accept
Operator precedence parser:- The operator precedence parser uses a parsing table
called operator precedence parsing table for making informed decisions to identify the
DEPARTMENT OF CSE

correct handle during the reduction step. As operator grammar doesnot have ε
productions the selection of right production for reduction is simplified. The operator
precedence parser has the following components.
1. An input buffer that contains a string to be parsed followed by a $.
2. A stack containing a sequence of grammar symbols with a $ at the bottom of
the stack.
3. An operator precedence parsing table
SE
Operator precedence parsing algorithm:-
C
Fig:- Operator Precedence Parser
Step 1:- Let a pointer P points to the first input symbol of x$, where x is the string to be
ET
parsed.
Step 2:- If the relation between top of the stack and current input symbol is <· then
perform shift operation to push current input symbol onto the stack.
Step 3:- If the relation between top of the stack and current input symbol is ·> then
C
perform reduce operation to reduce the production A t, if it exist.

Step 4:- If there exist no relation between top of the stack and current input symbol then
SA
call error procedure.

Steps 2& 3 are computed until $S in the stack and only $ in the input buffer.,
where S is the start symbol of the grammar.
Advantages of operator precedence parser :-
1. It is very simple.
2. Operator precedence parser eliminates the shift reduce parser conflicts by
defining the relation between the operators.
3. By using the operator precedence parsing table these parsers can be very fast.
4. Easy to identify the next handle.
Disadvantages of operator precedence parser :-
1. Unary operators are difficult to resolve.
2. Operator precedence parsing table can be constructed only for operator
grammar.
3. It works only for small class of grammar.
DEPARTMENT OF CSE
Example:- Construct the operator precedence parse table for the following grammar.
E E+E E E*E E (E) E id
And check the string id+id*id by using operator precedence parser.
Stack contents Relation Input Buffer Action Taken
$ <· id+id*id$ Shift id
$id ·> +id*id$ Reduce E id
$E <· +id*id$ Shift +
$E+ <· id*id$ Shift id
$E+id ·> *id$ Reduce E id
$E+E <· *id$ Shift id
$E+E* <· id$ Shift *
$E+E*id ·> $ Reduce E id
$E+E*E ·> $ Reduce E E*E
SE
$E+E ·> $ Reduce E E+E
$E ·> $ Accept
Difference between LR and LL parsers: -

S.No
1.
2.
LR Parsers
These are bottom up parsers.
This is complex to implement.
C
LL Parsers
These are top down parsers.
This is simple to implement.
ET
3. In
4. These are efficient parsers. These are less efficient parsers.
5. It is applied to a large class of It is applied to small class of
programming languages. languages.
C
Construction of SLR tables: - The SLR method for constructing parsing tables is a good
SA
starting point for studying LR parsing. The parsing table constructed by this method is an
SLR table, and an LR parser using an SLR-parsing table is an SLR parser. The SLR
method begins with LR(0) items and LR(0) automata. That is, given a grammar, G, we
produce augmented grammar G’, with a new start symbol S’. From G', we construct C,
the canonical collection of sets of items for G’ together with the GOT0 function.
The ACTION and GOT0 entries in the parsing table are then constructed using the
FOLLOW(A) for each nonterminal A of a grammar.
1. Construct C = {I0, I1, . . . ,In), the collection of sets of LR(0) items for G’.
2. State i is constructed from Ii, . The parsing actions for state i are determined
as follows:
(a) If [A α.aβ] is in Ii, and GOTO(Ii ,a ) = Ij , t hen set ACTION[i , a] to "shift j."
Here a must be a terminal.
(b) If [A α.] is in Ii, then set ACTION[i, a] to "reduce A  a" for all
a in FOLLOW(A) here A may not be S’.
(c) If [S’  S.] is in Ii ,then set ACTION[i, $1] to "accept ."
DEPARTMENT OF CSE

3. The goto transitions for state i are constructed for all nonterminals A using the rule: If
GOTO(Ii , A) = Ij, then GO TO[i, A] = j.
4. All entries not defined by rules (2) and (3) are made "error."
5. The initial state of the parser is constructed from the set containing [S’S.].
Example:- Construct the SLR parse table for the following grammar.
S L=R|R
L  *R | id
RL
Step 1:- Produce augmented grammar G’ for the given grammar.
S’  S
S L=R
S R
L  *R
SE
L  id
RL
Step 2: Construct the collection of sets of LR(0) items for G’.
I0 = Closure( S’  .S ) I5 = goto(I0 , id)
S’  ·S
S  ·L = R
S  ·R
C = Closure(L  id· )
L  id·
I6 = goto(I2 , =)
ET
L  ·*R = Closure(S  L = ·R)
L  ·id S  L=· R
R  ·L R  ·L
I1 = goto(I0 , S) L  ·*R
= Closure( S’  S· ) L  ·id
C
S’  S· I7 = goto(I4 , R)
I2 = goto(I0 , L) = Closure(L  *R· )
SA
= Closure(S  L· = R, R  L· ) L  *R·
S  L· = R I8 = goto(I4 , L)
R  L· = Closure( R  L· )
I3 = goto(I0 , R) R  L·
= Closure(S  R· ) goto(I4 , *)
S  R· = Closure(L  *·R )
I4 = goto(I0 , *) I4
= Closure(L  *·R ) goto(I4 , id)
L  *·R = Closure(L  id· )
R  ·L = I5
L  ·*R I9 = goto(I6 , R)
L  ·id = Closure(S  L = R·)
S  L= R·
DEPARTMENT OF CSE

goto(I6 , L) goto(I6 , *)
= Closure( R  L· ) = Closure(L  *·R )
= I8 I4
goto(I6 , id)
= Closure(L  id· )
= I5
Step 3: Construct the First and Follow of Non terminals.
S’  S
S L=R
S R
L  *R
L  id
RL
First(S’) = First(S) = First(R) = First(L) = {*, id}
SE
Follow(S’) = { $ }
Follow(S) = Follow(S’) = { $ }
Follow(R) = Follow(S) = { $ }
Follow(L) = Follow(R) = { $ }
Follow(L) = {=}
Thus Follow(L) = {=, $} C
ET
C
SA
DEPARTMENT OF CSE

More powerful LR parsers, Construction of CLR(1), LALR parsing tables,
Dangling Else ambiguity, Error recovery in LR parsing.
More Powerful LR Parsers: - LR parsing techniques can be extended to use one symbol
of lookahead on the input. Parsers constructed with this technique are called as powerful
LR parsers. There are two parsers constructed in this way. They are
1. The "canonical-LR" parser which makes full use of the lookahead symbol(s). This
method uses a large set of items, called the LR(1) items.
2. The "lookahead-LR" parser which is based on the LR(0) sets of items, and has
many fewer states than CLR parsers which are based on the LR(1) items. By
carefully introducing lookaheads into the LR(0) items, LALR parsers can handle
many more grammars than with the SLR parser, and build parsing tables that are
no bigger than the SLR tables. LALR is the method of choice in most situations.
Construction of CLR(1):- The method for building the collection of sets of valid LR(1)
SE
items is essentially the same as the one for building the canonical collection of sets of
LR(0) items. In the CLR(1) we collect canonical seethe value 1 in the bracket indicates
that there is one lookahead symbol in the set of items.
Construction of canonical set of items along with the lookahead: -
C
1. For the grammar G initially add S'--> .S in the set of item C.
2. For each set of items Ii in C and for each grammar symbol X add closure (Ii, X).
This process should be repeated by applying goto(Ii, X) for each X in Ii, such
ET
that goto(Ii, X) is not empty and not in C. The set of items has to be constructed
until no more set of items that can be added to C.
3. The closure function can be computed as follows.
For each item A  α·Xβ, a and for each rule X γ and b Є first(βa)
C
then add X · γ, b to I if X · γ, b is not in I.

Construction of DFA(i.e. go to graph): - For each canonical item set the node and label it
SA
with the canonical item name. There is an edge between one Ii node to the another node Ij
if Ij = goto(Ii, X). This process is repeated until all the nodes are connected by means of
the edges.
Construction of canonical LR parsing table: - The ACTION and GOT0 entries in the
parsing table are then constructed as follows.
(a) If [A α.aβ, b] is in Ii, and GOTO(Ii ,a ) = Ij , then set ACTION[i , a] to "shift j."
Here a must be a terminal.
(b) If [A α., a] is in Ii, then set ACTION[i, a] to "reduce A  α·" here A may not be S’.
(c) If [S’  S., $] is in Ii ,then set ACTION[i, $] to "accept ."
(d) The goto part of the LR table is filled as follows.
If GOTO(Ii , A) = Ij, then GO TO[i, A] = j.
(e) All entries not defined by above rules are made "error."
DEPARTMENT OF CSE

Example: - Consider the following augmented grammar and construct CLR parsing table.
Parse the strings “cdcd” and “cddd” using the CLR parsing table
Solution:- Start by computing the closure of {[S’·S, $]}. To find the closure match
[S’·S, $] with the item [A  α·Bβ, a]. That is, A = S’, α = ε, B = S, β = ε , and a = $.
Function CLOSURE tells us to add [B ·γ, b] for each production B γ and terminal b
in FIRST(βa).
The initial set of items is
SE
Now compute GOTO(I0,S),
C
No additional productions are added , since the dot is at the right end. Thus we have the
next set of items
ET
Now compute GOTO(I0,C),
C
Now compute GOTO(I0,c),

SA
Finally, compute GOTO(I0,d),
We have finished considering GOTO on I0. We get no new sets from I1, but we can apply
goto's on I2 . Now compute GOTO(I2,C),
Now compute GOTO(I2,c),
DEPARTMENT OF CSE
Now compute GOTO(I2,d),
Compute the GOTO'S of I3 . Now compute GOTO(I3,C),
The GOTO'S of I3 on c and d are I3 and I4 respectively, thus

GOTO(I3,c) = I3
GOTO(I3,d) = I4
I4 and I5 have no GOTO'S, since all items have their dots at the right end. The
SE
GOTO'S of I6 on c and d are I6 and I7 , respectively,
Compute GOTO(I6, C)
GOTO(I6,c) = I6
GOTO(I6,d) = I7
C
The GOTO'S of I6 on c and d are I6 and I7 , respectively,
ET
The remaining sets of items yield no GOTO'S, so we are done.
The goto graph for the above given grammar is
C
SA
DEPARTMENT OF CSE
The canonical parsing table for the above grammar is
SE
Parsing the input string “cdcd using LR(1) parsing table
Stack Input Buffer Parsing Action
$0
$0c3
$0c3d4
cdcd$
dcd$
cd$
C
Shift S3 [push c & 3]
Shift S4 [push d & 4]
R3 Reduce C  d
ET
$0c3C8 cd$ R2 Reduce C  cC
$0C2 cd$ Shift S6 [push c & 6]
$0C2c6 d$ Shift S7 [push d & 7]
$0C2c6d7 $ R3 Reduce C  d
C
$0C2c6C9 $ R2 Reduce C  cC
$0C2C5 $ R1 Reduce S  CC
SA
$0S1 $ Accept
Thus the given input string is successfully parsed using LR(1) parser or Canonical LR
parser.
Parsing the input string “cddd” using LR(1) parsing table
$0 cddd$ Shift S3 [push c & 3]
$0c3 ddd$ Shift S4 [push d & 4]
$0c3d4 dd$ R3 Reduce C  d
$0c3C8 dd$ R2 Reduce C  cC
$0C2 dd$ Shift S7 [push d & 7]
$0C2d7 d$ Error
Thus the given input string is not accepted by LR(1) parser or Canonical LR parser.
DEPARTMENT OF CSE

Example: - Construct the LR(1) parsing table for the following augmented grammar.
SA
A  BA|ε
B aB|b
Parse the strings “abab” and “abaa” using the LR(1)parsing table
Solution: - The above grammar is already the augmented grammar .

SA
A  BA|ε
B aB|b
Start by computing the closure of {[S·A, $]}. To find the closure match [S·A, $]
with the item [A  α·Bβ, a]. That is, A = S, α = ε, B = A, β = ε , and a = $. Therefore
FIRST(βa) = first(ε $)= $
Similarly To find the lookhead symbol of [B  ·aB match [A  · BA, $] with the item
SE
[A  α·Bβ, a]. That is, A = A, α = ε, B = B, β = A , and a = $. Therefore FIRST(βa) =
first(A$)= a/b/$
I0: S  · A, $
A  · BA, $
A·,$
B  ·aB , a/b/$
C
ET
B  ·b , a/b/$
Now compute GOTO(I0,A)
I1: S  A ·, $
Now compute GOTO(I0,B)
C
I2: A  B· A, $
A  · BA, $
SA
A·,$
B  ·aB , a/b/$
B  ·b , a/b/$
Similarly compute all the canonical items until · reaches to the end in every
production.
I3: GOTO(I0,a)
B  a·B , a/b/$
B  ·aB , a/b/$
B  ·b , a/b/$
I4: GOTO(I0,b)
B b · , a/b /$
I5: GOTO(I2,A)
A  BA·, $
I6: GOTO(I3,B)
B  aB· , a/b/$
DEPARTMENT OF CSE

SE
The canonical parsing table for the above grammar is
State
0
a
S3
b
S4
Action
$
r2
A
1
C
GoTo
B
2
ET
1 Accept
2 S3 S4 r2 5 2
3 S3 S4 6
4 r4 r4 r4
5 r1
C
6 r3 r3 r3
Parsing the input string “abab” using LR(1) parsing table

SA

$0 abab$ Shift S3 [push a & 3]
$0a3 bab$ Shift S4 [push b & 4]
$0a3b4 ab$ R4 Reduce B  b
$0a3B6 ab$ R3 Reduce B  aB
$0B2 ab$ Shift S3 [push a & 3]
$0B2a3 b$ Shift S4 [push b & 4]
$0B2a3b4 $ R4 Reduce B  b
$0B2a3B6 $ R3 Reduce B  aB
$0B2B2 $ R2 Reduce A  ε
$0B2B2A5 $ R1 Reduce A BA
$0B2A5 $ R1 Reduce A BA
$0A1 $ Accept
Thus the given input string is successfully parsed using LR(1) parser.
DEPARTMENT OF CSE

Parsing the input string “abaa” using LR(1) parsing table
$0 abaa$ Shift S3 [push a & 3]
$0a3 baa$ Shift S4 [push b & 4]
$0a3b4 aa$ R4 Reduce B  b
$0a3B6 aa$ R3 Reduce B  aB
$0B2 aa$ Shift S3 [push a & 3]
$0B2a3 a$ Shift S3 [push a & 3]
$0B2a3a3 $ Error
Thus the given input string is not accepted by LR(1) parser.
LALR Parser: - The last parser in LR parsers is LALR (Lookahead- LR) parser. This
method is often used in practice, because the tables obtained by it are considerably
smaller than the canonical LR tables. Common syntactic constructs of programming
languages can be easily expressed by an LALR grammar. The same is almost true for
SE
SLR grammars, but there are a few constructs that cannot be conveniently handled by LR
parsers. For a comparison of parser size, the SLR and LALR tables for a grammar always
have the same number of states, and this number is typically several hundred states for a
language like C. The canonical LR table would typically have several thousand states for
and LALR tables than the canonical LR tables.

C
the same-size language. Thus, it is much easier and more economical to construct SLR
Construction of LALR: - The LALR parsing table is constructed as follows.

ET
1. Construct the collection of sets of LR(1) items.
2. Merge the two states Ii and Ij if the first component are matching and replace the
two states with the merged state i.e. Iij=IiU Ij.
3. Build the LALR parse table similar to LR(1) parse table.
C
4. Parse the input string using LALR parse table similar to LR(1) parsing.
Example: - Construct the LALR parsing table for the following grammar.
SA
S  CC
C  aC
C d
Parse the strings “adad” and “adaa” using the LALRparsing table.
Solution:- Convert the above grammar as augmented grammar.
S’S
S  CC
C  aC
C d
I0: S’  ·S, $
S  · CC, $
C  · aC, a/d
C  · d , a/d
DEPARTMENT OF CSE

Now compute GOTO(I0,S),
No additional productions are added , since the dot is at the right end. Thus we have the
next set of items
Now compute GOTO(I0,C),
I2: GOTO(I0,C)
S  C · C, $
C  · aC, $
C·d,$
Computing the remaing items we have
I3: GOTO(I0,a)
S  a · C, a/d
C  · aC, a/d I8: GOTO(I3,C)
SE
C  · d , a/d S  a C·, a/d
I4: GOTO(I0,d) GOTO(I3,a) = I3
C  d ·, a/d GOTO(I3,d) = I4
I9: GOTO(I6,C)
I6: GOTO(I2,a)
S  a · C, $
C S  a C·, $
GOTO(I6,a) = I6
GOTO(I6,d) = I7
ET
C  · aC, $
C·d,$
As the first component of states I3 and I6 are same we merge the two states to get I36.
C
I36: GOTO(I0/ I2,a)

S  a · C, a/d/$
SA
C  · aC, a/d/$
C  · d , a/d/$
Similarly we merge the two states I4 and I7 to get I47 and states I8 and I9 to get I89.
I47: GOTO(I0/ I2,d) I89: GOTO(I3/ I6,C)
C  d ·, a/d/$ S  a C·, a/d/$
DEPARTMENT OF CSE

The LALR parsing table for the above grammar is
Action GoTo
State a d $ S C
0 S36 S47 1 2
1 Accept
2 S36 S47 5
36 S36 S47 89
47 r3 r3 r3
5 r1
89 r2 r2 r2
Parsing the input string “adad using LR(1) parsing table
$0 adad$ Shift S36[push a & 36
$0a36 dad$ Shift S47[push d & 47]
$0a36d47 ad$ R3 Reduce C  d
SE
$0a36C89 ad$ R2 Reduce C  aC
$0C2 ad$ Shift S36 [push c & 36]
$0C2a36 d$ Shift S47 [push d & 47]
$0C2a36d47 $ R3 Reduce C  d
$0C2a36C89
$0C2C5
$0S1
$
$
$ Accept
C
R2 Reduce C  aC
R1 Reduce S  CC
ET
Thus the given input string is successfully parsed using LALR parser.
Parsing the input string “addd” using LR(1) parsing table
$0 addd$ Shift S36 [push c & 36]
C
$0c36 ddd$ Shift S47 [push d & 47]

$0c36d47 dd$ R3 Reduce C  d
SA
$0c36C89 dd$ R2 Reduce C  cC

$0C2 dd$ Shift S47 [push d & 47]
$0C2d47 d$ R3 Reduce C  d
$0C2C5 d$ Error
Thus the given input string is not accepted by LALR parser.
Example:- Show that the following grammar is LR(1) but not LALR(1).
S Aa | bAc |Bc|bBa
A d
Bd
Solution:- Convert the above grammar as augmented grammar.
S’S
S  Aa | bAc |Bc|bBa
A d
Bd
DEPARTMENT OF CSE

I0: S’  ·S, $
S  · Aa, $
S  · bAC, $
S  · B c, $
S  ·bBa
A ·d
B  ·d
Computing the remaining canonical items we have
I2: GOTO(I0,A)
S  A·a, $
I3: GOTO(I0,b) I7: GOTO(I3,A)
SE
S  b · Ac, $ S  bA ·c , $
S  b· Ba, $ I8: GOTO(I3,B)
A·d,c S  bB·a, $
B·d,a I9: GOTO(I3,d)
I4:
I5:
GOTO(I0,B)
S  B ·c, $
GOTO(I0,d)
C A  d· , c
B  d· , a
I10: GOTO(I4,c)
ET
A  d· , a S  Bc· , $
B  d· , c I11: GOTO(I7,c)
I6: GOTO(I2,a) S  bAc·, $
S  Aa · , $ I12: GOTO(I8,a)
C
S  bBa· , $
The LR(1) parsing table for the above grammar is
SA
Action GoTo
State a b c d $ S A B
0 s3 s5 1 2 4
1 Accept
2 s6
3 s9 7 8
4 s10
5 r5 r6
6 r1
7 s11
8 s12
9 r6 r5
10 r3
11 r2
12 r4
DEPARTMENT OF CSE

As the first component of states I5 and I9 are same we merge the two states to get I59.
I59: GOTO(I0/ I3,d)
A  d· , a/c
B  d· , c/a
Action GoTo
State a b c d $ S A B
0 s3 s5 1 2 4
1 Accept
2 s6
3 s9 7 8
4 s10
59 r5/r6 r6/r5
6 r1
7 s11
8 s12
SE
10 r3
11 r2
12 r4
The LALR parsing table shows multiple entries in Action[59, a] and Action[59, c].
This is called reduce/reduce conflict. Because of this conflict we cannot parse input. Thus
C
it is shown that the given grammar is LR(1) but not LALR.
Dangling Else ambiguity:- It is a fact that every ambiguous grammar fails to be LR.
ET
However, certain types of ambiguous grammars are quite useful in the specification and
implementation of languages. Consider again the following grammar for conditional
statements:
C
SA
The above grammar is ambiguous because it does not resolve the dangling-else
ambiguity. To simplify the discussion, let us consider an abstraction of this grammar,
where i stands for if expr then, e stands for else, and a stands for "all other productions.''
Converting the above grammar into augmented grammar we have
LR(0) items for the above augmented grammar are as follows
DEPARTMENT OF CSE

The SLR parsing table for the "dangling-else" grammar is as follows
The above table has multiple entries at Action[4, e]. So the above grammar suffers from
Shift/reduce conflict.
Parse the input string iiaea
Stack contents Input Buffer Action Taken
SE
$0 iiaea$ S2 push i & 2
$0i2 iaea$ S2 push i & 2
$0i2i2 aea$ S3 push a & 3
$0i2i2a3 ea$ R3 Reduce S  a
$0i2i2S4 ea$
C
Shift/reduce conflict
When such a situation occurs first try choosing each action separately.
First choosing reduce action we have
ET
C

$0i2i2S4 ea$ R2 Reduce S iS
SA
$0i2S4 ea$ R2 Reduce S iS

$0S1 ea$ Error
As reduce action leads to error choose shift action we have
$0i2i2S4 ea$ S5 push e & 5
$0i2i2S4e5 a$ S3 push a & 3
$0i2i2S4e5a3 $ R3 Reduce S  a
$0i2i2S4e5S6 $ R1 Reduce S iSeS
$0i2S4 $ R2 Reduce S iS
S0S1 $ Accept
DEPARTMENT OF CSE

Thus the dangling-else ambiguity can be resolved by giving first preference to shift when
reduce/shift conflict occurs. Thus the SLR parse table can be modified as
Error Recovery in LR Parsing: - An LR parser will detect an error when it consults the
parsing action table and finds an error entry. Errors are never detected by consulting the
SE
goto table. An LR parser will announce an error as soon as there is no valid continuation
for the portion of the input thus far scanned. A canonical LR parser will not make even a
single reduction before announcing an error. SLR and LALR parsers may make several
symbol onto the stack. C

reductions before announcing an error, but they will never shift an erroneous input
Panic-mode error recovery: - In LR parsing, we can implement panic-mode error

recovery as follows. We scan down the stack until a state s with a goto on a particular
ET
nonterminal A is found. Zero or more input symbols are then discarded until a symbol a
is found that can legitimately follow A. The parser then stacks the state GOTO(S,a) and
resumes normal parsing. This method of recovery attempts to eliminate the phrase
containing the syntactic error. If the parser determines that a string derivable from A
C
contains an error. Part of that string has already been processed, and the result of this
processing is a sequence of states on top of the stack. The remainder of the string is still
SA
in the input, and the parser attempts to skip the remainder of this string by looking a
terminal that follows A. By removing states from the stack, skipping over the input, and
pushing GOTO(s, A) on the stack, the parser pretends that it has found an instance of A
and resumes normal parsing.
Phrase-level error recovery: - Phrase-level recovery is implemented by examining each
error entry in the LR parsing table and an appropriate recovery procedure can then be
constructed; In designing specific error-handling routines for an LR parser, we can fill in
each blank entry in the action field with a pointer to an error routine that will take the
appropriate action selected by the compiler designer. The actions may include insertion
or deletion of symbols from the stack or the input or both, or alteration and transposition
of input symbols. The modifications should be such that the LR parser will not get into an
infinite loop. A safe strategy will assure that at least one input symbol will be removed or
shifted eventually, or that the stack will eventually shrink if the end of the input has been
reached.
DEPARTMENT OF CSE

IMP Questions
1. Discuss about powerful LR parsers.
2. With an example explain the Construction of CLR tables and CLR parsing.
3. With an example explain the Construction of LALR tables and LALR
parsing.
4. Difference between SLR, CLR and LALR parsers.
5. With an example explain the how the Dangling Else ambiguity is eleminated.
6. Explain Error recovery in LR parsing.
SE
C
ET
C
SA
DEPARTMENT OF CSE
Compiler Design UNIT - IV
Semantic analysis: - SDT Schemes, Evaluation of semantic rules,

Intermediate code, Three Address code, Quadruples, Triples, Abstract
Syntax trees.
Semantic analysis:- An important part of semantic analysis is type checking, where

the compiler checks that each operator has matching operands. For example, many
programming language definitions require an array index to be an integer; the
compiler must report an error if a floating-point number is used to index an array.
The language specification may permit some type conversions called coercions. For
example, a binary arithmetic operator may be applied to either a pair of integers or to
a pair of floating-point numbers. If the operator is applied to a floating-point number
and an integer, the compiler may convert or coerce the integer into a floating-point
SE
number.
position = initial + rate * 60
In the above expression suppose that position, initial, and rate have been
declared to be floating-point numbers, and that the lexeme 60 by itself forms an
C
integer. The type checker in the semantic analyzer discovers that the operator * is
applied to a floating-point number rate and an integer 60. In this case, the integer
may be converted into a floating-point number.
ET
Semantic errors include type mismatches between operators and operands. It
reports error when array index is out of range. The semantic analyzer can report
semantic errors both at compile time & run time. At compile time it checks the
compatibility of operands and operators. At run time it checks the range of array
C
index etc.
Syntax-Directed Definitions: - A syntax-directed definition (SDD) is a context-free

SA
grammar together with, attributes and rules. Attributes are associated with grammar
symbols and rules are associated with productions. If X is a symbol and a is one of
its attributes, then we write X.a to denote the value of a at a particular parse-tree
node labeled X. If we implement the nodes of the parse tree by records or objects,
then the attributes of X can be implemented by data fields in the records that
represent the nodes for X. Attributes may be of any kind: numbers, types, table
references, or strings, for instance. The strings may even be long sequences of code,
say code in the intermediate language used by a compiler.
Inherited and Synthesized Attributes: - A syntax-directed definition (SDD) may use
two kinds of attributes for non terminals. They are
1. Synthesized Attributes: - A synthesized attribute at node N is defined only in
terms of attribute values at the children of N and at N itself.
DEPARTMENT OF CSE

E  T1 + T2
t.val = a.val+b.val
+ E.attribute
= f(T1.attribute,
T2.attribute)
a.val a b b.val
2. Inherited Attributes: - An inherited attribute at node N is defined by only in

terms of attribute values at N's parent, and N's siblings.
t.type
S STL
T  int
L  L1, L2
L.type = int
T.type = int
T Value obtained L Value obtained
L2. attribute
from sibling from parent = f(L.attribute, L1.attribute)
SE
int
L1 , L2
L1.type = int L2.type = int
The Syntax-Directed Definitions which uses only Synthesized Attributes is

C
called S-attributed definition. The Syntax-Directed Definitions which uses both only
Synthesized Attributes and inherited attributes are called L-attributed definition.
ET
Ex : Draw the annotated parse tree for the input string 3 * 5 + 4 n using the
grammar and rules given in the table.
C
SA
Solution: - The values of lexval are presumed supplied by the lexical analyzer. Each
of the nodes for the non terminals has attribute val computed in a bottom-up order,
and we see the resulting values associated with each node. For instance, at the node
with a child labeled *, after computing T.val= 3 and F.val = 5 at its first and third
children, we apply the rule that says T.val is the product of these two values, or 15.
The annotated parse tree is as shown above.
Ex : Draw the annotated parse tree for the input string 3 * 5 using the grammar and
rules given in the table.
DEPARTMENT OF CSE
Solution: - To see how the semantic rules are used, consider the annotated parse tree
for 3 * 5 in the above Fig. The leftmost leaf in the parse tree, labeled digit, has
attribute value lexval = 3, where the 3 is supplied by the lexical analyzer. Its parent
SE
is for production 4, F  digit. The only semantic rule associated with this
production defines F. val = digit. lexval, which equals 3.
At the second child of the root, the inherited attribute T1.inh is defined by the
semantic rule T1.inh = F.val associated with production 1. Thus, the left operand, 3,
C
for the * operator is passed from left to right across the children of the root. The
production at the node for T11 is T11  * FT;. (We retain the subscript 1 in the
annotated parse tree to distinguish between the two nodes for TI.) The inherited
ET
attribute T11. inh is defined by the semantic rule T11. inh = T1.inh x F. val associated
with production 2.
With T1.inh = 3 and F.val = 5, we get T11. inh = 15. At the lower node for T11,
the production is T1 ε . The semantic rule T1.syn = T1.inh defines T11.syn = 15.
C
The syn attributes at the nodes for T1 pass the value 15 up the tree to the node for T,
where T.val = 15.
SA
Syntax-Directed Translation Schemes: - Syntax-directed translation schemes are a

complementary notation to syntax directed definitions. All of the applications of
syntax-directed definitions can be implemented using syntax-directed translation
schemes.
A syntax-directed translation scheme (SDT) is a context free grammar with
program fragments embedded within production bodies. The program fragments are
called semantic actions and can appear at any position within a production body. By
convention, we place curly braces around actions; if braces are needed as grammar
symbols, then we quote them. Any SDT can be implemented by first building a
parse tree and then performing the actions in a left-to-right depth-first order; that is,
during a preorder traversal.
The conceptual view of syntax-directed translation is as shown in figure.
DEPARTMENT OF CSE

Evaluation order of
Input string Syntax tree Dependency Graph
Semantic Rules
Firstly we parse the input token stream and a syntax tree is generated. Then
the tree is being traversed for evaluating the semantic rules at the parse tree nodes.
Applications of Syntax-Directed Translation: - The main application of syntax-
directed translation techniques is the construction of syntax trees. Since some
compilers use syntax trees as an intermediate representation, a common form of
SDD turns its input string into a tree. We consider two SDD's for constructing
syntax trees for expressions. The first, an S-attributed definition, is suitable for use
during bottom-up parsing. The second, L-attributed, is suitable for use during top-
down parsing.
Construction of Syntax Trees: - SDD can be used to construct either syntax trees
or DAG's. Each node in a syntax tree represents a construct; the children of the node
SE
represent the meaningful components of the construct. A syntax-tree node
representing an expression El + E2 has label + and two children representing the
subexpressions El and E2. We shall implement the nodes of a syntax tree by objects
with a suitable number of fields. Each object will have an op field that is the label of
C
the node. The objects will have additional fields as follows:
 If the node is a leaf, an additional field holds the lexical value for the leaf. A
constructor function Leaf (op, val) creates a leaf object. Alternatively, if nodes
ET
are viewed as records, then Leaf returns a pointer to a new record for a leaf.
 If the node is an interior node, there are as many additional fields as the node
has children in the syntax tree. A constructor function Node takes two or more
arguments: Node(op, cl, c2, . . . , ck) creates an object with first field op and k
C
additional fields for the k children cl, c2, . . . , ck .

Ex : - Construct the syntax tree for a - 4 + c using the S-attributed definition given
below.
SA
Solution: - Every time the first production E  El + T is used, its rule creates a node
with '+’ for op and two children, El.node and T.node, for the subexpressions. The
second production has a similar rule.
For production 3, E  T, no node is created, since E.node is the same as
T.node. Similarly, no node is created for production 4, T  ( E ). The value of
DEPARTMENT OF CSE
T.node is the same as E.node, since parentheses are used only for grouping; they
influence the structure of the parse tree and the syntax tree, but once their job is
done, there is no further need to retain them in the syntax tree.
The last two T-productions have a single terminal on the right. We use the
constructor Leaf to create a suitable node, which becomes the value of T.node.
SE
C
ET
Above figure shows the construction of a syntax tree for the input a - 4 + c. The
C
nodes of the syntax tree are shown as records, with the op field first. Syntax-tree
edges are now shown as solid lines. The underlying parse tree, which need not
SA
actually be constructed, is shown with dotted edges. The third type of line, shown
dashed, represents the values of E.node and T.node; each line points to the
appropriate syntax-tree node. At the bottom we see leaves for a, 4 and c, constructed
by Leaf.
Dependency Graphs: - A dependency graph depicts the flow of information among
the attribute instances in a particular parse tree; an edge from one attribute instance
to another means that the value of the first is needed to compute the second. Edges
express constraints implied by the semantic rules.
Ex : - Construct the dependency graph tree for a - 4 + c using the L-attributed
definition given below.
DEPARTMENT OF CSE
SE
C
Solution: - The below dependency graph depicts the order of evaluation of the
attributes in a particular parse tree; an edge from one attribute instance to another
ET
means that the value of the first is needed to compute the second. Edges express
constraints implied by the semantic rules.
C
SA
DEPARTMENT OF CSE
Intermediate code: - The front end of a compiler constructs an intermediate

representation of the source program from which the back end generates the
target program.
Need for Intermediate code: -

1. Number of compilers for different machines can be generated by making
use of single front end and multiple back ends.
SE
C
2. A compiler for different languages on the same machine can be developed
ET
by making use of multiple front ends and a single back end.
C
SA
3. For a machine independent intermediate code we can make use of a

machine independent code optimizer to optimize the intermediate code.
Forms of Intermediate code: - The three most important intermediate code
representations are:
1. The postfix notation
2. Abstract syntax trees.
3. Three-address code
Postfix Notation:- The expression contains operands and operators. If the expression
contains operator in between the operands then it is an infix expression. If the
expression contains operator after the operands then it is an postfix expression.
There is a lot of complexity for evaluating the infix expression. So infix expressions
are converted to postfix expressions. Postfix evaluation is very easy.
DEPARTMENT OF CSE

The common arithmetic expressions are written in infix notation. In infix notation
the operator is placed in between the operands.
Ex: - A*B+C/D
In postfix notation the operator is placed after the operands. This
representation is referred as Reverse Polish notation (RPN).
Ex: - AB*CD/+
The postfix notation for an expression E can be defined inductively as follows:
1. If E is a variable or constant, then the postfix notation for E is E itself.
2. If E is an expression of the form El op E2, where op is any binary operator,
then the postfix notation for E is E’1 E’2 op, where E’1 and E’2 are the
postfix notations for El and E2, respectively.
3. If E is a parenthesized expression of the form (El), then the postfix notation
for E is the same as the postfix notation for El.
Example 1:- Convert the given infix expression a*b+c/d into postfix expression.
SE
Solution:- As * has higher preference convert a*b into postfix notation ab*
i.e. {ab*}+c/d
As / has next higher preference convert c/d into postfix notation cd/
i.e. {ab*}+{cd/}
i.e. ab*cd/+
Evaluation of postfix expression:-
C
As + has next higher preference convert {ab*}+{cd/} into postfix notation
ET
1. Scan the expression from left to right.
2. If an operand is encountered place it onto the stack.
3. If an operator is encountered pop the top most operands and perform the
specified operation and push the result back onto the stack.
4. Repeat the steps 2 & 3 until the whole expression is scanned. Now the stack
C
contains only one element which is the final result.

Example 2:- Evaluate the given postfix expression ab*c+ assuming a=1, b=2, c=3.
SA
1. Scan the expression from left to right.

2. As ‘a’ is an operand push it onto the stack.
3. As ‘b’ is an operand push it onto the stack.
4. As ‘*’ is an operator pop the top most operands and perform the specified
operation and push the result back onto the stack.
DEPARTMENT OF CSE

5. As ‘c’ is an operand push it onto the stack.
6. As ‘+’ is an operator pop the top most operands and perform the specified
operation and push the result back onto the stack.
7. As whole expression is scanned. Now the stack contains only one element
which is the final result i.e. 5.
Abstract syntax trees: - One form, of intermediate code is called abstract syntax
trees or simply syntax trees. It represents the hierarchical syntactic structure of the
source program. The parser produces a syntax tree that is further translated into
SE
three-address code. In the syntax tree the leaf nodes are operators and the interior
nodes are operands.
Construction of Syntax Trees:-
1. Identify the operator which has the least priority in the given expression.
C
That operator becomes the root node. Now the sub expression before the
root node operator becomes the left child and the sub expression after the
root node operator becomes the right child.
ET
2. Repeat the same process for the sub expression which is the left child of
the root node.
3. Repeat the same process for the sub expression which is the right child of
the root node.
C
4. Steps 1,2 & 3 are repeated until the sub expressions are operands of the
given expression.
SA
Ex:- Construct the syntax tree of the given expression A*B+C/D

1. In the given expression ‘+’ has the least priority. So it becomes the root.
A*B becomes the left child and C/D becomes the right child.
2. In the left child expression ‘*’ is the only operator. So it becomes the
root. A becomes the left child and B becomes the right child.
3. In the right child expression ‘/’ is the only operator. So it becomes the
root. C becomes the left child and D becomes the right child.
DEPARTMENT OF CSE

+
* /
A C/D
B C C/D
D
4. As all the sub expressions are operands the process is stopped and the tree
obtained is the syntax tree of the given expression.
Three-Address Code: - Three-address code is a sequence of instructions of the

form
x= y op x
where x, y, and z are names, constants, or compiler-generated temporaries; and op
SE
stands for an operator. In three-address code, there is at most one operator on the
right side of an instruction; that is, no built-up arithmetic expressions are permitted.
Thus a source-language expression like x+y*z might be translated into the
sequence of three-address instructions.
C
Three-address code is built from two concepts: addresses and instructions. An
ET
address can be one of the following:
1. A name. For convenience, we allow source-program names to appear as
addresses in three-address code. In an implementation, a source name is
replaced by a pointer to its symbol-table entry, where all information about
C
the name is kept.

2. A constant. In practice, a compiler must deal with many different types of
SA
constants and variables.

3. A compiler-generated temporary variable.
Here is a list of the common three-address instruction forms:
1. Assignment instructions of the form x = y op z, where op is a binary
arithmetic or logical operation, and x, y, and z are addresses.
2. Assignments of the form x = op y, where op is a unary operation.
Essential unary operations include unary minus, logical negation, shift
operators, and conversion operators that, for example, convert an
integer to a floating-point number.
3. Copy instructions of the form x = y, where x is assigned the value of y.
4. An unconditional jump goto L. The three-address instruction with label
L is the next to be executed.
5. Conditional jumps of the form if x goto L and if False x goto L. These
instructions execute the instruction with label L next if x is true and
DEPARTMENT OF CSE

false, respectively. Otherwise, the following three-address instruction in
sequence is executed next, as usual.
Records with fields for the operators and operands can be used to represent three-
address statements. A compiler implements three address-code in any of the three
ways. They are
1. Quadruple.
2. Triple.
3. Indirect triple.
Quadruples: - A quadruple (or just "quad') has four fields, which we call operator,
operand1, operand2 and result. For instance, the three-address instruction x = y + z
is represented by placing + in operator, y in operand1, z in operand2, and x in result.
The following are some exceptions to this rule:
1. Instructions with unary operators like x = minus y or x = y do not use
SE
operand2. Note that for a copy statement like x = y, operator is =, while for
most other operations, the assignment operator is implied.
2. For Instructions like param x, operator is param, operand1 is x but this
instruction neither uses operand2 nor result.
C
3. Conditional and unconditional jumps put the target label in result.
For example, a quadruple representation of the three-address code for the statement
x = (a + b) * - c/d is shown in Table 1. The numbers in parentheses represent the
ET
pointers to the triple structure.
Quadruple Representation of x = (a + b) * − c/d
Operator Operand1 Operand2 Result
1 + a b t1
C
2 − c t2
3 * t1 t2 t3
SA
4 / t3 d t4
5 = t4 x
Triples: - The contents of the operand1, operand2, and result fields are therefore
normally the pointers to the symbol records for the names represented by these
fields. Hence, it becomes necessary to enter temporary names into the symbol table
as they are created. This can be avoided by using the position of the statement to
refer to a temporary value. If this is done, then a record structure with three fields is
enough to represent the three-address statements: the first holds the operator value,
and the next two holding values for the operand1 and operand2, respectively. Such a
representation is called a "triple representation". The contents of the operand1 and
operand2 fields are either pointers to the symbol table records, or they are pointers to
records (for temporary names) within the triple representation itself. For example, a
DEPARTMENT OF CSE

triple representation of the three-address code for the statement x = (a+b)*−c/d is
shown in Table 2.
Table 2: Triple Representation of x = (a + b) * − c/d

Operator Operand1 Operand2
1 + a b
2 − c
3 * (1) (2)
4 / (3) d
5 = x (4)
Indirect Triple Representation:- Another representation uses an additional array to
SE
list the pointers to the triples in the desired order. This is called an indirect triple
representation. For example, a triple representation of the three-address code for the
statement x = (a+b)*−c/d is shown in Table 3.
1
Operator
+
Operand1
a
Operand2
b
C 11 (1)
ET
2 − c 12 (2)
3 * (11) (12) 13 (3)
4 / (13) d 14 (4)
C
5 = x (14) 15 (5)
SA
Comparison: - By using quadruples, we can move a statement that

computes A without requiring any changes in the statements using A, because
the result field is explicit. However, in a triple representation, if we want to
move a statement that defines a temporary value, then we must change all of the
pointers in the operand1 and operand2 fields of the records in which this
temporary value is used. Thus, quadruple representation is easier to work with
when using an optimizing compiler, which entails a lot of code movement. Indirect
triple representation presents no such problems, because a separate list of
pointers to the triple structure is maintained. When statements are moved,
this list is reordered, and no change in the triple structure is necessary;
hence, the utility of indirect triples is almost the same as that of quadruples.
DEPARTMENT OF CSE
IMPORTANT QUESTIONS
1. Briefly explain the purpose of Semantic analysis.

2. Write short notes on SDD and SDT.
3. What is an Intermediate code. List out the various forms of
Intermediate code representations.
4. Briefly explain three Address codeand its representations such as
Quadruples, Triples,etc.,
5. Discuss about construction of Syntax trees.
SE
C
ET
C
SA
DEPARTMENT OF CSE
Compiler Design UNIT - V
Symbol Tables: use of Symbol table. Run time environment, Storage

organization: Stack allocation Access to non-local data, Heap
management, and parameter passing mechanisms, introduction to
garbage collection. Reference counting garbage collectors.
Code generation: Issues, target language, Basic blocks & flow graphs,
Simple code generator, Peephole optimization, Register allocation and
assignment.
SE
Symbol Table: - Symbol table is a data structure that is used by compilers to hold
information of source program. The information is collected incrementally by the
analysis phases of a compiler and used by the synthesis phases to generate the target
C
code. Entries in the symbol table contain information about an identifier such as its
name, its type, its position of storage, and any other relevant information.
ET
Symbol table format:- A Symbol table is a storage area used by the compiler to
store symbols and their associated properties. For every identifier in the source
program, there exists an entry in the symbol table. The properties for each name can
be type, scope and its binding.
C
Properties
Symbol P1 P2 P3 P4 .. ..
S1
SA
S2
S3
:
:
:
In the above table P1, P2, P3,… are the properties of the symbol table and
S1,S2,S3… are the symbols encountered in the source program. The symbol table is
used by various phases as follows.
1. Lexical analyzer stores the information of the symbols in the symbol table.
2. Parser while checking the syntax of the statements uses the symbol table.
3. Semantic analysis phase refers symbol table for type checking.
4. Code generation refers symbol table to know run time memory allocated for
the symbols.
The most important property that the symbol table must have is that it should be
easily editable, grow able. The new properties of the symbol may be added in
different phases of the compiler. Hence, the symbol table should be flexible enough
www.sacet.ac.in Page 69 Page 69

DEPARTMENT OF CSE

to afford such modifications. The symbol table looks very tedious and vast; a symbol
table manager is made responsible for managing the symbol table. The
functionalities of symbol table manager are
1. The symbol table manager should easily make an entry in the symbol table for
each symbol in the source program and return the position of the symbol in
the symbol table.
2. Before making a new entry for a symbol table it should first check if the
symbol already exist in the table. If so then return the position of the symbol
in the symbol table.
3. Deletion, addition of entries should be quick and efficient.
4. It should allow duplication of symbols, as the meaning or purpose of a symbol
might differ from context to context.
A name in the symbol table can be stored in two ways. They are
1. Fixed Length 2. Variable Length
SE
Fixed Length:- In the fixed length symbol table, the length of every symbol or name
is fixed. The size of the table is still growable depending on the no of symbols in the
program.
Symbol
C Properties
ET
a
b
f a c t
s u m
:
C
:
:
SA
 The advantage of using the fixed size is to limit the maximum length of any
symbol in the language.
 The disadvantage is that a smaller length symbol will waste the unused
memory allocated to it.
Variable Length:- A variable length symbol table will not impose any constraint of
the maximum length of the symbol. If the symbol needs only three cells in the
symbol table then only three cells are allocated for it in a separate array. Thus, the
symbol table will contain only the starting index of a symbol in an array. The array
stores each symbol and a special character ($) is used to separate the symbols.

DEPARTMENT OF CSE
 The advantage of using variable length symbol table is that no memory is

wasted in the symbol table.
 The disadvantage is that an extra array has to be maintained. Insertions and
deletions into an array is difficult.
Methods of organizing symbol table :- There exits many ways to organize a symbol
table. Among these methods ordered and unordered symbol tables are simple to
implement.
SE
Unordered Symbol Table:-When declaration of variables is encountered , variables
are entered into the symbol table. Non Block structured language uses implict
declarations. When a new symbol is to be inserted, it searches(i.e lookup) the
C
symbol table and if the symbol is not found then it inserts the symbol as new entry.
Here the symbol enteries are made in the order of their declaration. So the table is
unordered table.
ET
Disadvantages:-
1. For a large size, unordered symbol table is not suitablebecause more time is
consumed for seaching and inserting operation.
2. For direct generation of a cross-reference listing, the unordered symbol table
need to be sorted first.
C
Ordered Symbol Table:-When declaration of variables is encountered , variables are

entered into the symbol table in the alphabetic order.
SA
Advantages :-
1. Look up(i.e. searching) operation is simplified for ordered symbol table.
Disadvantages: -
1. The insertion operation needs an average of (n+1)/2 record moves as records
are to be arranged in alphabetic order.
Attributes of a symbol table: - Some of the attributes of the symbol table are
Variable Name:- The name of a variable is a compulsory attribute of a symbol table,
as the variable name helps in identifying a variable, which is required by the code
generator and semantic analyzer.
Address:- Each variable in a program is associated with an object-code address. The
address gives the relative location for a variable's value or values at run time. When
a variable is first encountered or declared, its object-code address is entered into the
symbol table. Now, whenever a variable is referred in the source program, its object-

DEPARTMENT OF CSE

code address is recalled and used in the instruction (object instruction) accessing the
value of that variable.
Type:- Some languages are typeless, such as BLISS and some languages have either
implicit or explicit data types such as FORTRAN. The type attribute in the symbol
table is not required for typeless language. In typed language run-time memory
allocation for a variable is done by using its type. Different languages allocate
different bytes to a variable depending on its type. So the type of the attribute is
stored in the symbol table.
Dimension:- The number of parameters for a procedure or the number of arguments
for an array are indicated in symbol table under attribute dimension. Variables that
are neither a procedure nor an array are given a value of zero (0) for this field.
Line Declared:- The cross-reference listing given by a compiler includes many of
the attributes discussed as above well as the line number at which the variable is
declared (if it is declared explicitly) or referenced (if it is declared implicitly). An
SE
integer value is used to populate the line declared column of the symbol table. Lines
Referenced:- If an already declared variable is referenced at some other lines in the
program then these line numbers separated by commas are indicated in lines
referenced attribute of the symbol table. This could be difficult to handle if the
table can be used. C

references are any. Thus, a linked representation into an area other than the symbol
Link:- The link field is used to generate a cross-reference listing which is ordered
ET
alphabetically by variable name. If the crossreference listing feature is not required
in a compiler then the attributes like, line declared and line referenced can be deleted
from the symbol table.
C
A typical view of the symbol table is as follows.

Variable Address TyPe DimensionLine Line Referenced Link
SA
i 21 2 0 2 7,10 2
avg
45 1 0 4 6,9,10 1
x 53 1 1 3 5,7,14,15 3
Run-Time Environment:- The compiler creates and manages a run-time

environment in which it assumes its target programs are being executed. The run-
time memory of an object program is split into two parts to store code and data.
Data of the target code can be stored in three storage areas. They are static,
stack and Heap.

DEPARTMENT OF CSE
The size of the generated target code is fixed at compile time, so the compiler
can place the executable target code in a statically determined area Code, usually in
the low end of memory. Similarly, the size of some program data objects, such as
global constants, and data generated by the compiler, such as information to support
garbage collection, may be known at compile time, and these data objects can be
SE
placed in another statically determined area called Static. One reason for statically
allocating as many data objects as possible is that the addresses of these objects can
be compiled into the target code. In early versions of Fortran, all data objects could
be allocated statically.
C
To maximize the utilization of space at run time, the other two areas, Stack and
Heap, are at the opposite ends of the remainder of the address space. These areas are
dynamic; their size can change as the program executes. These areas grow towards
ET
each other as needed. The stack is used to store data structures called activation
records that get generated during procedure calls. The stack grows towards lower
addresses, the heap towards higher.
Activation Records:- Procedure calls and returns are usually managed by a run-time
stack called the control stack. Each live procedure has an activation record on the
C
control stack. If one procedure calls another procedure, the latter procedure has its
activation record at the top of the stack.
SA
The contents of activation records vary with the language being implemented.
Here is a list of the kinds of data that might appear in an activation record.

DEPARTMENT OF CSE

1. Temporary values, such as those arising from the evaluation of expressions, in
cases where those temporaries cannot be held in registers.
2. Local data belonging to the procedure whose activation record this is.
3. A saved machine status, with information about the state of the machine just
before the call to the procedure. This information typically includes the return
address and the contents of registers that were used by the calling procedure
and that must be restored when the return occurs.
4. An "access link" may be needed to locate data needed by the called procedure
but found elsewhere, in another activation record.
5. A control link, pointing to the activation record of the caller.
6. Space for the return value of the called function, if any.
7. The actual parameters used by the calling procedure. Commonly, these values
are not placed in the activation record but rather in registers, when possible,
for greater efficiency. However, we show a space for them to be completely
SE
general.
Storage Organization:- The layout and allocation of data to memory locations in

the run-time environment are key issues in storage management. These issues are
C
tricky because the same name in a program text can refer to multiple locations at run
time. The two memory allocation techniques are
1. Static Memory Allocation
ET
2. Dynamic Memory Allocation.
Static Memory Allocation:- The storage-allocation decision is static, if the storage
allocation is done at compile time.
Dynamic Memory Allocation:- The storage-allocation decision is dynamic, if the
storage allocation is done at execution time.
C
Many compilers use some combination of the following two strategies for
dynamic storage allocation. Non-Block structured languages uses Static Memory
SA
Allocation.
Storage Allocation Schemes:- Depending upon where the activation records of the
procedures are stored, the storage allocation schemes are divided into three types.
They are
1. Static Allocation
2. Stack Allocation
3. Heap Allocation
Static Allocation:- Static Allocation allocates memory for the activation record at the
compile time. The compiler uses the type of the variable to determine the storage
required. The address assigned for each variable is fixed at compile time.
A FORTRAN language uses the the activation records to store information in
static data area. add( )
Example:- {
------
average( )
}
average( )
www.sacet.ac.in Page
{ 74 Page 74
DEPARTMENT OF CSE
Disadvantages:-
1. The size of the object should be known in advance.
2. Recursive procedures cannot be implemented in static allocation.
3. Dynamically created objects cant be used as the allocation is static.
SE
Stack Allocation:- Almost all compilers for languages that use procedures, functions,
or methods use stack as a part of their run-time memory. Each time a procedure is
called, activation record of the procedure is pushed onto a stack, and when the
procedure terminates, that activation record is popped off the stack. This
overlap in time.
C
arrangement allows memory to be shared by procedure calls whose durations do not
add( )
ET
{ ------
average( )
}
average( )
{ ------
C
print( )
}
print( )
SA
{ ------
Advantages:- }
1. Recursion can be implemented.
2. Dynamically created objects can be used as the allocation is stack.
Disadvantages:-
1. More time is spent in pushing and poping Activation Records.
Heap Allocation:- The heap is the portion of the memory that is used for data that
lives indefinitely, or until the program explicitly deletes it. While local variables
typically become inaccessible when their procedures end, many languages enable us
to create objects or other data whose existence is not tied to the procedure that
creates them. For example, both C++ and Java use new operator to create objects
that may be passed from procedure to procedure, so they continue to exist long after
the procedure that created them is terminated. Such objects are stored on a heap.

DEPARTMENT OF CSE

In C or C++ the memory manager, the subsystem of the operating system
allocates and deallocates space within the heap. It deallocates the space when
explicit statements such as free or delete is used or when the program terminates.
In Java, garbage collector deallocates memory. Garbage collection is the
process of finding spaces within the heap that are no longer used by the program and
can therefore be reallocated to other data items.
Advantages:-
1. Heap allocation efficiently manages the run time memory.
Disadvantages:-
1. For Large size programs, Heap Allocation is not suitable.
2. Heap Manager has overhead for managing the memory.
Manual Deallocation:- Manual memory management has two common errors:

1. Not deleting data that can not be referenced is a memory-leak error and
SE
2. Referencing deleted data is a dangling-reference error.
Dangiling Reference in storage allocation:- Dangling reference situation occurs in
static and stack storage allocation. Dangling reference situation occurs when
deallocated object is referenced by the object in the activation record.
Ex:- procedure add
{
a,b,sum,*c:integer;
C
ET
sum=a+b;
c=proc(b)
}
procedure proc(d: integer)
{
C
avg: integer;
avg=d/2;
SA
return(&avg);
}
When the activation record of proc( ) is deleted or removed its local
variables are also deleted. After the termination of proc( ) procedure the control of
execution returns to the main program at the line c=proc(b). Here c is an integer
pointer pointing to the location returned by the proc( ) procedure. The proc( )
procedure returns the address of avg variable but avg is already deallocated. Thus
Pointer C is pointing to already deallocated data , thus it is known as dangling
reference. Dangling reference problem causes pointer C to point
 Garbage value if no other variable is allocated.
 Some other location if the space of avg was allocated to some other data.
Parameter Passing Mechanisms:- All programming languages have a notion of a

procedure, but they can differ in how these procedures get their arguments. There are

DEPARTMENT OF CSE

two types of parameters considered while using procedures. The actual parameters
(the parameters used in the call of a procedure) are associated with the formal
parameters (those used in the procedure definition). There are two mechanisms to
pass parameters between procedures. Majority of the programming languages uses
"call-by-value," or "call-by-reference," or both. Another method known as "call-by-
name," is also used.
In call-by-value, the actual parameter is evaluated (if it is an expression) or
copied (if it is a variable). The value is placed in the location belonging to the
corresponding formal parameter of the called procedure. This method is used in C
and Java, and is a common option in C++, as well as in most other languages. Call-
by-value has the effect that all computation involving the formal parameters done by
the called procedure is local to that procedure, and the actual parameters themselves
cannot be changed.
In call-by-reference, the address of the actual parameter is passed to the callee
SE
as the value of the corresponding formal parameter. Uses of the formal parameter in
the code of the called program are implemented by following this pointer to the
location indicated by the caller. Changes to the formal parameter thus appear as
changes to the actual parameter. If the actual parameter is an expression, however,
C
then the expression is evaluated before the call, and its value stored in a location of
its own. Changes to the formal parameter change this location, but can have no
effect on the data of the caller. Call-by-reference is used for "ref" parameters in C++
ET
and is an option in many other languages. It is almost essential when the formal
parameter is a large object, array, or structure.
The third mechanism - call-by-name - was used in the early programming
language Algol 60. It requires that the callee execute as if the actual parameter were
substituted literally for the formal parameter in the code of the callee, as if the
C
formal parameter were a macro standing for the actual parameter (with renaming of
local names in the called procedure, to keep them distinct). When the actual
SA
parameter is an expression rather than a variable, some unintuitive behaviors occur,

which is one reason this mechanism is not favored today.
Aliasing:- There is an interesting consequence of call-by-reference parameter
passing or its simulation, as in Java, where references to objects are passed by value.
It is possible that two formal parameters can refer to the same location; such
variables are said to be aliases of one another.
Garbage collection via reference counting: - Garabge collection is another method

of automatic memory management. It works as follows

DEPARTMENT OF CSE
Compiler Design UNIT- V

1. When an application needs some free space to allocate nodes and if there is
no free space available to allocate the memory for these objects then a system
routine called garbage collector is invoked.
2. Garbage collector then searches the system for the nodes that are no longer
used. The memory allocated for these nodes is deal located and added to the
heap. Then the system can use the available free space from the heap.
Reference count is a special counter used during implicit memory allocation. If any
block is referred by any other block then its reference count is incremented by one.
When a reference count of a block is zero it is not referred by any other block. Hence
Garbage collector searches the system for such unreferred blocks and the memory
allocated for these blocks is reallocated and added to the heap.
Advantages:-
1. The manual memory management is done by the programmer and is time
consuming and error prone. Hence automatic memory management is done.
SE
2. Reusability of memory can be achieved with the help of garbage collector.
Disadvantages:-
1. The execution of the program is stopped for some time when the garbage
collector is automatically invoked.
C
2. Sometime situation like thrashing may occur due to garbage collector. Let us
assume that garbage collector is called for getting some free space but almost
all the nodes are referred by external pointers. Now garbage collector executes
ET
and returns only a small amount of space. Again the system invokes garbage
collector for getting some more free space. Once again garbage collector
executes and returns very small amount of space. This happens repeatedly and
garbage collector is executing almost all the time. This process is called
thrashing. Thrashing must be avoided for better system performance.
C
SA

DEPARTMENT OF CSE
Basic Blocks: - A basic block is a sequence of statements that are always

executed one-after-the-other with no branching. Our first job is to partition
a sequence of three-address instructions into basic blocks. We begin a
new basic block with the first instruction and keep adding instructions
until we meet either a jump, a conditional jump, or a label on the following
instruction.
SE
Partitioning three-address instructions into basic blocks: - First, we determine those
instructions in the intermediate code that are leaders, that is, the first instructions in
the basic block. The rules for finding leaders are:
C
ET
1. The first three-address instruction in the intermediate code is a leader.
2. Any instruction that is the target of a conditional or unconditional jump is a
leader.
3. Any instruction that immediately follows a conditional or unconditional jump
is a leader.
C
Then, for each leader, its basic block consists of itself and all instructions up to
but not including the next leader or the end of the intermediate program.
SA
Example:- Convert the following source code into basic blocks.

DEPARTMENT OF CSE
First covert the following source code into three address code. Here we assume each
array elements occupy 8 bytes.
SE
C
ET
C
SA
Control Flow graph: - Once an intermediate-code program is partitioned into basic

blocks, we represent the flow of control between them by a flow graph. The nodes of
the flow graph are the basic blocks. There is an edge from block B to block C if and
only if it is possible for the first instruction in block C to immediately follow the last
instruction in block B. There are two ways that such an edge could be justified:
1. There is a conditional or unconditional jump from the end of B to the
beginning of C.

DEPARTMENT OF CSE

2. C immediately follows B in the original order of the three-address
instructions, and B does not end in an unconditional jump.
In any of the above cases B is a predecessor of C, and C is a successor of B. Two
nodes are added to the flow graph, called the entry and exit that do not correspond to
executable intermediate instructions. There is an edge from the entry to the first
executable node of the flow graph, that is, to the basic block that comes from the
first instruction of the intermediate code. There is an edge to the exit from any basic
block that contains an instruction that could be the last executed instruction of the
program. If the final instruction of the program is not an unconditional jump, then
the block containing the final instruction of the program is one predecessor of the
exit, but so is any basic block that has a jump to code that is not part of the program.
Example:- Construct the flow graph for a source code of a factorial function.
int fact( x )
1) f=1
SE
{
2) i=2
int f = 1;
3) if( i > x ) goto 9
i=2
4) t1 = f * i
while( i <= x )
5) f = t1
{
f=f*i;
i = i + 1;
6)
7)
8)
t2 = i + 1
i = t2
goto 3
C
ET
}
Print(f); 9) print(f)
}
C
Consideration for Optimization:- Motivation behind code optimization is to

produce target programs with a high execution efficiency. The need for optimization
SA
is felt because programs tend to be inefficient due to a combination of several

practical factors. Flexibility in programming languages often leads to inefficient
coding. Hence there is a need to optimize code during compilation. Optimizations
also differ in terms of the level at which they are performed, and their scope.
Optimizations can be performed at two levels. They are:
1. Machine dependent optimization: It is performed through a better choice of
instructions, better addressing modes and better usage of machine registers.
2. Machine independent optimization: It is based on the use of semantics pre-
serving transformations applied independent of the target machine. This
includes optimizations like common sub-expression elimination and loop
optimization etc.,

DEPARTMENT OF CSE
Machine dependent code optimization : - Compilers that need to produce

efficient target programs, include an machine dependent code optimization phase
prior to code generation. Various machine dependent code optimization are as follows.
1. Peephole optimization
2. Register allocation
3. Instruction Scheduling
4. Inter procedural optimization
Peephole Optimization:- While most production compilers produce good code

through careful instruction selection and register allocation, a few use an alternative
SE
strategy: they generate naive code and then improve the quality of the target code by
applying "optimizing" transformations to the target program.
A simple but effective technique for locally improving the target code is
peephole optimization, which is done by examining a sliding window of target
C
instructions (called the peephole) and replacing instruction sequences within the
peephole by a shorter or faster sequence, whenever possible. Peephole optimization
can also be applied directly after intermediate code generation to improve the
ET
intermediate representation.
The peephole is a small, sliding window on a program. The code in the peephole
need not be contiguous, although some implementations do require this. It is
characteristic of peephole optimization that each improvement may spawn
opportunities for additional improvements. In general, repeated passes over the target
C
code are necessary to get the maximum benefit. Some program transformations that
are characteristic of peephole optimizations:
SA
1. Redundant-instruction elimination 2. Eliminating Unreachable code.

3. Flow-of-control optimizations 4. Algebraic simplifications
5. Machine idioms
Eliminating Redundant Loads and Stores:- If we see the instruction sequence
LD a, R0
ST R0, a
in a target program, we can delete the store instruction. Note that if the store
instruction had a label, we could not be sure that the first instruction is always
executed before the second, so we could not remove the store instruction. The two
instructions have to be in the same basic block for this transformation to be safe.
Eliminating Unreachable Code:- Another opportunity for peephole optimization is the
removal of unreachable instructions. An unlabeled instruction immediately following
an unconditional jump may be removed. This operation can be repeated to eliminate a
sequence of instructions. For example, for debugging purposes, a large program may

DEPARTMENT OF CSE

have within it certain code fragments that are executed only if a variable
debug is equal to 1. In the intermediate representation, this code may look like
if debug == 1 goto L1
goto L2
L1 : print debugging information
L2:
One obvious peephole optimization is to eliminate jumps over jumps. Thus, no
matter what the value of debug, the code sequence above can be replaced by
if debug != 1 goto L2
print debugging information
L2:
If debug is set to 0 at the beginning of the program, constant propagation would
transform this sequence into
if 0 != 1 goto L2
SE
print debugging information
L2:
Now the argument of the first statement always evaluates to true, so the
statement can be replaced by goto L2. Then all statements that print debugging
C
information are unreachable and can be eliminated one at a time.
Flow-of-Control Optimizations:- Simple intermediate code-generation algorithms
frequently produce jumps to jumps, jumps to conditional jumps, or conditional jumps
ET
to jumps. These unnecessary jumps can be eliminated in either the intermediate code
or the target code by the following types of peephole optimizations. We can replace
the sequence
goto L1
...
C
Ll: goto L2
by the sequence
SA
goto L2
...
Ll: goto L2
If there are now no jumps to L1, then it may be possible to eliminate the
statement L1: goto L2 provided it is preceded by an unconditional jump.
Similarly, the sequence
If a< b go to L1
------
L1: go to L2
can be replaced by the sequence
If a< b go to L2
------
L1: go to L2

DEPARTMENT OF CSE

Algebraic Simplification and Reduction in Strength:- The algebraic identities can also
be used by a peephole optimizer to eliminate three-address statements such as
X=X+0 and X = X * 1
in the peephole. Similarly, reduction-in-strength transformations can be applied in the
peephole to replace expensive operations by equivalent cheaper ones on the
target machine. Certain machine instructions are considerably cheaper than others
and can often be used as special cases of more expensive operators. For
example, x2 is invariably cheaper to implement as x * x than as a call to an
exponentiation routine. Use of Machine Idioms:- The target machine may
have hardware instructions to implement certain specific operations efficiently.
Detecting situations that permit the use of these instructions can reduce execution
time significantly. For example, some machines have auto-increment and
auto-decrement addressing modes. These add or subtract one from an operand
before or after using its value. The use of the modes greatly improves the quality
of code when pushing or popping a stack, as in parameter passing. These modes can
SE
also be used in code for statements like x=x+l.
utilization of registers
One approach to isregister
C
Register Allocation and Assignment:- Instructions involving only register
operands are faster than those involving memory operands. Therefore, efficient
vitallyallocation
important and
in generating
assignmentgood
is tocode.
assign specific values
ET
in the target program to certain registers. For example, assign base addresses to one
group of registers, arithmetic computations to another, the top of the stack to a fixed
register, and so on. This approach has the advantage that it simplifies the design of a
code generator. Its disadvantage is that, applied too strictly, it uses registers
inefficiently; certain registers may go unused over substantial portions of code, while
C
unnecessary loads and stores are generated into the other registers. Nevertheless, it is
reasonable in most computing environments to reserve a few registers for base
SA
registers, stack pointers, and allow the remaining registers to be used by the code
generator as it sees fit. The various techniques for register allocation are
Global Register Allocation:- The code generation algorithm used registers to hold
values for the duration of a single basic block. However, all live variables were stored
at the end of each block. To save some of these stores and corresponding loads, we
might arrange to assign registers to frequently used variables and keep these registers
consistent across block boundaries (globally). Since programs spend most of their
time in inner loops, a natural approach to global register assignment is to try to keep a
frequently used value in a fixed register throughout a loop. One strategy for global
register allocation is to assign some fixed number of registers to hold the most active
values in each inner loop. The selected values may be different in different loops.
Registers not already allocated may be used to hold values local to one block as in
Section. This approach has the drawback that the fixed number of registers is not
always the right number to make available for global register allocation. Yet the

DEPARTMENT OF CSE

method is simple to implement and was used in Fortran H, the optimizing
Fortran compiler.
Usage Counts:- In this technique we consider the savings obtained by
keeping a variable x in a register for the duration of a loop L. However, if we use
the approach in to generate code for a block, there is a good chance that after x has
been computed in a block it will remain in a register if there are subsequent uses
of x in that block. Thus we count a savings of one for each use of x in loop L by
using the
∑ use(x, B) + 2 * live(x, B)
B in L
where use(x, B) is the number of times x is used in B prior to definition of xin
the same block. live(x, B) is 1 if x is live on exit from B and is assigned a value in B,
and live(x, B) is 0 otherwise. The variables whose usage count is more are stored
in the gloabal registers as they are frequently required during the processing of
the inner loop.
SE
Example: Consider the basic blocks in the inner loop as shown in figure and
calculate the usage counts of each variable and show what variables are
stored in global registers.
C
ET
C
SA
Assume registers R0, R1, and R2 are allocated to hold values throughout the loop.
Variables live on entry into and on exit from each block are shown in Fig.
To evaluate usage count for x = a, we observe that a is live on exit from B1 and
Used in B2 & B3. Thus
usage count for a = use in B2+ use in B3 + 2*live from B1
=4
usage count for b = use in B1 +2*live in B4+2*live in b3
=5
usage count for c = use in B1+ use in B3 + use in B4
=3
usage count for d = use in B1+use in B2+ use in B3 + use in B4 +2*live from B1
=6

DEPARTMENT OF CSE

usage count for e = 2*live from B1+2*live from B3
=4
usage count for f= use in B1 + use in B2 +2*live from B2
=4
Thus, we may select a, b, and d for registers R0, R1, and R2, respectively. Using R0
for e or f instead of a would be another choice with the same apparent benefit.
Register Allocation for outer loops:-Having assigned registers and generated code
for inner loops, we may apply the same idea to progressively larger enclosing loops.
L1-L2
SE
L1-L2
C
If an outer loop L1 contains an inner loop L2 the register allocation is as
follows. If the variable x has allocated register in L2 need not be allocated registers in
L1 - L2. If we allocate x a register in L2 but not L1, we must load x on entrance to L2
ET
and store x on exit from L2.
Register Allocation by Graph Coloring:- When a register is needed for a computation
but all available registers are in use, the contents of one of the used registers must be
stored (spilled) into a memory location in order to free up a register. Graph coloring is
a simple, systematic technique for allocating registers and managing register spills.
C
In the method, two passes are used. In the first, target-machine instructions are
selected as though there are an infinite number of symbolic registers. Once the
SA
instructions have been selected, a second pass assigns physical registers to symbolic
ones. The goal is to find an assignment that minimizes the cost of spills. In the second
pass, a register-interference graph is constructed. In RIG there is a node for each
temporary and there is an edge between any two temporaries if they are live
simultaneously at some point in the program. Two temporaries can be allocated to the
same register if there is no edge connecting them

DEPARTMENT OF CSE

b and c cannot be in the same register but b and d can be in the same
register. A coloring of a graph is an assignment of colors to nodes, such that nodes
connected by an edge have different colors. Pick a node t with fewer than k
neighbors in RIG. Eliminate t and its edges from RIG. If the resulting graph
has a k-coloring then so does the original graph.
1. Start with the RIG and with k = 4. Initially stack is empty. S= {}
2. Remove a and then d as they have less than four neighbours.
3. Now the stack contains S= {a, d} and the modified RIG is as follows
SE
4. Now all the nodes have less than four neighbours. So remove all of them and add to
the stack. Thus S= { f, e, c, b, d, a} .
C
5. Start assigning colors to: f, e, c, b, d, a. As k=4 the graph can be coloured with
manimum of 4 colours. Start assignging a colour to a node by checking the adjacent
coloured nodes. Repeat the process until all the nodes are coloured.
ET
C
SA

DEPARTMENT OF CSE
Compiler Design UNIT - VI
Machine Independent code optimization - Common sub expression

elimination, Constant folding, Copy Propagation, Dead code
elimination, Strength reduction, Loop optimization, Procedure inliling,
Instruction Scheduling, Interprocedural Optimization.
Basic Blocks: - A basic block is a sequence of statements that are always

executed one-after-the-other with no branching. Our first job is to partition
a sequence of three-address instructions into basic blocks. We begin a
new basic block with the first instruction and keep adding instructions
until we meet either a jump, a conditional jump, or a label on the following
instruction.
SE
Partitioning three-address instructions into basic blocks: - First, we determine those
instructions in the intermediate code that are leaders, that is, the first instructions in
the basic block. The rules for finding leaders are:
C
ET
1. The first three-address instruction in the intermediate code is a leader.
2. Any instruction that is the target of a conditional or unconditional jump is a
leader.
3. Any instruction that immediately follows a conditional or unconditional jump
is a leader.
C
Then, for each leader, its basic block consists of itself and all instructions up to
but not including the next leader or the end of the intermediate program.
SA
Example:- Convert the following source code into basic blocks.

DEPARTMENT OF CSE
First covert the following source code into three address code. Here we assume each
array elements occupy 8 bytes.
SE
C
ET
C
SA
Control Flow graph: - Once an intermediate-code program is partitioned into basic

blocks, we represent the flow of control between them by a flow graph. The nodes of
the flow graph are the basic blocks. There is an edge from block B to block C if and
only if it is possible for the first instruction in block C to immediately follow the last
instruction in block B. There are two ways that such an edge could be justified:
1. There is a conditional or unconditional jump from the end of B to the
beginning of C.

DEPARTMENT OF CSE

2. C immediately follows B in the original order of the three-address
instructions, and B does not end in an unconditional jump.
In any of the above cases B is a predecessor of C, and C is a successor of B. Two
nodes are added to the flow graph, called the entry and exit that do not correspond to
executable intermediate instructions. There is an edge from the entry to the first
executable node of the flow graph, that is, to the basic block that comes from the
first instruction of the intermediate code. There is an edge to the exit from any basic
block that contains an instruction that could be the last executed instruction of the
program. If the final instruction of the program is not an unconditional jump, then
the block containing the final instruction of the program is one predecessor of the
exit, but so is any basic block that has a jump to code that is not part of the program.
Example:- Construct the flow graph for a source code of a factorial function.
int fact( x )
1) f=1
SE
{
2) i=2
int f = 1;
3) if( i > x ) goto 9
i=2
4) t1 = f * i
while( i <= x )
5) f = t1
{
f=f*i;
i = i + 1;
6)
7)
8)
t2 = i + 1
i = t2
goto 3
C
ET
}
Print(f); 9) print(f)
}
C
Consideration for Optimization:- Motivation behind code optimization is to

produce target programs with a high execution efficiency. The need for optimization
SA
is felt because programs tend to be inefficient due to a combination of several

practical factors. Flexibility in programming languages often leads to inefficient
coding. Hence there is a need to optimize code during compilation. Optimizations
also differ in terms of the level at which they are performed, and their scope.
Optimizations can be performed at two levels. They are:
1. Machine dependent optimization: It is performed through a better choice of
instructions, better addressing modes and better usage of machine registers.
2. Machine independent optimization: It is based on the use of semantics pre-
serving transformations applied independent of the target machine. This
includes optimizations like common sub-expression elimination and loop
optimization etc.,
Machine independent Optimizations can be performed at two scopes. They are:
1. Local optimization: - When optimization is performed on a basic block then
that optimization is known as Local optimization. In Local optimization

DEPARTMENT OF CSE

scope is restricted to essentially sequential sections of program code called
basic blocks.
2. Global optimization: - When optimization is performed on a larger section of
a program than a basic block, typically a loop or a procedure/ function then
that optimization are known as Global optimization.
Usually both local and global optimizations are performed one after the other.
Code optimization can be done at more than one level to produce efficient code:
1. At the source code level, the user can change the algorithm and transform
loops to enhance the performance of the object code. For example, choosing a
quick sort algorithm can improve the sorting procedure when compared to the
bubble sort algorithm.
2. At the intermediate code level, the compiler can enhance the performance of
target code by performing optimization on intermediate code. Some
optimizations in this level are removal of loop invariants, elimination of
SE
induction variables, elimination of common sub-expressions and replacement
of compile time computations.
3. At the target code level, the compiler optimizes on choosing proper machine
resources. This includes the usage of registers for heavily used variables,
C
choosing suitable addressing modes for the target machine and peephole
optimizations. The richest source of optimization is in the efficient use of
registers and instruction set of a machine.
ET
The properties of code optimization are listed below:
1. The transformation should preserve the meaning of programs i.e. optimation
should not change the output of the program or produce an error.
2. The transformation should improve the speed efficiency of the program and/or
reduce the space occupied by the program.
C
3. The transformation must be worth the effort.

SA
Local optimization: - A transformation is local if it is performed by looking at the

statements only in the basic blocks. Local transformations are usually performed
first. Various techniques of local optimization are
1. Elimination of common sub-expressions
2. Copy/variable propagation
3. Elimination of dead code
4. Loop optimizations.
Elimination of Common Sub-expressions:- The code can be improved by eliminating
common sub expressions from the block. An expression whose value is previously
computed and the values of variables in the expression are not changed, since its
computation can be avoided by using the earlier computed value.

DEPARTMENT OF CSE
In the above discussion we have demonstrated two examples. In the first
SE
example b+c is not a common expression since one of its operand have been
changed before using the same expression. In the second example r2+r3 is a
common expression so its result is stored in a new variable temp and it is assigned to
r4 instead of re computing the expression r2+r3.
C
Copy propagation: - Statements of the form f : = g are called copy statements or
copies. When common expressions are eliminated copy statements are introduced.
Hence they have to be eliminated. We can use g instead of f after copy statement.
ET
Example:
x[i] = a; x[i] = a;
---------- ----------
----------- -----------
C
sum = x[i] + a;  sum = a + a;

In the above example x[i] = a is a copy statement. After this copy statement
SA
we can use ‘a’ instead of x[i] in further calculations. So in the statement sum = x[i] +
a, x[i] is replaced with ‘a’ which produces the statement sum = a + a;
Elimination of dead code: - A piece of code which is not reachable, that is the values
it computes is never used anywhere in the program then it is said to be dead code
and can be removed from the program safely. An assignment to a variable results in
dead code, if the value of this variable is not used in the subsequent program. Also
an assignment to a variable is a dead code if there is always another assignment to
the same variable before its value is used in the subsequent program.
x[i] = a; x[i] = a; -----------

----------  ----------  -----------
----------- ----------- -----------
sum = x[i] + a; sum = a + a; sum = a + a;

DEPARTMENT OF CSE
Copy propagation often makes copy statements into dead code which can be
easily eliminated. In the above example as x[i]=a is a copy statement we replaced
x[i] with ‘a’. If x[i] has no further use in the program then x[i] = a becomes a dead
statement and can be eliminated.
Loop optimization:- The major source of code optimization is loops, especially the
inner loops. Most of the run-time is spent inside the loops which can be reduced by
reducing the number of instructions in an inner loop. Important techniques of loop
optimization are
1. Code motion
2. Elimination of induction variables
3. Strength reduction.
Code motion: - Code motion reduces the number of instructions in a loop by moving
some loop-invariant instructions outside a loop. Loop-invariant computations are
SE
those instructions or expressions that result in the same value independent of the
number of times a loop is executed. Loop-invariant instructions inside the loop are
identified and move them to the beginning of the loop.
while ( i < (max-2) )

{
----------  {
C
t = max – 2;
while ( i < t )
ET
----------- -----------
} }
In the above example max-2 is a loop-invariant computation since its value is
fixed irrespective of loop. So the calculation max-2 is done outside the loop and its
C
result is used in the loop. This eliminates the necessity of calculating max-2 every
time the loop repeats.
SA
Elimination of induction variables:- An induction variable is a loop control variable

or any other variable that depends on the induction variable in some fixed way. It
can also be defined as a variable which is incremented or decremented by a fixed
number in a loop each time the loop is executed. Induction variables are of the form
i = i ± c (where c is a constant). If there are two or more induction variables in a loop
then by the induction variable elimination process all can be eliminated except one.
For example, consider the part of a program.
int a[10], b[10]; int a[10], b[10];

void fun(void) void fun(void)
{ {
int i, j, k;  int i, j, k;
for(i = 0,j = 0,k=0; i < 10; i++) for(i = 0; i < 10; i++)
a[j++]=b[k++]; a[i]=b[i];
return; return;
} }
DEPARTMENT OF CSE
Compiler Design UNIT - VII
In the above example there are three induction variables i,j and k which take
on the values 1,2,3, ... , 10 each time through the loop. Suppose that the values of
variables j and k are not used after the end of the loop then we can eliminate them
from the function fun ( ) by replacing them by variable i.
Strength Reduction: - Strength Reduction is the process of replacing expensive
operations by their equivalent cheaper operations on the target machine. On many
machines a multiplication operation takes more time than addition. On such
machines the speed of the object code can be increased by replacing a multiplication
by an addition.
For example when we want to calculate multiples of 2 we used multiplication
in the loop which calculates the ith multiple. But the same result can be obtained
through addition. Since multiplication is nothing but repeated addition. This often
SE
replaces expensive multiplication operation by its equivalent addition operations
which is less expensive.
i=1; i=1;
while ( i < =10 ) prod=0;
{
prod=2*i;
i = i + 1;
C
while ( i < =10 )
{
prod=prod+2;
ET

----------- i = i + 1;
} -----------
}
C
Frequency reduction:-
Loop unrolling: In order to reduce the number of iterations of a loop the body of the
loop is duplicated.
SA
i=1;
i=1;
while ( i < =n )
while ( i < =n )
{
{  a[i]=b[i];
a[i]=b[i];
i = i + 1;
i = i + 1;
a[i]=b[i];
-----------
i = i + 1;
}
}
In the above example we are transferring elements of B array into A array. In the
second example we are transferring two elements per iteration to reduce the number
of iterations.

DEPARTMENT OF CSE

Loop fusion: Merging two adjacent loops into a single loop is called Loop fusion. If
the two loops would iterate the same number of times, their bodies can be combined
as long as they make no reference to each other's data.
for i=1 to m do for i=1 to n*m do

for j = 1 to n do a[i] = 10
a[i,j] = 10 
In the above example we want to assign 10 to a two dimensional matrix

whose size is mXn where m is no of rows and n is number of columns. We now that
a two dimensional matrix whose size is mXn can be represented as a one
dimensional matrix whose size is n*m. Thus two loops can be merged to a single
loop.
SE
Folding:- Constant folding is a third optimization technique that evaluates constant
expressions at compile time and replaces such expressions by their computed values.
For example, the constant expression 3 x 3 which could be replaced by 9 at compile
time. Often the use of symbolic constants results in constant expressions.
i=1;
while ( i < =10 )
C i=1;
while ( i < =10 )
{
ET
{ 
prod=4;
prod=2*2;
i = i + 1;
i = i + 1;
-----------
-----------
}
C
}
SA
DAG representation of a Basic Block: - A DAG representation of a basic block is

a directed acyclic graph in which the nodes of the DAG represent the statements
within the block and each child of a node corresponds to the statement that is the last
definition of an operand used in the statement. Many important techniques for local
optimization begin by transforming a basic block into a DAG (directed acyclic
graph). We construct a DAG for a basic block as follows:
1. There is a node in the DAG for each of the initial values of the variables
appearing in the basic block.
2. There is a node N associated with each statement s within the block. The
children of N are those nodes corresponding to statements that are the last
definitions, prior to s, of the operands used by s.
3. Node N is labeled by the operator applied at s, and also attached to N is the
list of variables for which it is the last definition within the block.

DEPARTMENT OF CSE

4. Certain nodes are designated output nodes. These are the nodes whose
variables are live on exit from the block; that is, their values may be used
later, in another block of the flow graph.
Example: - Construct the DAG for the following block.
We start the construct of DAG form the first statement a = b + c. Since b and c are
defined elsewhere and used in this block they are designated as b0 and c0. For the
SE
expression a = b + c operator + becomes the root and b0 and c0 become the left and
right child respectively. As the result of the expression is stored in a, the node + is
labeled as a. Similarly we repeat the process for the second expression. c = b + c, we
C
know that the use of b in b + c refers to the b labeled at - because that is the most
recent definition of b. However, the node corresponding to the fourth statement d = a
- d has the operator - and the nodes with attached variables a and do as children.
ET
Since the operator and the children are the same as those for the node corresponding
to statement two, we do not create this node, but add d to the list of definitions for
the node labeled -.
Applications of DAG:- The DAG representation of a basic block lets us perform
several code improving transformations on the code represented by the block.
C
1. We can eliminate local common sub expressions, that is, instructions that
compute a value that has already been computed.
SA
In the above example a-d is a common expression. This is identified during

DAG construction so the node which computes a - d is added a new label d i.e.
b, d.
2. We can eliminate dead code, that is, instructions that compute a value that is
never used. Code generation from DAG eliminates dead code.

DEPARTMENT OF CSE
As b is not being used in the further calculations the statement which

calculates b is identified as a dead code and can be eliminated.
3. We can reorder statements that do not depend on one another; such reordering
may reduce the time a temporary value needs to be preserved in a register.
a=b+c
b=a-d
d=b
SE
c=b+c
When we generate the code from the DAG the common expressions are
C
eliminated and statements are automatically reordered as shown in fig 3.
4. We can apply algebraic laws to reorder operands of three-address instructions,
and sometimes there by simplify the computation.
ET
The DAG-construction process can help us to apply general algebraic
transformations such as commutativity and associativity. For example,
suppose the language reference manual specifies that * is commutative; that
is, x* y = y*x. Before we create a new node labeled * with left child M and
right child N, we always check whether such a node already exists. However,
C
because * is commutative, we should then check for a node having operator *,

left child N, and right child M.
SA
d
a=b*c +
* a, e
d=a+b
e=c*b b0 c0
In the above expression we have constructed the DAG for the first
expression in a normal manner. When considering the second expression we
have to construct the node + whose left child is a and b as the right child. As
we know a+b =b+a we reorder the operands to make use of already existing
node a. During the third statement we know * is commutative so we have

DEPARTMENT OF CSE

searched for the node which computes b*c or c*b. As b*c is already computed
for that node we add another label as e i.e. a, e.
Techniques of Global optimization:- In global code optimization, optimization is

performed by taking more basic blocks into consideration. As effect of local
optimization is limited, global optimization is performed to have a larger effect.
Different techniques of global optimization are
1. Global Common Sub expressions
2. Copy Propagation
3. Dead-Code Elimination
4. Reduction of Induction variable elements
Global Common Sub expressions: - An occurrence of an expression E is called a
common sub expression if E was previously computed and the values of the
variables in E have not changed since the previous computation. We avoid
SE
recomputing E by assigning the result of E to x and use x in place of E if the
previous computation of E was assigned has not changed in the interim.
C
ET
C
Consider the TAC in above Figure. Here in Block B1, 4*k is computed and is
available expression at B2. Also Block B2 also has the same computation i.e. 4*k.
SA
So it is a global common sub expression. The result of 4*k is stored in a variable m

and use m when 4*k is needed in further calculation.
Copy Propagation:- Assignments of the form f := g are called as copy statements.
When common sub expressions are eliminated, copy statements are introduced.
Hence they have to be eliminated.

DEPARTMENT OF CSE
In the below example t1=m is a copy statement. So after this statement we

can use m instead of t1 and can eliminate t1 if is not further used in the program.
Similarly t5=m is a copy statement. So after this statement we can use m instead of
t5 and can eliminate t5 if is not further used in the program. In the fig 2 we find
SE
t2=a[m] is a copy statement. So after this statement we can use t2 instead of a[m].
Dead-Code Elimination:- A variable is live at a point in a program if its value can be

used subsequently; otherwise, it is dead at that point. Dead coder are the statements
C
which compute values but never get used. While the programmer is unlikely to
introduce any dead code intentionally, it may appear as the result of previous
transformations.
ET
C
SA
In the first fig there are copy statements and when they are eliminated we get
figure 2. In figure 2 t1 has assigned a value of m and is not further used, so it
becomes a dead code and can be eliminated. Similarly t5 has assigned a value of m
and is not further used, so it becomes a dead code and can be eliminated as shown in
fig 3.
Reduction of Induction variable elements:- Induction variables are loop variables
such that every time the loop repeats they change their value i.e. they either get
incremented or decremented. Remove unnecessary induction variables from the loop
by substituting uses with another basic induction variable.

DEPARTMENT OF CSE
In the above example r1 and r2 are two induction variables which computes
the same value every time the loop repeats. So use one induction variable i.e r2
instead of r1. Thus the fig 2 consists of only one induction variable.
SE
Procedure Inlining: - Procedure inlining, which is the replacement of a procedure
call by the body of the procedure, is particularly useful in code optimizations. This
method speeds up the excution process when the procedures are simple.
C
Normally when the procedure is called the calling program is stopped and the
procedure is copied on to the main memory and then executes the procedure. When
the procedure is terminated the calling program continues its execution. So there will
ET
be a lot of internal work to be done when a procedure is called and terminated.
Now if there are many calls to that procedure and if procedure contains few
lines of code then such a jumming to memory becomes a performance overhead for
the compiler. It ultimately slows down the execution of the program. Hence the
C
solution is to use inline procedures.

The inline procedures replaces the procedure call by the body of the
SA
procedure, thus the whole code is available continuously. Due to which we can
avoid performance overhead for the compiler.
When an inline procedure is called 5 times the code is copied into the program
5 times which avoids jump to the procedure. The code size may slightly increase but
the performance of the compiler may be improved.
Instruction Scheduling:- Using DAG we can rearrange some sequence of

instructions and generate an efficient code using minimum number of registers. The
order of three address code affects the cost of object code being generated. Changing
the order in which computations are done we can obtain the object code with
minimum cost. Thus code optimization can be achieved by scheduling the instructions
in proper order. The technique of code optimization done by rearrangement of some
sequence of instructions is called Instruction Scheduling.

DEPARTMENT OF CSE
Compiler Design UNIT- VI

Example :- Rearrange the following three address statements.
t1 := a + b MOV a, R0
ADD b, R0
t2 := c – d MOV c, R1
SUB d, R1
t3 := e + t2 MOV R0, t1
MOV e, R0
ADD R0, R1
t4 := t1 + t3 MOV t1, R0
ADD R1, R0
MOV R0, t4
Now if we change the ordering sequence of the above three address code.
SE
t2 := c – d MOV c, R0
SUB d, R0
MOV e, R1
t3 := e + t2
t1:=a + b
C ADD R0, R1
MOV a, R0
ADD b, R0
ET
ADD R1, R0
t4 := t1 + t3 MOV R0, t4
In the first case the assembly code contains 10 lines. After rearranging the three
C
address code sequence then the assembly code contains 8 lines. So by rearranging
some sequence of instructions we can generate an efficient code using minimum
SA
number of registers. Thus here, an optimal order means the order that yields the
shortest instruction sequence.
Inter procedural optimization: - Inter procedural optimization technique is a kind

of code optimization in which collection of optimization techniques are used to
improve the performance of the program that contain many frequently used functions
of blocks. IPO reduces or eliminates duplicate calculations , inefficient use of memory
and to simplify loops. IPO may reorder the routines for better memory utilization. IPO
checks the branches that are never taken and removes the code in that branch.

DEPARTMENT OF CSE
SE
C
Now we apply the transformations on block B5 and B6.
ET
In block B5 there are common subexpressions 4 * I and 4 * j we will remove
these common subexpressions and the code will be
C
SA
Thus the local transformation to block B5 will be completed.

Now we will apply the global transformations: on B5. As 4 * I is already
computed as t2 in B1, so we can replace t6 by t2. Similarly as 4 * j is already
computed as t4, so t8 can be replaced by t4. Hence t6 and t8 can be eliminated.
Now we will apply the global transformations once again on B5. As val contains a[t2]
and a[t2] is already stored in t3. So we can replace val by t3. Similarly as a[t4] is
already computed in Block B3 and its value is stored in t5. Hence we can eliminate t9.
The optimized block will then be

DEPARTMENT OF CSE

In block B6 there are common subexpressions 4 * i and 4 * n we will
remove subexpressions and the code will be
these common
Thus the local transformation to block B6 will be completed.

Now we will apply the global transformations: on B6. As 4 * i is already
computed as t2 in B1, so we can replace t11 by t2. Similarly as 4 * n is already
computed as t1, so t13 can be replaced by t1. Hence t11 and t13 can be eliminated.
Now we will apply the global transformations once again on B6. As a[t2] is already
SE
stored in block in variable t3. Hence the optimized block B6 will then be
Finally the optimized flow graph can be shown below.

C
ET
C
SA

Compiler Design Notes PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Compiler Design Notes PDF

Transféré par

Droits d'auteur :

Formats disponibles

DEPARTMENT OF CSE

Compiler Design UNIT - I

Language Processors:- An integrated software development environment includes

Compiler Design UNIT - I

than a compiler, because it executes the source program statement by statement.

because assembly language is easier to produce as output and is easier to debug. A

The Structure of a Compiler: - The compilation process of a compiler can be

Compiler Design UNIT - I

Phases of A Compiler:- The compilation process operates as a sequence of phases.

Compiler Design UNIT - I

Figure:- Phases of a compiler

Compiler Design UNIT - I

Compiler Design UNIT - I

Fig:- Output of each phase of compiler

Compiler Design UNIT - I

program produced by a compiler

5. An compiler, list outs all error An interpreter, however, can give

source program statement by

Compiler Design UNIT - II

Lexical Analysis: - The first phase of a compiler is called lexical analysis or

Compiler Design UNIT - I

sequence of input characters denoting an identifier etc.,

 A lexeme is a sequence of characters in the source program that matches the

Compiler Design UNIT - I

4. Transpose two adjacent characters.

Regular Expressions: - Regular expressions are an important notation for

Compiler Design UNIT - I

Regular definitions:- Regular Definitions are names given to certain

Example 1 : C identifiers are strings of letters, digits, and underscores. Write a

Using shorthand notations , the regular definition can be rewritten as:

Transition Diagrams: - Compiler converts regular-expression patterns to transition

Compiler Design UNIT - I

Figure: Transition diagram for relop

Example 2 : Draw a transaction diagram for the identifier in C language.

Figure: Transition diagram for identifier

Figure: Transition diagram for then

Compiler Design UNIT - I

Figure: Transition diagram for whitespace

Figure: Transition diagram for unsigned numbers

Lex (lexical analyzer generator):- Lex is a tool, or in a more recent implementation

Flex, which allows to specify a lexical analyzer by specifying regular expressions to

Compiler Design UNIT - I

Structure of Lex Programs:- A Lex program has the following form:

Ex 1:- Develop a lexical Analyzer using lex tool

/*PROGRAM OF A LEXICAL ANALYSER USING LEX TOOL*/

Compiler Design UNIT - I

printf("number of lines=%d \n",num_lines);

Compiler Design UNIT - I

[a-z] { char ch = yytext[0];

printf ("%c", ch);

[A-Z] { char ch = yytext[0];

printf ("%c", ch);

printf("Enter any string and press ^d at End:\n ");

printf("No of Lowercase characters converted : %d\n",lwcnt);

Compiler Design UNIT - I

int nchar=0, nword=0, nline=0;

[^ \t\n]+ { nword++; nchar+=yyleng; }

printf("Enter your message and press ^d at the end: \n");

yyl/* Program to identify Nouns,verbs and adjectives */

[\t ]+ /* ignore whitespace */ ;

Krishna|Kishore|Bapatla|Guntur { printf ("\"%s\" is a Noun\n", yytext); }

Compiler Design UNIT - I

is|am|are|plays|dances { printf ("\"%s\" is a verb\n", yytext); }

/PROGRAM OF A LEXICAL ANALYSER USING LEX TOOL/