Compiler Design

9Chomsky
Normal
Form.
A
grammar
where
every
production
is
either
of
the
form
A
BC
or
A
c
(where
A,
B,
C
are
arbitrary
variables
and c an arbitrary symbol).
Example:
S
AS
|
a
A
SA
|
b
(If
language
contains
,
then
we
allow
S
where S is start symbol, and forbid S on RHS.)

Why
Chomsky
Normal
Form?
The
key
advantage
is
that
in
Chomsky
Normal
Form,
every
derivation
of
a
string
of
n
letters
has
exactly
2n
1
steps.
Thus: one can determine if a string is in the language by exhaustive search of all derivations.
Conversion
The conversion to Chomsky Normal Form has
four main steps:
1. Get rid of all productions.
2. Get rid of all productions where RHS is one
variable.
3. Replace every production that is too long by
shorter productions.
4. Move all terminals to productions where RHS
is one terminal.
G
1)
Eliminate
Productions
Determine
the
nullable
variables
(those
that
generate
)
(algorithm
given
earlier).
Go
through
all
productions,
and
for
each,
omit
every
possible
subset
of
nullable
variables.
For
example,
if
P
AxB
with
both
A
and
B
nullable,
add
productions
P
xB
|
Ax
|
x.
After this, delete all productions with empty RHS.
2)
Eliminate
Variable
Unit
Productions
A
unit
production
is
where
RHS
has
only
one
symbol.
Consider
production
A
B.
Then
for
every
pro
duction
B
,
add
the
production
A
.
Re
peat
until
done
(but
dont
re-create
a
unit
pro
duction already deleted).
3)
Replace
Long
Productions
by
Shorter
Ones
For
example,
if
have
production
A
BCD,
then
replace
it
with
A
BE
and
E
CD.
(In
theory
this
introduces
many
new
variables,
but one can re-use variables if careful.)
4)
Move
Terminals
to
Unit
Productions
For
every
terminal
on
the
right
of
a
non-unit
production,
add
a
substitute
variable.
For
example,
replace
production
A
bC
with
productions A BC and B b.
Example
Consider the CFG:
S aXbX
X aY | bY |
YX|c
The variable X is nullable; and so therefore is Y .
After elimination of , we obtain:
S aXbX | abX | aXb | ab
X aY | bY | a | b
YX|c
Example: Step 2
After elimination of the unit production Y X
we obtain:
S aXbX | abX | aXb | ab
X aY | bY | a | b
Y aY | bY | a | b | c
Example: Steps 3 & 4
Now, break up the RHSs of S; and replace a by A,
b by B and c by C wherever not units:
S EF | AF | EB | AB
X AY | BY | a | b
Y AY | BY | a | b | c
E AX
F BX
Aa
Bb
Cc
There are special forms for CFGs such as Chomsky Normal Form, where every production has
the
form
A
BC
or
A
c.
The
algorithm
to convert to this form involves (1) determining all nullable variables and getting rid of all
-productions,
(2)
getting
rid
of
all
variable
unit
productions,
(3)
breaking
up
long
productions,
and (4) moving terminals to unit productions.
10
In computer science, an AMBIGUOUS grammar is a context-free grammar for which there exists
a string that can have more than one leftmost derivation, while an unambiguous grammar is a context-free
grammar for which every valid string has a unique leftmost derivation. Many languages admit both ambiguous
and unambiguous grammars, while some languages admit only ambiguous grammars. Any non-empty language
admits an ambiguous grammar by taking an unambiguous grammar and introducing a duplicate rule or
synonym (the only language without ambiguous grammars is the empty language). A language that only admits
ambiguous grammars is called an inherently ambiguous language, and there are inherently ambiguous contextfree languages. Deterministic context-free grammars are always unambiguous, and are an important subclass of
unambiguous CFGs; there are non-deterministic unambiguous CFGs, however.
For real-world programming languages, the reference CFG is often ambiguous, due to issues such as
the dangling else problem. If present, these ambiguities are generally resolved by adding precedence rules or
other context-sensitive parsing rules, so the overall phrase grammar is unambiguous.
Recognizing ambiguous grammars
The general decision problem of whether a grammar is ambiguous is undecidable because it can be shown that it
is equivalent to the Post correspondence problem.[1] At least, there are tools implementing some semi-decision
procedure for detecting ambiguity of context-free grammars.[2]
The efficiency of context-free grammar parsing is determined by the automaton that accepts it. Deterministic
context-free grammars are accepted by deterministic pushdown automata and can be parsed in linear time, for
example by the LR parser.[3] This is a subset of the context-free grammars which are accepted by the pushdown
automaton and can be parsed in polynomial time, for example by the CYK algorithm.
Inherently ambiguous languages
The existence of inherently ambiguous languages was proven with Parikh's theorem in 1961 by Rohit Parikh in
an MIT research report.[5]
While some context-free languages (the set of strings that can be generated by a grammar) have both ambiguous
and unambiguous grammars, there exist context-free languages for which no unambiguous context-free
grammar can exist. An example of an inherently ambiguous language is the union
of
with
. This set is context-free, since the union of two
context-free languages is always context-free. But Hopcroft & Ullman (1979) give a proof that there is no way to
unambiguously parse strings in the (non-context-free) subset
these two languages.
No
intermediate
object code is
generated,
hence are
Generates
intermediate
object code
which further
requires
which is the intersection of
memory
efficient.
linking, hence
requires more
memory.
Continues
translating the
program until
the first error
is met, in which
case it stops.
Hence
debugging is
easy.
It generates the
error message
only after
scanning the
whole program.
Hence
debugging is
comparatively
hard.
Programming
language like
Python, Ruby
use
interpreters.
Programming
language like
C, C++ use
compilers.
5Error handling is concerned with failures due to man y causes: errors in the compiler or its
environmen t
(hardw are, op erating system), design errors in the program being compiled, an incomplete understanding of
the source language, transcription errors, incorrect data, etc. The tasks of the error handling process are to
detect each error, rep ort it to the user, and possibly make some repair to allow processing to continue.
The purp ose of error handling is to aid the programmer b y highligh ting inconsistencies.
It has a lo w frequency in comparison with other compiler tasks, and hence the time required
to complete it is largely irrelev an t, but it cannot b e regarded as an `add-on' feature of a
compiler. Its in uence up on the overall design is p erv asiv e, and it is a necessary debugging
to ol during construction of the compiler itself. Prop er design and implemen tation of an error
handler, ho w ever, depends strongly up on complete understanding of the compilation pro cess.
This is wh y w e ha v e deferred consideration of error handling un til no w.
11.1 Terminology
Errors can roughly be divided into four different types:
Compile-time errors
Logical errors
Run-time errors
Generated errors
A compile-time error, for example a syntax error, should not cause much trouble as it is caught by the compiler.
A logical error is when a program does not behave as intended, but does not crash. An example could be that
nothing happens when a button in a graphical user interface is clicked.
A run-time error is when a crash occurs. An example could be when an operator is applied to arguments of the
wrong type. The Erlang programming language has built-in features for handling of run-time errors.
A run-time error can also be emulated by calling erlang:error(Reason) or erlang:error(Reason, Args) (those
appeared in Erlang 5.4/OTP-R10).
A run-time error is another name for an exception of class error.
A generated error is when the code itself calls exit/1 or throw/1. Note that emulated run-time errors are not
denoted as generated errors here.
Generated errors are exceptions of classes exit and throw.
When a run-time error or generated error occurs in Erlang, execution for the process which evaluated the
erroneous expression is stopped. This is referred to as a failure, that execution or evaluation fails, or that the
process fails, terminates or exits. Note that a process may terminate/exit for other reasons than a failure.
A process that terminates will emit an exit signal with an exit reason that says something about which error has
occurred. Normally, some information about the error will be printed to the terminal.
Exceptions
Exceptions are run-time errors or generated errors and are of three different classes, with different origins.
The try expression (appeared in Erlang 5.4/OTP-R10B) can distinguish between the different classes, whereas
the catch expression can not. They are described in the Expressions chapter.
Class Origin
error
Run-time error for example 1+a, or the process called erlang:error/1,2 (appeared in Erlang 5.4/OTPR10B)
exit
The process called exit/1
throw The process called throw/1

Table 11.1: Exception Classes.
An exception consists of its class, an exit reason (the Exit Reason), and a stack trace (that aids in finding the code
location of the exception).
The stack trace can be retrieved using erlang:get_stacktrace/0 (new in Erlang 5.4/OTP-R10B) from within
a try expression, and is returned for exceptions of class error from a catch expression.An exception of
class error is also known as a run-time error.
4A context- free grammar is a finite set of variables ( nonterminals ) each of which represents a language. The
languages represented by the nonterminals are described recursively in terms of each other and primitive
symbols called terminals. The rules relating the nonterminals are called productions. A roduction states that the
language associated with a given nonterminal contains strings that are formed by concatenating strings from the
languages of certain other nonterminals, possibly along some terminals.
Regular expressions cannot check balancing tokens, such as parenthesis. Therefore, this phase uses contextfree grammar (CFG), which is recognized by push-down automata.
CFG, on the other hand, is a superset of Regular Grammar, as depicted below:
It implies that every Regular Grammar is also context-free, but there exists some problems, which are beyond
the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies used in
parsing technology.
A context-free grammar has four components:
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The nonterminals define sets of strings that help define the language generated by the grammar.
A set of tokens, known as terminal symbols (). Terminals are the basic symbols from which strings are
formed.
A set of productions (P). The productions of a grammar specify the manner in which the terminals and
non-terminals can be combined to form strings. Each production consists of a non-terminal called the
left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side
of the production.
One of the non-terminals is designated as the start symbol (S); from where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular Expression.
That is, L = { w | w = wR } is not a regular language. But it can be described by means of CFG, as illustrated
below:
G = ( V, , P, S )
Where:
V = { Q, Z, N }
= { 0, 1 }
P = { Q Z | Q N | Q | Z 0Q0 | N 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.
3A
program as a source code is merely a collection of text (code, statements etc.) and to make it alive, it
requires actions to be performed on the target machine. A program needs memory resources to execute
instructions. A program contains names for procedures, identifiers etc., that require mapping with the actual
memory location at runtime.
By runtime, we mean a program in execution. Runtime environment is a state of the target machine, which may
include software libraries, environment variables, etc., to provide services to the processes running in the
system.
dynamic Memory Allocation
Allocating memory
There are two ways that memory gets allocated for data storage:
1. Compile Time (or static) Allocation
o Memory for named variables is allocated by the compiler
o Exact size and type of storage must be known at compile time
o For standard array declarations, this is why the size has to be constant
2. Dynamic Memory Allocation
o Memory allocated "on the fly" during run time
o dynamically allocated space usually placed in a program segment known as the heap or the free
store
o Exact amount of space or number of items does not have to be known by the compiler in advance.
o For dynamic memory allocation, pointers are crucial
o
Dynamic Memory Allocation
We can dynamically allocate storage space while the program is running, but we cannot create new
variable names "on the fly"
For this reason, dynamic allocation requires two steps:
1. Creating the dynamic space.
2. Storing its address in a pointer (so that the space can be accesed)
To dynamically allocate memory in C++, we use the new operator.
De-allocation:
o Deallocation is the "clean-up" of space being used for variables or other data storage
o Compile time variables are automatically deallocated based on their known extent (this is the
same as scope for "automatic" variables)
o
o
It is the programmer's job to deallocate dynamically created space

To de-allocate dynamic memory, we use the delete operator
Dynamic and Static Memory

Readers who are familiar with the concepts of dynamic memory and pointers may wish to skip to the
next section of this chapter. In this one, the general concepts of static and dynamic memory is outlined. How
these issues are specifically handled in Modula-2 is not taken up again until after the rest of the preliminary
discussions are complete.
The distinction made in this section is based on the timing and manner for the setting aside of memory
for the use of a program. Some memory use is predetermined by the compiler and will always be set aside for
the program in exactly the same manner at the beginning of every run of a given program. Other memory is
obtained by the program at various points during the time it is running. This memory may only be used by the
program temporarily and then released for other uses.
The allocation of memory for the specific fixed purposes of a program in a predetermined fashion controlled by
the compiler is said to be static memory allocation.
The allocation of memory (and possibly its later deallocation) during the running of a program and under the
control of the program is said to be dynamic memory allocation.
Heap Allocation
Variables local to a procedure are allocated and de-allocated only at runtime. Heap allocation is used to
dynamically allocate memory to the variables and claim it back when the variables are no more required.
Except statically allocated memory area, both stack and heap memory can grow and shrink dynamically and
unexpectedly. Therefore, they cannot be provided with a fixed amount of memory in the system.
As shown in the image above, the text part of the code is allocated a fixed amount of memory. Stack and heap
memory are arranged at the extremes of total memory allocated to the program. Both shrink and grow against
each other.
Parameter Passing
The communication medium among procedures is known as parameter passing. The values of the variables
from a calling procedure are transferred to the called procedure by some mechanism. Before moving ahead,
first go through some basic terminologies pertaining to the values in a program.
2The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong to the
language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for finite strings of
symbols. The grammar defined by regular expressions is known as regular grammar. The language defined by
regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of strings, so
regular expressions serve as names for a set of strings. Programming language tokens can be described by
regular languages. The specification of regular expressions is an example of a recursive definition. Regular
languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be used to manipulate
regular expressions into equivalent forms.
Operations
The various operations on languages are:
Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Precedence and Associativity
*, concatenation (.), and | (pipe sign) are left associative

* has the highest precedence
Concatenation (.) has the second highest precedence.
| (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representing occurrence of symbols using regular expressions
letter = [a z] or [A Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular expression used in
specifying the patterns of keywords of a language. A well-accepted solution is to use finite automata for
verification.
1Symbol
table is an important data structure created and maintained by compilers in order to store
information about the occurrence of various entities such as variable names, function names, objects, classes,
interfaces, etc. Symbol table is used by both the analysis and the synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
To store the names of all entities in a structured form at one place.
To verify if a variable has been declared.
To implement type checking, by verifying assignments and expressions in the source code are
semantically correct.
To determine the scope of a name (scope resolution).
A symbol table is simply a table which can be either linear or a hash table. It maintains an entry for each name
in the following format:
<symbol name, type, attribute>
For example, if a symbol table has to store information about the following variable declaration:
static int interest;
then it should store the entry such as:
<interest, int, static>
The attribute clause contains the entries related to the name.
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be implemented as an unordered
list, which is easy to code, but it is only suitable for small tables only. A symbol table can be implemented in one
of the following ways:
Linear (sorted or unsorted) list
Binary Search Tree
Hash table
Among all, symbol tables are mostly implemented as hash tables, where the source code symbol itself is treated
as a key for the hash function and the return value is the information about the symbol.
Operations
A symbol table, either linear or hash, should provide the following operations.
insert()
This operation is more frequently used by analysis phase, i.e., the first half of the compiler where tokens are
identified and names are stored in the table. This operation is used to add information in the symbol table
about unique names occurring in the source code. The format or structure in which the names are stored
depends upon the compiler in hand.
An attribute for a symbol in the source code is the information associated with that symbol. This information
contains the value, state, scope, and type about the symbol. The insert() function takes the symbol and its
attributes as arguments and stores the information in the symbol table.
For example:
int a;
should be processed by the compiler as:
insert(a, int);
lookup()
lookup() operation is used to search a name in the symbol table to determine:
if the symbol exists in the table.

if it is declared before it is being used.
if the name is used in the scope.
if the symbol is initialized.
if the symbol declared multiple times.
The format of lookup() function varies according to the programming language. The basic format should
match the following:
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol exists in the symbol
table, it returns its attributes stored in the table.
Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be accessed by all the
procedures and scope symbol tables that are created for each scope in the program.
To determine the scope of a name, symbol tables are arranged in hierarchical structure as shown in the
example below:
...
int value=10;
void pro_one()
{
int one_1;
int one_2;
{
int one_3;
|_ inner scope 1
int one_4;
int one_5;
{
int one_6;
|_ inner scope 2
int one_7;
}
void pro_two()
{
int two_1;
int two_2;
{
int two_3;
|_ inner scope 3
int two_4;
int two_5;
}
...
The above program can be represented in a hierarchical structure of symbol tables:
The global symbol table contains names for one global variable (int value) and two procedure names, which
should be available to all the child nodes shown above. The names mentioned in the pro_one symbol table (and
all its child tables) are not available for pro_two symbols and its child tables.
This symbol table data structure hierarchy is stored in the semantic analyzer and whenever a name needs to be
searched in a symbol table, it is searched using the following algorithm:
first a symbol will be searched in the current scope, i.e. current symbol table.
if a name is found, then search is completed, else it will be searched in the parent symbol table until,
either the name is found or global symbol table has been searched for the name.
iv) code generator

ii) token
iii) link editors
v) error handler

Compiler Design

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Compiler Design

Transféré par

Droits d'auteur :

Formats disponibles

9Chomsky

where S is start symbol, and forbid S on RHS.)

which is the intersection of

The process called exit/1

throw The process called throw/1

Dynamic Memory Allocation

It is the programmer's job to deallocate dynamically created space

Dynamic and Static Memory

*, concatenation (.), and | (pipe sign) are left associative

if the symbol exists in the table.

iv) code generator

Vous aimerez peut-être aussi