Vous êtes sur la page 1sur 20

Compilation Techniques (4)

The Syntactic Analysis


Context Free Grammars
Ambiguity
Associativity
Bottom-Up Parsing
Top-Down Parsing

© Codruta-Mihaela ISTIN 2015


The Syntactic Analysis (SA)
 Verifies the tokens order and groups them
according to the syntactic rules
 After the end of the lexical analysis there
is not any kind of structure or relation
between tokens – this structure and its
constructs are determined and produced
in the syntactic phase
ID:sqrt LPAR REAL:25.0 RPAR
S
A

a call of function sqrt using the real constant 25.0


© Codruta-Mihaela ISTIN 2015
From Regular Definitions to
Context-Free Grammars
 Maybe the most important limitation of the
RD is that the recursion is not allowed
 Constructs which cannot be expressed
with RD: parenthesized expressions,
nested comments
 The rules of Context-Free Grammars
(CFG) allow recursion, so the above
constructs can be expressed.

expr ::= INT | expr ADD expr | LPAR expr RPAR


© Codruta-Mihaela ISTIN 2015
Context-Free Grammars
 Definition: G={T,N,S,P}
 T – a set of terminals. These are the basic symbols from
which the grammar (G) strings are formed. The terminals
are tokens names (or simply tokens). Ex: ID, INT, ADD,IF
 N – a set of nonterminals or syntactic variables. Each
nonterminal defines a set of strings.
 S – one nonterminal distinguished as start symbol
 P – a set of productions. A production consists of:
◦ Head – a nonterminal
◦ The symbol ::= or →
◦ Body – an expression consisting of terminals and
nonterminals which describes one way in which the head
can be constructed
© Codruta-Mihaela ISTIN 2015
Backus-Naur Form (BNF) and
Extended BNF (EBNF)
 BNF is a way to describe CFG productions
 From its development (late 1950s), the BNF form was
highly successful and it is widely used in many compiler
books and programming languages standards. There are
many tools which use this form and many algorithms for it
 It contains only terminals, nonterminals, sequences (a b)
and alternatives (a | b). Optional components were written
as different alternatives (a b | b) and the iteration was
written as recursion (A ::= ID | A , ID)
 For the ease of writing new operators were added, such
as: +, *, ?. These new conventions were named EBNF and
they can be reduced to BNF. There are multiple
conventions, for example the optionality can be written as
a? or [a] © Codruta-Mihaela ISTIN 2015
BNF examples
E ::= E + E | E * E | - E | ( E ) | id
stmt ::= if E then stmt else stmt
| if E then stmt
| begin stmtList end
stmtList ::= stmt ; stmtList | stmt

 When the meaning is clear, the operators, separators and


numbers can be written as such (not as token names)
 Terminals are written using lowercase and boldface
 Nonterminals are written using uppercase letters or
lowercase italics
 Lowercase Greek letters are used for a part of production:
let α=“if E then stmt” => stmt ::= α else stmt | α | …

© Codruta-Mihaela ISTIN 2015


Derivations
 A CFG can be seen as a set of rules to rewrite the
nonterminals (the heads of the productions) with their
productions bodies
 Derivation – a nonterminal rewrite (expansion) according
to its production body

E ::= E + E | E * E | - E | ( E ) | id

Possible derivations (one at a time):


E → id
E → -E → -id
E → E*E → (E)*E → (E+E)*E → (id+E)*E → (id+id)*E → (id+id)*id
© Codruta-Mihaela ISTIN 2015
Parse trees
 A graphical representation of a derivation

E → E*E → (E)*E → (E+E)*E → (id+E)*E → (id+id)*E → (id+id)*id

E * E

( E ) id

E + E

id id
© Codruta-Mihaela ISTIN 2015
Derivation order
 A CFG by itself does not specify the derivation order, so in
some cases multiple derivations are possible for the same
result. In practice two types of derivations are most used:
 Leftmost derivation – the leftmost nonterminal in production
is always chosen first for derivation
E → -E → -(E) → -(E+E) → -(id+E) → -(id+id)
 Rightmost derivation – the rightmost nonterminal in production is
always chosen first for derivation
E → -E → -(E) → -(E+E) → -(E+id) → -(id+id)

© Codruta-Mihaela ISTIN 2015


Ambiguity
 When multiple alternatives are possible, a CFG does not
impose an order on them, so ambiguities are possible
 For production: E ::= E * E | E + E | id
 The input sequence: id*id+id
 Can be obtained through any of below derivations (here the
parenthesis have only the role to show the derivation order, they
are not part of the grammar):
E → E*E → E*(E+E) → id*(id+id) (1)
E → E+E → (E*E)+E → (id*id)+id (2)
 Most probably derivation (2) is the desired result but because CFG
have no notion about operator precedence, also derivation (1) is
possible, so there is an ambiguity

© Codruta-Mihaela ISTIN 2015


Ambiguity elimination
 The order of derivations can be strictly defined by creating new productions
for the alternatives which are to be processed first (the ones with higher
precedence) and call these productions from the ones which are to be
executed later (the ones with lower precedence). For every production a
direct path to its higher precedence alternatives must be provided, in order
to make the lower precedence operations optional.
E ::= E < T | T
E ::= E < T | T
E ::= E * E | E + E | id | E < E E ::= E < T | T T ::= T + F | F
T ::= T + F | F
T ::= T * T| T + T | id F ::= F * R | R
F ::= F * F | id
R ::= id

 In this way, in order to evaluate E first T must be evaluated, so the


multiplication will be executed before addition.
 The alternatives “E < T | T” are necessary because the original E can also
be “E * E” or “id”. If E were transformed only as “E ::= T < T”, the
comparison would have been required in all cases.
 id was put in its own production and not in T because else it would be the
possibility to simply consume an id and not the entire expression.
© Codruta-Mihaela ISTIN 2015
Associativity
 If there is a sequence of operations using the same operator
or using operators with the same precedence, their
associativity is important. Ex: “18/6/3” is “(18/6)/3” or
“18/(6/3)”?
 In BNF the associativity can be implemented by putting the head of
the rule in the rule body in different ways
 Left associativity – “E ::= T | E + T”. The expression is a simple
term or all the left part of “+” followed by a simple term
 Right associativity – “Assignment ::= E | id = Assignment”. First
the right part of the assignment will be evaluated, so “a=b=7” will be
interpreted as “a=(b=7)”
 Non-associativity – “E ::= T | T < T”. In this way there is no
possibility to chain more than one “<“ in an expression. This is
sometimes desirable because “a<b<c” is interpreted “(a<b)<c” (a
boolean result compared with something) and not “a<b&&b<c”.
© Codruta-Mihaela ISTIN 2015
CFG parsing algorithms
 If a list of tokens is given, how can their order be verified and how
can they be grouped in syntactic constructs according to a CFG?
 Top-down algorithms – they start with the CFG start nonterminal
and try to match it against the tokens list. If the start nonterminal
has other nonterminals embedded, these are also matched one by
one. This approach follows the human thinking on the lines of “a
program is made of functions, global variables and types, each
function consisting of a header and a body, …”
 Bottom-up algorithms – they take directly the tokens from the list
and try to assemble them in CFG productions. By this approach
they resemble a finite automata in which the tokens are the input
symbols and with each token they advance on a transition to a
possible final state which is a CFG production. Example of thinking
along these lines: “if a for token is found, it must be the start of a for
statement, so the next transition will need to consume an open
parenthesis, …”.
© Codruta-Mihaela ISTIN 2015
Lookahead tokens
 Lookahead tokens – the next tokens from the current one, which are
tested (without being consumed), in order to optimize or make deterministic
choices from the current situation
 Both top-down and bottom-up algorithms may use lookahead tokens
 If a parser for a particular type of grammar (ex: LL or LR) can be made
deterministic by inspecting a number of k lookahead tokens, that grammar
is noted as LL(k) or LR(k)
 Many programming languages can have k=0 or k=1 grammars and in these
cases the parsing algorithms can be better optimized
 Example of a k=0 grammar: a sequence of commands, each one starting
with a keyword; every command can have an optional numeric argument.
forward 50
left 20
pickup
 In this case, by considering only the current token without any lookahead,
the parser can fully know what command is and what its parameters are
© Codruta-Mihaela ISTIN 2015
Top-down algorithms
 They are easier to implement by programmers so are more used in
hand written parsers
 They are more flexible and offer more features (ex: semantic
predicates) than the bottom-up algorithms, because the latter ones
must conform to the stricter rules of the automata theory
 It can be proven that the top-down algorithms try leftmost
derivations, so the CFG which can be implemented with these
algorithms are named LL grammars (Left to right, Leftmost)
 For simple grammars (with few alternatives) the efficiency is
comparable with the bottom-up algorithms. For complex grammars
the top-down algorithms are slower because they need to try many
alternatives, in a backtracking manner.
 The efficiency can be increased by using prediction methods, for
example by testing lookahead tokens and trying only the alternatives
which are consistent with them. This method is called predictive
parsing and sometimes it can completely eliminate the backtracking
© Codruta-Mihaela ISTIN 2015
Bottom-up algorithms
 They are harder to implement by programmers because they need
complex transition tables, so they are used especially by the tools
which generate syntactic parsers
 It can be proven that the bottom-up algorithms try rightmost
derivations, so the CFG which can be implemented with these
algorithms are named LR grammars (Left to right, Rightmost)
 They are faster than the top-down algorithms especially when there
are many possible alternatives, for example in processing natural
languages which can have more than thousands of alternatives. In
this case it is much more efficient to (mostly) deterministic advance
on a finite automata with each token than to try in a backtracking
manner thousands of possibilities.

© Codruta-Mihaela ISTIN 2015


A bottom-up algorithm
 In a bottom-up algorithm there are mainly two operations:
◦ Shift – take one token from the tokens list and put it on a processing
stack (shifts the token from the list to stack)
◦ Reduce – interpret the last token on stack (or the last symbols) as a
nonterminal and replace it/them with the corresponding nonterminal
 Because some tokens or nonterminals can appear in multiple places (ex:
parenthesis can appear in expressions or as part of statements, identifiers
in expressions and declarations, …) a table must be made for each symbol
with all its possible appearances from all possible states
 The possible states of a CFG are collected much in the same manner as a
regular definition is transformed in a transition diagram (TD)
 Because the resulted TD is not always deterministic (and because of the
recursive rules it cannot be made deterministic), sometimes conflicts may
appear, for example when both a shift and a reduce operation are possible.
In this case other disambiguation methods are needed, for example one or
more lookahead tokens can be considered in order to make the correct
operation.
© Codruta-Mihaela ISTIN 2015
Bottom-up example
 Input string: a * b E ::= E + T | T
T ::= T * F | F
F ::= ( E ) | id

Stack Input Action


a*b shift
a *b reduce by F → id
F *b reduce by T → F
T *b shift
T* b shift
T*b reduce by F → id
T*F reduce by T → T * F
T reduce by E → T
E accept
© Codruta-Mihaela ISTIN 2015
Other parsing algorithms
 Parsing Expression Grammar (PEG) – is a top-down variation in
which the alternatives are considered strictly in their order, so no
ambiguities are possible (if two alternatives are possible for the
same input tokens, the first one is always chosen). PEG grammars
are faster and easier to interpret, but the programmer must always
put first the alternatives which contains prefixes for other (shorter)
alternatives, so the longer alternatives will be evaluated first.
 Generalized LR (GLR) – is an extension to LR parsers to make
them handle nondeterminism and ambiguity. In such cases GLR
acts in a parallel way, by simultaneously trying multiple ways to
shift/reduce the input.
 LL(*) – is a combination of LR with backtracking. If a part of the
grammar is deterministic with a k (finite) lookahead, this part is
implemented as a fast deterministic automata. For the parts which
are not deterministic, a more powerful backtracking algorithm is
used.
© Codruta-Mihaela ISTIN 2015
Bibliography reading
 Compilers. Principles, Techniques and Tools
4.1, 4.2, 4.3, 4.5

© Codruta-Mihaela ISTIN 2015

Vous aimerez peut-être aussi