Compiler Design: Language Grammars

Compiler Design
4. Language Grammars
Kanat Bolazar
January 28, 2010
Introduction to Parsing: Language Grammars
Programming language grammars are usually written as

some variation of Context Free Grammars (CFG)s
Notation used is often BNF (Backus-Naur form):
<block> -> { <statementlist> }

<statementlist> -> <statement> ; <statementlist>
<statement> -> <assignment> ;
| if ( <expr> ) <block> else <block>
| while ( <expr> ) <block>
...
Example Grammar: Language 0+0
A language that we'll call "Language 0+0":
E -> E + E | 0
Equivalently:
E -> E + E
E -> 0
Note that if there are multiple rules for the same left hand side,
they are alternatives.
This language only contains sentences of the form:
0 0+0 0+0+0 0+0+0+0 ...
Derivation for 0+0+0:
E -> E + E -> E + E + E -> 0 + 0 + 0
Note: This language is ambiguous: In the second step, did we
expand the first or the second E to E + E? Both paths work.
Example Grammar: Arithmetic, Ambiguous
Arithmetic expressions:
Exp -> num | Exp Operator Exp
Op -> + | - | * | / | %
The "num" here represents a token. What it corresponds to is
defined in the lexical analyzer with a regular expression:
num [0-9]+
This langugage allows:
45 35 + 257 * 5 - 2 ...
This language as defined here is ambiguous:
2 + 5 * 7 Exp * 7 or 2 + Exp ?
Depending on the tools you use, you may be able to just
define precedence of operators, or may have to change the
grammar.
Example Language: Arithmetic, Factored
Arithmetic expressions grammar, factored for operator
precedence:
Exp -> Factor | Factor Addop Exp
Factor -> num | num Multop Factor
Addop -> + | -
Multop -> * | / | %
This langugage also allows the same sentences:
45 35 + 257 * 5 - 2 ...
This language is not ambiguous; it first groups factors:
2 + 5 * 7
Factor Addop Exp
num + Exp
num + Factor
num + num Multop Factor
num + num * num
Grammar Definitions
The grammar is a set of rules, sometimes called productions,
that construct valid sentences in the language.
Nonterminal symbols represent constructs in the language.
These would be the phrases in a natural language.
Terminal symbols are the actual words of the language. These
are the tokens produced by the lexical analyzer. In a natural
language, these would be the words, symbols, and space.
A sentence in the language only contains terminal symbols.
Nonterminals are intermediate linguistic constructs to define
the structure of a sentence.
Rules, Nonterminal and Terminal Symbols
Arithmetic expressions grammar, using multiplicative factors
for operator precedence:
Exp -> Factor | Factor Addop Exp
Factor -> num | num Multop Factor
Addop -> + | -
Multop -> * | / | %
This langugage has four rules as written here. If we expand
each option, we would have 2 + 2 + 2 + 3 = 9 rules.
There are four nonterminals:
Exp Factor Addop Multop
There are six terminals (tokens):
num + - * / %
Grammar Definitions: Rules
The production rules are rewrite rules. The basic CFG rule
form is:
X -> Y1 Y2 Y3 Yn
where X is a nonterminal and the Ys may be nonterminals or
terminals.
There is a special nonterminal called the Start symbol.
The language is defined to be all the strings that can be
generated by starting with the start symbol, repeatedly
replacing nonterminals by the rhs of one of its rules until there
are no more nonterminals.
Larger Grammar Examples
We'll look at language grammar examples for MicroJava
and Decaf.
Note: Decaf extends the standard notation; the very useful {
X }, to mean X | X, X | X, X, X | ... is not standard.
Parse Trees
Derivation of a sentence by the language rules can be used
to construct a parse tree.
We expect parse trees to correspond to meaningful semantic
phrases of the programming language.
Each node of the parse tree will represent some portion that
can be implemented as one section of code.
The nonterminals expanded during the derivation are
trunk/branches in the parse tree.
The terminals at the end of branches are the leaves of the
parse tree.
Parsing
A parser:
Uses the grammar to check whether a sentence (a program for us) is in
the language or not.
Gives syntax error If this is not a proper sentence/program.
Constructs a parse tree from the derivation of the correct program from
the grammar rules.
Top-down parsing:
Starts with the start symbol and applies rules until it gets the desired
input program.
Bottom-up parsing:
Starts with the input program and applies rules in reverse until it can get
back to the start symbol.
Looks at left part of input program to see if it matches the rhs of a rule.
Parsing Issues
Derivation Paths = Choices
Nave top-down and bottom-up parsing may require
backtracking to find a correct parse.
Restrictions on the form of grammar rules to make parsing
deterministic.
Ambiguity
One program may have two different correct derivations from
the grammar.
This may be a problem if it implies two different semantic
interpretations.
Famous examples are arithmetic operators and the dangling
else problem.
Ambiguity: Dangling Else Problem
Which if does this else associate with?
if X
if Y
find()
else
getConfused()
The corresponding ambiguous grammar may be:
IfSttmt -> if Cond Action
| if Cond Action else Action
Two derivations at top (associated with top "if") are:
if Cond Action if Cond Action else Action
Programming languages often associate else with the inner if.
Resources
Aho, Lam, Sethi, and Ullman, Compilers: Principles,
Techniques, and Tools, 2nd ed. Addison-Wesley, 2006.
Compiler Construction Course Notes at Linz:
http://www.ssw.uni-linz.ac.at/Misc/CC/
CS 143 Compiler Course at Stanford:
http://www.stanford.edu/class/cs143/
14

Compiler Design: Language Grammars

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Compiler Design: Language Grammars

Transféré par

Droits d'auteur :

Formats disponibles

Compiler Design

Programming language grammars are usually written as

<block> -> { <statementlist> }

Vous aimerez peut-être aussi