Académique Documents
Professionnel Documents
Culture Documents
Parsing
Given a string w and a grammar G, a parser finds
a derivation of the string w from the grammar G,
or else determines that the string is not part of the
language
Thus, a parser solves the membership problem for
a language, which is the problem of deciding, for
any string w and grammar G, whether w belongs
to the language generated by G
Typically, a parser also constructs a parse tree for
the string (which can be used by a compiler for
code generation)
03/24/16 05:30
Two questions
Can we solve the membership problem for
context-free languages? That is, can we
develop a parsing algorithm for any contextfree language?
If so, can we develop an efficient parsing
algorithm?
We saw in the previous chapter that we can,
if we place restrictions on the grammar.
03/24/16 05:30
Simplified forms
Theorem 6.1: Let G = (V, T, S, P) be a contextfree grammar. Suppose that P contains a
production rule of the form:
A x1Bx2
Assume that A and B are different variables and
that
B y1 | y2 | . . . | y n
is the set of all productions in P which have B as
the left side.
03/24/16 05:30
Simplified forms
Theorem 6.1: (continued)
Let G = (V, T, S, P) be the grammar in which P is
constructed by deleting
A x1Bx2
from P, and adding to it
A x1y1x2 | x1y2x2 | . . . | x1ynx2
Then it may be shown that
L(G) = L(G)
(see the Linz textbook, for the proof)
03/24/16 05:30
Simplified forms
Example:
A a | aaA | abBc
B abbA | b
Here we cant eliminate all rules with B on the left
side, but we can eliminate it from the right side
of any A rules. The equivalent productions
would be:
A a | aaA | ababbAc | abbc
B abbA | b
03/24/16 05:30
Simplified forms
Example:
Suppose that our complete simplified
grammar is:
SA
A a | aaA | ababbAc | abbc
B abbA | b
Since you cant get to B from S, there is no
longer any way that any B rules can play a part
in any derivation; they are useless.
03/24/16 05:30
Simplified forms
Another example:
Suppose that our grammar is:
S aSb | | A
A aA
Notice that the production rule A aA can
never be used to produce a sequence of all
terminals. It is therefore useless.
The production rule S A is also useless.
(Why?) Both of these rules may be deleted
without effectively changing the grammar.
03/24/16 05:30
Reachable
Definition:
A variable A in a CFG grammar G = (V, , S, P)
is reachable if S * xAy for some xy (V
T)*.
Reachable variables are variables that appear in
strings derivable from S.
03/24/16 05:30
Example
S EA
A abA | ab
C EC | Ab
E bC
G EbE | CE | ba
03/24/16 05:30
Reachable variables:
R0 = {S}
R1 = {S, E, A}
R2 = {S, E, A, C}
R3 = {S, E, A, C}
Useful variables
Definition:
Let G = (V, , S, P) be a context-free grammar.
Let A V; then A is live iff there is at least
one string w L(G) such that
xAy * w
with x, y in (V T)*
Informally, live variables are those from which
strings of terminals can be derived. Variables
which are not live are said to be dead.
03/24/16 05:30
Example
S AB | CD | ADF | CF | EA
A abA |ab
B bB | aD | BF | aF
C cB | EC | Ab
D bB | FFB
E bC | AB
F abbF | baF | bD | BB
G EbE | CE | ba
03/24/16 05:30
Live variables:
L0={A, G}
L1={A, G, C}
L2={A, G, C, E}
L3={A, G, C, E, S}
Useful variables
Definition 6.1 (modified): A variable A in a CFG
Useless variables
So a variable is useless if either:
1. it is not live (i.e., cannot derive a terminal
string), or
2. it is not reachable from the start symbol
A production is useless if it involves any
useless variables.
03/24/16 05:30
Exercise
Example:
Given G = ({S, A, B, C}, {a, b}, S, P), with P =
S aS | A | C
A a
B aa
C aCb
eliminate all useless variables and productions.
First, we find any dead variables.
It should be obvious that C can never generate a
string of all-terminals. C is dead.
03/24/16 05:30
Exercise
Delete any productions involving C.
New grammar: S aS | A
A a
B aa
Next, we check to see if there are any variables
which cannot be reached from the start symbol.
To do this, we may use a dependency graph.
03/24/16 05:30
Exercise
Example: S aS | A | C
A a
B aa
C aCb
Dependency graph:
S
C
03/24/16 05:30
Exercise
Delete any productions involving B.
New grammar: S aS | A
A a
The only productions that were deleted from the
original grammar were useless.
This new grammar generates all and only the
strings generated by the original grammar. It is
equivalent to the original grammar.
03/24/16 05:30
Useless variables
Theorem 6.2: Let G = (V, T, S, P) be a contextfree grammar. Then there exists an equivalent
grammar G = (V, T, S, P) that does not
contain any useless variables or productions.
Note that useless variables may be removed from
V to give V, and any terminals not occurring in
any useful production may be removed from T
to give T.
03/24/16 05:30
productions
Definition 6.2: Any production of a context-free
grammar of the form
A
is called a -production.
Any variable A for which the derivation A *
is possible is called nullable.
03/24/16 05:30
Nullable variables
A nullable variable in a context-free grammar G = (V,
, S, P) is defined as follows:
1. Any variable A for which P contains the production
A is nullable.
2. If P contains the production A B1B2Bn and
B1B2Bn are nullable variables, then A is nullable.
3. No other variables in V are nullable.
The nullable variables in V are precisely those variables
A for which A * .
03/24/16 05:30
productions
Note that without productions, a grammar would
have no way to reduce the number of characters
in its intermediate strings. In such a grammar,
we could stop processing intermediate strings as
soon as they exceeded the length of the target
string.
03/24/16 05:30
productions
So, given a CFG G without productions, we
could determine if a given string x of length |x|
belonged to L(G) simply by applying production
rules and generating all strings of length |x|. If x
had not been generated up to that point, it could
not belong to that language.
03/24/16 05:30
productions
Given the grammar
S aS1b
S1 aS1b |
What is the effect of the production S1 ?
The effect is to delete S1 from any sentential form
occurring on the right-hand side of a production
rule.
03/24/16 05:30
productions
If we apply the production S1 to
S aS1b
the resulting production rule is
S ab
If we apply the production S1 to
S1 aS1b
the resulting production rule is
S1 ab
03/24/16 05:30
productions
Therefore, we can eliminate any -productions from
this grammar by adding the new productions
obtained by substituting for S1 wherever S1
appears on the right-hand side of the production
rules, and then deleting the -production.
When we do this, we obtain the equivalent
grammar:
S aS1b | ab
S1 aS1b | ab
03/24/16 05:30
productions
Theorem 6.3: Let G be any context-free grammar
with not in L(G). Then there exists an
equivalent grammar G having no -productions.
03/24/16 05:30
Algorithm FindNull
Establish the set N0, which is the set of all variables A
in the grammar that go directly to .
Now loop:
The first time through the loop, add to this set all
variables B that go to A.
The second time through the loop, add to this set all
variables C that go to B.
The third time through the loop, add to this set all
variables D that go to C.
etc. . . .
Stop when no new variables were added to the set
03/24/16 05:30
during
the last iteration of the loop.
Example
Let G be the CFG with the productions:
S ABCBCDA
A CD
B Cb
Ca|
D bD |
Here, C and D are nullable because there are production
rules C and D .
But A is also nullable, because A CD, and both C
and D are nullable.
03/24/16 05:30
03/24/16 05:30
Example
Given a context-free grammar with the following
production rules, find the nullable variables:
S ABC
A B| a
BC|b|
C AB | D
D Cd
N0 = {B}
N1 = {B, A}
N2 = {B, A, C}
N3 = {B, A, C, S}
03/24/16 05:30
Example (continued)
S ABC
A B | a
BC|b|
C AB | D
D Cd
S ABC
S ABC | BC | AC | AB | A | B | C
C AB | D
C AB | A | B | D
D Cd
D Cd | d
N = {A, B, C, S}
03/24/16 05:30
Example (continued)
S ABC | AB | AC | BC | A | B | C
A B| a
BC|b
C AB | A | B | D
D Cd | d
Note that we have gotten rid of all -productions.
However, other beneficial changes can still be
made.
03/24/16 05:30
Unit productions
Definition 6.3: Any production of a context-free
grammar of the form
A B,
where A, B V is called a unit-production.
03/24/16 05:30
Unit productions
Theorem 6.4: Let G = (V, T, S, P) be any contextfree grammar without -productions. Then there
exists a context-free grammar G = (V, T, S, P)
that does not have any unit-productions and that
is equivalent to G.
Proof: See p. 159 in the Linz text.
03/24/16 05:30
03/24/16 05:30
Example
Original grammar:
S S+T | T
T T*F | F
F (S) | a
Resulting grammar:
S S+T | T*F | (S) | a
T T*F | (S) | a
F (S) | a
03/24/16 05:30
{S -derivable} = {T}
{T-derivable} = {F}
{S-derivable} ={T, F}
Summary
Theorem 6.5: Let L be a context-free language
that does not contain . Then there exists a
context-free language that generates L and that
does not have any useless productions, productions, or unit-productions.
Proof: Find a CFG that generates L. Apply the
procedures in theorems 6.2, 6.3, and 6.4. The
result is an equivalent CFG that generates L but
does not have any useless productions, productions, or unit-productions..
03/24/16 05:30
Summary
Note that the procedure specified above must occur
in a particular order. The procedure for removing
-productions can create new unit-productions,
and the procedure for eliminating unitproductions must start with a CFG that has no productions. The required sequence is:
1. Remove -productions
2. Remove unit productions
3. Remove useless productions
03/24/16 05:30
Unit productions
Given a context-free grammar G without
unitproductions, any production rule must either:
Convert a non-terminal to a terminal, or
Replace a non-terminal with at least two other
symbols
03/24/16 05:30
Unit productions
Let:
l = length of the current string
t = the number of terminals in the current string
The value of l + t is 1 for the starting string S and 2k for a
string (all terminals) of length k in the language.
The value of l + t for an intermediate string of length k
containing 1 or more variables would be < 2k.
Any intermediate string with l + t > 2k cannot generate a
string of length k in the language.
03/24/16 05:30
Simplified forms
What does this mean for us?
Given a grammar G and a language L(G), it means that if
you have a string, x, in L(G) and |x| = k, then starting
from S there are no more than 2k - 1 steps in the
derivation of x.
03/24/16 05:30
Proof:
Proof:
Once the intermediate string has k symbols in it, any
additional rules involved in the derivation of x must
simply replace variable symbols with terminals. The
worst-case scenario is if all the symbols are variables;
in that case, we will need at most k steps (of rules of the
second type, which replace a single variable with a
single terminal) to convert the intermediate string into a
string of all terminals.
It will take no more than 2k - 1 applications of the
production rules to derive x.
These rules can be applied in any order. (We dont have to
expand the string first and then convert it to terminals.)
03/24/16 05:30
03/24/16 05:30
Done!
03/24/16 05:30
Example
Original grammar:
S AB | ab
A ABAB | BA
B ab | b
03/24/16 05:30
After step 2:
S AB | XaXb
Xa a
Xb b
A ABAB | BA
B XaXb | b
Example
After step 2:
S AB | XaXb
Xa a
Xb b
A ABAB | BA
B XaXb | b
03/24/16 05:30
After step 3:
S AB | XaXb
Xa a
Xb b
A AY1 | BA
Y1 BY2
Y2 AB
B XaXb | b
Example
If you recognize that
A ABAB
has two copies of the
same pair of variables,
you could substitute
the following instead:
(but the first procedure
works equally well)
03/24/16 05:30
After step 3:
S AB | XaXb
Xa a
Xb b
A Y1Y1 | BA
Y1 AB
B XaXb | b
Proof (concluded)
This constitutes a proof by construction that
any CFG can be converted to CNF.
Later, this will be used to prove that there are
languages which are not context-free.
03/24/16 05:30
03/24/16 05:30
03/24/16 05:30
Some motivation
Here is the idea of the algorithm:
For a grammar in Chomsky normal form, any
derivation of a string w has 2n-1 steps, where n is
the length of w. (Why?) So, it is only necessary to
check derivations of 2n-1 steps to decide whether G
generates w.
Of course, this parsing algorithm is inefficient! It
would never be used in practice. But it solves the
membership problem for CFLs.
03/24/16 05:30
03/24/16 05:30
LL grammars
A top-down parser finds a leftmost derivation of a string.
Top-down means to start with the start symbol and
show how to derive the string from it.
An LL(k) grammar allows a parser to perform left-toright scan of the input to find a leftmost derivation, using
k symbols of lookahead to select the next rule.
Many compilers have been written using LL parsers. But
LL grammars are not sufficiently general to generate all
deterministic CFLs. This led to study of more general
deterministic grammars, especially LR grammars.
03/24/16 05:30
LR grammars
A bottom-up parser finds a rightmost derivation of a
string. Bottom-up means to start with a string and
reduce it to the start symbol.
An LR(k) grammar allows a parser to perform left-toright scan of the input to produce a rightmost derivation,
using k symbols of lookahead to select the next rule.
The class of languages generated by LR(1) grammars is
exactly the deterministic CFLs.
Two subclasses of LR(1) grammars, called SLR(1) (for
simple LR) and LALR(1) (for lookahead LR) are
commonly used for programming languages.
03/24/16 05:30
Parsing algorithms
Parsing is an extremely important topic in the
design and compilation of programming
languages. You will study parsing algorithms
based on various LL and LR grammars in a
course on compiler design.
Most of what we have studied in these
chapters about regular and context-free
languages provides the mathematical
foundation for designing good compilers. (It
has many other applications as well.)
03/24/16 05:30
Efficient parsing
Programming languages are context-free
languages, and parsing is central to any
programming language compiler
Many parsing algorithms for context-free
grammars have been developed over the years.
Most simulate pushdown automata.
However, some PDAs cannot be simulated
efficiently by computer programs because they
are nondeterministic. Efficient parsers simulate
deterministic PDAs.
03/24/16 05:30
Regular grammars
All regular languages can be generated by regular
grammars. All regular grammars generate regular
languages.
Context-free grammars are more powerful than
regular grammars. Regular languages are a
proper subset of context-free languages, so CFGs
can generate all regular languages (as well as
non-regular context-free languages).
03/24/16 05:30