Académique Documents
Professionnel Documents
Culture Documents
The lexical analyser breaks up the program input into a sequence of tokens the
basic syntactic component of the language.
The lexical analyser is also responsible for decoding the lexical representation of
symbols in the source program.
(e.g. possible to modify a Pascal compiler to accept the character @ instead of ^
by simply making a minor modification to the lexical analyser).
Spelling of reserved words should be concern only to the lexical analyser (e.g.
replace all Pascal reserved words by their Welsh translation, by changing the
lexical analysers table).
Tokens
The lexical analyser recognises the basic syntactic components of a programming
language. For example, a lexical analyser for Pascal would recognise identifiers, reserved
words, numerical constants, strings, punctuation symbols and special symbols. These
tokens would be passed to the syntax analyser as single values (nature, value or
identification).
So, the lexical analyser has to perform the dual task of recognising the token and
sometimes evaluating it.
Significant Problem: the need for look-ahead
It may be necessary to read several characters of a token before the
type of that token can be determined. The degree of lookahead
depends on the particular token on the language.
PASCAL good language on this respect (most tokens can be distinguished using just
one-or-two character lookahead).
or
A aB
Alternation and repetition can be specified by these grammar but more complex
structures such as balanced parentheses cannot be handled.
Strengths of these grammars: possible to construct simple and efficient parsers for them.
If the lexical tokens of a programming language can be defined in terms of a type 3
grammar, then an efficient lexical analyser for a compiler for that language can be
constructed simply, derived directly from the set of production rules defining the tokens.
Notation:
Regular Expressions
Consists of symbols (in the alphabet of the language that is being defined) and a
set of operators that allow
In terms of precedence,
a b denotes the set of strings {a b} this set contains just one member
a | b denotes {a, b}
a * denotes (, a, aa, aaa, ..}
a b * denotes (a, ab, abb, abbb, .}
(a | b)* denotes the set of strings made up of zero or more instances of an a or a
b
(a b | c)* d denotes {d, abd, cd, abcd, ababcd, ..}
Equivalent regular expressions: when they denote the same set of strings, that is, the same
language.
e.g. a ( b | c),
a b | a c are equivalent
Finite-State Automata
It is possible to represent a regular expression as a transition diagram, that is, a directed
graph having labelled branches
e.g. (a b | c ) * d
The nodes states are enclosed in circles (state number). Double circle states: accepting
states (i.e. a state reached if the expression to be parsed has been successfully
recognised). Arrow lines: edges or transitions.
c
d
start
b
2
Start in state 1
Input a, transition to state 2
Input b, transition to state 1
Input c, transition to state 1
Input d, transition to state 3, an accepting state the parse succeeds
Another input
abc
1. Starting in state 1
2. Input a, transition to state 2
3. Input c: there are no edges labelled c, from state 2 and hence this parse fails.
(Since the parser knows that it is in state 2 when the parse fails, it can output the
informative information that it was expecting the input of a b the only edge emerging
from state 2).
Transition table (of (a b | c ) * d )
State
1
2
3
2
-
1
1
finished
3
-
a
{1}
{4}
b
{2}
{2 , 3}
finished
First Steps:
- To decide on the precise set of tokens the lexical analyser should recognise.
- The notation to be used to specify the syntax of these tokens, preferably in some
formal notation.
Tokens identified by an integer value or equivalent
e.g. enumerated type in Pascal.
Convenient way implement the lexical analyser as a procedure or function called by the
syntax analyser.
e.g. (Pascals lexical analyser).
Type lextoken = (beginsym, endsym, ifsym, dosym, whilesym, periodsym,
commasym, semicolon );
Var token: lextoken (* the L.A. updates this variable each time it is called *)
Procedure NextToken;
.
.
.
Has to
to read
read characters
characters from
from the
the compilers
compilers input
input and
Has
and
return
the
identity
of
single
lexical
token
the
return the identity of single lexical token in the in
global
global variable
token
each
time
it is called.
variable
token each
time
it is
called.
Numerical constants
Integer, floating-point constants
0 : 1 : 2 : 3 : 4 :
5 : 6 : 7 : 8 : 9 :
Begin
End;
Intval := 0;
While Ch in [0 .. 9] do
Begin
(*accumulate decimal value*)
Intval := intval * 10 + ord (ch) ord (0);
NextCh
End;
Token := integersym
(*value of constant returned in intval*)
10