Académique Documents
Professionnel Documents
Culture Documents
M . E . L es k a n d E . S c h m i d t
A B S T RA C T
Lex helps write programs whose control flow is directed by instances of regular
expressions in the input stream. It is well suited for editor-script type transformations
and for segmenting input in preparation for a parsing routine.
Lex source is a table of regular expressions and corresponding program fragments.
The table is translated to a program which reads an input stream, copying it to an output
stream and partitioning the input into strings which match the given expressions. As
each such string is recognized the corresponding program fragment is executed. The
recognition of the expressions is performed by a deterministic finite automaton generated
by Lex. The program fragments written by the user are executed in the order in which
the corresponding regular expressions occur in the input stream.
The lexical analysis programs written with Lex accept ambiguous specifications
and choose the longest match possible at each input point. If necessary, substantial look-
ahead is performed on the input, but the input stream will be backed up to the end of the
current partition, so that the user has general freedom to manipulate it.
Lex can generate analyzers in either C or Ratfor, a language which can be
translated automatically to portable Fortran. It is available on the PDP-11 UNIX,
Honeywell GCOS, and IBM OS systems. This manual, however, will only discuss gen-
erating analyzers in C on the UNIX system, which is the only supported form of Lex
under UNIX Version 7. Lex is designed to simplify interfacing with Yacc, for those with
access to this compiler-compiler system.
on different computer hardware, Lex can write The finite automaton generated for this source
code in different host languages. The host will scan for both rules at once, observing at the
language is used for the output code generated by termination of the string of blanks or tabs whether
Lex and also for the program fragments added by or not there is a newline character, and executing
the user. Compatible run-time libraries for the the desired rule action. The first rule matches all
different host languages are also provided. This strings of blanks or tabs at the end of lines, and
makes Lex adaptable to different environments the second rule all remaining strings of blanks or
and different users. Each application may be tabs.
directed to the combination of hardware and host Lex can be used alone for simple transfor-
language appropriate to the task, the users back- mations, or for analysis and statistics gathering on
ground, and the properties of local implementa- a lexical level. Lex can also be used with a parser
tions. At present, the only supported host generator to perform the lexical analysis phase; it
language is C, although Fortran (in the form of is particularly easy to interface Lex and Yacc [3].
Ratfor [2] has been available in the past. Lex Lex programs recognize only regular expressions;
itself exists on UNIX, GCOS, and OS/370; but Yacc writes parsers that accept a large class of
the code generated by Lex may be taken any- context free grammars, but require a lower level
where the appropriate compilers exist. analyzer to recognize input tokens. Thus, a com-
Lex turns the users expressions and actions bination of Lex and Yacc is often appropriate.
(called source in this memo) into the host When used as a preprocessor for a later parser
general-purpose language; the generated program generator, Lex is used to partition the input
is named yylex. The yylex program will recog- stream, and the parser generator assigns structure
nize expressions in a stream (called input in this to the resulting pieces. The flow of control in
memo) and perform the specified actions for each such a case (which might be the first half of a
iiiiiiiiSee Figure 1.
expression as it is detected. compiler, for example) is shown in Figure 2.
Source cic iiiiiii
Lex cc yylex Additional programs, written by other generators
or by hand, can be added easily to programs writ-
ten by Lex.
iiiiiiii lexical grammar
Input cciiiiiiii
yylex cc Output rules rules
iiiiiiiii
iiiiiiiiiii
An overview of Lex cciiiiiiiii
Lex c c c
ic iiiiiiiiii
Yacc cc
Figure 1
iiiiiiiii
iiiiiiiiiii
For a trivial example, consider a program to Input cic iiiiiiii
yylex cc icc iiiiiiiiii
yyparse cc Parsed input
delete from the input all blanks or tabs at the ends
of lines. Lex with Yacc
%% Figure 2
[ \t]+$ ; Yacc users will realize that the name yylex is
is all that is required. The program contains a what Yacc expects its lexical analyzer to be
%% delimiter to mark the beginning of the rules, named, so that the use of this name by Lex
and one rule. This rule contains a regular expres- simplifies interfacing.
sion which matches one or more instances of the Lex generates a deterministic finite automa-
characters blank or tab (written \t for visibility, in ton from the regular expressions in the source [4].
accordance with the C language convention) just The automaton is interpreted, rather than com-
prior to the end of a line. The brackets indicate piled, in order to save space. The result is still a
the character class made of blank and tab; the + fast analyzer. In particular, the time taken by a
indicates one or more ...; and the $ indicates Lex program to recognize and partition an input
end of line, as in QED. No action is specified, stream is proportional to the length of the input.
so the program generated by Lex (yylex) will The number of Lex rules or the complexity of the
ignore these characters. Everything else will be rules is not important in determining speed,
copied. To change any remaining string of blanks unless rules which include forward context
or tabs to a single blank, add another rule: require a significant amount of rescanning. What
%% does increase with the number and complexity of
[ \t]+$ ; rules is the size of the finite automaton, and there-
[ \t]+ printf(" "); fore the size of the program generated by Lex.
-- --
In the program written by Lex, the users can spelling. Lex rules such as
fragments (representing the actions to be per- colour printf("color");
formed as each regular expression is found) are mechanise printf("mechanize");
gathered as cases of a switch. The automaton petrol printf("gas");
interpreter directs the control flow. Opportunity would be a start. These rules are not quite
is provided for the user to insert either declara- enough, since the word petroleum would become
tions or additional statements in the routine con- gaseum ; a way of dealing with this will be
taining the actions, or to add subroutines outside described later.
this action routine.
Lex is not limited to source which can be 3. Lex Regular Expressions.
interpreted on the basis of one character look- The definitions of regular expressions are
ahead. For example, if there are two rules, one very similar to those in QED [5]. A regular
looking for ab and another for abcdefg , and the expression specifies a set of strings to be
input stream is abcdefh , Lex will recognize ab matched. It contains text characters (which
and leave the input pointer just before cd. . . match the corresponding characters in the strings
Such backup is more costly than the processing of being compared) and operator characters (which
simpler languages. specify repetitions, choices, and other features).
The letters of the alphabet and the digits are
2. Lex Source. always text characters; thus the regular expres-
The general format of Lex source is: sion
{definitions} integer
%% matches the string integer wherever it appears
{rules} and the expression
%% a57D
{user subroutines} looks for the string a57D.
where the definitions and the user subroutines are Operators. The operator characters are
often omitted. The second %% is optional, but "\[]?.+|()$/{}%<>
the first is required to mark the beginning of the and if they are to be used as text characters, an
rules. The absolute minimum Lex program is escape should be used. The quotation mark
thus operator (") indicates that whatever is contained
%% between a pair of quotes is to be taken as text
(no definitions, no rules) which translates into a characters. Thus
program which copies the input to the output xyz"++"
unchanged. matches the string xyz++ when it appears. Note
In the outline of Lex programs shown that a part of a string may be quoted. It is harm-
above, the rules represent the users control deci- less but unnecessary to quote an ordinary text
sions; they are a table, in which the left column character; the expression
contains regular expressions (see section 3) and "xyz++"
the right column contains actions, program frag- is the same as the one above. Thus by quoting
ments to be executed when the expressions are every non-alphanumeric character being used as a
recognized. Thus an individual rule might appear text character, the user can avoid remembering
integer printf("found keyword INT"); the list above of current operator characters, and
to look for the string integer in the input stream is safe should further extensions to Lex lengthen
and print the message found keyword INT the list.
whenever it appears. In this example the host An operator character may also be turned
procedural language is C and the C library func- into a text character by preceding it with \ as in
tion printf is used to print the string. The end of xyz\+\+
the expression is indicated by the first blank or which is another, less readable, equivalent of the
tab character. If the action is merely a single C above expressions. Another use of the quoting
expression, it can just be given on the right side mechanism is to get a blank into an expression;
of the line; if it is compound, or takes more than a normally, as explained above, blanks or tabs end
line, it should be enclosed in braces. As a slightly a rule. Any blank character not contained within
more useful example, suppose it is desired to [ ] (see below) must be quoted. Several normal C
change a number of words from British to Ameri- escapes with \ are recognized: \n is newline, \t is
-- --
tab, and \b is backspace. To enter \ itself, use \\. Repeated expressions. Repetitions of
Since newline is illegal in an expression, \n must classes are indicated by the operators and + .
be used; it is not required to escape tab and back- a
space. Every character but blank, tab, newline is any number of consecutive a characters,
and the list above is always a text character. including zero; while
Character classes. Classes of characters a+
can be specified using the operator pair [ ]. The is one or more instances of a. For example,
construction [abc] matches a single character, [az]+
which may be a , b , or c . Within square brack- is all strings of lower case letters. And
ets, most operator meanings are ignored. Only [AZaz][AZaz09]
three characters are special: these are \ and . indicates all alphanumeric strings with a leading
The character indicates ranges. For example, alphabetic character. This is a typical expression
[az09<>_] for recognizing identifiers in computer languages.
indicates the character class containing all the Alternation and Grouping. The operator |
lower case letters, the digits, the angle brackets, indicates alternation:
and underline. Ranges may be given in either (ab | cd)
order. Using between any pair of characters matches either ab or cd. Note that parentheses
which are not both upper case letters, both lower are used for grouping, although they are not
case letters, or both digits is implementation necessary on the outside level;
dependent and will get a warning message. (E.g., ab | cd
[0z] in ASCII is many more characters than it is would have sufficed. Parentheses can be used for
in EBCDIC). If it is desired to include the char- more complex expressions:
acter in a character class, it should be first or (ab | cd+)?(ef)
last; thus matches such strings as abefef , efefef , cdef , or
[+09] cddd ; but not abc , abcd , or abcdef .
matches all the digits and the two signs. Context sensitivity. Lex will recognize a
In character classes, the operator must small amount of surrounding context. The two
appear as the first character after the left bracket; simplest operators for this are and $ . If the first
it indicates that the resulting string is to be com- character of an expression is , the expression
plemented with respect to the computer character will only be matched at the beginning of a line
set. Thus (after a newline character, or at the beginning of
[abc] the input stream). This can never conflict with
matches all characters except a, b, or c, including the other meaning of , complementation of char-
all special or control characters; or acter classes, since that only applies within the [ ]
[azAZ] operators. If the very last character is $ , the
is any character which is not a letter. The \ char- expression will only be matched at the end of a
acter provides the usual escapes within character line (when immediately followed by newline).
class brackets. The latter operator is a special case of the /
Arbitrary character. To match almost any operator character, which indicates trailing con-
character, the operator character text. The expression
. ab/cd
is the class of all characters except newline. matches the string ab , but only if followed by cd.
Escaping into octal is possible although non- Thus
portable: ab$
[\40\176] is the same as
matches all printable characters in the ASCII ab/\n
character set, from octal 40 (blank) to octal 176 Left context is handled in Lex by start conditions
(tilde). as explained in section 10. If a rule is only to be
executed when the Lex automaton interpreter is in
Optional expressions. The operator ? start condition x, the rule should be prefixed by
indicates an optional element of an expression. <x>
Thus using the angle bracket operator characters. If we
ab?c considered being at the beginning of a line to
matches either ac or abc . be start condition ONE , then the operator would
-- --
quotation (") marks, and provides that to include a 2) output(c) which writes the character c on
" in a string it must be preceded by a \. The regu- the output; and
lar expression which matches that is somewhat 3) unput(c) pushes the character c back onto
confusing, so that it might be preferable to write the input stream to be read later by input().
\"["] {
By default these routines are provided as macro
if (yytext[yyleng1] == \\)
definitions, but the user can override them and
yymore();
supply private versions. These routines define the
else
relationship between external files and internal
... normal user processing
characters, and must all be retained or modified
}
consistently. They may be redefined, to cause
which will, when faced with a string such as
input or output to be transmitted to or from
"abc\"def " first match the five characters "abc\ ;
strange places, including other programs or inter-
then the call to yymore() will cause the next part
nal memory; but the character set used must be
of the string, "def , to be tacked on the end. Note
consistent in all routines; a value of zero returned
that the final quote terminating the string should
by input must mean end of file; and the relation-
be picked up in the code labeled normal pro-
ship between unput and input must be retained or
cessing.
the Lex lookahead will not work. Lex does not
The function yyless() might be used to look ahead at all if it does not have to, but every
reprocess text in various circumstances. Consider rule ending in + ? or $ or containing / implies
the C problem of distinguishing the ambiguity of lookahead. Lookahead is also necessary to match
=a. Suppose it is desired to treat this as = an expression that is a prefix of another expres-
a but print a message. A rule might be sion. See below for a discussion of the character
=[azAZ] {
set used by Lex. The standard Lex library
printf("Op (=) ambiguous\n");
yyless(yyleng1); imposes a 100 character limit on backup.
... action for = ... Another Lex library routine that the user
} will sometimes want to redefine is yywrap()
which prints a message, returns the letter after the which is called whenever Lex reaches an end-of-
operator to the input stream, and treats the opera- file. If yywrap returns a 1, Lex continues with the
tor as =. Alternatively it might be desired to normal wrapup on end of input. Sometimes,
treat this as = a. To do this, just return the however, it is convenient to arrange for more
minus sign as well as the letter to the input: input to arrive from a new source. In this case,
=[azAZ] { the user should provide a yywrap which arranges
printf("Op (=) ambiguous\n"); for new input and returns 0. This instructs Lex to
yyless(yyleng2);
continue processing. The default yywrap always
... action for = ...
returns 1.
}
will perform the other interpretation. Note that This routine is also a convenient place to
the expressions for the two cases might more print tables, summaries, etc. at the end of a pro-
easily be written gram. Note that it is not possible to write a nor-
=/[AZaz] mal rule which recognizes end-of-file; the only
in the first case and access to this condition is through yywrap. In
=/[AZaz] fact, unless a private version of input() is sup-
in the second; no backup would be required in the plied a file containing nulls cannot be handled,
rule action. It is not necessary to recognize the since a value of 0 returned by input is taken to be
whole identifier to observe the ambiguity. The end-of-file.
possibility of =3, however, makes 5. Ambiguous Source Rules.
=/[ \t\n]
Lex can handle ambiguous specifications.
a still better rule.
When more than one expression can match the
In addition to these routines, Lex also per- current input, Lex chooses as follows:
mits access to the I/O routines it uses. They are:
1) The longest match is preferred.
1) input() which returns the next input charac-
2) Among rules which matched the same
ter;
number of characters, the rule given first is
preferred.
-- --
Thus, suppose the rules executed. The position of the input pointer is
integer keyword action ...; adjusted accordingly. Suppose the user really
[az]+ identifier action ...; wants to count the included instances of he:
to be given in that order. If the input is integers , she {s++; REJECT;}
it is taken as an identifier, because [az]+ he {h++; REJECT;}
matches 8 characters while integer matches only \n |
7. If the input is integer , both rules match 7 . ;
characters, and the keyword rule is selected these rules are one way of changing the previous
because it was given first. Anything shorter (e.g. example to do just that. After counting each
int ) will not match the expression integer and so expression, it is rejected; whenever appropriate,
the identifier interpretation is used. the other expression will then be counted. In this
The principle of preferring the longest example, of course, the user could note that she
match makes rules containing expressions like . includes he but not vice versa, and omit the
dangerous. For example, REJECT action on he; in other cases, however, it
. would not be possible a priori to tell which input
might seem a good way of recognizing a string in characters were in both classes.
single quotes. But it is an invitation for the pro- Consider the two rules
gram to read far ahead, looking for a distant sin- a[bc]+ { ... ; REJECT;}
gle quote. Presented with the input a[cd]+ { ... ; REJECT;}
first quoted string here, second here If the input is ab , only the first rule matches, and
the above expression will match on ad only the second matches. The input string
first quoted string here, second accb matches the first rule for four characters and
which is probably not what was wanted. A better then the second rule for three characters. In con-
rule is of the form trast, the input accd agrees with the second rule
[\n] for four characters and then the first rule for three.
which, on the above input, will stop after fifirrst . In general, REJECT is useful whenever the
The consequences of errors like this are mitigated purpose of Lex is not to partition the input stream
by the fact that the . operator will not match new- but to detect all examples of some items in the
line. Thus expressions like . stop on the current input, and the instances of these items may over-
line. Dont try to defeat this with expressions like lap or include each other. Suppose a digram table
[.\n]+ or equivalents; the Lex generated program of the input is desired; normally the digrams over-
will try to read the entire input file, causing inter- lap, that is the word the is considered to contain
nal buffer overflows. both th and he . Assuming a two-dimensional
Note that Lex is normally partitioning the array named digram to be incremented, the
input stream, not searching for all possible appropriate source is
matches of each expression. This means that %%
each character is accounted for once and only [az][az] {
once. For example, suppose it is desired to count digram[yytext[0]][yytext[1]]++;
occurrences of both she and he in an input text. REJECT;
Some Lex rules to do this might be }
she s++; . ;
he h++; \n ;
\n | where the REJECT is necessary to pick up a letter
. ; pair beginning at every character, rather than at
where the last two rules ignore everything besides every other character.
he and she. Remember that . does not include
newline. Since she includes he, Lex will nor- 6. Lex Source Definitions.
mally not recognize the instances of he included Remember the format of the Lex source:
in she, since once it has passed a she those char- {definitions}
acters are gone. %%
Sometimes the user would like to override {rules}
this choice. The action REJECT means go do %%
the next alternative. It causes whatever rule {user routines}
was second choice after the current rule to be So far only the rules have been described. The
-- --
assigned a bigger number than the size of the x 0,1,2, ... instances of x.
hardware character set. x+ 1,2,3, ... instances of x.
x|y an x or a y.
12. Summary of Source Format. (x) an x.
The general form of a Lex source file is: x/y an x but only if followed by y.
{definitions} {xx} the translation of xx from the
%% definitions section.
{rules} x{m,n} m through n occurrences of x
%%
{user subroutines} 13. Caveats and Bugs.
The definitions section contains a combination of There are pathological expressions which
1) Definitions, in the form name space trans- produce exponential growth of the tables when
lation. converted to deterministic machines; fortunately,
they are rare.
2) Included code, in the form space code.
REJECT does not rescan the input; instead
3) Included code, in the form it remembers the results of the previous scan.
%{ This means that if a rule with trailing context is
code found, and REJECT executed, the user must not
%} have used unput to change the characters forth-
4) Start conditions, given in the form coming from the input stream. This is the only
%S name1 name2 ... restriction on the users ability to manipulate the
5) Character set tables, in the form not-yet-processed input.
%T
number space character-string 14. Acknowledgments.
... As should be obvious from the above, the
%T outside of Lex is patterned on Yacc and the inside
6) Changes to internal array sizes, in the form on Ahos string matching routines. Therefore,
%xx nnn both S. C. Johnson and A. V. Aho are really origi-
where nnn is a decimal integer representing nators of much of Lex, as well as debuggers of it.
an array size and x selects the parameter as Many thanks are due to both.
follows: The code of the current version of Lex was
Letter Parameter designed, written, and debugged by Eric Schmidt.
p positions
n states 15. References.
e tree nodes
1. B. W. Kernighan and D. M. Ritchie, The C
a transitions
Programming Language, Prentice-Hall, N.
k packed character classes
J. (1978).
o output array size
2. B. W. Kernighan, Ratfor: A Preprocessor
Lines in the rules section have the form expres-
for a Rational Fortran, Software Prac-
sion action where the action may be continued
tice and Experience, 5, pp. 395-496 (1975).
on succeeding lines by using braces to delimit it.
3. S. C. Johnson, Yacc: Yet Another Compiler
Regular expressions in Lex use the follow-
Compiler, Computing Science Technical
ing operators:
Report No. 32, 1975, Bell Laboratories,
x the character "x"
Murray Hill, NJ 07974.
"x" an "x", even if x is an operator.
\x an "x", even if x is an operator. 4. A. V. Aho and M. J. Corasick, Effificcient
[xy] the character x or y. S t ring M a t ch i ng : A n A i d t o B i bl iog r ap h i c
[xz] the characters x, y or z. Search, Comm. ACM 18, 333-340 (1975).
[x] any character but x. 5. B. W. Kernighan, D. M. Ritchie and K. L.
. any character but newline. Thompson, QED Text Editor, Computing
x an x at the beginning of a line. Science Technical Report No. 5, 1972, Bell
<y>x an x when Lex is in start condition y. Laboratories, Murray Hill, NJ 07974.
x$ an x at the end of a line.
x? an optional x.
-- --