Lexer (Lexical Analyzer)

BITS Pilani
Dr.Aruna Malapati
BITS Pilani Asst Professor
Department of CSIS
Hyderabad Campus
BITS Pilani
Hyderabad Campus
Lexer / Scanner
Today’s Agenda
• Lexical Analysis
• Basic Concepts & Regular Expressions
• What does a Lexical Analyzer do?
• How does it Work?
• Formalizing Token Definition & Recognition
BITS Pilani, Hyderabad Campus

Short story of a lexer
Lexer/
Scanner
Go get me the next team
Syntax Analyzer

So you are Mr.i

please wait in
the next queue
until I find a
team for you
i n t a ;
Next
please
Scanner

n t a ;
Scanner
i

So you are Mr.n

please wait in
the next queue
until I find a
team for you
n t a ;
Scanner

So you are Mr.t

please wait in
the next queue
until I find a
team for you
t a ;
Scanner
i n

I found a team
called keyword
a ;
keyword
Scanner
i n t
Syntax Analyzer

a ;
Next
please
Scanner

I do not need a
person with the
skills you posses so
you get lost
a ;
Scanner

a ;
Next
please
Scanner

The Function of a Scanner
Source code
Pattern
Scanner recognition
How to
specification：
Regular Tokens
Expressin
Its Tools：DFA
& NFA

Lexical Analysis
• Lexical analysis recognizes the vocabulary of the

programming language and transforms a string of
characters into a string of words or tokens.
• Lexical analysis discards white spaces and comments

between the tokens.
• Lexical analyzer (or scanner or lexer) is the program that

performs lexical analysis.

Lexical Analysis
INPUT: sequence of characters
OUTPUT: sequence of tokens
character
token
source
Scanner parser
program
get next
token
get next
character symbol
table
A lexical analyzer is generally a subroutine of parser:
Simpler design
Efficient
Portable
Lexical Function
1. Read the input stream (sequence of characters), group

the characters into primitives (tokens).
Returns token as <type, value>.
2. Throw out certain sequences of characters (blanks,

comments, etc.).
3. Build the symbol table.
4. Generate error messages.
Tokens are described using regular expressions.

What is a Lexical Analyzer?
Source program text Tokens
• Example of Tokens
Operators: = + - > ( { := == <>
Keywords: if while for int double
Numeric literals: 43 4.565 -3.6e10 0x13F3A
Character literals: ‘a’ ‘~’ ‘\’’
String literals: “4.565” “Fall 10” “\”\” = empty”
• Example of non-tokens
White space space(‘ ‘) tab(‘\t’) end-of-line(‘\n’)
Comments /*this is not a token*/
Introducing Basic
Terminology
• What are Major Terms for Lexical Analysis?
– TOKEN
• Set of strings defining an atomic element with a defined meaning
• Examples include <Identifier>, <number>, etc.
– PATTERN
• A rule describing a set of string
• Recall File and OS Wildcards ([A-Z]*.*)
– LEXEME
• a sequence of characters that match some pattern
• Identifiers: x, count, name, etc…

Lexemes
• Lexemes are the lowest level syntactic units.

Example:
val = (int)(xdot + y*0.3) ;
In the above statement, the lexemes are

val, = , ( , int, ), (, xdot, +, y, * ,
0.3, ), ;

Tokens
The category of lexemes are tokens.

• Identifiers: Names chosen by the programmer.
Eg. val, xdot, y.
• Keywords: Names chosen by the language

designer to help syntax and structure. Eg. int,
return, void. (Keywords that cannot be used as
identifiers are known as reserved words ).

Tokens (Contd.)
• Operators: Identify actions. Eg. +, &&, !
• Literals: Denote values directly. Eg. 3.14, -

10, ‘a’, true, null
• Punctuation Symbols: Supports syntactic

structure. Eg. (, ), ;, {, }

Examples
Tokens Pattern Simple

Lexeme
while while while
relation_op = | != | < | > <
integer (0-9)* 42
string Characters “hello”
between “ “

A Program Fragment Viewed
As a Stream of Tokens

Example - 1
Lexeme Token
int max(int a,int b)

int Keyword
{ max identifier
if(a>b) ( operator
return a; int Keyword
else a identifier
, operator
return b;
int Keyword
} b identifier
) operator
{ operator
if Keyword
.. ..
Example - 2
Input string: size = r* 32 + c
<token, lexeme> pairs:

<id, size>
<assign, =>
<id, r>
<arith_sym,*>
<integer,32>
<arith_sym,+>
<id, c>

Semantic Values of Tokens
• Semantic values are used to distinguish

different tokens in a token type.
– < for,>
– < ID,Var1>
– <symbol , =>
– < Num , 10 >
• Token types affect syntax analysis and
semantic values affect semantic analysis.

Lexical Analyzer in Action
Lexical analysis
- Transform multi-character input stream to token stream
- Reduce length of program representation (remove spaces)



for_key

for_key ID(“var1”)

for_key ID(“var1”) eq_op

for_key ID(“var1”) eq_op Num(10)

for_key ID(“var1”) eq_op Num(10) ID(“var1”)

for_key ID(“var1”) eq_op Num(10) ID(“var1”) le_op

Implementing a lexer
Practical Issues:
• Translating RE into executable form
• Input buffering
• Must be able to capture a large number of tokens with

single machine
• Interface to parser
• Tools

Capturing multiple tokens
Capturing keyword “begin”
b e g i n WS
Capturing Variable names WS – White Space

AN – Alphanumeric
A|_ A - Alphabet
WS
AN
What if both need to happen at the same time?

Capturing multiple tokens
Machine becomes more complicated – just for two tokens

C code for RE –
letter (letter|digit)∗
#include <stdio.h>
#include <ctype.h>
main(){
char in;
in = getchar();
if(isalpha(in)){
in = getchar();
} else {
error();
}
while(isalpha(in) || isdigit(in)){
in=getchar();
}
}
Input buffer Ib, sp
Lexeme
f o r
Lexical Analyzer
Finite state Finite state

machine Simulator
Keeps track of info of

characters seen by sp
Patterns
Pattern matching
algorithm
Tokens

Finite Automation –
letter (letter|digit)∗
#include <stdio.h>
Letter/Digit #include <ctype.h>
main(){
char in;
1 2 int state;
state = 1;
letter in = getchar();
while(isalpha(in) || isdigit(in)){
switch(state){
case 1: if(isalpha(in){state = 2}else{error();}
break;
case 2: state = 2;
break;
}
in = getchar();
}
return state ;
}

Implementation Issues
• Input buffering
– Read in characters one by one
• Unable to look ahead

• Inefficient
– Read in a whole string and store it in memory
• Requires a big buffer

– Buffer pairs

Input Buffering
• Scanner performance is crucial:
• This is the only part of the compiler that examines the

entire input program one character at a time.
• Disk input can be slow.
• The scanner accounts for ~25-30% of total compile time.
• We need lookahead to determine when a match has

been found.
• Scanners use double-buffering to minimize the

overheads associated with this.
Input buffering
Ib,fp
f o r v a r 1 = 1 0 v a r 1 < =
Ib – Lexeme Beginning
fp – Forward pointer (keeps track of portion of input string scanned)

Input buffering
Ib fp
f o r v a r 1 = 1 0 v a r 1 < =

Input buffering
Ib fp
f o r v a r 1 = 1 0 v a r 1 < =
fp – scanned forward to search end of lexeme (often blank space)

Lexeme identified only when the sp scans blank space after “for”
Token and attribute of this lexeme is returned

for_key

Input buffering
Ib fp
f o r v a r 1 = 1 0 v a r 1 < =
Scanning character by character is expensive hence block of

data is read first in buffer and then scanned.
One buffer scheme

Two buffer scheme

One buffer scheme
f o r v a r 1 = Buffer
1 0 v a r 1 < = Input string
• If lexeme is long it crosses the boundary and hence to

scan rest of the lexeme the buffer has to be refilled which
overwrites the first part of lexeme.

Two buffer scheme
f o r v a r 1 = Buffer1
1 0 v a r 1 < = Buffer2

Sentinels
E = M eof * C * * 2 eof eof
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}

Specification of tokens
• In theory of compilation regular expressions are used to

formalize the specification of tokens.
• Regular expressions are means for specifying regular

languages.
• Example:
• Letter_(letter_ | digit)*
• Each regular expression is a pattern specifying the form

of strings.

Recognition of tokens
• Starting point is the language grammar to understand

the tokens:
stmt -> if expr then stmt

| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number

Recognition of tokens (cont.)
• The next step is to formalize the patterns:

digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:

ws -> (blank | tab | newline)+

Transition diagrams
• Transition diagram for relop

Transition diagrams (cont.)
• Transition diagram for reserved words and identifiers
• Examine symbol table for the lexeme found

• Returns whatever token name is there
• Places ID in symbol table if not there.

• Returns a pointer to symbol table entry

• Transition diagram for unsigned numbers

• Transition diagram for whitespace

Ambiguous Token Rule Sets
• A single expression is a completely unambiguous
specification of a token.
• Sometimes, when we put together a set of regular

expressions to specify all of the tokens in a language,
ambiguities arise:
– i.e., two regular expression match the same string
• For example the lexeme if8

• How so I tokenize them is it Keyword(if), integerconst(8) or ID(if8)
Ambiguous Token Rule Sets
We resolve ambiguities using two rules:
– Longest match: The regular expression that matches the

longest string takes precedence.
– Rule Priority: The regular expressions identifying tokens

are written down in sequence. If two regular expressions
match the same (longest) string, the first regular
expression in the sequence takes precedence.

Longest match
• Assume input to be >=.
• In a finite automation it may reach a state that is a final state but

still you have not encountered an EOS.
• Therefore, the matching should always start with the first

transition diagram.
• If failure occurs in one transition diagram then retract the

forward pointer to the start state and activate the next diagram.
• If failure occurs in all diagrams then a lexical error has occurred.

Example
Initial state Final or Accept state
Means retract the forward

pointer

Implementation of Transition
Diagram

Example

Example

Example

Example

Architecture of a DFA based
Lexical Analyzer

Handling Lexical Errors
• Error Handling is very localized, with respect to Input Source.
• For example: whil ( x = 0 ) do

generates no lexical errors in PASCAL
• In what Situations do Errors Occur?
– Lexical analyzer is unable to proceed because none of the patterns for tokens
matches a prefix of remaining input.
• Panic mode Recovery
– Delete successive characters from the remaining input until the analyzer can find
a well-formed token.
– May confuse the parser
• Possible error recovery actions:
– Deleting or Inserting Input Characters

– Replacing or Transposing Characters
Lexical Error

Lexical Error

Lexical Error

Lexical Error

Panic mode
• It defines a small set of “safe symbols” that delimit clean

points in the input.
• When an error occurs, a panic mode recovery algorithm

deletes input tokens until it finds a safe symbol, then
backs the parser out to a context in which that symbol
might appear.
Example Modula-2: Error detected at b, when this algorithm runs is

IF a b THEN x; likely to skip forward to the semicolon, thereby
ELSE y; missing the THEN
END;
Used in unstructured languages like BASIC,FORTRAN

Lexical Error

Problems with panic mode
recovery
• The meaning of the program may be changes.
• The whole input may get deleted in the process.
• Error recovery must accurate, precise, fast and must not

lead to error cascade.
• For ex “charr” can be corrected by “char” by deleting “r”

• “cha” can be corrected by inserting “r”
• “whiel” can be corrected to “while” by the transpose
method.
• “chrr” can be corrected by replacing “r” with “a”
Phrase-Level Recovery
• We can improve the quality of recovery by employing

different sets of “safe” symbols in different contexts.
• Parsers that incorporate this improvement are said to

implement phrase-level recovery.

Lexical Analyzer Generator -
Lex
Lex Source program

Lexical Compiler lex.yy.c
lex.l
C
lex.yy.c a.out
compiler
Input stream a.out Sequence

of tokens

Structure of Lex programs
declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions

Take home message
• Lexical Analysis is the first phase that detects the basic

units in a input program.
• It removes any unwanted spaces etc.
• It can be implemented using a finite state automata.
• Use any tools like lex/flex to recognize the patterns

written in regular expressions.

Take home message
• Lexer can be implemented using existing tools,
handwritten lexers and by Finite automation.
• Handwritten lexer are error prone and navie.
• Finite automation have to combine bits and then

implement the language as a whole.
• Lexer uses longest match for recognition.
• Error can be detected and corrected at the lexer level but

very dangerous when implemented at this stage so leave
it for the next phase to handle the error.

Lexer (Lexical Analyzer)

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lexer (Lexical Analyzer)

Transféré par

Droits d'auteur :

Formats disponibles

BITS Pilani

• Basic Concepts & Regular Expressions

• What does a Lexical Analyzer do?

• How does it Work?

• Formalizing Token Definition & Recognition

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

So you are Mr.i

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

So you are Mr.n

BITS Pilani, Hyderabad Campus

So you are Mr.t

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• Lexical analysis recognizes the vocabulary of the

• Lexical analysis discards white spaces and comments

• Lexical analyzer (or scanner or lexer) is the program that

BITS Pilani, Hyderabad Campus

1. Read the input stream (sequence of characters), group

2. Throw out certain sequences of characters (blanks,

3. Build the symbol table.

4. Generate error messages.

Tokens are described using regular expressions.

BITS Pilani, Hyderabad Campus

Source program text Tokens

BITS Pilani, Hyderabad Campus

• Lexemes are the lowest level syntactic units.

In the above statement, the lexemes are

BITS Pilani, Hyderabad Campus

The category of lexemes are tokens.

• Keywords: Names chosen by the language

BITS Pilani, Hyderabad Campus

• Operators: Identify actions. Eg. +, &&, !

• Literals: Denote values directly. Eg. 3.14, -

• Punctuation Symbols: Supports syntactic

BITS Pilani, Hyderabad Campus

Tokens Pattern Simple

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

int max(int a,int b)

Input string: size = r* 32 + c

<token, lexeme> pairs:

BITS Pilani, Hyderabad Campus

• Semantic values are used to distinguish

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

for_key ID(“var1”) eq_op

BITS Pilani, Hyderabad Campus

for_key ID(“var1”) eq_op Num(10)

BITS Pilani, Hyderabad Campus

for_key ID(“var1”) eq_op Num(10) ID(“var1”)

BITS Pilani, Hyderabad Campus

for_key ID(“var1”) eq_op Num(10) ID(“var1”) le_op

BITS Pilani, Hyderabad Campus