Vous êtes sur la page 1sur 80

BITS Pilani

Dr.Aruna Malapati
BITS Pilani Asst Professor
Department of CSIS
Hyderabad Campus
BITS Pilani
Hyderabad Campus

Lexer / Scanner
Today’s Agenda

• Lexical Analysis

• Basic Concepts & Regular Expressions

• What does a Lexical Analyzer do?

• How does it Work?

• Formalizing Token Definition & Recognition

BITS Pilani, Hyderabad Campus


Short story of a lexer

Lexer/
Scanner
Go get me the next team

Syntax Analyzer

BITS Pilani, Hyderabad Campus


Short story of a lexer

So you are Mr.i


please wait in
the next queue
until I find a
team for you
i n t a ;

Next
please

Scanner

BITS Pilani, Hyderabad Campus


Short story of a lexer

n t a ;

Scanner
i

BITS Pilani, Hyderabad Campus


Short story of a lexer

So you are Mr.n


please wait in
the next queue
until I find a
team for you
n t a ;

Scanner

BITS Pilani, Hyderabad Campus


Short story of a lexer

So you are Mr.t


please wait in
the next queue
until I find a
team for you
t a ;

Scanner

i n

BITS Pilani, Hyderabad Campus


Short story of a lexer

I found a team
called keyword

a ;

keyword

Scanner

i n t
Syntax Analyzer

BITS Pilani, Hyderabad Campus


Short story of a lexer

a ;

Next
please

Scanner

BITS Pilani, Hyderabad Campus


Short story of a lexer
I do not need a
person with the
skills you posses so
you get lost

a ;

Scanner

BITS Pilani, Hyderabad Campus


Short story of a lexer

a ;

Next
please

Scanner

BITS Pilani, Hyderabad Campus


The Function of a Scanner

Source code

Pattern
Scanner recognition
How to
specification:
Regular Tokens
Expressin

Its Tools:DFA
& NFA

BITS Pilani, Hyderabad Campus


Lexical Analysis

• Lexical analysis recognizes the vocabulary of the


programming language and transforms a string of
characters into a string of words or tokens.

• Lexical analysis discards white spaces and comments


between the tokens.

• Lexical analyzer (or scanner or lexer) is the program that


performs lexical analysis.

BITS Pilani, Hyderabad Campus


Lexical Analysis
INPUT: sequence of characters
OUTPUT: sequence of tokens
character
token
source
Scanner parser
program
get next
token
get next
character symbol
table
A lexical analyzer is generally a subroutine of parser:
Simpler design
Efficient
Portable
BITS Pilani, Hyderabad Campus
Lexical Function

1. Read the input stream (sequence of characters), group


the characters into primitives (tokens).
Returns token as <type, value>.

2. Throw out certain sequences of characters (blanks,


comments, etc.).

3. Build the symbol table.

4. Generate error messages.

Tokens are described using regular expressions.

BITS Pilani, Hyderabad Campus


What is a Lexical Analyzer?

Source program text Tokens

• Example of Tokens
Operators: = + - > ( { := == <>
Keywords: if while for int double
Numeric literals: 43 4.565 -3.6e10 0x13F3A
Character literals: ‘a’ ‘~’ ‘\’’
String literals: “4.565” “Fall 10” “\”\” = empty”
• Example of non-tokens
White space space(‘ ‘) tab(‘\t’) end-of-line(‘\n’)
Comments /*this is not a token*/
BITS Pilani, Hyderabad Campus
Introducing Basic
Terminology
• What are Major Terms for Lexical Analysis?
– TOKEN
• Set of strings defining an atomic element with a defined meaning
• Examples include <Identifier>, <number>, etc.

– PATTERN
• A rule describing a set of string
• Recall File and OS Wildcards ([A-Z]*.*)

– LEXEME
• a sequence of characters that match some pattern
• Identifiers: x, count, name, etc…

BITS Pilani, Hyderabad Campus


Lexemes

• Lexemes are the lowest level syntactic units.


Example:
val = (int)(xdot + y*0.3) ;

In the above statement, the lexemes are


val, = , ( , int, ), (, xdot, +, y, * ,
0.3, ), ;

BITS Pilani, Hyderabad Campus


Tokens

The category of lexemes are tokens.


• Identifiers: Names chosen by the programmer.
Eg. val, xdot, y.

• Keywords: Names chosen by the language


designer to help syntax and structure. Eg. int,
return, void. (Keywords that cannot be used as
identifiers are known as reserved words ).

BITS Pilani, Hyderabad Campus


Tokens (Contd.)

• Operators: Identify actions. Eg. +, &&, !

• Literals: Denote values directly. Eg. 3.14, -


10, ‘a’, true, null

• Punctuation Symbols: Supports syntactic


structure. Eg. (, ), ;, {, }

BITS Pilani, Hyderabad Campus


Examples

Tokens Pattern Simple


Lexeme
while while while
relation_op = | != | < | > <
integer (0-9)* 42
string Characters “hello”
between “ “

BITS Pilani, Hyderabad Campus


A Program Fragment Viewed
As a Stream of Tokens

BITS Pilani, Hyderabad Campus


Example - 1
Lexeme Token

int max(int a,int b)


int Keyword
{ max identifier
if(a>b) ( operator
return a; int Keyword

else a identifier
, operator
return b;
int Keyword
} b identifier
) operator
{ operator
if Keyword
.. ..
BITS Pilani, Hyderabad Campus
Example - 2

Input string: size = r* 32 + c

<token, lexeme> pairs:


<id, size>
<assign, =>
<id, r>
<arith_sym,*>
<integer,32>
<arith_sym,+>
<id, c>

BITS Pilani, Hyderabad Campus


Semantic Values of Tokens

• Semantic values are used to distinguish


different tokens in a token type.
– < for,>
– < ID,Var1>
– <symbol , =>
– < Num , 10 >
• Token types affect syntax analysis and
semantic values affect semantic analysis.

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

Lexical analysis
- Transform multi-character input stream to token stream
- Reduce length of program representation (remove spaces)

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

for_key

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

for_key ID(“var1”)

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

for_key ID(“var1”) eq_op

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

for_key ID(“var1”) eq_op Num(10)

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

for_key ID(“var1”) eq_op Num(10) ID(“var1”)

BITS Pilani, Hyderabad Campus


Lexical Analyzer in Action

for_key ID(“var1”) eq_op Num(10) ID(“var1”) le_op

BITS Pilani, Hyderabad Campus


Implementing a lexer

Practical Issues:
• Translating RE into executable form

• Input buffering

• Must be able to capture a large number of tokens with


single machine

• Interface to parser

• Tools

BITS Pilani, Hyderabad Campus


Capturing multiple tokens

Capturing keyword “begin”

b e g i n WS

Capturing Variable names WS – White Space


AN – Alphanumeric
A|_ A - Alphabet
WS

AN
What if both need to happen at the same time?

BITS Pilani, Hyderabad Campus


Capturing multiple tokens

Machine becomes more complicated – just for two tokens

BITS Pilani, Hyderabad Campus


C code for RE –
letter (letter|digit)∗
#include <stdio.h>
#include <ctype.h>
main(){
char in;
in = getchar();
if(isalpha(in)){
in = getchar();
} else {
error();
}
while(isalpha(in) || isdigit(in)){
in=getchar();
}
}
BITS Pilani, Hyderabad Campus
Input buffer Ib, sp
Lexeme

f o r
Lexical Analyzer

Finite state Finite state


machine Simulator

Keeps track of info of


characters seen by sp
Patterns
Pattern matching
algorithm

Tokens

BITS Pilani, Hyderabad Campus


Finite Automation –
letter (letter|digit)∗
#include <stdio.h>
Letter/Digit #include <ctype.h>
main(){
char in;
1 2 int state;
state = 1;
letter in = getchar();
while(isalpha(in) || isdigit(in)){
switch(state){
case 1: if(isalpha(in){state = 2}else{error();}
break;
case 2: state = 2;
break;
}
in = getchar();
}
return state ;
}

BITS Pilani, Hyderabad Campus


Implementation Issues

• Input buffering
– Read in characters one by one

• Unable to look ahead


• Inefficient
– Read in a whole string and store it in memory

• Requires a big buffer


– Buffer pairs

BITS Pilani, Hyderabad Campus


Input Buffering
• Scanner performance is crucial:

• This is the only part of the compiler that examines the


entire input program one character at a time.

• Disk input can be slow.

• The scanner accounts for ~25-30% of total compile time.

• We need lookahead to determine when a match has


been found.

• Scanners use double-buffering to minimize the


overheads associated with this.
BITS Pilani, Hyderabad Campus
Input buffering

Ib,fp

f o r v a r 1 = 1 0 v a r 1 < =

Ib – Lexeme Beginning
fp – Forward pointer (keeps track of portion of input string scanned)

BITS Pilani, Hyderabad Campus


Input buffering

Ib fp

f o r v a r 1 = 1 0 v a r 1 < =

BITS Pilani, Hyderabad Campus


Input buffering

Ib fp

f o r v a r 1 = 1 0 v a r 1 < =

fp – scanned forward to search end of lexeme (often blank space)


Lexeme identified only when the sp scans blank space after “for”

Token and attribute of this lexeme is returned


for_key

BITS Pilani, Hyderabad Campus


Input buffering

Ib fp

f o r v a r 1 = 1 0 v a r 1 < =

Scanning character by character is expensive hence block of


data is read first in buffer and then scanned.

One buffer scheme


Two buffer scheme

BITS Pilani, Hyderabad Campus


One buffer scheme

f o r v a r 1 = Buffer

1 0 v a r 1 < = Input string

• If lexeme is long it crosses the boundary and hence to


scan rest of the lexeme the buffer has to be refilled which
overwrites the first part of lexeme.

BITS Pilani, Hyderabad Campus


Two buffer scheme

f o r v a r 1 = Buffer1

1 0 v a r 1 < = Buffer2

BITS Pilani, Hyderabad Campus


Sentinels

E = M eof * C * * 2 eof eof

Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}

BITS Pilani, Hyderabad Campus


Specification of tokens

• In theory of compilation regular expressions are used to


formalize the specification of tokens.

• Regular expressions are means for specifying regular


languages.

• Example:
• Letter_(letter_ | digit)*

• Each regular expression is a pattern specifying the form


of strings.

BITS Pilani, Hyderabad Campus


Recognition of tokens

• Starting point is the language grammar to understand


the tokens:

stmt -> if expr then stmt


| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number

BITS Pilani, Hyderabad Campus


Recognition of tokens (cont.)

• The next step is to formalize the patterns:


digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>

• We also need to handle whitespaces:


ws -> (blank | tab | newline)+

BITS Pilani, Hyderabad Campus


Transition diagrams

• Transition diagram for relop

BITS Pilani, Hyderabad Campus


Transition diagrams (cont.)

• Transition diagram for reserved words and identifiers

• Examine symbol table for the lexeme found


• Returns whatever token name is there

• Places ID in symbol table if not there.


• Returns a pointer to symbol table entry

BITS Pilani, Hyderabad Campus


Transition diagrams (cont.)

• Transition diagram for unsigned numbers

BITS Pilani, Hyderabad Campus


Transition diagrams (cont.)

• Transition diagram for whitespace

BITS Pilani, Hyderabad Campus


Ambiguous Token Rule Sets
• A single expression is a completely unambiguous
specification of a token.

• Sometimes, when we put together a set of regular


expressions to specify all of the tokens in a language,
ambiguities arise:
– i.e., two regular expression match the same string

• For example the lexeme if8


• How so I tokenize them is it Keyword(if), integerconst(8) or ID(if8)
BITS Pilani, Hyderabad Campus
Ambiguous Token Rule Sets

We resolve ambiguities using two rules:

– Longest match: The regular expression that matches the


longest string takes precedence.

– Rule Priority: The regular expressions identifying tokens


are written down in sequence. If two regular expressions
match the same (longest) string, the first regular
expression in the sequence takes precedence.

BITS Pilani, Hyderabad Campus


Longest match

• Assume input to be >=.

• In a finite automation it may reach a state that is a final state but


still you have not encountered an EOS.

• Therefore, the matching should always start with the first


transition diagram.

• If failure occurs in one transition diagram then retract the


forward pointer to the start state and activate the next diagram.

• If failure occurs in all diagrams then a lexical error has occurred.

BITS Pilani, Hyderabad Campus


Example

Initial state Final or Accept state

Means retract the forward


pointer

BITS Pilani, Hyderabad Campus


Implementation of Transition
Diagram

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Architecture of a DFA based
Lexical Analyzer

BITS Pilani, Hyderabad Campus


Handling Lexical Errors
• Error Handling is very localized, with respect to Input Source.

• For example: whil ( x = 0 ) do


generates no lexical errors in PASCAL

• In what Situations do Errors Occur?

– Lexical analyzer is unable to proceed because none of the patterns for tokens
matches a prefix of remaining input.

• Panic mode Recovery

– Delete successive characters from the remaining input until the analyzer can find
a well-formed token.
– May confuse the parser

• Possible error recovery actions:

– Deleting or Inserting Input Characters


– Replacing or Transposing Characters
BITS Pilani, Hyderabad Campus
Lexical Error

BITS Pilani, Hyderabad Campus


Lexical Error

BITS Pilani, Hyderabad Campus


Lexical Error

BITS Pilani, Hyderabad Campus


Lexical Error

BITS Pilani, Hyderabad Campus


Panic mode

• It defines a small set of “safe symbols” that delimit clean


points in the input.

• When an error occurs, a panic mode recovery algorithm


deletes input tokens until it finds a safe symbol, then
backs the parser out to a context in which that symbol
might appear.

Example Modula-2: Error detected at b, when this algorithm runs is


IF a b THEN x; likely to skip forward to the semicolon, thereby
ELSE y; missing the THEN
END;
Used in unstructured languages like BASIC,FORTRAN

BITS Pilani, Hyderabad Campus


Lexical Error

BITS Pilani, Hyderabad Campus


Problems with panic mode
recovery
• The meaning of the program may be changes.

• The whole input may get deleted in the process.

• Error recovery must accurate, precise, fast and must not


lead to error cascade.

• For ex “charr” can be corrected by “char” by deleting “r”


• “cha” can be corrected by inserting “r”
• “whiel” can be corrected to “while” by the transpose
method.
• “chrr” can be corrected by replacing “r” with “a”
BITS Pilani, Hyderabad Campus
Phrase-Level Recovery

• We can improve the quality of recovery by employing


different sets of “safe” symbols in different contexts.

• Parsers that incorporate this improvement are said to


implement phrase-level recovery.

BITS Pilani, Hyderabad Campus


Lexical Analyzer Generator -
Lex

Lex Source program


Lexical Compiler lex.yy.c
lex.l

C
lex.yy.c a.out
compiler

Input stream a.out Sequence


of tokens

BITS Pilani, Hyderabad Campus


Structure of Lex programs

declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions

BITS Pilani, Hyderabad Campus


Take home message

• Lexical Analysis is the first phase that detects the basic


units in a input program.

• It removes any unwanted spaces etc.

• It can be implemented using a finite state automata.

• Use any tools like lex/flex to recognize the patterns


written in regular expressions.

BITS Pilani, Hyderabad Campus


Take home message
• Lexer can be implemented using existing tools,
handwritten lexers and by Finite automation.

• Handwritten lexer are error prone and navie.

• Finite automation have to combine bits and then


implement the language as a whole.

• Lexer uses longest match for recognition.

• Error can be detected and corrected at the lexer level but


very dangerous when implemented at this stage so leave
it for the next phase to handle the error.
BITS Pilani, Hyderabad Campus

Vous aimerez peut-être aussi