Académique Documents
Professionnel Documents
Culture Documents
Lexical Analysis
By
SHREYAS V KASHYAP[1BG13CS097]
SPANDANA RAO[1BG13CS103]
SUHAS RAINA[1BG13CS108]
Under the guidance
of
Smt. Usha C R
Assistant Professor
CSE Department
BNM Institute of Technology
VidyayaAmruthamAshnuthe
Department of Computer Science & Engineering
B. N. M. Institute of Technology
12th Main, 27th Cross, Banashankari II Stage, Bangalore 560 070.
B. N. M. Institute of
Technology
12th Main, 27th cross, Banashankari II Stage, Bangalore - 560070
Department of Computer Science & Engineering
VidyayaAmruthamAshnuthe
Certificate
Certified that the mini project entitled Lexical Analysis carried out by
Shreyas V Kashyap [1BG13CS097], Spanadana Rao [1BG13CS103], Suhas Raina
[1BG13CS108] bonafide students of B . N . M Institute of Technology of
Bachelor of Engineering in Computer Science & Engineering of the
Visvesvaraya Technological University, Belgaum during the year 2014-15. The
mini project report has been approved.
The lexical analyzer reads a string of characters and checks if a valid token in the grammar.
Token:
Terminal symbol in a grammar
Classes of sequences of characters with a collective meaning
Constants, operators, punctuations, keywords.
Lexeme:
Character sequence matched by an instance of the token.
Project
Description
Project Description
Lexical analyzer converts stream of input characters into a stream of tokens. The different
tokens that our lexical analyzer identifies are as follows:
KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf, scanf, case, break,
return, typedef, void
NUMBERS: positive and negative integers, positive and negative floating point numbers.
OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.
BRACKETS: [ ], { }, ( ).
For tokenizing into identifiers and keywords we incorporate a symbol table which initially
consists of predefined keywords. The tokens are read from an input file. If the encountered token
is an identifier or a keyword the lexical analyzer will look up in the symbol table to check the
existence of the respective token. If an entry does exist then we proceed to the next token. If not
then that particular token along with the token value is written into the symbol table. The rest of
the tokens are directly displayed by writing into an output file.
The output file will consist of all the tokens present in our input file along with their respective
token values.
SYSTEM DESIGN
Process:
The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters
and produce as output a sequence of tokens that the parser uses for syntax analysis. This
interaction, summarized schematically in fig. a.
Upon receiving a get next token command from the parser, the lexical analyzer reads the input
characters until it can identify next token.
Sometimes, lexical analyzers are divided into a cascade of two phases, the first called
scanning, and the second lexical analysis.
The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the
more complex operations.
The lexical analyzer which we have designed takes the input from an input file. It reads one
character at a time from the input file, and continues to read until end of the file is reached. It
recognizes the valid identifiers, keywords and specifies the token values of the keywords.
It also identifies the header files, #define statements, numbers, special characters, various
relational and logical operators, ignores the white spaces and comments. It prints the output in a
separate file specifying the line number.
BLOCK DIAGRAM:
OBJECTIVE OF
THE PROJECT
AIM OF THE PROJECT
Aim of the project is to develop a Lexical Analyzer that can generate tokens for the further
processing of compiler.
GOALS
To create tokens from the given input stream.
SCOPE OF PROJECT
Lexical analyzer converts the input program into character stream of valid words of language,
known as tokens.
The parser looks into the sequence of these tokens & identifies the language construct occurring
in the input program. The parser and the lexical analyzer work hand in hand; in the sense that
whenever the parser needs further tokens to proceed, it request the lexical analyzer. The lexical
analyzer in turn scans the remaining input stream & returns the next token occurring there. Apart
from that, the lexical analyzer also participates in the creation & maintenance of symbol table.
This is because lexical analyzer is the first module to identify the occurrence of a symbol. If the
symbol is getting defined for the first time, it needs to be installed into the symbol table. Lexical
analyzer is most widely used for doing the same.
PROJECT
CONTENTS
PROJECT CATEGORY
COMPILER
To define what a compiler is one must first define what a translator is. A translator is a program
that takes another program written in one language, also known as the source language, and
outputs a program written in another language, known as the target language.
Now that the translator is defined, a compiler can be defined as a translator. The source language
is a high-level language such as Java or Pascal and the target language is a low-level language
such as machine or assembly.
Lexical Analysis is the act of taking an input source program and outputting a stream of tokens.
This is done with the Scanner. The Scanner can also place identifiers into something called the
symbol table or place strings into the string table. The Scanner can report trivial errors such as
invalid characters in the input file.
Syntax Analysis is the act of taking the token stream from the scanner and comparing them
against the rules and patterns of the specified language. Syntax Analysis is done with the Parser.
The Parser produces a tree, which can come in many formats, but is referred to as the parse tree.
It reports errors when the tokens do not follow the syntax of the specified language. Errors that
the Parser can report are syntactical errors such as missing parenthesis, semicolons, and
keywords.
Semantic Analysis is the act of determining whether or not the parse tree is relevant and
meaningful. The output is intermediate code, also known as an intermediate representation (or
IR). Most of the time, this IR is closely related to assembly language but it is machine
independent. Intermediate code allows different code generators for different machines and
promotes abstraction and portability from specific machine times and languages. (I dare say the
most famous example is javas byte-code and JVM). Semantic Analysis finds more meaningful
errors such as undeclared variables, type compatibility, and scope resolution.
Code Optimization makes the IR more efficient. Code optimization is usually done in a sequence
of steps. Some optimizations include code hosting, or moving constant values to better places
within the code, redundant code discovery, and removal of useless code.
Code Generation is the final step in the compilation process. The input to the Code Generator is
the IR and the output is machine language code.
PLATFORM (TECHNOLOGY/TOOLS)
Although C was designed for writing architecturally independent system software, it is also
widely used for developing application software.
Worldwide, C is the first or second most popular language in terms of number of developer
positions or publicly available code. It is widely used on many different software platforms, and
there are few computer architectures for which a C compiler does not exist. C has greatly
influenced many other popular programming languages, most notably C++, which originally
began as an extension to C, and Java and C# which borrow C lexical conventions and operators.
Characteristics
Like most imperative languages in the ALGOL tradition, C has facilities for structured
programming and allows lexical variable scope and recursion, while a static type system prevents
many unintended operations. In C, all executable code is contained within functions. Function
parameters are always passed by value. Pass-by-reference is achieved in C by explicitly passing
pointer values. Heterogeneous aggregate data types (struct) allow related data elements to be
combined and manipulated as a unit. C program source text is free-format, using the semicolon
as a statement terminator (not a delimiter).
Features
The relatively low-level nature of the language affords the programmer close control over what
the computer does, while allowing special tailoring and aggressive optimization for a particular
platform. This allows the code to run efficiently on very limited hardware, such as embedded
systems.
C does not have some features that are available in some other programming languages:
Operators
increment and decrement (++, --) Main article: Operators in C and C++
C supports a rich set of operators, which are symbols used within an expression to specify the
manipulations to be performed while evaluating that expression. C has operators for:
arithmetic (+, -, *, /, %)
equality testing (==, !=)
order relations (<, <=, >, >=)
boolean logic (!, &&, ||)
bitwise logic (~, &, |, ^)
reference and dereference (&, *, [ ])
conditional evaluation (? :)
member selection (., ->)
type conversion (( ))
object size (sizeof)
function argument collection (( ))
sequencing (,)
subexpression grouping (( ))
C has a formal grammar, specified by the C standard.
Data structures
C has a static weak typing type system that shares some similarities with that of other ALGOL
descendants such as Pascal. There are built-in types for integers of various sizes, both signed and
unsigned, floating-point numbers, characters, and enumerated types (enum). C99 added a
boolean datatype. There are also derived types including arrays, pointers, records (struct), and
untagged unions (union).
C is often used in low-level systems programming where escapes from the type system may be
necessary. The compiler attempts to ensure type correctness of most expressions, but the
programmer can override the checks in various ways, either by using a type cast to explicitly
convert a value from one type to another, or by using pointers or unions to reinterpret the
underlying bits of a value in some other way.
Arrays
Array types in C are traditionally, of a fixed, static size specified at compile time. (The more
recent C99 standard also allows a form of variable-length arrays.) However, it is also possible to
allocate a block of memory (of arbitrary size) at run-time, using the standard library's malloc
function, and treat it as an array. C's unification of arrays and pointers (see below) means that
true arrays and these dynamically-allocated, simulated arrays are virtually interchangeable. Since
arrays are always accessed (in effect) via pointers, array accesses are typically not checked
against the underlying array size, although the compiler may provide bounds checking as an
option. Array bounds violations are therefore possible and rather common in carelessly written
code, and can lead to various repercussions, including illegal memory accesses, corruption of
data, buffer overruns, and run-time exceptions.
C does not have a special provision for declaring multidimensional arrays, but rather relies on
recursion within the type system to declare arrays of arrays, which effectively accomplishes the
same thing. The index values of the resulting "multidimensional array" can be thought of as
increasing in row-major order.
Although C supports static arrays, it is not required that array indices be validated (bounds
checking). For example, one can try to write to the sixth element of an array with five elements,
generally yielding undesirable results. This type of bug, called a buffer overflow or buffer
overrun, is notorious for causing a number of security problems. On the other hand, since bounds
checking elimination technology was largely nonexistent when C was defined, bounds checking
came with a severe performance penalty, particularly in numerical computation. A few years
earlier, some Fortran compilers had a switch to toggle bounds checking on or off; however, this
would have been much less useful for C, where array arguments are passed as simple pointers.
Deficiencies
Although the C language is extremely concise, C is subtle, and expert competency in C is not
commontaking more than ten years to achieve. [11] C programs are also notorious for security
vulnerabilities due to the unconstrained direct access to memory of many of the standard C
library function calls.
In spite of its popularity and elegance, real-world C programs commonly suffer from instability
and memory leaks, to the extent that any appreciable C programming project will have to adopt
specialized practices and tools to mitigate spiraling damage. Indeed, an entire industry has been
born merely out of the need to stabilize large source-code bases.
Although C was developed for Unix, Microsoft adopted C as the core language of its operating
systems. Although all standard C library calls are supported by Windows, there is only ad-hoc
support for Unix functionality side-by-side with an inordinate number of inconstant Windows-
specific API calls. There is currently no document in existence that can explain programming
practices that work well across both Windows and Unix.
It is inevitable that C did not choose limit the size or endianness of its typesfor example, each
compiler is free to choose the size of an int type as any anything over 16 bits according to what
is efficient on the current platform. Many programmers work based on size and endianness
assumptions, leading to code that is not portable.
Also inevitable is that the C standard defines only a very limited gamut of functionality,
excluding anything related to network communications, user interaction, or process/thread
creation. Its parent document, the POSIX standard, includes such a wide array of functionality
that no operating system appears to support it exactly, and only UNIX systems have even
attempted to support substantial parts of it.
Therefore the kinds of programs that can be portably written are extremely restricted, unless
specialized programming practices are adopted.
SOFTWARE AND HARDWARE TOOLS
Windows XP
Turbo C++
Turbo C++ is a C++ compiler and integrated development environment (IDE) from Borland.
The original Turbo C++ product line was put on hold after 1994, and was revived in 2006 as an
introductory-level IDE, essentially a stripped-down version of their flagship C++ Builder. Turbo
C++ 2006 was released on September 5, 2006 and is available in 'Explorer' and 'Professional'
editions. The Explorer edition is free to download and distribute while the Professional edition is
a commercial product. The professional edition is no longer available for purchase from Borland.
Turbo C++ 3.0 was released in 1991 (shipping on November 20), and came in amidst
expectations of the coming release of Turbo C++ for Microsoft Windows. Initially released as an
MS-DOS compiler, 3.0 supported C++ templates, Borland's inline assembler, and generation of
MS-DOS mode executables for both 8086 real-mode & 286-protected (as well as the Intel
80186.) 3.0's implemented AT&T C++ 2.1, the most recent at the time. The separate Turbo
Assembler product was no longer included, but the inline-assembler could stand in as a reduced
functionality version.
Starting with version 3.0, Borland segmented their C++ compiler into two distinct product-lines:
"Turbo C++" and "Borland C++". Turbo C++ was marketed toward the hobbyist and entry-level
compiler market, while Borland C++ targeted the professional application development market.
Borland C++ included additional tools, compiler code-optimization, and documentation to
address the needs of commercial developers. Turbo C++ 3.0 could be upgraded with separate
add-ons, such as Turbo Assembler and Turbovision 1.0.
HARDWARE REQUIREMENT
RAM : 256 MB
Hard Disk : 40 GB
FDD : 4 GB
SOFTWARE REQUIREMENT
Languages : C++
FEASIBILITY STUDY
Feasibility study: The feasibility study is a general examination of the potential of an idea to be
converted into a business. This study focuses largely on the ability of the entrepreneur to convert
the idea into a business enterprise. The feasibility study differs from the viability study as the
viability study is an in-depth investigation of the profitability of the idea to be converted into a
business enterprise.
Resource Feasibility
This involves questions such as how much time is available to build the new system,
when it can be built, whether it interferes with normal business operations, type and
amount of resources required, dependencies, etc. Contingency and mitigation plans
should also be stated here so that if the project does over run the company is ready for
this eventuality.
Schedule Feasibility
A project will fail if it takes too long to be completed before it is useful. Typically this
means estimating how long the system will take to develop, and if it can be completed in
a given time period using some methods like payback period.
Economic Feasibility
Economic analysis is the most frequently used method for evaluating the effectiveness of
a candidate system. More commonly known as cost/benefit analysis, the procedure is to
determine the benefits and savings that are expected from a candidate system and
compare them with costs. If benefits outweigh costs, then the decision is made to design
and implement the system.
Operational feasibility
Do the current work practices and procedures support a new system? Also social factors
i.e. how the organizational changes will affect the working lives of those affected by the
system...
Technical feasibility
Centers around the existing computer system and to what extent it can support the
proposed addition
SYSTEM DESIGN
A lexical analyzer generator creates a lexical analyzer using a set of specifications usually in the
format
p1 {action 1}
p2 {action 2}
............
pn {action n}
Where pi is a regular expression and each action actioni is a program fragment that is to be
executed whenever a lexeme matched by pi is found in the input. If more than one pattern
matches, then longest lexeme matched is chosen. If there are two or more patterns that match the
longest lexeme, the first listed matching pattern is chosen.
This is usually implemented using a finite automaton. There is an input buffer with
two pointers to it, a lexeme-beginning and a forward pointer. The lexical analyzer generator
constructs a transition table for a finite automaton from the regular expression patterns in the
lexical analyzer generator specification. The lexical analyzer itself consists of a finite automaton
simulator that uses this transition table to look for the regular expression patterns in the input
buffer.
This can be implemented using an NFA or a DFA. The transition table for an NFA is
considerably smaller than that for a DFA, but the DFA recognises patterns faster than the NFA.
Using NFA
The transition table for the NFA N is constructed for the composite pattern p 1|p2|. . .|pn,
The NFA recognizes the longest prefix of the input that is matched by a pattern. In the final NFA,
there is an accepting state for each pattern pi. The sequence of steps the final NFA can be in is
after seeing each input character is constructed. The NFA is simulated until it reaches termination
or it reaches a set of states from which there is no transition defined for the current input symbol.
The specification for the lexical analyzer generator is so that a valid source program cannot
entirely fill the input buffer without having the NFA reach termination. To find a correct match
two things are done. Firstly, whenever an accepting state is added to the current set of states, the
current input position and the pattern pi is recorded corresponding to this accepting state. If the
current set of states already contains an accepting state, then only the pattern that appears first in
the specification is recorded. Secondly, the transitions are recorded until termination is reached.
Upon termination, the forward pointer is retracted to the position at which the last match
occurred. The pattern making this match identifies the token found, and the lexeme matched is
the string between the lexeme beginning and forward pointers. If no pattern matches, the lexical
analyser should transfer control to some default recovery routine.
Using DFA
Here a DFA is used for pattern matching. This method is a modified version of the
method using NFA. The NFA is converted to a DFA using a subset construction algorithm. Here
there may be several accepting states in a given subset of nondeterministic states. The accepting
state corresponding to the pattern listed first in the lexical analyzer generator specification has
priority. Here also state transitions are made until a state is reached which has no next state for
the current input symbol. The last input position at which the DFA entered an accepting state
gives the lexeme.
DATA-FLOW DIAGRAM
A data flow diagram (DFD) is a graphical representation of the "flow" of data through an
information system. It differs from the flowchart as it shows the data flow instead of the control
flow of the program.
A data flow diagram can also be used for the visualization of data processing (structured design).
This level shows the overall context of the system and its operating environment and shows the
whole system as just one process. It does not usually show data stores, unless they are "owned"
by external systems, e.g. are accessed by but not maintained by this system, however, these are
often shown as external entities.
Level 1
This level shows all processes at the first level of numbering, data stores, external entities and the
data flows between them. The purpose of this level is to show the major high level processes of
the system and their interrelation. A process model will have one, and only one, level 1 diagram.
A level 1 diagram must be balanced with its parent context level diagram, i.e. there must be the
same external entities and the same data flows, these can be broken down to more detail in the
level 1, e.g. the "enquiry" data flow could be split into "enquiry request" and "enquiry results"
and still be valid.
Level 2
A Level 2 Data flow diagram showing the "Process Enquiry" process for the same system.
This level is a decomposition of a process shown in a level 1 diagram, as such there should be
level 2 diagrams for each and every process shown in a level 1 diagram. In this example
processes 1.1, 1.2 & 1.3 are all children of process 1, together they wholly and completely
describe process 1, and combined must perform the full capacity of this parent process. As
before, a level 2 diagram must be balanced with its parent level 1 diagram.
FLOW CHART
A flowchart is common type of chart, that represents an algorithm or process, showing the steps
as boxes of various kinds, and their order by connecting these with arrows. Flowcharts are used
in analyzing, designing, documenting or managing a process or program in various fields.
Flowcharts are used in designing and documenting complex processes. Like other types of
diagram, they help visualize what is going on and thereby help the viewer to understand a
process, and perhaps also find flaws, bottlenecks, and other less-obvious features within it. There
are many different types of flowcharts, and each type has its own repertoire of boxes and
notational conventions. The two most common types of boxes in a flowchart are:
Symbols
A typical flowchart from older Computer Science textbooks may have the following
kinds of symbols:
Start and end symbols
Represented as lozenges, ovals or rounded rectangles, usually containing the word "Start"
or "End", or another phrase signaling the start or end of a process, such as "submit
enquiry" or "receive product".
Arrows
Showing what's called "flow of control" in computer science. An arrow coming from one
symbol and ending at another symbol represents that control passes to the symbol the
arrow points to.
Processing steps
Represented as rectangles. Examples: "Add 1 to X"; "replace identified part"; "save
changes" or similar.
Input/Output
Represented as a parallelogram. Examples: Get X from the user; display X.
Conditional or decision
Represented as a diamond (rhombus). These typically contain a Yes/No question or
True/False test. This symbol is unique in that it has two arrows coming out of it, usually
from the bottom point and right point, one corresponding to Yes or True, and one
corresponding to No or False. The arrows should always be labeled. More than two
arrows can be used, but this is normally a clear indicator that a complex decision is being
taken, in which case it may need to be broken-down further, or replaced with the "pre-
defined process" symbol.
A number of other symbols that have less universal currency, such as:
A Document represented as a rectangle with a wavy base;
A Manual input represented by parallelogram, with the top irregularly sloping up from
left to right. An example would be to signify data-entry from a form;
A Manual operation represented by a trapezoid with the longest parallel side at the top, to
represent an operation or adjustment to process that can only be made manually.
A Data File represented by a cylinder
Flowcharts may contain other symbols, such as connectors, usually represented as circles, to
represent converging paths in the flow chart. Circles will have more than one arrow coming into
them but only one going out. Some flow charts may just have an arrow point to another arrow
instead. These are useful to represent an iterative process (what in Computer Science is called a
loop). A loop may, for example, consist of a connector where control first enters, processing
steps, a conditional with one arrow exiting the loop, and one going back to the connector. Off-
page connectors are often used to signify a connection to a (part of another) process held on
another sheet or screen. It is important to remember to keep these connections logical in order.
All processes should flow from top to bottom and left to right.
EVALUATION
Lexical analyzer converts stream of input characters into a stream of tokens. The different tokens
that our lexical analyzer identifies are as follows:
KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf, scanf, case, break,
return, typedef, void
NUMBERS: positive and negative integers, positive and negative floating point numbers.
OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.
BRACKETS: [ ], { }, ( ).
For tokenizing into identifiers and keywords we incorporate a symbol table which initially
consists of predefined keywords. The tokens are read from an input file. If the encountered token
is an identifier or a keyword the lexical analyzer will look up in the symbol table to check the
existence of the respective token. If an entry does exist then we proceed to the next token. If not
then that particular token along with the token value is written into the symbol table. The rest of
the tokens are directly displayed by writing into an output file.
The output file will consist of all the tokens present in our input file along with their respective
token values.
CODE
#include<iostream>
#include<string>
#include<string.h>
class hash1{
private:
struct item{
string type;
string value;
item* next;
};
item* hashtable[tablesize];
public:
hash1();//constructor
};
hash1::hash1()
hashtable[i]->type = "empty";
hashtable[i]->value = "empty";
hashtable[i]->next = NULL;
int hash = 0;
hash += (int)key[i];
}
return index;
if (hashtable[index]->type == "empty")
hashtable[index]->type = type;
hashtable[index]->value = value;
//do nothing
else
n->type = type;
n->value = value;
n->next = NULL;
ptr = ptr->next;
ptr->next = n;
int count = 0;
if (hashtable[index]->value == "empty")
return count;
else
count++;
count++;
ptr = ptr->next;
}
}
return count;
void hash1::printtable()
number = numberofitemsinindex(i);
printitemsinindex(i);
if (ptr->type == "empty")
else
ptr = ptr->next;
foundtype = true;//found
value = ptr->value;
ptr = ptr->next;
if (foundtype == true)
else
%{
#include"hash.cpp"
string ty,val,i;
void addword();
hash1 hashobj;
%}
%x HD
%x DEF
%x IO
%x RET
%x FN
%x OPO
ID [a-zA-Z][a-zA-Z0-9]*
DEC "int"|"float"|"char"|"short"|"long"|"unsigned"
OP "="|"+"|"-"|"*"|"/"|"%"|"=="|">="|"<="|"!
="|"&&"|"||"|"<"|">"|"!"|"&"|"|"|"~"|"^"|"<<"|">>"|"+="|"-
="|"/="|"*="|"%="|"<<="|">>="|"sizeof"
SP [ \t\n]*
%%
%HD
val=yytext;
hashobj.additem(ty,val);}
%DEF
<DEF>{SP}{ID}\, {ty="keyword";val=yytext[1];
hashobj.additem(ty,val);}
<DEF>{SP}{ID}{SP}\, {ty="keyword";val=yytext[1];
hashobj.additem(ty,val);}
<DEF>{SP}{ID}{SP}\; {ty="keyword";val=yytext[1];
hashobj.additem(ty,val);
BEGIN 0;}
%RET
<R>. {;}
%IO
<IO>. {;}
%%
int main()
yyin=fopen("tt.txt","r");
yylex();
fclose(yyin);
hashobj.printtable();
//{
//{
//hashobj.search(type);
//}
//}
return 0;
}
ADVANTAGES
AND
Disadvantages
OF LEXICAL
ANALYZER
ADVANTAGES
DISADVANTAGES
Done by hand.
Development is complicate
CONCLUSION
Lexical analysis is a stage in compilation of any program. In this phase we generate
tokens from the input stream of data. For performing this task we need Lexical
Analyzer.
So we are designing a lexical analyzer that will generate tokens from the given
input.
* REFERENCE
REFERENCES
www.google.co.in
www.wikipedia.com