Vous êtes sur la page 1sur 46

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

A MINI PROJECT REPORT


on

Lexical Analysis
By
SHREYAS V KASHYAP[1BG13CS097]
SPANDANA RAO[1BG13CS103]
SUHAS RAINA[1BG13CS108]
Under the guidance
of
Smt. Usha C R
Assistant Professor
CSE Department
BNM Institute of Technology

VidyayaAmruthamAshnuthe
Department of Computer Science & Engineering

B. N. M. Institute of Technology
12th Main, 27th Cross, Banashankari II Stage, Bangalore 560 070.
B. N. M. Institute of
Technology
12th Main, 27th cross, Banashankari II Stage, Bangalore - 560070
Department of Computer Science & Engineering

VidyayaAmruthamAshnuthe

Certificate
Certified that the mini project entitled Lexical Analysis carried out by
Shreyas V Kashyap [1BG13CS097], Spanadana Rao [1BG13CS103], Suhas Raina
[1BG13CS108] bonafide students of B . N . M Institute of Technology of
Bachelor of Engineering in Computer Science & Engineering of the
Visvesvaraya Technological University, Belgaum during the year 2014-15. The
mini project report has been approved.

Smt. Usha Dr. Sahana Gowda

Assistant Professor Professor and HOD


CSE Dept CSE Dept
Preface
The lexical analyzer is responsible for scanning the source input file and translating lexemes into
small objects that the compiler can easily process. These small values are often called tokens.
The lexical analyzer is also responsible for converting sequences of digits in to their numeric
form as well as processing other literal constants, for removing comments and whitespace from
the source file, and for taking care of many other mechanical details.

The lexical analyzer reads a string of characters and checks if a valid token in the grammar.

Lexical analysis terminology:

Token:
Terminal symbol in a grammar
Classes of sequences of characters with a collective meaning
Constants, operators, punctuations, keywords.
Lexeme:
Character sequence matched by an instance of the token.
Project
Description
Project Description
Lexical analyzer converts stream of input characters into a stream of tokens. The different
tokens that our lexical analyzer identifies are as follows:

KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf, scanf, case, break,
return, typedef, void

IDENTIFIERS: main, fopen, getch etc

NUMBERS: positive and negative integers, positive and negative floating point numbers.

OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.

BRACKETS: [ ], { }, ( ).

STRINGS: Set of characters enclosed within the quotes

COMMENT LINES: Ignores single line, multi line comments

For tokenizing into identifiers and keywords we incorporate a symbol table which initially
consists of predefined keywords. The tokens are read from an input file. If the encountered token
is an identifier or a keyword the lexical analyzer will look up in the symbol table to check the
existence of the respective token. If an entry does exist then we proceed to the next token. If not
then that particular token along with the token value is written into the symbol table. The rest of
the tokens are directly displayed by writing into an output file.

The output file will consist of all the tokens present in our input file along with their respective
token values.
SYSTEM DESIGN
Process:

The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters
and produce as output a sequence of tokens that the parser uses for syntax analysis. This
interaction, summarized schematically in fig. a.

Upon receiving a get next token command from the parser, the lexical analyzer reads the input
characters until it can identify next token.

Sometimes, lexical analyzers are divided into a cascade of two phases, the first called
scanning, and the second lexical analysis.

The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the
more complex operations.

The lexical analyzer which we have designed takes the input from an input file. It reads one
character at a time from the input file, and continues to read until end of the file is reached. It
recognizes the valid identifiers, keywords and specifies the token values of the keywords.

It also identifies the header files, #define statements, numbers, special characters, various
relational and logical operators, ignores the white spaces and comments. It prints the output in a
separate file specifying the line number.
BLOCK DIAGRAM:
OBJECTIVE OF
THE PROJECT
AIM OF THE PROJECT

Aim of the project is to develop a Lexical Analyzer that can generate tokens for the further
processing of compiler.

PURPOSE OF THE PROJECT


The lexical features of a language can be specified using types-3 grammar. The job of the lexical
analyzer is to read the source program one character at a time and produce as output a stream of
tokens. The tokens produced by the lexical analyzer serve as input to the next phase, the parser.
Thus, the lexical analyzers job is to translate the source program in to a form more conductive to
recognition by the parser.

GOALS
To create tokens from the given input stream.

SCOPE OF PROJECT
Lexical analyzer converts the input program into character stream of valid words of language,
known as tokens.

The parser looks into the sequence of these tokens & identifies the language construct occurring
in the input program. The parser and the lexical analyzer work hand in hand; in the sense that
whenever the parser needs further tokens to proceed, it request the lexical analyzer. The lexical
analyzer in turn scans the remaining input stream & returns the next token occurring there. Apart
from that, the lexical analyzer also participates in the creation & maintenance of symbol table.
This is because lexical analyzer is the first module to identify the occurrence of a symbol. If the
symbol is getting defined for the first time, it needs to be installed into the symbol table. Lexical
analyzer is most widely used for doing the same.
PROJECT
CONTENTS
PROJECT CATEGORY

Category of this project is Compiler Design based.

COMPILER

To define what a compiler is one must first define what a translator is. A translator is a program
that takes another program written in one language, also known as the source language, and
outputs a program written in another language, known as the target language.

Now that the translator is defined, a compiler can be defined as a translator. The source language
is a high-level language such as Java or Pascal and the target language is a low-level language
such as machine or assembly.

There are five parts of compilation (or phases of the compiler)

1.) Lexical Analysis


2.) Syntax Analysis
3.) Semantic Analysis
4.) Code Optimization
5.) Code Generation

Lexical Analysis is the act of taking an input source program and outputting a stream of tokens.
This is done with the Scanner. The Scanner can also place identifiers into something called the
symbol table or place strings into the string table. The Scanner can report trivial errors such as
invalid characters in the input file.

Syntax Analysis is the act of taking the token stream from the scanner and comparing them
against the rules and patterns of the specified language. Syntax Analysis is done with the Parser.
The Parser produces a tree, which can come in many formats, but is referred to as the parse tree.
It reports errors when the tokens do not follow the syntax of the specified language. Errors that
the Parser can report are syntactical errors such as missing parenthesis, semicolons, and
keywords.
Semantic Analysis is the act of determining whether or not the parse tree is relevant and
meaningful. The output is intermediate code, also known as an intermediate representation (or
IR). Most of the time, this IR is closely related to assembly language but it is machine
independent. Intermediate code allows different code generators for different machines and
promotes abstraction and portability from specific machine times and languages. (I dare say the
most famous example is javas byte-code and JVM). Semantic Analysis finds more meaningful
errors such as undeclared variables, type compatibility, and scope resolution.

Code Optimization makes the IR more efficient. Code optimization is usually done in a sequence
of steps. Some optimizations include code hosting, or moving constant values to better places
within the code, redundant code discovery, and removal of useless code.

Code Generation is the final step in the compilation process. The input to the Code Generator is
the IR and the output is machine language code.
PLATFORM (TECHNOLOGY/TOOLS)

In computing, C is a general-purpose computer programming language originally developed in


1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating
system.

Although C was designed for writing architecturally independent system software, it is also
widely used for developing application software.

Worldwide, C is the first or second most popular language in terms of number of developer
positions or publicly available code. It is widely used on many different software platforms, and
there are few computer architectures for which a C compiler does not exist. C has greatly
influenced many other popular programming languages, most notably C++, which originally
began as an extension to C, and Java and C# which borrow C lexical conventions and operators.

Characteristics
Like most imperative languages in the ALGOL tradition, C has facilities for structured
programming and allows lexical variable scope and recursion, while a static type system prevents
many unintended operations. In C, all executable code is contained within functions. Function
parameters are always passed by value. Pass-by-reference is achieved in C by explicitly passing
pointer values. Heterogeneous aggregate data types (struct) allow related data elements to be
combined and manipulated as a unit. C program source text is free-format, using the semicolon
as a statement terminator (not a delimiter).

C also exhibits the following more specific characteristics:

non-nest able function definitions


variables may be hidden in nested blocks
partially weak typing; for instance, characters can be used as integers
low-level access to computer memory by converting machine addresses to typed pointers
function and data pointers supporting ad hoc run-time polymorphism
array indexing as a secondary notion, defined in terms of pointer arithmetic
a preprocessor for macro definition, source code file inclusion, and conditional
compilation
complex functionality such as I/O, string manipulation, and mathematical functions
consistently delegated to library routines
A relatively small set of reserved keywords (originally 32, now 37 in C99)
A lexical structure that resembles B more than ALGOL, for example
{ ... } rather than ALGOL's the equal-sign is for assignment (copying), much like
Fortran
two consecutive equal-signs are to test for equality (compare to in Fortran or the equal-
sign in BASIC)
&& and || in place of ALGOL's and & or (these are semantically distinct from the
bit-wise operators & and | because they will never evaluate the right operand if the result
can be determined from the left alone (short-circuit evaluation)).
a large number of compound operators, such as +=, ++, ......

Features

The relatively low-level nature of the language affords the programmer close control over what
the computer does, while allowing special tailoring and aggressive optimization for a particular
platform. This allows the code to run efficiently on very limited hardware, such as embedded
systems.

C does not have some features that are available in some other programming languages:

No assignment of arrays or strings (copying can be done via standard functions;


assignment of objects having struct or union type is supported)
No automatic garbage collection
No requirement for bounds checking of arrays
No operations on whole arrays
No syntax for ranges, such as the A..B notation used in several languages
No separate Boolean type: zero/nonzero is used instead[6]
No formal closures or functions as parameters (only function and variable pointers)
No generators or co routines; intra-thread control flow consists of nested function calls,
except for the use of the longjmp or setcontext library functions
No exception handling; standard library functions signify error conditions with the global
errno variable and/or special return values
Only rudimentary support for modular programming
No compile-time polymorphism in the form of function or operator overloading
Only rudimentary support for generic programming
Very limited support for object-oriented programming with regard to polymorphism and
inheritance
Limited support for encapsulation
No native support for multithreading and networking
No standard libraries for computer graphics and several other application programming
needs
A number of these features are available as extensions in some compilers, or can be supplied by
third-party libraries, or can be simulated by adopting certain coding disciplines.

Operators

bitwise shifts (<<, >>)


assignment (=, +=, -=, *=, /=, %=, &=, |=, ^=, <<=, >>=)

increment and decrement (++, --) Main article: Operators in C and C++

C supports a rich set of operators, which are symbols used within an expression to specify the
manipulations to be performed while evaluating that expression. C has operators for:

arithmetic (+, -, *, /, %)
equality testing (==, !=)
order relations (<, <=, >, >=)
boolean logic (!, &&, ||)
bitwise logic (~, &, |, ^)
reference and dereference (&, *, [ ])
conditional evaluation (? :)
member selection (., ->)
type conversion (( ))
object size (sizeof)
function argument collection (( ))
sequencing (,)
subexpression grouping (( ))
C has a formal grammar, specified by the C standard.

Data structures

C has a static weak typing type system that shares some similarities with that of other ALGOL
descendants such as Pascal. There are built-in types for integers of various sizes, both signed and
unsigned, floating-point numbers, characters, and enumerated types (enum). C99 added a
boolean datatype. There are also derived types including arrays, pointers, records (struct), and
untagged unions (union).

C is often used in low-level systems programming where escapes from the type system may be
necessary. The compiler attempts to ensure type correctness of most expressions, but the
programmer can override the checks in various ways, either by using a type cast to explicitly
convert a value from one type to another, or by using pointers or unions to reinterpret the
underlying bits of a value in some other way.

Arrays

Array types in C are traditionally, of a fixed, static size specified at compile time. (The more
recent C99 standard also allows a form of variable-length arrays.) However, it is also possible to
allocate a block of memory (of arbitrary size) at run-time, using the standard library's malloc
function, and treat it as an array. C's unification of arrays and pointers (see below) means that
true arrays and these dynamically-allocated, simulated arrays are virtually interchangeable. Since
arrays are always accessed (in effect) via pointers, array accesses are typically not checked
against the underlying array size, although the compiler may provide bounds checking as an
option. Array bounds violations are therefore possible and rather common in carelessly written
code, and can lead to various repercussions, including illegal memory accesses, corruption of
data, buffer overruns, and run-time exceptions.

C does not have a special provision for declaring multidimensional arrays, but rather relies on
recursion within the type system to declare arrays of arrays, which effectively accomplishes the
same thing. The index values of the resulting "multidimensional array" can be thought of as
increasing in row-major order.

Although C supports static arrays, it is not required that array indices be validated (bounds
checking). For example, one can try to write to the sixth element of an array with five elements,
generally yielding undesirable results. This type of bug, called a buffer overflow or buffer
overrun, is notorious for causing a number of security problems. On the other hand, since bounds
checking elimination technology was largely nonexistent when C was defined, bounds checking
came with a severe performance penalty, particularly in numerical computation. A few years
earlier, some Fortran compilers had a switch to toggle bounds checking on or off; however, this
would have been much less useful for C, where array arguments are passed as simple pointers.

Deficiencies
Although the C language is extremely concise, C is subtle, and expert competency in C is not
commontaking more than ten years to achieve. [11] C programs are also notorious for security
vulnerabilities due to the unconstrained direct access to memory of many of the standard C
library function calls.

In spite of its popularity and elegance, real-world C programs commonly suffer from instability
and memory leaks, to the extent that any appreciable C programming project will have to adopt
specialized practices and tools to mitigate spiraling damage. Indeed, an entire industry has been
born merely out of the need to stabilize large source-code bases.

Although C was developed for Unix, Microsoft adopted C as the core language of its operating
systems. Although all standard C library calls are supported by Windows, there is only ad-hoc
support for Unix functionality side-by-side with an inordinate number of inconstant Windows-
specific API calls. There is currently no document in existence that can explain programming
practices that work well across both Windows and Unix.

It is inevitable that C did not choose limit the size or endianness of its typesfor example, each
compiler is free to choose the size of an int type as any anything over 16 bits according to what
is efficient on the current platform. Many programmers work based on size and endianness
assumptions, leading to code that is not portable.

Also inevitable is that the C standard defines only a very limited gamut of functionality,
excluding anything related to network communications, user interaction, or process/thread
creation. Its parent document, the POSIX standard, includes such a wide array of functionality
that no operating system appears to support it exactly, and only UNIX systems have even
attempted to support substantial parts of it.

Therefore the kinds of programs that can be portably written are extremely restricted, unless
specialized programming practices are adopted.
SOFTWARE AND HARDWARE TOOLS

Windows XP

Windows XP is a line of operating systems produced by Microsoft for use on personal


computers, including home and business desktops, notebook computers, and media centers. The
name "XP" is short for "experience". Windows XP is the successor to both Windows 2000
Professional and Windows Me, and is the first consumer-oriented operating system produced by
Microsoft to be built on the Windows NT kernel and architecture. Windows XP was first released
on 25 October 2001, and over 400 million copies were in use in January 2006, according to an
estimate in that month by an IDC analyst. It is succeeded by Windows Vista, which was released
to volume license customers on 8 November 2006 and worldwide to the general public on 30
January 2007. Direct OEM and retail sales of Windows XP ceased on 30 June 2008, although it
is still possible to obtain Windows XP from System Builders (smaller OEMs who sell assembled
computers) until 31 July 2009 or by purchasing Windows Vista Ultimate or Business and then
downgrading to Windows XP.

Windows XP introduced several new features to the Windows line, including:

Faster start-up and hibernation sequences


The ability to discard a newer device driver in favor of the previous one (known as driver
rollback), should a driver upgrade not produce desirable results
A new, arguably more user-friendly interface, including the framework for developing
themes for the desktop environment
Fast user switching, which allows a user to save the current state and open applications of
their desktop and allow another user to log on without losing that information
The Clear Type font rendering mechanism, which is designed to improve text readability
on Liquid Crystal Display (LCD) and similar monitors
Remote Desktop functionality, which allows users to connect to a computer running
Windows XP Pro from across a network or the Internet and access their applications,
files, printers, and devices
Support for most DSL modems and wireless network connections, as well as networking
over FireWire, and Bluetooth.

Turbo C++
Turbo C++ is a C++ compiler and integrated development environment (IDE) from Borland.
The original Turbo C++ product line was put on hold after 1994, and was revived in 2006 as an
introductory-level IDE, essentially a stripped-down version of their flagship C++ Builder. Turbo
C++ 2006 was released on September 5, 2006 and is available in 'Explorer' and 'Professional'
editions. The Explorer edition is free to download and distribute while the Professional edition is
a commercial product. The professional edition is no longer available for purchase from Borland.

Turbo C++ 3.0 was released in 1991 (shipping on November 20), and came in amidst
expectations of the coming release of Turbo C++ for Microsoft Windows. Initially released as an
MS-DOS compiler, 3.0 supported C++ templates, Borland's inline assembler, and generation of
MS-DOS mode executables for both 8086 real-mode & 286-protected (as well as the Intel
80186.) 3.0's implemented AT&T C++ 2.1, the most recent at the time. The separate Turbo
Assembler product was no longer included, but the inline-assembler could stand in as a reduced
functionality version.

Starting with version 3.0, Borland segmented their C++ compiler into two distinct product-lines:
"Turbo C++" and "Borland C++". Turbo C++ was marketed toward the hobbyist and entry-level
compiler market, while Borland C++ targeted the professional application development market.
Borland C++ included additional tools, compiler code-optimization, and documentation to
address the needs of commercial developers. Turbo C++ 3.0 could be upgraded with separate
add-ons, such as Turbo Assembler and Turbovision 1.0.
HARDWARE REQUIREMENT

Processor : Pentium (IV)

RAM : 256 MB

Hard Disk : 40 GB

FDD : 4 GB

SOFTWARE REQUIREMENT

Platform Used : flex package in Terminal

Operating System : Unix & other versions

Languages : C++
FEASIBILITY STUDY

Feasibility study: The feasibility study is a general examination of the potential of an idea to be
converted into a business. This study focuses largely on the ability of the entrepreneur to convert
the idea into a business enterprise. The feasibility study differs from the viability study as the
viability study is an in-depth investigation of the profitability of the idea to be converted into a
business enterprise.

Types of Feasibility Studies

The following sections describe various types of feasibility studies.

Technology and System Feasibility


This involves questions such as whether the technology needed for the system exists, how
difficult it will be to build, and whether the firm has enough experience using that
technology. The assessment is based on an outline design of system requirements in terms
of Input, Processes, Output, Fields, Programs, and Procedures. This can be quantified in
terms of volumes of data, trends, frequency of updating, etc in order to estimate if the
new system will perform adequately or not.

Resource Feasibility
This involves questions such as how much time is available to build the new system,
when it can be built, whether it interferes with normal business operations, type and
amount of resources required, dependencies, etc. Contingency and mitigation plans
should also be stated here so that if the project does over run the company is ready for
this eventuality.

Schedule Feasibility
A project will fail if it takes too long to be completed before it is useful. Typically this
means estimating how long the system will take to develop, and if it can be completed in
a given time period using some methods like payback period.
Economic Feasibility
Economic analysis is the most frequently used method for evaluating the effectiveness of
a candidate system. More commonly known as cost/benefit analysis, the procedure is to
determine the benefits and savings that are expected from a candidate system and
compare them with costs. If benefits outweigh costs, then the decision is made to design
and implement the system.

Operational feasibility
Do the current work practices and procedures support a new system? Also social factors
i.e. how the organizational changes will affect the working lives of those affected by the
system...

Technical feasibility
Centers around the existing computer system and to what extent it can support the
proposed addition
SYSTEM DESIGN

A lexical analyzer generator creates a lexical analyzer using a set of specifications usually in the
format

p1 {action 1}

p2 {action 2}

............

pn {action n}

Where pi is a regular expression and each action actioni is a program fragment that is to be
executed whenever a lexeme matched by pi is found in the input. If more than one pattern
matches, then longest lexeme matched is chosen. If there are two or more patterns that match the
longest lexeme, the first listed matching pattern is chosen.

This is usually implemented using a finite automaton. There is an input buffer with
two pointers to it, a lexeme-beginning and a forward pointer. The lexical analyzer generator
constructs a transition table for a finite automaton from the regular expression patterns in the
lexical analyzer generator specification. The lexical analyzer itself consists of a finite automaton
simulator that uses this transition table to look for the regular expression patterns in the input
buffer.

This can be implemented using an NFA or a DFA. The transition table for an NFA is
considerably smaller than that for a DFA, but the DFA recognises patterns faster than the NFA.

Using NFA

The transition table for the NFA N is constructed for the composite pattern p 1|p2|. . .|pn,
The NFA recognizes the longest prefix of the input that is matched by a pattern. In the final NFA,
there is an accepting state for each pattern pi. The sequence of steps the final NFA can be in is
after seeing each input character is constructed. The NFA is simulated until it reaches termination
or it reaches a set of states from which there is no transition defined for the current input symbol.
The specification for the lexical analyzer generator is so that a valid source program cannot
entirely fill the input buffer without having the NFA reach termination. To find a correct match
two things are done. Firstly, whenever an accepting state is added to the current set of states, the
current input position and the pattern pi is recorded corresponding to this accepting state. If the
current set of states already contains an accepting state, then only the pattern that appears first in
the specification is recorded. Secondly, the transitions are recorded until termination is reached.
Upon termination, the forward pointer is retracted to the position at which the last match
occurred. The pattern making this match identifies the token found, and the lexeme matched is
the string between the lexeme beginning and forward pointers. If no pattern matches, the lexical
analyser should transfer control to some default recovery routine.
Using DFA

Here a DFA is used for pattern matching. This method is a modified version of the
method using NFA. The NFA is converted to a DFA using a subset construction algorithm. Here
there may be several accepting states in a given subset of nondeterministic states. The accepting
state corresponding to the pattern listed first in the lexical analyzer generator specification has
priority. Here also state transitions are made until a state is reached which has no next state for
the current input symbol. The last input position at which the DFA entered an accepting state
gives the lexeme.
DATA-FLOW DIAGRAM

A data flow diagram (DFD) is a graphical representation of the "flow" of data through an
information system. It differs from the flowchart as it shows the data flow instead of the control
flow of the program.

A data flow diagram can also be used for the visualization of data processing (structured design).

Context Level Diagram (Level 0)

A context level Data flow diagram created using Select SSADM.

This level shows the overall context of the system and its operating environment and shows the
whole system as just one process. It does not usually show data stores, unless they are "owned"
by external systems, e.g. are accessed by but not maintained by this system, however, these are
often shown as external entities.
Level 1

A Level 1 Data flow diagram for the same system.

This level shows all processes at the first level of numbering, data stores, external entities and the
data flows between them. The purpose of this level is to show the major high level processes of
the system and their interrelation. A process model will have one, and only one, level 1 diagram.
A level 1 diagram must be balanced with its parent context level diagram, i.e. there must be the
same external entities and the same data flows, these can be broken down to more detail in the
level 1, e.g. the "enquiry" data flow could be split into "enquiry request" and "enquiry results"
and still be valid.
Level 2

A Level 2 Data flow diagram showing the "Process Enquiry" process for the same system.

This level is a decomposition of a process shown in a level 1 diagram, as such there should be
level 2 diagrams for each and every process shown in a level 1 diagram. In this example
processes 1.1, 1.2 & 1.3 are all children of process 1, together they wholly and completely
describe process 1, and combined must perform the full capacity of this parent process. As
before, a level 2 diagram must be balanced with its parent level 1 diagram.
FLOW CHART

A flowchart is common type of chart, that represents an algorithm or process, showing the steps
as boxes of various kinds, and their order by connecting these with arrows. Flowcharts are used
in analyzing, designing, documenting or managing a process or program in various fields.

Flowcharts are used in designing and documenting complex processes. Like other types of
diagram, they help visualize what is going on and thereby help the viewer to understand a
process, and perhaps also find flaws, bottlenecks, and other less-obvious features within it. There
are many different types of flowcharts, and each type has its own repertoire of boxes and
notational conventions. The two most common types of boxes in a flowchart are:

A processing step, usually called activity, and denoted as a rectangular box


A decision usually denoted as a diamond.

Flow chart building blocks

Symbols
A typical flowchart from older Computer Science textbooks may have the following
kinds of symbols:
Start and end symbols
Represented as lozenges, ovals or rounded rectangles, usually containing the word "Start"
or "End", or another phrase signaling the start or end of a process, such as "submit
enquiry" or "receive product".
Arrows
Showing what's called "flow of control" in computer science. An arrow coming from one
symbol and ending at another symbol represents that control passes to the symbol the
arrow points to.
Processing steps
Represented as rectangles. Examples: "Add 1 to X"; "replace identified part"; "save
changes" or similar.
Input/Output
Represented as a parallelogram. Examples: Get X from the user; display X.
Conditional or decision
Represented as a diamond (rhombus). These typically contain a Yes/No question or
True/False test. This symbol is unique in that it has two arrows coming out of it, usually
from the bottom point and right point, one corresponding to Yes or True, and one
corresponding to No or False. The arrows should always be labeled. More than two
arrows can be used, but this is normally a clear indicator that a complex decision is being
taken, in which case it may need to be broken-down further, or replaced with the "pre-
defined process" symbol.
A number of other symbols that have less universal currency, such as:
A Document represented as a rectangle with a wavy base;
A Manual input represented by parallelogram, with the top irregularly sloping up from
left to right. An example would be to signify data-entry from a form;
A Manual operation represented by a trapezoid with the longest parallel side at the top, to
represent an operation or adjustment to process that can only be made manually.
A Data File represented by a cylinder

Flowcharts may contain other symbols, such as connectors, usually represented as circles, to
represent converging paths in the flow chart. Circles will have more than one arrow coming into
them but only one going out. Some flow charts may just have an arrow point to another arrow
instead. These are useful to represent an iterative process (what in Computer Science is called a
loop). A loop may, for example, consist of a connector where control first enters, processing
steps, a conditional with one arrow exiting the loop, and one going back to the connector. Off-
page connectors are often used to signify a connection to a (part of another) process held on
another sheet or screen. It is important to remember to keep these connections logical in order.
All processes should flow from top to bottom and left to right.
EVALUATION

Lexical analyzer converts stream of input characters into a stream of tokens. The different tokens
that our lexical analyzer identifies are as follows:

KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf, scanf, case, break,
return, typedef, void

IDENTIFIERS: main, fopen, getch etc

NUMBERS: positive and negative integers, positive and negative floating point numbers.

OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.

BRACKETS: [ ], { }, ( ).

STRINGS : Set of characters enclosed within the quotes

COMMENT LINES: Ignores single line, multi line comments

For tokenizing into identifiers and keywords we incorporate a symbol table which initially
consists of predefined keywords. The tokens are read from an input file. If the encountered token
is an identifier or a keyword the lexical analyzer will look up in the symbol table to check the
existence of the respective token. If an entry does exist then we proceed to the next token. If not
then that particular token along with the token value is written into the symbol table. The rest of
the tokens are directly displayed by writing into an output file.

The output file will consist of all the tokens present in our input file along with their respective
token values.
CODE

File Name: hash.cpp

#include<iostream>

#include<string>

#include<string.h>

using namespace std;

class hash1{

private:

static const int tablesize = 20;

struct item{

string type;

string value;

item* next;

};

item* hashtable[tablesize];

public:
hash1();//constructor

int return_index(string key);

void additem(string type, string value);

int numberofitemsinindex(int index);//to count no. of item under a bucket

void printtable();//print contents in each f bucket

void printitemsinindex(int index);

int search(string type,string value);

};

hash1::hash1()

for (int i = 0; i < tablesize; i++) //std::size_t

hashtable[i] = new item;

hashtable[i]->type = "empty";

hashtable[i]->value = "empty";

hashtable[i]->next = NULL;

int hash1::return_index(string key)

int hash = 0;

for (int i = 0; i < key.length(); i++) //std::size_t

hash += (int)key[i];
}

int index = hash%tablesize;

return index;

void hash1::additem(string type, string value)

int index = return_index(type);

if (hashtable[index]->type == "empty")

hashtable[index]->type = type;

hashtable[index]->value = value;

else if (search(type, value))

//do nothing

cout << "duplicate entry so not added to table\n";

else

item* ptr = hashtable[index];

item* n = new item;

n->type = type;

n->value = value;
n->next = NULL;

while (ptr->next != NULL) //traverse list

ptr = ptr->next;

ptr->next = n;

int hash1::numberofitemsinindex(int index)

int count = 0;

if (hashtable[index]->value == "empty")

return count;

else

count++;

item* ptr = hashtable[index];

while (ptr->next != NULL)

count++;

ptr = ptr->next;

}
}

return count;

void hash1::printtable()

int number;//hold no. of elements in a bucket;

for (int i = 0; i < tablesize; i++)

number = numberofitemsinindex(i);

cout << "---------\n";

cout << "index = " << i << endl;

cout << hashtable[i] -> type << endl;

cout << hashtable[i] -> value << endl;

cout << "no. of items =" << number << endl;

cout << "---------\n";

printitemsinindex(i);

void hash1::printitemsinindex(int index)

item* ptr = hashtable[index];//point to first item in tat index

if (ptr->type == "empty")

cout << "index = " << index << " is empty\n";


}

else

{ cout<<"items under this index are :"<<endl;

while (ptr!= NULL)

cout << "*******\n";

cout << ptr->type << endl;

cout << ptr->value << endl;

cout << "*******\n";

ptr = ptr->next;

int hash1::search(string type,string value)

int bucket = return_index(type);

bool foundtype = false;

item* ptr = hashtable[bucket];

while (ptr != NULL)//traverse

if (ptr->type == type && ptr->value==value)

foundtype = true;//found
value = ptr->value;

ptr = ptr->next;

if (foundtype == true)

return 1; //item found

else

return 0;// item not found, item added to table

File name: hash.l

%{

#include"hash.cpp"

string ty,val,i;

void addword();

hash1 hashobj;

%}

%x HD

%x DEF

%x IO

%x RET
%x FN

%x OPO

ID [a-zA-Z][a-zA-Z0-9]*

DEC "int"|"float"|"char"|"short"|"long"|"unsigned"

OP "="|"+"|"-"|"*"|"/"|"%"|"=="|">="|"<="|"!
="|"&&"|"||"|"<"|">"|"!"|"&"|"|"|"~"|"^"|"<<"|">>"|"+="|"-
="|"/="|"*="|"%="|"<<="|">>="|"sizeof"

SP [ \t\n]*

%%

%HD

"#include<" {BEGIN HD;}

<HD>{ID}\.{ID} {ty="header file";

val=yytext;

hashobj.additem(ty,val);}

<HD>">" {BEGIN 0;}

%DEF

"int"|"float"|"char" {ty="keyword"; val=yytext;hashobj.additem(ty,val); BEGIN DEF;}

<DEF>{SP}{ID}\; {i=yytext[1]; hashobj.additem(val,i); BEGIN 0;}

<DEF>{SP}{ID}\, {ty="keyword";val=yytext[1];

hashobj.additem(ty,val);}

<DEF>{SP}{ID}{SP}\, {ty="keyword";val=yytext[1];

hashobj.additem(ty,val);}
<DEF>{SP}{ID}{SP}\; {ty="keyword";val=yytext[1];

hashobj.additem(ty,val);

BEGIN 0;}

%RET

"return" {ty="keyword"; val=yytext; hashobj.additem(ty,val); BEGIN RET;}

<R>. {;}

<R>';' {BEGIN 0;}

%IO

"printf"|"scanf" {ty="keyword" ;val=yytext; hashobj.additem(ty,val); BEGIN IO;}

<IO>. {;}

<IO>';' {BEGIN 0;}

{OP} {ty="operand"; val=yytext; hashobj.additem(ty,val);}

%%

int main()

yyin=fopen("tt.txt","r");
yylex();

fclose(yyin);

string type = "";

cout<<Symbol table : \n<<endl;

hashobj.printtable();

//while (type != "exit") //search function

//{

//cout << "search for \n";

//cin >> type;

//if (type != "exit")

//{

//hashobj.search(type);

//}

//}

return 0;

}
ADVANTAGES
AND
Disadvantages
OF LEXICAL
ANALYZER
ADVANTAGES

Easier and faster development.


More efficient and compact.
Very efficient and compact.

DISADVANTAGES

Done by hand.
Development is complicate
CONCLUSION
Lexical analysis is a stage in compilation of any program. In this phase we generate
tokens from the input stream of data. For performing this task we need Lexical
Analyzer.

So we are designing a lexical analyzer that will generate tokens from the given
input.
* REFERENCE
REFERENCES

www.google.co.in

www.wikipedia.com

Let Us C : Yashwant Kanetkar

Software Engineering : Rogger Pressman

System Software Engineering : D. S. Dhamdhere

Vous aimerez peut-être aussi