Vous êtes sur la page 1sur 12

Compilers

Software is of two types: Systems software: With this only the computer works. Ex: Operating systems, assemblers, compilers, loaders, macroprocessors. Applications software: Packages that are developed using the systems software are the applications software. Ex: accounts package, salary system package etc. Without systems software, the computer is like a dead machine. A compiler is a program that takes as a source program in high-level language and produces its equivalent assembly-level language program as out.

Phases in the design of a compiler


1. 2. 3. 4. 5. 6. Lexical analysis (scanning) Syntax analysis (parsing) Intermediate code generation (ICG) Intermediate code optimization (ICO) Assembly language code generation Assembly language code optimization

1. Lexical analysis (scanning)


The lexical analyzer is the interface between the source program and the compiler. The lexical analyzer reads the source program one character at a time, cutting the source program into a sequence of atomic units called tokens. Each token can represent a sequence of characters that can be treated as a single logical entity. Identifiers, keywords, constants, operators, and punctuation symbols are typical tokens. The token is an atomic unit in the program. Tokens are separated by BLANKS. Identifiers: Variables etc. These are the words defined by the programmer. These are data names. An identifier can be a literal identifier or a non-literal identifier. Each identifier should begin with an alphabet and should have a predefined length. Literals: These are constants. Ex: Numeric constants, character constants etc. Terminals: These are keywords, operators and special symbols defined in the language. Break characters: BLANK, (, ), +, -, *, /, punctuation marks. Non-break characters: A, B, C, , 1, 2, 3, .. are non-break characters. During lexical analysis, we perform two jobs: 1. Breaking the source code into tokens: Here, we separate all the words in the program which are separated by the break characters. 2. Categorizing all the tokens into various types: Here, all the tokens in the source code are categorized into various types. That is, we have to categorize them into identifiers, terminals, and literals. Databases: The phase of a lexical analysis involves the following databases: HLLP: High-level language program TT LT IT : Terminal table : Literal table : Identifier table

UST : Uniform symbol table

By using the database, we check every token to which category it belongs.

Literal Table: It is created by the lexical analyzer to describe all literals used in the source program. There is one entry for each literal consisting a value, a number of attributes, an address denoting the location of the literal at execution (filled by a later phase), and other information. The attributes such as data type, or precession can be deduced from the literal itself and filled in by the lexical analyzer. Format of literal table:

Identifier table: This is also created by the lexical analyzer to describe all identifiers used in the source program. There is one entry for each identifier. Lexical analyzer creates an entry in the table and places the name of the identifier into that entry. Since in many languages the name of identifiers may be from 1 to 31 characters long, the lexical phase may enter a pointer in the identifier table for efficiency of storage. The pointer points to the name in a table of names. Later phase will fill in the data attributes and address of each identifier. Format of identifier table:

Uniform symbol table: This is created by the lexical analyzer to represent the program as a string of tokens rather than of individual characters. (Spaces and comments in the source code are not represented as uniform symbols and therefore are not used by further phases.) There is one uniform symbol for every token in the program. Each entry in the uniform symbol table contains the identification of the table of which the token is a member and its index within that table. Format of UST:

Terminal Table: It is a permanent database that has an entry for each terminal symbol (Ex: arithmetic operators, keywords, special symbols like #, {, }, etc). Each entry consists of the terminal symbol an indication of its classification (K:keyword, P:operator, or B:break-character) and its precedence (used in the later phase).

Source Program: Original form of the program, which appears to the compiler as a string of characters. Example: Let us consider a COBLE statement: COMPUTE XYZ = (A + B - 10) For the above statement we construct the databases as follows: Terminal Table
SNO 1 2 3 4 5 .. .. .. 10 11 12 13 14 15 16 17 18 .. .. TERMINAL PERFORM GOTO COMPUTE INDICATOR K K K BREAKCHAR NO NO NO

BLANK + * / = ( ) .. .. ..

P P P P P S S .. .. ..

YES YES YES YES YES YES YES YES .. .. ..

Literal Table
SNO 1 LITERAL 10

Identifier Table
SNO 1 2 3 IDENTIFIER XYZ A B

Uniform Symbol (US) Table SNO 1 2 3 4 5 6 7 8 9 10 US TER ID TER TER ID TER ID TER LIT TER INDEX 3 1 15 16 2 11 3 12 1 17

Note: 1. Tokens are entered only once in their corresponding tables. But in the UST, they may appear more than once. 2. There is no need of entering BLANKS in the UST.

2. Syntax analysis (parsing)


Here, we check whether the given sentence is legal or not. For this we apply different parsing techniques. Ex: Consider a grammar. G = (VN, VT, S, P) Where VN = {S} VT = {a, b} S = S P = {S aS, S b} Now, the sentence aaab is legal because it is derivable from the above grammar. There are two types of parsing techniques. They are top-down parsing, and bottom-up parsing.

3. Intermediate code generation (ICG)


The intermediate code generation phase transforms the parse tree into an intermediate language representation of the source program. The intermediate code is the complexity in between the high level language and the low level representation. Forms of intermediate code: (a) Tree form (b) Matrix form (a) Tree form: Here any type of statement is represented in the form of a tree. Ex: A=B+C-D

(b) Matrix form Here, we represent any statement in the matrix form. Ex: A = B + C - D
LNO Op. Code Opr1 Opr2

(1) (2) (3)

+ =

B (1) A

C D (2)

Ex: X = (B + C) D + (P/Q)
LNO Op. Code Opr1 Opr2

(1) (2) (3) (4) (5)

+ / + =

B P (1) (3) X

C Q D (2) (4)

4. Intermediate code optimization (ICO)


In this phase, the intermediate code is optimized. This means that the number of lines in the matrix form is reduced if necessary. Ways of optimization: 1. 2. 3. 4. Common sub-expression elimination Compile time compute technique Moving invariant computation outside the loop. Boolean expression optimization

1. Common sub-expression elimination Let us consider the statement. X = (C + D) + Q ^ P (D + C - 10) Matrix table
LNO Op. Code Opr1 Opr2

Optimized matrix table


LNO Op. Code Opr1 Opr2

(1) (2) (3) (4) (5) (6) (7)

+ + ^ + =

C D (2) Q (1) (5) X

D C 10 P (4) (3) (6)

(1) (2) (3) (4) (5) (6)

+ ^ + =

C (1) Q (1) (4) X

D 10 P (3) (2) (5)

Steps to optimize the table: (i) (ii) (iii) (iv) Arrange all the operands in the ascending order for commutative Op.Codes. Identify rows having common sub-expressions. Eliminate all such rows except one. Modify the matrix form accordingly.

2. Compile time compute technique The compiler will have a capacity to do simple arithmetics. Ex: X = 2 * 3 / A Matrix table
LNO Op. Code Opr1 Opr2

Optimized matrix table


LNO Op. Code Opr1 Opr2

(1) (2) (3)

* / =

2 (1) X

3 A 2

(1) (2)

/ =

6 X

A (1)

We have to reduce the compile time. For that If both operands are scalars, then compute the value and substitute the result at the necessary places. Eliminate the lines in which both operands are numbers (or scalars). 3. Moving invariant computation outside the loop

We have to find whether the variable is variant, not variant or partially variant. 4. Boolean expression optimization Ex: IF <COND> THEN <SL1> ELSE <SL2> We can apply the short cut methods of Boolean expressions for simplifying conditional statements in any high level language program. In this way we can optimize intermediate code. We can also save time and space. It is better to write conditional statements using all OR gates or AND gates, where they are necessary.

5. Assembly language code generation


Here, the optimized intermediate code is converted into assembly language code.

ALCG: ALC Generator ALC: Assembly Language Code

To get the ALC from ICO, we first read the database. ICO database
LNO Op. Code Opr1 Opr2

M1 M2 .. .. .. Steps for + (i) (ii) (iii)

+ .. .. ..

A C .. .. ..

B D .. .. ..

Read the operands Execute the ADD routine Store results somewhere else

The routines developed in assembly language are maintained in a separate database. Assembly language database (ALDB)
Op. Code + * / = Routine L 1, Op1 A 1, Op2 ST 1, MX L 1, Op1 S 1, Op2 ST 1, MX L 1, Op1 M 1, Op2 ST 1, MX L 1, Op1 D 1, Op2 ST 1, MX L 1, Op2 ST 1, Op1

We assume that 1 : General purpose register L : Load, A : Add, ST : Store S : Subtract

M : Multiply, D : Divide Mx : Memory location

Ex: We have X = A + B C INTERMEDIATE CODE


LNO Op. Code Opr1 Opr2

M1 M2 M3

+ =

A M1 X

B C M2

The corresponding ALC is L 1, A A 1, B ST 1, M1 L 1, M1 S 1, C ST 1, M2 L 1, M2 ST 1, X

6. Assembly language code optimization


To optimize ALC (i) (ii) Ex: First we see the assembly language instructions. After that we can get the optimized code. We remove unnecessary instructions. X=A*B+C*D

The other and better way is discussed through an example. Intermediate code SNO M1 M2 M3 M4 Op. Code * * + = Opr1 A C M1 X Opr2 B D M2 M3

Table-1: ALC for the above Intermediate Code L 1, A M 1, B ST 1, M1 L 1, C M 1, D ST 1, M2 L 1, M1 A 1, M2 ST 1, M3 L 1, M3 ST 1, X The steps for optimization are (i) Consecutive store and load instructions to be eliminated, if they are dealing with same operand. Table-2: Reduced ALC from Table-1 L 1, A M 1, B ST 1, M1 L 1, C M 1, D A 1, M1 ST 1, X (ii) Try to use RR type instructions in place of RM type instructions. RR type instructions are much faster than RM type instructions. L 1, A M 1, B ST 1, 3 L 1, C M 1, D A 1, 3 ST 1, X Note: Here, 3 is another register, which is used in place of M1.

11

(iii)

If it is possible, try to use one or more register as general purpose registers to reduce one or more instructions. By using general purpose registers we can eliminate more instructions. L 1, A M 1, B L 2, C M 2, D A 1, 2 ST 1, X

Note: Here, we use general purpose register 2 to store C * D.

Vous aimerez peut-être aussi