Académique Documents
Professionnel Documents
Culture Documents
IntroductiontoCompilers
Mac C compiler
source code
in Unix C
Unix C
compiler
Mac C complier
usable on Unix
Mac C compiler
source code
in Unix C
Mac C complier
usable on Unix
Mac C complier
usable on Mac
Bootstrapping
Processofwritinga
g compiler
p
((or assembler)in
)
thetarget programminglanguage whichitis
intendedtocompile.
Applyingthistechniqueleadstoaself
Applying this technique leads to a self
hosting compiler.
Manycompilersformanyprogramming
Many compilers for many programming
languagesarebootstrapped,includingcompilers
for BASIC, ALGOL, C, Pascal, PL/I, Factor, Haskell,
Modula 2 Oberon,
Modula2,
Oberon OCaml,
OCaml Common
Common
Lisp, Scheme,Java, Python, Scala, Nimrod, Eiffel,
andmore.
Formal Languages
FormalLanguages
Alreadystudied
Already studied
Roles of Scanner
RolesofScanner
Removalofcomments
Removal of comments
Caseconversion
Removalofwhitespaces
Removal of white spaces
Blanks,tabulars,carriagereturnsandlinefeeds
Interpretationofcompilerdirectives
Interpretation of compiler directives
#include, #ifdef, #ifndef and
#define aredirectivesto
are directives to redirect
redirecttheinput
the inputof
of
thecompiler
Maybedonebyaprecompiler
Token:
Token:Anelementofthelexicaldefinitionof
An element of the lexical definition of
thelanguage.
Lexeme:Asequenceofcharactersidentified
Lexeme: A sequence of characters identified
asatoken.
Pattern
P
:Setofstringsisdescribedbyarule
S
f i
i d
ib d b
l
calledpatternassociatedwithatoken.
Possible Implementations
PossibleImplementations
LexicalAnalyzerGenerator(e.g.Lex)
y
( g
)
+ safe,quick
Mustlearnsoftware,unabletohandleunusualsituations
TableDrivenLexicalAnalyzer
+ generalandadaptablemethod,samefunctioncanbeused
for all tabledriven
foralltable
drivenlexicalanalyzers
lexical analyzers
Buildingtransitiontablecanbetediousanderrorprone
Possible Implementations
PossibleImplementations
Handwritten
Hand written
+ Canbeoptimized,canhandleanyunusual
situation easy to build for most languages
situation,easytobuildformostlanguages
Errorprone,notadaptableormaintainable
Specification of tokens
Specificationoftokens
Regularexpressionsareimportantnotationfor
specifying patterns
specifyingpatterns.
RulestodefineRegularexpressions
Regular Expressions
RegularExpressions
:{}
s
: {s | s in s^}
a
: {a}
r | s : {r | r in r^} or {s | s in s^}
s* : {sn | s in s^ and n>=0}
s+
: {sn | s in s^ and n>
n>=1}
1}
id -> letter(letter|digit)*
Num->digit+(.digit+)? (E(+|-)?digit+)?
Recognition of tokens
Recognitionoftokens
Transitiondiagrams:
Asanintermediatestepinconstructionoflexical
analyzer,weproduceastylizedflowchart,calleda
transitiondiagram.
Letterordigit
letter
start
9
10
other
11
Transitiondiagramforidentifiersandkeywords
Return(gettoken(),install_id())
(
k ()
ll d())
Implementingatransitiondiagram
p
g
g
Asequenceoftransitiondiagramscanbeconvertedintoaprogramtolookfor
thetokensspecifiedbythediagrams.Programsizeisproportionaltothenoof
states&edgesinthediagrams.
& d
i h di
digit
g
start
digit
25
other
26
27
Transitiondiagramfornumbers
token nexttoken()
{while(1){
switch (state) {
case 0: c = nextchar();
/* c is lookahead character */
if (
(c==blank
bl k :: c==tab
t b :: c==newline)
li ) {
state = 0;
g
g
lexerne_beginning++;
/* advance beginning of lexerne */
}
else if (c == '<') state = 1;
else if (c == '=') state = 5;
else if (c == '>')
> ) state = 6;
Gettoken()
Looksforlexemeinsymboltable.Iflexemeiskeyword,correspondingtokenis
returned;otherwisetokenidisreturned.
Install id()
Install_id()
Hasaccesstobuffer,wheretheidentifierlexemeislocated.
Sym
Symtableisexamined&iflexemeisfoundmarkedaskeyword,itreturns0.
table is examined & if lexeme is found marked as keyword,it returns 0.
Lexemeisfound&isprogramvariable,returnspointertosymtableentry
Ifnotfoundinsymtable,itisinstalledasavariable&pointertonewlycreated
entryisreturned.
t i t
d
Install_num()
letter
letter
digit
letter
letter
digit
letter
[ h ]
[other]
letter
l
letter
digit
digit
Implementation Concerns
ImplementationConcerns
Backtracking
Principle :Atokenisnormallyrecognizedonlywhenthe
nextcharacterisread.
Problem :Maybethischaracterispartofthenexttoken.
Example :x<1. < isrecognizedonlywhen1 is
read In this case we have to backtrack on character to
read.Inthiscase,wehavetobacktrackoncharacterto
continuetokenrecognition.
Canincludetheoccurrenceofthesecasesinthestate
transitiontable.
Implementation Concerns
ImplementationConcerns
Ambiguity
Problem :Sometokenslexemesaresubsetsofother
tokens.
Example :
n-1. Isit<n><><1>or<n><1>?
Solutions
l i
:
Postponethedecisiontothesyntacticanalyzer
Donotallowsignprefixtonumbersinthelexicalspecification
g p
p
Interactwiththesyntacticanalyzertofindasolution.(Induces
coupling)
Example
Alphabet:
p
{:,*,=,(,),<,>,{,},[a..z],[0..9]}
Simpletokens:
{(,),{,},:,<,>}
Compositetokens:
{:=,>=,<=,<>,(*,*)}
{
(* *)}
Words:
id::=letter(letter|digit)
id ::= letter(letter | digit)*
num::=digit*
Example
Ambiguityproblems:
Ambiguity problems:
Character
:
>
<
(
*
Possible tokens
:, ::=
>, >=
<, <=, <>
(, (*
* *)
*,
Backtracking:
Backtracking:
Mustbackupacharacterwhenwereadacharacter
thatispartofthenexttoken.
Occurrencesarecodedinthetable
O
d d i th t bl