Vous êtes sur la page 1sur 14

Intermediate Languages

Intermediate languages are with us for quite a long time. Most of the
compilers employ them in one-way or another. The way of using
intermediate languages have also sometimes contributed to the
success of the languages like Pascal, Java and recently C# (with .NET
initiative in general). Even some tools like MS-Word makes use of an
intermediate language (p-code). Intermediate languages have made a
big difference in achieving portability of code. This article explores the
concept of intermediate languages and their use in achieving complete
portability and language interoperability.

In 1950s itself, this idea of having an intermediate language was


experimented with the introduction of UNCOL [1]. Later Pascal became
widely available because of p-code. Not until recently did Java
employed the same technology with its promise of portability to
become a big success. Now Microsoft's .NET initiative is also based on
this approach, but the keyword is language interoperability. A closer
look reveals that all are based on same idea: use of intermediate code,
and that is the current trend in the language translation technology.

Approaches in Language Translation

Two major approaches towards language translation are compilation


and interpretation. In compilation, since the native code is generated,
different compilers are needed for different platforms. An interpreter is
also particular to the platform, but its specifics are abstracted making
only the behavior of the program transparent as the code is translated
and executed 'on-the-fly'.

Neither of them is ‘the’ best approach - both boasts of significant


advantages and also suffer from some serious disadvantages.

The advantages of compilation mostly overweigh its disadvantages and


that is why it is preferred in most of the languages. The translation
and optimization needs to be done only once, and native code ensures
efficiency of execution. However as native code is generated, it is
platform specific.

Interpretation has some very good advantages that compilation


doesn’t have. In particular, they can achieve full portability. Also it can
provide better interactive features and diagnostic (error) messages.
But one problem that cripples interpretation is performance. Since
translation needs to be done every time (and for the same reason,
optimization is less attractive) execution becomes generally very slow
(10 to 100 times) compared to the equivalent compiled code.

You can think of an interpreter as the one that implements a virtual


machine. So, it can be inferred that the 'machine language' of that
virtual machine is the high-level language for which the interpreter is
written!

Interpreters provide many advantages including greater flexibility,


better diagnostics and interactive features like debugging. True power
of interpretation is generally underestimated. To illustrate, you can
think of printf library routine in C as a tiny interpreter for formatted
output - depending on the requirement it parses the format string,
gets the arguments, computes the format on the fly, prints it to the
output along with facilities like padding, aligning, specific notations etc.

Interpretation allows the code to be generated or modify itself "on the


fly" and execute it. For example, interpreted languages like Lisp or
even partly interpreted languages like Java or C# provides such
facilities (although not recommended to use such facilities for practical
programming). Note that few language features like reflection are only
possible with interpretation. This is the reason why compilers cannot
be written for some languages and interpretation is the only option.

So, instead of restricting to pure compilation or interpretation, you can


infer that a combination of the both is better to enjoy the advantages
of these approaches.

It seems to be very simple to use both compilation and interpretation.


For that itself, various languages use different approaches.

Few languages provides options for both interpretation and compilation


and can be used depending on the requirement. For example, for
development and debugging, Microsoft Visual Basic uses an interpreter
and for deployment, it employs a compiler.

As already stated, writing compilers for the languages that are meant
for interpretation is sometimes not possible. In such cases, it is
possible to bundle the source code with the whole interpreter as an
executable file and distribute it. Java Native Interface (JNI) follows a
similar approach by embedding a Java interpreter in an executable
code created by a C/C++ source code, and run the Java code through
it. This solution is feasible because the Java interpreter’s size is not
very big.

A very interesting approach towards combining compilation and


interpretation through hardware support was widely used in as early
as 1960s in IBM chips. IBM released chips that incorporated the
interpreter software burnt into chips (nothing but firmware) that
supported a general instruction set. The main advantage being that
the same instruction set (of the interpreter) shall be used across
different processor architectures. Since, the compilers would generate
the code just for that one instruction set, the effort is minimized.
Later, the advent of VLSI design with RISC architecture obviated this
microprogramming approach.

The very popular and widely used approach that combines the
compilation and interpretation is making use of intermediate code
directly. Since its use in Java this approach seems to have received
widespread acceptance, even though its idea was used long back in
Pascal and in many other languages. Languages following this
approach are sometimes referred to as p-code languages since p-code
of Pascal is one of the first intermediate codes to follow this approach
successfully. Few of important p-code languages are Python, Perl, Java
and now, C# (.NET framework). The idea is to use a compiler to
generate the intermediate language code. The intermediate code can
be moved on to other platforms or even across the network where an
interpreter (also called as virtual machine) for that platform will take
over to interpret the code.

The usage of intermediate code in compilers is not a new idea. There


are various phases in a compiler and they can be grouped generally as
the front-end and the back-end. The analysis and translation of the
code (in general 'parsing') is done by the front end and the synthesis
of the code is taken care by the back-end. The information in the form
of intermediate languages is represented that is generally independent
any particular architecture and thus serves as a link to transfer the
information from front-end to the back-end. The advantage with this
approach is that, the front-end can be same and only the back-end
needs to be changed to support new architectures and hence the effort
is limited only in rewriting the back-end of the compiler. Intermediate
languages also help in general optimization of the code that are not
particular to any platform and provides many related advantages. So
most of the compilers employ intermediate code.
If intermediate languages have been in usage for a long time in
compilers, why is this renewed interest?

The difference is that previously most of the compilers hid the use of
intermediate languages as ‘implementation details’. But now, to make
the best use of them, they are extended, exposed and used in
conjunction with interpretation.

An advantage with the intermediate language approach is ‘binary code


portability’, i.e., the next level of portability promised by languages
like C/C++ that provides source code portability. As long as an
interpreter is there to understand and execute the intermediate code,
no recompilation is necessary to move the code to a new platform. It
also provides the ability to verify code and thus type-safety can be
ensured and as a result, ability to build security systems on that.

Intermediate languages can also provide the advantage of ‘language


interoperability’. As more than one language can be converted to the
same intermediate language, the code compiled through two different
high-level languages can interact at intermediate code level agnostic of
the different possible source languages. A related benefit is that the
number of translators that needs to be written is also drastically
reduced. This is a significant benefit as the cost of retargeting the
translator is high and writing a new one is very tedious. As a result
only a few translators need to be developed.

To illustrate, in traditional compilation approach, for N high-level


languages and M target machines, N * M translators are required. With
intermediate language approach, where the IL is capable of supporting
N high-level languages and M target platforms, then the number of
translators required effectively reduces to just N + M.

Apart from portability and interoperability, there are many features


that make the use of intermediate languages attractive. One of the
advantages that seems to be insignificant at the first look is that the
file size tends to be small compared to the equivalent native code.
That is one of the reasons why intermediate code approach was
popular around 1980s when disk and memory space was precious. The
small size of the code is advantageous even now: in the networked
environment, the small size of applets (nothing but Java class files)
makes their transmission across the network faster.
Approaches in Representing Intermediate Code

A challenge for an intermediate language is to satisfy seemingly two


contradicting goals:

1) To represent the source code without significant loss of


information

2) To keep only the necessary information so that the source code


can be translated directly and efficiently to the target machine
code later.

Depending on the level of abstraction of the intermediate language, we


can classify them as high-level, medium-level or low-level ILs.

Usually the high-level ILs are represented as trees or DAGs (Directed


Acyclic Graphs) and sometimes as stack based. The middle-level ILs
typically use approaches like triples and quadruples. The low-level ILs
generally use code closer to the assembly language of the target
machine.

High-level ILs takes more effort to get the code converted to the
native code, but are sufficiently high-level that it can represent rich
features of various source languages directly. So it is possible to
retarget new languages to high-level ILs. However, more effort is
needed in converting the intermediate code to native code. On another
extreme, low-level ILs make the translation to native code easier, but
it is generally hard to retarget other high-level languages than that it
is originally intended for.

Another design approach is to abstract the programming language (for


example, ANDF) or the machine (for example, RTL) and an IL has to
strike a balance between the two extremes.

There are many methods used to represent the information at


intermediate level. Some widely used approaches are discussed here.

Three-address code

In most of the language compilers, three address codes are generated


as intermediate code and later code-generators generate code for that
particular target machine. They are mostly register based, and are
fairly low-level, so target-code generation is straightforward. A similar
approach is Register Transfer Language (RTL) widely used in GNU C
compilers. Following is an example for three address code:

R1 := R2 + R3
And the general syntax is.
X := Y op Z

Object files

An interesting approach to represent information at intermediate level


is object files. With each platform, the format of object code file and
the information stored differs. This makes them platform dependent.
An effort has been made to combine the information that is required in
various formats. It resulted in creating a huge format that contains
all such details, and is referred to as 'fat binaries' effectively used in
NExtStep platforms. On the other hand, 'slim binaries' are also
available that generates the portable object code 'on-the-fly'.

Tree representations

ANDF [2,3] stands for Architectural Neutral Distribution Format


abstracts high-level language constructs. It aims at converting various
language constructs in a programming language to an intermediate
form that shall retain all the information available in the source. It
takes care to represent different language semantics, platform
dependencies etc. Languages including C, C++, Fortran, Ada has ANDF
producers (high-level compilers). This is achieved through an
intermediate representation of the language constructs (keeping its
variants in various programming languages). For example, expressions
in the source language are converted into EXPs, identifiers as TAGs,
parameterization as TOKENs etc. ANDF follows a tree representation of
constructs called CAPSULEs. It also has installers (low-level compilers)
in wide variety of platforms including Dec-Alpha, Sun-Sparc, PowerPC
and Intel and executed as native code.

Using another High Level Language!

Instead of creating a new intermediate language, this approach is


towards using a high-level language as a intermediate language that is
already available, highly portable, efficient and sufficiently low-level.
As you would have guessed, C is an interesting high-level language
with ability to write low-level code. Owing to this property, it is
sometimes referred to as a portable assembler or 'middle-level'
language. The beauty of C is that, in spite of its low-level abilities, the
code is highly portable. Due to this nature, C has effectively served as
a form of a portable intermediate language for other high-level
programming languages. In fact, in initial days of UNIX, translators
generating C code as target code was written so that re-writing them
for every platform were not required. Many languages follow this
approach, including Mercury, Scheme, APL, Haskell etc. Also GNU
Fortran and Pascal compilers (f2c and p2c) generate C code and native
C compilers take over to compile and execute them.

Stack-based machines

This representation assumes the presence of a run-time stack and


generates code for that. It uses the stack for evaluation of expressions
by making use of stack itself to store the intermediate values. Thus
the underlying concept is very simple and that is its main strength.
Also, least assumptions are made about the hardware and support
available in the target architecture. When a virtual machine (runtime
interpreter) simulates a stack-based machine and provides necessary
resources, this approach can effectively provide platform-
independence. In other words, the virtual machine becomes a platform
of its own that shall be simulated in possibly any hardware platform
(and thus needs to be written for those platforms). One problem with
this approach however is that the code optimization is difficult. Java
virtual machine is a very good example for such stack-based
representation.

Other major efforts include Oberon Module Interchange (OMI) [4],


'Juice'[5] that is based on OMI, Dynamic binding and VP code of TAO
operating system and U-Code.

Implementations based on stack approach

Since stack based machines are one of the most successful ones and is
gaining widespread acceptance in recent times, let us discuss few
implementations based on that in detail here.

P-code:

P-system was popular in 1970-80s as an operating system that made


use of p-code [6]. Compilers would emit p-code, an architectural
neutral one that was executed by a virtual machine. UCSD Pascal was
the most popular compiler that generated p-code and also the whole
p-system was written in UCSD Pascal itself making it easier to port to
new architectures with much ease.
Pascal implementations contained a compiler generating p-code and
later an interpreter will take over to execute that p-code. The
interpreter was very simple and so it was easy to write an interpreter
for a new architecture. With that interpreter, a Pascal compiler in the
form of p-code can be used to compile Pascal programs that can in-
turn be run on the interpreter.

However, this steps need to be done only in initial steps. Later the
compiler can be modified to generate native code, and once, the
Pascal compiler itself is given as input for the Pascal compiler in p-
code, the compiler becomes available in native code. This process is
referred to as 'bootstrapping', that is generally used in porting
compilers to new environments. But the point here is that this
'bootstrapping' process becomes easy and elegant with the p-code
approach. This let to the popularity and widespread use of Pascal
compilers.

Java bytecodes:

Java[7] was originally intended to be used in embedded systems and


set-top boxes and thus a heterogeneous environment. Internet was
becoming very popular at that time and there was no language that
best suited that heterogeneous environment. Java suited for that
purpose very well and thus became a sort of lingua franca of Internet.

The technology is as we already saw of intermediate languages. Java


compiler converts the source into intermediate code - bytecode and
pack with related information in class files. These class files (applets or
applications) can be distributed across the network and executed in
any platform provided that a Java Virtual Machine [8] is available
there. The advantage of this technology is, as everybody knows now,
'write once, run anywhere'. Programmers need not worry about the
target platform. They just concentrate on the problem and develop the
code.

To illustrate, for each and every class or interface defined, a class file
is generated. Every method is compiled into byte codes - the
instructions for the hypothetical microprocessor. For example:

int a = b + 2;

this may be converted to the bytecodes as,

iload_1 // push the variable 1 (int type) to evaluation stack


iconst_2 // push the constant value 2 (int type) on the stack

iadd // pop two integer values from stack, add them, push the result back.

istore_2 // pop the int value from the stack and store it in variable 2 // (int type)

All the information that you give inside the class definition are
converted and get stored to form an intermediate class file.

The intermediate class file format is a specification given by Sun. It is


a complex format and holds the information like the methods declared
along with the byte codes, the class variables used, initializing values
(if any), super class info, signatures of variables and methods, etc. It
has a constant pool - a per class runtime data structure, which is
similar to the symbol table created by the compiler.

Virtual machine plays a major role in Java technology. All the Java
virtual machines have the same behavior, as defined by SUN, but the
implementation differs. It forms a layer over the physical machine in
the memory.

Even though the Java technology provides the basis for achieving full
portability, Java’s design aids to make it architectural neutral. To
illustrate, lets compare it with a C design consideration with that of
Java. C gives much importance to efficiency and for that, it leaves size
of its data types (except char) to be implementation-dependent. But
Java primitive types are of predetermined size, independent of the
implementation. In general, Java improves upon C by removing the
constructs having various behavioral types in C by having well defined
behavior instead.

JVMs are available in very wide variety of platforms. However, Java


class file format and bytecodes are closely tied with the Java language
semantics. So, it becomes tough to retarget other language compilers
to generate Java class files. Few of the languages including Eiffel, Ada
and Python have compilers generating class files to be executed by
JVMs. Since the class files and bytecodes are not versatile enough to
accommodate wide range of languages, class files (and bytecodes)
doesn’t fare as a feasible universal intermediate language.

.NET Architecture

.NET architecture addresses an important need - language


interoperability, the concept that can change the way programming is
done and is significantly different from what Java offers. If Java came
as a revolution providing platform independence, .NET has language
interoperability to offer.

To illustrate, JVM is implemented for multiple platforms including


Windows, Linux, Solaris, OS/2 etc. But it has only few languages
targeting at JVM to generate class files. On the other hand, .NET
supports multiple languages generating code (MSIL – Microsoft
Intermediate Language) targeting Common Language Runtime (CLR).
This list includes, C#, VB, C, C++, COBOL, etc. and nevertheless this
list is an impressive one. CLR, the runtime interpreter of .NET platform
(equivalent of JVM of Java) is currently implemented only for Windows
platform and efforts are underway to implement it for Linux and other
platforms. A Java program can be compiled on a PC and can be
executed by a Mac or a Mainframe. Whereas with .NET, we can write a
code in COBOL, extend it in VB and deploy it as a component. Thus the
Java and .NET have fundamentally different design goals. However,
you may have noticed that they partially satisfy the two different
requirements for universal intermediate language: to support different
source languages and target architectures.

In .NET, the unit of deployment is the PE (Portable Executable) file - a


predefined binary standard (similar to class files of Java). It is made
up of collection of modules, exported types and resources and is put
together as an .exe file. It is very useful in versioning, deploying,
sharing, etc. The modules in PE file are known as assemblies.

The IL is designed such that it can accommodate wide range of


languages. Also at the heart of the .NET is the type system where the
types are declared and the metadata of how to use the types is also
stored. One important difference between MSIL and bytecodes is that
MSIL is type-neutral. For example, iconst_0 is a bytecode that tells to
push integer value 0 on the stack, meaning that the type information
is kept in the bytecodes. But the similar IL code just tells to push four
bytes, meaning that no type information is passed on.

With .NET you can write a class in C#, extend it in VB and instantiate
it in managed C++. This gives you the ability to work in diversified
languages depending on your requirement and still benefit from the
unified platform to which the code for any .NET compliant language is
compiled into.

Summary
The benefits of using intermediate languages are well known, and the
current trend is towards achieving fully portable code and language
interoperability. To understand the concept of portable code
compilation, the understanding of the conventional translation
technology and their advantages/disadvantages needs to be known.
There are many issues involved in intermediate code design and
understanding them shall enable us to appreciate the various efforts to
generate portable code.

Way back in 1950s itself, the first effort towards achieving portability
was taken through UNCOL initiative. Later, during 1970-80s, Pascal in
the form of p-code picked up the idea. In 1995s Java came as a
revolution to capture the imagination of the language world. Now it
is .NET’s turn with the age-old concept of language interoperability.
Thanks to such technological advances, achieving 100% portable code
with language interoperability is no more a dream. It is clear that the
concept of intermediate languages is here to stay for a long time and
we are not far from having a universal intermediate language.

References:

[1] see ftp://riftp.osf.org/pub/andf/andf_coll_papers/UNCOLtoANDF.ps


[2] see ANDF home page: http://www.andf.net/
[3] refer 'Architecture Neutral Distribution Format (XANDF)', Xopen
Co. Ltd., 1996
[4] see http://huxley.inf.ethz.ch/~marais/OMI.html
[5] see http://www.ics.uci.edu/~juice/
[6] see http://www.threedee.com/jcm/psystem/
[7] see http://java.sun.com/
[8] refer James Gosling, Bill Joy, Guy Steele, "The Java Language
Specification", Addison-Wesley, 1996
[9] refer Tim Lindolm, Frank Yellin, "The Java Virtual Machine
Specification", Addison-Wesley, 1997
[10] ECMA C# specification, Microsoft Corporation, 2001

[11] "The Development of the C language", Dennis M. Ritchie, Second


History of programming Languages conference, Cambridge, Mass.
1993

Sidebar 1:

Approaches in improving the performance of the intermediate


code in Java/.NET

One very obvious disadvantage of the combination of compilation and


interpretation is the speed factor. Although the code is compiled, still
an interpreter needs to be employed to execute the code and typically
the execution speed is reduced to the factor of 10 to 30 times to the
equivalent of the native C/C++ code. There are many suggested ways
to improve this speed to achieve the speed of "native code". The
remaining part of the article explores the approaches that are made to
overcome this problem with Java and .NET in particular.

An interesting approach is to avoid the interpretation at the target


platform and do compilation again. This can avoid the slowness
inherent in the interpretation approach, as translation is required each
and every time the code needs to be executed. So an alternative
approach of compiling the intermediate code again into native code to
execute can be thought of. This makes the life of the virtual machine
tougher but can improve the performance and it should be noted that
the approach should make sure that the platform independence is not
lost for this performance gain.

One such approach is Just-In-Time compilation. The aim is to compile


the code when the code is called and 'boot-strap' that the compiled
native code be used the next time the code is called. Since, in general,
90% of the time is spent in executing 10% of the code, this approach
can reap rich dividends. JIT compilers are also referred to as "load and
go" compilers as they don’t write the target code into a file. When a
method is called for the first time, it is compiled into native code and
kept in memory that the compiled code may be used next time the
method is called again. Java uses this approach of compilation on
demand extensively and many of commercial JVMs are JITters. Note
that, JVMs are allowed to follow the interpretative approach or JIT and
hence it is an optional feature for performance improvement. For .NET,
the use of JIT is mandatory and thus all MSIL code is JITted to native
code before execution.

Aggressive optimizations are also possible to make that are not


possible with static optimization techniques, since:

i) precious runtime information is available for doing


optimization

ii) depending on the situation and user requirement, only


particular parts of the code will be called and doing aggressive
optimizations on that part of the code and bootstrapping it shall
improve the performance considerably.
iii) since optimization is done the frequently called code, it shall
provide improved performance in general than the general
optimization without any profiler information.

However it should be noted that, the performance of JITters is still not


on par with the equivalent C/C++ code, but it is considerably better
than the equivalent interpreted code. Also, in environments like
embedded systems where RAM is precious, it may not be possible to
bootstrap the methods for later use.

Providing ‘Ahead of time’ recompilation is yet another approach for


improving the performance through native code. A compiler reads the
information in the Java class files and generates a file that contains the
original bytecodes and the native code equivalent for various
platforms. The resultant file is sometime referred to as "fat" class file
(refer to the "fat" binaries in the main text) and depending on the
platform, the native code shall be invoked. This can effectively avoid
the necessity to recompile every time the code needs to be executed
and at the same time not losing portability.

Another approach is Ahead of time compilation. When the file is loaded


into the memory itself, all the code gets compiled into native code. The
advantage with this approach is that the code shall be compiled before
any method is called, as opposed to JIT approach and hence is
sometimes referred to as PreJIT. So, it doesn't suffer the problem of
overhead with JIT compilation. This can thus effectively reduce the
time for start-up and execution, as the code is ready of execution in
form of native code.

So it seems that Ahead of time compiling is a very good alternative for


JIT, but that is not the end of the story. The virtual machine still needs
the meta-data about the classes, for example for supporting features
like reflection. If no meta-data is maintained, then reflection cannot be
supported which would be a serious disadvantage. Moreover, if the
environment or security settings change, then the code needs to be
compiled again. Thus PreJIT is not much promising approach. .NET
supports PreJIT with its support with its NGEN tool in Visual
Studio.NET

One of the interesting ideas to overcome the performance problem is


implementing the interpreter in hardware. That is what the JavaChip
intends to do - the bytecodes shall be directly executed by a
microprocessor designed specifically for that. Since Java bytecode are
not too low level to be implemented at chip level, still a JVM runs over
that, with the difference that the bytecodes are now directly executed
by the hardware. This idea too, is reminiscent of the old idea in IBM
machines that is discussed in the main text.

Sidebar - 2

C - A retrospect on its evolution and design decision

C is known for its high performance through generation of native code.


Its interesting to note that its predecessor B didn't generate native
code, instead it generated a stack based intermediate code called as
threaded code. When Ritchie wanted to write his operating system
with his new language, he found that it was handicapped to use the
interpreted approach. To overcome this problem, Thomson tried by
offering a 'virtual B' compiler that still had an interpreter, but it was
too slow for practical purposes. So, to make C sufficiently low level
that it can be used to write operating systems and compilers, Ritchie
abandoned threaded code approach and instead generated native
code. So this drastic change in the translation approach itself was
influenced by two main problems with interpretation: the runtime
overhead of interpretation and performance. In retrospect, had C
continued the B's tradition of interpretive approach, would it have
become such a huge success?

All rights reserved. Copyright Jan 2004.

Vous aimerez peut-être aussi