Académique Documents
Professionnel Documents
Culture Documents
TU Wien
David Gregg
Trinity College, Dublin
ABSTRACT
1. INTRODUCTION
Different programming language implementation approaches provide different tradeoffs with respect to the following criteria:
General Terms
Languages, Performance, Experimentation
Keywords
Interpreter, branch target buffer, branch prediction, code
replication, superinstruction
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
PLDI03, June 911, 2003, San Diego, California, USA.
Copyright 2003 ACM 1-58113-662-5/03/0006 ...$5.00.
Ease of implementation
Portability (Retargetability)
Compilation Speed
Execution Speed
Interpreters are a popular language implementation approach that can be very good at the first three criteria,
but has an execution speed disadvantage: an interpreter designed for efficiency typically suffers a factor of ten slowdown
for general-purpose programs over native code produced by
an optimizing compiler [10].1 In this paper we investigate
how to improve the execution speed of interpreters.
Existing efficient interpreters perform a large number of
indirect branches (up to 13% of the executed instructions).
Mispredicted branches are expensive on modern processors
(e.g., they cost about 10 cycles on the Pentium III and
Athlon and 20 cycles on the Pentium 4). As a result, interpreters can spend more than half of their execution time
recovering from indirect branch mispredictions [7]. Consequently, improving the indirect branch prediction accuracy
has a large effect on interpreter performance.
The best indirect branch predictor in widely available
processors is the branch target buffer (BTB). Most current
desktop and server processors have a BTB or similar structure: all Pentiums, Athlon, Alpha 21264, Itanium 2. BTBs
mispredict 50%63% of the executed indirect branches in
threaded-code interpreters and 81%98% in switch-based interpreters [7].
In this paper, we look at software ways to improve the
prediction accuracy. The main contributions of this paper
are:
We propose the new technique of replication (Section 4.1) for eliminating mispredictions.
1
For library-intensive special-purpose programs the speed
difference is usually much smaller. Not all interpreters
are designed for efficiency on general-purpose programs and
some may produce slowdowns by a factor > 1000 [15]. Unfortunately, many people draw incorrect general conclusions
about the performance of interpreters from such examples.
Paper and BibTeX entry are available at http://www.complang.tuwien.ac.at/papers/. This paper was published in:
SIGPLAN 03 Conference on Programming Language Design and Implementation (PLDI 03)
typedef enum {
add /* ... */
} Inst;
void engine()
{
static Inst program[] = { add /* ... */ };
Inst *ip = program;
int *sp;
for (;;)
switch (*ip++) {
case add:
sp[1]=sp[0]+sp[1];
sp++;
break;
/* ... */
}
}
2.
BACKGROUND
VM Code
VM instruction routines
imul
iadd
iadd
...
GNU C
next_inst = *ip;
ip++;
goto *next_inst;
Alpha assembly
ldq s2,0(s1) ;load next VM instruction
addq s1,0x8,s1 ;increment VM instruction pointer
jmp (s2)
;jump to next VM instruction
address of
branch instruction
predicted target
3.
VM
program
label:
A
B
A
GOTO label
BTB
entry
switch dispatch
next instruction
prediction actual
switch
switch
switch
switch
A
B
A
GOTO
B
A
GOTO
A
BTB
entry
threaded code
next instruction
prediction actual
br-A
br-B
br-A
br-GOTO
GOTO
A
B
A
B
A
GOTO
A
VM
program
label:
A1
B
A2
GOTO label
BTB
entry
threaded code
next instruction
prediction actual
br-A1
br-B
br-A2
br-GOTO
B
A2
GOTO
A1
B
A2
GOTO
A1
VM
program
label:
A
BA
GOTO label
threaded code
BTB
next instruction
entry
prediction actual
br-A
br-B A
br-GOTO
BA
GOTO
A
BA
GOTO
A
4.2 Superinstructions
Combining several VM instructions into superinstructions
is a technique that has been used for reducing VM code size
and for reducing the dispatch and argument access overhead
in the past [14, 13, 10, 8]. However, its effect on branch
prediction has not been investigated in depth yet.
In this paper we investigate the effect of superinstructions
on dispatch mispredictions; in particular, we find that using superinstructions reduces mispredictions far more than
it reduces dispatches or executed native instructions (see
Section 7.3).
To get an idea why this is the case, consider Figure 6: we
combine the sequence B A into the superinstruction B A.
This superinstruction occurs only once in the loop, and A
now also occurs only once, so there are no mispredictions
after the first iteration while the interpreter executes the
loop.
5. IMPLEMENTATION
5.1 Static Approach
There are two ways of implementing replication and superinstructions (see Fig. 7).
In the static approach the interpreter writer produces
replicas and/or superinstructions at interpreter build-time,
typically by generating C code for them with a macro processor or interpreter generator (e.g., vmgen supports static
superinstructions [8]). During VM code generation (at interpreter run-time) the interpreter front-end just selects among
the built-in replicas and/or superinstructions.
For static replication two plausible ways to select the copy
come to mind: round-robin (i.e., always select the statically
least-recently-used copy) and random. We tried both approaches in our simulator, and achieved better results for
round-robin, so we use that in the rest of the paper. Our
explanation for the better results with round-robin selection is spatial locality in the code; execution does not jump
around in the code at random, but tends to stay in a specific
region (e.g., in a loop), and there it is less likely to encounter
the same replica twice with round-robin selection. E.g., in
our example loop we will get the perfect result (Fig. 5) if we
have at least two replicas of A and use round-robin selection,
whereas random selection might use the same replica of A
twice and thus produce 50% mispredictions.
For static superinstructions one can use dynamic programming (shortest-path algorithm) to select the optimal
(minimum) number of superinstructions for a given basic
block [2]. A simpler alternative is the greedy (maximum
munch) algorithm. In the rest of the paper we use the greedy
algorithm (because we have not yet implemented dynamic
programming); preliminary simulation results indicate there
is almost no difference between the results for optimal and
greedy selection.
Static Replication
Dynamic Replication
code segment
VM routine originals
Machine code for iadd
Dispatch next
Machine code for iload
Dispatch next
5.3 Comparison
The main advantage of the static approach is that it is
completely portable, whereas the dynamic approach requires
a small amount of platform-specific code.
Another advantage of static superinstructions is that their
code can be optimized across their component instructions,
whereas dynamic superinstructions simply concatenate the
components without optimization. In particular, static superinstructions can keep stack items in registers across components, and combine the stack pointer updates of the components. In addition, static superinstructions make it possible to use instruction scheduling across component VM
instructions. These advantages can also be exploited in a
dynamic setting by combining static superinstructions with
dynamic superinstructions and dynamic replication.
6.
EXPERIMENTAL SETUP
Program
gray
bench-gc
tscp
vmgen
cross
brainless
brew
Version
4
1.1
0.4
0.5.9
0.5.9
0.0.2
38
Lines
754
1150
1625
2068
2735
3519
29804
Description
parser generator
garbage collector
chess
interpreter generator
Forth cross-compiler
chess
evolutionary programming
6.1 Implementation
We implemented the techniques described in Section 4 in
Gforth, a product-quality Forth interpreter.
In particular, we implemented static superinstructions using vmgen [8]; we implemented static replication by replicating the code for the (super)instructions on interpreter
startup instead of at interpreter build-time; in all other respects this implementation behaves like normal static replication (i.e., the replication is not specific to the interpreted
program, unlike dynamic replication). This was easier to
implement, allowed to use more replication configurations
(in particular, more replicas) and should produce the same
results as normal static replication (except for the copying
overhead, and the impact of that was small compared to the
benchmark run-times).
We implemented dynamic methods pretty much as described in Section 5.2, with free choice (through commandline flags) of replication, superinstructions, or both, and superinstructions within basic-blocks or across them. By using this machinery with a VM interpreter including static
superinstructions we can also explore the combination of
static superinstructions (with optimizations across component instructions) and the dynamic methods.
One thing that we have not implemented is eliminating
the increments of the VM instruction pointers along with
the rest of the instruction dispatch in dynamic superinstructions. However, by using static superinstructions in addition
dynamic superinstructions and replication we also reduce
these increments (in addition to other optimizations); looking at the results from that, eliminating only the increments
probably does not have much effect. It would also conflict
with superinstructions across basic blocks.
6.2 Machines
We used an 800MHz Celeron (VIA Apollo Pro chipset,
512MB PC100 SDRAM, Linux-2.4.7, glibc-2.2.2, gcc-2.95.3)
for most of the results we present here. The reason for
this choice is that the Celeron has a relatively small I-cache
(16KB), L2 cache (128KB), and BTB (512 entries), so any
negative performance impacts of the code growth from our
techniques should become visible on this processor.
plain
static repl
static super
dynamic repl
dynamic super
speedup
3.0
static both
dynamic both
across bb
2.0
1.0
gray
bench-gc
tscp
vmgen
cross
brainless
brew
with static super First, combine instructions within a basic block into static superinstructions (with 400 superinstructions) with greedy selection, then form dynamic superinstructions across basic blocks with replication from that. This combines the speed benefits of
static superinstructions (optimization across VM instructions) with the benefits of dynamic superinstructions with replication.
6.3 Benchmarks
Figure 9 shows the benchmarks we used for our experiments. The line counts include libraries that are not
preloaded in Gforth, but not what would be considered as
input files in languages with a hard compile-time/run-time
boundary (e.g., the grammar for gray, and the program to
be compiled for cross), as far as we could tell the difference.
7.
RESULTS
7.2 Speedups
plain Threaded code; this is used as the baseline of our
comparison (factor 1).
static repl Static replication with 400 replicas and roundrobin selection.
static super 400 static superinstructions with greedy selection.
static both 35 unique superinstructions, 365 replicas of instructions and superinstructions (for a total of 400).
dynamic repl Dynamic replication
dynamic super Dynamic superinstructions without replication, limited to basic blocks (very similar to what
Piumarta and Riccardi proposed [13]).
dynamic both Dynamic superinstructions, limited to basic blocks, with replication.
across bb Dynamic superinstructions across basic blocks,
with replication.
plain
static repl
static super
dynamic repl
dynamic super
events
1.0
static both
dynamic both
across bb
0.8
0.6
0.4
0.2
0.0
cycles (*500M)
taken_branches (*50M)
icache_misses (*100k)
code_bytes (*250k)
instructions (*250M)
taken_mispredicted (*50M)
miss_cycles (*500M)
static both
dynamic both
across bb
0.8
0.6
0.4
0.2
0.0
cycles (*60G)
taken_branches (*6G)
icache_misses (*200M)
code_bytes (*1M)
instructions (*30G)
taken_mispredicted (*6G)
miss_cycles (*60G)
cycles
0
25
50
100
200
400
800
400M
1600
200M
0
0
100
50
50
100 %superinstructions
0 %replicas
tscp
brainless
brew
across bb
2.98
2.49
2.17
bigForth
5.13
2.73
0.92
iForth
3.51
8.
RELATED WORK
branch prediction in all aspects: our work addresses a dynamic indirect branch predictor (the BTB) instead of a
static conditional branch predictor. Replication for conditional branches works at compile-time and is based on profiling to find correlations between branches to be exploited by
replication, and no data is affected; in contrast, our replication changes the representation of the interpreted program
at program startup time to decide the replicas to use.
Better indirect branch predictors than BTBs have been
proposed in a number of papers [4, 5, 11] and they work well
on interpreters [7], but they are not available in hardware
yet, and it will probably take a long time before they are
universally available, if at all.
There are a number of recent papers on improving interpreter performance [14, 6, 13, 16]. Software pipelining
the interpreter [9, 10] is a way to reduce the branch dispatch costs on architectures with delayed indirect branches
(or split indirect branches).
Ertl and Gregg [7] investigated the performance of various branch predictors on interpreters, but did not investigate
means to improve the prediction accuracy beyond threaded
code. In a similar vein, Romer et al. [15] investigated the
performance characteristics of several interpreters. They
used inefficient interpreters, and thus did not notice that
efficient interpreters spend much of their time on dispatch
branches.
Papers dealing with superoperators and superinstructions
[14, 13, 10, 8] concentrated on reducing the number of executed dispatches and sometimes the VM code size, but have
not evaluated the effect of superinstructions on BTB prediction accuracy (apart from two paragraphs in [8]). In particular, Piumarta and Riccardi invested extra work to avoid
replication (in order to reduce code size), but this increases
mispredictions on processors with BTBs.
9. CONCLUSION
If a VM instruction occurs several times in the working
set of an interpreted program, a BTB will frequently mispredict the dispatch branch of the VM instruction. We present
three techniques for reducing mispredictions in interpreters:
replicating VM instructions, such that hopefully each replica
occurs only once in the working set (speedup up to a factor of 2.39 over an efficient threaded-code interpreter); and
combining sequences of VM instructions into superinstructions (speedup up to a factor of 2.45). In combination these
techniques achieve an even greater speedup (up to a factor
of 3.17).
There are two variants of these optimizations: The static
variant creates the replicas and/or superinstructions at interpreter build-time; it produces less speedup (up to a factor
of 1.99), but is completely portable. The dynamic variant
creates replicas and/or superinstructions at interpreter runtime; it produces very good speedups (up to a factor of
3.09), but requires a little bit of porting work for each new
platform. The dynamic techniques can be combined with
static superinstructions for even greater speed (up to a factor 3.17).
The speedup of an optimization has to be balanced against
the cost of implementing it. In the present case, in addition
to giving good speedups, the dynamic methods are relatively
easy to implement (a few days of work). Static replication
with a few static superinstructions is also pretty easy to
implement for a particular interpreter.
Acknowledgements
We thank the referees for their helpful comments. The performance counter measurements were made using Mikael
Petterssons perfctr package.
10. REFERENCES
[1] J. R. Bell. Threaded code. Commun. ACM,
16(6):370372, 1973.
[2] T. C. Bell, J. G. Cleary, and I. H. Witten. Text
Compression. Prentice-Hall, 1990.
[3] B. Calder and D. Grunwald. Reducing branch costs
via branch alignment. In Architectural Support for
Programming Languages and Operating Systems
(ASPLOS-VI), pages 242251, 1994.
[4] K. Driesen and U. H
olzle. Accurate indirect branch
prediction. In Proceedings of the 25th Annual
International Symposium on Computer Architecture
(ISCA-98), pages 167178, 1998.
[5] K. Driesen and U. H
olzle. Multi-stage cascaded
prediction. In EuroPar99 Conference Proceedings,
volume 1685 of LNCS, pages 13121321. Springer,
1999.
[6] M. A. Ertl. Stack caching for interpreters. In
SIGPLAN 95 Conference on Programming Language
Design and Implementation, pages 315327, 1995.
[7] M. A. Ertl and D. Gregg. The behaviour of efficient
virtual machine interpreters on modern architectures.
In Euro-Par 2001, pages 403412. Springer
LNCS 2150, 2001.
[8] M. A. Ertl, D. Gregg, A. Krall, and B. Paysan. vmgen
a generator of efficient virtual machine interpreters.
SoftwarePractice and Experience, 32(3):265294,
2002.