Rv8 - RISC-V Simulator For x86-64

14/04/2018 rv8 - RISC-V simulator for x86-64
rv8
RISC-V simulator for x86-64
rv8 is a RISC-V simulation suite comprising a high performance x86-64 binary translator, a user mode
simulator, a full system emulator, an ELF binary analysis tool and ISA metadata:
rv-jit - user mode x86-64 binary translator

rv-sim - user mode system call proxy simulator
rv-sys - full system emulator with soft MMU
rv-bin - ELF disassembler and histogram tool
rv-meta - code and documentation generator
About | Installation | Optimisations | Benchmarks | Logging | Tracing | Histograms | Linux
About
The rv8 simulator suite contains libraries and command line tools for creating instruction opcode maps, C
headers and source containing instruction set metadata, instruction decoders, a JIT assembler, LaTeX
documentation, a metadata based RISC-V disassembler, a histogram tool for generating statistics on RISC-V
ELF executables, a RISC-V proxy syscall simulator, a RISC-V full system emulator that implements the RISC-
V 1.9.1 privileged speci cation and an x86-64 binary translator.
The current version is available at: https://github.com/rv8-io/rv8
RISC-V to x86-64 binary translator
The rv8 binary translation engine works by interpreting code while pro ling it for hot paths. Hot paths are
translated on the y to native code. The translation engine maintains a call stack to allow runtime inlining of
hot functions. A jump target cache is used to accelerate returns and indirect calls through function pointers.
The translator supports hybrid binary translation and interpretation to handle instructions that do not have
native translations. Currently ‘IM’ code is translated and ‘AFD’ is interpreted. The translator supports RVC
compressed code.
The rv8 binary translator supports a number of simple optimisations:
Hybrid interpretation and compilation of hot code paths
https://rv8.io/ 1/28
Incremental translation with dynamic trace linking

Inline caching of function calls in hot code paths
L1 jump target cache for indirect calls and returns
Macro-op fusion for common RISC-V instruction sequences
RISC-V full system emulator
The rv8 suite includes a full system emulator that implements the RISC-V privileged ISA with support for
interrupts, MMIO (memory mapped input output) devices, a soft MMU (memory management unit) with
separate instruction and data TLBs (translation lookaside buffers). The full system emulator has a simple
integrated debugger that allows setting breakpoints, single stepping and disassembling instructions as they
are executed.
The rv8 full system emulator has the following features:
RISC-V IMAFD Base plus Privileged ISA (riscv32 and riscv64)

Simple integrated debug command line interface
Histograms (registers, instruction and program counter frequency)
Soft MMU supporting sv32, sv39, sv48 page translation modes
Abstract MMIO device interface for device emulation
Extensible interpreter generated from ISA metadata
Protected address space
RISC-V user mode simulator
The rv8 user mode simulator is a single address space implementation of the RISC-V ISA that implements a
subset of the RISC-V Linux syscall ABI (application binary interface) and delegates system calls to the
underlying native host operating system. The user mode simulator can run RISC-V Linux binaries on non-
Linux operating systems via system call emulation. The current user mode simulator implements a small
number of Linux system calls to allow running RISC-V Linux ELF static binaries.
The rv8 user mode simulator has the following features:
RISC-V IMAFD Base ISA (riscv32 and riscv64)

Simple integrated debug command line interface
A small set of emulated Linux system calls for simple le IO
Extensible interpreter generated from ISA metadata
Instruction logging mode for tracing program execution
Shared address space
0x000000000000 - 0x000000000fff (zero)
0x000000001000 - 0x7ffdffffffff (guest)
0x7ffe00000000 - 0x7fffffffffff (host)
RISC-V instruction set metadata
The rv-bin tool contains a meta-data driven disassembler and a histogram tool for analysing static register
usage and static instruction usage.
The rv-meta tool is able to generate opcode maps, instruction decoders, source, headers and instruction set
listing LaTeX from ISA metadata. The following is an example of PDF output:
RISC-V instruction set metadata is available here. The linked page shows an example RISC-V Instruction Set
Reference generated by the rv-meta tool. rv8 has a simple extensible generator framework that allows
re ection on the instruction set metadata to generate a number of different output formats.
Installation
rv8 supports the following target architecture and host operating system combinations:
Target
RV32IMAFDC
RV64IMAFDC
Privilged ISA 1.9.1
Host
Linux (Debian 9.0 x86-64, Ubuntu 16.04 x86-64, Fedora 25 x86-64) (stable)
macOS (Sierra 10.11 x86-64) (stable)
FreeBSD (11 x86-64) (alpha)
Please read the RISC-V toolchain installtion instructions in the riscv-gnu-toolchain repository. To
experiment with the RISC-V toolchain online try the RISC-V Compiler Explorer.
Building riscv-gnu-toolchain
$ sudo apt-get install autoconf automake autotools-dev curl \

libmpc-dev libmpfr-dev libgmp-dev gawk build-essential \
bison flex texinfo gperf libtool patchutils bc zlib1g-dev
$ git clone https://github.com/riscv/riscv-gnu-toolchain.git
$ cd riscv-gnu-toolchain
$ git submodule update --init --recursive
$ ./configure --prefix=/opt/riscv/toolchain
$ make
rv8 has minimal external dependencies besides a C++14 compiler, the C/C++ standard libraries and the
asmjit submodule.
gcc-5.4 or clang-3.4 (Linux minimum)

gcc-6.3 or clang-3.8 (Linux recommended)
Building rv8
$ git clone https://github.com/rv8-io/rv8.git

$ cd rv8
$ git submodule update --init --recursive
$ make
$ sudo make install
Running rv8
The riscv64-unknown-elf newlib toolchain is required for building the rv8 test cases and this build step depends
on the RISCV environment variable.
$ cd rv8
$ export RISCV=/opt/riscv/toolchain
$ make test-build
$ make test-sim
$ make test-sys
$ rv-jit build/riscv64-unknown-elf/bin/test-dhrystone
Optimisations
The rv8 binary translator performs JIT (Just In Time) translation of RISC-V code to X86-64 code. This is a
challenging problem for many reasons; with the principle challange due to RISC-V having 31 integer
registers while x86-64 has only 16 integer registers.
Register allocation
rv8 solves the register set size problem by using a static register allocation and spilling registers to memory
(L1 cache) (a future versions may use dynamic register allocator). A signi cant amount of performance is
lost due to register allocations that take advantage of the larger number of available registers and less
frequent stack spills. It is not possible for the translator to rearrange memory and registers for optimal
stack spills as memory accesses must be translated precisely. The additional registers are translated as x86-
64 memory operands (which produce load and store micro-ops) or in some circumstances, explicit mov
instructions.
RV x86 RV x86 RV x86 RV x86

ra → rdx t1 → rdi a2 → r10 a5 → r13
sp → rbx a0 → r8 a3 → r11 a6 → r14
t0 → rsi a1 → r9 a4 → r12 a7 → r1
The remaining unallocated registers are stored in a memory spill area accessed using the rbp register. e.g.
qword [rbp+0xF8] would be used to access t4.
Translator temporaries
The rv8 translator needs to use several host registers to point to translator internal structures and for use
as temporary registers for the emulation of many instructions, for example a store instructions require the
use of two temporary registers if both register operands are in the spill area. The translator uses the
following x86-64 host registers as temporaries leaving 12 registers available for mapping to RISC-V
registers:
rbp - pointer to the register spill area and jump target cache
rsp - pointer to the host stack to allow procedure calls
rax - translator temporary register
rcx - translator temporary register
CISC vs RISC operands
The rv8 translator makes use of CISC memory operands to access registers residing in the memory backed
register spill area, which resides in L1 cache. The complex memory operands end up being cracked back into
micro-ops in the CISC pipeline however the use of complex memory operands helps increase instruction
density, which increases performance due to better use of I$ (instruction cache).
There are many combinations of instruction expansions depending on whether a register is mapped to a live
register, is memory backed and whether there are two or three operands. A three operand RISC-V
instruction is translated into a move and a destructive two operand x86-64 instruction. Temporary
registers are used if both operands are memory backed. The principle is to maintain the densest possible
mapping to the x86-64 ISA.
Memory operands are used to access registers in the spill area:
Indirect call acceleration
Indirect calls through function pointers cannot be statically translated as the target address of their
translation is not known at the time of translation. rv8 employs a trace cache which is a hashtable of guest
program addresses to native code addresses. A full trace cache lookup is relatively slow because it requires
saving caller-save registers and calling into C++ code. To accelerate indirect calls through function pointers,
a small assembly stub looks up the target address in a sparse 1024 entry direct mapped L1 translation
cache, and falls back to a slow translation cache miss path that saves registers and calls into the translator
code to populate the L1 translation cache so that the next indirect call can be accelerated.
The direct mapped L1 translation cache is indexed by bits[10:1] of the guest address. Bit zero can be ignored
because RISC-V instructions must start on a 2-byte boundary
Inline caching
Returns also make use of the L1 translation cache, however a procedure call made inside of a hot trace can
be inlined. The translator maintains a call stack to keep track of return addresses. Upon reaching an inlined
procedure RET (jalr zero, ra) instruction, the link register (ra in RISC-V, rdx in the x86 translation) is compared
against the callers known return address and if it matches, control ow continues along the return path. In
the case that the function is not inlined, the regular L1 translation cache is used to lookup the address of the
translated code.
An inlined subroutine call needs to test the return address:
Branch tail dynamic linking
The translator performs lazy translation of the source program during tracing and when it reaches
branches, it can only link both sides of the branch if there exists an existing translation for the not taken side
of the branch. To accelerate branch tail exits, the translator emits a relative branch to a trampoline that
returns to the tracer main loop, and the tracer adds the branch to a table of branch xup addresses indexed
by target guest address. If the branch target is hot, once it has been translated, all relative branches that
point to tail exit trampolines will be relinked to branch directly to the translated native code.
Macro-op fusion
The rv8 translator implements an optimisation known as macro-op fusion whereby speci c patterns of
adjacent instructions are translated into a smaller sequence of host instructions. The macro-op fusion
pattern matcher has potential to increase performance further with the addition of common patterns. The
following is a list of macro-op fusion patterns that are currently implemented in rv8:
AUIPC rs, imm20; ADDI rd, rs1, imm12;

PC-relative address is resolved using a single MOV instruction.
AUIPC t0, imm20; JALR ra, imm12(t0);

Target address is constructed using a single MOV instruction.
AUIPC ra, imm20; JALR ra, imm12(ra);

Target address register write is elided.
AUIPC rd, imm20; LW rd, imm12(rs1);

Fused into single MOV with an immediate addressing mode
AUIPC rd, imm20; LD rd, imm12(rs1);

Fused into single MOV with an immediate addressing mode
SLLI rd, rs1, 32; SRLI rd, rs1, 32;

Fused into a single MOVZX instruction.
ADDIW rd, rs1, imm12; SLLI rd, rs1, 32; SRLI rd, rs1, 32;
Fused into 32-bit zero extending ADD instruction.
SRLI r1, rs, imm12; SLLI r2, rs, 64 - imm12; OR r1, r1, r2;
Fused into 64-bit ROR with one residual SHL or SHR temporary
SRLIW r1, rs, imm12; SLLIW r2, rs, 32 - imm12; OR r1, r1, r2;
Fused into 32-bit ROR with one residual SHL or SHR temporary
A technique known as deoptimisation can be employed to allow elision of temporary registers in macro-op
fusion patterns assuming the translator sees the register killed within its translation window.
Deoptimisation requires that the optimised translation has an accompanying deoptimisation sequence to
ll in elided register values, and this is played back in the case of a fault (device or debug interrupt) so that
the visible machine state precisely matches that which the ISA dictates. rv8 does not presently implement
deoptimisation, however it may be necessary to allow more sophisticated optimisations.
Sign extension versus zero extension
In addition to the register allocation problem, rv8 has to make sure that 32-bit operations on registers are
sign extended instead of zero-extended. The normal behaviour of 32-bit operations on x86-64 is to zero
extend bit 31 to bit 63 whereas RISC-V sign extends bit 31 to bit 63. One potential optimisation is lazy sign
extension. It may be possible in a future version of the JIT translation engine to elide redundant sign
extension operations, however it is important that the register state precisely matches the semantics of the
ISA before executing an instruction that may cause a fault e.g. loads and stores.
Example of sign-extended vs zero-extended 32-bit arithmetic on RISC-V and x86-64:
Bit manipulation intrinsics
The bencharks below contain digest algorithms and ciphers which can take advantage of bit manipulation
instructions such as rotate and bswap. Present day compilers detect rotate and byte swap bitwise logical
operations by matching intermediate representation patterns that can be lowered directly to bit
manipulation instructions such as ROR, ROL, BSWAP on x86-64. This approach has the bene t of accelerating
code that does not use inline assembly or compiler builtin functions. RISC-V currently lacks bit manipulation
instructions however there are proposals to add them in the B extension. The following is a typical byte
swap pattern.
32-bit integer byteswap pattern (4 constants, 4 shifts, 4 ands, 3 ors)

((rs1 >> 24) & 0x000000ff) | ((rs1 << 8 ) & 0x00ff0000) |
((rs1 >> 8 ) & 0x0000ff00) | ((rs1 << 24) & 0xff000000)
rv8 implements rotate macro-op fusion which can translate two shift instructions and one OR instructions
with the correct offsets into one shift and one rotate. The rotate macro-op fusion needs to create the
residual temporary register side effects so that the register le contents are precisely matched, as it can’t
easily prove the residual temporary register is not later used. Deoptimisation would be required to elide the
temporary register.
32-bit rotate right or left pattern (2 shifts, 1 or)

(rs1 >> shamt) | (rs1 << (32 - shamt))
64-bit rotate right or left pattern (2 shifts, 1 or)

(rs1 >> shamt) | (rs1 << (64 - shamt))
Measurement
A future goal is to quantify the factors that contribute to the performance differences between native x86-
64 code and translated RISC-V code, so future benchmarks should measure:
Sign extension overhead

Indirect call/return overhead
Macro-op fusion performance improvement
Register allocation and spilled register accesses
Work per instruction (fused and unfused micro-ops)
Benchmarks
The following section contains benchmark runtime and instructions per second results comparing the
QEMU and rv8 JIT engines against native x86. This section also contains runtime neutral results comparing
total retired RISC-V instructions to x86 micro-ops. The benchmark programs are compiled for aarch64,
arm32, riscv64, riscv32, x86-64 and x86-32. See the Benchmarks Results page for the complete result set
including optimisation level comparisons, macro-op fusion performance, executable le sizes, dynamic
register and instruction usage charts.
Benchmark source
The following sources have been used to run the benchmarks:
rv8 - https://github.com/rv8-io/rv8/
rv8-bench - https://github.com/rv8-io/rv8-bench/
qemu-riscv - https://github.com/riscv/riscv-qemu/
musl-riscv-toolchain - https://github.com/rv8-io/musl-riscv-toolchain/
Benchmark metrics
The following benchmark metrics have been plotted and tabulated:
Runtimes
Instructions Per Second
Benchmark details
The rv8-bench benchmark suite contains the following test programs:
Benchmark Type Description

aes crypto encrypt, decrypt and compare 30MiB of data
bigint numeric compute 23 ^ 111121 and count base 10 digits
dhrystone synthetic synthetic integer workload
miniz compression compress, decompress and compare 8MiB of data
norx crypto encrypt, decrypt and compare 30MiB of data
primes numeric calculate largest prime number below 33333333
qsort sorting sort array containing 50 million items
sha512 digest calculate SHA-512 hash of 64MiB of data
Compiler details
The following compiler architectures, versions, compile options and runtime libraries are used to run the
benchmarks:
Architecture Compiler C Library Compile options

x86-32 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'
x86-64 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'
riscv32 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'
riscv64 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'
arm32 GCC 7.2.0 musl libc '-O3', '-O2' and '-Os'
aarch64 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'
Measurement details
Dynamic instruction counts are measured using rv-sim -E

Benchmarks are run 20 times and the best result is taken
All programs are compiled as position independent executables (-fPIE)
Host: Intel® 6th-gen Core™ i7-5557U Broadwell (3.10-3.40GHz, 4MB cache)
x86-64 μops measured with
perf stat -e cycles,instructions,r1b1,r10e,r2c2,r1c2 <cmd>
Runtimes
Runtime results comparing qemu, rv8 and native x86:
rv8-bench (Runtime -O3 64-bit)
10
native-x86-64-O3-runtime
rv8-riscv64-O3-runtime
qemu-riscv64-O3-runtime
qemu-aarch64-O3-runtime
8
6
Runtime (secs)
0
aes bigint dhrystone miniz norx primes qsort sha512
Figure 1: Benchmark runtimes -O3 64-bit
Runtime 64-bit -O3 (seconds)
program qemu-aarch64 qemu-riscv64 rv8-riscv64 native-x86-64

aes 1.31 2.16 1.49 0.32
bigint 1.38 1.08 0.71 0.38
dhrystone 0.98 0.57 0.20 0.10
miniz 2.66 2.21 1.53 0.77
norx 0.60 1.17 0.99 0.22
primes 2.09 1.26 0.65 0.60
qsort 7.38 4.76 1.21 0.64
sha512 0.64 1.24 0.81 0.24
(Sum) 17.04 14.45 7.59 3.27
Performance Ratio 64-bit -O3 (smaller is better)

aes 4.12 6.76 4.68 1.00
bigint 3.62 2.83 1.85 1.00

dhrystone 9.96 5.87 2.03 1.00
miniz 3.46 2.86 1.99 1.00
norx 2.73 5.33 4.51 1.00
primes 3.49 2.11 1.09 1.00
qsort 11.55 7.46 1.90 1.00
sha512 2.66 5.13 3.36 1.00
(Geomean) 4.44 4.39 2.40 1.00
10
qemu-aarch64-O2-runtime
8
6
Runtime (secs)
0

aes 1.32 2.18 1.49 0.32
bigint 1.34 1.03 1.44 0.38
dhrystone 1.77 1.06 0.23 0.12
miniz 2.72 2.22 1.52 0.77
norx 0.66 1.16 1.08 0.22

primes 2.11 1.25 0.66 0.59
qsort 7.35 4.74 1.19 0.62
sha512 0.68 1.32 0.96 0.24
(Sum) 17.95 14.96 8.57 3.26

aes 4.13 6.84 4.68 1.00
bigint 3.51 2.70 3.76 1.00
dhrystone 14.73 8.81 1.95 1.00
miniz 3.52 2.87 1.96 1.00
norx 2.94 5.20 4.81 1.00
primes 3.58 2.13 1.12 1.00
qsort 11.79 7.60 1.91 1.00
sha512 2.80 5.45 3.96 1.00
(Geomean) 4.75 4.64 2.69 1.00
rv8-bench (Runtime -Os 64-bit)
10
native-x86-64-Os-runtime
rv8-riscv64-Os-runtime
qemu-riscv64-Os-runtime
8 qemu-aarch64-Os-runtime
6
Runtime (secs)
0
Figure 3: Benchmark runtimes -Os 64-bit
Runtime 64-bit -Os (seconds)

aes 1.22 1.91 1.26 0.37
bigint 1.60 1.40 2.85 0.38
dhrystone 5.42 2.59 1.28 0.39
miniz 2.74 2.24 1.73 0.83
norx 1.58 1.53 0.96 0.24
primes 1.97 1.23 0.74 0.59
qsort 7.99 5.27 0.90 0.66
sha512 0.64 1.14 0.67 0.25
(Sum) 23.16 17.31 10.39 3.71
Performance Ratio 64-bit -Os (smaller is better)

aes 3.29 5.16 3.39 1.00
bigint 4.22 3.70 7.53 1.00
dhrystone 13.97 6.66 3.30 1.00
miniz 3.30 2.70 2.09 1.00
norx 6.59 6.36 4.00 1.00
primes 3.31 2.07 1.25 1.00
qsort 12.20 8.05 1.37 1.00
sha512 2.56 4.58 2.68 1.00
(Geomean) 5.07 4.49 2.75 1.00
10
qemu-arm32-O3-runtime
8
6
Runtime (secs)
0
program qemu-arm32 qemu-riscv32 rv8-riscv32 native-x86-32

aes 1.70 1.89 1.47 0.48
bigint 2.98 1.37 1.41 0.88
dhrystone 1.17 1.11 0.39 0.28
miniz 2.99 2.17 1.41 0.88
norx 0.77 0.77 0.78 0.26
primes 4.20 2.34 1.89 1.51
qsort 8.39 4.56 1.15 0.70
sha512 3.82 2.91 1.92 0.63
(Sum) 26.02 17.12 10.42 5.62

aes 3.56 3.97 3.09 1.00
bigint 3.39 1.56 1.61 1.00

dhrystone 4.13 3.91 1.37 1.00
miniz 3.40 2.47 1.60 1.00
norx 2.98 2.96 3.00 1.00
primes 2.79 1.55 1.25 1.00
qsort 12.04 6.54 1.65 1.00
sha512 6.04 4.60 3.04 1.00
(Geomean) 4.23 3.09 1.95 1.00
10
qemu-arm32-O2-runtime
8
6
Runtime (secs)
0

aes 1.73 1.88 1.41 0.48
bigint 2.73 1.46 1.76 0.85
dhrystone 2.19 1.80 0.41 0.36
miniz 2.98 2.16 1.36 0.88
norx 0.84 0.78 0.83 0.27

primes 4.32 2.31 1.91 1.54
qsort 10.53 4.55 1.16 0.68
sha512 3.78 3.95 2.19 0.57
(Sum) 29.10 18.89 11.03 5.63

aes 3.59 3.92 2.93 1.00
bigint 3.19 1.71 2.06 1.00
dhrystone 6.11 5.02 1.13 1.00
miniz 3.39 2.46 1.54 1.00
norx 3.10 2.87 3.04 1.00
primes 2.81 1.50 1.24 1.00
qsort 15.44 6.67 1.70 1.00
sha512 6.64 6.92 3.84 1.00
(Geomean) 4.63 3.37 2.00 1.00
rv8-bench (Runtime -Os 32-bit)
10
native-x86-32-Os-runtime
rv8-riscv32-Os-runtime
qemu-riscv32-Os-runtime
8 qemu-arm32-Os-runtime
6
Runtime (secs)
0
Figure 6: Benchmark runtimes -Os 32-bit
Runtime 32-bit -Os (seconds)

aes 1.62 1.57 1.13 0.50
bigint 3.62 1.80 3.21 1.02
dhrystone 4.71 2.31 1.43 0.58
miniz 3.06 2.20 1.56 1.26
norx 1.38 1.18 1.00 0.32
primes 4.46 2.20 2.74 1.38
qsort 8.70 5.01 0.81 0.77
sha512 3.52 2.69 2.20 0.79
(Sum) 31.07 18.96 14.08 6.62
Performance Ratio 32-bit -Os (smaller is better)

aes 3.24 3.15 2.26 1.00
bigint 3.55 1.76 3.14 1.00
dhrystone 8.14 4.00 2.48 1.00
miniz 2.43 1.74 1.24 1.00
norx 4.32 3.70 3.12 1.00
primes 3.22 1.59 1.98 1.00
qsort 11.24 6.47 1.05 1.00
sha512 4.47 3.41 2.78 1.00
(Geomean) 4.47 2.90 2.11 1.00
Instructions Per Second
Instructions per second in millions comparing qemu, rv8 and native x86:
rv8-bench (MIPS -O3 64-bit)
native-x86-64-O3-mips
14000
rv8-riscv64-O3-mips
qemu-riscv64-O3-mips
12000
MIPS (Millions of Instructions Per Second)
10000
8000
6000
4000
2000
0
Figure 5: Millions of Instructions Per Second -O3 64-bit
Instructions per second (MIPS) qemu, rv8 and native 64-bit -O3
program qemu-riscv64-mips rv8-riscv64-mips native-x86-mips

aes 2414 3395 11035
bigint 3738 5712 10557
dhrystone 1843 5274 8369
miniz 2625 3622 5530
norx 2223 2167 9112
primes 2438 4421 6100
qsort 644 2518 5780
sha512 2982 4556 12177
(Geomean) 2149 3769 8232
rv8-bench (MIPS -Os 64-bit)
native-x86-64-Os-mips
14000
rv8-riscv64-Os-mips
qemu-riscv64-Os-mips
12000
10000
8000
6000
4000
2000
0
Figure 6: Millions of Instructions Per Second -Os 64-bit
Instructions per second (MIPS) qemu, rv8 and native 64-bit -Os

aes 2655 3879 10072
bigint 3973 1955 12873
dhrystone 1250 2462 9073
miniz 2650 3427 5052
norx 1817 2439 8852
primes 2226 3555 6101
qsort 572 3340 6063
sha512 3269 5567 12206
(Geomean) 2008 3175 8355
rv8-bench (MIPS -O3 32-bit)
native-x86-32-O3-mips
14000
rv8-riscv32-O3-mips
qemu-riscv32-O3-mips
12000
10000
8000
6000
4000
2000
0
Figure 7: Millions of Instructions Per Second -O3 32-bit
Instructions per second (MIPS) qemu, rv8 and native 32-bit -O3

aes 2442 2851 9634
bigint 3964 3835 9780
dhrystone 1998 5667 3747
miniz 2195 3379 4988
norx 2824 2554 9146
primes 3039 3652 6368
qsort 671 2658 6259
sha512 2773 3671 11074
(Geomean) 2259 3428 7186
rv8-bench (MIPS -Os 32-bit)
native-x86-32-Os-mips
14000
rv8-riscv32-Os-mips
qemu-riscv32-Os-mips
12000
10000
8000
6000
4000
2000
0
Figure 8: Millions of Instructions Per Second -Os 32-bit
Instructions per second (MIPS) qemu, rv8 and native 32-bit -Os

aes 2832 3565 9472
bigint 3856 2166 9817
dhrystone 1435 2345 8479
miniz 2177 3062 4171
norx 1965 1970 8129
primes 2928 2355 7105
qsort 576 3528 5892
sha512 2901 3131 8396
(Geomean) 2063 2702 7441
Logging
rv-sim and rv-sys support the ability to log instructions (--log-instructions), register values (--log-operands) and rv-sys
can log page table walks (--log-pagewalks).
Sample output from rv-sim with the --log-instructions option
$ rv-sim -l build/riscv64-unknown-elf/bin/hello-world-pcrel
0000000000000000000 core-0 :0000000000010078 (4501 ) mv a0, zero
0000000000000000001 core-0 :000000000001007a (00000597) auipc a1, pc + 0
0000000000000000002 core-0 :000000000001007e (02658593) addi a1, a1, 38
0000000000000000003 core-0 :0000000000010082 (4631 ) addi a2, zero, 12
0000000000000000004 core-0 :0000000000010084 (4681 ) mv a3, zero
0000000000000000005 core-0 :0000000000010086 (04000893) addi a7, zero, 64
Hello World
0000000000000000006 core-0 :000000000001008a (00000073) ecall
0000000000000000007 core-0 :000000000001008e (4501 ) mv a0, zero
0000000000000000008 core-0 :0000000000010090 (4581 ) mv a1, zero
0000000000000000009 core-0 :0000000000010092 (4601 ) mv a2, zero
0000000000000000010 core-0 :0000000000010094 (4681 ) mv a3, zero
0000000000000000011 core-0 :0000000000010096 (05d00893) addi a7, zero, 93
Tracing
The rv-jit program supports the ability to log RISC-V instructions along with the dynamically translated x86-
64 assembly and machine code (--log-jit-trace). This mode is useful for JIT translation debugging and
optimisation analysis.
Sample output from rv-jit with the --log-jit-trace option
# 0x0000000000103d70 addi a0, zero, 1

mov r8, 1 ; 41B801000000
L3:
# 0x0000000000103d74 slli a0, a0, 28
shl r8, 1C ; 49C1E01C
L4:
# 0x0000000000103d78 addi a1, zero, -1
mov r9, FFFFFFFFFFFFFFFF ; 49C7C1FFFFFFFF
L5:
# 0x0000000000103d7c sb a1, 0(a0)
rex mov byte [r8], r9b ; 458808
L6:
# 0x0000000000103d80 lbu a2, 0(a0)
movzx r10d, byte [r8] ; 450FB610
L7:
Histograms
The rv-sim and rv-sys programs support the ability to record and print histograms. Program counter
frequency (--pc-usage-histogram), dynamic instruction frequency (--instruction-usage-histogram) and dynamic register
usage (--register-usage-histogram) is supported.
The rv-bin program via the histogram subcommand has the ability to print static instruction frequency and
static register usage.
Sample output from rv-sim with the --register-usage-histogram and --instruction-usage-histogram options
$ rv-sim --register-usage-histogram \
--instruction-usage-histogram \
build/riscv64-unknown-elf/bin/test-aes
integer register file

~~~~~~~~~~~~~~~~~~~~~
ra :0x00000000000197a4
sp :0x000000007fffff68 gp :0x0000000000020b18
tp :0xe87f5200d8d3e2fd t0 :0x0000000000011808
t1 :0x0000000000040000 t2 :0x000000000000035c
s0 :0x0000000000000000 s1 :0x0e9e1c7894d54be1
a0 :0x0000000000000000 a1 :0x0000000000000000
a2 :0x0000000000000000 a3 :0x0000000000000000
a4 :0x0000000000000000 a5 :0x0000000000000000
a6 :0x0000000006021440 a7 :0x000000000000005d
s2 :0x94a7319b493ab93c s3 :0xb643788d224af6fb
s4 :0x7269d597ce6e1df9 s5 :0x5f9113739f9b0d72
s6 :0x404c0e868734cf0c s7 :0x2aea8c1ef338fa59
s8 :0xff41772b6673e771 s9 :0x7a5a4806d4a6fe41
s10 :0x0a595462adddb5a8 s11 :0xda895865c86e2f07
t3 :0x0000000000000001 t4 :0x0000000000000000
t5 :0x0000000004000000 t6 :0x0000000000000000
control and status registers

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
instret : 5148514055 time :0x000290f90f8a9e3b
pc :0x000000000001944e fcsr :0x00000000
register usage histogram

~~~~~~~~~~~~~~~~~~~~~~~~
1. a5 15.11% [1769998250] #######################################
2. a2 13.35% [1564476600] ##################################
3. a7 10.03% [1174405210] #########################
4. a4 9.15% [1071646586] #######################
5. a6 7.50% [878706827] ###################
6. t4 5.12% [599785551] #############
7. t6 4.91% [574619648] ############
8. t1 4.65% [545259607] ############
9. a1 2.26% [264241924] #####
10. s1 2.15% [251658621] #####
11. s0 2.02% [236978689] #####
12. s8 2.01% [234881068] #####
13. t5 1.79% [209715245] ####
14. t0 1.68% [197132294] ####
15. s2 1.63% [190840979] ####
16. a0 1.61% [188744347] ####
17. s4 1.50% [176160835] ###
18. s5 1.50% [176160833] ###
19. s3 1.47% [171966617] ###
20. sp 1.43% [167773141] ###
21. s6 1.43% [167772235] ###
22. t3 1.43% [167772225] ###
23. s7 1.43% [167772223] ###
24. t2 1.36% [159383552] ###
25. zero 0.86% [100664377] ##

26. s9 0.75% [88080392 ] #
27. a3 0.68% [79693333 ] #
28. s11 0.68% [79691785 ] #
29. s10 0.43% [50331663 ] #
30. ra 0.07% [8388996 ]
31. gp 0.00% [90 ]
instruction usage histogram

~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. lw 16.37% [843055543] #######################################
2. xor 13.85% [713031972] ################################
3. add 13.69% [704643870] ################################
4. slli 13.20% [679477645] ###############################
5. srliw 10.75% [553648296] #########################
6. andi 9.94% [511705332] #######################
7. srli 3.05% [157286473] #######
8. lbu 2.93% [150995155] ######
9. addi 2.69% [138412722] ######
10. addiw 2.44% [125829177] #####
11. sd 1.79% [92275184 ] ####
12. sb 1.47% [75497501 ] ###
13. jal 1.14% [58720476 ] ##
14. beq 1.06% [54526059 ] ##
15. ld 0.98% [50332182 ] ##
16. and 0.98% [50331715 ] ##
17. slliw 0.98% [50331682 ] ##
18. bne 0.90% [46137534 ] ##
19. or 0.65% [33554442 ] #
20. auipc 0.41% [20971539 ]
21. lui 0.24% [12583002 ]
22. mulw 0.16% [8388608 ]
23. lwu 0.16% [8388608 ]
24. jalr 0.08% [4194443 ]
25. sraiw 0.08% [4194314 ]
26. sw 0.00% [213 ]
27. bltu 0.00% [78 ]
28. bge 0.00% [39 ]
29. blt 0.00% [33 ]
30. bgeu 0.00% [33 ]
31. sub 0.00% [29 ]
...
Linux
This section describes how to build and boot a Linux image in the full system emulator.
Please read the RISC-V toolchain installation instructions in the riscv-gnu-toolchain repository. The riscv64-
unknown-elf newlib toolchain is required for building the rv8 test cases and the riscv64-unknown-linux-gnu glibc
toolchain is required for building busybox which is used to create the Linux image that runs in the full system
emulator.
The toolchain script will download any required dependencies.
Building riscv-gnu-toolchain for linux
$ cd riscv-gnu-toolchain
$ make linux
The linux build script will download any required dependencies.
Building linux, busybox and bbl-lite
$ cd rv8
$ export RISCV=/opt/riscv/toolchain
$ export PATH=${PATH}:${RISCV}/bin
$ make linux
To start linux, we execute bbl (the Berkeley Boot Loader) which performs early machine set up and then
passes control to an embedded linux kernel. After kernel initialisation, busybox is then executed from the
initramfs as pid 1 (init). The linux image and the initramfs are combined together and linked into bbl as the
boot payload.
Running the full system emulator
$ rv-sys build/riscv64-unknown-elf/bin/bbl
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrr vvvvvvvvvvvvvvvvvvvvvv
rr vvvvvvvvvvvvvvvvvvvvvv
rr vvvvvvvvvvvvvvvvvvvvvvvv rr
rrrr vvvvvvvvvvvvvvvvvvvvvvvvvv rrrr
rrrrrr vvvvvvvvvvvvvvvvvvvvvv rrrrrr
rrrrrrrr vvvvvvvvvvvvvvvvvv rrrrrrrr
rrrrrrrrrr vvvvvvvvvvvvvv rrrrrrrrrr
rrrrrrrrrrrr vvvvvvvvvv rrrrrrrrrrrr
rrrrrrrrrrrrrr vvvvvv rrrrrrrrrrrrrr
rrrrrrrrrrrrrrrr vv rrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrrrrrr
INSTRUCTION SETS WANT TO BE FREE

[ 0.000000] Linux version 4.6.2-00044-g250754b-dirty (mclark@minty) (gcc version 7.1.1 20170509 (GCC) ) #2 Sat Jul 1 22:42:14 NZST 201
[ 0.000000] bootconsole [early0] enabled
[ 0.000000] Available physical memory: 1020MB
[ 0.000000] Initial ramdisk at: 0xffffffff800100b0 (261728 bytes)
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x0000000080400000-0x00000000bfffffff]

[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000080400000-0x00000000bfffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000080400000-0x00000000bfffffff]
[ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 257550
[ 0.000000] Kernel command line: earlyprintk=sbi-console rdinit=/sbin/init
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
[ 0.000000] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
[ 0.000000] Sorting __ex_table...
[ 0.000000] Memory: 1025684K/1044480K available (1701K kernel code, 125K rwdata, 488K rodata, 324K init, 230K bss, 18796K reserved, 0
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[ 0.000000] NR_IRQS:0 nr_irqs:0 0
[ 0.000000] clocksource: riscv_clocksource: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 191126044627 ns
[ 0.000000] Calibrating delay loop (skipped), value calculated using timer frequency.. 20.00 BogoMIPS (lpj=100000)
[ 0.000000] pid_max: default: 32768 minimum: 301
[ 0.010000] Mount-cache hash table entries: 2048 (order: 2, 16384 bytes)
[ 0.010000] Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes)
[ 0.010000] devtmpfs: initialized
[ 0.010000] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 0.010000] NET: Registered protocol family 16
[ 0.010000] clocksource: Switched to clocksource riscv_clocksource
[ 0.010000] TCP established hash table entries: 8192 (order: 4, 65536 bytes)
[ 0.010000] TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
[ 0.010000] TCP: Hash tables configured (established 8192 bind 8192)
[ 0.010000] UDP hash table entries: 512 (order: 2, 16384 bytes)
[ 0.010000] UDP-Lite hash table entries: 512 (order: 2, 16384 bytes)
[ 0.010000] Unpacking initramfs...
[ 0.010000] console [sbi_console0] enabled
[ 0.010000] console [sbi_console0] enabled
[ 0.010000] bootconsole [early0] disabled
[ 0.010000] bootconsole [early0] disabled
[ 0.010000] futex hash table entries: 256 (order: 0, 6144 bytes)
[ 0.010000] workingset: timestamp_bits=61 max_order=18 bucket_order=0
[ 0.010000] 9p: Installing v9fs 9p2000 file system support
[ 0.010000] io scheduler noop registered
[ 0.010000] io scheduler cfq registered (default)
[ 0.010000] 9pnet: Installing 9P2000 support
[ 0.010000] Freeing unused kernel memory: 324K (ffffffff80000000 - ffffffff80051000)
[ 0.010000] This architecture does not have kernel memory protection.
BusyBox v1.26.1 (2017-06-30 16:46:44 NZST) built-in shell (ash)

Enter 'help' for a list of built-in commands.
/#
References
RISC-V Foundation
RISC-V ISA Speci cation

RISC-V Privileged ISA Speci cation
RISC-V GNU Toolchain
RISC-V Open Source Software Implementations
RISC-V Assembler Reference
RISC-V Instruction Set Reference
Hosted on GitHub Pages — Theme by orderedlist

Rv8 - RISC-V Simulator For x86-64

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Rv8 - RISC-V Simulator For x86-64

Transféré par

Droits d'auteur :

Formats disponibles

14/04/2018 rv8 - RISC-V simulator for x86-64

rv-jit - user mode x86-64 binary translator

About | Installation | Optimisations | Benchmarks | Logging | Tracing | Histograms | Linux

The current version is available at: https://github.com/rv8-io/rv8

RISC-V to x86-64 binary translator

The rv8 binary translator supports a number of simple optimisations:

Hybrid interpretation and compilation of hot code paths

Incremental translation with dynamic trace linking

RISC-V full system emulator

The rv8 full system emulator has the following features:

RISC-V IMAFD Base plus Privileged ISA (riscv32 and riscv64)

RISC-V user mode simulator

The rv8 user mode simulator has the following features:

RISC-V IMAFD Base ISA (riscv32 and riscv64)

RISC-V instruction set metadata

$ sudo apt-get install autoconf automake autotools-dev curl \

gcc-5.4 or clang-3.4 (Linux minimum)

$ git clone https://github.com/rv8-io/rv8.git

RV x86 RV x86 RV x86 RV x86

CISC vs RISC operands

Memory operands are used to access registers in the spill area:

Indirect call acceleration

An inlined subroutine call needs to test the return address:

Branch tail dynamic linking

AUIPC rs, imm20; ADDI rd, rs1, imm12;

AUIPC t0, imm20; JALR ra, imm12(t0);

AUIPC ra, imm20; JALR ra, imm12(ra);

AUIPC rd, imm20; LW rd, imm12(rs1);

AUIPC rd, imm20; LD rd, imm12(rs1);

SLLI rd, rs1, 32; SRLI rd, rs1, 32;

Sign extension versus zero extension

Example of sign-extended vs zero-extended 32-bit arithmetic on RISC-V and x86-64:

Bit manipulation intrinsics

32-bit integer byteswap pattern (4 constants, 4 shifts, 4 ands, 3 ors)

32-bit rotate right or left pattern (2 shifts, 1 or)

64-bit rotate right or left pattern (2 shifts, 1 or)

Sign extension overhead

The following sources have been used to run the benchmarks:

The following benchmark metrics have been plotted and tabulated:

The rv8-bench benchmark suite contains the following test programs:

Benchmark Type Description

Architecture Compiler C Library Compile options

x86-64 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'

riscv32 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'

riscv64 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'

arm32 GCC 7.2.0 musl libc '-O3', '-O2' and '-Os'

aarch64 GCC 7.1.0 musl libc '-O3', '-O2' and '-Os'

Dynamic instruction counts are measured using rv-sim -E

Runtime results comparing qemu, rv8 and native x86:

rv8-bench (Runtime -O3 64-bit)

Figure 1: Benchmark runtimes -O3 64-bit

Runtime 64-bit -O3 (seconds)

program qemu-aarch64 qemu-riscv64 rv8-riscv64 native-x86-64

Performance Ratio 64-bit -O3 (smaller is better)

program qemu-aarch64 qemu-riscv64 rv8-riscv64 native-x86-64

program qemu-aarch64 qemu-riscv64 rv8-riscv64 native-x86-64

rv8-bench (Runtime -O2 64-bit)

Figure 2: Benchmark runtimes -O2 64-bit

Runtime 64-bit -O2 (seconds)

program qemu-aarch64 qemu-riscv64 rv8-riscv64 native-x86-64

program qemu-aarch64 qemu-riscv64 rv8-riscv64 native-x86-64

Performance Ratio 64-bit -O2 (smaller is better)

program qemu-aarch64 qemu-riscv64 rv8-riscv64 native-x86-64