6740234

2013 IEEE Sixth International Conference on Cloud Computing
Evaluation of a Server-Grade Software-Only

ARM Hypervisor
Alexey Smirnov
Mikhail Zhidko
Yingshiuan Pan
Industrial Technology Research Institute Industrial Technology Research Institute Industrial Technology Research Institute
Hsinchu, Taiwan
Hsinchu, Taiwan
Hsinchu, Taiwan
Email: alexey@itri.org.tw
Email: mikhail@itri.org.tw
Email: yspan@itri.org.tw
Po-Jui Tsao
Kuang-Chih Liu
Tzi-Cker Chiueh
Industrial Technology Research Institute Industrial Technology Research Institute Industrial Technology Research Institute
Hsinchu, Taiwan
Hsinchu, Taiwan
Hsinchu, Taiwan
Email: pjtsao@itri.org.tw
Email: kcliu@itri.org.tw
Email: tcc@itri.org.tw
AbstractBecause of its enormous popularity in embedded
systems and mobile devices, ARM CPU is arguably the most used
CPU in the world. The resulting economies of scale benet entices
system architects to ponder the feasibility of building lower-cost
and lower-power-consumption servers using ARM CPU. In modern data centers, especially those built to host cloud applications,
virtualization is a must. So how to support virtualization on ARM
CPUs becomes a major issue for constructing ARM-based servers.
Although the latest versions of ARM architecture (Cortex-A15
and beyond) provide hardware support for virtualization, the
majority of ARM-based SOCs (system-on-chip) currently available on the market do not. This paper presents results of an
evaluation study of a fully operational hypervisor that successfully
runs multiple VMs on an ARM Cortex A9-based server, which
is architecturally non-virtualizable, and supports VM migration.
This hypervisor features several optimizations that signicantly
reduce the performance overhead of virtualization, including
physical memory remapping, and batching of sensitive/privileged
instruction emulation.
I.
for virtualization similar to the VT technology [4] in x86,

production-grade SOCs based on this new architecture remain
unavailable. Judging from the market adoption rate of the
recent generations of ARM CPUs, it is expected to take several
years for ARM CPUs with hardware virtualization support to
become the mainstream in the market place. In the mean time,
a hypervisor that provides virtualization support for currently
available ARM CPUs is needed to ll the gap. Without such a
hypervisor, it is impossible for servers using current-generation
ARM CPUs to effectively compete with x86 servers.
Because current-generation ARM CPUs do not have hardware virtualization support, we set out to build a productiongrade software-only hypervisor that guarantees isolation between VMs and between VMs and the hypervisor, minimizes
the amount of paravirtualization required on the guest OSs, and
most importantly incurs acceptable performance overhead. The
resulting hypervisor is the ITRI ARM hypervisor.
I NTRODUCTION
Although there existed hypervisors for ARM-based mobile
devices [5], [6], [7] or embedded systems [8], none of them
satisfy all three requirements listed above and therefore are not
ready for deployment on data center servers. In contrast, the
ITRI ARM hypervisor evaluated in this paper is the rst known
server-grade hypervisor that runs on a state-of-the-art ARMbased SOC, i.e. Marvell ARMADA-XP SOC, which comes
with a quad-core 1.6GHz Cortex-A9 CPU and consumes fewer
than 10W, and supports concurrent execution of multiple VMs
and VM migration.
Today the compute server market is dominated by x86based CPUs. The resulting servers are relatively power-hungry
and expensive. In contrast, ARM CPU has been the dominant
CPU in embedded systems and mobile devices, because of its
emphasis on low-power design and lower cost. As a result,
there is an emergence of ARM-based server boards coming
from a number of hardware vendors such as Marvell [1],
Calxeda [2], ZTSystem [3], which are based on ARM CortexA9 CPUs, the most popular ARM CPUs that are commercially
available today but lack of server-grade hypervisor support.
Compared with their x86 counterparts, these server boards
may have lower absolute performance, but their lower cost and
lower power consumption allow them to fare well in terms of
metrics such as amount of work done per watt and amount of
work done per dollar.
The rest of the paper is organized as follows. In Section

II we review the related work in the area of ARM-based
hypervisors. In Section III we present the overall architecture
of our hypervisor and discuss the challenges associated with
implementing a hypervisor on a CPU without hardware support
of virtualization. In this section, we also present memory
virtualization and exception handling framework and the binary rewriting approach for CPU virtualization. We present
performance evaluation of our hypervisor in Section IV.
Then, Section V concludes the paper with a summary of our
contributions and directions for future work.
Because cloud data centers use server virtualization technology extensively in their resource allocation and management, an immediate requirement for ARM-based servers is that
it provides the same server virtualization capability available
on current x86 servers. Although the latest ARM architecture,
Cortex-A15 and beyond, does provide architectural support
978-0-7695-5028-2/13 $26.00 2013 IEEE
DOI 10.1109/CLOUD.2013.71
855
II.
pervisor, and the ITRI ARM hypervisor itself, particularly its

performance optimizations.
R ELATED W ORK
Open source hypervisors in x86 world such as Xen [9] and

KVM [10] have been ported to ARM platform by enthusiast
developers [5], [6], [8]. Xen-based hypervisors [6], [8] provide
the virtualization by replacing the privileged operations inside
source code of guest OS with hypercalls (traps to hypervisors).
On the other hand, KVM-ARM [5] and ARMvisor [11]
leverage the fact that guest VM runs as an unprivileged process
inside host OS, thus any attempt to execute a privileged
operation results in a trap that host OS can intercept. Still,
KVM-ARM hypervisor needs to resort to para-virtualization
also, since there are some instructions with side effects called
sensitive instructions that do not generate traps.
A. Marvell ARMADA-XP
Marvell ARMADA-XP is a SOC containing a quad-core
Cortex-A9 CPU, which supports 7 modes of operation, such as
user mode, supervisor mode, abort mode, undened mode, etc.
The ITRI ARM hypervisor only uses user (unprivileged) mode
and supervisor mode (also known as SVC mode or kernel
mode).
All ARM instructions are of the same size: four bytes,
which greatly simplies binary rewriting. However, not every
instruction could run in every possible mode. Some privileged
instructions (PI) could only run on a specic co-processor.
For example, most MMU instructions could run only on coprocessor 15 and in SVC mode. When these instructions are
executed in any other mode the CPU generates an (undened
instruction) exception. In addition, there are so-called sensitive
instructions (SI), which can only be executed correctly in SVC
mode but do not trap when executed in user mode. The effect
of their execution in user mode is undened. Examples of
those instructions are MSR and MRS, which write to and
read from the program status registers, respectively. Finally,
ARM architecture supports a domain mechanism [22], which
allows a processs address space to be partitioned into up to 16
protection domains whose accessibility could be dynamically
controlled.
Most hypervisors, both on x86 and ARM, employ shadow

page table (SPT) [12] as the memory virtualization technique.
Its idea is that the hypervisor injects a so-called shadow page
table between the guest VM and the hardware, which maps the
guest physical addresses into machine physical addresses. The
hypervisor also carries out the burden of allocating the memory
for the guest OS and mapping those ranges into guest physical
address space. This approach is favorable only when supported
by the hardware, and is notoriously difcult to implement
correctly in software due to a number of challenges. Some
advanced hardware-assisted techniques for managing shadow
page table on x86 architecture have been proposed [13],
[14]. In addition to virtualizing hardware resources, a hypervisor needs to perform privileged operations on behalf of
the guest (such as changing page table base pointer). There
are two well-known techniques that a hypervisor can employ
to deal with this problem [5]: 1) binary translation when
each sensitive instruction is translated to a block of non-SI
and 2) paravirtualization, when every guest kernel function
containing SIs is replaced with a hypercall. Our hypervisor
uses a hybrid approach: it employs both minimal lightweight
paravirtualization and binary translation.
In addition to a Cortex-A9 CPU, Marvell ARMADAXP includes an L2 cache, a memory controller, a set of
peripheral devices, and a system interconnect called Mbus,
which just like the PCI Express architecture, allows each
Mbus device to own a window of the physical memory
space and even to remap each incoming physical memory
address to something else through a remapping register. An
unusual feature of ARMADA-XP is it allows regions of the
main memory (SDRAM) to be accessed as a Mbus device.
Therefore, an access to the main memory could be captured by
an Mbus device and then remapped. Because of this remapping
capability, the physical memory space of an ARMADA-XP
server is 16GB rather than 4GB. Finally, the physical memory
address window associated with an Mbus device could be
turned on and off at run time. When an Mbus devices physical
memory address window is turned off, any physical memory
access that targets at the Mbus devices physical memory
address window triggers an exception. When an Mbus devices
physical memory address window is turned on (off), it is on
(off) to all cores of the CPU.
The industry progress for ARM virtualization is rapid.

ARM Cortex-A15 CPUs can provide hardware support for
virtualization, similar to capabilities found in x86 CPUs.
NICTA [15] has done a quantitative analysis for Cortex-A15.
VirtualOpenSystems [16], is a company that ported KVMARM to Cortex-A15 platform and submitted the patches to
Linux kernel maintainers. Xen started to port xen hypervisor
for ARMv7 with virtualization extension and submitted the
patches to mailing list as well [17]. VMware Horizon Mobile [7] is the most prominent commercial hypervisor that
can also support Cortex-A9, and its main purpose is to help
IT to manage a corporate mobile workspace on employees
smartphones. Some other popular commercial hypervisors are
vLogix Mobile [18], OKL4 microvisor [19]. ARM announced
a technology codenamed big.LITTLE [20] whose goal is to
combine the performance of Cortex-A15 cores with energy
efciency of Cortex-A7 processors. The most advanced architecture is 64-bit ARMv8 core. We expect a whole variety of
commercial solutions and open source to be released in the
near future [21].
As will become clear later, ARMADA-XPs ability to treat

regions of main memory as Mbus devices offers a poor-man
alternative to such memory virtualization support as Intels extended page table (EPT) [4] or AMDs nested page table [23],
because it offers the ITRI ARM hypervisor a hardware-based
check and remap capability for every physical memory access
coming from a guest VM.
B. Baseline Implementation
III.
ITRI ARM H YPERVISOR
The ITRI ARM hypervisor is similar to KVM in architecture, and runs on top of an Ubuntu Linux distribution that
comes with the ARMADA-XP reference design. The host OS
In this section we describe relevant architectural features

of Marvell ARMADA-XP, the target of the ITRI ARM hy-
856
runs on the ARMADA-XP SOC in SVC mode, and every

guest VM runs entirely in user mode and consists of one or
multiple processes running on the host OS. Isolation between
the host OS and the guest VMs is guaranteed through the mbus
mechanism or shadow page table, whereas isolation between a
guest OS and the user processes running on top of it is through
the domain mechanism [5]. The current implementation of the
ITRI ARM hypervisor consists of the following components:
1)
2)
3)
4)
MMU actually points to the shadow copy of each guest page

table rather than the guest page table itself. Whenever a
modication is made to a guest page table, this modication must be propagated to its corresponding shadow page
table. Automatically detecting changes to guest page tables
is the most complicated and time-consuming part of an SPT
implementation. The ITRI ARM hypervisor relies on paravirtualization to detect such changes.
The ITRI ARM Hypervsior virtualizes I/O devices using
the Virtio framework [25], which requires a front-end driver
installed in every guest VM, and a back-end driver inside
the host OS (vhost). Because ARM does not support I/O
instructions, the ITRI ARM hypervisor needs to intercept
memory-mapped I/O instructions by marking the target pages
of these instructions as invalid and forcing them to trap to the
host OS.
A loadable kernel module, which is a signicantly

revised version of KVM and is responsible for allocating memory for newly started guest VMs, handling
exceptions when guest VMs run, and performing
world context switches, including changing the operation mode, the page table and exception handler
table.
Para-virtualizations of the supported guest OS, specifically the Linux kernel 2.6.35, including removing unnecessary privileged instructions, e.g. redundant cache and TLB ushing, and simplifying the
architecture-dependent code in the booting process.
An ARM binary rewriting tool used to patch the
binary image of the supported guest OS.
A user-level machine emulator QEMU [24] that allows users to congure and launch guest VMs, and
interfaces with real device drivers and backend device
drivers.
C. Performance Optimizations
1) Light-Weight Context Switch: The common type of
exceptions are exceptions that could be handled by the guest
OS itself without involving the host OS, e.g. a user process
in a guest VM making a system call or executing certain
privileged instructions. A naive implementation requires two
world context switches, one from the guest VM to the host OS,
and then from the host OS to the guest VM. The ITRI ARM
hypervisor features a light-weight context switch mechanism
that embeds in exception handlers the logic to recognize this
type of exceptions and immediately transfer control back to
the guest OS without performing any world context switches.
Light-weight context switch triggers protection domain crossing, but does not require page table or exception handler table
switching and thus avoids the associated performance overhead
due to ushing of TLB and L2 cache.
To run a guest VM, the QEMU process loads the guest kernel image into its address space, performs a few initialization
tasks and then enters the main loop, which makes an ioctl()
to the revised KVM module, which in turn transfers control to
the guest OS by making a world context switch. Execution
of the guest VM occasionally triggers exceptions, because
of privileged instructions, sensitive instructions, system calls
or hypercalls. The revised KVM module either handles these
exceptions itself or delegates them to the QEMU emulator or
some backend device drivers. The set of exception handlers
used when a guest VM runs are different from those used
when the host OS runs. Therefore, the world context switch
includes a switch of the exception handler table.
2) Static Memory Partitioning: The idea of static memory

partitioning is to partition the physical memory on an ARM
server into multiple contiguous chunks and assign each chunk
to the host and guest VM. The Linux kernel supports a
kernel boot option that species a physical memory address
range given to that kernel, e.g. 16GB to 24GB. With this
option, a well-behaved kernel, i.e., one that is not infected
by virus or malicious code, puts only physical page numbers
in its assigned chunk in its page tables, and thus makes it
impossible to access the physical memory pages outside its
chunk. However, this approach does not prevent a kernel
rootkit in one guest VM from tampering with the page tables
and thus accessing physical memory pages of other guest VMs.
The ITRI ARM hypervisor employs static binary rewriting

to recognize sensitive instructions in a guest kernel binary
and replace them with non-sensitive instructions. There is no
need to rewrite privileged instructions since they cause traps
to the host OS anyway and thus could be emulated afterward.
Our rst implementation replaces every recognized sensitive
instruction with a software interrupt instruction (SWI), which
causes a trap to the host OS. The original instruction being
replaced was encoded as the replacing SWI instructions 24bit argument, the same way as was done in [5]. However, the
performance overhead of this implementation is too high to be
acceptable. The work ow of this binary rewriter is shown in
Figure 1. The current prototype of the ITRI ARM hypervisor
does not support run-time sensitive instruction rewriting, and
therefore does not allow new code to be added to the VMs
kernel, either through dynamic code generation or through
kernel module loading.
The ITRI ARM hypervisor closes this security hole

by leveraging the physical address window mechanism in
ARMADA-XP. More specically, the ITRI ARM hypervisor
assigns each guest VM a physical memory chunk, and associates each physical memory chunk with a separate Mbus device by setting the devices physical memory address window
to the chunks physical address range. When the hypervisor
schedules a guest VM for execution, it turns on the physical
memory address window of the Mbus device associated with
the physical memory chunk of the guest VM, and leaves all
other Mbus devices physical memory address windows off.
With this set-up, it is impossible for a guest VM to access any
physical memory page outside its chunk. Because a physical
The ITRI ARM hypervisor virtualizes the memory resource

using a shadow page table (SPT) [12] implementation, which
creates a shadow copy of each guest page table so that the
857
ELF header
Program
header table
Section 1
. . .
Code Section
. . .
Generate
wrappers
of
emulation
functions
Find all SIs in

guest image
C files with emulation

functions
Code Section
. . .
. . .
Compile
& Link
Section N
elf32
obj file
Append the section to the guest image

Section
header table
Link SI and wrapper functions
Guest OS image
Fig. 1. Binary rewriting algorithm owchart showing the process of nding sensitive instructions (SIs) in guest image, generating assembly wrappers with
invocation of emulation functions for each SI, and replacing SIs to branch to the corresponding wrappers and appending and linking the emulation code to the
guest image.
Many SI emulation instruction sequences could run in user

mode just ne because the host resources these SIs access are
virtualized, e.g. program status register and fault status register
(FSR). As a result, accesses to these virtualized host resources
do not need to go through the host OS. Unfortunately, not all
host resources could be virtualized this way. Thats why some
SI emulation instruction sequences still need to call on the
help of the host OS, e.g. those that change the VCPU mode,
because the hypervisor needs to keep track of whether a guest
VM runs in the guest user or guest kernel mode.
address window could service any memory access requests

from any core in ARMADA-XP, simultaneously running multiple guest VMs on different cores introduces the security risk of
one guest VM corrupting another guest VMs physical memory
chunk. Therefore, for absolute security, it is advisable to run
only one guest VM on all CPU cores at a time, rather than
multiple guest VMs each on a separate CPU core.
The main advantage of this physical memory remapping
approach to memory virtualization is it greatly simplies
the implementation complexity and reduces the performance
overhead associated with shadow page table-based implementations, especially with respect to detection of guest page table
modication if para-virtualization is not used.
Some benchmarks, such as iperf and hdpram, incur

signicant performance overhead when emulating sensitive instructions. After investigating the root cause, it
was found that the performance loss is due to extensive execution of load and store with translation instructions (LDR[B]T/STR[B]T) in kernel routines such as
__copy_to_user_std, __copy_from_user, etc. For
example in the __copy_from_user function a bunch of
LDRT instructions follow one after another:
In addition to physical memory pages, a guest VM is also

given a physical memory address range as targets of memorymapped I/O instructions when it starts up. The ITRI ARM
hypervisor gives each guest VM the same unmapped range so
that when a guest VM accesses any location in this range, a
trap is generated and control is transferred to the host OS.
3) Streamlined PI/SI Emulation: Emulation of many sensitive instructions does not always need to run in SVC mode.
Therefore, replacing all sensitive instructions with SWI causes
traps to the host OS and incurs world context switch overheads
unnecessarily. The ITRI ARM hypervisor solves this problem
by creating a dedicated emulation instruction sequence for
each sensitive instruction and then replacing each sensitive
instruction with a branch instruction to its associated emulation
instruction sequence. Instead of performing instruction decoding, i.e., parsing the opcode, operands and addressing modes,
at run time, the binary rewriter performs these operations
statically and outputs the result to the emulation instruction
sequences for a selective set of sensitive instructions.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
ldrt
ldrt
ldrt
...
ldrt
ldrt
ldrt
subs
r3, [r1], #4
r4, [r1], #4
r5, [r1], #4
r8,
ip,
lr,
r2,
[r1], #4
[r1], #4
[r1], #4
r2, #32
Our original solution to this problem was to recognize such

sensitive instruction sequences, and replace each sequence with
a single branch to a specic emulation instruction sequence
that emulates the aggregated effects of these instructions and
858
^ W d
sD
' W d
ZK
Zt
ZK
Zt
W E
ZK
TABLE I.
T HE SET OF BENCHMARKS USED IN THE PERFORMANCE
EVALUATION STUDY, INCLUDING THE TYPE OF SYSTEM RESOURCES THEY
STRESS AND THEIR INVOCATION PARAMETERS
W E
&
Parameters
CPU
CPU, Memory
cpu-max-prime=100
num-threads=64
thread-yields=100
thread-locks=2
memory-total-size=100M
requests=10000
concurrency=100
time=60
-t
Read: if=/dev/zero of=200MB le
Write: if=200MB le of=/dev/null
n/a
n/a
sysbench-memory
Apache Benchmark
Memory
CPU, IO
iperf
hdparm
IO
IO
IO
dd
W E
TYPE
sysbench-threads
Zt

W
Name
sysbench-cpu
Unixbench-double
Unixbench-spawn
CPU
Memory
every iteration, the hypervisor marks all the pages as read-only

in the shadow page table. When a page is rst modied in an
iteration, a write protection fault occurs, and the hypervisor
restores the pages original permission, as indicated in the
corresponding guest page table entry, and records the page
in the dirty page bitmap. Pages in the dirty page bitmap are
targets of transfer in the next iteration.
Fig. 2. To detect pages that are modied within a period of time, the hypervisor
marks them as read-only at the beginning of the period, forces write protection
faults when pages are modied, and record those pages in the dirty page bitmap.
then returns to the rst instruction after the bunch ((8) in

above example). All the sensitive instructions in the bunch
except the rst one are replaced with a NOP (No OPeration)
instruction. Unfortunately this solution did not always work,
because in the guest kernel image there exist such instruction
bunches in which some indirect branch actually jumps to an
instruction within the bunch. To address this problem, our nal
solution is to replace each instruction with a branch to an
emulation instruction sequence that performs loads or stores
associated with the replaced instructions and those following
it in the bunch. In above example, the emulation instruction
sequence for instruction (5) should emulate instructions (5), (6)
and (7). This way, even if instruction (5) is a target of some
indirect branch, the execution result remains correct. For some
workloads this technique doesnt provide any performance
improvement, but for iperf benchmark this technique reduces
the number of branches due to sensitive instructions by more
then 4 times.
IV.
P ERFORMANCE E VALUATION
A. Evaluation Methodology
The hardware test-bed used to evaluate the ITRI ARM
hypervisor prototype is the Marvell ARMADA-XP development board [27] with a quad-core Sheeva Cortex-A9 CPU. The
board has 4GB of RAM and is equipped with a SATA disk.
It runs Linux 2.6.35 kernel with Marvells proprietary patch.
The host OS is Ubuntu 9.10 (karmic) for ARM. The guest VM
uses the same Linux kernel version without the Marvell patch,
and boots from the same root le system. We are using the
arm-softmmu target of qemu v. 0.14.50 with virtio patches.
Table I lists the set of benchmarks used in this performance
evaluation study. Sysbench [28] is an open-source benchmark
with many testing options. The Sysbench CPU test is to nd
the maximum prime number below 100. The Sysbench threads
test creates 64 threads and each thread processes 100 requests
each of which consists of locking 2 mutexes, yielding the
CPU, and then unlocking when the thread is rescheduled. The
Sysbench memory test is to read/write 100MB of data.
The binary rewriter also strives to reduce the run-time

overhead of guest VMs by avoiding unnecessary trappings into
the hypervisor. In the original ARM Linux kernel, there are
many instances in which one or multiple privileged instructions
are embedded inside a loop, but there is another instruction
outside the loop that could actually achieve the same effect as
these privileged instructions. For example, a loop containing
several cache cleaning instructions is followed by another
instruction that triggers a context switch, which automatically
entails a cache cleaning operation. In this case, all the cache
cleaning instructions could be safely removed. Our binary
rewriter is able to identify these types of loops and remove
the unnecessary privileged instructions in these loops.
Apache Benchmark (ab) [29] is a benchmark designed

to test the service performance and quality of web sites. It
measures the total time a web server spends on handling 10000
requests that come in through 100 connections simultaneously.
Iperf [30] is a network benchmark that tests how much data can
be transmitted within 60 seconds. Hdparm [31] is a benchmark
for testing the read/write performance of hard disks. The tool
dd is for disk data reading or writing. The dd test is writing
200MB to disk from a zero device (/dev/zero) and reading this
le to the null device (/dev/null).
D. Live Migration
The ITRI hypervisor also supports live migration [26],
which iteratively copies the memory state of a VM from the
source machine to the destination machine, and stops the VM
in order to copy the residual memory state over to complete the
migration when the residual memory state is sufciently small.
To keep track of the set of dirty pages that still need to be
migrated, the ITRI ARM hypervisor implements a dirty page
tracking mechanism, as shown in Figure 2. At the beginning of
Unixbench [32] is an open source benchmark that consists

of a number of specic tests. Every test uses INDEX to show
the performance result. The higher the INDEX value, the
better. In this paper we choose Whetstone and Process Creation
test. The Unixbench Whetstone test measures the execution
time of such oating-point operations as math operations of
sin, cos, sqrt, exp, and log. The Unixbench Process Creation
859
TABLE II.
M EASURED PERFORMANCE OF THE SET OF BENCHMARKS

WHEN EXECUTED NATIVELY ON THE M ARVELL BOARD , ON A VM
RUNNING ON QEMU AND THEN ON THE M ARVELL BOARD , AND ON A VM
RUNNING ON THE ITRI ARM HYPERVISOR AND THEN ON THE M ARVELL
BOARD . T HE ITRI ARM HYPERVISOR USES STATIC MEMORY
PARTITIONING , EMPLOYS PARA - VIRTUALIZATION RATHER THAN
SENSITIVE INSTRUCTION EMULATION FOR SENSITIVE INSTRUCTIONS , AND
ENABLES LIGHT- WEIGHT CONTEXT SWITCHING .
Benchmark
Native
TCG
ITRI-ARM
sysbench-cpu (s)
sysbench-threads (s)
sysbench-memory (s)
Apache Benchmark (s)
Iperf (Mb/s)
hdparm (MB/s)
dd read/write (MB/s)
Unixbench-double
Unixbench-spawn
0.8
1.2
0.5
3.7
761
125
68.1 / 118
72.7
222.3
45.1
388.7
43.3
106.1
19.5
11.1
2.6 / 8.8
0.6
7.1
0.8
6.3
0.9
7.7
435
17.2
19.2 / 9.6
71.7
182.8
between 1.3 to 1.8 times slower compared with Natives

performance. Because more than 90% of the PI exceptions in
Unixbench-spawn are handled by light-weight context switch,
its relative performance penalty is less than that of sysbenchthreads.
For network-intensive benchmarks, i.e., iperf and ab,
TCGs performance is 30 to 40 times slower than Natives
performance, but the slow-down of ITRI-ARM when compared
with Native is around a factor of 2. The performance difference
between ITRI-ARM and Native comes from additional context
switches on the network packet delivery path induced by the
virtio architecture. For disk-intensive benchmarks, i.e., hdparm
and dd, TCGs performance is 10 to 20 times slower than
Natives performance and ITRI-ARM is between 7 to 10 times
slower than Natives performance. The substantial performance
difference between ITRI-ARM and Native comes from the fact
the former uses readv/writev for le I/O whereas the latter
uses read/write for le I/O. The reason that the ITRI-ARM
uses readv/writev is because the guest physical addresses
of the payload of a le I/O operation issued from a guest VM
in general are not consecutive.
test measures the number of processes that could be forked

within 30 seconds when processes are forked and then closed.
For Sysbench and Apache benchmark, the performance
measure is the total execution time, and so the smaller the
better. For Iperf and hdparm, the performance measure is
the average throughput, and so the higher the better. For
the Unixbench benchmark, the performance measure is their
INDEX number, and so the higher the better.
To better understand the performance gap between Native

and ITRI-ARM under sysbench-threads and sysbench-memory,
we decompose the performance overhead of the ITRI ARM
hypervisor into the following ve categories: (1) emulation of
privileged instructions (PI) (2) handling data access exceptions
(DA), (3) processing due to interrupts (IRQ), (4) handling of
system calls (SC) and (5) emulation of sensitive instructions
(SI), as shown in Table III. Without any optimization, processing of these ve types of exceptions requires full context
switching. The light-weight context switch optimization in the
ITRI ARM hypervisor reduces the context switching overheads
of SC and some percentages of PI, DA and SI. In addition,
the ITRI ARM hypervisor emulates some sensitive instructions
in user space, thus incurring no context switch at all. The
fact that a signicant percentage of sensitive instructions are
handled completed in user space demonstrates the effectiveness
of the SI emulation optimizations built into the ITRI ARM
hypervisor.
B. Performance Overhead of ITRI ARM Hypervisor

We ran the test benchmarks natively on the Marvell board
(Native), on a VM running on qemu which in turn runs on the
Marvell board (TCG or Tiny Code Generator [24]) and on a
VM running on the ITRI ARM hypervisor, which in turn runs
on the Marvell board (ITRI-ARM). The native conguration
does not have any virtualization overhead. The performance
overhead of the TCG conguration represents the upper bound
of the virtualization overhead, because qemu emulates every
instruction executed by a VM. The version of the ITRI ARM
hypervisor used in this test adopts static memory partitioning
rather than shadow page table for memory virtualization, employs light-weight context switching to eliminate unnecessary
world context switches, and uses manual para-virtualization
rather than SI emulation to handle sensitive instructions. Therefore this version represents the most performant conguration
of the ITRI ARM hypervisor.
This overhead breakdown shows that the main reason

behind the signicant performance penalty incurred by ITRI
ARM hypervisor in sysbench-threads and sysbench-memory
workloads is increased number of PI exceptions that require a
full context switch, compared to other workloads. In addition,
sysbench-threads also makes many more system calls, each
of which requires a number of sensitive instructions to return
from the guest kernel to the guest user.
For CPU-intensive benchmarks, i.e., sysbench-cpu,

sysbench-threads, ab and Unixbench-double, TCGs
performance is 30 to 60 times slower than Natives
performance, but ITRI-ARMs performance is almost the
same as Natives performance, except sysbench-threads, which
requires the guest OS to execute many privileged instructions
and trap to the hypervisor. As a result, the performance of
sysbench-threads under ITRI-ARM is ve times slower than
that under Native, and this performance difference mainly
comes from the additional context switches caused by the
privileged instructions executed by the guest OS. Because the
ITRI ARM hypervisor runs guest VMs as user processes, it
is inevitable that privileged instructions trigger exceptions.
C. Effectiveness of Performance Optimizations

Table IV presents the performance benet of each of three
main performance optimizations: static memory partitioning,
light-weight context switch, and para-virtualization to remove
SI emulation. The with SPT column of Table IV shows
that, implementing memory virtualization using shadow page
table rather than static memory partitioning, which corresponds
to the fastest version of the ITRI ARM hypervisor (ITRIARM column), incurs no performance loss for all benchmarks
except Unixbench-spawn. This is because Unixbench-spawn
creates and destroys many processes, causing many updates to
For memory-intensive benchmarks, i.e., sysbench-threads,

sysbench-memory, and Unixbench-spawn, TCGs performance
is again 30 to 400 times slower than Natives performance.
But excluding sysbench-threads, ITRI-ARMs performance is
860
TABLE III.
C OUNT OF AND TIME SPENT IN FIVE TYPES OF TRAPS WHEN THE ITRI ARM HYPERVISOR RUNS UNDER SYSBENCH - THREADS AND
SYSBENCH - MEMORY: (1) PRIVILEGED INSTRUCTION (PI), (2) DATA ACCESS EXCEPTION (DA), INTERRUPT (IRQ), SYSTEM CALL (SC), AND SENSITIVE
INSTRUCTIONS (SI). L IGHT- WEIGHT CONTEXT SWITCH CAN ONLY BE APPLIED TO SC AND SOME PERCENTAGES OF DA, PI AND SI.
Benchmarks
sysbench-thread
Total = 9.2 sec)
sysbench-memory
Total = 3.9 sec)
# of traps
Time (sec)
# of traps
Time (sec)
Full Context Switch

PI
DA
IRQ
531275
3.03
891
0.0092
180
0.0024
104
0.002
579
0.024
169
0.007
PI
Light-Weight Context Switch

DA
SC
SI
386481
0.5
2187
0.192
1060
0.0006
708
0.0005
1161816
0.64
409801
0.238
No Context Switch
SI
1161184
0.5
414384
2.3
12882150
2.4
2908546
0.7
TABLE IV.
P ERFORMANCE IMPACT OF SHADOW PAGE TABLE (SPT), LIGHT- WEIGHT CONTEXT SWITCHING (LWCS), AND EMULATION OF SENSITIVE
INSTRUCTIONS IN EXCEPTION HANDLERS (SIE) ON THE ITRI ARM HYPERVISOR , WHOSE OPTIMIZED SETTING IS SPT OFF , LWCS ON AND SIE OFF .
Benchmark
ITRI-ARM
Four Guests
with SPT
without LWCS
with SIE
sysbench-cpu (s)
sysbench-memory (s)
Apache Benchmark (s)
Iperf (Mbps)
hdparm (MB/s)
dd read/write (MB/s)
Unixbench-double
Unixbench-spawn
0.8
6.3
0.9
7.7
435
17.2
19.2 / 9.6
71.7
182.8
0.8
6.6
0.9
14.1
123
16.5
11.3 / 4.8
71.6
117.5
0.8
6.6
0.9
7.7
434
17.1
18.6 / 9.1
71.5
40.5
1.1
34.2
6.3
10.9
428
17.1
12.5 /8.8
71.6
44.1
0.8
9.2
3.9
12.7
109
16.7
1.4 / 9.5
71.7
95.3
TABLE V.
T HE PERFORMANCE OF A SET OF BENCHMARKS WHEN
THEY RUN IN A NORMAL VM AND WHEN THEY RUN IN A MIGRATED VM,
AND THE SERVICE DOWN TIME OF THE TEST VM S WHEN THEY ARE
MIGRATED
the guest pages tables and eventually to their shadow copies.

These guest page table updates eventually result in more than
4 times slow-down (from 182.8 to 40.5). This degradation
is surprisingly low, and one explanation is the current SPT
implementation requires para-virualization to capture guest
page table modications.
The substantial difference between the without LWCS
column and the ITRI-ARM column across all benchmarks conrms that light-weight context switch is a useful optimization
that effectively cuts down unnecessary full context switches.
The with SIE column corresponds to the guest OS conguration that uses emulation rather than para-virtualization
to handle sensitive instructions. As a result, for Unixbenchspawn, iperf and sysbench-memory benchmarks, which trigger
many SI exceptions, the difference between the with SIE
column and the ITRI-ARM column is substantial. For the other
benchmarks, such as sysbench-cpu and Unixbench-double,
these two columns are essentially the same. The performance
penalty of each test benchmark when SIE is turned on is proportional to the number of sensitive instructions encountered in
the test run. For example, the performance penalty of SIE for
sysbench-memory is more signicant than that of sysbenchCPU because the number of sensitive instructions executed in
the former is more than 8 times of that in the latter.
Benchmark
Normal
In Migration
Down Time (s)
sysbench-cpu2 (s)
sysbench-memory2 (s)
hdparm (MB/s)
dd for write/read (MB/s)
Unixbench-double
Unixbench-spawn
10.77
5.46
8.28
10.24
7.3/10.0
72.4
198.5
25.91
19.42
25.73
2.40
6.0/8.0
72.2
197.3
2.05
2.27
2.43
2.53
4.05
2.05
2.13
guest VMs. For dd, surprisingly the performance of the fourguest conguration is one half rather than one quarter of that
of the one-guest conguration, because the bottleneck in this
case is the virtio block device rather than the underlying disk
and so the one-guest congurations performance is not the
best it could be.
D. Live Migration Performance
Table V shows the performance of a set of benchmarks
when they run in a normal VM and when they run in
a VM that is being migrated from one physical machine
to another. The performance differences between these two
columns represent the performance impact of migration on
the test applications. The Normal column has the same
conguration as the ITRI-KVM column in previous tables
except for sysbench-cpu and sysbench-memory. Due to the
original elapsed time of sysbench-cpu and sysbench-memory
were too short to measure, we adjusted their parameters to
make them execute for a longer period. Also, we put the virtual
disk image on network le system for migration. In general,
the more memory state is modied, the larger the performance
degradation. For CPU/memory intensive workloads, the performance degradation is particularly signicant because they
modify more memory pages, and thus cause higher dirty page
tracking and memory state copying overhead. The Down Time
corresponds to the freeze time of the VM being migrated in
each test, and ranges between 2 to 4 seconds. These Down
times are higher than expected because the current ITRI ARM
hypervisor prototype uses only two iterations in each VM
Because the Marvell board has a quad-core ARM CPU,

it is possible to run four guest VMs on it simultaneously,
each running on a separate core. The Four Guest column in
Table IV shows the average measured performance of each test
benchmark running on each guest VM when four guest VMs
run on the Marvell board. For CPU-intensive benchmarks,
there is nearly no difference between the one-guest and fourguest congurations. This means the ITRI ARM hypervisor
is able to efciently utilize all four cores to run the four
guest VMs. However, for some memory-intensive benchmarks,
e.g. Unixbench-spawn, the four-guest conguration shows a
non-trivial performance degradation, because the L2 cache
and memory is shared among the guest VMs. For iperf, the
performance of the four-guest conguration is roughly one
quarter of that of the one-guest conguration because the
network card of the Marvell board is shared among the four
861
migration operation, rather than multiple iterations which can

make residual memory state small enough.
V.
[13] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, Accelerating

two-dimensional page walks for virtualized systems, in Proceedings
of the 13th international conference on Architectural support for
programming languages and operating systems, ser. ASPLOS XIII.
New York, NY, USA: ACM, 2008, pp. 2635. [Online]. Available:
http://doi.acm.org/10.1145/1346281.1346286
[14] G. Hoang, C. Bae, J. Lange, L. Zhang, P. Dinda, and R. Joseph, A case
for alternative nested paging models for virtualized systems, Computer
Architecture Letters, vol. 9, no. 1, pp. 1720, jan. 2010.
[15] P. Varanasi and G. Heiser, Hardware-supported virtualization on
ARM, in Proceedings of the Second Asia-Pacic Workshop on
Systems, ser. APSys 11. New York, NY, USA: ACM, 2011, pp. 11:1
11:5. [Online]. Available: http://doi.acm.org/10.1145/2103799.2103813
[16] Virtual open systems, http://www.virtualopensystems.com/.
[17] S. Stabellini and I. Campbell, Xen on ARM Cortex A15, Xen Summit,
2012.
[18] Redbend software: vlogix mobile for mobile virtualization,
http://www.redbend.com/en/products-services/mobilevirtualization/how-it-works.
[19] G. Heiser and B. Leslie, The okl4 microvisor: convergence point
of microkernels and hypervisors, in Proceedings of the rst ACM
asia-pacic workshop on Workshop on systems, ser. APSys 10.
http://doi.acm.org/10.1145/1851276.1851282
[20] ARM, big.little processing, http://www.arm.com/products/processors/
technologies/biglittleprocessing.php.
[21] M. Zyngier, KVM on arm64, http://lwn.net/Articles/529848/.
[22] ARM, ARM architecture reference manual ARMv7-a and ARMv7-r
edition. [Online]. Available: http://infocenter.arm.com/help/topic/
com.arm.doc.ddi0406b/index.html
[23] AMD, Amd64 virtualization codenamed pacica technology:
Secure
virtual
machine
architecture
reference
manual,
http://www.cs.utexas.edu/users/hunt/class/2005-fall/cs352/docsem64t/AMD/virtualization-33047.pdf.
[24] F. Bellard, Qemu, a fast and portable dynamic translator,
in Proceedings of the annual conference on USENIX Annual
Technical Conference, ser. ATEC 05. Berkeley, CA, USA:
USENIX Association, 2005, pp. 4141. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1247360.1247401
[25] R. Russell, virtio: towards a de-facto standard for virtual i/o devices,
SIGOPS Oper. Syst. Rev., vol. 42, no. 5, pp. 95103, jul 2008.
[Online]. Available: http://doi.acm.org/10.1145/1400097.1400108
[26] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt,
and A. Wareld, Live migration of virtual machines, in Proceedings
of the 2nd conference on Symposium on Networked Systems Design
& Implementation - Volume 2, ser. NSDI05. Berkeley, CA,
USA: USENIX Association, 2005, pp. 273286. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1251203.1251223
[27] Marvell, Armada-xp overview, http://www.marvell.com/embeddedprocessors/armada-xp/.
[28] A. Kopytov, Sysbench, a system performance benchmark,
http://sysbench.sourceforge.net/.
[29] T. A. S. Foundation, ab - apache http server benchmarking tool,
http://httpd.apache.org/docs/2.0/programs/ab.html.
[30] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs, Iperf: The
tcp/udp bandwidth measurement tool, http://iperf.sourceforge.net/.
[31] M. Lord, hdparm, http://hdparm.sourceforge.net/.
[32] I. Smith and K. Lucas, Unixbench, http://code.google.com/p/byteunixbench/.
C ONCLUSION
In this paper we present a fully functional KVM-ARM hypervisor running on the Marvell ARMADA-XP board, which
has already been deployed in some server products. To the best
of our knowledge, this is the rst hypervisor that successfully
runs multiple VMs with full isolation on a real-world ARM
Cotex-A9-based SOC, which does not provide any hardware
support for virtualization. Although the KVM-ARM hypervisor described here borrows heavily ideas and implementations
from the KVM-86 project, it still features several innovations
that are tailored to the Marvell platform, including binary
rewriting to remove unnecessary cache/TLB ushing, memory
virtualization using static memory partitioning and physical
memory remapping, and light-weight context switch.
With these optimizations, the performance of CPUintensive benchmarks running on VMs that sit on top of
the KVM-ARM hypervisor is almost the same as native
performance, but the performance of system activity-intensive
benchmarks fares much worse, sometimes up to a factor of
5 slow-down. Most of the performance penalty associated
with system activity-intensive benchmarks is due to sensitive
instruction emulation and exception handling for privileged
instructions. We plan to further optimize this KVM-ARM
hypervisor, and expect to port some of advanced features
such as VM migration and develop VM fault tolerance to
the hypervisor of Cortex-A15 and next generation CortexA53/A57.
R EFERENCES
[1] Marvell.
powers
dell
copper
ARM
server.
http://www.marvell.com/company/news/pressDetail.do?releaseID=2396.
[2] Calxeda ecx-1000, http://www.calxeda.com/technology/products/
processors/ecx-1000-series/.
[3] Zt
systems
announces
ARM-based
server
solution,
http://ztsystems.com/Default.aspx?tabid=1484.
[4] Intel Virtualization Technology, http://www.intel.com/technology/
virtualization/.
[5] C. Dall and J. Nieh, Kvm for arm, in Proceedings of the Ottawa
Linux Symposium, Ottawa, Canada, 2010.
[6] S. Sang-bum, Secure xen on ARM: Status and driver domain separation, 5th Xen Summit, 2007.
[7] VMware, Horizon mobile, http://www.vmware.com/products/
desktop virtualization/mobile/overview.html.
[8] Embeddedxen virtualization framework, http://sourceforge.net/
projects/embeddedxen/.
[9] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris,
A. Ho, R. Neugebauer, I. Pratt, and A. Wareld, Xen
and the art of virtualization, SIGOPS Oper. Syst. Rev.,
vol. 37, no. 5, pp. 164177, October 2003. [Online]. Available:
http://doi.acm.org/10.1145/1165389.945462
[10] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, kvm: the
linux virtual machine monitor, in Proceedings of the Linux Symposium,
vol. 1, 2007, pp. 225230.
[11] J. Ding, C. Lin, P. Chang, C. Tsang, W. Hsu, and Y. Chung, ARMvisor:
System virtualization for ARM, in Linux Symposium, 2012, p. 95.
[12] K. Adams and O. Agesen, A comparison of software and
hardware techniques for x86 virtualization, in Proceedings of
the 12th international conference on Architectural support for
programming languages and operating systems, ser. ASPLOS XII.
http://doi.acm.org/10.1145/1168857.1168860
862

6740234

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

6740234

Transféré par

Droits d'auteur :

Formats disponibles

2013 IEEE Sixth International Conference on Cloud Computing

Evaluation of a Server-Grade Software-Only

for virtualization similar to the VT technology [4] in x86,

The rest of the paper is organized as follows. In Section

pervisor, and the ITRI ARM hypervisor itself, particularly its

Open source hypervisors in x86 world such as Xen [9] and

Most hypervisors, both on x86 and ARM, employ shadow

The industry progress for ARM virtualization is rapid.

As will become clear later, ARMADA-XPs ability to treat

ITRI ARM H YPERVISOR

In this section we describe relevant architectural features

runs on the ARMADA-XP SOC in SVC mode, and every

MMU actually points to the shadow copy of each guest page

A loadable kernel module, which is a signicantly

2) Static Memory Partitioning: The idea of static memory

The ITRI ARM hypervisor employs static binary rewriting

The ITRI ARM hypervisor closes this security hole

The ITRI ARM hypervisor virtualizes the memory resource

Find all SIs in

C files with emulation

Append the section to the guest image

Link SI and wrapper functions

Many SI emulation instruction sequences could run in user

address window could service any memory access requests

Some benchmarks, such as iperf and hdpram, incur

In addition to physical memory pages, a guest VM is also

Our original solution to this problem was to recognize such

every iteration, the hypervisor marks all the pages as read-only

then returns to the rst instruction after the bunch ((8) in

The binary rewriter also strives to reduce the run-time

Apache Benchmark (ab) [29] is a benchmark designed

Unixbench [32] is an open source benchmark that consists

M EASURED PERFORMANCE OF THE SET OF BENCHMARKS

between 1.3 to 1.8 times slower compared with Natives

test measures the number of processes that could be forked

To better understand the performance gap between Native

B. Performance Overhead of ITRI ARM Hypervisor

This overhead breakdown shows that the main reason

For CPU-intensive benchmarks, i.e., sysbench-cpu,

C. Effectiveness of Performance Optimizations

For memory-intensive benchmarks, i.e., sysbench-threads,

Full Context Switch

Light-Weight Context Switch

the guest pages tables and eventually to their shadow copies.

Down Time (s)

Because the Marvell board has a quad-core ARM CPU,

migration operation, rather than multiple iterations which can

[13] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, Accelerating

Vous aimerez peut-être aussi