Académique Documents
Professionnel Documents
Culture Documents
Mikhail Zhidko
Yingshiuan Pan
Industrial Technology Research Institute Industrial Technology Research Institute Industrial Technology Research Institute
Hsinchu, Taiwan
Hsinchu, Taiwan
Hsinchu, Taiwan
Email: alexey@itri.org.tw
Email: mikhail@itri.org.tw
Email: yspan@itri.org.tw
Po-Jui Tsao
Kuang-Chih Liu
Tzi-Cker Chiueh
Industrial Technology Research Institute Industrial Technology Research Institute Industrial Technology Research Institute
Hsinchu, Taiwan
Hsinchu, Taiwan
Hsinchu, Taiwan
Email: pjtsao@itri.org.tw
Email: kcliu@itri.org.tw
Email: tcc@itri.org.tw
AbstractBecause of its enormous popularity in embedded
systems and mobile devices, ARM CPU is arguably the most used
CPU in the world. The resulting economies of scale benet entices
system architects to ponder the feasibility of building lower-cost
and lower-power-consumption servers using ARM CPU. In modern data centers, especially those built to host cloud applications,
virtualization is a must. So how to support virtualization on ARM
CPUs becomes a major issue for constructing ARM-based servers.
Although the latest versions of ARM architecture (Cortex-A15
and beyond) provide hardware support for virtualization, the
majority of ARM-based SOCs (system-on-chip) currently available on the market do not. This paper presents results of an
evaluation study of a fully operational hypervisor that successfully
runs multiple VMs on an ARM Cortex A9-based server, which
is architecturally non-virtualizable, and supports VM migration.
This hypervisor features several optimizations that signicantly
reduce the performance overhead of virtualization, including
physical memory remapping, and batching of sensitive/privileged
instruction emulation.
I.
I NTRODUCTION
Although there existed hypervisors for ARM-based mobile
devices [5], [6], [7] or embedded systems [8], none of them
satisfy all three requirements listed above and therefore are not
ready for deployment on data center servers. In contrast, the
ITRI ARM hypervisor evaluated in this paper is the rst known
server-grade hypervisor that runs on a state-of-the-art ARMbased SOC, i.e. Marvell ARMADA-XP SOC, which comes
with a quad-core 1.6GHz Cortex-A9 CPU and consumes fewer
than 10W, and supports concurrent execution of multiple VMs
and VM migration.
Today the compute server market is dominated by x86based CPUs. The resulting servers are relatively power-hungry
and expensive. In contrast, ARM CPU has been the dominant
CPU in embedded systems and mobile devices, because of its
emphasis on low-power design and lower cost. As a result,
there is an emergence of ARM-based server boards coming
from a number of hardware vendors such as Marvell [1],
Calxeda [2], ZTSystem [3], which are based on ARM CortexA9 CPUs, the most popular ARM CPUs that are commercially
available today but lack of server-grade hypervisor support.
Compared with their x86 counterparts, these server boards
may have lower absolute performance, but their lower cost and
lower power consumption allow them to fare well in terms of
metrics such as amount of work done per watt and amount of
work done per dollar.
Because cloud data centers use server virtualization technology extensively in their resource allocation and management, an immediate requirement for ARM-based servers is that
it provides the same server virtualization capability available
on current x86 servers. Although the latest ARM architecture,
Cortex-A15 and beyond, does provide architectural support
978-0-7695-5028-2/13 $26.00 2013 IEEE
DOI 10.1109/CLOUD.2013.71
855
II.
R ELATED W ORK
A. Marvell ARMADA-XP
Marvell ARMADA-XP is a SOC containing a quad-core
Cortex-A9 CPU, which supports 7 modes of operation, such as
user mode, supervisor mode, abort mode, undened mode, etc.
The ITRI ARM hypervisor only uses user (unprivileged) mode
and supervisor mode (also known as SVC mode or kernel
mode).
All ARM instructions are of the same size: four bytes,
which greatly simplies binary rewriting. However, not every
instruction could run in every possible mode. Some privileged
instructions (PI) could only run on a specic co-processor.
For example, most MMU instructions could run only on coprocessor 15 and in SVC mode. When these instructions are
executed in any other mode the CPU generates an (undened
instruction) exception. In addition, there are so-called sensitive
instructions (SI), which can only be executed correctly in SVC
mode but do not trap when executed in user mode. The effect
of their execution in user mode is undened. Examples of
those instructions are MSR and MRS, which write to and
read from the program status registers, respectively. Finally,
ARM architecture supports a domain mechanism [22], which
allows a processs address space to be partitioned into up to 16
protection domains whose accessibility could be dynamically
controlled.
In addition to a Cortex-A9 CPU, Marvell ARMADAXP includes an L2 cache, a memory controller, a set of
peripheral devices, and a system interconnect called Mbus,
which just like the PCI Express architecture, allows each
Mbus device to own a window of the physical memory
space and even to remap each incoming physical memory
address to something else through a remapping register. An
unusual feature of ARMADA-XP is it allows regions of the
main memory (SDRAM) to be accessed as a Mbus device.
Therefore, an access to the main memory could be captured by
an Mbus device and then remapped. Because of this remapping
capability, the physical memory space of an ARMADA-XP
server is 16GB rather than 4GB. Finally, the physical memory
address window associated with an Mbus device could be
turned on and off at run time. When an Mbus devices physical
memory address window is turned off, any physical memory
access that targets at the Mbus devices physical memory
address window triggers an exception. When an Mbus devices
physical memory address window is turned on (off), it is on
(off) to all cores of the CPU.
III.
The ITRI ARM hypervisor is similar to KVM in architecture, and runs on top of an Ubuntu Linux distribution that
comes with the ARMADA-XP reference design. The host OS
856
2)
3)
4)
C. Performance Optimizations
1) Light-Weight Context Switch: The common type of
exceptions are exceptions that could be handled by the guest
OS itself without involving the host OS, e.g. a user process
in a guest VM making a system call or executing certain
privileged instructions. A naive implementation requires two
world context switches, one from the guest VM to the host OS,
and then from the host OS to the guest VM. The ITRI ARM
hypervisor features a light-weight context switch mechanism
that embeds in exception handlers the logic to recognize this
type of exceptions and immediately transfer control back to
the guest OS without performing any world context switches.
Light-weight context switch triggers protection domain crossing, but does not require page table or exception handler table
switching and thus avoids the associated performance overhead
due to ushing of TLB and L2 cache.
To run a guest VM, the QEMU process loads the guest kernel image into its address space, performs a few initialization
tasks and then enters the main loop, which makes an ioctl()
to the revised KVM module, which in turn transfers control to
the guest OS by making a world context switch. Execution
of the guest VM occasionally triggers exceptions, because
of privileged instructions, sensitive instructions, system calls
or hypercalls. The revised KVM module either handles these
exceptions itself or delegates them to the QEMU emulator or
some backend device drivers. The set of exception handlers
used when a guest VM runs are different from those used
when the host OS runs. Therefore, the world context switch
includes a switch of the exception handler table.
857
ELF header
Program
header table
Section 1
. . .
Code Section
. . .
Generate
wrappers
of
emulation
functions
Code Section
. . .
. . .
Compile
& Link
Section N
elf32
obj file
Guest OS image
Fig. 1. Binary rewriting algorithm owchart showing the process of nding sensitive instructions (SIs) in guest image, generating assembly wrappers with
invocation of emulation functions for each SI, and replacing SIs to branch to the corresponding wrappers and appending and linking the emulation code to the
guest image.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
ldrt
ldrt
ldrt
...
ldrt
ldrt
ldrt
subs
r3, [r1], #4
r4, [r1], #4
r5, [r1], #4
r8,
ip,
lr,
r2,
[r1], #4
[r1], #4
[r1], #4
r2, #32
^ W d
sD
' W d
ZK
Zt
ZK
Zt
W E
ZK
TABLE I.
T HE SET OF BENCHMARKS USED IN THE PERFORMANCE
EVALUATION STUDY, INCLUDING THE TYPE OF SYSTEM RESOURCES THEY
STRESS AND THEIR INVOCATION PARAMETERS
W E
&
Parameters
CPU
CPU, Memory
cpu-max-prime=100
num-threads=64
thread-yields=100
thread-locks=2
memory-total-size=100M
requests=10000
concurrency=100
time=60
-t
Read: if=/dev/zero of=200MB le
Write: if=200MB le of=/dev/null
n/a
n/a
sysbench-memory
Apache Benchmark
Memory
CPU, IO
iperf
hdparm
IO
IO
IO
dd
W E
TYPE
sysbench-threads
Zt
W
Name
sysbench-cpu
Unixbench-double
Unixbench-spawn
CPU
Memory
Fig. 2. To detect pages that are modied within a period of time, the hypervisor
marks them as read-only at the beginning of the period, forces write protection
faults when pages are modied, and record those pages in the dirty page bitmap.
IV.
P ERFORMANCE E VALUATION
A. Evaluation Methodology
The hardware test-bed used to evaluate the ITRI ARM
hypervisor prototype is the Marvell ARMADA-XP development board [27] with a quad-core Sheeva Cortex-A9 CPU. The
board has 4GB of RAM and is equipped with a SATA disk.
It runs Linux 2.6.35 kernel with Marvells proprietary patch.
The host OS is Ubuntu 9.10 (karmic) for ARM. The guest VM
uses the same Linux kernel version without the Marvell patch,
and boots from the same root le system. We are using the
arm-softmmu target of qemu v. 0.14.50 with virtio patches.
Table I lists the set of benchmarks used in this performance
evaluation study. Sysbench [28] is an open-source benchmark
with many testing options. The Sysbench CPU test is to nd
the maximum prime number below 100. The Sysbench threads
test creates 64 threads and each thread processes 100 requests
each of which consists of locking 2 mutexes, yielding the
CPU, and then unlocking when the thread is rescheduled. The
Sysbench memory test is to read/write 100MB of data.
D. Live Migration
The ITRI hypervisor also supports live migration [26],
which iteratively copies the memory state of a VM from the
source machine to the destination machine, and stops the VM
in order to copy the residual memory state over to complete the
migration when the residual memory state is sufciently small.
To keep track of the set of dirty pages that still need to be
migrated, the ITRI ARM hypervisor implements a dirty page
tracking mechanism, as shown in Figure 2. At the beginning of
TABLE II.
Native
TCG
ITRI-ARM
sysbench-cpu (s)
sysbench-threads (s)
sysbench-memory (s)
Apache Benchmark (s)
Iperf (Mb/s)
hdparm (MB/s)
dd read/write (MB/s)
Unixbench-double
Unixbench-spawn
0.8
1.2
0.5
3.7
761
125
68.1 / 118
72.7
222.3
45.1
388.7
43.3
106.1
19.5
11.1
2.6 / 8.8
0.6
7.1
0.8
6.3
0.9
7.7
435
17.2
19.2 / 9.6
71.7
182.8
TABLE III.
C OUNT OF AND TIME SPENT IN FIVE TYPES OF TRAPS WHEN THE ITRI ARM HYPERVISOR RUNS UNDER SYSBENCH - THREADS AND
SYSBENCH - MEMORY: (1) PRIVILEGED INSTRUCTION (PI), (2) DATA ACCESS EXCEPTION (DA), INTERRUPT (IRQ), SYSTEM CALL (SC), AND SENSITIVE
INSTRUCTIONS (SI). L IGHT- WEIGHT CONTEXT SWITCH CAN ONLY BE APPLIED TO SC AND SOME PERCENTAGES OF DA, PI AND SI.
Benchmarks
sysbench-thread
Total = 9.2 sec)
sysbench-memory
Total = 3.9 sec)
# of traps
Time (sec)
# of traps
Time (sec)
180
0.0024
104
0.002
579
0.024
169
0.007
PI
386481
0.5
2187
0.192
1060
0.0006
708
0.0005
1161816
0.64
409801
0.238
No Context Switch
SI
1161184
0.5
414384
2.3
12882150
2.4
2908546
0.7
TABLE IV.
P ERFORMANCE IMPACT OF SHADOW PAGE TABLE (SPT), LIGHT- WEIGHT CONTEXT SWITCHING (LWCS), AND EMULATION OF SENSITIVE
INSTRUCTIONS IN EXCEPTION HANDLERS (SIE) ON THE ITRI ARM HYPERVISOR , WHOSE OPTIMIZED SETTING IS SPT OFF , LWCS ON AND SIE OFF .
Benchmark
ITRI-ARM
Four Guests
with SPT
without LWCS
with SIE
sysbench-cpu (s)
sysbench-threads (s)
sysbench-memory (s)
Apache Benchmark (s)
Iperf (Mbps)
hdparm (MB/s)
dd read/write (MB/s)
Unixbench-double
Unixbench-spawn
0.8
6.3
0.9
7.7
435
17.2
19.2 / 9.6
71.7
182.8
0.8
6.6
0.9
14.1
123
16.5
11.3 / 4.8
71.6
117.5
0.8
6.6
0.9
7.7
434
17.1
18.6 / 9.1
71.5
40.5
1.1
34.2
6.3
10.9
428
17.1
12.5 /8.8
71.6
44.1
0.8
9.2
3.9
12.7
109
16.7
1.4 / 9.5
71.7
95.3
TABLE V.
T HE PERFORMANCE OF A SET OF BENCHMARKS WHEN
THEY RUN IN A NORMAL VM AND WHEN THEY RUN IN A MIGRATED VM,
AND THE SERVICE DOWN TIME OF THE TEST VM S WHEN THEY ARE
MIGRATED
Benchmark
Normal
In Migration
sysbench-cpu2 (s)
sysbench-threads (s)
sysbench-memory2 (s)
hdparm (MB/s)
dd for write/read (MB/s)
Unixbench-double
Unixbench-spawn
10.77
5.46
8.28
10.24
7.3/10.0
72.4
198.5
25.91
19.42
25.73
2.40
6.0/8.0
72.2
197.3
2.05
2.27
2.43
2.53
4.05
2.05
2.13
guest VMs. For dd, surprisingly the performance of the fourguest conguration is one half rather than one quarter of that
of the one-guest conguration, because the bottleneck in this
case is the virtio block device rather than the underlying disk
and so the one-guest congurations performance is not the
best it could be.
D. Live Migration Performance
Table V shows the performance of a set of benchmarks
when they run in a normal VM and when they run in
a VM that is being migrated from one physical machine
to another. The performance differences between these two
columns represent the performance impact of migration on
the test applications. The Normal column has the same
conguration as the ITRI-KVM column in previous tables
except for sysbench-cpu and sysbench-memory. Due to the
original elapsed time of sysbench-cpu and sysbench-memory
were too short to measure, we adjusted their parameters to
make them execute for a longer period. Also, we put the virtual
disk image on network le system for migration. In general,
the more memory state is modied, the larger the performance
degradation. For CPU/memory intensive workloads, the performance degradation is particularly signicant because they
modify more memory pages, and thus cause higher dirty page
tracking and memory state copying overhead. The Down Time
corresponds to the freeze time of the VM being migrated in
each test, and ranges between 2 to 4 seconds. These Down
times are higher than expected because the current ITRI ARM
hypervisor prototype uses only two iterations in each VM
C ONCLUSION
In this paper we present a fully functional KVM-ARM hypervisor running on the Marvell ARMADA-XP board, which
has already been deployed in some server products. To the best
of our knowledge, this is the rst hypervisor that successfully
runs multiple VMs with full isolation on a real-world ARM
Cotex-A9-based SOC, which does not provide any hardware
support for virtualization. Although the KVM-ARM hypervisor described here borrows heavily ideas and implementations
from the KVM-86 project, it still features several innovations
that are tailored to the Marvell platform, including binary
rewriting to remove unnecessary cache/TLB ushing, memory
virtualization using static memory partitioning and physical
memory remapping, and light-weight context switch.
With these optimizations, the performance of CPUintensive benchmarks running on VMs that sit on top of
the KVM-ARM hypervisor is almost the same as native
performance, but the performance of system activity-intensive
benchmarks fares much worse, sometimes up to a factor of
5 slow-down. Most of the performance penalty associated
with system activity-intensive benchmarks is due to sensitive
instruction emulation and exception handling for privileged
instructions. We plan to further optimize this KVM-ARM
hypervisor, and expect to port some of advanced features
such as VM migration and develop VM fault tolerance to
the hypervisor of Cortex-A15 and next generation CortexA53/A57.
R EFERENCES
[1] Marvell.
powers
dell
copper
ARM
server.
http://www.marvell.com/company/news/pressDetail.do?releaseID=2396.
[2] Calxeda ecx-1000, http://www.calxeda.com/technology/products/
processors/ecx-1000-series/.
[3] Zt
systems
announces
ARM-based
server
solution,
http://ztsystems.com/Default.aspx?tabid=1484.
[4] Intel Virtualization Technology, http://www.intel.com/technology/
virtualization/.
[5] C. Dall and J. Nieh, Kvm for arm, in Proceedings of the Ottawa
Linux Symposium, Ottawa, Canada, 2010.
[6] S. Sang-bum, Secure xen on ARM: Status and driver domain separation, 5th Xen Summit, 2007.
[7] VMware, Horizon mobile, http://www.vmware.com/products/
desktop virtualization/mobile/overview.html.
[8] Embeddedxen virtualization framework, http://sourceforge.net/
projects/embeddedxen/.
[9] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris,
A. Ho, R. Neugebauer, I. Pratt, and A. Wareld, Xen
and the art of virtualization, SIGOPS Oper. Syst. Rev.,
vol. 37, no. 5, pp. 164177, October 2003. [Online]. Available:
http://doi.acm.org/10.1145/1165389.945462
[10] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, kvm: the
linux virtual machine monitor, in Proceedings of the Linux Symposium,
vol. 1, 2007, pp. 225230.
[11] J. Ding, C. Lin, P. Chang, C. Tsang, W. Hsu, and Y. Chung, ARMvisor:
System virtualization for ARM, in Linux Symposium, 2012, p. 95.
[12] K. Adams and O. Agesen, A comparison of software and
hardware techniques for x86 virtualization, in Proceedings of
the 12th international conference on Architectural support for
programming languages and operating systems, ser. ASPLOS XII.
New York, NY, USA: ACM, 2006, pp. 213. [Online]. Available:
http://doi.acm.org/10.1145/1168857.1168860
862