Vous êtes sur la page 1sur 5

Toward Performance Optimization with

CPU Offloading for Virtualized


Multi-Tenant Data Center Networks
An-Dee Lin, Hubertus Franke, Chung-Sheng Li, and Wanjiun Liao

Abstract
Network virtualization is an essential technique for data center operators to provide
traffic isolation, differentiated service, and security enforcement for multi-tenant ser-
vices. However, traditional protocols used in local area networks may not be appli-
cable for data center networks due to the difference in network topology. Recent
research suggests that layer-2-in-layer-3 tunneling protocols may be the solution to
address the challenges. In this article, we find via testbed experiments that directly
applying these tunneling protocols toward network virtualization only results in
poor performance due to the scalability problems. Specifically, we observe that the
bottlenecks actually reside inside the servers. We then propose a CPU offloading
mechanism that exploits a packet steering function to balance packet processing
among available CPU threads, thus greatly improving network performance. Com-
pared to a virtualized network created based on VXLAN, our scheme improves
the bandwidth for up to almost 300 percent on a 10 Gb/s link between a pair of
tunnel endpoints.

C loud computing is the fastest growing area in infor-


mation technology. In order to deliver cost-effec-
tive infrastructure services that include both computing and
storage services by cloud service providers, it is essential to
A number of tunneling protocols that encapsulate layer 2
frames in layer 3 packets have emerged to address the VLAN
ID challenge [15]. These tunneling protocols do incur addi-
tional computing overhead resulting from performing encapsu-
provide efficient computing, storage, and networking virtual- lation and decapsulation at tunneling endpoints. In this article,
ization so that these resources can be maximally shared across we investigate the performance problems of existing tunneling
a large number of tenants in a massively multi-tenant cloud protocols via testbed experiments. These experiments reveal
environment. that the effective bandwidth between the tunneling endpoints
Infrastructures used by major cloud service providers often approach 100 percent of the link rate if the link rate is at or
involve many cloud data centers distributed across the globe below 1 Gb/s, indicating that there is little overhead intro-
with dedicated leased lines or dark fibers connecting these duced by the tunneling. However, the effective bandwidth
data centers. Within each data center, network virtualization is reduced to only approximately 25 percent of the link rate
based on overlay networks is the primary isolation mechanism, when the link rate is increased to 10 Gb/s. In other words, the
which enables each tenant to allocate its own dedicated virtual overlay network only provides 2.5 Gb/s effective bandwidth for
network and assigns its own network settings (e.g., routing tenants, which is highly inefficient for network resource utili-
policies and layer 2/3 address space). Overlay networks also zation. One of the reasons is that the packet processing on the
enable explicit identification of different categories of network recipient side is carried out by a single CPU thread, and thus
traffic for provisioning differentiated quality of service. forms a bottleneck.
Traditional local area networks often leverage the virtu- An approach to overcome this bottleneck is proposed in
al local area network (VLAN) technique to achieve isola- [6], in which packets heading to virtual machines (VMs) are
tion among different user groups within a layer 2 network. processed by multiple CPU threads. This approach indeed
By doing so the network administrator controls the flow of addressed the performance issue, but requires additional soft-
packets of each tenant within the network, and enables differ- ware modules to be integrated into the runtime environment.
entiated services and security enforcement. However, VLAN is In contrast, our work is focused on properly configuring tun-
difficult to scale in cloud data center networks, due to the lim- neling functions in the Linux kernel, and offloading the packet
itation that only up to 4096 distinct VLAN IDs can be defined processing to unused CPU threads as well as properly setting
(according to IEEE 802.1q). A large cloud service provider up the pertinent parameters such as maximum transmission
often hosts hundreds of thousands of tenants across the globe unit (MTU). The effectiveness of the proposed solution is vali-
and will face severe scalability challenges if unique VLAN IDs dated through experiments.
are needed for each tenant. In the use case of the virtual network function forwarding
graph, overlay networks can easily chain the VMs that perform
various network functions. Some network functions, like fire-
An-Dee Lin and Wanjiun Liao are with National Taiwan University. walls, deep packet inspection, or load balancers, usually sit on
Hubertus Franke and Chung-Sheng Li are with IBM Thomas J. Watson the point where traffic flows converge. Our scheme is able to
Research Center. significantly benefit the bandwidth performance of an overlay

IEEE Network May/June 2016 0890-8044/16/$25.00 2016 IEEE 59


Protocol NVGRE VXLAN STT

Backer Microsoft VMware VMware

Broadcast within overlay


Establish point-to-multipoint tree with an IP multicast address.
network

Network ID space 24 bits 24 bits 64 bits

Format of
GRE UDP/IP Modified TCP/IP
encapsulated packet

Per-flow equal-cost multiple Yes, Virtual Subnet ID and FlowID are used Yes, UDP port number is Yes, TCP port number is
path to calculate hash. hashed. hashed.

TCP segmentation offloading No No Yes

Table 1. Profile of tunneling protocols.

endpoint that attaches to such virtualized network functions, the edge switch. It learns other VTEP addresses from the
especially on 10 Gb/s networks. outer and inner headers of the packets received. VXLAN
The rest of the article is organized as follows. The following also provides a 24-bit address space for tenant IDs. Further-
section provides an overview of the most popular tunneling more, VXLAN achieves per-flow load balancing in the net-
protocols. After that we describe the experimental setup of work fabric through performing hashing on the outer UDP
the tunneling environment at 1 Gb/s and 10 Gb/s. Then we source port number.
report on the analysis of the experimental results and identi-
fy the performance bottlenecks. An optimized framework is Stateless Transport Tunneling
then described to address these bottlenecks. The final section STT [10] encapsulates the layer 2 frames of each tenant into
summarizes the lessons learned and discusses potential future modified TCP/IP packets. The sequence number field in
directions. the outer TCP header is redefined as the fields of the STT
frame length and the STT fragment offset, and the acknowl-
Overview of Encapsulation Protocols edgment number is used similarly to the IPv4 identification
A number of layer-2-in-layer-3 protocols have emerged with- for packet assembly. Thus, STT does not maintain a state
in industry and academia on network encapsulation [15]. machine for retransmission. This is acceptable in data cen-
Examples include Generic Routing Encapsulation (GRE) [7], ter networking because the TCP stack of tenants operat-
Network Virtualization Using Generic Routing Encapsulation ing system will handle the retransmission task. NVGRE,
(NVGRE) [8], Virtual Extensible LAN (VXLAN) [9], and VXLAN, and STT are the three most widely used tunneling
Stateless Transport Tunneling (STT) [10]. These protocols protocols for creating overlay networks in data centers, as
encapsulate layer 2 frames within layer 3 packets and deliv- listed in Table 1. By decoupling tenants virtual networks
er these packets through an IP network. By decoupling the from the infrastructure network, overlay network technol-
tenant-defined packet header and the operator-defined packet ogies provide the key to accommodate a large number of
header, the operators can modify the configurations of under- tenants in cloud infrastructure. First, it creates an isolated
lay networks or servers without disrupting tenants traffic and virtual network for each tenant. Tenants can set up their
provide larger virtualized networks spanning multiple routers. own medium access control (MAC)/IP addresses without
Moreover, these protocols invariably have a large address interfering with the MAC/IP addresses of the other tenants.
space for each tenant. From the network administrators point of view, these tun-
neling protocols provide a large IP address space for the
Generic Routing Encapsulation virtual networks of their tenants with simultaneous traffic
GRE [6] is a point-to-point protocol that encapsulates a tenant isolation and security enforcement. Seamless VM migration
packet within another IP packet. A GRE-encapsulated packet can also be achieved as the headers of decapsulated packets
is routed through the data center network based on the desti- remain the same.
nation IP address. When the packet reaches the destination,
the outer IP and GRE headers are stripped, and the remain- Evaluation Methodology and Setup
ing inner packet is delivered to the tenants application. The In these experiments, we compare the effective bandwidth
tenant is totally unaware of this encapsulation and decapsula- between a pair of bare-metal servers without tunneling and
tion process, and presumes it is the only user in this virtual- that of going through a tunnel using GRE or VXLAN. All of
ized network. these configurations include pairs of vSwitches [11] residing in
More recently, a new encapsulation framework, NVGRE identical 2.4 GHz Intel Xeon servers connected by a 10 Gb/s
[7], was proposed to better support multi-tenant environments. top-of-rack (TOR) switch, as shown in Fig. 1. The vSwitches
Specifically, this framework encapsulates a tenants layer 2 inside both servers are software implementations of virtual
frame. The fields of Virtual Subnet ID (VSID) and FlowID network switches that support tunneling protocols such as
are used to identify tenants flows. This 24-bit VSID can sup- GRE and VXLAN, and thus they can function as tunneling
port up to 16 million tenants and virtual LANs. endpoints. By deploying vSwitches across servers, these end-
points form a dedicated overlay network for a tenant. The
Virtual Extensible Local Area Network most important feature of an overlay network is that it makes
In contrast to GRE and NVGRE, VXLAN [9] encapsu- the underlay fabric completely transparent and thus substan-
lates each users data frame into a new UDP/IP datagram, tially improves the flexibility. The responsibility of vSwitches
and extracts the data frame based on the Virtual Tunnel are to classify and deliver packets from applications to the
End Points (VTEP) protocol. VTEP can be in the form of underlay IP network and vice versa through pushing or pop-
software residing at each server endpoint or a daemon in ping tunneling headers.

60 IEEE Network May/June 2016


Sending server Receiving server

Linux Network Network Linux


App kernel interface interface kernel App
192.168.188.25 192.168.188.26
VXLAN VXLAN
vSwitch vSwitch
VXLAN
tunnel

iperf vSwitch Ethernet Ethernet vSwitch


adapter adapter iperf

10.10.5.25 0.0.0.0 0.0.0.0 10.10.5.26


GRE GRE
vSwitch GRE vSwitch
tunnel
192.168.88.25 192.168.88.26

Inject traffic here Measure goodput here

Figure 1. The two servers are interconnected by a 10 Gb/s rack switch (not shown here) and form
a 10 Gb/s Ethernet link (black solid line). The virtualized networks implemented by VXLAN
(purple dashed line) and GRE (green dashed line) are over this physical 10 Gb/s link. The solid
lines show how packets actually travel from the source to the destination. The operating system is
Ubuntu 12.04.3; Long Term Support and the Open vSwitch are in version 1.11.

Name of Performance Optimization


IP on sending server IP on receiving server
endpoints The low aggregated bandwidth suggests that there is a poten-
tial bottleneck within the system kernel. Scrutiny of the 10
Bare metal 10.10.5.25 10.10.5.26 Gb/s networks and the pair of servers reveals two symptoms:
packet fragmentation, and that only one of CPU threads is
GRE 192.168.88.25 192.168.88.26 busy while the others remain idle. We perform a series of
experiments to verify that a proper MTU setting and distribut-
VXLAN 192.168.188.25 192.168.188.26 ing the workload of packet processing among unused threads
are cures for the low bandwidth syndrome.
Table 2. IP setting of interfaces created by vSwitch. We use wireshark [13] to sniff packets within each overlay
network. The packet fragmentation phenomenon is clearly
Each vSwitch creates an interface within a Linux kernel with visible within the packet reassembly history. This symptom
specific IP addresses assigned in Table 2. We only evaluate the is often caused by a misconfigured MTU size. MTU defines
performance of GRE as we expect NVGRE to share similar the maximum permissible payload size of an IP packet for a
characteristics for the purpose of this article. STT is not evalu- communication interface. Once the size of the total payload
ated in this work. exceeds this value, the payload will be split into two or more
The performance gap of the effective bandwidth represents fragments, causing the receiving side to spend additional
the encapsulation and decapsulation overhead. In our experi- effort on packet reassembly. The immediate remedy is to
ments, available bandwidth (goodput) values are used to assess adjust the MTU value at both ends of the tunnel, that is,
the system performance. We adopt iperf [12] as the work- the upper and the lower vSwitches in Fig. 1. This adjustment
load generator to produce TCP traffic to exercise bare metal, enables the MTU value of the 10 Gb/s Ethernet adapter card
VXLAN, and GRE tunnels. and the vSwitch attaching to it to match the payload size
from the application layer. After adjusting the MTU values
Preliminary Experiment Results of the GRE and VXLAN endpoints, as shown in Table 3,
The first experiment is to assess the performance of tunneling the packet fragmentation entirely disappeared. We can thus
protocols over 1 Gb/s and 10 Gb/s Ethernet with default sys- conclude that the first remedy for improving network utili-
tem parameters. We create overlay networks in a data center zation is achieved through optimizing the MTU values on
network by directly applying GRE or VXLAN on the link. the vSwitches as tunnel endpoints to minimize or eliminate
Figure 2 shows the performance as a function of the number packet fragmentation.
of iperf flows in the bare metal network, GRE overlay, and The CPU utilization is investigated to identify additional
VXLAN overlay. performance bottlenecks. In particular, the activities of all
While running on a 1 Gb/s network, both GRE and CPU threads are monitored through sampling. We noticed
VXLAN almost achieve the same performance as bare that only one of 32 CPU threads is busy, indicating that this
metal. On the other hand, the bandwidth of a single GRE thread is responsible for all network interrupts and packet
and VXLAN flow only consumes around 25 percent of full processing, resulting in a thread-level bottleneck. To enable
link bandwidth at 10 Gb/s. Obviously there are bottlenecks multiple threads to participate in the network protocol han-
between the iperf server and the client. In order to ensure dling, receive packet steering (RPS) [14] is used to tackle the
that iperf has saturated available system capacity, multiple low bandwidth utilization problem.
flows were generated through VXLAN and GRE tunnels. RPS is a software load balancer that steers packets
We only observe small variations as the number of flows among specific CPU threads and enables idle CPU threads
increases from 2 to 10, indicating that the bottlenecks exist to participate in processing the incoming packet from the
elsewhere. Ethernet adapter of the receiving side. RPS is disabled by

IEEE Network May/June 2016 61


Bandwidth performance on 1 Gb/s link Bandwidth performance on 10 Gb/s link
1 10

0.8 8
Aggregated bandwidth (Gb/s)

Aggregated bandwidth (Gb/s)


0.6 6

0.4 4

0.2 2

Bare metal Bare metal


GRE GRE
VXLAN VXLAN
0 0
1 2 4 6 8 10 1 2 4 6 8 10
Number of concurrent flows Number of concurrent flows

Figure 2. Bandwidth performance with default setting.

Tunnel name Default MTU Adjusted MTU Bandwidth performance of the overlay network powered by VXLAN
10
GRE 1500 1462
Aggregated bandwidth (Gb/s)

8
VXLAN 1500 1450

Table 3. Adjusted MTU value of tunnel endpoints. 6


Bare metal
Default
Optimized
default as the physical Ethernet adapter already benefits 4
from receive side scaling (RSS) [14], a firmware version
of RPS. Unlike a server-grade physical Ethernet adapter 2
equipped with multiple receiving queues, each software
vSwitch has only one receiving queue. Disabling RPS results
in all incoming packets to the vSwitch to be processed by the 0
1 2 3 4 5 6 7 8 9 10
same CPU thread. No additional packets can be received Number of concurrent flows
once this thread reaches 100 percent utilization, resulting in
poor utilization (~25 percent) of GRE or VXLAN tunneling Figure 3. Comparison of aggregated bandwidth between the
over the 10 Gb/s link. VXLAN overlay with default and optimized setting.
To enable the RPS function on the vSwitch, we set a hexa-
decimal value to the file /sys/class/net/device/queues/rx-queue/
rps_cpus, where device is the name of the virtual network Those less than fully utilized threads can jointly process the
interface generated by vSwitch and rx-queue is the name of received packets, preventing any single thread being over-
an appropriate receive queue. The virtual interface has only whelmed by network interrupts and decapsulation of packets.
one receive queue named rx-0. The value of rps_cpus is a bit- For better results, we choose one thread from each CPU core
map that determines which CPU threads are associated with in addition to the thread that binds to the Ethernet adapter.
this receive queue. For example, to make the first four CPU The RPS functionality of generating multiple flows heading to
threads participate in packet processing, rps_cpus is set to the the configured virtual network interface can be validated by
binary value 00001111. The default value is 0, which means the following command:
RPS is disabled and thus forms a bottleneck to the overlay
endpoint. RPS can be turned on by the following command watch -n1 cat /proc/softirqs
line:
We should see that NET_RX are now distributed among mul-
echo f > /sys/class/net/device/queues/rx-0/rps_cpus tiple CPU threads.
In the next experiment, RPS is turned on for the vSwitches
with root privilege. Now rps_cpus has the hexadecimal value at both tunneling endpoints. Iperf is used to generate multiple
0000000f, indicating that the assignment is completed. flows from the source to the destination. We measure the
Setting up the number of CPU threads within RPS is based aggregated bandwidth over these flows and compare the result
on the hash value of the tuple formed by Source IP, Destina- on the VXLAN tunnel against the result on the bare metal
tion IP, Source Port, and Destination Port. The packets in a link. The results are shown in Fig. 3.
receive queue with identical tuple values will go to the same The baseline case (indicated by the orange line) is VXLAN-
CPU thread. The aggregated bandwidth improves when there based overlay networks where the receiving side has a default
are more than one concurrent flows heading to the destination MTU value of 1500 bytes and RPS disabled, while the opti-
server, whether they are encapsulated or bare metal flows. mized case (indicated by the blue line) is for the same network

62 IEEE Network May/June 2016


with a 1450-byte MTU setting and RPS enabled. When there References
is only one flow, the gap between blue and orange shows the [1] N. Chowdhury and R. Boutaba, Network Virtualization: State of the Art
benefit of no packet reassembly on the receiver side. When and Research Challenges, IEEE Commun. Mag., vol. 47, no. 7, July
2009, pp. 2026.
the number of flows increases, each flow is processed by a [2] M. Casado et al., Virtualizing the Network Forwarding Plane, Proc.
different thread on the receiver side. As shown in Fig. 3, four ACM PRESTO, 2010.
concurrent flows already obtain an aggregated bandwidth close [3] N. Chowdhury and R. Boutaba, A Survey of Network Virtualization, Else-
to the link rate (9.2 Gb/s with each of the flows rated at 2.3 vier Computer Networks, vol. 54, no. 5, Apr. 2010, pp. 86276.
[4] B. Pfaff et al., Extending Networking into the Virtualization Layer, Proc.
Gb/s), indicating that the bottlenecks have been completely Wksp. Hot Topics in Networks HotNets-VIII, 2009.
removed. [5] T. Narten et al., Problem Statement: Overlays for Network Virtualization,
These results validate that optimizing both MTU settings IETF Interne Draft, July 2013.
for preventing packet fragmentation, and load balancing on [6] L. Xia et al., Fast VMM-Based Overlay Networking for Bridging the Cloud
and High Performance Computing, Springer Cluster Computing, vol. 17,
packet processing by enabling RPS for vSwitches is essen- no. 1, Mar. 2014, pp. 3959.
tial for high networking performance on a link of 10 Gb/s or [7] D. Farinacci et al., Generic Routing Encapsulation (GRE), IETF RFC
above. 2784, Mar. 2000.
[8] M. Sridharan et al., NVGRE: Network Virtualization Using Generic Rout-
Conclusion and Future Opportunities ing Encapsulation, IETF Internet Draft, Feb. 2014.
[9] M. Mahalingam et al., Virtual eXtensible Local Area Network (VXLAN):
Network overlays substantially increase the number of virtual A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3
subnets that can be created on a physical network, which in Networks, IETF RFC 7348, Aug. 2014.
turn supports isolation and VM mobility, and can speed up the [10] B. Davie and B. Davie, A Stateless Transport Tunneling Protocol for Net-
work Virtualization (STT), IETF Internet-Draft, April 2014.
provisioning of new or existing services in a multi-tenant cloud [11] Open vSwitch, http://openvswitch.org/
environment. The traditional approach based on VLANs is [12] Iperf, https://iperf.fr/
not suitable to create such overlays in the emerging cloud [13] Wireshark, https://www.wireshark.org/
data center networks due to its fundamental scalability lim- [14] Scaling in the Linux Networking Stack, https://www.kernel.org/doc/
Documentation/networking/scaling.txt
itation (up to 4096). Emerging overlay network methods such [15] Open Daylight, http://www.opendaylight.org/
as VXLAN, NVGRE, and STT aim to reduce this limitation.
Their efficacy, however, could be limited due to performance
bottlenecks introduced in the networking stack of the oper-
Biographies
An-Dee Lin (d00942017@ntu.edu.tw) received his M.S. degree from National
ating system. The primary culprit was the saturation of the Tsing Hua University, Taiwan, in 2009 and currently is a Ph.D. student at
CPU thread due to the computation effort required, resulting National Taiwan University. He was a recipient of a Graduate Students Study
in underutilized links. To tackle this problem, we propose an Abroad Program Award from the Ministry of Science and Technology, Taiwan,
optimized approach, which includes selecting suitable MTU in 2013, and a visiting scholar at IBM Thomas J. Watson Research Center
from August 2013 to July 2014. His research interests include cloud-based
settings for packet integrity and enabling packet steering for networking and infrastructure.
CPU offloading. We have empirically validated the proposed
approach and demonstrated nearly full utilization for a 10 Hubertus Franke (frankeh@us.ibm.com) is a Distinguished Research Staff Mem-
Gb/s environment. ber and senior manager for Software Defined Infrastructures at the IBM T. J.
Watson Research Center. He received a Diplom degree in computer science
The proposed optimized approach can be integrated from the Technical University of Karlsruhe, Germany, in 1987, and M.S. and
as part of the network diagnosis and monitoring system. Ph.D. degrees in electrical engineering from Vanderbilt University in 1989
Based on such an integrated system, one could use it to and 1992, respectively. He subsequently joined IBM at the IBM T. J. Watson
exploit MTU discovery so as to proactively prevent packet Research Center, where he worked on the IBM SP1/2 MPI (Message Passing
Interface) subsystem, scalable operating systems, Linux scalability and multi-
fragmentation in overlay networks. The integrated system core architectures, scalable applications, the PowerEN architecture and appli-
could also continuously evaluate the utilization of network cation space, and, most recently, cloud platforms. He has received several IBM
and CPU threads for dynamic provisioning of unused CPU Outstanding Innovation Awards for his work. He is an author or co-author of
threads to participate in packet processing. This approach more than 30 patents and over 100 technical papers.
can easily be implemented on top of a software-defined Chung-Sheng Li [F] (csli@us.ibm.com) is currently the director of Commercial
network controller such as Open Daylight [15]. This will Systems and has been with IBM T.J. Watson Research Center since May
further enable the network management to provide much 1990. His research interests include cloud computing, data center networks,
better views on the loading of infrastructures and perform security and compliance, and digital library and multimedia databases. He has
authored or co-authored more than 130 journal and conference papers, and
finer-grained network optimization for tenants. All these received the best paper award from IEEE Transactions on Multimedia in 2003.
require future studies. He is a member of the IBM Academy of Technology Leadership Team. He
received his B.S.E.E. degree from National Taiwan University in 1984, and his
Acknowledgment M.S. and Ph.D. degrees in electrical engineering and computer science from
the University of California, Berkeley, in 1989 and 1991, respectively.
The authors like to thank Dr. J. Tracey and Dr. B. Karaca-
li-Akyamac for their insight in preparing the experiments. This Wanjiun Liao [F] (wjliao@ntu.edu.tw) is a Distinguished Professor and the
work was supported in part by the Excellent Research Projects department chair of electrical engineering, National Taiwan University. Her
of National Taiwan University under Grant Number AE00- research interests include cloud networking, green communications, and wire-
less networking. She was an Associate Editor of IEEE Transactions on Wireless
00-04, and in part by the Ministry of Science and Technology Communications and IEEE Transactions on Multimedia, an IEEE ComSoc Dis-
(MOST), Taiwan, under Grant Numbers 102-2221-E-002-017 tinguished Lecturer, an IEEE Fellow Committee member, and the IEEE ComSoc
-MY3 and 102-2917-I-002-074. Asia Pacific Board Director.

IEEE Network May/June 2016 63

Vous aimerez peut-être aussi