Vous êtes sur la page 1sur 27

Horizontal Scaling of Network Applications

A Seminar Report

Submitted in partial fulfillment of the requirements


for the degree of

Master of Technology
by
Jashkumar Dave
(Roll no. 153050004)

Supervisor:
Prof. Mythili Vutukuru

Department of Computer Science and Engineering


Indian Institute of Technology Bombay
Mumbai 400076 (India)

18 April 2016

Declaration
I declare that this written submission represents my ideas in my own words and where
others ideas or words have been included, I have adequately cited and referenced the
original sources. I declare that I have properly and accurately acknowledged all sources
used in the production of this thesis.
I also declare that I have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in my
submission. I understand that any violation of the above will be a cause for disciplinary
action by the Institute and can also evoke penal action from the sources which have thus
not been properly cited or from whom proper permission has not been taken when needed.

Jashkumar Dave
(153050004)

Date: 18 April 2016

Abstract
As the demand for a certain service grows, in response, we are also required to scale
our service capacity to meet the demand. Horizontal scaling is an approach to scale the
service capacity by adding identical pieces of hardware, such as servers. An alternative
to horizontal scaling is vertical scaling, but it has many limitations, which are discussed
later in this report. This report describes the challenges faced while designing horizontally
scalable network applications and how different papers solves them. Further we describe
how these solutions from different papers can be combined together to design a highly
scalable network applications.

ii

Table of Contents
Abstract

ii

List of Figures

List of Tables

vi

Introduction

Literature Survey

2.1

Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1

Ananta: Cloud Scale Load Balancing . . . . . . . . . . . . . . .

2.1.2

Duet: Cloud Scale Load Balancing with Hardware and Software .

Traffic Steering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.1

SIMPLE-fying Middlebox Policy Enforcement Using SDN . . . .

2.2.2

Enforcing Network-Wide Policies in the Presence of Dynamic

2.2

2.3

2.4
2.5

Middlebox Actions using FlowTags . . . . . . . . . . . . . . . .

State Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.1

OpenNF: Enabling Innovation in Network Function Control . . .

2.3.2

Scaling Up Clustered Network Appliances with ScaleBricks . . .

Dynamic Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.4.1

E2: A Framework for NFV Applications . . . . . . . . . . . . .

10

Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.5.1

PacketShader: a GPU-Accelerated Software Router . . . . . . . .

11

2.5.2

RouteBricks: Exploiting Parallelism To Scale Software Routers .

13

Discussions and Comparison

14

Conclusion

16

A Terminology

17

iii

Table of Contents

iv

References

18

Acknowledgements

20

List of Figures
1.1

Vertical versus Horizontal Scaling [6] . . . . . . . . . . . . . . . . . . .

2.1

Ananta Architecture[8]. . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Duet Architecture[3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

An example implementation of a web service. . . . . . . . . . . . . . . .

2.4

SIMPLE Overview[9]. . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

Flowtags Architecture[2]. . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6

OpenNF Architecture[4]. . . . . . . . . . . . . . . . . . . . . . . . . . .

2.7

ScaleBricks Architecture[10]. . . . . . . . . . . . . . . . . . . . . . . . .

10

2.8

E2 Architecture[7]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.9

PacketShader Architecture[5]. . . . . . . . . . . . . . . . . . . . . . . .

12

2.10 RouteBrickes Architecture[10]. . . . . . . . . . . . . . . . . . . . . . . .

13

List of Tables
3.1

Comparison of papers . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

15

Chapter 1
Introduction
As the demand for a particular network application increases beyond a certain limit
the current hardware setup won't be able to meet the demand. Thus we are required
to increase the scale of serving capacity of network application form time to time with
respect to demand. There are two approaches for scaling the capacity viz. Vertical scaling
and Horizontal scaling.
In vertical scaling we increase the capacity of network application by increasing the
capacity of serving hardware. For example, we increase the RAM of a server, replace the
CPU with higher speed CPU, etc. Increasing the capacity of a single server is limited by
the availability of high end hardware. The rate at which the hardware capacity increases
is very slow, and is not capable to meet the rate at which demand increases. Thus we cant
rely only on vertical scaling to increase the serving capacity.
In horizontal scaling we connect multiple pieces of similar capacity hardware (for
eg. servers), to scale up the total capacity of a network application. Since here we are
using multiple identical nodes to scale the capacity, the scaling capacity of horizontal
scheme is much larger than that of vertical scaling and is not limited by hardware development rate. Each node in a horizontally scalable system provides a certain functionality
of a network function and in co-ordination with other nodes provides single logical view
of the network function to the users. Figure 1.1 illustrates the difference between horizontal and vertical scaling.

(a) Vertical Scaling

(b) Horizontal Scaling

Figure 1.1: Vertical versus Horizontal Scaling [6]


As we distribute the work of a network application among multiple replicas, there
arises a need for load balancing, traffic steering, state management, etc. Few of these
challenges are listed below.
Load balancing : Distributing the work load among multiple replicas, so that no
one instance gets overloaded.
Traffic steering : Routing the packet traffic across multiple instances of network
functions.
State management : Network application may manage some state about flows they
handle. This state need to be managed across network application replicas for it's
correct functioning.
Dynamic scaling : Increasing or reducing the service capacity at runtime as per the
demand.
Latency : The time required to process the request by a network application may
increase as we distribute its work across multiple servers, this extra overhead is
termed as latency.
Reliability : As we distribute the work across multiple replicas, the chance of failure
also increase. The ability of network function to sustain such failure is termed as
reliability.
The next chapter discusses these challenges in more details and their solutions as
described by different papers.

Chapter 2
Literature Survey
This chapter briefly describes the functionality of each paper, challenges they handle and the overview of their architecture.

2.1

Load Balancing
Load balancing is a process of distributing the work load among all serving nodes

in some fashion. Load distribution can be equal among all nodes or in proportion to
their serving capacity or can be any other distribution. As we scale a network application
horizontally, now instead of running a network application on a single server, we are
running multiple instances of a network application over multiple servers. This leads
to need for load balancing among multiple instances of the network application. For
larger services the load balancer itself may get loaded, thus we also need to scale the load
balancer.

2.1.1

Ananta: Cloud Scale Load Balancing

Ananta[8] presents a scalable software based load balancer designed and used by
Microsoft for it's Windows Azure public cloud. To scale out the load balancer Ananta
divides the functionality of hardware load balancer into three parts viz. Ananta Manager,
Mux and Host Agent, and deploys them on different machines, where each component
is independently scalable. Since each component is independently scalable, the whole
system becomes highly scalable. Another benefit of Ananta is, it is implementable on
commodity hardware and thus is fully programmable.

2.1

Load Balancing

(a) Ananta Components

(b) Ananta Data Plane Tiers

Figure 2.1: Ananta Architecture[8].


Figure 2.1a shows the components of Ananta and their brief functionalities. Figure
2.1b shows the arrangement of Ananta components in typical data center architecture.
Ananta manager collects traffic information from switches and health information from
host agents, and uses this information for route management and source selection. Ananta
Muxes are responsible for forwarding packets for inbound connections and host agents
are responsible for source NATing the outbound traffic. Each service on Windows Azure
cloud is assigned a unique virtual IP address (VIP) this helps to get single logical view of
a service. Each virtual machine (VM) instance is assigned a direct IP address (DIP), this
helps Ananta Muxes to map service's VIP to it's serving VMs. Evaluations shows that
Ananta can handle up to 1Gbps traffic for single VIP and aggregated traffic of 1Tbps. Average latency for each packet was less than 1msec for 90% of packets and a approximate
median of 600 sec. Each Mux was able to handle 800Mbps per core and up to 220 Kpps
on each server.

2.1.2

Duet: Cloud Scale Load Balancing with Hardware and


Software

The problem with Ananta is, latency is bit higher than any other hardware load
balancer. Duet[3] is an extension to Ananta by Microsoft, which overcomes the problem
of latency. The idea behind Duet is based on the observation that the intermediate switches
in the data center network stores very less number of rules and most of the flow table
space remains unused. Duet makes use of this unused space on the intermediate switches
for storing VIP-DIP mappings. Since Duet doesn't requires any extra hardware, the cost
involved for implementing Duet is very less.

2.2

Traffic Steering

The working of Duet is based on Duet controller, which gathers the topology and
traffic information from the underlying network. Based on this information, Duet controller assigns the flow routing information to the switches. Duet controller also monitors
the health of each node, and in case of failures, reroutes the traffic. Duet also uses software
Muxes from Ananta as a backup and for the flows that can't be assigned to the hardware
switches due to space constraints.

Figure 2.2: Duet Architecture[3].


Figure 2.2 show the working of Duet. The requests coming for a certain VIP (service), are handled by the switch responsible for that VIP. For example in figure 2.2, core
switch C2 is responsible for VIP1 and aggregation switch A6 handles VIP2, solid lines
represents VIP traffic and dotted lines represents DIP traffic. Evaluations shows that Duet
improves the performance of Ananta significantly. It is able to handle 1Gbps traffic for
single VIP and aggregated traffic of 15Tbps. Median latency is reduced to 381 sec form
that of 600 sec in Ananta. Also each switch is capable of handling 600 Kpps almost
thrice as that of an Ananta Mux can handle.

2.2

Traffic Steering
An implementation of a service may require multiple network applications to work

in coordination underneath. For example for a web service we may pass the incoming
traffic through light intrusion detection system (IDS) for security purpose and if required
through heavy IDS also. Figure 2.3 shows an example implementation of such a service.
Such implementation of a service may require the same packet to be routed multiple
times and may be through same switch. In such a case only five tuple information will
not suffice to route the packet correctly. Since some packets from same source to same

2.2

Traffic Steering

destination may need to pass through heavy IDS, while some may not, such dynamic
routing decisions requires more information. Thus raising a need for more controlled way
to route traffic among instances of network applications. This controlled way of routing
packets among middleboxes is referred to as traffic steering. An instance of a network
application is sometimes also referred to as middlebox.

Figure 2.3: An example implementation of a web service.

2.2.1

SIMPLE-fying Middlebox Policy Enforcement Using SDN

SIMPLE[9] provides an SDN based solution for managing and steering traffic
among multiple middleboxes. It makes use of combination of tags and tunneling to steer
traffic among middleboxes. The input the SIMPLE is network topology, policy graphs,
middlebox and switch constraints. Based on these inputs resource manager outputs a set
of middlebox processing assignments which satisfies the given constraints. This is done
by selecting an optimal subset of all possible assignments which satisfies the middlebox
and switch constraints. This process of selecting the optimal subset is NP-hard and thus
SIMPLE uses an approximate algorithm which provide near optimal results. Another
important component of SIMPLE is dynamics handler, which monitors the current traffic pattern and infers the mappings between incoming and outgoing packets. This helps
the SIMPLE to supports dynamic policy chains without requiring any changes to middleboxes. Rule generator is responsible for generating the rules based on the information
provided by resource manager and dynamics handler, and feeds them to switches. Figure
2.4 shows an overview of SIMPLE approach.

2.2

Traffic Steering

Figure 2.4: SIMPLE Overview[9].


Since SIMPLE requires no modification to middleboxes, there is no extra overhead
induced on middleboxes by SIMPLE. Evaluations shows that it takes 300msec to generate
and install rules for the 23-node topology and 1.22 sec for 256-node topology. SIMPLE
also achieves 99% close to optimal performance and dynamics handler was able to classify the traffic correctly in 95% of cases.

2.2.2

Enforcing Network-Wide Policies in the Presence of Dynamic


Middlebox Actions using FlowTags

While SIMPLE supports dynamic policies, it does so by using a similarity based


detector for packets modified by middleboxes, which may work fine in most of the cases
but cannot be relied upon. Flowtags[2] is an improvised version of SIMPLE, and provides
guarantee for correct routing of packets even in the presence of dynamic policy chains.
Flowtags marks each packet with a tag, which is then used by switches to route the packet
and by middleboxes to mark the modifications. Thus Flowtags requires some support
from middleboxes, leading to some changes to be done to middleboxes. Each middlebox
and switch maintains a flowtag table and communicates with flowtag controller as shown
in the figure 2.5. Flowtags controller then uses this tags information to install rule in
switches, to guide the traffic flow.

2.3

State Management

Figure 2.5: Flowtags Architecture[2].


Figure 2.5 shows the Flowtags architecture. In contrast to SIMPLE, Flowtags doesn't
use dynamics handler, instead it takes help directly form middleboxes to know the modifications to the packet, and tags them accordingly. As Flowtags makes changes to middleboxes, it incurs an overhead of about less than 0.5% but is able to classify all the
traffic correctly. During experiments, the required changes to middlebox code was at
most 75 LOC a specific middlebox code and a 250 LOC for common library. Reduction
in throughput was observed to be less than 4% in most of the cases.

2.3

State Management
Many of the network applications stores some state information related to the flows

they serve. For example a state for an IDS can be packet count for each flow. As we
move from a single instance of network application to multiple instances, we also need
to move the state across these instances for correct functioning of our application. State
management is this process of moving and managing state across multiple instances of
network applications.

2.3.1

OpenNF: Enabling Innovation in Network Function Control

OpenNF[4] is a network function control framework, its goal is to satisfy tight service level agreements, accurately monitor and manipulate network traffic, and minimize
operating expenses. It primarily focuses on state management and state migration between virtualized NFs to achieve these goals. OpenNF exposes an API to manage state
between NFs and thus requires some changes in NFs to support the exposed API. OpenNF

2.3

State Management

controller monitors the state of each NF and manages the load dynamically. Figure 2.6
shows OpenNF architecture. OpenNF controller is divided into two parts viz. NF state
manger and flow manager. Flow manager is responsible to communicate with switches
and install rules on switches. NF state manager monitors NF state, instructs flow manager
accordingly and manages state migration between NF instances.

Figure 2.6: OpenNF Architecture[4].


Evaluations shows that OpenNF was able to move state among NFs in 400ms for
traffic rate of 2500 packets/sec along with loss-free and order-preserving guarantees. The
average latency incurred was 60ms and maximum latency incurred was 100ms.

2.3.2

Scaling Up Clustered Network Appliances with ScaleBricks

Sometimes for specific applications, it is necessary to pin a particular flow to some


specific node. For such needs, ScaleBricks[10] provides an approach to pin the specific
flow to specific handling node. ScaleBricks does so by using a new hash based data
structure named as SetSep. SetSep is then used to construct a compact Global Partition
Table(GPT) which resides on each handling node, and is used to look up packet's handling
node. This helps to route the packet to correct handling node with at most one hop,
and scraps the need to maintain full Forwarding Information Base(FIB) on each node,
thus helps in scaling out widely. Figure 2.7 shows the FIB architecture of ScaleBricks.
Experiments shows that it reduces the lookup overhead by 23% and latency by 10% in
LTE evolved packet core. Single node was able to achieve processing rate of 15Mpps
with 2M lookup entries and 12Mpps with 32M lookup entries.

2.4

10

Dynamic Scaling

Figure 2.7: ScaleBricks Architecture[10].

2.4

Dynamic Scaling
While the demand for services keeps fluctuating, it is difficult to predict how many

servers are required to serve the demand. One of the solution is virtualization, where we
can dynamically boot up the virtual machines(VM) as and when required. This is known
as dynamic scaling.

2.4.1

E2: A Framework for NFV Applications

Elastic Edge(E2)[7] provides a framework for managing and dynamic scaling of


network functions(NF) in a virtualized environment. E2 takes in policy graphs and constraints as inputs and computes and implements the instance graph. Figure 2.8 shows the
E2 architecture. E2 has the following three major components. E2 manager monitors
the NFs, gives command to boot or shutdown NF and updates rules on switches. Server
agent resides on each physical server and passes the current status of NFs to E2 manager.
It is also responsible for booting and shutting down NFs as guided by E2 manager. E2
Dataplane(E2D) acts as efficient multiplexer and demultiplexer for NFs and underlying
network. Results shows that E2 was able to distribute load equally between newly booted
NF and old NF with in 5 seconds. Inter NF transfer rate on single server achieved was
152.515 Gbps (1500B packets), 12.76 Mpps (64B packets) and latency of 1.56 sec.

2.5

11

Latency

Figure 2.8: E2 Architecture[7].

2.5

Latency

2.5.1

PacketShader: a GPU-Accelerated Software Router

Another major challenge in horizontal scaling is maintaining low latency.


PacketShader[5] provides an example of how the power of Graphics Processing Unit
(GPU) can be leveraged to compute faster in a multi threaded environment. GPU are
capable of processing hundreds of threads in parallel for single instruction multiple data
(SIMD) applications. This feature of GPU can be leveraged by many network applications, thus reliving CPU power for other computations. PacketShader shows this with the
help of example implementation of software router, which achieved 40Gbps throughput
on single server. GPUs can be very helpful in situations where CPU becomes bottleneck.

2.5

12

Latency

Figure 2.9: PacketShader Architecture[5].


Figure 2.9 shows the PacketShader's architecture. PacketShader typically collects
the packets in batches from fast path and skips the Linux TCP/IP stack, but also provides
an option to use Linux TCP/IP stack if required. Pre-shader and post-shader are programs
running on CUP (worker) and are responsible for stripping out packet header to be processed on GPU and passing the processed packets to NIC for forwarding respectively.
Shader is a program running on GPU which does the packet lookup for forwarding. Master thread passes the packet batches to shader and handles the shader's output to worker
threads, which then forwards the packets. PacketShader was able to achieve full duplex
40Gbps forwarding rate on single server(eight 2.66 GHz CPU cores and two GPUs) without overloading the CPU. Average roundtrip latency for IPv6 forwarding was around 200
sec for 5Gbps load and 400 sec for 25Gbps load.

2.5

13

Latency

2.5.2

RouteBricks: Exploiting Parallelism To Scale Software


Routers

Figure 2.10: RouteBrickes Architecture[10].


Another paper RouteBricks[1] illustrates how parallelism in CPUs can be exploited
to achieve high throughput with an example of software router. Figure 2.10 shows the
architecture of RouteBricks. Each server maintains full forwarding base and is directly
connected to each other server with low data rate ports. Also each server is connected to
external world with a high data rate port. It uses Direct Valiant load balancing (DVLB)
for load distribution among servers. RouteBricks achieved performance rates of 38.8Gbps
for minimal forwarding, 19.9Gbps for IP routing, and 5.8Gbps for IPsec, for 64B packet
size, and CPU utilization is 100%. Estimated latency for each packet is 47.666.4 sec.

Chapter 3
Discussions and Comparison
Summarizing the solutions form different papers we can say that Ananta and Duet
can be used to distribute load among instances of network applications. ScaleBricks can
be used for load distribution in scenarios where flow is bound to specific node. Solutions
form Simple and Flowtags can be used for traffic steering between NF instances. Elastic
Edge can be used for dynamic scaling of our service and OpenNF can be used to manage
states among multiple replicas of network application. RouteBricks and PacketShader
provides good idea of efficient hardware utilization and can be used to tackle latency
problems. Though all these solutions won't blend together with each other directly, but
can be done with some changes.
Table 3.1 shows a brief comparison of papers and the challenges they solve.

14

15

Table 3.1: Comparison of papers

Ananta
Duet
SIMPLE

Load

Traffic

State

Balancing

Steering

Management

Generic to

Specific to LB

all NF
Generic to

Specific to LB

all NF
Yes

Latency

Dynamic Scaling
Support

600
Lesser than
Ananta (381 )

Yes (no change


to middlebox)
Yes

Flowtags

Yes

(requires changes
to middlebox)
Yes with

OpenNF

Yes

loss-free
guarantee
By pinning

ScaleBricks

flows to

Reduces
Specific to LB

specific nodes
E2

hop count
by using GPT
Yes (API for

Yes

dynamic scaling)
Optimum

PacketShader

utilization
of GPU
Optimum

RouteBricks

utilization
of CPU

Chapter 4
Conclusion
We have seen the challenges associated with the horizontal scaling of network applications and the solutions to these challenges. We have seen how each paper solves one or
more of these challenges and the benefits and drawbacks of each approach.

16

Appendix A
Terminology
API

Application Programming Interface

CPU

Central Processing Unit

DIP

Direct IP

DPG

Dynamic Policy Graph

FIB

Forwarding Information Base

GPT

Global Partition Table

GPU

Graphics Processing Unit

IDS

Intrusion Detection System

LOC

Lines Of Code

LTE

Long Term Evolution

NAT

Network Address Translation

NF

Network Function

NFV

Network Function Virtualization

SDN

Software Defined Network

VIP

Virtual IP

VM

Virtual Machine

Gbps

Gigabits per second

Mbps

Megabits per second

Tbps

Terabits per second

Mpps

Million packets per second

Kpps

Kilo packets per second

17

References
[1] Dobrescu, M., Egi, N., Argyraki, K., Chun, B.-G., Fall, K., Iannaccone, G., Knies,
A., Manesh, M., and Ratnasamy, S., 2009, Routebricks: Exploiting parallelism to
scale software routers, in Proceedings of the ACM SIGOPS 22Nd Symposium on
Operating Systems Principles, SOSP 09 (ACM, New York, NY, USA). pp. 1528.
[2] Fayazbakhsh, S. K., Sekar, V., Yu, M., and Mogul, J. C., 2013, Flowtags: Enforcing
network-wide policies in the presence of dynamic middlebox actions, in Proceedings of the Second ACM SIGCOMM Workshop on Hot Topics in Software Defined
Networking, HotSDN 13 (ACM, New York, NY, USA). pp. 1924.
[3] Gandhi, R., Liu, H. H., Hu, Y. C., Lu, G., Padhye, J., Yuan, L., and Zhang, M., 2014
Aug., Duet: Cloud scale load balancing with hardware and software, SIGCOMM
Comput. Commun. Rev. 44, 2738.
[4] Gember-Jacobson, A., Viswanathan, R., Prakash, C., Grandl, R., Khalid, J., Das,
S., and Akella, A., 2014 Aug., Opennf: Enabling innovation in network function
control, SIGCOMM Comput. Commun. Rev. 44, 163174.
[5] Han, S., Jang, K., Park, K., and Moon, S., 2010 Aug., Packetshader: A gpuaccelerated software router, SIGCOMM Comput. Commun. Rev. 40, 195206.
[6] Jenkov, J., 2014, Scalable architectures, http://tutorials.jenkov.com/
software-architecture/scalable-architectures.html, [Online; last updated 2014-10-31].
[7] Palkar, S., Lan, C., Han, S., Jang, K., Panda, A., Ratnasamy, S., Rizzo, L., and
Shenker, S., 2015, E2: A framework for nfv applications, in Proceedings of the
25th Symposium on Operating Systems Principles, SOSP 15 (ACM, New York, NY,
USA). pp. 121136.

18

References

19

[8] Patel, P., Bansal, D., Yuan, L., Murthy, A., Greenberg, A., Maltz, D. A., Kern, R.,
Kumar, H., Zikos, M., Wu, H., Kim, C., and Karri, N., 2013 Aug., Ananta: Cloud
scale load balancing, SIGCOMM Comput. Commun. Rev. 43, 207218.
[9] Qazi, Z. A., Tu, C.-C., Chiang, L., Miao, R., Sekar, V., and Yu, M., 2013
Aug., Simple-fying middlebox policy enforcement using sdn, SIGCOMM Comput. Commun. Rev. 43, 2738.
[10] Zhou, D., Fan, B., Lim, H., Andersen, D. G., Kaminsky, M., Mitzenmacher, M.,
Wang, R., and Singh, A., 2015 Aug., Scaling up clustered network appliances with
scalebricks, SIGCOMM Comput. Commun. Rev. 45, 241254.

Acknowledgements
This section is for the acknowledgments. Please keep this brief and resist the temptation
of writing flowery prose! Do include all those who helped you, e.g. other faculty/staff
you consulted, colleagues who assisted etc.

Jashkumar Dave
IIT Bombay
18 April 2016

20

Vous aimerez peut-être aussi