13.cluster-Based Web Servers

946 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO.
7, JULY 2007
An SSL Back-End Forwarding Scheme in

Cluster-Based Web Servers
Jin-Ha Kim, Member, IEEE, Gyu Sang Choi, Member, IEEE, and Chita R. Das, Fellow, IEEE
Abstract—State-of-the-art cluster-based data centers consisting of three tiers (Web server, application server, and database server)
are being used to host complex Web services such as e-commerce applications. The application server handles dynamic and sensitive
Web contents that need protection from eavesdropping, tampering, and forgery. Although the Secure Sockets Layer (SSL) is the most
popular protocol to provide a secure channel between a client and a cluster-based network server, its high overhead degrades the server
performance considerably and, thus, affects the server scalability. Therefore, improving the performance of SSL-enabled network
servers is critical for designing scalable and high-performance data centers. In this paper, we examine the impact of SSL offering and
SSL-session-aware distribution in cluster-based network servers. We propose a back-end forwarding scheme, called ssl_with_bf, that
employs a low-overhead user-level communication mechanism like Virtual Interface Architecture (VIA) to achieve a good load balance
among server nodes. We compare three distribution models for network servers, Round Robin (RR), ssl_with_session, and ssl_with_bf,
through simulation. The experimental results with 16-node and 32-node cluster configurations show that, although the session reuse of
ssl_with_session is critical to improve the performance of application servers, the proposed back-end forwarding scheme can further
enhance the performance due to better load balancing. The ssl_with_bf scheme can minimize the average latency by about 40 percent
and improve throughput across a variety of workloads.
Index Terms—Secure Sockets Layer, cluster, Web servers, application server layer, load distribution, user-level communication.
1 INTRODUCTION
D UE to the growing popularity of the Internet, data

centers/network servers are anticipated to be the
bottleneck in hosting network-based services, even though
In addition, the overhead of SSL becomes even more
severe in application servers. Application servers provide
dynamic contents and the contents require secure mechan-
the network bandwidth continues to increase faster than the isms for protection. Generating dynamic content takes
server capacity [22]. It has been observed that network about 100 to 1,000 times longer than simply reading static
servers contribute to approximately 40 percent of the content [20]. Moreover, since static content is seldom
overall delay, and this delay is likely to grow with the updated, it can be easily cached. Several efficient caching
increasing use of dynamic Web contents [22]. algorithms [32], [11] have been proposed to reduce latency
For Web-based applications, a poor response time has and increase throughput of front-end Web services. How-
significant financial implications [20]. For example, E-Biz ever, because dynamic content is generated during the
[36] reported about $1.9 billion loss in revenue in 1998 due execution of a program, caching dynamic content is not an
to the long response time resulting from the Secure Sockets efficient option like caching static content.
Layer (SSL) [35], which is commonly used for secure Recently, a multitude of network services have been
communication between clients and Web servers. Even designed and evaluated using cluster platforms [26], [25].
though SSL is the de facto standard for transport layer Specifically, the design of distributed Web servers has been
security, its high overhead and poor scalability are two a major research thrust to improve the throughput and
major problems in designing secure large-scale network response time [9], [32], [24]. PRESS [12] is the first Web
servers. Deployment of SSL can decrease a server’s capacity server model that exploits user-level communication in a
by up to two orders of magnitude [2]. cluster-based Web server. Our previous work in [23]
reduces the response time in a cluster-based Web server
. J.-H. Kim is with the Research and Development Center, Samsung using coscheduling schemes.
Networks, 8-11F, ASEM Tower, World Trade Center, 159-1, Samsung- In this paper, first, we investigate the impact of SSL
Dong, Kangnam-Ku, Seoul, Korea 135-798. offering in cluster-based network servers, focusing on
E-mail: peanut.kim@samsung.com.
. G.S. Choi is with the Samsung Advanced Institute of Technology (SAIT),
application servers, which mainly provide dynamic content.
Samsung Electronics, Mt. 14-1, Nong-seo-Dong, Giheung-Gu, Yongin-Si, Second, we show the possible performance improvement
Gyeonggi-Do, Korea 446-712. E-mail: gsc121.choi@samsung.com. when the SSL-session reuse scheme is utilized in cluster-
. C.R. Das is with the Department of Computer Science and Engineering,
Pennsylvania State University, 354F IST Building, University Park, PA
based servers. The SSL-session reuse scheme has been
16802. E-mail: das@cse.psu.edu. tested on a single Web server node in [7] and extended to a
Manuscript received 4 Apr. 2006; revised 31 July 2006; accepted 10 Aug. cluster system that consisted of three Web servers in [6]. In
2006; published online 9 Jan. 2007. this paper, we explore the SSL-session reuse scheme using
Recommended for acceptance by R. Iyer. 16-node and 32-node cluster systems with various levels of
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-0081-0406. workload. Third, we propose a back-end forwarding
Digital Object Identifier no. 10.1109/TPDS.2007.1062. mechanism by exploiting the low-overhead user-level
1045-9219/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society
KIM ET AL.: AN SSL BACK-END FORWARDING SCHEME IN CLUSTER-BASED WEB SERVERS 947
Fig. 1. A multitier data center architecture.
communication to enhance the SSL-enabled network server 2.1 Cluster-Based Data Centers
performance. Fig. 1 depicts the typical architecture of a cluster-based data
To this end, we compare three distribution models in center or network server consisting of three layers: front-
clusters: Round Robin (RR), ssl_with_session, and ssl_with_bf end Web server, mid-level application server, and back-end
(backend_forwarding). The RR model, widely used in Web database server. A Web server layer in a data center is a
clusters, distributes requests from clients to servers using Web system architecture that consists of multiple server
the RR scheme. ssl_with_session uses a more sophisticated nodes interconnected through a System Area Network
distribution algorithm in which subsequent requests of the (SAN). The Web server presents the clients a single system
same client are forwarded to the same server, avoiding view through a front-end Web switch, which distributes the
expensive SSL setup costs. The proposed ssl_with_bf uses requests among the nodes. A request from a client goes
the same distribution policy as the ssl_with_session, but through a Web switch to initiate a connection between the
includes an intelligent load balancing scheme that forwards client and the Web server. When a request arrives at the
client requests from a heavily loaded back-end node to a Web switch, the Web switch distributes the request to one
lightly loaded node to improve the utilization across all of the servers using either a content-aware (Layer-7) or a
nodes. This policy uses the underlying user-level commu- content-oblivious (Layer-4) distribution [11]. The front-end
nication for fast communication. Web server provides static or simple dynamic services. The
Extensive performance analyses with various workload Web resources provided by the first tier are usually open to
and system configurations are summarized as follows: First,
the public and, thus, do not require authentication or data
schemes with reusable sessions, deployed in the ssl_with_
encryption. Hence, the average latency of client requests in
session and ssl_with_bf models, are essential to minimize the
this layer is usually shorter than in the application servers.
SSL overhead. Second, the average latency can be reduced The mid-tier, called the application server, is located
by 40 percent with the proposed ssl_with_bf model
between the Web servers and the back-end database. The
compared to the ssl_with_session model, resulting in im-
application server has a separate load balancer and a
proved throughput. Third, the proposed scheme provides
security infrastructure such as a firewall and should be
high utilization and better load balance across all nodes.
equipped with a support for databases, transaction manage-
The rest of this paper is organized as follows: In Section 2, a
ment, communication, legacy data, and other functionalities
brief overview of cluster-based network servers, user-level
[37]. After receiving a client’s request, an application server
communication, and SSL is provided. Section 3 outlines three
parses and converts it to a query. Then, it sends the
distribution models, including our proposed SSL back-end
generated query to a database and gets back the response
forwarding scheme, and Section 4 presents the simulation
from the database. Finally, it converts the response into an
platform. The performance results are analyzed in Section 5,
HTML-based document and sends it back to the client. The
followed by the related work in Section 6 and the concluding
application server provides important functionalities for
remarks in Section 7.
online business such as online billing, banking, and
inventory management. Therefore, the majority of the
2 BACKGROUND content here is generated dynamically and requires an
In this section, we summarize a generic architecture of a adequate security mechanism.
cluster-based data center, the Virtual Interface Architecture The back-end database layer houses the most confiden-
(VIA), for user-level communication and detail the descrip- tial and secure data. The main communication overhead of
tion of the SSL protocol and performance impact of SSL on a database layer is the frequent disk access through the
the Apache server. Storage Area Network [38].
948 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 7, JULY 2007
TABLE 1
The Cost of Cryptographic Algorithms
services: data encryption and message digest. Data encryp-

tion is done with a symmetric key algorithm such as Triple-
DES [30] or RC4. [34]. Message transport includes a message
Fig. 2. The SSL protocol. integrity check using a keyed message authentication code
(MAC). A secure hash function such as the Secure Hash
2.2 User-Level Communication (VIA) Algorithm Version 1.0 (SHA1) [29] or Message Digest 5
(MD5) [28] is used for MAC computations.
A Virtual Interface (VI) is a logical channel provided to an
Fig. 2 shows the detailed handshaking procedure required
application by the VIA framework [16]. The VI enables direct
to initiate a new session. A client initiates a connection with a
message transfer to a Network Interface Card (NIC) without
server by sending a Client Hello message that includes the
requiring the data to be copied to the kernel space or using the
session ID, a random number, cipher suites, and other
NIC driver for virtual-to-physical memory translations. An
required information. After receiving the Client Hello, the
application sends a request to the VI-enabled API, also called
the Virtual Interface Provider Library (VIPL). The VIPL server sends a Server Hello including its certificate and other
formats the request, creates a VI queue or work queue pair, information as a reply. With the certification of the server, the
and submits it directly to the NIC, which has VI-enabled client finishes the authentication of the server. Depending on
buffers to hold the data that are sent or received. The NIC, the server side configuration, the next procedure for the client
using its built-in processor, performs a limited processing of authentication is optional. If it is requested, the client needs to
the packet and does the virtual-to-physical-address transla- send its certificate to the server for verification. After the
tion, which was traditionally done by the protocol stack and authenticating procedures, the client generates session keys
the NIC driver, respectively. The process now is performed for the encryption and decryption of data. The session is
much more quickly since 1) it is done by hardware rather than identified by the session ID that is shared between the clients
by software, 2) there are no user-kernel context switch and and server.
memory copy overheads, and 3) there is minimal protocol To amortize the high overhead of the handshaking
processing overhead. In addition, the processor is free to do protocol, a session can be reused when reestablishing the
more productive work as it no longer has to perform the bulk connection. The SSL protocol allows a server to configure
of the message passing and interrupt processing. VIA has the session time. During the session time, a server caches the
been studied extensively as a standard user-level commu- session information of clients. Whenever a client requests a
nication to design SANs. Further description of VIA can be new connection within its session time, the server reuses the
found in [16]. cached session state to generate a set of keys for the new
session and saves the computation time required for
2.3 SSL authentication and negotiation procedures. Apostolopoulos
SSL [35], which operates between the HTTP and Transmis- et al. [7] showed that session reuse is essential to improve
sion Control Protocol (TCP) network layers, is the most the performance of Web servers.
popular tool that provides a secure channel between a client
and a Web server. Especially, most Web servers/data 2.3.1 Cryptographic Algorithms
centers supporting e-commerce applications deploy SSL to Table 1 shows the cost of cryptographic algorithms used in
provide enhanced security to Web traffic. In 1999, SSL was SSL. We measured the execution time of the algorithms in a
adopted by the Internet Engineering Task Force (IETF) as a 500-MHz UltraSparc uniprocessor, running on Solaris 2.9
standard and is newly named Transport Layer Security with a 1-Gbyte memory. The algorithms we measured are
(TLS) [3], [35]. SSL is composed of two components: a RSA, RC4, and MD5, which are the most widely used cipher
handshaking procedure and a bulk data encryption suites provided by a Web server and a Web browser. RSA is
procedure. measured using SSLeay’s RSA_private_encrypt() and RSA_
The handshaking procedure triggers when a connection is public_decrypt() functions, and RC4 and MD5 are measured
initiated between Web servers and a client [35]. During this using RC4() and MD5() functions, respectively [1]. Since the
phase, a server and a client authenticate each other and RSA algorithm is optimized for the public key operation,
negotiate encryption algorithms and the required session the operation time for the public key is relatively fast
keys using an asymmetric key algorithm such as RSA before compared to the private key operation. In the SSL protocol,
they send or receive data. Since all data between a server and a a server needs to compute a private key to establish a secure
client are encrypted using symmetric keys, the channel channel with a client. Thus, to negotiate the session key
between them is private. The bulk data encryption offers two between a server and a client, the client uses the server’s
38 times, respectively. The NASA trace file shows less

degradation because it transfers more bytes per RSA
authentication. The throughput results in Fig. 3b are similar
to the latency results.
3 APPLICATION SERVER MODELS

In a cluster-based data center or network server, all requests
from clients to an application server are first passed to a
distributor from a Web switch and, then, the distributor
forwards each request to one of the application servers
according to its distribution policy. The distribution in the
application server should be done differently compared to
the front-tier Web server in which a cache-aware distribu-
tion like Locality-Aware Request Distribution (LARD) [32]
shows good performance. Especially due to the high
overhead of the SSL protocol, the distributor in an
application server should adopt a policy that minimizes
the SSL overhead. Since the session reuse scheme, which is
widely used in single Web servers, is very effective to
reduce the SSL overhead [7], we plan to exploit the session
reuse scheme for the cluster-based application servers.
Currently, for cluster-based Web servers, the Weighted
Fig. 3. Latency and throughput of a Web server with/without SSL Round Robin (WRR), RR, and cache-aware distribution
protocol. (a) Latency. (b) Throughput. policies [32] are widely used [15]. Since these distribution
schemes do not consider the SSL session, even though each
public key to encrypt a session key, and the server has to server in an application layer is configured to reuse the
decrypt it using its own private key. This process increases session information, the subsequent requests from the same
the server-side overhead. The key size of RSA impacts the client are likely to be assigned to different nodes. Therefore,
level of the security. The 2,048-bit RSA key can provide a we need a distributor that considers the SSL session.
much stronger security level than a 1,024-bit RSA key, In this paper, we study three different distributor models
whereas the former requires more than six times the (RR, ssl_with_session, and ssl_with_bf (backend_forwarding)) to
computation cost as shown in Table 1. analyze the performance of SSL-enabled application ser-
The computation costs for RC4 and MD5 in Table 1 are vers. Detailed explanation of these models is given next.
the encryption and message digesting times for the 1-Kbyte
data block with a 128-bit key. These costs are relatively . RR. In this model, a distributor does not consider the
small compared to the cost for an RSA algorithm. SSL session of clients, and the requests are dis-
tributed by using the RR policy. Thus, requests from
2.3.2 Performance Impact with SSL the same client tend to be distributed to different
To study the impact of the SSL algorithms on server nodes, and the SSL setup is required for each
performance, we measured the throughput and latency using connection.
trace files in a single Web server environment. The Web server . ssl_with_session. The distributor of ssl_with_session
for this experiment is Apache v2.0 with mod_ssl, which maintains the client information to forward subse-
quent requests from the same client to the same
provides strong cryptography for the Apache Web server via
application server. The advantage of the ssl_with_
the SSL (SSL v2/v3) and Transport Layer Security (TLS) (TLS
session is that it avoids the unnecessary authentica-
v1) protocols [3]. The trace files used for the experiment are the
tion and negotiation phase when a client tries to
Clarknet and NASA trace files. The trace files have about reconnect to the server. The disadvantage of this
3,000,000 client requests for the Web resource. The average file model is that it may cause load imbalance among
size of Clarknet is about 15 Kbytes, whereas the average file the servers. If a server has clients whose requests are
size of NASA is about 28 Kbytes. We used 10 workstations as frequent and require much dynamic computation,
client nodes. One client process runs on each workstation. A the requests cannot be distributed to other lightly
client process sends a new request as soon as it receives the loaded servers, resulting in load skewness among
reply for the previous request. The server node is equipped the servers.
with 1-GHz UltraSparc llli dual processors running on Solaris . ssl_with_bf (backend_forwarding). The ssl_with_bf
2.9 with a 4-Gbyte memory system. model is aimed at mitigating the limitation of
Fig. 3 depicts the excessive overhead due to SSL offering. ssl_with_session by using a back-end forwarding
Since the average file size of the NASA trace file is larger mechanism to achieve load balance. We deploy a
than the Clarknet, it shows a worse latency and throughput load balancing module in the distributor to obtain
than the Clarknet trace without SSL. With SSL, the latencies the load information from each application server.
of the Clarknet and NASA are degraded about 76 times and The distributor updates the load information every
TABLE 2
Timing Parameters
of the application server, we ignore the latency between the

Fig. 4. An application server tier with the back-end forwarding scheme. Web switch and the distributor and logically represent
them as one unit.
300 ms. Since the cluster uses a low-overhead user- When a request arrives at the distributor, it searches its
level communication and the communicating mes- lookup table to determine whether there is a server that has
sages related to the load information are relatively the session information of the client and then forwards the
short, the communication overhead is not signifi- request to the server. Otherwise, it picks up a new server to
cant. In Section 5, we describe the performance forward the request. The forwarded server establishes a
impact of the distributor. The load of the ith server new SSL connection with the client. If the request is
Li is calculated by the number of open connections. forwarded to a highly loaded server, the server in turn
If a cluster consists of N servers, where the ith node sends the request with the session information to a lightly
is denoted by ni , then Li ¼ wdyn rdyn þ wsta rstat loaded server.
i i ,
i 2 f1; . . . ; Ng. Here, wdyn and wstat are weight
variables for dynamic requests and static requests, 4 SYSTEM MODEL
and rdyn
i and rstat
i are the number of outstanding
We have developed a comprehensive simulation testbed to
dynamic and static requests of ni , respectively. evaluate the application server design alternatives. The
These weight variables can be adjusted. In the paper, simulator, written in CSIM, consists of two major compo-
we assign the average processing time of the static nents: a communication module and an integrated Web server
and dynamic requests to the weighted values for and application server module. The VIA communication
each request to calculate the load of the ith server module, which follows the semantics of the VIA specification,
and thresholds T1 and T2. To forward requests, supports most of the VIA functionalities including blocking
servers use two thresholds T1 and T2 . If Li > T1 , and nonblocking communication, Completion Queue (CQ),
i 2 f1; . . . ; Ng, then ni forwards requests along with and receive and send queues [16]. VIA functions commu-
a negotiated session key to one of the servers whose nicate with the NIC device driver for intracluster commu-
load is less than T2 . The server, which receives the nication. An NIC uses the doorbell mechanism to check
request from another node, generates and encrypts communication requests from application programs. After
the dynamic content using the forwarded session completing a request, it adds the communication information
key. Finally, it returns the reply to the initial node, to the CQ or the appropriate send queue.
which sends the response back to the client. We The integrated Web and application server works as
assume that all the intracommunications in a cluster follows: The requests from client processes are distributed by
are secure since these nodes are connected through the distributor to the servers. As soon as a server receives a
the user-level communication and are located very request, it opens an SSL connection with the client. If the
closely. server has the client’s previous session information, it skips
Fig. 4 depicts the ssl_with_bf model. Application servers the RSA asymmetric key operation. According to the request
in the cluster are interconnected through a SAN. Steps 2 and type (static or dynamic), the request is processed differently.
3 in Fig. 4 show the back-end forwarding scheme between If the request is for static content, it is serviced by the Web
the cluster nodes. This back-end forwarding uses a low- server module. Otherwise, it is forwarded to the application
overhead user-level communication like VIA/IBA to mini- server module. When the back-end forwarding mechanism is
mize the intracluster communication latency. needed, the application server invokes the communication
The requests arriving at the Web switch of the network module to send the request to another node.
server are sent to either the Web server layer or the
application server layer according to the requested service 4.1 System and Workloads Parameters
by the client. Since the SSL connection is served by a Table 2 summarizes the system and network parameter
different type of HTTP server (Hypertext Transfer Protocol values used in our experiments. The memory_access_time,
Secure (HTTPS)) and a different port number, the requests VIA_net_latency, and VIA_processing_overhead values are
for the SSL connection are passed on to the distributor in the obtained from measurements on a 16-node Linux (2.4.7-
application server layer. To solely focus on the performance 10) cluster. Each node of this cluster is an Athlon 1.76-GHz
TABLE 3
Distribution Models Used in Simulation
uniprocessor with a 1-Gbyte memory. Using this cluster, we In addition, we vary the number of ping-pong processes
measured the latency values with the various file sizes. from one to four. In Fig. 5, we show two latency results
Then, we obtained the above equations as a function of the since VIA supports two communication modes: nonblock-
message size (S) using the interpolation method. HTTP_pro- ing and blocking. In both the figures, the results from the
tocol_overhead and HTTP_processing_overhead are the CPU simulator match well with those from the real experiments.
times for the interprocess communication and request These experiments validate the accuracy of the underlying
preprocessing such as path translation, respectively. We communication mechanism.
assume that the dynamic content is generated by FastCGI. Second, to test the timing parameters used for modeling an
The average dynamic content generation rate of FastCGI, SSL-enabled Web server, we compare the throughputs of the
studied in [20], is 2,250 Kbytes per second. We assume that simulator and a single Apache Web server, which is
80 percent of the requests are dynamic and are directed to configured to provide secure connections. We used the
the application servers. The remaining 20 percent are Clarknet trace as the workload for the simulator and the
serviced by the Web servers. Apache Web server. The least recently used (LRU) cache
Based on the client behaviors studied in [31] and [18], we replacement policy is employed in the simulator. Fig. 6 plots
assume that the number of client requests per session the throughput as a function of the number of clients. The
ranges from 1 to 100. The SSL parameters used in the throughput increases until the number of clients is 10, and
simulation are shown in Table 1. after that, it enters the stable region. Fig. 6 shows that the
throughputs of the simulator and Apache server are very
4.2 Distribution Models close. The Clarknet workload is observed, showing that the
Table 3 shows the distribution models used for generating the distribution of transfer sizes closely follows the heavy-tailed
Web files and incoming request intervals. Since many studies
have supported that the characteristics of a Web server follow
the long-tail distribution [17], [33], we use the Pareto
distribution to model the Web file size and the incoming
request intervals. Since the Pareto distribution does not fit
well to model small files, the Lognormal distribution model is
also used [5]. The Lognormal distribution models smaller
Web files ( 10 Kbytes), whereas the Pareto distribution
captures larger files (> 10 Kbytes). In the model, p represents
the maximum size of the Web files, k0 is the minimum file size,
and is the Pareto shape parameter. Since the size of the
dynamic requests is unpredictable, we limit the value of p to
20,000 Kbytes for our simulation. The interarrival time of the
requests is also modeled by Pareto distribution [13], where k
in the interarrival time distribution represents the minimum
arrival interval of the clients. In our simulation, k is varied to
reflect the load of the Web cluster.
4.3 Simulator Validation
The simulator validation consists of two parts: accuracy of
the intracluster communication model and the SSL-enabled
Web server parameters. First, we validate the communica-
tion model of the simulator by measuring the round-trip
time in the 16-node Linux cluster. Fig. 5 shows the latency
numbers obtained from the simulator and from measure-
ments. To measure the round-trip latency, we used a ping-
pong application that exchanges a fixed size of data
between two nodes. A ping-pong program captures the Fig. 5. Round-trip latency comparison. (a) VIA nonblocking mode.
network latency accurately with minimal system overhead. (b) VIA blocking mode.
Fig. 6. Throughput comparison between an Apache server and the

simulator.
distribution [19], [8]. Thus, Fig. 6 shows that the file size
distribution models used in this paper are also validated
through this experiment. With the validated simulator, we
next conduct a detailed performance evaluation. All our
simulation results are collected with a 90 percent confidence
interval.
5 PERFORMANCE ANALYSIS
In this paper, we consider two performance metrics: latency
and throughput. Latency is the time between the arrival of a
request at a server and the completion of the request.
Throughput is the number of requests completed per Fig. 7. Average latency and throughput of a 16-node application server.
second. (a) Latency. (b) Throughput.
The latency and throughput results of the three models
(RR, ssl_with_session, and ssl_with_bf) in a 16-node applica- RR cannot yield good performance, we omit the RR results in
tion server are plotted in Fig. 7. The results are measured as the forwarding discussion. Fig. 8 shows the latency and
the k value decreases from 26 to 18, where k is the throughput results in a 32-node application server. The
parameter of the Pareto distribution for the incoming tendency of these results is very similar to the results of the
request interval P ðxÞ ¼ k =xþ1 . When k decreases, the 16-node cluster. The average latency of ssl_with_bf reduces by
request interval becomes shorter and, consequently, the about 40 percent and the throughput of ssl_with_bf increases
load on the servers increases. up to 10 percent compared to ssl_with_session in a 32-node
Fig. 7a shows the average request latency with the RR, cluster environment.
ssl_with_session, and ssl_with_bf models. We can observe that In both results, the range of k values is chosen such that
RR shows the worst performance since subsequent requests the average utilization of the servers lies between 50 percent
from a client are not likely to be forwarded to the same server and 85 percent. When the servers in the cluster are less
that caches the previous session information of the client. loaded than this, the ssl_with_bf scheme does not work since
Thus, CPU cycles are wasted to reauthenticate and negotiate the CPU of each server has enough computation power to
keys between a client and a server. The results of RR show that process all the requests. Beyond this point, because all of the
the SSL setup procedure is the main bottleneck in application servers are overloaded, a load balancing mechanism of the
servers. Therefore, this experiment indicates that, in a cluster- ssl_with_bf scheme among the servers is unnecessary.
based Web server, which provides SSL connections, good Next, to further analyze these latency results, the
performance cannot be obtained with a distribution policy normalized breakdown of the CPU time in a server is
that considers only the cache locality or resource utilization. presented in Fig. 9. The total measured time is the duration
User locality for the reuse of the session information is the in which the Web cluster completes 100,000 requests. The
critical factor to improve performance. Fig. 7a also shows that CPU time is divided into the four categories:
the proposed ssl_with_bf scheme has much lower latency than
the ssl_with_session. Although the difference is not clear in the 1. asymmetric key operation time,
figure because the latency of the RR distributor is too high 2. symmetric key and hash operation time,
compared to the other two models, the latency of ssl_with_bf is 3. the time the CPU needs for servicing static and
about 50 percent less than that of the ssl_with_session. Fig. 7b dynamic jobs such as memory access time and TCP/
plots the throughput results of the three models. Like the Internet Protocol (IP) connection time, and
latency result, the throughput of RR is much lower compared 4. idle time.
to the ssl_with_bf and ssl_with_session models. The ssl_with_bf The RR scheme spends over 90 percent of the CPU time for
model also yields a better throughput compared to ssl_with_ the asymmetric key operation. Even the ssl_with_bf and
session as the load increases. Since Fig. 7 obviously shows that ssl_with_session models also spend the largest portion of the
TABLE 4
Normalized Utilization and Standard Deviation of
ssl_with_session to That of ssl_with_bf
shows a better utilization than the ssl_with_session model. In

the 16-node server, the utilization of the ssl_with_session
model is 80 percent of the utilization of ssl_with_bf. The
utilization gap of the two models slightly increases for the
32-node server because, in a larger cluster, load imbalance
among the nodes is likely to be high. This is also noticed
from the standard deviation of the CPU utilization. The
standard deviation of the CPU utilization of the two models
indicates that ssl_with_bf offers a better load balance among
the nodes compared to ssl_with_session.
Now, let us analyze more realistic scenarios. In the
previous experiments, we assume that the session informa-
tion of the clients does not expire in the distributor and it
can forward all the subsequent requests from the same
client correctly. However, this assumption may not be
realistic since the session information of a client expires
after the configuration time. If the distributor should keep
the client’s session information for a long period, the
maintenance of the session information may offset the
Fig. 8. Average latency and throughput of a 32-node application server.
performance. Thus, to estimate the minimum time period,
(a) Latency. (b) Throughput.
which still yields a good performance, we performed
experiments by varying the session period of the distribu-
CPU time for asymmetric key operations. For RR, since the
tor. Note that the session period mentioned here is the session
total time for completion of 100,000 requests is longer
configuration in the distributor, not in the application
compared to the other models, the CPU time for symmetric
server. An application server maintains the actual session
key operations looks negligible. Both the ssl_with_bf and
information such as the session ID and the master secret key
ssl_with_session models have similar SSL processing over-
of the old or existing sessions during the configuration time.
heads. However, since ssl_with_bf requires less time to
However, the distributor only keeps the corresponding
complete 100,000 jobs and also has the minimal idle time, it
server name of the client in the previous or existing
can process more jobs than ssl_with_session at a higher load.
connection to make the Web server reuse the session
To verify our argument that ssl_with_bf may yield a better
information.
load balance and higher utilization of the CPU at each node, Fig. 10 depicts the latency and throughput results of the
we measured the average CPU utilization and its standard ssl_with_session and ssl_with_bf model. The two models are
deviation for the ssl_with_session and ssl_with_bf models. tested with different server loads: high load ðk ¼ 12Þ and
ssl_with_session values in Table 4 are normalized with moderate load ðk ¼ 14Þ. The x-axis in Fig. 10 is the session
respect to the ssl_with_bf values. As expected, ssl_with_bf period of the distributor, and ave is the average interarrival
time of the incoming requests. We measured performance
when the session period increases from ave to 100 ave. In
Fig. 10, we can see that the latency and throughput of both
the ssl_with_session and ssl_with_bf models suffer from short
session periods. At the higher server load ðk ¼ 12Þ,
ssl_with_bf has an almost similar performance as that of
ssl_with_session at the ave point. This is because, when the
session period is short, many requests from the same client
need a reauthorization process. With a high server load, the
reauthorization process saturates the servers. In this
situation, the load balancing between the nodes is not
necessary. This explains why the results show that a highly
loaded server suffers more than a lightly loaded server with
Fig. 9. Normalized breakdown of the CPU time in a server. The bars a short session period.
represent the values for the ssl_with_bf, ssl_with_session, and RR from In Fig. 10a, the latencies of both models plunge at the
left to right. higher load ðk ¼ 12Þ when the session period changes from
Fig. 10. Impact of the session period on performance ðNode ¼ 32Þ. Fig. 11. Latency and throughput variation with file size ðN ¼ 32; k ¼ 20Þ.
(a) Latency. (b) Throughput. (a) Latency. (b) Throughput.
ave to 2 ave, and at the 2 ave point, the latency of In the next experiment, we consider the clients’ access
ssl_with_bf is 90 percent of the ssl_with_session latency. The behavior. Mitzenmacher [27] analyzed the clients’ behavior,
ssl_with_bf latency drops until the period is the 4 ave and which indicates that most of the clients access Web servers
slightly decreases until the 8 ave point. At this point, the with low access rates, whereas a few clients show
latency of ssl_with_bf is about 45 percent of the ssl_with_session aggressive accesses to the Web servers. Therefore, we
latency. divided the clients into two groups: busy clients and
With the lower load ðk ¼ 14Þ, the performance results of normal clients. The busy clients generate frequent requests
the two models are much better than those of k ¼ 12 and the to the servers, whereas the normal clients access the Web
latency drops until the 4 ave point. The throughput resource less frequently. In our experiment, 10 percent of
results of ssl_with_session and ssl_with_bf in Fig. 10b show the requests are from busy clients and 90 percent are from
a similar tendency as the latency results. The ssl_with_bf normal clients. The interval time of requests between the
model yields a better throughput than the ssl_with_session two different scenarios are adjusted to have the same
model over the observation window. The results in Fig. 10 average interval (expected value for the Pareto distribution
indicate that the session period should be at least 4 ave to ¼ k ð 1Þ1 ).
reap the benefits of session reuse. In Fig. 12, we compared four models, ssl_with_bf, mix-
Next, we plot the effect of the average file size on the two ed_ssl_with_bf, ssl_with_session, and mixed_ssl_with_session.
models in a 32-node application server in Fig. 11. The ssl_with_bf and ssl_with_session are the same models used in
average latency of both models increases and throughputs the previous experiments, and mixed_ssl_with_bf and mixed_
decrease when the average file size increases. However, the ssl_with_session are the results with busy and normal clients.
performance of the ssl_with_session model degrades more Fig. 12 shows very interesting results. The latency of
rapidly than the ssl_with_bf model because, when the file mixed_ssl_with_session slightly increases, whereas mixed_ssl_
size becomes large, the load imbalance among the servers in with_bf shows reduced latency up to 40 percent compared to
the cluster increases. the results with homogeneous customers.
The ssl_with_bf model alleviates the load imbalance using The reason why the mixed_ssl_with_bf model shows a
the back-end forwarding algorithm, whereas, in the ssl_with_ better latency is that the busy clients have more requests per
session model, when a node has an outstanding request that session and, thus, the application server can service more
requires to generate a large dynamic content, subsequent requests with less asymmetric key operations. Since the
requests suffer from a large waiting time. Therefore, the asymmetric key negotiation phase is the major performance
latency difference increases as the file size becomes larger. The bottleneck, mixed_ssl_with_bf can yield a better performance
throughput of the two models also decreases with a larger file than ssl_with_bf. mixed_ssl_with_session also has the same
size, but the throughput of ssl_with_bf can remain relatively advantage; however, the high skewness due to the busy
constant until the average file size becomes 45 Kbytes clients makes the imbalance among nodes increasing, which
compared to the ssl_with_session model. offsets this advantage.
Fig. 12. Latency and throughput of homogeneous and mixed client

models. (a) Latency. (b) Throughput.
Through this paper, we assume that 80 percent of the Fig. 13. Latency and throughput when the ratio of the dynamic requests
requests are dynamic and are directed to the application increases. (a) Latency (ms). (b) Throughput.
servers. The remaining 20 percent are serviced by the Web
servers. Fig. 13 shows the throughput and latency results in Extension (JSSE) API to allow the differentiation of resumed
a 32-node cluster when the percentage of dynamic requests SSL connections from new SSL connections.
becomes 90 percent. Fig. 13 shows a similar tendency with Recent studies on data centers have focused on cluster-
Fig. 8. However, the latency and throughput of both the based Web servers [9], [12], [23], and the following works are
models become worse because increasing dynamic requests related to our research. Aron et al. [9] have proposed the back-
require additional CPU computation time. Fig. 14 shows the end request forwarding scheme in cluster-based Web servers
throughput results when the interval of the distributor for for supporting HTTP1.1 persistent connections. The client
gathering load information of the servers varies (300 ms, requests are directed by a content-blind Web switch to a Web
600 ms, 1,000 ms, and 2,000 ms) in a 32-node cluster. As server in the cluster by a simple distribution scheme such as
expected, the throughput degrades as the period of the the RR Domain Name System (DNS). The first node that
distributor becomes larger. receives the request is called the initial node. The initial node
parses the request and determines whether to service it
6 RELATED WORK locally or forward it to another node based on the cache and
In this section, we summarize the prior studies related to load balance information. The forwarded request is sent back
this work. In a single Web server environment, the cost of to the initial node for responding to the client. However, this
the SSL layer was studied by Apostolopoulos et al. [7] using
the Netscape Enterprise Server and Apache Web server,
and it was shown that the session reuse is critical for
improving the performance of Web servers. This study was
extended to a cluster system that was composed of three
Web server nodes [6]. The paper described the architecture
of the L5 system and presented two application experi-
ments: routing HTTP session based on Uniform Resource
Locators (URLs) and session-aware dispatching of SSL
connections. The SSL-session reuse scheme is also investi-
gated in [21], which presented a session-based adaptive
overload control mechanism based on SSL connections
differentiation and admission control. Guitart et al. [21]
proposed a possible extension of the Java Secure Socket Fig. 14. Throughput of ssl_with_bf as a function of the distributor period.
study does not consider the impact of user-level communica- essential for minimizing the SSL overhead. Second, the
tion and SSL-enabled application servers. average latency can be reduced by about 40 percent with the
The first effort that has analyzed the impact of user-level ssl_with_bf model compared to the ssl_with_session model,
communication on distributed Web servers is the PRESS resulting in improved throughput. Third, ssl_with_bf yields
model [12]. The clients in the PRESS model communicate a better performance with the mixed clients, whereas the
with the cluster using TCP over a Fast Ethernet, whereas the performance of the ssl_with_session model is degraded due
intracluster communication uses VIA over connectionless to the increasing skewness. Finally, ssl_with_bf is more
local area network (cLAN) [12]. It was shown that the server robust than ssl_with_session in handling variable file sizes.
throughput can improve up to 30 percent by deploying All of these results indicate that the proposed back-end
VIA. Our previous work in [23] shows that, in addition to forwarding scheme is a viable mechanism for improving
taking advantage of a user-level communication scheme, the performance of secure cluster-based network servers.
coscheduling of the communicating processes reduces the
average response time by an additional 25 percent. Due to
the low cost of the intracluster communication, reading a ACKNOWLEDGMENTS
file from a remote cache turns out to be faster than reading This research was supported in part by US National Science
the file from the local disk. Implementation on an 8-node Foundation (NSF) Grants EIA-0202007, CCF-0429631, and
cluster showed that PRESS can improve the server CNS-0509251.
throughput by about 29 percent compared to the TCP/IP
model. Zhou et al. [38] have deployed VIA between a
database server and the storage subsystem. They imple- REFERENCES
mented the interface, called Direct Storage Access (DSA), to [1] “SSLeay Description and Source,” http://www2.psy.uq.edu.au/
support the Microsoft SQL Server to use VI. However, none ftp/Crypto/, 2007.
of these studies has investigated the application server [2] S. Abbott, “On the Performance of SSL and an Evolution to
Cryptographic Coprocessors,” Proc. RSA Conf., Jan. 1997.
performance with SSL offering. [3] C. Allen and T. Dierks, The TLS Protocol Version 1.0, IETF Internet
Amza et al. [4] have explored the characteristics in draft, work in progress, Nov. 1997.
several Web sites, including auction, online bookstore, and [4] C. Amza, A. Chanda, A.L. Cox, S. Elnikety, R. Gil, E. Cecchet, J.
bulletin board sites, using synthetic benchmarks. In their Marguerite, K. Rajamani, and W. Zwaenepoel, “Specification and
Implementation of Dynamic Web Site Benchmarks,” Proc. IEEE
study, the online bookstore benchmark reveals that the CPU Fifth Ann. Workshop Workload Characterization (WWC-5), Nov. 2002.
in the database server is the bottleneck, whereas the auction [5] M. Andreolini, E. Casalicchio, M. Colajanni, and M. Mambelli, “A
and bulletin board sites show that the CPU in the Web Cluster-Based Web System Providing Differentiated and Guaran-
teed Services,” Cluster Computing, vol. 7, no. 1, pp. 7-19, 2004.
server is the bottleneck. Cecchet et al. [14] examined the [6] G. Apostolopoulos, D. Aubespin, V. Peris, P. Pradhan, and D.
performance and scalability issues in Enterprise JavaBeans Saha, “Design, Implementation and Performance of a Content-
(EJB) applications. They modeled an online auction site like Based Switch,” Proc. INFOCOM, 2000.
eBay and experimented on it by several EJB implementa- [7] G. Apostolopoulos, V. Peris, and D. Saha, “Transport Layer
Security: How Much Does It Really Cost?” Proc. INFOCOM, 1999.
tions. Their test shows that the CPU on the EJB application [8] M.F. Arlitt and C.L. Williamson, “Internet Web Servers: Workload
server is the performance obstacle. In addition, the network Characterization and Performance Implications,” IEEE/ACM
is also saturated at some services. Trans. Networking, vol. 5, Oct. 1997.
The design of InfiniBand data centers is studied in [10]. It [9] M. Aron, P. Druschel, and W. Zwaenepoel, “Efficient Support for
P-HTTP in Cluster-Based Web Servers,” Proc. Usenix Ann.
compares the performance between Socket Direct Protocols Technical Conf., pp. 185-198, June 1999.
(SDP) and native sockets implementation over InfiniBand [10] P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J.
(IPoIB). This paper only uses the user-level communication Wu, and D.K. Panda, “Sockets Direct Procotol over InfiniBand in
in a data center without any intelligent distribution Clusters: Is It Beneficial?” Proc. IEEE Int’l Symp. Performance
Analysis of Systems and Software (ISPASS ’04), Mar. 2004.
algorithm or architectural support for secure transactions. [11] V. Cardellini, E. Casalicchio, M. Colajanni, and P.S. Yu, “The State
of the Art in Locally Distributed Web-Server Systems,” ACM
Computing Surveys, vol. 34, no. 2, pp. 263-311, 2002.
7 CONCLUSIONS [12] E.V. Carrera, S. Rao, L. Iftode, and R. Bianchini, “User-Level
In this paper, we investigated the performance implications Communication in Cluster-Based Servers,” Proc. Eighth Int’l Symp.
High-Performance Computer Architecture (HPCA ’02), pp. 248-259,
of the SSL protocol for providing a secure service in a 2002.
cluster-based application server and proposed a back-end [13] E. Casalicchio and M. Colajanni, “A Client-Aware Dispatching
forwarding scheme for improving server performance Algorithm for Web Clusters Providing Multiple Services,” Proc.
through a better load balance. The proposed ssl_with_bf 10th Int’l World Wide Web Conf., pp. 535-544, May 2001.
[14] E. Cecchet, J. Marguerite, and W. Zwaenepoel, “Performance and
scheme exploits the underlying user-level communication Scalability of EJB Applications,” Proc. 17th ACM SIGPLAN Conf.
in order to minimize the intracluster communication over- Object-Oriented Programming, Systems, Languages, and Applications,
head. We compared three application server models, RR, pp. 246-261, 2002.
ssl_with_session, and ssl_with_bf, through simulation. The [15] M. Colajanni, P.S. Yu, and D.M. Dias, “Analysis of Task Assign-
simulation model captures the VIA communication char- ment Policies in Scalable Distributed Web-Server Systems,” IEEE
Trans. Parallel and Distributed Systems, vol. 9, no. 6, pp. 585-600,
acteristics and the application server design in sufficient June 1998.
detail and uses realistic numbers for SSL encryption [16] Compaq Computer Corp., Intel Corp., and Microsoft Corp.,
overheads obtained from measurements. Virtual Interface Architecture Specification. Version 1.0, http://
Simulation with 16-node and 32-node cluster configura- www.vidf.org, Dec. 1997.
[17] M. Crovella and P. Barford, “Self-Similarity in the World Wide
tions with a variety of workloads provides the following Web Traffic-Evidence and Possible Cause,” Proc. ACM Int’l Conf.
conclusions: First, schemes with reusable sessions, de- Measurement and Modeling of Computer Systems (SIGMETRICS ’96),
ployed in the ssl_with_session and ssl_with_bf models, are pp. 160-169, 1996.
[18] C. Cunha, A. Bestavros, and M. Crovella, “Characteristics of Jin-Ha Kim received the MS degree from the
WWW Client-Based Traces,” Technical Report BU-CS-95-010, Department of Computer Science, Chungnam
Boston Univ., 1995. National University, and the PhD degree from
[19] A.B. Downey, “The Structural Cause of File Size Distributions,” the Department of Computer Science and
Proc. ACM Int’l Conf. Measurement and Modeling of Computer Engineering, Pennsylvania State University,
Systems (SIGMETRICS ’01), 2001. University Park, in 2005. She is currently work-
[20] G. Gousios and D. Spinellis, “A Comparison of Portable Dynamic ing in the Research and Development Center at
Web Content Technologies for the Apache Server,” Proc. Third Int’l Samsung Networks. Her research interests
System Administration and Network Eng. Conf. (SANE ’02), pp. 103- include cluster systems, distributed systems,
119, 2002. intracluster communication, and security on
[21] J. Guitart, D. Carrera, V. Beltran, J. Torres, and E. Ayuade, Web server communication. She is a member of the IEEE.
“Session-Based Adaptive Overload Control for Secure Dynamic
Web Applications,” Proc. Int’l Conf. Parallel Processing (ICPP ’05),
2005. Gyu Sang Choi received the PhD degree in
[22] C. Huitema, “Network vs. Server Issues in End-to-End Perfor- computer science and engineering from Penn-
mance,” keynote speech at Proc. Performance and Architecture of sylvania State University. He is a research staff
Web Servers Workshop, June 2000. member at the Samsung Advanced Institute of
[23] J.-H. Kim, G.S. Choi, D. Ersoz, and C.R. Das, “Improving Technology (SAIT) in Samsung Electronics. His
Response Time in Cluster-Based Web Servers through Coschedul- research interests include parallel and distribu-
ing,” Proc. 18th Int’l Parallel and Distributed Processing Symp., ted computing, supercomputing, cluster-based
pp. 88-97, 2004. Web servers, and data centers. Currently, he is
[24] Q. Li and B. Moon, “Distributed Cooperative Apache Web working on a real-time operating system in
Server,” Proc. 10th Int’l World Wide Web Conf., pp. 1215-1229, 2001. embedded systems at SAIT, while his prior
[25] Z. Maamar, “Commerce, E-Commerce, and M-Commerce: What research has been mainly focused on improving the performance of
Comes Next?” Comm. ACM, vol. 46, no. 12, pp. 251-257, 2003. clusters. He is a member of the IEEE.
[26] D.A. Menasce, “Performance and Availability of Internet Data
Centers,” IEEE Internet Computing, vol. 8, no. 3, pp. 94-96, 2004.
[27] M. Mitzenmacher, “Dynamic Models for File Sizes and Double
Pareto Distributions,” Internet Math., 2004. Chita R. Das received the MSc degree in
[28] Network Working Group, The MD5 Message-Digest Algorithm, electrical engineering from the Regional Engi-
IETF RFC 1321, http://www.ietf.org/rfc/rfc1321.txt, 1992. neering College, Rourkela, India, in 1981 and
the PhD degree in computer science from the
[29] Network Working Group, The Use of HMAC-SHA-1-96 within ESP
and AH, IETF RFC 2404, http://www.ietf.org/rfc/rfc2404.txt, Center for Advanced Computer Studies, Uni-
1998. versity of Louisiana at Lafayette, in 1986. Since
[30] Network Working Group, Triple-DES and RC2 Key Wrapping, IETF 1986, he has been with the Pennsylvania State
RFC 3217, http://www.ietf.org/rfc/rfc3217.txt, 2001. University, where he is currently a professor in
[31] A. Oke and R. Bunt, “Hierarchical Workload Characterization for the Department of Computer Science and
Engineering. His main areas of interest are
a Busy Web Server,” LNCS, vol. 2324/2002, p. 309, Aug. 2003.
parallel and distributed computer architectures, cluster computing,
[32] V.S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W.
Zwaenepoel, and E. Nahum, “Locality-Aware Request Distribu- mobile computing, Internet quality of service (QoS), multimedia
tion in Cluster-Based Network Servers,” Proc. Eighth Int’l Conf. systems, performance evaluation, and fault-tolerant computing. He
Architectural Support for Programming Languages and Operating has served on the editorial boards of the IEEE Transactions on
Systems, pp. 205-216, 1998. Computers and IEEE Transactions on Parallel and Distributed Systems.
[33] J.E. Pitkow, “Summary of WWW Characterizations,” Proc. Seventh He is a fellow of the IEEE and a member of the ACM.
Int’l Conf. World Wide Web, pp. 551-558, 1998.
[34] R. Rivest, The RC4 Encryption Algorithm, RSA Data Security, Inc.,
Mar. 1992. . For more information on this or any other computing topic,
[35] Transport Layer Security Working Group, The SSL Protocol Version please visit our Digital Library at www.computer.org/publications/dlib.
3.0, Internet draft, work in progress, Mar. 1996.
[36] T. Wilson, E-Biz Bucks Lost under SSL Strain, http://
www.internetwk.com/lead/lead0520.htm, May 1999.
[37] M. Yousif, Characterizing Datacenter Applications, http://
www.intel.com/idf, Feb. 2003.
[38] Y. Zhou, A. Bilas, S. Jagannathan, C. Dubnicki, J.F. Philbin, and K.
Li, “Experiences with VI Communication for Database Storage,”
Proc. 29th Ann. Int’l Symp. Computer Architecture, pp. 257-268, May
2002.

13.cluster-Based Web Servers

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

13.cluster-Based Web Servers

Transféré par

Droits d'auteur :

Formats disponibles

946 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO.

An SSL Back-End Forwarding Scheme in

D UE to the growing popularity of the Internet, data

Fig. 1. A multitier data center architecture.

services: data encryption and message digest. Data encryp-

38 times, respectively. The NASA trace file shows less

3 APPLICATION SERVER MODELS

of the application server, we ignore the latency between the

Fig. 6. Throughput comparison between an Apache server and the

shows a better utilization than the ssl_with_session model. In

Fig. 12. Latency and throughput of homogeneous and mixed client

Vous aimerez peut-être aussi