Vous êtes sur la page 1sur 59

Hadoop MapReduce in

Eucalyptus Private Cloud


Johan Nilsson

May 27, 2011


Bachelors Thesis in Computing Science, 15 credits
Supervisor at CS-UmU: Daniel Henriksson
Examiner: Pedher Johansson

Ume
a University
Department of Computing Science
SE-901 87 UME
A
SWEDEN

Abstract
This thesis investigates how setting up a private cloud using the Eucalyptus Cloud system
could be done along with its usability, requirements and limitations as an open-source cloud
platform providing private cloud solutions. It also studies if using the MapReduce framework
through Apache Hadoops implementation on top of the private Eucalyptus Cloud can
provide near linear scalability in terms of time and the amount of virtual machines in the
cluster.
Analysis has shown that Eucalyptus is lacking in a few usability areas when setting up
the cloud infrastructure in terms of private networking and DNS lookups, yet the API that
Eucalyptus provides gives benefits when migrating from public clouds like Amazon. The
MapReduce framework is showing an initial near-linear relation which is declining when the
amount of virtual machines is reaching the maximum of the cloud infrastructure.

ii

Contents
1 Introduction

2 Problem Description
2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
3
4
4

3 Virtualized cloud environments and Hadoop MapReduce


3.1 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Networking in virtual operating systems . . . . . . .
3.2 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Amazons public cloud service . . . . . . . . . . . . .
3.3 Software study - Eucalyptus . . . . . . . . . . . . . . . . . .
3.3.1 The different parts of Eucalyptus . . . . . . . . . . .
3.3.2 A quick look at the hypervisors in Eucalyptus . . . .
3.3.3 The Metadata Service . . . . . . . . . . . . . . . . .
3.3.4 Networking modes . . . . . . . . . . . . . . . . . . .
3.3.5 Accessing the system . . . . . . . . . . . . . . . . . .
3.4 Software study - Hadoop MapReduce & HDFS . . . . . . .
3.4.1 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

5
5
7
8
9
10
11
12
13
14
15
16
17
19

4 Accomplishment
4.1 Preliminaries . . . . . . . . . . . . . . . . .
4.2 Setup, configuration and usage . . . . . . .
4.2.1 Setting up Eucalyptus . . . . . . . .
4.2.2 Configuring an Hadoop image . . . .
4.2.3 Running MapReduce on the cluster
4.2.4 The MapReduce implementation . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

23
23
24
24
31
34
36

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

5 Results
39
5.1 MapReduce performance times . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iii

iv

CONTENTS

6 Conclusions
43
6.1 Restrictions and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7 Acknowledgements

47

References

49

A Scripts and code

51

List of Figures
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

A hypervisor can have multiple guest operating systems in it. . . . . . . . . . . . .

4.1
4.2
4.3

Eucalyptus network layout on the test servers.

5.1
5.2
5.3
5.4
5.5
5.6

Runtimes on a 2.9 GB database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Different types of hypervisor-based server and machine virtualizations. . . . . . . .


Simplified visualization of cloud computing. . . . . . . . . . . . . . . . . . . . . .
Overview of the components in Eucalyptus on rack based servers.
Metadata request example in Eucalyptus. . . . . . . . . . . . . .
The HDFS node structure. . . . . . . . . . . . . . . . . . . . . .
The interaction between nodes when a file is read from HDFS. . .
The MapReduce phases in Hadoop MapReduce. . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

6
6
8
12
14
18
19
21

. . . . . . . . . . . . . . . . . . . 25

The optimal physical layout compared to the test environment. . . . . . . . . . . . 26


Network traffic in the test environment. . . . . . . . . . . . . . . . . . . . . . . . 27

Runtimes on a 4.0 GB database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


Runtimes on a 9.6 GB database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Map task times on a 2.9 GB database. . . . . . . . . . . . . . . . . . . . . . . . . 42
Map task times on a 4.0 GB database. . . . . . . . . . . . . . . . . . . . . . . . . 42
Map task times on a 9.6 GB database. . . . . . . . . . . . . . . . . . . . . . . . . 42

vi

LIST OF FIGURES

Chapter 1

Introduction
By using a cloud service a company, organization or even a private person can outsource
management, maintenance and administration of large clusters of servers but still keep the
benefits. While using a public cloud provider is sufficient for most tasks; bandwidth, storage,
data protection or pricing details might encourage companies to house a private cloud. The
infrastructure to control and maintain the cloud can be proprietary like Microsoft Hyper-V
Cloud [17], VMware vCloud [21] and Citrix Open Cloud [4], but there are also a number of
free and open-source solutions like Eucalyptus Cloud, OpenNebula [19] and CloudStack [5].
The cloud can provide the processing power, but the actual framework to take benefit of
the these distributed instances does not inherently come with the machines. The Hadoop
MapReduce claims to provide very high scalability and stability across a large cluster [7].
It is meant to run on dedicated servers, but there is nothing that limits them from running
on a virtual machine.
This thesis is a study performed at the University of Ume
a, Department of Computing Science to provide familiarity with the cloud and its related technologies in general,
focusing specifically on the Eucalyptus cloud infrastructure. It shows a mean of setting up
a private cloud, along with using the Hadoop MapReduce idiom/framework on top of the
cloud showing the benefits and requirements of running MapReduce on a Eucalyptus private
cloud. As a proof of concept a simple MapReduce test is implemented and tested on the
cloud to provide an analysis of the distributed computation of MapReduce.
The report will have a software study on the systems used in the thesis, followed by a
description of the configuration, setup and usage of Eucalyptus and Hadoop. Finally the
result from the analysis will be presented along with a short conclusion.

Chapter 1. Introduction

Chapter 2

Problem Description
This thesis is two-fold. It will provide a relatively large software study of the Eucalyptus
cloud and a general overview of some the technologies it uses. It will also study what Hadoop
MapReduce is and how it can be used in conjunction with Eucalyptus.
The first part of the thesis is to analyse how to setup an Eucalyptus private cloud in a
small environment; what the requirements are to run and maintain it and what problems
and/or benefits the current implementation of it has. This is a documentation and implementation of one way to configure the infrastructure to deliver virtual machines in small
manner to a private user/company/organization.
The second part is to test how well Hadoop MapReduce is performing in a virtual cluster.
The machines used for the cluster will be virtual machines delivered through the Eucalyptus
cloud that has been set up in the course of the thesis. A (simple) MapReduce application
will be implemented to process a subset of Wikipedias articles and the time it takes to
process this, based on the number of nodes that the cluster runs on, will be measured. In a
perfect environment the MapReduce framework can deliver near-linear performance [7] but
that is without the extra overhead of running on small virtual machines.

2.1

Problem Statement

By first setting up a small Eucalyptus Cloud on a few local servers the thesis can answer
which problems and obstacles there are when preparing the open-source infrastructure. The
main priority is setting up a cloud that can deliver virtual instances capable of running
Hadoop MapReduce on them to supply a base to perform the analysis of the framework.
Simplifying launching Hadoop MapReduce clusters inside the Eucalyptus Cloud is of
second priority after setting up the infrastructure and testing the feasibility of MapReduce
on virtual machines. This can include scripts, stand-alone programs or utilities beyond
Eucalyptus and/or Hadoop.
3

Chapter 2. Problem Description

2.2

Goals

The goals of this thesis is to do a software study and analysis of the performance and
usability of Hadoop MapReduce running on top of virtual machines inside an Eucalyptus
Cloud infrastructure. It will study means to setup, launch, maintain and remove virtual
instances that can together form a MapReduce cluster. The following are the specific goals
of this thesis:
Demonstrate a way of setting up a private cloud infrastructure using the Eucalyptus Cloud system. This includes configuring subsystems that Eucalyptus uses like
hypervisors, controller libraries and networking systems.
Create a virtual machine image containing Hadoop MapReduce that will provide ease
of use and minimal manual configuration at provisioning time.
Provide a way to easily create and remove virtual instances inside the private cloud,
adjusting the size of Hadoop worker nodes available in its cluster.
Test the Hadoop MapReduce framework on a virtual cluster inside the private cloud.
This is to show what kind of performance increase a user gains when adding more
virtual nodes to the cluster, and if it is a near linear increase.

2.3

Related Work

Apache Whirr is a collection of scripts that has sprung out as a project of its own. The
purpose of Whirr is to simplify controlling virtual nodes inside a cloud like Amazon Web
Services [10]. Whirr controls everything from launching, removing and maintaining instances
that Hadoop then can utilize in a cluster.
Another similar controller program is Puppet [14] from Puppet Labs. This program
fully controls instances and clusters inside an EC2-compatible (AWS or Eucalyptus for
example) cloud. It uses a program outside the cloud infrastructure that can control whether
to launch, edit or remove instances. Puppet also controls the Hadoop MapReduce cluster
inside the virtual cluster. Mathias Gug, an Ubuntu Developer, has tested how to deploy a
virtual cluster inside an Ubuntu Enterprise Cloud using Puppet. The results can be found
on his blog [13].
Hadoops commercial and enterprise offspring, Cloudera [6], has released a distribution called CDH. The current version, version 3, contains a virtual machine with Hadoop
MapReduce configured along with Apache Whirr instructions. This is to simplify launching
and configuring Hadoop MapReduce clusters inside a cloud. These releases also contains
extra packages for enterprise clusters, such as Pig, Hive, Sqoop and HBase. CDH also uses
Apache Whirr to simplify AWS deployment.

Chapter 3

Virtualized cloud environments


and Hadoop MapReduce
This in-depth study focuses on explaining some key concepts regarding cloud computing,
virtualization and clustering along with how certain specific software solutions work based
on these concepts. As some of the software are used in the practical implementation of the
thesis the in-depth study naturally focuses on how these work in practical environment.

3.1

Virtualization

The term virtualization refers to creating a virtual environment instead of an actual physical
one. This enables a physical system to run different logical solutions on it by virtually
creating an environment that meets the demand of the solution. By virtually creating
several different operating systems on one physical workstation the administrator can create
a cluster of computers that acts as if they were physical.
There are several different methods of virtualization. Network Virtualization refers to
creating virtual networks that can be used for segmenting, subnetworking or creating virtual private networks (VPN) as a few examples. Desktop virtualization enables a user to
access his local desktop from a remote location and is commonly used in large corporations
or authorities to ensure security and accessibility. A more common virtualization usually
encountered by a home user is Application Virtualization which enables compilation of code
to machine instruction running in a certain environment. Examples of this include Java VM
and Microsofts .NET framework. In cloud computing, Server & Machine Virtualization are
extensively used to virtually create new computers that can act as a completely different
operating system independent of the underlying system it runs on [26].
Without virtualization situations would arise where machines would use only a percentage of their maximum capacity. If the server would have virtualization active and enable
more operating systems run on the physical hardware, the hardware would be used more
5

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

effectively. This is why server and machine virtualization is of great benefit when creating a
cloud environment because the cloud host can maximize effectivity and distribute resources
without having to buy a physical server each time a new instance is needed.
The system that keeps track of the machine virtualization is called a hypervisor. Hypervisors are mediators that translates calls from the virtualized OS to the hardware and
acts as a security guard. The guard prevents different virtual instances from accessing each
others memory or storage areas that is outside their virtual bounds. When a hypervisor
creates a new virtual instance (a Guest OS ), the hypervisor marks memory, CPU and
storage areas to be used by that instance [22]. The underlying hardware is usually the limit
of how many virtual instances that can be run on one physical machine.

Figure 3.1: A hypervisor can have multiple guest operating systems in it.
Depending on the type of hypervisor it can either work directly with the hardware (called
type 1 virtualization) or on top of an already installed OS (called type 2 virtualization).
The type used varies based on which hypervisor, underlying OS or hardware installed that
is installed. These different variations demands different requirements for each system; a
hypervisor might flawlessly work on one hardware/OS setup but might be inoperable in a
slightly different variation [26]. See figure 3.2.

Figure 3.2: Different types of hypervisor-based server and machine virtualizations.

3.1. Virtualization

Hypervisor-based virtualization is the most commonly used one [22], but several different
variants of it exists. Kernel-based virtualization employs specialized OS kernels, where
the kernel runs a separate version of itself along with a virtual machine on the physical
hardware. In practice one could say that the kernel acts as a hypervisor and it is usually
a Linux kernel that uses this technique. Hardware virtualization does not rely on any
software OS, but instead uses specialized hardware along with a special hypervisor to provide
virtualization. The benefit of this is that the OS running inside the hypervisor does not have
to be modified, which normal software hypervisor virtualization requires [22]. Technologies
for providing hardware virtualization on the CPUs (native virtualization) are based on the
CPU developers such as Intel VT-x or AMD-V.
The operating systems that runs in the virtual environment are called machine images.
These images can be be put to sleep and then stored on the hard drive with their current
installation, configuration and even running processes hibernated. When requested, the
images can then be restored to their running state to continue to finish what they did
before hibernation. This allows dynamic activation and deactivation of resources.

3.1.1

Networking in virtual operating systems

With operating systems acting inside a hypervisor and not directly contacting the physical
hardware the problem arises when there are several instances that wants to communicate
on the network. They do not actually exists with a physical Network Interface Card (NIC)
connected to them, so the hypervisor has to ensure that the right instance receives the
correct network packages.
The way the networking is handled depends on the hypervisor. There are four techniques
used to create virtual NICs [26]:
NAT Networking
NAT (Network Area Translation) is the same type of technique used in common home
routers. It translates an external IP address to an internal one, which enables multiple
internal IPs. The packets sent are recognized by the port they are sent to and from.
The hypervisor provides the NAT translation and the VMs reside in a subnetwork
with the hypervisor acting as the router.
Bridge Networking
Bridging the networking is basically connecting the virtual NIC with the physical
hardware NIC. The hypervisor sets up the bridge and the virtual OS connects to it
as it believes it to be a physical. The benefit of this is that the Virtual Machine will
show up on the local network just as any other physical machine.
Host-only
Host-only networking is the local variant of networking. The hypervisor disables networking to external machines outside of the VM which defeats the purpose of the VM
in a cloud environment. This is used on local machines mostly.
Hybrid
Hybrid networking is a combination or variation of the networking styles mentioned.

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

These can connect to most of the other types of networking styles and in some ways
can act as a bridge to a host-only VM.
Networking the virtual machines in a proper way is crucial when setting up a virtualized
cloud. The virtual machines have to be able to connect to the cloud system network to
provide resources.

3.2

Cloud Computing

Cloud computing is a type of distributed computing that provides elastic, dynamic processing power and storage when needed. At its essence it basically gives the user the computing
power when it needs it. The term cloud refers to typical visual representation of Internet in
a diagram; a cloud. What cloud computing means is that there is a collection of computers
that can give the customer/user the amount of computational power needed, without them
having to worry about maintenance or hardware [20].
Typically a cloud is hosted on a server farm with a large amount of clustered computers.
These provide the hardware resources. The cloud provider (the organization that hosts the
servers) offers an interface for users to pay for a certain amount of processing power, storage
or computers in a business model. These resources can then be increased or decreased based
on demand, so the user only needs to focus on its contents whereas the provider takes care
of maintenance, security and networking.

Figure 3.3: Simplified visualization of cloud computing.


The servers in the server farm are usually virtualized, although they are not required to
be to actually be included in a cloud. Virtualization is a ground pillar in cloud computing;
it enables the provider to maximize the processing power on the raw hardware and gives the
cloud elasticity, the ability for users to scale the instances required. It also helps providing
two other key features of a cloud: multitenacity, the sharing of resources, and massive
scalability, the ability to have huge amounts of processing systems and storage areas (tens
of thousands of systems with large amounts of terabytes or petabytes of data) [16].
There are three major types of services that can be provided from a cloud. These are
usually different levels of access for the user, ranging from having control of just a few
components to the operating system itself [16]:
Infrastructure as a Service (IaaS)
IaaS gives the user the most freedom and access on the systems. These can sometimes

3.2. Cloud Computing

be on dedicated hardware (that is, not virtualized) where the user has to install whatever they want on the system themselves. The user is given access to the operating
system, or the ability to create their own through images that they create (typically
in a virtualized environment). This is used when the user wants the raw processing
power of alot of systems or needs a huge amount of storage.
Platform as a Service (PaaS)
PaaS does not give as much freedom as the IaaS, but instead focusing on having key
applications already installed on the systems delivered. These are used to provide the
user the systems needed in a quick and accessible way. The users can then modify the
applications to their needs. An example of this would be a hosted Website; the tools
for hosting the website (along with extra systems like databases, web service engine
etc.) is installed and the user can create the page without having to think about
networking or accessibility.
Software as a Service(SaaS)
SaaS is generally transparent for the user. It gives the user a software that has its
process in a cloud. The user can only interact with the software itself and is more
often unaware that it is being processed in a cloud. A quick example is the Google
Docs, where users can edit documents that are hosted and processed in the Google
cloud.
The cloud providers in the business model often uses web interfaces to enable users to
increase or decrease the instances they use. They are then billed by the amount of space
required or processing power used (depending on what type of service that is bought) in a
pay-as-you-go system. This type is of the IaaS type, which for example Amazon, Proofpoint
and Rightscale can provide [16]. However a cloud does not necessarily only exists on the
Internet as business model of delivering computational power or storage. A cloud could
either be public; which means that the machines delivered resides on the Internet, private;
which means that the cluster is hosted locally or hybrid ; where the instances are local at
start but can use public cloud services on-demand if the private cloud does not have sufficient
power [20].
Cloud computing is also used in a lot more different systems. Google uses cloud computing to provide the backbone to their large systems such as Gmail, Google App Engine
and Google Sites. Google App Engine provides a PaaS for the sole purpose of creating and
hosting web sites. The Google Apps is a SaaS cloud that is a distributed system of handling
office types of files; a clouded Microsoft Office [16].

3.2.1

Amazons public cloud service

Amazon was one of the first big companies moving into being a large cloud system provider
[20]. Amazon is interesting in terms of being one of the first giants that provides an API
to access data through their Amazon Web Service (AWS) and web pages. Since Eucalyptus
uses the same API, albeit with a different and open source implementation, a closer look
on Amazon and their service is interesting. Amazon provides several different services [20],
but in terms of this thesis there are some of more interest:

10

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

Amazon Simple Storage Service (S3)


The Simple Storage Service is Amazons way of providing vast amounts of storage
space to the user. A user can pay for the amount of space needed, from just a few
gigabytes to several petabytes. Also, fees apply to the amount of data transfered to
and from the storage. S3 uses buckets which in layman terms can be seen as folders
to store data within. These buckets are stored somewhere inside the cloud and are
replicated on several devices to provide redundancy. Using standard protocols such as
HTTP, SOAP, REST and even BitTorrent to transfer the data to the S3; the Simple
Storage Service provides ease of access [3] to the user.
Amazon Elastic Compute Cloud (EC2)
The Elastic Compute Cloud is a way to provide a dynamic/elastic amount of computational power. Amazon gives the user the ability to pay for nodes. These nodes
are virtualized computers that can take an Amazon Machine Image and use it as the
image that runs on their virtualized environment (see section 3.1). The EC2 aims at
supplying large amounts of CPU and RAM to the user, but it is up to the user to write
and execute the applications to use the resources [2]. These virtualized computers,
nodes, are contained inside a security group, or a virtual network consisting of all the
EC2 nodes the user has payed for. During computation they can be linked together
to provide a strong distributed computational base.
Amazon Elastic Block Store (EBS)
While the S3 is focused on storage it does not focus on speed and fast access to the
data. When using the EC2 system the data that has to be processed must be stored in
a fast-to-access way to avoid downtime for the EC2 system. The S3 does not provide
that, so Amazon has created a way of attaching virtual, reliable and fast devices to
be attached to the EC2. This is called the Elastic Block Storage, EBS. EBS differs to
S3 in that the volumes cannot be as small or large as the S3 (1 GB - 1 TB on EBS
compared to 1 B - 5 TB on S3 [1, 3]) but instead has faster read-write times and are
easier to attach to EC2 instances. One EBS volume can only be attached to one EC2
instance at a time, but one EC2 instance can have several EBS volumes attached to
it. The EBS also offers the ability to snapshot the EBS volume and store the snapshot
on a different storage medium, for example the S3 [1].
As an example of using the Amazon EC2, the New York Times used EC2 and S3 in
conjunction to convert 4 TB of articles from 1851 - 1980 (around 11 million articles) stored
in TIFF images to PDF format. By using 100 Amazon EC2 nodes, the NY Times converted
the TIFF-images to 1.5 TB of PDFs in less than 24 hours, a conversion that would take
far greater time if done on a single computer [18]. On a side note, the NY Times also used
Apache Hadoop installed on their AMIs to process the data (see section 3.4).

3.3

Software study - Eucalyptus

Eucalyptus is a free open-source cloud management system that is using the same API
as the AWS are using. This enables tools that originally where developed for Amazon to
be used with Eucalyptus, but with the added benefit of Eucalyptus being free and opensource. It provides the same functionality in terms of IaaS deployment and can be used as

3.3. Software study - Eucalyptus

11

a private, hybrid or even a public cloud system with enough hardware. Instances running
inside Eucalyptus runs Eucalyptus Machine Images (EMI, cleverly name after AMI), which
can either be created by the user or downloaded as pre-packaged version. An EMI can
contain either a Windows, Linux or CentOS operating system [8]. At the time of writing
Eucalyptus does not support Mac OS.

3.3.1

The different parts of Eucalyptus

Eucalyptus resides on the host operating systems of which is installed on. Since it uses
libraries and hypervisors that are restricted to the Linux OS it cannot be run on other
operating systems like Microsoft Windows or Apple OS. When Eucalyptus starts it contacts
its different components to determine the layout and setup of the systems it controls.
These components are configured using configuration files in each component. They all
have different responsibilities and areas to create a complete system that can handle dynamic
creation of virtualized instances, large storage environments and user access control.
Providing the same features as Amazon in terms of computation clouds and storage,
the components inside Eucalyptus have different names but with equal functionality and
API [8]:
Walrus
Walrus is the name of the storage container system, similar to Amazon S3. It stores
data in buckets and have the same API to read and write data in a redundant system.
Eucalyptus offers a way to limit access and size of the storage buckets through the
same means as S3, by enforcing user credentials and size limits. Walrus is written in
Java, and is accessible through the same means as S3 (SOAP, REST or Web Browser).
Cloud Controller
The Cloud Controller (CLC) is the Eucalyptus implementation of the Elastic Compute
Cloud (EC2) that Amazon provides. The CLC is responsible for starting, stopping
and controlling instances in the system, as this is providing the computational power
(CPU & RAM) to the user. The CLC is indirectly contacting the hypervisors through
Cluster Controllers (CC) and Node Controllers (NC). The CLC is written in Java.
Storage Controller
This is the equivalent to the EBS found in Amazon. The Storage Controller (SC) is
responsible for providing fast dynamic storage devices with low latency and variable
storage size. It resides outside the virtual CLC-instances, but can communicate with
them as external devices in a similar fashion of the EBS system. The SC is written in
Java.
Beneath the Cluster Controller, on every physical machine lies the Node Controller (NC).
Written in C, this component is in direct contact with the hypervisor. The CC and SC talks
with the NC to determine the availability, access and need of the hypervisors. The CC and
SC runs on a cluster-level, which means they only need one per cluster. Usually the SC and
CC are deployed on the head-node of each cluster - that is a defined machine marked as
being the leader of the rest of the physical machines in the cluster - but if the cloud only

12

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

consists of one large cluster the CLC, SC, CC and Walrus all can reside on the head node,
the front-end node.

Figure 3.4: Overview of the components in Eucalyptus on rack based servers.


All components communicate with each other over SOAP with WS-security [8]. To make
up the entire system Eucalyptus has more parts which were mentioned in brief earlier. The
Cluster Controller, written in C, is responsible for an entire physical cluster of machines
to provide scheduling and network control of all the machines under the same physical
switch/router. See figure 3.4. While the CLC is responsible for controlling most of the
instances and their requests (creating, deleting, setting EMIs etc) it talks with both the CC
and SC on the cluster level. Walrus on the other hand is only responsible for storage actions
and thus only talks with the SC.
The front-end serves as the maintenance access point. If a user wants more instances or
needs more storage allocated to them, the front-end has Walrus and the CLC ready to accept
requests and propagate them to CCs and SCs. This provides the user the transparency of
the Eucalyptus cloud. The user cannot tell where and how the storage is created, only that
they actually received more storage space by requesting it from the front-end.

3.3.2

A quick look at the hypervisors in Eucalyptus

To be able to create and destroy virtualized instances on demand the Node Controller needs
to talk with a hypervisor installed on the machine it is running on. Currently, Eucalyptus
only supports the hypervisors Xen and KVM. To be able to communicate with them,
Eucalyptus utilizes the libvirt virtualization API and virsh.
The Xen hypervisor is a Type 1 hypervisor which utilizes paravirtualization to run

3.3. Software study - Eucalyptus

13

operating systems on it. This requires that the Host OSs must be modified to do calls
to the hypervisor instead of the actual hardware [22]. Xen can also support hardware
virtualization but that requires specialized virtualization hardware. See section 3.1. The
first guest operating system that Xen virtualizes is called dom0 (basically the first domain)
and is automatically booted into when starting the computer. When Xen runs a virtual
machine, the drivers are run in user-space, which means that every OS runs inside the
memory of a user instead of the kernels memory space. Xen provides networking by bridging
the NIC (see Section 3.1.1).
KVM, Kernel-based Virtual Machine, is a hypervisor built using the OSs kernel. This
means that KVM uses calls far deeper into the OS architecture, which in turn provide
greater speed. KVM is very small and built into the Linux kernel but it cannot by itself
provide CPU paravirtualization. To be able to do that it uses the QEMU CPU emulator.
QEMU is in short an emulator designed to simulate different CPUs through API calls. The
usage of QEMU inside KVM means that KVM sometimes is referred to as qemu/KVM.
When KVM runs inside the kernel-space, it uses calls through QEMU to interact with the
user-space parts, like creating or destroying a virtual machine [22]. Like Xen, KVM also
bridges the NIC to provide networking to the virtual machines.
When Eucalyptus creates EMIs to be used inside the cloud system, it requires images
along with kernel and ramdisk pairs that works for the designated hypervisor. A RAM
disk is not required in the beginning since the ramdisk image defines the state of the RAM
memory in the virtual machine [22], but if the image has been installed and is running when
put to sleep a ramdisk image should come with it (assume there were none, then when
the virtual machine resumed all the RAM would be empty). Since the images might look
different depending on which hypervisor created them; Xen and KVM cannot load each
others images. This provides an interesting point to Eucalyptus: if in theory there was
an image and ramdisk/kernel pair that would work on both the hypervisors, Eucalyptus
could run physical machines that had either KVM or Xen installed on them and boot
any Xen/KVM image without encountering any virtualization problems. With the current
discrepancy, the machines in the cloud are forced to run a specific hypervisor so that the
EMIs can be loaded across any Node Controller in the cloud.

3.3.3

The Metadata Service

Just like Amazon, Eucalyptus has a metadata service available for the virtual machines [8].
What the metadata does is supplying information to VMs about themselves. This is achieved
by the VMs contacting the CLC with a HTTP request. The CLC checks from which VM
the call is made and returns the requested information based on the VM which made the
call. An example, if the CLCs IP is 169.254.169.254 then a VM could make a request like:
http://169.254.169.254:8773/latest/meta-data/<name of metadata tag>
http://169.254.169.254:8773/latest/user-data/<name of user-defined metadata tag>

The metadata tag can be anything from standard default ones like the kernel-id, securitygroups or the public hostname or it could be specific ones defined by the administration.
This is a method of obtaining data to setup information when new instances are created or

14

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

Figure 3.5: Metadata request example in Eucalyptus.


destroyed. The metadata calls have the exact same callnames and structure like the AWS,
so tools used inside the AWS system works with the Eucalyptus metadata.

3.3.4

Networking modes

With Eucalyptus installed on all the machines in the cluster(s), the different components
call each other with SOAP commands on the physical NIC. However, when new instances
are created they need to have their networking set up on the fly. Since the physical network
might have other settings regarding how the NICs retrieve their IPs Eucalyptus have
different modes to give the virtual machines access to the network. The virtual machines
communicate with each other using virtual subnets. These subnets must not in any way
be in the same area as the physical net, used by the components of Eucalyptus (notice
the difference between components like CLC, Walrus, NC etc and virtual machines). The
CC has one connection towards the virtual subnet and another bridged to the physical
network [8].
Networking modes inside the Eucalyptus cloud system differs in how much freedom and
connectivity the instances have. Some modes adds features to the VM networks:
Elastic IPs is in short a way to supply the user with a range of IPs that the VMs
can use. These can then be used towards external, public, IPs and is ideal if the user
needs a persistent Web server for example.
Security groups is way of giving the user of a group of instances control what can
be done or not in terms of network traffic. For example one security group can enforce
that no ICMP calls is answered, or that one cannot make SSH connections.

3.3. Software study - Eucalyptus

15

VM isolation prevents VMs from different security groups to contact each other if
activated. When running a public cloud providing IaaS, this is almost a must-have.
The different modes gives different benefits and drawbacks and some is even required to
use under certain circumstances. There are four different networking modes in Eucalyptus.
In three modes, the front-end acts as the DHCP server distributing IPs to the virtual
machines. The fourth mode require an external DHCP server to distribute IPs to virtual
machines [8]. In all networking modes the VMs can be connected to from an external source
if given a public IP and the security group allows it. The different modes are the following:
SYSTEM
The only networking mode that requires an external DHCP server to serve new VM
instances with IPs. This mode requires little configuration since it does not limit
internal interaction. It does however provide no extra things like security groups,
elastic IPs or VM isolation. This mode is better if used when the cloud is private and
there are few users that share the cloud.
STATIC
STATIC mode requires that the DHCP server on the network is either turned off or
configured to not serve a specific IP-range that the VMs use. The front-end has to
take care of the DHCP server towards the instances, but through a non-dynamic way
by adding pairs of MAC-addresses and IPs to VMs. Just like SYSTEM, it does not
provide the benefits that you normally associate with a public cloud like elastic IPs,
VM isolation or security groups.
MANAGED
MANAGED mode gives the most freedom to the cloud. Advanced features like VM
isolation, elastic IPs and security groups are available by creating virtual networks
between the VMs.
MANAGED-NoVLAN
If the physical network relies on VLAN then normal MANAGED mode will not work
(since several VLAN packets on top of each others will cause problems for the routing).
In this mode most of the MANAGED mode features are still there except VM isolation.
When setting up the Eucalyptus networking mode one has to consider what type of
cloud it is and what kind of routing setup is made on the physical network.

3.3.5

Accessing the system

When a user wants to create, remove or edit the instances they can either contact them
directly through SSH (if they have public IPs) or they can control the instances by using
Eucalyptus web interface. By logging in on the front-end with a username and password
the user or admin can configure settings of the system. Also, tools developed for AWS can
be used for this since Eucalyptus supports the same API calls [8].
Similar there is a tool called euca2ools for administration. It is a Command Line
Interface tool that is used to manipulate the instances that a user has running. An admin

16

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

using euca2ools has more access than an ordinary user. Euca2ools is almost mandatory
when working with Eucalyptus.

3.4

Software study - Hadoop MapReduce & HDFS

Hadoop is an open source software package from the Apache Foundation which contains
different systems aimed at file storage, analysis and processing of large amounts of data
ranging from only a few gigabytes to several hundreds or thousands of petabytes. Hadoops
software is all written in Java, but the different parts are separate projects by themselves
so bindings to other programming languages exists on per-project basis. The three major
subprojects of Hadoop are the following [9]:
HDFS
The Hadoop Distributed File System, HDFS, is a specialized filesystem to store
large amounts of data across a distributed system of computers with very high throughput and multiple replication on a cluster. It provides reliability between the different
physical machines to support a base for very fast computations on a large dataset.
MapReduce
MapReduce is a programming idiom for analyzing and process extremely large
datasets in a fast, scalable and distributed way. Originally conceived by Google as a
way of handling the enormous amount of data produced by their search bots [23], it
has been adapted in a way that it can run on a cluster of normal commodity machines.
Common
The Hadoop Common subproject provides interfaces and components built in Java
to support distributed filesystems and I/O. This is more of a library that has all the
features that HDFS and MapReduce uses to handle the distributed computation. It
has the the code for persistent data structures and Java RPC that HDFS needs to
store clustered data [23].
While these are the projects that Hadoop have as its major subprojects, there are
several other that are related to the Hadoop package. These are generally projects related
to distributed systems which either uses the major Hadoop subprojects or are related to
them in some way:
Pig
Pig introduces a higher level data-flow language and framework when doing parallel
computation. It can work in conjunction with MapReduce and HDFS and has an
SQL-like syntax.
Hive
Hive is data warehouse infrastructure with a basic query language, Hive QL, which is
based of SQL. Hive is designed to easily integrate and work together with the data
storage of MapReduce jobs.

3.4. Software study - Hadoop MapReduce & HDFS

17

HBase
A distributed database designed to support large tables of data with a scalable infrastructure on top of normal commodity hardware. Its main usage is when to handle
extremely large database tables, i.e. billions of rows on millions of columns.
Avro
By using Remote Procedure Calls (RPC), Avro provides a data serialization system
to be used in distributed systems. Avro can be used when parts of a system needs to
communicate through the network.
Chukwa
Built on top of HDFS and MapReduce, Chukwa is a monitoring system when a large
distributed system needs to be monitored.
Mahout
A large machine learning library. It uses the MapReduce to provide scalability and
handling of large datasets.
ZooKeeper
ZooKeeper is mainly a service for distributed systems control, monitoring and synchronization.
HDFS and MapReduce are intended to work on commodity hardware which is the opposite to specialized high-end server hardware designed for computational-heavy processing.
The idea is to be able to use the Hadoop software on a cluster of not-that-high-end computers and still get a very good result in terms of throughput and reliability. An example
taken from Hadoop - The Definitive Guide [24] of a commodity hardware:
Processor:
Memory:
Harddrive:
Network:

2 quad-core 2-2.5GHz CPUs


16-24 GB ECC RAM
4 x 1TB SATA disks
Gigabit Ethernet

Since Hadoop is designed to use multiple large harddrives and multiple CPU cores, having
more of them is almost always a benefit. The ECC RAM stands for Error Correction Code
RAM and is almost a must have since Hadoop uses a lot of memory in processing and
reportedly sees a lot of checksum errors on clusters without it [24]. Using Hadoop on a large
cluster of racked physical machines in a two-level network architecture is a common setup.

3.4.1

HDFS

The Hadoop Distributed File System is designed to be a filesystem that gives a fast access
rate and reliability for very large datasets. HDFS is basically a Java program that communicates with other networked instances of HDFS through RPC to store blocks of data
across a cluster. It is designed to work well with large file sizes (which can vary from just
a hundreds of MBs to several PBs), but since it focuses more on delivering high amount of
data between the physical machines it has a slower access rate and higher latency [23].

18

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

HDFS is split into three software parts. The NameNode is the master of the filesystem that keeps track of where and how the files are stored in the filesystem. The DataNode
is the slave in the system and is controlled by the NameNode. There is also a Secondary
NameNode which, contrary to what its name says, is not a replacement of the NameNode.
The secondary NameNode is optional, which is explained why later in the section.
When HDFS stores files in its filesystem it splits the data into blocks. These blocks
of raw data is of configurable size (defined in the NameNode configuration) but the default
size is 64 MB. This is compared to a normal disk block which is 512 bytes [23]. When
a datafile has been split up into blocks, the NameNode sends the blocks to the different
DataNodes (other machines) where they are stored on disk. The same block can be sent to
multiple DataNodes which will provide redundancy and higher throughput when another
system requests access to the file.
The NameNode is responsible for keeping track of the location of the file among the
DataNode, as well as the tree structure that the filesystem uses. The metadata about each
file is also stored in the NameNode, like which original datafile it belongs to and its relation
to other blocks. This data is stored on-disk in the NameNode in form of two files: the
namespace image and the edit log. The exact block locations on the DataNodes is not
stored in the namespace image, this is reconstructed on startup by communicating with the
DataNodes and then only kept in memory [23].

Figure 3.6: The HDFS node structure.


Due to the NameNode keeping track of the metadata of the files and the tree structure of
the file system it is also a single point of failure. If it breaks down the whole HDFS filesystem
will be invalid since the DataNodes only stores the data on disk without any knowledge of the
structure. Even the secondary NameNode cannot work without the NameNode, since the
secondary NameNode is only responsible validating the namespace image of the NameNode.
Due to the large data amounts that file metadata can provide the NameNode and secondary
NameNodes should be different machines (and separated from the DataNodes) on a large
system [23].
However, as of Hadoop 0.21.0 work has begun to remove the Secondary NameNode and
replace it with a Checkpoint Node and Backup Node which are meant to keep track of
the NameNode and keep an up-to-date copy of the NameNode. This will work as a backup

3.4. Software study - Hadoop MapReduce & HDFS

19

in case of a NameNode breakdown [11], lowering the risk of failure if the NameNode crashes.
By default, the NameNode replicates each block by a factor of three. That is, the
NameNode tries to keep three copies of each block on different DataNodes at each time. This
will provide both redundancy and more throughput for the client that uses the filesystem.
To provide better redundancy and throughput HDFS is also rack-aware, that is it wants to
know which rack each node resides in and how far in terms of bandwidth each node is
from each other. That way the NameNode can keep more copies of blocks on one rack for
faster throughput, but additional copies on other racks for better redundancy.

Figure 3.7: The interaction between nodes when a file is read from HDFS.
DataNodes are more or less data-dummies that takes care of storing and sending file
data to and from clients. When started they have the NameNodes location defined as an
URL in their configuration file. This is by default localhost, which needs to be changed as
soon as there are more than one node in the cluster.
When a user wants to read a file it uses a Hadoop HDFS client that contacts the NameNode. The NameNode then fetches the block locations and return the locations to the
client, forcing the client to do the reading and merging of blocks from the DataNodes. See
Figure 3.7. Since HDFS requires a special client to interact with the filesystem it is not as
easy as mounting a NFS (Network File System) and reading from it in an operating system.
However, there are bindings to HTTP and FTP available and software like FUSE, Thrift
or WebDAV can also work with HDFS [23]. Using FUSE on top of HDFS would mean that
one can mount it as a normal Unix userspace drive.

3.4.2

MapReduce

MapReduce is a programming idiom/model for processing extremely large datasets using


distributed computing on a computer cluster. It is invented and patented by Google. The
word MapReduce derives from two typical functions used within functional programming,
the Map and Reduce functions [7]. Hadoop has taken this framework and implemented it to
be able to run it on top of a cluster of computers that are not high-end , similar to HDFS,
through a license from Google. The purpose of Hadoops MapReduce is to be able to utilize
the combined resources of a large cluster of commodity hardware. MapReduce relies on a

20

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

distributed file system, where HDFS is currently one of the few supported.

MapReduce phases
The MapReduce framework is split up into two major phases; the Map phase and Reduce
phase. The entire framework is built around Key-Value pairs and the only thing that is
communicated between the different parts of the framework are Key-Value pairs. The keys
and values can be user-implemented, but they are required to be serialized since they are
communicated across the network. Keys and values can range from simple primitive types
to large data types. When implementing a MapReduce problem the problem has to be able
to be split into n parts, where n is at least the amount of Hadoop nodes in the cluster.
It is important to understand that while the different phases in a MapReduce job can be
regarded as sequential, they are in fact working in parallel as much as possible. The shuffle
and reduce phases can start working as soon as one map task has completed and this is
usually the case. Depending on work slots available across the cluster each job is divided as
much as possible. The MapReduce framework is built around these components:
InputFormat
Reads file(s) on the DFS, tables from a DBMS or what the programmer wants it to
read. This phase takes an input of some sorts and splits it into InputSplits.
InputSplits
An InputSplit is dependant on what the input data is. It is a subset of the data read
and one InputSplit is sent to each Map task.
MAP
The Map phase takes a key-value pair generated through the InputSplit. Each node
runs one map task and is run in parallel with each other. One Map task takes a
key-value pair, process it and generates another key-value pair.
Combine
The optional combine phase is a local task run directly after each map task on each
node. It does a mini-reduce by combining all keys that are the same generated from
the current map task.
Shuffle
When the nodes have completed their map task it enters the shuffle phase, where data
is communicated with each node. Key-value pairs are passed between the nodes to
append, sort and partition it. This is the only phase where the nodes communicate
with each other.
Shuffle - Append
Appending the data during the shuffle phase is generally just putting all the data
together. Shuffle append is automatically done by the framework.
Shuffle - Sort
The sort phase is when the keys are sorted by either a default way or in a programmerimplemented way.

3.4. Software study - Hadoop MapReduce & HDFS

21

Shuffle - Partition
The partition phase is the last phase of the shuffle. This calculates how the combined
data should be split out to the reduces. It can either be handled in a default way or
programmer-implemented. It should generate an equal amount of data to each reducer
for optimal performance.
REDUCE
Reduce is done by taking all the key-value pairs with the same key and performing
some kind of reducing on the values. Each reduce takes a subset of all the key-value
pairs, but will always have all the values to one key. For example, the (Foo, Bar) and
(Foo, Bear) will go to the same reducer as (Foo, [Bar, Bear]). If a reducer has one
key, no other reducer will receive that key.
Output
Each reducer generates one output to a storage. The output can be controlled through
subclassing OutputFormat. By default the output is generating part-files for each reducer in the form of files named part-r-00000, part-r-00001 etc. This can be controlled
through an implementation of OutputFormat.
Although each inputsplit is sent to one Map task, the programmer can tell the InputFormat (through a RecordReader) to read across the boundaries of the split given. This
enables the InputFormat to read a subset of data without having to combine them from two
or more maps. When one inputsplit has been read across its boundaries, the latter split
will begin after where the former stopped. The size of the inputsplit given to a map task is
most oftenly dependant on the size of the data and the size of a HDFS block. Since Hadoop
MapReduce is optimized for - and most oftenly runs on - HDFS, the block size of HDFS
most often dictates the size of the split if it is read from a file.

Figure 3.8: The MapReduce phases in Hadoop MapReduce.


The key-values given to the mapper does not always require the keys to be meaningful.
The values can be the only interesting for the mapper, outputting a different key and value

22

Chapter 3. Virtualized cloud environments and Hadoop MapReduce

after computation. The general flow of key-value pairs in the MapReduce framework is the
following:
map(K1, V 1) list(K2, V 2)
reduce(K2, list(V 2)) list(V 3)

However, when implementing MapReduce the framework takes care of generating the
lists and combining the values to one key. The programmer only needs to focus on what
one map or reduce task does and the framework will apply it n times up to until the job is
mapped. In terms of the Hadoop framework it can be regarded as:
framework in(data (K1, V 1))
map(K1, V 1) (K2, V 2)
framework shuffle((K2, V 2) (K2, list(V 2)))
reduce(K2, list(V 2)) (K3, V 3)
framework out((K3, V 3) (K3, list(V 3)))

When starting a Hadoop MapReduce cluster it requires a master that contains the
HDFS NameNode and the JobTracker. The JobTracker is the scheduler and input of a
MapReduce job. The JobTracker communicates with TaskTrackers that runs on other
nodes in the cluster. A node that only contains a JobTracker and HDFS DataNode is
generally known as a slave. TaskTrackers periodically pings the JobTracker and checks
whether a free task to work on is ready or not. If the JobTracker has a task ready it is sent
to the TaskTracker which performs it. Generally on a large cluster the master is separate
from the slaves, but on smaller clusters the master also runs a slave.

Chapter 4

Accomplishment
This chapter describes how the configuration of Eucalyptus and Hadoop was done. It
describes one way to set up an Eucalyptus cloud and one way to run Hadoop MapReduce
in it. While it describes one way, it should not be regarded as the definite way to do it.
Eucalyptus can run several different networking modes on top of different OSs which means
that the following configurations are not the only solution.

4.1

Preliminaries

The hardware available for this thesis were nine rack servers running Debian 5, kernel version
2.6.26, connected through one gigabit switch with virtual network support. Of these, one
was the designated DHCP server of the subnet, only serving specific MAC addresses a
specific IP address and not giving out any IP to an unknown MAC address. This server also
had a shared NFS /home for the other eight servers in the subnet. Due to the other eight
server being dependant on this server, it was ruled out of the Eucalyptus setup. Out of the
eight available, four supported the Xen hypervisor and four supported the KVM hypervisor.
This is the hardware settings of the servers.
CPU
AMD Opteron 246 Quad-core on test01-04
AMD Opteron 2346 HE Quad-core on test05-08
RAM
2 GB on test01-04
4 GB on test05-08
HDD
27 GB on /home
145 GB/server
Initially Eucalyptus version 1.5.1 was chosen for this thesis, but due to problematic
configurations of the infrastructure a later version was used in the final testing; 2.0.2. See
23

24

Chapter 4. Accomplishment

section 4.2.1 for further explanation. For Hadoop MapReduce the latest version, 0.21.0, was
chosen due to the fact that in this release a major API change has been made. 0.21.0 also
has a large number of bug fixes implemented into it compared to the earlier versions. None
of the servers had any Hadoop or Eucalyptus previously installed on them.

4.2

Setup, configuration and usage

The following sections contains a description of how the Eucalyptus and Hadoop MapReduce
software were setup. It is divided into three different subsection; the first focuses on how to
configure Eucalyptus on the servers available, the second on how to create an image with
Hadoop suitable for the environment and finally the MapReduce implementation along with
how to get it running. Installing, configuring and running Eucalyptus requires a user with
root access to the systems of which it is installed on.
The configuration is based on a Debian host OS, as this was the system it was run
on. This means that some packages or commands either does not exists or have a different
command structure on other OSs like CentOS or RedHat Linux.

4.2.1

Setting up Eucalyptus

Compared to Xen, KVM works more out of the box since it is tightly configurated with the
native Linux kernel. The choice of hypervisor was then to use KVM to avoid any problems
that might occur between the host OS and the hypervisor. This meant that four out of
eight servers could not be used as servers inside the cloud infrastructure. Eucalyptus can
probably be set up to use both Xen and KVM if it loads an image adapted to the correct
hypervisor, but that is out of scope of this thesis.
Installing Eucalyptus can be done using the package manager of the OS. In Debian, aptget can be used once the repository of Eucalyptus has been added to /etc/apt/sources.list.
Depending on the version used the repository location is different. To add Eucalyptus 2.0.2,
edit the sources.list file and add the following line:
deb http://eucalyptussoftware.com/downloads/repo/eucalyptus/2.0.2/debian/ squeeze main

Calling apt-get install eucalyptus-nc eucalyptus-cc eucalyptus-cloud eucalyptus-sc will install all the parts of Eucalyptus on the server it was called on. Starting, stopping and
rebooting Eucalyptus services is done through the init.d/eucalyptus-* scripts.
The physical network setup is to have one server act as the front-end that has all the
cloud, cluster and storage controllers running on it. The three other servers only run virtual
machines and talk to the front-end. The FE does not run any instances at all to avoid any
issues with networking and resources. Public IPs can be booked in the CLC configuration
file and for the test environment the IP ranges from *.*.*.193-225 were set as available for
Eucalyptus. This needs to be communicated with the network admin, as the Eucalyptus
software must assume that these are IP addresses are free and no one else are using them.

4.2. Setup, configuration and usage

25

Figure 4.1: Eucalyptus network layout on the test servers along with what services were running
on each server.

In terms of the networking, different modes has different benefits. However, due to the
configuration of the subnet DHCP server the SYSTEM mode is not an option. The STATIC
configuration simplifies setting up new VMs but it prevents benefits like VM isolation and
especially the metadata service. So the choice is to use either the MANAGED or the
MANAGED-NOVLAN mode. By setting up a virtual network between two of the server
one can verify whether the network is able to use virtual LANs. This is documented on the
Eucalyptus network configuration page [8]. The network verifies as VLAN clean, that is it
is able to run VLANs.
Most of the problems encountered when setting up an Eucalyptus infrastructure is related to the network. The networking inside the infrastructure consists of several layers,
and with the error logs available (found in the /var/log/eucalyptus/*.log files) there is a
lot of information to search through when searching for errors. The configuration file that
Eucalyptus uses, /etc/eucalyptus/eucalyptus.conf, has a configuration setting that sets the
verbosity level of the logging. When installing, at least INFO setting is found to be recommended, while DEBUG can be set when there are errors that are hard to find the source
of.
When working with the Eucalyptus configuration file, it is important to note that the
different subsystems (NC, CC, CLC, SC and Walrus) uses the same configuration file but
different parts of it. As an example, changing the VIRTIO * settings on the Cloud controller
has no effect, as it is a setting that only the Node controller uses. This might cause confusion
in the beginning of the configuration, but the file itself is by default very well documented.
When setting up the initial Eucalyptus version - Eucalyptus 1.5.1 - problems occur with
compability of newer libraries. Since 1.5.1 uses libraries that are relatively new compared
to the Eucalyptus itself, it attempts to load libraries that has changed the names and/or
locations. The 1.5.1 version does for example attempt to load the Groovy library through an
import call that points to a non-existent location. To remedy these library problems 1.6.2
was selected as the next Eucalyptus version to try. This demanded a kernel and distribution
upgrade from Debian 5 to 6.
Eucalyptus 1.6.2 has a better support for KVM networking and calls to newer libraries
so it should work better than 1.5.1. 1.6.2 does not run without any problems though.
Booting the NCs on the three node server (test06-08) as well as the Cluster controller on
the front-end (test05) goes without any problems but the cloud controller silently dies a

26

Chapter 4. Accomplishment

few seconds after launching it. This is an error that cannot be found in the logs since it is
output directly to stderr from the daemon, which itself is directed to /dev/null. By running
the CLC directly from /usr/sbin/eucalyptus-cloud instead of /etc/init.d/eucalyptus-clc one
can see that 1.6.2 still has dependency problems with newer Hibernate and Groovy libraries.
This can be solved by downgrading the libraries to earlier version, however this can cause
compability issues with other software running on the servers.
To prevent compability issues the latest version, 2.0.2, was installed on the four servers
used. This proved to be a working concept. All the Eucalyptus services are running, but
they are not connected to each other. The CLC is not aware of any SC, Walrus or CC
running, and the CC is not aware of any nodes available. Since Eucalyptus is a web service
that runs on Apache Web Server one can verify that the services are running by calling
ps auxw | grep euca

to determine that the correct httpd-daemons are running. This should show an httpddaemon running under the eucalyptus user.
There are two ways of connecting the different parts of the system. Either the correct
settings can be set in the configuration file at the CLC and then rebooting it or running
the euca conf CLI program. What euca conf does is actually change the configuration file
and reboot a specific part of the CLC. The cloud controller can then connect to the NCs
through REST calls (which can be seen by reading the /var/log/eucalyptus/cc.log file at the
front-end). This means that the Eucalyptus infrastructure is working in one term, but the
virtual machines themselves can still be erroneously configured.

Network configuration

Figure 4.2: The optimal physical layout (left) compared to the test environment (right).

4.2. Setup, configuration and usage

27

Configuring the right network settings for the VMs that run inside the cloud is a somewhat trial and error procedure. Even though there are documentation on network setup,
it does not explain why the setup should be in the documented way. It also assumes that
the cloud has a specific setup regarding the physical network. In a perfect environment the
front-end should have two NICs, with one NIC connected to the external network and one
to the internal. See Figure 4.2.
The front-end will act as a virtual switch in any case which means that in a single-NIC
layout the traffic passes through the physical layer to the front-end, where it is switched and
then passed through the same network again as shown in Figure 4.3. This means that on
the NCs the Private and public interface settings are in fact the same NIC, with a bridge
specified so that virsh knows where to attach new VMs. On the front-end there is no virtual
machines but only a connection from the outside network (the NIC, eth0) and a connection
to the internal network (here a bridge).

Figure 4.3: Network traffic in the test environment. Dotted lines indicate a connection on the
same physical line.

While the documentation specifies that MANAGED should work even in the non-optimal
layout the physical network works on, none of the different combinations of settings on the
NC and FE could make the VMs to properly connect to the outside network. An indication
of a faulty network configuration is by checking the public IP address of the newly created
instance through the euca2ools or another tool like hybridfox, or read through the logs. If
the public IP shows 0.0.0.0 it is generally an indication of a faulty network setting. To
properly configure the network for the VMs, the front-end and node controllers needs to
have configured settings in the configuration file that match each other. Table 4.1 shows
variables shared on the NC and front-end found in the configuration file, with their setting
in the test environment that provides a working network.
Variable

Front-end Value

NC Value

VNET MODE

MANAGED-NOVLAN

MANAGED-NOVLAN

VNET PUBINTERFACE

eth0

eth0

VNET PRIVINTERFACE

br0

eth0

VNET BRIDGE

br0

Comment
Network
environment.
Same on NC and FE.
The public interface to
use.
The private interface to
the nodes.
Bridge to connect the
VMs to.

Table 4.1: Configuration variables and their corresponding values in the environment.
This renders a network that has an internal network communicating through bridges.

28

Chapter 4. Accomplishment

When a VM is booted, the VMs network (which itself is a virtual bridged NIC) attaches to
the bridge specified in the configuration file. This requires that the node controller and the
frond end has created a bridge to the physical network.
When running on Ubuntu or Debian as host OS, disabling the package NetworkManager
is required since it can - and most likely will - cause problems. NetworkManager can be
scheduled to check the network to a certain setting and reboot it if it is configured in another
way. This problem was identified at least twice as a root to networking issues during the
course of this thesis.

Node Controller configuration


The Node controllers are responsible for communicating with the hypervisor on each machine. In Eucalyptus, this is done by talking to virsh, which is a virtualization binary for
Linux. Virsh then takes care of telling KVM in this case which image/ramdisk/kernel to
load and a few different settings. Most of these can be left untouched, but there are a few
things that needs to be changed according to Eucalyptus settings. Virsh is made to run
stand-alone, and in order for Eucalyptus to run virsh it has to be set as the user who runs it.
These settings is specified in the Eucalyptus administration handbook, see the references [8].
If the VMs fail to receive an IP address one can specify in the virsh settings to use
VNC, which is a remote desktop system. Using VNC can be regarded as a security issue
and should only be used as a means to troubleshoot. VNC can be added to all the VMs
launched in the NC where the changes are made. This requires an image to be used where
a username/password pair already exists on the image, compared to the root access that
is given through keypairs in Eucalyptus default environment. VNC was used in the initial
troubleshooting of the networking by editing the /usr/share/eucalyptus/gen kvm libvirt xml
file. In the <devices> element, adding this gives VNC accessibility:
<graphics type=vnc port=-1 autoport=yes keymap=en-us listen=0.0.0.0/>

Node controllers running MANAGED-NOVLAN mode requires the NIC to be bridged.


This can be done using the linux package bridge-utils with a command named brctl.
When brctl is installed bridges can be added or removed by editing the network configuration
file, /etc/network/interfaces, and rebooting the networking daemon through the command
/etc/init.d/networking restart. The NCs all have the same networking configuration on the
test system which is:
iface eth0 inet dhcp
auto lo
iface lo inet loopback
auto br0
iface br0 inet dhcp
bridge_ports eth0

This configuration means that the eth0 is not activated but configured by default, whilst
the loopback address (127.0.0.1 or localhost) is always on. The bridge is attached to the
eth0 interface where it is automatically configured and run.

4.2. Setup, configuration and usage

29

In the Eucalyptus configuration file most of the settings can be left to the default values
as the configuration file is mostly set to run out-of-the-box. Changing the NC PORT configuration can be useful when there are other services running on the server, but it requires
changing the settings on the front-end too, as this is the port where the different Eucalyptus parts talks to each other. The only things that the NC needs to be set are VNET MODE,
VNET PUBINTERFACE, VNET PRIVINTERFACE, VNET BRIDGE and HYPERVISOR
(either xen or kvm, where kvm was used in this thesis).

Front-end configuration
Front-end refers to the Cloud controller, which in this case also is the Walrus, Storage
Controller and Cluster controller. On larger clusters the SC, CC and Walrus should run
on other servers to minimize load. Cluster controllers also specify what is referred to as
Availability Zones in AWS terms.
Configuring the front-end is done by editing the eucalyptus.conf file. The changes required to be made here is larger than on the NCs, as in this file the networking mode,
subsystem locations and external software are specified. Some of the changes in the configuration file can be done using the euca conf script to administrate the system, but editing
the configuration file is suggested to ensure that all the changes are exactly as the ones
needed. In the test environment the following settings are changed from the default (see 4.1
for VNET-modes of the front-end):
DISABLE DNS
Enabling DNS allows the VMs to use Fully Qualified Domain Names. This requires
that the network admin adds the front-end address to the public nameserver as the
nameserver of Eucalyptus VMs. This is further described in the handbook [8].
Set to N in the test environment as we want to use FQDNs due to Hadoop requirements.
SCHEDPOLICY
The policy on how the front-end chooses which NC to place a new VM on. ROUNDROBIN
is the default which means that the front-end cycles through each NC to add a new
VM. This gives an even distribution of VMs.
Set to ROUNDROBIN.
VNET DHCPDAEMON
Specifies the location of the binary to the DHCP server daemon. The front-end will
run this on the internal net to distribute IP-addresses to the VMs.
Set to /usr/sbin/dhcpd3.
VNET DHCPUSER
Sets the user to run the dhcp server. Can be either dhcpd or root.
Disabled on the test environment, which defaults to root.
VNET SUBNET
Specifies which internal IP range the VMs should use inside the internal network.
These cannot be accessed from outside. The IP range should be something that does
not exists on the current subnet.
Set to 10.10.0.0 in the test environment.

30

Chapter 4. Accomplishment

VNET NETMASK
In conjunction with the subnet setting, the netmask specifies which addresses can be
given out of the subnet.
A netmask of 255.255.0.0 which is the test setting, will enable VMs to receive internal
addresses from 10.10.1.2 - 10.10.255.255. The 10.10.1.1 is the address of the DHCP
server in the internal network (basically the front-end).
VNET DNS
Address to the external DNS. When a properly configured image boots, it will fetch
this address from the meta-data service and add it to its /etc/hosts file.
VNET ADDRSPERNET
Determines how many addresses can be set per virtual subnet. Recommended is either
32 or 64 on large clouds. It is quite irrelevant on the test environment as there is not
enough physical hardware to support that many VMs.
Set to 32.
VNET PUBLICIPS
This can be either a range or a comma-separated list of IP addresses. These are the
addresses on the public external net that is free to use.
Set to 130.239.48.193-130.239.48.225 on the test environment.
As the front-end takes care of distributing IP addresses to new VMs in all modes except
SYSTEM, the front-end needs to have its internal DHCP server running. The DHCP
daemon is an external binary that has to be set in the configuration file. This binary will
only run when there are VMs in the cloud running, and it will only listen to, and answer
calls from, VMs. An issue with Eucalyptus 2.0.2 is that it is programmed to work towards
the dhcpd3 daemon, whereas the latest daemon which replaces dhcpd3 in the Debian repo
is the isc-dhcp-server. The isc server has an incompatible API which renders it unusable
to Eucalyptus 2.0.2. This means that to run the Eucalyptus DHCP service the system needs
the older dhcpd3 server installed and not the isc-server. This can be done by forcing apt-get
to install the older version of dhcpd3.

Accessing and managing a Virtual Machine


To launch a virtual machine inside the Eucalyptus cloud the user requires keypairs that
are created and retrieved through the web interface of the Eucalyptus. The default user,
the admin, has privileges to run, add, remove or edit any images inside the cloud, while an
ordinary user cannot do as much. In the test environment the admin privileges was used so
that images could easily be updated into the cloud.
The first thing required is to generate a keypair that is bound to the user (admin in
this case). This keypair is used in all the managing of both the Eucalyptus features and
instances. By logging into https://<front-end>:8443/#login and choosing Credentials one
can download the required keypair files. Extracting and using source eucarc in a CLI gives
the current command interface access to euca2ools credentials, which means CLI interaction
with Eucalyptus features. The keypair, which is a file ending with .private, is required to
use SSH into an image.

4.2. Setup, configuration and usage

31

On the default images created by Eucalyptus there are no users with a username/password
to directly SSH into the instance. When the image boots, it instead contacts the metadata
service through the generic address 169.254.169.254 in MANAGED modes. The IP address
is automatically configured on the virtual switch on the front-end to land on the front-end
when a VM calls the IP. Thus, during boot the rc.local script calls the metadata, downloads the keypair, and adds it to the root verified SSH keypair. In short anyone with the
.private-key can login to an instance using
ssh -i xxxxx.private root@<VM IP>
or
ssh -i xxxxx.private root@<VM FQDN>

The keypair can also be used in other tools. During the course of the thesis the Hybridfox
was used to manage instances. This is a graphical tool built into Firefox that simplifies
viewing, launching and stopping running instances. Hybridfox was thoroughly employed in
the testing phase as it quickly shows which instances are running, their type and current
status.
Usage
Ping
SSH
DNS
Hadoop HDFS metadata
Hadoop NameNode access
HDFS data transfer
HDFS block/metadata recovery
Hadoop WS JobTracker
Hadoop WS TaskTracker
Hadoop WS NameNode
Hadoop FS custom port

Protocol
ICMP
TCP
TCP
TCP
TCP
TCP
TCP
TCP
TCP
TCP
TCP

From Port
-1
2
53
8020
9000
50010
50020
50020
50060
50070
54110

To Port
-1
2
53
8020
9000
50010
50020
50020
50060
50070
54111

CIDR
0.0.0.0/0 (all)
0.0.0.0/0
0.0.0.0/0
0.0.0.0/0
0.0.0.0/0
0.0.0.0/0
0.0.0.0/0
0.0.0.0/0
0.0.0.0/0
0.0.0.0/0
0.0.0.0/0

Table 4.2: Security group with Hadoop rules.


When running in any MANAGED mode, Eucalyptus supports a feature called security
groups. Security groups is a software firewall type of feature that prevents access to instances
inside a specific group. Launching an instance requires the user to specify a security group to
launch it within (or use the default group). The user can then open access on certain ports,
from certain CIDR locations, certain types of network traffic (UDP, TCP or ICMP) or a
combination between them all. The security groups does not prevent access from instances
inside the group, but from locations to and from outside the group. With Hadoop running
inside a security group, a security group was setup to allow access to web services and file
upload.
Table 4.2 details how the security group used in the test environment was defined. Some
ports in the security group was only set to open to verify access in a test with an external
Hadoop NameNode.

4.2.2

Configuring an Hadoop image

Eucalyptus does not come with any images when installing, but there are several images
with different operating systems that has been made to fit with Eucalyptus private cloud

32

Chapter 4. Accomplishment

that can be downloaded from the Eucalyptus homepage. The benefit of using these is that
everything needed to run have already been made including scripts, kernel- and ramdisk
pairs and the OS on the image itself. The downside is that there is not much space for any
extra software on the images and there are unneeded binaries clouting the images.
The idea behind the Hadoop image is to have a preconfigured image with the proper
users, credentials, binaries and scripts loaded on the image already. When the image is
loaded as a VM, only minimal configuration has to be done on the Hadoop slaves and the
master. There are many ways to solve this, including having one type of image for the
master node and one for the slaves. However, in this thesis only a single image was used to
allow for a homogeneity across the instances.
To simplify the image creation the Eucalyptus Debian 5 x86 64 image was chosen as a
default image to work with. This image comes with a kernel and ramdisk pair which can
be uploaded immediately to Walrus and Eucalyptus using the euca2ools. The image is on
download just an .img-file, so to be able to edit the image it has to be mounted on a physical
OS. With a simple script, run as root, this can be quite fast. Assuming the image is named
debian.5.hadoop.img one can use a few lines to fully edit it as a normal root OS:
mkdir mnt
mount -o loop
mount -o bind
mount -o bind
mount -o bind
chroot mnt

debian.5.hadoop.img
/proc mnt/proc
/sys mnt/sys
/dev mnt/dev

A specialized script was created to help with mounting/unmounting the image, as well
as a specialized script adapted to bundle and unbundle images. Bundling an image is an
Eucalyptus action of splitting the image into smaller files, uploading them into Walrus and
preparing them to be loaded. This allows the NCs to fetch them and boot them on call.
VM are always reset to their initial state when loaded. If there are changes to be made
on boot, editing the /etc/rc.local file is one way to ensure scripts are run. This is how some
automated on-boot settings are done to the VM when it boots.
When on the image a user hadoop under the group hadoop was created. All the hadoop
files is intended to be run under this user, so privileges has to be set accordingly. Hadoop
0.21.0 is downloaded and installed on the /opt/hadoop/hadoop-0.21.0 location. However,
this is close to filling up the image space, so removing all the source code found in the
src/ folder is necessary. All of the hadoop directories have read-write-run privileges by the
hadoop user.
As the image does not have any space to store files or use HDFS upon, these can be
loaded in several different ways. Similar to Amazon EBS, Eucalyptus can give dynamic
storage to instances. Neither the block storage (SC) nor the Walrus was used by the images
in this test however. Instead the images loaded the ephemeral data given to each VM on
boot and is basically a piece of the local HDD of which the VM is loaded on. The size of
the ephemeral storage is adjusted by the admin in the Eucalyptus Web Interface and can
have different sizes depending on what type of VM is loaded. In KVM, the ephemeral is
an unformatted harddrive found in /dev/sda2. The image has to mount and format this

4.2. Setup, configuration and usage

33

drive each time it boots, compared to Xen where it is already mounted and running an
ext2 filesystem. On the image, in rc.local, a script is thus added that mounts and formats
/dev/sda2.
The Hadoop settings is thus modified to use the ephemeral harddrive for data storage.
By changing the /opt/hadoop/hadoop-0.21.0/conf/core-site.xml to load from the ephemeral
storage all the loaded instances will load from their own ephemeral storage.
Hadoop requires changes in the core-site.xml, mapred-site.xml and hdfs-site.xml as well
as the masters and slaves configuration files each time they are included in a virtual Hadoop
cluster. These files have thus been modified with placeholder variables that are replaced
remotely through custom made script. This script is called through ssh and run with hadoop
privileges. What the script does is basically replace the variable with the data that is specific
for the cluster. As an example, the hdfs-site.xml contains a variable that must point at the
Master in the cluster. As the master can different for each cluster, it has a default variable
of {MASTER} which the script replaces with the IP of the master node.
A short list of changes that has to be made to the default image in order to have a
working Hadoop image:
Edit the rc.local file
Edit the rc.local file to add more boot features to the image. The Hadoop image
retrieves its FQDN with public/internal IP and sets it in the hostname. It also mounts
and formats the ephemeral data. Inside the rc.local is also the keypair retrieval code,
which must be left untouched.
Create user hadoop
Create the group and user hadoop with a password. A password is not necessary as
it will either be called through the image root or using a remote keypair.
Save Hadoop
Store the Hadoop binaries at /opt/hadoop/hadoop-0.21.0 without the source code.
These files needs ownership and privileges set to hadoop.
Set $HADOOP HOME
The shell variable $HADOOP HOME needs to be set at the Hadoop home directory.
This is set in the /etc/profile to ensure that all users have this set.
Edit configuration files
Change the configuration files mentioned earlier to have a default, changeable, value
set.
Modify conf/hadoop-env.sh
Hadoop is running into issues when using IPv6, so hadoop-env.sh needs a line added
that forces Hadoop to use IPv4. Adding
export HADOOP OPTS=-Djava.net.preferIPv4Stack=true
forces Hadoop in IPv4 on when it is first run.
An issue with Hadoop is that both the MapReduce and HDFS projects require the nodes
to both know the Fully Qualified Domain Name (FQDN) and the IP address. Eucalyptus
itself does not have an entirely working Reverse DNS system that gives the nodes the ability
to do a reverse DNS lookup to determine their FQDN. As such, Hadoop would not be able
to run and crash at startup giving an UnknownHostException. To work around this, the

34

Chapter 4. Accomplishment

instance can use the Eucalyptus metadata service to fetch its internal- and external IP, as
well as its FQDN. Thus, the rc.local has another script that fetches these and adds it to
the /etc/hostname file and forces the system to load from /bin/hostname -f /etc/hostname.
When these are set in the file, Hadoop does not try to use the nameservers to reverse DNS
but instead automatically resolves it from the file.

4.2.3

Running MapReduce on the cluster

The Hadoop cluster consists entirely of Virtual Machines. The master node is one (generally the first booted) machine in the cluster albeit with a slightly different configuration
compared to the slaves. To setup a cluster VMs are booted either through the euca2ools
CLI or the Hybridfox Firefox plugin. Once the machines are booted and ready to run - i.e.
they have mounted and formatted their ephemeral drives and received a FQDN - scripts
that reside outside the cluster on a physical server will do the setup based on a configuration
file. The configuration file, hadoop-cc.conf, contains the external IP of the master node,
a list of IPs consisting of all the slave nodes and the replication level.
Script/file name
hadoop-cc.conf

Location
Outside the cluster.

hadoop-cc.conf

Image: $HADOOP HOME/cc-scripts/

update-cluster

Outside the cluster.

boot-cluster

Outside the cluster.

terminate-cluster

Outside the cluster.

update-hadoop.sh

Image: $HADOOP HOME/cc-scripts/

set-master.sh

Image: $HADOOP HOME/cc-scripts/

set-slaves.sh

Image: $HADOOP HOME/cc-scripts/

set-replication.sh

Image: $HADOOP HOME/cc-scripts/

master

Image: $HADOOP HOME/conf/default/

slaves

Image: $HADOOP HOME/conf/default/

conf-site.xml

Image: $HADOOP HOME/conf/default/

hdfs-site.xml

Image: $HADOOP HOME/conf/default/

mapred-site.xml

Image: $HADOOP HOME/conf/default/

Purpose
Contains the configuration settings for
launching a cluster.
The local counterpart. Is replaced by the config file outside the cluster.
Reads from the .conf file and updates the settings on all Hadoop nodes.
Boots the cluster. Requires that the cluster
has been updated beforehand.
Terminates a running cluster. Does NOT terminate VMs, only Hadoop services.
Updates the local Hadoop configuration
based on a hadoop-cc.conf found locally.
Changes the masters in the local image.
Called by update-hadoop.sh.
Changes the slaves locally. Called by updatehadoop.sh.
Changes the replication level locally. Called
by update-hadoop.sh.
Default file to replace the configuration file
with.
Default file to replace the configuration file
with.
Default file to replace the configuration file
with.
Default file to replace the configuration file
with.
Default file to replace the configuration file
with.

Table 4.3: Table showing scripts and custom config files to run the Hadoop cluster.
Table 4.3 shows a listing of scripts used in configuration a running Hadoop cluster. The
procedure to boot a cluster and run a MapReduce job is the following:
1. Launch VMs through euca2ools or Hybridfox.
2. Change the master, slaves and the replication level in the hadoop-cc.conf file.
3. Run update-cluster.
Update-cluster will do several things; first it will connect to the master node and

4.2. Setup, configuration and usage

35

generate a SSH-key for the user hadoop. That key will then be distributed across all
the slaves, as it will enable passwordless-SSH for user hadoop to all VMs. Secondly
it will log in on all the VMs, use the metadata service and fetch the IP/FQDN pairs.
Third, it will create a /etc/hosts file containing all the IPs/FQDNs of all the VMs
in the cluster. Lastly it will distribute the /etc/hosts file to all nodes, distribute the
hadoop-cc.conf file to all nodes and tell them to run the update-hadoop.sh script.
4. Run boot-cluster.
This will start the Hadoop MapReduce and HDFS services as user hadoop. With the
SSH keypairs already distributed for user hadoop, the master node will take care of
starting it on all the nodes.
5. Upload the files to HDFS.
Since the database files does not reside on the cluster by default, they are transferred
to the master through SCP. With a SSH call to the masters HDFS client they can
then be uploaded to the HDFS.
6. Download the JAR file containing the MapReduce implementation.
Fetching the JAR-file containing the job is a simple matter of placing it in $HADOOP HOME.
7. Run the MapReduce job using bin/hadoop jar hadtest.jar count articles out.
This will run the JAR hadtest.jar with the arguments count articles out. It basically means run the count job on the articles HDFS folder and put the results in the
HDFS out folder.
8. Retrieve the result from out folder.
This will be a list based on the implementation.
9. Calculate the job time.
This is done by visiting Hadoops internal web services for the JobTracker. By visiting
http://<Master External IP>:50030 the MapReduce framework provides feedback on
time, number of tasks etc.
The scripts that run outside the cluster requires the prevalence of a keypair that gives
root access to the instances. This is because they will login as root and switch to user
hadoop on the VMs and then perform the relevant actions.
There are a few shortcomings in the Eucalyptus DNS nameserver for internal VMs.
What it does is that the instances cannot resolve each other through the nameserver that
Eucalyptus provides. To circumvent the shortcoming the update-cluster script builds a large
hosts file containing a list of IP/FQDN-pairs of all the nodes in the cluster. The computed
hosts file is distributed to all slaves and replaces the existing one. This lets Hadoop resolve
them through the local file.
What is also missing in Eucalyptus is providing proper in-communication between VMs
based on their external IP. For instance, assume a VM with 10.10.1.2 and 130.239.48.193
as its internal and external IP and another VM with 10.10.1.3/130.239.48.194 as its IPs.
The first VM can only contact the other VM by contacting the internal, 10.10.1.3, IP but
contacting through the external IP, 130.239.48.194, will give a timeout. The reason to this is
the virtual switch inside the front-end. When booting a new instance, the front-end adjusts
its Netfilter rules using the iptables binary in the same way as giving itself the metadata

36

Chapter 4. Accomplishment

address. The problem here is that Eucalyptus automatically does it for the internal IPs,
but the external IPs are never adjusted for the VMs.
The boot-cluster and terminate-cluster manually adds and removes iptables rules
to solve this issue through:

sudo iptables -t nat -A POSTROUTING -s "<LOC_IP>" -j SNAT --to-source "<PUB_IP>"


and
sudo iptables -t nat -D POSTROUTING -s "<LOC_IP>" -j SNAT --to-source "<PUB_IP>"

Unfortunately this requires that the scripts are run at the front-end and with su privileges. Of course, adding an SSH command would be sufficient if it was to be run remotely,
but they still require iptables changes on the front-end.

4.2.4

The MapReduce implementation

To test the MapReduce framework on the infrastructure an embarrassingly distributed problem was chosen that could easily use MapReduce to distribute the tasks. The problem chosen
was to count all the links in a subset of Wikipedia articles, listing the article title with the
most links to it. This is a problem most suitable for MapReduce, as the database dumps
from Wikipedia are large XML-files.
The largeness of the files fits the HDFS, but with XML files there is a slight problem
of tags existing outside of file splits. When splitting files, HDFS and MapReduce does
not account to any specific split locations by default, so a specific InputSplit had to be
implemented. Cloud9 [15] is a Hadoop MapReduce project with an implementation that
suits the Wikipedia databases and MapReduce. However, the XML parsing looks quite like
the Apache Mahout [12] implementation and both uses the old pre-Hadoop 0.20 API.
The implementation is thus rewritten to fit the 0.21.0 API. This utilizes the MapReduce
RecordReaders ability to read outside of the split given, so the InputFormat can read from
a tag start to a tag end.
After the XML files have been parsed, a WikipediaArticle is passed to a Mapper. The
mapper just calculates the link titles which are notified in the XML string, outputting the
link as a key and a 1 as a value. The Reducer then merges the links and sums up the 1s
to a complete number. However, when these are output, they are sorted alphabetically by
their keys and not by the amount of links they have. This is solved by running a second
job that goes very fast ( 3mins) which just reads the output file, quickly passes it through
a mapper that switches the key-values and passes it through a reducer to print it out.
This implementation enables an adjustable amount of data to use, as running the whole
Wikipedia database of current articles is a 27 GB large XML file. Only using subset files
during the testing shows whether there is a difference in completion time when there is a
large amount of data to process compared to a smaller amount. Subsets used where the
article-only (no metadata, images, history or user discussions) subsets dumped at 2011-0317 [25]. The subsets ranged from 2.9 GB (first three XML files), 4 GB (first 4) and 9.6 GB
(first 8 files).

4.2. Setup, configuration and usage

37

To calculate the time taken to process the data the following was done:
1. Start a specific amount of Hadoop VMs.
2. Configure them accordingly.
3. Upload the data and the JAR-file to the master VM.
4. Put the data onto HDFS.
5. Run the job 5 times, making sure to remove the output folder between runs.
6. Go to the JobTracker Web Service and analyse the time.
7. The max time of a job consist of adding both the Count and Sort jobs together.
8. Calculate an average of the 5 runs.
All of the nodes only uses 1 core and 512 MB RAM each and the master also contained a
slave in all configurations. When running with only one node in the cluster, the Replication
level was set to 1, in any other test the replication was set to 2. The size of the ephemeral
HDD varied depending on the size of the database and the number of nodes in the cluster.
See table 4.4, where the master also counts as a slave.
DB size
2.9 GB
2.9 GB
2.9 GB
2.9 GB
2.9 GB
4.0 GB
4.0 GB
4.0 GB
4.0 GB
4.0 GB
9.6 GB
9.6 GB
9.6 GB
9.6 GB
9.6 GB

Nr. of nodes
1
2
4
8
12
1
2
4
8
12
1
2
4
8
12

Master HDD size


10 GB
10 GB
10 GB
10 GB
10 GB
10 GB
10 GB
10 GB
10 GB
10 GB
25 GB
10 GB
10 GB
10 GB
10 GB

Slave HDD size


5 GB
5 GB
5 GB
5 GB
5 GB
5 GB
5 GB
5 GB
10 GB
10 GB
10 GB
10 GB

Replication level
1
2
2
2
2
1
2
2
2
2
1
2
2
2
2

Table 4.4: The ephemeral HDD size of the nodes in the running cluster.
To simplify deployment on the cluster, the JAR-file was built to include any libraries
it uses in its build path. This provides a relatively large file compared to without these
libraries. This does however remove the need to include libraries on launch from the MapReduce framework itself.

38

Chapter 4. Accomplishment

Chapter 5

Results
This chapter details the results of running Hadoop MapReduce jobs inside a virtual cluster.
The procedure to perform the tests are explained in an earlier chapter, see section 4.2.4.
When running the example job and verifying the timings and map tasks run, the internal
web service of Hadoop was used to procure the timings on the job time and map times.
VMs run on Eucalyptus never crashed during the test period once they were properly
booted. However, some things were observed and noticed during the boot phase of a VM,
where most of the problems occurred. A machine running did run until the machine was
terminated either inside the VM or by controlling it outside-wise. A few observations related
to running the Eucalyptus cloud:
Booting many instances at the same time increase crash risk of VMs
When Eucalyptus receives requests of booting several instances at the same time, there
is a risk that the VM either kernel panics or fails to receive an IP or properly resolve
its FQDN. If a high amount of (more than three) instances are booting simultaneously
then it is more likely that an instance will fail to boot. Requesting an instance one by
one minimize the risk, but increases the time to completely boot all the instances in
a cluster.
Node Controllers does not remove cached images that has been removed
Removing an image from the cloud removes it from Walrus but the Node Controllers
retains a cached copy of it, even though it cannot be booted by the infrastructure.
This can clout the HDDs of the NCs and sometimes requires manual removal to free
up disk space.
Networking to a new instance is never instantly created
Even though the iptable rules and bridging is properly setup the instances are not
instantly accessible. They can respond to ping, but the time it takes to access an
instance through SSH usually ends at 1-2 minutes.
The network to instances is never at a steady speed
Transferring files through wget, curl or SCP never had a stable speed. It could go
up to 20 mb/s, and in the next moment stall for 15 seconds. This was frequently
39

40

Chapter 5. Results

observed, especially when communicating from outside the virtual cluster into a VM.
If this affected the speed between Hadoop nodes in a MapReduce job is unknown.
Hadoop MapReduce itself is built around being redundant, so if a machine stopped
working it would solve the task eventually either on the same VM or on another in the
cluster. Running the MapReduce jobs did not induce any severe crashing VMs or similar
even though they could provide a high load on the machines. Some things were observed
during the runs however:
Near linear increase in performance on few nodes
Increasing from 1 to 2 nodes is a large increase in performance. Henceforth the performance increase is not linear but still increasing.
Larger cluster means more repetition of same task
If the cluster is large, the likelihood of a map task being done at least twice and then
dropped is increased. This is also based on the size of the database it runs on. As an
example, a job of 157 tasks is done exactly 157 times on a 1-node cluster whereas on
a 8-node cluster it would be done on average 165 times and having 8 tasks dropped.
Task getting dropped more often on large cluster
Related to the previous item, tests indicate that nodes timeout more often the larger
the cluster is. Whether this is related to the MapReduce framework or the internal
networking between the nodes is unknown. This requires forces the task to be redone
later or at another location.
An important note here is that performance is somewhat simplified. This thesis focuses
only how fast the job finishes which means that:
high performance = short job time

which means that high performance equals very low time to complete the job.

5.1

MapReduce performance times

Initially the testing was supposed to be made with a single size of a database, but once the
initial tests of the relatively small database of 2.9 GB of Wikipedia articles were done the
MapReduce displayed results that could indicate that 12 nodes would not make an increase
in performance compared to 8 nodes. So the database was increased to 4.0 GB at first and
then 9.6 GB to verify the results.
The increased size did not however display any significant changes from the initial analysis of the the database; when the size of the cluster where put to the maximum of 12 virtual
instances the gain in performance would be marginal. Figure 5.1, 5.2 and 5.3 shows the
performance curves based on the size of database and number of nodes in the cluster.

5.1. MapReduce performance times

41

Figure 5.1: Runtimes on a 2.9 GB database.

Figure 5.2: Runtimes on a 4.0 GB database.

Figure 5.3: Runtimes on a 9.6 GB database.


As such, the only noticeable difference when nearing the max number of virtual nodes
available is that the max/min-times of completing the job would draw closer to the average
compared to the 8 node cluster. On a cluster with only few nodes in it, the variations on
how long time it takes to complete the job were very high compared to a 8 or 12 node
cluster.
The map times on the jobs did not make any considerable changes based on how large
the database or nodes there are in the cluster. The only noticeable difference would be
that the larger the database the higher the maximum time of an individual map time would

42

Chapter 5. Results

be observed. However, that could point at a problem regarding the network, where the
JobTracker would be unable to contact the TaskTracker of the individual map and thus
drop it from the job, redoing it later or on another node. See figures 5.4, 5.5 and 5.6.

Figure 5.4: Map task times on a 2.9 GB database.

Figure 5.5: Map task times on a 4.0 GB database.

Figure 5.6: Map task times on a 9.6 GB database.


This shows that a linear increase in performance based on the number of nodes is not
achievable in the test environment set up. The performance evens out when nearing the
maximum number of nodes in the virtual cluster. A performance increase is however given
by using virtual nodes.

Chapter 6

Conclusions
While Hadoop MapReduce is designed to be used on top of physical commodity hardware
servers, testing has shown that running Hadoop MapReduce in private cloud supplied by
Eucalyptus is viable. Using virtual machines gives the user the ability to supply more machines when needed, as long as it is not reaching the physical upper limits of the underlying
host machines. While setting up the cloud and Hadoop can prove problematic at first, it
should not pose a problem to someone experienced in scripting, command line interfaces,
networking and administration in a UNIX environment.
Using Hadoop in a virtual cluster provides an added benefit of reusing the hardware to
something completely different when no MapReduce job is running. If the cloud contains
several different images, it is quite viable to use a private cloud as a mean to give more
computing power if needed and use the hardware to something else when it is not.

6.1

Restrictions and limitations

While this is a test of setting up and using MapReduce in a private Eucalyptus cloud, the
amount of physical servers severely restricts the size of the test. It does not reflect the
size of a cloud that most likely will be used in a company or an organization although the
principles would be just as fitting. This thesis focuses more what is required to make a
Hadoop cluster inside a private cloud instead of actually perfecting the means of creating
and managing such a cluster, although creating a simple, streamlined and usable method is
preferred.
The way of setting up the Hadoop nodes using the keypair, CLI scripts and configuration
file is a viable way but it is far from complete. It takes a while to boot images that uses
auto-mounting of ephemeral devices. API calls to Eucalyptus cannot always verify whether
the VM has booted correctly or not and since the networks are being bridged it requires
a certain amount of traffic to find the right node. This means a downtime when booting
that can range anything from 30 seconds to several minutes, of which afterwards there is
no guarantee that the VM might not have kernel panicked or hit any other kind of error
43

44

Chapter 6. Conclusions

preventing it from booting.


With current status of Eucalyptus in mind the shortcomings regarding the VMs connecting between their external IPs, the reverse DNS lookups and the incompabilities with
some binaries and libraries (dhcpd, hibernate and groovy) makes it problematic keeping an
up-to-date version of Eucalyptus. However, the API that Eucalyptus uses simplifies making
applications and tools interacting with it. It is one major benefit having tools that once was
used for AWS and can be as easily used in the private cloud infrastructure.
Eucalyptus also has two problems when booting a VM. The first is that if there are many
instances requested to boot at the same time, there is a risk that they fail to either retrieve
the proper IP, resolve the proper FQDN, or fetch the SSH key from the metadata service.
While the first is most likely an issue related to the virtual switching and the bridging of
the network, the FQDN- and SSH-problems resides in a metadata service that cannot be
accessed or is not prepared. Slow network in general, for example stalling SCP file transfer,
can diligently be observed on the type of network that the test cloud is using. Setting up
another kind of network like only MANAGED and with proper switches can possibly solve
this.
The second large problem is that when booting several VMs at the same time on the
same host machine, there is a risk of a kernel panic in the VM. Eucalyptus still regards the
VM as in a running state while it is not accessible at all. This is not Eucalyptus itself that
is erroneous, merely that an error occurred and virsh cannot tell Eucalyptus properly that
the instance has failed on boot.

6.2

Future work

The size of both the private Eucalyptus cloud and Hadoops virtual cluster is very small
compared to the size they are designed towards. Having a large cloud spread on large racks
and preferably under several geographically separated availability zones would definitely
increase the difficulty of maintenance and setup process. However, adding more Node Controllers to the already existing cluster should not pose a considerable problem, as the major
of issues regarding the cloud network is already set up.
Hadoop should preferably run on larger VMs with more RAM and HDD available to
them. What would be interesting to see is if the cluster shows the same decline in performance increase when the cluster is reaching 1.000 or more virtual machines. Also interesting
to see would be if the Hadoop cluster recognizes a VM that relies on a different physical
rack (but in the same virtual) as a node inside the same rack. The rack-awareness inside
the Hadoop cluster relies on the reverse DNS and FQDNs which means that Eucalyptus
first needs an upgrade in how it handles iptables, virtual switching and DNS lookups.
As the VMs runs on three physical servers, a comparisation between the performance of
the virtual cluster and running Hadoop straight on top of the three servers would also be
interesting. Dropping the cloud service, and with it the ability to switch VMs depending on
need, would give fewer Hadoop nodes but each with higher individual performance. If this
gives greater performance, and to what degree, than running more virtual nodes would be
a good comparison.

6.2. Future work

45

The way to launch, check and terminate Hadoop nodes can easily be more streamlined.
Either as a complete stand-alone software utilizing the open API of Eucalyptus or adding
more scripts to use in a CLI environment. The fact that the API that Eucalyptus uses is a
web service implies a lot of different approaches and ways of interacting with VMs.
Using the storage features found in Eucalyptus, like its storage controller and Walrus,
to simplify the booting of a new image could be an interesting take on how to actually
load the test data locally to the virtual cluster. Hadoops MapReduce uses HDFS as a file
storage, but it can run directly on top of Amazons S3. If it uses the same API calls, then it
should be able to run on top of Walrus instead. However, as Walrus is accessed through the
network, it can pose a limitation in terms of bandwidth (depending on size of the database).

46

Chapter 6. Conclusions

Chapter 7

Acknowledgements
I would like to thank my supervisor at CS UmU Daniel Henriksson and co-supervisor Lars
Larsson for giving me this great opportunity to work with something so high-end and still
usable. The help and feedback you have given me is duly appreciated.

I would also like to thank the system administrator at CS UmU, Tomas Ogren,
for his
patience and support when I repeatedly failed to setup and prepare the servers; solving
problems that was out of bounds of this thesis.
A final thanks to Erik Elmroth that managed to find several interesting distributed ,
this one included, despite his many work obligations and the lack of time.

47

48

Chapter 7. Acknowledgements

References
[1] Amazon. Amazon elastic block storage (ebs). http://aws.amazon.com/ebs, 2011. Retrieved 2011-02-15.
[2] Amazon. Amazon elastic compute cloud (ec2). http://aws.amazon.com/ec2, 2011.
Retrieved 2011-02-15.
[3] Amazon. Amazon simple storage service (s3). http://aws.amazon.com/s3, 2011. Retrieved 2011-02-15.
[4] Citrix. Citrix open cloud platform.
http://www.citrix.com/English/ps2/products/subfeature.asp?contentID=2303748,
2011. Retrieved 2011-04-26.
[5] Cloud.com. The cloud os for the modern datacenter. http://cloud.com/, 2011. Retrieved
2011-04-26.
[6] Cloudera.
Clouderas
distribution
including
http://www.cloudera.com/hadoop/, 2011. Retrieved 2011-04-26.

apache

hadoop.

[7] J Dean and S Ghemawat. Mapreduce: Simplified data processing on large clusters. In
OSDI04: Sixth Symposium on Operating System Design and Implementation. Google
Inc., 2004.
[8] Eucaluptys Systems, Inc. Eucalyptus Administration Guide (2.0), 2010.
[9] Apache Software Foundation. Apache hadoop. http://hadoop.apache.org/, 2011. Retrieved 2011-02-17.
[10] Apache Software Foundation. Apache whirr. http://incubator.apache.org/whirr/, 2011.
Retrieved 2011-04-26.
[11] Apache
Software
Foundation.
Hdfs
user
http://hadoop.apache.org/hdfs/docs/current/hdfs user guide.html, 2011.
2011-04-06.

guide.
Retrieved

[12] Apache Software Foundation. Mahout. http://mahout.apache.org/, 2011. Retrieved


2011-03-10.
[13] M Gug. Deploying a hadoop cluster on ec2/uec with puppet and ubuntu maverick. http://ubuntumathiaz.wordpress.com/2010/09/27/deploying-a-hadoop-cluster-onec2uec-with-puppet-and-ubuntu-maverick/, 2010. Retrieved 2011-03-10.
49

50

REFERENCES

[14] Puppet Labs. Puppet powers it production. http://www.puppetlabs.com/puppet/introduction/,


2011. Retrieved 2011-04-26.
[15] J Lin. Cloud9 - a mapreduce library for hadoop. http://www.umiacs.umd.edu/ jimmylin/cloud9/docs/, 2011. Retrieved 2011-03-10.
[16] T. Mather, S. Kumaraswamy, and S. Latif. Cloud Security and Privacy. OReilly, 2009.
[17] Microsoft. Microsoft private cloud. http://www.microsoft.com/virtualization/en/us/privatecloud.aspx, 2011. Retrieved 2011-04-26.
[18] Derek Gottfrid (New York Times).
Self-service, prorated supercomputing fun!
http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-supercomputing-fun/, 2007. Retrieved 2011-02-15.
[19] OpenNebula.
Opennebula - the opensource toolkit for cloud computing.
http://opennebula.org/, 2011. Retrieved 2011-04-26.
[20] A. T. Velte, T. J. Velte, and R. Elsenpeter. Cloud Computing - A Practical Approach.
The McGraw-Hill Companies, 2010.
[21] VMware. Vmware vcloud. http://www.microsoft.com/virtualization/en/us/privatecloud.aspx, 2011. Retrieved 2011-04-26.
[22] W. von Hagen. Professional Xen Virtualization. John Wiley & Sons, 2008.
[23] T. White. Hadoop - The Definitive Guide. OReilly Media, 2nd edition, 2010.
[24] T. White. Hadoop - The Definitive Guide, page 260. OReilly Media, 2nd edition, 2010.
[25] Wikimedia.
enwiki
dump
progress
on
20110317.
http://dumps.wikimedia.org/enwiki/20110317/, 2011. Retrieved 2011-03-18.
[26] C. Wolf and Erick Halter. Virtualization - From the Desktop to the Enterprise. Apress,
Berkeley, CA, 2005.

Appendix A

Scripts and code


The scripts and source code can be downloaded at:

http://www8.cs.umu.se/~c07jnn/bsthesis/

51

Vous aimerez peut-être aussi