Vous êtes sur la page 1sur 31

AN EXPERIMENTAL APPROACH TO MULTIPLE SEQUENCE

ALIGNMENT IN OPENVZ USING HADOOP CLUSTER

BAKTAVATCHALAM.G (08MW03)

MASTER OF ENGINEERING

Branch: SOFTWARE ENGINEERING

of Anna University

September 2009

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


PSG COLLEGE OF TECHNOLOGY
(Autonomous Institution)
COIMBATORE – 641 004
PSG COLLEGE OF TECHNOLOGY
(Autonomous Institution)
COIMBATORE – 641 004

AN EXPERIMENTAL APPROACH TO MULTIPLE SEQUENCE ALIGNMENT


IN OPENVZ USING HADOOP CLUSTER

Bona fide record of work done by

BAKTAVATCHALAM.G (08MW03)

MASTER OF ENGINEERING

Branch: COMPUTER SCIENCE AND ENGINEERING


of Anna University, Coimbatore.

September 2009

…..…………………. ……….………………….
Ms. M.Gowri Shankar Dr. S.N.Sivanandam
Faculty Guide Head of the Department

Certified that the candidate was examined in the viva-voce examination held on ………………….

…………………….. …………………………..
(Internal Examiner) (External Examiner)
Acknowledgement

ACKNOWLEDGEMENT

We wish to express our sincere gratitude to our respected Principal Dr. R.


Rudramoorthy for having given us the opportunity to undertake our project.

We also wish to express our sincere thanks to Dr. S. N. Sivanandam, Professor


and Head of the Department of Computer Science and Engineering, for his
encouragement and support that he extends towards our project work.

We extend our sincere thanks to our internal guide and faculty in charge Mr. M.
Gowri Shankar, Lecturer, Department of Computer Science and Engineering, for
his guidance and help rendered for the successful completion of our project.

i
Synopsis

SYNOPSIS

Multiple alignment of protein sequences is an essential tool in molecular


biology. It aids to determine evolutionary linkage and to predict molecular structures. The
factors to be considered while aligning multiple sequences are speed and accuracy of
alignment. Dynamic programming algorithms like Needleman-Wunsch and Smith-
Waterman produce accurate alignments. But these algorithms are computation intensive
and are limited to a small number of short sequences.

In this project we propose a time efficient approach to sequence alignment


that produces quality alignment. The dynamic nature of the algorithm coupled with data
and computational parallelism of hadoop data grids improves the accuracy and speed of
sequence alignment. Further due to the scalability of hadoop framework, the proposed
multiple sequence alignment is also highly suited for large scale alignment problems.

The Improved algorithm also overcome the Space limitations in


Needleman-Wunsch Algorithm by dividing the sequence into blocks and process the
individual blocks in parallel. Also we optimize the computation by performing parallel
alignment score computation. Also the algorithm is designed to supports the platform
virtualization (OpenVZ).

i
Contents

CONTENTS

CHAPTER Page No.

Synopsis………………….………………………………………………..…………….. .(i)
List of Figures.………….………………………………………………...…………….. .(ii)
List of Tables.…………………………………………………………………………….(iii)
1. INTRODUCTION.……...…………………………………………………………... .1
1.1. Problem Definition 1
1.2. Objective of the Project 1
1.3. Significance of the Project 1
1.4. Outline of the Project 1
2. SYSTEM STUDY..…….……………………..……………………………………...3
2.1. Existing System 3
2.2. Proposed System 3
3. SYSTEM ANALYSIS..…….……………………..………………………………….5
3.1 Requirement Analysis 5
3.2 Feasibility Study 5
4. SYSTEM DESIGN…...…….……………………..………………………………….7
4.1 Contextual Activity Diagram 7
5. SYSTEM IMPLEMENTATION.………………..…………………………………...8
5.1 Aligner Module 8
5.2 FileUtil Module 8
5.3 GAJobRunner Module 8
6. TESTING……………………….………………..……………………………………9
6.1 Unit Testing 9
6.2 Integration Testing 10
6.3 Sample Test Cases 11
7. SNAPSHOT.…..……………….………………..………………………………….12
7.1 Nodes 12
7.2 Parallel Jobs 13
CONCLUSIONS………………..………………………………………….……….……..14
FUTURE ENHANCEMENTS..…………………………………………………….……. .15
BIBLIOGRAPHY...…………………………………………………………….………….16
APPENDIX…….....…………………………………………………………….………….17

iii
List of Tables

LIST OF TABLES

TABLE NO NAME PAGE NO.

Table 6.1 Sample Test Cases 15

iii
List of Figures

LIST OF FIGURES

FIGURE NO LIST OF FIGURES PAGE NO.

Fig: 2.1 System Architecture 3

ii
Introduction Chapter 1

CHAPTER 1

INTRODUCTION

This chapter provides a brief overview of the company profile problem definition,
objectives and significance of the project and an outline of the report.

1.1 PROBLEM DEFINITION


Design and Implementation of Parallel approach to MSA using Hadoop Data
Clusters with Virtualization to overcome the Space limitations in Original Needleman-
Wunsch Algorithm by processing the sequence alignment in parallel. Parallel alignment
score computation is proposed to improve computational efficiency.

1.2 OBJECTIVE OF THE PROJECT


Most of the users are interested in parallel execution of MSA. Also users want
the accurate alignment results in balanced virtualized cluster. So this project gives a
solution for user that increased efficiency using parallel alignment and virtualization and
reduced time complexity.

1.3 SIGNIFICANCE OF THE PROJECT


With the enormous growth in bio-information, there is a corresponding need for
tools that enable fast and efficient alignment job of sequences. The concurrent execution
will greatly simplify the complexity of the alignment.

1.4 OUTLINE OF THE PROJECT


The rest of the report is structures as follows. Chapter 2 provides a detailed study
of the existing system and the basic ideas of the proposed system. Chapter 3 discusses
the requirements for the development of the system and an analysis on the feasibility of

1
Introduction Chapter 1

the system. Chapter 4 presents the overall design of the system. Chapter 5 discusses
the implementation details. Chapter 6 explains various testing procedures conducted on
the system. Chapter 7 contains the snapshot of various forms in our system. The last
section summarizes the project.

2
System Study Chapter 2

CHAPTER 2

SYSTEM STUDY
This chapter elucidates the existing system and a brief description of the
proposed system.

2.1 EXISTING SYSTEM

Our existing system is Needleman algorithm. Its only supports linear/sequential


execution of alignment of very large DNA sequences. So there is no parallelism over
those sequence alignments. The Needleman algorithm gives best accuracy over pair of
sequences, but it needs very large amount of space to align the sequences. Also it
doesn't support MSA. Also it does not compatible with platform virtualization.

2.2 PROPOSED SYSTEM

In our system, we use OpenVZ as a VM (Virtual Machine) and each VM has its
own Hadoop Datanode and TaskTracker. We can create many isolated VM’s on a single
Core Kernal. Each VM has its own Files, Users, Process Tree, N/W, Devices, and IPC
Objects … and also supports Dynamic Resource Management, Check Pointing (State
Dumping). The input sequences are given to Modified MSA and it will rearrange the
sequences into pair of sequences. Then each pair will be executed in each VM in the
Hadoop Cluster. This process is repeated until the user specified level is finished. Finally
all results are gathered from all VM’s and then all are combined together and the result
is displayed to the user.

3
System Study Chapter 2

3
System Analysis Chapter 3

CHAPTER 3

SYSTEM ANALYSIS
This section describes the hardware and software specifications for the
development of the system and an analysis on the feasibility of the system.

3.1 REQUIREMENT ANALYSIS


3.1.1 Software Requirements
After experimenting with various commercial software available and analyzing
the Pros and Cons of the software, the following are chosen.
• Operating System – Platform Independent
• Programming Languages – Java 1.6+
• Front End - Java Swing
• Framework - Hadoop
• Virtualization Tool - OpenVZ

3.1.2 Hardware Requirements


The Hardware requirements of the proposed system are as follows:
• Pentium-III machine & above
• RAM-256 MB
• Hard Disk with a Capacity of 10 GB
• Network of Computers with above configuration for Cluster

3.2 FEASIBILITY ANALYSIS


Feasibility deals with step-by-step analysis of the system. Analysis showed that
this project was feasible in all respects. Three kinds of feasibility factors are considered:

• Economic Feasibility

4
System Analysis Chapter 3

• Technical Feasibility

• Operational Feasibility

3.2.1 Economic Feasibility

The system is developed only using those softwares that are very well used in
the market, so there is no need for installation of new softwares. Hence, the cost
incurred towards this project is negligible

3.2.2 Technical Feasibility

3.2.2.1 Parallel MSA


The main aim of our project is to align the given sequences in parallel using
MSA.

3.2.2.2 Virtualization
Next important thing that must be done in our project is to configure OpenVZ to
incorporate platform virtualization for our project to increase concurrency.

3.2.3 Operational Feasibility


The functions needed to be performed by the system are all valid and without
any conflicts. All functions and constraints specified in the requirements are completely
operational. The requirements stated are realistically testable.
The requirements are adaptable to changes with out any large-scale effects on
other system requirements. The system is capable of accommodating future
requirements if they arise.

5
System Design Chapter 4

CHAPTER 4

SYSTEM DESIGN

This chapter describes the functional decomposition of the system and illustrates
the movement of data between external entities, the processes and the data stores
within the system, with the help of data flow diagrams.

4.1 CONTEXTUAL ACTIVITY DIAGRAM

6
Implementation Chapter 5

CHAPTER 5

IMPLEMENTATION

This phase is broken up into two phases: Development and Implementation. The
individual system components are built during the development period. Programs are
written and tried by users. During Implementation, the components built during
development are put into operational use. In the development phase of our system, the
following system components were built.
• Aligner module
• FileUtil module
• GAJobRunner

5.1 Aligner Module


This module contains,
• Procedure to align two Input Sequences using Standard Needleman
Algorithm.
• Procedure to compute score for two given sequences using Score
Matrix.

5.2 FileUtil Module


This module contains,
• Procedure to read File contents from HDFS.
• Procedure to write File Contents to HDFS.

5.3 GAJobRunner Module


This module contains,
• Implementation of Hadoop Map/Reduce Procedures
• Specification of Hadoop JobConf, InputFormat, OutputFormat,
Key/Value Pair Design and Parallel job submission.

10
Testing Chapter 6

CHAPTER 6

TESTING
This chapter explains the various testing procedures conducted on the system.
Testing is a process of executing a program with the intent of finding an error. A
successful test is one that uncovers an as yet undiscovered error. A testing process
cannot show the absence of defects but can only show that software errors are present.
It ensures that defined input will produce actual results that agree with the required
results. A good testing methodology should include
• Clearly define testing roles, responsibilities and procedures
• Establish consistent testing process
• Streamline testing requirements
• Overcome “requirements slow me down” mentality
• Common sense process approach
• Use some elements of existing Process
• Not an attempt to replace, rewrite or redefine Process
• To find defects early and to give good time to developers for bug fixes
• Independent perspective in testing

Some of the testing principles used in this project are:


• Unit Testing
• Integration Testing

6.1 UNIT TESTING


Unit testing is a strategy by which individual components, which make up the
system, are tested first to ensure that system works up to the desired extent. It focuses
on the verification effort on the smallest unit of the software design i.e. module. Various
modules of the system are tested to see whether they perform their intended functions.
Using procedural design description, important control paths are tested to uncover the

12
Testing Chapter 6

errors with in the boundary of the module. While accepting a connection using specified
functions we go for unit testing in their respective modules. The unit test is normally a
white box test (a testing method in which the control structure of the procedural design is
used to derive test cases).

6.1.1 Process Objectives


To test every unit of the software in isolation before integrating it with other units.

6.1.2 Definition of Unit


A unit is a module as identified during size estimation process with a size
estimate that does not exceed 1000LOC.
For GUI applications each screen will be a unit.
If the size estimate for a unit exceeds 1000 LOC and it is not feasible to break it
into smaller logically independent units that can be tested in isolation, the project lead in
concurrence with the SQA can decide to define this as a unit.

6.1.3 Entry Criteria


The entry criteria for this process are the following:
• Unit completed
• Unit peer reviewed

6.1.4 Exit Criteria


The exit criteria for this process are the following:
• Unit test cases executed
• Any defects that are identified during unit testing and that are not fixed before the
unit enters component testing is listed in the test report and verified
• 100% statement coverage
If unit will be tested before code review of unit, this must be identified in the
project plan. In these projects the developer will self-review (desk check) the code
before unit testing.
In cases of exception handling of error conditions that are difficult to generate,
thereby making it impossible to achieve 100% statement coverage, the code should be
formally reviewed with this additional criteria

13
Testing Chapter 6

6.2 INTEGRATION TESTING


The integration testing is a systematic technique for constructing the program
structure while conducting tests to uncover errors associated with interfacing. It is a type
of testing by which the individual modules of the system are combined and tested
whether they work properly as a whole. The objective is to take unit test modules and
build a program that has been dictated by the design. Integration testing can be either
‘Incremental’ or ‘Non-Incremental’.
The objective of the integration testing is to help engineers plan and execute the
component and Integration testing for their respective projects.
Integration testing should include the following objectives:
• Performed by the product group/Dev test team after feature complete
• Determines that all product components on a list of specific platforms function
successfully together (The List specified in Master test plan)
• Performed in a basic product / platform environment (Basic environment
specified in Master test plan)
• Tests the product functionality against the specification
• Tests functionality of fake languages with sample single and double byte
languages
• Tests scaling to an acceptable minimum level as called out in the master test
plan
• Tests performance, reliability to an acceptable level as called out in the master
test plan
• Final integration tests done after all components are integrated, with the build in
production format
The tasks of the project have been integrated and the functioning of the entire
system has been found to be satisfactory. The functionality of the entire system has
been subjected to a series of tests and all the modules have been found to interoperate
properly.
Finally the integration testing was performed on the integrated system and found
to work properly.

14
Testing Chapter 6

6.3 SAMPLE TEST CASES


The following are the some of the sample test cases employed along with the
test results have been described in the table below.

Table 6.1 Sample Test Cases

Test Description Result


Is Hadoop stable for running more than one client jobs? OK
Is VM’s are return the results properly? OK
Is MSA executes Optimally? OK
Is Alignment computed Accurately? OK
Is Hadoop is running in OpenVZ with no errors? OK

15
Snapshot Chapter 7

CHAPTER 7

SNAPSHOT
This chapter contains the snapshot of various snaps from our system.

7.1 Node Details

16
Snapshot Chapter 7

7.2 Parallel Jobs

17
Conclusion

CONCLUSION

Thus the analysis, design and implementation of MSA in Hadoop with OpenVZ
are done successfully. So that the user can able to do alignments of very large DNA
sequences and the user can able to view/set the virtual environments of OpenVZ. This is
very useful for align DNA sequences in a platform virtualized environment. Also the
alignment is running concurrently, so we can get higher performance.

17
Future Enhancements

FUTURE ENHANCEMENTS
Currently we need to configure and install Hadoop and OpenVZ in all nodes
manually and also it doesn’t have Fine Grain Scheduling of Alignment jobs. In future the
enhancements are made to build a AutoConfigure Script which will automatically install
and configure Hadoop and OpenVZ. Also to design an efficient scheduling agent which
executes alignment jobs.

18
Bibliography

BIBLIOGRAPHY

• Sagl B. Needle and Christus D. Wunsch, “A General Method Applicable To The Search
For Similarities In The Amino Acid Sequence Of Two Proteins”, journal of molecular
biology, 1970, pp 443-453.

• Wang-Sheng Juang and Shun-Feng Su, “Multiple Sequence Alignment Using Modified
Dynamic Programming And Particle Swarm Optimization”, Journal of the Chinese
Institute of Engineers, Vol. 31, No. 4, pp. 659-673 (2008).

• P.V.Lakshmi, Allam Appa Rao, GR Sridhar, “An Efficient Progressive Alignment


Algorithm for Multiple Sequence Alignment”, International Journal of Computer Science
and Network Security, VOL.8 No.10, October 2008.

• Jens Stoye, Vincent Moulton and Andreas W.M. Dres, “DCA: An Efficient
Implementation Of The Divide-and-conquer Approach To Simultaneous Multiple
Sequence Alignment”, Vol. 13 no. 6 1997. Pages 625-626. Research Center for
Interdisciplinary Studies on Structure Formation (FSPM), University of Bielefeld.

• [Booch 1994] Booch, G. Object Oriented Analysis and Design with Applications
(second edition), Benjamin/Cummings 1994, ISBN 0-8053-5340-2.

• Java Network Programming, O'Reilly & Associates, Inc.,, Second Edition

• Herbert Schildt ., and Patrick Naughton , 2001,“Java2: The Complete Reference “, Fourth
Edition , Tata McGraw-Hill Publishing Company Limited .

Websites
http://en.wikipedia.org/
http://www.omg.org/docs/formal/00-03-01.pdf
http://www.uml-forum.com/FAQ.htm

19
Bibliography

19
APPENDIX

APPENDIX

Needleman Algorithm 
The Needleman-Wunsch algorithm performs a global alignment on two
sequences (called A and B here). It is commonly used in bioinformatics to align protein
or nucleotide sequences. The algorithm was proposed in 1970 by Saul Needleman and
Christian Wunsch in their paper A general method applicable to the search for
similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53.

The Needleman-Wunsch algorithm is an example a of dynamic programming, and is


guaranteed to find the alignment with the maximum score. Needleman-Wunsch is the
first instance of dynamic programming being applied to biological sequence comparison.

Scores for aligned characters are specified by a similarity matrix. Here, S(i,j) is the
similarity of characters i and j. It uses a linear gap penalty, here called d.

For example, if the similarity matrix was

- A GCT
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8

then the alignment:

AGACTAGTTAC
CGA---GACGT

with a gap penalty of -5, would have the following score...

To find the alignment with the highest score, a two-dimensional array (or matrix)
is allocated. This matrix is often called the F matrix, and its (i,j)th entry is often denoted
Fij There is one column for each character in sequence A, and one row for each
character in sequence B. Thus, if we are aligning sequences of sizes n and m, the
amount of memory used by the algorithm is in O(nm). (However, there is a modified

18
APPENDIX

version of the algorithm which uses only O(m + n) space, at the cost of a higher running
time. This modification is in fact a general technique which applies to many dynamic
programming algorithms; this method was introduced in Hirschberg's algorithm for
solving the longest common subsequence problem.)

As the algorithm progresses, the Fij will be assigned to be the optimal score for the
alignment of the first i characters in A and the first j characters in B. The principle of
optimality is then applied as follows.

Basis:
F0j = d * j
Fi0 = d * i
Recursion, based on the principle of optimality:
Fij = max(Fi − 1,j − 1 + S(Ai,Bj),Fi,j − 1 + d,Fi − 1,j + d)

The pseudo-code for the algorithm to compute the F matrix therefore looks like this
(array indexes start at 0):

for i=0 to length(A)


F(i,0) ← d*i
for j=0 to length(B)
F(0,j) ← d*j
for i=1 to length(A)
for j = 1 to length(B)
{
Choice1 ← F(i-1,j-1) + S(A(i), B(j))
Choice2 ← F(i-1, j) + d
Choice3 ← F(i, j-1) + d
F(i,j) ← max(Choice1, Choice2, Choice3)
}

Once the F matrix is computed, note that the bottom right hand corner of the
matrix is the maximum score for any alignments. To compute which alignment actually
gives this score, you can start from the bottom left cell, and compare the value with the
three possible sources(Choice1, Choice2, and Choice3 above) to see which it came
from. If it was Choice1, then A(i) and B(i) are aligned, if it was Choice2 then A(i) is
aligned with a gap, and if it was Choice3, then B(i) is aligned with a gap.

AlignmentA ← ""
AlignmentB ← ""
i ← length(A)
j ← length(B)
while (i > 0 and j > 0)
{
Score ← F(i,j)
ScoreDiag ← F(i - 1, j - 1)
ScoreUp ← F(i, j - 1)
ScoreLeft ← F(i - 1, j)
if (Score == ScoreDiag + S(A(i), B(j)))
{

19
APPENDIX

AlignmentA ← A(i-1) + AlignmentA


AlignmentB ← B(j-1) + AlignmentB
i←i-1
j←j-1
}
else if (Score == ScoreLeft + d)
{
AlignmentA ← A(i-1) + AlignmentA
AlignmentB ← "-" + AlignmentB
i←i-1
}
otherwise (Score == ScoreUp + d)
{
AlignmentA ← "-" + AlignmentA
AlignmentB ← B(j-1) + AlignmentB
j←j-1
}
}
while (i > 0)
{
AlignmentA ← A(i-1) + AlignmentA
AlignmentB ← "-" + AlignmentB
i←i-1
}
while (j > 0)
{
AlignmentA ← "-" + AlignmentA
AlignmentB ← B(j-1) + AlignmentB
j←j-1
}

Types of virtualization 
In the context of this report, virtualization is a system or a method of dividing
computer resources into multiple isolated environments. It is possible to distinguish four
types of such virtualization: emulation, paravirtualization, operating system-level
virtualization, and multiserver (cluster) virtualization. Each virtualization type has its pros
and cons that condition its appropriate applications.

Emulation makes it possible to run any non-modified operating system which supports
the platform being emulated. Implementations in this category range from pure
emulators (like Bochs) to solutions which let some code to be executed on the CPU
natively, in order to increase performance. The main disadvantages of emulation are low
performance and low density. Examples: VMware products, QEmu, Bochs, Parallels.

20
APPENDIX

Paravirtualization is a technique to run multiple modified OSs on top of a thin layer


called a hypervisor, or virtual machine monitor. Paravirtualization has better performance
compared to emulation, but the disadvantage is that the “guest” OS needs to be
modified. Examples: Xen, UML.

Operating system-level virtualization enables multiple isolated execution


environments within a single operating system kernel. It has the best possible (i. e. close
to native) performance and density, and features dynamic resource management. On
the other hand, this technology does not allow to run different kernels from different OSs
at the same time. Examples: FreeBSD Jail, Solaris Zones/Containers, Linux-VServer,
OpenVZ and Virtuozzo.

OpenVZ kernel 
The OpenVZ kernel is a modified Linux kernel which adds the following
functionality: virtualization and isolation of various subsystems, resource management,
and checkpointing. Virtualization and isolation enables many virtual environments within
a single kernel. Resource management subsystem limits (and in some cases
guarantees) resources such as CPU, RAM, and disk space on a per-VE basis.
Checkpointing —a process of “freezing” a VE, saving its complete state to a disk file,
with the ability to “unfreeze” that state later. These components are described below.

Virtualization and isolation 

Each VE has its own set of resources provided by the operating system kernel.
Inside the kernel, those resources are either virtualized or isolated. Each VE has its own
set of objects, such as the ones described below.

Files – System libraries, applications, virtualized /proc and /sys, virtualized locks, etc.
Users and groups – Each VE has its own root user, as well as other users and groups.
Process tree – A VE sees only its own set of processes, starting from init. PIDs are
virtualized, so that the init PID is 1 as it should be.

21
APPENDIX

Network – Virtual network device, which allows the VE to have its own IP addresses, as
well as a set of netfilter (iptables) and routing rules.
Devices – Some devices are virtualized. In addition, if there is a need, any VE can be
granted (an exclusive) access to real devices like network interfaces, serial ports, disk
partitions, etc.
IPC objects – Shared memory, semaphores, and messages.

Resource management 

Resource management is of paramount importance for operating system level


virtualization solutions, because there is a finite set of resources within a single kernel
that are shared among multiple Virtual Environments. All those resources need to be
controlled in a way that lets many VEs co-exist on a single system, and not influence
each other. The OpenVZ resource management subsystem consists of three
components:

1. Two-level disk quota – The OpenVZ server administrator can set up per-VE disk
quotas in terms of disk space and number of inodes. This is the first level of disk quota.
The second level of disk quota lets the VE administrator (VE root) use standard UNIX
quota tools to set up per-user and per-group disk quotas.

2. “Fair” CPU scheduler – The OpenVZ CPU scheduler is also twolevel. On the first
level it decides which VE to give the time slice to, taking into account the VE’s CPU
priority and limit settings. On the second level, the standard Linux scheduler decides
which process in the VE to give the time slice to, using standard process priorities.

3. User Beancounters – This is a set of per-VE counters, limits, and guarantees. There
is a set of about 20 parameters which are carefully chosen to cover all the aspects of VE
operation, so no single VE can abuse any resource which is limited for the whole
computer and thus cause harm to other VEs. The resources accounted and controlled
are mainly memory and various in-kernel objects such as IPC shared memory
segments, network buffers etc.

22
APPENDIX

Checkpointing and live migration 

Checkpointing allows the “live” migration of a VE to another physical server. The


VE is “frozen” and its complete state is saved to a disk file. This file can then be
transferred to another machine and the VE can be “unfrozen” (restored) there. The
whole process takes a few seconds, and from the client’s point of view it looks not like a
downtime, but rather a delay in processing, since the established network connections
are also migrated.

OpenVZ Utilities 

1 vzctl
OpenVZ comes with a vzctl utility, which implements a high-level commandline
interface to manage Virtual Environments. For example, to create and start a new VE it
takes just two commands — vzctl create and vzctl start. vzctl set command is used to
change various VE parameters. Note that all the resources (for example, VE virtual
memory size) can be changed during runtime. This is usually impossible with other
virtualization technologies, like emulation or paravirtualization.

2 Templates and vzpkg


Templates are existing images used to create a new VE. A template is a set of
packages, and a template cache is an archive (tarball) of a chrooted environment with
those packages installed. During the vzctl create stage, this tarball is unpacked. Using a
template cache technique, a new VE can be created in seconds, thus enabling fast
deployment scenarios. Vzpkg tools is a set of tools to facilitate template cache creation.
It currently supports rpm and yum-based repositories. For example, to create a template
of Fedora Core 5 distribution, one needs to specify a set of (yum) repositories which
have FC5 packages, and a set of packages to be installed. In addition, pre- and post-
install scripts can be employed to further optimize/ modify a template cache. All the
above data (repositories, lists of packages, scripts, GPG keys, etc.) form template
metadata. With template metadata, a template cache can be created automatically
by the vzpkgcache utility. It will download and install the listed packages into a
temporary VE, and pack the result as a template cache. Template caches for non-RPM
distributions can be created as well, although this is more of a manual process.

23

Vous aimerez peut-être aussi