Académique Documents
Professionnel Documents
Culture Documents
BAKTAVATCHALAM.G (08MW03)
MASTER OF ENGINEERING
of Anna University
September 2009
BAKTAVATCHALAM.G (08MW03)
MASTER OF ENGINEERING
September 2009
…..…………………. ……….………………….
Ms. M.Gowri Shankar Dr. S.N.Sivanandam
Faculty Guide Head of the Department
Certified that the candidate was examined in the viva-voce examination held on ………………….
…………………….. …………………………..
(Internal Examiner) (External Examiner)
Acknowledgement
ACKNOWLEDGEMENT
We extend our sincere thanks to our internal guide and faculty in charge Mr. M.
Gowri Shankar, Lecturer, Department of Computer Science and Engineering, for
his guidance and help rendered for the successful completion of our project.
i
Synopsis
SYNOPSIS
i
Contents
CONTENTS
Synopsis………………….………………………………………………..…………….. .(i)
List of Figures.………….………………………………………………...…………….. .(ii)
List of Tables.…………………………………………………………………………….(iii)
1. INTRODUCTION.……...…………………………………………………………... .1
1.1. Problem Definition 1
1.2. Objective of the Project 1
1.3. Significance of the Project 1
1.4. Outline of the Project 1
2. SYSTEM STUDY..…….……………………..……………………………………...3
2.1. Existing System 3
2.2. Proposed System 3
3. SYSTEM ANALYSIS..…….……………………..………………………………….5
3.1 Requirement Analysis 5
3.2 Feasibility Study 5
4. SYSTEM DESIGN…...…….……………………..………………………………….7
4.1 Contextual Activity Diagram 7
5. SYSTEM IMPLEMENTATION.………………..…………………………………...8
5.1 Aligner Module 8
5.2 FileUtil Module 8
5.3 GAJobRunner Module 8
6. TESTING……………………….………………..……………………………………9
6.1 Unit Testing 9
6.2 Integration Testing 10
6.3 Sample Test Cases 11
7. SNAPSHOT.…..……………….………………..………………………………….12
7.1 Nodes 12
7.2 Parallel Jobs 13
CONCLUSIONS………………..………………………………………….……….……..14
FUTURE ENHANCEMENTS..…………………………………………………….……. .15
BIBLIOGRAPHY...…………………………………………………………….………….16
APPENDIX…….....…………………………………………………………….………….17
iii
List of Tables
LIST OF TABLES
iii
List of Figures
LIST OF FIGURES
ii
Introduction Chapter 1
CHAPTER 1
INTRODUCTION
This chapter provides a brief overview of the company profile problem definition,
objectives and significance of the project and an outline of the report.
1
Introduction Chapter 1
the system. Chapter 4 presents the overall design of the system. Chapter 5 discusses
the implementation details. Chapter 6 explains various testing procedures conducted on
the system. Chapter 7 contains the snapshot of various forms in our system. The last
section summarizes the project.
2
System Study Chapter 2
CHAPTER 2
SYSTEM STUDY
This chapter elucidates the existing system and a brief description of the
proposed system.
In our system, we use OpenVZ as a VM (Virtual Machine) and each VM has its
own Hadoop Datanode and TaskTracker. We can create many isolated VM’s on a single
Core Kernal. Each VM has its own Files, Users, Process Tree, N/W, Devices, and IPC
Objects … and also supports Dynamic Resource Management, Check Pointing (State
Dumping). The input sequences are given to Modified MSA and it will rearrange the
sequences into pair of sequences. Then each pair will be executed in each VM in the
Hadoop Cluster. This process is repeated until the user specified level is finished. Finally
all results are gathered from all VM’s and then all are combined together and the result
is displayed to the user.
3
System Study Chapter 2
3
System Analysis Chapter 3
CHAPTER 3
SYSTEM ANALYSIS
This section describes the hardware and software specifications for the
development of the system and an analysis on the feasibility of the system.
• Economic Feasibility
4
System Analysis Chapter 3
• Technical Feasibility
• Operational Feasibility
The system is developed only using those softwares that are very well used in
the market, so there is no need for installation of new softwares. Hence, the cost
incurred towards this project is negligible
3.2.2.2 Virtualization
Next important thing that must be done in our project is to configure OpenVZ to
incorporate platform virtualization for our project to increase concurrency.
5
System Design Chapter 4
CHAPTER 4
SYSTEM DESIGN
This chapter describes the functional decomposition of the system and illustrates
the movement of data between external entities, the processes and the data stores
within the system, with the help of data flow diagrams.
6
Implementation Chapter 5
CHAPTER 5
IMPLEMENTATION
This phase is broken up into two phases: Development and Implementation. The
individual system components are built during the development period. Programs are
written and tried by users. During Implementation, the components built during
development are put into operational use. In the development phase of our system, the
following system components were built.
• Aligner module
• FileUtil module
• GAJobRunner
10
Testing Chapter 6
CHAPTER 6
TESTING
This chapter explains the various testing procedures conducted on the system.
Testing is a process of executing a program with the intent of finding an error. A
successful test is one that uncovers an as yet undiscovered error. A testing process
cannot show the absence of defects but can only show that software errors are present.
It ensures that defined input will produce actual results that agree with the required
results. A good testing methodology should include
• Clearly define testing roles, responsibilities and procedures
• Establish consistent testing process
• Streamline testing requirements
• Overcome “requirements slow me down” mentality
• Common sense process approach
• Use some elements of existing Process
• Not an attempt to replace, rewrite or redefine Process
• To find defects early and to give good time to developers for bug fixes
• Independent perspective in testing
12
Testing Chapter 6
errors with in the boundary of the module. While accepting a connection using specified
functions we go for unit testing in their respective modules. The unit test is normally a
white box test (a testing method in which the control structure of the procedural design is
used to derive test cases).
13
Testing Chapter 6
14
Testing Chapter 6
15
Snapshot Chapter 7
CHAPTER 7
SNAPSHOT
This chapter contains the snapshot of various snaps from our system.
16
Snapshot Chapter 7
17
Conclusion
CONCLUSION
Thus the analysis, design and implementation of MSA in Hadoop with OpenVZ
are done successfully. So that the user can able to do alignments of very large DNA
sequences and the user can able to view/set the virtual environments of OpenVZ. This is
very useful for align DNA sequences in a platform virtualized environment. Also the
alignment is running concurrently, so we can get higher performance.
17
Future Enhancements
FUTURE ENHANCEMENTS
Currently we need to configure and install Hadoop and OpenVZ in all nodes
manually and also it doesn’t have Fine Grain Scheduling of Alignment jobs. In future the
enhancements are made to build a AutoConfigure Script which will automatically install
and configure Hadoop and OpenVZ. Also to design an efficient scheduling agent which
executes alignment jobs.
18
Bibliography
BIBLIOGRAPHY
• Sagl B. Needle and Christus D. Wunsch, “A General Method Applicable To The Search
For Similarities In The Amino Acid Sequence Of Two Proteins”, journal of molecular
biology, 1970, pp 443-453.
•
• Wang-Sheng Juang and Shun-Feng Su, “Multiple Sequence Alignment Using Modified
Dynamic Programming And Particle Swarm Optimization”, Journal of the Chinese
Institute of Engineers, Vol. 31, No. 4, pp. 659-673 (2008).
• Jens Stoye, Vincent Moulton and Andreas W.M. Dres, “DCA: An Efficient
Implementation Of The Divide-and-conquer Approach To Simultaneous Multiple
Sequence Alignment”, Vol. 13 no. 6 1997. Pages 625-626. Research Center for
Interdisciplinary Studies on Structure Formation (FSPM), University of Bielefeld.
• [Booch 1994] Booch, G. Object Oriented Analysis and Design with Applications
(second edition), Benjamin/Cummings 1994, ISBN 0-8053-5340-2.
• Herbert Schildt ., and Patrick Naughton , 2001,“Java2: The Complete Reference “, Fourth
Edition , Tata McGraw-Hill Publishing Company Limited .
Websites
http://en.wikipedia.org/
http://www.omg.org/docs/formal/00-03-01.pdf
http://www.uml-forum.com/FAQ.htm
19
Bibliography
19
APPENDIX
APPENDIX
Needleman Algorithm
The Needleman-Wunsch algorithm performs a global alignment on two
sequences (called A and B here). It is commonly used in bioinformatics to align protein
or nucleotide sequences. The algorithm was proposed in 1970 by Saul Needleman and
Christian Wunsch in their paper A general method applicable to the search for
similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53.
Scores for aligned characters are specified by a similarity matrix. Here, S(i,j) is the
similarity of characters i and j. It uses a linear gap penalty, here called d.
- A GCT
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8
AGACTAGTTAC
CGA---GACGT
To find the alignment with the highest score, a two-dimensional array (or matrix)
is allocated. This matrix is often called the F matrix, and its (i,j)th entry is often denoted
Fij There is one column for each character in sequence A, and one row for each
character in sequence B. Thus, if we are aligning sequences of sizes n and m, the
amount of memory used by the algorithm is in O(nm). (However, there is a modified
18
APPENDIX
version of the algorithm which uses only O(m + n) space, at the cost of a higher running
time. This modification is in fact a general technique which applies to many dynamic
programming algorithms; this method was introduced in Hirschberg's algorithm for
solving the longest common subsequence problem.)
As the algorithm progresses, the Fij will be assigned to be the optimal score for the
alignment of the first i characters in A and the first j characters in B. The principle of
optimality is then applied as follows.
Basis:
F0j = d * j
Fi0 = d * i
Recursion, based on the principle of optimality:
Fij = max(Fi − 1,j − 1 + S(Ai,Bj),Fi,j − 1 + d,Fi − 1,j + d)
The pseudo-code for the algorithm to compute the F matrix therefore looks like this
(array indexes start at 0):
Once the F matrix is computed, note that the bottom right hand corner of the
matrix is the maximum score for any alignments. To compute which alignment actually
gives this score, you can start from the bottom left cell, and compare the value with the
three possible sources(Choice1, Choice2, and Choice3 above) to see which it came
from. If it was Choice1, then A(i) and B(i) are aligned, if it was Choice2 then A(i) is
aligned with a gap, and if it was Choice3, then B(i) is aligned with a gap.
AlignmentA ← ""
AlignmentB ← ""
i ← length(A)
j ← length(B)
while (i > 0 and j > 0)
{
Score ← F(i,j)
ScoreDiag ← F(i - 1, j - 1)
ScoreUp ← F(i, j - 1)
ScoreLeft ← F(i - 1, j)
if (Score == ScoreDiag + S(A(i), B(j)))
{
19
APPENDIX
Types of virtualization
In the context of this report, virtualization is a system or a method of dividing
computer resources into multiple isolated environments. It is possible to distinguish four
types of such virtualization: emulation, paravirtualization, operating system-level
virtualization, and multiserver (cluster) virtualization. Each virtualization type has its pros
and cons that condition its appropriate applications.
Emulation makes it possible to run any non-modified operating system which supports
the platform being emulated. Implementations in this category range from pure
emulators (like Bochs) to solutions which let some code to be executed on the CPU
natively, in order to increase performance. The main disadvantages of emulation are low
performance and low density. Examples: VMware products, QEmu, Bochs, Parallels.
20
APPENDIX
OpenVZ kernel
The OpenVZ kernel is a modified Linux kernel which adds the following
functionality: virtualization and isolation of various subsystems, resource management,
and checkpointing. Virtualization and isolation enables many virtual environments within
a single kernel. Resource management subsystem limits (and in some cases
guarantees) resources such as CPU, RAM, and disk space on a per-VE basis.
Checkpointing —a process of “freezing” a VE, saving its complete state to a disk file,
with the ability to “unfreeze” that state later. These components are described below.
Virtualization and isolation
Each VE has its own set of resources provided by the operating system kernel.
Inside the kernel, those resources are either virtualized or isolated. Each VE has its own
set of objects, such as the ones described below.
Files – System libraries, applications, virtualized /proc and /sys, virtualized locks, etc.
Users and groups – Each VE has its own root user, as well as other users and groups.
Process tree – A VE sees only its own set of processes, starting from init. PIDs are
virtualized, so that the init PID is 1 as it should be.
21
APPENDIX
Network – Virtual network device, which allows the VE to have its own IP addresses, as
well as a set of netfilter (iptables) and routing rules.
Devices – Some devices are virtualized. In addition, if there is a need, any VE can be
granted (an exclusive) access to real devices like network interfaces, serial ports, disk
partitions, etc.
IPC objects – Shared memory, semaphores, and messages.
Resource management
1. Two-level disk quota – The OpenVZ server administrator can set up per-VE disk
quotas in terms of disk space and number of inodes. This is the first level of disk quota.
The second level of disk quota lets the VE administrator (VE root) use standard UNIX
quota tools to set up per-user and per-group disk quotas.
2. “Fair” CPU scheduler – The OpenVZ CPU scheduler is also twolevel. On the first
level it decides which VE to give the time slice to, taking into account the VE’s CPU
priority and limit settings. On the second level, the standard Linux scheduler decides
which process in the VE to give the time slice to, using standard process priorities.
3. User Beancounters – This is a set of per-VE counters, limits, and guarantees. There
is a set of about 20 parameters which are carefully chosen to cover all the aspects of VE
operation, so no single VE can abuse any resource which is limited for the whole
computer and thus cause harm to other VEs. The resources accounted and controlled
are mainly memory and various in-kernel objects such as IPC shared memory
segments, network buffers etc.
22
APPENDIX
Checkpointing and live migration
OpenVZ Utilities
1 vzctl
OpenVZ comes with a vzctl utility, which implements a high-level commandline
interface to manage Virtual Environments. For example, to create and start a new VE it
takes just two commands — vzctl create and vzctl start. vzctl set command is used to
change various VE parameters. Note that all the resources (for example, VE virtual
memory size) can be changed during runtime. This is usually impossible with other
virtualization technologies, like emulation or paravirtualization.
23