Vous êtes sur la page 1sur 46

Bioinformatics

and Role of Software Engineers In It

Abstract

2
Bioinformatics is the application of computers in biological sciences. It is concerned with
capturing, storing, graphically displaying, modeling and ultimately distributing biological
information. It is becoming an essential tool in molecular biology as genome projects generate
vast quantities of data.

The Human Genome Project has created the need for new kinds of scientific specialists
who can be creative at the interface of biology and other disciplines, such as computer science,
engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of
genomic research increases, the demand for these specialists greatly exceeds the supply. In the
past, the genome project has benefited immensely from the talents of non-biological scientists,
and their participation in the future is likely to be even more crucial. Through this report I have
tried to analyze the future requirements in development of advances technologies in this field and
what role, we, as software engineers can play in development of these technologies.

3
Contents

Page

Introduction to Bioinformatics

1.1

What is Bioinformatics?

1.2

Computers and Biology

1.3

Limitations in the use of computers

1.4

Current Stage of Research

1.5

Microbial, Plant and Animal Genomes

1.6

History (Stages of development)

4
4

Basics of Molecular Biology

2.1

Nucleotide

2.2

Amino acid

2.3

Properties of Genetic Code

2.4

DNA (Deoxy-ribonucleic Acid)

2.5

Chromosomes

2.6

Gene

2.7

Protein

2.8

Sequencing

2.9

Genome

2.10 Clone

10

2.11 Model Organism

10

4
3

Role of software Engineers and Technology in Biotechnology

11

3.1

Need for software automation

11

3.2

Genetic Algorithms

12

3.2.1 Database Searching

12

3.2.2 Comparing Two Sequences

12

3.2.3 Multiple Sequence Alignment

13

3.3

Genome Projects

13

3.4

Goals for Advancements in Sequencing Technology

14

3.5

Developing Technology to handle Sequence Variations

15

3.6

Need of Technology in Functional Genomics

18

3.7

Bioinformatics and Computational Biology

19

3.8

Job Opportunities and Job Requirements

21

3.9

Training Goals included in the Human Genome Project Plan

21

Human Genome Project

23

4.1

Introduction

23

4.2

Details of the Human Genome Project

23

4.3

U.S. Human Genome Project 5-Year Goals 1998-2003

25

4.3.1 Human DNA Sequencing

25

4.3.2 Sequencing Technology

27

4.3.3 Sequence Variation

27

4.3.4 Functional Genomics

27

4.3.5 Comparative Genomics

28

4.3.6 Ethical, Legal, and Social Implications (ELSI)

28

4.3.7 Bioinformatics and Computational Biology

29

4.3.8 Training

29

Biological Databases

30

5.1

The Biological sequence/structure deficit

30

5.2

Biological Databases

30

5.3

Primary Sequence Databases

31

5.3.1 Nucleic acid Sequence Databases

31

5.3.2 Protein Sequence Databases

32

5.4

Composite Protein Sequences Databases

32

5.5

Secondary Databases

33

5.6

Tertiary Databases

33

Applications of Bioinformatics

34

6.1

Application to the Ailments of Diseases

34

6.2

Application of Bioinformatics to Agriculture

36

6.2.1 Improvements in Crop Yield and Quality

36

6.3

Applications of Microbial Genomics

37

6.4

Risk Assessment

39

6.5

Evolution and Human Migration

39

6.6

DNA Forensics (Identification)

40

Bibliography

41

6
Chapter 1

Introduction to Bioinformatics

1.1

What is Bioinformatics?

Bioinformatics is the application of computers in biological sciences and especially


analysis of biological sequence data. It is concerned with capturing, storing, graphically
displaying, modeling and ultimately distributing biological information. It is becoming an
essential tool in molecular biology as genome projects generate vast quantities of data. With new
sequences being added to DNA databases on an average, once every minute, there is a pressing
need to convert this information into biochemical and biophysical knowledge by deciphering the
structural, functional and evolutionary clues encoded in the language of biological sequences.
What Bioinformatics therefore offers to the researcher, the entrepreneur, or the Venture
Capitalist is an enormous and exciting array of opportunities to discover how living systems
metabolise, grow, combat disease, reproduce and regenerate. The current knowledge represents
only the tip of the iceberg. Exciting and startling discoveries are being made everyday through
Bioinformatics, which is building up an extensive encyclopedia from which lifes mysteries will
be unraveled. The importance of computational science in collating this information and its
simultaneous interpretation by biologists is the underlying ethos of Bioinformatics.
Having an interest in biology and having a strong inclination towards genetics is all right.
But from our point of view, the most important thing is that biocomputing requires lots of
software professionals. And there is more to do for these people than the experts in biology.

1.2

Computers and Biology

7
Bioinformatics is the symbolic relationship between computational and biological
sciences. The ability to sort and extricate genetic codes from a human genomic database of 3
billion base pairs of DNA in a meaningful way is perhaps the simplest form of Bioinformatics.
Moving on to another level, Bioinformatics is useful in mapping different peoples genomes and
deriving differences in their genetic make-up through pattern recognition software. But that is the
easiest part. What is more complex is to decipher the genetic code itself to see what the
differences in genetic make-up between different people translate into in terms of physiological
traits. And there is yet another level, which is even more intricate and that is the genetic code
itself. The genetic code actually codes for amino acids and thereby proteins and the specific role,
played by each of these proteins controls the state of our health. The role or function of each of
our genes in coding for a specific protein, which in turn regulates a particular metabolic pathway,
is described as functional genomics. The true benefit of Bioinformatics therefore lies in
harnessing information pertaining to these genetic functions in order to understand how human
beings and other living systems operate.
Computational simulation of experimental biology is an important application of
Bioinformatics, which is referred to as in silico testing. This is perhaps an area that will expand
in a prolific way, given the need to obtain a greater degree of predictability in animal and human
clinical trials. Added to this, is the interesting scope that in silico testing provides to deal with
the growing hostility towards animal testing. The growth of this sector will largely depend on the
acceptance of in silico testing by the regulatory authorities. However, irrespective of this,
research strategies will certainly find computational modeling to be a vital tool in speeding up
research with enormous cost benefits.

1.3

Limitations in the use of computers


The last decade has witnessed the dawn of a new era of silicon-based biology, opening

the door, for the first time, to the possible investigation and comparative analysis of complete
genomes. Genome analysis means to elucidate and characterize the genes and gene products of
an organism. It depends on a number of pivotal concepts, concerning the processes of evolution

8
(divergence and convergence), the mechanism of protein folding, and the manifestation of
protein function.
Today, our use of computers to model such processes is limited by, and must be placed in
the context of, the current limits of our understanding of these central themes. At the outset, it is
important to recognize that we do not yet fully understand the rules of protein folding; we cannot
invariably say that a particular sequence or a fold has arisen by divergent or convergent
evolution; and we cannot necessarily diagnose a protein function, given knowledge only of its
sequence or of its structure, in isolation. Accepting what we cannot do with computers plays an
essential role in forming an appreciation of what, in fact, we can do. Without this kind of
understanding, it is easy to be misled, as spurious arguments are often used to promote perhaps
rather overenthusiastic points of view about what particular programs and software packages can
achieve.
Nature has its own complex rules, which we only poorly understand and which we cannot
easily encapsulate within computer programs. No current algorithm can do biology. Programs
provide mathematical and therefore infallible, models of biological systems. To interpret
correctly whether sequences or structures are meaningfully similar, whether they have arisen by
the processes of divergence or convergence, whether similar sequences or similar folds have the
same or different functions: these are the most challenging problems. There are no simple
solutions, and computers do not give us the answers; rather, given a sea of data, they help to
narrow the options down so that the users can begin to draw informed biologically reasonable
conclusions.

1.4

Current Stage of Research


In the field of Bioinformatics, the current research drive is to be able to understand

evolutionary relationships in terms of the expression of protein function. Two computational


approaches have been brought up to bear on the problem, tackling the identification of protein
function from the perspectives of sequence analysis and of structure analysis respectively. From
the point of view of sequence analysis, we are concerned with the detection of relationships

9
between newly determined sequences and those of known function (usually within a database).
This may mean pinpointing functional sites shared by disparate proteins (probably the result of
convergent evolution), or identifying related functions in similar proteins (most commonly the
result of divergent evolution.
The identification of protein function from sequence sounds straightforward, and indeed,
sequence analysis is usually a fruitful technique. But, function cannot be inferred from sequence
for about one-third of proteins in any of the sequenced genomes, largely because biological
characterization cannot keep pace with the volume of data issuing from the genome projects
(large number of database sequences thus either carry no annotation beyond the parent gene
name, or are simply designated as hypothetical proteins). Another important point is that, in
some instances, closely related sequences, which may be assumed to share a common structure,
may not share the same function. What this means is that, though sequence or structure analysis
can be used for deducing gene functions, still neither technique can be applied infallibly without
reference to the underlying biology.

1.5

Microbial, Plant and Animal Genomes


Although the human genome appears to be the focal point of interest, microbial, plant and

animal genomes are equally exciting to explore through Bioinformatics. Mining plant genomics
has an important impact on opening up new vistas for research in agriculture. Microbial
genomics offers a dual opportunity of developing new fermentation-based products and
technologies as well as defining new ways of combating microbial infections. Exploring animal
genomics opens up unlimited scope to pursue research in veterinary science and transgenic
models.

10
1.6

History (Stages of development)

The science of sequencing began slowly. The earliest techniques were based on methods
for separation of proteins and peptides, coupled with methods for identification and
quantification of amino acids. Prior to 1945, there was not a single quantitative analysis available
for any one protein. However, significant progress with chromatographic and labeling
techniques over the next decade eventually led to the elucidation of the first complete sequence,
that of the peptide hormone insulin (1955). Yet, it was the first five years before the sequence of
the first enzyme (ribonuclease) was complete (1960). By 1965, around 20 proteins with more
than 100 residues had been sequenced, and by 1980, the number was estimated to be of the order
of 1500. Today, there are more than 3,00,000 sequences available.

Initially, the manual process of sequential Edman degradation dansylation, obtained the
majority of protein sequences. A key step towards the rapid increase in the number of sequenced
proteins was the development of automated sequencers, which, by 1980, offered a 104-fold
increase in sensitivity relative to the automated procedure implemented by Edman and Begg in
1967.

In the 1960s, scientists struggled to develop methods to sequence nucleic acids, but the
first techniques to emerge were really only applicable to tRNA because they are short (74 to 95
nucleotides in length) and it is possible to purify individual molecules.

As against RNA, DNA is very long: human chromosomal molecules may contain
between 55*106 and 250*106 base pairs. Assembling the complete nucleotide sequence of a
complete DNA molecule is a huge task. Even if the sequence can be broken down into smaller
fragments, purification remains a problem. The advent of gene cloning provided a solution to
how the fragments can be separated. By 1977, two sequencing methods had emerged using chain
termination and chemical degradation approaches. With only minor changes, the techniques
propagated to laboratories throughout the world, and laid the foundation for the sequence
revolution of the next two decades, and the subsequent birth of Bioinformatics.

11
During the last decade, molecular biology has witnessed an information revolution as a
result of both of the development of rapid DNA sequencing techniques and of the corresponding
progress in computer base technologies, which are allowing us to cope with this information
deluge in increasingly efficient ways. The broad term that was coined in the mid-1980s to
encompass computer applications in biological sciences is Bioinformatics. The term
Bioinformatics has been commandeered by several different disciplines to mean rather different
things. In its broadest sense, the term can be considered to mean information technology applied
to the management and analysis of biological sequence data; this has implications in diverse
areas, ranging from artificial intelligence and robotics to genome analysis. In the context of
genome initiatives, the term was originally applied to the computational manipulation and
analysis of biological sequence data. However in view of this recent rapid accumulation of
available protein structures, the term now tends also to be used to embrace the manipulation and
analysis of 3D structure data.

12

Chapter 2

Basics of Molecular Biology


This chapter explains in short some of the common biological terms absolutely essential
to get a clear understanding of what exactly is Bioinformatics all about. I have avoided getting
into the intricacies of Genetics because the basic aim of this report is to know the latest
developments in the field of Bioinformatics, try to visualize where it is heading, understand what
it has got to offer to the community, and exploit the opportunities available in this field.

2.1

Nucleotide

A nucleotide is a macromolecule made up of three sub-units: a pentose sugar, a nitrogen


base and a phosphate. Nucleic acids are polymers of nucleotides. Pentose sugar is either ribose or
deoxyribose (this decides whether the genetic material formed is RNA or DNA). Nitrogen bases
are of two types: Purines (Adenine (A), Guanine (G)) and Pyramidines (Cytosine (C), Thymine
(T) and Uracil (U))

2.2

Amino acid

It is the fundamental building block of proteins. There are 20 naturally occurring amino
acids in animals and around 100 more found only in plants. A sequence of three nucleotides
forms one amino acid. The logic behind this is as follows: There are four types of nucleotides
depending on the nitrogenous base: (A,G,C,T) in DNA and (A,G,C,U) in RNA. 20 different
amino acids are to be coded using permutations of 4 types of nucleotides. So obviously, 3
nucleotides are required to signify one amino acid (43 > 20), because less than 3 will be
insufficient and more than 3 will cause redundancy. The sequence of three nucleotide specifying
an amoni acid is called a triplet code or codon (coding unit). All 64 codons specify something or
the other. Most of them specify amino acids, but a few are instructions for starting and stopping
the synthesis.

2.3

Properties of Genetic Code


1. Three nucleotides in a DNA molecule code for one amino acid in the corresponding
protein. Such a triplet is called a codon.
2. The code is read from a fixed starting point.
3. Codes for starting and stopping are present, but not for a pause in the middle, or
comma.
4. The nucleotides are read three at a time in a non-overlapping manner.

13
5. Most of the 64 possible nucleotide triplets stand for one amino acid or the other.
6. A few triplets stand for starting and stopping the synthesis.
7. There are two or more different codons for the same amino acid. Because of this,
the genetic code is said to be degenerate.
8. The code has polarity because it can be read only in one direction.
9. The code is universal. Practically all the organisms use the same code.
2.4

DNA (Deoxy-ribonucleic Acid)

The long, thread-like DNA molecule consists of two strands that are joined to one another
all along their length. Each strand is a polymer made up of repeated sub-units (nucleotides).
Hence each strand is also called a polynucleotide. DNA is the basic genetic material in all the
living material existing on this earth. The two essential mechanisms possessed by DNA are (1)
Transmission of hereditary characters and (2) Ability of self-duplication. In the DNA molecule,
tow long polynucleotide chains are spirally twisted around each other. This is also called helical
coiling and the DNA is often referred to as a double helix. A polynucleotide chain has polarity
and the two strands of a DNA molecule run in opposite directions, hence they are said to be anti
parallel. The two chains are joined together by hydrogen bonds existing between the nitrogenous
bases on the inside. Adenine (A) forms a bond only with Thymine (T) and Guanine (G) can form
a bond only with Cytosine (C). Because of the base pairing restriction, the two strands are always
complementary to each other.
The sequence of bases along the polynucleotide is not restricted in any way. An infinite
variety of combinations is possible. It is the precise sequence of bases that determines the genetic
information. There is no theoretical limit to the length of a DNA molecule.

2.5

Chromosomes

Chromosomes are the paired, self-replicating genetic structures of cells that contain the
cellular DNA; the nucleotide sequence of the DNA encodes the linear array of genes.

2.6

Gene

A gene is the fundamental physical and functional unit of heredity. A gene is an ordered
sequence of nucleotides located in a particular position on a particular chromosome that encodes
a specific functional product (i.e. a protein or RNA molecule).

2.7

Protein

Protein is a molecule composed of one or more chains of amino acids in a specific order.
The order is determined by the base sequence of nucleotides in the gene coding for the protein.
Proteins are required for the structure, function and regulation of cells, tissues and organs, each
protein having a specific role (e.g., hormones, enzymes and antibodies).
DNA carries the hereditary material and the only thing that they do is to synthesize
proteins, and thereafter, all the hereditary characteristics get reflected in the activities of the body
cells because of proteins.

2.8

Sequencing

14
Sequencing means the determination of the order of nucleotides (base sequences) in a
DNA or RNA molecule, or the order of amino acids in a protein.

2.9

Genome

Genome of an organism means all the genetic material in its chromosomes. Its size is
generally given as its total number of base pairs. Genomes of different organisms can be
compared to identify similarities and disparities in the strategies for the Logic of Life.

2.10 Clone
Clone is an exact copy made of biological material such as a DNA segment, a whole cell
or a complete organism. The process of creating a clone is called as cloning.

2.11 Model Organism


Saccharomytes cerevisiae commonly known as the bakers yeast have emerged as the
model organism. It has demonstrated the fundamental conservation of the basic informational
pathways found in almost all the organisms. From the detailed study of the genomes of these
organisms (which is possible today), we can gain an insight into their functioning. All this data
will lead to the fundamental insights into human biology.
Vast amount of genetic data available on this species provides important clues helpful for
the ongoing research on human genetics. Saccharomytes cerevisiae has become the workhorse of
many biotechnology labs. It can exist either in a haploid or a diploid state and divides by the
vegetative process of budding. Yeast cultures can be easily propagated in labs. It has become the
model organism partly because of the ease with which genetic manipulations can be carried out.
Random mutations can be induced into the genome by the treatment of live cells with chemicals
such as ethyl-methanesulphone or by exposure to ultra-violet rays. Targeted gene inactivations
can also be carried out; this property is very important during experiments for the unambiguous
assignments of gene functions.
Saccharomytes cerevisiae has a compact genome of 12 lakh base pairs of DNA present on
16 chromosomes. This presented a reasonable goal for complete sequencing and analysis of its
genome. The Saccharomytes genome database (SGD) was established at the Stanford University
in 1995.
Knowing the complete sequence of a genome is only the first step in understanding how
the huge amount of information contained in genes is translated into functional proteins.

Chapter 3

Role of software Engineers and Technology in Biotechnology

15
The tools of computer science, statistics and mathematics are critical for studying biology
as an informational science. Curiously, biology is the only science that at its very heart, employs
a digital language. The grand challenge in biology is to determine how the digital language of the
chromosomes is converted into 3-D and 4-D (time varying) languages of living organisms.

3.1

Need for software automation


DNA encodes the information necessary for building and maintaining life. DNA is a non-

branching, double-stranded macromolecule in which the nucleotide building blocks (A,C,G,T)


are linked. Bases are arranged in A-T and C-G pairs. Small viral genomes of the order of several
thousand bases were the first to be sequenced in 1970. Few years later, genomes of the order of
40 kilo base pairs represented the limit of what could reasonably be sequenced. At this stage, the
need for automation was recognized and methods were applied to the degree possible. By the
year 1997, the yeast genome consisting of 12 Mega base pairs was completed, and in 1998, the
conclusion of the 100 Mega base pairs nematode genome project was announced. Most recently,
the 180 Mega base pairs fruit-fly genome was also completed. All of these projects relied on
substantially higher levels of software automation. We are now in the midst of the most
ambitious project so far: sequencing of the 3 Giga base pairs Human Genome. For this effort,
and those yet to come, software automation lies at the very core of the planning and executing of
the project.
The need for automation is driven largely by the trend of handling ever larger sizes of
DNA and the corresponding increase in the amount of raw data this entails. Mathematical
analysis indicates that the size of a project is roughly proportional to the size of the genome. This
is due to the fact that the amount of information obtained for an individual sequencing
experiment is relatively constant and is independent of the genome size. It is estimated that for
the human genome, as order of 108 individual experiments are required to cover the genome. To
meet the projected goals, modern large scale sequencing centers have developed throughput
capacities of the order of several million experiments per month, with data processing handled
on a continuous basis. Managing such large projects without a high degree of automation would
clearly be impossible in terms of cost and time requirements.

16
So, DNA is the basic genetic material. It transmits hereditary characters from one
generation to the next. During synthesis of proteins, mRNA which act as the messengers of
information (the exact genetic code) are build from DNA. Proteins are synthesized using mRNA
molecules. Protein interactions give rise to information pathways and networks which help in
building cells which are identical to their parent cells. Clustering of many cells in a predefined
format composes a tissue. An organ is a combination of tissues and an organism is nothing but an
organization of organs. Refer figure 3.1.
The challenge for computer professionals is to create tools that can capture and integrate
these different levels of biological information.

17
3.2

Genetic Algorithms
All that computers can do is implement algorithms. Hence when we talk of using

computers for processing of biological information, we have to define precise mathematical


algorithms. Following are a few absolutely basic algorithms in Bioinformatics.

3.2.1 Database Searching


Database interrogation can take the form of text queries (e.g. Display all the human
adrenergic receptors) or sequence similarity searches (e.g. Given the sequence of a human
adrenergic receptor, display all the similar sequences in the database). Sequence similarity
searches are straightforward because the data in the databases is mostly in the form of sequences.

3.2.2 Comparing Two Sequences


Let us take the case of comparing two protein sequences. The alphabet complexity is 20,
since a protein is nothing but a sequence of amino acids and there are 20 possible amino acids.
The nave approach is to line up the sequences against each other and insert additional characters
to bring the two strings into vertical alignment. More the matches, more is the closeness in the
two sequences.
The process of alignment can be measured in terms of the number of gaps introduced and
the number of mismatches remaining in the alignment. A metric relating such parameters
represents the distance between two sequences.

3.2.3 Multiple Sequence Alignment


In the previous sub section, we saw pairwise sequence alignment, which is fundamental
to sequence analysis. However, analysis of groups of sequences that form gene families requires
the ability to make connections between more than two members of the group, in order to reveal
subtle conserved family characteristics. The goal of multiple sequence alignment is to generate a
concise information-rich summary of sequence data in order to inform decision-making on the
relatedness of sequences to a gene family.

18
Multiple sequence alignment is a 2D table, in which the rows represent individual
sequences and the columns the residue positions. The sequences are laid onto this grid in such a
manner that (a) the relative positioning of residues within any one sequence is preserved, and (b)
similar residues in all the sequences are brought into vertical register.

3.3

Genome Projects

In the mid-1980s, the united states department of Energy initiated a number of projects to
construct detailed genetic and physical maps of the human genome, to determine its complete
nucleotide sequence, and to localize its estimated 100000 genes. Work on this scale required the
development of new computational methods for analysing genetic map and DNA sequence data,
and demanded the design of new techniques and instrumentation for detecting and analysing
DNA. To benefit the public most effectively, the projects also necessitated the use of advanced
means of information dissemination in order to make the results available as rapidly as possible
to scientists and physicians. The international effort arising from this vast initiative became
known as the human genome project.
Similar research efforts were also launched to map and sequence the genomes of a variety
of organisms used extensively in research labs as model systems. In April 1998, although the
sequencing projects of only a small number of relatively small genomes had been completed, and
the human genome is not expected to be complete until after the year 2003, the results of such
projects were already beginning to pour into the public sequence databases in overwhelming
numbers.

3.4

Goals for Advancements in Sequencing Technology


DNA sequencing technology has improved dramatically since the genome projects began.

The amount of sequence produced each year is increasing steadily; individual centers are now
producing tens of millions of base pairs of sequence annually. In the future, de novo sequencing
of additional genomes, comparative sequencing of closely related genomes, and sequencing to
assess variation within genomes will become increasingly indispensable tools for biological and
medical research. Much more efficient sequencing technology will be needed than is currently
available. The incremental improvements made to date have not yet resulted in any fundamental
paradigm shifts. Nevertheless, the current state-of-the-art technology can still be significantly
improved, and resources should be invested to accomplish this. Beyond that, research must be
supported on new technologies that will make even higher throughput DNA sequencing efficient,

19
accurate, and cost-effective, thus providing the foundation for other advanced genomic analysis
tools. Progress must be achieved in three areas:
a) Continue to increase the throughput and reduce the cost of current sequencing
technology.
Increased automation, miniaturization, and integration of the approaches currently in use,
together with incremental, evolutionary improvements in all steps of the sequencing process,
are needed to yield further increases in throughput (to at least 500 Mb of finished sequence
per year by 2003) and reductions in cost. At least a twofold cost reduction from current levels
(which average $0.50 per base for finished sequence in large-scale centers) should be
achieved in the next 5 years. Production of the working draft of the human sequence will cost
considerably less per base pair.
b) Support research on novel technologies that can lead to significant improvements in
sequencing technology.
New conceptual approaches to DNA sequencing must be supported to attain substantial
improvements over the current sequencing paradigm. For example, microelectromechanical
systems (MEMS) may allow significant reduction of reagent use, increase in assay speed,
and true integration of sequencing functions. Rapid mass spectrometric analysis methods are
achieving impressive results in DNA fragment identification and offer the potential for very
rapid DNA sequencing. Other more revolutionary approaches, such as single-molecule
sequencing methods, must be explored as well. Significant investment in interdisciplinary
research in instrumentation, combining chemistry, physics, biology, computer science, and
engineering, will be required to meet this goal. Funding of far-sighted projects that may
require 5 to 10 years to reach fruition will be essential. Ultimately, technologies that could,
for example, sequence one vertebrate genome per year at affordable cost are highly desirable.
c) Develop effective methods for the advanced development and introduction of new
sequencing technologies into the sequencing process.
As the scale of sequencing increases, the introduction of improvements into the production
stream becomes more challenging and costly. New technology must therefore be robust and
be carefully evaluated and validated in a high-throughput environment before its
implementation in a production setting. A strong commitment from both the technology

20
developers and the technology users is essential in this process. It must be recognized that the
advanced development process will often require significantly more funds than proof-ofprinciple studies. Targeted funding allocations and dedicated review mechanisms are needed
for advanced technology development.

3.5

Developing Technology to handle Sequence Variations

Natural sequence variation is a fundamental property of all genomes. Any two haploid
human genomes show multiple sites and types of polymorphism. Some of these have functional
implications, whereas many probably do not. The most common polymorphisms in the human
genome are single base-pair differences, also called single-nucleotide polymorphisms (SNPs).
When two haploid genomes are compared, SNPs occur every kilobase, on average. Other kinds
of sequence variation, such as copy number changes, insertions, deletions, duplications, and
rearrangements also exist, but at low frequency and their distribution is poorly understood. Basic
information about the types, frequencies, and distribution of polymorphisms in the human
genome and in human populations is critical for progress in human genetics. Better highthroughput methods for using such information in the study of human disease are also needed.
SNPs are abundant, stable, widely distributed across the genome, and lend themselves to
automated analysis on a very large scale, for example, with DNA array technologies. Because of
these properties, SNPs will be a boon for mapping complex traits such as cancer, diabetes, and
mental illness. Dense maps of SNPs will make possible genome-wide association studies, which
are a powerful method for identifying genes that make a small contribution to disease risk. In
some instances, such maps will also permit prediction of individual differences in drug response.
Publicly available maps of large numbers of SNPs distributed across the whole genome, together
with technology for rapid, large-scale identification and scoring of SNPs, must be developed to
facilitate this research.
a) Develop technologies for rapid, large-scale identification of SNPs and other DNA
sequence variants. The study of sequence variation requires efficient technologies that can be
used on a large scale and that can accomplish one or more of the following tasks: rapid
identification of many thousands of new SNPs in large numbers of samples. Although the
immediate emphasis is on SNPs, ultimately technologies that can be applied to polymorphisms of
any type must be developed. Technologies are also needed that can rapidly compare, by largescale identification of similarities and differences, the DNA of a species that is closely related to
one whose DNA has already been sequenced. The technologies that are developed should be
cost-effective and broadly accessible.

21
b) Identify common variants in the coding regions of the majority of identified genes
Initially, association studies involving complex diseases will likely test a large series of
candidate genes; eventually, sequences in all genes may be systematically tested. SNPs in coding
sequences (also known as cSNPs) and the associated regulatory regions will be immediately
useful as specific markers for disease. An effort should be made to identify such SNPs as soon as
possible. Ultimately, a catalog of all common variants in all genes will be desirable. This should
be cross-referenced with cDNA sequence data.
c) Create an SNP map of at least 100,000 markers. A publicly available SNP map of sufficient
density and informativeness to allow effective mapping in any population is the ultimate goal. A
map of 100,000 SNPs (one SNP per 30,000 nucleotides) is likely to be sufficient for studies in
some relatively homogeneous populations, while denser maps may be required for studies in
large, heterogeneous populations. Thus, during this 5-year period, the HGP authorities have
planned to create a map of at least 100,000 SNPs. If technological advances permit, a map of
greater density is desirable. Research should be initiated to estimate the number of SNPs needed
in different populations.
d) Develop the intellectual foundations for studies of sequence variation. The methods and
concepts developed for the study of single-gene disorders are not sufficient for the study of
complex, multigene traits. The study of the relationship between human DNA sequence variation,
phenotypic variation, and complex diseases depends critically on better methods. Effective
research design and analysis of linkage, linkage disequilibrium, and association data are areas
that need new insights. Questions such as which study designs are appropriate to which specific
populations, and with which population genetics characteristics, must be answered. Appropriate
statistical and computational tools and rigorous criteria for establishing and confirming
associations must also be developed.
e) Create public resources of DNA samples and cell lines. To facilitate SNP discovery it is
critical that common public resources of DNA samples and cell lines be made available as
rapidly as possible. To maximize discovery of common variants in all human populations, a
resource is needed that includes individuals whose ancestors derive from diverse geographic
areas. It should encompass as much of the diversity found in the human population as possible.

22
Samples in this initial public repository should be totally anonymous to avoid concerns that arise
with linked or identifiable samples.
DNA samples linked to phenotypic data and identified as to their geographic and other origins
will be needed to allow studies of the frequency and distribution of DNA polymorphisms in
specific populations and their relevance to disease. However, such collections raise many ethical,
legal, and social concerns that must be addressed. Credible scientific strategies must be
developed before creating these resources.

3.6

Need of Technology in Functional Genomics


Functional genomics is the interpretation of the function of DNA sequence on a genomic

scale. Already, the availability of the sequence of entire organisms has demonstrated that many
genes and other functional elements of the genome are discovered only when the full DNA
sequence is known. Such discoveries will accelerate as sequence data accumulate. However,
knowing the structure of a gene or other element is only part of the answer. The next step is to
elucidate function, which results from the interaction of genomes with their environment. Current
methods for studying DNA function on a genomic scale include comparison and analysis of
sequence patterns directly to infer function, large-scale analysis of the messenger RNA and
protein products of genes, and various approaches to gene disruption. In the future, a host of
novel strategies will be needed for elucidating genomic function. This will be a challenge for all
of biology. The HGP will be contributing to this area by emphasizing the development of
technology that can be used on a large scale, is efficient, and is capable of generating complete
data for the genome as a whole. To the extent that available resources allow, expansion of current
approaches as well as innovative technology ideas should be supported in the areas described
below.
a) Develop cDNA resources. Complete sets of full-length cDNA clones and sequences for both
humans and model organisms would be enormously useful for biologists and are urgently
needed. Such resources would help in both gene discovery and functional analysis. High priority
should be placed on developing technology for obtaining full-length cDNAs. Complete and
validated inventories of full-length cDNA clones and corresponding sequences should be
generated and made available to the community once such technology is at hand.

23
b) Improved technologies are needed for global approaches to the study of non-protein-coding
sequences, including production of relevant libraries, comparative sequencing, and computational
analysis.
c) Develop technology for comprehensive analysis of gene expression. Information about the
spatial and temporal patterns of gene expression in both humans and model organisms offers one
key to understanding gene expression. Efficient and cost-effective technology needs to be
developed to measure various parameters of gene expression reliably and reproducibly.
Complementary DNA sequences and validated sets of clones with unique identifiers will be
needed for array technologies, large-scale in situ hybridization, and other strategies for measuring
gene expression. Improved methods for quantifying, representing, analyzing, and archiving
expression data should also be developed.
d) Improve methods for genome-wide mutagenesis. Creating mutations that cause loss or
alteration of function is another prime approach to studying gene function. Technologies, both
gene- and phenotype-based, which can be used on a large scale in vivo or in vitro, are needed for
generating or finding such mutations in all genes. Such technologies should be piloted in
appropriate model systems, including both cell culture and whole organisms.
e) Develop technology for global protein analysis. A full understanding of genome function
requires an understanding of protein function on a genome-wide basis. Development of
experimental and computational methods to study global spatial and temporal patterns of protein
expression, protein-ligand interactions, and protein modification needs to be supported.

3.7

Bioinformatics and Computational Biology


Bioinformatics support is essential to the implementation of genome projects and for

public access to their output. Bioinformatics needs for the genome project fall into two broad
areas: (i) databases and (ii) development of analytical tools. Collection, analysis, annotation, and
storage of the ever increasing amounts of mapping, sequencing, and expression data in publicly
accessible, user-friendly databases is critical to the project's success. In addition, the community
needs computational methods that will allow scientists to extract, view, annotate, and analyze
genomic information efficiently. Thus, the genome project must continue to invest substantially

24
in these areas. Conservation of resources through development of portable software should be
encouraged.
a) Improve content and utility of databases. Databases are the ultimate repository of genome
projects data. As new kinds of data are generated and new biological relationships discovered,
databases must provide for continuous and rapid expansion and adaptation to the evolving needs
of the scientific community. To encourage broad use, databases should be responsive to a diverse
range of users with respect to data display, data deposition, data access, and data analysis.
Databases should be structured to allow the queries of greatest interest to the community to be
answered in a seamless way. Communication among databases must be improved. Achieving this
will require standardization of nomenclature. A database of human genomic information,
analogous to the model organism databases and including links to many types of phenotypic
information, is needed.
b) Develop better tools for data generation, capture, and annotation. Large-scale, highthroughput genomics centers need readily available, transportable informatics tools for
commonly performed tasks such as sample tracking, process management, map generation,
sequence finishing, and primary annotation of data. Smaller users urgently need reliable tools to
meet their sequencing and sequence analysis needs. Readily accessible information about the
availability and utility of various tools should be provided, as well as training in the use of tools.
c) Develop and improve tools and databases for comprehensive functional studies. Massive
amounts of data on gene expression and function will be generated in the near future. Databases
that can organize and display this data in useful ways need to be developed. New statistical and
mathematical methods are needed for analysis and comparison of expression and function data,
in a variety of cells and tissues, at various times and under different conditions. Also needed are
tools for modeling complex networks and interactions.
d) Develop and improve tools for representing and analyzing sequence similarity and
variation. The study of sequence similarity and variation within and among species will become
an increasingly important approach to biological problems. There will be many forms of
sequence variation, of which SNPs will be only one type. Tools need to be created for capturing,
displaying, and analyzing information about sequence variation.

25
e) Create mechanisms to support effective approaches for producing robust, exportable
software that can be widely shared. Many useful software products are being developed in both
academia and industry that could be of great benefit to the community. However, these tools
generally are not robust enough to make them easily exportable to another laboratory.
Mechanisms are needed for supporting the validation and development of such tools into
products that can be readily shared and for providing training in the use of these products.
Participation by the private sector is strongly encouraged.

3.8

Job Opportunities and Job Requirements


The Human Genome Project has created the need for new kinds of scientific specialists

who can be creative at the interface of biology and other disciplines, such as computer science,
engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of
genomic research increases, the demand for these specialists greatly exceeds the supply. In the
past, the genome project has benefited immensely from the talents of non-biological scientists,
and their participation in the future is likely to be even more crucial. There is an urgent need to
train more scientists in interdisciplinary areas that can contribute to genomics. Programs must be
developed that will encourage training of both biological and non-biological scientists for careers
in genomics. Especially critical is the shortage of individuals trained in Bioinformatics. Also
needed are scientists trained in the management skills required to lead large data-production
efforts. Another urgent need is for scholars who are trained to undertake studies on the societal
impact of genetic discoveries. Such scholars should be knowledgeable in both genome-related
sciences and in the social sciences. Ultimately, a stable academic environment for genomic
science must be created so that innovative research can be nurtured and training of new
individuals can be assured. The latter is the responsibility of the academic sector, but funding
agencies can encourage it through their grants programs.

3.9

Training Goals included in the Human Genome Project Plan

a) Nurture the training of scientists skilled in genomics research.


A number of approaches to training for genomics research should be explored. These
include providing fellowship and career awards and encouraging the development of institutional

26
training programs and curricula. Training that will facilitate collaboration among scientists from
different disciplines, as well as courses that introduce scientists to new technologies or
approaches, should also be included.
b) Encourage the establishment of academic career paths for genomic scientists.
Ultimately, a strong academic presence for genomic science is needed to generate the
training environment that will encourage individuals to enter the field. Currently, the high
demand for genome scientists in industry threatens the retention of genome scientists in
academia. Attractive incentives must be developed to maintain the critical mass essential for
sponsoring the training of the next generation of genome scientists.
c) Increase the number of scholars who are knowledgeable in both genomic and genetic
sciences and in ethics, law, or the social sciences.
As the pace of genetic discoveries increases, the need for individuals who have the necessary
training to study the social impact of these discoveries also increases. The ELSI program should
expand its efforts to provide postdoctoral and senior fellowship opportunities for cross-training.
Such opportunities should be provided both to scientists and health professionals who wish to
obtain training in the social sciences and humanities and to scholars trained in law, the social
sciences, or the humanities who wish to obtain training in genomic or genetic sciences.

27
Chapter 4

Human Genome Project


4.1

Introduction

Begun in 1990, the U.S. Human Genome Project is a 13-year effort coordinated by the
department of Energy and the National Institutes of Health. The project originally was planned to
last 15 years, but effective resource and technological advances have accelerated the expected
completion date to 2003. Project goals are to

identify all the approximately 30,000 genes in human DNA,

determine the sequences of the 3 billion chemical base pairs that make up human DNA,

store this information in databases,

improve tools for data analysis,

transfer related technologies to the private sector, and

address the ethical, legal, and social issues that may arise from the project.

4.2

Details of the Human Genome Project


The Human Genome Project (HGP) is fulfilling its promise as the single most important

project in biology and the biomedical sciences--one that will permanently change biology and
medicine. With the recent completion of the genome sequences of several microorganisms,
including Escherichia coli and Saccharomyces cerevisiae, and the imminent completion of the
sequence of the metazoan Caenorhabditis elegans, the door has opened wide on the era of whole
genome science. The ability to analyze entire genomes is accelerating gene discovery and
revolutionizing the breadth and depth of biological questions that can be addressed in model
organisms. These exciting successes confirm the view that acquisition of a comprehensive, highquality human genome sequence will have unprecedented impact and long-lasting value for basic

28
biology, biomedical research, biotechnology, and health care. The transition to sequence-based
biology will spur continued progress in understanding gene-environment interactions and in
development of highly accurate DNA-based medical diagnostics and therapeutics.
Human DNA sequencing, the flagship endeavor of the HGP, is entering its decisive
phase. It will be the project's central focus during the next 5 years. While partial subsets of the
DNA sequence, such as expressed sequence tags (ESTs), have proven enormously valuable,
experience with simpler organisms confirms that there can be no substitute for the complete
genome sequence. In order to move vigorously toward this goal, the crucial task ahead is building
sustainable capacity for producing publicly available DNA sequence. The full and incisive use of
the human sequence, including comparisons to other vertebrate genomes, will require further
increases in sustainable capacity at high accuracy and lower costs. Thus, a high-priority
commitment to develop and deploy new and improved sequencing technologies must also be
made.
Availability of the human genome sequence presents unique scientific opportunities, chief
among them the study of natural genetic variation in humans. Genetic or DNA sequence variation
is the fundamental raw material for evolution. Importantly, it is also the basis for variations in
risk among individuals for numerous medically important, genetically complex human diseases.
An understanding of the relationship between genetic variation and disease risk promises to
change significantly the future prevention and treatment of illness. The new focus on genetic
variation, as well as other applications of the human genome sequence, raises additional ethical,
legal, and social issues that need to be anticipated, considered, and resolved.
The HGP has made genome research a central underpinning of biomedical research. It is
essential that it continue to play a lead role in catalyzing large-scale studies of the structure and
function of genes, particularly in functional analysis of the genome as a whole. However, full
implementation of such methods is a much broader challenge and will ultimately be the
responsibility of the entire biomedical research and funding communities.
Success of the HGP critically depends on Bioinformatics and computational biology as
well as training of scientists to be skilled in the genome sciences. The project must continue a
strong commitment to support of these areas.
As intended, the HGP has become a truly international effort to understand the structure
and function of the human genome. Many countries are participating according to their specific

29
interests and capabilities. Coordination is informal and generally effected at the scientist-toscientist level. The U.S. component of the project is sponsored by the National Human Genome
Research Institute at the National Institutes of Health (NIH) and the Office of Biological and
Environmental Research at the Department of Energy (DOE). The HGP has benefited greatly
from the contributions of its international partners. The private sector has also provided critical
assistance. These collaborations will continue, and many will expand. Both NIH and DOE
welcome participation of all interested parties in the accomplishment of the HGP's ultimate
purpose, which is to develop and make publicly available to the international community the
genomic resources that will expedite research to improve the lives of all people.

4.3

U.S. Human Genome Project 5-Year Goals 1998-2003

4.3.1 Human DNA Sequencing


Providing a complete, high-quality sequence of human genomic DNA to the research
community as a publicly available resource continues to be the HGP's highest priority goal. The
enormous value of the human genome sequence to scientists, and the considerable savings in
research costs its widespread availability will allow, are compelling arguments for advancing the
timetable for completion. Recent technological developments and experience with large-scale
sequencing provide increasing confidence that it will be possible to complete an accurate, highquality sequence of the human genome by the end of 2003, 2 years sooner than previously
predicted. NIH and DOE expect to contribute 60 to 70% of this sequence, with the remainder
coming from the effort at the Sanger Center and other international partners.
This is a highly ambitious goal, given that only about 6% of the human genome sequence
has been completed thus far. Sequence completion by the end of 2003 is a major challenge, but
within reach and well worth the risks and effort. Realizing the goal will require an intense and
dedicated effort and a continuation and expansion of the collaborative spirit of the international
sequencing community. Only sequence of high accuracy and long-range contiguity will allow a
full interpretation of all the information encoded in the human genome.

30
Availability of the human sequence will not end the need for large-scale sequencing. Full
interpretation of that sequence will require much more sequence information from many other
organisms, as well as information about sequence variation in humans. Thus, the development of
sustainable, long-term sequencing capacity is a critical objective of the HGP. Achieving the goals
below will require a capacity of at least 500 megabases (Mb) of finished sequence per year by the
end of 2003.
a) Finish the complete human genome sequence by the end of 2003.
To best meet the needs of the scientific community, the finished human DNA sequence
must be a faithful representation of the genome, with high base-pair accuracy and long-range
contiguity. Specific quality standards that balance cost and utility have already been established.
These quality standards should be reexamined periodically; as experience in using sequence data
is gained, the appropriate standards for sequence quality may change. One of the most important
uses for the human sequence will be comparison with other human and nonhuman sequences.
The sequence differences identified in such comparisons should, in nearly all cases, reflect real
biological differences rather than errors or incomplete sequence. Consequently, the current
standard for accuracy--an error rate of no more than 1 base in 10,000--remains appropriate.
The current public sequencing strategy is based on mapped clones and occurs in two
phases. The first, or "shotgun" phase, involves random determination of most of the sequence
from a mapped clone of interest. Methods for doing this are now highly automated and efficient.
Mapped shotgun data are assembled into a product ("working draft" sequence) that covers most
of the region of interest but may still contain gaps and ambiguities. In the second, finishing
phase, the gaps are filled and discrepancies resolved. At present, the finishing phase is more labor
intensive than the shotgun phase. Already, partially finished, working-draft sequence is
accumulating in public databases at about twice the rate of finished sequence.
b) Make the sequence totally and freely accessible.
The HGP was initiated because its proponents believed the human sequence is such a
precious scientific resource that it must be made totally and publicly available to all who want to

31
use it. Only the wide availability of this unique resource will maximally stimulate the research
that will eventually improve human health.

4.3.2 Sequencing Technology


Create a long-term, sustainable sequencing capacity by improving current technology and
developing highly efficient novel technologies. Achieving this HGP goal will require current
sequencing capacity to be expanded 2-3 times, demanding further incremental advances in
standard sequencing technologies and improvements in efficiency and cost. For future
sequencing applications, planners emphasize the importance of supporting novel technologies
that may be 5-10 years in development.

4.3.3 Sequence Variation


Develop technologies for rapid identification of DNA sequence variants. A new priority
for the HGP is examining regions of natural variation that occur among genomes (except those of
identical twins). Goals specify development of methods to detect different types of variation,
particularly the most common type called single nucleotide polymorphisms (SNPs) that occur
about once every 1000 bases. Scientists believe SNP maps will help them identify genes
associated with complex diseases such as cancer, diabetes, vascular disease, and some forms of
mental illness. These associations are difficult to make using conventional gene hunting methods
because any individual gene may make only a small contribution to disease risk. DNA sequence
variations also underlie many individual differences in responses to the environment and
treatments.

4.3.4 Functional Genomics


Expand support for current approaches and innovative technologies. Efficient
interpretation of the functions of human genes and other DNA sequences requires developing the

32
resources and strategies to enable large-scale investigations across whole genomes. A technically
challenging first priority is to generate complete sets of full-length cDNA clones and sequences
for human and model organism genes. Other functional genomics goals include studies into gene
expression and control, creation of mutations that cause loss or alteration of function in
nonhuman organisms, and development of experimental and computational methods for protein
analyses.

4.3.5 Comparative Genomics


Obtain complete genomic sequences for C. elegans (1998), Drosophila (2002), and
mouse (2008). A first clue toward identifying and understanding the functions of human genes or
other DNA regions is often obtained by studying their parallels in nonhuman genomes. To enable
efficient comparisons, complete genomic sequences already have been obtained for the
bacterium E. coli and the yeast S. cerevisiae, and work continues on sequencing the genomes of
the roundworm, fruit fly, and mouse. Planners note that other genomes will need to be sequenced
to realize the full promise of comparative genomics, stressing the need to build a sustainable
sequencing capacity.

4.3.6 Ethical, Legal, and Social Implications (ELSI)

Analyze and address implications of identifying DNA sequence information for


individuals, families, and communities.

Facilitate safe and effective integration of genetic technologies.

Facilitate education about genomics in nonclinical and research settings.

Rapid advances in genetics and applications present new and complex ethical and policy
issues for individuals and society. ELSI programs that identify and address these implications
have been an integral part of the US HGP since its inception. These programs have resulted in a
body of work that promotes education and helps guide the conduct of genetic research and the
development of related health professional and public policies. Continuing and new challenges

33
include safeguarding the privacy of individuals and groups who contribute samples for largescale sequence variation studies; anticipating how resulting data may affect concepts of race and
ethnicity; identifying how genetic data could potentially be used in workplaces, schools, and
courts; commercial uses; and the impact of genetic advances on concepts of humanity and
personal responsibility.

4.3.7 Bioinformatics and Computational Biology


Improve current databases and develop new databases and better tools for data generation
and capture and comprehensive functional studies. Continued investment in current and new
databases and analytical tools is critical to the success of the Human Genome Project and to the
future usefulness of the data. Databases must be structured to adapt to the evolving needs of the
scientific community and allow queries to be answered easily. Planners suggest developing a
human genome database analogous to model organism databases with links to phenotypic
information. Also needed are databases and analytical tools for the expanding body of gene
expression and function data, for modeling complex biological networks and interactions, and for
collecting and analyzing sequence variation data.

4.3.8 Training
Nurture the training of genomic scientists and establish career paths.
Increase the number of scholars knowledgeable in genomics and ethics, law, or the social
sciences. Planners note that future genomics scientists will require training in interdisciplinary
areas that include biology, computer science, engineering, mathematics, physics, and chemistry.
Additionally, scientists with management skills will be needed for leading large data-production
efforts.

34

35
Chapter 5

Biological Databases

5.1

The Biological sequence/structure deficit

At the beginning of 1998, in publicly available, non-redundant databases, more than


3,00,000 protein sequences have been deposited, and the number of partial sequences in public
and proprietary Expressed sequence tag databases is estimated to run into millions. By contrast,
the number of unique 3D structures in the Protein Data Bank (PDB) was less than 1500.
Although structural information is far more complex to derive, store and manipulate than are
sequence data, these figures nevertheless highlight an enormous information deficit. This
situation is likely to get worse as the genome projects around the world begin top bear fruit. Of
course, the acquisition of structural data is also hastening, and the future large-scale structure
determination enterprise could conceivably furnish 2000 3D structures annually. But this is a
small yield by comparison with that of sequence databases, which are doubling in size every
year, with a new sequence being added, on average once a minute.

5.2

Biological Databases

If we are to derive the maximum benefit from the deluge of sequence information, we
must deal with it in a concerted way; this means establishing, maintaining and disseminating
databases; providing easy to use software to access the information they contain; and designing
state-of-the-art analysis tools to visualize and interpret the structural and functional clues latent
in the data.
The first, then, in analysing sequence information is to assemble it into central shareable
resources i.e. databases. Databases are effectively electronic filling cabinets, a convenient and
efficient method of storing vast amounts of information. There are many different database types,
depending both on the nature of the information being stored and on the manner of data
storage( eg: whether in flat-files, tables in a relational database or objects in an object oriented
database).
In the context of protein sequence analysis, we will encounter primary, composite and
secondary databases. Such resources store different levels of information in totally different
formats. In the past, this has led to a variety of communication problems, but emerging computer
technologies are beginning to provide solutions, allowing seamless, transparent access to
disparate, distributed data structures over the internet.
Primary and secondary databases are used to address different aspects of sequence
analysis, because they store different levels of protein sequence information.

36
The primary structure of a protein is its amino acid sequence; these are stored in primary
databases as linear alphabets that denote the constituent residues. The secondary structure of a
protein corresponds to regions of local regularity, which, in sequence alignments, are often
apparent as well conserved motifs; these are stored in secondary databases as patterns. The
tertiary structure of a protein arises from the packing of its secondary structure elements which
may form discrete domains within a fold, or may give rise to autonomous folding units or
modules; complete folds, domains and modules are stored in structure databases as sets of atomic
co-ordinates.

5.3

Primary Sequence Databases

In the early 1980s, sequence information started to become more abundant in the
scientific literature. Realising this, several laboratories saw that there might be advantages to
harvesting and storing these sequences in central repositories. Thus, several primary database
projects began to evolve in different parts of the world.

5.3.1 Nucleic acid Sequence Databases


The principle DNA sequence databases are GenBank (USA), EMBL (Europe) and DDBJ
(Japan), which exchange data on a daily basis to ensure comprehensive coverage at each of the
sites.
EMBL is the nucleotide sequence database from the European Bioinformatics Institute.
The rate of growth of DNA databases has been following an exponential trend, with a doubling
time less than a year. EMBL data predominantly (more than 50%) consist of model organisms.
DNA Data Bank of Japan is produced, distributed and maintained by the National
Institute of Genetics.
GenBank, the DNA database from the National Center for Biotechnology Information,
exchanges data with both EMBL and DDBJ to help ensure comprehensive coverage. The
database is split into 17 smaller discrete divisions.

5.3.2 Protein Sequence Databases


PIR, MIPS, SWISS-PROT, and TrEMBL are the major protein sequence databases.
PIR was developed for investigating evolutionary relations between proteins. In its
current form, the database is split into four distinct sections PIR1-PIR4, which differ in terms of
the quality of data and the level of annotation provided.

37
MIPS collects and processes sequence data for the tripartite PIR-International Protein
sequence Database Project.
SWISS-PROT is a protein sequence database which, endeavors to provide high level
annotations, including descriptions of the function of the protein, and of the structure of its
domain, its post translational modifications and so on.
TrEMBL was created as a supplement to the SWISS-PROT. It was designed to address
the need for a well structured SWISS-PROT-like resource that would allow very rapid access to
sequence data from the genome projects, without having to compromise the quality of SWISSPROT itself by incorporating sequences with insufficient analysis and annotation.

5.4

Composite Protein Sequences Databases

One solution to the problem of proliferation primary databases is to compile a composite,


i.e. a database that amalgamates a variety of different primary sources. Composite databases
render sequence searching much more efficient, because they obviate the need to interrogate
multiple resources. The interrogation process is stream lined still further if the composite has
been designed to be non-redundant, as this means that the same sequence need not be searched
more than once. The choices of different sources and the application of different redundancy
criteria have led to the emergence of different composites. The major composite databases are
Non-Redundant Database, OWL, MIPSX, SWISS-PROT+TrMBL.

5.5

Secondary Databases

Secondary databases contain the fruits of analyses of the sequences in the primary
resources. Because there are several different primary databases and a variety of ways of
analysing protein sequences, the information housed in each of the secondary resources is
different. Designing software tools that can search the different types of data, interpret the range
of outputs, and assess the biological significance of the results is not a trivial task. SWISS-PROT
has emerged as the most popular primary source and many secondary databases now use it as
their basis.
Some of the main secondary resources are as follows:

Secondary database

PROSITE

Primary source

Stored Information

SWISS-PROT

Regular expressions

38
Profiles

SWISS-PROT

Weighted matrices

PRINTS

OWL

Aligned motifs

Pfam

SWISS-PROT

Hidden Marcov Models

BLOCKS

PROSITE/PRINTS

Aligned motifs (blocks)

IDENTIFY

BLOCKS/PRINTS

Fuzzy regular expressions

5.6

Tertiary Databases

Tertiary databases are the databases derived from information housed in secondary
(pattern) databases (e.g. the BLOCKS and eMOTIF databases, which draw on the data stored
within PROSITE and PRINTS). The value of such resources is in providing a different scoring
perspective on the same underlying data, allowing the possibility to diagnose relationships that
might be missed using the original implementation.

39
Chapter 6

Applications of Bioinformatics
A big amount of investment is being made in the field of biotechnology. In this chapter, I
have attempted to take a review of the overall outcome obtained so far and what all is estimated
in the future.

6.1

Application to the Ailments of Diseases


The miraculous substance that contains all of our genetic instructions, DNA, is rapidly

becoming a key to modern medicine. By focusing on the diaphanous and extraordinarily long
filaments of DNA that we inherit from our parents, scientists are finding the root causes of
dozens of previously mysterious diseases: abnormal genes. These discoveries are allowing
researchers to make precise diagnoses and predictions, to design more effective drugs, and to
prevent many painful disorders. The new findings also pave the way for the development of the
ultimate therapy - substituting a normal gene for a malfunctioning one so as to correct a patient's
genetic defect permanently.
Recently, scientists have made spectacular progress against two fatal genetic diseases of
children, cystic fibrosis and Duchenne muscular dystrophy. In addition, they have identified the
genetic flaws that predispose people to more widespread, though still poorly understood ailments
- various forms of heart disease, breast and colon cancer, diabetes, arthritis - which are not
usually thought of as genetic in origin.
While many of the researchers who are exploring our genetic wilderness want to find the
sources of the nearly 4,000 disorders caused by defects in single genes, others have an even
broader goal: They hope to locate and map all of the 50,000 to 100,000 genes on our
chromosomes. This map of our complete biological inheritance "the marvelous message, evolved
for 3 billion years or more, which gives rise to each one of us," as Robert Sinsheimer of the
University of California, Santa Barbara, calls it - will guide biological research for years to
come. And it will radically simplify the search for the genetic flaws that cause disease.

40
Once scientists have identified such a flaw, they need to understand just how it produces
a particular illness. They must determine the normal gene's function in human cells: What kind
of protein does it instruct the cells to make, in what quantities, at what times, and in what
specific places? Then the researchers can ask whether the genetic flaw results in too little protein,
the wrong kind of protein, or no protein at all - and how best to counteract the effects of this
failure.
For most genetic disorders, researchers are still at the very beginning of the trail. They
have no clues to the DNA error that causes a disease, and they are still trying to find large
families whose DNA patterns can help them track it down.
By contrast, scientists who work on cystic fibrosis and a few other diseases have covered
much of the trail. They have already succeeded in correcting the gene defect inside living human
cells by inserting healthy genes into these cells in a laboratory dish - an achievement that may
lead to gene therapy.
The farther scientists go along the trail, the broader the implications of their findings. For
example, the discovery of the gene defect that causes Duchenne muscular dystrophy, a musclewasting disease, led scientists to identify a previously unknown protein that plays an important
role in all muscle function. This gives them a clearer view of how muscle cells work and allows
them to diagnose other muscle disorders with exceptional precision, as well as devise new
approaches to treatment.
Any new treatment will need to be tested on animals. In fact, the next explosion of
information in medical genetics is expected to come from the study of animals - particularly with
defects that mimic human disorders. The techniques for producing animal models of disease are
improving rapidly. Even today, "designer mice" are playing an increasingly important role in
research.
The growth of powerful computerized databases is bringing further insights. Only a
month after the discovery of the genetic error involved in neurofibromatosis, a disfiguring and
sometimes disabling hereditary disease, a computer search revealed a match between the protein
made by normal copies of the newly uncovered gene and a protein that acts to suppress the
development of cancers of the lung, liver, and brain - a key finding for cancer researchers.

41
Such revelations are becoming increasingly frequent. "If a new sequence has no match in
the databases as they are, a week later a still newer sequence will match it," observes Walter
Gilbert of Harvard University.
Brain disorders such as schizophrenia or Alzheimer's disease may be next to yield to the
genetic approach. "We won't know what went wrong in most cases of mental disease until we
can find the gene that sets it off," says James Watson, co-discoverer of the structure of DNA.

6.2

Application of Bioinformatics to Agriculture

Techniques aimed at crop improvement have been utilized for centuries. Today, applied
plant science has three overall goals: increased crop yield, improved crop quality, and reduced
production costs. Biotechnology is proving its value in meeting these goals. Progress has,
however, been slower than with medical and other areas of research. Because plants are
genetically and physiologically more complex than single-cell organisms such as bacteria and
yeasts, the necessary technologies are developing more slowly.

6.2.1 Improvements in Crop Yield and Quality


In one active area of plant research, scientists are exploring ways to use genetic
modification to confer desirable characteristics on food crops. Similarly, agronomists are looking
for ways to harden plants against adverse environmental conditions such as soil salinity, drought,
alkaline earth metals, and anaerobic (lacking air) soil conditions.
Genetic engineering methods to improve fruit and vegetable crop characteristics - such as
taste, texture, size, color, acidity or sweetness, and ripening process, are being explored as a
potentially superior strategy to the traditional method of cross-breeding.
Research in this area of agricultural biotechnology is complicated by the fact that many
of a crop's traits are encoded not by one gene but by many genes working together. Therefore,
one must first identify all of the genes that function as a set to express a particular property. This
knowledge can then be applied to altering the germlines of commercially important food crops.
For example, it will be possible to transfer the genes regulating nutrient content from one variety
of tomatoes into a variety that naturally grows to a larger size. Similarly, by modifying the genes
that control ripening, agronomists can provide supplies of seasonal fruits and vegetables for
extended periods of time.
Biotechnological methods for improving field crops, such as wheat, corn and soybeans,
are also being sought, since seeds serve both as a source of nutrition for people and animals and
as the material for producing the next plant generation. By increasing the quality and quantity of
protein or varying the types in these crops, we can improve their nutritional value.

6.3

Applications of Microbial Genomics

42

new energy sources (biofuels)

environmental monitoring to detect pollutants

protection from biological and chemical warfare

safe, efficient toxic waste cleanup

understanding disease vulnerabilities and revealing drug targets

In 1994, taking advantage of new capabilities developed by the genome project, DOE
initiated the Microbial Genome Program to sequence the genomes of bacteria useful in energy
production, environmental remediation, toxic waste reduction, and industrial processing.
Despite our reliance on the inhabitants of the microbial world, we know little of their number or
their nature: estimates are that less than 0.01% of all microbes have been cultivated and
characterized. Programs like the DOE Microbial Genome Program help lay a foundation for
knowledge that will ultimately benefit human health and the environment. The economy will
benefit from further industrial applications of microbial capabilities.
Information gleaned from the characterization of complete genomes in MGP will lead to
insights into the development of such new energy-related biotechnologies as photosynthetic
systems, microbial systems that function in extreme environments, and organisms that can
metabolize readily available renewable resources and waste material with equal facility.
Expected benefits also include development of diverse new products, processes, and test
methods that will open the door to a cleaner environment. Biomanufacturing will use nontoxic
chemicals and enzymes to reduce the cost and improve the efficiency of industrial processes.
Already, microbial enzymes are being used to bleach paper pulp, stone wash denim, remove
lipstick from glassware, break down starch in brewing, and coagulate milk protein for cheese
production. In the health arena, microbial sequences may help researchers find new human genes
and shed light on the disease-producing properties of pathogens.
Microbial genomics will also help pharmaceutical researchers gain a better understanding
of how pathogenic microbes cause disease. Sequencing these microbes will help reveal
vulnerabilities and identify new drug targets.

43
Gaining a deeper understanding of the microbial world also will provide insights into the
strategies and limits of life on this planet. Data generated in this young program already have
helped scientists identify the minimum number of genes necessary for life and confirm the
existence of a third major kingdom of life. Additionally, the new genetic techniques now allow
us to establish more precisely the diversity of microorganisms and identify those critical to
maintaining or restoring the function and integrity of large and small ecosystems; this knowledge
also can be useful in monitoring and predicting environmental change. Finally, studies on
microbial communities provide models for understanding biological interactions and
evolutionary history.

6.4

Risk Assessment
assess health damage and risks caused by radiation exposure, including low-dose
exposures

assess health damage and risks caused by exposure to mutagenic chemicals and cancercausing toxins

reduce the likelihood of heritable mutations

Understanding the human genome will have an enormous impact on the ability to assess risks
posed to individuals by exposure to toxic agents. Scientists know that genetic differences make
some people more susceptible and others more resistant to such agents. Far more work must be
done to determine the genetic basis of such variability. This knowledge will directly address
DOE's long-term mission to understand the effects of low-level exposures to radiation and other
energy-related agents, especially in terms of cancer risk.

6.5

Bioarchaeology, Anthropology, Evolution, and Human


Migration

44

study evolution through germline mutations in lineages

study migration of different population groups based on female genetic inheritance

study mutations on the Y chromosome to trace lineage and migration of males

compare breakpoints in the evolution of mutations with ages of populations and historical
events

Understanding genomics will help us understand human evolution and the common biology
we share with all of life. Comparative genomics between humans and other organisms such as
mice already has led to similar genes associated with diseases and traits. Further comparative
studies will help determine the yet-unknown function of thousands of other genes.
Comparing the DNA sequences of entire genomes of differerent microbes will provide new
insights about relationships among the three kingdoms of life: archaebacteria, eukaryotes, and
prokaryotes.

6.6

DNA Forensics (Identification)

identify potential suspects whose DNA may match evidence left at crime scenes

exonerate persons wrongly accused of crimes

identify crime and catastrophe victims

establish paternity and other family relationships

identify endangered and protected species as an aid to wildlife officials (could be used for
prosecuting poachers)

detect bacteria and other organisms that may pollute air, water, soil, and food

match organ donors with recipients in transplant programs

determine pedigree for seed or livestock breeds

authenticate consumables such as caviar and wine

45
Any type of organism can be identified by examination of DNA sequences unique to that
species. Identifying individuals is less precise at this time, although when DNA sequencing
technologies progress further, direct characterization of very large DNA segments, and possibly
even whole genomes, will become feasible and practical and will allow precise individual
identification.
To identify individuals, forensic scientists scan about 10 DNA regions that vary from person
to person and use the data to create a DNA profile of that individual (sometimes called a DNA
fingerprint). There is an extremely small chance that another person has the same DNA profile
for a particular set of regions.

46
Bibliography
1. IEEE Magazine
Engineering in Medicine and Biology
Volume 20, Number 4, July/August 2002
2. Introduction to Bioinformatics
By T. K. Attwood and D. J. Parry-Smith
First Edition
Publication: Pearson Education Ltd.
3. Web Sites
Human Genome Project

http://www.ornl.gov/TechResources/Human_Genome/

Beyond Discovery

http://www4.nas.edu/beyond/beyonddiscovery.nsf/

Bioinformatics in India

http://bioinformatics-india.com

Other sites

http://bioinform.com
http://bioinformatics.org

Vous aimerez peut-être aussi