Vous êtes sur la page 1sur 21

S.S.Aamir | Vinayak.

CONTENTS
1) Automated Biological Analysers.
1.1) Routine biochemistry analysers.
1.2) Immuno-based analysers.
1.3) Hematology analysers.
2) Biological Databases.
2.1) Nucleic Acids Research Database
2.2) Species-specific databases
2.3) Applications
2.3.1) Mass Spectrometry - MALDI TOF
2.3.2) Protein Fingerprinting
2.3.3) Proteomics
3) The Human Genome Project
3.1) Sequencing Applications
3.2) Computational Analysis
3.3) Computational Engine for Genomic Sequences
3.4) Data mining and information retrieval.
3.4) Data warehousing.
3.5) Visualization for data and collaboration.
4) Bibliography

PAGE 1

AUTOMATED BIOLOGICAL ANALYSERS


An automated analyser is a medical laboratory instrument designed to
measure different chemicals and other characteristics in a number of
biological samples quickly, with minimal human assistance .These measured
properties of blood and other fluids may be useful in the diagnosis of disease.
Many methods of introducing samples into the analyser have been invented.
This can involve placing test tubes of sample into racks, which can be moved
along a track, or inserting tubes into circular carousels that rotate to make the
sample available. Some analysers require samples to be transferred to sample
cups. However, the effort to protect the health and safety of laboratory staff
has prompted many manufacturers to develop analysers that feature closed
tube sampling, preventing workers from direct exposure to samples.
Samples can be processed singly, in batches, or continuously .The automation
of laboratory testing does not remove the need for human expertise (results
must still be evaluated by medical technologists and other qualified clinical
laboratory professionals), but it does ease concerns about error reduction,
staffing concerns, and safety
Routine biochemistry analysers
These are machines that process a large portion of the samples going into a
hospital or private medical laboratory. Automation of the testing process has
reduced testing time for many analytes from days to minutes. The history of
discrete sample analysis for the clinical laboratory began with the
introduction of the "Robot Chemist" invented by Hans Baruch and introduced
commercially in 1959
AutoAnalyzer is an automated analyzer using a special flow technique named
"continuous flow analysis (CFA)", invented in 1957 by Leonard Skeggs, PhD
and first made by the Technicon Corporation.

PAGE 2

The first applications were for clinical (medical) analysis. The AutoAnalyzer
profoundly changed the character of the chemical testing laboratory by
allowing significant increases in the numbers of samples that could be
processed. The design based on separating a continuously flowing stream with
air bubbles largely reduced slow, clumsy, and error prone manual methods of
analysis.
The types of tests required include enzyme levels (such as many of the liver
function tests), ion levels (e.g. sodium and potassium, and other tell-tale
chemicals (such as glucose, serum albumin, or creatinine).
Simple ions are often measured with ion selective electrodes, which let one
type of ion through, and measure voltage differences. Enzymes may be
measured by the rate they change one coloured substance to another; in these
tests, the results for enzymes are given as an activity, not as a concentration
of the enzyme.
Other tests use colorimetric changes to determine the concentration of the
chemical in question. Turbidity may also be measured.
Immuno-based analysers
Antibodies are used by some analysers to detect many substances by
immunoassay and other reactions that employ the use of antibody-antigen
reactions.
When concentration of these compounds is too low to cause a measurable
increase in turbidity when bound to antibody, more specialised methods must
be used. Recent developments include automation for the
immunohaematology lab, also known as transfusion medicine.

PAGE 3

Hematology analysers
These are used to perform complete blood counts, erythrocyte sedimentation
rates (ESRs), or coagulation tests.

Cell counters
Automated cell counters sample the blood, and
quantify, classify, and describe cell populations
using both electrical and optical techniques.
Electrical analysis involves passing a dilute
solution of the blood through an aperture across
which an electrical current is flowing. The
passage of cells through the current changes the
impedance between the terminals (the Coulter
principle). A lytic reagent is added to the blood
solution to selectively lyse the red cells (RBCs),
leaving only white cells (WBCs), and platelets
intact. Then the solution is passed through a
second detector. This allows the counts of RBCs,
WBCs, and platelets to be obtained. The platelet
count is easily separated from the WBC count by
the smaller impedance spikes they produce in
the detector due to their lower cell volumes.
Computer Controlled Electron Microscopy
Computer-controlled scanning electron microscopy is introduced as a faster,
reliably and cost-reducing alternative to conventional electron microprobe
analyses on Biological samples. Many universities link their EM to computer
networks which can be operated by their doctorate candidates to view images
in real time from any point in the computer network.

PAGE 4

BIOLOGICAL DATABASES
Biological databases are libraries of life sciences information, collected from
scientific experiments, published literature, high-throughput experiment
technology, and computational analysis.[2] They contain information from
research areas including genomics, proteomics, metabolomics, microarray
gene expression, and phylogenetics. Information contained in biological
databases includes gene function, structure, localization (both cellular and
chromosomal), clinical effects of mutations as well as similarities of biological
sequences and structures.

Biological databases can be broadly classified into sequence and structure


databases. Nucleic acid and protein sequences are stored in sequence
databases and structure database only store proteins. These databases are
important tools in assisting scientists to analyze and explain a host of
biological phenomena from the structure of biomolecules and their
interaction, to the whole metabolism of organisms and to understanding the
evolution of species. This knowledge helps facilitate the fight against diseases,
assists in the development of medications, predicting certain genetic diseases
and in discovering basic relationships among species in the history of life.

Biological knowledge is distributed among many different general and


specialized databases. This sometimes makes it difficult to ensure the
consistency of information. Integrative bioinformatics is one field attempting
to tackle this problem by providing unified access. One solution is how
biological databases cross-reference to other databases with accession
numbers to link their related knowledge together.

PAGE 5

Relational database concepts of computer science and Information retrieval


concepts of digital libraries are important for understanding biological
databases. Biological database design, development, and long-term
management is a core area of the discipline of bioinformatics.[4] Data
contents include gene sequences, textual descriptions, attributes and ontology
classifications, citations, and tabular data. These are often described as semistructured data, and can be represented as tables, key delimited records, and
XML structures.
Most biological databases are available through web sites that organise data
such that users can browse through the data online. In addition the underlying
data is usually available for download in a variety of formats. Biological data
comes in many formats. These formats include text, sequence data, protein
structure and links. Each of these can be found from certain sources, for
example:

Text formats are provided by PubMed and OMIM.


Sequence data is provided by GenBank, in terms of DNA, and UniProt, in
terms of protein.
Protein structures are provided by PDB, SCOP, and CATH.
Nucleic Acids Research Database
An important resource for finding biological databases is a special yearly issue
of the journal Nucleic Acids Research (NAR). The Database Issue of NAR is
freely available, and categorizes many of the publicly available on line
databases related to biology and bioinformatics. A companion database to the
issue called the Online Molecular Biology Database Collection lists 1,380
online databases. Other collections of databases exist such as MetaBase and
the Bioinformatics Links Collection.

PAGE 6

Species-specific databases
Species-specific databases are available for some species, mainly those that are
often used in research. For example, Colibase is an E. coli database. Other
popular species specific databases include Mouse Genome Informatics for the
laboratory mouse, Mus musculus, the Rat Genome Database for Rattus, ZFIN
for Danio Rerio (zebrafish), FlyBase for Drosophila, WormBase for the
nematodes Caenorhabditis elegans and Caenorhabditis briggsae, and Xenbase
for Xenopus tropicalis and Xenopus laevis frogs.

Mass Spectrometry - MALDI TOF


The resulting masses measured in MALDI-TOF (Matrix-Assisted Laser
Desorption ionisation time-of-flight Spectrometry) can be used to search the
protein sequence databases for matches with theoretical masses.
Protein Fingerprinting
Following digestion of protein with reagents the resulting mixture of peptides
can be separated by chromatographic processes. The resulting pattern or the
peptide map is a diagnostic of the protein hence called its fingerprint. These
fingerprints can be stored in databases online for reference by other Biologists.
Proteomics
The Proteome is the entire complement of proteins in a cell or organism.
Proteomics is the large-scale effort to identify and characterize all the proteins
encoded in an organisms genome, including its post translational
modifications. Proteins resolved in two dimensional electrophoresis are
subjected to trypsin digestion and extremely accurate molar masses of the
peptides produced are used as a fingerprint to identify the protein from
databases of real or predicted tryptic peptide sizes.

PAGE 7

THE HUMAN
GENOME PROJECT
Just a few short years ago, most of us knew
very little about our genes and their impact on
our lives. But more recently, it has been
virtually impossible to escape the popular
medias attention to a number of breathtaking
discoveries of human genes, especially those
related to diseases such as cystic fibrosis,
Huntingtons chorea, and breast cancer. What
has brought this about is the Human Genome
Project, an international effort started in 1988
and sponsored in the United States by the
Department of Energy (DOE) and the
National Institutes of Health (NIH). The goal
of this project is to elucidate the information
that makes up the genetic blueprint of human
beings.

COMPUTING
THE GENOME

Ed Uberbacher shows a model of


transcribing genes. Photograph by
Tom Cerniglio.
ORNL is part of a team that is
designing and preparing to
implement a new computational
engine to rapidly analyze large-scale
genomic sequences, keeping up with
the flood of data from the Human
Genome Project.

The Human Genome Projects success in sequencing the chemical bases of


DNA is virtually revolutionizing biology and biotechnology. It is creating new
knowledge about fundamental biological processes. It has increased our
ability to analyze, manipulate, and modify genes and to engineer organisms,
providing numerous opportunities for applications. Biotechnology in the
United States, virtually nonexistent a few years ago, is expected to become a
$50 billion industry before 2000, largely because of the Human Genome
Project.

PAGE 8

Despite this projects impact, the pace of gene discovery has actually been
rather slow. The initial phase of the project, called mapping, has been
primarily devoted to fragmenting chromosomes into manageable ordered
pieces for later high-throughput sequencing. In this period, it has often taken
years to locate and characterize individual genes related to human disease.
Thus, biologists began to appreciate the value of computing to mapping and
sequencing.
A good illustration of the emerging impact of computing on genomics is the
search for the gene for Adrenoleukodystrophy (related to the disease in the
movie Lorenzos Oil). A team of researchers in Europe spent about two years
searching for the gene using standard experimental methods. Then they
managed to sequence the region of the chromosome containing the gene.
Finally, they sent information on the sequence to the ORNL server containing
the ORNL-developed computer program called Gene Recognition and
Analysis Internet Link (GRAIL). Within a couple of minutes, GRAIL returned
the location of the gene within the sequence.
The Human Genome Project has entered a new phase. Six NIH genome
centers were funded recently to begin high-throughput sequencing, and plans
are under way for large-scale sequencing efforts at DOE genome centers at
Lawrence Berkeley National Laboratory (LBNL), Lawrence Livermore
National Laboratory (LLNL), and Los Alamos National Laboratory (LANL);
these centers have been integrated to form the Joint Genome Institute. As a
result, researchers are now focusing on the challenge of processing and
understanding much larger domains of the DNA sequence. It has been
estimated that, on average from 1997 to 2003, new sequence of approximately
2 million DNA bases will be produced every day. Each days sequence will
represent approximately 70 new genes and their respective proteins. This
information will be made available immediately on the Internet and in central
genome databases.

PAGE 9

Such information is of immeasurable value to medical researchers,


biotechnology firms, the pharmaceutical industry, and researchers in a host
of fields ranging from microorganism metabolism to structural biology.
Because only a small fraction of genes that cause human genetic disease have
been identified, each new gene revealed by genome sequence analysis has the
potential to significantly affect human health. Within the human genome is
an estimated total of 6000 genes that have a direct impact on the diagnosis
and treatment of human genetic diseases. The timely development of
diagnostic techniques and treatments for these diseases is worth billions of
dollars for the U.S. economy, and computational analysis is a key component
that can contribute significantly to the knowledge necessary to effect such
developments.
The rate of several megabase pairs per day at which the Human Genome and
microorganism sequencing projects will soon be producing data will exceed
current sequence analysis capabilities and infrastructure. Sequences are
already arriving at a rate and in forms that make analysis very difficult. For
example, a recent posting of a large clone (large DNA sequence fragment) by
a major genome center was made in several hundred thousand base fragments,
rather than as one long sequence, because the sequence database was unable
to input the whole sequence as a single long entry. Anyone who wishes to
analyze this sequence to determine which genes are present must manually
reassemble the sequence from these many small fragments, an absolutely
ridiculous task. The sequences of large genomic clones are being routinely
posted on the Internet with virtually no comment, analysis, or interpretation;
and mechanisms for their entry into public-domain databases are in many
cases inadequately defined. Valuable sequences are going unanalyzed because
methods and procedures for handling the data are lacking and because
current methods for doing analyses are time-consuming and inconvenient.
And in real terms, the flood of data is just beginning.

PAGE 10

Computational Analysis
Computers can be used very effectively to indicate the location of genes and
of regions that control the expression of genes and to discover relationships
between each new sequence and other known sequences from many different
organisms. This process is referred to as sequence annotation. Annotation
(the elucidation and description of biologically relevant features in the
sequence) is the essential prerequisite before the genome sequence data can
become useful, and the quality with which annotation is done will directly
affect the value of the sequence. In addition to considerable organizational
issues, significant computational challenges must be addressed if DNA
sequences that are produced can be successfully annotated. It is clear that new
computational methods and a workable process must be implemented for
effective and timely analysis and management of these data.
In considering computing related to the large-scale sequence analysis and
annotation process, it is useful to examine previously developed models.
Procedures for high-throughput analysis have been most notably applied to
several microorganisms (e.g., Haemophilus influenzae and Mycoplasma
genitalium) using relatively simple methods designed to facilitate basically a
single pass through the data (a pipeline that produces a one-time result or
report).
However, this is too simple a model for analyzing genomes as complex as the
human genome. For one thing, the analysis of genomic sequence regions
needs to be updated continually through the course of the Genome Project
the analysis is never really done. On any given day, new information relevant
to a sequenced gene may show up in any one of many databases, and new links
to this information need to be discovered and presented. Additionally, our
capabilities for analyzing the sequence will change with time.
The analysis of DNA sequences by computer is a relatively immature science,
and we in the informatics community will be able to recognize many features
(like gene regulatory regions) better in a year than we can now. There will be
a significant advantage in reanalyzing sequences and updating our knowledge
of them continually as new sequences appear from many organisms, methods
improve, and databases with relevant information grow. In this model,
sequence annotation is a living thing that will develop richness and improve

PAGE 11

in quality over the years. The single pass-through pipeline is simply not the
appropriate model for human genome analysis, because the rate at which new
and relevant information appears is staggering.
Computational Engine for Genomic Sequences
Researchers at ORNL, LBNL, Argonne National Laboratory (ANL), and several
other genome laboratories are teaming to design and implement a new kind
of computational engine for analysis of large-scale genomic sequences. This
sequence analysis engine, which has become a Computational Grand
Challenge problem, will integrate a suite of tools on high-performance
computing resources and manage the analysis results.
In addition to the need for state-of-the-art computers at several
supercomputing centers, this analysis system will require dynamic and
seamless management of contiguous distributed high-performance
computing processes, efficient parallel implementations of a number of new
algorithms, complex distributed data mining operations, and the application
of new inferencing and visualization methods. A process of analysis that will
be started in this engine will not be completed for seven to ten years.
The data flow in this analysis engine is shown in the next page. Updates of
sequence data will be retrieved through the use of Internet retrieval agents
and stored in a local data warehouse. Most human genome centers will daily
post new sequences on publicly available Internet or World Wide Web sites,
and they will establish agreed-upon policies for Internet capture of their data.
These data will feed the analysis engine that will return results to the
warehouse for use in later or long-term analysis processes, visualization by
researchers, and distribution to community databases and genome
sequencing centers. Unlike the pipeline analysis model, the warehouse
maintains the sequence data, analysis results, and data links so that continual
update processes can be made to operate on the data over many years.

PAGE 12

The analysis engine will combine a number of


processes into a coherent system running on
distributed high-performance computing
hardware
at
ORNLs
Center
for
Computational Sciences (CCS), LBNLs
National
Energy
Research
Scientific
Computing Center, and ANLs Center for
Computational Science and Technology
facilities. A schematic of these processes is
shown in Fig. 2. A process manager will
conditionally determine the necessary
analysis steps and direct the flow of tasks to
massively parallel process resources at these
several locations. These processes will include
multiple statistical and artificial-intelligencebased pattern-recognition algorithms (for
locating genes and other features in the
sequence), computation for statistical
characterization of sequence domains, gene
modeling algorithms to describe the extent
and structure of genes, and sequence
comparison programs to search databases for
other sequences that may provide insight into
a genes function. The process manager will
also initiate multiple distributed information
retrieval and data mining processes to access
remote databases for information relevant to
the genes (or corresponding proteins)
discovered in a particular DNA sequence
region. Five significant technical challenges
must be addressed to implement such a
system. A discussion of those challenges
follows.

THE ANALYSIS ENGINE

Fig. 1. The sequence analysis engine will input


a genomic DNA sequence from many sites
using Internet retrieval agents, maintain it in a
data warehouse, and facilitate a long-term
analysis process using high-performance
computing facilities.

PAGE 13

Seamless high-performance computing


Megabases of DNA sequence being analyzed each day will strain the capacity
of existing supercomputing centers. Interoperability between highperformance computing centers will be needed to provide the aggregate
computing power, managed through the use of sophisticated resource
management tools. The system must be fault-tolerant to machine and
network failures so that no data or results are lost.

Parallel algorithms for sequence analysis


The recognition of important features in a sequence, such as genes, must be
highly automated to eliminate the need for time-consuming manual gene
model building. Five distinct types of algorithms (pattern recognition,
statistical measurement, sequence comparison, gene modeling, and data
mining) must be combined into a coordinated toolkit to synthesize the
complete analysis.
Fig. 3. Approximately 800 bases of DNA sequence (equivalent to 1/3,800,000
of the human genome), containing the first gene coding segment of four in
the human Ras gene. The coding portion of the gene is located between
bases 1624 and 1774. The remaining DNA around this does not contain a
genetic message and is often referred to as junk DNA.

PAGE 14

Fig. 4. (a) The artificial neural


network used in GRAIL combines
the results of a number of
statistical tests to locate regions in
the sequence that contain portions
of genes with the genetic code for
proteins. (b) A diagram showing
the prediction of GRAIL on the
human Ras gene (part of which is
shown in Fig. 3). The peaks
correspond to gene coding
segments, and the connected bars
represent the model of the gene
predicted by the computer.
One of the key types of algorithms
needed is pattern recognition. Methods must be designed to detect the subtle
statistical patterns characteristic of biologically important sequence features,
such as genes or gene regulatory regions. DNA sequences are remarkably
difficult to interpret through visual examination. For example, in Fig. 3, it is
virtually impossible to tell that part of the sequence is a gene coding region.
However, when examined in the computer, DNA sequence has proven to be a
rich source of interesting patterns, having periodic, stochastic, and chaotic
properties that vary in different functional domains. These properties and
methods to measure them form the basis for recognizing the parts of the
sequence that contain important biological features.
In genomics and computational biology, pattern recognition systems often
employ artificial neural networks or other similar classifiers to distinguish
sequence regions containing a particular feature from those regions that do
not. Machine-learning methods allow computer-based systems to learn about
patterns from examples in DNA sequence. A well-known example of this is
ORNLs GRAIL gene detection system, deployed originally in 1991, which
combined seven statistical pattern measures using a simple feed-forward
PAGE 15

neural network [Fig. 4(a)]. GRAIL is able to determine regions of the sequence
that contain genes [Fig. 4(b)], even genes it has never seen before, based on
its training from known gene examples.
High-speed sequence comparison represents another important class of
algorithms used to compare one DNA or protein sequence with another in a
way that extracts how and where the two sequences are similar. Many
organisms share many of the same basic genes and proteins, and information
about a gene or protein in one organism provides insight into the function of
its relatives or homologs in other organisms.
To get a sense of the scale, examination of the relationship between 2
megabases of sequence (one days finished sequence) and a single database of
known gene sequence fragments (called ESTs) requires the calculation of 1015
DNA base comparisons. And there are quite a number of databases to consider.

Two examples of sequence


comparison for members of the same
protein family are shown in Fig. 5.
One shows a very similar relative to
the human protein sequence query
and the second a much weaker (and
evolutionarily more distant)
relationship. The sequence databases
(which contain sequences used for
such comparisons) are growing at an
exponential rate, making it necessary
to apply ever-increasing
computational power to this problem.
Data mining and information retrieval.
Methods are needed to locate and
retrieve information relevant to newly

PAGE 16

discovered genes. If similar genes or proteins are discovered through sequence


comparison, often experiments have been performed on one or more
homologs that can provide insight into the newly discovered gene or protein.
Relevant information is contained in more than 100 databases scattered
throughout the world, including DNA and protein sequence databases,
genome mapping databases, metabolic pathway databases, gene expression
databases, gene function and phenotype databases, and protein structure
data-bases. These data can provide insight into a genes biochemical or whole
organism function, pattern of expression in tissues, protein structure type or
class, functional family, metabolic role, and potential relationship to disease
phenotypes. Using the Internet, researchers are developing automated
methods to retrieve, collate, fuse, and dynamically link such database
information to new regions of sequence. This is an important component of
ORNLs Functional Genomics Initiative, because it helps to link experimental
studies in the mouse to the larger context of the worlds genomic information.
The target data resources are very heterogeneous (i.e., structured in a variety
of ways), and some are merely text-based and poorly formatted, making the
identification of relevant information and its retrieval difficult. Intelligent
information retrieval technology is being applied to this domain to improve
the reliability of such systems. One challenge here is that information relevant
to an important gene or protein may appear in any database at any time. As a
result, systems now being developed dynamically update the descriptions of
genes and proteins in our data warehouse and continually poll remote data
resources for new information.
Data warehousing. The information retrieved by intelligent agents or
calculated by the analysis system must be collected and stored in a local
repository from which it can be retrieved and used in further analysis
processes, seen by researchers, or downloaded into community databases.
Numerous data of many types need to be stored and managed in such a way
that descriptions of genomic regions and links to external data can be
maintained and updated continually.

PAGE 17

In addition, large volumes of data in the warehouse must be accessible to the


analysis systems running at multiple sites at a moments notice. Our plan
involves using the High-Performance Storage System being implemented at
ORNLs CCS.
Visualization for data and collaboration
The sheer volume and complexity of the analyzed information and links to
data in many remote databases require advanced data visualization methods
to allow user access to the data. Users need to interface with the raw sequence
data; the analysis process; and the resulting synthesis of gene models, features,
patterns, genome map data, anatomical or disease phenotypes; and other
relevant data.
In addition, collaborations among multiple sites are required for most large
genome analysis problems, so collaboration tools, such as video conferencing
and electronic notebooks, are very useful. Even more complex and
hierarchical displays are being developed that will be able to zoom in from
each chromosome to see the chromosome fragments (or clones) that have
been sequenced and then display the genes and other functional features at
the sequence level. Linked (or hyperlinked) to each feature will be detailed
information about its properties, the computational or experimental methods
used for its characterization, and further links to many remote databases that
contain additional information. Analysis processes and intelligent retrieval
agents will provide the feature details available in the interface and
dynamically construct links to remote data.
The development of the sequence analysis engine represents part of the rapid
changes in the biological sciences paradigm to one that makes much greater
use of computation, networking, simulation and modeling, and sophisticated
data management systems. Unlike any other existing system in the genomics
arena, the sequence analysis engine will link components for data input,
analysis, storage, update, and submission in a single distributed highperformance framework that is designed to carry out a dynamic and continual
discovery process over a 10-year period.
PAGE 18

The approach outlined is flexible enough to use a variety of hardware and data
resources; configure analysis steps, triggers, conditions, and updates; and
provide the means to maintain and update the description of each genomic
region. Users can specify recurrent analysis and data mining operations that
continue for years. The combined computational process will provide new
knowledge about the genome on a scale that is impossible for individual
researchers using current methods. Such a process is absolutely necessary to
keep up with the flood of data that will be gathered over the remainder of the
Human Genome Project.

PAGE 19

BIBLIOGRAPHY
1) Hyman, E. D. A new method of sequencing DNA. Analytical
Biochemistry 174, 423436 (1988)
2) Metzker, M. L. Emerging technologies in DNA sequencing. Genome
Research 15, 17671776 (2005)
3) Oak Ridge National Laboratory
http://web.ornl.gov/info/ornlreview/v30n3-4/genome.htm
4) Wikipedia
https://en.wikipedia.org
5) Google
https://google.co.in

PAGE 20

Vous aimerez peut-être aussi