Académique Documents
Professionnel Documents
Culture Documents
HISTORICAL BACKGROUND
EVENTS YEAR
1
GENE MINING
BLAST
2
GENE MINING
3. INTRODUCTION:-
CENTRAL DOGMA:-
3
GENE MINING
The principle reason for gene mining is to identify and isolate genes that are
characterised for conferring essential traits. The widespread use and availability of
molecular biological techniques have allowed for the rapid development and
identification of nucleic acid derived sequences.With the availability of integration
of laboratory equipment with advanced computer software, researchers are able to
conduct advanced quantitative analyses, database comparisons and computational
algorithms to seek and identify gene sequences . Genetic databases for organisms
such as Escherichia coli, Haemophilus influenzae, Mycoplasma genitalium , and
Mycoplasma pneumoniae , to name a few, are available for public.These
biological databases store information that is searchable and from which biological
information may be retrieved.This work illustrates exploitation of publicly
available sequence databases on the Internet for identification of useful genes. The
4
GENE MINING
2. Data Mining
8. ORIEL
5
GENE MINING
Plant Material
The germplasm used in this study of allele mining is leaf materials of the
concerned genotypes are collected from the Genetic Resources Centers.
DNA Extraction
Total genomic DNA is isolated from fresh green leaves (approx 5 g) according to
the methodology of Dellaporta et al. with minor modifications. The quality and
quantity of the extracted DNA is confirmed to be consistent both
spectrophotometrically and by running the extracted DNA on 1.0% agarose gels
stained with ethidium bromide.
PCR Analysis
PCR amplification of genomic DNA was carried out using gene-specific primers.
The PCR amplification consisted of a total of 40 cycles of melting (94°C for 1
min), annealing (55°C for 1 min 15 s), and elongation (72°C for 3 min 5 s). The
PCRamplified products were electrophoresed in 1.4% agarose gel in 1X Tris-
acetate-ethylenediamine– tetraacetic acid (TAE) buffer. The gels were
photographed under an ultraviolet transilluminator.
6
GENE MINING
Gene-specific primers amplify the DNA of each accession, and the amplified
product represents either the entire allele or some functional component of the
allele, such as the promoter or the coding sequences.
7
GENE MINING
PROBLEMS
8
GENE MINING
2. DATA Mining
Data mining mainly is about somehow extracting the information and knowledge
from text;
2 Definitions:
Any operation related to gathering and analyzing text from external sources for
business intelligence purposes;
Data mining is the process of compiling, organizing, and analyzing large document
collections to support the delivery of targeted types of information to analysts and
decision makers and to discover relationships between related facts that span wide
domains of inquiry.
Data mining systems induce knowledge from datasets which are huge, noisy
(incorrect), incomplete, inconsistent, imprecise (fuzzy), and uncertain.
The problem is that existing systems use a limiting attribute value language for
representing the training examples and induced knowledge.
Furthermore, some important patterns are ignored because they are statistically
insignificant.
9
GENE MINING
Rapid growth of available data in digital format increase need for methods to
analyze them . So research on some topics such as text classification, information
retrieval and automatic text summarization became an important field
Public protein sequence database such as SWISS-PROT is used practically for the
protein identification from the result of Matrix-Assisted Laser Desorption
Ionization-Time Of Flight (MALDI-TOF)data, which is one of popular proteomic
studies. However, for the less of protein information for the specific plant species
in these databases it is needed to construct the private protein database containing
sufficient protein information for interpreting massive PMF results about each
specific plant species. Thus we tried to make the protein database by translating
enormous coding region sequences obtained from EST analysis and the PMF
software working on these databases. Therefore, in this study, we tried to make the
individual systems about EST based data analysis, regulatory motif information
from chromosomal mapping of ESTs, microarray data and PMF information from
bench works at first and finally integrate these individual
10
GENE MINING
1. Obtain a list of genes or gene products known to be involved with the target
disease from the CBioC[5] database.
2. Apply heuristics to unify variants of extracted names, and use HUGO [8] to
normalize both the set obtained in the previous step and the names stored in
CBioC. This will be referred to as the initial set.
4. Apply a heuristic scoring formula to the extended set to predict the proteins most
likely related to the disease.
11
GENE MINING
A human gene association study often involves several genomic markers such as
single nucleotide polymorphisms (SNPs) or short tandem repeat polymorphisms,
and many statistically significant markers may be identified during the study.
GenoWatch can efficiently extract up-to-date information about multiple markers
and their associated genes in batch mode from many relevant biological databases
in real-time. The comprehensive gene information retrieved includes gene
ontology, function, pathway, disease, related articles in PubMed and so on.
Subsequent SNP functional impact analysis and primer design of a target gene for
re-sequencing can also be done in a few clicks. The presentation of results has been
carefully designed to be as intuitive as possible to all users. The GenoWatch is
available at the website http://genepipe.ngc.sinica.edu.tw/genowatch.
7. ORIEL
Introduction
The ORIEL Project (Online Research Information Environment for the Life
Sciences) This European Project will develop tools and procedures to promote
access to and integration of a wide range of information resources in the life
sciences.
The tools developed through ORIEL will enable effective linking of different types
of biological information (literature, factual and multimedia databases) make
navigation easy, thereby encouraging the creative exploration of the information
landscape facilitate communication by making data presentation and information
visualisation user-friendly.
12
GENE MINING
Project Description
Aims
Methodologies
Developments
Methods leading to the creation of new concepts of the scientific literature, based
on machine-understandable documents.Technologies permitting effective linking of
a wide range of biological digital information sources, including molecular,
genomic and multi-dimensional image databases, promoting ease of cross-
database navigation, leading to creative exploration of the information landscape.
Protocols facilitating effective data representation and information
visualisation through the construction of adaptive interfaces that meet the needs of
individual users.
Background
13
GENE MINING
have become important growth areas in the European life sciences industry.
Genomics research is characterized by the production of vast amounts of raw and
derived data. The integration of the exponentially growing amounts of these and
associated biological information in digital form (publications, sequence and
sequence-related information, digital image data) is presenting one of the most
demanding current challenges to information technology. There is an urgent need
to better exploit the potential of the Internet and other communication networks to
develop novel technology and intelligent middleware for the integration of large,
complex and disparate information resources.
Objectives
The ORIEL project will explore and further develop methods, technologies and
protocols aimed at the integration, dissemination and exploitation of large,
complex and disparate digital information resources. With a view to making such
technologies widely available, it will focus on the Life Sciences as a data-intensive
and highly demanding testbed that will: - permit effective linking of different types
of biological information displaying complex inter-relationships (literature, factual
and multi-media image databases)
- promote ease of navigation leading to creative exploration of the information
landscape and facilitate user-friendly data presentation and information
visualisation.
Milestones
The development of new concepts that will enhance the efficiency of integration of
different types of biological data currently maintained in a wide spectrum of digital
collections and resources across Europe.
14
GENE MINING
This is a recently developed technique for the analysis of gene expression & has
following features.
• The expression of many genes can be investigated at the same time (i.e. in one
experiment)
• Based on two RNA samples, a control and a sample of interest (e.g. heat stressed/
mutant)
Limitations:-
• High tech.
15
GENE MINING
16
GENE MINING
17
GENE MINING
18
GENE MINING
19
GENE MINING
1. Allele Mining for Stress Tolerance Genes in Oryza Species and Related
Germplasm.
2. Allele mining and sequence diversity at the wheat powdery mildew resistance
locus Pm3.
12. Gene mining in African rice germplasm to improve drought resistance in rainfed
production systems for resource-poor farmers of Africa
13. Mining the Epigenome for Methylated Genes in Lung Cancer
20
GENE MINING
The international project to sequence the genome of Oryza sativa L cv. Nipponbare
has made allele mining possible for all genes of rice. Scientists used a rice
calmodulin gene, a rice gene encoding a late embryogenesis-associated protein,
and salt-inducible rice gene to optimize the polymerase chain reaction (PCR) for
allele mining of stress tolerance genes on identified accessions of rice and related
germplasm. Two sets of PCR primers were designed for each gene. Primers based
on the 5' and 3' untranslated region of genes were found to be sufficiently
conserved so as to be effective over the entire range of germplasm in rice for which
the concept of allelism is applicable.
However, the primers based on the adjacent amino (N) and carboxy (C)
termini amplify additional loci. Two sets of PCR primers were designed for each
gene. Field-based phenotyping of germplasm identifies tolerant accessions,
biochemical and physiological analysis groups. the existing and emerging tools of
genomics and proteomics help to identify key genes or key members of a gene
family involved in each mechanisms. The technique of choice for allele mining is
PCR. Gene-specific primers amplify the DNA of each accession, and the amplified
product represents either the entire allele or some functional component of the
allele, such as the promoter or the coding sequences.
21
GENE MINING
22
GENE MINING
Plant science research has reached the post-genome era with the completion
of the genome sequences of both a dicotyledonous (Arabidopsis thaliana) (The
Arabidopsis Genome Initiative 2000) and a monocotyledonous (rice: Oryza sativa)
species (Yu et al. 2002). These genome sequences obtained through publicly
funded research have been made available through the Internet with new sequence
information appearing each day. In addition, large collections of cDNA libraries
derived from different plant tissues or growth conditions have been subjected to
single pass sequencing, often from the 3’ends, to derive express sequence tag
(EST) databases (Bennetzen 1999, Quackenbush et al. 2000). All these data
present an opportunity for researchers to enhance studies on non-model crop plants
by identifying homologues in the more tractable model species. This can lead to
design of experiments, such as the study of mutants in the model plant, which can
provide rapid answers to gene function in the crop plant. The plant cell wall, often
containing a matrix of pectic components, is the first line of defence against fungal
pathogens (Esquerre-Tugaye et al. 2000). In addition, pectic fragments broken
down from plant cell walls are elicitors of the plant defense response (Boudart et
al. 1998). Several lines of evidence indicate that the polygalacturonase inhibiting
protein (PGIP), which is associated with the cell walls of many plants has a role to
23
GENE MINING
play in plant resistance to fungal pathogens (De Lorenzo and Cervone 1997). This
work was initiated to identify a homologue of the gene for PGIP in A. thaliana
since it has relevance as a model system for protein-protein interactions, as well as
practical application in engineering fungal resistance in crop plants (Powell et al.
2000). PGIPs have been identified in a variety of plant species such as bean, pear
and apple (Toubart et al. 1992, Stotz et al. 1993, Arendse et al. 1999) . PGIPs are
characterised by their ability to bind to fungal polygalacturonases (PGs) and this
has led to the hypothesis that PGIP plays a role in the plant defence response by
modulating the activity of endo- PGs produced by invading fungi (Cervone et al.
1989). In addition, PGIPs are interesting for protein-protein interaction studies,
since they are made up of leucine rich repeats (LRRs) (De Lorenzo and Cervone
1997). The main model system for studying this process has been the interaction
between the Fusarium monilforme PG and the bean PGIP (Desiderio et al. 1997).
These studies have given in vitro evidence for this protein-protein interaction and
enabled identification of specific PGIP amino acids in this interaction (Leckie et al.
1999). However, this is a heterologous system employing use of a tobacco
expression system for production of variants of the bean PGIP. Testing of the
hypothesis in vivo has been hampered by lack of a model plant system, which can
be readily transformed and manipulated. This provided the rationale for searching
for a pgip homologue in the model dicotyledon A. thaliana.
24
GENE MINING
was mined based on nucleotide and amino acid sequence alignment with the
existing aminopeptidase related sequences
ALL AML
26
GENE MINING
27
GENE MINING
Although there are some variations between the different classifiers, on average all
four subsets (three best trees and the feature set with the 20 relevant genes)
identified by our ensemble approach perform comparably with or better than the
feature set with all 2000 genes. Best tree 3, although it does not include the top
gene (M26383 ), achieves the highest performance across the multiple external
classifiers and even performs better than the feature set of the top 20 colon tumor-
relevant genes, with the highest performance (92.1%) attained using a SVM with a
polynomial 2-D kernel, which is the highest attainable so far. The second best
feature set is the top 20 relevant genes, reflecting the fact that the relevant genes
are extracted from trees, which are in turn built with a target of high classification
performance given a data structure. Nevertheless, this feature set is neither
necessarily the most economical (minimal) nor the most efficient set for
classification or prediction because there are ‘redundant’ features among the top 20
genes (e.g. the two replicates of R39465 ). Indeed, mining these ‘redundant’ genes
is one of major goals for ensemble decision analysis of microarrays.
28
GENE MINING
insurance, etc.). Furthermore, there may be a long latency between the analysis of
the genetic test and the clinical expression of the disease and wide differences in
the disease patterns. Consequently, information about some genetic test data may
stigmatize patients leading to poor quality of life. This has raised the issue of
‘genetic exceptionalism’ justifying specific regulation of use of genetic
information.
Discussions on how to handle sampling and data are ongoing within the industry
and the regulatory sphere, the European Agency for the Evaluation of Medicinal
Products (EMEA) having issued a position paper, the Council for International
Organizations of Medical Sciences (CIOMS) having a working group on this issue,
and the European Society of Human Genetics preparing background paper on
‘Polymorphic sequence variants in medicine: Technical, social, legal and ethical
issues. Pharmacogenetics as an example’. Within the European project Privacy in
Research Ethics and Law (PRIVIREAL), recommendations for common European
guidelines for membership in research ethical committees have been discussed,
balancing the interests and assuring independence and legal competence. Good
decision making, assuring legality of protocols and assessment of data protection is
suggested to be part of any evaluation of protocols.
30
GENE MINING
31
GENE MINING
32
GENE MINING
become a global public health burden, with 1.5 million deaths expected by 2010.
The high mortality from this disease stems from the lack of an effective screening
approach for early diagnosis and the refractiveness of advanced cancers to
conventional therapies, substantiating the need to develop more effective
targeted therapies and chemoprevention. Although smoking cessation does
reduce risk for lung cancer, approximately half of lung cancers diagnosed are in
former smokers. Adenocarcinoma is the major histologic type of cancer diagnosed
in smokers in the United States and now Europe. An incidence rate of 40% and up
to 80% has been reported for this histologic type of cancer in smokers and never
smokers, respectively, diagnosed with lung cancer. Non–small cell lung cancer
(NSCLC, comprising mainly adeno, squamous cell, and large cell carcinoma) is
diagnosed in approximately 80% of patients, while the remaining 20% of tumors
appear to be small cell lung cancer (SCLC). The detection of numerous cytogenetic
changes provided the first link to the molecular pathogenesis of lung cancer.
Mapping of chromosomal sites for rearrangement, breakpoints, and losses
revealed both common and distinct changes in SCLC and NSCLC. The commonality
for specific regions in the genome for allelic loss suggested the presence of tumor
suppressor genes (TSGs) within these loci. The retinoblastoma gene was the first
TSG linked to lung cancer . Loss of function of this gene through either deletion or
point mutation occurs in 90% of SCLC, while less than 15% of NSCLCs harbor
changes in this TSG . The second major TSG inactivated in lung cancer is p53.
Although p53 inactivation is common across many malignancies, the mutation
spectrum within this gene tracks with specific tumor types.
In lung cancer, the most common mutation seen is the G:C to T:A transversion, an
alteration potentially stemming from the inability to repair DNA damage caused
33
GENE MINING
These studies were focused on breast and colon cancer, but most likely reflect the
paradigm seen in lung cancer studies that have evaluated candidate genes
discovered through various screening modalities. In this whole genome
sequencing study, approximately 80 gene mutations were identified that alter
amino acids. What was surprising was that the prevalence of the majority of these
mutations in primary tumors was less than 5%. The authors concluded that these
minor mutations would each be associated with a "small fitness advantage" that
34
GENE MINING
would drive tumor progression, and thus, it is not the most common genetic
changes but these rare changes that dominate the cancer genome landscape .
While this is an interesting hypothesis, the emergence of epigenetic modifications
of critical regulatory genes indicates that the epigenome may play an equal, if not
greater role in driving cancer initiation and progression than genetic mutations.
The most common epigenetic change in cancer is methylation of DNA at the fifth
position of the cytosine ring. Cytosine located 5' to guanine (CpG) is the prime
target of methylation in the mammalian genome and this dinucleotide is
concentrated in a much higher frequency than a random genome-wide
distribution in regions called CpG islands. About 50% of human promoters contain
CpG islands that often extend into exon 1 of many critical regulatory genes. When
DNA hypermethylation occurs within a CpG island located in the promoter region
of a gene, it is also accompanied by histone modifications (such as acetylation,
methylation, or phosporylation of histone tails) within the island. Together, these
two epigenetic changes create a closed chromatin configuration around the
promoter region denying access to RNA polymerase and regulatory proteins
needed for transcription.
The end result of this process is loss of gene transcription and hence "silencing of
gene function." With the development of the methylation-specific PCR assay that
can screen for gene methylation in specific promoters, there has been
tremendous growth over the past decade in the identification of genes that are
silenced in lung cancer through promoter hypermethylation. Transcriptional
silencing by CpG island hypermethylation now rivals genetic changes that affect
coding sequence as a critical trigger for neoplastic development and progression.
Genes responsible for all types of normal cellular function are targeted for
35
GENE MINING
36
GENE MINING
5. FUTURE SCENARIO:-
37
GENE MINING
last two years over 100 new genes have been identified that are associated with
risk of developing different types of diseases. Not just cancer, but other diseases
that often we regard as lifestyle diseases: the risk of developing diabetes, your
propensity towards obesity, high blood pressure, neurological diseases and so on.
And those new genes offer tremendous opportunities for the prevention and
even better therapy.There are also big ethical questions still to be answered
about tinkering with these basic building-blocks of life.
Global gene mining and the pharmaceutical industry
Worldwide efforts are ongoing in optimizing medical treatment by searching for
the right medicine at the right dose for the individual. Metabolism is regulated by
polymorphisms, which may be tested by relatively simple SNP analysis, however
requiring DNA from the test individuals. Target genes for the efficiency of a given
medicine or predisposition of a given disease are also subject to population
studies, e.g., in Iceland, Estonia, Sweden, etc. For hypothesis testing and
generation, several bio-banks with samples from patients and healthy persons
within the pharmaceutical industry have been established during the past 10
years. Thus, more than 100,000 samples are stored in the freezers of either the
pharmaceutical companies or their contractual partners at universities and test
institutions.
Ethical issues related to data protection of the individuals providing samples to
bio-banks are several: nature and extent of information prior to consent,
coverage of the consent given by the study person, labeling and storage of the
sample and data (coded or anonymized). In general, genetic test data, once
obtained, are permanent and cannot be changed. The test data may imply
information that is not beneficial to the patient and his/her family (e.g.,
38
GENE MINING
The mouse genome sequence, published , has already made a huge impact on the
research community. Although only a draft, it is clear that the sequence is a very
high-quality product, with excellent coverage and reliability over large genomic
expanses. It is a huge asset to researchers, and its significance matches that of the
human genome. In the past six months, for example, the Ensembl genome
39
GENE MINING
But there is one important difference between these two resources — the mouse
genome encodes an experimentally tractable organism. This means that it is now
truly possible to determine the function of each and every component gene by
experimental manipulation and evaluation, in the context of the whole organism.
6. CONCLUSION :-
Work is being done in exponential scale worldwide. Indian scientist are also toiling
hard. Due to some bottlenecks they are not able to keep the pace. Gene mining is
not only boon for plant biotechnology but equally good for animal sciences. Gene
mining provided molecular biologists with a powerful and useable tool for
extracting disease-relevant genes, a major theme in the post-genomic era. This
technique leaves a question mark for the target driven gene functioning.
BLAST TYPES:-
40
GENE MINING
4. Graciel A, Gonzalez¥, Juan C. Uribe, Luis Tari, Colleen Brophy & Chitta Baral.
2007. mining gene-disease relationships from biomedical literature:
weighting proteinprotein interactions and connectivity measures. Pacific
Symposium on Biocomputing 12. 28-39.
41
GENE MINING
8. Patents(program )
Peptide Mass Fingerprinting Database Management program Using
AMWISE and fBIND technique ,
42
GENE MINING
9. REFERENCES:
1. R. Latha, L. Rubia, J. Bennett and M. S. Swaminathan. 2004Allele Mining for
Stress Tolerance Genes in Oryza Species and Related Germplasm.
Molecular Biotechnology.Volume 27. 101-108.
4. GracielA Gonzalez¥, Juan C. Uribe, Luis Tari, Colleen Brophy, Chitta Baral.
2007. mining gene disease relationships from biomedical literature:
weighting proteinprotein interactions and connectivity measures. Pacific
Symposium on Biocomputing 12. 28-39.
43
GENE MINING
On Line references:-
a) http://www.pubgene.org.
b) http://www.gene.ucl.ac.uk
c) http://microarray.princeton.com
d) http://www.ncbi.nlm.nih.gov/
e) http://www.tigr.org/tdb/tgi/agi/
f) http://genepipe.ngc.sinica.edu.tw/genowatch
44