Académique Documents
Professionnel Documents
Culture Documents
Abstract
Introduction
Gene function and metabolomics vary time to time just because of the changes in
gene expression, means that genes are not always remain active in cell, their
activation depends on the requirements of their product, that vary time to time on the
8 P. Katara, M. Agarwal, G. Jeena, S. Karkra, Ishu Sharma and Vinay Sharma
basis of the internal and external cellular conditions, thus to understand cellular
function of any cell or organism it is very important to understand about the gene
regulation system.
The identification of the repertoire of regulatory elements in a genome is one of
the major challenges in modern biology and also important for the study of gene
expression and its regulatory machinery. Proteins responsible for transcription
initiation and its rate called transcription factors (TF), transcription process initiate by
the interaction of these TF and cis-regulatory elements in DNA which act as binding
sites and known as transcription factor binding sites (TFBS), which play very
essential role in transcription initiation process (Alberts et al., 2002).
Regulatory elements are short sequences of DNA (5–20 bp in length) that
determine the timing, location and level of gene expression (Dominic et al., 2004).
They are generally located in an upstream region of a genes transcription start site and
through the interaction with specific transcription factors; they modulate the
expression patterns of the genes in a genome. In recent years, sequences with known
regulatory elements have become easily available due to the effort for large-scale
sequencing of many genomes. Meanwhile, technologies such as microarray and ChIP-
on-chip (or GSLA for Genome-Scale Location Analysis) make it feasible to identify
potential targets of transcription factors. Among the available techniques, microarray
technologies allow direct measurement of the level of expression of each single gene
in a cell (DeRisi et al., 1997). One can characterize sets of genes involved in a defined
cellular process by such methods. Incidentally, many computational methods, such as
MEME, AlignACE and Consensus (Wasserman and Sandelin, 2004) and algorithm
like phylogenetic footprinting (Mathieu and Martin, 2002) have been developed in the
past decade to predict regulatory elements and regulatory motifs (Yueyi et al., 2004,
Yoon et al., 2005).
Humans have been affected by various pathogens from years. These pathogens are
of various types some of them are very dangerous and some are less harmful, but all
are causes some type of harms to their target host, thus it is very important to
understand about the genes and to get inside about the activity of these pathogen and
their control.
M. tuberculosis is small, an obligate aerobe and act as causative agent of chronic
infectious disease tuberculosis that is resposible for the death of countless millions of
people; it was first described by Robert Koch in 1882. The genome of complete
sequence of laboratry strain M. tuberculosis H37Rv, has already been determined and
analyzed in order to improve our understanding of the biology of this pathogen and to
help the conception of new prophylactic and therapeutic interventions. Its genome
comprises 4.41 Million base pairs and contains around approximately 4048 predicted
genes (Cole et al., 1998). Antibiotics and other drugs are already developed, even
after that M. tuberculosis is very dangerous because it complex life cycle and acquired
resistance against them (Camus et al., 1998).
Our present work focus on the prediction of the regulatory elements sequence
pattern of virulence genes resposible for the pathogenecity of M. tuberculosis by
combining information of coexpressed genes from gene expression data (Sand and
In-silico prediction of the regulatory element patterns of human pathogen 9
Methodology
In the present study for the mining of the regulatory sequence of virulence genes of
M. tuberculosis, virulence genes are collected from various available database VFDB
(Chen et al., 2005; Yoon et al., 2007), and by data mining from various research
articles (Cole et al., 1998; Camus et al., 2002 ). From various databases, gene
sequences of interest downloaded for further use. For M. tuberculosis, the data are
stored and scaled by taking the logarithm ratio of gene expression data from SMD
(Sherlock et al., 2001) for the clustering of the gene (Catherine et al., 2005).
For the clustering of genes on the basis of expression values (Schena et al., 1996,
Yenug et al., 2004), Cluster software from Eisen lab is used (Eisen et al., 1998). As
we are interested in genes responsible for virulence, therefore after K-mean
clustering, only those clusters having our gene of interest (virulence genes) were
selected and all genes shared by such cluster were considered as probable virulence
genes.
Further for the analysis of the pattern of regulatory sequence of coexpressed
virulence genes, by using RSAT program, first we mine their upstream region, then
oligo-analysis and dyad-analysis was performed (van-Helden et al., 2000; Thomas-
Chollier et al., 2008; Defrance et al., 2008), which took above obtained clusters
information as an input for finding upstream region, oligo and dyad analysis for M.
tuberculosis this is based on comparative analysis (Lenhard and Wasserman, 2002).
Upstream region
To analyze the upstream region of virulence and probable virulence genes, all
coexpressed genes were subjected to RSAT server. 75 clusters having 1622 genes in
total were subjected to RSAT to find the regulatory upstream regions, as a result we
found various pattern of the upstream regions, with different size, distance from
transcription start sites (TSS) and sequence, for all cluster.
10 P. Katara, M. Agarwal, G. Jeena, S. Karkra, Ishu Sharma and Vinay Sharma
Upstream pattern
350
319
298
300
268
257
250
No. of upstream
194
200
150 138
100
57
47
50
0
0
0 1 to 10 11 to 21 to 51 to 100 to 201 to 501 to above
20 50 100 200 500 1000 1000
Distance from TSS
Out of these 75 clusters, 298 genes yielded no upstream, 195 genes with 1 to 20
bp long upstream, 194 genes had 50 bp, 587 genes with 51 to 200 bp, 304 genes found
to have upstream region in the distance of 201 to 1000 from TSS, with maximum
distance being 1063, Most of the upstream regions is found to be in range of 100 to
200 bp which is a common range for microbes (Wasserman and Sandelin, 2004), this
range facilitate the analysis of proximal promoter region and far from that.
Oligo-analysis
All 75 clusters were analyzed for the prediction of oligomer occurrence and we found
that among 75 cluster, 36 shows positive result, in which 23 clusters having single
oligo-pattern, 11 clusters shows 2 oligo-patterns, while the remaining 2 clusters had
seven patterns each (figure 2), which indicated that most of the virulence genes are
regulated by single type regulatory pattern but some are also controlled by more than
one regulatory pattern.
25 23
No. of clusters
20
15 11
10
5 0 0 0 0 2
0
1 2 3 4 5 6 7
Pattern per cluster
Figure 2: Figure showing the Bar diagram of oligo analysis using RSAT.
In-silico prediction of the regulatory element patterns of human pathogen 11
Among those giving single pattern, three pairs had the same conserved motifs
each. A total of 5 clusters having double pattern and 1 cluster is remaining had
distinct motifs and a pair of 2 clusters and a group of three were found to have same
oligomer patterns respectively. The remaining two clusters had the same seven
patterns shared by both.
Dyad analysis
All 75 clusters of M. tuberculosis were analyzed for dyad analysis and 36 clusters
showing the dyad result (figure 3).
25 21
No. of clusters
20
15
9
10
5 3 2
1 0 0 0
0
1 2 3 4 5 6 7 8
Pattern per cluster
Figure 3: Figure showing the Bar diagram of dyad analysis using RSAT.
Dyad analysis shows that out of 36 cluster which shown the presence of dyad
pattern, most of the cluster shows single (21) and double type of dyad pattern (9), It
suggest that virulence genes of M. tuberculosis have a very conserved and common
types of dyad patterns. It also observable that, among 36 cluster 10 clusters having 1
spacer result which means they are also showing the resulting oligo analysis and
showing best suited occurrence significance while remaining 28 clusters having the
spacers lies between 2 to 18 that represent the variation among dyad pattern binding
sites.
Conclusion
Our analysis reveals that genes of M. tuberculosis shown considerable similarities in
their gene expression with well documented virulence gene that are considerable as
probable virulence genes and we found that 1622 genes shared more than 60% gene
expression pattern similarities with the pattern of virulence genes . Analysis of these
all genes in form of cluster for analysis of upstream regions shown that, most of the
12 P. Katara, M. Agarwal, G. Jeena, S. Karkra, Ishu Sharma and Vinay Sharma
genes contained upstream regions of distance 100 to 200 base pair which is a common
range for microbes (Wasserman and Sandelin, 2004).
The regulatory motif analysis shows the presence of different and significant
pattern: contiguous oligo-nucleotides and spaced dyads. Some clusters were found to
have similar conserved patterns for oligo as well as dyad analysis indicating that these
clusters may have the same regulatory elements. As a result of oligo analysis we
found 23 clusters having single patterns, of which 17 patterns were distinct. The other
11 clusters gave two patterns each with a total of 12 distinct patterns. The remaining
two clusters with 7 patterns each, shared all of them.
In the results for dyad analysis, 72 patterns were obtained of 36 clusters. The
results showed dyads with various spacer values ranging from 0, 1 to maximum of 18.
Maximum patterns were obtained with the space of 1 and 18. Of these 4 distinct
patterns with 1 spacer values, 3 distinct patterns with 2 spacer values, 5 distinct
patterns with 18 spacer values were obtained.
References
[1] Alberts B, Johnson A, Lewis J, Raf M, Roberts Kand Walter P (2002).
Molecular Biology of the Cell - Fourth Edition; Garland Science, New York
[2] Ball C A, Awad I A B, Demeter J, Gollub J, Hebert J M, Hernandez-Boussard
T, Jin H, Matese J C, Nitzberg M, Wymore F, Zachariah Zachariah K, Brown
P O and Sherlock G (2005). The Stanford Microarray Database accommodates
additional microarray platforms and data formats. Nucleic Acids Res. 33: 580–
582
[3] Camus J C, Pryor M J, Medigue C and Cole S T (2002). Re-annotation of the
genome sequence of Mycobacterium tuberculosis H37Rv. Microbiology. 44:
2967–2973Chen L, Jian Y, Jun Y, Zhijian Y, Lilian S, Yan S and Qi J (2005).
VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res.
33: D325-D328
[5] Cole S T, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV,
Eiglmeier K, Gas S, Barry C E 3rd, Tekaia F, Badcock K, Basham D, Brown
D, Chillingworth T, Connor R, Davies R, Devlin K, Feltwell T, Gentles S,
Hamlin N, Holroyd S, Hornsby T, Jagels K, Krogh A, McLean J, Moule S,
Murphy L, Oliver K, Osborne J, Quail MA, Rajandream MA, Rogers J, Rutter
S, Seeger K, Skelton J, Squares R, Squares S, Sulston JE, Taylor K,
Whitehead S and Barrell BG (1998). Deciphering the biology of
Mycobacterium tuberculosis from the complete genome sequence. Nature.
393: 537–544Defrance M, Janky R, Sand O and van-Helden J (2008). Using
RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in
nucleic sequences. Nat Protoc. 3:1589-1603
[7] DeRisi J L, Iyer V R, Brown P O (1997). Exploring the metabolic and genetic
control of gene expression on a genomic scale. Science. 278(5338):680-6
[8] Dominic J A, Isaac S, and Atul J (2004). Quantifying the relationship between
co-expression, co-regulation and gene function. BMC Bioinformatics. 5: 18
In-silico prediction of the regulatory element patterns of human pathogen 13