Vous êtes sur la page 1sur 8

Infection, Genetics and Evolution 32 (2015) 330337

Contents lists available at ScienceDirect

Infection, Genetics and Evolution


journal homepage: www.elsevier.com/locate/meegid

Elucidating evolutionary features and functional implications of orphan


genes in Leishmania major
Sumit Mukherjee a,b, Arup Panda a, Tapash Chandra Ghosh a,
a
b

Bioinformatics Centre, Bose Institute, P 1/12, C.I.T. Scheme VII M, Kolkata 700 054, West Bengal, India
Department of Physical Sciences, Indian Institute of Science Education and Research-Kolkata, Mohanpur 741246, Nadia, West Bengal, India

a r t i c l e

i n f o

Article history:
Received 13 January 2015
Received in revised form 25 March 2015
Accepted 26 March 2015
Available online 2 April 2015
Keywords:
Orphan genes
Evolutionary rate
Protein disorder
Interaction and trafcking motifs
Hostparasite interaction
Lineage-specic adaptation

a b s t r a c t
Orphan genes are protein coding genes that lack recognizable homologs in other organisms. These genes
were reported to comprise a considerable fraction of coding regions in all sequenced genomes and
thought to be allied with organisms lineage-specic traits. However, their evolutionary persistence
and functional signicance still remain elusive. Due to lack of homologs with the host genome and for
their probable lineage-specic functional roles, orphan gene product of pathogenic protozoan might be
considered as the possible therapeutic targets. Leishmania major is an important parasitic protozoan of
the genus Leishmania that is associated with the disease cutaneous leishmaniasis. Therefore, evolutionary
and functional characterization of orphan genes in this organism may help in understanding the factors
prevailing pathogen evolution and parasitic adaptation. In this study, we systematically identied orphan
genes of L. major and employed several in silico analyses for understanding their evolutionary and functional attributes. To trace the signatures of molecular evolution, we compared their evolutionary rate
with non-orphan genes. In agreement with prior observations, here we noticed that orphan genes evolve
at a higher rate as compared to non-orphan genes. Lower sequence conservation of orphan genes was
previously attributed solely due to their younger gene age. However, here we observed that together with
gene age, a number of genomic (like expression level, GC content, variation in codon usage) and proteomic factors (like protein length, intrinsic disorder content, hydropathicity) could independently
modulate their evolutionary rate. We considered the interplay of all these factors and analyzed their relative contribution on protein evolutionary rate by regression analysis. On the functional level, we observed
that orphan genes are associated with regulatory, growth factor and transport related processes.
Moreover, these genes were found to be enriched with various types of interaction and trafcking motifs,
implying their possible involvement in hostparasite interactions. Thus, our comprehensive analysis of L.
major orphan genes provided evidence for their extensive roles in hostpathogen interactions and
virulence.
2015 Elsevier B.V. All rights reserved.

1. Introduction
Orphan genes are protein coding genes that do not share detectable sequence similarity with the genomes of other organisms
(Tautz and Domazet-Loso, 2011). Due to their phylogenetic restriction these genes are also called as lineage-specic or taxonomically
restricted genes (Wilson et al., 2005). Orphan genes comprise a
considerable fraction of genes in all domains of life including
Abbreviations: L. major, Leishmania major; BLAST, Basic Local Alignment Search
Tool; GRAVY, grand average of hydropathy index; Nc, effective number of codon;
CAI, Codon Adaptation Index; FPKM, Fragments Per Kilobase of exon per Million
fragments mapped.
Corresponding author. Tel.: +91 33 2355 6626; fax: +91 33 2355 3886.
E-mail address: tapash@jcbose.ac.in (T.C. Ghosh).
http://dx.doi.org/10.1016/j.meegid.2015.03.031
1567-1348/ 2015 Elsevier B.V. All rights reserved.

viruses (Khalturin et al., 2009; Wilson et al., 2005; Yin and


Fischer, 2008). These genes can be broadly classied into two categories (i) taxon-specic orphan genes (TSOGs) that lack homology
outside of a focal taxonomic group and (ii) species-specic orphan
genes (SSOGs), a subset of SSOGs sharing no homology with any
gene in any other species (Wissler et al., 2013). Several hypotheses
have been put forward to explain the origin of orphan genes. For
instance, gene duplication and rearrangement processes followed
by rapid divergence were considered to be an important pathway
for the emergence of orphan genes in primates, Arabidopsis and
zebrash (Donoghue et al., 2011; Toll-Riera et al., 2009; Yang
et al., 2013). In primates, it has been found that the majority of
orphan genes arise from frequent recruitment of transposable elements (Toll-Riera et al., 2009). Orphan genes may also arise de

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330337

novo from non-coding regions (Cai et al., 2008; Heinen et al., 2009;
Knowles and McLysaght, 2009; Neme and Tautz, 2013; Wu et al.,
2011; Xie et al., 2012; Yang and Huang, 2011). These genes were
also found to emerge from overlapping of anti-sense reading
frames and frameshift mutations in protein coding sequences
(Wissler et al., 2013).
Orphan genes are emerging to play critical roles in lineagespecic adaptation of different species to a broad range of ecological conditions (Khalturin et al., 2009). These genes were reported
to play substantial roles in response to a variety of abiotic stresses
in plant genomes (Donoghue et al., 2011). Imperative roles of
orphan genes were also evidenced in several development processes. For instance, orphan gene products were found to be crucial
for human early brain development (Zhang et al., 2011) and also
for regulation of tentacle formation in Hydra species (Khalturin
et al., 2008). Lineage-specic putative surface antigen of plasmodium were shown to be involved in hostparasite interactions
(Kuo and Kissinger, 2008). In 2010, Zhang et al. ectopically
expressed 14 Leishmania donovani-specic genes in Leishmania
major and observed that two of these genes could increase L. major
survival in visceral organs (Zhang and Matlashewski, 2010).
Studies conducted on different eukaryotes demonstrated that
orphan genes evolve faster than non-orphan genes (Cai et al.,
2006; Domazet-Loso and Tautz, 2003; Donoghue et al., 2011;
Kuo and Kissinger, 2008; Toll-Riera et al., 2009). An inverse
relationship between gene age and protein evolutionary rate has
been widely observed in a broad range of organisms including primates (Toll-Riera et al., 2009), mammals (Alba and Castresana,
2005), drosophila (Domazet-Loso and Tautz, 2003), Plasmodium
(Kuo and Kissinger, 2008), fungi (Cai et al., 2006) and bacteria
(Daubin and Ochman, 2004). Since, orphan genes are younger
genes in a particular lineage it was hypothesized that these genes
evolve faster mainly due to their recent evolutionary origin (Cai
et al., 2006; Domazet-Loso and Tautz, 2003; Toll-Riera et al.,
2009). Later it was found that protein evolutionary rate could not
be determined by a single factor, rather proteins intrinsic properties as well as their evolutionary age independently modulate the
rates of protein evolution (Toll-Riera et al., 2012). Protein
evolutionary rate was shown to correlate with a number of gene
level and protein level attributes, such as expression level
(Drummond et al., 2005; Drummond et al., 2006; Pal et al.,
2001), number of proteinprotein interactions (Fraser et al.,
2002), protein complex number (Chakraborty and Ghosh, 2013),
its centrality in the protein interaction network (Hahn and Kern,
2005), protein dispensability (Hirsh and Fraser, 2001), sequence
length (Marais and Duret, 2001), Codon Adaptation Index (CAI),
effective number of codons (Nc) (Pal et al., 2001; Wall et al.,
2005), protein disorder content (Chen et al., 2011; Podder and
Ghosh, 2010), etc. In spite of all these ndings, factors determining
the evolutionary rate of orphan genes are still under debate and
the relative contribution of different genomic and proteomic attributes on the evolutionary rates of orphan genes remains elusive.
With the availability of high-throughput genomic sequences
together with expression data and bioinformatics prediction tools,
it has now become easier to identify and characterize orphan genes
in different species. L. major is one of the most important protozoan parasites of the genus Leishmania. It is associated with the
disease cutaneous leishmaniasis, affecting more than 2 million
people throughout the world every year (Ivens et al., 2005). In spite
of multiple research endeavors, till date, there is no available vaccine for this disease. Because of their absence in the host genomes
orphan gene products in pathogenic protozoan were considered to
be possible therapeutic targets (Kuo and Kissinger, 2008).
Therefore, proling orphan genes of L. major from the perception
of protein evolutionary rates and comparing them with non-orphan genes along with understanding their functional roles will

331

be helpful to recognize the molecular signature of parasitic adaptation. With this aim we carried out rigorous analysis to understand
the functionality of orphan genes and investigated the evolutionary forces affecting orphan gene evolution. To evaluate the attributes of orphan genes in the evolutionary framework we
performed a comprehensive analysis comparing orphan genes with
the non-orphan genes. In this study our primary objective is to
characterize all the possible determinants that may have shaped
the evolutionary rate of orphan genes in L. major. One of the main
obstacles to such a study is the limitation of required data on
orphan genes. Therefore, in this study we consider several genomic
and proteomic attributes that could be easily identied from coding sequences and analyzed their relative inuence on the
evolutionary rate heterogeneity between orphan and non-orphan
genes.
Conrming earlier observations our study revealed that orphan
genes evolve faster than non-orphan genes (Domazet-Loso and
Tautz, 2003; Toll-Riera et al., 2009). However, in contrary to the
suggestions of those studies, here, we found that gene age could
account for a fraction of variation of their evolutionary rate.
Instead, together with gene age, a number of factors like gene
expression, codon bias, genic GC content, protein hydropathicity,
protein disorder content and protein length were found to have
substantial contribution on the evolutionary rate difference
between orphan and non-orphan genes. On functional level, we
found that sequences of orphan genes are endowed with host targeting motifs, prenylation motifs, heparin-binding consensus
sequences, signal peptides and transmembrane domains, implying
their possible roles in hostparasite interactions. Thus, our study
on orphan genes of L. major shed light on the factors governing
pathogen evolution and reveals their contribution in parasitic
adaptations.
2. Materials and methods
2.1. Collection of dataset and gene expression data
We retrieved the protein coding sequences of L. major (strain
Friedlin) from TriTrypDB version 7.0 (http://tritrypdb.org/tritrypdb/) (Aslett et al., 2010). CDS sequences containing internal stop
codons and partial codons were removed using CodonW
(http://codonw.sourceforge.net). Signal peptide, transmembrane
domain, epitope, paralogs and pathway informations of all L. major
genes were downloaded from TriTrypDB version 7.0. To compute
gene expression level, we retrieved high-throughput RNA-seq
expression prole data of L. major promastigote stage from the
dataset of Rastrojo et al. (2013). We searched for protein domains
via InterProScan (Zdobnov and Apweiler, 2001).
2.2. Identication of orphan genes
To identify orphan gene models which are restricted to the
Leishmania genus, we used a systematic way based on homology
search. First, BLASTP followed by TBLASTN ltering approach
(E < 10 5 and use of low-complexity lters) was used against
NCBI nr databases. Additionally, to further screen for similarity
between sequences we employed Position-Specic Iterated BLAST
(PSI-BLAST) (Altschul et al., 1997) that can detect weaker homologous relationships that would otherwise be missed by the standard
BLAST algorithms.
2.3. Calculation of nucleotide substitution rate
The ratio of the rate of non-synonymous substitutions (dN) to
the rate of synonymous substitutions (dS) was widely used as an

332

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330337

indicator of selective pressure acting on a protein-coding genes


(Kryazhimskiy and Plotkin, 2008). dN/dS values of L. major genes
were calculated with respect to their one-to-one orthologous
sequences in four other Leishmania species: Leishmania infantum,
Leishmania braziliensis, Leishmania mexicana and L. donovani. To calculate dN/dS values, each set of orthologous gene pair was aligned
using ClustalW (Larkin et al., 2007). dN/dS values were calculated
by Yang and Nielsen method using the PAML package v-4 (Yang
and Nielsen, 2000). We averaged the dN/dS values of each gene
and represented that as their evolutionary rate (Supplementary
dataset).
2.4. Codon usage indices calculation
Codon Adaptation Index (CAI) and effective number of codons
(Nc) of L. major genes were computed using CodonW (Sharp
et al., 1986). For calculating CAI values highly expressed gene set
of L. major was prepared based on RNA-seq data of promastigote
stage (Rastrojo et al., 2013). Overall genic GC content and an average protein hydropathy index were also computed using CodonW
(Sharp et al., 1986).

and SVM) predict same localization of a gene we took that as its


subcellular localization.
We predicted pathogenic ability of the orphan gene using MP3
server (http://metagenomics.iiserb.ac.in/mp3/index.php) (Gupta
et al., 2014). This server predicts pathogenic and virulent proteins
from genomic and metagenomic datasets using an integrated
SVM-HMM approach.

2.8. Identication of interaction and trafcking motifs


We investigated for host-cell targeting motifs RXLXE/D/Q
(where X is a neutral or a hydrophobic amino acid residue) that
were previously reported for their activity to export Plasmodium
falciparum
proteins
from
the
intracellular
parasites
(Bhattacharjee et al., 2012) to the surrounding erythrocytes. We
also searched for the presence of consensus sequences XBBXBX,
XBBBXXBX and XBBBXXBBBXXBBX (where X is a neutral or
hydrophobic amino acid residue and B is a basic amino acid residue) which were implicated in hairpin binding (de Castro Cortes
et al., 2012). We searched all of these sequence patterns using
in-house Perl scripts.

2.5. Calculation of gene age


2.9. Identication of CAAX prenylation motifs
To calculate phylogenetic age of L. major genes we used phylostratigraphic approach (Domazet-Loso and Tautz, 2010). Briey,
according to the signicant BLAST hits found in most remote species (as documented in NCBI nr databases) L. major genes were
classied into four taxonomic levels: genes shared by only
Leishmanial species, genes shared by Trypanosomatidae, genes distributed among basal Eukaryota and genes distributed among all
organisms. Genes for which BLAST hits were found only in
Leishmania genus (genus restricted orphan genes) were considered
as the youngest genes; whereas, genes distributed in prokaryotic
species were considered as the oldest genes.
2.6. Prediction of protein intrinsic disorder
Protein disorder content was predicted using IUPred algorithm
(Dosztanyi et al., 2005). Based on pairwise interaction energy
IUPred assigns a score to each amino acid (Dosztanyi et al., 2005).
For each protein we calculated the proportion of amino acids with
disorder score P0.5 and represented this as its disorder content.
2.7. Prediction of GO term, subcellular localization and pathogenic
protein

We searched for the CaaX prenylation motifs (C is Cysteine, a


is an aliphatic amino acid, and X is any amino acid) in orphan
genes using PrePS webserver (http://mendel.imp.univie.ac.at/sat/
PrePS) (Maurer-Stroh and Eisenhaber, 2005). This server classied
CaaX motifs as farnesyltransferase (FT), geranylgeranyltransferase
I (GGT1) and geranylgeranyltransferase II (GGT2) motifs.

2.10. Statistical analysis


For correlation analyses we used non-parametric Spearmans
rank correlation q. Signicant differences of variables between
orphan and non-orphan genes were evaluated using
MannWhitney U test following their non-parametric distribution
(KolmogorovSmirnov test, P < 0.05). Here, P < 0.05 denotes the
measure of signicance at 95% condence level. All tests were done
using the software SPSS (v-13.0).

3. Results and discussion


3.1. Searching for orphan genes in L. major

For Gene Ontology (GO) annotations of orphan genes we primarily focused on TriTrypDB v 7.0. However, we found only 43
orphan genes have annotated GO terms. Therefore, for the rest of
orphan genes in our dataset we predicted GO categories using
ProtFun 2.2 webserver (http://www.cbs.dtu.dk/services/ProtFun/)
(Jensen et al., 2003). Protfun 2.2 is a homology independent
method and predicts protein function based on their physicochemical properties. Therefore, this algorithm was considered to
be useful for prediction of protein function even of orphan genes
(Yang et al., 2013).
Subcellular localization of orphan genes was predicted using
two independent web servers: CELLO v.2.5 (http://cello.life.nctu.
edu.tw/) (Yu et al., 2004) and SubCellProt (http://www.databases.
niper.ac.in/SubCellProt) (Garg et al., 2009). CELLO predicts protein
subcellular localization using two-level support vector machine
(SVM). While, SubCellProt is based on two machine learning
approaches, k Nearest Neighbor (k-NN) and Probabilistic Neural
Network (PNN). When two of these three approaches (k-NN, PNN

Basic Local Alignment Search Tool (BLAST) is a standalone


method to identify orphan genes in any sequenced genome
(Tautz and Domazet-Loso, 2011). The number of orphan genes in
an organism depends on different ltering procedure during identication steps. Previous studies have used BLASTP and TBLASTN
methods to identify orphan genes in various species (Lin et al.,
2010; Yang et al., 2013). Following these studies, we used rigorous
BLAST searches against NCBI non-redundant (nr) databases to
identify orphan genes in L. major genome. To identify matches
missed by the BLASTP and tBLASTn searches we performed PSIBLAST search with a cut-off of E-value < 1  10 5. Finally, we identied 881 genus specic orphan genes, corresponding to 10.65% of
all L. major protein coding sequences (Supplementary_dataset).
According to high-throughput RNA-seq expression data of promastigote stage (Rastrojo et al., 2013), we found gene expression
intensity (FPKM value) of 864 (out of 881) orphan genes which
indicates that orphan genes are not artifact of genome annotations.

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330337

3.2. Evolutionary rate heterogeneity of orphan and non-orphan genes:


effect of gene age
To examine whether orphan genes of L. major had come under
selective constraints, we computed their evolutionary rate and
compared with that of non-orphan genes. Consistent with the previous reports on other genomes (Domazet-Loso and Tautz, 2003;
Toll-Riera et al., 2009; Wissler et al., 2013), here we found that
evolutionary rate of orphan genes is signicantly higher as
compared to the rest of the genes in L. major (0.55 0.06 vs.
0.24 0.002 for orphan vs. non-orphan genes, P = 1  10 6,
MannWhitney U test). Previously, it was attributed that phylogenetically conserved genes are associated with basic cellular
processes and are functionally more constrained. Whereas, lineage-specic genes are functionally less constrained and could
evolve many new functions (Kuo and Kissinger, 2008). Thus, higher
sequence divergence of orphan genes was regarded to be due to
lesser selection on functional requirement (Toll-Riera et al.,
2009). Since, newly evolved genes tend to be under weaker purifying selection earlier studies reported that proteins with accelerated
evolutionary rate are younger in gene age (Alba and Castresana,
2005; Cai et al., 2006). Thus, proteins evolutionary origin i.e. their
gene age was considered to be a potential determinant of their
evolutionary rate (Vishnoi et al., 2010). Orphan genes are evolutionarily younger genes that appeared at some time within a phylogenetic lineage towards an extant species (Toll-Riera et al., 2009).
Therefore, their accelerated evolutionary rates were thought to
be due to their recent evolutionary origin (Cai et al., 2006; Tautz
and Domazet-Loso, 2011). To test this hypothesis we assigned a
phyletic age to all the protein coding genes of L. major according
to their phylogenetic distribution (i) genes restricted to
Leishmania genus (orphan genes), (ii) trypanosomatidae restricted
genes, (iii) extensively distributed eukaryotic genes and (iv) genes
distributed among all organisms (including virus, bacteria and all
other life forms). Thus, we found that evolutionary rate decrease
with increasing gene age (Table 1). Consequently, we found a negative correlation between gene age and evolutionary rate
(Spearmans q = 0.519, P = 1  10 6) which suggests that gene
age is an important correlate of protein evolutionary rate. To test
its independent impact on the evolutionary rate of L. major proteins, we performed linear regression taking gene age as a predictor variable and protein evolutionary as a dependent variable.
Results revealed that gene age could account for 12.4% variation
of protein evolutionary rate in our dataset (R2 = 0.124,
F = 1054.839, P < 1  10 6). Thus, it is apparent that although gene
age has a major contribution on the variation of protein evolutionary rate between orphan and non-orphan genes, there could be
other evolutionary force(s) behind this variation.
3.3. Protein evolutionary rate: impact of gene level variants
Previously, it was shown that genes with higher GC content
tend to evolve at a slower rate as compared to lower GC genes
(Xia et al., 2009). Thus, GC content was regarded as a strong predictor of protein evolutionary rate. In many species orphan genes

Table 1
Evolutionary rate of L. major genes according to their phylogenetic distribution.
Phylogenetic class of L. major genes

dN/dS
(Mean SE)

P-values

Genes restricted to Leishmania genus (orphan


genes)
Trypanosomatidae restricted genes
Extensively distributed eukaryotic genes
Genes distributed among all organisms

0.55 0.0068

<1  10

0.27 0.0022
0.17 0.0040
0.11 0.0044

333

were shown to have low GC content as compared to non-orphan


genes (Johnson and Tsutsui, 2011; Lin et al., 2010; Palmieri et al.,
2014). However, our results revealed that orphan genes of L. major
have higher GC content than non-orphan genes (0.628 0.001 vs.
0.613 0.0004 for orphan vs. non-orphan genes; P = 1  10 6).
Thus our results support the observation of Yang et al., in zebrash
(Yang et al., 2013). Further, a positive correlation was observed
between genic GC and protein evolutionary rate (Spearmans
q = 0.295, P = 1  10 6), suggesting that GC content has a strong
impact on the evolutionary rates of L. major genes.
It has long been known that codons encoding same amino acids
i.e. the synonymous codons are used with unequal frequencies due
to selection on protein-coding sequences (Ikemura, 1985). Both Nc
and CAI are used as measures of codon usage bias of a gene
towards a set of optimal codons (Gouy and Gautier, 1982;
Wright, 1990). Nc is a measure of the extent of codon usage bias
of a gene, with smaller values indicating more biased usage of synonymous codons (Wright, 1990). CAI measures the overall synonymous codon usage bias of protein coding sequences. However,
there is a fundamental difference between these two measures.
While Nc measures how far the codon usage of a gene departs from
equal usage of synonymous codons, CAI measures the deviation of
codon usage from set of most highly expressed genes (Botzman
and Margalit, 2011). Here, we observed that orphan genes have
lower CAI and higher Nc value as compared to non-orphan genes
(CAI values 0.33 0.002 vs. 0.405 0.001 and Nc values 49.02
0.15 vs. 44.69 0.07 for orphan vs. non-orphan genes; P = 1 
10 6). Moreover, we found an overall strong positive correlation
between Nc and protein evolutionary rate (Spearmans q = 0.613,
P = 1  10 6) and strong negative correlation between CAI and protein evolutionary rate (Spearmans q = 0.681, P = 1  10 6). These
ndings suggest that newly evolved genes are under lesser selection for codon choice.
Gene expression intensity was attributed as the strongest
correlate of protein evolutionary rate (Drummond et al., 2005).
To scrutinize whether variation of gene expression level has any
inuence on the evolutionary rate differences between orphan
and non-orphan genes we measured their gene expression levels.
Our analysis based on RNA-seq gene expression data revealed that
orphan genes have much lower average expression intensity as
compared to non-orphan genes (22.12 0.92 vs. 34.55 0.67 for
orphan vs. non-orphan genes, P = 1  10 6). Consequently, a
signicant negative correlation was detected between gene
expression and protein evolutionary rate (Spearmans q = 0.365,
P = 1  10 6). Thus, we infer that gene expression intensity may
have an impact on evolutionary rate differences between orphan
and non-orphan genes.
Thus, all these observations suggest that variation of codon
usage, genic GC content along with gene expression intensity have
a dominant role in shaping evolutionary rates of orphan and nonorphan genes in L. major genome.
3.4. Protein evolutionary rate: impact of protein level variants
Being free from structural constraints, disordered proteins tend
to evolve at a faster rate than the well structured globular proteins
(Dunker et al., 2002; Dyson and Wright, 2005). In line with these
observations, we obtained a signicant positive correlation
between protein disorder content and their evolutionary rate in
our dataset (Spearmans q = 0.432, P = 1  10 6). In support of their
higher evolutionary rate we found that orphan genes encode
unstructured proteins with signicantly higher fraction of disordered residues compared to non-orphan genes (45.23 0.97 vs.
22.52 0.259 for orphan vs. non-orphan genes, P = 1  10 6).
Moreover, by regression analysis we found that protein disorder
content could independently explain 21.2% variability of protein

334

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330337

evolutionary rate (R2 = 0.212, F = 2005.877, P < 1  10 6) in our


dataset. Intrinsically disordered proteins are frequently involved
in essential biological processes such as transcriptional and translational regulation, membrane fusion, transport and proteinprotein interactions, etc. (Dunker et al., 2002; Dyson and Wright,
2005; Mohan et al., 2008; Uversky et al., 2000). It was reported that
high protein disorder in aerobic microbes provides an adaptive
opportunity to t with the genomic and functional complexities
of the aerobic lifestyle (Panda and Ghosh, 2014). Therefore, elevated protein disorder in orphan genes of L. major may indicate
their possible functional roles in parasitic adaptive processes.
Grand average of hydropathy index indicates the extent of protein solubility (Kyte and Doolittle, 1982). Overall positive GRAVY
score signies hydrophobic properties; whereas, overall negative
GRAVY indicates the hydrophilic nature of proteins. Here, we
noticed that orphan genes are hydrophilic by nature as compared
to non-orphan genes (average GRAVY score 0.32 0.012 vs.
0.26 0.004 for orphan vs. non-orphan genes P = 1  10 6).
Additionally, a signicant positive correlation was found between
protein hydropathy and protein evolutionary rate (Spearmans
q = 0.026, P = 1  10 6), implying its importance in evolutionary
rate heterogeneity between orphan and non-orphan genes.
Previous studies are consistent with the observation that
orphan genes are shorter in protein length than non-orphan genes
(Donoghue et al., 2011; Kuo and Kissinger, 2008; Toll-Riera et al.,
2009; Yang et al., 2013). Interestingly, our nding highlighted that
orphan genes of L. major have longer protein length as compared to
non-orphan genes (681.76 18.78 vs. 607.52 6.47, for orphan vs.
non-orphan genes; P = 1  10 6 MannWhitney U test). One
explanation for such an unexpected trend is that orphan genes of
L. major may have derived from duplicated genes that had undergone gene elongation and rapid divergence. Further, we found a
signicant positive correlation between protein length and protein
evolutionary rate (Spearmans q = 0.339, P = 1  10 6) which suggests that proteins with higher evolutionary rate tend to be longer
as compared to evolutionarily constrained proteins.
3.5. Relative contribution of the factors in determining evolutionary
rate variation
In order to excavate the independent inuence of aforementioned predictor variables on protein evolutionary rate, we performed categorical regression considering protein evolutionary
rate as dependent variable and all the other factors as predictor
variables. According to our ANOVA model (R2 = 0.578,
F = 1274.306, P < 1  10 6) gene age, GC content, codon usage bias
(Nc and CAI), gene expression level, protein length, percentage of
intrinsically disordered residues and protein hydrophilicity are
the attributes regulating the evolutionary rates of the orphan and
non-orphan genes (Table 2). The relative importance of these
factors in determining the rate of protein evolution is as follows:
disorder content > Nc > CAI > protein hydrophilicity > gene expression level > gene age > protein length.
3.6. Functional attributes of orphan genes
Vast amount of orphan genes detected in different taxa with
their possible lineage-specic functions motivated us to investigate if orphan genes confer any advantage for lineage-specic
adaptation of L. major. We searched for the Gene Ontology (GO)
annotations of orphan genes to explore their functional signicance. Gene ontology (GO) annotations are currently available only
for a limited number of L. major orphan genes (43 genes with annotated GO terms out of 881 orphan genes in L. major) (Table 3).
Therefore, to trace the potential functional roles of orphan genes
we predicted GO annotations from ProtFun 2.2 server (Jensen

Table 2
Categorical regression to illustrate independent inuence of different variables on
protein evolutionary rate.
Parameter
Protein level properties
Intrinsic disorder content
Protein length
Protein hydrophilicity
Gene age
Gene level properties
Expression level (FPKM)
CAI
GC content
Nc

b score

P-values

0.371
0.035
0.237
0.072

<1  10

0.108
0.270
0.256
0.330

et al., 2003). Using ProtFun we were able to predict the GO annotations for 674 orphan genes. Similar to the study Yang et al., in zebrash (Yang et al., 2013) our analysis revealed a non-random
distribution of orphan genes across different functional categories.
Here, we observed that growth factors are the most abundant
functional categories for orphan genes (29.2%), followed by transcription regulation (24.18%), and cellular transportation (20.02%)
(Table 4). Therefore, predicted functional annotations indicate that
most of the orphan genes lie in the growth factor categories which
could stimulate cell growth and proliferation and are important for
regulating a variety of cellular processes, suggesting that these
genes could involve in various biochemical pathways leading to
parasitic lineage-specic adaptations.
Prediction of protein subcellular localization is an important
component of in silico prediction of protein function (Yu et al.,
2006). Computational prediction of subcellular localization of proteins may be error prone (Nair and Rost, 2003). Therefore, for the
prediction of subcellular localization of orphan genes here we
employed two prediction servers: CELLO (Yu et al., 2004) and
SubCellProt (Garg et al., 2009) which are based on three different
methods (k-NN, P-NN and SVM). We assigned subcellular localization for a protein if at least two of those three methods predict the
same. Predictions from these two web servers unanimously suggest that orphan genes are mainly located within nucleus and
plasma membrane (Supplementary_dataset). Further gene

Table 3
Functional categorization of orphan genes as per annotated GO term in TriTrypDB.
Annotated GO function

Number of orphan
genes

Acid-amino acid ligase activity


ATP binding
Calcium ion binding
DNA binding
Heat shock protein binding
Heme binding
Magnesium-dependent protein serine/threonine
phosphatase activity
Microtubule motor activity
Nucleic acid binding
Protein binding
Protein tyrosine/serine/threonine phosphatase activity
RNA binding
Structural molecule activity
Transferase activity
Transporter activity
Ubiquitin protein ligase binding
Ubiquitin thiolesterase activity
UDP-N-acetylglucosamine-dolichyl-phosphate Nacetylglucosaminephosphotransferase activity
Zinc ion binding

1
5
1
2
1
1
1
3
1
13
2
5
1
1
1
1
1
1
12

Note: Total 43 orphan genes were assigned to various GO terms in TriTrypDB. Some
orphan genes were assigned into multiple GO functional terms in TriTrypDB.

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330337


Table 4
Functional categorization of orphan genes as predicted by ProtFun. Using ProtFun we
were able to predict the GO term for 674 orphan genes out of 838 orphan genes with
unknown GO term. Percentage of orphan genes indicates that distribution of each GO
category within 674 orphan genes.
Predicted GO function

Number of
orphan genes

Percentage of
orphan genes

Growth_factor
Transcription_regulation
Transporter
Structural_protein
Transcription
Central_intermediary_metabolism
Receptor
Signal_transducer
Cation_channel
Ion_channel
Stress_response
Voltage-gated_ion_channel

197
163
135
64
51
15
11
9
8
8
7
6

29.22
24.18
20.02
9.49
7.56
2.22
1.63
1.33
1.18
1.18
1.03
0.89

ontology (GO) analysis with the orphan genes of plasma membrane revealed that most of these genes are involved in processes
like transport, ion channel and voltage-gated ion channel, etc.
Involvement of genes in metabolic pathways indicates their
important functional consequences in several biosynthetic processes. Here, we investigated whether orphan genes have any role
in metabolic pathways of L. major. Therefore, from TriTrypDB we
retrieved the list of L. major genes which are involved in various
pathways. By this way we found evidence for the involvement of
three orphan genes in different biosynthetic pathways of L. major.
For instance, one orphan gene (Gene ID: LmjF.36.4180) was found
to be associated with N-Glycan biosynthesis pathways. Another
two orphan genes (Gene ID: LmjF.06.0780 and LmjF.35.0550) were
found to be associated with ascorbate and aldarate metabolism
pathways, ubiquinone and other terpenoid-quinone biosynthesis
pathways as well as in glycosaminoglycan degradation pathways.
To search for their functional roles in those pathways, we considered their GO annotations and Enzyme Commission (EC) numbers.
TriTrypDB annotated GO process indicates that gene LmjF.36.4180
is involved in dolichol-linked oligosaccharide biosynthetic process
in N-Glycan biosynthesis pathways (annotated GO function:
UDP-N-acetylglucosamine-dolichyl-phosphate
N-acetylglucosaminephosphotransferase activity and EC number is 2.7.8.15
(UDP-N-acetylglucosamine-dolichyl-phosphate
N-acetylglucosaminephosphotransferase)). Annotated GO terms and EC numbers are unavailable for the genes LmjF.06.0780 and LmjF.
35.0550 in TriTrypDB. Therefore, we considered EC number
inferred from OrthoMCL (Li et al., 2003) for their functional assignments. This observation indicates that these two genes are possibly
involved in glycosidase activity in those pathways (EC Numbers
inferred from OrthoMCL: 3.2.1. (glycosidases, i.e. enzymes
hydrolyzing O- and S-glycosyl compounds)). Thus, our ndings
suggest that orphan genes of L. major could integrate into its
metabolic pathway to play important functional roles.
3.7. Orphan genes of L. major putatively involved in secretory
pathways, hostparasite interactions and virulence
Hostpathogen interactions are types of environmental interaction where intracellular pathogens exploit host cells to ensure their
survival and replication within the host genome (Tautz and
Domazet-Loso, 2011). Virulence is one of the potential results of
hostpathogen interaction (Casadevall and Pirofski, 2001).
Therefore, involvement of orphan genes in hostpathogen interactions may suggest their crucial role in parasitic adaptation to the
host systems. In our endeavor to understand the role of orphan
genes in hostparasite interactions, we investigated presence of

335

different interaction and trafcking motifs in orphan genes.


Prenylation is a post-translational modication that leads to farnesylation or geranylgeranylation, which are required for proteins to
be fully functional (Zhang and Casey, 1996). Pathogen proteins
undergoing this type of host directed post-translational modication were considered to be crucial for hostpathogen interactions (Amaya et al., 2011). We found prenylation motifs in ve
orphan genes (Supplementary_dataset). This provides evidence
that orphan genes may take part in protein prenylation facilitating
interactions with the host genome. Heparin binding proteins, present at the surface of Leishmania were shown to be involved in
hostpathogen interactions (de Castro Cortes et al., 2012).
Therefore, we searched for heparin-binding consensus sequence
in orphan genes. Thus, we found that heparin binding consensus
sequences XBBXBX are present within 143 orphan genes; whereas,
XBBBXXBX are present in 15 orphan genes (Supplementary_
dataset) which suggests that these genes possibly interact with
heparin/heparin sulfate at the surface of mammalian host. It has
been reported that P. falciparum employ host targeting (HT) motifs
(RXLXE/D/Q) on secretory proteins to deliver hundreds of effectors
that must cross the haustorial/vacuolar membrane to enter the
cytoplasm of host cells (Bhattacharjee et al., 2012). Thus, presence
of HT motif was regarded as a key signature of virulence factors.
We found HT motifs in protein sequences of 138 L. major orphan
genes (Supplementary_dataset), indicating that these genes possibly exported to the host during intracellular stage of infections.
Membrane proteins and proteins destined for secretion are targeted to the appropriate intracellular membrane by their signal
peptides (Martoglio and Dobberstein, 1998). Transmembrane proteins mainly function as gateways to deny or permit the transport
of specic substances across the biological membrane and are
involved in a broad range of biological processes (Arinaminpathy
et al., 2009). Their imperative roles make them rewarding drug targets. Our analysis demonstrated that 239 orphan genes contain at
least a putative signal peptide or one transmembrane domain
(Supplementary_dataset), which suggests that these proteins could
be exported to the host cell or integrated on the extracellular surface of the parasite where it can interact with host cell receptors.
Secreted and surface-exposed antigens of parasites are thought
to provide targeted structures for detection by the host immune
system (Silverman et al., 2010). Previous studies showed that
many lineage-specic genes of Plasmodium and Theileria are
encoded surface antigens that interact with their host genomes
(Kuo and Kissinger, 2008). We also found that orphan genes of L.
major contain several genus restricted surface antigens and hydrophilic surface proteins (HASPA2) (Supplementary_dataset) which
imply their possible involvement in hostparasite interactions.
Proteophosphoglycan (PPG) are surface glycoproteins of L. major
which contribute to the binding of Leishmania to host cells and play
a vital role in immunomodulatory effect on macrophage function
(Piani et al., 1999). We noticed that one of the orphan gene
(Gene ID: LmjF.35.0550) are encoded proteophosphoglycan which
indicates its possible role in attachment of parasite with the host
cells. 57-residue small hydrophilic endoplasmic reticulum-associated proteins (SHERP) were shows high level of stage-specic
expression and considered to be important for modulating cellular
processes related to membrane organization and acidication during vector transmission of infective Leishmania (Moore et al., 2011).
Here, we noticed that some orphan genes of L. major are encoded
57-residue small hydrophilic endoplasmic reticulum-associated
proteins. Therefore, this result illustrates that these genes could
play a crucial function in metacyclic parasites during transmission
to the mammalian host. Amastin like surface proteins were
assumed to evolve novel functions crucial to the growth of leishmanial parasites after the acquisition of vertebrate host (Jackson,
2010). We found multiple copies of amastin like surface proteins

336

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330337

in orphan gene dataset (Supplementary_dataset) which indicates


that these genes are essential for survival of L. major within host.
To further understand the role of orphan genes in virulence we
retrieved predicted epitope information of L. major genes from
TriTrypDB. Prediction of epitope in a protein-coding genes is a
powerful approach for unbiased antigen discoveries (Dumonteil,
2009). Our results suggest that ve orphan genes contain epitope
in their sequences (Supplementary_dataset) which indicates their
possible role in immunogenicity within the host genome.
Ubiquitination is a post-translational modication where ubiquitin
is attached to a substrate protein and can affect the proteins by
degradation via the proteasome, alter their cellular location, affect
their activity, and promote or prevent proteinprotein interactions
(Mukhopadhyay and Riezman, 2007). We found that three orphan
genes (Gene id: LmjF.05.0620; LmjF.13.0730 and LmjF.35.2610)
contains ubiquitin domain which indicate their possible crucial
role in host protein targeting and virulence. Finally, to test whether
orphan genes have virulent like properties we predicted their
propensity for pathogenic proteins using MP3 web server (Gupta
et al., 2014). Interestingly, we found that 90.98% proteins in orphan
dataset are pathogenic proteins; whereas it limits to 54.29% for
non-orphan dataset. The difference is statistically signicant at
99% level of condence by z-test (Z score = 19.963). Overall these
results suggest that orphan genes of L. major are likely to be
involved in the parasites virulence. So, various intriguing roles of
orphan genes in several biochemical processes of L. major provide
an indication that these genes could be the potential target for the
development of new therapeutics.
4. Conclusion
Our studies constitute the rst attempt to explore evolutionary
and functional analysis of genus restricted orphan genes in
L. major. Assessing the results from our multivariate regression
analysis we concluded that together with gene age, different genomic and proteomic attributes of orphan genes are responsible for
their faster evolutionary rate. Our functional analysis highlighted
the role of orphan genes in parasitic lineage-specic adaption
which inuences both survival and virulence in the host genome.
Our study provides valuable information on L. major orphan genes
and advocates for further analysis and experimental studies to
facilitate the development of novel therapeutic targets in near
future.
Acknowledgements
We are thankful to the anonymous reviewers for their valuable
comments which helped us immensely to improve our manuscript.
We thank Department of Biotechnology, Govt. of India and Bose
Institute for nancial support.
Appendix A. Supplementary data
All of the data sets supporting the results of this article are
available in the supplementary datasets as well as TriTrypDB
community les portal under title: Orphan genes of Leishmania
major
(http://tritrypdb.org/tritrypdb/showApplication.do).
Supplementary data associated with this article can be found, in
the online version, at http://dx.doi.org/10.1016/j.meegid.2015.03.
031.
References
Alba, M.M., Castresana, J., 2005. Inverse relationship between evolutionary rate and
age of mammalian genes. Mol. Biol. Evol. 22, 598606.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W., Lipman,
D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Res. 25, 33893402.
Amaya, M., Baranova, A., van Hoek, M.L., 2011. Protein prenylation: a new mode of
hostpathogen interaction. Biochem. Biophys. Res. Commun. 416, 16.
Arinaminpathy, Y., Khurana, E., Engelman, D.M., Gerstein, M.B., 2009.
Computational analysis of membrane proteins: the largest class of drug
targets. Drug Discovery Today 14, 11301135.
Aslett, M., Aurrecoechea, C., Berriman, M., Brestelli, J., Brunk, B.P., Carrington, M.,
Depledge, D.P., Fischer, S., Gajria, B., Gao, X., Gardner, M.J., Gingle, A., Grant, G.,
Harb, O.S., Heiges, M., Hertz-Fowler, C., Houston, R., Innamorato, F., Iodice, J.,
Kissinger, J.C., Kraemer, E., Li, W., Logan, F.J., Miller, J.A., Mitra, S., Myler, P.J.,
Nayak, V., Pennington, C., Phan, I., Pinney, D.F., Ramasamy, G., Rogers, M.B.,
Roos, D.S., Ross, C., Sivam, D., Smith, D.F., Srinivasamoorthy, G., Stoeckert Jr., C.J.,
Subramanian, S., Thibodeau, R., Tivey, A., Treatman, C., Velarde, G., Wang, H.,
2010. TriTrypDB: a functional genomic resource for the Trypanosomatidae.
Nucleic Acids Res. 38, D457D462.
Bhattacharjee, S., Stahelin, R.V., Speicher, K.D., Speicher, D.W., Haldar, K., 2012.
Endoplasmic reticulum PI(3)P lipid binding targets malaria proteins to the host
cell. Cell 148, 201212.
Botzman, M., Margalit, H., 2011. Variation in global codon usage bias among
prokaryotic organisms is associated with their lifestyles. Genome Biol. 12, 109.
Cai, J., Zhao, R., Jiang, H., Wang, W., 2008. De novo origination of a new proteincoding gene in Saccharomyces cerevisiae. Genetics 179, 487496.
Cai, J.J., Woo, P.C.Y., Lau, S.K.P., Smith, D.K., Yuen, K.-Y., 2006. Accelerated
evolutionary rate may be responsible for the emergence of lineage-specic
genes in Ascomycota. J. Mol. Evol. 63, 111.
Casadevall, A., Pirofski, L.A., 2001. Hostpathogen interactions: the attributes of
virulence. J. Infect. Dis. 184, 337344.
Chakraborty, S., Ghosh, T.C., 2013. Evolutionary rate heterogeneity of core and
attachment proteins in yeast protein complexes. Genome Biol. Evol. 5, 1366
1375.
Chen, S.C.-C., Chuang, T.-J., Li, W.-H., 2011. The relationships among microRNA
regulation, intrinsically disordered regions, and other indicators of protein
evolutionary rate. Mol. Biol. Evol. 28, 25132520.
Daubin, V., Ochman, H., 2004. Bacterial genomes as new gene homes: the genealogy
of ORFans in E-coli. Genome Res. 14, 10361042.
de Castro Cortes, L.M., de Souza Pereira, M.C., da Silva, F.S., Santini Pereira, B.A., de
Oliveira Junior, F.O., de Araujo Soares, R.O., Brazil, R.P., Toma, L., Vicente, C.M.,
Nader, H.B., Madeira, M.d.F., Bello, F.J., Alves, C.R., 2012. Participation of heparin
binding proteins from the surface of Leishmania (Viannia) braziliensis
promastigotes in the adhesion of parasites to Lutzomyia longipalpis cells (Lulo)
in vitro. Parasit. Vectors 5, 142.
Domazet-Loso, T., Tautz, D., 2003. An evolutionary analysis of orphan genes in
Drosophila. Genome Res. 13, 22132219.
Domazet-Loso, T., Tautz, D., 2010. Phylostratigraphic tracking of cancer genes
suggests a link to the emergence of multicellularity in metazoa. BMC Biol. 8, 66.
Donoghue, M.T.A., Keshavaiah, C., Swamidatta, S.H., Spillane, C., 2011. Evolutionary
origins of Brassicaceae specic genes in Arabidopsis thaliana. BMC Evol. Biol. 11,
47.
Dosztanyi, Z., Csizmok, V., Tompa, P., Simon, I., 2005. IUPred: web server for the
prediction of intrinsically unstructured regions of proteins based on estimated
energy content. Bioinformatics 21, 34333434.
Drummond, D.A., Bloom, J.D., Adami, C., Wilke, C.O., Arnold, F.H., 2005. Why highly
expressed proteins evolve slowly. Proc. Natl. Acad. Sci. U.S.A. 102, 14338
14343.
Drummond, D.A., Raval, A., Wilke, C.O., 2006. A single determinant dominates the
rate of yeast protein evolution. Mol. Biol. Evol. 23, 327337.
Dumonteil, E., 2009. Vaccine development against Trypanosoma cruzi and
Leishmania species in the post-genomic era. Infect. Genet. Evol. 9, 10751082.
Dunker, A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., Obradovic, Z., 2002.
Intrinsic disorder and protein function. Biochemistry 41, 65736582.
Dyson, H.J., Wright, P.E., 2005. Intrinsically unstructured proteins and their
functions. Nat. Rev. Mol. Cell Biol. 6, 197208.
Fraser, H.B., Hirsh, A.E., Steinmetz, L.M., Scharfe, C., Feldman, M.W., 2002.
Evolutionary rate in the protein interaction network. Science 296, 750752.
Garg, P., Sharma, V., Chaudhari, P., Roy, N., 2009. SubCellProt: predicting protein
subcellular localization using machine learning approaches. In Silico Biol. 9, 35
44.
Gouy, M., Gautier, C., 1982. Codon usage in bacteria: correlation with gene
expressivity. Nucleic Acids Res. 10, 70557074.
Gupta, A., Kapil, R., Dhakan, D.B., Sharma, V.K., 2014. MP3: a software tool for the
prediction of pathogenic proteins in genomic and metagenomic data. PLoS ONE
9, e93907.
Hahn, M.W., Kern, A.D., 2005. Comparative genomics of centrality and essentiality
in three eukaryotic protein-interaction networks. Mol. Biol. Evol. 22, 803806.
Heinen, T.J.A.J., Staubach, F., Haeming, D., Tautz, D., 2009. Emergence of a new gene
from an intergenic region. Curr. Biol. 19, 15271531.
Hirsh, A.E., Fraser, H.B., 2001. Protein dispensability and rate of evolution. Nature
411, 10461049.
Ikemura, T., 1985. Codon usage and tRNA content in unicellular and multicellular
organisms. Mol. Biol. Evol. 2, 1334.
Ivens, A.C., Peacock, C.S., Worthey, E.A., Murphy, L., Aggarwal, G., Berriman, M., Sisk,
E., Rajandream, M.A., Adlem, E., Aert, R., Anupama, A., Apostolou, Z., Attipoe, P.,
Bason, N., Bauser, C., Beck, A., Beverley, S.M., Bianchettin, G., Borzym, K., Bothe,
G., Bruschi, C.V., Collins, M., Cadag, E., Ciarloni, L., Clayton, C., Coulson, R.M.R.,

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330337


Cronin, Cruz, A.K., Davies, R.M., De Gaudenzi, J., Dobson, D.E., Duesterhoeft, A.,
Fazelina, G., Fosker, N., Frasch, A.C., Fraser, A., Fuchs, M., Gabel, C., Goble, A.,
Goffeau, A., Harris, D., Hertz-Fowler, C., Hilbert, H., Horn, D., Huang, Y.T., Klages,
S., Knights, A., Kube, M., Larke, N., Litvin, L., Lord, A., Louie, T., Marra, M., Masuy,
D., Matthews, K., Michaeli, S., Mottram, J.C., Muller-Auer, S., Munden, H., Nelson,
S., Norbertczak, H., Oliver, K., ONeil, S., Pentony, M., Pohl, T.M., Price, C.,
Purnelle, B., Quail, M.A., Rabbinowitsch, E., Reinhardt, R., Rieger, M., Rinta, J.,
Robben, J., Robertson, L., Ruiz, J.C., Rutter, S., Saunders, D., Schafer, M., Schein, J.,
Schwartz, D.C., Seeger, K., Seyler, A., Sharp, S., Shin, H., Sivam, D., Squares, R.,
Squares, S., Tosato, V., Vogt, C., Volckaert, G., Wambutt, R., Warren, T., Wedler,
H., Woodward, J., Zhou, S.G., Zimmermann, W., Smith, D.F., Blackwell, J.M.,
Stuart, K.D., Barrell, B., Myler, P.J., . The genome of the kinetoplastid parasite,
Leishmania major. Science 309, 436442.
Jackson, A.P., 2010. The evolution of amastin surface glycoproteins in
trypanosomatid parasites. Mol. Biol. Evol. 27, 3345.
Jensen, L.J., Gupta, R., Staerfeldt, H.H., Brunak, S., 2003. Prediction of human protein
function according to Gene Ontology categories. Bioinformatics 19,
635642.
Johnson, B.R., Tsutsui, N.D., 2011. Taxonomically restricted genes are associated
with the evolution of sociality in the honey bee. BMC Genomics 12, 164.
Khalturin, K., Anton-Erxleben, F., Sassmann, S., Wittlieb, J., Hemmrich, G., Bosch,
T.C.G., 2008. A novel gene family controls species-specic morphological traits
in hydra. PLoS Biol. 6, 24362449.
Khalturin, K., Hemmrich, G., Fraune, S., Augustin, R., Bosch, T.C.G., 2009. More than
just orphans: are taxonomically-restricted genes important in evolution?
Trends Genet. 25, 404413.
Knowles, D.G., McLysaght, A., 2009. Recent de novo origin of human protein-coding
genes. Genome Res. 19, 17521759.
Kryazhimskiy, S., Plotkin, J.B., 2008. The population genetics of dN/dS. PLoS Genet. 4,
e1000304.
Kuo, C.-H., Kissinger, J.C., 2008. Consistent and contrasting properties of lineagespecic genes in the apicomplexan parasites Plasmodium and Theileria. BMC
Evol. Biol. 8, 108.
Kyte, J., Doolittle, R.F., 1982. A simple method for displaying the hydropathic
character of a protein. J. Mol. Biol. 157, 105132.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam,
H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J.,
Higgins, D.G., 2007. Clustal W and clustal X version 2.0. Bioinformatics 23,
29472948.
Li, L., Stoeckert, C.J., Roos, D.S., 2003. OrthoMCL: identication of ortholog groups for
eukaryotic genomes. Genome Res. 13 (9), 21782189.
Lin, H., Moghe, G., Ouyang, S., Iezzoni, A., Shiu, S.-H., Gu, X., Buell, C.R., 2010.
Comparative analyses reveal distinct sets of lineage-specic genes within
Arabidopsis thaliana. BMC Evol. Biol. 10, 41.
Marais, G., Duret, L., 2001. Synonymous codon usage, accuracy of translation, and
gene length in Caenorhabditis elegans. J. Mol. Evol. 52, 275280.
Martoglio, B., Dobberstein, B., 1998. Signal sequences: more than just greasy
peptides. Trends Cell Biol. 8, 410415.
Maurer-Stroh, S., Eisenhaber, F., 2005. Renement and prediction of protein
prenylation motifs. Genome Biol. 6, R55.
Mohan, A., Sullivan Jr., W.J., Radivojac, P., Dunker, A.K., Uversky, V.N., 2008. Intrinsic
disorder in pathogenic and non-pathogenic microbes: discovering and
analyzing the unfoldomes of early-branching eukaryotes. Mol. BioSyst. 4,
328340.
Moore, B., Miles, A.J., Guerra-Giraldez, C., Simpson, P., Iwata, M., Wallace, B.A.,
Matthews, S.J., Smith, D.F., Brown, K.A., 2011. Structural basis of molecular
recognition of the leishmania small hydrophilic endoplasmic reticulumassociated protein (SHERP) at membrane surfaces. J. Biol. Chem. 286, 9246
9256.
Mukhopadhyay, D., Riezman, H., 2007. Proteasome-independent functions of
ubiquitin in endocytosis and signaling. Science 315, 201205.
Nair, R., Rost, B., 2003. LOC3D: annotate sub-cellular localization for protein
structures. Nucleic Acids Res. 31, 33373340.
Neme, R., Tautz, D., 2013. Phylogenetic patterns of emergence of new genes support
a model of frequent de novo evolution. BMC Genomics 14, 117.
Pal, C., Papp, B., Hurst, L.D., 2001. Highly expressed genes in yeast evolve slowly.
Genetics 158, 927931.
Palmieri, N., Kosiol, C., Schloetterer, C., 2014. The life cycle of Drosophila orphan
genes. Elife 3, e01311.
Panda, A., Ghosh, T.C., 2014. Prevalent structural disorder carries signature of
prokaryotic adaptation to oxic atmosphere. Gene 548, 134141.

337

Piani, A., Ilg, T., Elefanty, A.G., Curtis, J., Handman, E., 1999. Leishmania major
proteophosphoglycan is expressed by amastigotes and has an
immunomodulatory effect on macrophage function. Microbes Infect. 1, 589
599.
Podder, S., Ghosh, T.C., 2010. Exploring the differences in evolutionary rates
between monogenic and polygenic disease genes in human. Mol. Biol. Evol. 27,
934941.
Rastrojo, A., Carrasco-Ramiro, F., Martin, D., Crespillo, A., Reguera, R.M., Aguado, B.,
Requena, J.M., 2013. The transcriptome of Leishmania major in the axenic
promastigote stage: transcript annotation and relative expression levels by
RNA-seq. BMC Genomics 14, 223.
Sharp, P.M., Tuohy, T.M.F., Mosurski, K.R., 1986. Codon usage in yeast: cluster
analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids
Res. 14, 51255143.
Silverman, J.M., Clos, J., deOliveira, C.C., Shirvani, O., Fang, Y., Wang, C., Foster, L.J.,
Reiner, N.E., 2010. An exosome-based secretion pathway is responsible for
protein export from Leishmania and communication with macrophages. J. Cell
Sci. 123, 842852.
Tautz, D., Domazet-Loso, T., 2011. The evolutionary origin of orphan genes. Nat. Rev.
Genet. 12, 692702.
Toll-Riera, M., Bosch, N., Bellora, N., Castelo, R., Armengol, L., Estivill, X., Mar Alba,
M., 2009. Origin of primate orphan genes: a comparative genomics approach.
Mol. Biol. Evol. 26, 603612.
Toll-Riera, M., Bostick, D., Mar Alba, M., Plotkin, J.B., 2012. Structure and age jointly
inuence rates of protein evolution. PLoS Comput. Biol. 8, e1002542.
Uversky, V.N., Gillespie, J.R., Fink, A.L., 2000. Why are natively unfolded proteins
unstructured under physiologic conditions? Proteins 41, 415427.
Vishnoi, A., Kryazhimskiy, S., Bazykin, G.A., Hannenhalli, S., Plotkin, J.B., 2010. Young
proteins experience more variable selection pressures than old proteins.
Genome Res. 20, 15741581.
Wall, D.P., Hirsh, A.E., Fraser, H.B., Kumm, J., Giaever, G., Eisen, M.B., Feldman, M.W.,
2005. Functional genomic analysis of the rates of protein evolution. Proc. Natl.
Acad. Sci. U.S.A. 102, 54835488.
Wilson, G.A., Bertrand, N., Patel, Y., Hughes, J.B., Feil, E.J., Field, D., 2005. Orphans as
taxonomically restricted and ecologically important genes. Microbiology-Sgm
151, 24992501.
Wissler, L., Gadau, J., Simola, D.F., Helmkampf, M., Bornberg-Bauer, E., 2013.
Mechanisms and dynamics of orphan gene emergence in insect genomes.
Genome Biol. Evol. 5, 439455.
Wright, F., 1990. The effective number of codons used in a gene. Gene 87, 2329.
Wu, D.-D., Irwin, D.M., Zhang, Y.-P., 2011. De novo origin of human protein-coding
genes. PLoS Genet. 7, e1002379.
Xia, Y., Franzosa, E.A., Gerstein, M.B., 2009. Integrated assessment of genomic
correlates of protein evolutionary rate. PLoS Comput. Biol. 5, e1000413.
Xie, C., Zhang, Y.E., Chen, J.-Y., Liu, C.-J., Zhou, W.-Z., Li, Y., Zhang, M., Zhang, R., Wei,
L., Li, C.-Y., 2012. Hominoid-specic de novo protein-coding genes originating
from long non-coding RNAs. PLoS Genet. 8, e1002942.
Yang, L., Zou, M., Fu, B., He, S., 2013. Genome-wide identication, characterization,
and expression analysis of lineage-specic genes within zebrash. BMC
Genomics 14, 65.
Yang, Z., Huang, J., 2011. De novo origin of new genes with introns in Plasmodium
vivax. FEBS Lett. 585, 641644.
Yang, Z.H., Nielsen, R., 2000. Estimating synonymous and nonsynonymous
substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17, 32
43.
Yin, Y., Fischer, D., 2008. Identication and investigation of ORFans in the viral
world. BMC Genomics 9, 24.
Yu, C.-S., Chen, Y.-C., Lu, C.-H., Hwang, J.-K., 2006. Prediction of protein subcellular
localization. Proteins 64, 643651.
Yu, C.S., Lin, C.J., Hwang, J.K., 2004. Predicting subcellular localization of proteins for
Gram-negative bacteria by support vector machines based on n-peptide
compositions. Protein Sci. 13, 14021406.
Zdobnov, E.M., Apweiler, R., 2001. InterProScan an integration platform for the
signature-recognition methods in InterPro. Bioinformatics 17, 847848.
Zhang, F.L., Casey, P.J., 1996. Protein prenylation: molecular mechanisms and
functional consequences. Annu. Rev. Biochem. 65, 241269.
Zhang, W.-W., Matlashewski, G., 2010. Screening Leishmania donovani-specic
genes required for visceral infection. Mol. Microbiol. 77, 505517.
Zhang, Y.E., Landback, P., Vibranovski, M.D., Long, M., 2011. Accelerated recruitment
of new brain development genes into the human genome. PLoS Biol. 9,
e1001179.

Vous aimerez peut-être aussi