Computational Biology and Chemistry 53 (2014) 84–96

Contents lists available at ScienceDirect

Computational Biology and Chemistry
journal homepage:

Orphan and gene related CpG Islands follow power-law-like
distributions in several genomes: Evidence of function-related and
taxonomy-related modes of distribution
Giannis Tsiagkas a , Christoforos Nikolaou b , Yannis Almirantis a, *

Institute of Biosciences and Applications, National Center for Scientific Research “Demokritos”, 15310 Athens, Greece
Computational Genomics Group, Department of Biology, University of Crete, 71409 Heraklion, Greece



Article history:
Available online 16 September 2014

CpG Islands (CGIs) are compositionally defined short genomic stretches, which have been studied in the
human, mouse, chicken and later in several other genomes. Initially, they were assigned the role of
transcriptional regulation of protein-coding genes, especially the house-keeping ones, while more
recently there is found evidence that they are involved in several other functions as well, which might
include regulation of the expression of RNA genes, DNA replication etc. Here, an investigation of their
distributional characteristics in a variety of genomes is undertaken for both whole CGI populations as
well as for CGI subsets that lie away from known genes (gene-unrelated or “orphan” CGIs). In both cases
power-law-like linearity in double logarithmic scale is found. An evolutionary model, initially put
forward for the explanation of a similar pattern found in gene populations is implemented. It includes
segmental duplication events and eliminations of most of the duplicated CGIs, while a moderate rate of
non-duplicated CGI eliminations is also applied in some cases. Simulations reproduce all the main
features of the observed inter-CGI chromosomal size distributions. Our results on power-law-like
linearity found in orphan CGI populations suggest that the observed distributional pattern is
independent of the analogous pattern that protein coding segments were reported to follow. The
power-law-like patterns in the genomic distributions of CGIs described herein are found to be compatible
with several other features of the composition, abundance or functional role of CGIs reported in the
current literature across several genomes, on the basis of the proposed evolutionary model.
ã 2014 Elsevier Ltd. All rights reserved.

Orphan CpG-Islands
CpG dinucleotide
Power-law-like distribution
Genome evolution

1. Introduction
Genomic CpG islands, in which CpG dinucleotides are abundant
and non-methylated, have been initially detected experimentally
and defined as short (hundreds of nucleotides-long) stretches in
vertebrate genomes, thus offered as cleavable sites for mCpGsensitive restriction enzymes (HTF islands, see e.g. Bird, 1986). As
lengthy DNA sequences and whole genomes became progressively
available, sequence-based definitions of CpG Islands (CGIs) and
computational algorithms using a sliding window combined with
threshold values for key quantities were put forward. Thresholds
for sequence-based search of CGIs are considered for: (i) the
minimal island length (Lmin); (ii) the percentage of cytosine and
guanine content (C + G, abbreviated in the following as CG%); and
(iii) the observed over expected frequency of occurrence of the CpG

* Corresponding author. Tel.: +30 2106503619; fax: +30 2106511767.
E-mail address: (Y. Almirantis).
1476-9271/ ã 2014 Elsevier Ltd. All rights reserved.

dinucleotide (CpGo/e). Gardiner-Garden and Frommer (1987)
introduced the first widely used set of thresholds (Lmin = 200 bp,
CG > 50%, CpGo/e > 0.6; to which we will hereafter refer as “relaxed
criteria”; bp stands for ‘base pairs’), while later, Takai and Jones
(2002) used more conservative threshold values, named hereafter
“stringent criteria” (Lmin = 500 bp, CG > 55%, CpGo/e > 0.65). In the
following, these threshold choices will be abridged as G-G&F and
T&J respectively. Alternatively, methods based on the degree of
CpG dinucleotide clustering (Hackenberg et al., 2006; Glass et al.,
2007) and on entropic edge detection (Luque-Escamilla et al.,
2005) have also been introduced. In order to avoid false positives
(e.g. due to high C + G Alu sequences in the human genome),
especially when relaxed criteria are used, repeat-masking is
usually applied before searching for CGIs.
CGIs are widely accepted as markers for the existence of
protein-coding genes, the promoter regions of which are usually
proximal to or overlapping with the islands. Consequently, the
search for CGIs was motivated by the need of detection of yet
unannotated protein coding genes. Therefore, most of the

The persistence of orphan CGIs’ chromosomal distributions to form power-laws. while the same domains largely coincide with Polycomb-binding sites. CGIs cannot be generally seen as mutational cold spots. unclear. Genomewide methylation data were not available until recently. 2011). Klimopoulos et al. Algorithm for the extraction of CGIs coordinates in genomic sequences The algorithm we have used for the localization of CGIs is an implementation of the method described by Takai and Jones . 2009). Data for coordinates of CGIs derived from the methylation state of the sequence. based on human and chimpanzee genomes’ comparison.e. These are retained for long evolutionary time in other species where they remain functional.5) and the supplementary file “Supplementary Table”. Methods 2. we use different distance thresholds from the nearest gene in order to characterize a CGI as orphan. besides the “standard” G-G&F and T&J threshold sets. A separate study of populations of orphan CGIs is performed. Information about the origin of the genomic coordinates used for masking of transcription start sites (TSS) and of genes may also be found in “genome data”.2. In previous works (Sellis et al. 2. see Antequera and Bird (1993). Danio rerio. Tanay et al. Antequera (2003) and Jabbari and Bernardi (2004). Bird. when tested numerically. The term has been widely used to denote all CGIs which are not directly involved in the regulation of nearby genes. the independence of CpG deficiency and TpG. Matsuo et al. 2009). (2010) were downloaded from the additional material of this article. We used the annotation of these files in order to define the subset of CGIs which may be considered as orphan. In the following.e. There is not an unambiguous consensus about what an orphan CpG-Island really is. Homo sapiens.. after masking for known repeated sequences (for all organisms apart from S.e. Only entire chromosomes are considered. More specifically. The CpG depletion of the vertebrate genome is explained on the basis of heavy methylation and the consequent mutation of 5-methylcytosine to thymine. principally in order to exclude that the observed power-law-like distributions in whole CGI populations are a mere consequence of similar distributions already known to exist for the inter-genic distances (Sellis and Almirantis. Illingworth and co-workers formulated a hypothesis according to which most of orphan CGIs are related to functional promoters of unknown genes. as determined by Illingworth et al. (2010) are also used alongside with CGI coordinates derived from the implementation of compositional thresholds. 2. / Computational Biology and Chemistry 53 (2014) 84–96 introduced CGI finding algorithms were judged upon their ability to find CGIs in the proximity of known genes (see e. CGIs were shown to often coincide with Origins of Replication (ORIs) thus suggesting their functional roles may extend beyond transcriptional regulation (Cayrou et al. Such distributions present extended linear regions in log–log scale (see Methods). found regions with reduced CpG mutability.g. many of which could be functional RNA genes. at least in the germ line. Han and Zhao. cerevisiae). Bos taurus. by means of the algorithm described in the following sub-Section 2. including biologically plausible steps.nih. For all studied organisms. The most important of this work’s observations was that there exist numerous “orphan” CpG Islands (i. when taken alone. Power-lawlike distributions are extensively observed and through the proposed model we attempt to extend our understanding of the evolution and function of CGIs. e. protein-coding exon) coordinates were downloaded from the FTP site of NIH [ftp. researchers attempted to introduce several forms of epigenomic information. distances between protein-coding segments (PCS) often follow “power-law-like” size distributions in entire chromosomes. This model. Caenorhabditis elegans. the data of Illingworth et al. Later on. full-length gene masking. like the ones found in the CGI chromosomal distributions. Drosophila melanogaster.. (2010) provided a comprehensive list of CGIs in the human and mouse genomes based principally on methylation data. More recently. Saccharomyces cerevisiae. Additional data for gene or CDS (i. This lack of methylation on specific locations may be seen as the result either of protection from point-mutations (mutational “cold spots”) or of purifying selection due to functional roles which are fulfilled only if specific compositional traits are preserved in the underlying sequences.. Origin of the genomic and epigenomic data The genomes of the following organisms are used in this study: Apis melifera. can provide valuable insight into the overall organization of genome architecture and the mechanisms under which sequences are attributed with functionality. Monodelphis domestica.g. is shown to systematically generate power-law-like size distributions. Canis familiaris. Information about the origin of the genomic sequences used may be found in the supplementary file “genome data”. independently of the method used for their detection. Sellis and] for each of the organisms we have studied.ncbi. in large genomes. for more information see later on (sub-Section 2. In the cases of human and mouse genomes. with flanks of 5 kbp and 2 kbp (thousands of base pairs) at 50 and 30 ends respectively. The functional character of a CGI is related to the condition that its CpG dinucleotides are predominantly unmethylated. nonetheless. (2007) used a method combining a standard sliding window algorithm with available epigenomic data for human chromosomes 21 and 22. CpA excess on the level of DNA methylation) is held. 2007. 2009. verified by the observed increase in TpG and CpA abundances and is extensively discussed in relation to CGIs evolution and function. 2008. we have considered. 1987. Bock et al. in some cases. The propensity of 5-methylcytosine to quickly mutate to thymine has driven to the conjecture that the methylation is hindered in genomic sequences where CpG frequency is close to the expected one. To what extent the colocalization of CGI and origins of replication is due to a direct causal link or an indirect effect mediated by positional preferences of gene transcription start sites remains. but the exact proximity relationships between CGI and gene bodies vary according to the study.2. We have put forward an evolutionary model. Several or all chromosomes of each genome are studied. Our principal result herein is that CGIs also. as there are converging data that many CGIs in several lineages have been lost when they ceased to be under purifying selection. for the temporal evolution of genomic components that are subjected to purifying selection. as well as. After the systematic search for CGIs in the human and other genomes using the simple sequence criteria described above. Tsiagkas et al.1. (2007). depending on genome sizes.G. where however the opposite position (i. See Bird (1986). apart from an indication of their functional role and conservation through purifying selection. not connected to a known protein-coding gene). (1993). in human and mouse genome comparisons. Gallus gallus. 85 Here we investigate the large-scale genomic distribution of CpG Islands in several genomes. Mus musculus. several additional choices of sequence composition are also tested and the resulting inter-CGI distances’ distributions are critically presented. follow power-law-like size distributions. Illingworth et al. 2012) we have observed that distances between transposable elements (TE) belonging to the same family. These data are used for the sequence-based determination of CGIs. either studied as entire populations or when only orphan islands are considered.

then Z1 NðSÞ ¼ nPðSÞ / ðr1m Þdr / Sm .4.2. Li (2002). and one or more chromosomes (taken at random) with a medium E value. but it corresponds to the number of all spacers longer than S. Power-laws in size distributions 2. The criterion adopted for choosing chromosomes for the creation of their parameter grids is to include the chromosome with the most extended linearity for the standard threshold choices G-G&F or T&J. (2007) and Stumpf and Porter (2012). These grids are given in the supplementary file “grid”. but with the additional masking of each protein coding segment. with free to be adjusted thresholds on Lmin. CG% and CpGo/e. m > 0 2. i.1. with symmetric flanks 50 bp upstream and downstream. Additionally to CGIs genomic distributions.e. all figures also include a bundle of ten simulated size distributions (continuous lines) where an equal number of markers. we cannot speak about “power-laws” but rather about “power-law-like” distributions. The intensity of the power-law-like pattern is quantified through the value of the extent (E) of the linear region in double logarithmic scale. Neighbor CGIs are merged together and then considered as a unique island.3. Tsiagkas et al. For reviews on power-law size distributions. the threshold values for CGIs detection in the cited literature are mostly determined for the purpose of the localization of new protein-coding genes.e. 2. for large genomes (actually. for 8 organisms out of the 11 overall studied. with a medium E value. or (iv) we followed the TSS masking again. . are randomly positioned in a sequence of equal length. by definition. The cumulative distribution forms smoother “tails”. (where s is the size of the bin width) and N*(S) the number of spacers: N ðSÞ ¼ npðSÞ / eaS . interchanging the minimal length thresholds. In finite size collections taken from physical systems. Clauset et al. a > 0 When scale-free clustering appears. independent of any binning choice: in a cumulative curve the value of P(S) for length S is not associated with the subset of spacers whose length falls in the same bin. The grid-driven search is done only for a limited number of chromosomes. along with every genomic one (see below for details). 2007) for the sizes of spacers (distances) between CpG Islands. cerevisiae): (iii) we either followed the TSS masking (narrowing the upstream masking to 100 bp) as in (ii). (2007) the use (in double-logarithmic plots) of the slope of the cumulative form of the distribution for the estimation of the exponent of the power-law gives much more precise results than the use of the slope of the original one. allows for an objective estimation of the linearity trend in each case. Let p(S) be the probability of a spacer having length between S  s/2 and S + s/2. for some chromosomes of the most widely studied genomes. long-range correlations extend to several length scales (ideally. we form a two-dimensional “parameter grid” of threshold values for CG% and CpGo/e (around the G-G&F and T&J threshold choices) for either L > 200 bp or L > 500 bp. S where. the inclusion of ten surrogate data distribution curves. It is also. taken at random. In all applications the window is sliding per one nucleotide. This is particularly important for our study (see Section 4 for further details. in order to further elaborate on alternative modes of CGI detection. / Computational Biology and Chemistry 53 (2014) 84–96 (2002). their properties and alternative forms see Adamic and Huberman (2002). The equations in terms of probabilities Suppose there is a large collection of n objects (in our case spacers between CGIs). Details of the masking procedure In order to define the “orphan” CGI subsets. Most of this work concerns the human and a few other.3. because power-laws for the others are found to be rudimentary. all the vertebrates included in our list): (i) either we masked the total length of genes by additionally masking 5 kbp upstream of the start-of-gene (TSS) and 2 kbp downstream of the end-of-gene (TES). and S. elegans we derived a grid only for one chromosome. with a linear extent below unity. denoted by altG-G&F (for L > 500 bp) and altT&J (for L > 200 bp). if they are separated by less than 100 base pairs and if the extended island they form meets the requirements put by the considered thresholds. or (ii) we masked only around the TSS by adding flanks of 500 bp upstream of the TSS and 100 bp downstream of the TSS. The criterion adopted for choosing chromosomes for the study of the distribution of orphan CGIs is to include the chromosome with the most extended linearity for the standard threshold choices G-G&F or T&J. As it is pointed out by Sims et al. and several other chromosomes. defined as follows: Z1 PðSÞ ¼ pðrÞdr. each characterized by its length S. Newman (2005).. In the case of C. animal genomes.5. which corresponds to a linear graph in a double logarithmic scale: N ðSÞ ¼ npðSÞ / Sz ¼ S1m . as in the original distribution. in our case for the whole examined genomic region) and the spacers’ size distributions follow a power-law. Although the cumulative form of the distribution may be seen as presenting more “inertia” thus overshadowing local features. In such random collections of objects we can approximate the distribution of their sizes with an exponential distribution (like the runs of heads in a coin-tossing experiment).3. cumulative size distribution In the present work we use the “complementary cumulative distribution function” (Clauset et al. Methodology for the search of threshold values for CGIs localization in several genomes As discussed in the introduction. warm-blooded. p(r) is the original spacers’ size distribution. The inclusion of these random (surrogate) data sets in the figures allows for a direct visualization of the difference between observed and random distribution patterns. minus one in absolute value): if p(r) / r1m. Also provided in the Supplementary Table (see also in the Results and Table 1) are values of E given for the standard G-G&F and T&J threshold sets and for alternative sets for the same values of CG% and CpGo/e. which are thus preserved in evolutionary time. less affected by statistical fluctuations. (see in the Discussion). This is done in an attempt to determine the threshold values corresponding to functional islands. For small genomes (insects. Original vs. 2. S where N(S) is the number of spacers longer or equal to S. corresponding to the CGIs of the original natural sequence. The cumulative form of a power-law size distribution is again a power-law characterized by an exponent (slope) equal to that of the original distribution plus 1 (i. Here. 2. worm. which are known to have promoters often intersecting with CGIs. where the proposed model is explained) as there appears to be no particular tendency for a unique (universal) exponent m. tabulating the used thresholds alongside with the extent (E) of the complementary cumulative inter-CGI distances size distributions.86 G.

in some instances of masked genomes.16 – 1.42 0. for the human and mouse genomes. We have also attempted a search for the determination of the form of distribution of distances between CGIs using Table 1 (i) Mean values of linearity extent E (M. we observe increase of the extent of linearity when orphan islands are considered instead of the full populations. as well as the fraction of the TSS-unrelated CGIs (Fig.-5 1.-5 – – – – . 21.29 1.V.67 2.V. which have a minimal distance of 5 kbp upstream or 2 kbp downstream of any RefSeq TSS and we name them “TSS-unrelated” in the present text.19 1. although the extent of linearity is somewhat reduced.V.V.37 1.V. circles and diamonds are used respectively.V.04 – M. 1000 markers (representing CGIs) are randomly inserted in a sequence 2 Mbp long.03 – Epigenetically determined CGIs (complete population) M.87 – – – M.V. The G-G&F or T&J thresholds values have been used here. The length reached by the simulated chromosome was 195 Mbp (steps i and ii of the model. along with the corresponding CGI subsets remaining after gene-masking (non-TSS CGIs.33 M.02 1.V. dealing with the study of masking of threshold-based CGIs. In (a). with lengths sampled from a uniform distribution not exceeding in size the 5% of the actual length of the simulated sequence. in some chromosomes for each organism.73 – 1.-5 1. Power-law-like distributions of orphan CGIs are widespread.61 0.18 1. Standard and alternative threshold values are considered. for one chromosome of each organism. 5(a–c). When a masking procedure is applied in the epigenetically determined CGI collection (for details see Methods. examples of these distributions are presented. Fig.V.74 M.40 1. After each such event a number of markers (CGIs) equal to 90% of the number of the duplicated ones are eliminated. we have formed the subset of Illingworth and co-workers original CGIs.-5 1. See also the Supplementary Plots.V. – – – – 1.93 – 1.59 M.29 1.79 1.08 – – – altG-G&F altT&J M. the ones we called “alternative” in sub-Section 2. More plots are included in “Supplementary Plots”. while in Supplementary Table all the studied cases are included. / Computational Biology and Chemistry 53 (2014) 84–96 We have also checked the existence of power-law-like distributions for the set of epigenetically determined CGIs (Illingworth et al. In Table 1 the following information is included: (i) mean values of linearity extent E (M. In Fig. 3b and e).59 2.) for all studied chromosomes from each organism.-5 – – – – 1. For genomic and for model-generated size distributions. This is done in order to test the distribution of CGIs that are unlikely to have any role in the transcriptional regulation of protein coding genes at mediumdistances (and probably spatially linked).66 – 1.01 1. Chromosomal distributions of CpG-Islands in various genomes In Fig. while.68 – 1.V.67 – – – T&J M. 1.5) the picture remains qualitatively the same. 1.15 – 1. In these Tables we have included along with the standard parameter sets. This filtering is analogous to the case (i) of the previous paragraph. In Table 2 mean values are given for the orphan CGI spacers’ size distribution in the same form with Table 1. as described in detail in the Methods. 1.13 – M. Here. We have also performed gene masking and subsequently we studied the spacers’ size distribution for orphan CGIs. sub-Section 2.82 2. 3a and d).56 M.54 0.15 M. 1.70 M.39 1. 1..51 1.87 1. see the Supplementary Table).66 1.45 M. In Supplementary Table all chromosomes are presented. (ii) mean values of linearity Extent E for the five more extended power-law incidences (M.V. in double logarithmic scale.-5 1.58 0.34 1. In (b) the segmental duplications were 87 (50% of those in (a)) and the final sequence length 22 Mbp.V. Simulations using the segmental duplication – CGI loss model Examples of simulations are given in Fig.V.V.78 1. (ii) mean values of linearity extent E for the five more extended power-law incidences (M. 3.V. 1 we present examples of the complementary cumulative inter-CGI distances’ size distributions. 87 for whole chromosomes from the 11 considered genomes.88 1. The full set of epigenetically determined CGI populations and their orphan subsets are presented in the Supplementary Table.V.V.89 1.32 1.V. The principal result of this study is that power-law-like distributions of inter-CGIs distances are observed in several cases with linearity extent E > 2.49 2.6. 2. Alternatively.24 – 1. 3c and f). 2.-5 1.50 – 1.89 M.) for all studied chromosomes from each organism. 173 segmental duplication events occurred. – – – – M.09 M.33 1.70 M. In (c) the number of segmental duplications is as in (a) and the finally reached length 119 Mbp.V. 31 and human chr. In Table 2. Scripts in Perl and programs in FORTRAN used for parsing genomic data files. 1.05 1. 3. Results 3. chromosomal distributions for the islands determined by Illingworth et al.-5 1.38 M.52 1. In the framework of the present study. Simulations presented in Fig. the two classes of epigenetically determined CGIs defined above are considered as orphan.82 – 1. Tsiagkas et al.01 1. see sub-Section 2. Initially. We observe that power-law-like distributions are again formed. 1.4.37 1. serving as an additional estimator of the intensity of the appearance of the power-law-like pattern per organism. G-G&F Large genomes Bos taurus Canis familiaris Danio rerio Gallus gallus Homo sapiens Monodelphis domestica Mus musculus Small genomes Apis melifera Caenorhabditis elegans Drosophila melanogaster Saccharomyces cerevisiae M.1. 2010) not intersecting with TSSs and having a minimal distance of 100 bp from a TSS of the RefSeq annotation.74 1. In Fig.03 1.41 – – – M.44 1. like the ones found in the orphan CGI populations derived on the basis of compositional criteria.4.86 1.V.47 0.-5). 256 additional events of non-duplicated CGIs eliminations are also allowed (step iii of the model).70 1. throughout.V. the complete sets of epigenetically determined CGI are presented (Fig. summary information about non-gene-related epigenetically determined CGIs is also included. which in two instances surpass the three orders of magnitude (in dog chr. 1. see in the Section 4.90 1. Additionally.G. We name these Islands “non-TSS”.V.79 1. These eliminations represent cases of loss of function of ancestral CGIs with a subsequent progressive decomposition.26 – 1. 5 are representative of a large number of numerical experiments conducted. 1.-5). computing the presented size distributions and simulating the evolutionary model are available upon request.V.-5 2.-5 1.02 1. (2010) based on methylation data are studied and the power-law-like pattern is also found. for the same chromosomes as in Fig.33 1.

Tsiagkas et al. In the supplementary file “grid” the extents of linearities are presented in the form of grids of thresholds (see Methods) for all the studied cases. / Computational Biology and Chemistry 53 (2014) 84–96 threshold sets different than the standard and the “alternative” G-G&F and T&J._1)TD$IG] Fig. while characteristic plots are given in Fig. Each sub-plot (a–f) corresponds to a chromosome of one of the examined genomes..88 G. 2008. T&J (Lmin = 500 bp. 1. An evolutionary scenario reproducing the observed power-lawlike distributions based on segmental duplications and CGI loss Segmental duplication events occurred continuously in the evolutionary past of virtually all eukaryotes (De Grassi et al. Inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes.6) or the Takai and Jones. CpGo/e > 0. for a specific choice of thresholds in the detection of CpG–Islands. CpGo/e > 0. The linear segments are inferred by linear regression. G-G&F (Lmin = 200 bp. . GC > 50%.2. corresponding to randomly distributed markers.65) sets of thresholds have been used. [(Fig. 3. Genomic curves are accompanied in each plot by 10 curves of surrogate data (continuous lines). Either the Gardiner–Garden and Frommer. Logarithms are to base 10. Linearity in log–log scale (a power-law-like pattern) is observed. 4. G + C > 55%.

Tsiagkas et al. but here excluding gene-related CGIs. 1986). It is estimated that 50% of all genes in a genome are expected to duplicate giving an “offspring” at least once on time scales of 35–350 million years.G. 2002). Again.e. / Computational Biology and Chemistry 53 (2014) 84–96 Kehrer-Sawatzki and Cooper. Additionally. Sémon and Wolfe (2007) and references therein. duplication of the whole genome and subsequent reduction to diploidy). It has been shown that at least 10% of the non-repetitive human genome consists of identifiable (i. . 2002. 1. 2008. Shapira and Finnerty. 2000). McLysaght et al. Segmental duplication and polyploidisation generate copies of some or all the genes of an organism.. power-law-like linearity is observed. Logarithms are to base 10. not taking into account events of polyploidization (Lynch and Conery. Gibson and Spring (2000). Surrogate data curves and linear regression are similar to Fig.. Several alternative ways of retaining only orphan CGIs have been used.. 89 Adams and Wendel (2005). sub-plot (a–f). 2008.g. 2. but also of other functional [(Fig.e._2)TD$IG] Fig. Kirsch et al. relatively recent) segmental duplications (Bailey et al. Same cases of chromosomal distributions as in Fig. 1. see e. most extant taxa have experienced paleopolyploidy during their evolution (i. often more extended than in the corresponding complete population.

97 – M. This model includes events of the types: i Segmental duplications of extended regions of chromosomes. 2000). with one gene being under the regulatory control of a CGI.48 1.V.53 1. Sémon and Wolfe (2007). additional eliminations of non-duplicated CGIs. retroviruses.82 1.23 G-G&F M. The proposed evolutionary scenario reproduces power-law-like inter-CGI distances size distributions.28 1.09 1.-5 2.g.V. 1.96 0.46 M. often connected to the loss of (protein-coding or not) genes due to changes of the needs of a given species is also a widespread phenomenon in genomic evolution.56 2.e.V.18 M.93 1. ii iii iv v This step may include as limiting case whole genome duplications.V.V. the two copies continue to survive in the genome sharing the multiple functions of the initial gene between them.91 1.V.-5).V.-5 1. Insertions of sequences increasing the total chromosomal length (these could be transposable elements.29 1. and then disintegrates progressively by random mutations.43 1.24 2.84 1. The above fate of duplicated genes is normal to have its parallel to the fate of CGIs accompanying genes and being involved in their regulatory mechanism.69 1. 1.87 – T&J M.98 2. 1. – – T&J M.98 1. which is consistent with findings suggesting massive functional gene loss in the last 10 Myr (IHGSC.26 M.35 1. see e.12 M.32 – 2. eliminations of repeat copies of the studied TE family. 1. However. Deletions of sequence stretches (which usually are under weak or no purifying selection). transposable elements (TE) or microsatellites.V. Tsiagkas et al. 1.V. The existence of another source of CGI loss can be supported by current findings of comparative genomics.31 M.41 M. 2001).41 1.V. In the literature. in the study of non-conserved elements. orphan CGIs becoming redundant. we may expect that loss of CGIs due to one of the above mentioned reasons. 1. – – M.V. while it is also exposed to the possibility of excision due to recombination driven eliminations. 1. with the accelerated rate imposed by the rapid turnover of the CpG dinucleotide (Nachman and Crowell. Random eliminations of a number of CGIs which is lower or equal to the number of the duplicated ones.25 Masked genes (5 kbp.V. Lynch and Conery (2000)..-5 – – M. given that the population of the genomic elements we study are (principally) functional and thus preserved by purifying selection.02 1.19 1.V. event types iii and iv (in that context. Even if there are not available quantitative data for the CGI corresponding populations.81 1. regarded as orphan. there are reported three possibilities for the evolution of the duplicated genes (Adams and Wendel. De Grassi et al. this property is proven numerically to be robust to quantitative modifications of all the involved types of molecular events. the fate of most duplicated genes is that one copy is silenced. Moreover. Moreover.24 M. 2008.44 M.24 1.09 M.-5 – – M. Kasahara (2007).V.02 1.19 1.96 – 1. +100 bp). Only events i and ii are indispensable for the appearance of the power-law-like pattern.69 1.-5 1.-5 1.93 1.27 – M.V.21 1. / Computational Biology and Chemistry 53 (2014) 84–96 genomic localizations.55 1.64 1. segmental duplications occurring during genomic evolution and (in some cases) complete genome duplications. +2 kbp) only for large genomes M.38 2.-5 1.V.V.90 G. An example of relatively recent gene loss is that 60% of the olfactory receptors in the human genome (1000 genes and pseudogenes in total) have disrupted ORFs and appear to be pseudogenes. cds (50) only for small genomes G-G&F Large genomes Homo sapiens Mus musculus Small genomes Apis melifera Drosophila melanogaster M. – – M.97 – 1.-5 1.70 2.). These orphan CGIs will be gradually decomposed and lost.V.V. (ii) mean values for the five more extended power-law incidences cases (M.19 1.-5 1.-5 – – M. 1.V. +100 bp) for small genomes G-G&F Large genomes Bos taurus Canis familiaris Danio rerio Gallus gallus Homo sapiens Monodelphis domestica Mus musculus Small genomes Apis melifera Drosophila melanogaster M. In other cases.41 1.V. 1. +100 bp) for large genomes. Occasionally.78 M. The implementation of a “genomic duplication–CGI loss” model is based on the following genomic phenomena: occasional gene (and associate CGI) silencing and subsequent degradation. which has been subsequently decomposed (for details see in the following sub-section).V.25 1.01 – M. 1.-5 – – Epigenetically determined CGIs (non-TSS CGIs only) Epigenetically determined CGIs (TSS-unrelated only) M.-5 – – M. Adams and Wendel (2005). as all authors agree. (100 bp.28 – 1.V.-5 – – . In a completely different genomic framework. when duplicated often become superfluous and stop being under purifying selection. for several choices of gene-unrelated (orphan) sets of CGIs. i.02 1.V.V.-5 1.36 – Masked TSS (500 bp.11 – 1.12 1.-5 1.g.V. – – T&J M. microsatellite expansions etc). – – M.80 0. Lynch and Conery. like CGIs. sequencing and analysis of the human genome has shown that segmental duplications and gene loss are widespread phenomena of genome remodeling during evolution.31 M.22 0.48 – 2. emancipation of genes from their associate CGI.V. Masked TSS (-500 bp.44 1.-5 1. 2000): In some cases one member of the gene pair adopts a new function while the other remains unchanged.V.19 M.43 1.V. 2005.24 2.19 1. losing the ability to be transcribed.V.V. and insertions of TE families more recent than the studied one) are required instead of i Table 2 Values of linearity extent E as: (i) Mean values (M. We can consider that CGIs with yet unknown roles.V. Several cases of syntenic genes of different organisms have been shown to be under differential control of CGIs.68 1.V. – – M. while its syntenic counterpart in another genome appears to have been emancipated from the action of its ancestral CGI. e. although not explicitly considered in the present implementation.V.37 M.

. 1993). iv and v are numerically shown to be dispensable for the emergence of power-laws in the computer simulations studied herein._3)TD$IG] Fig. Surrogate data curves and linear regression are similar to Fig. as the above authors remark (see also discussion in the next sub-section). . The involvement of this type of events in the formation of the observed linearities is indicated by the relatively higher linearities observed in mouse where higher island elimination rates are observed.. The inclusion of events type iv tests the robustness of the model. Events of type iii. although not essential for the emergence of power-lawtype linearity. Several cases of non-TSS or TSS-unrelated (orphan) islands are considered. 2007. Inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes for epigenetically determined CGIs from the human and mouse genome. Klimopoulos et al. 2012). Tsiagkas et al. 3. Events iii. / Computational Biology and Chemistry 53 (2014) 84–96 and ii (see Sellis et al. Logarithms are to base 10. as several authors bring evidence about loss of CGIs over evolutionary time due to de novo 91 methylation (see e. 1. are important. Antequera and Bird. [(Fig.G.g.

Logarithms are to base 10._4)TD$IG] Fig. conceived to describe the genomic dynamics of CGIs. The model presented herein is formally similar to another model introduced previously. Events of type v represent either deletion of sequence regions usually due to unequal recombination or gradual shrinkage by a balance of indel (insertion/deletion) events favoring decrease of the sequence length. in order to account for the appearance of a power-law-like pattern in the chromosomal distribution of genes (Sellis and Almirantis. This is verified by all our computer [(Fig.4 and the supplementary file “grid”. / Computational Biology and Chemistry 53 (2014) 84–96 because for many organisms important parts of the genome are generated by repeat proliferation. Here non-standard threshold sets are used. Tsiagkas et al. Notice that the model presented herein. 4. .92 G. (1991) for the appearance of power-law size distributions in aggregative growth of particles in physicochemical systems. Plots of inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes. for more details see the sub-Section 2. Surrogate data curves and linear regression are similar to Fig. 2009). Both are based on an analytically solvable model introduced by Takayasu et al. is not analytically solvable and thus no universal exponents (slopes for the linear segment in log–log scale) may be obtained. 1.

the case of human olfactory receptors) and several genes having their regulatory pattern changed. under no purifying selection. Fig. Discussion 4. this is a feature common to all cases of naturally occurring “power-laws”) but mainly because they lack any universal exponent (slope). Logarithms are to base 10. In our case we clearly show that the log–log linearities we observe in our genomic data (often being quite extended) lack a universal exponent (slope). More specifically. 2012). For details see in Section 2 and Section 4. 4. / Computational Biology and Chemistry 53 (2014) 84–96 [(Fig. If we include additional elimination of CGI (events of type iii). as observed in real chromosomes. 5c). it is probable that the observed distributions are the result of the proposed mechanism in synergy with other aspects of genomic dynamics. as an absolute proof that this model is at the origin of the genomic pattern. The similarity of the power-law-like pattern emerging in simulations of this model with the genomic distributions of inter-CGI distances cannot be considered. Thus. along with the common dependence on the evolutionary parameters shared between model and genomic distributions. A decrease in the number of allowed segmental duplications. Simulations using the proposed evolutionary model. (2012). TEs are long-lived genomic objects. In the simulation of Fig. In the previous paragraph we consider the possibility that the genomic power-law-like size distributions of CGIs might be an indication of functionality. These authors state that these requirements include a statistically sound power-law (extended linearity in log–log scale with indications of convergence to a universal exponent) and a concrete underlying theory to support it. These results are compatible with extended powerlaw linearity found. Linearity in log–log scale similar to the one observed in the plots of inter-CGI spacers’ complementary cumulative size distributions is found. For a related discussion see Klimopoulos et al. as extensively discussed therein.g. as intuitively expected.G.1. as found in several instances. CGIs). decreases the extent of the linearity. see Stumpf and Porter (2012). Tsiagkas et al. On the other hand. protein coding segments. 5b._5)TD$IG] 93 simulations and is in accordance with the variety of slope values met in the study of genomic CGI distributions. This feature. We see that. recognizable in mammalian genomes up to . We may infer that gene inactivation and reassignment of the roles of genes (which may have caused their emancipation from an associate CGI) led several CGIs to cease being under purifying selection and thus they progressively decomposed. depicted in Fig. in cases where an extended genic remodeling in the recent past of the organism is plausible (perhaps. in line with the view that functionality is a necessary condition for CpG islands to form stable power-laws. shown to generate long-range correlations in nucleotide sequences might have significantly contributed to power-law-like distributional patterns of several genomic components (repeats. Moreover. The power-law inter-repeat size distributions of distances between TEs could be seen as a counter-example (Klimopoulos et al. our data deviate from the typical power-law not only because they always have the linearity in log–log scale truncated at a lower and an upper cut-off (in fact. This is the principal reason why we call the pattern we have found “power-law-like” throughout this article. the expansion–modification model proposed by Li (1991). 5a.. however. only events of the types i and ii are included (segmental duplications followed by CGI loss). corroborates the hypothesis that the evolutionary dynamics described by this model is at the origin of the observed genomic patterns. For a recent in depth view of the requirements for having a power-law. the case of the human genome) with many genes inactivated (e. especially in small genomes. this deviation from universality is a characteristic feature shared between the genomic inter-CGI distributional patterns we describe herein and the simulations of the proposed evolutionary model. Analysis of the observed power-law-like chromosomal CGI distributions. Taxonomy-related evidence The property of the formulated simple model to produce power-law-like distributions is an indication that it could be extended to include more detailed modeling of conservation through evolutionary time. However. power-law-like curves (linear in log–log scale) are formed in the inter-CGI spacers' size distributions. 5. the linearity is considerably increased (see Fig.

45 respectively). i. (Ostertag and Kazazian. Contrary to TEs. they stay above given thresholds for CG% and for CpGo/e) only for relatively short times after ceasing to be under purifying selection. More specifically. Human and mouse. the lengthier log–log linear extents (the only ones above 2) are found in the higher threshold value region.e. This . also has as effect the protection of the power-law-like pattern from random interruptions of the inter-CGI spacers thus maximizing linearity in log–log scale. The result of Han et al. Another point compatible with the model is that this optimum is reached for the relaxed (G-G&F) and not for the stringent (T&J) threshold set. This is in accordance with the findings of Li et al. Athanasopoulou et al. where a decrease of the linearity extent in log–log scale was observed (figure not included herein). Moreover.e. given the systematic connection of many CGIs with promoters of protein-coding genes. On the other hand. the observation of Han and Zhao (2008) that in fish genomes stringent thresholds (T&J) gave the best results on functional CGIs determination is compatible with our finding that in D. An immediate observation stemming from our study is that the standard threshold sets (G-G&F and T&J) may produce power-lawlike distributions not only in human. long-rangeness and fractality.V. according to Antequera and Bird (1993). as the erosion of mouse CGIs probably makes the relaxed thresholds more suitable for filtering in most of the functional CGIs. (2008) that the dog genome hosts the highest percentage of gene-unrelated functional CGIs if compared to any studied genome is compatible. are losing CpG Islands over evolutionary time due to de novo methylation in the germ line followed by CpG loss through mutation. we note that this increase is much higher for the GG&F than for the T&J threshold set (0. These are events not requiring evolutionary conservation of the involved genomic elements in order to generate power-law-like distributions. Another point where findings presented herein comply with findings of other research groups following a different approach is our observation that chromosome III of Saccharomyces cerevisiae is the one presenting the globally maximal extent of linear segment (E = 1. from which they are principally derived. reported to disappear from a given position at least one order of magnitude faster than any other dinucleotide in the eukaryotic genome. As already mentioned. (1998) and Bradnam et al. in fact. for both standard threshold choices. CpG is. In H. This is due to the mutability of CpG after being methylated.e. functional islands in general might be compositionally moved toward high CG % and CpGo/e values. about chromosome III being atypical in many regards: It presents large-scale compositional (C + G content) inhomogeneity and high extent of compositional mosaicism of genes found only in this chromosome. It is also worthy to note that the two chromosomes (I and XI) reported by Bradnam et al. we extended our analysis to the subset of CGIs characterized as orphan. The existence of several organisms with higher power law linearities for lower thresholds and shorter minimum length (i. often focussing on their non-gene-related subset (orphan islands). Matsuo et al.85) and the only one which exhibits linearity for alternative threshold set (altG-G&F and altT&J). with our result (see Table 2) that dog genome. slowing down of the transposition dynamics. according to our proposed model. to some extent. orphan CGIs The distributions of coding segments and of genes in a variety of genomes are shown to follow a power-law-like spatial distribution. The prevalence of the higher thresholds in the formation of the most extended linear segments in log–log scale suggests that in fish genomes.67 vs. 1993) we observe that the maximum linear length is systematically positioned at lower threshold combinations than the standard ones. while sporadically high values of thresholds may perform better (see data in the “grid”). 4.. this being a corroboration of the proposed model. i. mouse and other mammalian genomes.. when Lmin = 200 performs better than Lmin = 500) is compatible with the finding that “many known functional CGIs are shorter than commonly assumed – the extreme example being Xenopus CGIs” (Hackenberg et al. in our study we attempt to identify the distributional patterns of CGI populations of different functionalities and not only the gene-transcription related. Gene-related vs. Especially in mouse. to all examined genomes. This has also been shown numerically through computer simulations allowing random transpositions. sapiens. i. but also in a variety of other genomes. 0. 2006). where CGIs are known to be eroded.) of all examined genomes. The observation that power-law-like linearity persists (and is sporadically more extended) in orphan CGI distributions is indicative of distinct functionalities for this type of genomic localizations. / Computational Biology and Chemistry 53 (2014) 84–96 200My. CpG Islands retain their identity in the evolutionary time scale (i. 2001) thus allowing the interplay of event types iii and iv of the proposed model. Tsiagkas et al. when full gene and flanks masking is applied. From log–log linearities found herein one might infer that an important proportion of the CGIs detected in all examined genomes using these thresholds are functional.2. The situation changes. with no signs of skewed deviations to any direction in the parameter (threshold) space for all analyzed chromosomes. (1999) to be next to III in their degree of variation of silent-site C + G content are the ones with the next higher extents of linearity (1. on the basis of the argument presented above: i. the separated study of orphan CGIs allows for the investigation of different modes of chromosomal distribution and thus of different functionalities of this class of genomic element compared to entire CGI populations. The same optimum is reached if we examine the M.6 and 1. rerio. In other genomes the optimum (maximum linear extent) is positioned at lower value combinations. 2010). On the other hand.e. Along the same lines is the observation that this genome is characterized by the maximum of increase of linearity when full gene and flanks masking is applied either for G-G&F or for T&J thresholds (if masked and unmasked genomes are compared).e. These authors consider as the more likely reason for the particularities of chromosome III that it contains the mating type loci thus being under a selective pressure protecting it from disruptions and keeping it intact in evolutionary time. type iii events of the proposed model occur according to the cited literature with a particularly higher rate in rodents.5 respectively). In order to exclude the possibility that the power-law-like spatial distribution of CGIs is a mere consequence of a power-law distribution of genes. Note that this result does not contradict the earlier finding of Han and Zhao (2009) that the T&J threshold set is particularly suitable for identifying promoterassociated CGIs in human and probably other genomes as well. see Nachman and Crowell (2000) and Antequera (2003).5 values. (1999)..V. the maximum of the linearity in log–log scale is indeed observed for the standard threshold sets or for values being in their close vicinity. in view of the proposed model. for several of the other studied genomes as may be revealed by an inspection of the data presented in the supplementary file “grid”. not spatially connected to genes. by simple inspection of Table 1 we see that the mouse genome presents for G-G&F the highest mean extent. see also Sharp and Lloyd (1993). 2009.e. they are decreased in number while the “surviving” ones are shorter and with lower CG% and CpGo/e values (Antequera and Bird. Considering mean values from all chromosomes as depicted in Tables 1 and 2. On the other hand. by using several different approaches (Sellis and Almirantis. which makes its genomic turnover rate to be the highest of any other dinucleotide. highest mean value of log–log linearity (M. and in fact other vertebrate genomes too.94 G. 1993. that log–log linearity implies functionality for CGI populations. represents the global optimum of linearity extent.

. C. Li. G + C content variation along and among saccharomyces cerevisiae chromosomes. L. Linearity in log– log scale was often found in the range between two and three orders of magnitude. lacking the clear 95 partition into isochores observed in warm-blooded animals. Gardiner-Garden..doi. Curr. Scaling properties and fractality in the distribution of coding segments in eukaryotic genomes revealed through a block entropy approach. Lanave. J.. P. C. 446–458. J.. Glass. Plant.. Han. Biol.. A. De Grassi. T. Luque-Escamilla. The observation we made on deviations in power-law linearity extent.. E. 60.D. 196. Glottometrics 3. Biol. Genome Res. Han. Cell.. A. Bradnam. We have studied in separate the population of CGIs which lie outside of genes and flanks considering them as gene-unrelated or orphan..A.E. 1438–1449.].F. strengthens the complementary roles of the two approaches in the prediction of functional CGIs... M. 051917. 1647–1658. 2003. R79. Clark. arXiv. Fazzari. C.1155/2008/565631.. T. Almirantis.. Khulan. Bird. More generally. M..P. Bird. J.. However..) of all chromosomes..T. 10. M. 35. at least for the time being. CpG island mapping by epigenome prediction.. provided by Illingworth et al. Opin. R. 1055–1070. K. Adams. Walter. Wendel. K. Olivier. Saccone.013..F.. studied herein.P. and they were also found to follow power-law-like distributions.L. which not only survives but increases when gene-unrelated (regarded as orphan) CGIs are studied. 1–6. S. Greally. Carpena. 666–675. Oliver.. L. M. 135–141..doi. formally analog to one previously proposed by our group for the understanding of genic segment distributions was formulated and simulations have shown that it might account for the distributional patterns reported herein... function and evolution of CpG island promoters. 2007. 2005. given that methylation is an in vivo reversible feature. which are compatible with the proposed model. Polyploidy and genome evolution in plants. Mol. Melnick. this is also an indication that two classes of islands (gene-related and orphan) might be distinguished using several compositional or distributional criteria in a converging way. 342–347. Biol.A. Supplementary data Supplementary data associated with this article can be found. we expect that a treatment like the one presented herein could allow for the deduction of more nuanced conclusions. U.H. Karamanos.. B. Several inferences based on known differences of compositional and other characteristics between species have been made. A.... E.. given that such locally maintained heterogeneities are indications of conservation due to functionality. A simple evolutionary model. Number of CpG islands and genes in human and mouse.W. University of Athens) for serving as academic advisors to G. We believe that only further research may resolve this ambiguity. Life Sci. 2006. P. C.. 8. Comp. L. (1999) about the compositional mosaicism of this chromosome. W. Myers. 2002.. CGIs derived on the basis of methylation data CGIs determined on the basis of methylation data. et al. 261–282. CpGcluster: a distance-based algorithm for CpG-island detection.. S. doi:http://dx. Bouhassira. . 259–264. At the present time only CGIs involved in gene transcription represent a well defined island set. C.. When a more refined annotation of functional roles of CGIs is available.A.. Natl. G. Martinez-Aroza.W.. 9.L. Power-law distributions in empirical data. Trend.. Zant. A. A limitation of the present work is that the chromosomal distribution of CGIs cannot be disentangled from the complex large-scale compositional characteristics of the genome. E. CG dinucleotide clustering is a species-specific property of the genome... 3. J.R. Zhao. presents a clearly inter-CGI linearity in log–log scale. L. Trans. Cayrou.. B. Seoighe. The observation of a clearly lower extent of power-law linearity in the methylation-derived CGI distributions in human and mouse raises a question about the prevalence of the methylation-based over the composition-based prediction of the CGI character. Comparative analysis of CpG islands in four fish genomes. R.1186/1471-2105-10-65.N. 209–213. E. Eichler. Z. Nature 321. Genet. A. Genomics doi:http://dx. Soc. P. 2008. Clauset... 2000.. Appendix A.. We deduced that both. 2011. 1987. Z. 2002.. S. P.V. 2007. J.. Zhao. M. melifera genome. Paulsen. Bird. Previti. Li. E 82. J.E. Wolfe.. Biochem. between compositionbased and methylation-state-based CGI populations. quantified by the mean value (M.compbiolchem. Y. Tables 1 and 2)..J. Zhao.. CpG islands or CpG clusters: how to identify functional GCrich regions in a genome? BMC Bioinformatics 10. Schwartz.. Hackenberg.1062v1 [physics. 1999. Acids Res. conserved by purifying selection) until recently. Oakley.. PloS Comp. Gene 421. Thompson. experimentally observed methylation might be due to some specific conditions of the studied cell population and its state. 1993. Bailey. A. Science 297. Genome-scale analysis of metazoan replication origins reveals their organization in specific but flexible sites defined by conserved features.. (2010) for human and mouse chromosomes often follow power-law-like distributions (Tables 1 and 2 and detailed data available in Supplementary Table). Nucl. Funct. Acad. / Computational Biology and Chemistry 53 (2014) 84–96 may represent further evidence that functional gene-unrelated (orphan) islands accomplish their role under less stringent compositional constraints. A.. Structure. 143–150.. Lengauer. Phys.. Mol. Sharp. 65. 1986. Zipf’s law and the internet. P. Samonte. F. Proc. Mol. it is noteworthy that even the A. 6798–6807. Tsiagkas et al..1016/j.G... An alternative explanation for compositionally predicted but methylated CGIs is that these sequences could be functional islands (thus. B. and Philippa Constantinou for having participated in an initial stage of this study.08. Gibson. F. 2007. We are particularly indebted to Professor Oliver Clay who predicted the particularly high power-law-linearity of yeast chromosome III on the basis of the findings of Bradnam et al..M. A 90.-H. K. V. One could think that a combination of CG% and CpGo/e clearly higher than the prevailing in the bulk genome can be seen as a trustworthy marker of any CGI functionality. Conclusions In this work we investigated the distributional features of CpG–Islands in several genomes by means of a study of inter-CGI distances considering only entire chromosomes. 28. in the online version.. Rev.doi. Such newly de-functionalized CGIs are also expected to participate to the linearity in log–log plots according to the proposed model. 0706. Coulombe. 1003–1007. 5. Z.L. Shalizi. 2008. Genome Biol. Spring. 3. K. at http://dx.R. 2009. M. Newman. L. C... BMC Bioinformatics 7...J. 2010. Figueroa.. CpG islands in vertebrate genomes. Vigneron. Antequera. Frommer. CpG islands as gene markers in the vertebrate nucleus.A.. 4. Acknowledgements We would like to acknowledge Professors Stavros J. the taxonomy of the considered organisms and the functional role of the studied island population leave their imprint on the observed distributions. CpG-rich islands and the function of DNA methylation. M. Hamodrakas and George C. Huberman. Evidence in favour of ancient octaploidy in the vertebrate genome. Gu.2014. 1987. J. Antequera. 2008. Golden.V. References Adamic. 21. which have lost their function although they have not yet changed their composition due to mutational dynamics. Rodakis (both at the Faculty of Biology.E. R.L..E.. Evol. E. Han. CpG island density and its correlations with genomic features in mammalian genomes. Recent segmental duplications in the human genome.J. although several other functions have been attributed to CGIs in the recent past. for both G-G&F and T&J threshold sets (cf. thus their distribution also reflects the large-scale features of the isochore partition of genomes and other compositional inhomogeneities. Adams. On the other hand. Athanasopoulos. 11995–11999... A. C. Evol 16. Sci. J.. J. Athanasopoulou.M. Su.3. K. Genome duplication and gene-family evolution: the case of three OXPHOS gene families.

Cell Mol. Conery. Tsiagkas et al. 5521–5526.. A. 543–555. Power laws: Pareto distributions and Zipf’s law. D. Tanay. Genome Res. Statistical properties of aggregation with injection. Silke. A 43. 2007. Smith. Bestor. Gomez-Lopera. Klimopoulos.H. Anim. 916–928. D. 179–183. 2005. PLoS Genetics 6. Andrews.doi. 2008.. Crowell...S.. 31. K. S.A. 860–921. Phys.E. Nachman..005.. Kazazian. Schempp.. D.J. T. J. D.076711.... M.gene. Reciprocal gene loss between Tetraodon and zebrafish after whole genome duplication in their ancestor.K. G. R. 2007. and uniformity between.. Kirsch.M. 76... Huber. Biol. 2009. M.. e1001134. Proc. Clay.. Takayasu. 46. Bernardi..P. Sémon. 501–538. Rate of spontaneous gene duplication at two loci of Drosophila melanogaster. possible paleopolyploidy and gene loss. M.96 G... Matsuo. An update. Phys.. Rev. 2008..N. U. S.. Rev. Turner.E. Extensive genomic duplication during early chordate evolution. 1991.1016/j. 8. Biology of mammalian L1 retrotransposons... Genome Res. 061925. Genet.02.P. 2002. W. 35. The 2 R hypothesis. Almirantis.W. A. Cytosine methylation and CpG. Immunol. Li. 3740–3745. K.. 2007. Li. 19. Webb. 2007.J.R.. 1998.A. Sci.. 21... M. D... 2012.. P. McLysaght.D. Genet.. Hokamp. 23. 547–552.L. Orphan CpG islands identify numerous conserved promoters in the mammalian genome. Stolovitzky. Glottometrics 5. Annu. Widespread occurrence of powerlaw distributions in inter-repeat distances shaped by genome dynamics. Illingworth. 18–28. Jiang.. E 71.. R. Somat. S. 725–745. Jabbari. Initial sequencing and analysis of the human genome.. Evol.. Acad. Kerr. T.. A. James. 447. Sellis. Kehrer-Sawatzki. Phys. Stat. Mathematics: critical truths about power laws. A.H. Phys. Lloyd.S. M. Provata.... PNAS 99. J.W. P..L. A.. H. W. Sims... Estimate of the mutation rate per nucleotide in humans. 29 doi:http://dx. 19.108.. Ecol. 2005..G.G. Righton. Science 290. P. J. 1993. K. Evidence of erosion of mouse CpG islands during mammalian evolution. 24. D. V. Power-laws in the genomic distribution of coding segments in several organisms: an evolutionary trace of segmental duplications. Batz.1101/gr. Z. Y. Li. W. 23. Nucleic Acids Res. D. 14–21. 2010.. J. S. Compositional searching of CpG islands in the human genome. D.. Bernaola-Galvan. Genetics 156. C. Provata. Mol.2012.H. 2000.. Cooper. The evolutionary fate and consequence of duplicate genes. Oliver.. R.. M. J. D. Porter. Gruenewald-Schneider. W. K. A.. Harrison. 323–351. J. Alu and LINE1 distributions in the human chromosomes: evidence of global genomic organization expressed in the form of power laws. 2007. Chromosome Res.. M. 1993. DNA sequences of yeast chromosomes. Evolutionary dynamics of segmental duplications from human Ychromosomal euchromatin/heterochromatin transition regions.. W.. Contemp.. Trends Genet. Molecular mechanisms of chromosomal rearrangement during primate evolution. M... 41–56. Damelin. 2385–2399. Finnerty.L. E. Sharp. TpG (CpA) and TpA frequencies. 2004. 222–229. 1991. Curr.. Chen. 200–204. C.. Luque-Escamilla. 2012. Expansion-modification systems: a model for spatial 1/f spectra. Takahashi. A. Schaffner. J. 159–167. K. G. J. C. U. O'Donnell. A. 2001. Jones. Eichler. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. 108–112. 665–666. . O.H.. G.. 1986. K. Takai.J. Almirantis. M. Gene doi:http://dx.. Nature 409.H. Newman.. Pitchford. Bird.M. Y. 2000. Nat. Stumpf. Martinez-Aroza. A. L. Kasahara.. The use of genetic complementation in the study of eukaryotic macromolecular evolution. 2002.. Oliver. / Computational Biology and Chemistry 53 (2014) 84–96 IHGSC. S. Minimizing errors in identifying Levy flight behaviour of organisms.T. Hyperconserved CpG domains underlie Polycomb-binding sites. 104. Genet. Takayasu. Zipf’s law everywhere. 5240–5260. 2002. 1151–1155... Sellis.. Opin. Almirantis. 143–149. E. 297–304. Evol..H. Lynch. 16. H. A. Mol. Ostertag.. H.. Sellis.doi. Rev. J. Regional base composition variation along yeast chromosome III: evolution of chromosome primary structure. Gene Munch.W.. 65. Z. J.. P. Science 335. Gene. Cheng. Compositional heterogeneity within. Wolfe. Y. RomanRoldan. Shapira.. Wolfe. Nat..

Sign up to vote on this title
UsefulNot useful