Computational Biology and Chemistry 53 (2014) 84–96

Contents lists available at ScienceDirect

Computational Biology and Chemistry
journal homepage: www.elsevier.com/locate/compbiolchem

Orphan and gene related CpG Islands follow power-law-like
distributions in several genomes: Evidence of function-related and
taxonomy-related modes of distribution
Giannis Tsiagkas a , Christoforos Nikolaou b , Yannis Almirantis a, *
a
b

Institute of Biosciences and Applications, National Center for Scientific Research “Demokritos”, 15310 Athens, Greece
Computational Genomics Group, Department of Biology, University of Crete, 71409 Heraklion, Greece

A R T I C L E I N F O

A B S T R A C T

Article history:
Available online 16 September 2014

CpG Islands (CGIs) are compositionally defined short genomic stretches, which have been studied in the
human, mouse, chicken and later in several other genomes. Initially, they were assigned the role of
transcriptional regulation of protein-coding genes, especially the house-keeping ones, while more
recently there is found evidence that they are involved in several other functions as well, which might
include regulation of the expression of RNA genes, DNA replication etc. Here, an investigation of their
distributional characteristics in a variety of genomes is undertaken for both whole CGI populations as
well as for CGI subsets that lie away from known genes (gene-unrelated or “orphan” CGIs). In both cases
power-law-like linearity in double logarithmic scale is found. An evolutionary model, initially put
forward for the explanation of a similar pattern found in gene populations is implemented. It includes
segmental duplication events and eliminations of most of the duplicated CGIs, while a moderate rate of
non-duplicated CGI eliminations is also applied in some cases. Simulations reproduce all the main
features of the observed inter-CGI chromosomal size distributions. Our results on power-law-like
linearity found in orphan CGI populations suggest that the observed distributional pattern is
independent of the analogous pattern that protein coding segments were reported to follow. The
power-law-like patterns in the genomic distributions of CGIs described herein are found to be compatible
with several other features of the composition, abundance or functional role of CGIs reported in the
current literature across several genomes, on the basis of the proposed evolutionary model.
ã 2014 Elsevier Ltd. All rights reserved.

Keywords:
CpG-Islands
CGIs
Orphan CpG-Islands
CpG dinucleotide
Power-law-like distribution
Genome evolution

1. Introduction
Genomic CpG islands, in which CpG dinucleotides are abundant
and non-methylated, have been initially detected experimentally
and defined as short (hundreds of nucleotides-long) stretches in
vertebrate genomes, thus offered as cleavable sites for mCpGsensitive restriction enzymes (HTF islands, see e.g. Bird, 1986). As
lengthy DNA sequences and whole genomes became progressively
available, sequence-based definitions of CpG Islands (CGIs) and
computational algorithms using a sliding window combined with
threshold values for key quantities were put forward. Thresholds
for sequence-based search of CGIs are considered for: (i) the
minimal island length (Lmin); (ii) the percentage of cytosine and
guanine content (C + G, abbreviated in the following as CG%); and
(iii) the observed over expected frequency of occurrence of the CpG

* Corresponding author. Tel.: +30 2106503619; fax: +30 2106511767.
E-mail address: yalmir@bio.demokritos.gr (Y. Almirantis).
http://dx.doi.org/10.1016/j.compbiolchem.2014.08.013
1476-9271/ ã 2014 Elsevier Ltd. All rights reserved.

dinucleotide (CpGo/e). Gardiner-Garden and Frommer (1987)
introduced the first widely used set of thresholds (Lmin = 200 bp,
CG > 50%, CpGo/e > 0.6; to which we will hereafter refer as “relaxed
criteria”; bp stands for ‘base pairs’), while later, Takai and Jones
(2002) used more conservative threshold values, named hereafter
“stringent criteria” (Lmin = 500 bp, CG > 55%, CpGo/e > 0.65). In the
following, these threshold choices will be abridged as G-G&F and
T&J respectively. Alternatively, methods based on the degree of
CpG dinucleotide clustering (Hackenberg et al., 2006; Glass et al.,
2007) and on entropic edge detection (Luque-Escamilla et al.,
2005) have also been introduced. In order to avoid false positives
(e.g. due to high C + G Alu sequences in the human genome),
especially when relaxed criteria are used, repeat-masking is
usually applied before searching for CGIs.
CGIs are widely accepted as markers for the existence of
protein-coding genes, the promoter regions of which are usually
proximal to or overlapping with the islands. Consequently, the
search for CGIs was motivated by the need of detection of yet
unannotated protein coding genes. Therefore, most of the

Methods 2.2. Several or all chromosomes of each genome are studied.G. 1987. based on human and chimpanzee genomes’ comparison. not connected to a known protein-coding gene). The term has been widely used to denote all CGIs which are not directly involved in the regulation of nearby genes.2.nih. 2009). in human and mouse genome comparisons. (2007). Matsuo et al. More recently. Genomewide methylation data were not available until recently. These data are used for the sequence-based determination of CGIs. nonetheless. Additional data for gene or CDS (i. can provide valuable insight into the overall organization of genome architecture and the mechanisms under which sequences are attributed with functionality. researchers attempted to introduce several forms of epigenomic information. Mus musculus. See Bird (1986). we use different distance thresholds from the nearest gene in order to characterize a CGI as orphan. is shown to systematically generate power-law-like size distributions. Canis familiaris. cerevisiae). depending on genome sizes. More specifically. apart from an indication of their functional role and conservation through purifying selection. see Antequera and Bird (1993). 2.g. as well as. The CpG depletion of the vertebrate genome is explained on the basis of heavy methylation and the consequent mutation of 5-methylcytosine to thymine. After the systematic search for CGIs in the human and other genomes using the simple sequence criteria described above. Bos taurus. The persistence of orphan CGIs’ chromosomal distributions to form power-laws. Illingworth et al. the data of Illingworth et al. (2010) provided a comprehensive list of CGIs in the human and mouse genomes based principally on methylation data. we have considered.gov/genomes/MapView] for each of the organisms we have studied. for the temporal evolution of genomic components that are subjected to purifying selection. / Computational Biology and Chemistry 53 (2014) 84–96 introduced CGI finding algorithms were judged upon their ability to find CGIs in the proximity of known genes (see e. full-length gene masking. 2009). In previous works (Sellis et al. Gallus gallus. The most important of this work’s observations was that there exist numerous “orphan” CpG Islands (i. when tested numerically. Data for coordinates of CGIs derived from the methylation state of the sequence. 2009. including biologically plausible steps.g. at least in the germ line. We have put forward an evolutionary model. 2. Our principal result herein is that CGIs also. CGIs were shown to often coincide with Origins of Replication (ORIs) thus suggesting their functional roles may extend beyond transcriptional regulation (Cayrou et al. Power-lawlike distributions are extensively observed and through the proposed model we attempt to extend our understanding of the evolution and function of CGIs.. as there are converging data that many CGIs in several lineages have been lost when they ceased to be under purifying selection. Caenorhabditis elegans. in some cases. 2007.ncbi. In the cases of human and mouse genomes. To what extent the colocalization of CGI and origins of replication is due to a direct causal link or an indirect effect mediated by positional preferences of gene transcription start sites remains. Such distributions present extended linear regions in log–log scale (see Methods).e. For all studied organisms. 2008. This lack of methylation on specific locations may be seen as the result either of protection from point-mutations (mutational “cold spots”) or of purifying selection due to functional roles which are fulfilled only if specific compositional traits are preserved in the underlying sequences. Antequera (2003) and Jabbari and Bernardi (2004). CGIs cannot be generally seen as mutational cold spots. found regions with reduced CpG mutability. unclear. protein-coding exon) coordinates were downloaded from the FTP site of NIH [ftp. 85 Here we investigate the large-scale genomic distribution of CpG Islands in several genomes. Han and Zhao. Origin of the genomic and epigenomic data The genomes of the following organisms are used in this study: Apis melifera. The propensity of 5-methylcytosine to quickly mutate to thymine has driven to the conjecture that the methylation is hindered in genomic sequences where CpG frequency is close to the expected one. (2007) used a method combining a standard sliding window algorithm with available epigenomic data for human chromosomes 21 and 22. principally in order to exclude that the observed power-law-like distributions in whole CGI populations are a mere consequence of similar distributions already known to exist for the inter-genic distances (Sellis and Almirantis. independently of the method used for their detection. Homo sapiens. Tsiagkas et al. Tanay et al. Illingworth and co-workers formulated a hypothesis according to which most of orphan CGIs are related to functional promoters of unknown genes. Information about the origin of the genomic sequences used may be found in the supplementary file “genome data”. Drosophila melanogaster.. in large genomes. after masking for known repeated sequences (for all organisms apart from S. Danio rerio. 2012) we have observed that distances between transposable elements (TE) belonging to the same family.. This model. besides the “standard” G-G&F and T&J threshold sets. when taken alone.5) and the supplementary file “Supplementary Table”. either studied as entire populations or when only orphan islands are considered. 2011).e. verified by the observed increase in TpG and CpA abundances and is extensively discussed in relation to CGIs evolution and function. Only entire chromosomes are considered. Bock et al. e. distances between protein-coding segments (PCS) often follow “power-law-like” size distributions in entire chromosomes. like the ones found in the CGI chromosomal distributions. while the same domains largely coincide with Polycomb-binding sites. Bird. These are retained for long evolutionary time in other species where they remain functional. where however the opposite position (i. Saccharomyces cerevisiae. (2010) were downloaded from the additional material of this article. There is not an unambiguous consensus about what an orphan CpG-Island really is.e. CpA excess on the level of DNA methylation) is held. many of which could be functional RNA genes. In the following. (2010) are also used alongside with CGI coordinates derived from the implementation of compositional thresholds. The functional character of a CGI is related to the condition that its CpG dinucleotides are predominantly unmethylated. the independence of CpG deficiency and TpG. Klimopoulos et al. with flanks of 5 kbp and 2 kbp (thousands of base pairs) at 50 and 30 ends respectively. follow power-law-like size distributions. as determined by Illingworth et al. but the exact proximity relationships between CGI and gene bodies vary according to the study.1. Sellis and Almirantis. A separate study of populations of orphan CGIs is performed. by means of the algorithm described in the following sub-Section 2. Algorithm for the extraction of CGIs coordinates in genomic sequences The algorithm we have used for the localization of CGIs is an implementation of the method described by Takai and Jones . We used the annotation of these files in order to define the subset of CGIs which may be considered as orphan. Later on. Information about the origin of the genomic coordinates used for masking of transcription start sites (TSS) and of genes may also be found in “genome data”. Monodelphis domestica. several additional choices of sequence composition are also tested and the resulting inter-CGI distances’ distributions are critically presented. for more information see later on (sub-Section 2. (1993).

Also provided in the Supplementary Table (see also in the Results and Table 1) are values of E given for the standard G-G&F and T&J threshold sets and for alternative sets for the same values of CG% and CpGo/e. but it corresponds to the number of all spacers longer than S. The criterion adopted for choosing chromosomes for the study of the distribution of orphan CGIs is to include the chromosome with the most extended linearity for the standard threshold choices G-G&F or T&J. we cannot speak about “power-laws” but rather about “power-law-like” distributions. and S.2. tabulating the used thresholds alongside with the extent (E) of the complementary cumulative inter-CGI distances size distributions. as in the original distribution. In finite size collections taken from physical systems. For reviews on power-law size distributions. Neighbor CGIs are merged together and then considered as a unique island. animal genomes. with a linear extent below unity. taken at random.. 2. It is also. Details of the masking procedure In order to define the “orphan” CGI subsets. Power-laws in size distributions 2. by definition. (2007) the use (in double-logarithmic plots) of the slope of the cumulative form of the distribution for the estimation of the exponent of the power-law gives much more precise results than the use of the slope of the original one. p(r) is the original spacers’ size distribution. cumulative size distribution In the present work we use the “complementary cumulative distribution function” (Clauset et al.86 G. S where N(S) is the number of spacers longer or equal to S. In such random collections of objects we can approximate the distribution of their sizes with an exponential distribution (like the runs of heads in a coin-tossing experiment). The inclusion of these random (surrogate) data sets in the figures allows for a direct visualization of the difference between observed and random distribution patterns. The equations in terms of probabilities Suppose there is a large collection of n objects (in our case spacers between CGIs). Newman (2005). The grid-driven search is done only for a limited number of chromosomes. a > 0 When scale-free clustering appears. minus one in absolute value): if p(r) / r1m. worm. These grids are given in the supplementary file “grid”. where the proposed model is explained) as there appears to be no particular tendency for a unique (universal) exponent m. long-range correlations extend to several length scales (ideally. For small genomes (insects. In the case of C. 2. each characterized by its length S. for some chromosomes of the most widely studied genomes. / Computational Biology and Chemistry 53 (2014) 84–96 (2002). denoted by altG-G&F (for L > 500 bp) and altT&J (for L > 200 bp). . which corresponds to a linear graph in a double logarithmic scale: N ðSÞ ¼ npðSÞ / Sz ¼ S1m . In all applications the window is sliding per one nucleotide. because power-laws for the others are found to be rudimentary. all figures also include a bundle of ten simulated size distributions (continuous lines) where an equal number of markers. i. independent of any binning choice: in a cumulative curve the value of P(S) for length S is not associated with the subset of spacers whose length falls in the same bin. corresponding to the CGIs of the original natural sequence. the threshold values for CGIs detection in the cited literature are mostly determined for the purpose of the localization of new protein-coding genes. with a medium E value. which are known to have promoters often intersecting with CGIs. if they are separated by less than 100 base pairs and if the extended island they form meets the requirements put by the considered thresholds. Let p(S) be the probability of a spacer having length between S  s/2 and S + s/2. S where. allows for an objective estimation of the linearity trend in each case. Original vs. Li (2002). The cumulative distribution forms smoother “tails”. interchanging the minimal length thresholds. we form a two-dimensional “parameter grid” of threshold values for CG% and CpGo/e (around the G-G&F and T&J threshold choices) for either L > 200 bp or L > 500 bp. CG% and CpGo/e. Additionally to CGIs genomic distributions. The intensity of the power-law-like pattern is quantified through the value of the extent (E) of the linear region in double logarithmic scale.e.3. with free to be adjusted thresholds on Lmin. Most of this work concerns the human and a few other.e. warm-blooded. Tsiagkas et al. then Z1 NðSÞ ¼ nPðSÞ / ðr1m Þdr / Sm . the inclusion of ten surrogate data distribution curves. which are thus preserved in evolutionary time. or (iv) we followed the TSS masking again.5. and several other chromosomes. but with the additional masking of each protein coding segment. with symmetric flanks 50 bp upstream and downstream. m > 0 2. Although the cumulative form of the distribution may be seen as presenting more “inertia” thus overshadowing local features. Clauset et al. less affected by statistical fluctuations. for 8 organisms out of the 11 overall studied. cerevisiae): (iii) we either followed the TSS masking (narrowing the upstream masking to 100 bp) as in (ii). (2007) and Stumpf and Porter (2012). This is done in an attempt to determine the threshold values corresponding to functional islands. (where s is the size of the bin width) and N*(S) the number of spacers: N ðSÞ ¼ npðSÞ / eaS . their properties and alternative forms see Adamic and Huberman (2002). 2007) for the sizes of spacers (distances) between CpG Islands. The cumulative form of a power-law size distribution is again a power-law characterized by an exponent (slope) equal to that of the original distribution plus 1 (i.4.3. 2. Methodology for the search of threshold values for CGIs localization in several genomes As discussed in the introduction. The criterion adopted for choosing chromosomes for the creation of their parameter grids is to include the chromosome with the most extended linearity for the standard threshold choices G-G&F or T&J. along with every genomic one (see below for details). in our case for the whole examined genomic region) and the spacers’ size distributions follow a power-law.1. defined as follows: Z1 PðSÞ ¼ pðrÞdr. Here. are randomly positioned in a sequence of equal length. This is particularly important for our study (see Section 4 for further details. for large genomes (actually. (see in the Discussion).3. elegans we derived a grid only for one chromosome. or (ii) we masked only around the TSS by adding flanks of 500 bp upstream of the TSS and 100 bp downstream of the TSS. all the vertebrates included in our list): (i) either we masked the total length of genes by additionally masking 5 kbp upstream of the start-of-gene (TSS) and 2 kbp downstream of the end-of-gene (TES). As it is pointed out by Sims et al. in order to further elaborate on alternative modes of CGI detection. and one or more chromosomes (taken at random) with a medium E value.

V. In Table 2 mean values are given for the orphan CGI spacers’ size distribution in the same form with Table 1. chromosomal distributions for the islands determined by Illingworth et al. Power-law-like distributions of orphan CGIs are widespread. We have also performed gene masking and subsequently we studied the spacers’ size distribution for orphan CGIs. the two classes of epigenetically determined CGIs defined above are considered as orphan.32 1.G.V. We have also attempted a search for the determination of the form of distribution of distances between CGIs using Table 1 (i) Mean values of linearity extent E (M. in some chromosomes for each organism. 1. like the ones found in the orphan CGI populations derived on the basis of compositional criteria. Scripts in Perl and programs in FORTRAN used for parsing genomic data files.V. 1.42 0.-5 1.19 1.87 – – – M.33 M. Results 3. The length reached by the simulated chromosome was 195 Mbp (steps i and ii of the model.59 2. 1.V. see the Supplementary Table). Standard and alternative threshold values are considered.40 1.-5 1.V.33 1. which have a minimal distance of 5 kbp upstream or 2 kbp downstream of any RefSeq TSS and we name them “TSS-unrelated” in the present text. Simulations presented in Fig. This filtering is analogous to the case (i) of the previous paragraph.79 1. Alternatively.04 – M.33 1. The G-G&F or T&J thresholds values have been used here.37 1.V.37 1.70 1.-5 2. while in Supplementary Table all the studied cases are included.74 M. 1. the ones we called “alternative” in sub-Section 2. We observe that power-law-like distributions are again formed.V. Simulations using the segmental duplication – CGI loss model Examples of simulations are given in Fig. In (c) the number of segmental duplications is as in (a) and the finally reached length 119 Mbp.01 1. 1 we present examples of the complementary cumulative inter-CGI distances’ size distributions. 87 for whole chromosomes from the 11 considered genomes. 1000 markers (representing CGIs) are randomly inserted in a sequence 2 Mbp long. More plots are included in “Supplementary Plots”.V. Tsiagkas et al.26 – 1.45 M.V. We name these Islands “non-TSS”.15 M.56 M. In (b) the segmental duplications were 87 (50% of those in (a)) and the final sequence length 22 Mbp. For genomic and for model-generated size distributions. 3b and e).16 – 1.V.93 – 1. The principal result of this study is that power-law-like distributions of inter-CGIs distances are observed in several cases with linearity extent E > 2.68 – 1.1.08 – – – altG-G&F altT&J M.29 1.67 2. for the human and mouse genomes. In Table 2.90 1. See also the Supplementary Plots.47 0. 2.) for all studied chromosomes from each organism. we have formed the subset of Illingworth and co-workers original CGIs.V. circles and diamonds are used respectively. for one chromosome of each organism.V.-5).03 1.05 1.09 M.13 – M.-5 1.70 M. (2010) based on methylation data are studied and the power-law-like pattern is also found. in double logarithmic scale.02 1.89 1.78 1. see in the Section 4.V.87 1. sub-Section 2. as described in detail in the Methods.. although the extent of linearity is somewhat reduced. 1. Fig. In these Tables we have included along with the standard parameter sets.18 1.02 1.79 1.6. the complete sets of epigenetically determined CGI are presented (Fig.74 1. 21. 1.50 – 1.V. 5(a–c).-5 1.V. This is done in order to test the distribution of CGIs that are unlikely to have any role in the transcriptional regulation of protein coding genes at mediumdistances (and probably spatially linked).V. 31 and human chr.52 1.-5 1. Chromosomal distributions of CpG-Islands in various genomes In Fig.82 2. throughout.01 1. Here. 5 are representative of a large number of numerical experiments conducted.V.-5 1.88 1.-5 – – – – 1.67 – – – T&J M.) for all studied chromosomes from each organism.34 1.82 – 1.4.V.54 0. 1.4. dealing with the study of masking of threshold-based CGIs.-5 – – – – .24 – 1. When a masking procedure is applied in the epigenetically determined CGI collection (for details see Methods. in some instances of masked genomes. we observe increase of the extent of linearity when orphan islands are considered instead of the full populations.66 1. (ii) mean values of linearity Extent E for the five more extended power-law incidences (M. 3. In Fig.73 – 1. After each such event a number of markers (CGIs) equal to 90% of the number of the duplicated ones are eliminated. In the framework of the present study.61 0. / Computational Biology and Chemistry 53 (2014) 84–96 We have also checked the existence of power-law-like distributions for the set of epigenetically determined CGIs (Illingworth et al. for the same chromosomes as in Fig. (ii) mean values of linearity extent E for the five more extended power-law incidences (M. along with the corresponding CGI subsets remaining after gene-masking (non-TSS CGIs. computing the presented size distributions and simulating the evolutionary model are available upon request.86 1.29 1. – – – – M.44 1.5) the picture remains qualitatively the same. These eliminations represent cases of loss of function of ancestral CGIs with a subsequent progressive decomposition.V.49 2. 3c and f).V.51 1. 256 additional events of non-duplicated CGIs eliminations are also allowed (step iii of the model).89 M. 1.38 M.V. see sub-Section 2. which in two instances surpass the three orders of magnitude (in dog chr. summary information about non-gene-related epigenetically determined CGIs is also included.58 0. 3. while. In Supplementary Table all chromosomes are presented.66 – 1. as well as the fraction of the TSS-unrelated CGIs (Fig.59 M. G-G&F Large genomes Bos taurus Canis familiaris Danio rerio Gallus gallus Homo sapiens Monodelphis domestica Mus musculus Small genomes Apis melifera Caenorhabditis elegans Drosophila melanogaster Saccharomyces cerevisiae M.39 1. with lengths sampled from a uniform distribution not exceeding in size the 5% of the actual length of the simulated sequence. 1. The full set of epigenetically determined CGI populations and their orphan subsets are presented in the Supplementary Table.-5 1.41 – – – M. 2.15 – 1.-5).V. serving as an additional estimator of the intensity of the appearance of the power-law-like pattern per organism. Additionally. 173 segmental duplication events occurred. In Fig.V. Initially.V. In Table 1 the following information is included: (i) mean values of linearity extent E (M.03 – Epigenetically determined CGIs (complete population) M.70 M. 2010) not intersecting with TSSs and having a minimal distance of 100 bp from a TSS of the RefSeq annotation. examples of these distributions are presented. 3a and d). – – – – 1. In (a).

4. 3. G + C > 55%._1)TD$IG] Fig. Genomic curves are accompanied in each plot by 10 curves of surrogate data (continuous lines). while characteristic plots are given in Fig. In the supplementary file “grid” the extents of linearities are presented in the form of grids of thresholds (see Methods) for all the studied cases. 2008. Logarithms are to base 10. An evolutionary scenario reproducing the observed power-lawlike distributions based on segmental duplications and CGI loss Segmental duplication events occurred continuously in the evolutionary past of virtually all eukaryotes (De Grassi et al. The linear segments are inferred by linear regression. CpGo/e > 0. Linearity in log–log scale (a power-law-like pattern) is observed.65) sets of thresholds have been used. Either the Gardiner–Garden and Frommer. . Inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes. for a specific choice of thresholds in the detection of CpG–Islands. [(Fig. Each sub-plot (a–f) corresponds to a chromosome of one of the examined genomes.2. T&J (Lmin = 500 bp.6) or the Takai and Jones.. GC > 50%. G-G&F (Lmin = 200 bp. CpGo/e > 0.88 G. corresponding to randomly distributed markers. / Computational Biology and Chemistry 53 (2014) 84–96 threshold sets different than the standard and the “alternative” G-G&F and T&J. 1. Tsiagkas et al.

Shapira and Finnerty. Gibson and Spring (2000). 2.e. 2008. Sémon and Wolfe (2007) and references therein. Surrogate data curves and linear regression are similar to Fig. relatively recent) segmental duplications (Bailey et al.. 2002). Additionally. 2002.e. / Computational Biology and Chemistry 53 (2014) 84–96 Kehrer-Sawatzki and Cooper. Logarithms are to base 10. 2008. power-law-like linearity is observed. Segmental duplication and polyploidisation generate copies of some or all the genes of an organism. Again. see e. Several alternative ways of retaining only orphan CGIs have been used. McLysaght et al._2)TD$IG] Fig. 89 Adams and Wendel (2005). often more extended than in the corresponding complete population. 2000). 1986). Kirsch et al. but here excluding gene-related CGIs. 1. duplication of the whole genome and subsequent reduction to diploidy). Same cases of chromosomal distributions as in Fig.. sub-plot (a–f). not taking into account events of polyploidization (Lynch and Conery.g.. It has been shown that at least 10% of the non-repetitive human genome consists of identifiable (i. . most extant taxa have experienced paleopolyploidy during their evolution (i. 1. but also of other functional [(Fig. It is estimated that 50% of all genes in a genome are expected to duplicate giving an “offspring” at least once on time scales of 35–350 million years.G. Tsiagkas et al.

Insertions of sequences increasing the total chromosomal length (these could be transposable elements.-5 – – M. the two copies continue to survive in the genome sharing the multiple functions of the initial gene between them. 1. given that the population of the genomic elements we study are (principally) functional and thus preserved by purifying selection.V. Lynch and Conery.V. – – M. microsatellite expansions etc).69 1. emancipation of genes from their associate CGI.23 G-G&F M.V. 1.64 1.09 1. 2008. cds (50) only for small genomes G-G&F Large genomes Homo sapiens Mus musculus Small genomes Apis melifera Drosophila melanogaster M.29 1.12 M. the fate of most duplicated genes is that one copy is silenced.46 M.56 2.-5 1. transposable elements (TE) or microsatellites.32 – 2.V. 1.V.V. while it is also exposed to the possibility of excision due to recombination driven eliminations. Deletions of sequence stretches (which usually are under weak or no purifying selection). 1. In the literature.81 1. – – T&J M.22 0. / Computational Biology and Chemistry 53 (2014) 84–96 genomic localizations.V.V. Moreover.96 – 1. Lynch and Conery (2000). Sémon and Wolfe (2007). 2000): In some cases one member of the gene pair adopts a new function while the other remains unchanged.V. +100 bp) for small genomes G-G&F Large genomes Bos taurus Canis familiaris Danio rerio Gallus gallus Homo sapiens Monodelphis domestica Mus musculus Small genomes Apis melifera Drosophila melanogaster M. The above fate of duplicated genes is normal to have its parallel to the fate of CGIs accompanying genes and being involved in their regulatory mechanism. Random eliminations of a number of CGIs which is lower or equal to the number of the duplicated ones. Only events i and ii are indispensable for the appearance of the power-law-like pattern.V.82 1.97 – 1.41 M. eliminations of repeat copies of the studied TE family.48 1.-5 – – M. and insertions of TE families more recent than the studied one) are required instead of i Table 2 Values of linearity extent E as: (i) Mean values (M.V.35 1. – – M.48 – 2. In other cases. event types iii and iv (in that context.24 1. segmental duplications occurring during genomic evolution and (in some cases) complete genome duplications. 1. which has been subsequently decomposed (for details see in the following sub-section).V.98 1. losing the ability to be transcribed. 1. 1.V. +2 kbp) only for large genomes M. We can consider that CGIs with yet unknown roles. Tsiagkas et al. 2001).-5 – – Epigenetically determined CGIs (non-TSS CGIs only) Epigenetically determined CGIs (TSS-unrelated only) M.V.87 – T&J M. this property is proven numerically to be robust to quantitative modifications of all the involved types of molecular events.18 M.43 1. 1.V.V.25 1.02 1.25 Masked genes (5 kbp. often connected to the loss of (protein-coding or not) genes due to changes of the needs of a given species is also a widespread phenomenon in genomic evolution.V. An example of relatively recent gene loss is that 60% of the olfactory receptors in the human genome (1000 genes and pseudogenes in total) have disrupted ORFs and appear to be pseudogenes. while its syntenic counterpart in another genome appears to have been emancipated from the action of its ancestral CGI.69 1.V. +100 bp).V.98 2.68 1. De Grassi et al.-5 – – M.24 2.19 1. 1. with the accelerated rate imposed by the rapid turnover of the CpG dinucleotide (Nachman and Crowell.93 1.-5 1.28 1.41 1.93 1.37 M. This model includes events of the types: i Segmental duplications of extended regions of chromosomes. The existence of another source of CGI loss can be supported by current findings of comparative genomics.09 M. in the study of non-conserved elements.19 1. Kasahara (2007). when duplicated often become superfluous and stop being under purifying selection.V.36 – Masked TSS (500 bp.g. Adams and Wendel (2005).. as all authors agree. However.21 1.-5 – – .19 1. Occasionally.V. i.V.-5 1.78 M.).02 1.V.55 1. orphan CGIs becoming redundant.19 1. there are reported three possibilities for the evolution of the duplicated genes (Adams and Wendel.-5 1. Moreover. Even if there are not available quantitative data for the CGI corresponding populations.31 M. 2000).26 M.-5 – – M. with one gene being under the regulatory control of a CGI. additional eliminations of non-duplicated CGIs. we may expect that loss of CGIs due to one of the above mentioned reasons.-5).V. – – M.V.-5 1.V.90 G.43 1.91 1. sequencing and analysis of the human genome has shown that segmental duplications and gene loss are widespread phenomena of genome remodeling during evolution.97 – M.84 1. ii iii iv v This step may include as limiting case whole genome duplications.12 1. – – T&J M.V.-5 1.44 1.V. and then disintegrates progressively by random mutations. In a completely different genomic framework.V. although not explicitly considered in the present implementation.70 2. e.01 – M. (100 bp.11 – 1. Masked TSS (-500 bp.-5 1.27 – M.53 1. The implementation of a “genomic duplication–CGI loss” model is based on the following genomic phenomena: occasional gene (and associate CGI) silencing and subsequent degradation. regarded as orphan. The proposed evolutionary scenario reproduces power-law-like inter-CGI distances size distributions.19 M.-5 2. These orphan CGIs will be gradually decomposed and lost.96 0.44 M.-5 1.02 1. Several cases of syntenic genes of different organisms have been shown to be under differential control of CGIs.24 2.28 – 1.41 1.38 2.e.V. +100 bp) for large genomes. 2005.g. like CGIs.V.80 0. 1. which is consistent with findings suggesting massive functional gene loss in the last 10 Myr (IHGSC.24 M. – – M. see e.-5 1.V. retroviruses.V.V. (ii) mean values for the five more extended power-law incidences cases (M. for several choices of gene-unrelated (orphan) sets of CGIs.31 M.

Logarithms are to base 10.. as the above authors remark (see also discussion in the next sub-section). Events iii. as several authors bring evidence about loss of CGIs over evolutionary time due to de novo 91 methylation (see e. The involvement of this type of events in the formation of the observed linearities is indicated by the relatively higher linearities observed in mouse where higher island elimination rates are observed. are important.G. Klimopoulos et al.g.. iv and v are numerically shown to be dispensable for the emergence of power-laws in the computer simulations studied herein. 2012). 1. Tsiagkas et al._3)TD$IG] Fig. Several cases of non-TSS or TSS-unrelated (orphan) islands are considered. Inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes for epigenetically determined CGIs from the human and mouse genome. 2007. The inclusion of events type iv tests the robustness of the model. Antequera and Bird. [(Fig. Surrogate data curves and linear regression are similar to Fig. 3. Events of type iii. although not essential for the emergence of power-lawtype linearity. / Computational Biology and Chemistry 53 (2014) 84–96 and ii (see Sellis et al. . 1993).

Logarithms are to base 10. in order to account for the appearance of a power-law-like pattern in the chromosomal distribution of genes (Sellis and Almirantis. / Computational Biology and Chemistry 53 (2014) 84–96 because for many organisms important parts of the genome are generated by repeat proliferation. is not analytically solvable and thus no universal exponents (slopes for the linear segment in log–log scale) may be obtained. Surrogate data curves and linear regression are similar to Fig. 1. Both are based on an analytically solvable model introduced by Takayasu et al. for more details see the sub-Section 2. 4._4)TD$IG] Fig.4 and the supplementary file “grid”. The model presented herein is formally similar to another model introduced previously. Notice that the model presented herein. (1991) for the appearance of power-law size distributions in aggregative growth of particles in physicochemical systems. conceived to describe the genomic dynamics of CGIs. . Here non-standard threshold sets are used. Tsiagkas et al. 2009). This is verified by all our computer [(Fig. Plots of inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes.92 G. Events of type v represent either deletion of sequence regions usually due to unequal recombination or gradual shrinkage by a balance of indel (insertion/deletion) events favoring decrease of the sequence length.

this deviation from universality is a characteristic feature shared between the genomic inter-CGI distributional patterns we describe herein and the simulations of the proposed evolutionary model. For a related discussion see Klimopoulos et al. Discussion 4. see Stumpf and Porter (2012). Moreover. as found in several instances. However. as an absolute proof that this model is at the origin of the genomic pattern. We see that. as extensively discussed therein. We may infer that gene inactivation and reassignment of the roles of genes (which may have caused their emancipation from an associate CGI) led several CGIs to cease being under purifying selection and thus they progressively decomposed.1. Tsiagkas et al. Analysis of the observed power-law-like chromosomal CGI distributions. The power-law inter-repeat size distributions of distances between TEs could be seen as a counter-example (Klimopoulos et al. protein coding segments. These authors state that these requirements include a statistically sound power-law (extended linearity in log–log scale with indications of convergence to a universal exponent) and a concrete underlying theory to support it. only events of the types i and ii are included (segmental duplications followed by CGI loss). in cases where an extended genic remodeling in the recent past of the organism is plausible (perhaps. corroborates the hypothesis that the evolutionary dynamics described by this model is at the origin of the observed genomic patterns.g. In our case we clearly show that the log–log linearities we observe in our genomic data (often being quite extended) lack a universal exponent (slope). Simulations using the proposed evolutionary model. it is probable that the observed distributions are the result of the proposed mechanism in synergy with other aspects of genomic dynamics. as intuitively expected. This is the principal reason why we call the pattern we have found “power-law-like” throughout this article. TEs are long-lived genomic objects. 5c). On the other hand. as observed in real chromosomes. recognizable in mammalian genomes up to . however. 2012). A decrease in the number of allowed segmental duplications. along with the common dependence on the evolutionary parameters shared between model and genomic distributions. CGIs). These results are compatible with extended powerlaw linearity found. Taxonomy-related evidence The property of the formulated simple model to produce power-law-like distributions is an indication that it could be extended to include more detailed modeling of conservation through evolutionary time. the expansion–modification model proposed by Li (1991). 4. the case of human olfactory receptors) and several genes having their regulatory pattern changed. Logarithms are to base 10. This feature. In the simulation of Fig. More specifically. power-law-like curves (linear in log–log scale) are formed in the inter-CGI spacers' size distributions. this is a feature common to all cases of naturally occurring “power-laws”) but mainly because they lack any universal exponent (slope). (2012). Fig. 5b. For a recent in depth view of the requirements for having a power-law. The similarity of the power-law-like pattern emerging in simulations of this model with the genomic distributions of inter-CGI distances cannot be considered. the linearity is considerably increased (see Fig. decreases the extent of the linearity. Linearity in log–log scale similar to the one observed in the plots of inter-CGI spacers’ complementary cumulative size distributions is found. under no purifying selection. in line with the view that functionality is a necessary condition for CpG islands to form stable power-laws. Thus.G. 5. If we include additional elimination of CGI (events of type iii). In the previous paragraph we consider the possibility that the genomic power-law-like size distributions of CGIs might be an indication of functionality. / Computational Biology and Chemistry 53 (2014) 84–96 [(Fig. our data deviate from the typical power-law not only because they always have the linearity in log–log scale truncated at a lower and an upper cut-off (in fact. the case of the human genome) with many genes inactivated (e._5)TD$IG] 93 simulations and is in accordance with the variety of slope values met in the study of genomic CGI distributions. 5a. depicted in Fig. For details see in Section 2 and Section 4. shown to generate long-range correlations in nucleotide sequences might have significantly contributed to power-law-like distributional patterns of several genomic components (repeats.. especially in small genomes.

e.. in our study we attempt to identify the distributional patterns of CGI populations of different functionalities and not only the gene-transcription related. Considering mean values from all chromosomes as depicted in Tables 1 and 2. long-rangeness and fractality. from which they are principally derived. The prevalence of the higher thresholds in the formation of the most extended linear segments in log–log scale suggests that in fish genomes. The existence of several organisms with higher power law linearities for lower thresholds and shorter minimum length (i. i. highest mean value of log–log linearity (M. are losing CpG Islands over evolutionary time due to de novo methylation in the germ line followed by CpG loss through mutation. 2009.e. and in fact other vertebrate genomes too. (1999) to be next to III in their degree of variation of silent-site C + G content are the ones with the next higher extents of linearity (1. 1993) we observe that the maximum linear length is systematically positioned at lower threshold combinations than the standard ones. we extended our analysis to the subset of CGIs characterized as orphan. the observation of Han and Zhao (2008) that in fish genomes stringent thresholds (T&J) gave the best results on functional CGIs determination is compatible with our finding that in D. often focussing on their non-gene-related subset (orphan islands). CpG is.e. but also in a variety of other genomes. represents the global optimum of linearity extent. this being a corroboration of the proposed model. according to our proposed model. functional islands in general might be compositionally moved toward high CG % and CpGo/e values. rerio. i. the maximum of the linearity in log–log scale is indeed observed for the standard threshold sets or for values being in their close vicinity. by simple inspection of Table 1 we see that the mouse genome presents for G-G&F the highest mean extent. Another point where findings presented herein comply with findings of other research groups following a different approach is our observation that chromosome III of Saccharomyces cerevisiae is the one presenting the globally maximal extent of linear segment (E = 1. mouse and other mammalian genomes. sapiens. Another point compatible with the model is that this optimum is reached for the relaxed (G-G&F) and not for the stringent (T&J) threshold set. Moreover. by using several different approaches (Sellis and Almirantis. to all examined genomes.5 values. Gene-related vs. while sporadically high values of thresholds may perform better (see data in the “grid”). in fact. also has as effect the protection of the power-law-like pattern from random interruptions of the inter-CGI spacers thus maximizing linearity in log–log scale.5 respectively). On the other hand. It is also worthy to note that the two chromosomes (I and XI) reported by Bradnam et al. 2001) thus allowing the interplay of event types iii and iv of the proposed model. In other genomes the optimum (maximum linear extent) is positioned at lower value combinations. Matsuo et al. slowing down of the transposition dynamics. see also Sharp and Lloyd (1993). when full gene and flanks masking is applied. with our result (see Table 2) that dog genome. This is due to the mutability of CpG after being methylated. On the other hand. (1999). More specifically. Contrary to TEs. on the basis of the argument presented above: i. Especially in mouse.45 respectively). Human and mouse. (1998) and Bradnam et al. Tsiagkas et al.85) and the only one which exhibits linearity for alternative threshold set (altG-G&F and altT&J). 0.) of all examined genomes. The result of Han et al. where CGIs are known to be eroded. in view of the proposed model. given the systematic connection of many CGIs with promoters of protein-coding genes. the separated study of orphan CGIs allows for the investigation of different modes of chromosomal distribution and thus of different functionalities of this class of genomic element compared to entire CGI populations.V. reported to disappear from a given position at least one order of magnitude faster than any other dinucleotide in the eukaryotic genome. The observation that power-law-like linearity persists (and is sporadically more extended) in orphan CGI distributions is indicative of distinct functionalities for this type of genomic localizations. In order to exclude the possibility that the power-law-like spatial distribution of CGIs is a mere consequence of a power-law distribution of genes. when Lmin = 200 performs better than Lmin = 500) is compatible with the finding that “many known functional CGIs are shorter than commonly assumed – the extreme example being Xenopus CGIs” (Hackenberg et al. with no signs of skewed deviations to any direction in the parameter (threshold) space for all analyzed chromosomes. orphan CGIs The distributions of coding segments and of genes in a variety of genomes are shown to follow a power-law-like spatial distribution. This .V. 2010). (Ostertag and Kazazian..e. according to Antequera and Bird (1993).6 and 1. Athanasopoulou et al. Note that this result does not contradict the earlier finding of Han and Zhao (2009) that the T&J threshold set is particularly suitable for identifying promoterassociated CGIs in human and probably other genomes as well.. / Computational Biology and Chemistry 53 (2014) 84–96 200My. 1993. i. we note that this increase is much higher for the GG&F than for the T&J threshold set (0. From log–log linearities found herein one might infer that an important proportion of the CGIs detected in all examined genomes using these thresholds are functional. This has also been shown numerically through computer simulations allowing random transpositions. as the erosion of mouse CGIs probably makes the relaxed thresholds more suitable for filtering in most of the functional CGIs. they stay above given thresholds for CG% and for CpGo/e) only for relatively short times after ceasing to be under purifying selection. see Nachman and Crowell (2000) and Antequera (2003). that log–log linearity implies functionality for CGI populations. to some extent. The same optimum is reached if we examine the M.e. These authors consider as the more likely reason for the particularities of chromosome III that it contains the mating type loci thus being under a selective pressure protecting it from disruptions and keeping it intact in evolutionary time. for several of the other studied genomes as may be revealed by an inspection of the data presented in the supplementary file “grid”.94 G. for both standard threshold choices.2. Along the same lines is the observation that this genome is characterized by the maximum of increase of linearity when full gene and flanks masking is applied either for G-G&F or for T&J thresholds (if masked and unmasked genomes are compared). CpG Islands retain their identity in the evolutionary time scale (i. On the other hand. 2006). where a decrease of the linearity extent in log–log scale was observed (figure not included herein). An immediate observation stemming from our study is that the standard threshold sets (G-G&F and T&J) may produce power-lawlike distributions not only in human. which makes its genomic turnover rate to be the highest of any other dinucleotide.67 vs. type iii events of the proposed model occur according to the cited literature with a particularly higher rate in rodents. 4. This is in accordance with the findings of Li et al.e. The situation changes. As already mentioned. not spatially connected to genes. (2008) that the dog genome hosts the highest percentage of gene-unrelated functional CGIs if compared to any studied genome is compatible. about chromosome III being atypical in many regards: It presents large-scale compositional (C + G content) inhomogeneity and high extent of compositional mosaicism of genes found only in this chromosome. These are events not requiring evolutionary conservation of the involved genomic elements in order to generate power-law-like distributions. the lengthier log–log linear extents (the only ones above 2) are found in the higher threshold value region. they are decreased in number while the “surviving” ones are shorter and with lower CG% and CpGo/e values (Antequera and Bird. In H.

.compbiolchem. Number of CpG islands and genes in human and mouse. Antequera. 65.. Linearity in log– log scale was often found in the range between two and three orders of magnitude. Evol. Phys. K. When a more refined annotation of functional roles of CGIs is available. A. 1003–1007.. Z.. L. Gardiner-Garden. S. We believe that only further research may resolve this ambiguity. Comp... Structure. Antequera.A. Gibson. 2005. 2010. Frommer. at least for the time being. A. Sharp. V.. E 82.. L.N.. and Philippa Constantinou for having participated in an initial stage of this study. in the online version. Zhao. presents a clearly inter-CGI linearity in log–log scale. Genome-scale analysis of metazoan replication origins reveals their organization in specific but flexible sites defined by conserved features. Zant.M. M. G + C content variation along and among saccharomyces cerevisiae chromosomes... Tsiagkas et al. Saccone. arXiv. Gene 421. 1055–1070.. Funct. A.L. The observation we made on deviations in power-law linearity extent. On the other hand. Walter.. 28. Myers.W. J. A 90. W. et al. R.A.P. We deduced that both. F. 6798–6807... P.L. Shalizi. Hackenberg. R. CG dinucleotide clustering is a species-specific property of the genome.. G.. Mol. K. Thompson.E. T.data-an].2014.doi. 446–458. Bradnam. Cayrou. 051917. Trans.. Zhao. 1–6. Comparative analysis of CpG islands in four fish genomes. Appendix A. 21.R. Karamanos. Nucl. De Grassi. However.1062v1 [physics.. A.. CpG island mapping by epigenome prediction. . Athanasopoulos. Recent segmental duplications in the human genome..F. A. 0706.. 9.. Plant. 135–141.doi. 2008.. 1999. 259–264. C. CpG islands in vertebrate genomes. Soc. Several inferences based on known differences of compositional and other characteristics between species have been made.013. Adams. 1647–1658. which have lost their function although they have not yet changed their composition due to mutational dynamics. it is noteworthy that even the A. Supplementary data Supplementary data associated with this article can be found. C.A. 2006. J.. Trend. Athanasopoulou. Zhao.. Z. Genet. M. 261–282.D.F. 666–675.. Oliver. Schwartz.. R. K... 8. (2010) for human and mouse chromosomes often follow power-law-like distributions (Tables 1 and 2 and detailed data available in Supplementary Table). U. Wendel. Genome Res. Olivier. Han. doi:http://dx.J. lacking the clear 95 partition into isochores observed in warm-blooded animals. although several other functions have been attributed to CGIs in the recent past.E. Z. More generally... Huberman. 143–150...org/10. A. Tables 1 and 2).doi. Fazzari. Han. Evol 16. Zipf’s law and the internet. experimentally observed methylation might be due to some specific conditions of the studied cell population and its state. 4. E.. 1438–1449. Mol. C.org/10. Hamodrakas and George C. which are compatible with the proposed model. J.. 2007. Such newly de-functionalized CGIs are also expected to participate to the linearity in log–log plots according to the proposed model. between compositionbased and methylation-state-based CGI populations. Luque-Escamilla. conserved by purifying selection) until recently.. which not only survives but increases when gene-unrelated (regarded as orphan) CGIs are studied. 2007. B. J. Polyploidy and genome evolution in plants. C. The observation of a clearly lower extent of power-law linearity in the methylation-derived CGI distributions in human and mouse raises a question about the prevalence of the methylation-based over the composition-based prediction of the CGI character.. CpGcluster: a distance-based algorithm for CpG-island detection. 3.. Clark. Evidence in favour of ancient octaploidy in the vertebrate genome. Bock. Li. Melnick.E.-H. J.. function and evolution of CpG island promoters. Nature 321. J. B. for both G-G&F and T&J threshold sets (cf. at http://dx. M. Samonte. 2002. Bailey. Natl. Seoighe.. PloS Comp. S. Z. formally analog to one previously proposed by our group for the understanding of genic segment distributions was formulated and simulations have shown that it might account for the distributional patterns reported herein. Rodakis (both at the Faculty of Biology. Acids Res. At the present time only CGIs involved in gene transcription represent a well defined island set. Figueroa. Adams. Genome duplication and gene-family evolution: the case of three OXPHOS gene families.. References Adamic. CpG island density and its correlations with genomic features in mammalian genomes.L.J.W. Gu. 5.R. 196. J. M. R79.J. Acknowledgements We would like to acknowledge Professors Stavros J. Reinert. J. M. Biol. Acad. Khulan..1016/j. Genomics doi:http://dx. 2007. 11995–11999. E. the taxonomy of the considered organisms and the functional role of the studied island population leave their imprint on the observed distributions. Bouhassira. Newman. 60. we expect that a treatment like the one presented herein could allow for the deduction of more nuanced conclusions. 2008. Bird.. Lanave.L.. P. 35. M... E. this is also an indication that two classes of islands (gene-related and orphan) might be distinguished using several compositional or distributional criteria in a converging way.. Scaling properties and fractality in the distribution of coding segments in eukaryotic genomes revealed through a block entropy approach. B. L. K. Coulombe.3. 1987. 2009. CpG islands or CpG clusters: how to identify functional GCrich regions in a genome? BMC Bioinformatics 10.T. Opin.. F.P. Life Sci. Rev. K.G.J. / Computational Biology and Chemistry 53 (2014) 84–96 may represent further evidence that functional gene-unrelated (orphan) islands accomplish their role under less stringent compositional constraints.V. P. Biochem. S. 1993. Eichler. 209–213. melifera genome... Greally. Previti. We are particularly indebted to Professor Oliver Clay who predicted the particularly high power-law-linearity of yeast chromosome III on the basis of the findings of Bradnam et al. Paulsen.. C. Oakley.. E. Su. Curr. Martinez-Aroza. Vigneron.. Biol. Clauset. 1987.H. 3. Proc. Glass.. University of Athens) for serving as academic advisors to G.. Carpena.1186/1471-2105-10-65. An alternative explanation for compositionally predicted but methylated CGIs is that these sequences could be functional islands (thus.. 2003.1155/2008/565631. Glottometrics 3. J. C.. Golden.E. Cell.. 2008. One could think that a combination of CG% and CpGo/e clearly higher than the prevailing in the bulk genome can be seen as a trustworthy marker of any CGI functionality. T. Li. M.. A. C... Bird. Genome Biol. CpG-rich islands and the function of DNA methylation.. Biol.. Sci.. 1986. Conclusions In this work we investigated the distributional features of CpG–Islands in several genomes by means of a study of inter-CGI distances considering only entire chromosomes. A limitation of the present work is that the chromosomal distribution of CGIs cannot be disentangled from the complex large-scale compositional characteristics of the genome. strengthens the complementary roles of the two approaches in the prediction of functional CGIs. (1999) about the compositional mosaicism of this chromosome. Lengauer. and they were also found to follow power-law-like distributions. Wolfe. Science 297. CGIs derived on the basis of methylation data CGIs determined on the basis of methylation data... A simple evolutionary model. provided by Illingworth et al. given that methylation is an in vivo reversible feature. 2011. P.. Almirantis. quantified by the mean value (M.08. 2000. thus their distribution also reflects the large-scale features of the isochore partition of genomes and other compositional inhomogeneities. L. We have studied in separate the population of CGIs which lie outside of genes and flanks considering them as gene-unrelated or orphan. Spring.V. A.M. Power-law distributions in empirical data. 2002.org/ 10. Han. studied herein.A. L. Y.. M. P. E. CpG islands as gene markers in the vertebrate nucleus. given that such locally maintained heterogeneities are indications of conservation due to functionality.. Bird..) of all chromosomes.. Mol. 342–347. BMC Bioinformatics 7.

2001. Science 335.. J. Science 290.. 29 doi:http://dx.. Hokamp. Stolovitzky.. M. Nat. Silke. E 71. Evol. Bestor. Illingworth. Shapira. Mathematics: critical truths about power laws. M.. K. D. Nucleic Acids Res. Gomez-Lopera. and uniformity between. G. L. 41–56.. possible paleopolyploidy and gene loss. Phys.. E..E. 2005. 297–304. W.. 2004. Finnerty. Rev. W. Cell Mol.S. 2008. Webb.. Damelin. K. 665–666. J. Gene. .. 2007. O.... D..N. C. Takayasu..M. 104. Lloyd. Li. M.W. Biology of mammalian L1 retrotransposons. Pitchford. 222–229. J. DNA sequences of yeast chromosomes. Statistical properties of aggregation with injection. Gruenewald-Schneider. Acad. Li. H.1016/j. 23.. Kirsch.doi.. 2012. A... Sims. Proc. A.L. Klimopoulos. 2002. A. Eichler. Stumpf.. Compositional heterogeneity within.P.. D. 108–112. Evol. Li. Cooper. Kasahara. 2007. Provata. 46. Phys. Jabbari.H. R. Glottometrics 5. Chromosome Res. Genet. K.. S. Nachman. H. Genetics 156. Conery. Batz. 143–149. 2007.G. Phys. Tsiagkas et al. P. G.. An update.. Almirantis.. J. RomanRoldan.H. P. Andrews.S.2012. R. Lynch. D... Extensive genomic duplication during early chordate evolution. Expansion-modification systems: a model for spatial 1/f spectra. Rev.. Almirantis. Trends Genet. Sémon. 31. 2009. Oliver. Crowell. The evolutionary fate and consequence of duplicate genes. M. K. C.. Martinez-Aroza. O'Donnell. Cheng. Gene doi:http://dx.A. Luque-Escamilla. T. 061925.. Curr. J.. 547–552.. 2000. 501–538... 2007. Genet. P.. Tanay. Jones.. Bernardi. J. Rate of spontaneous gene duplication at two loci of Drosophila melanogaster.H.gene.. M. Sellis. A. Rev. J..1101/gr. / Computational Biology and Chemistry 53 (2014) 84–96 IHGSC. Mol. Provata..H. 24. Reciprocal gene loss between Tetraodon and zebrafish after whole genome duplication in their ancestor.A. Takai. A. Z.. S. Gene 333.R. Matsuo. Wolfe. A 43. Phys. 2000.. Stat. Porter. G.W. 2010.L. Smith. 35. U. Takayasu. W. TpG (CpA) and TpA frequencies. 2008. 1991. Initial sequencing and analysis of the human genome. K..K... Munch. Minimizing errors in identifying Levy flight behaviour of organisms.H. 5240–5260. D... Righton. PLoS Genetics 6... A. Y. Sharp. S. Alu and LINE1 distributions in the human chromosomes: evidence of global genomic organization expressed in the form of power laws. S. A. 323–351. Ostertag.. Huber. 65. Y. 2012. Contemp.D. Annu. Sellis. D.... Turner. Widespread occurrence of powerlaw distributions in inter-repeat distances shaped by genome dynamics. J. Chen.02..T. 2007. H.. 447. A. Bernaola-Galvan. U. 1993.J. Compositional searching of CpG islands in the human genome.. M. C. W. Molecular mechanisms of chromosomal rearrangement during primate evolution. Bird. W.. Y.. Clay.org/10. 2385–2399.. PNAS 99. 76. Biol.076711. James. Evolutionary dynamics of segmental duplications from human Ychromosomal euchromatin/heterochromatin transition regions.. 5521–5526.. 916–928.L. R. Schaffner. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. 200–204. D. 860–921. D. 23. Genome Res... 2001. 3740–3745. Nature 409... The 2 R hypothesis. Orphan CpG islands identify numerous conserved promoters in the mammalian genome... E. 16. 2002. 1991.P. 21. Genet. 543–555. 1151–1155.108. Takahashi.. Z.. Cytosine methylation and CpG. 18–28. A. Power laws: Pareto distributions and Zipf’s law. Evidence of erosion of mouse CpG islands during mammalian evolution.E. 1986. V. M. D. Mol. M. Nat... 8. Kehrer-Sawatzki. K. Wolfe. Power-laws in the genomic distribution of coding segments in several organisms: an evolutionary trace of segmental duplications.W.96 G. McLysaght. Immunol.. P. Opin. Anim. Schempp. T. The use of genetic complementation in the study of eukaryotic macromolecular evolution. 2005. Oliver. Harrison. 2002. Ecol. Almirantis. 1993.H.. Hyperconserved CpG domains underlie Polycomb-binding sites. e1001134. Jiang. Kerr. Sellis.G. 179–183. A. Newman. J. Genome Res. 19..org/10. Somat.. M. 725–745. 19. Kazazian.doi..M. J. Zipf’s law everywhere.J.J. Estimate of the mutation rate per nucleotide in humans. S.005. 1998. 159–167. Regional base composition variation along yeast chromosome III: evolution of chromosome primary structure. Sci.. 14–21..