Computational Biology and Chemistry 53 (2014) 84–96

Contents lists available at ScienceDirect

Computational Biology and Chemistry
journal homepage: www.elsevier.com/locate/compbiolchem

Orphan and gene related CpG Islands follow power-law-like
distributions in several genomes: Evidence of function-related and
taxonomy-related modes of distribution
Giannis Tsiagkas a , Christoforos Nikolaou b , Yannis Almirantis a, *
a
b

Institute of Biosciences and Applications, National Center for Scientific Research “Demokritos”, 15310 Athens, Greece
Computational Genomics Group, Department of Biology, University of Crete, 71409 Heraklion, Greece

A R T I C L E I N F O

A B S T R A C T

Article history:
Available online 16 September 2014

CpG Islands (CGIs) are compositionally defined short genomic stretches, which have been studied in the
human, mouse, chicken and later in several other genomes. Initially, they were assigned the role of
transcriptional regulation of protein-coding genes, especially the house-keeping ones, while more
recently there is found evidence that they are involved in several other functions as well, which might
include regulation of the expression of RNA genes, DNA replication etc. Here, an investigation of their
distributional characteristics in a variety of genomes is undertaken for both whole CGI populations as
well as for CGI subsets that lie away from known genes (gene-unrelated or “orphan” CGIs). In both cases
power-law-like linearity in double logarithmic scale is found. An evolutionary model, initially put
forward for the explanation of a similar pattern found in gene populations is implemented. It includes
segmental duplication events and eliminations of most of the duplicated CGIs, while a moderate rate of
non-duplicated CGI eliminations is also applied in some cases. Simulations reproduce all the main
features of the observed inter-CGI chromosomal size distributions. Our results on power-law-like
linearity found in orphan CGI populations suggest that the observed distributional pattern is
independent of the analogous pattern that protein coding segments were reported to follow. The
power-law-like patterns in the genomic distributions of CGIs described herein are found to be compatible
with several other features of the composition, abundance or functional role of CGIs reported in the
current literature across several genomes, on the basis of the proposed evolutionary model.
ã 2014 Elsevier Ltd. All rights reserved.

Keywords:
CpG-Islands
CGIs
Orphan CpG-Islands
CpG dinucleotide
Power-law-like distribution
Genome evolution

1. Introduction
Genomic CpG islands, in which CpG dinucleotides are abundant
and non-methylated, have been initially detected experimentally
and defined as short (hundreds of nucleotides-long) stretches in
vertebrate genomes, thus offered as cleavable sites for mCpGsensitive restriction enzymes (HTF islands, see e.g. Bird, 1986). As
lengthy DNA sequences and whole genomes became progressively
available, sequence-based definitions of CpG Islands (CGIs) and
computational algorithms using a sliding window combined with
threshold values for key quantities were put forward. Thresholds
for sequence-based search of CGIs are considered for: (i) the
minimal island length (Lmin); (ii) the percentage of cytosine and
guanine content (C + G, abbreviated in the following as CG%); and
(iii) the observed over expected frequency of occurrence of the CpG

* Corresponding author. Tel.: +30 2106503619; fax: +30 2106511767.
E-mail address: yalmir@bio.demokritos.gr (Y. Almirantis).
http://dx.doi.org/10.1016/j.compbiolchem.2014.08.013
1476-9271/ ã 2014 Elsevier Ltd. All rights reserved.

dinucleotide (CpGo/e). Gardiner-Garden and Frommer (1987)
introduced the first widely used set of thresholds (Lmin = 200 bp,
CG > 50%, CpGo/e > 0.6; to which we will hereafter refer as “relaxed
criteria”; bp stands for ‘base pairs’), while later, Takai and Jones
(2002) used more conservative threshold values, named hereafter
“stringent criteria” (Lmin = 500 bp, CG > 55%, CpGo/e > 0.65). In the
following, these threshold choices will be abridged as G-G&F and
T&J respectively. Alternatively, methods based on the degree of
CpG dinucleotide clustering (Hackenberg et al., 2006; Glass et al.,
2007) and on entropic edge detection (Luque-Escamilla et al.,
2005) have also been introduced. In order to avoid false positives
(e.g. due to high C + G Alu sequences in the human genome),
especially when relaxed criteria are used, repeat-masking is
usually applied before searching for CGIs.
CGIs are widely accepted as markers for the existence of
protein-coding genes, the promoter regions of which are usually
proximal to or overlapping with the islands. Consequently, the
search for CGIs was motivated by the need of detection of yet
unannotated protein coding genes. Therefore, most of the

See Bird (1986). as there are converging data that many CGIs in several lineages have been lost when they ceased to be under purifying selection. 2. (2007). researchers attempted to introduce several forms of epigenomic information. apart from an indication of their functional role and conservation through purifying selection. independently of the method used for their detection. by means of the algorithm described in the following sub-Section 2. Our principal result herein is that CGIs also. Klimopoulos et al. There is not an unambiguous consensus about what an orphan CpG-Island really is. Additional data for gene or CDS (i. Matsuo et al. In the cases of human and mouse genomes. including biologically plausible steps. In the following. Illingworth et al. 2011). besides the “standard” G-G&F and T&J threshold sets.e. Antequera (2003) and Jabbari and Bernardi (2004). 2009). when tested numerically. To what extent the colocalization of CGI and origins of replication is due to a direct causal link or an indirect effect mediated by positional preferences of gene transcription start sites remains. CpA excess on the level of DNA methylation) is held. Power-lawlike distributions are extensively observed and through the proposed model we attempt to extend our understanding of the evolution and function of CGIs. depending on genome sizes. Illingworth and co-workers formulated a hypothesis according to which most of orphan CGIs are related to functional promoters of unknown genes. Caenorhabditis elegans. Canis familiaris. for the temporal evolution of genomic components that are subjected to purifying selection. 2012) we have observed that distances between transposable elements (TE) belonging to the same family. with flanks of 5 kbp and 2 kbp (thousands of base pairs) at 50 and 30 ends respectively. Only entire chromosomes are considered. at least in the germ line.1.e. 2009).. in some cases.5) and the supplementary file “Supplementary Table”. for more information see later on (sub-Section 2.ncbi. either studied as entire populations or when only orphan islands are considered. We have put forward an evolutionary model. Such distributions present extended linear regions in log–log scale (see Methods). / Computational Biology and Chemistry 53 (2014) 84–96 introduced CGI finding algorithms were judged upon their ability to find CGIs in the proximity of known genes (see e. distances between protein-coding segments (PCS) often follow “power-law-like” size distributions in entire chromosomes. we use different distance thresholds from the nearest gene in order to characterize a CGI as orphan. The propensity of 5-methylcytosine to quickly mutate to thymine has driven to the conjecture that the methylation is hindered in genomic sequences where CpG frequency is close to the expected one. Tsiagkas et al. The CpG depletion of the vertebrate genome is explained on the basis of heavy methylation and the consequent mutation of 5-methylcytosine to thymine. 2008. 2009. where however the opposite position (i.nih. Algorithm for the extraction of CGIs coordinates in genomic sequences The algorithm we have used for the localization of CGIs is an implementation of the method described by Takai and Jones . e. several additional choices of sequence composition are also tested and the resulting inter-CGI distances’ distributions are critically presented. see Antequera and Bird (1993). in large genomes. This model. full-length gene masking. is shown to systematically generate power-law-like size distributions.G. (1993). cerevisiae). Mus musculus. Monodelphis domestica. (2010) were downloaded from the additional material of this article. follow power-law-like size distributions. Sellis and Almirantis. In previous works (Sellis et al. These data are used for the sequence-based determination of CGIs. found regions with reduced CpG mutability. The most important of this work’s observations was that there exist numerous “orphan” CpG Islands (i. Information about the origin of the genomic coordinates used for masking of transcription start sites (TSS) and of genes may also be found in “genome data”. as determined by Illingworth et al. Information about the origin of the genomic sequences used may be found in the supplementary file “genome data”. 1987. Several or all chromosomes of each genome are studied. (2007) used a method combining a standard sliding window algorithm with available epigenomic data for human chromosomes 21 and 22. Gallus gallus. nonetheless. Bock et al. the data of Illingworth et al.2.e. not connected to a known protein-coding gene). (2010) are also used alongside with CGI coordinates derived from the implementation of compositional thresholds. The persistence of orphan CGIs’ chromosomal distributions to form power-laws. Homo sapiens. 2. After the systematic search for CGIs in the human and other genomes using the simple sequence criteria described above.2. protein-coding exon) coordinates were downloaded from the FTP site of NIH [ftp. Genomewide methylation data were not available until recently. The functional character of a CGI is related to the condition that its CpG dinucleotides are predominantly unmethylated. Methods 2. Later on. Tanay et al. More specifically. Drosophila melanogaster. based on human and chimpanzee genomes’ comparison. (2010) provided a comprehensive list of CGIs in the human and mouse genomes based principally on methylation data. Origin of the genomic and epigenomic data The genomes of the following organisms are used in this study: Apis melifera. verified by the observed increase in TpG and CpA abundances and is extensively discussed in relation to CGIs evolution and function. while the same domains largely coincide with Polycomb-binding sites. We used the annotation of these files in order to define the subset of CGIs which may be considered as orphan. CGIs cannot be generally seen as mutational cold spots. Bos taurus. principally in order to exclude that the observed power-law-like distributions in whole CGI populations are a mere consequence of similar distributions already known to exist for the inter-genic distances (Sellis and Almirantis. Data for coordinates of CGIs derived from the methylation state of the sequence. but the exact proximity relationships between CGI and gene bodies vary according to the study. many of which could be functional RNA genes. when taken alone. like the ones found in the CGI chromosomal distributions. Saccharomyces cerevisiae. More recently. the independence of CpG deficiency and TpG. after masking for known repeated sequences (for all organisms apart from S.g. A separate study of populations of orphan CGIs is performed. as well as. For all studied organisms. Danio rerio.g. These are retained for long evolutionary time in other species where they remain functional.. 2007.. can provide valuable insight into the overall organization of genome architecture and the mechanisms under which sequences are attributed with functionality. Bird. unclear. CGIs were shown to often coincide with Origins of Replication (ORIs) thus suggesting their functional roles may extend beyond transcriptional regulation (Cayrou et al. in human and mouse genome comparisons.gov/genomes/MapView] for each of the organisms we have studied. we have considered. Han and Zhao. This lack of methylation on specific locations may be seen as the result either of protection from point-mutations (mutational “cold spots”) or of purifying selection due to functional roles which are fulfilled only if specific compositional traits are preserved in the underlying sequences. The term has been widely used to denote all CGIs which are not directly involved in the regulation of nearby genes. 85 Here we investigate the large-scale genomic distribution of CpG Islands in several genomes.

This is particularly important for our study (see Section 4 for further details. Power-laws in size distributions 2. Methodology for the search of threshold values for CGIs localization in several genomes As discussed in the introduction. for 8 organisms out of the 11 overall studied. Original vs. the threshold values for CGIs detection in the cited literature are mostly determined for the purpose of the localization of new protein-coding genes. Neighbor CGIs are merged together and then considered as a unique island. in order to further elaborate on alternative modes of CGI detection. minus one in absolute value): if p(r) / r1m. p(r) is the original spacers’ size distribution.4. cumulative size distribution In the present work we use the “complementary cumulative distribution function” (Clauset et al. Details of the masking procedure In order to define the “orphan” CGI subsets.86 G. The intensity of the power-law-like pattern is quantified through the value of the extent (E) of the linear region in double logarithmic scale. with free to be adjusted thresholds on Lmin. (where s is the size of the bin width) and N*(S) the number of spacers: N ðSÞ ¼ npðSÞ / eaS . Li (2002). and S. along with every genomic one (see below for details). The inclusion of these random (surrogate) data sets in the figures allows for a direct visualization of the difference between observed and random distribution patterns. Tsiagkas et al. warm-blooded.3. less affected by statistical fluctuations. (2007) the use (in double-logarithmic plots) of the slope of the cumulative form of the distribution for the estimation of the exponent of the power-law gives much more precise results than the use of the slope of the original one. by definition. as in the original distribution. It is also. Let p(S) be the probability of a spacer having length between S  s/2 and S + s/2. Most of this work concerns the human and a few other. 2. In such random collections of objects we can approximate the distribution of their sizes with an exponential distribution (like the runs of heads in a coin-tossing experiment). animal genomes. taken at random. because power-laws for the others are found to be rudimentary. with a linear extent below unity.e. m > 0 2. cerevisiae): (iii) we either followed the TSS masking (narrowing the upstream masking to 100 bp) as in (ii). In finite size collections taken from physical systems. Although the cumulative form of the distribution may be seen as presenting more “inertia” thus overshadowing local features.2. 2. (2007) and Stumpf and Porter (2012). The grid-driven search is done only for a limited number of chromosomes. or (ii) we masked only around the TSS by adding flanks of 500 bp upstream of the TSS and 100 bp downstream of the TSS. Also provided in the Supplementary Table (see also in the Results and Table 1) are values of E given for the standard G-G&F and T&J threshold sets and for alternative sets for the same values of CG% and CpGo/e. and one or more chromosomes (taken at random) with a medium E value. In all applications the window is sliding per one nucleotide. in our case for the whole examined genomic region) and the spacers’ size distributions follow a power-law. but it corresponds to the number of all spacers longer than S. For small genomes (insects. or (iv) we followed the TSS masking again. the inclusion of ten surrogate data distribution curves. The cumulative form of a power-law size distribution is again a power-law characterized by an exponent (slope) equal to that of the original distribution plus 1 (i. we cannot speak about “power-laws” but rather about “power-law-like” distributions. with symmetric flanks 50 bp upstream and downstream. elegans we derived a grid only for one chromosome. interchanging the minimal length thresholds. long-range correlations extend to several length scales (ideally. for some chromosomes of the most widely studied genomes.5. then Z1 NðSÞ ¼ nPðSÞ / ðr1m Þdr / Sm . and several other chromosomes.3. The criterion adopted for choosing chromosomes for the study of the distribution of orphan CGIs is to include the chromosome with the most extended linearity for the standard threshold choices G-G&F or T&J. denoted by altG-G&F (for L > 500 bp) and altT&J (for L > 200 bp)..e. a > 0 When scale-free clustering appears. i.1. 2. / Computational Biology and Chemistry 53 (2014) 84–96 (2002). The equations in terms of probabilities Suppose there is a large collection of n objects (in our case spacers between CGIs). all figures also include a bundle of ten simulated size distributions (continuous lines) where an equal number of markers. which are known to have promoters often intersecting with CGIs. Newman (2005). This is done in an attempt to determine the threshold values corresponding to functional islands. if they are separated by less than 100 base pairs and if the extended island they form meets the requirements put by the considered thresholds. each characterized by its length S. The criterion adopted for choosing chromosomes for the creation of their parameter grids is to include the chromosome with the most extended linearity for the standard threshold choices G-G&F or T&J. CG% and CpGo/e. (see in the Discussion). for large genomes (actually. For reviews on power-law size distributions. which corresponds to a linear graph in a double logarithmic scale: N ðSÞ ¼ npðSÞ / Sz ¼ S1m . Additionally to CGIs genomic distributions. are randomly positioned in a sequence of equal length. These grids are given in the supplementary file “grid”. independent of any binning choice: in a cumulative curve the value of P(S) for length S is not associated with the subset of spacers whose length falls in the same bin. corresponding to the CGIs of the original natural sequence. all the vertebrates included in our list): (i) either we masked the total length of genes by additionally masking 5 kbp upstream of the start-of-gene (TSS) and 2 kbp downstream of the end-of-gene (TES). worm. where the proposed model is explained) as there appears to be no particular tendency for a unique (universal) exponent m. allows for an objective estimation of the linearity trend in each case. As it is pointed out by Sims et al. we form a two-dimensional “parameter grid” of threshold values for CG% and CpGo/e (around the G-G&F and T&J threshold choices) for either L > 200 bp or L > 500 bp. Clauset et al. tabulating the used thresholds alongside with the extent (E) of the complementary cumulative inter-CGI distances size distributions. but with the additional masking of each protein coding segment. their properties and alternative forms see Adamic and Huberman (2002). defined as follows: Z1 PðSÞ ¼ pðrÞdr. 2007) for the sizes of spacers (distances) between CpG Islands. The cumulative distribution forms smoother “tails”.3. Here. which are thus preserved in evolutionary time. with a medium E value. S where. . S where N(S) is the number of spacers longer or equal to S. In the case of C.

V.41 – – – M.V..39 1. 1.01 1.40 1.78 1.79 1. which have a minimal distance of 5 kbp upstream or 2 kbp downstream of any RefSeq TSS and we name them “TSS-unrelated” in the present text.01 1.V.G. We observe that power-law-like distributions are again formed. in some instances of masked genomes.70 1. 87 for whole chromosomes from the 11 considered genomes. In Table 2 mean values are given for the orphan CGI spacers’ size distribution in the same form with Table 1.-5).-5 1. Results 3.66 1.V. In Fig.18 1. More plots are included in “Supplementary Plots”. circles and diamonds are used respectively. 1.V.74 M. For genomic and for model-generated size distributions. When a masking procedure is applied in the epigenetically determined CGI collection (for details see Methods.-5). the complete sets of epigenetically determined CGI are presented (Fig. the two classes of epigenetically determined CGIs defined above are considered as orphan. like the ones found in the orphan CGI populations derived on the basis of compositional criteria.6.82 – 1.70 M.V. 31 and human chr.V.58 0.02 1. These eliminations represent cases of loss of function of ancestral CGIs with a subsequent progressive decomposition.-5 2.90 1.V. as well as the fraction of the TSS-unrelated CGIs (Fig.73 – 1. 1000 markers (representing CGIs) are randomly inserted in a sequence 2 Mbp long. In Fig.V. 1. 1. in double logarithmic scale. as described in detail in the Methods. for the same chromosomes as in Fig.4. 21. chromosomal distributions for the islands determined by Illingworth et al. for one chromosome of each organism. 1 we present examples of the complementary cumulative inter-CGI distances’ size distributions.88 1.V.13 – M.-5 1. In (b) the segmental duplications were 87 (50% of those in (a)) and the final sequence length 22 Mbp. Alternatively.V.V.50 – 1.V.15 M. Standard and alternative threshold values are considered.-5 1. Chromosomal distributions of CpG-Islands in various genomes In Fig. (ii) mean values of linearity extent E for the five more extended power-law incidences (M. dealing with the study of masking of threshold-based CGIs.89 1.93 – 1. 3b and e).16 – 1.V. for the human and mouse genomes.67 2. see in the Section 4. Scripts in Perl and programs in FORTRAN used for parsing genomic data files. The length reached by the simulated chromosome was 195 Mbp (steps i and ii of the model.32 1. 3. 256 additional events of non-duplicated CGIs eliminations are also allowed (step iii of the model).) for all studied chromosomes from each organism.15 – 1. summary information about non-gene-related epigenetically determined CGIs is also included.V. In Table 1 the following information is included: (i) mean values of linearity extent E (M. (ii) mean values of linearity Extent E for the five more extended power-law incidences (M. This is done in order to test the distribution of CGIs that are unlikely to have any role in the transcriptional regulation of protein coding genes at mediumdistances (and probably spatially linked).V.37 1. In Table 2.04 – M.59 2. In Supplementary Table all chromosomes are presented.52 1. G-G&F Large genomes Bos taurus Canis familiaris Danio rerio Gallus gallus Homo sapiens Monodelphis domestica Mus musculus Small genomes Apis melifera Caenorhabditis elegans Drosophila melanogaster Saccharomyces cerevisiae M. Initially. 1.37 1.86 1. 1.V.51 1.33 1.47 0.87 1.49 2.33 M. In these Tables we have included along with the standard parameter sets.V. we have formed the subset of Illingworth and co-workers original CGIs.V. while in Supplementary Table all the studied cases are included. (2010) based on methylation data are studied and the power-law-like pattern is also found.-5 – – – – 1. The principal result of this study is that power-law-like distributions of inter-CGIs distances are observed in several cases with linearity extent E > 2. we observe increase of the extent of linearity when orphan islands are considered instead of the full populations.79 1.67 – – – T&J M.V. 5 are representative of a large number of numerical experiments conducted. Simulations presented in Fig. Tsiagkas et al. with lengths sampled from a uniform distribution not exceeding in size the 5% of the actual length of the simulated sequence. which in two instances surpass the three orders of magnitude (in dog chr.70 M.29 1.45 M. Fig. See also the Supplementary Plots. while.02 1.54 0.56 M. computing the presented size distributions and simulating the evolutionary model are available upon request.59 M. sub-Section 2.) for all studied chromosomes from each organism. although the extent of linearity is somewhat reduced.44 1.66 – 1. see the Supplementary Table).74 1.89 M. 173 segmental duplication events occurred.29 1.19 1. Here.05 1. We have also performed gene masking and subsequently we studied the spacers’ size distribution for orphan CGIs. in some chromosomes for each organism.V.1. 5(a–c).-5 1. 3. We have also attempted a search for the determination of the form of distribution of distances between CGIs using Table 1 (i) Mean values of linearity extent E (M.08 – – – altG-G&F altT&J M. In the framework of the present study. After each such event a number of markers (CGIs) equal to 90% of the number of the duplicated ones are eliminated. 1. We name these Islands “non-TSS”. 3a and d). 2. the ones we called “alternative” in sub-Section 2.68 – 1. see sub-Section 2. 2010) not intersecting with TSSs and having a minimal distance of 100 bp from a TSS of the RefSeq annotation. Additionally.82 2. Simulations using the segmental duplication – CGI loss model Examples of simulations are given in Fig.61 0. 3c and f). 2.-5 1. 1.-5 1.24 – 1.26 – 1.87 – – – M. 1.-5 – – – – .38 M. In (a). In (c) the number of segmental duplications is as in (a) and the finally reached length 119 Mbp. examples of these distributions are presented. – – – – M.33 1.V.4. Power-law-like distributions of orphan CGIs are widespread.34 1. The G-G&F or T&J thresholds values have been used here. along with the corresponding CGI subsets remaining after gene-masking (non-TSS CGIs.-5 1. throughout.V. This filtering is analogous to the case (i) of the previous paragraph. – – – – 1.09 M.03 1. / Computational Biology and Chemistry 53 (2014) 84–96 We have also checked the existence of power-law-like distributions for the set of epigenetically determined CGIs (Illingworth et al.03 – Epigenetically determined CGIs (complete population) M.42 0. serving as an additional estimator of the intensity of the appearance of the power-law-like pattern per organism.5) the picture remains qualitatively the same. The full set of epigenetically determined CGI populations and their orphan subsets are presented in the Supplementary Table.V.

Tsiagkas et al. Each sub-plot (a–f) corresponds to a chromosome of one of the examined genomes. Either the Gardiner–Garden and Frommer.88 G. while characteristic plots are given in Fig. The linear segments are inferred by linear regression. corresponding to randomly distributed markers.2. Logarithms are to base 10. [(Fig. Linearity in log–log scale (a power-law-like pattern) is observed. Genomic curves are accompanied in each plot by 10 curves of surrogate data (continuous lines). 4. 3. for a specific choice of thresholds in the detection of CpG–Islands. . / Computational Biology and Chemistry 53 (2014) 84–96 threshold sets different than the standard and the “alternative” G-G&F and T&J..6) or the Takai and Jones. In the supplementary file “grid” the extents of linearities are presented in the form of grids of thresholds (see Methods) for all the studied cases. CpGo/e > 0. G-G&F (Lmin = 200 bp. An evolutionary scenario reproducing the observed power-lawlike distributions based on segmental duplications and CGI loss Segmental duplication events occurred continuously in the evolutionary past of virtually all eukaryotes (De Grassi et al.65) sets of thresholds have been used. 2008. CpGo/e > 0. Inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes. GC > 50%. 1. T&J (Lmin = 500 bp._1)TD$IG] Fig. G + C > 55%.

_2)TD$IG] Fig. but here excluding gene-related CGIs. / Computational Biology and Chemistry 53 (2014) 84–96 Kehrer-Sawatzki and Cooper. Additionally.e. Again. Segmental duplication and polyploidisation generate copies of some or all the genes of an organism. 2008. power-law-like linearity is observed. but also of other functional [(Fig. often more extended than in the corresponding complete population. Kirsch et al. Sémon and Wolfe (2007) and references therein. 2008.. 1. Surrogate data curves and linear regression are similar to Fig.e. 1. 2002. 2000). 2. see e.. Logarithms are to base 10. 1986). Several alternative ways of retaining only orphan CGIs have been used. 89 Adams and Wendel (2005). relatively recent) segmental duplications (Bailey et al.g. not taking into account events of polyploidization (Lynch and Conery. McLysaght et al. Same cases of chromosomal distributions as in Fig. It has been shown that at least 10% of the non-repetitive human genome consists of identifiable (i.G. Gibson and Spring (2000). It is estimated that 50% of all genes in a genome are expected to duplicate giving an “offspring” at least once on time scales of 35–350 million years.. Tsiagkas et al. sub-plot (a–f). . 2002). Shapira and Finnerty. duplication of the whole genome and subsequent reduction to diploidy). most extant taxa have experienced paleopolyploidy during their evolution (i.

18 M.-5 – – Epigenetically determined CGIs (non-TSS CGIs only) Epigenetically determined CGIs (TSS-unrelated only) M.29 1. – – M. and insertions of TE families more recent than the studied one) are required instead of i Table 2 Values of linearity extent E as: (i) Mean values (M. Moreover.35 1.70 2. cds (50) only for small genomes G-G&F Large genomes Homo sapiens Mus musculus Small genomes Apis melifera Drosophila melanogaster M. emancipation of genes from their associate CGI.09 1. microsatellite expansions etc).96 0. event types iii and iv (in that context. eliminations of repeat copies of the studied TE family. 1.41 M. when duplicated often become superfluous and stop being under purifying selection. 1.31 M.V.-5 2. – – M. 1. e. there are reported three possibilities for the evolution of the duplicated genes (Adams and Wendel.46 M.-5 1.V. 2000): In some cases one member of the gene pair adopts a new function while the other remains unchanged. / Computational Biology and Chemistry 53 (2014) 84–96 genomic localizations. 2008.98 2. Adams and Wendel (2005). – – T&J M. which is consistent with findings suggesting massive functional gene loss in the last 10 Myr (IHGSC.-5 – – M.g.21 1. although not explicitly considered in the present implementation.V.26 M. sequencing and analysis of the human genome has shown that segmental duplications and gene loss are widespread phenomena of genome remodeling during evolution. and then disintegrates progressively by random mutations.V. Kasahara (2007).28 – 1. Even if there are not available quantitative data for the CGI corresponding populations.02 1. ii iii iv v This step may include as limiting case whole genome duplications.g. while its syntenic counterpart in another genome appears to have been emancipated from the action of its ancestral CGI.V.-5 – – M. Tsiagkas et al. regarded as orphan. the two copies continue to survive in the genome sharing the multiple functions of the initial gene between them.V.V.02 1.82 1.V. An example of relatively recent gene loss is that 60% of the olfactory receptors in the human genome (1000 genes and pseudogenes in total) have disrupted ORFs and appear to be pseudogenes.48 1.41 1.80 0.97 – 1. +2 kbp) only for large genomes M.-5 1. Masked TSS (-500 bp.02 1.24 2.V.-5 1. The above fate of duplicated genes is normal to have its parallel to the fate of CGIs accompanying genes and being involved in their regulatory mechanism.V.19 1.43 1. Lynch and Conery.-5 1.V.V.56 2. retroviruses.-5 – – .11 – 1.01 – M.69 1.93 1. transposable elements (TE) or microsatellites.19 M. this property is proven numerically to be robust to quantitative modifications of all the involved types of molecular events.87 – T&J M.96 – 1. The implementation of a “genomic duplication–CGI loss” model is based on the following genomic phenomena: occasional gene (and associate CGI) silencing and subsequent degradation. 1.44 1. 2001). often connected to the loss of (protein-coding or not) genes due to changes of the needs of a given species is also a widespread phenomenon in genomic evolution. This model includes events of the types: i Segmental duplications of extended regions of chromosomes. 1. with the accelerated rate imposed by the rapid turnover of the CpG dinucleotide (Nachman and Crowell.e. the fate of most duplicated genes is that one copy is silenced. additional eliminations of non-duplicated CGIs.V. We can consider that CGIs with yet unknown roles. These orphan CGIs will be gradually decomposed and lost. for several choices of gene-unrelated (orphan) sets of CGIs.19 1.V. we may expect that loss of CGIs due to one of the above mentioned reasons.12 M. +100 bp) for small genomes G-G&F Large genomes Bos taurus Canis familiaris Danio rerio Gallus gallus Homo sapiens Monodelphis domestica Mus musculus Small genomes Apis melifera Drosophila melanogaster M.V.V.-5 1. 2000). as all authors agree.V.V.93 1.V.19 1. 1.23 G-G&F M. orphan CGIs becoming redundant.V. – – M.).27 – M. Lynch and Conery (2000).. (100 bp.V. – – M. 2005.-5 – – M.43 1.24 2.28 1.91 1.53 1.V. In other cases. +100 bp). Several cases of syntenic genes of different organisms have been shown to be under differential control of CGIs.V.24 1.22 0.24 M.32 – 2.V.38 2.-5 1. Moreover. losing the ability to be transcribed. 1. segmental duplications occurring during genomic evolution and (in some cases) complete genome duplications.-5 – – M.64 1.55 1.V. Insertions of sequences increasing the total chromosomal length (these could be transposable elements. 1. Deletions of sequence stretches (which usually are under weak or no purifying selection). while it is also exposed to the possibility of excision due to recombination driven eliminations.81 1.37 M.25 1.-5). in the study of non-conserved elements.84 1. Random eliminations of a number of CGIs which is lower or equal to the number of the duplicated ones.-5 1. – – T&J M.98 1.V.09 M.44 M.48 – 2. given that the population of the genomic elements we study are (principally) functional and thus preserved by purifying selection. +100 bp) for large genomes.V.V. In the literature.-5 1.41 1.V.V. The existence of another source of CGI loss can be supported by current findings of comparative genomics.36 – Masked TSS (500 bp.V. 1. see e.12 1.90 G. which has been subsequently decomposed (for details see in the following sub-section).25 Masked genes (5 kbp. However. De Grassi et al.-5 1. Sémon and Wolfe (2007). i.97 – M. with one gene being under the regulatory control of a CGI. 1.69 1. In a completely different genomic framework.78 M. like CGIs. Occasionally.19 1.V. Only events i and ii are indispensable for the appearance of the power-law-like pattern.31 M. The proposed evolutionary scenario reproduces power-law-like inter-CGI distances size distributions. (ii) mean values for the five more extended power-law incidences cases (M.68 1.V.V.

Surrogate data curves and linear regression are similar to Fig.g.G. Several cases of non-TSS or TSS-unrelated (orphan) islands are considered. Events of type iii.. iv and v are numerically shown to be dispensable for the emergence of power-laws in the computer simulations studied herein. The inclusion of events type iv tests the robustness of the model. 2007. . / Computational Biology and Chemistry 53 (2014) 84–96 and ii (see Sellis et al. The involvement of this type of events in the formation of the observed linearities is indicated by the relatively higher linearities observed in mouse where higher island elimination rates are observed._3)TD$IG] Fig. Logarithms are to base 10. 2012). as the above authors remark (see also discussion in the next sub-section). Inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes for epigenetically determined CGIs from the human and mouse genome. as several authors bring evidence about loss of CGIs over evolutionary time due to de novo 91 methylation (see e. 1993). 1. Antequera and Bird. Klimopoulos et al.. 3. Tsiagkas et al. [(Fig. Events iii. are important. although not essential for the emergence of power-lawtype linearity.

Here non-standard threshold sets are used. Notice that the model presented herein. conceived to describe the genomic dynamics of CGIs. 1. Plots of inter-CGI spacers’ complementary cumulative size distributions in whole chromosomes. This is verified by all our computer [(Fig.92 G. Events of type v represent either deletion of sequence regions usually due to unequal recombination or gradual shrinkage by a balance of indel (insertion/deletion) events favoring decrease of the sequence length. (1991) for the appearance of power-law size distributions in aggregative growth of particles in physicochemical systems. in order to account for the appearance of a power-law-like pattern in the chromosomal distribution of genes (Sellis and Almirantis. Surrogate data curves and linear regression are similar to Fig. Tsiagkas et al. Logarithms are to base 10. 2009). / Computational Biology and Chemistry 53 (2014) 84–96 because for many organisms important parts of the genome are generated by repeat proliferation. The model presented herein is formally similar to another model introduced previously. for more details see the sub-Section 2. . is not analytically solvable and thus no universal exponents (slopes for the linear segment in log–log scale) may be obtained. Both are based on an analytically solvable model introduced by Takayasu et al. 4._4)TD$IG] Fig.4 and the supplementary file “grid”.

5b. only events of the types i and ii are included (segmental duplications followed by CGI loss). These authors state that these requirements include a statistically sound power-law (extended linearity in log–log scale with indications of convergence to a universal exponent) and a concrete underlying theory to support it. We see that. in cases where an extended genic remodeling in the recent past of the organism is plausible (perhaps. In our case we clearly show that the log–log linearities we observe in our genomic data (often being quite extended) lack a universal exponent (slope). The power-law inter-repeat size distributions of distances between TEs could be seen as a counter-example (Klimopoulos et al. especially in small genomes. depicted in Fig. Thus. Discussion 4. However. 2012). corroborates the hypothesis that the evolutionary dynamics described by this model is at the origin of the observed genomic patterns. this deviation from universality is a characteristic feature shared between the genomic inter-CGI distributional patterns we describe herein and the simulations of the proposed evolutionary model. in line with the view that functionality is a necessary condition for CpG islands to form stable power-laws. For a recent in depth view of the requirements for having a power-law. the case of the human genome) with many genes inactivated (e. Linearity in log–log scale similar to the one observed in the plots of inter-CGI spacers’ complementary cumulative size distributions is found. it is probable that the observed distributions are the result of the proposed mechanism in synergy with other aspects of genomic dynamics. as intuitively expected. as extensively discussed therein. 5. protein coding segments. In the previous paragraph we consider the possibility that the genomic power-law-like size distributions of CGIs might be an indication of functionality. Simulations using the proposed evolutionary model. More specifically. this is a feature common to all cases of naturally occurring “power-laws”) but mainly because they lack any universal exponent (slope). (2012). This feature. Fig. We may infer that gene inactivation and reassignment of the roles of genes (which may have caused their emancipation from an associate CGI) led several CGIs to cease being under purifying selection and thus they progressively decomposed. however. Analysis of the observed power-law-like chromosomal CGI distributions. as observed in real chromosomes. as an absolute proof that this model is at the origin of the genomic pattern. Taxonomy-related evidence The property of the formulated simple model to produce power-law-like distributions is an indication that it could be extended to include more detailed modeling of conservation through evolutionary time. the linearity is considerably increased (see Fig. If we include additional elimination of CGI (events of type iii). decreases the extent of the linearity. our data deviate from the typical power-law not only because they always have the linearity in log–log scale truncated at a lower and an upper cut-off (in fact. On the other hand. see Stumpf and Porter (2012).. 4. A decrease in the number of allowed segmental duplications. Tsiagkas et al. For a related discussion see Klimopoulos et al. 5a. the expansion–modification model proposed by Li (1991). recognizable in mammalian genomes up to ._5)TD$IG] 93 simulations and is in accordance with the variety of slope values met in the study of genomic CGI distributions.g. along with the common dependence on the evolutionary parameters shared between model and genomic distributions. For details see in Section 2 and Section 4. under no purifying selection. Logarithms are to base 10.G. The similarity of the power-law-like pattern emerging in simulations of this model with the genomic distributions of inter-CGI distances cannot be considered. / Computational Biology and Chemistry 53 (2014) 84–96 [(Fig. shown to generate long-range correlations in nucleotide sequences might have significantly contributed to power-law-like distributional patterns of several genomic components (repeats. CGIs). These results are compatible with extended powerlaw linearity found. the case of human olfactory receptors) and several genes having their regulatory pattern changed. power-law-like curves (linear in log–log scale) are formed in the inter-CGI spacers' size distributions. Moreover. In the simulation of Fig. TEs are long-lived genomic objects.1. 5c). as found in several instances. This is the principal reason why we call the pattern we have found “power-law-like” throughout this article.

the observation of Han and Zhao (2008) that in fish genomes stringent thresholds (T&J) gave the best results on functional CGIs determination is compatible with our finding that in D. This .2. by using several different approaches (Sellis and Almirantis.e. they stay above given thresholds for CG% and for CpGo/e) only for relatively short times after ceasing to be under purifying selection.) of all examined genomes. with our result (see Table 2) that dog genome. This is in accordance with the findings of Li et al. from which they are principally derived. Considering mean values from all chromosomes as depicted in Tables 1 and 2. in view of the proposed model. the maximum of the linearity in log–log scale is indeed observed for the standard threshold sets or for values being in their close vicinity. Contrary to TEs. but also in a variety of other genomes. The prevalence of the higher thresholds in the formation of the most extended linear segments in log–log scale suggests that in fish genomes. for several of the other studied genomes as may be revealed by an inspection of the data presented in the supplementary file “grid”.e.67 vs. 0. on the basis of the argument presented above: i. Moreover. This is due to the mutability of CpG after being methylated. with no signs of skewed deviations to any direction in the parameter (threshold) space for all analyzed chromosomes. rerio. functional islands in general might be compositionally moved toward high CG % and CpGo/e values. while sporadically high values of thresholds may perform better (see data in the “grid”). These are events not requiring evolutionary conservation of the involved genomic elements in order to generate power-law-like distributions. in fact. we extended our analysis to the subset of CGIs characterized as orphan. as the erosion of mouse CGIs probably makes the relaxed thresholds more suitable for filtering in most of the functional CGIs. slowing down of the transposition dynamics. More specifically. From log–log linearities found herein one might infer that an important proportion of the CGIs detected in all examined genomes using these thresholds are functional. (Ostertag and Kazazian. i.5 values. CpG is. In H. this being a corroboration of the proposed model.e. not spatially connected to genes.94 G. i. An immediate observation stemming from our study is that the standard threshold sets (G-G&F and T&J) may produce power-lawlike distributions not only in human.5 respectively). These authors consider as the more likely reason for the particularities of chromosome III that it contains the mating type loci thus being under a selective pressure protecting it from disruptions and keeping it intact in evolutionary time. they are decreased in number while the “surviving” ones are shorter and with lower CG% and CpGo/e values (Antequera and Bird. reported to disappear from a given position at least one order of magnitude faster than any other dinucleotide in the eukaryotic genome..e. 1993. 2010). Another point where findings presented herein comply with findings of other research groups following a different approach is our observation that chromosome III of Saccharomyces cerevisiae is the one presenting the globally maximal extent of linear segment (E = 1. about chromosome III being atypical in many regards: It presents large-scale compositional (C + G content) inhomogeneity and high extent of compositional mosaicism of genes found only in this chromosome. Tsiagkas et al. where a decrease of the linearity extent in log–log scale was observed (figure not included herein). Especially in mouse. (1998) and Bradnam et al.. see Nachman and Crowell (2000) and Antequera (2003). In order to exclude the possibility that the power-law-like spatial distribution of CGIs is a mere consequence of a power-law distribution of genes. On the other hand. type iii events of the proposed model occur according to the cited literature with a particularly higher rate in rodents. according to Antequera and Bird (1993). which makes its genomic turnover rate to be the highest of any other dinucleotide. we note that this increase is much higher for the GG&F than for the T&J threshold set (0. The result of Han et al. where CGIs are known to be eroded. On the other hand.V. represents the global optimum of linearity extent. / Computational Biology and Chemistry 53 (2014) 84–96 200My. The situation changes.e. according to our proposed model. by simple inspection of Table 1 we see that the mouse genome presents for G-G&F the highest mean extent.6 and 1. the separated study of orphan CGIs allows for the investigation of different modes of chromosomal distribution and thus of different functionalities of this class of genomic element compared to entire CGI populations. Another point compatible with the model is that this optimum is reached for the relaxed (G-G&F) and not for the stringent (T&J) threshold set. (1999) to be next to III in their degree of variation of silent-site C + G content are the ones with the next higher extents of linearity (1.. the lengthier log–log linear extents (the only ones above 2) are found in the higher threshold value region. (2008) that the dog genome hosts the highest percentage of gene-unrelated functional CGIs if compared to any studied genome is compatible. Gene-related vs. 4. see also Sharp and Lloyd (1993). Along the same lines is the observation that this genome is characterized by the maximum of increase of linearity when full gene and flanks masking is applied either for G-G&F or for T&J thresholds (if masked and unmasked genomes are compared). given the systematic connection of many CGIs with promoters of protein-coding genes. Note that this result does not contradict the earlier finding of Han and Zhao (2009) that the T&J threshold set is particularly suitable for identifying promoterassociated CGIs in human and probably other genomes as well. when full gene and flanks masking is applied. 2001) thus allowing the interplay of event types iii and iv of the proposed model.V. for both standard threshold choices. Athanasopoulou et al. As already mentioned. (1999). In other genomes the optimum (maximum linear extent) is positioned at lower value combinations. sapiens. long-rangeness and fractality. This has also been shown numerically through computer simulations allowing random transpositions. The observation that power-law-like linearity persists (and is sporadically more extended) in orphan CGI distributions is indicative of distinct functionalities for this type of genomic localizations. often focussing on their non-gene-related subset (orphan islands). 1993) we observe that the maximum linear length is systematically positioned at lower threshold combinations than the standard ones. It is also worthy to note that the two chromosomes (I and XI) reported by Bradnam et al. The same optimum is reached if we examine the M. On the other hand. 2006). 2009.85) and the only one which exhibits linearity for alternative threshold set (altG-G&F and altT&J). in our study we attempt to identify the distributional patterns of CGI populations of different functionalities and not only the gene-transcription related. orphan CGIs The distributions of coding segments and of genes in a variety of genomes are shown to follow a power-law-like spatial distribution.45 respectively). CpG Islands retain their identity in the evolutionary time scale (i. to some extent. are losing CpG Islands over evolutionary time due to de novo methylation in the germ line followed by CpG loss through mutation. Matsuo et al. when Lmin = 200 performs better than Lmin = 500) is compatible with the finding that “many known functional CGIs are shorter than commonly assumed – the extreme example being Xenopus CGIs” (Hackenberg et al. i. and in fact other vertebrate genomes too.e. mouse and other mammalian genomes. The existence of several organisms with higher power law linearities for lower thresholds and shorter minimum length (i. Human and mouse. also has as effect the protection of the power-law-like pattern from random interruptions of the inter-CGI spacers thus maximizing linearity in log–log scale. that log–log linearity implies functionality for CGI populations. highest mean value of log–log linearity (M. to all examined genomes.

D. 135–141.. Acad....1062v1 [physics. Bradnam. Greally. J.. Reinert. CGIs derived on the basis of methylation data CGIs determined on the basis of methylation data. P. C. 2000.. C.R. 2008. 2009.. A simple evolutionary model.2014. in the online version. 2006.T. 259–264. 4.. Biochem.. R. Gibson. 0706. Martinez-Aroza. Antequera. L.H...013. The observation of a clearly lower extent of power-law linearity in the methylation-derived CGI distributions in human and mouse raises a question about the prevalence of the methylation-based over the composition-based prediction of the CGI character. Gene 421.org/ 10.. 051917. Trans. Appendix A. J. L. M.. function and evolution of CpG island promoters. . at http://dx. 261–282.. G... Cell. C. However.. 2007. A limitation of the present work is that the chromosomal distribution of CGIs cannot be disentangled from the complex large-scale compositional characteristics of the genome. J.. Huberman. 1647–1658.E. BMC Bioinformatics 7. although several other functions have been attributed to CGIs in the recent past. lacking the clear 95 partition into isochores observed in warm-blooded animals. Proc. Comparative analysis of CpG islands in four fish genomes.. 65. M. PloS Comp. (1999) about the compositional mosaicism of this chromosome. Tables 1 and 2). A 90.W.. Bird. presents a clearly inter-CGI linearity in log–log scale. 1438–1449. Rodakis (both at the Faculty of Biology. E.. 1003–1007. U. A. E. A.. 1987.org/10.E.doi.1155/2008/565631. Several inferences based on known differences of compositional and other characteristics between species have been made. When a more refined annotation of functional roles of CGIs is available. Biol. 1999. Antequera. Phys... We deduced that both. 8.L.J.. J. Acids Res. C. S. 3. / Computational Biology and Chemistry 53 (2014) 84–96 may represent further evidence that functional gene-unrelated (orphan) islands accomplish their role under less stringent compositional constraints. Scaling properties and fractality in the distribution of coding segments in eukaryotic genomes revealed through a block entropy approach. An alternative explanation for compositionally predicted but methylated CGIs is that these sequences could be functional islands (thus.. R79. 1055–1070.A. Adams. Genet. Biol. Olivier. quantified by the mean value (M. M. Life Sci. Oakley.A. we expect that a treatment like the one presented herein could allow for the deduction of more nuanced conclusions.... E 82. Zhao. Samonte. Luque-Escamilla. Carpena. 2010. Khulan. Z.M. 196. We have studied in separate the population of CGIs which lie outside of genes and flanks considering them as gene-unrelated or orphan. Opin. Adams.. Trend. M... A. Bouhassira.L. strengthens the complementary roles of the two approaches in the prediction of functional CGIs. Karamanos. B..V. Funct. Newman.P. Figueroa. Genomics doi:http://dx. T.G. Z. G + C content variation along and among saccharomyces cerevisiae chromosomes. E..E. CpG-rich islands and the function of DNA methylation.. Z. L. 1993. Bird.A. Linearity in log– log scale was often found in the range between two and three orders of magnitude.A. Fazzari. Zipf’s law and the internet.L.M. Supplementary data Supplementary data associated with this article can be found. One could think that a combination of CG% and CpGo/e clearly higher than the prevailing in the bulk genome can be seen as a trustworthy marker of any CGI functionality. 21. Previti..F. Sharp.. Cayrou. doi:http://dx.. E. et al. L. Science 297. 9.J.. P. C. the taxonomy of the considered organisms and the functional role of the studied island population leave their imprint on the observed distributions. it is noteworthy that even the A. Tsiagkas et al.. 2007. Glottometrics 3. References Adamic. R.compbiolchem. B..R. Such newly de-functionalized CGIs are also expected to participate to the linearity in log–log plots according to the proposed model. Zhao..1186/1471-2105-10-65. (2010) for human and mouse chromosomes often follow power-law-like distributions (Tables 1 and 2 and detailed data available in Supplementary Table). A. Zhao. We are particularly indebted to Professor Oliver Clay who predicted the particularly high power-law-linearity of yeast chromosome III on the basis of the findings of Bradnam et al. Eichler..08. J. Lanave. Han. 2007. 11995–11999. A.. The observation we made on deviations in power-law linearity extent... 2002.. Acknowledgements We would like to acknowledge Professors Stavros J. Clark. 446–458. K.. 60. Number of CpG islands and genes in human and mouse. Soc. S. Spring. At the present time only CGIs involved in gene transcription represent a well defined island set. 5.J. Bock. Myers. M. Wendel. Plant.. this is also an indication that two classes of islands (gene-related and orphan) might be distinguished using several compositional or distributional criteria in a converging way.. 2003. Coulombe. V. at least for the time being. K. between compositionbased and methylation-state-based CGI populations. CpGcluster: a distance-based algorithm for CpG-island detection. Evol. Hackenberg. Gardiner-Garden. experimentally observed methylation might be due to some specific conditions of the studied cell population and its state. 2008. 35. L. which are compatible with the proposed model.data-an].. CpG islands in vertebrate genomes. 342–347. R. A. Evidence in favour of ancient octaploidy in the vertebrate genome. J.F. for both G-G&F and T&J threshold sets (cf. C. Mol. P. J. Glass. Recent segmental duplications in the human genome. M. M. CG dinucleotide clustering is a species-specific property of the genome.V.) of all chromosomes. 209–213. Oliver. Paulsen.E. E. Rev. M.. Genome Res. CpG islands or CpG clusters: how to identify functional GCrich regions in a genome? BMC Bioinformatics 10.. C. A. 28.. which not only survives but increases when gene-unrelated (regarded as orphan) CGIs are studied. Genome Biol.. Sci. 143–150. P. F.1016/j. Gu. 1986.. Z. J. Clauset. Walter. 6798–6807.P. T. Golden. Evol 16... Biol. Vigneron. Wolfe. 2002. 2011. given that such locally maintained heterogeneities are indications of conservation due to functionality. F. Han. Melnick.. 1987. Conclusions In this work we investigated the distributional features of CpG–Islands in several genomes by means of a study of inter-CGI distances considering only entire chromosomes. Zant. 3. We believe that only further research may resolve this ambiguity. arXiv. Curr. Bailey. Polyploidy and genome evolution in plants. Li. given that methylation is an in vivo reversible feature. Genome duplication and gene-family evolution: the case of three OXPHOS gene families. formally analog to one previously proposed by our group for the understanding of genic segment distributions was formulated and simulations have shown that it might account for the distributional patterns reported herein. melifera genome. J.. W... Han.W. Natl. studied herein... Schwartz. CpG islands as gene markers in the vertebrate nucleus.L.. Lengauer. Nature 321. More generally.org/10. Thompson. Genome-scale analysis of metazoan replication origins reveals their organization in specific but flexible sites defined by conserved features. Li. conserved by purifying selection) until recently. De Grassi.. Saccone.-H. thus their distribution also reflects the large-scale features of the isochore partition of genomes and other compositional inhomogeneities.J. University of Athens) for serving as academic advisors to G. and Philippa Constantinou for having participated in an initial stage of this study. K.3. Power-law distributions in empirical data.doi.N. K. Mol. P. Athanasopoulos. Comp. Seoighe. 2005.. Hamodrakas and George C.. Mol. Shalizi. and they were also found to follow power-law-like distributions. CpG island density and its correlations with genomic features in mammalian genomes. provided by Illingworth et al. Athanasopoulou... S.. A. 2008. which have lost their function although they have not yet changed their composition due to mutational dynamics.. Nucl. Bird. Su. 1–6.doi. 666–675. CpG island mapping by epigenome prediction. K. On the other hand. Almirantis. Frommer. Structure. B. Y.

. K.gene. Gene doi:http://dx. M.. O'Donnell.. E 71. R. D... Genet. Y. C.W. Nature 409. Nachman.1016/j. Illingworth. 2002. 2008. D. Statistical properties of aggregation with injection. 725–745. 2008. Matsuo.. Turner. H. 76. Kazazian. 35. Stat. Rev. H... 2007. 860–921. Almirantis. 2012. Sellis.. Cytosine methylation and CpG. 222–229. Compositional heterogeneity within. Immunol. Crowell.. M. Mol. D. W.. 547–552. 5521–5526. Tsiagkas et al. 1993.G.. W.. 143–149. 179–183. Contemp. Smith. Sims. Proc.. Minimizing errors in identifying Levy flight behaviour of organisms.S. A.. 2012.N. M. Initial sequencing and analysis of the human genome... Luque-Escamilla. 14–21. Sharp. J. 108–112.E. K. J.W. 916–928.. J. G. Jabbari. D.. Evolutionary dynamics of segmental duplications from human Ychromosomal euchromatin/heterochromatin transition regions.. P. M.W. . C. Ostertag. Genet.. Genome Res. 18–28. Sellis. 5240–5260.. Phys. Silke.doi. P.. Genet. Klimopoulos. Andrews. Schempp. 8.. 2007. Kirsch. U. 1151–1155. C. 2010. A... 1991.. 2002. Annu.1101/gr.T... Expansion-modification systems: a model for spatial 1/f spectra.. Bestor. Takayasu. 3740–3745.. Chromosome Res. A. J. S. 2001. Newman. Acad. Regional base composition variation along yeast chromosome III: evolution of chromosome primary structure.J. Oliver. J. Power-laws in the genomic distribution of coding segments in several organisms: an evolutionary trace of segmental duplications..M. Chen.076711. K. W.J. Bernardi. Munch.. A.org/10..L. Curr. Jiang. 323–351.. M. 543–555.J. Li. McLysaght. James.A. Gene.. Z.. TpG (CpA) and TpA frequencies...S. Gomez-Lopera. R. Eichler. 2000. Nat. PLoS Genetics 6. Bernaola-Galvan. Nucleic Acids Res.. Genome Res. 2005. Conery. and uniformity between. 1998.R. Takahashi.K.H... Tanay.. Hyperconserved CpG domains underlie Polycomb-binding sites. Estimate of the mutation rate per nucleotide in humans. 2009. 665–666. Rev. 16. Evidence of erosion of mouse CpG islands during mammalian evolution. Almirantis. Y. Gruenewald-Schneider.. D. G. Webb. P.. D. RomanRoldan.. 104.doi. J. Biol.H. 1986. A.org/10.H.96 G.D... Wolfe.. R. Hokamp. The 2 R hypothesis.. Mathematics: critical truths about power laws. Mol.. O. Takayasu.2012.. H. Finnerty. Huber. M. S... S. 23. A. Rev.G. K. E. G. Reciprocal gene loss between Tetraodon and zebrafish after whole genome duplication in their ancestor. J. possible paleopolyploidy and gene loss. A.H. 24. Biology of mammalian L1 retrotransposons. 2007. J. PNAS 99.. 159–167. Molecular mechanisms of chromosomal rearrangement during primate evolution. The use of genetic complementation in the study of eukaryotic macromolecular evolution. 2005. 2007.A. e1001134. D. 2001. K. The evolutionary fate and consequence of duplicate genes. 2002. 19. Power laws: Pareto distributions and Zipf’s law. 23. Bird... Stumpf. Damelin. Zipf’s law everywhere. Alu and LINE1 distributions in the human chromosomes: evidence of global genomic organization expressed in the form of power laws. Harrison. Trends Genet. DNA sequences of yeast chromosomes. Stolovitzky. 2004... Orphan CpG islands identify numerous conserved promoters in the mammalian genome. D. Lynch.. 2000.. Somat. Evol. Ecol... Li.M. Science 290. Kehrer-Sawatzki. Glottometrics 5.H..005. Cheng. 297–304. Sci. A.02. P. M. Gene 333. T. Sémon. 447.L. Oliver. 501–538. Almirantis. Extensive genomic duplication during early chordate evolution. 1991. D. Provata. Wolfe. Jones. 2007. Pitchford. Y. T.. W. Martinez-Aroza. Shapira... A 43. 31. An update.. 41–56... W. 46. Evol. Li. Opin. Righton. Kerr. Rate of spontaneous gene duplication at two loci of Drosophila melanogaster.. S.P. Cooper. Science 335. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. 200–204. Takai. A. Lloyd.P.. Clay... Schaffner. M. Z.108.E. Cell Mol. Anim. 061925.. E. / Computational Biology and Chemistry 53 (2014) 84–96 IHGSC. 2385–2399. Compositional searching of CpG islands in the human genome. Kasahara.. Genetics 156. 19. M.. 29 doi:http://dx. Phys. Phys. 1993. Porter. K. 65. J. U. Provata. S. Phys. A. Batz. Widespread occurrence of powerlaw distributions in inter-repeat distances shaped by genome dynamics. Nat.. L. V.H.L. J. 21. Sellis.

Sign up to vote on this title
UsefulNot useful