Site-To-Site Variation of Synonymous Substitution Rates

Site-to-Site Variation of Synonymous Substitution Rates
Sergei Kosakovsky Pond* and Spencer V. Muse

*Antiviral Research Center, University of California, San Diego; and Bioinformatics Research Center, North Carolina State University We develop a new model for studying the molecular evolution of protein-coding DNA sequences. In contrast to existing models, we incorporate the potential for site-to-site heterogeneity of both synonymous and nonsynonymous substitution rates. We demonstrate that within-gene heterogeneity of synonymous substitution rates appears to be common. Using the new family of models, we investigate the utility of a variety of new statistical inference procedures, and we pay particular attention to issues surrounding the detection of sites undergoing positive selection. We discuss how failure to model synonymous rate variation in the model can lead to misidentication of sites as positively selected.
Introduction One of the most fundamental goals of molecular evolutionary studies is the identication and quantication of sources of heterogeneity in rates and patterns of nucleotide substitution. To facilitate these studies, an ever-growing list of stochastic models of sequence evolution has appeared in the literature. Beginning with the simple model of Jukes and Cantor (1969), which assumed that all changes at all sites occurred at equal rates, researchers have incorporated such effects as unequal base frequencies (Felsenstein 1981), transition/transversion bias (Kimura 1980; Hasegawa, Kishino, and Yano 1985), site-to-site rate heterogeneity (Uzzel and Corbin 1971; Yang 1993), differences in synonymous and nonsynonymous substitution rates (Goldman and Yang 1994; Muse and Gaut 1994), and correlated rates or patterns of evolution among sites (Schoniger and von Haeseler 1994; Felsenstein and Churchill 1996). The development of these models allowed biologists to formally estimate the parameters governing the process of sequence evolution and to make statistically rigorous inferences about them. With rapid growth in molecular sequence databases coupled with continued improvements in the accessibility of likelihood methods in molecular evolution (Huelsenbeck and Crandall 1997), having a variety of biologically realistic models to use for data analysis is more important than ever. Models describing the evolution of protein-coding DNA sequences are a relatively new addition to the molecular evolution landscape. In 1994, the rst tractable models of codon evolution were introduced (Goldman and Yang 1994; Muse and Gaut 1994). In the decade since then, only a handful of modications of these original models have been published (e.g., Nielsen and Yang 1998; Pedersen, Wiuf, and Christiansen 1998; Yang et al. 2000). Of particular note are models that allow for variation in substitution rates among sites within a gene. Nielsen and his colleagues have published a fascinating series of papers exploiting codon models to identify lineages and codon sites that are undergoing adaptive or positive selection (e.g., Nielsen and Yang 1998; Yang and Nielsen 2002). The key element of these models is the ratio of nonsynonymous to synonymous substitution rates, x. By allowing x to vary across sites using methods similar to those developed for nucleotide models
Key words: synonymous substitution rate, positive selection, molecular evolution, codon model, model selection. E-mail: muse@stat.ncsu.edu.
Mol. Biol. Evol. 22(12):23752385. 2005 doi:10.1093/molbev/msi232 Advance Access publication August 17, 2005 The Author 2005. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oupjournals.org
(Yang 1993, 1994), individual sites with x . 1a hallmark of positive selectionmay be identied. One important technical choice was made in the formulation of all of these models: the rate of synonymous substitution was assumed to be constant across sites, implying that site-to-site variation in x is solely the result of corresponding variation in nonsynonymous rates. If this assumption is incorrect, variation in x is actually related to variation in both synonymous and nonsynonymous rates, and subsequent inferences about x may be awed. We would like to test this assumption and develop appropriate inference procedures for cases where tests suggest that synonymous rates are variable. While there have been many studies that showed large variation in synonymous substitution rates between different organisms, genomic regions, and individual genes (see, for example, Miyata and Yasunaga 1980; Wolfe, Li, and Sharp 1987; Wolfe, Sharp, and Li 1989; Gaut et al. 1996; Hanada, Suzuki, and Gojobori 2004) and while some evidence for variable silent substitution rates within a gene has been presented (Hurst and Pal 2001; Tuplin et al. 2002), to our knowledge there are no published results on modeling the site-to-site variation of synonymous rates in a single gene. Toward this end, we develop and study a class of stochastic models that allow for simultaneous variation in both categories of rates in coding sequences. We then provide a series of statistical inference procedures, including methods for identifying and quantifying amounts of site-to-site heterogeneity, locating sites that are targets of positive selection, and estimation of evolutionary distances. Materials and Methods A Model for Codon Evolution with Variable Rates of Synonymous and Nonsynonymous Substitution Existing Codon Models We describe our model as a modication of the MG94 codon model (Muse and Gaut 1994), although the principles can be easily applied to any substitution model containing two or more substitution rates (such as HKY85 or GY94). The MG94 model is dened by the instantaneous rate matrix whose entry QMG94 reects the rate of change x;y from codon x to codon y. Formally, 8 m ap > ny ; x/y is a synonymous one-step change > > > e:g:; ACGThreonine/ACTThreonine; < MG94 m Qx;y 5 bpny ; x/y is a nonsynonymous one-step change > > > e:g:; ACGThreonine/AGGArginine; > : 0; otherwisee:g:; ACG/TTG;
2376 Pond and Muse
where pmy refers to the frequency of the target nucleotide n in codon y at codon position m. For instance, the target nucleotide for the synonymous ACG/ACT substitution is T in the third codon position, and its corresponding rate is ap3 : T Under this model, the equilibrium frequency of codon ijk is the product of the constituent nucleotide frequencies, scaled to account for the absence of stop codons. For sequences obeying the universal genetic code these frequencies are pijk 5 pi pj pk
1 2 3 1 2 1 2 3 3 1 2 3
the likelihood function requires integrating over the positive quarter-plane R2 : 1 S N N Y PrDs j as ; bs ; T Lg; D; T 5
s51 0 0
3 has ; bs ; gdas dbs A closed form expression is unavailable and it is not practical to perform numerical integration, so we restrict ourselves to considering discrete (or discretized) distributions as Yang (1994) introduced for nucleotide models. For such distributions the likelihood can be computed by carrying out a double summation: Lg; D; T 5
S N M YXX s51 i51 j51
1 pT pA pA pT pG pA pT pA pG
Other genetic codes can be accommodated by making the obvious modications. Finally, note that the original version of the MG94 model assumed that nucleotide frequencies were shared by all three codon positions. For a given evolutionary lineage, the estimable substitution parameters when using the MG94 model are at and bt. Recognizing this natural confounding of evolutionary rates and times (Felsenstein 1981), we can arbitrarily set a 5 1. This formulation is similar to that of Goldman and Yang (1994), and it has the advantage of expressing the substitution rates in terms of the ratio of nonsynonymous and synonymous rates, x 5 b/a, which provides an indication of the magnitude of selective pressure. Originally, x was treated as a deterministic model parameter; however, one can easily envision x varying across sites. Nielsen and Yang (1998) and Yang et al. (2000) studied models with codon-to-codon variation of x described by a variety of probability distributions. Approaches of this type allow us to relate patterns of sequence evolution to various biological hypotheses of interest. For completeness and consistency of notation, we include details of how to dene and evaluate the phylogenetic likelihood in this context as Supplementary Material online (originally discussed in Nielsen and Yang 1998; Yang et al. 2000).
PrDs j as 5 rij ; bs 5 qij ; T
3 Pras 5 rij ; bs 5 qij ; g; where (rij, qij), i 5 1, ., M, j 5 1,., N denote the possible values for (as, bs). If as and bs are sampled from independent distributions, this expression can be rewritten as: Lg; D; T 5
S N YX s51 i51
Pras 5 ri ; g
M X j51
PrDs j as 5ri ; bs 5qj ; T Prbs 5 qj ; g:
Modications for Variable Synonymous Substitution Rates The basic modeling approach described above allows for site-to-site variation in x by treating the synonymous rate as constant across sites and introducing variable nonsynonymous rates. Using the notation of the original MG94 model, a is shared by all sites and b varies over sites. Our extension of this approach allows rates of both synonymous and nonsynonymous substitutions to vary across sites. The revised rate matrix can be written in the same form as the original MG94 model, but with the substitution rates at codon s considered as random quantities (as and bs) rather than deterministic model parameters. We describe (as, bs) as a random vector with a bivariate density h(as, bs;g) parameterized by g, which may also be a vector. Each site in a sequence is modeled with its own (unknown) value of (as, bs), leading to variation in xs 5 bs/as. Let our data consist of a set of aligned sequences D related by a phylogenetic tree T , with the sth column of the alignment denoted as Ds, s 5 1, ., S. Computing
The conditional probability PrDs jas 5rij ; bs 5qij ; T can be computed efciently using Felsensteins pruning algorithm (Felsenstein 1981), with numerical matrix exponentiation routines (Moler and Loan 1978) used to evaluate the transition probabilities needed for the site likelihoods (c.f. Goldman and Yang 1994; Muse and Gaut 1994). Note that the S sites produce only M 3 N sets of transition probabilities, eliminating the need for repeated calculations at each site. Additional computational savings in evaluating the likelihood function are described in Larget and Simon (1998) and Kosakovsky Pond and Muse (2004). Maximum likelihood estimates (MLEs) of g and the branch lengths in T can then be obtained numerically. Likelihood function evaluation can be distributed across multiple processors using the algorithm described in Kosakovsky Pond and Frost (2005c). Although the notation has grown cumbersome, the basic idea is simple: we have provided M rate categories for synonymous substitution rates, N rate categories for nonsynonymous substitution rates, and a discrete probability distribution for the M 3 N possible combinations of these categories. The model posits that each codon falls into one of the M 3 N categories, and the likelihood function is calculated by taking a weighted sum over all possible assignments of codons to rate categories. As is typical for studies of substitution rates, only products of rates and times can be estimated (Felsenstein 1981), and we can restrict one of the distributions to have a xed mean. We choose to restrict the distribution of as to have unit mean because this leads to a natural extension of current models, which set as 5 1 for all sites. This restriction is analogous to current rate heterogeneity models
Site-to-Site Silent Variation 2377
(e.g., Yang 1993) that use one-parameter gamma distributions with unit mean instead of the more general twoparameter gamma. Independent Gamma Distributions for Synonymous and Nonsynonymous Rates For purposes of illustrating the modeling approach, we will assume that both synonymous rates as and nonsynonymous rates bs are sampled independently from gamma distributions, with separate means and variances for the two types of substitutions. We dene the gamma density gx; A; B 5 BB x B1 eBx=A =AB CB; x; A; B . 0; with mean A and variance A2/B. The synonymous rates as are drawn from a unit mean gamma density g(as; 1, la). In the simplest case, we assume that nonsynonymous rates are drawn from a gamma density g(bs; x, lb). It is often desirable to include a category of invariable sites (Fitch 1971). To capture such behavior, we also introduce a model where nonsynonymous rates are described by a mixture of a gamma distribution with a point mass at 0. If a fraction P of the sites are invariable at the amino acid level, the density is given by gi (x; A, B, P) 5 Pd0(x) 1 (1 P)g(x; A, B), where d0(x) denotes the point mass dis tribution at 0, i.e., d0(x) 5 0 for x . 0 and N d0 x51: The 0 mean and variance of this Gamma 1 Inv distribution are E[bs] 5 (1 P)A and Varbs 51 P2 A2 =B: Under this framework, the extension of the simple bivariate gamma leads to nonsynonymous rates bs being sampled from gi(bs; x, lb, P). Using these models, the joint density for (as, bs) depends on either three or four parametersthe shape parameter for synonymous rates la, both shape and scale parameters for nonsynonymous rates lb and x, and in the second setting the proportion of invariant sites Pand can be written as either h(as, bs; la, lb, x) 5 g(as; la)g(bs; x, lb) or h(as, bs; la, lb, x, P) 5 g(as; la)gi(bs; x, lb, P) as appropriate. Using the discretization procedure described in the Supplementary Material online to convert the continuous bivariate distribution to a discrete one, we are able to t our models to a sequence alignment as described previously. Independent Discrete Distributions for Synonymous and Nonsynonymous Rates In general, there is no compelling reason to believe that substitution rates follow a gamma (or any other simple) distribution. The choice of the gamma distribution in Yang (1993) was justied largely by the exibility of its parametric family and by the simplicity of estimating a single shape parameter. Given a xed computational complexity (i.e., the number of rate classes D), the most general parametric approach to tting a rate distribution is to use a general discrete distribution (GDD) with D bins. Such a distribution is specied by enumerating its values di, i 5 1, ., D (without loss of generality we may assume that di , dj, for all i , j) and the probabilities of observing each value pi 5 P PrfX 5 dig, subject to D pi 51: A GDD distribution i51 with D bins will include 2D 1 free parameters to be estimated by maximum likelihood. When modeling variation in synonymous rates the mean of the distribution must be
restricted to 1 (cf. UGDD distribution in Kosakovsky Pond and Frost 2005a), reducing the number of free parameters to 2D 2. GDDs are parameter rich, and care must be taken not to include too many rate classes because it can easily lead to estimation problems such as slow convergence, multiple local maxima, and rapid performance degradation. For all but the largest alignments the value of D should not exceed 5 (unpublished data and Kosakovsky Pond and Frost 2005a). Model Extensions and Variations The previous section described the basic modeling approach we have used for investigating site-to-site heterogeneity of synonymous substitution rates. We are interested in improvements in model t, in the robustness of parameter estimation (particularly of the nonsynonymous:synonymous rate ratio x) to model choice, and in exploration of the extent to which synonymous rates vary from site to site within gene sequences. To facilitate these investigations, we need to describe a series of variations on the basic model designed to allow comparisons with the published literature and to address biological hypotheses of interest. Each specic model consists of four components: the rate heterogeneity model, the core rate matrix, the rate distribution used, and the discretization scheme, and each component is described below. The Core Rate Matrix Entries in the instantaneous rate matrix QMG94 take the form hpny : In the simplest case we had only two possibilities, apny for synonymous substitutions and bpny for nonsynonymous changes. However, one might want to include other common substitution biases, such as differences in transition and transversion rates. One can think of combining the basic MG94 model that accomodates separate synonymous and nonsynonymous rates with a nucleotidebased model to capture nucleotide-level factors. It is this combination that we dub the core rate matrix. We can create a core matrix in this manner using any time reversible 4 3 4 nucleotide model, but in the current work we will limit ourselves to the following core rate matrices. MG94 3 REV. In these models the Muse and Gaut (1994) model is enhanced with a separate substitution parameter for each nucleotide substitution type. The rate matrix elements describing the substitution of codon x with codon y take the form: 8 > as hij pny ; x/y one-step synonymous > > > substitution of nucleotide i with > > < nucleotide j; MG94 3 REV 5 Qx;y > bs hij pny ; x/y one-step nonsynonymous > > > substitution of nucleotide i with > > : nucleotide j: To ensure reversibility, hij 5 hji. Because we can only estimate the products of rates and times, one of the parameters hij cannot be identied, and without loss of generality we set hAG 5 1. This rate matrix is the most general one used in this study.
2378 Pond and Muse
MG94. Set all hij 5 1. MG94 3 HKY85. Set hAG 5 hCT 5 1 and hAC 5 hAT 5 hCG 5 hGT. GY94. The Goldman and Yang (1994) model does not quite follow the general form described in this work, differing slightly in the modeling of equilibrium frequencies. We include this core matrix so that we can use it for direct comparisons with published results. The GY94 differs from MG94 3 HKY85 in that each nonzero rate in the matrix is proportional to the frequency of the target codon py instead of the frequency of the target nucleotide. This feature leads to small (but typically negligible) differences from the MG94 3 HKY85 model. The Rate Heterogeneity Model We described above a basic model for allowing site-tosite rate heterogenetiy of both synonymous and nonsynonymous rates in which rates were described by independent random variables. For purposes of exploring the utility of allowing heterogeneity of synonymous rates, and specically to allow us to test various hypotheses about the relationship between synonymous and nonsynonymous rates, we now introduce several modications of that scheme. Constant: constant rates model. This model is the simplest used in this work. It sets as 5 1 and bs 5 x for every site s in the alignment. The original MG94 and GY94 models used this approach of having no site-to-site rate variation. Nonsynonymous: variable nonsynonymous rates model. This rate heterogeneity model was used by Yang et al. (2000) and described in the Existing Codon Models above. All sites have as 5 1, while the bs values are sampled from a given rate distribution (e.g., gamma or general discrete). Nonsynonymous assumes that synonymous rates do not vary over sites. However, the nonsynonymous-to-synonymous rate ratio does vary over sites as a result of the nonsynonymous rate variation. Proportional: proportional variable rates model. Proportional is perhaps the simplest approach for allowing both types of rate heterogeneity. The basic framework is similar toYang et al. (2000). Unlike that work, however, the synonymous rates are not constant over sites. Instead, the nonsynonymous rate is proportional to the synonymous rate, with the proportionality constant shared by all sites. The as values are drawn from a given unit mean rate distribution, and the nonsynonymous rates are then given by bs 5 xas, where x is the ratio of nonsynonymous to synonymous rates and is shared by all sites. Such a model might be effective if site-specic mutation rates were the driving factor in creating site-to-site heterogeneity (Kosakovsky Pond and Frost 2005a). Dual: dual variable rates model. Dual is the primary rate heterogeneity model described in the Modications for variable synonymous substitution rates above. The as and bs values are sampled from two given rate distributions, with as restricted to have mean 1. Dual assumes that synonymous and nonsynonymous rates are independent.
FIG. 1.Hierarchical presentation of rate heterogeneity models used in this work.
Lineage Dual: dual variable rates model, branch corrections. Model Lineage Dual is an extension of Dual that allows for lineage-specic adjustments to the ratio, x, of nonsynonymous to synonymous rates. The values of as and an ancillary quantity cs are sampled from two (independent) unit mean distributions and bs 5 xb 3 cs; xb denotes the E[nonsynonymous]/E[synonymous] ratio for branch b. The branch correction parameters, xb, are estimated separately for every branch in the phylogeny. They can be thought of as correction factors to the average value of x dened in Dual, with larger values of xb indicative of branches with x values higher than the tree average. These values are useful for identifying specic lineages that may have undergone adaptive evolution events (Yang 1998; Kosakovsky Pond and Frost 2005a). Dual is a particular case of Lineage Dual and can be used to test the null hypothesis xb 5 x for all branches b. Lineage Dual is the most general model considered in this study. A diagram illustrating the hierarchy of the models is found in gure 1. The Discretization Method Having selected a rate heterogeneity model based on gamma-distributed random variables, for practicality we
Table 1 Data Sets Analyzed for Presence of Synonymous Rate Variation

MG94 3 REV Nonsynonymous GDD 3 Data Sperm lysin Primate COXI Drosophila adh HIV-1 vif b-globin Inuenza A* Camelid VHH* Encephalitis env Flavivirus NS5 Hepatitis D antigen Reference (Yang and Swanson 2002) (Seo, Kishino, and Thorne 2004) (Yang et al. 2000) (Yang et al. 2000) (Yang et al. 2000) (Yang 2000) (Harmsen et al. 2000) (Yang et al. 2000) (Yang et al. 2000) (Anisimova and Yang 2004) Sequences Codons 25 21 23 29 17 349 212 23 18 33 135 506 254 192 144 329 96 500 183 196 log L 4,409 12,013.3 4,586.2 3,347.2 3,659.3 10,916.5 16,540.8 6,774.4 9,137.8 5,137.7 Tree Length 2.85 8.5 1.41 0.97 2.6 1.42 14.9 0.85 6.3 1.9 MG94 3 REV Dual GDD 3 3 3 log L Tree Length P Value 0.0001 ,0.0001 0.23 ,0.0001 0.0004 ,0.0001 ,0.0001 ,0.0001 ,0.0001 ,0.0001 DAIC 15.36 65.27 2.35 17.63 12.43 103.7 291.24 35.15 47.25 118.98
(0.06) 4,397.3 2.93 (0.06) (0.22) 11,976.6 5.8 (0.15) (0.03) 4,583.4 1.47 (0.03) (0.02) 3,334.4 0.99 (0.02) (0.08) 3,649.1 3.3 (0.1) (0.002) 10,860.7 1.42 (0.002) (0.04) 16,391.2 14.9 (0.04) (0.02) 6,752.8 0.89 (0.02) (0.19) 9,110.2 7.8 (0.24) (0.03) 5,074.2 2.02 (0.03)
NOTE.Tree lengths are reported in expected numbers of substitutions per unit time per nucleotide site; numbers in parentheses reect mean branch lengths. Reported P value is for the likelihood ratio test between the Nonsynonymous and Dual rate variation models, using the asymptotic distribution of v2 : DAIC is the difference in AIC scores 4 between the Dual and Nonsynonymous models: positive numbers indicate the preference for the Dual model. Analyses marked with the asterisk used a branch length approximation scheme similar to Yang (2000), discussed in detail in Kosakovsky Pond and Frost (2005b) to reduce computational burden.
must now convert those continuous distributions into discrete approximations. Using the schemes discussed in Supplementary Material online, discretization based on M 2 synonymous rate classes and N 2 nonsynonymous rate classes is denoted by M 3 N. The inclusion of models Nonsynonymous and Proportional, which have only one random rate, as well as the use of distributions with an invariant category for Dual and Lineage Dual, require us to dene some rules for discretization. For Nonsynonymous and Proportional, we discretize the single distribution into MN equiprobable rate classes. For Dual and Lineage Dual, we discretize the distribution of synonymous rates into M equiprobable classes and the distribution of nonsynonymous rates into N equiprobable classes. If an invariable class is present for nonsynonymous rates, we discretize the distribution of synonymous rates into M equiprobable classes, the continuous portion of the mixture of nonsynonymous rates into N 1 equiprobable classes, and add the invariable class at 0 to obtain N total nonsynonymous classes. Under this discretization strategy, all models have MN rate classes and equivalent computational costs. When GDDs are used, no discretization is needed and one only need specify the number of rate classes. With the three components in place, we now have a wide selection of models from which to choose. To completely specify a model, we list its rate variation model, core rate matrix, rate heterogeneity method, rate distributions used, and the discretization method. The model MG94 3 REV Dual Gamma 5 3 7 would use the MG94 core matrix composed with the REV nucleotide model, have separate independent gamma distributions for synonymous and nonsynonymous rates, discretized into ve synonymous categories and seven nonsynonymous categories. A Simple Test for Presence of Synonymous Rate Variation The Dual and Nonsynonymous models are nested if GDD distributions are used to model rate variation, making it possible to assess signicance using standard likelihood ratio tests. If N nonsynonymous rate categories are used both in Dual and in Nonsynonymous and if M synonymous
rate categories are employed in Dual, then there are 2M 2 more estimable parameters in Dual than in Nonsynonymous. Dual can be constrained so that Nonsynonymous is obtained. When enough alignment columns are available, the asymptotic v2 distribution with 2M 2 degrees of freedom can be used to compute P values. For smaller data sets a parametric bootstrap P value can be found. Alignments, Trees, and Programs To investigate the utility of the new models in exploring site-to-site rate heterogeneity, we made use of 10 previously studied codon alignments (table 1) of varying sizes and levels of divergence. The archive of alignments and phylogenetic trees in NEXUS format can be downloaded from http://www.hyphy.org/pubs/2rates.tgz. We examine the mammalian b-globin data set in great detail but for brevity focus only on tests for the presence of synonymous rate variation in other data sets. All analyses were performed using HyPhy version .99b (Kosakovsky Pond, Frost, and Muse 2004) for Mac OS X on a dual G5/2Ghz workstation or a 16-processor Linux MPI cluster. Both the program and the batch les for comparing the models discussed herein and result processing are included as standard analyses with the HyPhy distribution (http://www.hyphy.org). The chart generation option used for creating some of the graphics in this paper is presently only available in Mac OS and Windows. Results Nucleotide Substitution Biases In all data sets analyzed for this paper, the MG94 3 REV model of codon substitution was preferred over both MG94 3 HKY85 and traditional MG94 when assessed by the appropriate likelihood ratio test, and it was also chosen over GY94 using Akaike Information Criterion (AIC) (Supplementary Material online). This nding held even when different rate variation models were considered. For the remainder of this article, we consider only the MG94 3 REV core model unless specied otherwise.
2380 Pond and Muse
Table 2 Descriptive Statistics for the Inferred Distributions of Synonymous (as) and Nonsynonymous (bs) Substitution Rates
MG94 3 REV Nonsynonymous GDD 3 Data Sperm lysin Primate COXI Drosophila adh HIV-1 vif b-globin Inuenza A Camelid VHH Encephalitis env Flavivirus NS5 Hepatitis D antigen E(bs) 1.1 0.24 0.1 0.82 0.27 0.46 0.93 0.05 0.05 0.47 CV(bs) 0.91 11.7 1.45 1.39 1.23 1.76 0.95 2.04 1.71 1.23 CV(as) 0.75 0.18 0.36 0.93 1.47 0.8 0.74 0.57 0.39 0.81
MG94 3 REV Dual GDD 3 3 3 E(bs) 0.97 0.02 0.1 1.76 0.18 0.47 0.75 0.05 0.04 0.42 CV(bs) 0.91 1.99 1.45 1.39 1.23 1.76 0.94 2.04 1.71 1.24 E(bs/as) 4.67 N 0.11 1.31 0.31 0.75 1.25 0.07 N N CV(bs/as) 2.60 NA 1.58 1.6 1.39 2.03 1.30 2.65 NA NA
NOTE.E, expected value; CV, coefcient of variation; NA, not applicable.
Choice of Rate Distribution For every choice of rate variation models, GDD tends to have similar or better AIC scores than a discretized gamma (1Inv) distribution (Supplementary Material online). Fewer rate categories are needed by GDD to achieve good ts when compared to continuous discretized distributions, and consequently the analyses require less computer time to run. In general, a uniformly discretized gamma distribution may be a poor choice to capture the underlying pattern of rate variation, even if the true distribution of substitution rates is a gamma (Kosakovsky Pond and Frost 2005c). Adaptive discretization schemes or GDDs may be a better general choice for modeling distributions of rates. We chose to focus on the results derived from using GDDs of rates, although similar results have been obtained with a variety of other distributions (results not shown). Heterogeneity of Synonymous Rates Is Not Negligible Each of the ten data sets was analyzed for evidence of synonymous rate variation (table 1). The MG94 3 REV model, using GDD was t to each data set under two rate variation hypotheses: NullNo synonymous rate variation: Nonsynonymous with three rate categories. AlternativePresence of synonymous rate variation: Dual, with three rate categories for as and three for bs. Because the models are nested, a standard likelihood ratio test using the v2 asymptotic distribution can be used to 4 assess signicance. Additionally, we employed the AIC to quantify the relative improvement of t. The computational difculty of the Dual model was increased roughly by a factor of 3 compared to the Nonsynonymous model; however, if run in parallel on a cluster of computers with at least 10 nodes (Kosakovsky Pond and Frost 2005c) the models are roughly equivalent in terms of overall run time. The choice of three (or 3 3 3) rate categories was partly historical, e.g., the often cited M3 model of Yang et al. (2000) used a GDD with three rate categories and partly because nite sample sizes do not, in general, permit reliable inference of more than a few rate classes (Kosakovsky Pond and Frost 2005c).
In 9 of 10 cases the hypothesis of constant synonymous rates can be rejected at the 0.01 level. The Drosophila adh data set is the sole exception where the test failed to reject the assumption of constant synonymous rates. Dual was preferred by AIC in the same nine cases, with score improvements ranging from 12.43 to 291.24, with larger data sets yielding more pronounced AIC decreases. Coefcients of variation (CV: dened as the standard deviation of a distribution divided by its mean) can be used to quantify and compare the variability of different distributions. The CV for synonymous rate distributions ranged from 0.18 to 1.47 and in many cases were comparable to, or even greater than, the CV of the corresponding nonsynonymous distributions (table 2). Note that while for most data sets both models yielded similar estimates of overall tree lengths (table 1) and CV of the nonsynonymous rate distributions, there were some notable exceptions (COXI, b-globin, and avivirus NS5 data sets) for which there was a pronounced effect on the estimate of overall sequence divergence. This effect could be important, for example, in estimation of divergence times and denitely merits further exploration. Detailed Analysis of the b-Globin Data Set In the following paragraphs, we illustrate potential applications of the models developed above through a series of analyses of the b-globin data set. We examine the ts of various models and investigate the inferred rate heterogeneity distributions. Model Selection for the b-Globin Data Set A more thorough analysis of the b-globin data is presented in table 3. Introduction of branch-specic rate correction factors in Lineage Dual appears to lead to a signicant t improvement over Dual model as measured by the likelihood ratio test (asymptotic P value of 0.005). Inferred branch selection coefcients xb ranged from 0 to 3.7 (median 0.31). The Lineage Dual model can serve as a simple test for heterogeneity in average selection pressure among lineages when both substitution rates are allowed to vary among sites, complementing existing branch-site
Table 3 MG94 GDD 3 3 3 Applied to the b-Globin Alignment

Model Constant Proportional Nonsynonymous Dual Lineage Dual log L 3,768.25 3,712.2 3,659.3 3,649.7 3,622.5 AIC 7,610.5 7,506.4 7,400.7 7,389.3 7,395 CV(as) 0 0.88 0 0.8 0.68 E(bs) 0.23 0.2 0.27 0.22 NA CV(bS) 0 0.88 1.23 1.24 NA E(bS/aS) 0.23 0.2 0.27 0.31 NA CV(bS/aS) 0 0 1.23 1.42 NA
NOTE.log L: the logarithm of the maximum likelihood value for the model. AIC: the value of the Akaike Information Criterion (Akaike 1974) model selection index. AIC 5 2 log L 1 2(# par). AIC rewards models for good t but penalizes them for adding parameters. Small values of AIC indicate better models. E[bs]: expected value for the nonsynonymous rate distribution. Because our discretization scheme preserves rst moments, this value is the same for the continuous distribution and its discrete variant. CV[as]: coefcient of variation of the discretized synonymous rates distribution, i.e., the ratio of the standard deviation of the distribution and its mean. CV[bs]: coefcient of variation of the discretized nonsynonymous rates distribution. E[bs/as]: expected value for the distribution of the ratio of nonsynonymous and synonymous rates. CV[bs/as]: coefcient of variation of the distribution of the ratio of nonsynonymous and synonymous rates; NA, not applicable. N X v51 N X v51 M X u51
models (Yang 1998; Yang and Nielsen 2002). It is likely (Kosakovsky Pond and Frost 2005a) that a model where each branch has an adjustable xb is too parameter rich and that a number of models with only a few classes of xb will t the data as well, a possibility reinforced by the fact that the best model using AIC as a criterion is Dual. Note that for this particular data set, estimates of the various parameters regulating site-to-site rate heterogeneity appear quite insensitive to model selection, although the results from tables 1 and 2 indicate that this behavior is not universal. Examining Rate Patterns at Sites in the b-Globin Data Set ^ Once the MLEs, H; of the entire collection of unknown parameters, H, have been obtained, one can ask questions about rate patterns among sites in the data set. One of the results produced by a likelihood analysis is a table of conditional probabilities comprised of the entries ls ; s 5 1, ., S, j 5 1, ., M (number of synonymous rate jk classes), k 5 1, ., N (number of nonsynonymous rate clas^ ses): ls 5PrDs jas 5aj ; bs 5bk ; H: Given ls ; we can comjk jk pute the empirical Bayes posterior probability ps of site s jk belonging to rate class ( j, k) by applying the elementary Bayes rule: pjk 5 PM
s
Ebs j Ds 5 R
bv Prbs 5 bv j Ds 5 R
bv
ps : uv
Note that both Eas jDs and Ebs jDs shown in gure 3 show substantial variation (as suggested by their respective CV in table 2) but with little obvious spatial correlation. Selective Pressure For the Dual and Lineage Dual models, one can examine the conditional distribution of xs 5 bs/as (or Ms 5 bs as) over codon sites and attempt to identify sites under selective pressure. Because the distributions used for as and bs are always discrete (or discretized), the inferred distribution of xs (or Ms) is also discrete and can be obtained in a straightforward way. Following the empirical Bayes methodology of Nielsen and Yang (1998), given the posterior probabilities ps as dened above and a cutoff jk level 0 C 1, we label a codon site s as a site under selective pressure if X ps C: PSs 5 jk
j;k such that bk .aj
ljk Pras 5 aj ; bs 5 bk : PN s u51 v 5 1 luv Pras 5 au ; bs 5 bv
One could also employ Bayes factors to identify sites that are under selective pressure. A Bayes factor BF(s) for the hypothesis (or event) Ws 5fbs . as g at site s is dened as the ratio of posterior odds of the event and its prior odds: BFs 5 PrfWs j Ds g=1 PrfWs j Ds g : PrfWs g=1 PrfWs g
The posterior distribution of rates at a site is useful in determining the rate of evolution at a given site. For instance, one can simply assign site s to the rate class (j, k) that has the maximal posterior probability. Note that if all rate classes are equiprobable, the same results can be obtained simply by selecting the rate class with the largest conditional posterior probability ls (table 4). In gure 2, we show examples ij of sites assigned to different rate classes by various models. Additional indicators of the rate pattern present at site s are the expected posterior synonymous and nonsynonymous rates at that site, denoted by Eas jDs and Ebs jDs (g. 3): Eas j Ds 5
M X u51
Table 4 Rate Class Assignments Under MG94 3 REV Dual GDD 3 3 3 for b-Globin
Rates a 5 0.47 a 5 1.42 a 5 7.98 Column sum b 5 0.005 29 13 6 48 b 5 0.18 42 29 9 80 b 5 0.85 11 3 2 16 Row Sum 82 45 17 144
au Pras 5 au j Ds 5
M X u51
au
N X v51
puv ;
NOTE.Table entries are counts of sites assigned to each synonymous 3 nonsynonymous (row 3 column) rate category combination.
2382 Pond and Muse
FIG. 2.Examples of rate class assignments and positive selection identication for b-globin under MG94 3 REV Dual GDD 3 3 3, MG94 3 REV Nonsynonymous GDD 3, and M3 in (Yang et al. 2000).
A large value of BF(s), for instance BF(s) . 100 (decisive evidence, see Kass and Raftery 1995) indicates that the effect of the data in column s on the prior event of positive selection is amplifying and thus suggestive of the presence of selective pressure at site s. One of the advantages of using Bayes factors as opposed to a posterior probability cutoff is that Bayes factors properly account for the effect of the prior distribution. For example, rst suppose that PrfWsg 5 0.7 and PrfWs jDs g50:95: It is easy to check that BF(s) 5 8.14. Next, consider PrfWsg 5 0.05 and PrfWs jDs g50:95: Now, BF(s) 5 361. Both situations would be treated as equally strong evidence for positive seE[S|i]
5
lection with the posterior cutoff but vastly different under Bayes factors. It seems reasonable to consider the relative effect of the data upon the prior as measured by BF(s), if for no other reason than to make the conclusions of positive selection analyses based on different priors directly comparable. Under certain scenarios (Kosakovsky Pond and Frost 2005b), the reciprocals of Bayes factors serve as a useful approximate measure of the rates of false positives of the test for selection at a site. In practice, however, there seems to be little difference in inferring the location of sites under selection, whether posterior P values or Bayes factors are
0 25 50 75 100 125
E[NS|i]
Codon
FIG. 3.Expected posterior rates (E[S|i] for synonymous rates and E[NS|i] for nonsynonymous rates) for b-globin under MG94 3 REV Dual GDD 3 3 3.
FIG. 4.Comparison of site-by-site selection inference using MG94 3 REV Dual GDD 3 3 3 and MG94 3 REV Nonsynonymous GDD. Shaded areas in each plot show codons that are classied as positively selected by the Nonsynonymous model only (cutoff of 0.9 used for the posterior probability plot).
used (Kosakovsky Pond and Frost 2005b; unpublished simulation results). Empirical Bayes inference can be strongly inuenced by errors in MLE. Recently, several methods to correct for this effect have been suggested. One can either use a hierarchical Bayes model (Huelsenbeck and Dyer 2004) or employ Bayes Empirical Bayes inference (Yang, Wong, and Nielsen 2005). Because our Dual models include many estimated parameters, we caution that estimation errors can exert strong inuence on the empirical Bayes analysis, especially for smaller data sets. While we base our initial analyses on a simplistic application of empirical Bayes inference, both approaches to dealing with errors in MLEs can be adapted to include synonymous rate variation. Furthermore, new approaches (e.g., sampling importance resampling) could be brought to bear on handling errors in parameter estimates, and we plan to extend our methodology to account for the effect of such estimation errors.
Comparison with Existing Methods In Yang et al. (2000), model M3 was found to have the best t for the b-globin data (AIC 5 7,408.12). Note that M3 of Yang et al. (2000) can be denoted as GY94 Nonsynonymous GDD 3 using the conventions of this paper. GY94 Nonsynonymous GDD 3 and MG94 3 REV Dual GDD 3 3 3 (AIC 5 7,388.23) differ in three aspects: (1) modeling of rate variation, (2) compensating for nucleotide substitution biases, and (3) treatment of equilibrium frequencies. In order to focus specically on the effect of failing to allow for synonymous rate variation we also included results from MG94 3 REV Nonsynonymous GDD 3 (AIC 5 7,400.66). The AIC scores for the three models suggest that site-to-site synonymous variation is present and that it is important to model. In gure 2, we illustrate examples of sites whose rates can be categorized as high synonymous/low nonsynonymous (codon 55), high synonymous/high nonsynonymous (codon 85), low synonymous/high nonsynonymous (codon 48), and moderate synonymous/high nonsynonymous (codon 7). The presence of this variety of site categories suggests that it
might be important to include both types of site-to-site rate variation. While the model t is important, we focus now on the specic task of identifying sites under positive selection. A few sites illustrate important points with regard to the suitability of models for this task. The most crucial difference between methods is found at codon 85, a position with high synonymous and nonsynonymous diversity. Because of the high level of nonsynonymous diversity, models that assume equal synonymous rates identify such a site positively selected with high statistical signicance (note the Bayes factors/posterior probabilities of 5,545 and 0.998 in g. 2). However, an argument can be made that in reality site 85 is evolving neutrally or even under purifying selection. Using MG94 3 REV Dual GDD 3 3 3, the estimated ratio of nonsynonymous to synonymous rates at site 85 is 0.108. The model assigns high synonymous and nonsynonymous rates to such variable sites, leading to a negligible positive selection signal. A look at gure 3 suggests that while evolution at most sites in the alignment can be explained well by a single synonymous rate, there are a number of prominent exceptions where synonymous rate estimates are substantially elevated. Figure 4 illustrates the effects of including (or excluding) site-to-site variation in synonymous rates when searching for sites undergoing positive selection. We used models MG94 3 REV Nonsynonymous GDD 3 and MG94 3 REV Dual GDD 3 3 3 in order to focus solely on the performance differences arising from the inclusion or exclusion of site-to-site synonymous rate variation. The scatterplots of posterior expected xs and posterior probabilities for positive selection (bs . as) should form a line along the y 5 x axis if the models lead to the same inferences. Predictably, the models agree on a large subset of the sites. However, the most important feature of the plots is the tendency for some sites (7, 67, 85, and 123) with large support for positive selection under the Nonsynonymous model to have considerably weaker support when site-to-site synonymous variation is introduced using the Dual model. This behavior could lead to an elevated rate of false positives when analyzing data sets that include synonymous rate
2384 Pond and Muse
Table 5 Evidence of Positive Selection with and Without Synonymous Rate Heterogeneity
Codon 7 11 42 48 50 54 67 85 123 Nonsynonymous 170,763 3.1 3,800 5,956 3,521 1,224 76,465 5,545 602,744 (0.9999) (0.84) (0.997) (0.998) (0.9997) (0.991) (0.9998) (0.998) (0.99998) 10.5 398 501 222 161 373 6.7 0.4 79 Dual (0.41) (0.96) (0.97) (0.94) (0.91) (0.96) (0.31) (0.03) (0.84)
Supplementary Material Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe. oxfordjournals.org/). Acknowledgments The authors would like to thank Joseph Watkins, Simon D. W. Frost, Andy J. Leigh Brown, and Jesse Taylor for their comments and suggestions. This research was supported in part by the National Science Foundation (award number 0331654), National Institutes of Health (AI47745, AI43638, and AI57167), the University of California Universitywide AIDS Research Program (grant number IS02-SD-701), and by a University of California, San Diego Center for AIDS Research/ National Institute of Allergy and Infectious Diseases Developmental Award (AI36214). Literature Cited
Akaike, H. 1974. A new look at the statistical model identication. IEEE Trans. Autom. Contr. 19:716723. Anisimova, M., and Z. Yang. 2004. Molecular evolution of the hepatitis delta virus antigen gene: recombination or positive selection? J. Mol. Evol. 59:815826. Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376. Felsenstein, J., and G. A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93104. Fitch, W. 1971. Towards dening the course of evolution: minimum change for a specic tree topology. Syst. Zool. 20:406416. Gaut, B. S., B. R. Morton, B. McCaig, and M. T. Clegg. 1996. Substitution rate comparisons between grasses and palms: synonymous rate differences at the nuclear gene Adh parallel rate differences at the plastid gene rbcl. Proc. Natl. Acad. Sci. USA 93:1027410279. Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736. Hanada, K., Y. Suzuki, and T. Gojobori. 2004. A large variation in the rates of synonymous substitution for RNA viruses and its relationship to a diversity of viral infection and transmission modes. Mol. Biol. Evol. 21:10741080. Harmsen, M., R. Ruuls, I. Nijman, T. Niewold, Franken, L. G., and B. de Geus. 2000. Llama heavy-chain V regions consist of at least four distinct subfamilies revealing novel sequence features. Mol. Immunol. 37:579590. Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160174. Huelsenbeck, J., and K. Crandall. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28:437466. Huelsenbeck, J. P., and K. A. Dyer. 2004. Bayesian estimation of positively selected sites. J. Mol. Evol. 58:661672. Hurst, L., and C. Pal. 2001. Evidence for purifying selection acting on silent sites in BRCA1. Trends Genet. 17:6265. Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21132 in H. M. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
NOTE.Columns two and four show the Bayes factors for positive selection for all sites (column one) having a Bayes factor of 100 or higher for either the GY94 Nonsynonymous GDD 3 model or the MG94 3 REV Dual GDD 3 3 3 model. Posterior probabilities are given in parentheses in columns three and ve.
variation with Nonsynonymous-type models (for an example, see Kosakovsky Pond and Frost 2005b). As a nal investigation of the relationship between model choice and identication of positively selected sites, in table 5, we display results for every site that had a Bayes factor for positive selection greater than 100 (which also translates to posterior probabilities of xs . 1 being over 0.9 in all cases) using either MG94 3 REV Dual GDD 3 3 3 or M3. For most sites the two models are in general agreement. However, large discrepancies are found at sites 7, 67, and 85, and to a lesser degree at site 123. In each of these cases the evidence for positive selection disappears when synonymous rate variation is introduced. Conclusions The results above clearly indicate that the inclusion of potential site-to-site variation in synonymous rates when studying the molecular evolution of protein-coding regions is desirable. From the statistical perspective, these models allow for rigorous testing of parameters regulating the process of sequence evolution. Applications of the new models and associated inference tools on ten data sets indicate that variation in synonymous substitution rates is a real phenomenon. This observation, if it turns out to be widespread, will no doubt be one of considerable interest: investigations using estimates of synonymous rates for molecular clock dating could be brought into question; synonymous rate variation might be used to explain the overdispersed clock; positive selection studies might require second looks. At the most basic level, it raises questions as to the biological sources of the heterogeneity. Selective arguments, perhaps hinging on localized codon usage patterns or avoidance of transcription factorbinding site motifs, could be crafted. An explanation drawing on variation of mutation rates at different dinucleotides has some appeal and has been described in Hepatitis C (Tuplin et al. 2002). One could also imagine an explanation based on properties of DNA winding on histones. In any event, the model provides for the rst time the tools to carefully investigate any of these phenomena that might actually be present in nature, and our analyses suggest that the phenomenon is ubiquitous enough to merit close attention.
Kass, R., and A. Raftery. 1995. Bayes factors. J. Am. Stat. Assoc. 90:773795. Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120. Kosakovsky Pond, S. L., and S. D. Frost. 2005a. A genetic algorithm approach to detecting lineage-specic variation in selection pressure. Mol. Biol. Evol. 22:478485. . 2005b. Not so different after all: a comparison of methods for detecting amino-acid sites under selection. Mol. Biol. Evol. 22:12081222. . 2005c. A simple hierarchical approach to modeling distributions of substitution rates. Mol. Biol. Evol. 22:223234. Kosakovsky Pond, S. L., S. D. W. Frost, and S. V. Muse. 2004. Hyphy: hypothesis testing using phylogenies. Bioinformatics 21:676679. Kosakovsky Pond, S. L., and S. V. Muse. 2004. Column sorting: rapid calculation of the phylogenetic likelihood function. Syst. Biol. 53:685692. Larget, B., and D. Simon. 1998. Faster likelihood calculations on trees. Technical Report 98-02. Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, Pa. Miyata, T., and T. Yasunaga. 1980. Molecular evolution of mRNA: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its application. J. Mol. Evol. 16: 2336. Moler, C., and C. V. Loan. 1978. Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev. 20:801836. Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol. Biol. Evol. 11:715724. Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positive selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929936. Pedersen, A. M. K., C. Wiuf, and F. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:10691081. Schoniger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phylogenet. Evol. 3:240247. Seo, T.-K., H. Kishino, and J. Thorne. 2004. Estimating absolute rates of synonymous and nonsynonymous nucleotide substitu-
tion in order to characterize natural selection and date species divergences. Mol. Biol. Evol. 21:12011213. Tuplin, A., J. Wood, D. Evans, A. Patel, and P. Simmonds. 2002. Thermodynamic and phylogenetic prediction of RNA secondary structures in the coding region of hepatitis C virus. RNA 82:824841. Uzzel, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to evolutionary events. Science 172:10891096. Wolfe, K. H., W.-H. Li, and P. M. Sharp. 1987. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc. Natl. Acad. Sci. USA 84:90549058. Wolfe, K. H., P. M. Sharp, and W.-H. Li. 1989. Rates of synonymous substitution in plant nuclear genes. J. Mol. Evol. 29:208211. Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:13961401. . 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:105111. . 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568573. . 2000. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human inuenza virus A. J. Mol. Evol. 51:423432. Yang, Z., and W. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19: 4957. Yang, Z., W. S. Wong, and R. Nielsen. 2005. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22:11071118. Yang, Z. H., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specic lineages. Mol. Biol. Evol. 19:908917. Yang, Z. H., R. Nielsen, N. Goldman, and A. M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449.
Brandon Gaut, Associate Editor Accepted June 27, 2005

Site-To-Site Variation of Synonymous Substitution Rates

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Site-To-Site Variation of Synonymous Substitution Rates

Transféré par

Droits d'auteur :

Formats disponibles

Site-to-Site Variation of Synonymous Substitution Rates

Sergei Kosakovsky Pond* and Spencer V. Muse

2376 Pond and Muse

PrDs j as 5 rij ; bs 5 qij ; T

PrDs j as 5ri ; bs 5qj ; T Prbs 5 qj ; g:

Site-to-Site Silent Variation 2377

2378 Pond and Muse

FIG. 1.Hierarchical presentation of rate heterogeneity models used in this work.

Site-to-Site Silent Variation 2379

Table 1 Data Sets Analyzed for Presence of Synonymous Rate Variation

2380 Pond and Muse

NOTE.E, expected value; CV, coefcient of variation; NA, not applicable.

Site-to-Site Silent Variation 2381

Table 3 MG94 GDD 3 3 3 Applied to the b-Globin Alignment

ljk Pras 5 aj ; bs 5 bk : PN s u51 v 5 1 luv Pras 5 au ; bs 5 bv

2382 Pond and Muse

Site-to-Site Silent Variation 2383

2384 Pond and Muse

Site-to-Site Silent Variation 2385

Brandon Gaut, Associate Editor Accepted June 27, 2005

Vous aimerez peut-être aussi