It has been suggested that, due to the structure of the genetic code, nonsynonymous transitions are less likely than transversions to cause radical changes in amino acid physicochemical properties so are on average less deleterious. This view was supported by some but not all mutagenesis experiments. Because laboratory measures of fitness effects have limited sensitivities and relative frequencies of different mutations in mutagenesis studies may not match those in nature, we here revisit this issue using comparative genomics. We extend the standard codon model of sequence evolution by adding the parameter η
Nucleotide changes between the two purines (A and G) and those between the two pyrimidines (C and T) are known as transitions, whereas changes between a purine and a pyrimidine are known as transversions. Because there are four types of transitions but eight types of transversions, the expected number of transitions relative to that of transversions (Ts/Tv) is 0.5 in DNA sequence evolution if all types of nucleotide changes have equal rates. In reality, however, Ts/Tv often exceeds 0.5 or even 1 (Nei and Kumar 2000; Yang 2006). This phenomenon, referred to as the transition bias, is well recognized and is commonly considered in nucleotide or codon substitution models of DNA sequence evolution that are used for estimating nucleotide substitution rates, inferring molecular phylogenies, and testing natural selection (Kimura 1980; Hasegawa et al. 1985; Li et al. 1985; Tamura and Nei 1993; Goldman and Yang 1994; Yang et al. 1998; Zhang et al. 1998).
The transition bias observed in the evolution of protein-coding DNA sequences has two sources of origin. First, the transition bias exists at the mutational level. Transitions require a much smaller distortion of the DNA double-helix structure than transversions so tend to occur more frequently in DNA replication. In addition, deamination, a common chemical change of nucleotides, leads to transitions. Genome-wide evidence for transition bias at the mutational level typically comes from mutation accumulation (MA) experiments, in which mutations are accumulated over many generations in organisms kept in extremely small populations to minimize the effect of selection. For instance, spontaneous mutations observed in MA experiments of Saccharomyces cerevisiae (Lynch et al. 2008; Zhu et al. 2014; Liu and Zhang 2019), Drosophila melanogaster (Haag-Liautard et al. 2008; Schrider et al. 2013), and Arabidopsis thaliana (Ossowski et al. 2010) show transition bias. Similar biases were found among single nucleotide polymorphisms of natural populations at noncoding or synonymous sites, which are presumably under little or no natural selection (Freudenberg-Hua et al. 2003; Rosenberg et al. 2003; Cutter 2006; Jiang and Zhao 2006; Hershberg and Petrov 2010). Notably, however, the transition bias at the mutational level was reported to be absent in the nematode Caenorhabditis elegans (Denver et al. 2009).
Second, the transition bias also results from natural selection. For example, Zhu et al. (2014) observed Ts/Tv = 0.95 from 867 spontaneous mutations in Saccharomyces cerevisiae MA lines, but Ts/Tv = 2.96 when the MA ancestral line is compared with the yeast reference genome. A similar inflation of Ts/Tv among natural polymorphisms relative to mutations is also observed in Caenorhabditis elegans (Denver et al. 2009). These results suggest that transitions are less deleterious and less likely to be purged by natural selection than transversions. The difference in purifying selection intensity on transitions and transversions has two potential causes in coding sequences. First, due to the structure of the genetic code, transitions are more likely than transversions to be synonymous, rendering transitions less often selected against than transversions and an inflated Ts/Tv. This mechanism is theoretically sound and empirically supported (Zhang 2000; Freudenberg-Hua et al. 2003; Schrider et al. 2013). Second, it has been suggested that, compared with nonsynonymous transversions, nonsynonymous transitions are less deleterious because they tend not to cause radical changes in amino acid physicochemical properties such as the charge, polarity, and size (Zhang 2000). For instance, Zhang (2000) grouped the 20 amino acids into different physicochemical bins and reported that nonsynonymous transitions are less likely than nonsynonymous transversions to cause amino acid changes from one bin to another. Furthermore, he reported a lower substitution rate for nonsynonymous transversion than nonsynonymous transition in mammalian gene evolution (Zhang 2000). Freudenberg-Hua et al. (2003) classified amino acid changes as radical or conservative according to the Grantham physicochemical distance (Grantham 1974) and showed a similar trend among human nonsynonymous polymorphisms. Although these studies support the existence of a selection strength difference between nonsynonymous transitions and transversions, Stoltzfus and Norris (2016) disagreed with this view. Based on eight mutagenesis studies that measured the fitness effects of 1,239 nonsynonymous mutations in six viruses and the beta-lactamase TEM-1 gene in Escherichia coli, the authors found no significant difference in fitness effect between nonsynonymous transitions and transversions. By contrast, Lyons and Lauring analyzed 11,282 nonsynonymous mutations from deep mutational scans in the influenza virus and human immunodeficiency virus and reported that nonsynonymous transversions are significantly more deleterious than nonsynonymous transitions (Lyons and Lauring 2017). Thus, whether nonsynonymous transversions are generally more deleterious than nonsynonymous transitions remains controversial.
We note that, although mutagenesis studies are powerful in its ability to test the fitness impact of any mutation of interest, it has a limited sensitivity. For instance, fitness difference smaller than 0.02% per generation is virtually undetectable in the lab (Gallet et al. 2012), whereas natural selection can detect a fitness difference that is larger than the inverse of the effective population size. In addition, we note that both nonsynonymous transitions and nonsynonymous transversions comprise a mixture of many types of amino acid changes whose relative frequencies depend on a number of factors such as codon frequencies, which vary among species. Hence, the relative frequencies of various amino acid changes in mutagenesis studies may not represent those in nature for the same species and genes, let alone other species and genes. Consequently, to answer whether nonsynonymous transitions are on average less deleterious than nonsynonymous transversions, we need to analyze evolutionary data. In this study, we built a codon substitution model that includes the parameter η that measures the fixation probability of nonsynonymous transitions relative to that of nonsynonymous transversions. Estimating η from each of 90 pairs of genomes across the tree of life, we find that η varies from significantly below 1 to significantly above 1. We show that this unexpected result is most likely due to the variation of amino acid exchangeabilities across evolutionary lineages.
Let η be the fixation probability or acceptability of nonsynonymous transitions relative to that of nonsynonymous transversions. To estimate η, we extended the Markov codon substitution model of Goldman and Yang (1994) such that the rate of substitution from codon u to codon v is given by
We performed computer simulations to verify the reliability of the above-described estimator of η. Specifically, we simulated coding sequence evolution under the model described above to produce a pair of homologous sequences of 500,000 codons. This long alignment was used because our purpose was to validate the maximum likelihood implementation of the model and to evaluate the potential bias in η estimation rather than the sampling error and because this length is close to the median alignment length (486,750 codons) of real sequences analyzed in this study. A series of η values ranging from 0.1 to 2 were used in the simulation. We also varied the genetic distance d, which is defined by the number of nucleotide substitutions per codon between the two sequences,


Simulations according to equation (1) show that the inferred η’s are unbiased when compared with the true values and are uncorrelated with (a) the genetic distance (d) between the two species in the clade, (b) transition bias at the mutational level (
After verifying the reliability of the η estimator, we applied the estimator to 90 clades across the tree of life (supplementary table S1, Supplementary Material online), including 15 eukaryotic, 6 archaeal, and 69 bacterial clades. The eukaryotic clades comprise six vertebrate, two insect, two fungal, three plant, and two protozoan clades. Each of the 90 clades includes a pair of relatively closely related species or strains with available genome sequences. With the genome-wide concatenated coding sequence alignment available for each clade (see Materials and Methods), we estimated η for each clade using codemlz.
The results are surprising (fig. 2 and supplementary table S1, Supplementary Material online). Within both eukaryotes and prokaryotes, some clades show η > 1, whereas other clades exhibit η < 1. In total, 34 clades have η > 1, with the largest being 2.0; whereas 56 clades have η < 1, with the smallest being 0.13. For example, for the clade containing two malaria pathogens Plasmodium vivax and P. knowlesi, the inferred η is 2.0, indicating that nonsynonymous transitions are twice as likely to be fixed as nonsynonymous transversions. By contrast, η is estimated to be 0.54 for the clade consisting of the ant species Atta cephalotes and Solenopsis invicta, meaning that the acceptability of nonsynonymous transitions is close to one half that of nonsynonymous transversions. Among the 90 clades, 81 have an η that deviates significantly from 1 (P < 0.05, likelihood ratio test followed by Bonferroni correction for multiple testing), including 30 cases of η > 1 and 51 cases of η < 1.


The estimated η varies among 90 clades sampled across the tree of life. Each clade is represented by an alignment of genome-wide orthologous coding sequences of two closely related species/strains. Statistical significance of η’s deviation from 1 is determined by an adjusted P value of <0.05 (likelihood ratio test followed by Bonferroni correction for multiple testing). Clade indices on the x-axis refer to those in supplementary table S1, Supplementary Material online.
To examine the potential variation of η within a genome, we randomly split the coding sequence alignment of a clade into two halves and estimated η from each half. The 90 estimated η’s from the first halves have a strong correlation (Spearman’s
Why does η vary among different clades, reaching values of both higher than 1 and lower than 1? Because the overall acceptability of nonsynonymous transitions relative to that of nonsynonymous transversions depends on the acceptability of individual nonsynonymous mutations, a mechanistic understanding requires considering the underlying relative exchangeabilities among amino acids (Zou and Zhang 2019). Here, the relative exchangeability between amino acids i and j, or REij, is the fixation probability of mutations converting between i and j, relative to the overall fixation probability of all nonsynonymous mutations. To study how amino acid exchangeabilities influence η, we simulated sequence evolution using the general codon model proposed in Yang et al. (1998), with slight modifications. Specifically, the rate of substitution from codon u to codon v equals
Here,
First, we simulated sequence evolution with a series of


Variations in
Second, we simulated sequence evolution under a series of
Third, we simulated 90 pairs of sequences using the same parameters except for the codon frequencies
Fourth, we also tested the effect of the genetic distance (d) between two species in a clade on the estimated η value. Sequence evolution under a series of d values between 0.05 and 5.5 was simulated, whereas all other parameters were unchanged. Although there is a positive correlation between the estimated η and d, η varies only within the range of 0.8–1.07 (fig. 3d), and no positive correlation exists between η and d in the real data (supplementary fig. S1c, Supplementary Material online). Hence, d cannot be a major factor driving the observed among-clade η variation in real data.
Although individual variations of the above parameters cannot explain the observed η variation, it remains possible that clade-specific combination of
Given the above set of largely negative findings, we used simulations to investigate the impact of the last component, REs, on η. Starting from the REs used above, we created a series of modified REs by randomly increasing or decreasing each original RE by a certain percentage, or by shuffling REs between different amino acid pairs (see Materials and Methods). Under constant


Variation of REs among clades can explain η variation. (a) Simulations with RE values based on the Grantham matrix (see Materials and Methods). For each of the ten new RE sets at a given level of deviation from or shuffled from the original values, five replicate sequence evolution simulations are conducted and the corresponding η estimates are plotted. Different RE sets at each deviation level and from each independent shuffle are distinguished by different (randomly assigned) colors. (b) The η’s estimated from the 90 clades simulated using the corresponding RE values of the real clades are plotted against the η’s estimated from the real clades. Dots are colored by the corresponding taxonomic group of the clades, as shown in the legend. The dashed red line indicates y = x. The y-axis value of each dot is the mean estimate from ten replicate simulations. (c) The expected η computed from the estimated REs and codon frequencies of each clade is plotted against the η estimated by the likelihood method from the alignment of the clade. The dashed red line indicates y = x. The y-axis value of each dot is the mean estimate from ten replicate simulations.
We recently used the codeml program in PAML to estimate
In theory, η is the mean RE for amino acid changes caused by nonsynonymous transitions, weighted by the corresponding frequencies of codon pairs, divided by its counterpart for nonsynonymous transversions (see Materials and Methods). To validate this relationship, we calculated η for each of the 90 clades from the estimated REs and codon frequencies of each clade. These calculated η values are very strongly correlated with the likelihood estimates of η from the real data (r = 0.97, P = 9E-54; fig. 4c), supporting the proposed mathematical relationship between η and REs.
Model-based analysis could yield misleading results if the assumed model differs from the reality and the analysis is sensitive to model misspecification (Zhang 1999). In our η estimator,
Additionally, in the presence of among-codon
We also simulated sequence evolution under equation (2) with among-codon variation of
Another major model simplification in our inference of η is that we considered only one aspect of mutational bias, transition bias. To confirm that this model simplification had not affected our conclusion, we simulated sequence evolution under equation (3) as follows:
Equation (3) is a more general version of equation (2) that allows a more complex mutational scheme. Here, mutations follow a General Time-Reversible (GTR) model, with mutability
The third major model simplification in our analyses is the negligence of potential natural selection on synonymous mutations. One can see from equation (1) that η is the transition bias of nonsynonymous substitutions (
To investigate the extent to which selection on synonymous mutations impacts our results, we modified and implemented in codemlz the FMutSel model (Yang and Nielsen 2008) that explicitly includes selection on synonymous mutations, as described by equation (4):
Here, Fv is twice the effective population size multiplied by the fitness of the genotype with codon v, k (=1, 2, or 3) is the position at which u and v differ, and
In this work, we studied a potential cause of the widespread phenomenon of transition bias in coding sequence evolution and tackled the controversy of whether nonsynonymous transversions are more deleterious than nonsynonymous transitions. We developed a likelihood estimator of η, the fixation probability of nonsynonymous transitions, relative to that of nonsynonymous transversions, and showed that this estimator is reliable. Surprisingly, however, applying this estimator to 90 two-species clades across the tree of life revealed a large variation of η from significantly above 1 to significantly below 1, whereas the difference between two random halves (or two halves with contrasting
As mentioned in Introduction, an intraspecific study in humans (Freudenberg-Hua et al. 2003) and an interspecific study in mammals (Zhang 2000) both concluded that nonsynonymous transversions are more deleterious than nonsynonymous transitions. Indeed, we observed η > 1 in three of the four mammalian clades surveyed here (the second to fourth clade in fig. 2). Nevertheless, our broader phylogenetic survey also found η < 1 in many other clades, including the clade of human and rhesus macaque (η = 0.92) (fig. 2). Thus, although these previous findings might not be wrong, they provided an incomplete picture.
In summary, our study showed that whether nonsynonymous transversions are overall more deleterious than nonsynonymous transitions varies with species. Of all possible factors we have investigated, the among-species variation in amino acid exchangeabilities is the primary cause of the among-species variation in η. The among-species variation in amino acid exchangeabilities is probably a result of proteome-wide changes in the physicochemical environments of amino acid residues during evolution (Zou and Zhang 2019), but more studies are required to gain a better understanding of its exact origin. Regardless, multiple recent studies have investigated amino acid exchangeabilities (or related relative substitution rates) and reported cases of species-specificity (Dang et al. 2010; Chen et al. 2019; Weber and Whelan 2019). The variations of amino acid exchangeabilities and η among species demonstrate that even some of the most fundamental parameters of protein and DNA sequence evolution vary among evolutionary lineages, which cautions against assuming a constant molecular evolutionary model across all life forms.
The sequence alignments used in this study were from Zou and Zhang (2019) and the full list of the 90 clades surveyed is in supplementary table S1, Supplementary Material online. Sequence data used were retrieved from various sources listed in supplementary table S1, Supplementary Material online. Specifically, coding sequence alignments of four mammalian clades, fruitflies, and yeasts were directly retrieved from respective databases. For each of the other eukaryotic clades, we queried in Ensembl (https://useast.ensembl.org/index.html; last accessed August 17, 2020) a list of all one-to-one orthologous genes for the pair of species and downloaded their coding sequences. The coding sequences were translated into protein sequences using MACSE v1.02 (Ranwez et al. 2011). Local pairwise protein sequence alignment was performed for each pair of orthologs by MAFFT v7.294b (Katoh and Standley 2013) using the L-INS-i algorithm. The corresponding coding sequence alignment was then derived using a custom Python script. All prokaryotic clades were sampled from strains available in the ATGC database (Kristensen et al. 2017). All alignments were filtered so that no gaps, missing data, or ambiguous codons exist. The alignments have been deposited to GitHub (https://github.com/ztzou/REvariation; last accessed August 17, 2020).
We modified the codeml program in PAML 4.8 (Yang 2007) and named the modified program “codemlz.” To use codemlz, one should use the following model setting (following the original codeml control file): seqtype = 1, CodonFreq = 3, clock = 0, model = 0, NSsites = 0, Mgene = 0, fix_alpha = 1, and alpha = 0. To conduct inferences under the FMutSel model, one should use: seqtype = 1, CodonFreq = 7, estFreq = 1, clock = 0, model = 0, NSsites = 0, Mgene = 0, fix_alpha = 1, and alpha = 0. Two options are added to the control file for η estimation: “fix_eta” and “eta.” Setting fix_eta = 0 allows inferring η with the initial value specified by eta, whereas setting fix_eta = 1 assumes a fixed η with the value specified by eta. The inferred η value is output to the “mlc” file generated by the program. The codemlz program can be accessed from GitHub (https://github.com/ztzou/codemlz; last accessed August 17, 2020).
We used codemlz to estimate η. We ran codemlz on a given sequence alignment 30 times, with three replicate runs of each of ten different initial η values (from 0.1 to 50), to avoid spurious results. The run yielding the highest likelihood provided the likelihood estimates of model parameters under the alternative hypothesis (H1) in which η is unconstrained. We further performed three replicate runs under the null hypothesis (H0) in which η = 1.0. The run with the highest likelihood offered the model parameters under H0. H1 and H0 were compared via a likelihood ratio test with one degree of freedom. Parameters inferred under H1 (
To estimate
To estimate parameters of the GTR model in each clade, we used the program baseml in PAML 4.9e. Four-fold degenerate sites in the concatenated coding sequence of each clade were used as input. Baseml was called with the parameter setting of model = 7, Mgene = 0, clock = 0, fix_kappa = 0, kappa = 5, fix_alpha = 0, and alpha = 0.5.
Simulations in figure 1 and supplementary figure S2, Supplementary Material online, followed the codon substitution model specified by equation (1). Simulations in figures 3 and 4 and supplementary figures S3 and S4, Supplementary Material online, followed the model specified by equation (2). Simulations in supplementary figure S5, Supplementary Material online, followed the model specified by equation (3). To simulate a clade with a pair of sequences, a transition matrix
The calculation of expected η follows the following:
Here, u and v are the codons before and after a single nucleotide substitution, respectively; i and j ≠ i are the amino acids encoded by u and v, respectively; NI and NV are the sets of nonsynonymous transitions and nonsynonymous transversions, respectively; and n is the number of codon pairs belonging to each set.
We thank Daohan Jiang, Daniel Lyons, and three anonymous reviewers for valuable comments. This work was supported by the U.S. National Institutes of Health research grant R01GM120093 to J.Z.