Learning about the roles that duplicate genes play in the origins of novel phenotypes requires an understanding of how their functions evolve. A previous method for achieving this goal, CDROM, employs gene expression distances as proxies for functional divergence and then classifies the evolutionary mechanisms retaining duplicate genes from comparisons of these distances in a decision tree framework. However, CDROM does not account for stochastic shifts in gene expression or leverage advances in contemporary statistical learning for performing classification, nor is it capable of predicting the parameters driving duplicate gene evolution. Thus, here we develop CLOUD, a multi-layer neural network built on a model of gene expression evolution that can both classify duplicate gene retention mechanisms and predict their underlying evolutionary parameters. We show that not only is the CLOUD classifier substantially more powerful and accurate than CDROM, but that it also yields accurate parameter predictions, enabling a better understanding of the specific forces driving the evolution and long-term retention of duplicate genes. Further, application of the CLOUD classifier and predictor to empirical data from Drosophila recapitulates many previous findings about gene duplication in this lineage, showing that new functions often emerge rapidly and asymmetrically in younger duplicate gene copies, and that functional divergence is driven by strong natural selection. Hence, CLOUD represents a major advancement in classifying retention mechanisms and predicting evolutionary parameters of duplicate genes, thereby highlighting the utility of incorporating sophisticated statistical learning techniques to address long-standing questions about evolution after gene duplication.
Gene duplication is a mutational process that creates copies of existing genes. Experimental studies in several diverse species have revealed that duplication occurs faster than all other types of spontaneous mutation (Lynch et al. 2008; Lipinski et al. 2011; Schrider et al. 2013; Keith et al. 2016; Konrad et al. 2018), thus serving as a major reservoir of genetic variation. Moreover, in contrast to other types of mutation, duplication generates redundancy, permitting the exploration of evolutionary space that may have been ancestrally forbidden (Ohno 1970). As a result, duplication has long been hypothesized to underlie the origins of novel phenotypes and complex biological systems (Ohno 1970). Indeed, mounting evidence of widespread duplication and its contribution to adaptation and speciation in all three biological domains (Zhang 2003; Kondrashov 2012) highlights its key role in evolution across the tree of life.
Yet, the evolutionary path leading from gene duplication to functional innovation remains unclear. According to traditional evolutionary models (Ohno 1970; Force et al. 1999; Stoltzfus 1999; Lynch and Force 2000; He and Zhang 2005; Rastogi and Liberles 2005), gene duplication generates a younger “child” copy that is identical to its older “parent” copy (fig. 1). Though such redundancy can promote adaptation through a relaxation of selective constraint (Ohno 1970), beneficial mutations are rare (Lynch and Force 2000). Hence, theory predicts that the most common outcome of gene duplication is nonfunctionalization, whereby one copy loses its function via an accumulation of deleterious mutations, leading to a reversion back to the ancestral single-copy state (Lynch and Force 2000). As a result, four mechanisms have been proposed to explain how numerous duplicate genes bypass nonfunctionalization and are retained over millions of years of evolution (Ohno 1970; Zhang 2003; Force et al. 1999; Stoltzfus 1999; He and Zhang 2005; Rastogi and Liberles 2005). First, either benefits of increased gene dosage (Ohno 1970) or recombination between gene copies (Zhang 2003) may result in conservation, whereby both copies maintain the ancestral function. Second, beneficial mutations in one gene copy may lead to neofunctionalization, whereby this copy acquires a new function while the other maintains the ancestral function (Ohno 1970). Third, deleterious mutations targeting different functional domains of each gene copy may result in subfunctionalization, whereby each copy maintains a distinct subset of the ancestral function (Force et al. 1999; Stoltzfus 1999). Fourth, a combination of deleterious and beneficial mutations targeting different functional domains of each gene copy may lead to specialization, whereby each copy maintains a subset of the ancestral function and also acquires a new function (He and Zhang 2005; Rastogi and Liberles 2005). Though mutations initiating the latter three retention mechanisms may take some time to appear, dosage balance can act as an intermediate state for preventing gene loss through nonfunctionalization during this waiting period (Hughes et al. 2007; Veitia et al. 2008; Edger and Pires 2009; Teufel et al. 2016; Raju 2020).


Hypothesized evolutionary trajectories of duplicate genes. Gene duplication results in two copies of an ancestral gene. Evolution may result in the loss of one functional copy by nonfunctionalization, or in the retention of two functional copies by either conservation, neofunctionalization, subfunctionalization, or specialization.
On the other hand, genomic studies from the past two decades show that the duplication process itself can often generate a child copy that is distinct from its parent copy (Cusack and Wolfe 2006; Hakes et al. 2007; Assis and Bachtrog 2013; Assis 2014; Cardoso-Moreira et al. 2016; Rogers et al. 2017). A key example is RNA-mediated duplication, which creates a child copy with its parent’s protein-coding sequence, but missing its introns and regulatory elements (Cusack and Wolfe 2006; Assis and Bachtrog 2013; Assis 2014). Duplicate gene copies arising from RNA-mediated duplication frequently display immediate sequence and expression differentiation consistent with functional independence (Cusack and Wolfe 2006; Assis and Bachtrog 2013; Assis 2014). Such differences are also more common after small-scale than whole-genome duplication events (Hakes et al. 2007), perhaps due to incomplete copying of short- and long-range regulatory elements, as well as after complex duplication events involving shuffling of exons and regulatory elements (Rogers et al. 2017). Thus, asymmetric duplication may directly contribute to the emergence of novel gene functions by reducing or removing the waiting period for new mutations required under traditional neofunctionalization, subfunctionalization, and specialization models (Ohno 1970; Zhang 2003; Force et al. 1999; Stoltzfus 1999; He and Zhang 2005; Rastogi and Liberles 2005) (fig. 1).
Due to their vastly different evolutionary forces and functional outcomes, differentiating among duplicate gene retention mechanisms is critical to understanding how gene duplication drives phenotypic innovation. Accordingly, many studies have tackled this problem through applications of comparative (Kondrashov et al. 2002; Kellis et al. 2004; He and Zhang 2005) and model-based approaches (Hughes and Liberles 2007; Konrad et al. 2011) to DNA sequence data. However, perhaps a more direct source of functional information about a gene is its expression profile, which captures its activity levels across multiple conditions (e.g., tissues, developmental stages, or disease states). In particular, gene expression profiles are ideal proxies for function due to their correlations with many other functional metrics, including protein-coding gene sequence divergence (Nuzhdin et al. 2004; Subramanian and Kumar 2004; Lemos et al. 2005; Hunt et al. 2013; Assis and Kondrashov 2014; Mahler et al. 2017; Assis 2019a), protein–protein interaction networks (Ge et al. 2001; Bhardwaj and Lu 2005; Lemos et al. 2005; French and Pavlidis 2011; Assis and Kondrashov 2014; Mahler et al. 2017; Assis 2019a), and biological processes and pathways (Zhou et al. 2002; Assis 2019a). Moreover, high-throughput gene expression data are widely available for numerous conditions and species, and simple to quantify and compare relative to alternative proxies of gene function.
With this in mind, Assis and Bachtrog (2013) designed a decision tree classification algorithm based on comparisons of differences between multi-tissue expression profiles of ancestral single-copy genes and their derived parent and child duplicate gene copies. Their approach (Assis and Bachtrog 2013), which was later generalized to other types of input data and implemented in the R package CDROM (Perry and Assis 2016), has been used to classify retention mechanisms of duplicate genes in Drosophila (Assis and Bachtrog 2013), mammals (Assis and Bachtrog 2015), honeybees (Chau and Goodisman 2017), and grasses (Jiang and Assis 2019). Together, these studies have demonstrated that duplicate genes are frequently retained by neofunctionalization (Assis and Bachtrog 2013, 2015; Assis 2014; Chau and Goodisman 2017; Jiang and Assis 2019), that child copies more often acquire new functions than parent copies (Assis and Bachtrog 2013, 2015; Assis 2014; Jiang and Assis 2019), and that new functions tend to be male-specific (Assis and Bachtrog 2013, 2015; Assis 2014; Chau and Goodisman 2017; Jiang and Assis 2019). These findings are concordant with earlier work showing that young animal and plant duplicate genes are often specifically expressed in male tissues (Betrán et al. 2002; Marques et al. 2005; Kaessmann 2010; Zhang et al. 2010; Wu et al. 2014). Further, earlier studies showed that levels of protein-coding sequence divergence are often consistent with retention mechanisms classified based on gene expression divergence (Assis and Bachtrog 2013, 2015; Assis 2014; Chau and Goodisman 2017), and a follow-up analysis in Drosophila revealed natural selection to play important roles in both whether and how duplicate genes are retained over evolutionary time (Jiang and Assis 2017).
However, there are two major shortcomings of the method implemented by CDROM (Assis and Bachtrog 2013; Perry and Assis 2016). First, it does not account for stochastic shifts in gene expression that may occur as a result of phenotypic drift (Oleksiak et al. 2002; Khaitovich et al. 2004). Second, it does not leverage the power provided by recent advances in statistical and machine learning (Hastie et al. 2009; Goodfellow et al. 2016). With these limitations in mind, we developed CLassification using Ornstein–Uhlenbeck of Duplicates (CLOUD), a novel classification algorithm that employs simulated training data generated by Ornstein–Uhlenbeck (OU) processes, which can model gene expression evolution along phylogenetic trees (Hansen 1997; Butler and King 2004; Bedford and Hartl 2008; Kalinka et al. 2010; Brawand et al. 2011; Perry et al. 2012; Rohlfs et al. 2014; Rohlfs and Nielsen 2015). In particular, because OU processes model Brownian motion with a pull toward an optimal state, they have a natural application to evolution, in which phenotypic drift is analogous to Brownian motion, natural selection to pull, and fittest phenotype to optimal state (Hansen 1997; Butler and King 2004).
Though OU processes have been used to model expression evolution of single-copy genes (Bedford and Hartl 2008; Kalinka et al. 2010; Brawand et al. 2011; Perry et al. 2012; Rohlfs et al. 2014; Rohlfs and Nielsen 2015), they have never been applied to the analogous problem after gene duplication. Thus, CLOUD adapts the OU framework to quantify expression evolution after gene duplication by modeling changes along a tree relating a pair of duplicate genes (parent and child copies) and their ancestral gene in a related sister species. Then, it utilizes the simulated output of these models to construct a multilayer feed-forward neural network for classifying duplicate genes as retained under conservation, neofunctionalization, subfunctionalization, or specialization. Moreover, this approach enables CLOUD to also predict parameters influencing the expression evolution of duplicate genes. Application of CLOUD to simulated data shows that it has high power to differentiate among classes, vastly outperforming CDROM for a wide range of parameter values, and also accurately predicts parameters shaping the expression evolution of retained duplicate genes. Further, application of CLOUD to empirical data from Drosophila (Assis and Bachtrog 2013; Assis 2019b) recapitulates a majority of the classified duplicate gene retention mechanisms presented by Assis and Bachtrog (2013), as well as generates parameter predictions that match theoretical expectations of these retention mechanisms. CLOUD has been implemented as an R package, and is freely available at http://assisgroup.fau.edu/software.html and https://github.com/rassis/CLOUD. Its input data can include gene expression measured for a single condition or multiple conditions of varying types (e.g., tissues, developmental stages, or disease states), making it applicable to a wide range of single- and multi-cellular biological systems.
In this section, we design our CLOUD classifier and predictor, evaluate its performance on simulated data, and apply it to an empirical data set from Drosophila. First, we introduce our OU framework for modeling expression evolution after gene duplication, which forms the basis of the CLOUD classifier and predictor. Next, we formally define the multilayer neural network architecture implemented by CLOUD for both classification and prediction tasks. We then employ simulations to evaluate the relative classification powers and accuracies of CDROM and CLOUD across a wide range of parameters, as well as in more targeted regions of the parameter space. We also use these simulations to probe its accuracy in predicting parameters driving gene expression evolution after duplication, specifically its ability to estimate optimal gene expression, selection strength, and phenotypic drift for each of the classified retention mechanisms. Last, we apply CLOUD to empirical data from Drosophila (Assis and Bachtrog 2013; Assis 2019b) to classify retention mechanisms and predict underlying evolutionary parameters after gene duplication in this lineage.
To design a model of expression evolution after gene duplication, we consider a pair of related species, Species 1 and Species 2, whose lineages diverged from that of a common ancestor at time (fig. 2A–C). Suppose that the common ancestor had a single-copy gene that underwent duplication, giving rise to a pair of duplicate genes at time

![Modeling expression evolution after gene duplication as an OU process. (A) Relationships between two species (black phylogeny) and their genes (green phylogeny). After the two species diverged, a blue gene in Species 1 (parent) underwent a duplication event to create a yellow copy (child). (B) Relationships among the parent gene copy (P) in Species 1, child gene copy (C) in Species 1, and ancestral single-copy gene (A) in Species 2. The duplication event occurred at time TPC, and both copies split from the ancestral gene at time TPCA. Optimal expression states for the parent, child, and ancestral genes are given by θP, θC, and θA, respectively. The internal branch and the branch above the root have optimal expression states θPC and θPCA, respectively. (C) Cartoon depicting expression profile changes (red lines) along the gene tree. Expression profiles change randomly through phenotypic drift with strength σ2, and toward the optimal expression state through selection with strength α. (D) Illustration of how we simulate multi-tissue expression vectors for parent, child, and ancestral genes. (E) Schematic of our feed-forward neural network architecture, which takes in p input units with values x1,x2,…,xp, has K output units with values y1,y2,…,yK, and has L hidden layers, where the number of units in layer ℓ is p[ℓ] and the value of unit k in layer ℓ is the activation ak[ℓ].](/dataresources/secured/content-1765993458804-7568aec4-0386-42c7-806d-54f8dae98f4c/assets/msaa267f2.jpg)
Modeling expression evolution after gene duplication as an OU process. (A) Relationships between two species (black phylogeny) and their genes (green phylogeny). After the two species diverged, a blue gene in Species 1 (parent) underwent a duplication event to create a yellow copy (child). (B) Relationships among the parent gene copy (P) in Species 1, child gene copy (C) in Species 1, and ancestral single-copy gene (A) in Species 2. The duplication event occurred at time
In each tissue, gene expression
Here, we assume that expression is independent across tissues. However, this approach can also be extended to account for the intertissue expression covariance structure using established approaches (Revell and Harmon 2008; Revell and Collar 2009; Clavel et al. 2015).
We denote the set of all genes with two copies in one species and one copy in the other as duplicate genes
We transform and compare the expression vector

| Feature number | Feature value |
|---|---|
| 1 |
|
| 2 to |
|
| 3 m + 2 to |
|
| 4 m + 2 |
|
| 4 m + 3 |
|
| 4 m + 4 |
|
| 4 m + 5 |
|
| 4 m + 6 |
|
| 4 m + 7 |
|
| 4 m + 8 |
|
| 4 m + 9 | rank of |
| 4 m + 10 | rank of |
| 4 m + 11 | rank of |
| 4 m + 12 | rank of |
| 4 m + 13 to 4 m + 20 |
kth moment of |
| 4m + 21 to 4 m + 28 |
kth moment of |
| 4m + 29 to 4 m + 36 |
kth moment of |
| 4m + 37 to 4 m + 44 |
kth moment of |
| 4m + 45 |
|
| 4 m + 46 |
|
| 4 m + 47 |
|
| 4 m + 48 |
|
| 4 m + 49 | rank of |
| 4 m + 50 | rank of |
| 4 m + 51 | rank of |
| 4 m + 52 | rank of |
| 4 m + 53 to 4 m + 60 |
kth moment of |
| 4m + 61 to 4 m + 68 |
kth moment of |
| 4m + 69 to 4 m + 76 |
kth moment of |
| 4m + 77 to 4 m + 84 |
kth moment of |
Given the input feature vector
When performing classification of duplicate gene retention mechanisms,
We consider a dense feed-forward neural network with
We define the values at unit
For hidden layer
For the prediction problem, we instead use the linear activation function (Goodfellow et al. 2016), such that the output for parameter prediction
This neural network was implemented in R (R Core Team 2013), using Keras (Allaire and Chollet 2017) with a TensorFlow backend (Abadi et al. 2015). A schematic of the neural network architecture is provided in figure 2E. Note that when L = 0, the neural network simplifies to a multinomial regression model (Hastie et al. 2009) for the classification problem, and to a linear regression model (Hastie et al. 2009) for the prediction problem.
To evaluate the classification power and accuracy of our multi-layer neural network classifier CLOUD, we trained and tested it on independent data sets simulated under each class of duplicate gene retention mechanisms (see Materials and Methods section). We assumed two hidden layers when training and testing CLOUD, as this resulted in the best cross-validation performance (see Materials and Methods section). Our training set consisted of 50,000 observations, of which 10,000 were simulated under each class. We trained CLOUD on these data, and explored evolutionary parameters drawn on a logarithmic scale across many orders of magnitude. Specifically, we independently drew the five parameters
Analysis of the resulting classifications reveals that CLOUD generally has substantially higher power (fig. 3A) and accuracy (fig. 3B) than CDROM. Specifically, across the wide range of test parameter values explored, CLOUD achieved an accuracy of 80.18%, whereas CDROM only reached 68.76% accuracy (fig. 3B). This represents a boost in over 16% classification accuracy of CLOUD relative to CDROM. In addition to increased overall accuracy, CLOUD yields similar accuracies across classes, illustrated by narrow ranges of correct classification rates (between 77.1 and 84.2%; diagonal cells of fig. 3B) and mis-classification rates (between 1.3 and 9.2%; non-diagonal cells of fig. 3B). In contrast, CDROM demonstrates a much higher correct classification rate for the “Specialized” class (96.9%) than for other classes (between 45.7 and 67.9%; fig. 3B), and a higher mis-classification rate toward the “Specialized” class (between 28.7 and 30.6%; fig. 3B). Moreover, CDROM experiences additional issues when classifying true “Subfunctionalized” observations, with an 18.3% mis-classification rate toward the “Conserved” class (fig. 3B).

![Classification results for CDROM and CLOUD with L = 2 hidden layers applied to data simulated under parameters α∈[1,103] and σ2∈[10−2,103]. (A) Receiver operating characteristic curves across the full range of false positive rates (left) and truncated at a false positive rate of 5% (right). Because CDROM is a decision tree classifier, its true positive and false positive rates are plotted as points. (B) Confusion matrices depicting the classification rates of each of the five duplicate gene retention classes for CDROM (left) and CLOUD (right).](/dataresources/secured/content-1765993458804-7568aec4-0386-42c7-806d-54f8dae98f4c/assets/msaa267f3.jpg)
Classification results for CDROM and CLOUD with L = 2 hidden layers applied to data simulated under parameters
In addition, CLOUD is much more conservative than CDROM for pairs of α and
In addition to its vastly improved classification performance relative to CDROM, a unique attribute of CLOUD is its ability to learn parameters underlying the expression evolution of duplicate genes. Thus, we next assessed the accuracy of the CLOUD predictor by training and testing it on the same independent simulated data sets that we employed for training and testing the CLOUD classifier. In particular, we trained CLOUD (again assuming two hidden layers) to make predictions for each of the five parameters (
To investigate prediction accuracy, we examined the distributions of mean parameter prediction errors across the six tissues (fig. 4). In general, all parameter estimates appear unbiased, with mean prediction errors centered on zero. Moreover, estimates of optimal expression states (

![Prediction results for application of CLOUD with L = 2 hidden layers to data simulated under parameters α∈[1,103] and σ2∈[10−2,103] for each of the five classes of duplicate gene retention mechanisms. Violin plots display distributions of mean parameter prediction errors across the m = 6 tissues for each simulated test set.](/dataresources/secured/content-1765993458804-7568aec4-0386-42c7-806d-54f8dae98f4c/assets/msaa267f4.jpg)
Prediction results for application of CLOUD with L = 2 hidden layers to data simulated under parameters

| Class | Optimal expression state at a given tissue |
|---|---|
| Conserved |
|
| Neofunctionalized parent |
|
| Neofunctionalized child |
|
| Subfunctionalized |
|
| Specialized |
|
As with classification, confidence in parameter predictions made by CLOUD also vary with α and
We showed that under ideal settings, CLOUD is a superior classifier to CDROM, and is also adept at predicting underlying evolutionary parameters. Thus, we next explored the performance of the trained CLOUD classifier and predictor on test data generated under scenarios that violated model assumptions of the training data. In particular, we considered test data in which the duplicate gene retention mechanism was non-uniform across the simulated tissues. Specifically, we evaluated scenarios in which
Comparisons among these diverse scenarios illustrates that the classification performance of CLOUD is dependent on a combination of the difference between flexibilities in model parameters of the two retention mechanisms and the numbers of tissues sharing each retention mechanism (supplementary fig. S6, Supplementary Material online). In particular, the retention mechanism with greater flexibility in model parameters is most frequently chosen by CLOUD unless a majority of tissues share the more constrained retention mechanism. For example, when Tissue Mechanism A is “Conserved” (the most constrained retention mechanism) and shared by
Despite its difficulty in classifying retention mechanisms under these mixed scenarios, CLOUD is generally unbiased in its parameter predictions (supplementary figs. S7–S11, Supplementary Material online). Though mixtures of retention mechanisms occasionally lead to biased parameter predictions, such as the overestimation of
Our simulation experiments highlight the exceptional classification performance of CLOUD relative to CDROM, as well as the unique ability of CLOUD to predict parameters underlying the evolution of duplicate genes. Hence, we next sought to use CLOUD to classify retention mechanisms and predict parameters of 208 duplicate genes in Drosophila (Assis and Bachtrog 2013) from their expression data in six tissues (Assis 2019b). Specifically, we first used PhyML (Guindon et al. 2010) to estimate a gene tree relating each parent, child, and ancestral gene in this data set of duplicate genes (Assis and Bachtrog 2013) (see Materials and Methods section). Next, as in our simulation studies, we trained CLOUD (assuming two hidden layers) on a large balanced simulated training set of 50,000 observations (10,000 from each of five classes), with evolutionary parameters
Analysis of the resulting classifications reveals that the predominant mechanism of duplicate gene retention in Drosophila is neofunctionalization in which the child copy acquires a new function (61.43%; fig. 5), mirroring the findings of Assis and Bachtrog (2013). Moreover, classifications of CLOUD are generally concordant with those of CDROM (59.29%), with three key differences. In particular, of the 167 duplicates classified as “Neofunctionalized child” by CDROM, 16 are classified as “Conserved” by CLOUD. In addition, of the 53 duplicates classified as “Conserved” by CDROM, 18 are classified as “Neofunctionalized child” and 14 as “Specialized” by CLOUD. Finally, of the 41 duplicates classified as “Specialized” by CDROM, 18 are classified as “Neofunctionalzied child” by CLOUD. Based on our simulation results (supplementary figs. S1 and S2, Supplementary Material online), it is likely that these discrepancies reflect differences in the abilities of the CLOUD and CDROM classifiers to handle gene expression stochasticity.


Classification and prediction results for application of CLOUD with L = 2 hidden layers to empirical data from Drosophila (Assis and Bachtrog 2013; Assis 2019b). Box plots overlaid onto strip plots show distributions of log-transformed parameter estimates for each of the five classes of duplicate gene retention mechanisms. Note that six estimates, corresponding to the six tissues in the empirical data set, are plotted for each parameter. The confusion matrix in the bottom right illustrates the high concordance in classifications of CLOUD and CDROM for these empirical data, with both methods classifying the majority of duplicate genes as retained by neofunctionalization of the child copy.
We next examined the parameter predictions of CLOUD. Here, our major question was whether these predictions match theoretical expectations of duplicate gene retention mechanisms. To answer this question, we examined distributions of empirical parameter estimates for each class obtained with the CLOUD classifier (fig. 5). We first considered optimal expression estimates
As a final analysis of the empirical data, we performed a case study of the child gene Dntf-2r and its parent Dntf-2. We chose this duplication event because it was well-characterized in earlier studies (Betrán and Long 2003; Assis and Bachtrog 2013; Jiang and Assis 2017), providing us with a baseline for comparing our findings. In particular, Dntf-2r arose in the D. melanogaster lineage after its divergence from the D. pseudoobscura lineage (Betrán and Long 2003; Assis and Bachtrog 2013). Several studies showed that Dntf-2r is specifically expressed in the testis and evolving under positive selection, whereas its parent Dntf-2 is expressed broadly across tissues and evolving under negative selection (Bhattacharya and Steward 2002; Betrán and Long 2003; Assis and Bachtrog 2013; Jiang and Assis 2017). Hence, it has been hypothesized that Dntf-2r underwent neofunctionalization and acquired a new male-specific function after duplication (Betrán and Long 2003; Assis and Bachtrog 2013). Consistent with this hypothesis, Dntf-2r and Dntf-2 are classified by both CDROM and CLOUD as retained by neofunctionalization of the child copy. Moreover, CLOUD estimates mean
In this study, we have demonstrated that modeling of expression evolution and application of modern statistical learning techniques substantially enhances performance in classifying the retention mechanisms of duplicate genes and predicting their underlying parameters. Specifically, our new method CLOUD has high power and accuracy in discriminating among five classes of duplicate gene retention mechanisms (fig. 3, supplementary figs. S1–S4, Supplementary Material online), and high accuracy in parameter estimation (fig. 4). It represents a major advancement over the only previously available expression-based method, CDROM (Assis and Bachtrog 2013; Perry and Assis 2016), which has much lower classification power and accuracy (fig. 3), displays strong classification bias (supplementary figs. S1 and S2, Supplementary Material online), and cannot perform the task of parameter prediction at all. Thus, CLOUD represents a major advancement in classifying duplicate gene retention mechanisms and predicting their evolutionary parameters. Moreover, though our study focuses on its application to gene expression data from multiple tissues, CLOUD can also be applied to gene expression data from multiple conditions of different types (e.g., developmental stages or disease states), or even to gene expression data from a single condition, which is always the case for single-celled organisms. As a result, CLOUD can be used to learn about evolution after gene duplication in many diverse biological systems.
When designing the multi-layer neural network architecture of CLOUD, we took measures to mitigate overfitting through elastic net-style regularization (Zou and Hastie 2005), which shrinks model weights through a mixture of L1- and L2-norm penalties (Hastie et al. 2009). However, several other approaches, such as early stopping (Bishop 1995; Sjöberg and Ljung 1995; Goodfellow et al. 2016) and dropout (Srivastava et al. 2014; Goodfellow et al. 2016), could have been used instead to achieve a similar goal. Of the two alternatives mentioned, the dropout regularization procedure is closer to our approach, with a key difference in that regularization proceeds in a more stochastic fashion. Specifically, regularization is performed by dropping some proportion
We also considered an alternative approach for constructing the CLOUD classifier and predictor by employing maximum likelihood estimation (Casella and Berger 2002; Brawand et al. 2011; Clavel et al. 2015). Specifically, given expression data for parent, child, and ancestral genes, one can use maximum likelihood to estimate the set of parameters
Application of CLOUD to empirical data from Drosophila (Assis and Bachtrog 2013; Assis 2019b) recapitulated many of the classifications previously inferred by CDROM (Assis and Bachtrog 2013), notably classifying the majority of duplicate genes as retained by neofunctionalization in which the child copy acquires a new function (fig. 5). Predicted parameters of Drosophila duplicate genes were also generally concordant with theoretical expectations of their classified retention mechanisms (table 2, fig. 5). In particular, observed differences among distributions of optimal expression estimates for parent (
In this section, we detail the algorithmic choices used to train CLOUD, the simulation setting used to compare its performance to the classifier CDROM, and the necessary steps for application of CLOUD and CDROM to empirical data from Drosophila.
Consider a set of Nk training observations for class
To train the neural network, we wish to identify the set of parameters (weights and biases)
To prevent overfitting, we include an elastic net-style regularization penalty term (Zou and Hastie 2005) on the weights of each layer with two tuning hyper parameters. Specifically, we reduce the complexity of the fitted model with the tuning parameter
In the classification problem, for training observation
In the prediction problem,
Simulated data have been successfully used to train models for learning about evolution from genomic data in many recent studies (Lin et al. 2011; Schrider and Kern 2016; Sheehan and Song 2016; Kern and Schrider 2018; Schrider et al. 2018; Sugden et al. 2018; Flagel et al. 2019; Mughal et al. 2020; Mughal and DeGiorgio 2019; Adrion et al. 2020). Therefore, to train CLOUD for both classification and prediction, we generated a balanced simulated data set with 104 observations from each of the five classes, for a total of N = 50,000 training observations. We assumed that tissues were independent, and that there were a total of m = 6 tissues as in an empirical data from Drosophila (Assis 2019b) that we later applied our method to, for a total of p = 108 input features.
To make the simulated data set more realistic, we drew model parameters
Given a number of hidden layers L and the pair of regularization tuning parameters λ and γ, we estimated the set of parameters
Previous studies have found that neural networks with enough hidden layers or units can approximate any function, and therefore lead to overfitting (Cybenko 1989; Goodfellow et al. 2016). Hence, based on simulations, we estimated that a neural network with
To compare the relative classification powers and accuracies of CDROM and CLOUD and explore the prediction accuracy of CLOUD, we simulated training and test data sets for duplicate genes based on an OU process, which is described in subsection Training the neural network on data simulated from OU processes above. However, in that subsection, we assumed that expression vectors for single-copy genes
For our simulated training and test sets, we generated a background set of 10,000 six-tissue expression vectors for single-copy genes that was inspired by those of the single-copy genes identified in Drosophila (Assis and Bachtrog 2013; Assis 2019b). Specifically, we applied the Brownian motion model (Felsenstein 1973) implemented in mvmorph (Clavel et al. 2015) to expression vectors of single-copy genes between Species 1 and 2, assuming that tissues were independent and that the root of the two-species phylogeny had height one, to estimate ancestral expression θ and variance
To test either the classifier or predictor, we generated a balanced set of duplicate gene expression vectors, such that each of the five classes had 1,000 observations, for a total of N = 5,000 test observations. We assumed that tissues were independent, and that there was a total of m = 6 tissues as in the training set. For a given class, we drew parameters
To assay how CLOUD performs in different portions of the parameter space, we also examined its accuracy on test sets drawn from restricted parameter values. Specifically we considered three distinct ranges for α of
We applied CDROM and CLOUD to empirical data consisting of Drosophila duplicate and single-copy genes (Assis and Bachtrog 2013) and their expression abundances in six tissues (Assis 2019b). In particular, duplicate and single-copy genes in D. melanogaster and D. pseudoobscura were obtained from Assis and Bachtrog (2013). In that study, pairs of duplicate genes in each species were identified via BLAST searches (Altschul et al. 1990) and supplemented with those from Chen et al. (2010). A table of orthologs, or genes that arose from the same common ancestor in 12 sequenced Drosophila species (Drosophila 12 Genomes Consortium 2007), was downloaded from FlyBase at https://www.flybase.org. Orthologs were used to determine presence or absence of duplicate gene copies in D. melanogaster and D. pseudoobscura. Gene duplication events that occurred after the divergence of the D. melanogaster and D. pseudoobscura lineages were defined as those for which one duplicate gene copy is present in both species (parent) and the second is only present in one species (child). Quantile-normalized gene expression abundances for carcass, female head, ovary, male head, testis, and accessory gland tissues in D. melanogaster and D. pseudoobscura were obtained from the Dryad data set associated with Assis (2019b) at https://doi.org/10.5061/dryad.742564m. In that study, paired-end RNA-sequencing reads were downloaded from modENCODE (Celniker et al. 2009) at https://www.modencode.com, and aligned to reference transcriptomes of each species with Bowtie 2 (Langmead et al. 2009). Abundances in fragments per kilobase of exon per million fragments mapped (FPKM, Trapnell et al. 2013) were calculated with eXpress (Roberts and Pachter 2013), quantile-normalized, and log-transformed. After examination of the distribution of these values, genes with little or no expression in all tissues were removed.
Because CLOUD requires estimates of
We thank the editor and two anonymous reviewers for their comments that helped improve this manuscript. This work was supported by National Science Foundation grants DEB-1555981 and DEB-2001059 to RA and DEB-1949268 and BCS-2001063 to MD, and by National Institutes of Health grant R35GM128590 to MD.