Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes

PLoS ONE

Home Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes

Donna Henderson, Sha (Joe) Zhu, Christopher B. Cole, Gerton Lunter

Competing Interests: The authors have declared that no competing interests exist.

Current address: TaiChi AI Ltd, London, United Kingdom

https://doi.org/10.1371/journal.pone.0247647, Volume: 16, Issue: 3, Pages: 1-24

Article Type: Research Article Article History

Publisher: Public Library of Science

Altmetric

Table of Contents

Introduction
Methods
Results
Discussion
Appendix

Abstract

Demographic events shape a population’s genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.

Henderson,Zhu,Cole,Lunter,and Mariño: Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes

Introduction

The demographic history of a species has a profound impact on its genetic diversity. Changes in population size, migration and admixture events, and population splits and mergers, shape the genealogies describing how individuals in a population are related, which in turn shape the pattern and frequency of observed genetic variants in extant genomes. By modeling this process and integrating out the unobserved genealogies, it is possible to infer the population’s demographic history from the observed variants. However, in practice this is challenging, as individual mutations provide limited information about tree topologies and branch lengths. If many mutations were available to infer these genealogies this would not be problematic, but the expected number of observed mutations increases only logarithmically with the number of observed genomes, and recombination causes genealogies to change along the genome at a rate proportional to the mutation rate. As a result there is considerable uncertainty about the genealogies underlying a sample of genomes, and because the space of genealogies across the genome is vast, integrating out this latent variable is hard.

A number of approaches have been proposed to tackle this problem [reviewed in 1]. A common approximation is to treat recombination events as known and assume unlinked loci, either by treating each mutation as independent [2–7], or by first identifying tracts of genetic material unbroken by recombination [8–12]. To account for recombination while retaining power to infer earlier demographic events, it is necessary to model the genealogy directly. ARGWeaver [13] uses Markov chain Monte Carlo (MCMC) for inference, but does not allow the use of a complex demographic model, and since mutations are only weakly informative about genealogies this leaves the inferred trees biased towards the prior model and less suitable for inferring demography. Restricting itself to single diploid genomes, the Pairwise Sequentially Markovian Coalescent (PSMC) model [14] uses an elegant and efficient inference method, but with limited power to detect recent changes in population size or complex demographic events. Several other approaches exist that improve on PSMC in various ways [15–18], but they remain limited particularly in their ability to infer migration.

We here focus on the general problem of inferring demography from several whole-genome sequences, which is informative about demographic events in all but the most recent epochs [13, 14, 16]. A promising approach which so far has not been applied to this problem is to use a particle filter. Particle filters have many desireable properties [19–22], and applications to a range of problems in computational biology have started to appear [23–26]. Like MCMC methods, particle filters converge to the exact solution in the limit of infinite computational resources, are computationally efficient by focusing on realisations that are supported by the data, do not require the underlying model to be approximated, and generate explicit samples from the posterior distribution of the latent variable. Unlike MCMC, particle filters do not operate on complete realisations of the model, but construct samples sequentially, which is helpful since full genealogies over genomes are cumbersome to deal with.

To use particle filters, we use a formulation of the coalescent model in which the state is a genealogical tree at a particular genome locus, which “evolves” sequentially along the genome, rather than in evolutionary time. To avoid confusion, in this paper “time” by itself refers to the variable along which the model evolves, while evolutionary (coalescent, recombination) time refers to an actual time in the past on a genealogical tree.

Originally, particle filters were introduced for models with discrete time evolution and with either discrete or continuous state variables [19, 27]. In this paper, the latent variable is a piecewise constant sequence of genealogical trees along the genome, with trees changing only after recombination events that, in mammals, occur once every several hundred nucleotides. The observations of the model are genetic variants, which are similarly sparse. Realizations of the discrete-time model of this process (where “time” is the genome locus) are therefore stationary (remain in the same state) and silent (do not produce an observation) at most transitions, leading to inefficient algorithms. Instead, it seems natural to model the system as a Markov jump process (or purely discontinuous Markov process, [28]), a continuous-time stochastic process with as realisations piecewise constant functions $x : [1, L] \mapsto T$ , where $T$ is the state space of the Markov process (the space of genealogical trees over a given number of genomes) and L the length over which observations are made (here the genome size).

Particle filters have been generalised to continuous-time diffusions [29–31], as well as to Markov jump processes on discrete state spaces [32, 33], and hybrids of the two [34, 35], as well as to piecewise deterministic processes [36]; for a general treatment see [37, 38]. Here we focus on Markov jump processes that are continuous in both time and state space; to our knowledge the method has not been extended to this case. The algorithm we propose relies on Radon-Nikodym derivatives [see e.g. 31], and we establish criteria for choosing a finite set of “waypoints” that makes it possible to reduce the problem to the discrete-time case, while ensuring that particle degeneracy remains under control.

Although the algorithm generally works well, we found that for the CwR model we obtain biased inferences for some parameters. For example, coalescent rates for recent epochs are associated with tree nodes that persist across long genomic segments (the model exhibits “long forgetting times”), because their short descendant branches attract few recombinations. They have few informative mutations as well, and collecting these mutations therefore require long lags in the fixed-lag smoothing procedure, in turn resulting in increased particle degeneracy [39]. For discrete-time models the Auxiliary Particle Filter [40] addresses a related problem by “guiding” the particle filter towards states that are likely to be relevant in future iterations, using an approximate likelihood that depends on data one step ahead. This approach does not work well for some continuous-time models, including ours, that have no single preferred time scale. Instead we introduce an algorithm that shapes the resampling process by an approximate “lookahead likelihood” that can depend on data at arbitrary distances ahead. Using simulations we show that this substantially reduces the bias.

The particle filter generates samples from the posterior distribution of the latent variable, here the sequence of genealogies along the genome, and we infer the model parameters from this sample. One strategy is to use stochastic expectation-maximization [SEM; 41]. However, such approaches yield point estimates, ignoring any uncertainty in the inferred parameters. Combined with the bias due to self-normalized importance sampling which cause particle filters to under-sample low-rate events, this result in a non-zero probability of inferring zero event rates, which are fixed points of any SEM procedure. In principle this can be avoided by using an appropriate prior on the rate parameters. To implement this we use Variational Bayes to estimate an approximate joint posterior distribution over parameters and latent variables, partially accounting for the uncertainty in the inferred parameters, as well as providing way to explicitly include a prior. In this way zero-rate estimates are avoided, and more generally we show that this approach further reduces the bias in parameter estimates.

Applying these ideas to the coalescent-with-recombination (CwR) model, we find that the combination of lookahead filter and Variational Bayes inference enables us to analyze four diploid human genomes simultaneously, and infer demographic parameters across epochs spanning more than 3 orders of magnitude, without making model approximations beyond passing to a continuous-locus model.

The remainder of the paper is structured as follows. We first introduce the particle filter, generalise it to continuous-time and -space Markov jump processes, describe how to choose waypoints, introduce the lookahead filter, and describe the Variational Bayes procedure for parameter inference. In the results section we first introduce the continuous-locus CwR process, then discuss the lookahead likelihood, choice of waypoints and parameter inference for this model, before applying the model to simulated data, and finally show the results of analyzing sets of four diploid genomes of individuals from three human populations. A discussion concludes the paper.

Methods

The sequential coalescent with recombination model

The coalescent-with-recombination (CwR) process, and the graph structures that are the realisations of the process, was first described by Hudson [42], and was given an elegant mathematical description by Griffiths [43], who named the resulting structure the Ancestral Recombination Graph (ARG). Like the coalescent process, these models run backwards in evolutionary time and consider the entire sequence at once, making it difficult to use them for inference on whole genomes. The first model of the CwR process that evolves sequentially rather than in the evolutionary time direction was introduced by Wiuf and Hein [44], opening up the possibility of inference over very long sequences. Like Griffiths’ process, the Wiuf-Hein algorithm operates on an ARG-like graph, but it is more efficient as it does not include many of the non-observable recombination events included in Griffiths’ process. The Sequential Coalescent with Recombination Model (SCRM) [45] further improved efficiency by modifying Wiuf and Hein’s algorithm to operate on a local genealogy rather than an ARG-like structure. Besides the “local” tree over the observed samples, this genealogy includes branches to non-contemporaneous tips that correspond to recombination events encountered earlier in the sequence. Recombinations on these “non-local” branches can be postponed until they affect observed sequences, and can sometimes be ignored altogether, leading to further efficiency gains while the resulting sample still follows the exact CwR process. An even more efficient but approximate algorithm is obtained by culling some non-local branches. In the extreme case of culling all non-local branches the SCRM approximation is equivalent to the SMC’ model [46, 47]. With a suitable definition of “current state” (i.e., the local tree including all non-local branches) these are all Markov processes, and can all be used in the Markov jump particle filter; here we use the SCRM model with tunable accuracy as implemented in [45].

The state space $T$ of the Markov process is the set of all possible genealogies at a given locus. The probability measure of a complete realisation x can be written as

Here x is the sequence of genealogies along the genome; |x| is the number of recombinations that occurred on x; b_u(x_s) is the number of branches in the genealogy at locus s at evolutionary time u;

B (x_{s}) = \int_{u = 0}^{root (x_{s})} b_{u} (x_{s}) d u

is the total branch length of x_s; ρ(s) is the recombination rate per nucleotide and per generation at locus s, so that ρ(s)B(x_s) is the exit rate of the Markov process in state x_s; (s_j, ν_j) is the locus and recombination time of the jth recombination event; τ_j > ν_j is the coalescence time of the corresponding coalescence event; and C(u) = 1/2N_e(u) is the coalescence rate in generation u. See Appendix (“The sequential coalescent with recombination process”) for more details. The distribution π_x(x) has a density with respect to the Lebesgue measure (ds)^|x|(du)^2|x|, because each of the |x| recombination events is associated with a sequence locus, a recombination time, and a coalescent time.

Mutations follow a Poisson process whose rate at s depends on the state x_s via μ(s)B(x_s) where μ(s) is the mutation rate at s per nucleotide and per generation. Mutations are not observed directly, but their descendants are; a complete observation is represented by a set $y = {(s_{j}, A_{j})}_{j = 1, \dots, | y |} \in Y$ where s_j ∈ [1, L) is the locus of mutation j, and A_j ∈ {0, 1}^S are the wildtype (0) and alternative (1) alleles observed in the S samples. The conditional probability measure of the observations y given a realisation x is

where P(A|x_s, μ) is the probability of observing the allelic pattern A given a genealogy x_s and a mutation rate μ per nucleotide and per generation; this probability is calculated using Felsenstein’s peeling algorithm []. Note that B(x_s)μ(s) = ∑_{A≠(0,…,0)} P(A|x_s, μ(s)).

Particle filters

Particle filters methods, also known as Sequential Monte Carlo (SMC) [22], generate samples from complex probability distributions with high-dimensional latent variables. An SMC method uses importance sampling (IS) to approximate a target distribution using weighted random samples (particles) drawn from a tractable distribution. We briefly review the discrete-time case. Suppose that particles {(x⁽ⁱ⁾, w⁽ⁱ⁾)}_i=1,…,N, approximate a distribution with density p(x), such that

for any bounded continuous function f, where X ∼ p(x)dx. Here and in the remainder, we use “approximate” and ≈ to mean that X_N ∼ ∑w⁽ⁱ⁾ δ_x⁽ⁱ⁾(x) converges in distribution to X ∼ p(x)dx and equality holds in () as N → ∞; and summations without an index are over N particles indexed by i. Under mild conditions (i.e., q(x)/p(x) must exist almost everywhere and be absolutely continuous) we can use IS to obtain particles approximating another distribution q(x)dx:

where

{\tilde{w}}^{(i)} : = w^{(i)} q (x^{(i)}) / p (x^{(i)})

, and the last step holds because

\sum {\tilde{w}}^{(i)} / \sum w^{(i)} \approx E_{p} [q / p] = E_{q} [1] = 1

. This shows that

{(x^{(i)}, {\tilde{w}}^{(i)})}

approximate q(x)dx. The normalisation ensures that any constant factor in w⁽ⁱ⁾ drops out, so that it is sufficient to know the ratio q(x)/p(x) up to a constant. A particle filter builds the desired distribution sequentially, making it suited to hidden Markov models, for which the joint distribution of latent variables X and observations Y has the form

Here

1 \cdot \cdot s

denotes the set {1, 2, …, s}, and

x = x_{1 \cdot \cdot s} = (x_{1}, x_{2}, \dots, x_{s})

and

y = y_{1 \cdot \cdot s}

are vectors. Let {(x⁽ⁱ⁾, w⁽ⁱ⁾)} be particles approximating the target distribution

P (X_{1 \cdot \cdot s} = x_{1 \cdot \cdot s} | Y_{1 \cdot \cdot s} = y_{1 \cdot \cdot s})

, which for brevity we write as

P (x_{1 \cdot \cdot s} | y_{1 \cdot \cdot s})

. If

{\tilde{x}}^{(i)}

is the vector obtained by extending x⁽ⁱ⁾ with a sample from

P (x_{s + 1} | x_{s}^{(i)})

, then from () and () it follows that

{({\tilde{x}}^{(i)}, w^{(i)})}

approximate

P (x_{1 \cdot \cdot s + 1} | y_{1 \cdot \cdot s}) \propto

P (x_{1 \cdot \cdot s + 1}, y_{1 \cdot \cdot s})

. Now,

P (x_{1 \cdot \cdot s + 1} | y_{1 \cdot \cdot s + 1}) \propto

P (x_{1 \cdot \cdot s + 1}, y_{1 \cdot \cdot s + 1}) =

P (x_{1 \cdot \cdot s + 1}, y_{1 \cdot \cdot s}) g (y_{s + 1} | x_{s + 1})

, so that using IS and setting

we obtain particles

{({\tilde{x}}^{(i)}, {\tilde{w}}^{(i)})}

that approximate

P (x_{1 \cdot \cdot s + 1} | y_{1 \cdot \cdot s + 1})

. This shows how to sequentially construct particles that approximate the target distribution

P (x_{1 \cdot \cdot L} | y_{1 \cdot \cdot L})

. Instead of sampling from

p (x_{s + 1} | x_{s}^{(i)})

, any proposal distribution

q (x_{s + 1} | x_{s}^{(i)}, y_{1 \cdot \cdot L})

(subject to conditions) can be used, which is advantageous if q is easier to sample from, is closer to the target distribution, or has heavier tails than p. Again, IS accounts for the change in sampling distribution, resulting in

For now we will choose q to be independent of y. Because samples from q do not follow the desired target P(x|y), the fraction of particles close to the target’s mode diminishes exponentially at each iteration until () fails altogether. To address this, we occasionally draw samples from the approximating distribution itself, assigning each resampled particle weight 1/N—interestingly, if we interpret fitness as (proportional to) the likelihood

g (y_{s + 1} | {\tilde{x}}_{s + 1}^{(i)})

, this is the same process that is used in the Wright-Fisher model with selection to describe how fitness differences shape an evolving constant-size population []. Doing this tends to remove particles that have drifted from the mode of the target and have low weight, and duplicates particles with large weights, while () remains valid. Although resampling substantially decreases the future variance of (), it increases the variance at the current iteration. To avoid increasing this variance unnecessarily, resampling is performed only when the estimated sample size, defined as ESS = (∑w⁽ⁱ⁾)²/∑(w⁽ⁱ⁾)², drops below a threshold, e.g. N/2. In addition, we use systematic resampling to minimize the variance that is introduced when resampling is performed []. This leads to Algorithm 1 [].

Note that the algorithm can be seen as a recipe to transform a sample from P(X) to a sample from P(X)P(Y|X)/P(Y) = P(X|Y), that is, an application of Bayes’ theorem. Following this interpretation we will refer to P(X) as the prior distribution, and P(X|Y) as the posterior.

The algorithm generates an approximation to $P (x_{1 \cdot \cdot s} | y_{1 \cdot \cdot s})$ rather than $P (x_{s} | y_{1 \cdot \cdot s})$ , but we follow [22] in calling it a particle filter algorithm instead of a smoothing algorithm (although our use of fixed-lag distributions for parameter estimation is a partial smoothing operation).

The marginal likelihood can be estimated (although with high variance, see [51]) by setting the weights to $N^{- 1} \sum_{i} w_{s}^{(i)}$ rather than N⁻¹ when particles are resampled. This makes the weights asymptotically normalized, so that (3) becomes E_P(X,Y=y)[f] ≈ ∑_i w⁽ⁱ⁾ f(x⁽ⁱ⁾), and $P (Y = y) = \int P (x, y) d x = E_{P (X, Y = y)} [1] \approx \sum_{i} w_{L}^{(i)}$ .

Algorithm 1 Particle filter

Input: $y_{1 \cdot \cdot L}$

Output: Particles ${(x_{1 \cdot \cdot L}^{(i)}, w_{L}^{(i)})}$ approximating $P (x_{1 \cdot \cdot L} | y_{1 \cdot \cdot L})$

$w_{0}^{(i)} \leftarrow 1 / N$ , $x_{0}^{(i)} \leftarrow \emptyset (i = 1, \dots, N)$

For s from 0 to L − 1

Loop invariant: ${(x_{1 \cdot \cdot s}^{(i)}, w_{s}^{(i)})} \sim p (x_{1 \cdot \cdot s} | y_{1 \cdot \cdot s})$

If ESS < N/2:

Resample, with replacement, ${x_{1 \cdot \cdot s}^{(i)}}$ proportional to ${w_{s}^{(i)}}$

$w_{s}^{(i)} \leftarrow N^{- 1} (i = 1, \dots, N)$

For i from 1 to N:

Sample $x_{s + 1}^{(i)} \sim q (x_{s + 1} | x_{s}^{(i)})$

$w_{s + 1}^{(i)} \leftarrow w_{s}^{(i)} \frac{p (x_{s + 1}^{(i)} | x_{s}^{(i)})}{q (x_{s + 1}^{(i)} | x_{s}^{(i)})} g (y_{s + 1} | x_{s + 1}^{(i)})$ .

Continuous-time and -space Markov jump processes

For the hidden process we now consider Markov jump processes, which have as realisations piecewise constant functions $x : [1, L) \mapsto T$ where $T$ is the state space of the Markov process. Recall that in the model we consider, $T$ is the space of rooted genealogical trees with branch lengths. Let $(X, F_{x}, π_{x})$ be a probability space, where $X = T^{[1, L)}$ is the space of possible realisations of the hidden stochastic process X = {X_s}_{s ∈ [1, L)}, $F_{x} \subset P (X)$ is the σ-algebra of events, and π_x(X) is the probability measure on $X$ induced by the stochastic process X. See the Appendix (“Conditional distributions and the Markov property”) for some remarks on how to define a Markov model when the phase space $T$ is uncountable.

The complete model is defined by specifying the observation process. We consider models where observations Y are generated by a Poisson process whose intensity at time (i.e. locus) s depends on X_s [a Cox process, see e.g. 52]. The space of observations $Y$ consists of finite subsets of [1, L) × M, where M is a discrete set of potential events, each of which may occur at some s ∈ [1, L). For a full observation $y = (({\tilde{s}}_{1}, m_{1}), \dots, ({\tilde{s}}_{k}, m_{k})) \in Y$ we write |y| ≔ k for the number of events in y. Writing λ(y) for the Lebesgue measure (ds)^|y|, the emission distribution π(Y|X = x) has a density r(y|x) relative to λ(y). For Cox processes this density has the form

where r(s, m|x_s) is the rate at which event m occurs at time s conditional on X_s = x_s and

is the intensity of the emission Poisson process at s conditional on X_s = x_s. The probability space for the joint process is

(X \times Y, F, π)

, and the posterior distribution of interest is π conditioned on an observation

y \in Y

, written as π(X|Y = y).

The absence of events in an interval s ∈ [a, b) is also informative about the latent variable through the exponential factor in (8). In practice however, not all intervals may have been observed, so that events may or may not have occurred in these intervals. Assuming that the “observation process” is independent of the Markov jump process X, such unobserved intervals can simply be left out of integral (8).

Some more notation is needed to describe the Markov jump process version of algorithm 1. As above π_x denotes the prior distribution of the latent variable X, and ξ_x denotes the proposal distribution, both Markov processes on $X$ , playing the role of p(x) and q(x) in the discrete case. We write $a : b$ for the interval $[a, b) \subset R$ , and α^{a: b} for the restriction of a measure or function α to $a : b$ ; similarly y_a:b ≔ y∩([a, b) × M) and X_a:b ≔ {X_s}_s∈[a,b). The particle filter algorithm uses the notation (dα/dβ)(x) for distributions α and β to denote their Radon-Nikodym derivative: the ratio of their density functions with respect to a common reference measure, evaluated at x. To simplify notation we write the Radon-Nikodym derivative of two conditional distributions $α (X | G)$ and $β (X | G)$ at x as $(d α / d β) (x | G)$ , and we also do not explicitly restrict distributions to their appropriate intervals when this is clear from the context, so that we write for example $(d π / d λ) (y_{s_{j} : s_{j + 1}} | X_{s_{j} : s_{j + 1}} = x)$ instead of $(d π^{s_{j} : s_{j + 1}} (Y | X_{s_{j} : s_{j + 1}} = x) / d λ^{s_{j} : s_{j + 1}}) (y_{s_{j} : s_{j + 1}})$ . With this notation we can formulate Algorithm 2.

Algorithm 2 Particle filter for Markov jump processes

Input: $y_{1 : L} \in Y$ ; waypoints 1 = s₀ < s₁ < … < s_K = L.

Output: Particles ${(x_{1 : L}^{(i)}, w_{L}^{(i)})}$ approximating the posterior distribution π(X|Y = y_1:L)

$w_{1}^{(i)} \leftarrow N^{- 1}$ , $x_{1 : 1}^{(i)} \leftarrow \emptyset$ (i = 1, …, N)

For j from 0 to K − 1

Loop invariant: ${(x_{1 : s_{j}}^{(i)}, w_{s_{j}}^{(i)})} \approx π (X_{1 : s_{j}} | Y_{1 : s_{j}} = y_{1 : s_{j}})$

If $E S S ({w_{s_{j}}^{(i)}}) < N / 2$ :

Resample ${x_{1 : s_{j}}^{(i)}}$ with probabilities proportional to ${w_{s_{j}}^{(i)}}$

$w_{s_{j}}^{(i)} \leftarrow N^{- 1}$ (i = 1, …, N)

For i from 1 to N:

Sample $x_{s_{j} : s_{j + 1}}^{(i)} \sim ξ_{x} (X_{s_{j} : s_{j + 1}} | X_{s_{j}} = x_{s_{j}}^{(i)})$

$w_{s_{j + 1}}^{(i)} \leftarrow w_{s_{j}}^{(i)} \frac{d π_{x}}{d ξ_{x}} (x_{s_{j} : s_{j + 1}}^{(i)} | X_{s_{j}} = x_{s_{j}}^{(i)}) \frac{d π}{d λ} (y_{s_{j} : s_{j + 1}} | X_{s_{j} : s_{j + 1}} = x_{s_{j} : s_{j + 1}}^{(i)})$ .

The choice of waypoints s₁, …, s_K is discussed below; in particular they need not be the same as the event loci ${\tilde{s}}_{1}, \dots, {\tilde{s}}_{| y |}$ of the observation y. Note that there is no initialization step; instead, initially $x_{1 : 1}^{(i)} = \emptyset$ , and the first sample will be drawn from ξ conditioned on an empty set, i.e. the unconditional distribution. The loop invariant holds when j = 0 since $1 : s_{0} = \emptyset$ . As with Algorithm 1 it is possible to estimate the likelihood density π_θ(y_1:L) by replacing the factors N⁻¹ with $N^{- 1} \sum_{i} w_{s_{j}}^{(i)}$ ; then the likelihood density w.r.t. λ(dy) = (ds)^|y| is approximated by $\sum_{i} w_{L}^{(i)}$ .

Note that by the nature of Markov jump processes, particles that start with identical latent variables have a positive probability of remaining identical after a finite time. Combined with resampling, this causes a considerable number of particles to have one or more identical siblings. For computational efficiency we represent such particles once, and keep track of their multiplicity k. When evolving a particle with multiplicity k > 1, we increase the exit rate k-fold, and when an event occurs one particle is spawned off while the remaining k − 1 continue unchanged.

Using lookahead to improve the particle filter

At the jth iteration, Algorithm 2 uses data up to waypoint s_j to build particles approximating $π (X_{1 : s_{j}} | Y_{1 : s_{j}} = y_{1 : s_{j}})$ . This is reasonable as $π (X_{1 : s_{j}} | y_{1 : s_{j}})$ is independent of data beyond s_j. However, not all particles are equally important for approximating subsequent posteriors, which suggests to emphasise particles that will be relevant in future at the expense of those relevant only to $π (X_{1 : s_{j}} | y_{1 : s_{j}})$ . This echoes the justification of resampling: although resampling increases the variance of the approximation to the current partial posterior, the variance at subsequent iterations by increasing the number of particles that are likely to contribute to future distributions. For discrete-time models p(X_1:n|y_1:n), the Auxiliary Particle Filter (APF) [40] implements this intuition by targeting a resampling distribution [53], which includes a “lookahead” factor $\tilde{p} (y_{i + 1} | x_{i})$ approximating the probability of observing data y_i+1 given the current state x_i. Importance sampling is used to keep track of the desired distribution p(X_1:i|y_1:i).

In the continuous-time context it is natural to look an arbitrary distance ahead. Similar to APF, the lookahead distribution can be conditioned on the current state only, and must be an approximation of the true distribution. It should be heavy-tailed with respect to the true distribution to ensure that the variance of the estimator remains finite [22], which implies that the distribution should not depend on data too far beyond s; what is “too far” depends on how well the lookahead distribution approximates the true distribution.

The lookahead distribution is only evaluated on a fixed observation y, and is used to quantify the plausibility of a current state $x_{s}^{(i)}$ , rather than to define a distribution over y. For this reason we call it a lookahead likelihood. In fact, for correctness of the algorithm it is not necessary that this likelihood derives from a probability distribution. We define the lookahead likelihood as a family of functions $h^{s} (y_{s : L} | x_{s}) : Y_{s : L} \times T \to R$ , and an associated family of unnormalized distributions ${\tilde{π}}^{s} (x_{1 : s}, y_{1 : L}) = π^{1 : s} (x_{1 : s}, y_{1 : s}) h^{s} (y_{s : L} | x_{s}) λ^{s : L} (y_{s : L})$ on $X_{1 : s} \times Y$ . The functions h^s can be chosen arbitrarily, except that h^s(⋅, x_s)λ^s:L must be absolutely continuous w.r.t. π^s:L(⋅|X_s = x_s) to ensure that importance sampling is justified. The lookahead Algorithm 3 keeps track of two sets of weights, which together with a single set of samples form two sets of particles that approximate the resampling and target distributions.

Algorithm 3 Markov-jump particle filter with lookahead

Input: $y_{1 : L} \in Y$ ; waypoints 1 = s₀ < s₁ < … < s_K = L.

Output: Particles ${(x_{1 : L}^{(i)}, w_{L}^{(i)})}$ approximating π(X|Y_1:L = y_1:L)

$w_{1}^{(i)} \leftarrow 1 / N$ , $v_{1}^{(i)} \leftarrow 1 / N$ , $x_{1 : 1}^{(i)} \leftarrow \emptyset$ (i = 1, …, N)

For j from 0 to K − 1

Loop invariant: ${(x_{1 : s_{j}}^{(i)}, w_{s_{j}}^{(i)})} \approx π^{1 : s_{j}} (X_{1 : s_{j}} | Y_{1 : s_{j}} = y_{1 : s_{j}})$

Loop invariant: ${(x_{1 : s_{j}}^{(i)}, v_{s_{j}}^{(i)})} \approx {\tilde{π}}^{s_{j}} (X_{1 : s_{j}} | Y = y_{1 : L})$

If $E S S ({v_{s_{j}}^{(i)}}) < N / 2$ :

Resample ${x_{1 : s_{j}}^{(i)}}$ with probabilities proportional to ${v_{s_{j}}^{(i)}}$

$w_{s_{j}}^{(i)} \leftarrow N^{- 1} w_{s_{j}}^{(i)} / v_{s_{j}}^{(i)}$ (i = 1, …, N)

$v_{s_{j}}^{(i)} \leftarrow N^{- 1}$ (i = 1, …, N)

For i from 1 to N:

Sample $x_{s_{j} : s_{j + 1}}^{(i)} \sim ξ_{x} (X_{s_{j} : s_{j + 1}} | X_{s_{j}} = x_{s_{j}}^{(i)})$

$v_{s_{j + 1}}^{(i)} \leftarrow v_{s_{j}}^{(i)} \frac{d π_{x}}{d ξ_{x}} (x_{s_{j} : s_{j + 1}}^{(i)} | X_{s_{j}} = x_{s_{j}}^{(i)}) \frac{d {\tilde{π}}^{s_{j + 1}}}{d {\tilde{π}}^{s_{j}}} (y_{s_{j} : L} | X_{s_{j} : s_{j + 1}} = x_{s_{j} : s_{j + 1}}^{(i)})$

(see Appendix, “Proof of Algorithm 3”.) To implement the lookahead particle filter we need a tractable approximate likelihood of future data given a current genealogy. To do this we simplify the full likelihood, and ignore all data except for a digest of singletons and doubletons that are informative of the topology and branch lengths near the tips of the genealogy—in particular, singletons are informative of terminal branch lengths, and doubletons identify the existence of nodes with precisely two descendants (“cherries”). This digest consists of the distance s_i to the nearest future singleton for each haploid sequence, and ≤ n/2 mutually consistent cherries c_k = (a_k, b_k) identified by their two descendants a_k, b_k, together with loci $s_{k}^{'} \leq s_{k}^{″}$ where their first and last supporting doubleton were observed (Fig 1a). Under some simplifying assumptions we derive an approximation of the likelihood $h^{s} ({t_{i}}, {c_{k}, s_{k}^{'}, s_{k}^{″}} | x_{s})$ of the current genealogy given these data; see Appendix (“A lookahead likelihood”) for details.

Fig 1

a. Example of data digest. Lines represent genomes of 6 lineages, circles observed genetic variants. Of the data shown, one singleton (yellow) and five doubletons (red) contribute to the digest. Cherry c₃ is supported by a single doubleton; r does not contribute because the mutation patterns p and q are incompatible with c₃. Similarly, p does not contribute because it is incompatible with c₂ and c₃. b. Partial genealogy (unbroken lines) over 6 lineages. Open circles and arrowheads represent potential recombination and coalescence events that would change the terminal branch length for lineage 1 (t,u), and remove cherry c₃ (x,y).

Choosing waypoints

The choice of waypoints s_j can significantly impact the performance of the algorithm: choosing too few increases the variance of the approximation, and choosing too many slows down the algorithm without increasing its accuracy. Waypoints determine where the algorithms might perform a resampling step. A high density of waypoints is therefore always acceptable, but a low density may result in particle degeneracy. Choosing a waypoint at every event ensures that any weight variance induced at these sites is mitigated, but there is still the opportunity for weight variance to build up between events.

If ξ_x = π_x, particle weights diverge only because different particles $(x_{1 : s}^{(i)}, w_{s}^{(i)})$ experience a different total intensity $r (x_{s}^{(i)})$ of observed events. If ESS₀ is the current estimated sample size, then under some assumptions, along an interval of length L where no events occur we have

(see Appendix, “Particle weight variance”), where σ² the variance of the total event intensity r(x_s) () under the prior π_x(x). Therefore, if we choose waypoints at every event, adding additional waypoints so that they are never more than a distance

1 / \sqrt{2 σ^{2}}

apart, the ESS will not drop more than a factor

\sqrt{1 / e} \approx 0.6

between waypoints, and particle degeneracy is avoided.

To apply this to our situation, assume a panmictic population with constant diploid effective population size N_e. The variance of the total coalescent branch length in a sample of n individuals is ${(4 N_{e})}^{2} \sum_{i = 1}^{n - 1} i^{- 2}$ [54]. The variance of total mutation intensity σ² is obtained by multiplying this by μ², since the rate of mutations on the coalescent tree is μ times the total branch length. Rewriting this in terms of the heterozygosity θ = 4N_e μ, and approximating the sum with $\sum_{i = 1}^{\infty} i^{- 2} = π^{2} / 6$ gives σ² = θ² π²/6, and a minimum waypoint distance of $1 / \sqrt{2 σ^{2}} = \sqrt{3} / π θ \approx 1 / 2 θ$ .

Because the assumptions mentioned above are in practice only met approximately, this minimum waypoint density should be taken as a guide; breakdown of the assumptions can be monitored by tracking the ESS, increasing the density of waypoints if necessary.

Parameter inference

Parameters can be inferred by stochastic expectation maximization (SEM), which involves maximizing the expected log likelihood over the posterior distribution of the latent variable. The probability density for a Poisson process is $\frac{1}{c!} θ^{c} e^{- q θ}$ , where c is the event count, and θ is the rate of events per unit of “opportunity” q, measured in units of time or space or some combination of them. The expected log likelihood c log θ − qθ (ignoring constants) is maximized for θ = c/q, where c and q are the expected event count and opportunity. We consider Markov jump processes X_s with parameters θ and distribution

where |x|_i is the event count and Q_i(x) is the total opportunity for events of type i in realisation x; both can be random variables. Similar to the Poisson case, the parameters maximizing the expected log likelihood are

The expectations can be computed by using samples over x ∼ π(x|y, θ) as approximated by Algorithm 3.

To evaluate the expectations above we do not use the complete set of events in the full realisations x, since resampling causes early parts of x to become degenerate due to “coalescences” of the particle’s history of events along the sequence, which would lead to high variance of the estimates. Using only the most recent events is also problematic as these have not been shaped by many observations and mostly follow the prior π_x(x|θ), resulting in highly biased estimates. Smoothing techniques such as two-filter smoothing [55] cannot be used here since finite-time transition probabilities are intractable. For discrete-time models fixed-lag smoothing is often effective [39]. For our model the optimal lag depends on the epoch, as the age of tree nodes strongly influence their correlation distance. For each epoch we determine the correlation distance empirically, and for the lag we use this distance multiplied by a factor α; we obtain good results with α = 1.

Particularly in cases where some event types are rare, Variational Bayes can improve on EM by iteratively estimating posterior distributions rather than point estimates of θ. A tractable algorithm is obtained if the joint posterior π(x, θ|y)dxdθ is approximated as a product of two independent distributions over x and θ, and an appropriate prior over θ is chosen. For the Poisson example above, combining a Γ(θ|α₀, β₀) prior with the likelihood θ^c e^−qθ results in a Γ(θ|α₀ + c, β₀ + q) posterior. Similarly, with this choice the Variational Bayes approximation results in an inferred posterior distribution of the form

where expectations are taken over x ∼ ∫π(x|y, θ)π(θ)dθ, and π(θ) is the current posterior over θ. It would appear that obtaining samples x from this distribution is intractable. However, if π(θ) is a Gamma distribution, θ can be integrated out analytically in the likelihood π(x, y|θ)Γ(θ|α, β), resulting in an expression that is identical to the likelihood of the point estimate θ_i = α_i/β_i except for an additional scaling factor e^ψ(α_i)/α_i for each event of type i in x, where ψ is the digamma function. These scaling factors render the normalization constant of the likelihood intractable, but fortunately SMC algorithms only require densities to be defined up to normalization. As a result, Algorithm 3 can be used to generate samples from this distribution at no additional computational cost. See the Appendix (“Variational Bayes for Markov Jump processes”) for more details.

Explicitly, for model (1) the parameters $θ^{'} = (ρ_{EM}, C_{EM})$ maximising $E [log π_{x} (x | θ^{'})]$ , where the expectation is taken over the posterior x ∼ π(x|y, θ)dx as approximated by Algorithm 3, is

where θ = (ρ, C) is the vector of current parameter estimates. Note that

C_{EM}^{'}

in () is constant in evolutionary time. In practice we maximize () with respect to piecewise constant functions

C_{EM}^{'} (t)

, which yields

for t ∈ [ν, τ), where |x|_ν,τ denotes the number of coalescent events in x that occur in the epoch [ν, τ). Similarly, a Variational Bayes inference procedure uses

where expectations are taken over x ∼ ∫π(x|y, θ)p(θ)dθ, where p(θ) is the posterior parameter distribution ( and ) of the previous iteration, and α_ρ, β_ρ, α_C, β_C parameterize the prior distributions ρ ∼ Γ(α_ρ, β_ρ) and C ∼ Γ(α_C, β_C).

Results

Simulation study

We implemented the model and algorithm above in a Python/C++ program SMCSMC (Sequential Monte Carlo for the Sequentially Markovian Coalescent) and assessed it on simulated and real data.

To investigate the effect of the lookahead particle filter, we simulated four 50 megabase (Mb) diploid genomes under a constant population-size model (N_e = 10, 000, μ = 2.5 × 10⁻⁸ and ρ = 10⁻⁸, both per generation and per site, generation time g = 30 years). We inferred population sizes N_e through evolutionary time, defined as the inverse of twice the instantaneous coalescent rate, as a piecewise constant function across 9 epochs (with boundaries at 400, 800, 1200, 2k, 4k, 8k, 20k, 40k and 60k generations) using particle filters Algorithms 2 and 3, as well as a recombination rate, which was taken to be constant through evolutionary time (and along the genome). Although recombination rate can be inferred, we here focus on the accuracy of the inferred N_e through evolutionary time. Observations are often available as unphased genotypes, and we assessed both algorithms using phased and unphased data, using the same simulations for both. Experiments were run for 15 EM iterations and repeated 15 times (Fig 2a).

Fig 2

Accuracy of population size inferences in simulated data.

Shown are true population sizes (black) and median inferred population sizes across 15 independent runs (blue); shaded areas denote quartiles and full extent. a Impact of lookahead, phasing and number of particles on the bias in population size estimates for recent epochs, for data simulated under a constant population size model. b Inference in the “zigzag” model on phased data using lookahead and 30, 000 particles, comparing inference using stochastic Expectation Maximization (SEM) and Variational Bayes (VB).

On phased data (Fig 2a, top rows), N_e values inferred without lookahead show a strong positive bias in recent epochs, corresponding to a negative bias in the inferred coalescence rate. Increasing the number of particles reduces this bias somewhat. By contrast, the lookahead filter shows no discernable bias on these data, even for as little as 1, 000 particles. On unphased data (Fig 2a, bottom rows), the default particle filter continues to work reasonably well; in fact the bias appears somewhat reduced compared to phased data analyses, presumably because integrating over the phase makes the likelihood surface smoother, reducing particle degeneracy. By contrast, the lookahead particle filter shows an increased bias on these data compared to the default implementation. This is presumably because of the reliance of the lookahead likelihood on the distance to the next singleton; this statistic is much less informative for unphased data, making the lookahead procedure less effective, and even counterproductive for early epochs.

We next investigated the impact of using Variational Bayes instead of stochastic EM, using the lookahead filter on phased data. We simulated four 2 gigabase (Gb) diploid genomes using human-like evolutionary parameters (μ = 1.25 × 10⁻⁸, ρ = 3.5 × 10⁻⁸, g = 29, N_e(0) = 14312) under a “zigzag” model similar to that used in [16] and [18], and inferred N_e across 37 approximately exponentially spaced epochs; see Appendix (“Implementation Details”). Both approaches give accurate N_e inferences from 2, 000 years up to 1 million years ago (Mya); other experiments indicate that population sizes can be inferred up to 10 Mya (but see Fig 3b). The upwards bias in the most recent epochs is reduced considerably by the Variational Bayes approach compared to SEM (Fig 2b), although some bias remains.

Fig 3

Population size inferences by SMCSMC on four diploid samples.

Left, three human populations (CEU, CHB, YRI), together with inferences from msmc using 1, 2 and 4 diploid samples. Right, simulated populations resembling CEU and YRI population histories. All inferences (SMCSMC, msmc) were run for 20 iterations.

Inference on human subpopulations

We applied SMCSMC to three sets of four phased diploid samples, of Northern European (CEU), Han Chinese (CHB) and Yoruban (YRI) ancestry respectively, from the 1000 Genomes project. For comparison we also ran msmc [16] inferring on the same data, and on subsets of 2 and 1 diploid samples. Inferences show good agreement where msmchas power (Fig 3). Since the inferences show some variation particularly in more recent epochs, we simulated data under a demographic model closely resembling CEU and YRI ancestry as inferred by multiple methods (see Appendix, “Implementation Details”), and we inferred population sizes using SMCSMC and msmc as before. This confirmed the accuracy of SMCSMC inferences from about 5,000 to 5 million years ago, while inferences in more recent epochs show more variability. A representative comparison of run times is provided in Table 1.

Table 1

Table lists means ± one standard deviation across 10 independent runs in a high performance compute environment. Note that due to parallel execution of SMCSMC (146 genomic chunks) and msmc (8 cores), wall clock time was considerably less than the total CPU time.

Runtimes (total CPU time, hours) for analyzing one or two diploid human genomes using msmc (40 EM iterations), and SMCSMC (15 Variational Bayes iterations).

Algorithm	2 haploids	4 haploids
`msmc`	5.2±0.5	107.3±18.7
`SMCSMC` 5,000 particles	109.2±5.7	277±15
`SMCSMC` 10,000 particles	219±11	673±32

Discussion

Motivated by the problem of recovering a population’s demographic history from the genomes of a sample of its individuals [1], we have introduced a continuous-locus approximation of the CwR model, and developed a particle filter algorithm for continuous-time Markov jump processes with a continuous phase space, by evaluating the doubly-continuous process at a suitably chosen set of “waypoints”, and applying a standard particle filter to the resulting discrete-time continuous-state process. It however proved very challenging to obtain reliable parameter inferences for our intended application using this approach. To overcome this challenge we have extended the standard particle filter algorithm in two ways. First, we have generalized the Auxiliary Particle Filter of Pitt and Shephard [40] from a discrete-time one-step-lookahead algorithm to a continuous-time unbounded-lookahead method. This helped to address a challenging feature of the CwR model, namely that recent demographic events induce “sticky” model states with very long forgetting times. With an appropriate lookahead likelihood function (and phased genotype data), we showed that the unbounded-lookahead algorithm mitigates the bias that is otherwise observed in the inferred parameters associated with these recent demographic events. Some bias however remained, particularly for very early epochs. We reduced this remaining bias by a Variational Bayes alternative to stochastic expectation maximization (SEM), which explicitly models part of the uncertainty in the inferred parameters, and avoid zero rate estimates which are fixed points for the SEM procedure. The combination of a continuous-time particle filter, the unbounded-lookahead method, and VB inference, allowed us to infer demographic parameters from up to four diploid genomes across many epochs, without making model approximations beyond passing to the continuous-locus limit.

On three sets of four diploid genomes, from individuals of central European, Han Chinese and Yoruban (Nigeria) ancestry respectively, we obtain inferences of effective population size over epochs ranging from 5,000 years to 5 million years ago. These inferences agree well with those made with other methods [14–18], and show higher precision across a wider range of epochs than was previously achievable by a single method. Despite the improvements from the unbounded-lookahead particle filter and the Variational Bayes inference procedure, the proposed method still struggles in very recent epochs (more recent than a few thousand years ago), and haplotype-based methods [e.g., 12] remain more suitable in this regime. In addition, methods focusing on recent demography benefit from the larger number of recent evolutionary event present in larger samples of individuals, and the proposed model will not scale well to such data, unless model approximations such as those proposed in [18] are used.

A key advantage of particle filters is that they are fundamentally simulation-based. This allowed us to perform inference under the full CwR model without having to resort to model approximations, such as requiring coalescences to occur at certain evolutionary times only, that characterizes most other approaches. The same approach will make it possible to analyze complex demographic models, as long as forward simulation (along the sequential variable) is tractable. The proposed particle filter is based on the sequential coalescent simulator SCRM [45], which already implements complex models of demography that include migration, population splits and mergers, and admixture events. Although not the focus of this paper, it should therefore be straightforward to infer the associated model parameters, including directional migration rates. In addition, several aspects of the standard CwR model are known to be unrealistic. For instance, gene conversions and doublet mutations are common [56, 57], and background selection profoundly shapes the heterozygosity in the human genome [58]. These features are absent from current models aimed at inferring demography, but impact patterns of heterozygosity and may well bias inferences of demography if not included in the model. As long as it is possible to include such features into a coalescent simulator, a particle filter can model such effects, reducing the biases otherwise expected in other parameter due to model misspecification. Because a particle filter produces an estimate of the likelihood, any improved model fit resulting from adding any of these features can in principle be quantified, if these likelihoods can be estimated with sufficiently small variance. However, even improved models will capture only a fraction of relevant features of a population’s evolution, and the inferred effective population sizes will continue to have a complex relationship with census population due to population substructure, variation in family size, and many other aspects [59].

A further advantage of a particle filter is that it provides a sample from the posterior distribution of ancestral recombination graphs (ARGs). Such explicit samples simplify the estimation of the age of mutations and recombinations, and explicit identification of sequence tracks with particular evolutionary histories, for instance tracts arising from admixture by a secondary population. In contrast to MCMC-based approaches [13], a particle filter can provide only one self-consistent sample of an ARG per run. However, for marginal statistics such as the expected age of a mutation or the expected number of recombinations in a sequence segment, a particle filter can provide weighted samples from the posterior in a single run.

The algorithm presented here scales in practice to about 4 diploid genomes, but requires increasingly large numbers of particles as larger numbers of genomes are analyzed jointly. This is because the space of possible tree topologies increases exponentially with the number of genomes observed, while the number of informative mutations grows much more slowly, resulting in increasing uncertainty in the topology given observed mutations. This uncertainty is further compounded by uncertainty in branch lengths. Nevertheless, the many effectively independent genealogies underlying even a single genome provide considerable information about past demographic events [14], and a joint analysis of even modest numbers of genomes under demographic models involving migration and admixture events enable more complex demographic scenarios to be investigated. Our results show that particle filters are a viable approach to demographic inference from whole-genome sequences, and the ability to handle complex model without having to resort to approximations opens possibilities for further model improvements, hopefully leading to more insight in our species’ recent demographic history.

Appendix

Conditional distributions and the Markov property

Here we outline how to define a conditional distribution $π (\cdot | G)$ given a distribution π on $X$ and a conditioning subset $G \subset X$ of measure 0. Suppose $G_{τ}$ is a family of subsets of $X$ so that $\cup_{τ} G_{τ} = X$ . A particular subset $G_{τ}$ for a fixed τ plays the role of the conditioning event B in the standard definition P(A|B) = P(A ∩ B)/P(B). It can be shown that, under some conditions, there exists an essentially unique family of measures $π_{G_{τ}}$ and a measure μ so that $π_{G_{τ}}$ is concentrated on $G_{τ}$ , $π_{G_{τ}} (X) = 1$ for all τ, and $E_{π} [f] = \int \int f (x) π_{G_{τ}} (d x) μ (d τ)$ for well-behaved functions f [60], making it possible to define the conditional expectation as $E_{π} [f | G] = E_{π_{G}} [f] = \int f (x) π_{G} (d x) .$ Using this, the Markov property of π can be expressed in terms of conditional expectations:

for loci s₁ < s₂ < … < s_k < t and any well-behaved function f.

Proof of Algorithm 3

The algorithm is proved by induction on j. For j = 0 the loop invariant holds, while for j = K it implies the output condition. Suppose the loop invariant is true for some j. If ESS < N/2, assume w.l.o.g. that $v^{(i)} = v_{s_{j}}^{(i)}$ are normalized, let i_k be the index of the kth new particle, ${\hat{v}}^{(k)} = N^{- 1}$ and ${\hat{w}}^{(k)} = N^{- 1} w^{(i_{k})} / v^{(i_{k})}$ be its weights, and write $π^{j} = π (X_{1 : s_{j}} | Y_{1 : s_{j}} = y_{1 : s_{j}})$ , ${\tilde{π}}^{j} = {\tilde{π}}^{s_{j}} (X_{1 : s_{j}} | Y_{1 : s_{j}} = y_{1 : s_{j}})$ , then

and

so that the loop invariant continues to hold after the optional resampling step.

After sampling $x_{s_{j} : s_{j + 1}}^{(i)} \sim ξ_{x} (X_{s_{j} : s_{j + 1}} | X_{s_{j}} = x_{s_{j}}^{(i)})$ , the particles ${(x_{1 : s_{j}}^{(i)}, w_{s_{j}}^{(i)})}$ approximate $π (X_{1 : s_{j}} | Y = y_{1 : s_{j}}) ξ_{x} (X_{s_{j} : s_{j + 1}} | X_{s_{j}})$ . To make this distribution absolutely continuous w.r.t. $π (X_{1 : s_{j + 1}}, Y_{s_{j} : s_{j + 1}} | Y_{1 : s_{j}})$ , multiply it with the constant measure $λ_{s_{j} : s_{j + 1}} (y_{s_{j} : s_{j + 1}})$ ; any measure will do as long as it has a density w.r.t. $λ_{s_{j} : s_{j + 1}}$ and is independent of X. Taking the Radon-Nikodym derivative of these two distributions gives

This shows that

{w_{s_{j + 1}}^{(i)}, X_{1 : s_{j + 1}}^{(i)}}

form particles approximating

π^{1 : s_{j + 1}} (X_{1 : s + 1}, y_{s_{j} : s_{j + 1}} | y_{1 : s_{j}})

, and since

π (x_{1 : s_{j + 1}}, y_{s_{j} : s_{j + 1}} | Y_{1 : s_{j}} = y_{1 : s_{j}}) \propto π (x_{1 : s_{j + 1}} | Y_{1 : s_{j + 1}} = y_{1 : s_{j + 1}}) λ_{s_{j} : s_{j + 1}} (y_{s_{j} : s_{j + 1}})

they also approximate

π (x_{1 : s_{j + 1}} | Y_{1 : s_{j + 1}} = y_{1 : s_{j + 1}})

. The argument showing that

(v_{s_{j + 1}}^{(i)}, X_{1 : s_{j + 1}}^{(i)}) \approx {\tilde{π}}^{s_{j}} (x_{1 : s_{j + 1}} | Y = y_{1 : L})

is analogous. This proves the loop invariant for j + 1, and the algorithm.

Particle weight variance

To derive a criterion on the waypoints that limits the effect of weight variance build-up, let R(s) = f(X_s) be the stochastic variable that measures the instantaneous rate of occurrence of emission events for a particular (random) particle X, and let $W (s) = W_{0} exp (- \int_{0}^{s} R (u) d u)$ be that particle’s time-dependent weight; the dependence on W on X is not written explicitly. Note that the expression for W(s) is valid as long as no events have occurred in the interval [0, L). We assume that R(s) is time-homogeneous, that it can be approximated by a Gaussian process, that particles are drawn from the equilibrium distribution, and that W₀ and R(s) are independent. Write 〈V(X)〉 ≔ ∫V(X)dπ(X) for the expectation of V over π(X). Writing $R (s) = μ + \tilde{R} (s)$ where μ = 〈R(s)〉 is the mean event rate (which is independent of s by assumption), then

as k → ∞, where Δs = L/k and s_i = iΔs. The last expectation becomes

where in the second equality we used the formula for higher moments of a Gaussian distribution, K is the covariance function of the Gaussian process

\tilde{R} (t)

, and C is the integral

\int_{s_{1}, s_{2} = 0}^{L} K (s_{1}, s_{2}) d s_{1} d s_{2}

. Now define σ² ≔ K(s, s) and assume that the covariance function satisfies 0 ≤ K(s₁, s₂) ≤ σ², then 0 ≤ C ≤ σ² L² and

so that across an interval [0, L) where no events occur,

where ESS₀ is the expected sample size at s = 0, and ≈ denotes convergence in distribution as N → ∞ as before.

In practice particles will not be drawn from the equilibrium distribution π_x(X), but from the joint distribution on X and Y conditioned on observations y. However, for most likelihoods conditioning will reduce the variance of R as observations tend to constrain the distribution of likely particles, making this a conservative assumption. The other assumption that is likely not met is that R(t) is a Gaussian process; it is less clear whether making this approximation will in practice be conservative.

The sequential coalescent with recombination process

In formula (1), if s is a recombination point, x_s is the genealogy just left of the recombination point and includes the infinite branch from the root, so that b_u(x_s) = 1 for u above the root.

The measure (1) describes the CwR process exactly as long as x encodes both the local genealogy and the non-local branches used by the SCRM algorithm. In practice the SCRM algorithm prunes some of these branches, and we use (1) on the pruned x.

Note that we take the view that the realisation x encodes not only the sequence of genealogies x_s but also the number of recombinations |x| (some of which may not change the tree), their loci $s_{j} = s_{j}^{x}$ , and the recombination and coalescence times $ν_{j}^{x}$ and $τ_{j}^{x}$ . This information is also kept in the implementation of the algorithm, and is used to calculate the sufficient statistics required for inference of the coalescence and recombination rates.

Variational Bayes for Markov jump processes

We consider hidden Markov models where the latent variable follows a Markov jump process over $x \in X$ , that with respect to a suitable measure dxdy admits a probability density of the form

Here, |x|_i is the event count for events of type i in realisation x, and B_i(x) is the total opportunity for events of that type in x. For example, in our case

and

{| x |}_{U}^{R} = # {j : ν_{j} \in U}

{| x |}_{U}^{C} = # {j : τ_{j} \in U}

, for recombinations and coalescence opportunities and counts occurring in an epoch U ⊂ [0, ∞).

A Variational Bayes approach approximates the true joint posterior density π(x, θ|y) ∝ π_xy(x, y|θ)π_θ(θ), where π_θ is a prior on the parameters, with a probability density ϕ(x, θ) that is easier to work with (here the constant of proportionality implied by “∝” hides a constant density λ(y)). Following Hinton and van Camp [61] and MacKay [62], we choose to constrain ϕ by requiring it to factorize as ϕ(x, θ) = ϕ_x(x)ϕ_θ(θ), and we choose to optimize it by minimizing the Kullback-Leibler divergence KL(ϕ||π), also referred to as the variational free energy [63],

To optimize ϕ_θ(θ) we write F(ϕ) as a function of ϕ_θ with ϕ_x fixed, as

This is minimized by setting logϕ_θ(θ) equal to the log of the denominator. We can still choose the prior π_θ(θ); a product of Gamma distributions

\prod_{i} Γ (α_{i}, β_{i}) (θ_{i}) \propto \prod_{i} θ_{i}^{α_{i} - 1} exp {- β_{i} θ_{i}}

is suitable as it is conjugate to the factors appearing in the denominator. The result is that

with

α_{i}^{'} = α_{i} + E_{ϕ_{x}} {[| x |}_{i}]

and

β_{i}^{'} = β_{i} + E_{ϕ_{x}} [B_{i} (x)]

. Next, to optimize ϕ_x(x) we write F(ϕ) as a function of ϕ_x with ϕ_θ fixed,

Define

{\bar{θ}}_{i} : = E_{ϕ_{θ}} [θ_{i}]

and

θ_{i}^{*} : = exp {E_{ϕ_{θ}} [log θ_{i}]}

, then using properties of the Gamma distribution we get

\bar{θ} = α_{i}^{'} / β_{i}^{'}

and

θ_{i}^{*} = exp {ψ (α_{i}^{'}) - log β_{i}^{'}}

where ψ is the digamma function. Again, F(ϕ) is minimized if the numerator and denominator are proportional, which happens for

where

η_{i} : = θ_{i}^{*} / {\bar{θ}}_{i} = exp {ψ (α_{i}^{'})} / α_{i}^{'}

. As given, the algorithms in this paper sample from a distribution of the form

π (x | y, \bar{θ})

, but they can easily be modified to sample from ϕ_x(x) instead by including an additional factor η_i in a particle’s weight for every event of type i that occurs.

A lookahead likelihood

Let s_i denote the distance along the genome to the nearest future singleton in each sequence, and let c_k = (a_k, b_k) be ≤ n/2 mutually consistent cherries with loci $s_{k}^{'} \leq s_{k}^{″}$ of their first and last supporting doubleton. To simplify notation we assume that the current locus is 0 (Fig 1a).

Note that recombinations result in a change of a terminal branch length (TBL) if either the recombination occurred in the branch itself and the new lineage does not coalesce back into it, or the recombination occurred outside the branch and the new lineage coalesces into it (Fig 1b). To compute the likelihood that the first singleton in lineage i occurs at locus s_i, we assume that all TBLs are equal to l_i, and that coalescences occur before l_i. Then, the total rate of events that change the TBL i is

Define μ_i ≔ μl_i to be the total mutation rate on branch i, and assume that when a TBL changes, it reverts deterministically to some length

l_{i}^{'}

. If a terminal branch with length l_i changes at u to

l_{i}^{'}

, which happens with probability e^−ρ_iu ρ_idu, the likelihood that the first singleton occurs at distance s_i is

e^{- μ_{i} u} e^{- μ_{i}^{'} (s_{i} - u)} μ_{i}^{'} d t,

where

μ_{i}^{'} : = μ l_{i}^{'}

. Conversely, if that branch does not change along [0, s_i), which happens with probability

e^{- ρ_{i} s_{i}}

, the likelihood is

e^{- μ_{i} s_{i}} μ_{i} d t

. Combining these possibilities and marginalizing over u ∈ [0, s_i) gives (using Mathematica to evaluate the integral)

In the case that no singleton is observed up until s_i but data was missing thereafter, the same probability densities apply except for the factors μ_i and

μ_{i}^{'}

in the likelihood, so that

We account for the uncertainty in

l_{i}^{'}

by marginalizing over the empirical distribution of TBLs for sequence i.

To approximate the likelihood of the doubleton data, note that a node c with precisely two descendants (a, b) (a “cherry”) at height l changes if a recombination occurs in either branch a or b and the new lineage coalesces out, or a recombination occurs outside of a and b and coalesces into either (Fig 1b). Again assuming that all TBLs are l and coalescences occur before l, the total rate of change is $2 l ρ \frac{n - 2}{n} + (n - 2) l ρ \frac{2}{n} = 4 l ρ \frac{n - 2}{n} : = ρ_{C}$ . When a cherry changes, we assume that the new cherry is drawn from the equilibrium distribution. To calculate the probablity of observing c = (a, b) at equilibrium, assume that a tree supports 1 ≤ k ≤ n/2 cherries. The branches of c are among the 2k branches subtended by the tree’s k cherries with probability $\frac{2 k}{n} \frac{2 k - 1}{n - 1}$ , and a is paired with b with probability $\frac{1}{2 k - 1}$ . Since k has mean n/3 if n ≥ 3 [64], the probability of observing (a, b) at equilibrium is $\frac{2}{3 (n - 1)}$ . We approximate the likelihood of a doubleton by 0 if the c is not in the tree, and by 1 if it is. Then, the likelihood of observing c_k = (a_k, b_k) at the last known locus $s_{k}^{″}$ conditional on the tree currently containing c_k is

where (a_k, b_k;l) ∈ τ expresses that τ contains cherry c_k = (a_k, b_k) at height l. Now suppose c_k ∉ τ and let

\bar{l}

be the average TBL in τ. Under similar assumptions, cherries are created at a rate

(n - 1) ρ \bar{l}

and assuming that new cherries are drawn from the equilibrium distribution, the likelihood of observing c_k at the first known locus

s_{k}^{'}

where

ρ_{C}^{'} = (n - 1) \bar{l} ρ

is the effective rate of recombinations that potentially result in the creation of c_k. Note that ( and ) are likelihoods for τ supporting c_k at the given locus, rather than for a doubleton mutation actually occurring.

These likelihoods show good performance, but result in some negative bias in inferred population size for recent epochs. We traced this to the lack of correlation between l_i and $l_{i}^{'}$ , requiring a single very recent coalescence to explain a long segment devoid of singletons, rather than allowing for the possibility of several correlated coalescences each in slightly earlier epochs. To model correlations, we averaged the likelihood above over ρ′ = ρ and ρ′ = ρ/2 each weighted with probability 1/2. This effectively removed the negative bias.

To deal with missing data, we reduce μ proportionally to the missing segment length and the number of lineages missing. For unphased mutation data, singletons and doubletons can still be extracted, and are greedily assigned to compatible lineages. The likelihoods are also similarly calculated, by greedily assigning cherries to observed doubletons. Unphased singletons can result from mutations on either of the individual’s alleles; the likelihood term uses the sum of the two branch lengths for that individual to calculate the expected rate of unphased singletons.

Implementation details

While x_1:s refers to the entire sequence of genealogies along the sequence segment 1: s, storing this sequence would require too much memory. Instead we only store the most recent genealogy x_s (including non-local branches where appropriate), which is sufficient to simulate subsequent genealogies using the SCRM algorithm. To implement epoch-dependent lags when harvesting sufficient statistics, we do store records of the events (recombinations, coalescences and migrations) that changed x along the sequence, as well as the associated opportunities, for each particle and each epoch; this implicitly stores the full ARG. To avoid making copies of potentially many event records when particles are duplicated at resampling, these are stored in a linked list, and are shared by duplicated particles where appropriate, forming a tree structure. Records are removed dynamically after contributing to the summary statistics, and when particles fail to be resampled, ensuring that memory usage is bounded.

The likelihood calculations involve many evaluations of the exponential function, often for small exponents. We use the continued-fraction approximation $e^{x} \approx 1 + 2 x / (2 - x + \frac{1}{6} x^{2})$ for |x| < 0.03, with relative error bounded by 10⁻¹⁰ [65].

Table 2 shows the commands to generate the data for the three simulation experiments. Epoch boundaries for N_e inference in generations for the zigzag experiment were defined by taking interval boundaries −14312log(1−i/256)/2, i = 0, …, 255, merging intervals according to the pattern 4 * 1 + 7 * 2 + 8 * 5 + 7 * 13 + 1 * 15 + 8 * 11 + 1 * 3 (37 epochs; see [14]). For the real data experiments, epochs boundaries for the 32 epochs were logarithmically spaced from 133 to 133016 generations ago, using generation time g = 29 years, without merging intervals (command line option -P 133 133016 31*1).

Table 2

Commands to generate simulation data.

Experiment	Command
zigzag	`scrm 8 1 -N0 14312 -t 1431200 -r 400736 2000000000 -eN 0 1 -eG 0.000582262 1318.18 -eG 0.00232905 -329.546 -eG 0.00931619 82.3865 -eG 0.0372648 -20.5966 -eG 0.149059 5.14916 -eN 0.596236 0.1 -seed 1 -T -L -p 10 -l 300000`
CEU	`scrm 8 1 -N0 14312 -t 1789000 -r 500920 2500000000 -eN 0 10.4807 -eG 0.00120468 214.8965 -eG 0.0180702 -14.15827 -eG 0.180702 1.33255 -eG 1.084212 -0.563414 -eN 2.71053 2.096143 -seed 1 -T -L -p 10 -l 300000`
YRI	`scrm 8 1 -N0 14312 -t 1789000 -r 500920 2500000000 -eN 0 10.4807 -eG 0.00120468 502.8635 -eG 0.00542106 0 -eG 0.0451755 -5.89189 -eG 0.180702 1.33255 -eG 1.084212 -0.563414 -eN 2.71053 2.096143 -seed 1 -T -L -p 10 -l 300000`

Acknowledgements

We thank Arnaud Doucet for helpful discussions, and Paul Staab for implementing the SCRM library.

References

JGSchraiber, JMAkey. Methods and models for unravelling human evolutionary history. Nature Reviews Genetics. 2015;. 10.1038/nrg4005

MBeaumont. Detecting Population Expansion and Decline Using Microsatellites. Genetics. 1999;153(4):2013–2029.

JKPritchard, MStephens, PDonnelly. Inference of Population Structure Using Multilocus Genotype Data. Genetics. 2000;155(2):945–959.

MBeaumont, WZhang, DJBalding. Approximate Bayesian Computation in Population Genetics. Genetics. 2002;162(4):2025–2035.

RNGutenkunst, RDHernandez, SHWilliamson, CDBustamante. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5(10):e1000695. 10.1371/journal.pgen.1000695

IGronau, MJHubisz, BGulko, CGDanko, ASiepel. Bayesian inference of ancient human demography from individual genome sequences. Nature genetics. 2011;43(10):1031–1034. 10.1038/ng.937

LExcoffier, IDupanloup, EHuerta-Sánchez, VCSousa, MFoll. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9(10):e1003905. 10.1371/journal.pgen.1003905

AJDrummond, ARambaut. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology. 2007;7:214. 10.1186/1471-2148-7-214

SRBrowning, BLBrowning. A fast, powerful method for detecting identity by descent. American Journal of Human Genetics. 2011;88(2):173–182. 10.1016/j.ajhg.2011.01.010

PFPalamara, TLencz, ADarvasi, IPeér. Length distributions of identity by descent reveal fine-scale demographic history. Am J Hum Genet. 2012;91(5):809–22. 10.1016/j.ajhg.2012.08.030

KHarris, RNielsen. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 2013;9(6):e1003521. 10.1371/journal.pgen.1003521

GHellenthal, GBBusby, GBand, JFWilson, CCapelli, DFalush, et al. A genetic atlas of human admixture history. Science. 2014;343(6172):747–751. 10.1126/science.1243518

MDRasmussen, MJHubisz, IGronau, ASiepel. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):e1004342. 10.1371/journal.pgen.1004342

HLi, RDurbin. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. 10.1038/nature10231

SSheehan, KHarris, YSSong. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662. 10.1534/genetics.112.149096

SSchiffels, RDurbin. Inferring human population size and separation history from multiple genome sequences. Nature genetics. 2014;46(8):919–925. 10.1038/ng.3015

MSteinrücken, JKamm, JPSpence, YSSong. Inference of complex population histories using whole-genome sequences from multiple populations. Proceedings of the National Academy of Sciences. 2019; p. 17115–20. 10.1073/pnas.1905060116

JTerhorst, JAKamm, YSSong. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature Genetics. 2017;49(2):303–309. 10.1038/ng.3748

NJGordon, DJSalmond, AFMSmith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F, Radar and Signal Processing. 1993;140(2):107–113. 10.1049/ip-f-2.1993.0015

ADoucet, SGodsill, CAndrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing. 2000;10(3):197–208. 10.1023/A:1008935410038

MSArulampalam, SMaskell, NGordon, TClapp. A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking. IEEE Trans Signal Processing. 2002;50(2):174–188. 10.1109/78.978374

ADoucet, AMJohansen. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of nonlinear filtering. 2011;12(3):656–704.

Taylor S, Ridall G, Sherlock C, Fearnhead P. Particle learning approach to Bayesian model selecion: An application from neurology. In: Springer Proceedings in Mathematics and Statistics. vol. 63; 2014. p. 165–167.

RASmith, ELIonides, AAKing. Infectious Disease Dynamics Inferred from Genetic Data via Sequential Monte Carlo. Molecular Biology and Evolution. 2017;. 10.1093/molbev/msx124

MFourment, BCClaywell, CMcCoy, FAMatsenIv, AEDarling. Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals. Systematic Biology. 2018;. 10.1093/sysbio/syx090

LWang, SWang, ABouchard-Côté. An Annealed Sequential Monte Carlo Method for Bayesian Phylogenetics. Systematic Biology. 2019.

MNRosenbluth, AWRosenbluth. Monte Carlo Calculation of the Average Extension of Molecular Chains. J Chem Phys. 1955;23(356):356–359. 10.1063/1.1741967

WFeller. On the Integro-Differential Equations of Purely Discontinuous Markoff Processes. Transactions of the American Mathematical Society. 1940;48(3):488–515. 10.1090/S0002-9947-1940-0002697-3

PDel Moral, JJacob, PProtter. The Monte Carlo method for filtering with discrete-time observations. Probability Theory and Related Fields. 2002;120:346–368. 10.1007/PL00008786

AGolightly, DJWilkinson. Bayesian sequential inference for nonlinear multivariate diffusions. Statistics and Computing. 2006;16(4):323–338. 10.1007/s11222-006-9392-x

PFearnhead, OPapaspiliopoulos, GORoberts. Particle Filters for Partially Observed Diffusions. Journal of the Royal Statistical Society: Series B. 2008;70(4):755–777. 10.1111/j.1467-9868.2008.00661.x

Nodelman U, Shelton CR, Koller D. Continuous Time Bayesian Networks. In: Proceedings of the UAI; 2002.

Ng B, Pfeffer A, Dearden R. Continuous Time Particle Filtering. In: Proceedings of the IJCAI; 2005. p. 1360–1365.

ADoucet, NJGordon, VKrishnamurthy. Particle Filters for State Estimation of Jump Markov Linear Systems. IEEE Transactions on Signal Processing. 2001;49(3):613–624. 10.1109/78.905890

CSherlock, AGolightly, CSGillespie. Bayesian Inference for Hybrid Discrete-Continuous Systems Biology Models. Inverse Problems. 2014;30:114005. 10.1088/0266-5611/30/11/114005

NWiteley, AMJohansen, SGodsill. Monte Carlo Filtering of Piecewise Deterministic Processes. Journal of Computational and Graphical Statistics. 2011;20(1):119–139. 10.1198/jcgs.2009.08052

PDel Moral, LMiclo. Branching and interacting particle systems. Approximations of Feynman-Kac formulae with applications to non-linear filtering. Séminaire de probabilités (Strasbourg). 2000; p. 1–145.

PDel Moral. Mean Field Simulation for Monte Carlo Integration. Chapman and Hall/CRC; 2016.

JOlsson, OCapp’e, RDouc, ÉMoulines. Sequential Monte Carlo smoothing with application to parameter estimation in nonlinear state space models. Bernoulli. 2008;14(1):155–179. 10.3150/07-BEJ6150

MKPitt, NShephard. Filtering via Simulation: Auxiliary Particle Filters. Journal of the American Statistical Association. 1999;94(446):590–599. 10.1080/01621459.1999.10474153

SFNielsen. The stochastic EM algorithm: estimation and asymptotic results. Bernoulli. 2000;6(3):457–489. 10.2307/3318671

RRHudson. Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology. 1983;23(2):183–201. 10.1016/0040-5809(83)90013-8

RCGriffiths, PMarjoram. An ancestral recombination graph. In: PDonnelly, STavaré, editors. Progress in Population Genetics and Human Evolution. Springer-Verlag; 1997. p. 257–270.

CWiuf, JHein. Recombination as a Point Process along Sequences. Theoretical Population Biology. 1999;55:248–259. 10.1006/tpbi.1998.1403

PRStaab, SZhu, DMetzler, GLunter. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics. 2015;31(10):1680–1682. 10.1093/bioinformatics/btu861

GAMcVean, NJCardin. Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci. 2005;360(1459):1387–1393. 10.1098/rstb.2005.1673

PMarjoram, JDWall. Fast “coalescent” simulation. BMC Genetics. 2006;7(16). 10.1186/1471-2156-7-16

JFelsenstein. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution. 1981;17(6):368–376. 10.1007/BF01734359

CCHeyde. The effect of selection on genetic balance when the population size is varying. Th Pop Biol. 1977;11:249–251. 10.1016/0040-5809(77)90027-2

JCarpenter, PClifford, PFearnhead. Improved particle filter for nonlinear problems. IEE Proceedings—Radar, Sonar and Navigation. 1999;146:2–7. 10.1049/ip-rsn:19990255

JBérard, PDel Moral, ADoucet. A lognormal central limit theorem for particle approximations of normalizing constants. Electron J Probab. 2014;19(94):1–28.

SCKou, XSXie, JSLiu. Bayesian analysis of single-molecule experimental data. Journal of the Royal Statistical Society Series C. 2005;54:496–506. 10.1111/j.1467-9876.2005.00509.x

AMJohansen, ADoucet. A note on auxiliary particle filters. Statistics and Probability Letters. 2008;78(12):1498–1504. 10.1016/j.spl.2008.01.032

STavaré. Line-of-descent and genealogical processes, and their application in population genetic models. Theoretical Population Biology. 1984;26:119–164. 10.1016/0040-5809(84)90027-3

MBriers, ADoucet, SMaskell. Smoothing algorithms for state-space models. Annals of the Institute of Statistical Mathematics. 2009;62:61–89. 10.1007/s10463-009-0236-2

AHarpak, XLan, ZGao, JKPritchard. Frequent nonallelic gene conversion on the human lineage and its effect on the divergence of gene duplicates. Proc Nat Acad Sci. 2017;114(48):12779–12784. 10.1073/pnas.1708151114

SWhelan, NGoldman. Estimating the Frequency of Events That Cause Multiple-Nucleotide Changes. Genetics. 2004;167(4):2027–2043. 10.1534/genetics.103.023226

GMcVicker, DGordon, PDavis Colleen Green. Widespread Genomic Signatures of Natural Selection in Hominid Evolution. PLoS Genet. 2009;5(5):e1000471. 10.1371/journal.pgen.1000471

RFrankham. Effective population size/adult population size ratios in wildlife: a review. Genetics Research. 1995;66(2):95–107. 10.1017/S0016672300034455

JTChang, DPollard. Conditioning as disintegration. Statistica Neerlandica. 1997;51(3):287–317. 10.1111/1467-9574.00056

Hinton G, van Camp D. Keeping neural networks simple by minimizing the description length of their weights. In: Proceedings of the COLT’93; 1993. p. 5–13.

Mackay D. Ensemble learning for hidden Markov models; 1997. Available from: www.inference.org.uk/mackay/ensemblePaper.pdf.

RPFeynman. Statistical Mechanics: A Set Of Lectures. Reading, Mass: W. A. Benjamin; 1972.

AMcKenzie, MSteel. Distributions of cherries for two models of trees. Math Biosci. 2000;164(1):81–92. 10.1016/S0025-5564(99)00060-7

LLorentzen, HWaadeland. Continued Fractions. In: CKChui, editor. Atlantis Studies in Mathematics for Engineering and Science 1. Atlantis Press / World Scientific; 2008.