Proceedings of the National Academy of Sciences of the United States of America
Home Inference of dynamic systems from noisy and sparse data via manifold-constrained Gaussian processes
Inference of dynamic systems from noisy and sparse data via manifold-constrained Gaussian processes
Inference of dynamic systems from noisy and sparse data via manifold-constrained Gaussian processes

Edited by Wing Hung Wong, Stanford University, Stanford, CA, and approved March 5, 2021 (received for review September 30, 2020)

Author contributions: S.Y., S.W.K.W., and S.C.K. designed research; S.Y., S.W.K.W., and S.C.K. performed research; S.Y. and S.W.K.W. contributed new reagents/analytic tools; S.Y. and S.W.K.W. analyzed data; and S.Y., S.W.K.W., and S.C.K. wrote the paper.

Article Type: Research Article Article History
Abstract

Ordinary differential equations are a ubiquitous tool for modeling behaviors in science, such as gene regulation, biological rhythms, epidemics, and ecology. An important problem is to infer and characterize the uncertainty of parameters that govern equations. We present an accurate and fast inference method using manifold-constrained Gaussian processes, such that derivatives of the Gaussian process must satisfy the dynamics of the differential equations. Our method completely avoids the use of numerical integration and is thus fast to compute. Our construction is embedded in a principled statistical framework and is demonstrated to yield fast and reliable inference in a variety of practical problems. Our method works even when some system components are unobserved, which is a significant challenge for previous methods.

Parameter estimation for nonlinear dynamic system models, represented by ordinary differential equations (ODEs), using noisy and sparse data, is a vital task in many fields. We propose a fast and accurate method, manifold-constrained Gaussian process inference (MAGI), for this task. MAGI uses a Gaussian process model over time series data, explicitly conditioned on the manifold constraint that derivatives of the Gaussian process must satisfy the ODE system. By doing so, we completely bypass the need for numerical integration and achieve substantial savings in computational time. MAGI is also suitable for inference with unobserved system components, which often occur in real experiments. MAGI is distinct from existing approaches as we provide a principled statistical construction under a Bayesian framework, which incorporates the ODE system through the manifold constraint. We demonstrate the accuracy and speed of MAGI using realistic examples based on physical experiments.

Keywords
Yang,Wong,and Kou: Inference of dynamic systems from noisy and sparse data via manifold-constrained Gaussian processes

Dynamic systems, represented as a set of ordinary differential equations (ODEs), are commonly used to model behaviors in scientific domains, such as gene regulation (1), biological rhythms (2), spread of disease (3), ecology (4), etc. We focus on models specified by a set of ODEs

x˙(t)=dx(t)dt=f(x(t),θ,t),t[0,T], [1]
where the vector x(t) contains the system outputs that evolve over time t and θ is the vector of model parameters to be estimated from experimental/observational data. When f is nonlinear, solving x(t) given initial conditions x(0) and θ generally requires a numerical integration method, such as Runge–Kutta.

Historically, ODEs have mainly been used for conceptual or theoretical understanding rather than data fitting as experimental data were limited. Advances in experimental and data collection techniques have increased the capacity to follow dynamic systems closer to real time. Such data will generally be recorded at discrete times and subject to measurement error. Thus, we assume that we observe y(τ)=x(τ)+ϵ(τ) at a set of observation time points τ with error ϵ governed by noise level σ. Our focus here is inference of θ given y(τ), with emphasis on nonlinear f where specialized methods that exploit a linear structure (e.g., refs. 5 and 6), are not generally applicable. We shall present a coherent, statistically principled framework for dynamic system inference with the help of Gaussian processes (GPs). The key to our method is to restrict the GPs on a manifold that satisfies the ODE system: Thus, we name our method MAGI (manifold-constrained Gaussian process inference). Placing a GP on x(t) facilitates inference of θ without numerical integration, and our explicit manifold constraint is the key idea that addresses the conceptual incompatibility between the GP and the specification of the ODE model, as we shall discuss shortly when overviewing our method. We show that the resulting parameter inference is computationally efficient, statistically principled, and effective in a variety of practical scenarios. MAGI particularly works in the cases when some system component(s) is/are unobserved. To the best of our knowledge, none of the current available software packages that do not use numerical integration can analyze systems with unobserved component(s).

Overview of Our Method

Following the Bayesian paradigm, we view the D-dimensional system x(t) to be a realization of the stochastic process X(t)=(X1(t),,XD(t)) and the model parameters θ a realization of the random variable Θ. In Bayesian statistics, the basis of inference is the posterior distribution, obtained by combining the likelihood function with a chosen prior distribution on the unknown parameters and stochastic processes. Specifically, we impose a general prior distribution π() on θ and independent GP prior distributions on each component Xd(t) so that Xd(t)GP(μd,Kd),t[0,T], where Kd:R×RR is a positive definite covariance kernel for the GP and μd:RR is the mean function. Then, for any finite set of time points τd, Xd(τd) has a multivariate Gaussian distribution with mean vector μd(τd) and covariance matrix Kd(τd,τd). Denote the observations by y(τ)=(y1(τ1),,yD(τD)), where τ=(τ1,τ2,,τD) is the collection of all observation time points, and each component Xd can have its own set of observation times τd=(τd,1,,τd,Nd). If the dth component is not observed, then Nd=0, and τd=. N=N1++ND is the total number of observations. We note that for the remainder of the paper, the notation t shall refer to time generically, while τ shall refer specifically to the observation time points.

As an illustrative example, consider the dynamic system in ref. 1 that governs the oscillation of Hes1 mRNA (messenger ribonucleic acid) (M) and Hes1 protein (P) levels in cultured cells, where it is postulated that an Hes1-interacting (H) factor contributes to a stable oscillation, a manifestation of biological rhythm (2). The ODEs of the three-component system X=(P,M,H) are

f(X,θ,t)=aPH+bMcPdM+e1+P2aPH+f1+P2gH,
where θ=(a,b,c,d,e,f,g) are the associated parameters. In Fig. 1, left-most panel, we show noise-contaminated data generated from the system, which closely mimics the experimental setup described in ref. 1: P and M are observed at 15-min intervals for 4 h, but H is never observed. In addition, P and M observations are asynchronous: Starting at time 0, every 15 min we observe P; starting at 7.5 min, every 15 min we observe M; P and M are never observed at the same time. It can be seen that the mRNA and protein levels exhibit the behavior of regulation via negative feedback.

Inference by MAGI for Hes1 partially observed asynchronous system on 2,000 simulated datasets. The red curve is the truth. MAGI recovers the system well, without the usage of any numerical solver: The green curve shows the median of the inferred trajectories among the 2,000 simulated datasets, and a 95% interval from the 2.5 and 97.5% of all inferred trajectories is shown via the blue area.
Fig. 1.

Inference by MAGI for Hes1 partially observed asynchronous system on 2,000 simulated datasets. The red curve is the truth. MAGI recovers the system well, without the usage of any numerical solver: The green curve shows the median of the inferred trajectories among the 2,000 simulated datasets, and a 95% interval from the 2.5 and 97.5% of all inferred trajectories is shown via the blue area.

The goal here is to infer the seven parameters of the system: a,b govern the rate of protein synthesis in the presence of the interacting factor; c,d,g are the rates of decomposition; and e,f are inhibition rates. The unobserved H component poses a challenge for most existing methods that do not use numerical integration but is capably handled by MAGI: The P and M panels of Fig. 1 show that our inferred trajectories provide good fits to the observed data, and the H panel shows that the dynamics of the entirely unobserved H component are largely recovered as well. We emphasize that these trajectories are inferred without any use of numerical solvers. We shall return to the Hes1 example in detail in Results.

Intuitively, the GP prior on X(t) facilitates computation as GP provides closed analytical forms for X˙(t) and X(t), which could bypass the need for numerical integration. In particular, with a GP prior on X(t), the conditional distribution of X˙(t) given X(t) is also a GP with its mean function and covariance kernel completely specified. This GP specification for the derivatives x˙(t), however, is inherently incompatible with the ODE model because Eq. 1 also completely specifies x˙(t) given x(t) (via the function f). As a key contribution of our method, MAGI addresses this conceptual incompatibility by constraining the GP to satisfy the ODE model in Eq. 1. To do so, we first define a random variable W quantifying the difference between stochastic process X(t) and the ODE structure with a given value of the parameter θ:

W=supt[0,T],d{1,,D}|X˙d(t)f(X(t),θ,t)d|. [2]
W0 if and only if ODEs with parameter θ are satisfied by X(t). Therefore, ideally the posterior distribution for X(t) and θ given the observations y(τ) and the ODE constraint, W0, is (informally)
pΘ,X(t)|W,Y(τ)(θ,x(t)|W=0,Y(τ)=y(τ)). [3]
While Eq. 3 is the ideal posterior, in reality W is not generally computable. In practice, we approximate W by finite discretization on the set I=(t1,t2,,tn) such that τI[0,T] and similarly define WI as
WI=maxtI,d{1,,D}|X˙d(t)f(X(t),θ,t)d|. [4]
Note that WI is the maximum of a finite set, and WIW monotonically as I becomes dense in [0,T]. Therefore, the practically computable posterior distribution is
pΘ,X(I)|WI,Y(τ)(θ,x(I)|WI=0,Y(τ)=y(τ)),
which is the joint conditional distribution of θ and X(I) together. Thus, effectively, we simultaneously infer both the parameters and the unobserved trajectory X(I) from the noisy observations y(τ).

Under Bayes’ rule, we have

pΘ,X(I)|WI,Y(τ)(θ,x(I)|WI=0,Y(τ)=y(τ))P(Θ=θ,X(I)=x(I),WI=0,Y(τ)=y(τ)),
where the right-hand side can be decomposed as
P(Θ=θ,X(I)=x(I),WI=0,Y(τ)=y(τ))=πΘ(θ)×P(X(I)=x(I)|Θ=θ)(1)×P(Y(τ)=y(τ)|X(I)=x(I),Θ=θ)(2)×P(WI=0|Y(τ)=y(τ),X(I)=x(I),Θ=θ)(3).

The first term (1) can be simplified as P(X(I)=x(I)|Θ=θ)=P(X(I)=x(I)) due to the prior independence of X(I) and Θ; it corresponds to the GP prior on X. The second term (2) corresponds to the noisy observations. The third term (3) can be simplified as

P(WI=0|Y(τ)=y(τ),X(I)=x(I),Θ=θ)=P(X˙(I)f(x(I),θ,tI)=0|Y(τ)=y(τ),X(I)=x(I),Θ=θ)=P(X˙(I)f(x(I),θ,tI)=0|X(I)=x(I))=P(X˙(I)=f(x(I),θ,tI)|X(I)=x(I)),
which is the conditional density of X˙(I) given X(I) evaluated at f(x(I),θ,tI). All three terms are multivariate Gaussian: The third term is Gaussian because X˙(I) given X(I) has a multivariate Gaussian distribution as long as the kernel K is twice differentiable.

Therefore, the practically computable posterior distribution simplifies to

pΘ,X(I)|WI,Y(τ)(θ,x(I)|WI=0,Y(τ)=y(τ))πΘ(θ)exp12d=1D[+|I|log(2π)+log|Cd|+xd(I)μd(I)Cd12(1)+|I|log(2π)+log|Kd|+fd,Ix,θμ˙d(I)md{xd(I)μd(I)}Kd12(3)+Ndlog(2πσd2)+xd(τd)yd(τd)σd22(2), [5]
where vA2=vAv, |I| is the cardinality of I, fd,Ix,θ is short for the dth component of f(x(I),θ,tI), and the multivariate Gaussian covariance matrix Cd and the matrix Kd can be derived as follows for each component d:
C=K(I,I)m=K(I,I)K(I,I)1K=K(I,I)K(I,I)K(I,I)1K(I,I), [6]
where K=sK(s,t), K=tK(s,t), and K=2stK(s,t).

In practice, we choose the Matern kernel K(s,t)=ϕ121νΓ(ν)2νlϕ2νBν2νlϕ2, where l=|st|, Γ is the Gamma function, Bν is the modified Bessel function of the second kind, and the degree of freedom ν is set to be 2.01 to ensure that the kernel is twice differentiable. K has two hyperparameters ϕ1 and ϕ2. Their meaning and specification are discussed in Materials and Methods.

With the posterior distribution specified in Eq. 5, we use Hamiltonian Monte Carlo (HMC) (7) to obtain samples of XI and the parameters together. At the completion of HMC sampling, we take the posterior mean of XI as the inferred trajectory and the posterior means of the sampled parameters as the parameter estimates. Throughout the MAGI computation, no numerical integration is ever needed.

Review of Related Work

The problem of dynamic system inference has been studied in the literature, which we now briefly review. We first note that a simple approach to constructing the “ideal” likelihood function is according to p(y(t)|x^(t,θ,x(0)),σ), where x^(t,θ,x(0)) is the numerical solution of the ODE obtained by numerical integration given θ and the initial conditions. This approach suffers from a high computational burden: Numerical integration is required for every θ sampled in an optimization or Markov chain Monte Carlo (MCMC) routine (8). Smoothing methods have been useful for eliminating the dependence on numerical ODE solutions, and an innovative penalized likelihood approach (9) uses a B-spline basis for constructing estimated functions to simultaneously satisfy the ODE system and fit the observed data. While in principle, the method in ref. 9 can handle an unobserved system component, substantive manual input is required as we show in Results, which contrasts with the ready-made solution that MAGI provides.

As an alternative to the penalized likelihood approach, GPs are a natural candidate for fulfilling the smoothing role in a Bayesian paradigm due to their flexibility and analytic tractability (10). The use of GPs to approximate the dynamic system and facilitate computation has been previously studied by a number of authors (8, 1112131415). The basic idea is to specify a joint GP over y,x,x˙ with hyperparameters ϕ and then, provide a factorization of the joint density p(y,x,x˙,θ,ϕ,σ) that is suitable for inference. The main challenge is to find a coherent way to combine information from two distinct sources: the approximation to the system by the GP governed by hyperparameters ϕ and the actual dynamic system equations governed by parameters θ. In refs. 8 and 11, the factorization proposed is p(y,x,x˙,θ,ϕ,σ)=p(y|x,σ)p(x˙|x,θ,ϕ)p(x|ϕ)p(ϕ)p(θ), where p(y|x,σ) comes from the observation model and p(x|ϕ) comes from the GP prior as in our approach. However, there are significant conceptual difficulties in specifying p(x˙|x,θ,ϕ): On one hand, the distribution of x˙ is completely determined by the GP given x, while on the other hand, x˙ is completely specified by the ODE system x˙=f(x,θ,t); these two are incompatible. Previous authors have attempted to circumvent this incompatibility of the GP and ODE system: Refs. 8 and 11 use a product of experts heuristic by letting p(x˙|x,θ,ϕ)p(x˙|x,ϕ)p(x˙|x,θ), where the two distributions in the product come from the GP and a noisy version of the ODE, respectively. In ref. 15, the authors arrive at the same posterior as refs. 8 and 11 by assuming an alternative graphical model that bypasses the product of experts heuristic; nonetheless, the method requires working with an artificial noisy version of the ODE. In ref. 12, the authors start with a different factorization: p(y,x,x˙,θ,ϕ,σ)=p(y|x˙,ϕ,σ)p(x˙|x,θ)p(x|ϕ)p(ϕ)p(θ), where p(y|x˙,ϕ) and p(x|ϕ) are given by the GP and p(x˙|x,θ) is a Dirac delta distribution given by the ODE. However, this factorization is incompatible with the observation model p(y|x,σ) as discussed in detail in ref. 16. There is other related work that uses GPs in an ad hoc partial fashion to aid inference. In ref. 13, GP regression is used to obtain the means of x and x˙ for embedding within an Approximate Bayesian Computation estimation procedure. In ref. 14, GP smoothing is used during an initial burn-in phase as a proxy for the likelihood, before switching to the ideal likelihood to obtain final MCMC samples. While empirical results from the aforementioned studies are promising, a principled statistical framework for inference that addresses the previously noted conceptual incompatibility between the GP and ODE specifications is lacking. Our work presents one such principled statistical framework through the explicit manifold constraint. MAGI is therefore distinct from recent GP-based approaches (11, 15) or any other Bayesian analogs of ref. 9.

In addition to the conceptual incompatibility, none of the existing methods that do not use numerical integration offer a practical solution for a system with unobserved component(s), which highlights another unique and important contribution of our approach.

Results

We apply MAGI to three systems. We begin with an illustration that demonstrates the effectiveness of MAGI in practical problems with unobserved system component(s). Then, we make comparisons with other current methods on two benchmark systems, which show that our proposed method provides more accurate inference while having much faster run time.

Illustration: Hes1 Model.

The Hes1 model described in the Introduction demonstrates inference on a system with an unobserved component and asynchronous observation times. This section continues the inference of this model. Ref. 1 studied the theoretical oscillation behavior using parameter values a=0.022, b=0.3, c=0.031, d=0.028; e=0.5, f=20, g=0.3, which leads to one oscillation cycle approximately every 2 h. Ref. 1 also set the initial condition at the lowest value of P when the system is in oscillation equilibrium (1): P=1.439, M=2.037, H=17.904. The noise level in our simulation is derived from ref. 1 where the SE based on repeated measures is reported to be around 15% of the P (protein) level and M (mRNA) level, so we set the simulation noise to be multiplicative following a log-normal distribution with SD 0.15; throughout this example, we assume the noise level σ is known to be 0.15 from repeated measures reported in ref. 1. The H component is never observed. Owing to the multiplicative error on the strictly positive system, we apply our method to the log-transformed ODEs, so that the resulting error distributions are Gaussian. To the best of our knowledge, MAGI is the only one that provides a practical and complete solution for handling unobserved component cases like this example.

We generate 2,000 simulated datasets based on the above setup for the Hes1 system. The left-most panel in Fig. 1 shows one example dataset. For each dataset, we use MAGI to infer the trajectories and estimate the parameters. We use the posterior mean of Xt=(P,M,H)t as the inferred trajectories for the system components, which are generated by MAGI without using any numerical solver. Fig. 1 summarizes the inferred trajectories across the 2,000 simulated datasets, showing the median of the inferred trajectories of Xt together with the 95% interval of the inferred trajectories represented by the 2.5 and 97.5% percentiles. The posterior mean of θ=(a,b,c,d,f,e,g) is our estimate of the parameters. Table 1 summarizes the parameter estimates across the 2,000 simulated datasets, by showing their means and SDs. Fig. 1 shows that MAGI recovers the system well, including the completely unobserved H component. Table 1 shows that MAGI also recovers the system parameters well, except for the parameters that only appear in the equation for the unobserved H component, which we will discuss shortly. Together, Fig. 1 and Table 1 demonstrate that MAGI can recover the entire system without any usage of a numerical solver, even in the presence of unobserved component(s).

Table 1.
Parameter inference in the Hes1 partially observed asynchronous system based on 2,000 simulation datasets
θTruthMAGIRef. 9
EstimateRMSEEstimateRMSE
a0.0220.021 ± 0.0030.0030.027 ± 0.0260.026
b0.30.329 ± 0.0510.0590.302 ± 0.0860.086
c0.0310.035 ± 0.0060.0070.031 ± 0.0100.010
d0.0280.029 ± 0.0020.0030.028 ± 0.0030.003
e0.50.552 ± 0.0740.0900.498 ± 0.0880.088
f2013.759 ± 3.0266.936604.9 ± 5084.85,117.0
g0.30.141 ± 0.0260.1621.442 ± 9.4529.519

Average parameter estimates based on MAGI and ref. 9 across the 2,000 simulated datasets are reported together with the SD. Parameter RMSEs are reported in the following column. Bold highlights the best method in terms of parameter RMSE for each parameter.

Metrics for assessing the quality of system recovery.

To further assess the quality of the parameter estimates and the system recovery, we consider two metrics. First, as shown in Table 1, we examine the accuracy of the parameter estimates by directly calculating the root mean squared error (RMSE) of the parameter estimates to the true parameter value. We call this measure the parameter RMSE metric. Second, it is possible that a system might be insensitive to some of the parameters; in the extreme case, some parameters may not be fully identifiable given only the observed data and components. In these situations, it is possible that the system trajectories implied by quite distinct parameter values are similar to each other (or even close to the true trajectory). We thus consider an additional trajectory RMSE metric to account for possible parameter insensitivity and measure how well the system components are recovered given the parameter and initial condition estimates. The trajectory RMSE is obtained by treating the numerical ODE solution based on the true parameter value as the ground truth: First, the numerical solver is used to reconstruct the trajectory based on the estimates of the parameter and initial condition (from a given method); then, we calculate the RMSE of this reconstructed trajectory to the true trajectory at the observation time points. We emphasize that the trajectory RMSE metric is only for evaluation purpose to assess (and compare across methods) how well a method recovers the trajectories of the system components and that throughout MAGI, no numerical solver is ever needed.

We summarize the trajectory RMSEs of MAGI in Table 2 for the Hes1 system.

Table 2.
Trajectory RMSEs of the individual components in the Hes1 system, comparing the average trajectory RMSEs of MAGI and ref. 9 over the 2,000 simulated datasets
MethodPMH
MAGI0.970.212.57
Ref. 91.300.4059.47

The best trajectory RMSE for each system component is shown in bold.

We compare MAGI with the benchmark provided by the B spline-based penalization approach of ref. 9. To the best of our knowledge, among all of the existing methods that do not use numerical integration, ref. 9 is the only one with a software package that can be manually adapted to handle an unobserved component. We note, however, that this package itself is not ready made for this problem: It requires substantial manual input as it does not have default or built-in setup of its hyperparameters for the unobserved component. None of the other benchmark methods, including refs. 11 and 15, provide software that is equipped to handle an unobserved component. Table 1 compares our estimates against those given by ref. 9 based on the parameter RMSE, which shows that the parameter RMSEs for MAGI are substantially smaller than ref. 9. Fig. 1 shows that the inferred trajectories from MAGI are very close to the truth. On the contrary, the method in ref. 9 is not able to recover the unobserved component H nor the associated parameters f and g; SI Appendix, Fig. S1 has the plots. Table 2 compares the trajectory RMSE of the two methods. It is seen that the trajectory RMSE of MAGI is substantially smaller than that of ref. 9. Further implementation details and comparison are provided in SI Appendix.

Finally, we note that MAGI recovers the unobserved component H almost as well as the observed components of P and M, as measured by the trajectory RMSEs. In comparison, for the result of ref. 9 in Table 2, the trajectory RMSE of the unobserved H component is orders of magnitude worse than those of P and M. The numerical results thus illustrate the effectiveness of MAGI in borrowing information from the observed components to infer the unobserved component, which is made possible by explicitly conditioning on the ODE structure. The self-regulating parameter g and inhibition rate parameter f for the unobserved component appear to have high inference variation across the simulated datasets despite the small trajectory RMSEs. This suggests that the system itself could be insensitive to f and g when the H component is unobserved.

Comparison with Previous Methods Based on GPs.

To further assess MAGI, we compare with two methods: adaptive gradient matching (AGM) of ref. 11 and fast Gaussian process-based gradient matching (FGPGM) of ref. 15, representing the state of the art of inference methods based on GPs. For fair comparison, we use the same benchmark systems, scripts, and software provided by the authors for performance assessment and run the software using the settings recommended by the authors. The benchmark systems include the FitzHugh–Nagumo (FN) equations (17) and a protein transduction model (18).

FN model.

The FN equations are a classic ion channel model that describes spike potentials. The system consists of X=(V,R), where V is the variable defining the voltage of the neuron membrane potential and R is the recovery variable from neuron currents, satisfying the ODE

f(X,θ,t)=c(VV33+R)1c(Va+bR),
where θ=(a,b,c) are the associated parameters. As in refs. 11 and 15, the true parameters are set to a=0.2,b=0.2,c=3, and we generate the true trajectories for this model using a numerical solver with initial conditions V=1, R=1.

To compare MAGI with FGPGM of ref. 15 and AGM of ref. 11, we simulated 100 datasets under the noise setting of σV=σR=0.2 with 41 observations. The noise level is chosen to be on similar magnitude with that of ref. 15, and the noise level is set to be the same across the two components as the implementation of ref. 11 can only handle equal-variance noise. The number of repetitions (i.e., 100) is set to be the same as ref. 15 due to the high computing time of these alternative methods.

The parameter estimation results from the three methods are summarized in Table 3, where MAGI has the lowest parameter RMSEs among the three. Fig. 2 shows the inferred trajectories obtained by our method: MAGI recovers the system well, and the 95% interval band is so narrow around the truth that we can only see the band clearly after magnification (as shown in Fig. 2, Insets). SI Appendix provides visual comparison of the inferred trajectories of different methods, where MAGI gives the most consistent results across the simulations. Furthermore, to assess how well the methods recover the system components, we calculated the trajectory RMSEs, and the results are summarized in Table 4, where MAGI significantly outperforms the others, reducing the trajectory RMSE over the best alternative method for 60% in V and 25% in R. We note that compared with the true parameter value, all three methods show some bias in the parameter estimates, which is partly due to the GP prior as discussed in ref. 15, and MAGI appears to have the smallest bias.

Table 3.
Parameter inference in the FN model based on 100 simulated datasets
θMAGIFGPGM (15)AGM (11)
EstimateRMSEEstimateRMSEEstimateRMSE
a0.19 ± 0.020.020.22 ± 0.040.050.30 ± 0.030.10
b0.35 ± 0.090.170.32 ± 0.130.180.36 ± 0.060.17
c2.89 ± 0.060.132.85 ± 0.150.212.04 ± 0.140.97

For each method, average parameter estimates are reported together with SD; parameter RMSEs across simulations are also reported. Bold highlights the best method in terms of parameter RMSE for each parameter.

Inferred trajectories by MAGI for each component of the FN system over 100 simulated datasets. The Top is the V component and the Bottom is the R component. The blue shaded area represents the 95% interval. Insets magnify the corresponding segments.
Fig. 2.

Inferred trajectories by MAGI for each component of the FN system over 100 simulated datasets. The Top is the V component and the Bottom is the R component. The blue shaded area represents the 95% interval. Insets magnify the corresponding segments.

Table 4.
Trajectory RMSEs of each component in the FN system, comparing the average trajectory RMSE of the three methods over 100 simulated datasets
MethodVR
MAGI0.1030.070
FGPGM (15)0.2570.094
AGM (11)1.1770.662

The best trajectory RMSE for each system component is shown in bold. MAGI reduces the RMSE for 60% in component V and 25% in component R over the best alternative method.

For computing cost, the average run time of MAGI for this system over the repetitions is 3 min, which is 145 times faster than FGPGM (15) and 90 times faster than AGM (11) on the same processor (we follow the authors’ recommendation for running their methods) (SI Appendix has details).

Protein transduction model.

This protein transduction example is based on systems biology where components S and Sd represent a signaling protein and its degraded form, respectively. In the biochemical reaction, S binds to protein R to form the complex SR, which enables the activation of R into Rpp. X=(S,Sd,R,SR,Rpp) satisfies the ODE

f(X,θ,t)=k1Sk2SR+k3SRk1Sk2SR+k3SR+VRppKm+Rppk2SRk3SRk4SRk4SRVRppKm+Rpp,
where θ=(k1,k2,k3,k4,V,Km) are the associated rate parameters.

We follow the same simulation setup as refs. 11 and 15 by taking t=0,1,2,4,5,7,10,15,20,30,40,50,60,80,100 as the observation times, X(0)=(1,0,1,0,0) as the initial values, and θ=(0.07,0.6,0.05,0.3,0.017,0.3) as the true parameter values. Two scenarios for additive observation noise are considered: σ=0.001 (low noise) and σ=0.01 (high noise). Note that the observation times are unequally spaced, with only a sparse number of observations from t=20 to t=100. Further, inference for this system has been noted to be challenging due to the nonidentifiability of the parameters, in particular Km and V (15). Therefore, the parameter RMSE is not meaningful for this system, and we focus our comparison on the trajectory RMSE.

We compare MAGI with FGPGM of ref. 15 and AGM of ref. 11 on 100 simulated datasets for each noise setting (SI Appendix has method and implementation details). We plot the inferred trajectories of MAGI in the high-noise setting in Fig. 3, which closely recover the system. The 95% interval band from MAGI is quite narrow that for most of the inferred components, we need magnifications (as shown in Fig. 3, Insets) to clearly see the 95% band. We then calculated the trajectory RMSEs, and the results are summarized in Table 5 for each system component. In both noise settings, MAGI produces trajectory RMSEs that are uniformly smaller than both FGPGM (15) and AGM (11) for all system components. In the low-noise setting, the advantage of MAGI is especially apparent for components S, R, SR, and Rpp, with trajectory RMSEs less than half of the closest comparison method. For the high-noise setting, MAGI reduces trajectory RMSE the most for Sd and Rpp (50%). AGM (11) struggles with this example at both noise settings. To visually compare the trajectory RMSEs in Table 5, plots of the corresponding reconstructed trajectories by different methods at both noise settings are given in SI Appendix.

Inferred trajectories by MAGI for each component of the protein transduction system in the high-noise setting. The red line is the truth, and the green line is the median inferred trajectory over 100 simulated datasets. The blue shaded area represents the 95% interval. Insets magnify the corresponding segment.
Fig. 3.

Inferred trajectories by MAGI for each component of the protein transduction system in the high-noise setting. The red line is the truth, and the green line is the median inferred trajectory over 100 simulated datasets. The blue shaded area represents the 95% interval. Insets magnify the corresponding segment.

Table 5.
Trajectory RMSEs of the individual components in the protein transduction system, by comparing the average RMSEs of the three methods over 100 simulated datasets
MethodSSdRSRRpp
Low-noise case, σ=0.001
MAGI0.00200.00130.00400.00170.0036
FGPGM (15)0.00490.00160.01560.00360.0149
AGM (11)0.04760.28810.39920.08260.2807
High-noise case, σ=0.01
MAGI0.01220.00430.01670.01350.0136
FGPGM (15)0.01280.00890.02100.01360.0309
AGM (11)0.06710.31250.41380.09800.2973

The method achieving the best RMSE for each system component is shown in bold.

The run time of MAGI for this system averaged over the repetitions is 18 min, which is 12 times faster than FGPGM (15) and 18 times faster than AGM (11) on the same processor (we follow the authors’ recommendation for running their methods) (SI Appendix has details).

Discussion

We have presented a methodology for the inference of dynamic systems, using manifold-constrained GPs. A key feature that distinguishes our work from the previous approaches is that it provides a principled statistical framework, firmly grounded on the Bayesian paradigm. Our method also outperformed currently available GP-based approaches in the accuracy of inference on benchmark examples. Furthermore, the computation time for our method is much faster. Our method is robust and able to handle a variety of challenging systems, including unobserved components, asynchronous observations, and parameter nonidentifiability.

A robust software implementation is provided, with user interfaces available for R, MATLAB, and Python, as described in SI Appendix. The user may specify custom ODE systems in any of these languages for inference with our package by following the syntax in the examples that accompany this article. In practice, inference with MAGI using our software can be carried out with relatively few user interventions. The setting of hyperparameters and initial values is fully automatic, although may be overridden by the user.

The main setting that requires some tuning is the number of discretization points in I. In our examples, this was determined by gradually increasing the denseness of the points with short sampler runs, until the results become indistinguishable. Note that further increasing the denseness of I has no ill effect, apart from increasing the computational time. To illustrate the effect of the denseness of I on MAGI inference results, an empirical study is included in SI Appendix, Varying Number of Discretization, where we examined the results of the FN model with the discretization set I taken to be 41, 81, 161, and 321 equally spaced points. The results confirm that our proposal of gradually increasing the denseness of I works well. The inference results improve as we increase I from 41 to 161 points, and at 161 points, the results are stabilized. If we further increase the discretization to 321 points, that doubles the compute time with only a slight gain in accuracy compared with 161 points in terms of trajectory RMSEs. This empirical study also indicates that as WI becomes an increasingly dense approximation of W, an inference limit would be expected. A theoretical study is a natural future direction of investigation.

We also investigated the stability of MAGI when the observation time points are farther apart. This empirical study, based on the FN model with 21 observations, is included in SI Appendix, FN Model with Fewer Observations. The inferred trajectories from the 21 observations are still close to the truth, while the interval bands become wider, which is expected as we have less information in this case. We also found that the denseness of the discretization needs to be increased (to 321 time points in this case) to compensate for the sparser 21 observations.*

An inherent feature of the GP approximation is the tendency to favor smoother curves. This limitation has been previously acknowledged (11, 15). As a consequence, two potential forms of bias can exist. First, estimates derived from the posterior distributions of the parameters may have some statistical bias. Second, the trajectories reconstructed by a numerical solver based on the estimated parameters may differ slightly from the inferred trajectories. MAGI, which is built on a GP framework, does not entirely eliminate these forms of bias. However, as seen in the benchmark systems, the magnitude of our bias in both respects is significantly smaller than the current state of the art in refs. 11 and 15.

We considered the inference of dynamic systems specified by ODEs in this article. Such deterministic ODE models are often adequate to describe dynamics at the aggregate or population level (19). However, when the goal is to describe the behavior of individuals [e.g., individual molecules (20, 21)], models such as stochastic differential equations (SDEs) and continuous-time Markov processes, which explicitly incorporate intrinsic (stochastic) noise, often become the model of choice. Extending our method to the inference of SDEs and continuous-time Markov models is a future direction we plan to investigate. Finally, recent developments in deep learning have shown connections between deep neural networks and GPs (22, 23). It could thus also be interesting to explore the application of neural networks to model the ODE system outputs x(t) in conjunction with GPs.

Materials and Methods

For notational simplicity, we drop the dimension index d in this section when the meaning is clear.

Algorithm Overview.

We begin by summarizing the computational scheme of MAGI. Overall, we use HMC (7) to obtain samples of XI and the parameters from their joint posterior distribution. Details of the HMC sampling are included in SI Appendix, Hamiltonian Monte Carlo. At each iteration of HMC, XI and the parameters are updated together with a joint gradient, with leapfrog step sizes automatically tuned during the burn-in period to achieve an acceptance rate between 60 and 90%. At the completion of HMC sampling (and after discarding an appropriate burn-in period for convergence), we take the posterior means of XI as the inferred trajectories and the posterior means of the sampled parameters as the parameter estimates. The techniques we use to temper the posterior and speed up the computations are discussed in Prior Tempering and SI Appendix, Techniques for Computational Efficiency.

Several steps are taken to initialize the HMC sampler. First, we apply a GP fitting procedure to obtain values of ϕ and σ for the observed components; the computed values of the hyperparameters ϕ are subsequently held fixed during the HMC sampling, while the computed value of σ is used as the starting value in the HMC sampler (if σ is known, the GP fitting procedure is used to obtain values of ϕ only). Second, starting values of XI for the observed components are obtained by linearly interpolating between the observation time points. Third, starting values for the remaining quantities—θ and (XI,ϕ) for any unobserved component(s)—are obtained by optimization of the posterior as described below.

Setting Hyperparameters ϕ for Observed Components.

The GP prior Xd(t)GP(μd,Kd),t[0,T], is on each component Xd(t) separately. The GP Matern kernel K(l)=ϕ121νΓ(ν)2νlϕ2νBν2νlϕ2 has two hyperparameters that are held fixed during sampling: ϕ1 controls overall variance level of the GP, while ϕ2 controls the bandwidth for how much neighboring points of the GP affect each other.

When the observation noise level σ is unknown, values of (ϕ1,ϕ2,σ) are obtained jointly by maximizing GP fitting without conditioning on any ODE information, namely

(ϕ~,σ~)=arg maxϕ,σp(ϕ,σ2|yI0)=arg maxϕ,σπΦ1(ϕ1)πΦ2(ϕ2)πσ(σ2)p(yI0|ϕ,σ2), [7]
where yI0|ϕ,σN(0,Kϕ+σ2). The index set I0 is the smallest evenly spaced set such that all observation time points in this component are in I0 (i.e., τI0). The priors πΦ1(ϕ1) and πσ(σ2) for the variance parameter ϕ1 and σ are set to be flat. The prior πΦ2(ϕ2) for the bandwidth parameter ϕ2 is set to be a Gaussian distribution. 1) The mean μΦ2 is set to be half of the period corresponding to the frequency that is the weighted average of all of the frequencies in the Fourier transform of y on I0 (the values of y on I0 are linearly interpolated from the observations at τ), where the weight on a given frequency is the squared modulus of the Fourier transform with that frequency, and 2) the SD is set such that T is three SDs away from μΦ2. This Gaussian prior on ϕ2 serves to prevent it from being too extreme. In the subsequent sampling of (θ,Xτ,σ2), the hyperparameters ϕ are fixed at ϕ~, while σ~ gives the starting value of σ in the HMC sampler.

If σ is known, then values of (ϕ1,ϕ2) are obtained by maximizing

ϕ~=arg maxϕp(ϕ|yI0,σ2)=arg maxϕπΦ1(ϕ1)πΦ2(ϕ2)p(yI0|ϕ,σ2) [8]
and held fixed at ϕ~ in the subsequent HMC sampling of (θ,Xτ). The priors for ϕ1 and ϕ2 are the same as previously defined.

Initialization of XI for the Observed Components.

To provide starting values of XI for the HMC sampler, we use the values of Yτ at the observation time points and linearly interpolate the remaining points in I.

Initialization of the Parameter Vector θ When All System Components Are Observed.

To provide starting values of θ for the HMC sampler, we optimize the posterior Eq. 5 as a function of θ alone, holding XI and σ unchanged at their starting values, when there is no unobserved component(s). The optimized θ is then used as the starting value for the HMC sampler in this case.

Settings in the Presence of Unobserved System Components: Setting ϕ, Initializing XI for Unobserved Components, and Initializing θ.

Separate treatment is needed for the setting of ϕ and initialization of (θ, XI) for the unobserved component(s). We use an optimization procedure that seeks to maximize the full posterior in Eq. 5 as a function of θ together with ϕ and the whole curve of XI for unobserved components while holding the σ, ϕ, and XI for the observed components unchanged at their initial value discussed above. We thereby set ϕ for the unobserved component and the starting values of θ and XI for unobserved components at the optimized value. In the subsequent sampling, the hyperparameters are fixed at the optimized ϕ, while the HMC sampling starts at the θ and the XI obtained by this optimization.

Prior Tempering.

After ϕ is set, we use a tempering scheme to control the influence of the GP prior relative to the likelihood during HMC sampling. Note that Eq. 5 can be written as

pΘ,X(I)|Y(τ),WI(θ,x(I)|y(τ),WI=0)pΘ,X(I)|WI(θ,x(I)|WI=0)pY(τ)|X(τ)(y(τ)|x(τ)). [9]
As the cardinality of |I| increases with more discretization points, the prior part pΘ,X(I)|WI(θ,x(I)|WI=0) grows, while the likelihood part pY(τ)|X(τ)(y(τ)|x(τ)) stays unchanged. Thus, to balance the influence of the prior, we introduce a tempering hyperparameter β with the corresponding posterior
pΘ,XI|WI,Yτ(β)(θ,xI|0,yτ)pΘ,X(I)|WI(θ,x(I)|WI=0)1/βpY(τ)|X(I)(y(τ)|x(I))πΘ(θ)exp12d=1DNdlog(2πσd2)+xd(τd)yd(τd)σd22+1βxd(I)μd(I)Cd12+fd,IX,θμ˙d(I)md(xd(I)μd(I))Kd12. [10]

A useful setting that we recommend is β=D|I|/N, where D is the number of system components, |I| is the number of discretization time points, and N=d=1DNd is the total number of observations. This setting aims to balance the likelihood contribution from the observations with the total number of discretization points.

Acknowledgements

The research of S.W.K.W. is supported in part by Natural Sciences and Engineering Research Council of Canada Discovery Grant RGPIN-2019-04771. The research of S.C.K. is supported in part by NSF Grant DMS-1810914.

The authors declare no competing interest.
This article is a PNAS Direct Submission.
*This finding echoes the classical understanding that stiff systems require denser discretization (observations farther apart make the system appear relatively more stiff).
The parameters here refer to θ and σ. If the noise level σ is known a priori, the parameters then refer to θ only.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2020397118/-/DCSupplemental.

Data Availability

All of the data used in the article are simulation data. The details, including the models to generate the simulation data, are described in Results and SI Appendix. Our software package, available at GitHub, https://github.com/wongswk/magi, also includes complete replication scripts for all of the data and examples.

References

H. Hirata, Oscillatory expression of the bHLH factor Hes1 regulated by a negative feedback loop. Science 298, 840843 (2002).

D. B. Forger, Biological Clocks, Rhythms, and Oscillations: The Theory of Biological Timekeeping (MIT Press, 2017).

H. Miao, C. Dykes, L. M. Demeter, H. Wu, Differential equation modeling of HIV viral fitness experiments: Model identification, model selection, and multimodel inference. Biometrics 65, 292300 (2009).

S. Busenberg, Differential Equations and Applications in Ecology, Epidemics, and Population Problems (Elsevier, 2012).

N. S. Gorbach, S. Bauer, J. M Buhmann. “Scalable variational inference for dynamical systems” in Advances in Neural Information Processing Systems, I. Guyon, Eds. (Curran Associates Inc., Red Hook, NY, 2017), pp. 48064815.

L. Wu, X. Qiu, Y.-X. Yuan, H. Wu, Parameter estimation and variable selection for big systems of linear ordinary differential equations: A matrix-based approach. J. Am. Stat. Assoc. 114, 657667 (2019).

R. M. Neal, “MCMC using Hamiltonian dynamics” in Handbook of Markov Chain Monte Carlo, S. Brooks, A. Gelman, G. Jones, X. L. Meng, Eds. (Chapman & Hall/CRC Handbooks of Modern Statistical Methods, CRC Press, 2011), pp. 113162.

B. Calderhead, M. Girolami, N. D. Lawrence, “Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes” in Advances in Neural Information Processing Systems, D. Koller, D. Schuurmans, Y. Bengio, L. Bottou, Eds. (Curran Associates Inc., Red Hook, NY, 2008), pp 217224.

J. O Ramsay, G. Hooker, D. Campbell, J. Cao, Parameter estimation for differential equations: A generalized smoothing approach. J. Roy. Stat. Soc. B 69, 741796 (2007).

10 

P. Hennig, M. A. Osborne, M. Girolami, Probabilistic numerics and uncertainty in computations. Proc. Math. Phys. Eng. Sci. 471, 20150142 (2015).

11 

F. Dondelinger, D. Husmeier, S. Rogers, M. Filippone, “ODE parameter inference using adaptive gradient matching with Gaussian processes” in International Conference on Artificial Intelligence and Statistics, C. M. Carvalho, P. Ravikumar, Eds. (Machine Learning Research Press, Cambridge, MA, 2013), pp. 216228.

12 

D. Barber, Y. Wang, “Gaussian processes for Bayesian estimation in ordinary differential equations” in International Conference on Machine Learning, E. P. Xing, T. Jebara, Eds. (Machine Learning Research Press, Cambridge, MA, 2014), pp. 14851493.

13 

S. Ghosh, S. Dasmahapatra, K. Maharatna, Fast approximate Bayesian computation for estimating parameters in differential equations. Stat. Comput. 27, 1938 (2017).

14 

A. Lazarus, D. Husmeier, T. Papamarkou, “Multiphase MCMC sampling for parameter inference in nonlinear ordinary differential equations” in International Conference on Artificial Intelligence and Statistics, A. Storkey, F. Perez-Cruz, Eds. (Machine Learning Research Press, Cambridge, MA, 2018), pp. 12521260.

15 

P. Wenk, “Fast Gaussian process based gradient matching for parameter identification in systems of nonlinear ODEs” in International Conference on Artificial Intelligence and Statistics, K. Chaudhuri, M. Sugiyama, Eds. (Machine Learning Research Press, Cambridge, MA, 2019), pp. 13511360.

16 

B. Macdonald, C. Higham, D. Husmeier, Controversy in mechanistic modelling with Gaussian processes. Proc. Mach. Learn. Res. 37, 15391547 (2015).

17 

R. FitzHugh, Impulses and physiological states in theoretical models of nerve membrane. Biophys. J. 1, 445466 (1961).

18 

V. Vyshemirsky, M. A. Girolami, Bayesian ranking of biochemical system models. Bioinformatics 24, 833839 (2007).

19 

T. G. Kurtz, The relationship between stochastic and deterministic models for chemical reactions. J. Chem. Phys. 57, 29762978 (1972).

20 

S. C. Kou, X. S. Xie, Generalized Langevin equation with fractional Gaussian noise: Subdiffusion within a single protein molecule. Phys. Rev. Lett. 93, 180603 (2004).

21 

S. C. Kou, B. J. Cherayil, W. Min, B. P. English, X. S. Xie, Single-molecule Michaelis-Menten equations. J. Phys. Chem. B 109, 1906819081 (2005).

22 

A. Jacot, F. Gabriel, C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, S. Bengio, Eds. (Curran Associates Inc., Red Hook, NY, 2018), pp. 85808589.

23 

J. Lee, “Deep neural networks as Gaussian processes” in International Conference on Learning Representations, I. Murray, M. Ranzato, O. Vinyals, Eds. (OpenReview.net, Amherst, MA, 2018), pp. 117.