Learning Dynamical Systems from Multiple Sparse Datasets: A Hierarchical Bayesian Modeling Approach
Summary
Proposes a hierarchical Bayesian framework for meta-learning in dynamical systems from multiple sparse, noisy datasets, using gradient-based MCMC with an embedded ODE solver for efficient posterior inference of shared and dataset-specific parameters.
View Cached Full Text
Cached at: 06/25/26, 05:08 AM
# Learning Dynamical Systems from Multiple Sparse Datasets: A Hierarchical Bayesian Modeling Approach
Source: [https://arxiv.org/html/2606.24966](https://arxiv.org/html/2606.24966)
Lea MultererMarco ForgioneLaura AzzimontiSUPSI, Dalle Molle Institute for Artificial Intelligence \(IDSIA USI\-SUPSI\), Lugano, Switzerland \(e\-mail: lea\.multerer@idsia\.ch\)
###### Abstract
Estimating parameters of dynamical systems from sparse, noisy, and irregularly sampled data is often severely ill\-conditioned\. When multiple related datasets are available, they provide additional information if the shared structure and variability are properly modeled\. We propose a hierarchical Bayesian framework for probabilistic meta\-learning in dynamical systems, modeling dataset\-specific parameters as draws from a shared population distribution\. A numerical ODE solver is embedded within gradient\-based MCMC to enable efficient posterior inference of the shared population and dataset\-specific parameter distribution\. Experiments show improved predictive performance over unpooled methods, highlighting the potential for data\-efficient system identification in settings with sparse data\.
###### keywords:
System identification, Hierarchical Bayesian modeling, Probabilistic meta\-learning, Sparse data, Non\-identifiability\.
## 1Introduction
Gray\-box models based on ordinary differential equations \(ODEs\) are a standard description of time\-evolving physical systems across domains such as engineering, biology and epidemiology\. In many applications, their parameters must be estimated from small, noisy and irregularly sampled datasets\. When data informativity is extremely low, this inverse problem may be ill\-conditioned, rendering the estimates unreliable or even undefined\(Geverset al\.,[2009](https://arxiv.org/html/2606.24966#bib.bib32)\)\. This is a manifestation of practical, as opposed to structural, non\-identifiability\(Raueet al\.,[2009](https://arxiv.org/html/2606.24966#bib.bib30)\)\.
In system identification, the classical remedy for low data informativity is regularization: prior information is injected to stabilize the estimate, either through an explicit penalty or through a tunable kernel whose hyperparameters are adjusted from the data by marginal\-likelihood maximization\(Pillonettoet al\.,[2022](https://arxiv.org/html/2606.24966#bib.bib26)\)\. A complementary approach is Bayesian inference, where a probabilistic prior over parameters is updated by conditioning on the data to yield a full posterior over the unknowns\.
A second lever appears when a*collection*of experiments generated by*similar*dynamical systems is available; its usefulness is governed by how much such systems vary\. Assume all systems are identical and the collection reduces to one long, likely informative experiment; assume them completely independent and the shared structure is discarded, leaving each weak dataset to fend for itself\. Often, neither extreme fits: systems are related but not identical, and the variability between them is itself unknown, and must be inferred across the entire collection of datasets\.
Estimating the across\-dataset variability is precisely what mixed\-effects, or multilevel, modeling does, a construction with a long tradition in applied statistics\(Pinheiro and Bates,[2000](https://arxiv.org/html/2606.24966#bib.bib28)\)\. Its Bayesian instantiation, which we adopt in this work, is known as Hierarchical Bayesian Modeling \(HBM\)\(Gelmanet al\.,[2013](https://arxiv.org/html/2606.24966#bib.bib27)\)\. In HBM, the dataset\-specific parameters are treated as draws from a shared distribution whose hyperparameters, including the spread, are inferred jointly with the parameters\. Sparse longitudinal data, as in epidemiology\(Congdon,[2019](https://arxiv.org/html/2606.24966#bib.bib3); Lawson,[2018](https://arxiv.org/html/2606.24966#bib.bib12)\), and population pharmacokinetics\(Wakefield,[1996](https://arxiv.org/html/2606.24966#bib.bib29)\)are among the most widespread applications\.
A limitation of current HBM implementations is that they require the parameter\-to\-observation map to be linear or, at least, available in closed form, which significantly restricts applicability\. For dynamical systems, in particular, this map runs through the ODE solution, which generally admits no closed form in the nonlinear case, and can therefore only be evaluated approximately through a numerical integration scheme\.
As a result, Bayesian inference for ODEs is largely limited to the non\-hierarchical single\-dataset case\(Stuart,[2010](https://arxiv.org/html/2606.24966#bib.bib10); Roda,[2020](https://arxiv.org/html/2606.24966#bib.bib4); Greenet al\.,[2015](https://arxiv.org/html/2606.24966#bib.bib34)\)\. Hierarchical multi\-dataset extensions only exist for special cases where analytical ODE solutions are known, reducing to nonlinear mixed\-effects regression\(Multereret al\.,[2021](https://arxiv.org/html/2606.24966#bib.bib5)\), or where the ODE is replaced by basis\-function or collocation surrogates\(Looset al\.,[2018](https://arxiv.org/html/2606.24966#bib.bib9); Huanget al\.,[2020](https://arxiv.org/html/2606.24966#bib.bib1)\)\. Approaches that both pool across related datasets and support a numerical ODE solver within a Bayesian inference scheme remain rare\.
A related line of work addresses low data informativity through meta\-learning\(Forgioneet al\.,[2023](https://arxiv.org/html/2606.24966#bib.bib23); Chakrabartyet al\.,[2025](https://arxiv.org/html/2606.24966#bib.bib24); Lakshminarayananet al\.,[2025](https://arxiv.org/html/2606.24966#bib.bib25)\)\. By exploiting shared structure across a population of systems, meta\-learning compensates for individually weak datasets, allowing the identification of even high\-dimensional black\-box models \(e\.g\., neural networks\)\. However, current approaches require a very large collection of related datasets, which in practice is feasible only when a digital twin of the process can synthesize them at scale\.
In this work, we instead target the regime of a modest number of genuinely measured datasets, each individually weak\. We formalize probabilistic meta\-learning for gray\-box ODE systems within a HBM framework, with the ODE coefficients and noise parameters of each dataset forming the dataset\-specific level of the hierarchy\. Approximate posterior inference is carried out by Markov Chain Monte Carlo \(MCMC\) sampling\. Because the posterior dimension grows with the number of datasets through the dataset\-specific parameters, gradient\-free samplers become impractical; we therefore adopt the No\-U\-Turn Sampler \(NUTS\)\(Hoffman and Gelman,[2014](https://arxiv.org/html/2606.24966#bib.bib33)\), a gradient\-based sampler that scales to high\-dimensional probabilistic problems\. The enabling ingredient is a differentiable ODE solver: embedding it in the computational graph allows gradients of the likelihood to propagate through the numerical integration step, making NUTS viable in the ODE setting\. On a Lotka–Volterra benchmark, we show that systematic pooling across sparse, noisy, and irregularly sampled trajectories markedly improves practical identifiability and predictive accuracy over unpooled Bayesian and nonlinear least\-squares baselines\.
The rest of this paper is organized as follows\. We introduce the problem formulation of meta\-datasets with a shared structure in Section[2](https://arxiv.org/html/2606.24966#S2)\. In Section[3](https://arxiv.org/html/2606.24966#S3), we develop the meta\-learning approach within the HBM framework\. We detail the implementation in Section[4](https://arxiv.org/html/2606.24966#S4)and present a numerical example in Section[5](https://arxiv.org/html/2606.24966#S5)\. Finally, conclusions and directions for further work are discussed in Section[6](https://arxiv.org/html/2606.24966#S6)\.
## 2Problem Formulation
### 2\.1System Description
We consider dynamical systems described by a system of first\-order ODEs:
ddtx\(t\)=f\(x\(t\),t;θf\),x\(0\)=x0,\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}x\(t\)=f\\big\(x\(t\),t;\\;\\theta\_\{f\}\\big\),\\qquad x\(0\)=x\_\{0\},\(1a\)wherex\(t\)∈ℝnxx\(t\)\\in\\mathbb\{R\}^\{n\_\{x\}\}denotes the state vector evolving over continuous timet∈ℝ\+t\\in\\mathbb\{R\}^\{\+\},f:ℝnx→ℝnxf:\\mathbb\{R\}^\{n\_\{x\}\}\\to\\mathbb\{R\}^\{n\_\{x\}\}defines the state dynamics, andθf∈ℝnθf\\theta\_\{f\}\\in\\mathbb\{R\}^\{n\_\{\\theta\_\{f\}\}\}are the uncertain parameters offf, which vary from system to system\.
The systems are partially observed at a finite set ofTTmeasurement instants irregularly spaced over time\. Let𝐭=\{t1,t2,…,tT\}\\mathbf\{t\}=\\\{t\_\{1\},t\_\{2\},\\ldots,t\_\{T\}\\\},ti∈ℝ\+t\_\{i\}\\in\\mathbb\{R\}^\{\+\}and𝐲=\{y1,y2,…,yT\}\\mathbf\{y\}=\\\{y\_\{1\},y\_\{2\},\\ldots,y\_\{T\}\\\},yi∈ℝnyy\_\{i\}\\in\\mathbb\{R\}^\{n\_\{y\}\}denote the set of sampling times and corresponding output measurements, respectively\. The measurementyi∈ℝnyy\_\{i\}\\in\\mathbb\{R\}^\{n\_\{y\}\}at time steptit\_\{i\}is:
y\(ti\)=g\(x\(ti\);θg\)\+ηi,y\(t\_\{i\}\)=g\\big\(x\(t\_\{i\}\);\\;\\theta\_\{g\}\\big\)\+\\eta\_\{i\},\(1b\)where the functiong:ℝnx→ℝnyg:\\mathbb\{R\}^\{n\_\{x\}\}\\to\\mathbb\{R\}^\{n\_\{y\}\}has uncertain parametersθg∈ℝnθg\\theta\_\{g\}\\in\\mathbb\{R\}^\{n\_\{\\theta\_\{g\}\}\}, also varying across system instances\. The functional forms offfandggare assumed to be known\. Furthermore,ηi∈ℝny\\eta\_\{i\}\\in\\mathbb\{R\}^\{n\_\{y\}\}is additive measurement noise generated by a probability density functionpη\(⋅,θη\)p\_\{\\eta\}\(\\cdot,\\theta\_\{\\eta\}\)that depends on further, potentially unknown, parametersθη∈ℝnθη\\theta\_\{\\eta\}\\in\\mathbb\{R\}^\{n\_\{\\theta\_\{\\eta\}\}\}\. In the following, we denote the complete set of a system’s uncertain parameters to be estimated byθ=\[θf,θg,θη\]⊤∈ℝnθ\\theta=\[\\theta\_\{f\},\\theta\_\{g\},\\theta\_\{\\eta\}\]^\{\\top\}\\in\\mathbb\{R\}^\{n\_\{\\theta\}\}\.
### 2\.2Meta\-Dataset and Shared Structure
A finite collection ofJJdatasets is available\. All datasets are generated by systems of the defined ODE structure \([1](https://arxiv.org/html/2606.24966#S2.E1)\)\. However, each system is characterized by a different realization of the parametersθ\\thetaand a potentially different set of measurement instants\. In line with the meta\-learning literature, this collection is referred to as the*meta\-dataset*:
𝒟=\{D1,D2,…,DJ\},\\mathcal\{D\}=\\\{D^\{1\},D^\{2\},\\ldots,D^\{J\}\\\},\(2a\)where thejj\-th datasetDj∈𝒟D^\{j\}\\in\\mathcal\{D\}is characterized byDj=\{𝐭j,𝐲j\}D^\{j\}=\\\{\\mathbf\{t\}^\{j\},\\mathbf\{y\}^\{j\}\\\}, with𝐭j=\{t1j,…,tTjj\}\\mathbf\{t\}^\{j\}=\\\{t\_\{1\}^\{j\},\\dots,t\_\{T\_\{j\}\}^\{j\}\\\}and𝐲j=\{y1j,…,yTjj\}\\mathbf\{y\}^\{j\}=\\\{\{y\}\_\{1\}^\{j\},\\dots,\{y\}\_\{T\_\{j\}\}^\{j\}\\\}\.
Some \(or all\) datasetsDj∈𝒟D^\{j\}\\in\\mathcal\{D\}are assumed to be individually*information\-poor*\. This means that applying standard supervised learning techniques \(e\.g\., maximum likelihood, maximum a posteriori, or full Bayesian inference under a weak prior\) to a single dataset would yield unacceptably large uncertainty, or even an ill\-posed estimation problem\.
However, the meta\-dataset exhibits a strong, yet a priori unknown*similarity pattern*\. Specifically, an unknown distributionpo\(θ\)p^\{o\}\(\\theta\)characterizes the generation ofθj\\theta^\{j\}and, therefore, the common structure across datasets\. We posit that, ifpo\(θ\)p^\{o\}\(\\theta\)were known, the parameter posteriorp\(θj∣Dj;po\)p\(\\theta^\{j\}\\mid D^\{j\};\\;p^\{o\}\)over each individual dataset, formally defined by:
p\(θj∣Dj;po\)=p\(Dj∣θj\)po\(θj\)p\(Dj\)p\(\\theta^\{j\}\\mid D^\{j\};\\;p^\{o\}\)=\\frac\{p\(D^\{j\}\\mid\\theta^\{j\}\)\\,p^\{o\}\(\\theta^\{j\}\)\}\{p\(D^\{j\}\)\}\(2b\)would be sufficiently concentrated around the trueθj\\theta^\{j\}to support reliable inference, for allj=1,2,…,Jj=1,2,\\dots,J\.
### Example: LTI Meta\-Dataset
Consider a meta\-dataset𝒟\\mathcal\{D\}generated by first\-order linear\-time\-invariant \(LTI\) dynamical systems:
ddtx\(t\)=−1τ\(x\(t\)−d\),\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}x\(t\)=\-\\frac\{1\}\{\\tau\}\\big\(x\(t\)\-d\\big\),\(3a\)whereτ\\tauis the time constant,ddis the steady\-state value andx0=0x\_\{0\}=0is the initial state, assumed known\. Note that in this simple example, the ODE solutionx\(t\)=d\+\(x0−d\)e−t/τx\(t\)=d\+\(x\_\{0\}\-d\)\\,e^\{\-t/\\tau\}is available in closed\-form\. The measurementyycorresponds to the state variablexxcorrupted by white Gaussian noise:y\(ti\)=x\(ti\)\+ηi,ηi∼𝒩\(0,ση2\),y\(t\_\{i\}\)=x\(t\_\{i\}\)\+\\eta\_\{i\},\\qquad\\eta\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\eta\}^\{2\}\),\(3b\)where the noise standard deviation \(ση=0\.05\\sigma\_\{\\eta\}=0\.05\) is known and fixed in all datasets\.
Instead, the coefficientsθ=\[τd\]⊤\\theta=\[\\tau\\;d\]^\{\\top\}vary across datasets and are characterized by the \(unknown\) distributionpo\(θ\)p^\{o\}\(\\theta\):
τ∼U\(0\.9,1\.1\),d∼U\(0\.5,2\.0\)\.\\tau\\sim U\(0\.9,1\.1\),\\qquad d\\sim U\(0\.5,2\.0\)\.\(4\)
The meta\-dataset containsJ=50J=50realizations\. The first4848containT=2T=2samples randomly spaced in the interval\[0,10\]\[0,10\], while the last two \(49thand 50th\) feature a single measurement \(T=1T=1\) at instantst149=1t\_\{1\}^\{49\}=1andt150=10t\_\{1\}^\{50\}=10, respectively\. Fig\.[1](https://arxiv.org/html/2606.24966#S2.F1)illustrates four representative datasets\. For each of them, the measured pointsy\(ti\)y\(t\_\{i\}\), the true statex\(t\)x\(t\), and the posterior state mean—derived under knowledge ofpop^\{o\}—with 95% credible bands are shown\.
Figure 1:LTI datasets: noisy measurementsy\(ti\)y\(t\_\{i\}\)\(black circles\), true statex\(t\)x\(t\)\(black line\), and posterior mean trajectory with 95% credible band under the true priorpop^\{o\}, see Eq\. \([4](https://arxiv.org/html/2606.24966#S2.E4)\) \(green\), and under the weak prior uniform in the range, see Eq\. \([5](https://arxiv.org/html/2606.24966#S2.E5)\) \(pink\)\.We remark that, even for this simple example, the posterior lacks a closed\-form analytical solution, and is here approximated with MCMC sampling\. In this case, however, the MCMC implementation is simplified by the availability of the analytical solution of the ODE, allowing the likelihood to be evaluated without numerical integration\. A comprehensive discussion of approximate probabilistic inference is deferred to Section[4](https://arxiv.org/html/2606.24966#S4)\.
It appears that, given knowledge of the generative priorpop^\{o\}, the predictive uncertainty over the state remains remarkably narrow across all datasets\. At the same time, if the priorpop^\{o\}was not known, some of the datasets in the collection would be very weakly informative\. For instance, inD5D^\{5\}, the two data points are close to each other and both at steady\-state, preventing practical identification of the dynamical parametersτ\\tauanddd\. Moreover,D49D^\{49\}andD50D^\{50\}, which contain a single data point each, are structurally non\-identifiable\.
Now assume that the model structure \([3](https://arxiv.org/html/2606.24966#S2.E3)\) and constantsx0=0,ση=0\.05x\_\{0\}=0,\\sigma\_\{\\eta\}=0\.05are known, while the generative distributionpop^\{o\}in \([4](https://arxiv.org/html/2606.24966#S2.E4)\) is not\. The only \(correct, yet vague\) background knowledge is that:
τ∈\[0\.1,6\.0\],d∈\[−6,6\]\\displaystyle\\tau\\in\[0\.1,\\;6\.0\],\\qquad d\\in\[\-6,\\;6\]\(5\)i\.e\., that the system is stable, the time constantτ\\tauis not extremely small, and the absolute value of the steady\-stateddis not extremely large\. Based on this background, the Bayesian system identification practitioner may postulate a weak priorpweakp^\{\\mathrm\{weak\}\}, uniform in the range given in Eq\. \([5](https://arxiv.org/html/2606.24966#S2.E5)\)111Alternative prior hypotheses compatible with \([5](https://arxiv.org/html/2606.24966#S2.E5)\) are possible; the uniform over the interval is chosen for simplicity\.\. As shown in Fig\.[1](https://arxiv.org/html/2606.24966#S2.F1), under this weak prior the predictive uncertainty over the state trajectory becomes very large forD5D^\{5\},D49D^\{49\}, andD50D^\{50\}, reflecting the difficulty of learning the dynamics without population\-level information\.
## 3Probabilistic Meta\-Learning of Dynamical Systems
The shared structure assumption illustrated in Section[2\.2](https://arxiv.org/html/2606.24966#S2.SS2)is the key motivation for a meta\-learning approach\. Rather than instantiating independent learning problems for each dataset, where individual information may be insufficient, we instead devise a*global*learning approach that explicitly takes advantage of this similarity across datasets\. In principle, a probabilistic meta\-learning approach aims to jointly estimate:
- •A shared prior distributionp^\(θ\)\\hat\{p\}\(\\theta\)capturing the cross\-dataset structure,
- •The individual posterior distributionsp\(θj∣Dj;p^\)p\(\\theta^\{j\}\\mid D^\{j\};\\;\\hat\{p\}\)for each dataset, constructed usingp^\(θ\)\\hat\{p\}\(\\theta\)as prior\.
The interplay between prior learning and posterior inference enables the meta\-dataset as a whole to inform parameter estimates for any single dataset\.
### 3\.1Hierarchical Bayesian Modeling
The approach followed in this paper formalizes the prior learning objective by adopting an HBM framework\. Fundamentally, HBM postulates that the meta\-dataset is generated by the following hierarchical probabilistic process:
- •Shared*hyperparameters*ϕ\\phiare drawn from a*hyperprior*distribution: ϕ∼p\(ϕ\)\.\\phi\\sim p\(\\phi\)\.\(6a\)
- •For each datasetj∈\{1,…,J\}j\\in\\\{1,\\dots,J\\\}, the system\-specific parametersθj\\theta^\{j\}are drawn from a shared prior conditioned on those hyperparameters: θj∼p\(θj∣ϕ\),\\theta^\{j\}\\sim p\(\\theta^\{j\}\\mid\\phi\),\(6b\)withp\(θj∣ϕ\)p\(\\theta^\{j\}\\mid\\phi\)being identical for all datasets\.
- •For each measurement instanti∈\{1,…,Tj\}i\\in\\\{1,\\dots,T\_\{j\}\\\}within datasetjj, the observationyijy\_\{i\}^\{j\}is drawn from a measurement probability distribution conditioned on the specific system parameters and the sampling timetijt\_\{i\}^\{j\}: yij∼p\(yij∣tij,θj\)\.y\_\{i\}^\{j\}\\sim p\(y\_\{i\}^\{j\}\\mid t\_\{i\}^\{j\},\\theta^\{j\}\)\.\(6c\)
For systems of the form \([1](https://arxiv.org/html/2606.24966#S2.E1)\), the measurement conditional density in \([6c](https://arxiv.org/html/2606.24966#S3.E6.3)\) is induced by the ODE solution and the observation model\. In the case of known initial state and additive observation noise, it is given by
p\(yij∣tij,θj\)\\displaystyle p\(y\_\{i\}^\{j\}\\mid t\_\{i\}^\{j\},\\theta^\{j\}\)=pη\(yij−y^ij;θηj\),\\displaystyle=p\_\{\\eta\}\(y\_\{i\}^\{j\}\-\\hat\{y\}\_\{i\}^\{j\};\\theta\_\{\\eta\}^\{j\}\),\(7a\)y^ij\\displaystyle\\hat\{y\}\_\{i\}^\{j\}=g\(x^ij,θgj\),\\displaystyle=g\(\\hat\{x\}\_\{i\}^\{j\},\\theta\_\{g\}^\{j\}\),\(7b\)x^ij\\displaystyle\\hat\{x\}\_\{i\}^\{j\}=x0\+∫0tijf\(xj\(s\),s,θfj\)ds\.\\displaystyle=x\_\{0\}\+\\int\_\{0\}^\{t\_\{i\}^\{j\}\}f\(x^\{j\}\(s\),s,\\theta\_\{f\}^\{j\}\)\\,\\mathrm\{d\}s\.\(7c\)
The hierarchical formulation captures the shared structure across the meta\-dataset\. By modeling the parametersθj\\theta^\{j\}as originating from this shared priorp\(θj∣ϕ\)p\(\\theta^\{j\}\\mid\\phi\), the datasetsDjD^\{j\}are structurally linked\. Importantly, a vague hyperprior might be chosen in order to accommodate a wide range of plausible posterior scenarios\.
The probabilistic model \([6](https://arxiv.org/html/2606.24966#S3.E6)\) is then*conditioned*on the observed meta\-dataset to obtain the posterior distribution of hyperparameters and parameters of interest:
p\(ϕ,θ1:J∣𝒟\)=p\(ϕ,θ1:J,𝒟\)p\(𝒟\),p\(\\phi,\\theta^\{1:J\}\\mid\\mathcal\{D\}\)=\\frac\{p\(\\phi,\\theta^\{1:J\},\\mathcal\{D\}\)\}\{p\(\\mathcal\{D\}\)\},\(8\)which describes the posterior belief ofϕ\\phiandθ1:J\\theta^\{1:J\}\.
In general, \([8](https://arxiv.org/html/2606.24966#S3.E8)\) does not have a closed\-form expression, as its denominatorp\(𝒟\)p\(\\mathcal\{D\}\)is intractable\. However, the*joint*probability density functionp\(ϕ,θ1:J,𝒟\)p\(\\phi,\\theta^\{1:J\},\\mathcal\{D\}\)of hyperparameters, parameters and observations, i\.e\.,
p\(ϕ,θ1:J,𝒟\)=p\(ϕ\)∏j=1J\(p\(θj∣ϕ\)∏i=1Tjp\(yij∣tij,θj\)\),p\(\\phi,\\theta^\{1:J\},\\mathcal\{D\}\)=p\(\\phi\)\\prod\_\{j=1\}^\{J\}\\bigg\(p\(\\theta^\{j\}\\mid\\phi\)\\prod\_\{i=1\}^\{T^\{j\}\}p\(y\_\{i\}^\{j\}\\mid t\_\{i\}^\{j\},\\theta^\{j\}\)\\bigg\),\(9\)for*fixed*𝒟\\mathcal\{D\}, is proportional to the density of interest \([8](https://arxiv.org/html/2606.24966#S3.E8)\)\. Therefore, an MCMC algorithm can be applied to obtain samples from \([8](https://arxiv.org/html/2606.24966#S3.E8)\), using \([9](https://arxiv.org/html/2606.24966#S3.E9)\) as*unnormalized*target distribution\.
We denote the sequence ofMMposterior samples \(or trace\) for both the shared hyperparameters and the system\-specific parameters obtained with MCMC by:
𝒯=\{ϕm,θ1;m,…,θJ;m\}m=1M∼p\(ϕ,θ1:J∣𝒟\),\\mathcal\{T\}=\\Big\\\{\\phi^\{m\},\\theta^\{1;m\},\\dots,\\theta^\{J;m\}\\Big\\\}\_\{m=1\}^\{M\}\\sim p\(\\phi,\\theta^\{1:J\}\\mid\\mathcal\{D\}\),\(10\)whereθj;m\\theta^\{j;m\}denotes themm\-th posterior sample of the parameters associated with datasetjj\. From the perspective of applied Bayesian analysis, the probabilistic model is considered*solved*once the trace𝒯\\mathcal\{T\}is computed, because these empirical samples characterize the otherwise intractable posterior distribution \([8](https://arxiv.org/html/2606.24966#S3.E8)\)\. Other predictive tasks, such as computing the posterior of a state trajectoryxj\(t\)x^\{j\}\(t\), can be accomplished by forward\-simulating the ODE for each parameter sample\{θj;m\}m=1M\\big\\\{\\theta^\{j;m\}\\big\\\}\_\{m=1\}^\{M\}in the trace\.
### Example: HBM for the LTI Meta\-Dataset
To apply the HBM framework to the LTI example, we first translate the deterministic belief \([5](https://arxiv.org/html/2606.24966#S2.E5)\) on the plausible parameter ranges into a hierarchical probabilistic prior\. We choose:
τ0\\displaystyle\\tau^\{0\}∼𝒩\(3\.05,0\.982\),\\displaystyle\\sim\\mathcal\{N\}\(3\.05,0\.98^\{2\}\),\\qquadστ\\displaystyle\\sigma\_\{\\tau\}∼\|𝒩\(0,22\)\|\\displaystyle\\sim\|\\mathcal\{N\}\(0,2^\{2\}\)\|\(11a\)d0\\displaystyle d^\{0\}∼𝒩\(0,22\),\\displaystyle\\sim\\mathcal\{N\}\(0,2^\{2\}\),\\qquadσd\\displaystyle\\sigma\_\{d\}∼\|𝒩\(0,22\)\|\\displaystyle\\sim\|\\mathcal\{N\}\(0,2^\{2\}\)\|\(11b\)τj\\displaystyle\\tau^\{j\}∼𝒩\(τ0,στ2\),\\displaystyle\\sim\\mathcal\{N\}\(\\tau^\{0\},\\sigma\_\{\\tau\}^\{2\}\),\\qquadj\\displaystyle j=1,…,J\\displaystyle=1,\\dots,J\(11c\)dj\\displaystyle d^\{j\}∼𝒩\(d0,σd2\),\\displaystyle\\sim\\mathcal\{N\}\(d^\{0\},\\sigma\_\{d\}^\{2\}\),\\qquadj\\displaystyle j=1,…,J,\\displaystyle=1,\\dots,J,\(11d\)where\|𝒩\(0,σ2\)\|\|\\mathcal\{N\}\(0,\\sigma^\{2\}\)\|denotes a half\-normal distribution with scaleσ2\\sigma^\{2\}\. In \([11](https://arxiv.org/html/2606.24966#S3.E11)\), the coefficients induce prior distributions forτj\\tau^\{j\}anddjd^\{j\}with a wide support that fully covers the ranges in \([5](https://arxiv.org/html/2606.24966#S2.E5)\)\.
The hierarchical prior \([11](https://arxiv.org/html/2606.24966#S3.E11)\) is complemented with the Gaussian likelihood:
yij∼𝒩\(dj\(1−e−t/τj\),0\.052\),y\_\{i\}^\{j\}\\sim\\mathcal\{N\}\\big\(d^\{j\}\(1\-e^\{\-t/\\tau^\{j\}\}\\big\),0\.05^\{2\}\),\(12\)wherej=1,…,Jj=1,\\dots,Jandi=1,…,Tji=1,\\dots,T^\{j\}, exploiting the closed\-form solution of the first\-order linear ODE withx0=0x\_\{0\}=0and the known form of the noise distribution\.
Overall, \([11](https://arxiv.org/html/2606.24966#S3.E11)\)\-\([12](https://arxiv.org/html/2606.24966#S3.E12)\) defines an HBM whose parameters can be estimated \(i\.e\., conditioned on𝒟\\mathcal\{D\}\) with MCMC\. We implemented the model in PyMC\(Abril\-Plaet al\.,[2023](https://arxiv.org/html/2606.24966#bib.bib13)\)using NUTS and thus obtained a trace containingM=10 000M=10\\,000posterior samples from:
p\(τ0,στ,d0,σd,τ1:J,d1:J∣𝒟\)\.p\(\\tau^\{0\},\\sigma\_\{\\tau\},d^\{0\},\\sigma\_\{d\},\\tau^\{1:J\},d^\{1:J\}\\mid\\mathcal\{D\}\)\.\(13\)Fig\.[2](https://arxiv.org/html/2606.24966#S3.F2)shows the posterior mean trajectories and 95% credible bands of the hidden statex\(t\)x\(t\)for datasetsD4,D5,D49,D50D^\{4\},D^\{5\},D^\{49\},D^\{50\}\.
Figure 2:LTI datasets: data pointsy\(ti\)y\(t\_\{i\}\)\(black circles\), true statex\(t\)x\(t\)\(solid black line\), posterior mean trajectory with 95% credible band under the hierarchical model \(blue\)\.We observe that the prediction quality is comparable to that obtained under knowledge of the true generative prior \([4](https://arxiv.org/html/2606.24966#S2.E4)\), visualized in Fig\.[1](https://arxiv.org/html/2606.24966#S2.F1)\.
### 3\.2Meta\-Learned Prior for New Tasks
To gain insights into the shared parameter structure learned by the HBM framework, it is instructive to visualize the distribution:
p^\(θJ\+1\)=p\(θJ\+1∣𝒟\),\\hat\{p\}\(\\theta^\{J\+1\}\)=p\(\\theta^\{J\+1\}\\mid\\mathcal\{D\}\),\(14\)namely the posterior belief for the parameters of a*new*, unobserved datasetDJ\+1D^\{J\+1\}, assumed to share the same generation mechanism asD1:JD^\{1:J\}\. We can identify this posterior predictive distribution as the*learned prior*, since it can be used as an informative prior when performing inference on a future dataset\. Formally, we can obtainp^\\hat\{p\}by marginalizing the*conditional joint*p\(ϕ,θJ\+1∣𝒟\)p\(\\phi,\\theta^\{J\+1\}\\mid\\mathcal\{D\}\):
p^\(θJ\+1\)\\displaystyle\\hat\{p\}\(\\theta^\{J\+1\}\)=p\(θJ\+1∣𝒟\)=∫p\(ϕ,θJ\+1∣𝒟\)𝑑ϕ\\displaystyle=p\(\\theta^\{J\+1\}\\mid\\mathcal\{D\}\)=\\int p\(\\phi,\\theta^\{J\+1\}\\mid\\mathcal\{D\}\)\\;d\\phi\(15\)=∫p\(θJ\+1∣ϕ\)p\(ϕ∣𝒟\)𝑑ϕ\.\\displaystyle=\\int p\(\\theta^\{J\+1\}\\mid\\phi\)p\(\\phi\\mid\\mathcal\{D\}\)\\;d\\phi\.
In practice, given the sample\-based approximation ofp\(ϕ∣𝒟\)p\(\\phi\\mid\\mathcal\{D\}\)obtained by solving the HBM with MCMC, samples ofp^\(θJ\+1\)\\hat\{p\}\(\\theta^\{J\+1\}\)can be obtained by sampling, for eachϕm\\phi^\{m\}in the trace, a correspondingθJ\+1;j\{\\theta^\{J\+1;j\}\}fromp\(θJ\+1∣ϕm\)p\(\\theta^\{J\+1\}\\mid\\phi^\{m\}\)\.
Fig\.[3](https://arxiv.org/html/2606.24966#S3.F3)visualizes the learned priorp^\\hat\{p\}for the previous LTI example\.
Figure 3:Histogram of10 00010\\,000samples of the learned prior forτ\\tau\(left\) anddd\(right\)\. The uniform distribution corresponding to the “true” prior \([4](https://arxiv.org/html/2606.24966#S2.E4)\) is visualized as a superposed yellow area\. The deterministic bounds \([5](https://arxiv.org/html/2606.24966#S2.E5)\) corresponding to the prior knowledge are visualized as superposed blue shaded areas and extend beyond the horizontal limit of the figure\.In particular, the learned prior forτ\\tauis strongly concentrated around the unit value, aligning with the underlying true generative distribution in \([4](https://arxiv.org/html/2606.24966#S2.E4)\)\. Given this fundamental additional information captured by the HBM through conditioning on𝒟\\mathcal\{D\}, it becomes evident why joint estimation of bothτ\\tauandddis possible even in information\-weak datasets likeD4,D49D^\{4\},D^\{49\}, andD50D^\{50\}\.
## 4Implementation
We recall that the practical implementation of MCMC requires repeated evaluations of the unnormalized target distribution, which in our case is the joint density \([9](https://arxiv.org/html/2606.24966#S3.E9)\)\. Assuming the hyperprior, the dataset\-specific prior densities, the measurement density, and the functionsffandggare inexpensive to evaluate, the primary technical difficulty in evaluating the joint distribution is the repeated solution of the ODE in \([7c](https://arxiv.org/html/2606.24966#S3.E7.3)\)\. In particular, when the analytical solution is unavailable, a numerical ODE integration scheme must be applied\. From a software implementation perspective, obtaining the conditional distribution \([8](https://arxiv.org/html/2606.24966#S3.E8)\) for general systems of the form \([1](https://arxiv.org/html/2606.24966#S2.E1)\) requires the coupling of a probabilistic inference engine implementing MCMC with a numerical ODE solver\.
A key practical consideration in this coupling is whether the chosen ODE solver provides derivatives of the simulated state variables with respect to its parameters\. A differentiable solver can provide not only the simulated statex^ij\\hat\{x\}\_\{i\}^\{j\}, but also the gradients ofx^ij\\hat\{x\}\_\{i\}^\{j\}with respect toθj\\theta^\{j\}, for example via a discrete or continuous adjoint method, as implemented in JAX\-based solvers such as Diffrax\(Bradburyet al\.,[2018](https://arxiv.org/html/2606.24966#bib.bib14); Kidger,[2021](https://arxiv.org/html/2606.24966#bib.bib35)\)\.
If such derivatives are available, the likelihood is differentiable with respect toθj\\theta^\{j\}through the full map
θj↦x^ij↦y^ij↦p\(yij∣tij,θj\),\\theta^\{j\}\\mapsto\\hat\{x\}\_\{i\}^\{j\}\\mapsto\\hat\{y\}\_\{i\}^\{j\}\\mapsto p\(y\_\{i\}^\{j\}\\mid t\_\{i\}^\{j\},\\theta^\{j\}\),\(16\)defined by the ODE solution in \([7c](https://arxiv.org/html/2606.24966#S3.E7.3)\)\. Access to these gradients allows for the use of gradient\-based MCMC methods such as Hamiltonian Monte Carlo or NUTS, which are attractive in high\-dimensional landscapes because they can explore such posteriors efficiently\. This is particularly relevant in the hierarchical setting, where the dimension of the posterior distribution grows with the number of datasets through the dataset\-specific parametersθ1:J\\theta^\{1:J\}, in addition to the shared hyperparametersϕ\\phi\.
If the numerical solver is instead treated as a black\-box routine, the joint density \([9](https://arxiv.org/html/2606.24966#S3.E9)\) can still be evaluated, but gradients with respect toθj\\theta^\{j\}andϕ\\phiare not available\. This is the case, for instance, when standard solver implementations, such as SciPy’ssolve\_ivpare used\. Inference must then rely on gradient\-free samplers, such as Metropolis\-Hastings or slice sampling\. While these methods may be adequate for low\-dimensional or relatively simple models, they tend to become inefficient or numerically unstable for nonlinear ODE systems with strong posterior correlations across datasets\.
In this work, we adopt a JAX\-based implementation, using NumPyro\(Binghamet al\.,[2019](https://arxiv.org/html/2606.24966#bib.bib18); Phanet al\.,[2019](https://arxiv.org/html/2606.24966#bib.bib17)\)for probabilistic inference together with Diffrax for ODE solving\. We use a 5th\-order Runge\-Kutta solver to numerically solve the ODEs, together with a PID step\-size controller\. In our setup, the numerical integration step is part of the JAX computational graph, so that derivatives of the simulated states with respect to the sampled parameters are available to the gradient\-based NUTS sampler\. The code is available online222https://gitlab\-core\.supsi\.ch/dti\-idsia/hierarchical\_bayesian\_odes\. The experiments reported in the following section were run on a server equipped with two AMD EPYC 7742 64\-Core processors \(128 physical cores, 256 logical threads\) and 629 GiB of RAM\.
## 5Numerical Experiments
As a case study, we consider the Lotka–Volterra predator–prey model, a classical description of oscillatory population cycles in ecology\. We use parameter values representative of the classical Canadian lynx and snowshoe hare setting\(Stensethet al\.,[1997](https://arxiv.org/html/2606.24966#bib.bib15); Carpenter,[2018](https://arxiv.org/html/2606.24966#bib.bib19); Brunkhorst,[2023](https://arxiv.org/html/2606.24966#bib.bib20)\)and simulate multiple datasets that we interpret as scenarios arising from different, but related ecological systems\. For example, these could correspond to distinct regions with varying environmental conditions\.
### 5\.1System Description
The Lotka\-Volterra model describes the interaction between predator and prey populations as:
ddtx=ddt\[x1x2\]=\[αx1−βx1x2δx1x2−γx2\]=f\(x,t;θf\),\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}x=\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\\begin\{bmatrix\}x\_\{1\}\\\\ x\_\{2\}\\end\{bmatrix\}=\\begin\{bmatrix\}\\alpha x\_\{1\}\-\\beta x\_\{1\}x\_\{2\}\\\\ \\delta x\_\{1\}x\_\{2\}\-\\gamma x\_\{2\}\\end\{bmatrix\}=f\\big\(x,t;\\,\\theta\_\{f\}\\big\),\(17\)where the statesx1x\_\{1\},x2x\_\{2\}represent the prey and predator populations in thousands of individuals, respectively, andθf=\[αβγδ\]⊤\\theta\_\{f\}=\[\\alpha\\;\\beta\\;\\gamma\\;\\delta\]^\{\\top\}\.
At a finite set of irregularly\-spaced sampling instants, noisy measurements of both states are collected, corrupted by a mean\-preserving multiplicative log\-normal noise:
\[y1\(ti\)y2\(ti\)\]=\[x1\(ti\)exp\(ση,1ϵ1,i−12ση,12\)x2\(ti\)exp\(ση,2ϵ2,i−12ση,22\)\],\\begin\{bmatrix\}y\_\{1\}\(t\_\{i\}\)\\\\ y\_\{2\}\(t\_\{i\}\)\\end\{bmatrix\}=\\begin\{bmatrix\}x\_\{1\}\(t\_\{i\}\)\\exp\\\!\\bigg\(\\sigma\_\{\\eta,1\}\\epsilon\_\{1,i\}\-\\frac\{1\}\{2\}\\sigma\_\{\\eta,1\}^\{2\}\\bigg\)\\\\\[8\.53581pt\] x\_\{2\}\(t\_\{i\}\)\\exp\\\!\\bigg\(\\sigma\_\{\\eta,2\}\\epsilon\_\{2,i\}\-\\frac\{1\}\{2\}\\sigma\_\{\\eta,2\}^\{2\}\\bigg\)\\end\{bmatrix\},\(18\)withϵ1,i,ϵ2,i∼𝒩\(0,1\)\\epsilon\_\{1,i\},\\epsilon\_\{2,i\}\\sim\\mathcal\{N\}\(0,1\)andση,1,ση,2\>0\\sigma\_\{\\eta,1\},\\sigma\_\{\\eta,2\}\>0unknown noise parameters\.
### 5\.2Meta\-Dataset Generation
We generateJ=20J=20Lotka\-Volterra systems and corresponding datasets
𝒟=\{D1,D2,…,D20\}\.\\mathcal\{D\}=\\\{D^\{1\},D^\{2\},\\ldots,D^\{20\}\\\}\.\(19\)
In all datasets, we fixx0=\[35 4\]⊤x\_\{0\}=\[35\\;4\]^\{\\top\}, corresponding to 35 thousand prey \(hares\) and 4 thousand predators \(lynx\)\. The ODE coefficientsθf=\[αβγδ\]⊤\\theta\_\{f\}=\[\\alpha\\;\\beta\\;\\gamma\\;\\delta\]^\{\\top\}, and the noise coefficientsθη=\[ση,1ση,2\]⊤\\theta\_\{\\eta\}=\[\\sigma\_\{\\eta,1\}\\;\\sigma\_\{\\eta,2\}\]^\{\\top\}vary across system realizations and form the parameter vectorθ=\[θf⊤θη⊤\]∈ℝnθ\\theta=\[\\theta\_\{f\}^\{\\top\}\\;\\theta\_\{\\eta\}^\{\\top\}\]\\in\\mathbb\{R\}^\{n\_\{\\theta\}\}withnθ=6n\_\{\\theta\}=6\. Each parameter in each dataset is generated from a scaled Beta family:
θkj∼λmin,k\+\(λmax,k−λmin,k\)Beta\(κk,νk\),\\theta\_\{k\}^\{j\}\\sim\\lambda\_\{\\min,k\}\+\(\\lambda\_\{\\max,k\}\-\\lambda\_\{\\min,k\}\)\\mathrm\{Beta\}\(\\kappa\_\{k\},\\nu\_\{k\}\),\(20\)fork=1,…,nθk=1,\\dots,n\_\{\\theta\}andj=1,…,Jj=1,\\dots,J\. The hyperparameters of the scaled Beta distributions are reported in Table[1](https://arxiv.org/html/2606.24966#S5.T1)\.
Table 1:Parameter distributional values used in the dataset\-generating process\.For each datasetDjD^\{j\}, we collectT=5T=5noisy state samples irregularly spaced over the interval\[0,tend\]\[0,t\_\{\\mathrm\{end\}\}\], wheretendt\_\{\\mathrm\{end\}\}is 27 years, corresponding to around 3 population cycles\. The sampling instants are randomized across datasets, increasing the diversity in the meta dataset\.
### 5\.3Probabilistic Meta\-Learning with HBM
We assume access to the meta\-dataset𝒟\\mathcal\{D\}in \([19](https://arxiv.org/html/2606.24966#S5.E19)\), knowledge of the functional form of the data\-generating mechanism \([17](https://arxiv.org/html/2606.24966#S5.E17)\) and of the noise structure \([18](https://arxiv.org/html/2606.24966#S5.E18)\)\. However, the values of the parametersθ\\thetafor each dataset are unknown, as are their probabilistic structure \([20](https://arxiv.org/html/2606.24966#S5.E20)\) and the hyperparameters in Table[1](https://arxiv.org/html/2606.24966#S5.T1)\.
Instead, we assume some weak background knowledge on*plausible*parameter ranges, informed byCarpenter \([2018](https://arxiv.org/html/2606.24966#bib.bib19)\)\. Upper and lower range constituting weak prior knowledge onθf\\theta\_\{f\}are given in Table[2](https://arxiv.org/html/2606.24966#S5.T2)\.
Table 2:Weak prior knowledge on the system parametersθf\\theta\_\{f\}\.Note that these intervals are wider than the ranges used to generate the datasets\. For the HBM parameter estimation, this weak information is encoded through a hierarchical prior on the dataset\-specific parameters\. For each datasetjjand the ODE parametersθfj\\theta\_\{f\}^\{j\}, we introduce a non\-centered Gaussian hierarchy on an unconstrained variable,
zθj=μθ\+τθϵθj,ϵθj∼𝒩\(0,1\),z\_\{\\theta\}^\{j\}=\\mu\_\{\\theta\}\+\\tau\_\{\\theta\}\\epsilon\_\{\\theta\}^\{j\},\\qquad\\epsilon\_\{\\theta\}^\{j\}\\sim\\mathcal\{N\}\(0,1\),whereμθ\\mu\_\{\\theta\}represents the population location andτθ\\tau\_\{\\theta\}the between\-dataset variability of parameterθfj\\theta\_\{f\}^\{j\}on the unconstrained scale\. We then mapzθjz\_\{\\theta\}^\{j\}to the admissible interval using a logistic transform combined with a min\-max scaling\. This construction keeps all dataset\-specific parameters within their weakly plausible ranges while allowing their population location and variability to be learned from the meta\-dataset\. The population\-level hyperparametersμθ\\mu\_\{\\theta\}andτθ\\tau\_\{\\theta\}are assigned weakly informative priors,
μθ∼𝒩\(0,1\),τθ∼\|𝒩\(0,\(2/3\)2\)\|,\\mu\_\{\\theta\}\\sim\\mathcal\{N\}\(0,1\),\\qquad\\tau\_\{\\theta\}\\sim\\lvert\\mathcal\{N\}\(0,\(2/3\)^\{2\}\)\\rvert,for all four parameters\.
For the weak prior knowledge onθη\\theta\_\{\\eta\}, lognormal priors centered at the nominal valuesσ¯η,1=0\.30\\bar\{\\sigma\}\_\{\\eta,1\}=0\.30andσ¯η,2=0\.34\\bar\{\\sigma\}\_\{\\eta,2\}=0\.34, are assigned, with half\-normal priors on the corresponding between\-dataset log\-scale variation\. The HBM was sampled with four NUTS chains, using 2000 warm\-up iterations and 2500 posterior draws per chain, yielding a trace𝒯\\mathcal\{T\}ofM=10 000M=10\\,000posterior samples\.333Convergence diagnostics \(R^\\hat\{R\}statistics, effective sample sizes, and trace plots\) are provided in the code\.
As a reference, we consider two models that estimate the parameters separately for each datasetDjD^\{j\}, without exploiting the shared structure of the meta\-dataset\. As a Bayesian baseline, we consider an unpooled model that treats the datasets as independent\. It uses the same likelihood, parameter transformations, and sampling setup as the HBM described above, but assigns independent bounded weak priors to the transformed ODE parameters and independent priors to the two dataset\-specific observation\-noise scales, without any shared population\-level hierarchy\.
As a non\-probabilistic baseline, we compute nonlinear least\-squares \(NLS\) estimates with box constraints over the weak ranges in Table[2](https://arxiv.org/html/2606.24966#S5.T2)\. We use a multi\-start optimization strategy and residuals defined as differences between the logarithms of simulated and observed states, and return a single point estimate for the four ODE parametersθf\\theta\_\{f\}in each dataset, without estimating observation\-noise parametersθη\\theta\_\{\\eta\}\.
We compare the performance of the three estimation approaches with respect to both the accuracy of the trajectories and the parameter recovery\. Trajectory accuracy is assessed on a fine grid of 1000 points by comparing estimated state trajectories with the true ones\. For NLS, we obtain a point estimate of the state trajectory by simulating \([17](https://arxiv.org/html/2606.24966#S5.E17)\) with the obtained estimate ofθf\\theta\_\{f\}\. For the Bayesian models, we use as point estimate the posterior mean, computed as the point\-wise mean over trajectories obtained by simulating \([17](https://arxiv.org/html/2606.24966#S5.E17)\) with parameters from the MCMC posterior trace\. We report average root mean squared error \(RMSE\) values for the trajectories\. Furthermore, for the Bayesian models we assess the latent\-trajectory uncertainty using the latent\-state continuous ranked probability score \(CRPS\)\(Gneiting and Raftery,[2007](https://arxiv.org/html/2606.24966#bib.bib22)\), averaged over time on the fine grid\.
For parameter estimation, we first compute the relative error between the true dataset\-specific parameters and the corresponding point estimates\. We summarize overall recovery accuracy as a normalized RMSE \(nRMSE\), computed for each dataset as the root mean squared value of these relative errors across parameters, and then averaged across datasets\. For the Bayesian models, point estimates of the parameters are defined as the posterior mean of the parameters\. To quantify parameter uncertainty, we also report the mean relative width of the corresponding 95% highest density intervals \(HDI\)\.
### 5\.4Results
The complete HBM run required 20 minutes on the computing setup described in Section[4](https://arxiv.org/html/2606.24966#S4)\. In terms of mean trajectory RMSE, the HBM model achieves better performance \(RMSE of 8\.26, in the units of the ODE states\), compared the unpooled Bayesian model \(11\.77\) and the NLS \(17\.17\)\. This represents a substantial improvement in trajectory accuracy for the hierarchical approach, corresponding to a reduction of approximately 30% over the unpooled Bayesian model and more than 50% over NLS\. Fig\.[4](https://arxiv.org/html/2606.24966#S5.F4)visualizes the latent trajectory estimates for three representative datasetsD1D^\{1\},D8D^\{8\},D19D^\{19\}\.
Figure 4:Latent trajectory estimates for three representative Lotka–Volterra datasets\. Black circles denote the noisy observations, solid black curves the true latent states for prey and predators, blue curves the HBM posterior mean trajectories, pink curves the unpooled Bayesian posterior mean trajectories, and orange dashed curves the bounded NLS fits\. Shaded bands indicate 50% and 95% posterior credible intervals for the Bayesian models\.From the figure, it is evident that the performance of the three methods depends on the informativeness of each dataset\. ForD1D^\{1\}, where the observations are well\-positioned to constrain the dynamics, both the HBM and the unpooled model recover the true trajectory accurately, although the HBM already achieves a better fit for the predator state\. For datasetD8D^\{8\}, the HBM starts to outperform both baseline models more clearly, and forD19D^\{19\}, where the sparse and irregularly sampled observations provide limited information, the performance gain is evident\. This improvement is also reflected in the distributional trajectory score\. The HBM achieves a mean latent CRPS of 3\.59, compared to 5\.18 for the unpooled model\.
For parameter recovery, the HBM is more accurate \(parameter nRMSE of 0\.07\) compared to the unpooled model \(0\.12\) and the NLS \(0\.20\)\. The mean and standard deviation of the relative error for each parameter are reported in Tab\.[3](https://arxiv.org/html/2606.24966#S5.T3)\.
Table 3:Mean relative bias and standard deviation \(in brackets\) of parameter estimates across the 20 datasets\.Across all four parameters, the HBM exhibits consistently lower bias and notably lower variability across datasets\. The posterior uncertainty, quantified by the mean relative width of the 95% HDI, also demonstrates the benefit of hierarchical pooling\. The HBM achieves a mean relative HDI width of 0\.35, compared to 0\.58 for the unpooled model\. This indicates that the HBM produces credible intervals that are approximately 40% narrower on average\. Importantly, because this reduction in uncertainty is accompanied by lower trajectory RMSE, lower CRPS, and improved parameter recovery, the results indicate that the HBM is not simply overconfident but provides more accurate and well\-calibrated estimates\.
## 6Conclusions
We introduced a hierarchical Bayesian approach to probabilistic meta\-learning for dynamical systems, enabling joint inference across related datasets\. By learning a shared prior distribution from the meta\-dataset, the proposed approach mitigates practical non\-identifiability arising in sparse and irregularly sampled settings, where single\-dataset inference is unreliable\. The integration of differentiable ODE solvers with gradient\-based MCMC allows efficient numerical inference, for which we provide an open\-access implementation to facilitate adoption\. Numerical results on a Lotka–Volterra benchmark demonstrate that the method consistently outperforms standard unpooled Bayesian and nonlinear least\-squares approaches\. This highlights the potential for data\-efficient system identification in applications with sparse and noisy data, such as those arising in the biomedical domain\. As a promising direction, the framework could be extended to hybrid gray\-box models that combine dynamical systems with Bayesian neural network components for increased expressivity\.
## References
- O\. Abril\-Pla, V\. Andreani, C\. Carroll, L\. Dong, C\. Fonnesbeck, M\. Kochurov, R\. Kumar, J\. Lao, C\. Luhmann, O\. Martin, M\. Osthege, R\. Vieira, T\. Wiecki, and R\. Zinkov \(2023\)PyMC: a modern, and comprehensive probabilistic programming framework in Python\.PeerJ Computer Science9,pp\. e1516\.External Links:ISSN 2376\-5992,[Document](https://dx.doi.org/10.7717/peerj-cs.1516)Cited by:[§3](https://arxiv.org/html/2606.24966#S3.SSx1.p3.2)\.
- E\. Bingham, J\. Chen, M\. Jankowiak, F\. Obermeyer, N\. Pradhan, T\. Karaletsos, R\. Singh, P\. Szerlip, P\. Horsfall, and N\. Goodman \(2019\)Pyro: Deep universal probabilistic programming\.Vol\.20,pp\. 973–978\.External Links:ISSN 1532\-4435,[Document](https://dx.doi.org/10.48550/arXiv.1810.09538)Cited by:[§4](https://arxiv.org/html/2606.24966#S4.p5.1)\.
- J\. Bradbury, R\. Frostig, P\. Hawkins, M\. Johnson, Y\. Katariya, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne, and Q\. Zhang \(2018\)JAX: composable transformations of Python\+NumPy programs\.External Links:[Link](http://github.com/jax-ml/jax)Cited by:[§4](https://arxiv.org/html/2606.24966#S4.p2.3)\.
- G\. Brunkhorst \(2023\)ODE Lotka\-Volterra with Bayesian inference in multiple ways\.External Links:[Link](https://www.pymc.io/projects/examples/en/latest/ode_models/ODE_Lotka_Volterra_multiple_ways.html)Cited by:[§5](https://arxiv.org/html/2606.24966#S5.p1.1)\.
- B\. Carpenter \(2018\)Predator\-prey population dynamics: the Lotka\-Volterra model in Stan\.Stan Case Study\.External Links:[Link](https://mc-stan.org/learn-stan/case-studies/lotka-volterra-predator-prey.html)Cited by:[§5\.3](https://arxiv.org/html/2606.24966#S5.SS3.p2.1),[§5](https://arxiv.org/html/2606.24966#S5.p1.1)\.
- A\. Chakrabarty, G\. Wichern, V\. Deshpande, A\. Vinod, K\. Berntorp, and C\. Laughman \(2025\)Meta\-learning for physically constrained neural system identification\.Neurocomputing651,pp\. 130945\.External Links:ISSN 09252312,[Document](https://dx.doi.org/10.1016/j.neucom.2025.130945)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p7.1)\.
- P\. Congdon \(2019\)Bayesian hierarchical models: With applications using R, second edition\.Chapman and Hall/CRC,New York\.External Links:ISBN 978\-0\-429\-11335\-2,[Document](https://dx.doi.org/10.1201/9780429113352)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p4.1)\.
- M\. Forgione, F\. Pura, and D\. Piga \(2023\)From system models to class models: an in\-context learning paradigm\.IEEE Control Systems Letters7,pp\. 3513–3518\.External Links:ISSN 2475\-1456,[Document](https://dx.doi.org/10.1109/LCSYS.2023.3335036)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p7.1)\.
- A\. Gelman, J\. Carlin, H\. Stern, D\. Dunson, A\. Vehtari, and D\. Rubin \(2013\)Bayesian data analysis\.Third edition,Texts in statistical science series,CRC,Boca Raton, Florida\.External Links:ISBN 9781439840955 1439840954Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p4.1)\.
- M\. Gevers, A\. Bazanella, X\. Bombois, and L\. Mišković \(2009\)Identification and the information matrix: how to get just sufficiently rich?\.IEEE Transactions on Automatic Control54\(12\),pp\. 2828–2840\.External Links:[Document](https://dx.doi.org/10.1109/TAC.2009.2034199)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p1.1)\.
- T\. Gneiting and A\. Raftery \(2007\)Strictly proper scoring rules, prediction, and estimation\.Journal of the American Statistical Association102\(477\),pp\. 359–378\.External Links:ISSN 0162\-1459, 1537\-274X,[Document](https://dx.doi.org/10.1198/016214506000001437)Cited by:[§5\.3](https://arxiv.org/html/2606.24966#S5.SS3.p7.1)\.
- P\. Green, E\. Cross, and K\. Worden \(2015\)Bayesian system identification of dynamical systems using highly informative training data\.Mechanical Systems and Signal Processing56,pp\. 109–122\.External Links:ISSN 08883270,[Document](https://dx.doi.org/10.1016/j.ymssp.2014.10.003)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p6.1)\.
- M\. Hoffman and A\. Gelman \(2014\)The No\-U\-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo\.Journal of Machine Learning Research15\(1\),pp\. 1593–1623\.Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p8.1)\.
- H\. Huang, A\. Handel, and X\. Song \(2020\)A Bayesian approach to estimate parameters of ordinary differential equation\.Computational Statistics35\(3\),pp\. 1481–1499\.External Links:ISSN 0943\-4062, 1613\-9658,[Document](https://dx.doi.org/10.1007/s00180-020-00962-8)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p6.1)\.
- P\. Kidger \(2021\)On Neural Differential Equations\.Ph\.D\. Thesis,University of Oxford\.Cited by:[§4](https://arxiv.org/html/2606.24966#S4.p2.3)\.
- B\. Lakshminarayanan, M\. Guerrero, and C\. Rojas \(2025\)Fine\-tuning a simulation\-driven estimator\.IEEE Control Systems Letters9,pp\. 2975–2980\.External Links:ISSN 2475\-1456,[Document](https://dx.doi.org/10.1109/LCSYS.2025.3647070)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p7.1)\.
- A\. Lawson \(2018\)Bayesian disease mapping: hierarchical modeling in spatial epidemiology\.3 edition,Chapman and Hall/CRC,New York\.External Links:ISBN 978\-1\-351\-27176\-9Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p4.1)\.
- C\. Loos, S\. Krause, and J\. Hasenauer \(2018\)Hierarchical optimization for the efficient parametrization of ODE models\.Bioinformatics34\(24\),pp\. 4266–4273\(en\)\.External Links:ISSN 1367\-4803, 1367\-4811,[Document](https://dx.doi.org/10.1093/bioinformatics/bty514)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p6.1)\.
- L\. Multerer, T\. Glass, F\. Vanobberghen, and T\. Smith \(2021\)Analysis of contamination in cluster randomized trials of malaria interventions\.Trials22\(1\),pp\. 613\.External Links:ISSN 1745\-6215,[Document](https://dx.doi.org/10.1186/s13063-021-05543-8)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p6.1)\.
- D\. Phan, N\. Pradhan, and M\. Jankowiak \(2019\)Composable effects for flexible and accelerated probabilistic programming in NumPyro\.External Links:[Document](https://dx.doi.org/10.48550/arXiv1912.11554)Cited by:[§4](https://arxiv.org/html/2606.24966#S4.p5.1)\.
- G\. Pillonetto, T\. Chen, A\. Chiuso, G\. De Nicolao, L\. Ljung,et al\.\(2022\)Regularized system identification: learning dynamic models from data\.Communications and Control Engineering,Springer,Cham\.External Links:ISBN 978\-3\-030\-95859\-6 978\-3\-030\-95860\-2,[Document](https://dx.doi.org/10.1007/978-3-030-95860-2)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p2.1)\.
- J\. Pinheiro and D\. Bates \(2000\)Mixed\-effects models in S and S\-PLUS\.Statistics and Computing,Springer,New York\.External Links:ISBN 978\-0\-387\-98957\-0,[Document](https://dx.doi.org/10.1007/b98882)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p4.1)\.
- A\. Raue, C\. Kreutz, T\. Maiwald, J\. Bachmann, M\. Schilling, U\. Klingmüller, and J\. Timmer \(2009\)Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood\.Bioinformatics25\(15\),pp\. 1923–1929\.External Links:[Document](https://dx.doi.org/10.1093/bioinformatics/btp358)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p1.1),[Remark 4](https://arxiv.org/html/2606.24966#Thmremark4.p1.4.4)\.
- A\. Raue, C\. Kreutz, F\. Theis, and J\. Timmer \(2013\)Joining forces of Bayesian and frequentist methodology: a study for inference in the presence of non\-identifiability\.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences371\(1984\),pp\. 20110544\.External Links:[Document](https://dx.doi.org/10.1098/rsta.2011.0544)Cited by:[Remark 4](https://arxiv.org/html/2606.24966#Thmremark4.p1.4.4)\.
- W\. Roda \(2020\)Bayesian inference for dynamical systems\.Infectious Disease Modelling5,pp\. 221–232\.External Links:ISSN 24680427,[Document](https://dx.doi.org/10.1016/j.idm.2019.12.007)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p6.1)\.
- N\. Stenseth, W\. Falck, O\. Bjørnstad, and C\. Krebs \(1997\)Population regulation in snowshoe hare and Canadian lynx: Asymmetric food web configurations between hare and lynx\.Proceedings of the National Academy of Sciences94\(10\),pp\. 5147–5152\.External Links:ISSN 0027\-8424, 1091\-6490,[Document](https://dx.doi.org/10.1073/pnas.94.10.5147)Cited by:[§5](https://arxiv.org/html/2606.24966#S5.p1.1)\.
- A\. Stuart \(2010\)Inverse problems: A Bayesian perspective\.Acta Numerica19,pp\. 451–559\.External Links:ISSN 0962\-4929, 1474\-0508,[Document](https://dx.doi.org/10.1017/S0962492910000061)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p6.1)\.
- J\. Wakefield \(1996\)The Bayesian analysis of population pharmacokinetic models\.Journal of the American Statistical Association91\(433\),pp\. 62–75\.External Links:[Document](https://dx.doi.org/10.1080/01621459.1996.10476664)Cited by:[§1](https://arxiv.org/html/2606.24966#S1.p4.1)\.Similar Articles
Learning Manifold and It\^o Dynamics with Branched Neural Rough Differential Equations
This paper introduces Branched Neural Rough Differential Equations, a method for learning manifold and Itô dynamics by combining rough path theory with neural networks, enabling the modeling of complex stochastic and geometric structures.
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
Introduces Bayesian Manifold Curriculum (BMC), an adaptive curriculum learning method for LLMs that leverages the model's latent geometry to allocate training effort across diverse problem types, improving efficiency beyond traditional difficulty-based curricula.
Stochastic Neural Networks for hierarchical reinforcement learning
OpenAI researchers propose a framework using stochastic neural networks for hierarchical reinforcement learning that pre-trains useful skills guided by a proxy reward, then leverages these skills for faster learning in downstream tasks with sparse rewards or long horizons.
Parallel Adaptive Multi-Objective Evolutionary Learning of Discretized Bayesian Network Classifiers for Clinical Data
This paper introduces a parallelization strategy and adaptive steering mechanism for the Baymex algorithm to efficiently learn discretized Bayesian network classifiers for clinical data, achieving speedups over 54x on a 16-core CPU and comparable or better predictive performance than traditional models while maintaining explainability.
Deep Spectral Learning of Embedded Latent Transfer Operators for Stochastic Dynamical Systems
Proposes a spectral learning method for stochastic nonlinear dynamical systems using deep feature spaces and an operator-based latent state-space model, demonstrating stable performance in forecasting and filtering tasks.